Next Article in Journal
Multifarious Roles of Hidden Chiral-Scale Symmetry: “Quenching” gA in Nuclei
Next Article in Special Issue
Generating the Triangulations of the Torus with the Vertex-Labeled Complete 4-Partite Graph K2,2,2,2
Previous Article in Journal
When Does a Dual Matrix Have a Dual Generalized Inverse?
Previous Article in Special Issue
Recursively Divided Pancake Graphs with a Small Network Cost
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Logical Contradictions in the One-Way ANOVA and Tukey–Kramer Multiple Comparisons Tests with More Than Two Groups of Observations

1
Higher School of Economics, National Research University, 101978 Moscow, Russia
2
Rutgers Business School, Rutgers University, Piscataway, NJ 08854, USA
*
Author to whom correspondence should be addressed.
Symmetry 2021, 13(8), 1387; https://doi.org/10.3390/sym13081387
Submission received: 14 June 2021 / Revised: 9 July 2021 / Accepted: 11 July 2021 / Published: 30 July 2021
(This article belongs to the Special Issue Topological Graph Theory and Discrete Geometry)

Abstract

:
We show that the one-way ANOVA and Tukey–Kramer (TK) tests agree on any sample with two groups. This result is based on a simple identity connecting the Fisher–Snedecor and studentized probabilistic distributions and is proven without any additional assumptions; in particular, the standard ANOVA assumptions (independence, normality, and homoscedasticity (INAH)) are not needed. In contrast, it is known that for a sample with k > 2 groups of observations, even under the INAH assumptions, with the same significance level α , the above two tests may give opposite results: (i) ANOVA rejects its null hypothesis H 0 A : μ 1 = = μ k , while the TK one, H 0 TK ( i , j ) : μ i = μ j , is not rejected for any pair i , j { 1 , , k } ; (ii) the TK test rejects H 0 TK ( i , j ) for a pair ( i , j ) (with i j ), while ANOVA does not reject H 0 A . We construct two large infinite pseudo-random families of samples of both types satisfying INAH: in case (i) for any k 3 and in case (ii) for some larger k. Furthermore, case (ii) ANOVA, being restricted to the pair of groups ( i , j ) , may reject equality μ i = μ j with the same α . This is an obvious contradiction, since μ 1 = = μ k implies μ i = μ j for all i , j { 1 , , k } . Such contradictions appear already in the symmetric case for k = 3 , or in other words, for three groups of d , d , and c observations with sample means + 1 , 1 , and 0, respectively. We outline conditions necessary and sufficient for this phenomenon. Similar contradictory examples are constructed for the multivariable linear regression (MLR). However, for these constructions, it seems difficult to verify the Gauss–Markov assumptions, which are standardly required for MLR. Mathematics Subject Classification: 62 Statistics.

1. One-Way ANOVA and Tukey–Kramer Multiple Comparisons Tests

We use standard statistical definitions and notation; the reader can find more details in [1] or [2].

1.1. One-Way ANOVA

Consider an arbitrary sample that consists of k groups of randomly chosen real values. A group j { 1 , , k } contains n j values x j with = 1 , , n j . Then, n = n 1 + + n k is the total number of values in the sample.
Standardly, x ¯ j and μ j denote the sample and population means for j = 1 , , k .
We test
H 0 A : μ 1 = = μ k H 1 A : not all μ i are the same , i = 1 , , k .
The one-way ANOVA test rejects the null hypothesis H 0 A with significance α , that is, with confidence 100 ( 1 α ) % , if and only if
F s t a t > F c r i t ( α , k 1 , n k ) ,
or equivalently, if the p-value corresponding to F s t a t is less than α .
Here, F c r i t ( α , k 1 , n k ) is the critical value of the Fisher–Snedecor distribution corresponding to the significance level α , with degrees of freedom of the numerator d f 1 = k 1 and of the denominator d f 2 = n k .
The value F s t a t is given by the ratio
F s t a t = M S ( T r ) M S E ,
where
M S ( T r ) = S S ( T r ) k 1 , S S ( T r ) = j = 1 k n j ( x ¯ j x ¯ ¯ ) 2 = 1 n j = 1 k i = j + 1 k n j n i ( x ¯ j x ¯ i ) 2 ,
x ¯ ¯ = 1 n j = 1 k = 1 n j x j = 1 n j = 1 k n j x ¯ j .
M S E = S S E n k , S S E = j = 1 k i = 1 n j ( x i j x ¯ j ) 2 ,
Thus, ANOVA rejects H 0 A if and only if
M S E < n ( k 1 ) F c r i t ( α , k 1 , n k ) 1 j = 1 k i = j + 1 k n j n i ( x ¯ j x ¯ i ) 2 .

1.2. Tukey–Kramer’s Test

For each pair i , j { 1 , k } we test the null hypothesis:
H 0 TK ( i , j ) : μ i = μ j H 1 TK ( i , j ) : μ i μ j
for all i j .
Tukey [3] proposed a procedure for testing these hypotheses in case of equal group sizes n 1 = = n k . Then, it was extended in [4,5] to arbitrary group sizes. This test, called the Tukey–Kramer (TK) test, uses the studentized range statistic
Q = y ¯ m a x y ¯ m i n M S E n ,
where y ¯ m a x and y ¯ m i n are the largest and the smallest sample means, out of a collection of k sample means.
TK rejects H 0 TK if and only if
| x ¯ i x ¯ j | > C R ( α , k , n , i , j ) ,
where the critical range (CR) is defined by the formula
C R ( α , k , n ; i , j ) = Q ( α , k , n k ) M S E 2 1 n i + 1 n j ,
and Q ( α ; k , n k ) is the critical value of the studentized range Q corresponding to the significance level α , with degrees of freedom of numerator d f 1 = k and of denominator d f 2 = n k .
Equivalently, (3) can be stated as
M S E < 2 Q 2 ( α , k , n k ) ( x ¯ i x ¯ j ) 2 n i n j n i + n j .

1.3. Comparing ANOVA and Tukey–Kramer Tests

Both ANOVA and TK tests are based on the following standard assumptions: independence, normality, and homoscedasticity (INAH). The last one means that all groups have equal standard deviations, σ 1 = = σ k . Both rejection criteria (2) and (5) are based on these assumptions; see [1,2,6] for more details.
In this note, we concentrate on the agreement between the above two tests rather than on their validity. Both inequalities (2) and (5) have the same left-hand side, M S E , which can be any number and is irrelevant for the sake of comparison of two tests.
By definition, H 0 A holds if and only if H 0 TK holds for all pairs ( i , j ) with i j . When k = 2 , there is only one such pair, and hence, the ANOVA and TK tests should agree, and indeed they are. In Section 2, we prove it for an arbitrary sample. In particular, even if the INAH assumptions are not met, still both tests either reject their null hypothesis or both do not, for any fixed significance level α .
However, when k > 2 , even under INAH assumptions, the ANOVA and TK tests may disagree and both cases (i) and (ii) defined in the Abstract may occur. Case (i) is not a paradox. Indeed, if H 0 T K ( i , j ) : μ i = μ j holds with significance slightly larger than α , then it is not rejected by the TK test. This may hold for all pairs i , j { 1 , , k } with i j . Yet, the number of these pairs k ( k 1 ) 2 is more than 1 when k > 2 . Thus, ANOVA may reject H 0 A : μ 1 = = μ k with significance α .
Somewhat surprisingly, the inverse also happens: H 0 A may hold with a fixed α , while H 0 TK may be rejected for some pair ( i , j ) with the same α . Such examples are known. Hsu [7], on page 177, remarks: “An unfortunate common practice is to pursue multiple comparisons only when the null hypothesis of homogeneity is rejected.”
We construct two large families of samples of both types considered above. In Section 3.1, we provide two randomly generated samples with three groups in each, k = 3 , and in Section 3.2, two infinite families of pseudo-random samples with k 3 for type (i) and with some larger k for type (ii). It is important to note that these constructions are realized under INAH assumptions.
When k > 2 , Formula (4) appears to be somewhat strange: the critical range is defined for a given pair ( i , j ) via the value of M S E that depends on all observations, in all k groups. These observations are independent random variables; hence, their values in a group cannot affect the equality H 0 T K ( i , j ) : μ i = μ j whenever i , j , are pairwise distinct. Moreover, group may not relate to groups i and j at all.

1.4. Modified Tukey–Kramer Test

Given significance level α , a sample with k > 2 groups, and a pair ( i , j ) with i j , let us modify the TK test for H 0 T K ( i , j ) : μ i = μ j by eliminating all groups but i and j from the sample. Thus, we obtain a new sample with k = 2 ,   n = n i + n j , and
M S E = S S E n 2 , S S E = i = 1 n i ( x i i x i ¯ ) 2 + j = 1 n j ( x j j x j ¯ ) 2 .
Then, we define
C R ( α , k , n ; i , j ) = Q ( α , 2 , n 2 ) M S E 2 1 n i + 1 n j .
In Section 3.3, assuming homoscedasticity ( σ 1 = = σ k ) and also that n 1 = = n k and n k is large enough, we will show that C R C R . Hence, the modified TK test rejects H 0 T K ( i , j ) : μ i = μ j whenever the standard TK test does.
Remark 1.
Note that in general, inequality C R C R may fail since M S E may be much smaller than M S E . Indeed, if s = 0 (resp., small) for all { 1 , , k } { i , j } , while s i > 0 and s j > 0 (resp., large), then S S E = S S E > 0 (resp., S S E S S E may be an arbitrarily small non-negative number). Notice however that the homoscedasticity assumption might not hold when s i and s j differ significantly. Furthermore, n k may be much larger than n 2 resulting in Q ( k , n k ) M S E < Q ( 2 , n 2 ) M S E .

1.5. Counterintuitive Examples with Symmetric Samples of Two and Three Groups

Another infinite set of pseudo-random examples will be constructed in Section 4. Given two groups of observations with n 1 = n 2 = d , x ¯ 1 = 1 , x ¯ 2 = 1 , σ 1 = σ 2 = σ , and a significance level α , we find d, and α such that ANOVA rejects H 0 : μ 1 = μ 2 , for some σ , with confidence 1 α . Then, we add a third group of observations with n 3 = c ,   x ¯ 3 = 0 ,   σ 3 = σ and show that H 0 A : μ 1 = μ 2 = μ 3 , for some σ , is not rejected by ANOVA with the same confidence when 0 < c < d .
As we already mentioned, this is a logical contradiction. Let us add that condition c < d seems counterintuitive. Indeed, x ¯ 3 = 0 , hence group 3, contains values that are typically between x ¯ 1 = 1 and x ¯ 2 = 1 , which could be viewed as “an argument” in support of μ 1 = μ 2 = μ 3 . Furthermore, the larger c is, the stronger is this argument.

1.6. ANOVA Is Not Inclusion Monotone on the Subsets of Its k Groups of Observations { 1 , , k }

Given a significance level α , a sample with k > 2 groups, and a pair i , j { 1 , , k } with i j , recall case (ii) (the TK test rejects H 0 TK ( i , j ) : μ i = μ j , while ANOVA does not reject H 0 A : μ 1 = = μ k ).
Reduce the sample to only two groups i and j, eliminating k 2 remaining groups, and apply the ANOVA and modified TK tests to the obtained sample. According to the previous subsection, the latter still rejects the equality μ i = μ j and, based on Theorem 1, ANOVA also rejects it, while μ 1 = = μ k is not rejected. This is a contradiction.

1.7. Logical Contradictions in F- and t-Tests of Multivariable Linear Regression (MLR)

The general multivariable linear regression model with k predictors X 1 , , X k and response Y can be written as
Y = β 0 + β 1 X 1 + + β k X k + ϵ .
The properties of the estimators of the coefficients β i are derived under the Gauss–Markov assumptions (GMA); see, for example, [8].
Commonly used tests in regressions analysis are the F-test
H 0 F : β 1 = = β k = 0 H 1 F :   at   least   one   β i   is   not   0 ,   for   i = 1 , , k ,
and the t-test for individual coefficients β i for i { 1 , , k } as follows:
H 0 t i : β i = 0 H 1 t i : β i 0 .
It is well-known that the F- and t-tests are equivalent in the case of simple linear regression (SLR), that is, when k = 1 . In this case, the p-values of the tests are equal due to identity F ( 1 , ν ) = t 2 ( ν ) for all natural ν , where F ( 1 , ν ) is a Fisher–Snedecor random variable with d f 1 = 1 and d f 2 = ν , and t ( ν ) is a random variable having Student’s distribution with the degrees of freedom ν .
Yet, for MLR, k > 1 , logical contradictions similar to ones outlined in Section 1.3, Section 1.4 and Section 1.5 appear. With the same α , the F- and t-tests for MLR may give opposite results as follows:
(j)
F-test rejects H 0 F : β 1 = = β k = 0 , while H 0 t i : β i = 0 is rejected for no i { 1 , , k } ;
(jj)
F-test does not reject H 0 F , while t-test rejects H 0 t i for some (or even for all) i { 1 , , k } .
Similarly to case (i) of ANOVA, case (j) is not a paradox: H 0 t i cannot be rejected with significance α for each particular i, but it can be rejected with this significance for at least one i. In contrast, case ( j j ) is an obvious contradiction, since H 0 F : β 1 = = β k = 0 implies H 0 t i : β i = 0 for every i { 1 , , k } .
The corresponding examples are shown in Section 5 for k = 2 with the following inequalities for the p-values:
  • p 12 > p 1 and p 12 > p 2 in case (j) in Section 5;
  • p 12 < p 1 and p 12 < p 2 in case ( j j ) in Section 5,
where p 12 is the p-value of the F-test, while p 1 and p 2 are the p-values of the t-tests for β 1 and β 2 , respectively.
Similarly to Section 1.5 for ANOVA, we will show that MLR may not be a monotonic inclusion on the set { 1 , , k } of its predictors. More precisely, consider MLR F-test with k predictors and eliminate k 1 of them, all but i, obtaining k SLR problems, one for each predictor X i and the same response Y, for i { 1 , , k } . Denote p i the p-value of the SLR test i. (Recall that the F- and t-tests for SLR are equivalent and have equal p-values.)
The example also has the following property: after removing predictor X 1 , we obtain p 12 > 0.05 > p 2 . Hence, for significance level α = 0.05 , the null hypothesis H 0 F : β 1 = β 2 = 0 is not rejected, while p 2 and p 2 are both less than 0.05 . Thus, Case ( j j ) holds and, furthermore, both predictors X 1 and X 2 are not significant, while X 2 alone is significant.
Let us remark finally that our constructions of Section 1.3, Section 1.4 and Section 1.5 satisfy INAH assumptions for ANOVA; however, it seems difficult to verify the GMA, which are standardly required for MLR.

2. Two Groups, k = 2

In this case, a unique pair ( i , j ) = ( 1 , 2 ) of means and multiple comparisons turn into a single one. ANOVA and TK tests’ null hypotheses H 0 coincide, stating that μ 1 = μ 2 .
Theorem 1.
In the case of two groups, k = 2, ANOVA and TK tests are equivalent.
Proof. 
It is enough to show that inequalities (2) and (5) are equivalent when k = 2 . In this case, Formulas (2) and (5) can be rewritten as follows:
M S E < n 1 n 2 n F c r i t 1 ( α , 1 , n 2 ) ( x ¯ 1 x ¯ 2 ) 2 ,
and
M S E < 2 n 1 n 2 n Q 2 ( α , 2 , n 2 ) ( x ¯ 1 x ¯ 2 ) 2 ,
where
M S E = 1 n 2 j = 1 k i = 1 n j ( x i j x ¯ j ) 2 .
Thus, it is sufficient to prove the identity
Q 2 ( α , 2 , n 2 ) = 2 F c r i t ( α , 1 , n 2 ) ,
which is implied by the following lemma.
Let F ( 1 , ν ) be a Fisher–Snedecor random variable with d f 1 = 1 and d f 2 = ν , and Q ( 2 , ν ) be a random variable having studentized range distribution with the number of groups k = 2 and the degrees of freedom ν . □
Lemma 1.
Equation 2 F ( 1 , ν ) = Q 2 ( 2 , ν ) holds.
Proof. 
The probability density function of studentized range Q in case k = 2 is given by
f Q ( q ; 2 , ν ) = 4 2 π ν 2 ν 2 Γ ( ν 2 ) 0 s ν ϕ ( ν s ) ϕ ( z + q s ) ϕ ( z ) d z d s = 4 2 π ν 2 ν 2 Γ ( ν 2 ) 0 s ν 1 2 π e ν s 2 2 1 2 π e ( z + q s ) 2 + z 2 2 d z d s ;
see [3]. We transform this formula as follows: substitute u = 2 z to obtain
f Q ( q ; 2 , ν ) = 2 2 ν 2 ν 2 Γ ( ν 2 ) 0 s ν 1 2 π e ν s 2 2 1 2 π e u + 2 q s 2 2 + q 2 s 2 2 2 d u d s = 2 2 ν 2 ν 2 π Γ ( ν 2 ) 0 s ν e ν + q 2 2 s 2 2 d s .
Then, by substituting t = s 2 ,
f Q ( q ; 2 , ν ) = ν 2 ν 2 π Γ ( ν 2 ) 0 t ν 1 2 e ν + q 2 2 t 2 d t ,
and by substituting y = ν + q 2 2 t 2 ,
f Q ( q ; 2 , ν ) = 2 ν ν 2 π Γ ( ν 2 ) ν + q 2 2 ν + 1 2 0 y ν + 1 2 1 e y d y = 2 ν ν 2 π Γ ( ν 2 ) ν + q 2 2 ν + 1 2 Γ ν + 1 2 .
For X = Q 2 2 , we obtain
P X x = P Q 2 2 x = P Q 2 x ,
and then
f X ( x ) = d d x P Q 2 x = 1 2 x f Q ( 2 x ; 2 , ν ) ,
which, based on (8), implies that
f X ( x ) = 1 2 x 2 ν ν 2 π Γ ( ν 2 ) ν + x ν + 1 2 Γ ν + 1 2 = 1 B ( 1 2 , ν 2 ) ν x 1 2 1 + x ν ν + 1 2 ,
where B ( a , b ) is the beta function.
It is well known (see, for example, [9]) that (9) defines the probability density function of the Fisher–Snedecor distribution with degrees of freedom of the numerator d f 1 = 1 and of the denominator d f 2 = ν . □
This proves the Theorem.
Note that Theorem 1 holds for an arbitrary sample. In particular, the p-values for ANOVA and TK tests are equal regardless of the validity of assumptions INAH.

3. Some Cases When ANOVA and TK Tests Disagree

In this section, we provide several examples where the considered two tests disagree: (i) H 0 A is rejected, while H 0 TK is not or (ii) vice versa. In Section 3.1, we provide two randomly generated samples illustrating (i) and (ii) with three groups in each, k = 3 ; and in Section 3.2, we construct two infinite families of pseudo-random samples with k 3 for (i) and with some larger k for (ii).

3.1. Two Examples with Three Groups

Using R, we generated two random samples with k = 3 groups, n 1 = n 2 = n 3 = 10 , from normal distributions with parameters μ 1 = 10 , μ 2 = 25 , μ 3 = 40 , and σ 1 = σ 2 = σ 3 = 25 . (see Table 1)

3.1.1. Case 1: ANOVA Rejects H 0 A , While TK Does Not Reject H 0 T K

Let us fix the significance level α = 0.05 ; then, T K does not distinguish any pair μ i and μ j (see Table 2), while Table 3 shows that the ANOVA test rejects the null hypothesis H 0 A .

3.1.2. Case 2: ANOVA Does Not Reject H 0 A , While TK Rejects H 0 T K

Let us fix again the significance level α = 0.05 . The generated random sample for Case 2 is shown in Table 4. Then, Table 5 shows that TK distinguishes μ 1 and μ 3 at significance level α = 0.05 . In contrast, Table 6 shows that the ANOVA test does not reject H 0 A for the same α . In this case, we can apply the approach suggested in Section 1.4 and Section 1.5. Let us reduce the sample by eliminating group 2 and apply the ANOVA and (modified) TK test.
According to Theorem 1, these two tests are equivalent: p-value is 0.02432 (see Table 7 and Table 8) for the equality μ 1 = μ 2 in both cases. Yet, for the original sample of three groups, the p-value was 0.05624 for the equality μ 1 = μ 2 = μ 3 . This is an obvious contradiction: ANOVA rejects μ 1 = μ 2 with confidence 97.5 % but cannot reject the stronger statement μ 1 = μ 2 = μ 3 (which is easier to do) even with confidence 95 % .
Recall that this example was generated by R under INAH assumptions. This did not take too many trials: with given parameters k = 3 ,   n 1 = n 2 = n 3 = 10 ,   μ 1 = 10 , μ 2 = 25 , μ 3 = 40 , and σ 1 = σ 2 = σ 3 = 25 , about each 20 trials provide an example with such properties.

3.2. Two Large Families of Examples with K Groups

In this subsection, we consider samples with k groups such that n is divisible by k, and
n 1 = = n k = n k ,
s 1 = = s k = s ,
where s i is the standard deviation of the ith group. In this case, we have
S S E = k n k 1 s 2 ,   M S E = S S E n k = s 2 .

3.2.1. Case 1: ANOVA Rejects H 0 A , While TK Does Not Reject H 0 TK

Based on (2) and (5), this happens if and only if
n ( k 1 ) F c r i t ( α , k 1 , n k ) 1 j = 1 k i = j + 1 k n j n i ( x ¯ j x ¯ i ) 2 > M S E > 2 Q 2 ( α , k , n k ) ( x ¯ i x ¯ j ) 2 n i n j n i + n j ,
which implies
Q 2 ( α , k , n k ) j = 1 k i = j + 1 k n j n i ( x ¯ j x ¯ i ) 2 > 2 n ( k 1 ) F c r i t ( α , k 1 , n k ) ( x ¯ i x ¯ j ) 2 n i n j n i + n j .
Consider any sample with k groups, k is even, satisfyng (10) and (11), and
x ¯ 1 = = x ¯ k 2 = 1 , x ¯ k 2 + 1 = = x ¯ k = 0 .
Based on (12), M S E = s 2 , and (14) can be rewritten as
n k 2 k 2 4 Q 2 ( α , k , n k ) > 2 n 3 k 2 2 n k F c r i t ( α , k 1 , n k ) ,
which can be simplified to
G ( α , k , n k ) = Q 2 ( α , k , n k ) F c r i t ( α , k 1 , n k ) 4 1 1 k > 0 .
Function G ( α , k , n k ) has the following properties:
  • G ( α , k , n k ) 0 if k = 2 ;
  • G ( α , k , n k ) is monotonically increasing with respect to n k and converging as n for each k;
  • G ( α , k , n k ) > 0 for k 3 and all n k 0 for α = 0.005 , 0.01 , 0.025 , 0.05 , 0.1 , 0.25 , and 0.5 .
It is not our goal to study function G ( α , k , n k ) in detail; we are primarily interested only in its positivity, required by condition (16). The required inequality (16) holds for any k 3 .
Given an even k and n divisible by k, we generate a desired pseudo-random sample as follows: It satisfies (10), (11), (15), and in addition, whenever (16) holds, we still can choose s 2 = MSE satisfying (13).

3.2.2. Case 2: ANOVA Does Not Reject H 0 A , While TK Rejects H 0 T K

Based on (2) and (5), this happens if and only if
n ( k 1 ) F c r i t ( α , k 1 , n k ) 1 j = 1 k i = j + 1 k n j n i ( x ¯ j x ¯ i ) 2 < M S E < 2 Q 2 ( α , k , n k ) ( x ¯ i x ¯ j ) 2 n i n j n i + n j ,
which implies
Q 2 ( α , k , n k ) j = 1 k i = j + 1 k n j n i ( x ¯ j x ¯ i ) 2 < 2 n ( k 1 ) F c r i t ( α , k 1 , n k ) ( x ¯ i x ¯ j ) 2 n i n j n i + n j .
Note that if k = 2 , we obtain (8).
Consider any sample with k groups satisfying (10) and (11), and
x ¯ 1 = 1 , x ¯ 2 = = x ¯ k = 0 .
Then, (18) turns into
n k 2 ( k 1 ) Q 2 ( α , k , n k ) < 2 n ( k 1 ) n k 2 2 n k F c r i t ( α , k 1 , n k ) ,
which simplifies to
H ( α , k , n k ) = Q 2 ( α , k , n k ) F c r i t ( α , k 1 , n k ) k < 0 .
Since s can be chosen arbitrarily, we can always find M S E satisfying (17) whenever (18) holds.
Function H ( α , k , n k ) shares properties (1) and (2) of G ( α , k , n k ) ; furthermore, H ( α , k , n k ) > 0 for sufficiently small k, and H ( α , k , n k ) < 0 for sufficiently large k. Again, it is not our goal to study H ( α , k , n k ) in detail since we are primarily interested only in its negativity required by condition (20).
The signs of H ( α , k , n k ) depending on k are shown in Table 9 for some values of α . The second (resp., third) column contains the values of k such that H ( α , k , n k ) > 0 (resp., H ( α , k , n k ) < 0 ) for all n. Missing values of k correspond to the cases when the sign of H ( α , k , n k ) depends on n.
One can see that the required inequality (20) holds when the number of groups k is large enough.
Given k and n divisible by k, we generate the desired pseudo-random sample as follows: It satisfies (10), (11), (19), and in addition, whenever (20) holds, we still can choose s 2 = MSE satisfying (17).
Remark 2.
We variate the choice of sample means in (15) and (19) to increase the feasible area for (16) and (20), respectively. Obviously, k 4 1 1 k and equality hold if and only if k = 2 .
Remark 3.
We can extend considerably the family of the constructed examples by relaxing equalities (11), (15), and (19), and replacing them with approximate equalities.

3.3. Critical Range in Modified TK Test

In Section 1.4, we modified the standard TK multiple comparisons test, replacing it with the pairwise comparison version as follows: Given significance level α , a sample with k > 2 groups, and a pair ( i , j ) { 1 , , k } with i j , consider the null hypothesis for the corresponding two groups, H 0 TK ( i , j ) : μ i = μ j and eliminate all groups but i and j from the sample, obtaining a new one with k = 2 ,   n = n i + n j . For the standard and modified TK tests, the critical ranges C R = C R ( α , k , n k ; i , j ) and C R = C R ( α , k , n k ; i , j ) and the corresponding values of M S E and M S E are given by Formulas (1), (4), (6) and (7).
We are looking for conditions implying the inequality C R C R , in which case the modified TK test rejects H 0 T K ( i , j ) whenever the standard one does. In general, this inequality may fail; see Remark 1.
Let us assume INAH, and in addition (10) and (11). As we know, in this case, M S E = M S E = s 2 and formulas for C R and C R are simplified as follows:
C R = Q ( α , k , n k ) s k n , C R = Q ( α , 2 , n k ) s k n = Q α , 2 , 2 n k 2 s 2 2 n / k = Q α , k k / 2 , n k k / 2 s k n .
Thus, in the considered case,
C R C R = Q ( α , 2 , ν ) Q ( α , 2 , ν ) ,
where = k 2 1 and ν = n k = 2 n k 1 .
The critical value of the studentized range Q ( α , 2 , ν ) monotonically increases with when ν = 2 n k 1 is large enough; see Table 10.
In these cases, C R C R , and hence, conclusions of Section 1.5 are applicable. Recall the construction of Section 3.2 Case 2, in which ANOVA does not reject H 0 A : μ 1 = = μ k , while the standard TK test rejects H 0 TK ( i , j ) : μ i = μ j . This pseudo-random construction satisfies INAH. Let us apply ANOVA and TK tests to the reduced sample that consists of only two groups i and j, with the remaining k 2 groups eliminated. Based on the above arguments, the modified TK test still rejects its hypothesis H 0 TK ( i , j ) : μ i = μ j and, based on Theorem 1, ANOVA rejects it as well. However, ANOVA does not reject a stronger hypothesis H 0 A : μ 1 = = μ k , with the same significance level α , which is an obvious contradiction.

4. Symmetric Samples with Two and Three Groups

4.1. Two Groups

Consider two groups 1 and 2 with d observations in each, that is, k = 2 ,   n 1 = n 2 = d ,   n = n 1 + n 2 = 2 d , with means x ¯ 1 = 1 ,   x ¯ 2 = 1 , and standard deviations σ 1 = σ 2 = σ . We can assume that I N A H holds.
Obviously, S S ( T r ) = M S R = 2 d ; furthermore, according to (1),
S S E = σ 2 ,   M S E = S S E / ( n k ) = σ 2 / ( 2 d 2 ) ,
F s t a t = M S R / M S E = 4 d ( d 1 ) σ 2 .
According to (2), ANOVA rejects its null hypothesis H 0 A : μ 1 = μ 2 if and only if
4 d ( d 1 ) σ 2 > F c r i t ( α , k 1 , n k ) = F c r i t ( α , 1 , 2 d 2 )
As for the T K (in this case just Tukey test), we have | x ¯ 1 x ¯ 2 | = 2 , and based on (4), the critical range is given by formula
C R ( 1 , 2 ) = Q ( α , 2 , 2 d 2 ) M S E / d = Q ( α , 2 , 2 d 2 ) σ / 2 d ( d 1 ) .
Thus, according to (3), the Tukey test rejects μ 1 = μ 2 if and only if
8 d ( d 1 ) σ 2 > Q 2 ( α , 2 , 2 d 2 ) .
Criteria (21) and (22) are equivalent, based on Lemma 1.

4.2. Three Groups

Let us add to groups 1 and 2 one more group 3 of c observations, obtaining k = 3 and n = n 1 + n 2 + n 3 = 2 d + c . Furthermore, set x ¯ 3 = 0 and σ 3 = σ and assume that I N A H holds. Then, based on (1) and (2),
S S ( T r ) = 2 d ,   M S R = S S R / ( k 1 ) = d ;
S S E = σ 2 ,   M S E = S S E / ( n k ) = σ 2 / ( 2 d + c 3 ) ,
F s t a t = M S R / M S E = d ( 2 d + c 3 ) σ 2 .
Thus, ANOVA rejects its null hypothesis H 0 A : μ 1 = μ 2 = μ 3 if and only if
d ( 2 d + c 3 ) σ 2 > F c r i t ( α , k 1 , n k ) = F c r i t ( α , 2 , 2 d + c 3 ) .
As for the T K test, we have | x ¯ 1 x ¯ 2 | = 2 , while based on (4), the critical range
C R ( 1 , 2 ) = Q ( α , k , n k ) M S E / d = σ Q ( α , 3 , 2 d + c 3 ) / d ( 2 d + c 3 ) .
Thus, based on (3), the T K test rejects μ 1 = μ 2 if and only if
4 d ( 2 d + c 3 ) σ 2 > Q 2 ( α , 3 , 2 d + c 3 ) .

4.3. ANOVA for k = 2 and k = 3

ANOVA rejects μ 1 = μ 2 and does not reject μ 1 = μ 2 = μ 3 if and only if (21) holds while (23) fails. It happens, for some σ , if and only if
F c r i t ( α , 1 , 2 a ) F c r i t ( α , 2 , 2 a + b ) < 2 2 a 2 a + b ,
where a = d 1 ,   b = c 1 .
Remark 4.
Obviously, the set of feasible σ is an interval. In fact, this holds for all our examples showing logical contradictions. Such examples are relatively rare because the interval for σ is small.
Consider α = 0.05 . Inequality (25) holds whenever b < a , or equivalently, c < d . It seems that (25) can be solved, with respect to a and b, explicitly. Consider the following sequence of positive integers S = ( 6 , A 2 , B 4 , A , B 5 , ( A , B 6 ) ) , where A = ( 8 , 9 , 8 , 8 , 9 ) , B = ( 8 , 9 , 8 , 8 , 9 , 8 , 8 , 9 ) ; a power denotes the number of repetitions. Thus, S is a quasi-periodic sequence with the period ( A , B 6 ) of length 5 + 8 · 6 = 53 . To each a we assign a nonnegative integer a ( s ) uniquely defined by the inequalities
i = 1 a ( s ) s i a < i = 1 a ( s ) + 1 s i .
Then, (25) holds if and only if b < a + a ( s ) . This criterion is confirmed by computations for a 500 . We conjecture that it holds for all a and that similar criteria hold for all α .

4.4. TK Test for k = 2 and k = 3

T K rejects equality μ 1 = μ 2 for k = 2 and does not reject it for k = 3 if and only if (22) holds while (24) fails. It happens, for some σ , if and only if
Q 2 ( α , 3 , 2 a + b ) Q 2 ( α , 2 , 2 a ) < 2 a + b 2 a ,
Consider α = 0.05 . Inequality (26) holds whenever b a , or equivalently, c d . Again, it seems that (26) can be solved, with respect to a and b, explicitly. Consider sequence of positive integers S = ( 3 , 7 , ( 8 , 7 7 , 8 , 7 6 ) ) . It is quasi-periodic with the period ( 8 , 7 7 , 8 , 7 6 ) of length 15.
Then, (26) holds if and only if b a a ( s ) . This criterion is confirmed by computations for a 500 . We conjecture that it holds for all a and that similar criteria hold for all α .

5. Logical Contradictions in Multivariable Linear Regression

Here, we provide examples presented in Section 1.7, see Table 11, Table 12, Table 13, Table 14, Table 15, Table 16, Table 17 and Table 18.

5.1. Construction for Case (j): F-Test Rejects H 0 F : β 1 = β 2 = 0 , While H 0 t 1 : β 1 = 0 , H 0 t 2 : β 2 = 0 Are Not Rejected by t-Tests with the Same Significance α = 0.05

Note that p 1 = 0.05598 > 0.05 , p 2 = 0.28837 > 0.05 , p 12 = 0.04865 < 0.05 . Hence, Case (j) holds.

5.2. Constructions for Case (jj): F-Test Does Not Reject H 0 F , While t-Tests Reject H 0 t 1 or H 0 t 2 with the Same Significance

Here p 1 = 0.08480 , p 2 = 0.04959 , and p 12 = 0.12690 . Hence, F-test does not reject H 0 F , while t-tests reject H 0 t 1 and H 0 t 2 with significance α = 0.1 .
The next example also illustrates case ( j j ) and, in addition, shows that F-test can be as not monotone inclusion on the set of predictors.
Note that p 2 = 0.03719 < 0.05 < 0.09723 = p 12 . Hence, Case ( j j ) holds. Furthermore, eliminating predictor X 1 yields the following SLR table:
We observe again that p 2 = 0.04808 < 0.05 < 0.09723 = p 12 . Thus, F test states that both predictors X 1 and X 2 are insignificant, while X 2 alone is significant at α = 0.05 .

6. Concluding Remarks

Both ANOVA and T K multiple comparisons tests with k groups may result in logical contradictions when k > 2 , even if I N A H assumptions hold. Therefore, the good old approach of using pairwise comparisons instead of multiple ones is a bit slower but more reliable. Furthermore, all contradictions disappear if we replace the ANOVA and T K tests by their pairwise versions, applying them for any pair of groups i , j { 1 , , k } with i j . Then, according to Theorem 1, these two tests become equivalent.
Similar contradictions appear for the linear regression with the number of predictors k > 1 (MLR). Already for k = 2 , with the same level of significance α , it may happen that t-test rejects H 0 t 1 : β 1 = 0 , while F-test fails to reject the stronger null hypothesis H 0 F : β 1 = β 2 = 0 .
In general, estimating the quality of a prediction made by ANOVA or MLR seems much more doubtful than the prediction itself.

Author Contributions

Data curation, M.N.; Supervision, V.G. Both authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

This paper was prepared within the framework of the HSE University Basic Research Program.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Miller, I.; Miller, M.; John, E. Freund’s Mathematical Statistics, 6th ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2004. [Google Scholar]
  2. Montgomery, D.C. Design and Analysis of Experiments, 9th ed.; John Wiley: New York, NY, USA, 2017. [Google Scholar]
  3. Tukey, J. Comparing Individual Means in the Analysis of Variance. Biometrics 1949, 5, 99–114. [Google Scholar] [CrossRef] [PubMed]
  4. Kramer, C.Y. Extension of multiple range tests to group means with unequal numbers of replications. Biometrics 1956, 12, 307–310. [Google Scholar] [CrossRef]
  5. Kramer, C.Y. Extension of multiple range tests to group correlated adjusted means. Biometrics 1957, 13, 13–18. [Google Scholar] [CrossRef]
  6. Hayter, A.J. A proof of the conjecture that the Tukey–Kramer multiple comparisons procedure is conservative. Ann. Stat. 1984, 12, 61–75. [Google Scholar] [CrossRef]
  7. Hsu, J.C. Multiple Comparisons: Theory and Methods; Chapman & Hall: London, UK, 1996. [Google Scholar]
  8. Wooldridge, J.M. Introductory Econometrics: A Modern Approach, 5th ed.; Thomson/South-Western: Mason, OH, USA, 2012. [Google Scholar]
  9. Fisher, R.A. Statistical Methods for Research Workers. In Breakthroughs in Statistics; Springer Series in Statistics (Perspectives in Statistics); Kotz, S., Johnson, N.L., Eds.; Springer: New York, NY, USA, 1992. [Google Scholar]
Table 1. Generated random sample for Case 1.
Table 1. Generated random sample for Case 1.
Group 1Group 2Group 3
33.7361742941.343278611.949654854
6.532109599−2.2959601564.73534452
−15.8706812537.8091143617.47791461
24.41853292−38.548850443.91077426
32.5246951228.8144750815.70006485
−36.6777507491.9946477354.51355702
4.94614482154.9689546231.54941908
−11.4878907777.168770.819316511
32.0475043195.131894873.84330239
−24.025307826.72240501445.92357859
Table 2. The results of TK test for the example in Case 1.
Table 2. The results of TK test for the example in Case 1.
GroupDifflwruprp adj
Group2-Group134.696519918−1.29454253670.687582370.06045
Group3-Group130.427939620−5.56312283466.419002070.10952
Group3-Group2−4.268580298−40.25964275231.722482160.95353
Table 3. ANOVA table for the example in Case 1.
Table 3. ANOVA table for the example in Case 1.
DfSum SqMean SqF ValuePr(>F)
group27159.7633579.8813.397890.04828
Residuals2728,446.1641053.562
Table 4. Generated random sample for Case 2.
Table 4. Generated random sample for Case 2.
Group 1Group 2Group 3
19.6565627330.4728269397.66594506
31.634710182.35949327437.29706457
5.47471652125.9482280137.28238885
7.325738946−6.706730014−9.215515132
47.1663356.0033782744.75306142
−28.9948768222.3794551372.60365833
14.9956480773.8135854321.39501942
48.120357725.4469972671.5651277
25.54178184−3.74597314563.33149261
−16.6130510148.6198710726.01262136
Table 5. The results of TK test for the example in Case 2.
Table 5. The results of TK test for the example in Case 2.
GroupDifflwruprp adj
Group2-Group1−10.0283214−40.823128539320.766485730.70172
Group3-Group130.83829460.043487467861.633101740.04962
Group3-Group220.8099732−9.984833936951.604780330.23279
Table 6. ANOVA table for the example in Case 2.
Table 6. ANOVA table for the example in Case 2.
DfSum SqMean SqF ValuePr(>F)
group24948.7422474.3713.208040.05624
Residuals2720,825.208771.304
Table 7. ANOVA table for groups 1 and 3 in Case 2.
Table 7. ANOVA table for groups 1 and 3 in Case 2.
DfSum SqMean SqF ValuePr(>F)
group14755.0024755.0026.0440.02432
Residuals1814,161.164786.731
Table 8. The results of modified TK test for groups 1 and 3 in Case 2.
Table 8. The results of modified TK test for groups 1 and 3 in Case 2.
GroupDifflwruprp adj
Group3-Group130.83829464.48480439957.19178480.02431
Table 9. The signs of H ( α , k , n k ) for selected α depending on k.
Table 9. The signs of H ( α , k , n k ) for selected α depending on k.
α H ( α , k , n k ) > 0 H ( α , k , n k ) < 0
0.005 3 k 10 k 14
0.01 3 k 10 k 14
0.025 3 k 10 k 13
0.05 3 k 10 k 12
0.1 3 k 10 k 12
0.25 3 k 10 k 11
0.5 3 k 9 k 10
Table 10. Conditions for monotonic increasing of the studentized range Q ( α , 2 , ν ) .
Table 10. Conditions for monotonic increasing of the studentized range Q ( α , 2 , ν ) .
α 0.0050.010.0250.050.10.250.5
ν 7 ν 5 ν 4 ν 3 ν 2 ν 1 ν 1
Table 11. Generated random sample for Case (j).
Table 11. Generated random sample for Case (j).
X1X2Y
1.7136733330.8916520191.718488057
0.9328309250.3532318231.311861467
−0.0536737241.1325867171.903344806
1.0554821370.2484116191.582305067
−0.248355435−0.1742567272.607296494
−0.0044498670.1155505882.352411276
0.086988258−0.8334960072.558602277
0.687284914−0.4171716851.721811264
−0.2534747120.0453711231.982673543
0.135747949−0.1458178052.309533234
Table 12. Regression output for the sample in Table 11.
Table 12. Regression output for the sample in Table 11.
CoefficientsEstimateStd. Errort Value Pr ( > | t | )
(Intercept)2.1933988260.1211200518.109295963.87188 × 10 7
X1−0.3975784170.173782994−2.287786670.05599
X2−0.2258533920.196597153−1.1488131350.28837
Residual standard error: 0.321371159 on 7 degrees of freedom
Multiple R-squared: 0.578426902, Adjusted R-squared: 0.457977446
F-statistic: 4.802237548 on 2 and 7 DF, p-value: 0.04865
Table 13. Generated random sample for Case ( j j ).
Table 13. Generated random sample for Case ( j j ).
X1X2Y
1.1736990451.5077975931.693611518
1.5278665881.2048801591.719565524
−0.2377568870.3215257842.313343543
0.4248767070.3724727962.215619921
0.155008273−0.3820978491.752313506
0.0782976350.2024069961.985018225
−0.739378749−1.774905231.280511608
−0.325947264−0.1707391931.751709441
0.0576392940.0254980392.285726127
0.3175171510.4395665641.809984615
Table 14. Regression output for the sample in Table 13.
Table 14. Regression output for the sample in Table 13.
CoefficientsEstimateStd. Errort Value Pr ( > | t | )
(Intercept)1.9271740470.09424286920.449017141.67739 × 10 7
X1−0.5343091070.266285323−2.0065285690.08480
X20.4781295120.201722632.370232390.04959
Residual standard error: 0.271926659 on 7 degrees of freedom
Multiple R-squared: 0.445574672, Adjusted R-squared: 0.287167435560639
F-statistic: 2.812842909 on 2 and 7 DF, p-value: 0.12690
Table 15. Another generated random sample for Case ( j j ).
Table 15. Another generated random sample for Case ( j j ).
X1X2Y
1.5685623190.9278349031.612462698
1.482860011.097739462.466033052
−0.5731156580.9815371832.518881417
−0.050008016−0.493298211.301806858
0.165268254−0.3975008531.436310825
−0.306203404−0.1931303932.072432714
−0.399941489−0.0960352361.573998771
0.210693560.6039844321.827021014
0.431810105−0.3833129091.933014077
0.0806282070.2316112992.002964703
Table 16. Regression output for the sample in Table 15.
Table 16. Regression output for the sample in Table 15.
CoefficientsEstimateStd. Errort Value Pr ( > | t | )
(Intercept)1.80405790.114943415.695191.0317 × 10 6
X1−0.18148550.1720305−1.054960.32649
X20.51685050.20136932.566680.03719
Residual standard error: 0.3324189 on 7 degrees of freedom
Multiple R-squared: 0.486189, Adjusted R-squared: 0.3393858
F-statistic: 3.311843 on 2 and 7 DF, p-value: 0.09723
Table 17. Regression output for the generated random sample for Case ( j j ) with independent variable X 1 only.
Table 17. Regression output for the generated random sample for Case ( j j ) with independent variable X 1 only.
CoefficientsEstimateStd. Errort Value Pr ( > | t | )
(Intercept)1.867014800.1463502312.757171.3432 × 10 6
X10.028644530.197184440.145270.88809
Residual standard error: 0.4332275 on 8 degrees of freedom
Multiple R-squared: 0.002631, Adjusted R-squared: −0.122040
F-statistic: 0.0211027 on 1 and 8 DF, p-value: 0.8880929
Table 18. Regression output for the sample in Table 15 with independent variable X 2 only.
Table 18. Regression output for the sample in Table 15 with independent variable X 2 only.
CoefficientsEstimateStd. Errort Value Pr ( > | t | )
(Intercept)1.77972470.113397415.694582.7117 × 10 7
X20.41575290.17835052.331100.04808
Residual standard error: 0.3347572 on 8 degrees of freedom
Multiple R-squared: 0.404497, Adjusted R-squared: 0.330059
F-statistic: 5.434026 on 1 and 8 DF, p-value: 0.04808
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Gurvich, V.; Naumova, M. Logical Contradictions in the One-Way ANOVA and Tukey–Kramer Multiple Comparisons Tests with More Than Two Groups of Observations. Symmetry 2021, 13, 1387. https://doi.org/10.3390/sym13081387

AMA Style

Gurvich V, Naumova M. Logical Contradictions in the One-Way ANOVA and Tukey–Kramer Multiple Comparisons Tests with More Than Two Groups of Observations. Symmetry. 2021; 13(8):1387. https://doi.org/10.3390/sym13081387

Chicago/Turabian Style

Gurvich, Vladimir, and Mariya Naumova. 2021. "Logical Contradictions in the One-Way ANOVA and Tukey–Kramer Multiple Comparisons Tests with More Than Two Groups of Observations" Symmetry 13, no. 8: 1387. https://doi.org/10.3390/sym13081387

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop