1. Introduction
The tests developed here build on and extend the notion of contrasts. In undergraduate courses and texts for users of statistics, contrasts are often met as a means to better understand a significant main effect by making comparisons between the factor level means. Such texts include [
1,
2] (see, for example, p. 477), and [
3] (see Section 12.8) a more extensive account is given in [
4], Section 3.2. Although occasionally ordered levels are mentioned, most accounts focus on when the levels of the factor are unordered. See, for example [
5].
The idea of orthogonal contrasts is to decompose a statistic, such as an ANOVA sum of squares, into components that may be used to detect alternatives in important and distinct subsets of the parameter space. The orthogonality ensures that these components are independent. However, we have found that most accounts do not consider unbalanced designs where there are different numbers of observations in the levels of the factors; they rarely give any detail of when the factor levels are ordered, nor do they discuss contrasts for interactions.
The tests proposed here seek to extend the benefits of contrasts by aiming to detect alternatives in different, important subsets of the parameter space. While we focus on two-factor models, extensions to multifactor ANOVA models are certainly possible. However, they are not considered here. We view the processes we propose as exploratory data analysis, with each component providing an input. Too many inputs would typically be too difficult to incorporate into an overall picture. We return to this issue in
Section 7.
In the next section we derive orthonormal contrasts for main effect tests. Importantly, our definition of contrasts allows for unbalanced designs. Although almost part of statistical folklore, these parametric tests are, we believe, unfamiliar to many users of statistics. When the levels of a factor are ordered, our contrast-based tests allow testing for polynomial effects in the factor levels, such as increasing means and umbrella effects. When the levels of a factor are not ordered, we recommend tests based on pairwise comparisons. These involve non-orthogonal comparisons and, we believe, give useful objective insights for data analysts.
When considering tests for interaction effects, we note that in unbalanced ANOVAs, tests for main and interaction effects use regression methods that will be unfamiliar to some data analysts. Instead, we observe that for balanced designs, an important part of the definition of the interaction sum of squares involves a quantity that we call a coefficient. The coefficient definition can readily be extended to unbalanced designs and used in focused nonparametric tests for aspects of the interaction effect. We give examples where the levels of neither, one, or two factors are ordered.
Most importantly, the nonparametric interaction tests developed here allow objective assessment of effects usually gleaned subjectively from interaction plots. We contend that both are important and provide useful insights in data analysis.
2. The Main Effects Test Statistics
The parametric model of interest here is a multifactor fixed-effects ANOVA, but as previously indicated, we focus only on two factors, A and B, say. In
Section 3 the model assumes there is an AB interaction. Suppose
Yijk is the
kth of
nij > 1 observations of the
ith of
r levels of factor A and the
jth of
c levels of factor B. The design is balanced if all
nij are equal. All observations are mutually independent and normally distributed with constant variance σ
2.
We now focus on parametric tests for main effects. Although the design may have multiple factors, we focus on one only, say A. Suppose Yij is the jth of ni observations of the ith of r levels of a factor A. There are n = observations in all. A design is balanced if all ni are equal. The completely randomized design is an example of a design that is in general unbalanced. The randomized block, balanced incomplete block, and Latin square designs are all balanced.
At this point the levels of A may or may not be ordered. Write Yi for the sum of the observations for level i and for the mean of the same.
Under the null hypothesis of no factor A effect, the unconditional factor A sum of squares, SSF = = –, has the distribution. As usual, the dot notation denotes summation. Note that unbalanced multifactor designs are not in general orthogonal. Thus, in R, the first factor called is unconditional, and the second is conditional on the first factor called. Subsequently, here we only work with the unconditional factor A sum of squares.
Put, for
i = 1, …,
t,
pi =
ni/
n and
Zi =
–
), and
Z = (
Zi) and D = diag(√
pi). Suppose H is a
t ×
t orthogonal matrix with columns
h1, …,
ht with
ht = 1
t/√
t. Put H = (
h1| … |
ht) so that H H
T =
+ … +
= I
t. Then
We now give two constructions: first for the levels of factor A being ordered, and second, those levels being unordered.
If the levels of the factor are ordered, one choice for H is to first construct G with columns g1, …, gt, with gr the orthonormal polynomial of degree r on the weight function (pi). The degree zero polynomial is taken to be identically one and is specified as the tth: gt = 1t. Now as = δrs (the Kronecker delta) if as above, D = diag(√pi), then GTD2G = It and H = DG is an orthogonal matrix. If G has columns g1, …, gt, and H has columns h1, …, ht, then H = (h1| …|ht) = D(g1| …|gt) = (Dg1 | …| Dgt) and hi = Dgi.
Now, as above,
n SS
F =
. However, for
i = 1, …,
t and
so that
n SS
F =
. (See [
6] for a definition of contrast in unbalanced designs.) The
are contrasts because, as just shown,
(the
Zi are centred) and
=
= 0 for
r = 1, …,
t: the contrast coefficients sum to zero. The
decomposes SS
F into
t − 1 orthonormal contrasts that have the same interpretation as in the balanced case, for which again see [
6].
Now suppose the levels of the factor are not ordered. We take pi = ni/n for i = 1, …, t, ht = (√pi), and D = diag(√pi). Then = 0 and SSF = , since, as before, the Zi are centered random variables. Thus, the decompose SSF into t − 1 orthonormal components.
One choice for H is to take
Clearly gives a comparison of the first two levels of the factor. For it to be a contrast requires that : the potential contrast coefficients come to zero. Here unless p1 = p2. Thus is not in general a contrast.
For r = 2, …, t − 1, the rth column of H can be taken to have r ones and zeros thereafter, and the Gram–Schmidt orthogonalization process applied. This gives a decomposition of SSF into t − 1 orthonormal components, the first of which is .
However, it is more convenient to proceed as in the balanced case, discussed in [
7]. We construct a matrix M with the last row (√
pj) and, for all
r <
s, rows with
in the
rth position, −
in the
sth and zeros elsewhere. As above, the rows of M
Z
are not, in general, contrasts. Moreover, such a matrix will not be orthogonal, and so the sum of the squared contrasts is not the treatment sum of squares. However, the elements of M
Z
give all the pairwise comparisons, and although these are not mutually orthonormal comparisons, their squares give the basis for an objective comparison between all pairs of the levels of the factor.
For parametric testing, whether the levels are ordered or not, we seek to test the null hypothesis that the expectation of the contrast/component is zero, usually against two-sided alternatives. This may be done parametrically by referring
to the F
1,edf distribution, where
ems is the error mean square and
edf are the error degrees of freedom. See [
7], Section 2 for a proof. When the levels of the factor of interest are ordered and the orthonormal polynomials are used, a significant
result suggests a degree
i effect in the levels of the factor. In most applications it would be expected that linear and quadratic effects are of most interest. In nonparametric testing, we use permutation testing
p-values based on the
often comparing them with the corresponding parametric
p-values based on the F distribution.
Steroid production example. The data in
Table 1 are given in [
4], p. 224, exercise 10. The response is steroid production per 100 mg of gland per hour for each of two treatments with the glands taken from rats at four different stages of growth. We assume the stages are ordered, but the treatments are not.
The design is not equally replicated and therefore not orthogonal. So different calls in R give different output. In R we are interested in the call that gives the stages sum of squares in the form assumed in this section:
The ANOVA is given in
Table 2. We note that the Shapiro–Wilk test for normality of the residuals had a
p-value of 0.6287. There is no reason to doubt the parametric model.
The parametric p-values are 0.0751 for stages with orthonormal contrasts of degree one, two, and three, and 0.1274, 0.0442, and 0.3456, respectively.
We also calculated permutation test
p-values: are they similar to the parametric
p-values? Observed values of the F statistics for stages and their contrasts were calculated. Then residuals were calculated by removing treatment and interaction effects but not stage effects, since these are zero under the null hypothesis. See
Appendix A to see how the residuals are defined. The residuals were then permuted, and for each permutation, the F statistics were calculated. The proportion that exceeded the observed values gave the required
p-values.
With 1,000,000 permutations, we found p-values of 0.0755 for stages and 0.1275, 0.0445, and 0.3440 for the degree one, two, and three contrasts, respectively. These agree well with the parametric p-values.
Although the factor stage is not significant at the 0.05 level, its degree two contrast is. This is reflected in the cell mean plot in
Figure 1. The treatment mean in the plot has a clear quadratic shape. We also note that initially treatment 1 is roughly constant, then increases at stage 4, while initially treatment 2 decreases, then increases at stage 4. The significant interaction manifests in that the two treatments are behaving differently; their plots are not parallel. This, however, is a subjective conclusion.
Even when the parametric model is in serious doubt, the parametric p-values may still be substantially correct, as the following example shows.
Biomass example. The data in
Table 3 are modified from data that at the time of writing could be found at [
8]. The purpose of the modification was to achieve an unbalanced two-way fixed effects ANOVA, and to that end six randomly selected responses were removed from the original data set.
Output for the parametric ANOVA, both with fertilizer preceding irrigation and with irrigation preceding fertilizer, is given in
Table 4 and
Table 5. The conclusions for both analyses are similar: that both main effects and interaction are all significant at levels less than 0.001. However, for both analyses, the Shapiro–Wilk test for normality of the ANOVA residuals returned a
p-value of less than 2.618 × 10
−10. Thus the conclusions of the parametric analyses are not valid. For comparison purposes, we will proceed to analyze these data both parametrically and nonparametrically.
Subjective conclusions can be drawn from the cell mean plots in
Figure 2. The left hand panel plots yield against fertilizer, while the right hand panel plots yield against irrigation.
The levels of the fertilizer factor are clearly ordered. We find a strong parametric fertilizer main effect with contrast p-values of less than 0.0001, 0.8756, and 0.0040 for linear, quadratic, and cubic effects, respectively. There is strong evidence of both linear and cubic effects, but not of quadratic effects.
We assume the irrigation effects are ordered, from A to B to C and then to D, perhaps by quantity of water released. There is evidence of a strong parametric irrigation effect, with a p-value less than 0.0001. The parametric irrigation contrast p-values are less than 0.0001, 0.0965, and less than 0.0001 for linear, quadratic, and cubic effects, respectively. There is strong evidence of both linear and cubic effects, with only weak evidence, at the 0.1 level but not at the 0.05 level of a quadratic effect.
If we take the irrigation levels to not be ordered, pairwise comparisons are appropriate. The irrigation means are 3100.6667, 3080.6111, 335.8947, and 126.1053 for A, B, C, and D, respectively. All pairwise parametric
p-values are less than 0.0001 except for A and B, with a
p-value of 0.7233, and C and D with a
p-value of 0.0003. These are consistent with the left panel cell mean plot of yield against fertilizer in
Figure 2.
Using 1,000,000 permutations, p-values were calculated for fertilizer main effects. Residuals were calculated by removing the irrigation and interaction effects but not the fertilizer effects, since these are zero under the null hypothesis. We find linear, quadratic, and cubic effect p-values for the contrasts of less than 0.0001, 0.8786, and 0.0044, respectively. The agreement with the parametric p-values is, given the rejection of the parametric model, surprisingly good.
For irrigation, we considered levels both ordered and not. For ordered levels and 1,000,000 permutations, the overall p-value was less than 0.0001, while for contrasts, they were less than 0.0001 for linear effects, 0.0960 for quadaratic effects, and less than 0.0001 for cubic effects. For levels not ordered and 1,000,000 permutations, the overall p-value was less than 0.0001, while all pairwise comparisons returned p-values less than 0.0001 except for comparing A and B, with a p-value of 0.7226, and comparing C and D, with a p-value of 0.0004.
For clearer comparison of the parametric and nonparametric
p-values, these are given in
Table 6 and
Table 7.
Even if the parametric model is seriously compromised, the parametric p-values may be indicative. However, to avoid invalid inference, the nonparametric p-values should always be calculated.
3. The Interaction Effect Test Statistics
The parametric model of interest here is a multifactor fixed effects ANOVA, but as previously indicated, we focus only on two factors, A and B, say. The model assumes there is an AB interaction. Suppose Yijk is the kth of nij > 1 observations of the ith of r levels of factor A and the jth of c levels of factor B. The design is balanced if all nij are equal. All observations are mutually independent and normally distributed with constant variance σ2. Write ni = , n = , for the mean of the observations for level i of factor A and level j of factor B, for the mean of the observations for level i of factor A, for the mean of the observations for level j of factor B, and for the mean of all of the observations.
We now focus on the balanced parametric model with
nij =
s for all
i and
j. The interaction sum of squares is
Note that
is the (
i,
j)th aligned cell mean. Alignment is a tool sometimes used to strip main effects from the observations. See, for example [
9], Section 9.4. Put
in which
.
It was shown in [
6] that
in which
m1, …,
m(r − 1)(c − 1) are the eigenvectors corresponding to the eigenvalues 1 of an idempotent matrix with rank (
r − 1)(
c − 1) and
t indexes the cells. The
mts are not uniquely defined; it is only required that they are mutually orthonormal and orthogonal to 1
rc. A useful approach is to first suppose A is any
r ×
r idempotent matrix of rank
r − 1 and B is any
c ×
c idempotent matrix of rank
c − 1. Then A has
r − 1 eigenvalues of one and one eigenvalue of zero, and B has
c − 1 eigenvalues one and one eigenvalue zero. It follows that if ⊗ represents the Kronecker product, then A ⊗ B is idempotent and has (
r − 1)(
c − 1) eigenvalues of one and
r +
c − 1 eignvalues of zero. In particular suppose the eigenvectors of A corresponding to the eigenvalue one are
a1, … ,
ar−1 and the eigenvectors of B corresponding to the eigenvalue one are
b1, … ,
bc−1. The
au ⊗
bv,
u = 1, … ,
r − 1 and
v = 1, … ,
c − 1, are an appropriate choice for the eigenvectors of A ⊗ B corresponding to the eigenvalue 1 and are mutually orthonormal.
Take
mt =
au ⊗
bv, where
t indexes (
u,
v),
u = 1, … ,
r − 1, and
v = 1, … ,
c − 1. Now define Z to be the
r ×
c matrix (
Zij), quite distinct from but obviously related to
ZI. Then
Parametric tests for testing = 0 against its two-sided negation can be based on Ft = ()2/{SSE/edf}, which has the F1,edf distribution. The expression here for Ft seems to be the most intuitively appealing form of the contrast test statistic.
A coherent interpretation of is that it is the projection of the aligned cell means into the parameter space spanned by , or, alternatively, it gives the degree (u, v) interaction effect. We call them coefficients and distinguish the degree-degree, level-degree, and level-level coefficients depending on whether the levels of the factors are ordered or not.
Whatever is assumed about the ordering of the levels of the factors, a test for an overall interaction effect can be based on the sum of the Ft.
For unbalanced designs with some i and j, we instead define and proceed as in the balanced case, except that now the distribution of Ft is not clear, so instead we use permutation testing to obtain p-values.
4. Crop Yield Example
The data set in
Table 8 was available at the time of writing at [
10], where it is analyzed using regression techniques. No scenario is given, but it is apparent that three crops are subjected to two fertilizers and the yields noted. All other influences are assumed to have been randomized. Clearly the levels of neither factor are ordered.
The data are analyzed in R that uses type 1 sums of squares in which the order of the factors in the call is important. Interestingly, the
p-values resulting from calling both crop before fertilizer and fertilizer before crop are the same, although there are slight differences in some of the test statistics. The crop called before fertilizer ANOVA is given in
Table 9.
From the output, neither main effect is significant at even the 0.1 level, with a crop p-value of 0.6427 and a fertilizer p-value of 0.3163. The interaction p-value of 0.0383, significant at the 0.05 level, while the Shapiro–Wilk normality test returns a p-value of 0.6226. There is no apparent reason to doubt the parametric p-values.
Based on 1,000,000 permutations and permuting the original data, the fertilizer p-value is 0.3128, while for the crop main effect the p-value is 0.6370 with a component p-value of 0.3493 comparing corn and soy, 0.6800 comparing corn and rice, and 0.5947 comparing soy and rice. These compare with the corresponding parametric component p-values of 0.3530, 0.6832, and 0.5976. Thus there is good agreement between the parametric and nonparametric p-values. No significant pairwise differences are found, as might be expected with such a large fertilizer p-value. Note that the crop p-value is calculated based on the crop F statistic in the parametric model.
Figure 3 gives cell mean plots of these data. In the left panel the fertilizer lines are clearly not parallel, but that is a subjective judgement. The right panel gives the plot without the interference of the main effects. The parametric analysis suggests there is weak evidence of an interaction, but it would be useful to have focused component tests to see whether there is an effect that is masked by non-significant effects.
If we align by removing the main effects, permutation testing for level-level effects based on 1,000,000 permutations gave p-values of 0.0276, 0.9695, and 0.0254 for pairwise effects reflecting effects due to soy and rice, corn and rice, and corn and soy, respectively. The corresponding p-values without aligning, just permuting the raw data, were 0.0281, 0.9697, and 0.0252 for the pairwise effects: very similar. This is not surprising, since the main effects were relatively small.
Apparently corn and rice do not contribute to the interaction effect, but both soy and rice, and finally corn and soy, do. This is reflected in the
Figure 3 plots if the crops are visualized two at a time. Subjectively it is clear that soy is responding to the fertilizer blends in the opposite manner to corn and rice.
7. Conclusions
We noted in the introduction that contrasts “detect alternatives in different, important subsets of the parameter space”. We developed contrasts for main effects in unbalanced ANOVA designs and showed how to test for polynomial effects when the factor levels are ordered and for pairwise effects when they are not. Although here we focus on these choices, we note that other choices are possible. For interaction effects, a quantity in the interaction sum of squares in balanced designs was generalized to unbalanced designs and used to test for ‘important subsets of the parameter space’. Our preferred test statistics are based on contrasts and components with F distributions when the design is balanced, but for interaction effects in unbalanced designs, we use permutation testing.
The main advantages of our approach are that we are able to analyze
In particular, Ref. [
11] noted that “Since the data obtained in many areas of psychological inquiry are … frequently unbalanced, researchers using the conventional procedure will erroneously claim treatment effects when none are present …”. We hope our treatment of unbalanced designs will appeal to these researchers.
In two-factor designs in which the levels of both factors are ordered, interpreting coefficient effects can be problematic. While what we might call a degree (1, 2) effect is an umbrella effect, it is not clear how to interpret, for example, a degree (2, 3) effect, beyond indicating a complex interaction effect.
As noted in the introduction we have not developed coefficient tests for multiway interactions. Many, as just noted, would not be interpretable. Moreover, when testing at the 0.05 level, 5% of the effects would be significant even when there was no interaction effect. Even when a significant effect is interpretable, it may be spurious, one of the 5%. In two-way ANOVAs the interaction degrees of freedom are usually much fewer, and a single significant interaction effect may be evident from the cell mean plot or discounted from the same and thus identified as a complex and unimportant effect.
In the examples we have examined in which both the parametric and nonparametric analyses are available, we have found that the parametric and nonparametric p-values for a particular effect are almost always similar. This suggests that the nonparametric tests are not noticeably inferior to their parametric competitors. However, even an indicative power study would be a sizable undertaking, and we leave that for another time.
In unbalanced ANOVAs, tests for main and interaction effects use regression methods. The interaction tests that we propose are routine extensions of tests for balanced ANOVAs, and we believe they will be especially appealing to data analysts familiar only with balanced ANOVA testing.