2. Materials & Methods: Meta-Analysis
In this section, the core concepts of meta-analysis are briefly introduced. The remainder of this paper focuses on illustrating how theory-based hypotheses, often containing inequality constraints, can be evaluated using null hypothesis testing, the AIC, and the GORICA.
Meta-analysis aims to aggregate evidence from several different studies, usually in the form of a statistical parameter or set of parameters (e.g., an effect size measure or standardized regression parameters) estimated from different samples, to come to an overall estimate of one or more population parameters [
1,
2]. In principle, meta-analysis combines evidence from different studies by taking a weighted average of the parameter estimates, where the study weights reflect the amount of information or certainty in a given estimate. This weighting procedure can be applied
univariately (via weighted least squares; using the inverse of the variance of one or more estimates); or
multivariately (via generalized least squares; using the inverse of the covariance matrix of the estimates) to take into account dependencies between the parameters which are the target of the meta-analysis. The latter is comparable to the method called parameter-based meta-analytic structural equation modeling (MASEM) (cf. [
7]). For a more in-depth treatment of these weighting procedures, the reader is referred to Becker & Wu [
2] and Demidenko and colleagues [
8].
Another distinction in meta-analytic techniques can be made based on the assumed underlying population parameter model: Researchers can assume a single underlying parameter, in a
fixed-effect analysis (also referred to as common-effect or equal-effect model), or a distribution of population parameters, in a
random-effects model. The latter reduces to the former if all variance in each of the parameter estimates is assumed to come from sampling variance alone. Note that a random-effects model should be used to generalize meta-analytic results beyond the primary studies included in the meta-analysis. For an elaboration on meta-analytic techniques, see for instance Borenstein and colleagues [
1] and Becker & Wu [
2].
Additionally, one can include moderators, that is, predictors on the study-level, in a (multiple) meta-regression. Moderators can be used to explain some of the heterogeneity variances in the meta-analysis (which may be due to differences in study designs or, as another example, publication-year). Note that subgroup analysis is a special case, since then the predictors are dummy/grouping/categorical variables. In this model, it is assumed that not all studies stem from the same population (because of different study characteristics) and that the true overall effects differ per subgroup (or vary with the moderator values). Such a model is also referred to as moderator analysis or (multilevel) fixed-effects plural model or (multilevel) mixed-effects model.
All these types of meta-analysis models are captured by the following equation:
where
is the observed effect size(s) of Study
s (which can consist of multiple elements),
is the moderator(s),
is the sampling error (how much does the effect size deviate from its true effect), and
is the random effect denoting between-study heterogeneity (implying that the true effect size comes from an overarching distribution of effect sizes), which is independent from the sampling error. In case there are no moderators, ‘
’ is left out in the equation or, stated otherwise,
is assumed to be 0. In case there is no random effect (i.e., there is solely a fixed effect), ‘
’ is left out in the equation; stated otherwise, the variance of
(often referred to as
) is assumed to be zero.
In principle, meta-analysis takes a weighted average of the effect-size estimates, where the contribution of each study is weighted by the inverse of the variance of the estimate (or the inverse of the covariance matrix of the estimates), that is, the amount of information or certainty in the estimate(s). To be more precise, the study contribution is weighted by the inverse of the sampling variance (matrix) in the case of a fixed-effect model and by the inverse of the total variance (matrix) in the case of a random-effects model.
Software
There are several software programs to perform a meta-analysis. In this manuscript, I will make use of the R package metafor [
9].
Next, I will describe some currently used methods to test (null) hypotheses, the differences between these tested (null) hypotheses and the ones of interest, and how the GORICA can help out.
4. Materials & Methods: GORICA
By using the generalized order-restricted information Criterion (GORIC) [
4,
5] or its approximation (GORICA) [
3], researchers’ theories can directly be examined by evaluating theory-based hypotheses, like
or
. Thus, the GORIC and GORICA can evaluate theory-based hypotheses containing order restrictions on the parameters (“<” and/or “>”) besides equality restrictions (“=”). They can evaluate hypotheses with restrictions on linear combinations of parameters (notably, a restriction regarding the square of say
is not possible, but this is oftentimes also not of interest).
The GORIC is an extension of the AIC (and, thus, also an estimate of the Kullback–Leibler discrepancy) and is of the form
This expression is based on the order-restricted maximum likelihood (i.e., the maximum likelihood under the order restrictions in the hypothesis) and has a more general penalty expression (using so-called chi-bar-square weights) such that the order restrictions are properly accounted for (for more details, see
Appendix C). The penalty equates, loosely speaking, the expected number of distinct parameters. For example,
represents 1.5 distinct regression parameters and not 2, as would be the case in the AIC (which would evaluate
). If there are solely equality constraints (“=”) and/or no constraints (“,”) and, thus, no order restrictions, the GORIC reduces to the AIC.
The GORICA is an approximation method which eases the calculation of the GORIC for a broad range of models. It uses the fact that maximum likelihood estimates are asymptotically normally distributed:
where
is the (order-restricted) fit part (different from that of the GORIC) and where ‘penalty’ is the (order-restricted) complexity part which has the same expression as that of the GORIC; where both parts account for the order restrictions in the hypothesis. The fit part of the GORICA,
, is (besides the order restrictions) based on the maximum likelihood estimates (MLEs) and their covariance matrix (which are a summary of the data) instead of the data themselves. Using the central limit theory, the fit part of the GORICA is always based on the normal distribution for the MLEs even if the data do not follow one (like in a logistic regression). The fit part is the maximum of this distribution given the restrictions in the theory-based hypotheses (i.e., it is an order-restricted maximum). The interested reader is referred to
Appendix C for details about the log likelihood and penalty parts of the GORIC and GORICA.
Because of the different fit expressions, the fit values of the GORIC and GORICA differ in an absolute sense but asymptotically not in the relative sense when comparing candidate hypotheses (cf. [
3] and
Appendix C). Therefore, the GORICA asymptotically selects the hypothesis with the smallest distance to the truth (while the GORICA value itself is not an estimate of the Kullback–Leibler discrepancy). The GORICA, like the AIC and GORIC, orders the hypothesis in the set, where the hypothesis with the smallest value is the preferred one.
Note that the GORICA only needs the estimates of the (unconstrained) parameters of interest and their covariance matrix. To be more precise, the estimates and their covariance matrix are only needed for the parameters included in the set of hypotheses, which often do not include variance components (for more details, see
Appendix C). Therefore, it can easily be applied to all types of meta-analyzed estimates (like effect-size measure estimates or standardized regression estimates), as long as their covariance matrix is also known (which is often, if not always, part of meta-analytic software).
4.1. GORICA Weights
To improve the interpretation of information criteria values, one should transform them into weights. The GORICA weight for Hypothesis
is calculated by:
for
, with
M the total number of hypotheses in the set and
. Notably, the GORICA weights and the GORIC weights are asymptotically the same. In the case of no order restrictions, the GORIC weights equal the Akaike weights (cf. [
14]) and, thus, the GORICA weights asymptotically equal Akaike weights (and can thus be used a proxy).
Bear in mind that an information criterion (IC) can be written as
Consequently,
Hence, the IC weights are comparable to likelihood ratios, only now the complexity of the hypotheses/models are also taken into account.
The IC weights reflect the strength/likelihood/support of a hypothesis given the data and the set of hypotheses [
4,
14,
15,
16]. That is,
denotes the weight of evidence that Hypothesis
/Model
m is the best hypothesis for the data at hand given the
M candidate hypotheses. Thus, when inspecting another set of hypotheses, the weights for the same hypotheses may change.
For the comparison of two hypotheses, one can use the ratio of their weights, denoting the relative support of one hypothesis versus the other. For instance, GORICA weights for Hypothesis and a competing hypothesis of and mean that has times more support than the competing hypothesis . Stated otherwise, is 7 times more likely than the competing hypothesis . For readers who are familiar with Bayesian statistics, the information criterion weights (e.g., GORICA weights) are comparable to posterior model probabilities and the relative support (i.e., ratio of weights) to Bayes factors. Note that the relative support (i.e., ratio of weights) does not depend on the full set of candidate models, it is the support for one hypothesis relative to one other hypothesis. For example, when researchers would include an additional hypothesis in the set, they may find weights of 0.7, 0.1, and 0.2, but the relative support of vs. still equals 7, namely 0.7/0.1 = 7.
4.2. Software
There are two R functions that can calculate GORICA values and weights: the goric function [
17] in the restriktor package [
18] and the gorica function in the gorica package [
19]. These functions render the same results of course, but there are some differences in functionality (cf. [
20]). The goric function of the restriktor package is used in this paper.
The next section demonstrates, among other things, how the GORICA can be applied to meta-analyzed parameter estimates and gives insight into the (dis)advantages of using the GORICA. It also contains remarks for specific types of hypotheses, which are explicitly and more elaborately addressed in
Appendix B.
5. Results: Illustrations
I will make use of empirical meta-analytic studies, by using datasets provided by
the site of Wolfgang Viechtbauer (accessed on 1 June 2022). For several data sets, I run a meta-analysis including null hypothesis testing in R [
6] and I applied both the AIC and GORICA to the meta-analyzed estimates. An R script containing annotated R code is available on
my GitHub page (accessed on 1 June 2022). These include meta-analyses regarding effect size measures and model parameters, both with and without moderators (i.e., examples for each of the four cases discussed in
Appendix A). For brevity, I will next show (a part of) one of the meta-analyses (including R code). Based on that, I will give insight in the comparison of evaluating null hypothesis in meta-analysis (using null hypothesis testing and the AIC) with the proposed theory-based hypothesis evaluation using the GORICA.
In this section, the meta-analytic study of Berkey and colleagues [
21] is used, where surgical and non-surgical treatments for medium-severity periodontal disease is compared in five trials for two outcomes: attachment level (AL) and probing depth (PD) one year after the treatment. In this meta-analysis, the effect size to be aggregated is the (raw) mean difference, where non-surgical treatment is the reference category. This means that a positive value indicates that surgery was more effective than non-surgical treatment. Note that the outcomes are negatively related: a positive estimate indicates effectiveness of surgery in either increasing the attachment level or decreasing the probing depth.
Meta-analysis is used to obtain an estimate of the population mean differences, where the latter will be denoted by the population parameters
and
. From theory, it might be expected that the first parameter is negative and the latter positive, leading to the following hypothesis of interest:
In the case that there might also be reason to believe that
could be negative, there is a competing hypothesis of interest:
Alternatively, it can be the case that there is theory regarding the size of the effect size, or one wants to compare it to a cut-off value for that specific effect size type (e.g., when using Cohen’s d or Hedges’
g). In this example, there might be theory regarding the size of the mean difference stating:
where
denotes the absolute value of
x.
As another example, it might be expected from theory that the absolute size of
is smaller than that of
:
Note that, for a fair comparison of parameters, both outcomes should be on the same scale (as is the case here). Comparing sizes can be meaningful in the case of multiple outcomes, like here, or in the case of multiple (aggregated) standardized regression estimates (where one then compares the importance of the corresponding predictors).
Next, the input and (part of the) output for the meta-analysis is given. Subsequently, one can find the results and conclusions regarding hypotheses to when performing null hypothesis testing, model selection using the AIC, and model selection using the GORICA.
5.1. Meta-Analysis
The R code to (multivariately) aggregate the estimates of the five trials with a meta-analysis using the metafor package:
# data
data <-~dat.berkey1998
# Covariance matrix, needed for multivariate meta-analysis
V <- bldiag(lapply(split(data[,c("v1i", "v2i")], data$trial), as.matrix))
# meta-analysis
metaan <- rma.mv(yi, V, mods = ~ outcome - 1, random = ~ outcome | trial,
struct="UN", data=data, method="ML")
print(metaan, digits=3)
Note that the function rma.mv() uses restricted maximum likelihood (REML) estimation by default. Hence, method = “ML” must be explicitly requested, which is used to mimic [
21].
This renders output. The, for this paper, relevant part of the output is:
Model Results:
estimate se zval pval ci.lb ci.ub
outcomeAL -0.338 0.080 -4.237 <.001 -0.494 -0.182 ***
outcomePD 0.345 0.049 6.972 <.001 0.248 0.442 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Be aware that the theory-based hypotheses are (and should be) formulated before inspecting this.
5.2. Null Hypothesis Testing
In the example where
(and possible
) is of interest, there are two null hypotheses, namely surgery was equally effective as non-surgical treatment in increasing the attachment level and in decreasing the probing depth. This can be represented by the following two statistical hypotheses:
Classical null hypothesis tests are part of the meta-analysis output, as can be seen in the previous subsection. From this, it can be concluded that both hypotheses can be rejected (
p < 0.001). Thus, there is a significant difference in the effectiveness of the treatments for both outcomes. When inspecting the sign of the meta-analyzed estimates, it can be concluded that, on average, surgery is more effective in decreasing probing depth (i.e., 0.35 > 0) and less effective in increasing the attachment level (i.e., −0.34 < 0) than non-surgery (i.e., non-surgery is more effective in increasing the attachment level). Note that one can also do a one-tailed test to test, for instance,
(by dividing the
p-value by two, when the sign is in agreement with the expectation; otherwise, one should use
).
In the above, the two null hypotheses are not tested simultaneously. One may like to test them simultaneously by testing the following null hypothesis:
This null can be tested by inspecting elliptical/multivariate confidence intervals (which are based on the covariance matrix of the meta-analyzed estimates), but only the univariate confidence intervals are reported when using metafor. Alternatively, one can use a chi-square test. When using metafor, one can test
with the (Wald-type) chi-square test:
# R code
anova(metaan)
# Output:
Test of Moderators (coefficients 1:2):
QM(df = 2) = 155.7728, p-val < 0.0001
This omnibus test renders a p-value smaller than 0.001, indicating that is rejected. If of interest, one can test a null hypothesis for a specific set of parameters via the btt argument in the anova function (e.g., ‘anova(metaan, btt = 1:2)’ for the first two parameters). Note that one cannot test , since there is no such one-tailed test.
Thus, even when testing the null hypotheses simultaneously (by testing directly), this cannot address the hypotheses of interest (especially when there are more than two parameters). In this analysis with two parameters, it does give some insight into and perhaps , when also inspecting the meta-analyzed estimates. Namely, seems to be supported by the results, while is not or only partly. However, it is not clear how large their support is. Additionally, one cannot compare (the support for) these two hypotheses.
When inspecting and , one can test ‘’ and ‘’, respectively. One again has to additionally inspect the meta-analyzed estimates to obtain more insight into the hypotheses of interest. Additionally, one still cannot quantity the support for the hypotheses of interest: In general, a p-value and (the width of) elliptical confidence intervals do not quantify the support for any hypothesis.
As will be demonstrated next, model selection can be used to evaluate restrictions on parameters simultaneously and quantify the (relative) support of the hypotheses included in the set.
5.3. Null Hypothesis Selection (Using the AIC)
To evaluate and compare null hypotheses, one can conduct model selection using the AIC value. Because metafor cannot render the AIC values for the null hypotheses of interest in this example, I will use the GORICA weights as a proxy to the Akaike weights and refer to them as AIC weights. Note that this way I can still compare model selection evaluation of null hypotheses versus that of theory-based hypotheses.
To apply the GORICA to meta-analyzed estimates in R, these estimates and their covariance matrix should be extracted:
#Substract estimates from meta-an, to~be used in goric function
est <- coef(metaan)
names(est) <- c("theta_AL", "theta_PD")
VCOV_est <- vcov(metaan)
In this analysis with two parameters, the number of possible equality hypotheses is not that large. Therefore, all possibilities will be inspected here. To prevent choosing the best from a set of weak/bad hypotheses (i.e., from a set of hypotheses not supported by the data), the unconstrained hypothesis (which does not restrict parameters) is included in the set as a fail-safe. Stated otherwise, when the equality-restricted hypotheses are not supported by the data, the unconstrained hypothesis will be the best of the set. See
Appendix B.2 for more information.
Notably, in the case of more than two parameters, one may want to reduce the number of hypotheses, like Burnham & Anderson [
14] also recommend. This should then be based on theory. As a side note, in case there is theory, I expect a researcher to have expectations which will most probably include order restrictions as is the case in hypotheses
to
above (e.g.,
). In such a case, one should evaluate these directly with GORICA, as will become clear later.
Based on the three example sets of hypotheses specified, the following three hypothesis sets are used. The first set compares each of the two parameters to zero:
The second set compares each of the two parameters to 0.2 (in absolute sense):
Note that, when using the GORICA, one can also evaluate restrictions regarding absolute values of parameters. The third set compares the parameters to each other (in absolute sense):
In R, using the goric function in the restriktor package, these three hypotheses sets are formulated as follows:
# Set 1
H01 <- "theta_AL == 0; theta_PD == 0"
H02 <- "theta_PD == 0" # i.e.,~theta_AL, theta_PD == 0
H03 <- "theta_AL == 0" # i.e.,~theta_AL == 0, theta_PD
# Note: By default, the~unconstrained hypothesis is added to the~set.
# Set 2
H04 <- "abs(theta_AL) == 0.2; abs(theta_PD) == 0.2"
# Note: This can be compared to its complement, but~because of the equality,
# the complement equals the unconstrained~hypothesis.
# Set 3
H05 <- "abs(theta_AL) == abs(theta_PD)"
# Note: This can be compared to its complement, but~because of the equality,
# the complement equals the unconstrained~hypothesis.
The following R code should be used to evaluate these sets with the GORICA:
# Apply GORICA to obtain AIC~weights
# Set 1
results_AIC_Set1 <- goric(est, VCOV = VCOV_est, H01, H02, H03,
type = "gorica")
results_AIC_Set1
# Set 2
results_AIC_Set2 <- goric(est, VCOV = VCOV_est, H04,
comparison = "complement", type = "gorica")
results_AIC_Set2
# Set 3
results_AIC_Set3 <- goric(est, VCOV = VCOV_est, H05,
comparison = "complement", type = "gorica")
results_AIC_Set3
The output next shows the AIC weights (‘gorica.weights’) for the first set of hypotheses without order restrictions. Note that the reported log likelihood (loglik) and penalty are based on the structural parameters, that is, the parameters included in the set of hypotheses. For more details, see
Appendix C.
Results:
model loglik penalty gorica gorica.weights
1 H01 -73.975 0.000 147.950 0.000
2 H02 -20.393 1.000 42.787 0.000
3 H03 -5.063 1.000 12.127 0.000
4 unconstrained 3.912 2.000 -3.824 1.000
---
From this, it is concluded that the unconstrained hypothesis is the best hypothesis, since it has the smallest IC value and the largest IC weight. It even has full support, reflected by an IC weight of 1. This implies that the other three hypotheses are weak hypotheses, that is, hypotheses not supported by the data. Note that these three hypotheses (i.e., to ) do not reflect any of the mentioned possible hypotheses of interest (i.e., and ) which are included in the unconstrained. Based on the results, one can conclude that there is overwhelming support that both estimates are not zero. When inspecting the signs of the meta-analyzed estimates, something can be said about and , but one cannot quantify the support for these hypotheses or the support relative to each other.
The output next shows the AIC weights (‘gorica.weights’) for the second set of hypotheses without order restrictions:
Results:
model loglik penalty gorica gorica.weights
1 H04 -9.560 0.000 19.120 0.000
2 complement 3.912 2.000 -3.824 1.000
---
The order-restricted hypothesis ‘H04’ has 0.000 times more support
than its complement.
From this, it is concluded that there is no support for . Thus, there is full support for the unconstrained, the complement of . However, the unconstrained contains both values above and below 0.2. Thus, one should perhaps inspect the size of the meta-analyzed estimates. Although this renders more insight, it still lacks support for the hypothesis of interest .
The output next shows the AIC weights (‘gorica.weights’) for the third set of hypotheses without order restrictions:
Results:
model loglik penalty gorica gorica.weights
1 H05 3.910 1.000 -5.820 0.731
2 complement 3.912 2.000 -3.824 0.269
---
The order-restricted hypothesis ‘H05’ has 2.713 times more support
than its complement.
From this, it is concluded that
has
times more support than the unconstrained hypothesis (which includes
). This may not seem to be convincing evidence, but it is. This has to do with evaluating an equality which is almost true in the data (judged by the two almost equal log likelihood (loglik) values), as discussed in
Appendix B.4. As was the case in null hypothesis testing, inspecting the meta-analyzed estimates renders more insight, but it still lacks support for the hypothesis of interest
.
The examples above show that the AIC can quantify the support of the hypotheses in the set, where multiple parameters can be constrained simultaneously. Nevertheless, the hypotheses in the set are not per se the hypotheses a researcher is interested in. In such a case, a researcher can still not quantity the support for the hypotheses of interest and/or compare their support. As will be shown next, this is possible when evaluating order-restricted hypotheses with the GORICA.
5.4. GORICA
The GORICA can evaluate hypotheses with (and without) order restrictions. Hence, it can directly evaluate Hypotheses to mentioned above (e.g., ). To apply the GORICA to meta-analyzed estimates in R, these estimates and their covariance matrix should be extracted:
#Substract estimates from meta-an, to~be used in goric function
est <- coef(metaan)
names(est) <- c("theta_AL", "theta_PD")
VCOV_est <- vcov(metaan)
Next, different sets of theory-based hypotheses are evaluated.
Hypothesis
can be evaluated against its complement (see
Appendix B.2 for more information) with the following R code:
# Hypothesis of interest
H1.1 <- "theta_AL < 0; theta_PD > 0"
# Apply GORICA
set.seed(123) # set seed: to obtain the same results when you re-run it
results_H1.1 <- goric(est, VCOV = VCOV_est, H1.1,
comparison = "complement", type = "gorica")
results_H1.1
The corresponding output is:
Results:
model loglik penalty gorica gorica.weights
1 H1.1 3.912 0.799 -6.226 1.000
2 complement -5.063 1.701 13.529 0.000
---
The order-restricted hypothesis ‘H1.1’ has 19482.703 times more support
than its complement.
From this, it can be concluded that the hypothesis of interest has full support (when compared to any other ordering/theory). Thus, there is overwhelming support for the hypothesis of interest stating that surgery is less effective in increasing the attachment level than non-surgery and more effective in decreases in probing depth.
Now, assume a researcher is not only interested in
but also in the competing hypothesis
. Since the two hypotheses do not cover the whole space/do not cover all possible orderings of parameters, one should include the unconstrained hypothesis to prevent choosing a weak/bad hypothesis (see
Appendix B.2 for more information). This can be done using the following code:
# Hypothesis of interest
H1.1 <- "theta_AL < 0; theta_PD > 0"
H1.2 <- "theta_AL < 0; theta_PD < 0"
# Note: By default, the~unconstrained hypothesis is added to the~set.
# Apply GORICA
set.seed(123) # set seed: to obtain the same results when you re-run it
results_H1 <- goric(est, VCOV = VCOV_est, H1.1, H1.2,
type = "gorica")
results_H1
round(results_H1$ratio.gw, digits = 2)
This results in the following output:
Results:
model loglik penalty gorica gorica.weights
1 H1.1 3.912 0.799 -6.226 0.769
2 H1.2 -20.393 1.201 43.189 0.000
3 unconstrained 3.912 2.000 -3.824~0.231
> round(results_H1$ratio.gw, digits = 2)
vs. H1.1 vs. H1.2 vs. unconstrained
H1.1 1.0 53728634328 3.32
H1.2 0.0 1 0.00
unconstrained 0.3 16165185252 1.00
From this, one can conclude that is not a weak hypothesis since it is (3.3 > 1 times) more supported than the unconstrained hypothesis; and that is weak (since it is 0 or, to be more precise, times more supported than the unconstrained), that is, is not supported by the data. Because at least one of the hypotheses of interest is not weak, these hypotheses can be meaningfully compared to each other, which is the interest in this example. It can be concluded that is many more (nl. 53,728,634,328) times supported than .
Note that the unconstrained hypothesis includes all possible hypotheses and, thus, also
. Therefore, the support for the unconstrained includes support for
. If one would leave out the unconstrained (which is only included as a safeguard),
would have full support (i.e., an IC weight of 1) here, which is also reflected by the relative GORICA weights (i.e., 53,728,634,328). Thus, the results for the set excluding the unconstrained can be inferred from the one including the unconstrained (for this, one could also use the R function IC.weights [
22] when necessary).
In conclusion, there is overwhelming support for the hypothesis of interest stating that surgery is less effective in increasing the attachment level than non-surgery and more effective in decreases in probing depth, compared to stating that surgery is less effective in increasing the attachment level than non-surgery and less effective in decreases in probing depth.
In case would be the hypothesis of interest, the following R code should be used:
# Hypothesis of interest
H2 <- "abs(theta_AL) > 0.2; abs(theta_PD) > 0.2"
# Apply GORICA
set.seed(123) # set seed: to obtain the same results when you re-run it
results_H2 <- goric(est, VCOV = VCOV_est, H2,
comparison = "complement", type = "gorica")
results_H2
This renders the following output:
Results:
model loglik penalty gorica gorica.weights
1 H2 3.912 0.799 -6.226 0.917
2 complement 2.417 1.701 -1.431 0.083
---
The order-restricted hypothesis ‘H2’ has 10.996 times more support
than its complement.
From this, it can be concluded that is 11 times more supported than its complement (i.e., any other hypothesis/ordering). Thus, there is convincing support for the hypothesis of interest, which states that the mean difference between surgery and non-surgery is in absolute values larger than 0.2 for both outcomes.
In case would be the hypothesis of interest, the following code should be used:
# Hypothesis of interest
H3 <- "abs(theta_AL) < abs(theta_PD)"
# Apply GORICA
set.seed(123) # set seed: to obtain the same results when you re-run it
results_H3 <- goric(est, VCOV = VCOV_est, H3,
comparison = "complement", type = "gorica")
results_H3
This renders the following output:
Results:
model loglik penalty gorica gorica.weights
1 H3 3.912 1.500 -4.824 0.500
2 complement 3.910 1.500 -4.820 0.500
---
The order-restricted hypothesis ‘H3’ has 1.002 times more support
than its complement.
From the output above, it can be concluded that both hypotheses, and its complement , are equally likely, that is, they have the same support. The maximum log likelihood values are nearly the same and, consequently, the weights largely depend on the penalty values. In that case, the GORICA weights resemble or even equal the penalty weights, the weights based on solely the penalty parts:
library(devtools)
install_github("rebeccakuiper/ICweights")
library(ICweights)
#?IC.weights
#citation("ICweights")
# Weights based on penalty values
IC.weights(2*results_H3$result[,3])$IC.weights
# Note that the penalty is 2*‘penalty’ is 2*results_H3$result[,3]
# This renders penalty weights of:
[1] 0.5 0.5
# which equal the ‘gorica.weights’ above.
Since both hypotheses are of the same size (i.e., have the same penalty), the GORICA weights for the two hypotheses (with approximately the same fit) are also the same. Notably, if the penalty values differed across the hypotheses, the GORICA weights would differ across the hypotheses as well, but the GORICA weights would still equal the penalty weights. When the GORICA weights equal the penalty weights, like here, one can conclude that there is support for the overlap (here, border) of these two hypotheses: , reflecting equal ‘absolute’ strength. Consequently, there is no support for . We do find evidence for ‘’, which can be evaluated in future research.
Hence, the GORICA can evaluate a hypothesis of interest directly and quantify its support (or the lack there of) in comparison with one or more competing hypotheses. This aids in either confirming an a priori theory or in developing a new or competing theory for future research.
6. Discussion
This paper demonstrated how theories regarding relationships based on multiple studies can be evaluated using current methods (i.e., null hypothesis testing and AIC) and GORICA. Current methods to test or evaluate hypotheses in meta-analysis can only address equality restrictions and, therefore, do often not address the hypothesis of interest let alone quantify the support for the hypothesis of interest. Fortunately, this is possible, when using the GORICA. Notably, if the goal of the meta-analysis is prediction and not the evaluation of one or more theories/hypotheses, the researcher should not use model selection as I propose in this paper.
I only inspected ‘regular’ meta-analysis accompanied by null hypotheses tests and model selection using the AIC. An increasingly popular meta-analytic method is meta-analytic structural equation modeling (MASEM [
23]). This method additionally provides measures of the overall fit of a model, that is, goodness-of-fit indices, which include the AIC. When using MASEM, meta-analyzed estimates can be restricted and, thus, the AIC values of MASEM can be used to evaluate equality-restricted hypotheses regarding effect size parameters. Note that MASEM cannot evaluate restrictions regarding absolute values, like
and
in Sets 2 and 3 in the illustrations section (i.e.,
Section 5). In that case, one should use
and
(or better,
) instead. When using MASEM, one has to specify and run each of the models separately (including the unconstrained model). MASEM then provides the AIC value for each model, and one then selects the model with the smallest AIC value. To inspect and compare the relative support for these equality-restricted models, one should also inspect the AIC weights, which are not part of MASEM but can be calculated using the ICweights package [
22]. MASEM is, thus, like the other current methods only fit for equality restrictions and was therefore not included in the comparison. In case one wants to know more about the similarity of MASEM and the GORICA and/or how the GORICA can be of added value to MASEM, see
Appendix D.
To apply the GORICA, a researcher only needs the meta-analyzed estimates of interest and their covariance matrix, which is output in most meta-analytic software. Notably, in the case of a single study where this information is not given (and, thus, not a meta-analysis as discussed in the paper), one should use the original data to apply the GORICA. Alternatively, one can create multiple covariance matrices (based on expertise and previous research) and do a sensitivity/robustness check. By using the GORICA, researchers can quantify the support for their hypothesis/-es of interest. One can, for instance, make claims like: The hypothesis of interest is 10 times more likely than any other theory (i.e., any other expectation about the ordering of parameters); or: The hypothesis of interest is 10 times more likely than the competing hypothesis of interest .
A disadvantage of the GORICA is that it is an asymptotic method: It assumes that the parameters of interest (i.e., the ones used in the hypotheses of interest) are normally distributed (as more methods assume). This may be an unrealistic assumption for some type of parameters and/or for small samples. Nevertheless, simulations thus-far do show that the performance of the GORICA is good (cf. the simulation in Altınışık et al. [
3]). An advantage of the GORICA is that—since it only needs the (unconstrained) parameters of interest and their covariance matrix—it can easily be applied to parameters from all types of statistical models. The GORICA can thus also easily be applied to all types of meta-analyzed effect-size estimates, as long as their covariance matrix is also known.
The GORICA is not the only method which can address theory-based hypotheses. There are multiple confirmatory methods: e.g., F-Bar test ([
24], pp. 25–4), the Bayes factor (a.o., [
25]), and the GORIC [
4,
5]. The most practical ones, when having secondary data, are the Bayes factor in the R package bain [
26] and the GORICA in the R function goric [
17] of the restriktor package [
18]. The latter is used in this paper, the other methods have, as far as I know, not been applied to meta-analyzed estimates to evaluate theory-based hypotheses. Readers who are more in favor of the Bayesian framework instead of the information-theoretical one are referred to the use of bain. Note that most, if not all, described in this paper also applies to bain. One important difference between the GORICA and Bayesian approaches is that the latter use a prior and, therefore, the results may depend on the choice of the prior, especially when there is one or more equality constraints in one of the hypotheses of interest.
By evaluating theory-based hypotheses using the GORICA, researchers from all types of fields (e.g., psychology, sociology, political science, biomedical science, and medicine) can quantify the support for their hypothesis/-es of interest. Evaluating theory-based hypotheses also increases the statistical power of selecting the correct hypothesis, comparable to one- versus two-sided testing in null-hypothesis testing (cf. [
27,
28] who show that confirmatory methods evaluating theory-based hypotheses have more power than exploratory ones). Hence, meta-analyses could contribute to theory confirmation and/or development by evaluating a priori specified, theory-based hypotheses. Furthermore, the use of meta-analyzed estimates leads to an increased (combined) sample size, which increases the statistical power as well. The quantification of support and the power increase bolster, for instance, developing evidence-based treatments and policy.
As a final remark, meta-analysis heavily depends on equal or quite similar study designs across the primary studies. If there are differences in designs, either incomparable estimates are aggregated, or one aggregates (via meta-regression) only the estimates of subsets of studies which designs are equal (since the moderator selects studies that are comparable; see
Appendix A.1.3 for more details). Instead of aggregating estimates, like in meta-analysis, one could aggregate the support for the hypothesis of interest, as Kuiper et al. [
29] do for Bayesian model selection. The next step is to develop such a method for the GORICA.