1. Introduction
Meta-analysis is a methodology for evaluating the overall treatment effect by integrating the results of past clinical trials and is widely recognized as one of the research methods that underlie “Evidence Based Medicine” [
1,
2]. Generally, the methodology involves the integration of summary statistics, such as odds ratios or hazard ratios reported in published papers, by using appropriate statistical methods to estimate the average treatment effect [
1,
2,
3]. In a meta-analysis, various biases that could affect the validity of the synthesized results have been widely studied, for example, (1)
publication bias, whereby positive results are more likely than negative or null results to be published [
4]; (2)
language bias, whereby non-English studies tend to be excluded from meta-analyses [
5]; (3)
time-lag bias, whereby positive results tend to have longer time differences from trial completion to publication than negative or null results [
6]; (4)
reporting bias, whereby studies selectively report outcomes favoring their hypothesis [
1]; (5)
outlier bias, whereby a single or a few studies disproportionately influence the overall results of a meta-analysis [
7]; (6)
categorization bias, whereby studies use different categorization or stratification schemes to achieve the same outcome [
8]; and (7)
covariate set bias, whereby studies use different covariate sets in the regression model that share the same regression task across the studies [
9]. In this study, we aim to focus on a new source of bias, the “cherry-picking” bias, in meta-analyses.
The simplest setup of a meta-analysis is to assume that there are
K independent studies, each yielding an estimate
(
) of an underlying treatment effect parameter
. The standard fixed-effect model is defined as
where
is the reported (known) within-study variance of the
ith study. Under this fixed-effect model, the maximum likelihood estimate of
is defined by the weighted average
where the
ith study is assigned the weight
. The corresponding standard normal test statistic is
and the resulting confidence interval (CI) is
where
is the standard normal percentage point for the coverage of
, and
is the cumulative distribution function of the standard normal distribution [
10,
11,
12,
13]. In practice, researchers are often interested in a hypothesis regarding whether a given treatment has no effect (
) or is beneficial (
). The one-sided
p-value of the
ith study is defined as
Similarly, the
p-value of Equation (
2) is defined as
Equation (
1) is based on the fixed-effect assumption that each study shares the same underlying effect
. When heterogeneity between included studies is suspected, the random-effect model is fitted as
where
is the between-study variance, which can be estimated from the data using standard methods, such as the method proposed in DerSimonian and Laird [
3]. The same reasoning can be applied by replacing the weights
in Equation (
2) with
Refer to the studies by [
10,
11,
12] and Cooper et al. [
13] for a detailed discussion of the various methods used for meta-analysis.
One of the most important stages of a meta-analysis is the specification of the inclusion and/or exclusion criteria, because the selection of studies for a literature review is known to influence the conclusions. One must carefully consider which studies to include or exclude from the review to obtain unbiased and fair conclusions. However, in reality, a significant number of meta-analyses are published without a protocol to define the inclusion and exclusion criteria before conducting the meta-analysis and systematic review. Furthermore, it is not common for papers to follow procedures such as stating inclusion and exclusion criteria in advance and adhering to them. For example, Page et al., (2016) examined the reporting completeness of Biomedical Research meta-analyses and found that only 16% of the included reviews had a publicly accessible protocol published before the review was conducted [
14]. In addition, Tawfic et al., (2020) found that only 37.4% of researchers who are trying to conduct a meta-analysis agree that protocol registration prior to the main analysis should be mandatory [
15].
Given a set of included studies, the conclusions obtained from the results of meta-analyses are frequently based on statistical tests and their associated
p-values in practice. Ideally, a statistical test with a type 1 error rate of
should be used to control the ratio of false findings at a ratio of (less than)
. However, inclusion and/or exclusion criteria can be misused by (sometimes malicious) meta-analysts (i.e., the authors of a meta-analysis who intentionally or unintentionally report false (non)significant overall effects, regardless of the actual treatment effect) to pick a subset of all studies that changes the result and sometimes leads to their desired conclusion. This practice is also known as cherry-picking, and it means that the resulting
p-value no longer controls the ratio of false findings. Figures in
Section 4 show practical examples. Reviewer selection bias is also known in the field of meta-analysis as the situation where reviewers (un)intentionally seek only a subset of existing studies that satisfy certain criteria, so the chosen subset does not reflect all available evidence [
16]. The degree of bias in a synthesized result can depend on a selector’s prior knowledge, research field, existing collaborators, and opinion regarding the research question of interest [
17]. Other similar biases related to inclusion and/or exclusion criteria include the English language bias (whereby non-English studies are more likely to be excluded), the data availability bias (whereby only studies with individual patient data are included), and the database bias (whereby only studies published in journals indexed in popular databases such as Embase or Medline are included); see [
17,
18,
19] for an overview of this topic. For instance, Ahmed et al. [
17] investigated 31 meta-analyses and found that 29% of them suffered from a significant selection bias based on the use of selective or nonsystematic approaches for the identification of relevant studies. They concluded that biased synthesized results can lead to incorrect decisions by medical practitioners, which can harm patients because inefficient or ineffective treatments may be chosen. Such results can also mislead future research efforts [
20]. However, although the selection bias has a similar impact on synthesized results to the publication bias, which has been widely studied in the field of meta-analysis, no attempts have been made to examine the selection bias from a statistical perspective. In this study, we demonstrate that it is possible to modify the results of a meta-analysis by changing the inclusion and/or exclusion criteria to select an arbitrary subset of studies, so that they support a biased conclusion, such as (i) the treatment of interest having a significant effect, despite there being no actual effect or (ii) the treatment having a nonsignificant effect, despite the presence of an actual effect. The reliability of a meta-analysis is decreased in the presence of such a selection bias. The goal of this study is to identify the possibility of cherry-picking.
The remainder of the article is organized as follows: In
Section 2, we show theoretical guarantees on the chance of cherry-picking by meta-analysts who intentionally or unintentionally select the subset of studies. To demonstrate that conventional meta-analysis procedures have a significant cherry-picking effect, the results of extensive simulation studies are presented in
Section 3, and two clinical datasets are examined in
Section 4. Lastly,
Section 5 presents a discussion and our conclusions.
2. Methods
We consider the simple fixed-effect meta-analysis settings defined in Equation (
1). An extension for a random-effect model is described in
Section 2.2 and later in the discussion section. We assume that there are
K studies
collected via data extraction from several databases such as PubMed, Medline, and Embase. Each study is supposed to report an estimate
and corresponding variance
(or
, equivalently). Meta-analysts determine the inclusion and/or exclusion criteria to select a subset of
S studies from all
K studies found in the databases. This subset is denoted as
. Therefore,
may suffer from a selection bias. In this study, we assume that meta-analysts (intentionally or unintentionally) select studies
to (i) overstate the effect of the treatment of interest (Case 1), despite the treatment having no actual effect (i.e.,
), or (ii) understate the effect of the treatment (Case 2), despite the treatment having an actual effect (i.e.,
). Furthermore, we assume that meta-analysts use a statistical testing framework by defining the null and alternative hypotheses as
and
, respectively. The null hypothesis
states that the treatment has no effect, while the alternative hypothesis
states that the treatment has a significant effect. Statistical significance at a level of
for the dataset
is defined as
where
,
, and
or Equation (
6) is used in the fixed- and random-effect models, respectively. The extension to two-sided tests is easy and is discussed later in
Section 5.
2.1. Chance of Cherry-Picking in a Meta-Analysis
This section describes how the standard hypothesis testing procedure is no longer robust against selection bias due to the cherry-picking of studies using biased inclusion and/or exclusion criteria. We used similar techniques to those employed by Komiyama and Maehara [
21] in the following derivation.
Theorem 1 guarantees that, under certain mild conditions, it is possible for meta-analysts to have sufficient statistical power to falsely conclude that a significant effect of the treatment of interest (Case 1) exists, even if the treatment has no actual effect. This is achieved by cherry-picking the subset that provides the top-S smallest p-values.
Theorem 1. For any , , and , if andwith , and , then meta-analysts can select such that with a probability of at least . Similarly, Theorem 2 guarantees that under certain conditions, it is also possible to falsely conclude that the treatment has an insignificant effect (Case 2), even if the treatment has an actual effect. This is achieved by cherry-picking the subset that provides the top-S largest p-values.
Theorem 2. For any , , and , if andthe meta-analysts can select such that with a probability of at least δ. Together, these theorems imply that meta-analysts have a chance to change the results of meta-analysis, regardless of real treatment effects, by cherry-picking an appropriate value for
S. When
S satisfies the conditions outlined in the theorems, readers or inspectors of the meta-analysis results can claim that the possibility of cherry-picking exists. In addition, now, we have assumed that meta-analysts cherry-pick the subset of studies
yielding the “top-
S” (largest/smallest)
p-values, which sometimes seems an unrealistic assumption because the actual meta-analysts might try to cherry-pick the subset of studies in a more arbitrary manner. However, it is noteworthy that even if meta-analysts cherry-pick an arbitrary subset of all studies such as the subset of studies with moderate
p-values, these theorems are still valid because the current assumption of
is the most aggressive and worst setting, i.e., we assume that
provides the minimum/maximum
p-value in the proof. Thus, the theorem still holds even under the more relaxed assumption of cherry-picking moderate
p-values. The proofs for the theorems can be found in
Appendix A,
Appendix B and
Appendix C.
2.2. Extension to a Random-Effect Model
In the above section, we tentatively assumed that
was known, which corresponds to a fixed-effect model in a meta-analysis. However, we can also consider cases in which
in Equation (
6) is estimated. In other words, we can estimate the between-study variance using the random-effect model. In practice,
is estimated from the data, frequently by using the method proposed by [
3]. Given
, the DerSimonian–Laird estimate of
is defined as
where
and
. Theorems 1 and 2 are nontrivial because this estimate depends on the choice of
, and the selection of the top-
S largest test statistics of
depends on the estimate. These factors eliminate the simplicity of Theorems 1 and 2 and require a more sophisticated analysis. One possible approach is that, instead of using Equation (
8), we replace
and
S in Equation (
8) with
and
K, respectively. This corresponds to the situation where once
is estimated, it is regarded as a fixed constant in the model and the same discussion is applied with Theorems 1 and 2. The results of the random-effect models are examined in the simulation and application sections. In addition, in our future work, we plan to extend our results to cover cases in which
depends on the choice of
.
5. Discussion
The conclusions of any meta-analysis can be biased if meta-analysts intentionally or unintentionally cherry-pick a subset of all studies that lead to a desired favorable result. This is achieved by choosing beneficial inclusion and/or exclusion criteria. We theoretically assessed the conditions under which such cherry-picking is possible. To prevent cherry-picking in a meta-analysis, one solution is to mandate stricter adherence to Cochrane and other guidelines. This would require meta-analysts to register and publish their protocol before carrying out the primary meta-analysis. In addition, a more advanced mechanism would be necessary to verify the inclusion/exclusion criteria that were not initially included in the protocol but were subsequently added. The R code is provided in a GitHub repository (
https://github.com/kingqwert/R/tree/master/metaCherry/, accessed on 2 March 2023) and will be hosted on the
R CRAN repository (
https://www.r-project.org/, accessed on 2 March 2023) in the near future, allowing others to apply our method easily.
Extensive Monte Carlo simulations were conducted to illustrate that the standard meta-analysis method could be subject to cherry-picking, leading to biased results. The chance of cherry-picking is remarkably high, especially when S is small. Furthermore, two real data analysis problems were simulated to provide new insights into the results of RCTs on the effectiveness of magnesium on AMI and St. John’s wort on depression. We demonstrated that it is easy to obtain favorable, i.e., biased, conclusions by cherry-picking studies based on biased inclusion and/or exclusion criteria. We encourage the re-evaluation of our approach using other datasets.
We demonstrated that meta-analysts can cherry-pick a subset of studies by modifying inclusion and/or exclusion criteria. However, this type of cherry-picking should not be taken too literally: the theorems presented in this study can be applied to any type of cherry-picking if information regarding
K,
S, and
is available. In addition, we analyzed the case of cherry-picking from a ‘subset’ of studies (i.e., the case of
). It is trivial to extend this analysis to the case of
, where meta-analysts use unsuitable inclusion and/or exclusion criteria to increase the total number of studies to obtain a favorable conclusion. Similarly, although we focused on the case of one-sided right-tailed hypothesis testing in this study, it is simple to extend our results to (i) the one-sided left-tailed hypothesis case (
and
) by using
, and (ii) the two-sided hypothesis case (
and
) by using
instead of Equation (
7). In addition, the assumption of cherry-picking the top-
S results is sometimes unrealistic, and actual meta-analysts might try to cherry-pick the subset of studies in a more arbitrary manner. However, we note again that, as discussed in
Section 2.1, the theorems are still valid, even if meta-analysts cherry-pick an arbitrary subset of all studies.
Similar to most published studies on meta-analyses, the within-study variance
in Equation (
1) was assumed to be known, ignoring the fact that it must be estimated in practice. If the estimated
and
values are used to define the
p-value, it no longer follows a standard normal distribution under the null hypothesis [
27,
28,
29], eliminating the simplicity of our theorems. In such cases, a more complicated asymptotic analysis would be required. Furthermore, there have been many previous attempts to formulate a “publication bias” using
p-values [
30,
31]. It would be worthwhile to consider both selection and publication biases simultaneously by using the proposed framework for hypothesis testing and its associated
p-value. However, further discussion about the conceptual difference between the publication bias and selection bias due to cherry-picking is required.