1. Introduction
The comparison of two proportions is a topic of special interest in statistics [
1], with important applications in medicine and health sciences in general. Of special interest is the case in which the two proportions are paired, as is the case in which, in a sample of
n individuals, a binary variable is observed before and after a certain treatment or when the sensitivities (specificities) of two binary diagnostic tests are compared with respect to the same gold standard [
2,
3]. This problem also frequently arises in clinical trials [
4], such as when assessing the effectiveness of a new drug or treatment. These situations give rise to the analysis of a
table, in which the only value set by the researcher is the sample size
n. There are numerous statistical methods in the statistical literature to solve these problems. Classically, the problem has been solved by conditioning in discordant pairs and thus neglecting the frequencies of discordant pairs. This way of solving the problem has given rise to different methods, and the McNemar test [
5] is the best known of all of them [
6,
7,
8]. The problem can be solved with exact tests (conditioned and unconditioned) and with approximate tests (conditioned and unconditioned). All test statistics of the approximate methods are distributed approximately according to a chi-square distribution with one degree of freedom.
In the statistical literature, there are numerous methods to solve the hypothesis test to compare two paired proportions. May and Johnson [
9], Park [
10], and, more recently, Fagerland et al. [
11,
12,
13] have compared different methods to solve this problem. However, in these works, only some of the existing methods have been studied. This is one of the main motivations for our study together with the proposal of new methods, comparing a large number of different methods to solve the hypothesis testing to compare two paired binomial proportions.
An alternative method to the hypothesis test, one directly related to it, consists of comparing the two paired proportions using confidence intervals for the difference (or ratio) of the two paired proportions. A review of more common confidence intervals can be seen in Pradhan et al. [
4] and Tan et al. [
14]. In addition, new intervals are proposed in Pradhan et al. [
15], more recently in Fay et al. [
16], and in Chan et al. [
17]. A review of different methods to solve the hypothesis test as well as confidence intervals for the difference and the ratio of two paired proportions can be seen in Fagerland et al. [
13].
Therefore, the purpose of this manuscript is to compare the asymptotic behaviour in terms of type I error rates and powers of different methods to solve the hypothesis test to compare two paired binomial proportions and to provide general rules of application for the methods. The rest of the article is structured as follows.
Section 2 describes 24 methods to solve the hypothesis test for comparing two paired binomial proportions.
Section 3 describes the criteria used to compare the asymptotic behaviour of the 24 methods.
Section 4 carries out extensive Monte Carlo simulation experiments to study the type I error rates and the powers of the methods.
Section 5 presents general rules of application for the methods to solve the problem posed. In
Section 6, the results are applied to a real example on the diagnosis of coronary heart disease, and
Section 7 discusses the conclusions obtained.
2. Notation and Methods
In general terms and focusing on common problems in the field of medicine, let us consider a binary random variable, with the categories of ‘success’ and ‘failure’, which is observed in a random sample of
n individuals before and after a treatment. This situation gives rise to
Table 1, where the only value set from the researcher is the sample size
n. This table also shows the theoretical probabilities of each cell. The data observed in this table,
, were the product of a multinominal distribution with probability vector
, verifying that
. Variance–covariance matrix of
was as follows:
and the estimator of
was
.
In this situation, the comparison of two paired binomial proportions consisted of solving the hypothesis test:
which was equivalent to solving the test:
Estimators of
and
were as follows:
The following describes 24 statistical methods to solve this hypothesis test. Of these 24 methods, two were exact, one was quasi-exact, and 21 were approximate (of which five were new).
- 1.
Conditional exact test (CET)
The probabilities
and
did not intervene in the hypothesis test (1) so that these probabilities could be ignored, as frequencies
and
could, because they did not influence the results of the hypothesis test (1). Conditioning was in the sum of discordant frequencies, i.e., an exact test was obtained using the binomial distribution [
13,
18]. Conditioning on
, it was verified that
, and therefore,
was the product of a binomial distribution of parameters
and
, i.e.,
. If
was true, then
, and the hypothesis test (1) was equivalent to the test:
The
p-value could be calculated directly from the binomial distribution. If we assumed that
, then the following was derived:
where
. Finally, the two-sided exact
p-value for the comparison test of the two paired binomial proportions was as follows:
Conditional exact test is a conservative test; that is, when is true, the p-value is typically less than of the time, where is the nominal error level.
- 2.
Conditional exact mid-p test (MidpT)
The conditional exact
mid-p test [
19] is a modification of the
CET that consists of subtracting the probability of the observed outcome
from (3), as in the following:
Then, the mid-
p value to compare the two proportions is as follows:
Conditional exact mid-p test is a less conservative method than the CET.
- 3.
McNemar Test (MT)
The McNemar test [
4,
13,
18] is the asymptotic version of the
CET. Conditioning in
and applying the central limit theorem, the test statistic for hypothesis test (1) is as follows:
whose distribution is approximately a standard normal distribution and where the following occurs:
Since it is being conditioned in
(frequencies
and
are disregarded), then
and
. If
is true, then the following are derived:
and
Substituting
with
and
for
in the expression of the test statistic
z, the test statistic of the McNemar test (without continuity correction) is as follows:
whose distribution is approximately (it is traditionally required that
) a standard normal. Very often, the test statistic is expressed in terms of the chi-square distribution:
whose distribution is approximately one chi-square with a degree of freedom.
MT is a method that has good asymptotic behaviour in terms of type I error rate and power.
- 4.
McNemar test with Yates continuity correction (MTYcc)
The McNemar test approximates the binomial distribution to the normal distribution. In this situation, it is common to apply a continuity correction (
cc), whose objective is to improve the approximation to the normal distribution. Edwards [
20] proposed the following test statistic with Yates
cc [
21]:
whose distribution is approximately a standard normal distribution. It is also common to express this test statistic in terms of the chi-square distribution [
13,
18]:
- 5.
McNemar test with continuity correction (MTcc1)
Conditioning in
, the random variable
jumps from 1 to 1, so a
cc is
(half the jump) [
22]. Therefore, another test statistic of the McNemar test with
cc is as follows:
or what is the same:
This
cc has been used by Chang et al. [
17] to estimate the difference between two paired binomial proportions using confidence intervals. These authors have also proposed other continuity corrections: 0.125 and 0.25. We proposed applying these continuity corrections to the McNemar test statistics, obtaining the following new test statistics (called
MTcc2 and
MTcc3, respectively):
- 6.
Modified McNemar test (MMT)
Bennett and Underwood [
23] proposed a modification of the McNemar test statistic by adding
to the observed frequencies, with the aim of improving the approximation to the chi-square distribution. Thus, the test statistic is as follows:
- 7.
Wald test (WT)
The hypothesis test (1) can be solved by applying the Wald method [
24,
25]. Since
is the probability vector of a multinomial distribution, its variance-covariance matrix is as follows:
The hypothesis test (2) is equivalent to checking the following:
where
It is easy to verify that the estimated variance of
is as follows:
Applying the central limit theorem, the following is derived:
By performing algebraic operations, it was obtained that the Wald test statistic for test (2) was as follows:
whose distribution was approximately a chi-square distribution with a degree of freedom.
- 8.
Modified Wald test (MWT)
May and Johnson [
9] proposed modifying the Wald test statistic by adding
to
and to
. Thus, the modified Wald test statistic is as follows:
This method has good asymptotic behaviour and is recommended as one of the best methods to solve the hypothesis test [
9].
- 9.
Likelihood-ratio test (LRT)
The hypothesis test (1) can be solved by applying the likelihood-ratio test [
26]. The likelihood function of the data is as follows:
where
. If
is true, then it is verified that the likelihood function of the data is as follows:
and that the following is derived:
Applying the likelihood-ratio test [
25,
26], the likelihood-ratio test statistic to compare the two proportions was as follows:
whose distribution was approximately one chi-square with a degree of freedom. Therefore, the test statistic of the
LRT method only contained the frequencies of the discordant pairs.
- 10.
Unconditional exact test (UET)
The
CET method condition on
. Suissa and Shuster [
27] have proposed, from the McNemar test statistic, an exact test that uses all the observed frequencies and therefore does not condition in
. When the two proportions were compared, the power function of the test was as follows:
where
and
, with
and
as the calculated value of the McNemar statistic. If
was true, then the distribution of
was a trinomial distribution with parameters
and probability vector
, and the power function was as follows:
where
was the nuisance parameter. El nuisance parameter was eliminated by maximizing this function over the range of
. The function
was simplified as follows:
where
,
,
was the integer function and
was the cumulative binomial distribution function with parameters
and
. Finally, the two-sided exact
p-value was calculated as follows:
- 11.
Unconditional McNemar test (UMT)
Lu [
28] has proposed a test statistic for the McNemar test that does not condition on
. Hypothesis test (1) was equivalent to the following hypothesis test:
If
was true, then
(or
) was the product of a binomial distribution with parameters
and
, that is to say,
. The mean and variance of the estimators of this binomial distribution were as follows:
respectively. Approximating the normal distribution and applying the central limit theorem, the unconditional test statistic was as follows:
or rather
whose distribution was approximately a chi-square distribution with one degree of freedom. In order to apply this method, it was required that
, and its asymptotic behaviour was very similar to that of the
CET [
28].
- 12.
Unconditional likelihood-ratio test (ULRT)
Lu [
29] also proposed a likelihood-ratio test statistic to compare two binomial proportions that contain all frequencies. The likelihood-ratio test statistic is obtained in two phases: (I) the likelihood-ratio test statistic is calculates when the four
frequencies are combined in two,
and
; (II) the likelihood-ratio test statistic is calculated when the four
frequencies are combined in another two,
and
. Corresponding test statistics were as follows:
and
Finally, the likelihood-ratio test statistic was calculated as the mean of both likelihood-ratio test statistics:
and its distribution was approximately a chi-square distribution with one degree of freedom. The
ULRT can be applied in most cases, although the test statistic does not fit well to the chi-square distribution when the difference between
and
is large, especially when
is also large, and in this situation, it was a better method than the
LRT [
29].
- 13.
New revised version of the McNemar test (NMT)
Lu et al. [
30] revised the unconditional McNemar test [
28]. Under the hypothesis that is no difference in the number of “success” and “failure” results between “before” and “after”, the estimated probability of obtaining a “success” is as follows:
and the estimated probability of obtaining a “failure” is as follows:
Frequencies
and
correspond to “success” and “failure” in “before” measurements. The estimated mean is as follows:
and the estimated standard deviation is as follows:
Applying the central limit theorem, the statistic test was as follows:
and its distribution was approximately a standard normal distribution when
and
. Alternatively, the following was derived:
This method had an asymptotic behaviour that improved that of the UMT [
30].
- 14.
New revised version of the McNemar test with cc (NMTcc)
Lu et al. [
30] revised the unconditional McNemar test and proposed the following unconditional test statistic with
cc:
- 15.
Haber test (HT)
Haber [
31] has studied the use of continuity correction in hypothesis testing, particularizing the results in 2 × 2 tables. Haber proposed a McNemar test statistic with a
cc based on the McNemar test statistics:
where
is the McNemar test statistic and
m is the number of different values
z may attain. The number of different achievable values of
is very close to
, and since the range of
values is
, the
cc based on the average difference of the successive values gives rise to the test statistic:
and its distribution is approximately a chi-square with one degree of freedom.
- 16.
Irony et al. test (IT)
Irony et al. [
32] have studied the comparison of two binomial proportions from a Bayesian perspective. The Dirichlet distribution is the natural conjugate prior for
. Therefore, the distribution for PI is a Dirichlet with parameter
, and its posterior distribution is also Dirichlet with parameter
, where
. The objective is to solve the hypothesis test:
where
. This hypothesis test is equivalent to the following:
where
. Therefore, the only parameters of interest are
and
, and therefore, only the trinomial data
are considered. Likelihood function is written as a product of two factors: one depending only on the parameter of interest
and the other depending only on the nuisance parameter
. Distribution of
is as follows:
and distribution of
is as follows:
Parameters
and
are independent. An interval for
is constructed by generating a large number of observations from the posterior distribution of
, that is, a Dirichlet distribution with parameter
. Irony et al. [
32] have shown that posterior mean of
is as follows:
and posterior variance of
is as follows:
A confidence interval for
is as follows:
where
and
q is the
quantile of the standard normal distribution. From the previous equations, test statistic for the hypothesis test (1) was as follows:
whose distribution was approximately a chi-square distribution with one degree of freedom.
- 17.
RR test (RRT)
The hypothesis test (1) was equivalent to the hypothesis test:
where
Lui [
33] solved this hypothesis test by applying weighted least squares. Estimator of RR is as follows:
and applying the delta method the estimated variance of
is as follows:
where
Applying the central limit theorem, the test statistic for hypothesis test (4) was as follows:
or equivalently
whose distribution was approximately a chi-square distribution with one degree of freedom.
- 18.
OD test (ODT)
The hypothesis test (1) was also equivalent to the following:
where
Lui [
33] solved this hypothesis test by applying the same method as the one used in the
RR test. Following an analogous procedure, the test statistic for the hypothesis test (5) is as follows:
and where
The distribution of the test statistic is the same as the one in the previous case.
- 19.
ODM test (ODMT)
The hypothesis test (1) was also the same as the following:
where
Applying the same method as in the two previous cases, Lui [
33] proposed the following test statistic:
where
The distribution of test statistic was the same as in the previous cases.
- 20.
RR, OD, and ODM test with cc (RRTcc, ODTcc, and ODMTcc)
The previous three methods can also be obtained by adding a
cc. We proposed to add
to each one of the observed frequencies, i.e., in the following:
Thus, the expressions of test statistics
,
, and
were replaced by
,
, and
as follows:
respectively. In this way, new test statistics
,
, and
were obtained, and their distributions were the same as in previous cases.
7. Discussion
The comparison of two paired binomial proportions is a problem that appears frequently in medical and clinical studies. In the statistical literature, there are diverse methods proposed to solve this hypothesis test, and therefore, it is necessary to determine which methods have the best asymptotic behaviour in terms of the type I error rate and power. We reviewed 19 existing methods and proposed 5 new ones, and we carried out broad simulation experiments to study their asymptotic behaviour. From the results obtained, we have given some general rules of application for the methods studied.
May and Johnson [
9] compared through simulation experiments the asymptotic behaviour of eight methods (
CET,
MidpT,
MT,
MTYcc,
MMT,
WT,
MWT, and
LRT) and recommended using the
MidpT,
MWT, and
MT methods when it is verified that
. May and Johnson used the criterion that the type I error rate must not be higher than
.
Park [
10] has compared, using the same criteria as May and Johnson, the asymptotic behavior of the
CET,
MT,
LRT, and
WT methods, concluding that the method with the best behavior is the
MT.
Fagerland et al. [
11,
12,
13] also compared through simulation experiments the asymptotic behaviour of five methods:
CET,
MidpT,
UET,
MT, and
MTYcc. These authors used the same criterion as May and Johnson and recommended using the
MidpT and
MT methods.
The studies of May and Johnson [
9] and Fagerland et al. [
11,
12,
13] used the same criterion to assess the type I error rates, and both studies recommended the
MidpT and
MT methods. Park [
10] recommends the
MT method.
Our criterion to assess the type I error rate of each method is more flexible, allowing for a method to be higher than
without being too liberal. Regarding the asymptotic behaviour of an approximate test, it is to be expected that its type I error rate will fluctuate around the level of the nominal error when the sample size is large, and therefore, it can be higher than that of the nominal error level. With our criterion, it can be slightly higher than the level of the nominal error. Regarding an exact test, its type I error rate must not be higher than the level of the nominal error, as happens with the results obtained for
CET and
UET (
Table 2,
Table 3,
Table 4 and
Table 5).
The simulation experiments carried out allowed us to establish some general rules of application for the methods. The WT, LRT, RRT, and ODMT methods can be used for whatever the sample size is, and if the sample size is large, then the MT and MWT methods can also be applied. Of these six methods, two are conditioned methods (MT and LRT), and four are unconditioned (WT, MWT, RRT, ODMT); therefore, the problem can be addressed without any problem from both perspectives (conditioned and unconditioned), obtaining results that are very similar. Another important conclusion obtained from the simulation experiments is that continuity corrections do not improve the asymptotic behaviors of the studied methods. Therefore, although in the statistical literature there are different methods that incorporate continuity corrections, their application is not justified.
In this manuscript, we have studied the comparison of two paired proportions using hypothesis tests. An alternative method is to carry out this comparison using confidence intervals instead of hypothesis testing. In this context, there are also numerous intervals (exact and approximate) that can be used [
4,
13,
14,
15,
16]. In Fagerland et al. [
12,
13], the behaviour of some of the most used is compared, but it may currently be somewhat incomplete. Therefore, given that new confidence intervals have been investigated in recent years [
14,
15,
16], it is of great interest from a practical point of view to determine which intervals have the best asymptotic behaviour.