1. Introduction
Medical comparative studies often involve collecting either bilateral or unilateral data from subjects. For instance, in ophthalmologic studies, it is meaningful to assume that the information between two eyes from the same subject is generally correlated. Rosner [
1] pointed out that the fundamental unit for statistical analysis in ophthalmologic studies is often the eye rather than the person. For example, when an individual contributes both eyes’ information to analysis, such as comparing intraocular pressures between age groups, the values of the two eyes are generally correlated. Therefore, methods of analysis that treat each eye independently are not valid. For bilateral data, Rosner [
1] proposed a constant R model to test whether the proportions of affected eyes are the same among the
g groups of patients while accounting for the intra-person dependence. Dallal [
2] criticized the appropriateness of Rosner’s model, noting that if the trait would almost certainly occur bilaterally with widely differing population-specific prevalences, the model would give a poor fit, and thus they proposed an alternative approach based on compound multinomial sampling. Donner [
3] proposed an alternative approach based on a simple adjustment of the standard Pearson chi-square test for the homogeneity of proportions. Neuhaus [
4] compared the magnitudes of covariate effects and the differences in the interpretation of the regression coefficients measured by the different classes of approaches. Tang et al. [
5] investigated eight procedures for testing the equality of proportions between two groups in correlated otolaryngologic data. The empirical results show that tests based on the approximate unconditional method usually produce better empirical type I error rates than their asymptotic tests. Based on Donner [
3]’s model, Lin et al. [
6] discussed the relationship between the disease probability and covariates via the logistic regression in ophthalmologic studies and proposed a new minorization-–maximization (MM) algorithm and a fast quadratic lower bound (QLB) algorithm to calculate the MLEs of the vector of regression coefficients. Most recently, Westgate [
7] assessed the residual pseudo-likelihood approach with the GLIMMIX procedure in SAS and deemed it a practical and reliable method for estimating the intra-cluster correlation coefficient in binary outcome data from cluster randomized trials. It allows for easy confidence interval construction and offers the advantage of small-sample adjustments to ensure valid statistical inference. Furthermore, the Poisson regression model with a sandwich variance estimator provides covariate-adjusted risk ratios and standard errors for prospective studies with independent binary outcomes. Zou and Donner [
8] adjusted the middle term in the sandwich estimator and extended the model to studies with correlated outcomes in longitudinal or cluster randomization studies. Also, Li and Tong [
9] stated that sample size formulas for relative risk regression can be applied to the modified Poisson model in cluster-randomized trials without losing efficiency. These formulas also work well with variable cluster sizes when appropriate corrections into the sandwich variance estimator are used, and a minimum of 10 clusters is present.
Obviously, procedures that fail to utilize both bilateral and unilateral data would be less powerful. Researchers have provided some comprehensive studies. Pei et al. [
10] studied ten test statistics to test the equality of two proportions and found that Rosner and Wald-type statistics based on dependency models and constrained maximum likelihood estimation perform satisfactorily for small to large samples. For general
groups, Wang and Ma [
11] worked on the combined bilateral and unilateral data and developed three types of confidence intervals for the relative ratio. Their simulation results show that score tests are always robust under different configurations with the median near the nominal type I error rate, and they yield higher powers than other test statistics, assuming the paired organs are not completely independent. The score CIs are the closest to the nominal coverage probability with reasonable interval widths on the whole parameter space.
In practice, stratified data analysis has attracted much attention, and several statistical methods have been proposed in clinical trials. Stratification involves dividing subjects into subgroups based on specific characteristics, such as age, gender, disease severity, or other relevant factors. By stratifying subjects, researchers can ensure that each treatment group has a similar distribution of these characteristics, which can help to minimize the impact of these variables on the outcomes of the trial. Nam [
12] provided the sample size using the modified homogeneity score method and compared it with the goodness-of-fit method in stratified studies. Li and Tang [
13] showed that the modified score test based on the bootstrap resampling is a desirable test procedure for testing the homogeneity of rate ratios in stratified matched–paired designs. Tang and Qiu [
14] explored the homogeneity test of differences between two proportions in a stratified bilateral sample design, and a modified score test statistic was proposed. Further, Tang et al. [
15] developed ten confidence intervals for risk differences, and notably, eight hybrid confidence intervals do not rely on any model assumption. The simulation results show that the hybrid confidence interval based on Mid-P individual confidence interval is well-controlled but without a closed-form solution for the individual confidence interval. Most recently, Shen et al. [
16] introduced a common test and interval estimation of the difference between two proportions on stratified bilateral designs under an equal intra-class correlation coefficients model. This paper is well-motivated but only considers bilateral data. The analysis methods that do not take into account unilateral data are obviously not comprehensive enough.
In the stratified data analysis with two groups (experimental vs. placebo group), the homogeneity test of the difference of two proportions is used to see whether the differences of the group effects are the same across strata. When the differences across strata are significantly different, we can conclude that a confounding effect exists. If insignificant, we can combine the stratified data into a pooled two-group dataset. Based on the previous discussion, we apply a homogeneity test of the differences between two proportions across strata into stratified bilateral and unilateral data. In this article, focusing on Donner’s common
model, we will derive the constrained and unconstrained MLEs and develop three MLE-based testing procedures in
Section 2. In addition, a model-based method treats the measurements from paired organs of each subject as repeated observations, which will also be provided for comparison with the MLE-based methods. Generalized estimating equations (GEEs) will be briefly introduced as the theoretical underpinning of how the model-based method handles correlated binary outcomes. In
Section 3, Monte Carlo simulations are conducted to evaluate the test performances concerning the Type I error rates and powers through various configurations. In
Section 4, a real example from an otolaryngological study is shown using the proposed method. Discussions are presented in
Section 5.
3. Simulation Study
We evaluate the testing performance by comparing the three MLE-based and model-based methods using the type I error rate and power. For the type I error rate, we generate Monte Carlo simulations with a balanced sample size
= 25, 50, and 100 and unbalanced sample size (
) = (60,40), where
, and specifically,
J = 2, 4 and 8. We also consider
= 0.2, 0.5, 0.7, and
=0.3, 0.5. Then, we test the homogeneity of the risk differences across strata under the null hypothesis
by choosing
d = 0, 0.1, and 0.2. In each configuration, 10,000 random samples are generated under the null hypothesis, and empirical type I error rates are calculated as
. The calculations are based on converged cases of 10,000 replicates. At a 5% significant level,
Table 2,
Table 3 and
Table 4 report the empirical type I error rates based on all proposed configurations for
J = 2, 4, and 8, respectively.
Following Tang et al. [
5], a test is considered liberal if the ratio of the empirical type I error to the nominal type I error is greater than 1.2 (i.e., empirical type I error > 0.06 for
= 0.05), conservative if the ratio is smaller than 0.8 (i.e., empirical type I error < 0.04 for
= 0.05), and robust otherwise. From the tables, the GEE method is robust for all scenarios. The Wald test performs well when
J = 2, but becomes liberal when
J increases. The Likelihood ratio test is liberal when
is large as well as a small or unbalanced sample size. The Score test becomes liberal for some cases with a larger
,
d, and
under an unbalanced sample size. Moreover, the Score test is robust when
d = 0, 0.1. All of the tests perform better as the sample size increases. The program’s running time also needs to be considered to evaluate the above methodologies. Even if the results obtained using the GEE method are satisfactory, its running time is much longer than that of the MLE-based methods.
We also conduct simulations with parameters chosen at random, except for the above parameter settings. For each of
and sample size
, the parameters
,
, and
d are generated from a uniform distribution for 1000 configurations. The empirical type I error rates are calculated after 10,000 replications for each configuration. The Boxplots in
Figure 1 summarize the results. The plot shows that the likelihood ratio test performs worst. When
J and the sample size are small, the LR test behaves liberally. The Wald test also becomes inflated when
J increases. The Score test shows a relatively stable type I error control, and its plots are closest to the nominal level
= 0.05. All three tests behave better when given a larger sample size and a smaller
J. In general, the Score test behaves satisfactorily.
Next, we investigate the performance of power for the four tests. The settings of
are the same when calculating the empirical type I error. Furthermore, the parameter settings of the alternative hypothesis are listed in
Table 5.
Table 6,
Table 7 and
Table 8 report the empirical powers for the same scenarios for
J = 2, 4, 8, respectively. We observe that powers increase when the sample size increases. Also, when the difference of
d in the alternative hypothesis increases, the powers increase significantly. In some settings, the likelihood ratio test’s powers are inflated because their empirical type I error rates are greater than the nominal level. Moreover, the powers of the three MLE-based methods are very close, and the model-based test gets the lowest power. Overall, the Score test is recommended because of its satisfactory type I error control and good performance in terms of power.
4. Real Example
We demonstrate the performance of the four mentioned methods by considering the data from a double-blinded clinical trial by Mandel et al. [
19]. The study compared cefaclor and amoxicillin for treating acute otitis media with effusion in 214 children (293 ears), who received either unilateral or bilateral tympanocentesis and were randomly assigned to either Cefaclor or Amoxicillin group. After 14 days of treatment, the number of cured ears was recorded at the end of the treatment period. The data are summarized in
Table 9 according to the stratum determined based on the children’s age.
According to the data above, the constrained MLEs and unconstrained MLEs of parameters are listed in
Table 10. The test statistics and the p-values of the proposed methods are reported in
Table 11. All the p-values are greater than the nominal level
, and we conclude that there is no evidence to claim the corresponding
d varies across strata for all methods. However, we notice that the unconstrained MLEs of
d for different strata are vastly different, which may be due to the small sample size for the groups.
5. Conclusions
In this article, we compare three test procedures with the model-based method to test the homogeneity of differences between two proportions for stratified bilateral and unilateral data across stratum. The three MLE-based methodologies are illustrated in detail, especially the procedures used to compute the constrained and unconstrained MLEs of parameters. According to the simulation results, the Score test shows a relatively stable type I error control and good performance in terms of power. The Wald test behaves well when the strata
J = 2, and the Likelihood ratio test works well when the difference
d = 0. All four tests are close to the nominal level when the sample size is set to 100. The model-based method is robust and performs well on power, sometimes even more robustly than the Score test. However, this GEE for repeated measures analysis is developed without the explicit form of the test statistics and is very time-consuming. The explicit form of the Score test statistic is useful for its simplicity and further development of the exact test. If we wish to conduct further research to determine the exact test for small sample cases, this model-based method will be nearly impossible because of the extensive calculations involved. Also, in fact, many studies in clinical ophthalmology focus only on the eye instead of a person with paired organs. For example, Saba and Walter [
20] worked on a retrospective consecutive case series of 626 eyes (543 patients) with nvAMD treated with 1438 Brolucizumab injections. Some of these patients contributed one eye and some contributed two eyes, and it would be imprudent to ignore the correlation between two eyes from the same patient in a statistical analysis. Therefore, our proposed methodologies related to the comprehensive data structure can be applied to such cases to compare drug efficacy. In summation, the Score test is worth a more in-depth investigation.