1. Introduction
In randomized clinical trials, bilateral data frequently occurs in patients who receive a treatment based on paired body parts or organs (such as eyes, ears, kidneys, and so on). Since the outcome of bilateral data has been naturally split into three types (no response, unilateral response, and bilateral responses), considering the intraclass correlation in the bilateral data is a natural way to avoid misleading results [
1,
2,
3,
4,
5]. Intraclass correlation in bilateral data has been investigated in recent decades with various statistical methods [
1,
2,
3,
5,
6,
7,
8]. Rosner proposes that the conditional probability of a response occurring at one side of paired body parts or organs that gives a response at the other body parts or organs is a positive constant R times the response rate [
1]. Tang et al. provide a statistical inference for correlated data in binary paired data under the R model and also evaluate the asymptotic test with Type I error and power [
9]. Ma, Shan, and Liu develop an asymptotic testing method under the homogeneity assumption for Rosner’s R model [
5]. Donner assumes that all treatment groups share one intraclass correlation coefficient
[
2], and Thompson evaluates the robustness of Donnar’s
model in pair data by adopting simulations [
10]. Liu et al. test the equality of correlation coefficients based on Donner’s
model for paired binary data with multiple groups [
11]. Later, Liu et al. explore the exact methods of testing the homogeneity of prevalence for correlated binary data under Donnar’s
model [
8]. However, Dallal criticizes Rosner’s assumption and points out that “the constant R model will give a poor fit if the characteristic is almost certain to occur bilaterally with widely varying group-specific prevalence” [
3]. He proposes that it is more advantageous to assume the characteristic emerges through a triggering mechanism, in which the probability of a subsequent occurrence is unaffected by the probability of initiating the trigger [
3]. Dallal believes the conditional probability of a response at one side of paired body parts or organs giving a response at the other body parts or organs is a constant
[
3]. Li et al. develop asymptotic and exact methods following Dallal’s model [
12]. Then, Chen et al. propose multiple test statistics of response rates in the different groups under Dallal’s model [
13].
The homogeneity test for the appropriate effect size measure between different groups equates to testing the common value of the measure of effect. Moreover, there are three popular methods to evaluate the effect size in randomized clinical trials: the odds ratio, risk difference, and relative risk [
14]. Indeed, relative risk is more informative than risk difference in some cases [
15]. Compared with the odds ratio, the relative risk can process sparse data better. The stratifying of bilateral data by some control variables (e.g., disease phases, age, etc.) will provide more sophisticated statistical results to satisfy different research proposed in randomized clinical trials [
16,
17]. For example, evaluating the appropriate effect size across strata of disease phases between treatment and control groups equates to testing the effect size of homogeneity across the strata. Zhuang et al. investigate the homogeneity test of the ratio of two proportions in the stratified bilateral data based on Donner’s
model [
18]. Shen et al. test the homogeneity of the difference between two proportions for stratified correlated paired binary data under Donner’s
model [
19]. Xue and Ma propose interval estimation of proportion ratios for stratified bilateral correlated binary data under Rosner’s constant R model [
20].
Rosner’s R and Donner’s
models have limitations and restrictions. Rosner’s R model is a conditional probability that a response at one side of paired body parts or organs that gives a response at the other body parts or organs is a positive constant R times the response rate [
1]. However, the constant R can not reach one unless each response rate is equal, and the model is not appropriate if the patient’s body parts or organs are all responding and different groups have different response rates [
3,
20]. In Donner’s
model, it is assumed that all treatment groups share one intraclass correlation coefficient
[
2]. However, when the correlations between the two groups are significantly different, Donner’s
model cannot be used for data analysis. In practice, it is first verified that the correlation coefficients are equal to decide whether the Donner model is appropriate. Compared to Rosner’s R and Donner’s
models, Dallal’s model does not require concern over whether all of a patient’s body parts or organs have responded, nor does it need to worry about whether different groups have varying response rates. Additionally, it eliminates the need to verify that the correlation coefficients are equal. Instead, it provides a straightforward intraclass correlation constant,
, for the conditional probability of a response in one part of a pair of body parts or organs given a response in the other.
To avoid the limitations of Rosner’s R and Donner’s
models, we investigate the proportion ratios in clinical trial design with stratified bilateral data under Dallal’s model. The remainder of this paper is organized as follows:
Section 2 presents the data structure and hypotheses, and
Section 3 introduces the maximum-likelihood estimation under homogeneity. We propose three tests to examine the homogeneity of proportions across strata in Dallal’s model in
Section 4. Accordingly, we investigate the performance and robustness of three tests by using simulation studies in
Section 5. In
Section 6, we use two real data examples to illustrate our proposed methods. Conclusions and future works are in
Section 7.
6. Real Data Examples
Two randomized clinical trial data are used as real data examples to illustrate the proposed three tests. The first real data example is a double-blinded randomized clinical trial data proposed by Mandel et al. [
21]. In this clinical trial, children who suffer from otitis media with effusion (OME) and simultaneously have bilateral tympanocentesis are randomized into two groups: receiving 14-day treatments of amoxicillin or cefaclor [
21]. After the treatment, the number of cured ears is summarized in
Table 4 with the age group as strata. To explore whether the cured rates between the two groups (amoxicillin and cefaclor) among age strata are clinically equivalent, we test the homogeneity based on the three proposed tests. The MLEs of the parameters, based on observed data, are listed in
Table 5, and the three test statistics and
p-values are summarized in
Table 6. However, all of the
p-values are greater than 0.05, which means there is no statistical evidence to reject the null hypothesis
.
Another example used to illustrate the three proposed tests is a randomized double-blinded placebo-controlled trial presented by Postlethwaite et al. [
22]. In this trial, the primary measurement is the modified Rodnan Skin Score (MRSS), and patients diagnosed with diffuse scleroderma were randomly divided into the treatment and the control group. The duration of the disease (early or late phase) was considered two separate strata. Meanwhile, the number of improved hands is summarized by groups and disease phases in
Table 7. We obtain the MLEs of the parameters in
Table 8, and the statistics and
p-values are summarized in
Table 9. All of the
p-values are greater than 0.05, which means there is no statistical evidence to reject the null hypothesis
.
7. Conclusions
This article utilizes three MLE-based tests (LRT, Wald-type test, and score test) to test the homogeneity relative risk of two proportions on stratified bilateral correlated data.
The three Monte Carlo simulation results show that the score test yields a robust performance with the empirical Type I error and power. Even with a small sample size and multiple strata, the score test still generates a stable empirical Type I error and satisfactory power. Meanwhile, the likelihood ratio test and Wald-type test offer reasonable power but come with unstable empirical Type I error performances. By incorporating a small sample size, the Wald-type test shows unsatisfactory performance of the empirical Type I error, but the performance tends to be reasonable by increasing the sample size. Under a small sample size with multiple strata, the score test is slightly unstable, but the performance also tends to be robust by increasing the sample size.
However, asymptotic methods may have some limitations due to poor performance under a small sample size with multiple strata. Future work might consider exact tests to investigate related issues.