1. Introduction
Adaptive designs have significant advantages in clinical research, as they address limitations that have hindered the application of a static pre-established protocol. Interim analyses (sequential testing procedures), especially, enable the early termination of trials when one treatment demonstrates clear superiority over the other. These interim analyses are used to evaluate the early evidence of an intervention’s effectiveness and safety, which helps researchers make informed decisions about whether to continue the trial or adjust the study design [
1,
2,
3,
4]. Interim analyses can be planned or unplanned, and they can be comparative or noncomparative. A significant challenge with comparative interim analyses is the increased risk of committing a Type I error from repeated testing. Type I error represents the probability of incorrectly rejecting the true null hypothesis, falsely indicating a treatment effect. The Type I error rate, known as alpha (α), is predetermined and typically set at 0.05 to control this risk. This setting ensures that the likelihood of mistakenly declaring a treatment effect as significant does not exceed 5%. Robust statistical techniques, predetermined stopping rules, and transparent decision-making rules and regulations should be applied to dominate this challenge. Group sequential models incorporate a strategy of predefined comparative interim analyses, including guidelines that may terminate a trial based on significant or insignificant results. A larger sample size is usually needed with a special trial design that includes stops for interim analyses. In addition, there are many ways to define the stopping boundaries regarding α choices, and many of the critical values adopted in clinical trials have been suggested [
5,
6,
7].
Successful applications were detected using interim analyses. For instance, numerous researchers have utilized the critical values proposed by O’Brien and Fleming to determine the significance of their studies [
8]. For example, Hammond et al. (2022) [
9] have demonstrated that the combination of nirmatrelvir and ritonavir, when given to COVID-19 patients at the onset of the illness, significantly reduces the likelihood of the patients developing severe symptoms. This treatment has also been shown to rapidly decrease the SARS-CoV-2 virus in patients. These essential findings emphasize the importance of taking proactive measures and represent a positive step forward in the battle against COVID-19. Likewise, Goldberg et al. (2004) [
10] also conducted a study on chemotherapy for metastatic colorectal cancer using three different treatments, using the critical values from O’Brien and Fleming. The research also established that while there were no disparities in the general survival of patients in the three treatments, two of the treatments,
FOLFOX and
FOLFIRINOX, had better response rates and longer
PFS than the third treatment,
FOLFIRI. In addition, Marcus et al. (2017) [
11] employed O’Brien and Fleming’s methodology in their study and decided to conclude Gallium trials prematurely for treatment-naive follicular lymphoma patients. These studies indicate that the group sequential testing procedure developed by O’Brien and Fleming, when applied to various types of clinical trials, emphasizes the method’s value and usefulness in various circumstances. It facilitates more precise and timely decisions regarding experiments and evaluations that can save patients from receiving ineffective and redundant treatments. O’Brien and Fleming used an approximate distribution and resorted to simulation to demonstrate that a fixed one-stage chi-square test shares the same Type I error rate and power as their method, which is a strength point that encourages researchers to follow their steps and modify it. The O’Brien and Fleming group sequential testing procedure has undergone significant refinements to address practical limitations and expand its applicability. Kung-Jong Lui (1993, 1994) extended the procedure by developing methods to incorporate intraclass correlation. This allows the model to be used in cluster-randomized trials, often presenting unique challenges due to correlation within clusters. Lui’s work improved the flexibility and accuracy of the procedure in settings where clusters or groups of participants are allocated together [
12,
13]. Tang et al. (1989) [
14] extended the O’Brien and Fleming method to accommodate multiple correlated endpoints. Their work demonstrated how critical boundaries could be adapted to account for correlations among endpoints, making the process more applicable to complex trial designs where multiple outcomes are evaluated simultaneously. Additionally, Weigl et al. (2020) [
15] explored the use of the O’Brien and Fleming approach in the context of longitudinal studies, where data are collected repeatedly for each participant. The continued refinement of the O’Brien and Fleming procedure has expanded its utility across a variety of research domains. The ability to incorporate intraclass correlations, handle longitudinal data, and account for multiple endpoints has enhanced the procedures’ relevance and applicability in both clinical and behavioral research settings.
O’Brien and Pocock introduced an alpha-spending approach to generalize the O’Brien and Fleming procedure. This method allows for flexibility in the timing of interim analyses, enabling researchers to conduct analyses at non-predefined intervals while maintaining overall control of the Type I error rate. The alpha-spending function dynamically adjusts critical values based on the number and timing of interim analyses, ensuring statistical rigor and practicality in real-world scenarios.
Moreover, the O’Brien and Fleming procedure has been compared to other group sequential designs, such as the Pocock and Haybittle–Peto methods. These comparisons have shown that the O’Brien–Fleming approach is generally more conservative, with a lower risk of early termination and higher power at later trial stages [
16]. In 2013, Hammouri addressed a crucial issue regarding the stopping bounds of the O’Brien and Fleming procedure. Specifically, Hammouri noted an inconsistency with the critical values of the O’Brien and Fleming multiple testing methods, particularly their non-monotonic behavior. This challenge was effectively resolved by conducting additional simulations to generate critical values with a monotonic characteristic. Adhering to this pattern enhances the management of Type I errors, reducing the likelihood of failure.
Additionally, Hammouri focused on further changes to the O’Brien and Fleming procedure by applying three distinct implementations, which helped make the method more flexible and adaptable. Since each implementation was performed individually, there were three different procedures. In this sense, optimal, Neyman, and weighted allocations (the weights decrease sequentially) are three implementations with extra advantages within clinical trial design [
17]. Then, the original O’Brien and Fleming procedure was modified by combining it with the weighted samples and optimal random allocation simultaneously; the optimal allocation aims to maximize resource utilization efficiency by assigning a larger proportion of subjects to the superior treatment. On the other hand, the implementation of weighted allocation allows for applying different sample weights in various trial phases. This allocation procedure assumes that only certain phases or time points in clinical trials contribute to additional data gathering or provide more informative data. Hence, assigning higher weights to the samples collected during such critical phases enables investigators to plan optimal resource deployment and focus on obtaining well-organized data at the most sensitive phase of the trial [
18]. The last integration used the Urn allocation method with the O’Brien and Fleming multiple testing procedure. This innovative approach offers a refined framework combining the dynamic allocation properties of the Urn method with the rigorous control of Type I error provided by the O’Brien and Fleming procedure [
19].
A significant benefit of these implementations is that they allow trials to end early when one treatment demonstrates superiority. This capability not only boosts the ethical dimensions of clinical trials but also minimizes patient exposure to ineffective treatments while enhancing resource efficiency. Importantly, these new approaches maintain the statistical power of the original multiple testing procedure, ensuring robust and reliable results. The efficacy and robustness of all these methods have been demonstrated using experimental results and simulated case studies. These studies highlight its broad applicability across diverse clinical scenarios, making it a versatile tool for modern clinical research. The methods’ ability to adapt to varying trial conditions while preserving the integrity of statistical outcomes underscores its potential to become a standard in adaptive clinical trial designs. Previous work sets a strong foundation for future research and practical applications in optimizing treatment comparisons in clinical settings.
On the other hand, randomization techniques are critical in clinical trials to ensure unbiased treatment allocation and comparability between groups. Various methods serve unique purposes in enhancing the integrity of trial results. Simple randomization provides a straightforward approach where each participant has an equal chance of being assigned to any treatment group, minimizing selection bias. In contrast, block randomization helps maintain balance across treatment groups by dividing participants into blocks and randomizing within those blocks, ensuring that each group is evenly represented throughout the trial. Stratified randomization involves categorizing participants based on specific characteristics, such as age or gender, before randomization, which ensures that these characteristics are evenly distributed across treatment groups. Additionally, response-adaptive randomization adjusts treatment probabilities based on participant responses, allowing for more efficient resource allocation by favoring treatments with better outcomes [
20]. The foundational work of researchers like Bai et al. (2002) [
21] provides insights into the asymptotic properties of adaptive designs, highlighting their relevance in trials with delayed responses. Furthermore, Rosenberger and Lachin (2016) [
22] explored the theoretical and practical applications of randomization in clinical trials after Rosenberger and Sverdlov (2008) [
23] examined how to handle covariates effectively in trial design. Atkinson et al. (2023) [
24] discussed innovative methodologies in randomization, emphasizing their importance in modern clinical trials.
Over the years, researchers have implemented various adaptation procedures to reduce bias in clinical trials, including random allocation or randomization. Random allocation is a strategy designed to unbiasedly assign participants to different treatment groups, which enhances statistical power and ensures that any observed differences in outcomes between treatments can be attributed to the specific interventions being evaluated rather than confounding factors. Random allocations can eliminate biases from various sources, including participant behavior, investigator preferences, and other external factors. All of these can alter results and undermine the validity of the treatment under analysis [
25,
26,
27,
28].
One example of random allocation is optimal allocation, which leverages success rates observed in earlier stages of interim analyses. This method dynamically adjusts allocation probabilities to maximize the expected number of successes or minimize failures, making it a commonly used approach in clinical trials [
29]. Another notable method is the Neyman allocation, a widely recognized approach in sample allocation. This method considers group variances and sampling costs to achieve optimal precision in estimating population means within a fixed total sample size. By accounting for intra-group variation and associated costs, the Neyman allocation ensures efficient resource distribution and enhances the statistical accuracy of clinical trials [
30]
Randomization based on Neyman allocation is a sophisticated method designed to enhance the efficiency and balance of treatment assignments in clinical trials. This approach aims to minimize the variance of treatment effect estimates by allocating participants proportionately to the expected treatment effects. Specifically, the Neyman allocation involves assigning more participants to treatments that are anticipated to be more effective, thereby optimizing the trial’s statistical power. One study discusses the theoretical foundations of Neyman allocation and its application in real-world scenarios, highlighting its advantages over more straightforward randomization methods. Another analysis provides a comprehensive look at Neyman allocation’s practical applications, emphasizing its role in improving the precision of treatment effect estimates while maintaining ethical considerations in participant allocation. Further research explores the implications of this method in adaptive trial designs, demonstrating how it can lead to more informed decision making as data accumulate. Together, these studies underscore the significance of Neyman allocation in advancing clinical trial methodologies [
31,
32,
33].
The Neyman allocation method continues to demonstrate its value across various domains, from clinical trials to market surveys and political representation. For instance, Sverdlov and Rosenberger (2013) emphasize the importance of evaluating the performance of competing allocation rules under varying experimental scenarios—such as low, medium, and high treatment success probabilities—to optimize patient assignment in clinical trials [
34]. Their work highlights how adaptive designs incorporating Neyman allocation can lead to more efficient and ethical trials. In the context of market research, Olayiwola et al. (2013) compared three different allocation procedures for estimating both the average and the variance of Peak Milk (Nigerian made) prices in local markets. Their findings identified Neyman (optimum) allocation as the most effective method, offering superior efficiency compared to the alternatives [
35]. This reinforces the practical advantages of Neyman allocation in reducing variability within stratified sampling frameworks. Extending this principle beyond empirical studies, Wright (2012) demonstrates that Neyman’s optimal allocation is equivalent to the method of equal proportions used in apportioning seats in the U.S. House of Representatives—bridging classical sampling theory with political representation [
36]. Taken together, these studies demonstrate that the Neyman allocation procedure is not only statistically optimal for reducing variance but also versatile and effective across diverse real-world applications.
Integrating ethical considerations in clinical trials is essential for ensuring participant protection and maintaining research integrity, especially when utilizing Neyman allocation. This allocation method, which optimizes statistical efficiency by assigning more participants to treatments expected to be more effective, must be implemented with a strong ethical framework. Antognini and Giovagnoli (2010) [
33] underscore the importance of addressing the treatment allocation problem when employing Nyman allocation. They proposed two steps: first identifying an appropriate target allocation and then applying a sequential approach to implement it. It is crucial to continuously monitor any ethical concerns, such as disparities in treatment assignments, as noted by May and Flournoy (2009) [
37]. Furthermore, equity in participant selection is crucial; researchers should stratify participants based on relevant characteristics to prevent bias, as highlighted by Duarte et al. (2024a) [
38]. Ethics committees should be involved in designing and overseeing trials using Neyman allocation to assess its ethical implications, ensuring alignment with ethical guidelines. Ultimately, by embedding Neyman allocation within a comprehensive ethical framework, researchers can achieve a balance between statistical consistency and participant well-being and develop a responsible research environment that prioritizes both effective treatment evaluation and ethical integrity, as discussed by Duarte et al. (2024b) and Metelkina et al. (2017) [
39,
40].
Rosenberger (1993) [
41] further supported the application of adaptive designs, illustrating their potential to enhance statistical power and ethical considerations by ensuring that more participants receive potentially beneficial treatments. Hu et al. (2006) [
42] emphasized the advantages of adaptive designs in clinical trials, highlighting their ability to respond to accumulating data and make informed decisions about treatment efficacy and safety. The use of adaptive designs for Neyman allocation enhances the flexibility and efficiency of clinical trials by allowing modifications based on interim results. Duarte and Atkinson (2024b) [
39] discussed how adaptive designs can optimize treatment assignments in personalized medicine, particularly when response variance is treatment dependent. This approach allows researchers to adjust the allocation of participants dynamically, improving the likelihood of identifying effective treatments. Together, these works underscore the importance of integrating adaptive designs with Neyman allocation to improve trial outcomes while maintaining ethical standards and participant welfare.
This paper presents the implementation of Neyman allocation with distinct weights assigned to sample sizes within the O’Brien and Fleming testing outline. A novel methodological approach is introduced, named the Neyman Weighted Multiple Testing Procedure (NWMP). This approach provides a robust framework for treatment comparison and enhances the efficiency of study design in detecting treatment differences. An evaluation of the Type I error rate and statistical power is conducted using Monte Carlo simulations. Monte Carlo simulations are used because they offer a flexible and competent way to investigate the Type I error and the power of statistical procedures. While a theoretical approach can offer more rigorous and generalizable insights, it often requires significant effort to be executed. In contrast, Monte Carlo simulations allow researchers to quickly assess the performance of various testing methods under different scenarios. The use of Monte Carlo simulations is a valuable complement to theoretical analysis, as they can provide insights that are more readily applicable to real-world situations. Furthermore, Monte Carlo simulations can be easily adapted to integrate a wide range of factors, such as different sample sizes, effect sizes, and distributions, enabling a more comprehensive evaluation of a procedure’s properties [
25,
43,
44,
45,
46,
47,
48]. Moreover, this study includes practical examples to illustrate the effective application of the proposed procedure in realistic scenarios. The NWMP enhances the quality of statistical inference under practical constraints by carefully managing sample allocation and weighting during the trial. The NWMP preserves statistical power and Type I error rates similar to the well-known O’Brien and Fleming test, while adding greater flexibility to accommodate different stages of data collection and varying treatment effect scales. Flexibility stands out as the main benefit.
2. Methodologies
In the Methodologies section, the foundational methods for clinical trials are described: the O’Brien and Fleming procedure and the Neyman allocation method. The new approach, NWMP, is also presented, integrating diverse sample weights with the Neyman allocation technique to advance the precision of treatment effect evaluations while monitoring the overall Type I error rate. Finally, the methods used to evaluate Type I errors and statistical power were explored.
2.1. The Original O’Brien and Fleming Procedure
The data are reviewed and tested periodically with subjects receiving treatment 1 and treatment 2, respectively, at each stage in the K stages ( refers to the number of interim analyses planned during the trial). Here, the total sample size equals . represents the highest number of subjects utilized in the trial. The usual Pearson chi-square statistic is calculated for each stage after is determined, and the critical value of O’Brien and Fleming is chosen.
For stage , if , the study is terminated, and the null hypothesis is rejected. Otherwise, if < the following subjects are randomized. Their measurements are taken, and the process is repeated for the subsequent stage. If after stages, and < the conclusion is to terminate the study and conclude that the null hypothesis cannot be rejected at the significance level α.
Table 1 lists the critical values, as originally stated by O’Brien and Fleming, alongside the corrected values provided by Hammouri (2013) [
17].
For more information on calculating and assessing the original and adjusted critical points, refer to [
8,
17].
2.2. Neyman Allocation Procedure
The Neyman allocation procedure is a statistical method aimed at effectively distributing a finite sample across various segments of a population. This approach minimizes the overall variance of population estimates while adhering to any constraints on the total allocation. The Neyman allocation is determined using unknown binomial parameters, which are estimated to identify it.
Let , …, be a binary response (with two values success = 1 and failure = 0). Here, is the total sample size, and , …, are treatment assignment indicators, which have the value of one for treatment 1 and zero for treatment 2.
Then,
,
, where
and
are the total number of patients assigned to treatments 1 and 2, respectively.
Denoting
= {
, …,
,
, …,
},
, and conditional expectation as
, the following allocation rule is obtained:
Therefore, this rule simply substitutes the unknown success and failure probability parameters , , , and in the Neyman allocation rule with the existing estimates of the proportion of successes and failures calculated using the trial sample , , and .
When
,
,
,
∈ (0,1),
Here, the formula
is the Neyman allocation, which is used as a weight for the new subsamples [
27].
2.3. Using Different Subsample Sizes
Different weights are intricately linked to the resource availability and the size of the sample pool. Adjustments to the weights are made based on these factors, ensuring that data collection is not only strategic but also resource-efficient. When higher weights are employed earlier in the trial, significant benefits can be realized, particularly in facilitating early stopping decisions. By prioritizing more informative data points through increased weighting, the robustness of early analyses is enhanced, allowing trends and treatment effects to be detected more quickly. Consequently, if efficacy or safety concerns meet predefined criteria, the trial can be stopped early. This ensures that decisions to halt the study are made based on compelling evidence, thereby potentially reducing exposure to ineffective or harmful treatments. Ultimately, the strategic application of weights early in the process supports more efficient and ethically responsible study conduct, allowing each data point to contribute optimally to the overall analysis and enhancing the reliability of conclusions under varying resource conditions [
42].
2.4. The Proposed Method: Neyman Weighted Multiple Testing Procedure (NWMP)
2.4.1. The New Methodology
Our framework combines the O’Brien and Fleming method with Neyman allocation and weighted subsamples, allowing researchers to track and evaluate data gathered at each interim analysis stage. It is applicable in clinical trials where both the input and outcome variables are binary. This configuration, which assesses proportions between two groups based on binary outcomes, works especially well with this procedure. Since NWMP aims to manage the Type I error rate while testing various hypothesis endpoints, it is suited for chi-square-based test statistics. Examples of binary outcomes include symptom resolution, the occurrence of adverse events, and treatment success; any of these can be analyzed within a binary treatment allocation (e.g., drug versus placebo). This makes NWMP particularly useful in scenarios involving multiple binary endpoints assessed with chi-square tests.
The O’Brien and Fleming procedure is designed to compare two treatments when the treatment response is binary and occurs immediately. Our methodology follows the O’Brien–Fleming framework. This method allows for adjusting subsample sizes within each stage using Neyman allocation and allows different variable weights to be applied to subsamples between stages. Informed decisions are made based on predefined stopping rules and efficacy boundaries, resulting in a comprehensive and sophisticated methodology that enhances flexibility while maintaining the reliability of studies. By leveraging the strengths of these combined procedures, our procedure leads to more meaningful and accurate results, ultimately providing a valuable contribution to the field.
The new procedure involves periodically reviewing and testing collected data in stages, .
Choose K, α, {, P (K, α) and N. Here, K is the number of the total stages, { are the predefined weights for the subsamples, and N is the total sample size (should be even).
The subsample sizes between the stages will be determined using the predefined weights {. Here, 0 < ≤ 1, = 1, …, K and = 1.
Then, {, …, } is used to find each stage sample size. Calculate = round () if is even. Otherwise, = round () + 1, for k = 1, …, K − 1 and = ( should be even to work with equal allocation).
Split the sample size, , into two portions, and , which will be assigned to treatment 1 and treatment 2, respectively.
If , the sample size for the first stage and equal allocation are used to compute the subsamples, where .
If , the subsamples are given by with , where and are success rates from the cumulative previous stages for treatment 1 and treatment 2, respectively, and . In cases where equals zero, equal allocation will be used.
Subjects are randomized starting from the initial stage, and their measurements are recorded. When no rejection occurs, a subsample will be appended to the prior subsamples for every treatment.
is calculated and compared to :
If , then the study ends, and the null hypothesis is rejected.
If and , the procedure precedes the next stage and goes back to step 4.
If and , the study is terminated, and the null hypothesis fails to be rejected.
The algorithm for the new procedure is graphically illustrated in the following flow chart in
Figure 1.
2.4.2. Participant Preferences and Informed Consent for the NWMP
When implementing the NWMP, clinical trial participants need to understand the randomization process. Although the technical details of Neyman allocation and the O’Brien and Fleming procedure might not be extensively explained in simple terms, participants must recognize that they will be randomly assigned to various treatment groups for both ethical and practical reasons. To aid their comprehension, the procedure should be explained in a simpler manner, allowing participants to grasp its importance in preserving the integrity and validity of the trial results. This strategy not only adheres to ethical guidelines but also fosters participant trust and cooperation, which are essential for the successful conduct of a clinical trial.
2.5. Type I Error and Power Testing Methods
This section is designed to demonstrate the accuracy of the procedure by examining Type I error and power. Monte Carlo simulations are employed to assess the effectiveness of a procedure in scenarios where theoretical analysis presents challenges. The Monte Carlo simulation method facilitates the examination of statistical power and the Type I error rate associated with the NWMP, followed by a comparison to the original procedure. Monte Carlo simulations illustrate how effectively the NWMP can detect significant effects while maintaining an equivalent Type I error rate and power, thereby upholding statistical standards.
2.5.1. Type I Error Testing Method
SAS (Statistical Analysis System, version 9.4, SAS Institute Inc., Cary, NC, USA) code was employed to run simulations to calculate Type I errors for the NWMP. The simulations address a range of success probabilities (P = 0.1, 0.2, 0.3, 0.4, and 0.5), with α values of 0.01 and 0.05, for all critical values corresponding to the selected α, and for values from 1 to 5 for K.
Running these simulations can evaluate the NWMP’s performance in numerous scenarios. The differing success probabilities facilitate an examination of its robustness and adaptability to diverse conditions, where various test sizes and critical values will be considered.
To evaluate how effectively the new procedure retains the null hypothesis when suitable, subsamples will be simulated from a single binomial distribution with equal success rates, ensuring that the null hypothesis holds. The process will be conducted 500,000 times, and the percentage where the null hypothesis is rejected will be computed to assess the Type I error rate.
2.5.2. Power Testing Method
First, the sample sizes required to achieve a power value of 0.8 for specific scenarios involving the conventional chi-square test are determined for comparison purposes. This analysis compares the estimated power values for each sample of the NWMP to 0.08.
The success probability is set at 0.1, while the success probabilities take one of the following values: 0.15, 0.2, 0.25, and 0.3. The program will run 500,000 iterations with significance levels α = 0.01 and 0.05.
The analysis includes the corrected O’Brien and Fleming critical values, denoted as ). After applying the conventional chi-square test power calculations, sample sizes were determined for each combination of = 0.1 and to achieve the desired power value of 0.8. For , the sample sizes were 1366, 396, 200, and 120, respectively. For , the sizes were 2032, 588, 292, and 182, respectively. To ensure a significant difference, the two subsamples should be generated for each case of from distinct binomial distributions with different means (resulting from using different success probabilities). The NWMP procedure is conducted to test whether the null hypothesis ( indicating that no difference between the two groups could be rejected against the alternative hypothesis (. The entire process will be repeated 500,000 times to calculate the proportion of rejections of representing the power rate, given that was guaranteed to be true.
2.5.3. Type I Error and Power Estimation Algorithm
Monte Carlo simulations are conducted to evaluate the proposed method’s Type I error rate and power.
These simulations assess the probability of incorrectly rejecting a true null hypothesis across different experimental settings or the probability of correctly rejecting a false null hypothesis.
- (a)
The Type I error analysis considers probabilities and = {0.1, 0.2, 0.3, 0.4, 0.5}.
- (b)
The power analysis considers an initial probability = 0.1 and = {0.15, 0.2, 0.25, 0.3}.
The simulations are performed for a range of success probabilities ( and ): equal for Type I error and unequal for power.
Then, random samples are generated from binomial distributions with the specified success probability and and sample size N.
Each simulation is repeated 500,000 times to ensure robust statistical inference. Application of the New NWMP is performed for all values of K.
The test statistics are computed according to the proposed NWMP methodology.
The computed statistics are compared to the predefined critical values P(K, α).
Decision Rule:
If the test statistic exceeds the critical value in any stage, is rejected (false positive for Type I error and true positive for power).
Otherwise, fails to be rejected.
Computation of Type I Error Rate or Power: The proportion of rejected cases across all iterations is calculated.
3. Results
This section presents the results of the analysis of Type I error and power levels for the proposed procedure. It was found that the Type I error rate was acceptable, indicating that the null hypothesis was not rejected when it was true. Additionally, a higher power level of the test was observed, suggesting that the alternative hypothesis was correctly accepted when it was true. These results support the validity of the proposed procedure, instilling confidence in the accuracy of the findings.
3.1. Type I Error Results
After conducting an extensive analysis of various sample sizes, it was observed that the results remained consistent irrespective of the sample size. Furthermore, when comparing these results to the conventional chi-square test, it was found that the modified procedure reduced Type I error values. This reduction became more pronounced as the value of increased.
The rationale behind this trend lies in the fact that as increases, the chi-square statistic is getting larger. Raising K for interim analyses helps prevent premature conclusions about a treatment’s effectiveness, maintaining scientific integrity. This approach not only upholds the accuracy of the trial results but also enhances ethical standards by responding dynamically to new data. On the other hand, adaptive allocations promote a more efficient trial process by potentially reducing the number of participants needed to reach conclusive results. By focusing on treatments that show promise and adjusting away from those that do not, trials can achieve their objectives faster and with fewer resources. This accelerates the development process of new medical interventions and reduces participants’ exposure to less effective treatments.
Adjusting the critical value, K, and employing adaptive allocations ensures that clinical trials are flexible and precise.
Table 2 summarizes Type I error values across two distinct sample sizes and significance levels. Specifically, it shows results for a sample size of 360 at α = 0.05 and for a size of 300 at α = 0.01. In the case of the sample size of 360 with a significance level of 0.05, the Type I error was observed to range from 0.0487 to 0.0502. As the value of
K increased, there was a consistent decrease in Type I error. For the sample size 300 at α = 0.01, the Type I error values ranged from 0.0421 to 0.0428, lower than the significance level of 0.05. Compared to the standard chi-square procedure, this enhanced control of errors suggests that the proposed method is effective.
When using a significance level of α = 0.01 and a sample size of 300, the Type I error values maintain a monotonic behavior among the stages, with values ranging between 0.0090 and 0.099 in the second stage. As the value of K increased, a decrease in the Type I error values was observed, ranging between 0.0081 and 0.0091 in the last stage. Importantly, these values remain below 0.0105, which is considered satisfactory.
The Type I error also showed similar patterns when considering the same significance level α and varying sample sizes of 90 and 720. For the sample size of 90, the Type I error ranges from 0.0374 to 0.0511. For the sample size 720, the Type I error ranges from 0.0428 to 0.0494.
Furthermore, the Type I error values were calculated for sample sizes of 90 and 720, using a significance level of
. The results indicated that the new procedure performs appropriately regarding Type I errors. Specifically, for a sample size of 90, all Type I error values fall within the range of 0.0061 to 0.0104. Similarly, for a sample size of 720, the Type I error values range from 0.0086 to 0.0103. The remaining Type I errors are shown in
Table 3, and
Figure 2.
3.2. Impact of Random Seed Selection on Type I Error Rates
ANOVA analysis was conducted to assess the impact of random seed selection on Type I error rates in Monte Carlo simulations. Each simulation was treated as a sample, with the outcome of each iteration coded as 1 or 0. Five simulations were run under identical parameter settings, with only the random seed varying. The results of the ANOVA indicate that no significant difference was found in the mean Type I error rates across the five simulations. The calculated mean Type I error rates were very close, ranging from 0.0498 to 0.0510. Based on these findings, it can be concluded that the choice of random seed does not materially affect the estimation of Type I error in this Monte Carlo simulation context.
3.3. Power Analysis and Performance Trends
Statistical power is crucial for evaluating a testing procedure’s effectiveness. Power refers to the likelihood of correctly rejecting a false null hypothesis, thereby significantly decreasing the risk of Type II errors. This section presents power results from Monte Carlo simulations conducted under several scenarios. The fluctuations in power values associated with the NWMP were analyzed in depth. The influence of modifications to the O’Brien and Fleming procedure on its power performance was examined. The findings revealed significant trends in power estimates, offering essential insights into the robustness of the suggested approach. After testing several scenarios, it was observed that the power values with K = 1 ranged between 0.80195 and 0.80717. Additionally, these values exhibited a decreasing trend as K increased. For α = 0.05 at K = 5, the power values ranged from 0.7723 to 0.7819, with a margin of error less than 0.0277 compared to the target power value of 0.8.
Next, power values for α = 0.01 and a
K value of 1 range from 0.80109 to 0.8038. Further analysis confirmed that power values decline as
K values rise. When
K equals 5, the power values range from 0.7793 to 0.7873. The other power values are shown in
Table 4 and illustrated in
Figure 3a,b.
3.4. Rejection Rates: Calculations for Each Stage
This section aimed to identify the stage at which rejection occurs by calculating rejection rates and determining the sample size needed to conclude that the null hypothesis was rejected. The objective was to assess whether the proposed methodology requires a smaller sample size to achieve statistical significance, which is the desired outcome.
3.4.1. Calculating Rejection Rates for Each Stage When the Difference Is Present
With a standard power of 0.8 and reflecting different probabilities of success (0.1 and 0.15) with an
, a total sample size of 2032 was determined. Using 500,000 iterations at each value of
K, the results are provided in
Table 5 and exemplified in
Figure 4.
For K = 2, 43% of rejections of (acceptance of ) were observed in the second stage, amounting to 170,396 iterations. In this scenario, most rejections (57%) took place in the second stage, suggesting that the entire sample size was necessary to reject the null hypothesis in 43% of instances.
For K = 3, 236,537 iterations were necessary to reject the null hypothesis in the second stage, resulting in a rejection rate of 59%, the highest among the three stages. For K = 4, the second stage had the highest rejection rate at 41%, requiring 162,778 iterations to reject the null hypothesis. Finally, for K = 5, the third stage had the highest rejection rate of 38%, with 150,267 iterations.
Based on the findings, the proposed procedure effectively minimized the necessary sample size for statistical significance, significantly decreasing expenses and effort. This outcome indicated the efficiency and practicality of the suggested procedure.
3.4.2. Calculating Rejection Rates for Each Stage When the Difference Is Not Presented
Rejections were computed using α = 0.01, based on a total sample size of 360. From 500,000 iterations, the following table and figure illustrate the necessary sample size for reaching rejection, along with the number of rejections at stage K.
Additionally, it is important to recognize that these percentages are derived from the 5% rejection rate. The decision rules applied in this multiple testing procedure closely resemble those of the standard chi-square one-stage procedure, provided there is no early termination when
is true. We notice that most of the rejections happened in the last stage. Results are presented in
Table 6 and
Figure 5.
4. Examples
4.1. Example 1: Computational Example
We simulated the data for a trial on 600 subjects divided into two groups, each with five stages. Both groups had a high success rate of 0.3 and 0.5 during the initial phase. We analyzed the data and presented data for K = 5 as part of our findings. For K = 5, the NWMP was used with the following subsamples weights Then, subsample sizes equivalent to , , , and were obtained.
In the first stage, where equals 210, the subsamples were evenly distributed, resulting in and both being 105. The chi-square statistic calculated was 7.704. When this value is multiplied by one-fifth, it equals 1.54, which does not exceed the critical value, leading us to fail to reject the null hypothesis.
For the second stage, with and the cumulative subsamples were distributed by using Neyman allocation as follows:
Since and , we obtain , with and . The success was counted, and the chi-square statistic was calculated.
The chi-square statistic stood at 9.04. After multiplying this value by two-fifths, we obtained 3.616, which does not exceed the critical value. Therefore, we did not reject the null hypothesis once more.
For the third stage, with the cumulative subsamples are distributed using Neyman allocation, which resulted in and . Then, the chi-square statistic was 7.3. After multiplying the chi-square value by third-fifths, the value equals 4.382, which is greater than the critical value. Thus, we reject the .
In total, for
K = 5, using only three stages with 480 participants out of 600 patients needed to terminate the experiment, there was a significant difference between the two treatments. See
Table 7 for further details.
4.2. Example 2: Real-Life Example
We collected data for illustrative purposes on the smoking habits of 300 individuals and how their parents’ smoking habits affected them. Participants were chosen based on their smoking status, including 150 smokers and 150 non-smokers, and their parents’ smoking status was recorded.
The question is whether parents’ smoking habits affect their children’s smoking behavior. The order in which participants responded was considered essential and used to organize the data for analysis. This sequence acted as a ‘bank’, allowing the data to be arranged in various ways for different values of K and Nyman allocation based on the estimates of successes.
Concurrently, the NWMP was applied to analyze these sequences. This statistical method may include adjustments or modifications to cater to different subsets of the data based on the predefined value of K.
After assigning data for each value from 1 to 5, the NWMP was applied. The necessary sample size was also noted when the hypothesis was rejected. To demonstrate the procedure, we included details of case four of the trial below:
For K = 4, the trial was concluded after two stages with 195 participants out of 300.
In detail, for the first stage, with . The subsamples were split equally, where . Then, the chi-square statistic was calculated to be equal to 0.427. Multiplying the chi-square by one-fourth, equals 0.1068, which is not greater than the critical value, so is not rejected.
For the second stage, and required recalculating using the Neyman allocation. The cumulative subsamples are equal to 100 and . Moreover, the chi-square statistic, after being multiplied by two-fourths, is 10.82, which is greater than the critical value. Thus, is rejected.
A total of only 195 students were included in the analysis to identify a significant difference between the two treatments. For
K = 5, the analysis also reached significance with just 165 students out of 300. See
Table 8 for additional information.
4.3. Example 3: Real-Life Example
The example assesses two binary outcomes: whether the participant is diagnosed with cancer (yes/no) and whether the participant is a smoker (yes/no). Participants are grouped by a cancer diagnosis, and then the smoking status is evaluated. The analysis was based on publicly available lung cancer data from Data World [
49].
- I.
For stage 1 (i = 1), with a weight of w1 = 0.40, the sample size is n1 = 56, distributed evenly:
n1,1 = 28 participants with cancer (23 smokers and 5 non-smokers)
n1,2 = 28 participants without cancer (15 smokers and 13 non-smokers)
Next, one-third of the chi-square statistic is calculated as follows: 5.24/3 ≈ 1.747, less than 4.0191. Since this statistic does not reach the critical value, we proceed to Stage 2.
- II.
In stage 2 (i = 2), with a weight of w2 = 0.35, the sample size is n2 = 50. The allocation of the subsample is determined adaptively:
Given p1,1 = 0.4643, q1,1 = 0.5357, and y2 = 0.5657.
Then, n2,1 = round(0.5657 × 50) = 28, and n2,2 = 22.
n2,1 = 28 participants with cancer (9 smokers, 19 non-smokers)
n2,2 = 22 participants without cancer (0 smokers, 22 non-smokers)
Cumulatively (after combining with Stage 1), two-thirds of the chi-square statistics equals approximately 7.953 (calculated as 11.93 × 2/3), which is greater than 4.0191.
Since the adjusted statistic exceeds the boundary, the null hypothesis is rejected at Stage 2, and we conclude that there is a statistically significant association between smoking and lung cancer.
5. Discussion
Specific adaptive designs in clinical trials are structured to boost efficiency by incorporating sequential processes that focus on specific critical endpoints. This approach allows for multiple testing, enabling a comprehensive assessment of the treatment under review.
This study introduces an innovative adaptation of the O’Brien and Fleming procedure, first proposed in 1979, by incorporating Neyman allocation and unequal weighted allocation methods into its framework. This enhancement is designed to optimize the procedure’s efficiency and effectiveness across multiple stages of clinical trials.
Neyman’s allocation method, a key component of the revised procedure, strategically assigns subjects in a manner that maximizes the likelihood of detecting effective treatments early. This approach is particularly beneficial as it provides more patients with potentially superior treatment. Unlike traditional methods that apply uniform subsample weights across all phases, this innovative procedure utilizes unequal weighted allocations, enabling a more responsive and effective trial design.
The introduction of weighted allocations complements Neyman’s strategy by ensuring that more subjects receive what preliminary results suggest might be the better treatment option. This methodological enhancement not only streamlines the process but also improves the overall ethical and scientific quality of clinical trials. By focusing on these advanced allocation strategies, the study positions the adapted O’Brien and Fleming procedure as a powerful tool for contemporary clinical research, aiming to deliver timely and reliable healthcare solutions.
The modified procedure was evaluated under several scenarios with varying sample sizes. The findings showed that the innovative approach, combining Neyman’s allocation with weighted allocation, significantly reduced Type I error while ensuring adequate Power. Lowering Type I error is vital since it minimizes false positive results, thereby improving the reliability of the outcomes. Additionally, preserving power is crucial to guarantee that the study remains sensitive enough to identify genuine treatment effects.
For Type I error, a simulation study was conducted considering various α values and sample sizes. For example, with an α value of 0.05 and a sample size of 360, the Type I error values ranged between 0.0487 and 0.0502 for K = 1 but decreased monotonically as K increased, reaching 0.0421 to 0.0428 when K = 5. These values were below the 0.05 threshold, indicating acceptable errors that were better than those obtained for the original procedures.
Similarly, when considering an α level of 0.01, the Type I error values demonstrated a monotonic trend, ranging from 0.0091 to 0.0099 for K = 1, and decreasing further as K increased, falling between 0.0081 and 0.0088. All these values remained below the specified α level of 0.01, indicating successful control of Type I error.
The remaining Type I error values demonstrated improved performance compared to the usual chi-square test, with the observed Type I error values falling within acceptable limits.
Conversely, although there was a minor decline in values, the new implementation effectively preserved the acceptance rate of power. By dividing the chi-square values by the stages in the interim analysis, the values diminished, complicating the rejection of the null hypothesis.
However, these values remained acceptable, as the power values for an α level of 0.05 ranged between 0.8020 and 0.8072 for K = 1 and between 0.7723 and 0.7819 for K = 5. The marginal error between these values was less than 0.0277 for an α level of 0.05.
For an α level of 0.01, Power values ranged from 0.8011 to 0.8038 for K = 1, while they varied between 0.7793 and 0.7873 for K = 5. The maximum difference of less than 0.0124 indicates sustained adequate Power levels.
The results show that, for both α levels, despite slight reductions, the proposed implementation successfully maintains satisfactory power values. The differences observed between these values are within reasonable margins, highlighting the robustness of the modified procedures.
These characteristics make the new procedure a promising tool for clinical trials and other medical research contexts where efficient and reliable analysis is important [
14,
15,
43].
In conclusion, the findings of this study indicate that the NWMP constitutes a more adaptable procedure compared to the multiple tests utilized by O’Brien and Fleming, as well as the single sample method. The NWMP provides enhanced control over Type I error rates while preserving reasonable levels of statistical power.
A comparative analysis of three multiple testing procedures—the Neyman Weighted Multiple Testing Procedure (NWMP), the Optimal Weighted Multiple Testing Procedure (OWMP), and the Adaptive Multiple Testing Procedure with Urn Allocation (UMP)—reveals key differences in their ability to control Type I error rates and maintain strong statistical power under various significance thresholds. Each method applies a unique allocation mechanism tailored to specific trial dynamics, with the shared objective of enhancing efficiency in statistical decision making.
The UMP exhibits Type I error rates ranging from 0.0428 to 0.0545 at α = 0.05 and from 0.0082 to 0.0113 at α = 0.01, with corresponding power values between 0.7807 and 0.8158 at α = 0.05 and 0.7820 to 0.8131 at α = 0.01. While UMP generally maintains acceptable error control, its slightly elevated upper bounds at α = 0.05 indicate a modest risk of Type I error inflation in some scenarios.
The OWMP demonstrates tighter Type I error control—ranging from 0.0415 to 0.0507 at α = 0.05 and 0.0084 to 0.0104 at α = 0.01—alongside robust statistical power ranging from 0.7726 to 0.8164 at α = 0.05 and 0.7841 to 0.8113 at α = 0.01. This suggests that OWMP delivers a favorable balance between precision in error control and sensitivity in detecting true effects.
The NWMP also performs competitively, with Type I error rates between 0.0421 and 0.0539 at α = 0.05 and 0.0081 to 0.0104 at α = 0.01 and power levels ranging from 0.7723 to 0.8072 at α = 0.05 and 0.7793 to 0.8011 at α = 0.01. Although NWMP provides strong overall performance, its Type I error control is marginally less strict than OWMP’s, particularly at the 5% level.
Overall, all three procedures demonstrate effectiveness in maintaining statistical validity, but none consistently dominates across all criteria. The OWMP procedure offers a slightly superior balance between stringent Type I error control and high statistical power. To further clarify these distinctions, future research should implement the same simulation frameworks that vary sample size, treatment effect variability, and interim analysis frequency to enable direct and equitable comparisons of each method’s performance under controlled conditions [
18,
19].
Furthermore, we plan to extend these comparisons to include multi-arm trials, reflecting the significant contributions of Dr. Lui’s 1993 [
12] expansion of the O’Brien and Fleming group sequential test to multiple treatment groups. This expansion will involve integrating Neyman allocation and weighted sample size techniques, enhancing methodological robustness, and tailoring approaches to the specific challenges of multi-arm clinical trials. Additionally, the comparison will be expanded to include modern adaptive designs that employ Bayesian methods, which are becoming increasingly prevalent due to their flexibility in incorporating prior knowledge and updating trial parameters in real-time based on accumulating data. This comprehensive evaluation will not only clarify the strengths and weaknesses of each procedure but also refine statistical methods in clinical trials, ultimately leading to significant advancements in medical research. Such efforts are aligned with our overarching goal to improve the statistical framework of clinical trials, ensuring that they are optimally designed to meet both scientific and ethical standards.