1. Introduction
In the current dynamic research landscape, marked by an exponential increase in accumulated data and a rise in the complexity of research inquiries, the need for efficient statistical analyses plays a pivotal role. Choosing the right test is crucial in statistical data analysis. The selection of an appropriate data analysis method depends on the research goal and the distribution of the data, which determines whether the method can be a parametric test [
1].
Parametric tests are characterized by greater precision than their nonparametric counterparts and provide results that are better for assessing the real significance of observed differences [
2]. Every researcher should strive to ensure that the statistical analysis is conducted rigorously and interpreted appropriately. Our concern should not only be to achieve statistically significant results but also to ensure that our interpretation reflects reality.
In the analysis of multiple groups, the analysis of variance (ANOVA) test is often a common choice. This test, however, belongs to parametric methods and thus requires certain assumptions. The two most critical assumptions are that the data to be compared with the ANOVA test must have (i) a normal distribution and (ii) equal variances [
3]. Depending solely on the outcomes of ANOVA analysis per se is often insufficient. While this statistical test can determine whether differences exist between groups, it does not identify specifically between which groups these differences occur. To obtain more detailed insights, post hoc tests are routinely employed, which are essential in the so-called multiple comparisons analyses [
4]. Currently, there is a wide range of post hoc tests, each suitable for specific studies. Most often, more liberal tests are not recommended due to inadequate correction of the error level, thereby increasing the risk of committing statistical errors [
5].
Researchers are confronted with the challenge of appropriately managing the risk of statistical errors, and such a threat may constitute a particular risk in the realm of multiple comparisons. In fact, especially when analyzing numerous groups or conditions concurrently, it may lead to an undue escalation of the risk of type I statistical errors [
6]. In the context of statistical analysis, the issue of multiple comparisons has become a focal point of intensive research, leading to the development of diverse methods aimed at effectively controlling the risk of statistical errors. Traditional approaches, such as the Bonferroni correction or false discovery rate (FDR) procedure methods, have for a long time served as pivotal tools in this domain [
7]. However, with advancements in the field of statistics, there is a noticeable surge in the interest toward more flexible and innovative techniques.
One advanced method in statistical analysis of multiple comparisons that has gained recognition is bootstrapping. Proposed by Bradley Efron, bootstrapping is a resampling technique that involves multiple sampling with replacements [
8]. It enables the estimation of sample distribution without assumptions about the population distribution. This method offers flexibility and can be tailored to the specifics of a dataset, making it valuable when traditional methods are inadequate.
The objective of this article was to explore the versatile role of the bootstrap method in multiple comparisons analyses. We chose the bootstrap method because it helps to better handle the issue of multiple comparisons. The aim was to explain how bootstrapping manages various data distributions and compare its efficacy with conventional statistical methods such as ANOVA. In our study, we sought to demonstrate that bootstrap is an excellent tool for handling distributions that deviate significantly from the normal distribution, which is required for the application of parametric methods. Additionally, we aimed to show how bootstrap effectively manages data characterized by small sample sizes. By juxtaposing the outcomes of bootstrapping with those of traditional approach, we sought to underscore the added value that bootstrapping brings to statistical analyses.
Utilizing techniques such as ANOVA and post hoc tests, we analyzed differences between groups, taking into account factors such as data distribution and variance homogeneity. Through this comparative analysis, we sought to illustrate how the bootstrap method can provide insights that may not be readily apparent with traditional methods alone. Acknowledging that the bootstrap method may slightly influence the significance of results, we intended to highlight its potential to enhance the interpretative depth of statistical analyses. By demonstrating the subtle interplay between the bootstrap method and traditional methods, our aim is to promote a more integrated approach to statistical analysis, thereby supporting more robust research practices.
4. Discussion
Our article aimed to illustrate the application of the bootstrap method in the context of atypical distributions. In the case of multiple comparisons, bootstrap proved to be a valuable tool, aiding in demonstrating greater reliability of the outcomes of statistical analysis. The simulation results have demonstrated that bootstrap is a valuable tool for confirming the credibility of the obtained results, irrespective of the distribution type of a variable and its departure from the normal distribution characteristics.
Table 11 presents a summary of the performance of bootstrap and its effectiveness across various distributions, as well as its application in multiple group analyses.
Bootstrap is a method commonly employed to enhance and validate existing analyses. In the study by Jayalath et al. this method was utilized for tests examining the homogeneity of variances in two groups [
12]. The presented article discusses an approach to improving tests for variance homogeneity across samples of equal and unequal sizes. The authors suggest employing a bootstrap test based on the ratio of mean absolute deviations to enhance assessment accuracy. This proposed bootstrap test demonstrates effectiveness particularly when the underlying distributions exhibit symmetry or slight skewness. The study by Zhang G. assessed the utility of bootstrap in multiple comparisons following one-way ANOVA [
13]. They conducted a comprehensive study of one-way ANOVA under heteroscedastic variances and varying sample sizes, employing a bootstrap approach without data transformation. Simulation indicated convergence of type I error rate and multiple comparison procedures to the nominal level of significance. Hill et al. used ANOVA enriched with bootstrap due to the requirements for normal distributions in the analyzed data [
14]. They concluded that using the bootstrap method requires more time and computational resources compared to traditional ANOVA, yet it does not rely on the assumption of data distribution.
In our study, we investigated how bootstrap-boosted analyzes facilitate multiple comparisons in different scenarios with data of various distributions. We employed ANOVA and post hoc tests to examine differences between individual groups. Original data were employed to simulate different variants of distributions either departing or not departing from normality. In addition, we simulated the data with deliberately chosen unequal variances in the compared BMI groups. Thus, we prepared the data with distributions meeting the assumptions for use the parametric analysis of variance, as well as the data that violated at least one of these assumptions (normal distribution, variance homogeneity). We paid a special attention to ensure that all simulated data systems within the comparison category (i.e., leptocurtic vs. platycurtic vs. mesocurtic and homogenous vs. heterogenous variances) were characterized by non-different values of the F test statistic, regardless of whether they met or did not meet the assumptions of using parametric ANOVA methods. Implementing this idea in our analysis, we calculated the F statistics values for distributions that meet the assumptions of parametric ANOVA (producing reliable ANOVA outcomes) and for those that do not meet these assumptions (unreliable outcomes, distorting true relationships). Taking advantage of the convenience that bootstrap does not require the data to meet the assumptions necessary to apply parametric tests, the next step was to compare the results obtained using the classic analysis of variance and the analysis supported by the resampling procedure (bootstrap-boosted ANOVA). In the context of this study, we conducted an evaluation of the effectiveness of the bootstrap method in statistical data analysis, with a primary focus on cases where observed data exhibited atypical distributions or did not meet the assumptions required by parametric methods. Our goal was to demonstrate that the bootstrap method can serve as a more flexible and adaptable approach to data analysis under diverse conditions compared to traditional parametric approaches. In selecting the data, we adhered to the principle of representativeness and endeavored to incorporate the diversity of observations to ensure the utmost reliability and universality of our results. We prioritized the objectivity of our findings, striving to minimize the impact of subjective interpretations and biases on the data analysis process.
In the first part, we compared two variables with differing homogeneity of variances. It was observed that in the case of heterogenous variances, bootstrap slightly inflates the values of the post hoc probabilities (P). For the variable with homogenous variances, the outcomes of the bootstrap-boosted analyses were closer to those obtained with the classical approach, which simply implies that bootstrap is an effective tool for assessing the reliability of results in data with both homogenous and heterogenous variances.
In the second part, we examined the effectiveness of bootstrap procedure in analyses of data with distributions characterized by different kurtosis and skewness. It was found that in tests enriched with bootstrap, rejecting the null hypothesis is more challenging. This indicates that bootstrap provides more robust results and reduces the risk of type I statistical errors. Simultaneously, the post hoc probabilities were not inflated to an extent that would lead to type II errors compared to classical methods.
In other studies, bootstrap has also proven to be a valuable tool for distributions significantly deviating from the Gaussian distribution. In the simulation study by Perez-Melo et al., the bootstrap method was useful for calculating confidence intervals in distributions with substantial skewness [
15]. Likewise, Chan et al. demonstrated that the bootstrap method is recommended for correlation tests in non-Gaussian distributions deviating from normality [
16].
In the final section of our paper, we tested bootstrap with post hoc tests of varying conservatism. It turned out that the strictness of the test does not affect the performance of bootstrap. The results obtained with this method had slightly inflated p values in all cases compared to the traditional approach. However, the inflated p values did not lead to a loss of statistical significance.
Researchers willingly use bootstrap in increasingly diverse statistical analyses and situations where classical methods may yield uncertain results. For instance, Xu et al. demonstrated that bootstrap can be useful in two-way ANOVA, even with small sample sizes [
17]. In contrast, Romano et al. utilized bootstrap in conjunction with the Bonferroni test for multiple testing [
18]. The use of bootstrap in the context of multiple comparisons was also discussed in Westfall P.H., where the authors concluded that it is not a universal improvement over the classical approach [
19].
There is no universal recommendation for the number of repetitions in bootstrap analyses. In our study, we selected 10,000 iterations to ensure the stability and precision of our results. Efron’s early work suggested that even a small number of iterations, such as 25 or 50, could suffice for estimating the standard error, while a larger number is needed for confidence intervals [
20]. Efron based his recommendations on the unconditional coefficient of variation. Booth and Sarkar proposed a higher number of iterations, considering the conditional coefficient of variation, which only accounts for the variability from resampling [
21]. Contemporary studies, such as those by Hesterberg, recommend using 10,000 iterations for more precise estimates [
22]. The current computational power available allows for significantly more iterations, further enhancing the accuracy and reliability of the results.
Parametric methods are often the primary choice in statistical data analysis. However, in reality, studies rarely involve data where variables follow a Gaussian distribution. The application of non-parametric methods carries a higher risk of making errors since they have lower statistical power [
23]. One should also not underestimate the fact that while parametric methods provide us with a huge variety of different analysis models, only a few of them, the most basic ones, have their equivalents among non-parametric tests. Therefore, the bootstrap has proven to be a viable alternative to classical methods of analysis. In situations where classical methods fail or yield uncertain results, bootstrap can be a valuable tool for emphasizing the credibility of analyses and their outcomes. Analyses utilizing the bootstrap method demonstrated an inclination to elevate
p values, thus indicating that employing of this method may encourage a more prudent approach to null hypothesis rejection. Although bootstrap analysis tends to yield higher
p values, these differences are not significant enough to obscure genuine, substantial deviations. They represent subtle adjustments that still allow attention to be focused on real statistical differences where they exist.
5. Conclusions
Despite the abundance of available statistical tools, the problem of multiple comparisons is still present during data analysis. Parametric methods, which are known for their great power, are not applicable when data distributions deviate from normality. While parametric tests are popular in use, bootstrap seems to be a good alternative in data analysis. In our study, simulations were conducted in various scenarios, showing data with extreme distributions and differing homogeneity of variance. In all cases, bootstrap effectively validates the accuracy of the results. Bootstrap-boosted analyses showed that the rejection of the null hypothesis becomes less hasty, which enhances the significance of the results.
The results demonstrated that bootstrap is especially useful tool for analyzing data with small sample sizes. The p values obtained from bootstrap-enhanced analyses help to prevent the premature rejection of the null hypothesis, thereby reducing the risk of type I errors. In the traditional approach, there is a higher chance of obtaining a false result, especially when the sample sizes of the groups are very small.
On the other hand, the bootstrap method does have its limitations. One potential drawback is the duration of the analyses, which can extend to several hours depending on the complexity of the analyses, the sample size, and the computer’s processing power. In our study, the most time-consuming aspect was the resampling of groups with different distributions, which took several hours per distribution. When employing this data analysis method, it is important to consider that less powerful computers may significantly increase the analysis time or even be unable to complete the analyses.
In conclusion, the analyses presented in our study show the effectiveness of bootstrap in verifying the robustness of the research results. In analyzing data with distributions significantly departing from the Gaussian model, an alternative method such as bootstrap should be considered so that the results obtained by classical methods are not uncertain and ambiguous, but more reliable.