Adaptations on the Use of p-Values for Statistical Inference: An Interpretation of Messages from Recent Public Discussions
Round 1
Reviewer 1 Report
The authors provide a balanced review on the controversy associated with P-values and significance interpretation in sciences. The strength of the review are the clear presentation of the concepts, definition and underlying assumptions followed by different authors, and the presentation of an example accompanied by a detailed implementation in the R programming language.
The points that would need to be clarified are as follows:
1) "Replicability".
a) The authors wrote that they used the definition of replicability in their review, which they defined as "ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected" (line 260).
b) the rri is based on the bootstrapped data so it cannot really be compared to real data sampling. In that case, associating the rri value to "replicability of the experiment" and interpreting this value as such may be misleading (e.g. lines 193-195).
c) The title of the manuscript is also not reflecting the definition of experimental replicability.
2) The proposed bootstrap approach for statistical parameter estimation may also present several limitations, which should be presented in the review, especially:
a) dependence on the observed data, with the assumption that the observed data are a representative sample of the population,
b) sample size limitations: a sufficiently large sample size is needed to generate accurate estimates. In some cases, the sample size may be too small to support meaningful resampling, leading to unstable estimates. In small samples, the bootstrap estimates may be biased due to the presence of outliers or other unusual observations,
c) sensitivity to model assumptions: The bootstrap relies on the assumption that the observed data are generated from a well-defined statistical model. If the model is mis-specified or inappropriate for the data, the bootstrap estimates may be biased or unreliable.
In that respect, providing simulations about the effects of sample size or recommendations about the minimum number of observations for the approach to be meaningful would be welcome.
3) Interpretation: a) Line 196-197: why is a rri of 0.44 (ie. in 44% of the cases H0 is rejected based on the bootstrap distribution of the p-value at a given significance level; so in 56% of the bootstrapped cased, H0 is accepted) is a "more robust assessment […] relative to the “marginally significant" result of p=0.0534." With both approaches, it seems that some similar degree of uncertainty is conveyed, yet on different scales. What is mean by "robustness" is not clear here.
The same statement about improved robustness is claimed again lines 265 onwards, but the justifications for it arenot provided.
b) in Table 1, the CI of the s-values are mostly overlapping. Does it mean that one should interpret the estimated values are not different from each other?
4) Generally statistical tests are constructed to control for Type I errors. Here the bootstrap procedure performs multiple tests on the same data. What do the authors recommend to avoid inflating the rate of false positives, especially because the generated p-values are then directly used for further inferences?
5) The authors chose the use of ROC for calculating the effects of the biomarkers. Why not using simpler tests, e.g. t-test, which would be simpler to present in the review paper, and for which the effects of both t-test statistics (effect size) and associated P-values could be easier to grasp?
In addition, statistical analysis of biomarker data are often relying on using multiple t-tests followed by correction for multiple testing. Maybe the authors could explain the rationale behind their choice of ROC then.
Minor points
-Table1: the label of "95% CI" of the s-values are missing (probably wrongly attributed to one column to the left, i.e. the p-values).
-Some sentences could be clarified or modified, as their meaning is not clear:
a) Line 199: "Statistics are uncertain".
b) Line 243: "Editors of renowned journals have been looking for different approaches and reflections of credible researchers in order to cultivate bit by bit an evolution of the p-value topic"
- using a specific seed (set.seed(xxx)) in the presented R code would be good to guarantee reproducible data as presented in the work, especially because randomization procedures are involved.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
I found the article quite problematic, and on several counts. Some are listed below, in no special order.
1. On line 82, the authors say, “Overall, however, there is a broad consensus that p-values are useful tools if properly used”
I don’t thing that is true. For one thing, the authors fail to say what constitutes proper use, other than the very vague statement that researchers should consider other information too. For another thing, there are many researchers (including myself), who believe there is no proper use. In fact, given that the authors cited a few (a small subset!) of the authors that have rejected p-values, the authors seem self-contradictory.
2. An important argument against p-values is that they are based not just on the null hypothesis, but on the whole statistical model. But the whole statistical model includes many assumptions that are in addition to the null hypothesis. For example, there is the ubiquitous assumption that participants were randomly selected from the population, which is practically never true. And there are many more as Amrhein et al. (who the authors cited) indicated. Thus, the statistical model is practically always wrong. Well, then, what is the point in gathering evidence, in the form of a p-value or s-value, against a known wrong model? The authors never explain this.
3. Hearkening back to Point 1, and taking Point 2 seriously, where is the authors’ vaunted “proper use” for p-values? Given that p-values do not provide evidence against the null hypothesis, but at best against the whole statistical model (and I don’t even buy this part), which is already tantamount to certainly wrong, where is the proper use for p-values? Exactly how is the researcher to use them to improve on the scientific conclusions drawn, above and beyond the descriptive statistics researchers have? The authors never say!
4. The authors claim to be interested in precision, but do not cite any of the relevant literature pertaining to the a priori procedure, which is specifically concerned with obtaining a sufficient sample size to meet researcher prescriptions for precision and confidence. This lack is difficult to understand.
5. Although the foregoing problems are already fatal, none of them are the biggest problem. The biggest problem is the authors try to parley their discussion of p-values into a way to address the replication issue. However, the authors never explain what it means to replicate a finding and fail to review the literature on this. There have been different arguments: (a) successful replication means getting statistical significance twice, (b) successful replication means getting statistical significance the second time, too, but with a higher-powered study, (c) successful replication means getting summary statistics (e.g., means) that are similar in both studies, (d) successful replication means getting summary statistics that are similar to the population value (e.g., the sample mean is similar to the population mean), and (e) there are others not listed here. As the authors fail to explain their conception of what it means to replicate, it is impossible to properly evaluate their suggestion.
6. In addition to Point 4, I would expect the authors to explain why their conceptualization of what it means to successfully replicate is better than alternatives.
There are other complaints I could make, but this seems sufficient grounds for rejection, so I’ll stop here.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
The manuscript reviews the literature on the limitations and misuse of the p-value, and explores some strategies to mitigate some of these limitations.
Although the topic is not new, the article is very well written, with an extensive list of the most recent related publications. The case study is clearly presented, and the source code of the scripts is made available. I did not find any writing errors.
A small caveat concerns the use of bootstrap on the Wilcoxon Rank Sum test, since resampling with replacement induces a high number of ties and compromise the test performance (see N.Chlass, J.J.Kruger. Small Sample Properties of the Wilcoxon Signed Rank Test with Discontinuous and Dependent Observations. Jena Economic Research Paper No. 2007-032, 2007). Nonetheless, this issue seems to be properly addressed in lines 278-283.
Author Response
We kindly thank the reviewer
Round 2
Reviewer 1 Report
i thank the authors for carefully addressing the open questions. I don't have further questions.
Author Response
We kindly thank the reviewer
Reviewer 2 Report
The revisions were insufficient to address my concerns. Therefore, I still recommend rejection.
Author Response
Unfortunately, we will have to consider this as a deadlock and have decided to make no further changes as these would alter the essence of our work that has been positively assessed by reviewers #1 and #3.