*3.4. Limitations*

Although much progress has been made in the approximately two years since the APP was invented, there nevertheless remain limitations. The most important limitations are conceptual. Unlike many other inferential statistical procedures, the APP does not dictate what hypotheses to accept or reject. For those researchers who believe that other inferential statistical procedures, such as NHST, really do validly dictate what hypotheses to accept or reject, this is an important limitation. Hopefully, the remarks in the first major section of the present article have disabused the reader that any procedure is valid for making decisions about hypotheses. If the reader is convinced, then although the limitation remains serious in the absolute sense that it would be nice to have an inferential procedure that validly dictates what hypotheses to accept or reject; the limitation is not serious in the relative sense that other inferential procedures that make the promise fail to deliver, so nothing is lost by using the APP.<sup>10</sup>

A second conceptual limitation is suggested by one of the arguments against *p*-values. That is, the model is known wrong and so there is no point using *p*-values to gather evidence against it. It is tempting to apply model wrongness to the APP, where the models again will not be precisely correct. However, as we pointed out earlier, closeness counts heavily with respect to estimation; but does not count at all for binary decisions, that are either correct or incorrect. Because the APP is for estimation, if the model is reasonably close to being correct, though it cannot be precisely correct, the estimate should be reasonably good. And this point can be explained in more specific terms. Suppose that the model is close but not perfect, thereby resulting in a sample size that is slightly larger or slightly smaller than the precisely correct sample size necessary to meet specifications. In the case of the larger sample size, the result will be that the researcher will have slightly better closeness, and little harm is done, except that the researcher will have put greater than optimal e ffort into data collection. In the case of the smaller sample size, the researcher's sample statistics of interest will not be quite as close to their corresponding population parameters as desired; but may nevertheless be close enough to be useful. Therefore, although model wrongness always is important to consider, it need not be fatal for the APP whereas it generally is fatal for *p*-values.

There are also practical limitations. Although work is in progress concerning complex contrasts involving multiple means (assuming a normal distribution) or multiple locations (assuming a skew-normal distribution), the requisite APP equations do not ye<sup>t</sup> exist. Similarly, for researchers interested in correlations, regression weights, and so on; although work is in progress; the requisite APP equations do not ye<sup>t</sup> exist. Another practical limitation is the lack of an APP computer program that will allow researchers to perform the calculations without having to do their own programming. And yet, work proceeds and we hope to be able to address these practical limitations in the very near future.

### *3.5. APP versus Power Analysis*

To some, the APP may seem like merely an advanced way to perform power analysis. However, this is not so as can be shown in both a general sense and in two specific senses. Speaking generally, the APP and power analysis have very di fferent goals. The goal of power analysis is to find the number of participants needed to have a good chance of obtaining a *p*-value that comes in under threshold

<sup>10</sup> I thank an anonymous reviewer for pointing out that, due to the lack of cutoff points, this limitation can be considered a strength. According to the reviewer, "There is no cutoff point, so potentially all estimates could be viable."

(e.g., *p* < 0.05) when the null hypothesis is meaningfully violated.<sup>11</sup> In contrast, the APP goal is to find the number of participants necessary to reach specifications for closeness and confidence.

This general di fference results in specific mathematical di fferences too.<sup>12</sup> First, power analysis depends importantly on the anticipated e ffect size. If the anticipated e ffect size is small, a power analysis will indicate that many participants are necessary for adequate power; but if the anticipated e ffect size is large, a power analysis will indicate that only a small sample size is necessary for adequate power. In contrast, the anticipated e ffect size plays no part whatsoever in APP calculations. For example, suppose that anticipated e ffect size for a single sample experiment is 0.80. A power analysis would show that only 13 participants are needed for power = 0.80; but an APP calculation would nevertheless demonstrate that the closeness value is a woeful 0.54.<sup>13</sup>

A second specific di fference is that APP calculations are influenced, importantly, by the desired level of closeness. In contrast, power calculations are completely uninfluenced by the desired level of closeness. Moreover, absent APP thinking, few researchers would even consider the issue of the desired level of closeness. In summary, the APP is very di fferent from power analysis, both with respect to general goals and with respect to specific factors that influence how the calculations are performed.

### *3.6. The Relationship between the APP and Idealized Replication*

Much recent attention concerns replication probabilities across sciences. For example, the Open Science Foundation (2015) publication indicates that most published papers in top psychology journals failed to replicate. One of the many disadvantages of both *p*-values and CIs is that they fail to say much about the extent to which experiments would be likely or unlikely to replicate. In contrast, as Trafimow (2018a) explained in detail, the results of the APP strongly relate to reproducibility.

To understand the relationship, it is necessary to make two preliminary points. The first point is philosophical and concerns what we would expect a successful replication to entail. Because of scientists' addiction to NHST, most consider a successful replication to entail statistically significant findings in the same direction, in both the original and replication studies. However, once NHST is admitted as problematic, defining a successful replication in terms of NHST is similarly problematic. But the present argumen<sup>t</sup> extends beyond NHST to e ffect sizes more generally.

Consider the famous Michelson and Morley (1887) experiment that disconfirmed the presence of the luminiferous ether that researchers had supposed to permeate the universe.<sup>14</sup> The surprise was that the e ffect size was near zero, thereby suggesting that there is no luminiferous ether, after all.<sup>15</sup> Suppose a researcher today wished to replicate. Because larger e ffect sizes correspond with lower *p*-values, it should be clear that going by replication conceptions involving *p*-values, it is much more di fficult to replicate smaller e ffect sizes than larger ones.<sup>16</sup> Thus, according to traditional NHST thinking, it should be extremely di fficult to replicate Michelson and Morley (1887), though physicists do not find it so. This is one reason it is a mistake to let e ffect sizes dictate replication probabilities. In contrast, using APP thinking, a straightforward conceptualization of a successful replication is if the descriptive statistics of concern are close to their corresponding population parameters in both the original and replication studies.<sup>17</sup> An advantage of this conceptualization is that it treats large and

<sup>11</sup> For those who prefer CIs, an alternative goal would be to find the number of participants required to obtain sample CIs of desired widths.

<sup>12</sup> For elaborated mathematical discussions of the differences, see Trafimow and Myüz (Trafimow and Myüz) and Trafimow (2019b).

<sup>13</sup> See (Trafimow and Myüz (forthcoming) for details.

<sup>14</sup> Michelson received his Nobel Prize in 1907.

<sup>15</sup> It is interesting that Carver (1993) reanalyzed the data using NHST and obtained a statistically significant effect due to the large number of data points. As Carver pointed out, had Michelson and Morley used NHST, the existence of the luminiferous ether would have been supported, with incalculable consequences for physics (also see Trafimow and Rice 2009).

<sup>16</sup> A counter might be to use equivalence testing; but this is extremely problematic because it involves the computation of at least two *p*-values, whereas we already have seen that even one *p*-value is problematic.

<sup>17</sup> If specifications are not met in one of the two studies, that constitutes a failure to replicate.

small e ffect sizes equally. What matters is not the size of the e ffect; but rather how close the sample effect is to the population e ffect, in both the original and replication studies.

The second point is to imagine an idealized universe, where all systematic factors are the same in both the original and replication study. Thus, the only di fferences between the original and replication study are due to randomness.

Remembering that our new conceptualization of a successful replication pertains to the sample statistics of interest being close to their corresponding population parameters, in both the original and replication studies, invoking an idealized universe suggests a simple way to calculate the probability of replication. Specifically, the probability of replication in the idealized universe is simply the square of the probability of being close in a single experiment (Trafimow 2018a). We have already seen that all APP equations can be algebraically rearranged to yield closeness given the sample size used in the original study. Well, then, it is equally possible to fix closeness at some level and algebraically rearrange APP equations to yield the probability of meeting the closeness specification given the sample size used. Once this has been accomplished, the researcher merely squares that probability to obtain the probability of replication in the idealized universe. Trafimow (2018a) described the mathematics in detail and showed how specifications for closeness and sample size influence the probability of replication.

A way to attack the usefulness of the APP conceptualization of a successful replication is to focus on the necessity to invoke an idealized universe to carry through the calculations. But the attack can be countered in both general and specific ways. The general counter is that scientists often have suggested idealized universes, such as the ideal gas law in chemistry, Newton's idealized universe devoid of friction, and so on. The history of science shows that idealized conceptions often have been useful, though not strictly correct (Cartwright 1983). More specifically, however, consider that the di fference between the APP idealized universe and real universe is that only random factors can hinder replication in the idealized universe whereas both random and systematic factors can hinder replication in the real universe. Because there is more that can go wrong in the real universe than in the idealized universe, it should be clear that the probability of replication in the idealized universe sets an upper limit on the probability of replication in the real universe. Because Trafimow (2018a) showed that most research has a low probability of replication even in the idealized universe, the probability of replication in the real universe must be even lower. In summary, whereas *p*-values and CIs have little to say about the probability of replication, the APP has much to say about it.
