Statistical evidence is the central core concept of this approach. It is now examined how this has been addressed by various means and to what extent these have been successful. For this, we need some notation. The basic ingredient of every statistical theory is the sampling model a collection of probability density functions on some sample space with respect to some support measure The variable is the model parameter which takes values in the parameter space , and it indexes the probability distributions in the model. The idea is that one of these distributions in the model produced the data x, and this is represented as The object of interest is represented as a function where can be obtained from the distribution The questions E and H can be answered categorically if becomes known. Note that there is no reason to restrict to be finite dimensional, and so nonparametric models are also covered by this formulation.
A simple example illustrates these concepts and will be used throughout this discussion.
3.1. p-Values, E-Values and Confidence Regions
The
p-value is the most common statistical approach to measuring evidence associated with
H. The
p-value is associated with Fisher, see [
7], although there are precursors who contributed to the idea such as John Arbuthnot, Daniel Bernoulli, and Karl Pearson; see [
8,
9]. Our purpose in this section is to show that the
p-value and the associated concepts of e-value and confidence region do not provide adequate measures of statistical evidence.
We suppose that there is a statistic
and consider its distribution under the hypothesis
The idea behind the
p-value is to answer the question, is the observed value of
something we would expect to see if
is the true value of
If
is a surprising value, then this is interpreted as evidence against
A
p-value measures the location of
in the distributions of
under
, with small values of the
p-value being an indication that the observed value is surprising. For example, it is quite common that large values of
are in the tails (regions of low probability content) of each of the distributions of
under
and so, given
is computed as the
p-value.
Example 2 (location normal)
. Suppose that it is required to assess the hypothesis For this, it is common to use the statistic as this has a fixed distribution (the absolute value of a standard normal variable) under with the p-value given by where Φ
is the cdf.
Several issues arise with the use of
p-values in general; for example, see
Section 3 in [
10] for a very thorough discussion. First, what value
is small enough to warrant the assertion that evidence against
has been obtained when the
p-value is less than
It is quite common in many fields to use the value
as a cut-off, but this is not universal, as in particle physics,
is commonly employed. Recently, due to concerns with the replication of results, it has been proposed that the value
be used as a standard, but there are concerns with this as expressed in [
11]. The issue of the appropriate cut-off is not resolved.
It is also not the case that a value greater than
is to be interpreted as evidence in favor of
being true. Note that, in the case that
has a single continuous distribution
under
then
has a uniform
distribution when
This implies that a
p-value near 0 has the same probability of occurring as a
p-value near 1. It is often claimed that a valid
p-value must have this uniformity property. But consider the
p-value of Example 2, where
under
and so, as
n rises, the distribution of
becomes more and more concentrated about
For large enough
n, virtually all of the distribution of
under
will be concentrated in the interval
, where
represents a deviation from
that is of scientific interest, while a smaller deviation is not, e.g., the measurements
are made to an accuracy no greater than
Note that in any scientific problem, it is necessary to state the accuracy with which the investigator wants to know the object
, as this guides the measurement process as well as the choice of sample size. The
p-value ignores this fact and, when
n is large enough, could record evidence against
when in fact the data are telling us that
is effectively true and evidence in favor should be stated. This distinction between statistical significance and scientific significance has long been recognized (see [
12]) and needs to be part of statistical methodology (see Example 6). This fact also underlies a recommendation, although not commonly followed, as it does not really address the problem, that the
cut-off should be reduced as the sample size increases. Perhaps the most important take-away from this is that a value
that lies in the tails of its distributions under
is not necessarily evidence against
It is commonly stated that a confidence region for
should also be provided, but this does not tell us anything about
which needs to be given as part of the problem.
There are many other issues that can be raised about the
p-value, where many of these are covered by the word
p-hacking and associated with the choice of
For example, as discussed in [
13], suppose an investigator is using a particular
as the cut-off for evidence against and, based on a data set of size
, obtains the
p-value
where
is small. Since finding evidence against
is often regarded as a positive, as it entails a new discovery, the investigator decides to collect an additional
data values, computes the
p-value based on the full
data values and obtains a
p-value less than
But this ignores the two-stage structure of the investigation, and when this is taken into account, the probability of finding evidence against
at level
when
is true, and assuming a single distribution under
, equals
where
{evidence against
found at first stage} and
{evidence against
found at second stage}. If
then
So, even though the investigator has done something very natural, the logic of the statistical procedure based on using
p-values with a cut-off, effectively prevents finding evidence against
Royall in [
10] provided an excellent discussion of the deficiencies of
p-values and
rejection trials (
p-values with cut-off
) when considering these as measuring evidence. Sometimes, it is recommended that the
p-value itself be reported without reference to a cut-off
, but we have seen already that a small
p-value does not necessarily constitute evidence against, and so this is not a solution.
Confidence regions are intimately connected with
p-values. Suppose that for each
there is a statistic
that produces a valid
p-value for
as has been described and an
cut-off is used. Now, put
Then,
and so
is a
-confidence region for
As such,
is equivalent to
not being rejected at level
and so all the problematical issues with
p-values as measures of evidence apply equally to confidence regions. Moreover, it is correctly stated that, when
x has been observed, then
is not a lower bound on the probability that
Of course, that is what we want, namely, to state a probability that measures our belief that
is in this set, and so confidence regions are commonly misinterpreted; see the discussion of bias in
Section 3.4.
An alternative approach to using
p-values is provided by e-values; see [
14,
15,
16]. An
e-variable for a hypothesis
is a non-negative statistic
that satisfies
whenever
The observed value
is called an
e-value, where the “e” stands for expectation to contrast it with the “
p” in
p-value which stands for probability. Also, a cut-off
needs to be specified such that
is rejected whenever
It is immediate, from Markov’s inequality, that
whenever
, and so this provides a rejection trial with cut-off
for
Example 3 (location normal). For the context of Example 2, define for any , and it is immediately apparent that is an e-variable for
Example 3 contains an example of the construction of an e-variable. Also, likelihood ratios can serve as e-variables and there are a number of other constructions of such variables discussed in the cited literature.
Consider the situation where data are collected sequentially,
is an e-variable for
based on independent data
, and there is a stopping time
e.g., stop when
Also, put
Then, whenever
we have that
and so the process
is a discrete time super-martingale. Moreover, it is clear that
is an e-variable for
This implies that, under conditions,
and so the stopped variable
is also an e-variable for
. Assuming stopping time
is finite with probability 1 when
holds, then by Ville’s inequality,
This implies that the problem for
p-values when sampling until rejecting
at size
is avoided when using e-values.
While e-values have many interesting and useful properties relative to p-values, the relevant question here is whether or not these serve as measures of statistical evidence. The usage of both requires the specification of to determine when there are grounds for rejecting Sometimes, this is phrased instead as “evidence against has been found”, but, given the arbitrariness of the choice of and the failure to properly express when evidence in favor of has been found, neither seems suitable as an expression of statistical evidence. One could argue that the intention behind these approaches is not to characterize statistical evidence but rather to solve the reject/accept problems of decision theory. It is the case, however, at least for p-values, that these are used as if they are proper characterizations of evidence, and this does not seem suitable for purely scientific applications.
Another issue that needs to be addressed in a problem is how to find the statistic
or
A common argument is to use likelihood ratios (see
Section 3.2 and
Section 3.3), but this does not resolve the problems that have been raised here, and there are difficulties with the concept of likelihood that need to be addressed as well.
3.2. Birnbaum on Statistical Evidence
Alan Birnbaum devoted considerable thought to the concept of statistical evidence in the frequentist context. Birnbaum’s paper [
17] contained what seemed like a startling result about the implications to be drawn from the concept, and his final paper [
18] contained a proposal for a definition of statistical evidence. Also, see [
19] for a full list of Birnbaum’s publications, many of which contain considerable discussion concerning statistical evidence.
In [
17], Birnbaum considered various relations, as defined by statistical principles, on the set
of all inference bases, where an
inference base is the pair
consisting of a sampling model and data supposedly generated from a distribution in the model. A
statistical principle R is then a relation defined in
namely,
R is a subset of
There are three basic statistical principles that are commonly invoked as part of evidential inference, namely, the likelihood principle
the sufficiency principle
and the conditionality principle
Inference bases
for
satisfy
if for some constant
we have
for every
they satisfy
if the models have equivalent (1-1 functions of each other)
minimal sufficient statistics (a sufficient statistic that is a function of every other sufficient statistic) that take equivalent values at the corresponding data values, and they satisfy
if there is
ancillary statistic a (a statistic whose distribution is independent of
such that
can be obtained from
(or conversely) via conditioning on
so
The basic idea is that, if two inference bases are related by one of these principles, then they contain the same statistical evidence concerning the unknown true value of
For this idea to make sense, it must be the case that these principles form equivalence relations on
In [
20], it is shown that
L and
S do form equivalence relations but
C does not, and this latter result is connected with the fact that a unique
maximal ancillary (an ancillary which every other ancillary is a function of) generally does not exist.
Birnbaum provided a proof in [
17] of the result, known as Birnbaum’s Theorem, that if a statistician accepted the principles
S and
then they must accept
This is highly paradoxical because frequentist statisticians generally accept both
S and
C but not
L, as
L does not permit repeated sampling. Two very different sampling models can lead to proportional likelihood functions, so repeated sampling behavior is irrelevant under
L. There has been considerable discussion over the years concerning the validity of this proof, but no specific flaw has been found. A resolution of this is provided in [
20], where it is shown that
is not an equivalence relation and what Birnbaum actually proved is that the smallest equivalence relation on
that contains
is
This substantially weakens the result, as there is no reason to accept the additional generated equivalences, and in fact it makes more sense to consider the largest equivalence relation on
that is contained in
Also, as discussed in [
20], an argument similar to the one found in [
17] establishes that a statistician who accepts
C alone must accept
L. Again, however, this only means that the smallest equivalence relation on
that contains
C is
As shown in [
21], issues concerning
C can be resolved by restricting conditioning on ancillary statistics to
stable ancillaries (those ancillaries such that conditioning on them retains the ancillarity of all other ancillaries and similarly have their ancillarity retained when conditioning on any other ancillary), as there always is a maximal stable ancillary. This provides a conditionality principle that is a proper characterization of statistical evidence but it does not lead to
While Birnbaum did not ultimately resolve the issues concerning statistical evidence, [
18] made a suggestion as a possible starting point by proposing the
confidence concept in the context of comparing two hypotheses
versus
The confidence concept is characterized by two
error probabilities
and then reporting
with the following interpretation:
This clearly results from a confounding of the decision theoretic (as in Neyman–Pearson) approach to hypothesis testing with the evidential approach, as rejection trials similarly do. Also, this does not give a general definition of what is meant by statistical evidence, as it really only applies in the simple versus simple hypothesis testing context, and it suffers from many of the same issues as discussed concerning
p-values. From reading Birnbaum’s papers, it seems he had largely despaired of ever finding a fully satisfactory characterization of statistical evidence in the frequentist context. It will be shown in
Section 3.4, however, that the notion of frequentist error probabilities do play a key role in a development of the concept.
3.3. Likelihood
Likelihood is another statistical concept initiated by Fisher; see [
22]. While the likelihood function plays a key role in frequentism, a theory of inference, called
pure likelihood theory, based solely on the likelihood function, is developed in [
10,
23]. The basic axiom is thus the likelihood principle
L of
Section 3.2, which says that the likelihood function
from inference base
completely summarizes the evidence given in
I concerning the true value of
. As such, two inference bases with proportional likelihoods must give identical inferences, and so repeated sampling does not play a role. In particular, the ratio
provides the evidence for
relative to the evidence for
, and this is independent of the arbitrary constant
While this argument for the ratios seems acceptable, a problem arises when we ask, does the value
provide evidence in favor of or against
being the true value? To avoid the arbitrary constant, Royall in [
10] replaced the likelihood function by the
relative likelihood function given by
, where
is the
maximum likelihood estimate. The relative likelihood always takes values in
Note that, if
, then
for all
, and so no other value is supported by more than
r times the support accorded to
For
Royall then argues, based on single urn experiment, for
to represent very strong evidence in favor and for
to represent quite strong evidence in favor of
Certainly, these values seem quite arbitrary and again, as with
p-values, we do not have a clear cut-off between evidence in favor and evidence against. Whenever values like this are quoted, there is an implicit assumption that there is a universal scale with which evidence can be measured. Currently, there are no developments that support the existence of such a scale; see the discussion of the Bayes factor in
Section 3.4. For estimation, it is natural to quote
and report a likelihood region such as
for some
r as an assessment of the accuracy of
Another serious objection to pure likelihood theory, and to the use of the likelihood function to determine inferences more generally, arises when nuisance parameters are present, namely, we want to make inference about
where
is not 1-1. In general, there does not appear to be a way to define a likelihood function for the parameter of interest
based on the inference base
I only, that is consistent with the motivation for using likelihood in the first place, namely,
is proportional to the probability of the observed data as a function of
It is common in such contexts to use the
profile likelihood
as a likelihood function. There are many examples that show that a profile likelihood is not a likelihood, and so such usage is inconsistent with the basic idea underlying likelihood methods. An alternative to the profile likelihood is the
integrated likelihood
where the
are probability measures on the pre-images
see [
24]. The integrated likelihood is a likelihood with respect to the sampling model
, where
but this requires adding
to the inference base.
Example 4 (location normal)
. Suppose we want to estimate , so the nuisance parameter is given by The likelihood function is given by Since , this is minimized by when and by when Therefore, the profile likelihood for ψ is , and this depends on the data only through To see that is not a likelihood function, observe that the density of is given by which is not proportional to For the integrated likelihood, we need to choose , so Note that (2) is not obtained by integrating (1), which is appropriate because is not a minimal sufficient statistic.The profile and integrated likelihoods are to be used just as the original likelihood is used for inferences about θ, even though the profile likelihood is not a likelihood. The profile likelihood leads immediately to the estimate for ψ, while (2) needs to be maximized numerically. Although the functional forms look quite different, the profile and integrated likelihoods here give almost identical numerical results. Figure 1 is a plot with generated from a distribution, and the estimates of ψ are both equal to There is asome disagreement in the left tail, so some of the reported intervals will differ, but as n increases, such differences disappear. This agreement is also quite robust against the choice of Another issue arises with the profile likelihood, namely, the outcome differs in contexts which naturally should lead to equivalent results.
Example 5 (prediction with scale normal)
. Suppose is an independent and identically distributed sample from a distribution in the family . Based on the observed data, the likelihood equals where and the MLE of is Now suppose, however, that the interest is in predicting k future (or occurred but concealed) independent values Perhaps the logical predictive likelihood to use is where Profiling out leads to the profile MLE of y equaling 0 for all x as might be expected. Profiling y out of however, leads to the profile likelihood for , and so the profile MLE of equals When interest is in the integrated likelihood for produces When interest is in then integrating out after placing a probability distribution on produces the MLE 0 for y, although the form of the integrated likelihood will depend on the particular distribution chosen.
The profile and integrated likelihoods certainly do not always lead to roughly equivalent results as in Example 4, particularly as the dimension of the nuisance parameters rises. While the profile likelihood has the advantage of not having to specify
it suffers from a lack of a complete justification, at least in terms of the likelihood principle, and, as Example 5 demonstrates, it can produce unnatural results. In
Section 3.4, it is shown that the integrated likelihood arises from a very natural principle characterizing evidence.
The discussion here has been mostly about pure likelihood theory, where repeated sampling does not play a role. The likelihood function is also an important aspect of frequentist inference. Such usage, however, does not lead to a resolution of the evidence problem, namely, providing a proper characterization of statistical evidence. In fact, frequentist likelihood methods use the
p-value for
H. There is no question that the likelihood contains the evidence in the data about
but questions remain as to how to characterize that evidence, whether in favor of or against a particular value, and also how to express the strength of the evidence. More discussion of the likelihood principle can be found in [
25].
3.4. Bayes Factors and Bayesian Measures of Statistical Evidence
Bayesian inference adds another ingredient to the inference base, namely, the prior probability measure on so now The prior represents beliefs about the true value of Note that is equivalent to a joint distribution for with density Once the data x have been observed, a basic principle of inference is then invoked.
Principle of Conditional Probability: for probability model if is observed to be true, where then the initial belief that is true, as given by is replaced by the conditional probability
So, we replace the prior by the posterior where is the prior predictive density of the data to represent beliefs about While at times the posterior is taken to represent the evidence about this confounds two distinct concepts, namely, beliefs and evidence. It is clear, however, that the evidence in the data is what has changed our beliefs, and it is this change that leads to the proper characterization of statistical evidence through the following principle.
Principle of Evidence: for probability model if is observed to be true where then there is evidence in favor of being true if evidence against being true if and no evidence either way if
Therefore, in the Bayesian context, we compare
with
to determine whether or not there is evidence one way or the other concerning
being the true value. This seems immediate in the context where
is a discrete probability measure. It is also relevant in the continuous case, where densities are defined via limits, as in
, where
is a sequence of neighborhoods of
converging nicely to
as
, and
is absolutely continuous with respect to support measure
This leads to the usual expressions for densities; see [
26] (Appendix A) for details.
The Bayesian formulation has a very satisfying consistency property. If interest is in the parameter
then the nuisance parameters can be integrated out using the conditional prior
and the inference base
is replaced by
where
is the marginal prior on
, with density
Applying the principal of evidence here means to compare the posterior
to the prior density
So, there is evidence in favor of
being the true value whenever
etc.
In many applications, we need more than the simple characterization of evidence that the principle of evidence gives us. A
valid measure of evidence is then any real-valued function of
that satisfies the existence of a cut-off
c such that the function taking a value greater than
c corresponds to evidence in favor, etc. One very simple function satisfying this is the
relative belief ratio
where the second equality follows from (
3) and is sometimes referred to as the Savage–Dickey ratio; see [
27]. Using
, the values of
are now totally ordered with respect to evidence, as when
, there is more evidence in favor of
than for
etc.
This suggests an immediate answer to E, namely, record the relative belief estimate , as this value has the maximum evidence in its favor. Also, to assess the accuracy of , record the plausible region , the set of values with evidence in their favor, together with its posterior content , as this measures the belief that the true value is in Note that while depends on the choice of the relative belief ratio to measure evidence, the plausible region only depends on the principle of evidence. This suggests that any other valid measure of evidence can be used instead to determine an estimate, as it will lie in . A -credible region for can also be quoted based on any valid measure of evidence provided , as, otherwise, would contain a value of for which there is evidence against being the true value. For example, a relative belief -credible region takes the form where
For H, the value tells us immediately whether there is evidence in favor of or against To measure the strength of the evidence concerning , there is the posterior probability as this measures our belief in what the evidence says. If and then there is strong evidence that is true, while when and then there is strong evidence that is false. Often, however, will be small, even 0 in the continuous case, so it makes more sense to measure the strength of the evidence in such a case by , as when and then there is small belief that the true value of has more evidence in its favor than etc.
Recalling the discussion in
Section 3.1 about the
difference that matters it is relatively easy to take this into account in this context, at least when
is real valued. For this, we consider a grid of values
separated by
and then conduct the inferences using the relative belief ratios of the intervals
In effect,
is now
Example 6 (location normal)
. Suppose that we add the prior to form a Bayesian inference base. The posterior distribution is then Suppose our interest is inference about Figure 2 is a plot of the relative belief ratio based on the data in Example 4 generated from a distribution) and using the prior with From (4), it is seen that maximizing is equivalent to maximizing the integrated likelihood, and in this case, the prior is such that with prior probability 0.5, so indeed the relative belief estimate is To assess the accuracy of the estimate, we record the plausible interval which has posterior content There is evidence in favor of since and with so there is only moderate evidence in favor of 2 being the true value of ψ. Of course, these results improve with sample size For example, for with with posterior content and with strength So, the plausible interval has shortened considerably, and its posterior content has substantially increased, but the strength of the evidence in favor of has not increased by much. It is important to note that, because we are employing a discretization (with here), and since the posterior is inevitably virtually completely concentrated in as the strength converges to 1 at any of the values in this interval and to 0 at values outside of it. There are a number of optimal properties for the relative belief inferences as discussed in [
26]. While this is perhaps the first attempt to build a theory of inference based on the principle of evidence, there is considerable literature on this idea in the philosophy of science where it underlies what is known as
confirmation theory; see [
28]. For example, the philosopher Popper in [
29] (Appendix ix) writes
If we are asked to give a criterion of the fact that the evidence y supports or corroborates a statement x, the most obvious reply is: that y increases the probability of
One issue with philosophers’ discussions is that these are never cast in a statistical context. Many of the anomalies raised in those discussions, such as Hempel’s Raven Paradox, can be resolved when formulated as statistical problems. Also, the relative belief ratio itself has appeared elsewhere, although under different names, as a natural measure of evidence.
There are indeed other valid measures of evidence besides the relative belief ratio, for example, the Bayes factor originated with Jeffreys as in [
30]; see [
6,
31] for discussion. For probability model
with
the
Bayes factor in favor of
A being true, after observing that
C is true, is given by
the ratio of the posterior odds in favor of
to the prior odds in favor of
It is immediate that
is a valid measure of evidence because
if and only if
Also,
and so, because
if and only if
this implies
when there is evidence in favor and
when there is evidence against. These inequalities are important because it is sometimes asserted that the Bayes factor is not only a measure of evidence but also that its value is a measure of the strength of that evidence.
Table 1 gives a scale, due to Jeffreys (see [
32] (Appendix B), which supposedly can be used to assess the strength of evidence as given by the Bayes factor. Again, this is an attempt to establish a universal scale on which evidence can be measured and currently no grounds exist for claiming that such a scale exists. If we consider that the strength of the evidence is to be measured by how strongly we believe what the evidence says, then simple examples can be constructed to show that such a scale is inappropriate. Effectively, the strength of the evidence is context dependent and needs to be calibrated.
Example 7.
Suppose Ω is a set with A value ω is generated uniformly from Ω and partially concealed but it is noted that where It is desired to know if where and Then, we have that and so since The posterior belief in what the evidence is saying is, however, which can be very small, say when But, if then which is well into the range where Jeffreys scale says it is decisive evidence in favor of Clearly there is only very weak evidence in favor of A being true here. Also, note that the posterior probability by itself does not indicate evidence in favor although the observation that C is true must be evidence in favor of A because The example can obviously be modified to show that any such scale is not appropriate.
There is another issue with current usage of the Bayes factor that needs to be addressed. This arises when interest is in assessing the evidence for or against
when
Clearly, the Bayes factor is not defined in such a context. An apparent solution is provided by choosing a prior of the form
, where
is a prior probability assigned to
, and
is a prior probability measure on the set
Since
the Bayes factor is now defined. Of some concern now is how
should be chosen. A natural choice is to take
otherwise there is a contradiction between beliefs as expressed by
and
It follows very simply, however, that in the continuous case, when
then
Moreover, if instead of using such a mixture prior, we define the Bayes factor via
where
is a sequence of neighborhoods of
converging nicely to
then, under weak conditions (
is positive and continuous at
,
So, there is no need to introduce the mixture prior to get a valid measure of evidence. Furthermore, now the Bayes factor is available for
E as well as
H as surely any proper characterization of statistical evidence must be, and the relevant posterior for
is
as obtained from
rather than from
Other Bayesian measures of evidence have been proposed. For example, it is proposed in [
33] to measure the evidence for
by computing
and then use the posterior tail probability
If
is large, this is evidence against
, while, if it is small, it is evidence in favor of
. Note that
is sometimes referred to as an e-value, but this is different from the e-values discussed in
Section 3.1. Further discussion and development of this concept can be found in [
34]. Clearly, this is building on the idea that underlies the
p-value, namely, providing a measure that locates a point in a distribution and using this to assess evidence. It does not, however, conform to the principle of evidence.
Bias and Error Probabilities
One criticism that is made of Bayesian inference is that there are no measures of reliability of the inferences, which is an inherent part of frequentism. It is natural to add such measures, however, to assess whether or not the specified model and prior could potentially lead to misleading inferences. For example, suppose it could be shown that evidence in favor of would be obtained with prior probability near 1, for a data set of given size. It seems obvious that, if we did collect this amount of data and obtained evidence in favor of then this fact would undermine our confidence in the reliability of the reported inference.
Example 8 (Jeffreys–Lindley paradox)
. Suppose we have the location normal model and is obtained which leads to the p-value , so there would appear to be definitive evidence that is false. Suppose the prior is used, and the analyst chooses to be very large to reflect the fact that little is known about the true value of It can be shown (see [26]), however, that as Therefore, for a very diffuse prior, evidence in favor of will be obtained, and so the frequentist and Bayesian will disagree. Note that the Bayes factor equals the relative belief ratio in this context. A partial resolution of this contradiction is obtained by noting that as and so the Bayesian measure of evidence is only providing very weak evidence when the p-value is small. This anomaly occurs even when the true value of μ is indeed far away from , and so the fault here does not lie with the p-value. Prior error probabilities associated with Bayesian measures of evidence can, however, be computed and these lead to a general resolution of the apparent paradox. There are two error probabilities for
H that we refer to as
bias against and
bias in favor of as given by the following two prior probabilities
Both of these biases are independent of the valid measure of evidence used, as they only depend on the principle of evidence as applied to the model and prior chosen. The
is the prior probability of not getting evidence in favor of
when it is true and plays a role similar to type I error. The
is the prior probability of not obtaining evidence against
when it is meaningfully false and plays a role similar to that of the type II error. Here,
is the deviation from
which is of scientific interest as determined by some measure of distance
on
As discussed in [
26], these biases cannot be controlled by, for example, the choice of prior, as a prior that reduces
causes
to increase and vice versa. The proper control of these quantities is through sample size as both converge to 0 as
Example 9 (Jeffreys–Lindley paradox). In this case, bias against and the bias in favor of as This leads to an apparent resolution of the issue: do not choose the prior to be arbitrarily diffuse to reflect noninformativeness; rather, choose a prior that is sufficiently diffuse to cover the interval where it is known μ must lie, e.g., the interval where the measurements lie, and then choose n so that both biases are suitably small. While the Jeffreys–Lindley paradox arises due to diffuse priors inducing bias in favor, an overly concentrated prior induces bias against, but again this bias can be controlled via the amount of data collected.
There are also two biases associated with
E obtained by averaging, using the prior
the biases for
H. These can also be expressed in terms of coverage probabilities of the plausible region
The biases for
E are given by
where
is the
implausible region, namely, the set of values for which evidence against has been obtained. So,
is the prior probability that the true value is not in the plausible region
and so
is the prior coverage probability (Bayesian confidence) of
It is of interest that there is typically a value
and so
gives a lower bound on the frequentist confidence of
with respect to the model
obtained from the original model by integrating out the nuisance parameters. In both cases, the Bayesian confidences are average frequentist confidences with respect to the original model.
is the prior probability that a meaningfully false value is not in the implausible region. Again, both of these biases do not depend on the valid measure of statistical evidence used and converge to 0 with increasing sample size.
With the addition of the biases, a link is established between Bayesianism and frequentism: inferences are Bayesian, but the reliability of the inferences is assessed via frequentist criteria. More discussion on bias in this context can be found in [
35].