The Concept of Statistical Evidence, Historical Roots and Current Developments

Evans, Michael

doi:10.3390/encyclopedia4030078

Open AccessReview

The Concept of Statistical Evidence, Historical Roots and Current Developments

by

Michael Evans

Department of Statistical Sciences, University of Toronto, Toronto, ON M5G 1Z5, Canada

Encyclopedia 2024, 4(3), 1201-1216; https://doi.org/10.3390/encyclopedia4030078

Submission received: 3 June 2024 / Revised: 3 July 2024 / Accepted: 29 July 2024 / Published: 2 August 2024

(This article belongs to the Section Mathematics & Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

One can argue that one of the main roles of the subject of statistics is to characterize what the evidence in the collected data says about questions of scientific interest. There are two broad questions that we will refer to as the estimation question and the hypothesis assessment question. For estimation, the evidence in the data should determine a particular value of an object of interest together with a measure of the accuracy of the estimate, while for the hypothesis assessment, the evidence in the data should provide evidence in favor of or against some hypothesized value of the object of interest together with a measure of the strength of the evidence. This will be referred to as the evidential approach to statistical reasoning, which can be contrasted with the behavioristic or decision-theoretic approach where the notion of loss is introduced, and the goal is to minimize expected losses. While the two approaches often lead to similar outcomes, this is not always the case, and it is commonly argued that the evidential approach is more suited to scientific applications. This paper traces the history of the evidential approach and summarizes current developments.

Keywords:

statistical evidence; p-values; e-values; confidence; likelihood; Bayes factor; relative belief ratio; bias

1. Introduction

Most statistical analyses refer to the concept of statistical evidence as in phrases like “the evidence in the data suggests” or “based on the evidence we conclude”, etc. It has long been recognized, however, that the concept itself has never been satisfactorily defined or, at least, no definition has been offered that has met with general approval. This article is about the concept of statistical evidence outlining some of the key historical aspects of the discussion and the current state of affairs.

One issue that needs to be dealt with right away is whether or not it is even necessary to settle on a clear definition. After all, statistics has been functioning as an intellectual discipline for many years without a resolution. There are at least two reasons why resolving this is important.

First, to be vague about the concept leaves open the possibility of misinterpretation and ambiguity as in “if we don’t know what statistical evidence is, how can we make any claims about what the evidence is saying?” This leads to a degree of adhocracy in the subject where different analysts measure and interpret the concept in different ways. For example, consider the replicability of research findings in many fields where statistical methodology is employed. One cannot claim that being more precise about statistical evidence will fix such problems, but it is reasonable to suppose that establishing a sound system of statistical reasoning, based on a clear prescription of what we mean by this concept, can only help.

Second, the subject of statistics cannot claim that it speaks with one voice on what constitutes a correct statistical analysis. For example, there is the Bayesian versus frequentist divide as well as the split between the evidential versus the decision-theoretic or behavioristic approaches to determining inferences. This diversity of opinion, while interesting in and of itself, does not enhance general confidence in the soundness of the statistical reasoning process.

As such, it is reasonable to argue that settling the issue of what statistical evidence is and how it is to be used to establish inferences is of paramount importance. Statistical reasoning is a substantial aspect of how many disciplines, from anthropology to particle and quantum physics and on to zoology, determine truth in their respective subjects. One can claim that the role of the subject of statistics is to provide these scientists with a system of reasoning that is logical, consistent and sound in the sense that it produces satisfactory results in practical contexts free of paradoxes. Recent controversies over the use of p-values, which have arisen in a number of scientific contexts, suggest that, at the very least, this need is not being met, for example, see [1,2].

It is to be emphasized that this paper is about the evidential approach to inference and does not discuss decision theory. In particular, the concern is with the attempts, within the context of evidential inference, to characterize statistical evidence. In some attempts to accomplish this, concepts from decision theory have been used, and so there is a degree of confounding between the two approaches. For example, it is common when using p-values to invoke an error probability, namely, the size of a test as given by

α \in (0, 1)

, to determine when there is evidence against a hypothesis; see Section 3.1. Furthermore, the discussion of which approach leads to more satisfactory outcomes is not part of our aim, and no such comparisons are made. There is indeed a degree of tension between evidential inference and decision theory, and this dates back to debates between Fisher and Neyman concerning which is more appropriate. Some background on this can be found in the references [3,4,5]. A modern treatment of decision theory can be found in [6].

In Section 2, we provide some background. Section 3 discusses various attempts at measuring statistical evidence and why these are not satisfactory. Section 4 discusses what we call the principle of evidence and how it can resolve many of the difficulties associated with defining and using the concept of statistical evidence.

2. The Background

There are basically two problems that the subject of statistics has been developed to address, which we will refer to as the estimation (E) problem and the hypothesis assessment (H) problem. We will denote the data produced by some process by the symbol x and always assume that these have been collected correctly, namely, by random sampling from relevant populations. Obviously, this requirement is not always met in applications, but the focus here is on the ideal situation that can serve as a guide in practical contexts. The fact that data have not been collected properly in an application can be, however, a substantial caveat when interpreting the results of a statistical analysis and at least should be mentioned as a qualification.

Now suppose there is an object or concept of interest, denoted by

Ψ,

in the world whose value is unknown. For example,

Ψ

could be the half-life of some fundamental particle, the rate of recidivism in a certain jurisdiction at some point in time, the average income and assets of voters in a given country, a graph representing the relationships among particular variables measured on each element of a population, etc. A scientist collects data x which they believe contain evidence concerning the answers to the following questions.

E: What value does $Ψ$ take (estimation)?
H: Does $Ψ = ψ_{0}$ (hypothesis assessment)?

Actually, there is more to these questions than just providing answers. For example, when quoting an estimate of

Ψ

based on

x,

there should also be an assessment of how accurate it is believed that the estimate is. Also, if a conclusion is reached concerning whether or not

Ψ = ψ_{0}

based on

x,

there should also be an assessment of how strong a basis there is for drawing this conclusion. At least as far as this article goes, statistics as a subject is concerned with E and H. For example, if interest is in predicting a future value of some process based on observing

x,

then

Ψ

will correspond to this future value.

The difference between the evidential and decision-theoretic approach can now be made clear. The evidential approach is concerned with summarizing what the evidence in x says about the answers to E and H. For example, for E, a conclusion might be that the evidence in the data implies that the true value of

Ψ

is best estimated by

ψ (x)

and also provide an assessment of the accuracy of

ψ (x)

. For H, the conclusion would be that there is evidence against/in favor of

Ψ = ψ_{0}

together with a measure of the strength of this evidence. By contrast, the decision-theoretic approach adds the idea of utility or loss to the problem and searches for answers which optimize the expected utility/loss as in quoting an optimal estimate

ψ (x)

or optimally accepting/rejecting

Ψ = ψ_{0} .

In a sense, the decision-theoretic approach dispenses with the idea of evidence and tries to characterize optimal behavior. A typical complaint about the decision-theoretic approach is that for scientific problems, where the goal is the truth whatever that might be, there is no role for the concept of utility/loss. That is not to say that there is no role for decision theory in applications generally, but for scientific problems, we want to know as clearly as possible what the evidence in the data is saying. There may indeed be good reasons to do something that contradicts the evidence, now being dictated by utility considerations, but there is no reason not to state what the evidence says and then justify the difference. For example, consider the results of a clinical trial for which strong evidence is obtained that a new drug is efficacious in the treatment of a disease when compared to a standard treatment. Still, from the point of view of the pharmaceutical company involved, there may be good reasons not to market the drug due to costs, potential side effects, etc. As such, the utility considerations of the company outweigh what the evidence indicates.

The other significant division between the approaches to E and H is between frequentist and Bayesian statistics. In fact, these are also subdivisions of the evidential and the decision-theoretic approach. As they do not generally lead to the same results, this division needs to be addressed as part of our discussion. As a broad characterization, frequentism considers a thought experiment where it is imagined that the observed data x could be observed as a member of a sequence of basically identical situations, say, obtaining

x_{1}, x_{2}, \dots,

and then how well an inference performs in such sequences in terms of reflecting the truth concerning

Ψ,

is measured. Inferences with good performance are considered more reliable and thus preferred. As soon as these performance measures are invoked, however, this brings things close to being part of decision theory and, as will be discussed here, there does not seem to be a clear way to avoid this. By contrast, the Bayesian approach insists that all inferences must be dictated by the observed data without taking into account hypothetical replications as a measure of reliability. As will be discussed in Section 3.4, however, it is possible to resolve this apparent disagreement to a great extent.

3. Evidential Inference

Statistical evidence is the central core concept of this approach. It is now examined how this has been addressed by various means and to what extent these have been successful. For this, we need some notation. The basic ingredient of every statistical theory is the sampling model

{f_{θ} : θ \in Θ},

a collection of probability density functions on some sample space

X

with respect to some support measure

ν .

The variable

θ

is the model parameter which takes values in the parameter space

Θ

, and it indexes the probability distributions in the model. The idea is that one of these distributions in the model produced the data x, and this is represented as

f_{θ_{t r u e}} .

The object of interest

Ψ

is represented as a function

Ψ : Θ \to Ψ (Θ)

where

ψ = Ψ (θ)

can be obtained from the distribution

f_{θ} .

The questions E and H can be answered categorically if

ψ_{t r u e} = Ψ (θ_{t r u e})

becomes known. Note that there is no reason to restrict

Θ

to be finite dimensional, and so nonparametric models are also covered by this formulation.

A simple example illustrates these concepts and will be used throughout this discussion.

Example 1

(location normal). Suppose

x = (x_{1}, \dots, x_{n})

is an independent and identically distributed sample of n from a distribution in the family

{ϕ (\cdot, μ, σ_{0}^{2}) : μ \in R^{1}}

, where

ϕ (\cdot, μ, σ_{0}^{2})

denotes the density of a normal distribution with unknown mean μ and known variance

σ_{0}^{2},

and the goal is inference about the true value of

ψ = Ψ (μ) = μ .

While Ψ here is one to one, it could be many to one as in

ψ = Ψ (μ) = | μ |,

and such contexts raise the important issue of nuisance parameters and how to deal with them.

3.1. p-Values, E-Values and Confidence Regions

The p-value is the most common statistical approach to measuring evidence associated with H. The p-value is associated with Fisher, see [7], although there are precursors who contributed to the idea such as John Arbuthnot, Daniel Bernoulli, and Karl Pearson; see [8,9]. Our purpose in this section is to show that the p-value and the associated concepts of e-value and confidence region do not provide adequate measures of statistical evidence.

We suppose that there is a statistic

T_{ψ_{0}} (x)

and consider its distribution under the hypothesis

H_{0} : Ψ (θ) = ψ_{0} .

The idea behind the p-value is to answer the question, is the observed value of

T_{ψ_{0}} (x)

something we would expect to see if

ψ_{0}

is the true value of

Ψ ?

If

T_{ψ_{0}} (x)

is a surprising value, then this is interpreted as evidence against

H_{0} .

A p-value measures the location of

T_{ψ_{0}} (x)

in the distributions of

T_{ψ_{0}}

under

H_{0}

, with small values of the p-value being an indication that the observed value is surprising. For example, it is quite common that large values of

T_{ψ_{0}} (x)

are in the tails (regions of low probability content) of each of the distributions of

T_{ψ_{0}}

under

H_{0}

and so, given

T_{ψ_{0}} (x),

p_{ψ_{0}} (x) = sup_{Ψ (θ) = ψ_{0}} P_{θ} (T_{ψ_{0}} (X) \geq T_{ψ_{0}} (x))

is computed as the p-value.

Example 2

(location normal). Suppose that it is required to assess the hypothesis

H_{0} : μ = μ_{0} .

For this, it is common to use the statistic

T_{μ_{0}} (x) = \sqrt{n} | \bar{x} - μ_{0} | / σ_{0},

as this has a fixed distribution (the absolute value of a standard normal variable) under

H_{0},

with the p-value given by

p_{μ_{0}} (x) = P_{μ_{0}} (T_{μ_{0}} (X) \geq T_{μ_{0}} (x)) = 2 (1 - Φ (\sqrt{n} | \bar{x} - μ_{0} | / σ_{0}))

where Φ is the

N (0, 1)

cdf.

Several issues arise with the use of p-values in general; for example, see Section 3 in [10] for a very thorough discussion. First, what value

α

is small enough to warrant the assertion that evidence against

H_{0}

has been obtained when the p-value is less than

α ?

It is quite common in many fields to use the value

α = 0.05

as a cut-off, but this is not universal, as in particle physics,

α = 5.733031 \times 10^{- 7}

is commonly employed. Recently, due to concerns with the replication of results, it has been proposed that the value

α = 0.005

be used as a standard, but there are concerns with this as expressed in [11]. The issue of the appropriate cut-off is not resolved.

It is also not the case that a value greater than

α

is to be interpreted as evidence in favor of

H_{0}

being true. Note that, in the case that

T_{ψ_{0}}

has a single continuous distribution

P_{H_{0}}

under

H_{0},

then

P_{H_{0}} (T_{ψ_{0}} \geq T_{ψ_{0}} (x))

has a uniform

(0, 1)

distribution when

x \sim P_{H_{0}} .

This implies that a p-value near 0 has the same probability of occurring as a p-value near 1. It is often claimed that a valid p-value must have this uniformity property. But consider the p-value of Example 2, where

\bar{x} \sim N (μ_{0}, σ_{0}^{2} / n)

under

H_{0},

and so, as n rises, the distribution of

\bar{x}

becomes more and more concentrated about

μ_{0} .

For large enough n, virtually all of the distribution of

\bar{x}

under

H_{0}

will be concentrated in the interval

(μ_{0} - δ, μ_{0} + δ)

, where

δ

represents a deviation from

μ_{0}

that is of scientific interest, while a smaller deviation is not, e.g., the measurements

x_{i}

are made to an accuracy no greater than

2 δ .

Note that in any scientific problem, it is necessary to state the accuracy with which the investigator wants to know the object

Ψ

, as this guides the measurement process as well as the choice of sample size. The p-value ignores this fact and, when n is large enough, could record evidence against

H_{0}

when in fact the data are telling us that

H_{0}

is effectively true and evidence in favor should be stated. This distinction between statistical significance and scientific significance has long been recognized (see [12]) and needs to be part of statistical methodology (see Example 6). This fact also underlies a recommendation, although not commonly followed, as it does not really address the problem, that the

α

cut-off should be reduced as the sample size increases. Perhaps the most important take-away from this is that a value

T_{ψ_{0}} (x)

that lies in the tails of its distributions under

H_{0}

is not necessarily evidence against

H_{0} .

It is commonly stated that a confidence region for

Ψ

should also be provided, but this does not tell us anything about

δ,

which needs to be given as part of the problem.

There are many other issues that can be raised about the p-value, where many of these are covered by the word p-hacking and associated with the choice of

α .

For example, as discussed in [13], suppose an investigator is using a particular

α

as the cut-off for evidence against and, based on a data set of size

n_{1}

, obtains the p-value

α + ϵ

where

ϵ > 0

is small. Since finding evidence against

H_{0}

is often regarded as a positive, as it entails a new discovery, the investigator decides to collect an additional

n_{2}

data values, computes the p-value based on the full

n = n_{1} + n_{2}

data values and obtains a p-value less than

α .

But this ignores the two-stage structure of the investigation, and when this is taken into account, the probability of finding evidence against

H_{0}

at level

α

when

H_{0}

is true, and assuming a single distribution under

H_{0}

, equals

P_{H_{0}} (A) + P_{H_{0}} (A^{c} \cap B)

where

A =

{evidence against

H_{0}

found at first stage} and

B =

{evidence against

H_{0}

found at second stage}. If

P_{H_{0}} (A) = α,

then

P_{H_{0}} (A) + P_{H_{0}} (A^{c} \cap B) > α .

So, even though the investigator has done something very natural, the logic of the statistical procedure based on using p-values with a cut-off, effectively prevents finding evidence against

H_{0} .

Royall in [10] provided an excellent discussion of the deficiencies of p-values and rejection trials (p-values with cut-off

α

) when considering these as measuring evidence. Sometimes, it is recommended that the p-value itself be reported without reference to a cut-off

α

, but we have seen already that a small p-value does not necessarily constitute evidence against, and so this is not a solution.

Confidence regions are intimately connected with p-values. Suppose that for each

ψ_{0} \in Ψ (Θ)

there is a statistic

T_{ψ_{0}}

that produces a valid p-value for

H_{0}

as has been described and an

α

cut-off is used. Now, put

C_{1 - α} (x) = {ψ_{0} : p_{ψ_{0}} (x) > α} .

Then,

P_{θ} (Ψ (θ) \in C_{1 - α} (x)) \geq 1 - α

and so

C_{1 - α} (x)

is a

(1 - α)

-confidence region for

ψ .

As such,

ψ_{0} \in C_{1 - α} (x)

is equivalent to

ψ_{0}

not being rejected at level

α

and so all the problematical issues with p-values as measures of evidence apply equally to confidence regions. Moreover, it is correctly stated that, when x has been observed, then

1 - α

is not a lower bound on the probability that

ψ_{t r u e} \in C_{1 - α} (x) .

Of course, that is what we want, namely, to state a probability that measures our belief that

ψ_{t r u e}

is in this set, and so confidence regions are commonly misinterpreted; see the discussion of bias in Section 3.4.

An alternative approach to using p-values is provided by e-values; see [14,15,16]. An e-variable for a hypothesis

H_{0} : Ψ (θ) = ψ_{0}

is a non-negative statistic

N_{ψ_{0}}

that satisfies

E_{θ} (N_{ψ_{0}}) \leq 1

whenever

Ψ (θ) = ψ_{0} .

The observed value

N_{ψ_{0}} (x)

is called an e-value, where the “e” stands for expectation to contrast it with the “p” in p-value which stands for probability. Also, a cut-off

α \in (0, 1)

needs to be specified such that

H_{0}

is rejected whenever

N_{ψ_{0}} (x) \geq 1 / α .

It is immediate, from Markov’s inequality, that

P_{θ} (N_{ψ_{0}} (x) \geq 1 / α) \leq α

whenever

Ψ (θ) = ψ_{0}

, and so this provides a rejection trial with cut-off

α

for

H_{0} .

Example 3

(location normal). For the context of Example 2, define

N_{μ_{0}} (x) = a [2 (1 - Φ (\sqrt{n} | \bar{x} - μ_{0} | / σ_{0} {))]}^{a - 1}

for any

a \in (0, 1)

, and it is immediately apparent that

N_{μ_{0}}

is an e-variable for

H_{0} : μ = μ_{0} .

Example 3 contains an example of the construction of an e-variable. Also, likelihood ratios can serve as e-variables and there are a number of other constructions of such variables discussed in the cited literature.

Consider the situation where data are collected sequentially,

N_{i, ψ_{0}} (x_{i})

is an e-variable for

H_{0} : Ψ (θ) = ψ_{0}

based on independent data

x_{i}

, and there is a stopping time

S,

e.g., stop when

N_{ψ_{0}} (x_{1}, \dots, x_{n}) = \prod_{i = 1}^{n} N_{i, ψ_{0}} (x_{i}) > 1 / α .

Also, put

N_{0, ψ_{0}} \equiv 1 .

Then, whenever

Ψ (θ) = ψ_{0},

we have that

\begin{matrix} E_{θ} (N_{ψ_{0}} (x_{1}, \dots, x_{n}) | x_{1}, \dots, x_{n - 1}) & = N_{ψ_{0}} (x_{1}, \dots, x_{n - 1}) E_{θ} (N_{n, ψ_{0}} (x_{n})) \\ \leq N_{ψ_{0}} (x_{1}, \dots, x_{n - 1}) \end{matrix}

and so the process

N_{ψ_{0}} (x_{1}, \dots, x_{n})

is a discrete time super-martingale. Moreover, it is clear that

N_{ψ_{0}} (x_{1}, \dots, x_{n})

is an e-variable for

H_{0} : Ψ (θ) = ψ_{0} .

This implies that, under conditions,

E_{θ} (N_{ψ_{0}} (x_{1}, \dots, x_{S})) \leq 1

and so the stopped variable

N_{ψ_{0}} (x_{1}, \dots, x_{S})

is also an e-variable for

H_{0} : Ψ (θ) = ψ_{0}

. Assuming stopping time

S_{α} = inf {n : N_{ψ_{0}} (x_{1}, \dots, x_{n}) \geq 1 / α}

is finite with probability 1 when

H_{0}

holds, then by Ville’s inequality,

P (N_{ψ_{0}} (x_{1}, \dots, x_{S_{α}}) \geq 1 / α) \leq α .

This implies that the problem for p-values when sampling until rejecting

H_{0}

at size

α

is avoided when using e-values.

While e-values have many interesting and useful properties relative to p-values, the relevant question here is whether or not these serve as measures of statistical evidence. The usage of both requires the specification of

α

to determine when there are grounds for rejecting

H_{0} .

Sometimes, this is phrased instead as “evidence against

H_{0}

has been found”, but, given the arbitrariness of the choice of

α

and the failure to properly express when evidence in favor of

H_{0}

has been found, neither seems suitable as an expression of statistical evidence. One could argue that the intention behind these approaches is not to characterize statistical evidence but rather to solve the reject/accept problems of decision theory. It is the case, however, at least for p-values, that these are used as if they are proper characterizations of evidence, and this does not seem suitable for purely scientific applications.

Another issue that needs to be addressed in a problem is how to find the statistic

T_{ψ_{0}}

or

N_{ψ_{0}} .

A common argument is to use likelihood ratios (see Section 3.2 and Section 3.3), but this does not resolve the problems that have been raised here, and there are difficulties with the concept of likelihood that need to be addressed as well.

3.2. Birnbaum on Statistical Evidence

Alan Birnbaum devoted considerable thought to the concept of statistical evidence in the frequentist context. Birnbaum’s paper [17] contained what seemed like a startling result about the implications to be drawn from the concept, and his final paper [18] contained a proposal for a definition of statistical evidence. Also, see [19] for a full list of Birnbaum’s publications, many of which contain considerable discussion concerning statistical evidence.

In [17], Birnbaum considered various relations, as defined by statistical principles, on the set

I

of all inference bases, where an inference base is the pair

I = ({f_{θ} : θ \in Θ}, x)

consisting of a sampling model and data supposedly generated from a distribution in the model. A statistical principle R is then a relation defined in

I,

namely, R is a subset of

I \times I .

There are three basic statistical principles that are commonly invoked as part of evidential inference, namely, the likelihood principle

(L),

the sufficiency principle

(S)

and the conditionality principle

(C) .

Inference bases

I_{i} = ({f_{i θ} : θ \in Θ}, x_{i}),

for

i = 1, 2

satisfy

(I_{1}, I_{2}) \in L

if for some constant

c > 0

we have

f_{1 θ} (x_{1}) = c f_{2 θ} (x_{2})

for every

θ \in Θ,

they satisfy

(I_{1}, I_{2}) \in S

if the models have equivalent (1-1 functions of each other) minimal sufficient statistics

m_{i}

(a sufficient statistic that is a function of every other sufficient statistic) that take equivalent values at the corresponding data values, and they satisfy

(I_{1}, I_{2}) \in C

if there is ancillary statistic a (a statistic whose distribution is independent of

θ)

such that

I_{2}

can be obtained from

I_{1}

(or conversely) via conditioning on

a (x)

so

I_{2} = ({f_{1 θ} (\cdot | a (x)) : θ \in Θ}, x_{1}) .

The basic idea is that, if two inference bases are related by one of these principles, then they contain the same statistical evidence concerning the unknown true value of

θ .

For this idea to make sense, it must be the case that these principles form equivalence relations on

I .

In [20], it is shown that L and S do form equivalence relations but C does not, and this latter result is connected with the fact that a unique maximal ancillary (an ancillary which every other ancillary is a function of) generally does not exist.

Birnbaum provided a proof in [17] of the result, known as Birnbaum’s Theorem, that if a statistician accepted the principles S and

C,

then they must accept

L .

This is highly paradoxical because frequentist statisticians generally accept both S and C but not L, as L does not permit repeated sampling. Two very different sampling models can lead to proportional likelihood functions, so repeated sampling behavior is irrelevant under L. There has been considerable discussion over the years concerning the validity of this proof, but no specific flaw has been found. A resolution of this is provided in [20], where it is shown that

S \cup C

is not an equivalence relation and what Birnbaum actually proved is that the smallest equivalence relation on

I

that contains

S \cup C

is

L .

This substantially weakens the result, as there is no reason to accept the additional generated equivalences, and in fact it makes more sense to consider the largest equivalence relation on

I

that is contained in

S \cup C .

Also, as discussed in [20], an argument similar to the one found in [17] establishes that a statistician who accepts C alone must accept L. Again, however, this only means that the smallest equivalence relation on

I

that contains C is

L .

As shown in [21], issues concerning C can be resolved by restricting conditioning on ancillary statistics to stable ancillaries (those ancillaries such that conditioning on them retains the ancillarity of all other ancillaries and similarly have their ancillarity retained when conditioning on any other ancillary), as there always is a maximal stable ancillary. This provides a conditionality principle that is a proper characterization of statistical evidence but it does not lead to

L .

While Birnbaum did not ultimately resolve the issues concerning statistical evidence, [18] made a suggestion as a possible starting point by proposing the confidence concept in the context of comparing two hypotheses

H_{0}

versus

H_{1} .

The confidence concept is characterized by two error probabilities

\begin{matrix} α & = probability of rejecting (accepting) H_{0} (H_{1}) when it is true (false), \\ β & = probability of accepting (rejecting) H_{0} (H_{1}) when it is false (true), \end{matrix}

and then reporting

(α, β)

with the following interpretation:

\begin{matrix} rejecting H_{0} constitutes strong evidence against H_{0} (in favor of H_{1}) \\ when α and β are small . \end{matrix}

This clearly results from a confounding of the decision theoretic (as in Neyman–Pearson) approach to hypothesis testing with the evidential approach, as rejection trials similarly do. Also, this does not give a general definition of what is meant by statistical evidence, as it really only applies in the simple versus simple hypothesis testing context, and it suffers from many of the same issues as discussed concerning p-values. From reading Birnbaum’s papers, it seems he had largely despaired of ever finding a fully satisfactory characterization of statistical evidence in the frequentist context. It will be shown in Section 3.4, however, that the notion of frequentist error probabilities do play a key role in a development of the concept.

3.3. Likelihood

Likelihood is another statistical concept initiated by Fisher; see [22]. While the likelihood function plays a key role in frequentism, a theory of inference, called pure likelihood theory, based solely on the likelihood function, is developed in [10,23]. The basic axiom is thus the likelihood principle L of Section 3.2, which says that the likelihood function

L (θ | x) = c f_{θ} (x),

from inference base

I = ({f_{θ} : θ \in Θ}, x),

completely summarizes the evidence given in I concerning the true value of

θ

. As such, two inference bases with proportional likelihoods must give identical inferences, and so repeated sampling does not play a role. In particular, the ratio

L (θ_{1} | x) /

L (θ_{2} | x)

provides the evidence for

θ_{1}

relative to the evidence for

θ_{2}

, and this is independent of the arbitrary constant

c .

While this argument for the ratios seems acceptable, a problem arises when we ask, does the value

L (θ | x)

provide evidence in favor of or against

θ

being the true value? To avoid the arbitrary constant, Royall in [10] replaced the likelihood function by the relative likelihood function given by

L^{r e l} (θ | x) = L (θ | x) / L (θ_{M L E} (x) | x)

, where

θ_{M L E} (x) = arg {sup}_{θ} L (θ | x)

is the maximum likelihood estimate. The relative likelihood always takes values in

[0, 1] .

Note that, if

L^{r e l} (θ | x) \geq 1 / r

, then

L^{r e l} (θ^{'} | x) / L^{r e l} (θ | x) \leq r

for all

θ^{'}

, and so no other value is supported by more than r times the support accorded to

θ .

For

H_{0} : θ = θ_{0},

Royall then argues, based on single urn experiment, for

L^{r e l} (θ_{0} | x) \geq 1 / 8

to represent very strong evidence in favor and for

L^{r e l} (θ_{0} | x) \geq 1 / 32

to represent quite strong evidence in favor of

θ .

Certainly, these values seem quite arbitrary and again, as with p-values, we do not have a clear cut-off between evidence in favor and evidence against. Whenever values like this are quoted, there is an implicit assumption that there is a universal scale with which evidence can be measured. Currently, there are no developments that support the existence of such a scale; see the discussion of the Bayes factor in Section 3.4. For estimation, it is natural to quote

θ_{M L E} (x)

and report a likelihood region such as

{θ : L^{r e l} (θ | x) \geq 1 / r},

for some r as an assessment of the accuracy of

θ_{M L E} (x) .

Another serious objection to pure likelihood theory, and to the use of the likelihood function to determine inferences more generally, arises when nuisance parameters are present, namely, we want to make inference about

ψ = Ψ (θ)

where

Ψ

is not 1-1. In general, there does not appear to be a way to define a likelihood function for the parameter of interest

ψ,

based on the inference base I only, that is consistent with the motivation for using likelihood in the first place, namely,

L (θ | x)

is proportional to the probability of the observed data as a function of

θ .

It is common in such contexts to use the profile likelihood

L_{Ψ}^{p r o f} (ψ | x) = sup_{θ \in Ψ^{- 1} {ψ}} L (θ | x)

as a likelihood function. There are many examples that show that a profile likelihood is not a likelihood, and so such usage is inconsistent with the basic idea underlying likelihood methods. An alternative to the profile likelihood is the integrated likelihood

L_{Ψ}^{i n t} (ψ | x) = \int_{Ψ^{- 1} {ψ}} L (θ | x) Π (d θ | ψ)

where the

Π (\cdot | ψ)

are probability measures on the pre-images

Ψ^{- 1} {ψ},

see [24]. The integrated likelihood is a likelihood with respect to the sampling model

{m (\cdot | ψ) : ψ \in Ψ (Θ)}

, where

m (x | ψ) = \int_{Ψ^{- 1} {ψ}} f_{θ} (x) Π (d θ | ψ),

but this requires adding

Π (\cdot | ψ)

to the inference base.

Example 4

(location normal). Suppose we want to estimate

ψ = Ψ (μ) = | μ |

, so the nuisance parameter is given by

s g n (μ) .

The likelihood function is given by

L (μ | x) = exp (- n {(\bar{x} - μ)}^{2} / 2 σ_{0}^{2}) .

Since

{(\bar{x} - μ)}^{2} = {(\bar{x} - s g n (μ) ψ)}^{2}

, this is minimized by

s g n (μ) = 1

when

\bar{x} > 0

and by

s g n (μ) = - 1

when

\bar{x} < 0 .

Therefore, the profile likelihood for ψ is

L_{Ψ}^{p r o f} (ψ | x) = exp (- n (| \bar{x} {| - ψ)}^{2} / 2 σ_{0}^{2})

, and this depends on the data only through

| \bar{x} | .

To see that

L_{Ψ}^{p r o f} (\cdot | x)

is not a likelihood function, observe that the density of

| \bar{x} |

is given by

\frac{\sqrt{n}}{\sqrt{2 π} σ_{0}} \{exp (- \frac{n}{2 σ_{0}^{2}} (| \bar{x} {| - s g n (μ) ψ)}^{2}) + exp (- \frac{n}{2 σ_{0}^{2}} (| \bar{x} {| + s g n (μ) ψ)}^{2})\}

(1)

which is not proportional to

L_{Ψ}^{p r o f} (ψ | x) .

For the integrated likelihood, we need to choose

Π ({1} | ψ) = p \in (0, 1)

, so

L_{Ψ}^{i n t} (ψ | x) = p exp (- n (\bar{x} - ψ) 2 / 2 σ_{0}^{2}) + (1 - p) exp (- n {(\bar{x} + ψ)}^{2} / 2 σ_{0}^{2}) .

(2)

Note that (2) is not obtained by integrating (1), which is appropriate because

| \bar{x} |

is not a minimal sufficient statistic.

The profile and integrated likelihoods are to be used just as the original likelihood is used for inferences about θ, even though the profile likelihood is not a likelihood. The profile likelihood leads immediately to the estimate

| \bar{x} |

for ψ, while (2) needs to be maximized numerically. Although the functional forms look quite different, the profile and integrated likelihoods here give almost identical numerical results. Figure 1 is a plot with

n = 2, \bar{x} = 1.47

generated from a

N (2, 1)

distribution, and the estimates of ψ are both equal to

1.47 .

There is asome disagreement in the left tail, so some of the reported intervals will differ, but as n increases, such differences disappear. This agreement is also quite robust against the choice of

p .

Another issue arises with the profile likelihood, namely, the outcome differs in contexts which naturally should lead to equivalent results.

Example 5

(prediction with scale normal). Suppose

x = (x_{1}, \dots, x_{n})

is an independent and identically distributed sample from a distribution in the family

{ϕ (\cdot, 0, σ^{2}) : σ^{2} > 0}

. Based on the observed data, the likelihood equals

L (σ^{2} | x) = σ^{- n} exp {- s_{x}^{2} / 2 σ^{2}}

where

s_{x}^{2} = \sum_{i = 1}^{n} {x_{i}}^{2}

and the MLE of

σ^{2}

is

s_{x}^{2} / n .

Now suppose, however, that the interest is in predicting k future (or occurred but concealed) independent values

y = (y_{1}, \dots, y_{k}) \sim N (0, σ^{2}) .

Perhaps the logical predictive likelihood to use is

L (σ^{2}, y | x) = σ^{- (n + k)} exp {- (s_{x}^{2} + s_{y}^{2}) / 2 σ^{2}}

where

s_{y}^{2} = \sum_{i = 1}^{k} y_{i}^{2} .

Profiling out

σ^{2}

leads to the profile MLE of y equaling 0 for all x as might be expected. Profiling y out of

L (σ^{2}, y | x),

however, leads to the profile likelihood

L^{p r o f} (σ^{2} | x) = σ^{- (n + k)} exp {- s_{x}^{2} / 2 σ^{2}} \neq L (σ^{2} | x)

for

σ^{2}

, and so the profile MLE of

σ_{x}^{2}

equals

s_{x}^{2} / (n + k) .

When interest is in

σ^{2},

the integrated likelihood for

σ^{2}

produces

L (σ^{2} | x) .

When interest is in

y,

then integrating out

σ^{2},

after placing a probability distribution on

σ^{2}

produces the MLE 0 for y, although the form of the integrated likelihood will depend on the particular distribution chosen.

The profile and integrated likelihoods certainly do not always lead to roughly equivalent results as in Example 4, particularly as the dimension of the nuisance parameters rises. While the profile likelihood has the advantage of not having to specify

Π (\cdot | ψ),

it suffers from a lack of a complete justification, at least in terms of the likelihood principle, and, as Example 5 demonstrates, it can produce unnatural results. In Section 3.4, it is shown that the integrated likelihood arises from a very natural principle characterizing evidence.

The discussion here has been mostly about pure likelihood theory, where repeated sampling does not play a role. The likelihood function is also an important aspect of frequentist inference. Such usage, however, does not lead to a resolution of the evidence problem, namely, providing a proper characterization of statistical evidence. In fact, frequentist likelihood methods use the p-value for H. There is no question that the likelihood contains the evidence in the data about

θ,

but questions remain as to how to characterize that evidence, whether in favor of or against a particular value, and also how to express the strength of the evidence. More discussion of the likelihood principle can be found in [25].

3.4. Bayes Factors and Bayesian Measures of Statistical Evidence

Bayesian inference adds another ingredient to the inference base, namely, the prior probability measure

Π

on

Θ,

so now

I = (Π, {f_{θ} : θ \in Θ}, x) .

The prior represents beliefs about the true value of

θ .

Note that

(Π, {f_{θ} : θ \in Θ})

is equivalent to a joint distribution for

(θ, x)

with density

π (θ) f_{θ} (x) .

Once the data x have been observed, a basic principle of inference is then invoked.

Principle of Conditional Probability: for probability model $(Ω, F, P),$ if $C \in F$ is observed to be true, where $P (C) > 0,$ then the initial belief that $A \in F$ is true, as given by $P (A),$ is replaced by the conditional probability $P (A | C) .$

So, we replace the prior

π

by the posterior

π (θ | x) = π (θ) f_{θ} (x) / m (x),

where

m (x) = \int_{Θ} π (θ) f_{θ} (x) d θ

is the prior predictive density of the data

x,

to represent beliefs about

θ .

While at times the posterior is taken to represent the evidence about

θ,

this confounds two distinct concepts, namely, beliefs and evidence. It is clear, however, that the evidence in the data is what has changed our beliefs, and it is this change that leads to the proper characterization of statistical evidence through the following principle.

Principle of Evidence: for probability model $(Ω, F, P),$ if $C \in F$ is observed to be true where $P (C) > 0,$ then there is evidence in favor of $A \in F$ being true if $P (A | C) > P (A),$ evidence against $A \in F$ being true if $P (A | C) < P (A)$ and no evidence either way if $P (A | C) = P (A) .$

Therefore, in the Bayesian context, we compare

π (θ | x)

with

π (θ)

to determine whether or not there is evidence one way or the other concerning

θ

being the true value. This seems immediate in the context where

Π

is a discrete probability measure. It is also relevant in the continuous case, where densities are defined via limits, as in

π (θ) = {lim}_{ϵ \to 0} Π (N_{ϵ} (θ)) / ν (N_{ϵ} (θ))

, where

N_{ϵ} (θ)

is a sequence of neighborhoods of

θ

converging nicely to

θ

as

ϵ \to 0

, and

Π

is absolutely continuous with respect to support measure

ν .

This leads to the usual expressions for densities; see [26] (Appendix A) for details.

The Bayesian formulation has a very satisfying consistency property. If interest is in the parameter

ψ = Ψ (θ),

then the nuisance parameters can be integrated out using the conditional prior

Π (\cdot | ψ)

and the inference base

I = (Π, {f_{θ} : θ \in Θ}, x)

is replaced by

I_{Ψ} = (Π_{Ψ}, {m (\cdot | ψ) : ψ \in Ψ (Θ)}, x)

where

Π_{Ψ}

is the marginal prior on

ψ

, with density

π_{Ψ} .

Applying the principal of evidence here means to compare the posterior

π_{Ψ} (ψ | x) = π_{Ψ} (ψ) \frac{m (x | ψ)}{m (x)} = \int_{Ψ^{- 1} {ψ}} Π (d θ | x)

(3)

to the prior density

π_{Ψ} (ψ) = \int_{Ψ^{- 1} {ψ}} Π (d θ) .

So, there is evidence in favor of

ψ

being the true value whenever

π_{Ψ} (ψ | x) > π_{Ψ} (ψ),

etc.

In many applications, we need more than the simple characterization of evidence that the principle of evidence gives us. A valid measure of evidence is then any real-valued function of

I_{Ψ}

that satisfies the existence of a cut-off c such that the function taking a value greater than c corresponds to evidence in favor, etc. One very simple function satisfying this is the relative belief ratio

R B_{Ψ} (ψ | x) = \frac{π_{Ψ} (ψ | x)}{π_{Ψ} (ψ)} = \frac{m (x | ψ)}{m (x)}

(4)

where the second equality follows from (3) and is sometimes referred to as the Savage–Dickey ratio; see [27]. Using

R B_{Ψ} (ψ | x)

, the values of

ψ

are now totally ordered with respect to evidence, as when

1 < R B_{Ψ} (ψ_{1} | x) < R B_{Ψ} (ψ_{2} | x)

, there is more evidence in favor of

ψ_{2}

than for

ψ_{1},

etc.

This suggests an immediate answer to E, namely, record the relative belief estimate

ψ (x) = arg {sup}_{ψ} R B_{Ψ} (ψ | x)

, as this value has the maximum evidence in its favor. Also, to assess the accuracy of

ψ (x)

, record the plausible region

P l_{Ψ} (x) = {ψ : R B_{Ψ} (ψ | x) > 1}

, the set of

ψ

values with evidence in their favor, together with its posterior content

Π_{Ψ} (P l_{Ψ} (x) | x)

, as this measures the belief that the true value is in

P l_{Ψ} (x) .

Note that while

ψ (x)

depends on the choice of the relative belief ratio to measure evidence, the plausible region

P l_{Ψ} (x)

only depends on the principle of evidence. This suggests that any other valid measure of evidence can be used instead to determine an estimate, as it will lie in

P l_{Ψ} (x)

. A

γ

-credible region

C_{Ψ, γ} (x)

for

ψ

can also be quoted based on any valid measure of evidence provided

C_{Ψ, γ} (x) \subset P l_{Ψ} (x)

, as, otherwise,

C_{Ψ, γ} (x)

would contain a value of

ψ

for which there is evidence against

ψ

being the true value. For example, a relative belief

γ

-credible region takes the form

C_{Ψ, γ} (x) = {ψ : R B_{Ψ} (ψ | x) \geq c_{γ} (x)}

where

c_{γ} (x) = inf {c :

Π_{Ψ} (R B_{Ψ} (ψ | x) \leq c | x) \geq 1 - γ} .

For H, the value

R B_{Ψ} (ψ_{0} | x)

tells us immediately whether there is evidence in favor of or against

H_{0} : Ψ (θ) = ψ_{0} .

To measure the strength of the evidence concerning

H_{0}

, there is the posterior probability

Π_{Ψ} ({ψ_{0}} | x),

as this measures our belief in what the evidence says. If

R B_{Ψ} (ψ_{0} | x) > 1

and

Π_{Ψ} ({ψ_{0}} | x) \approx 1,

then there is strong evidence that

H_{0}

is true, while when

R B_{Ψ} (ψ_{0} | x) < 1

and

Π_{Ψ} ({ψ_{0}} | x) \approx 0,

then there is strong evidence that

H_{0}

is false. Often, however,

Π_{Ψ} ({ψ_{0}} | x)

will be small, even 0 in the continuous case, so it makes more sense to measure the strength of the evidence in such a case by

S t r_{Ψ} (ψ_{0}) = Π_{Ψ} (R B_{Ψ} (ψ | x) \leq R B_{Ψ} (ψ_{0} | x) | x)

, as when

R B_{Ψ} (ψ_{0} | x) > 1

and

S t r_{Ψ} (ψ_{0}) \approx 1,

then there is small belief that the true value of

ψ

has more evidence in its favor than

ψ_{0},

etc.

Recalling the discussion in Section 3.1 about the difference that matters

δ,

it is relatively easy to take this into account in this context, at least when

ψ

is real valued. For this, we consider a grid of values

\dots ψ_{- 2}, ψ_{- 1}, ψ_{0}, ψ_{1}, ψ_{2}, \dots

separated by

δ

and then conduct the inferences using the relative belief ratios of the intervals

[ψ_{i} - δ / 2, ψ_{i} + δ / 2) .

In effect,

H_{0}

is now

H_{0} : Ψ (θ) \in [ψ_{0} - δ / 2, ψ_{0} + δ / 2) .

Example 6

(location normal). Suppose that we add the prior

μ \sim N (μ_{0}, τ_{0}^{2})

to form a Bayesian inference base. The posterior distribution is then

μ | x \sim N ({(n / σ_{0}^{2} + 1 / τ_{0}^{2})}^{- 1} (n \bar{x} / σ_{0}^{2} + μ_{0} / τ_{0}^{2}), {(n / σ_{0}^{2} + 1 / τ_{0}^{2})}^{- 1}) .

Suppose our interest is inference about

ψ = Ψ (μ) = | μ | .

Figure 2 is a plot of the relative belief ratio based on the data in Example 4

(n = 2, \bar{x} = 1.47

generated from a

N (2, 1)

distribution) and using the prior with

μ_{0} = 0, τ_{0} = 2 .

From (4), it is seen that maximizing

R B_{Ψ} (ψ | x)

is equivalent to maximizing the integrated likelihood, and in this case, the prior is such that

s g n (μ) = 1

with prior probability 0.5, so indeed the relative belief estimate is

ψ (x) = 1.47 .

To assess the accuracy of the estimate, we record the plausible interval

P l_{Ψ} (x) = (0.65, 2.26)

which has posterior content

0.76 .

There is evidence in favor of

H_{0} : ψ = 2

since

2 \in P l_{Ψ} (x),

and

R B_{Ψ} (2 | x) = 1.41

with

S t r_{Ψ} (ψ_{0}) = 0.42,

so there is only moderate evidence in favor of 2 being the true value of ψ. Of course, these results improve with sample size

n .

For example, for

n = 10, ψ (x) = 1.83

with

P l_{Ψ} (x) = (1.40, 2.27)

with posterior content

0.95

and

R B_{Ψ} (2 | x) = 5.00

with strength

0.44 .

So, the plausible interval has shortened considerably, and its posterior content has substantially increased, but the strength of the evidence in favor of

H_{0} : ψ = 2

has not increased by much. It is important to note that, because we are employing a discretization (with

δ = 0.01

here), and since the posterior is inevitably virtually completely concentrated in

[ψ_{0} - δ / 2, ψ_{0} + δ / 2)

as

n \to \infty,

the strength

S t r_{Ψ} (2)

converges to 1 at any of the values in this interval and to 0 at values outside of it.

There are a number of optimal properties for the relative belief inferences as discussed in [26]. While this is perhaps the first attempt to build a theory of inference based on the principle of evidence, there is considerable literature on this idea in the philosophy of science where it underlies what is known as confirmation theory; see [28]. For example, the philosopher Popper in [29] (Appendix ix) writes

If we are asked to give a criterion of the fact that the evidence y supports or corroborates a statement x, the most obvious reply is: that y increases the probability of $x .$

One issue with philosophers’ discussions is that these are never cast in a statistical context. Many of the anomalies raised in those discussions, such as Hempel’s Raven Paradox, can be resolved when formulated as statistical problems. Also, the relative belief ratio itself has appeared elsewhere, although under different names, as a natural measure of evidence.

There are indeed other valid measures of evidence besides the relative belief ratio, for example, the Bayes factor originated with Jeffreys as in [30]; see [6,31] for discussion. For probability model

(Ω, F, P), A, C \in F

with

P (A) > 0, P (C) > 0,

the Bayes factor in favor of A being true, after observing that C is true, is given by

B F (A | C) = \frac{P (A | C) / P (A^{c} | C)}{P (A) / P (A^{c})},

the ratio of the posterior odds in favor of

A

to the prior odds in favor of

A .

It is immediate that

B F (A | C)

is a valid measure of evidence because

B F (A | C) > 1

if and only if

P (A | C) > P (A) .

Also,

B F (A | C) = R B (A | C) / R B (A^{c} | C)

and so, because

R B (A | C) > 1

if and only if

R B (A^{c} | C) < 1,

this implies

B F (A | C) > R B (A | C)

when there is evidence in favor and

B F (A | C) < R B (A | C)

when there is evidence against. These inequalities are important because it is sometimes asserted that the Bayes factor is not only a measure of evidence but also that its value is a measure of the strength of that evidence. Table 1 gives a scale, due to Jeffreys (see [32] (Appendix B), which supposedly can be used to assess the strength of evidence as given by the Bayes factor. Again, this is an attempt to establish a universal scale on which evidence can be measured and currently no grounds exist for claiming that such a scale exists. If we consider that the strength of the evidence is to be measured by how strongly we believe what the evidence says, then simple examples can be constructed to show that such a scale is inappropriate. Effectively, the strength of the evidence is context dependent and needs to be calibrated.

Example 7.

Suppose Ω is a set with

# (Ω) = N .

A value ω is generated uniformly from Ω and partially concealed but it is noted that

ω \in C

where

# (C) = n .

It is desired to know if

ω \in A

where

# (A) = 1

and

A \subset C .

Then, we have that

P (A) = 1 / N

and

P (A | C) = 1 / n

so

R B (A | C) = N / n > 1

since

n < N .

The posterior belief in what the evidence is saying is, however,

P (A | C) = 1 / n

which can be very small, say when

n = 10^{3} .

But, if

N = 10^{6},

then

B F (A | C) > R B (A | C) = 10^{3}

which is well into the range where Jeffreys scale says it is decisive evidence in favor of

A .

Clearly there is only very weak evidence in favor of A being true here. Also, note that the posterior probability by itself does not indicate evidence in favor although the observation that C is true must be evidence in favor of A because

A \subset C .

The example can obviously be modified to show that any such scale is not appropriate.

There is another issue with current usage of the Bayes factor that needs to be addressed. This arises when interest is in assessing the evidence for or against

H_{0} : Ψ (θ) = ψ_{0}

when

Π_{Ψ} ({ψ_{0}}) = 0 .

Clearly, the Bayes factor is not defined in such a context. An apparent solution is provided by choosing a prior of the form

Π_{H_{0}, p} = p Π_{H_{0}} + (1 - p) Π

, where

p > 0

is a prior probability assigned to

H_{0}

, and

Π_{H_{0}}

is a prior probability measure on the set

Ψ^{- 1} {ψ_{0}} .

Since

Π_{p} (Ψ^{- 1} {ψ_{0}}) = p

the Bayes factor is now defined. Of some concern now is how

Π_{H_{0}}

should be chosen. A natural choice is to take

Π_{H_{0}} = Π (\cdot | ψ_{0})

otherwise there is a contradiction between beliefs as expressed by

Π

and

Π_{H_{0}} .

It follows very simply, however, that in the continuous case, when

Π_{H_{0}} = Π (\cdot | ψ_{0}),

then

B F (ψ_{0} | x) = R B_{Ψ} (ψ_{0} | x) .

Moreover, if instead of using such a mixture prior, we define the Bayes factor via

B F_{Ψ} (ψ_{0} | x) = lim_{ϵ \to 0} B F (N_{ϵ} (ψ_{0}) | x) = lim_{ϵ \to 0} \frac{Π_{Ψ} (N_{ϵ} (ψ_{0}) | x) / Π_{Ψ} (N_{ϵ}^{c} (ψ_{0}) | x)}{Π_{Ψ} (N_{ϵ} (ψ_{0})) / Π_{Ψ} (N_{ϵ}^{c} (ψ_{0}))},

where

N_{ϵ} (ψ_{0})

is a sequence of neighborhoods of

ψ_{0}

converging nicely to

ψ_{0},

then, under weak conditions (

π_{Ψ}

is positive and continuous at

ψ_{0})

,

B F_{Ψ} (ψ_{0} | x) = R B_{Ψ} (ψ_{0} | x) .

So, there is no need to introduce the mixture prior to get a valid measure of evidence. Furthermore, now the Bayes factor is available for E as well as H as surely any proper characterization of statistical evidence must be, and the relevant posterior for

ψ

is

Π_{Ψ} (\cdot | x)

as obtained from

Π

rather than from

Π_{H_{0}, p} .

Other Bayesian measures of evidence have been proposed. For example, it is proposed in [33] to measure the evidence for

H_{0} : Ψ (θ) = ψ_{0}

by computing

θ_{*} = arg {sup}_{Ψ (θ) = ψ_{0}} π (θ | x)

and then use the posterior tail probability

e v (H_{0} | x) = Π (π (θ | x) \leq π (θ_{*} | x) | x) .

If

e v (H_{0} | x)

is large, this is evidence against

H_{0}

, while, if it is small, it is evidence in favor of

H_{0}

. Note that

e v (H_{0} | x)

is sometimes referred to as an e-value, but this is different from the e-values discussed in Section 3.1. Further discussion and development of this concept can be found in [34]. Clearly, this is building on the idea that underlies the p-value, namely, providing a measure that locates a point in a distribution and using this to assess evidence. It does not, however, conform to the principle of evidence.

Bias and Error Probabilities

One criticism that is made of Bayesian inference is that there are no measures of reliability of the inferences, which is an inherent part of frequentism. It is natural to add such measures, however, to assess whether or not the specified model and prior could potentially lead to misleading inferences. For example, suppose it could be shown that evidence in favor of

H_{0} : Ψ (θ) = ψ_{0}

would be obtained with prior probability near 1, for a data set of given size. It seems obvious that, if we did collect this amount of data and obtained evidence in favor of

H_{0},

then this fact would undermine our confidence in the reliability of the reported inference.

Example 8

(Jeffreys–Lindley paradox). Suppose we have the location normal model

\bar{x} \sim N (μ, σ_{0}^{2} / n)

and

T_{μ_{0}} (x) = \sqrt{n} | \bar{x} - μ_{0} | / σ_{0} = 5

is obtained which leads to the p-value

2 (1 - Φ (\sqrt{n} | \bar{x} - μ_{0} | / σ_{0})) = 5.733031 \times 10^{- 7}

, so there would appear to be definitive evidence that

H_{0} : μ = μ_{0}

is false. Suppose the prior

μ \sim N (μ_{0}, τ_{0}^{2})

is used, and the analyst chooses

τ_{0}^{2}

to be very large to reflect the fact that little is known about the true value of

μ .

It can be shown (see [26]), however, that

R B_{Ψ} (μ_{0} | x) \to \infty

as

τ_{0}^{2} \to \infty .

Therefore, for a very diffuse prior, evidence in favor of

H_{0}

will be obtained, and so the frequentist and Bayesian will disagree. Note that the Bayes factor equals the relative belief ratio in this context. A partial resolution of this contradiction is obtained by noting that

S t r_{Ψ} (μ_{0}) \to 2 (1 - Φ (\sqrt{n} | \bar{x} - μ_{0} | / σ_{0}))

as

τ_{0}^{2} \to \infty

and so the Bayesian measure of evidence is only providing very weak evidence when the p-value is small. This anomaly occurs even when the true value of μ is indeed far away from

μ_{0}

, and so the fault here does not lie with the p-value.

Prior error probabilities associated with Bayesian measures of evidence can, however, be computed and these lead to a general resolution of the apparent paradox. There are two error probabilities for H that we refer to as bias against

H_{0}

and bias in favor of

H_{0}

as given by the following two prior probabilities

\begin{matrix} bias {against}_{Ψ} (ψ_{0}) = M (R B_{Ψ} (ψ_{0} | X) \leq 1 | ψ_{0}), \\ bias in {favor}_{Ψ} (ψ_{0}) = sup_{ψ : d_{Ψ} (ψ, ψ_{0}) \geq δ} M (R B_{Ψ} (ψ_{0} | X) \geq 1 | ψ) . \end{matrix}

Both of these biases are independent of the valid measure of evidence used, as they only depend on the principle of evidence as applied to the model and prior chosen. The

bias {against}_{Ψ} (ψ_{0})

is the prior probability of not getting evidence in favor of

ψ_{0}

when it is true and plays a role similar to type I error. The

bias in {favor}_{Ψ} (ψ_{0})

is the prior probability of not obtaining evidence against

ψ_{0}

when it is meaningfully false and plays a role similar to that of the type II error. Here,

δ

is the deviation from

ψ_{0}

which is of scientific interest as determined by some measure of distance

d_{Ψ}

on

Ψ (Θ) .

As discussed in [26], these biases cannot be controlled by, for example, the choice of prior, as a prior that reduces

bias {against}_{Ψ} (ψ_{0})

causes

bias in {favor}_{Ψ} (ψ_{0})

to increase and vice versa. The proper control of these quantities is through sample size as both converge to 0 as

n \to \infty .

Example 9

(Jeffreys–Lindley paradox). In this case, bias against

(ψ_{0}) \to 0

and the bias in favor of

(ψ_{0}) \to 1

as

τ_{0}^{2} \to \infty .

This leads to an apparent resolution of the issue: do not choose the prior to be arbitrarily diffuse to reflect noninformativeness; rather, choose a prior that is sufficiently diffuse to cover the interval where it is known μ must lie, e.g., the interval where the measurements lie, and then choose n so that both biases are suitably small. While the Jeffreys–Lindley paradox arises due to diffuse priors inducing bias in favor, an overly concentrated prior induces bias against, but again this bias can be controlled via the amount of data collected.

There are also two biases associated with E obtained by averaging, using the prior

Π_{Ψ},

the biases for H. These can also be expressed in terms of coverage probabilities of the plausible region

P l_{Ψ} (x) .

The biases for E are given by

\begin{matrix} E_{Π_{Ψ}} (bias {against}_{Ψ}) = E_{Π_{Ψ}} (M (R B_{Ψ} (ψ | X) \leq 1 | ψ)) = E_{Π_{Ψ}} (M (ψ \notin P l_{Ψ} (X) | ψ)), \\ E_{Π_{Ψ}} (bias in {favor}_{Ψ}) = E_{Π_{Ψ}} (sup_{ψ^{'} : d_{Ψ} (ψ^{'}, ψ) \geq δ} M (R B_{Ψ} (ψ | X) \geq 1 | ψ^{'})) \\ = E_{Π_{Ψ}} (sup_{ψ^{'} : d_{Ψ} (ψ^{'}, ψ) \geq δ} M (ψ \notin I m_{Ψ} (X) | ψ^{'})) \end{matrix}

where

I m_{Ψ} (x) = {ψ : R B_{Ψ} (ψ | x) < 1}

is the implausible region, namely, the set of values for which evidence against has been obtained. So,

E_{Π_{Ψ}} (bias {against}_{Ψ})

is the prior probability that the true value is not in the plausible region

P l_{Ψ} (x)

and so

1 - E_{Π_{Ψ}} (M (ψ \notin P l_{Ψ} (X) | ψ))

is the prior coverage probability (Bayesian confidence) of

P l_{Ψ} (x) .

It is of interest that there is typically a value

ψ_{*} = arg {sup}_{ψ} bias {against}_{Ψ} (ψ)

and so

M (ψ \in P l_{Ψ} (X) | ψ) \geq 1 - M (R B_{Ψ} (ψ_{*} | X) \leq 1 | ψ_{*})

gives a lower bound on the frequentist confidence of

P l_{Ψ} (x)

with respect to the model

{m (\cdot | ψ) : ψ \in Ψ}

obtained from the original model by integrating out the nuisance parameters. In both cases, the Bayesian confidences are average frequentist confidences with respect to the original model.

E_{Π_{Ψ}} (bias in {favor}_{Ψ})

is the prior probability that a meaningfully false value is not in the implausible region. Again, both of these biases do not depend on the valid measure of statistical evidence used and converge to 0 with increasing sample size.

With the addition of the biases, a link is established between Bayesianism and frequentism: inferences are Bayesian, but the reliability of the inferences is assessed via frequentist criteria. More discussion on bias in this context can be found in [35].

4. Conclusions

This paper is a review of various conceptions of statistical evidence, as this has been discussed in the literature. It is seen that precisely characterizing evidence is still an unsolved problem in purely frequentist inference and in pure likelihood theory, but the Bayesian framework provides a natural way to perform this through the principle of evidence. One often hears complaints about the need to choose a prior which can be performed properly through elicitation; see [36]. Furthermore, priors are falsifiable via checking for prior-data conflict; see [26]. Finally, the effects of priors can be mitigated by the control of the biases, which are directly related to the principle of evidence.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada, grant RGPIN-2024-03839.

Data Availability Statement

The code used for the computations is available from the author.

Acknowledgments

Many thanks to Qiaoyu Liang for a number of helpful comments.

Conflicts of Interest

The author’s research is concerned with the need to be precise about the concept of statistical evidence and with the construction of a theory of statistical inference based on this. Parts of this theory are described in Section 3.4, where some of the contents of references [26,35] are described.

References

Ionides, E.L.; Giessing, A.; Ritov, Y.; Page, S.E. Response to the ASA’s Statement on p-Values: Context, Process, and Purpose. Am. Stat. 2017, 71, 88–89. [Google Scholar] [CrossRef]
Wasserstein, R.L.; Lazar, N.A. The ASA Statement on p-Values: Context, process, and purpose. Am. Stat. 2016, 70, 129–133. [Google Scholar] [CrossRef]
Lehmann, E.L. Neyman’s statistical philosophy. In Selected Works of E. L. Lehmann; Springer: Boston, MA, USA, 1995; pp. 1067–1073. [Google Scholar]
Neyman, J. “Inductive Behavior” as a basic concept of philosophy of science. Rev. Int. Stat. Inst. 1957, 25, 7–22. [Google Scholar] [CrossRef]
Neyman, J. Foundations of behavioristic statistics. In Foundations of Statistical Inference, a Symposium; Godambe, V.P., Sprott, D.A., Eds.; Holt, Rinehart and Winston of Canada: Toronto, ON, Canada, 1971; pp. 1–13. [Google Scholar]
Berger, J.O. Statistical Decision Theory and Bayesian Analysis, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Fisher, R.A. Statistical Methods for Research Workers, 14th ed.; Hafner Press: Royal Oak, MI, USA, 1925. [Google Scholar]
Stigler, S.M. The History of Statistics; Belknap Press: Cambridge, MA, USA, 1986. [Google Scholar]
Stigler, S.M. The Seven Pillars of Statistical Wisdom; Harvard University Press: Cambridge, MA, USA, 2016. [Google Scholar]
Royall, R. Statistical Evidence: A likelihood paradigm; Chapman & Hall: London, UK, 1997. [Google Scholar]
Ioannidis, J.P.A. The proposal to lower p-value thresholds to .005. J. Am. Med. Assoc. 2018, 319, 1429–1430. [Google Scholar] [CrossRef] [PubMed]
Boring, E.G. Mathematical vs. scientific significance. Psychol. Bull. 1919, 16, 335–338. [Google Scholar] [CrossRef]
Cornfield, J. Sequential trials, sequential analysis and the likelihood principle. Am. Stat. 1966, 20, 18–23. [Google Scholar] [CrossRef]
Grunwald, P.; de Heide, R.; Koolen, W.M. Safe testing. J. R. Stat. Soc. Ser. B 2024, in press. [Google Scholar] [CrossRef]
Shafer, G.; Shafer, A.; Vereshchagin, N.; Vovk, V. Test martingales, Bayes factors and p-values. Stat. Sci. 2011, 26, 84–101. [Google Scholar]
Vovk, V.; Wang, R. Confidence and discoveries with e-values. Stat. Sci. 2023, 38, 329–354. [Google Scholar] [CrossRef]
Birnbaum, A. On the foundations of statistical inference (with discussion). J. Am. Stat. Assoc. 1962, 57, 269–326. [Google Scholar] [CrossRef]
Birnbaum, A. The Neyman-Pearson theory as decision theory, and as inference theory; with criticism of the Lindley-Savage argument for Bayesian theory. Synthese 1977, 36, 19–49. [Google Scholar] [CrossRef]
Giere, R.N. Publications by Allan Birnbaum. Synthese 1977, 36, 15–17. [Google Scholar] [CrossRef]
Evans, M. What does the proof of Birnbaum’s theorem prove? Electron. J. Stat. 2013, 7, 2645–2655. [Google Scholar] [CrossRef]
Evans, M.; Frangakis, C. On resolving problems with conditionality and its implications for characterizing statistical evidence. Sankhya A 2023, 85, 1103–1126. [Google Scholar] [CrossRef]
Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. London. Ser. A 1922, 222, 309–368. [Google Scholar]
Edwards, A.W.F. Likelihood, 2nd ed.; Johns Hopkins University Press: Baltimore, MD, USA, 1992. [Google Scholar]
Berger, J.O.; Liseo, B.; Wolpert, R.L. Integrated likelihood methods for eliminating nuisance parameters. Stat. Sci. 1999, 14, 1–22. [Google Scholar] [CrossRef]
Berger, J.O.; Wolpert, R.L. The Likelihood Principle: A Review, Generalizations, and Statistical Implications. IMS Lecture Notes Monogr. Ser. 1988, 6, 208. [Google Scholar] [CrossRef]
Evans, M. Measuring Statistical Evidence Using Relative Belief. CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
Dickey, J. The weighted likelihood ratio, linear hypotheses on normal location parameters. Ann. Stat. 1971, 42, 204–223. [Google Scholar] [CrossRef]
Salmon, W. Confirmation. Sci. Am. 1973, 228, 75–81. [Google Scholar] [CrossRef]
Popper, K. The Logic of Scientific Discovery. Harper Torchbooks; Routledge: New York, NY, USA, 1968. [Google Scholar]
Jeffreys, H. Some tests of significance, treated by the theory of probability. Math. Proc. Camb. Philos. 1935, 31, 203–222. [Google Scholar] [CrossRef]
Kass, R.E.; Raftery, A.E. Bayes factors. J. The American Stat. Assoc. 1995, 90, 773–795. [Google Scholar] [CrossRef]
Jeffreys, H. Theory of Probability, 3rd ed.; Oxford University Press BPS: Oxford, UK, 1961. [Google Scholar]
De Bragança Pereira, C.A.; Stern, J.M. Evidence and credibility: Full Bayesian significance test for precise hypotheses. Entropy 1999, 1, 99–110. [Google Scholar] [CrossRef]
Stern, J.M.; Pereira, C.A.D.B.; Lauretto, M.D.S.; Esteves, L.G.; Izbicki, R.; Stern, R.B.; Diniz, M. The e-value and the full Bayesian significance test: Logical properties and philosophical consequences. São Paulo J. Math. Sci. 2022, 16, 566–584. [Google Scholar]
Evans, M.; Guo, Y. Measuring and controlling bias for some Bayesian inferences and the relation to frequentist criteria. Entropy 2021, 23, 190. [Google Scholar] [CrossRef] [PubMed]
O’Hagan, A.; Buck, C.E.; Daneshkhah, A.; Eiser, J.R.; Garthwaite, P.H.; Jenkinson, D.J.; Oakley, J.E.; Rakow, T. Uncertain Judgements: Eliciting Expert Probabilities; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]

Figure 1. Relative likelihoods for

ψ

computed from the profile—and integrated—likelihoods with

n = 2, p = 1 / 2

for data generated from a

N (2, 1)

distribution in Example 4.

Figure 1. Relative likelihoods for

ψ

computed from the profile—and integrated—likelihoods with

n = 2, p = 1 / 2

for data generated from a

N (2, 1)

distribution in Example 4.

Figure 2. Relative belief ratio for

ψ

in Example 6 with

n = 2

.

Figure 2. Relative belief ratio for

ψ

in Example 6 with

n = 2

.

Table 1. Jeffreys’ Bayes factor scale for measuring strength of the evidence in favor (the strength of evidence against is measured by the reciprocals).

Bayes Factor	Strength
1 to $10^{1 / 2}$	Barely worth mentioning
$10^{1 / 2}$ to 10	Substantial
10 to $10^{3 / 2}$	Strong
$10^{3 / 2}$ to $10^{2}$	Very Strong
> $10^{2}$	Decisive

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Evans, M. The Concept of Statistical Evidence, Historical Roots and Current Developments. Encyclopedia 2024, 4, 1201-1216. https://doi.org/10.3390/encyclopedia4030078

AMA Style

Evans M. The Concept of Statistical Evidence, Historical Roots and Current Developments. Encyclopedia. 2024; 4(3):1201-1216. https://doi.org/10.3390/encyclopedia4030078

Chicago/Turabian Style

Evans, Michael. 2024. "The Concept of Statistical Evidence, Historical Roots and Current Developments" Encyclopedia 4, no. 3: 1201-1216. https://doi.org/10.3390/encyclopedia4030078

APA Style

Evans, M. (2024). The Concept of Statistical Evidence, Historical Roots and Current Developments. Encyclopedia, 4(3), 1201-1216. https://doi.org/10.3390/encyclopedia4030078

Article Menu

The Concept of Statistical Evidence, Historical Roots and Current Developments

Abstract

1. Introduction

2. The Background

3. Evidential Inference

3.1. p-Values, E-Values and Confidence Regions

3.2. Birnbaum on Statistical Evidence

3.3. Likelihood

3.4. Bayes Factors and Bayesian Measures of Statistical Evidence

Bias and Error Probabilities

4. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI