Statistical Reasoning: Choosing and Checking the Ingredients, Inferences Based on a Measure of Statistical Evidence with Some Applications

Al-Labadi, Luai; Baskurt, Zeynep; Evans, Michael

doi:10.3390/e20040289

Open AccessArticle

Statistical Reasoning: Choosing and Checking the Ingredients, Inferences Based on a Measure of Statistical Evidence with Some Applications

by

Luai Al-Labadi

¹,

Zeynep Baskurt

² and

Michael Evans

^3,*

¹

Department of Mathematics, University of Sharjah, P.O. Box 27272 Sharjah, United Arab Emirates

²

Genetics and Genome Biology, Hospital for Sick Children, Toronto, ON M5G 1X8, Canada

³

Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada

^*

Author to whom correspondence should be addressed.

Entropy 2018, 20(4), 289; https://doi.org/10.3390/e20040289

Submission received: 17 February 2018 / Revised: 5 April 2018 / Accepted: 11 April 2018 / Published: 16 April 2018

(This article belongs to the Special Issue Foundations of Statistics)

Download

Browse Figures

Versions Notes

Abstract

The features of a logically sound approach to a theory of statistical reasoning are discussed. A particular approach that satisfies these criteria is reviewed. This is seen to involve selection of a model, model checking, elicitation of a prior, checking the prior for bias, checking for prior-data conflict and estimation and hypothesis assessment inferences based on a measure of evidence. A long-standing anomalous example is resolved by this approach to inference and an application is made to a practical problem of considerable importance, which, among other novel aspects of the analysis, involves the development of a relevant elicitation algorithm.

Keywords:

statistical reasoning; model checking; elicitation of priors; checking priors; measuring statistical evidence; relative belief inferences

1. Introduction

It is relevant to ask what characteristics should be required of a theory of statistical reasoning. The phrase statistical reasoning is used here, as opposed to statistical inference, because there is a logical separation between how the ingredients to a statistical problem are chosen and checked for their validity, and the inference step that involves the application of the rules of a theory of inference to the ingredients. Thus, it is argued in Section 2 that there are two aspects to a theory of statistical reasoning: (i) specifying methodology for choosing and checking the ingredients to a statistical analysis beyond the data and (ii) specifying a theory of inference using these ingredients that is based on a measure of statistical evidence. These components correspond to the premises and the argument in logical reasoning.

In Section 3, a specific theory of statistical reasoning, as described in [1], that satisfies the criteria developed in Section 2 is reviewed. It is shown that an application of the theory of relative belief inference resolves difficulties in a problem that has led to anomalous results for other theories. It is to be noted that a number of additional examples have been published demonstrating that relative belief inference can lead to more satisfactory results than other approaches (see [2,3,4,5,6]). Particularly noteworthy is the Jeffreys–Lindley paradox where an increasingly diffuse prior typically leads to overwhelming evidence in favor of a hypothesis even when it seems contradicted by the data. The discussion of this paradox in [1] shows that the relative belief approach to inference leads to a clear resolution. A goal of this paper is to argue in favor of the relative belief approach to statistical inference based on its logical coherence and its utility in applications.

In Section 4, the theory of statistical reasoning is applied to an important practical problem where some inferential difficulties have arisen, namely, the problem of determining whether or not there is a relationship between a binary-valued response variable and predictors. For this, response

y \in {0, 1}

is related to k quantitative predictors

x = (x_{1}, \dots, x_{k})

via

y \sim

Bernoulli

(p (x))

with

p (x) = G (β_{1} x_{1} + \dots + β_{k} x_{k}),

(1)

where G is a known cdf and

β = (β_{1}, \dots,

β_{k}) \in R^{k}

is unknown. This can be regarded as a case-study for the overall approach and a number of novel results are obtained. Perhaps the biggest challenge with this model is determining a suitable prior and in Section 4.1 an elicitation algorithm is developed that improves on earlier efforts. The bias in the prior is measured in Section 4.2. Model checking is essential, namely, determining if (1) holds at least approximately. Since this is dealt with in [7], this approach is used here without much discussion. The check for prior-data conflict is developed in Section 4.3, together with an approach to modifying the prior when necessary, and relative belief inferences are applied in Section 4.4.

The main theme of the approach discussed here for (ii) centers around being clear about how statistical evidence is to be measured and, as will be seen, this has implications for (i) as well. Certainly, the concept of statistical evidence plays a key role in the development of the subject, but, as discussed in Chapter 3 of [1], it can be argued that there is a general failure of most approaches to deal adequately with this concept. There have, however, been a number of measures of evidence proposed in the philosophical and statistical literature and many of these are related to the relative belief ratio. For example, the Bayes factor is one such measure, but the relative belief ratio can be considered as more basic (see Section 3), and this comment applies to a number of other measures as well as discussed in Chapter 4 of [1]. The relative belief ratio is certainly not a new measure of evidence as [8] refers to the log of this quantity as the information and there are some references to Keynes calling it the association factor. As far as we know, however, there have been no attempts to use this quantity as the central concept in derivation of a theory of statistical inference. This leads to a number of unique characteristics for the theory outlined in Section 3.

There are many others who have also focused on treating statistical evidence and its measurement as a central concept in the subject. Jeffreys’ earliest papers focus on this (see also [9]), and the paper [10] provides excellent coverage of this history. In addition, the paper [11] summarizes developments by these authors that is certainly in the same spirit as what is being advocated here for (ii). The philosophy of science literature on the topic of evidence is vast and we point to [12] as a nice introduction with [13] being a more detailed discussion. The work of [14] is also particularly noteworthy in this regard.

2. The Foundations of Statistical Reasoning

When concerned with the foundations of statistics, it is reasonable to ask: what is the purpose of statistics as a subject or what is its role in science? To answer this, consider a context where an investigator has interest in some quantity and either wants to know (E) the value of this quantity or has a theory that leads to a specific value for the quantity and so wants to know (H) if this value is indeed correct and so test the theory. In addition, the investigator has available data

d,

produced in some fashion, which it is believed contains evidence concerning answers to (E) and (H). The purpose of statistical theory is to provide a reasoning process that can be applied to d to determine what the evidence has to say about (E) or (H), namely, produce an estimate of the quantity based on the evidence or assess whether there is evidence either in favor of or against the hypothesized value. In addition, as is widely recognized, estimation and hypothesis assessment should also produce a measure of the accuracy of the estimate and a measure of the strength of the evidence for or against the hypothesis. Answering (E) and/or (H) is called statistical inference and a sound, logical theory of statistical inference, which contained the minimal ingredients possible, can be viewed as a major goal of the subject.

Any theory that does not lead to specific answers to (E) and (H) or is dependent on ingredients or rules of reasoning that are not well-justified is unsatisfactory. In the end, the believability of the inferences drawn is entirely dependent on the soundness of the theory that produced them. Thus, statistics is not an empirical subject like physics, where conclusions can also be assessed against the empirical world, but is more like an extension of purely logical reasoning to contexts where the data does not lead to categorical answers to (E) and (H) and so produces uncertainty. The view is taken here that we want to maintain a close relationship between a theory of statistical reasoning and the theory of logical reasoning. This has a number of consequences with perhaps the most important being that it implies a separation of the assessment of the appropriateness of the ingredients specified to a statistical analysis beyond the data, and the theory producing the inferences. The ingredients play the role of the premises and the theory of statistical inference takes the role of the rules of inference used in a logical argument. The separation of these aspects of a logical argument has been understood since Aristotle (see [15]).

There are two main theses of the argument developed here: (I) all ingredients to a statistical analysis must be checkable against the data and (II) the theory of inference must be based on a measure of statistical evidence. The rationale for (I) and (II) are now considered with (II) discussed first, as it plays a key role.

The concept of the evidence in the data is clearly of utmost importance to statistical reasoning. There is no need, however, to provide a measure of the total evidence contained in the data, for the measure of evidence only has to deal with (E) and (H) for the quantity of interest. The measure of evidence must clearly indicate whether there is evidence for or against any specific value of the quantity of interest being the true value. This follows from the desired relationship with logic, as the rules of logical inference assume the truth of the premises, so the theory of statistical inference has to be based on the truth of the ingredients and this implies that one of the possible values for the quantity of interest is true. A theory of logical reasoning that could only determine falsity and never truth is not useful and similarly any valid measure of evidence must be able to indicate evidence in favor as well as evidence against.

Once a measure of statistical evidence is determined, an estimate of the quantity of interest is necessarily the value that maximizes the measure of evidence and the accuracy of the estimate can be assessed by looking at the size of the set of values that has evidence in their favor. The measure of evidence similarly necessarily determines whether there is evidence for or against a hypothesized value and the strength of this evidence can be assessed by comparison with the evidence associated with the other possible values for the quantity of interest.

Consider now requirement (I). If a satisfactory measure of evidence could be determined from the data alone, then this would be ideal, but currently this is not available and it is questionable whether it is even possible. It is assumed hereafter that the data

x \in X

can be regarded as having been produced by some probability distribution on the set

X

with unknown density

f .

If the data was collected via random sampling, then this assumption seems justified, but it is always an assumption. The density f is unknown and it is assumed that, once f is known, then this completely determines the answers to (E) and (H). The ingredients are then as follows: it is assumed that

f \in {f_{θ} : θ \in Θ},

a collection of densities on

X

indexed by the parameter

θ \in Θ

called the statistical model, and it is assumed that there is a probability measure

Π

with density

π

on

Θ

that represents beliefs of the investigator about the true value of

θ \in Θ

and called the prior. The ingredients correspond to the premises of a logical argument and these may be true or false.

It can be questioned as to whether both the model and prior are necessary for the development of a satisfactory theory and certainly minimizing the ingredients is desirable. However, as discussed in Section 3, it seems that a valid definition of a measure of evidence requires both and again the challenge is open to develop a satisfactory measure of evidence that uses fewer ingredients. In particular, the use of a prior is often claimed to be inappropriate as it is subjective in nature and, as the goal of a scientific investigation is to be as objective as possible, the prior seems contrary to that. It needs to be recognized, however, that all ingredients to a statistical analysis beyond the data are subjective as they are chosen by the statistician. As discussed in Section 3.2, it is possible to check both the model and the prior against the (objective) data to determine whether or not reasonable choices have been made. This can be considered as analogous to checking the consistency of the premises in a logical argument. In addition, it is possible to check whether or not the chosen ingredients have biased the results so that the inferences obtained are in fact foregone conclusions, namely, could have been made without even looking at the data. It is our view that checking for bias and checking for conflict with the data go a long way towards answering criticisms concerning the subjectivity inherent in a statistical analysis. Another implication from (I) is that no ingredient can be added to a statistical analysis unless it can be checked against the data and, as such, this rules out the use of loss functions.

It is not clear how the ingredients are to be chosen and guidance needs to be provided for this. Not much has been written about how the model is to be chosen, but certainly something needs to be said to justify a specific choice as part of the statistical reasoning argument. Much more has been written about the selection of the prior and the position is adopted here that it is necessary to base this on a clearly stated elicitation algorithm, namely, a prescription for how an expert can translate knowledge into beliefs as expressed via the prior.

In summary, the desiderata for a theory of statistical reasoning include the following: a methodology for choosing a model, an elicitation algorithm for selecting a prior, methodology for assessing the bias in the ingredients chosen, model checking and checking for prior-data conflict procedures and a theory of inference based upon a measure of statistical evidence.

3. A Theory of Statistical Reasoning

Choosing and checking the ingredients logically comes before inference, but it is convenient to discuss these in reverse order.

3.1. Relative Belief Inferences

Consider now a statistical problem with ingredients the data

d,

a model

{f_{θ} : θ \in Θ},

a prior

π

and interest is in making inference about

ψ = Ψ (θ)

for

Ψ : Θ \to Ψ

where no distinction is made between the function and its range to save notation. Initially, suppose that all the probability distributions are discrete. This is not really a restriction in the discussion, however, as if something works for inference in the discrete case but does not work in the continuous case, then it is our view that the concept is not being applied correctly or the mathematical context is just too general. For us, the continuous case is always to be thought of as an approximation to a fundamentally discrete context, as measurements are always made to finite accuracy, and the approximation arises via taking limits. Some additional comments on the continuous case are made subsequently.

As discussed in Section 2, the basic object of inference is the measure of evidence and what is wanted is a measure of the evidence that any particular value

ψ \in Ψ

is true. Based on the ingredients specified, there are two probabilities associated with this value, namely, the prior probability

π_{Ψ} (ψ)

, as given here by the marginal prior density evaluated at

ψ,

and the posterior probability

π_{Ψ} (ψ | d)

, as given here by the marginal posterior density evaluated at

ψ .

In certain treatments of inference,

π_{Ψ} (ψ | d)

is taken as a measure of the evidence that

ψ

is the true value, but, for a wide variety of reasons, this is not felt to be correct and Example 1 provides a specific case where this fails. In addition, this measure suffers from the same basic problem of p-values, namely, there is no obvious dividing line between evidence for and evidence against. Moreover, it is to be noted that probabilities measure belief and not evidence. If we start with a large prior belief in

ψ

, then, unless there is a large amount of data, there will still be a large posterior belief even if it is false and, similarly, if we started with a small amount of belief. There is agreement, however, to use the principle of conditional probability to update beliefs after receiving information or data and this is to be regarded as the first principle of the theory of relative belief.

Thus, what is the evidence that

ψ

is the true value to be measured? Basic to this is the principle of evidence: if

π_{Ψ} (ψ | d) > π_{Ψ} (ψ),

there is evidence in favor, as belief has increased, if

π_{Ψ} (ψ | d) < π_{Ψ} (ψ),

there is evidence against as belief has decreased and if,

π_{Ψ} (ψ | d) = π_{Ψ} (ψ),

there is no evidence either way. This principle has a long history in the philosophical literature concerning evidence. This principle does not provide a specific measure of evidence but at least it indicates clearly when there is evidence for or against, independent of the size of initial beliefs, and it does suggest that any reasonable measure of the evidence depends on the difference, in some sense, between

π_{Ψ} (ψ)

and

π_{Ψ} (ψ | d),

namely, evidence is measured by change in belief rather than belief. A number of measures of this change have been proposed (see [1] for a discussion), but, by far, the simplest and the one that has the nicest properties is the relative belief ratio

R B_{Ψ} (ψ | d) = π_{Ψ} (ψ | d) / π_{Ψ} (ψ) .

(2)

Thus, if

R B_{Ψ} (ψ | d) > 1,

there is evidence for

ψ

being the true value, if

R B_{Ψ} (ψ | d) < 1,

there is evidence against

ψ

being the true value and no evidence either way if

R B_{Ψ} (ψ | d) = 1 .

The use of the relative belief ratio to measure the evidence is the third and final principle of the theory, which we call the principle of relative belief. The relative belief ratio can also be written as

R B_{Ψ} (ψ | d) = m (d | ψ) / m (d),

where m is the prior predictive density of the data and

m (\cdot | ψ)

is the conditional prior predictive density of the data given

Ψ (θ) = ψ .

Another natural candidate for a measure of evidence is the Bayes factor

B F_{Ψ} (ψ | d)

as this satisfies the principle of evidence, namely,

B F (ψ | d) > (<, =) 1

when there is evidence for (against, neither)

ψ

being the true value. The Bayes factor can be defined in terms of the relative belief ratio as

B F_{Ψ} (ψ | d) = R B_{Ψ} (ψ | d) / R B_{Ψ} ({ψ}^{c} | d)

but not conversely. Note that the relative belief ratio of a set

A \subset Ψ

is

R B_{Ψ} (A | d) = Π_{Ψ} (A | d) / Π_{Ψ} (A),

where

Π_{Ψ}, Π_{Ψ} (\cdot | d)

are the prior and posterior probability measures of

Ψ,

respectively. It might appear that

B F_{Ψ} (ψ | d)

is a comparison between the evidence for

ψ

being true with the evidence for

ψ

being false, but it is provable that

R B_{Ψ} (A | d) > 1

implies

R B_{Ψ} (A^{c} | d) < 1

and conversely, so this is not the case. In addition, as subsequently discussed, in the continuous case, it is natural to take

B F_{Ψ} (ψ | d) = R B_{Ψ} (ψ | d) .

The principle of relative belief leads to an ordering of the possible values for

ψ

as

ψ_{1}

is preferred to

ψ_{2}

whenever

R B_{Ψ} (ψ_{1} | d) > R B_{Ψ} (ψ_{2} | d)

since there is more evidence for

ψ_{1}

than

ψ_{2} .

When

Ψ (θ) = θ

, this agrees with the likelihood ordering, but likelihood fails to provide such an ordering for general

ψ .

It is common to use the profile likelihood ordering even though this is not a likelihood ordering and this does not agree with the relative belief ordering. It is noteworthy that the relative belief idea is consistent in the sense that inferences for all

ψ = Ψ (θ)

are based on a measure of the change in prior to posterior probabilities.

The relative belief ordering leads immediately to a theory of estimation. Basing inferences on the evidence requires that the relative belief estimate be a value

ψ (d)

that maximizes

R B_{Ψ} (ψ | d)

and typically such a value is unique so

ψ (d) = arg {sup}_{ψ \in Ψ} R B_{Ψ} (ψ | d)

. It is also necessary to say something about the accuracy of this estimate in an application. For this, a set of values containing

ψ (d)

is quoted and the “size” of the set is taken as the measure of accuracy. Again, following the ordering based on the evidence, it is necessary that the set take the form

{ψ : R B_{Ψ} (ψ | d) > c}

for some constant

c \leq {sup}_{ψ \in Ψ} R B_{Ψ} (ψ | d)

since, if

R B_{Ψ} (ψ_{1} | d) \leq R B_{Ψ} (ψ_{2} | d),

then

ψ_{2}

must be included whenever

ψ_{1}

is. However, what c should be used? It is perhaps natural to chose c so that

{ψ : R B_{Ψ} (ψ | d) > c}

contains some prescribed amount of posterior probability, so the set is a

γ

-credible region. However, there are several problems with this approach. For what

γ

should be chosen? Even if one is content with some particular

γ,

say

γ = 0.95,

there is the problem that the set may contain values

ψ

with

R B_{Ψ} (ψ | d) < 1

and such a value has been ruled out since there is evidence against such a

ψ

being true. It is argued in [1] that the plausibility set

P l_{Ψ} (d) = {ψ : R B_{Ψ} (ψ | d) > 1}

be used as

P l_{Ψ} (d)

contains all the values for which there is evidence in favor of it being the true value. In general circumstances, it is provable that

R B_{Ψ} (ψ (d) | d) > 1

so

P l_{Ψ} (x) \neq ϕ .

There are several possible measures of size, and certainly the posterior content

Π_{Ψ} (P l_{Ψ} (d) | d)

is one as this measures the belief that the true value is in

P l_{Ψ} (d),

but also some measure such as length or cardinality is relevant. If

P l_{Ψ} (d)

is small and

Π_{Ψ} (P l_{Ψ} (d) | d)

large, then

ψ (d)

can be judged to be an accurate estimate of

ψ .

It is immediate that

R B_{Ψ} (ψ_{0} | d)

is the evidence concerning

H_{0} : Ψ (θ) = ψ_{0} .

The evidential ordering implies that the smaller

R B_{Ψ} (ψ_{0} | d)

is than 1, the stronger is the evidence against

H_{0}

and the bigger it is than 1, the stronger is the evidence in favor

H_{0}

, but how is one to measure this strength? In [16], it is proposed to measure the strength of the evidence via

Π_{Ψ} (R B_{Ψ} (ψ | d) \leq R B_{Ψ} (ψ_{0} | d)| d),

(3)

which is the posterior probability that the true value of

ψ

has evidence no greater than that obtained for the hypothesized value

ψ_{0} .

When

R B_{Ψ} (ψ_{0} | d) < 1

and (3) is small, then there is strong evidence against

H_{0}

since there is a large posterior probability that the true value of

ψ

has a larger relative belief ratio. Similarly, if

R B_{Ψ} (ψ_{0} | d) > 1

and (3) is large, then there is strong evidence that the true value of

ψ

is given by

ψ_{0}

since there is a large posterior probability that the true value is in

{ψ : R B_{Ψ} (ψ | x) \leq R B_{Ψ} (ψ_{0} | d)}

and

ψ_{0}

maximizes the evidence in this set. Additional results concerning

R B_{Ψ} (ψ_{0} | d)

as a measure of evidence and (3) can be found in [1,16].

For continuous parameters, it is natural to define

R B_{Ψ} (ψ | d) = {lim}_{ϵ \to 0} R B_{Ψ} (N_{ϵ} (ψ) | d),

where

N_{ϵ} (ψ)

is a sequence of sets converging nicely to

{ψ}

as

ϵ \to 0 .

When the densities are continuous at

ψ,

then this limit equals (2) so this is a measure of evidence in general circumstances. In addition, it is natural to define the Bayes factor by

B F_{Ψ} (ψ | d) = {lim}_{ϵ \to 0} B F_{Ψ} (N_{ϵ} (ψ) | d)

and, when the densities are continuous at

ψ,

then

B F_{Ψ} (ψ | d) = R B_{Ψ} (ψ | d) .

A variety of consistency results, as the amount of data increases, are proved in [1] concerning the estimation and hypothesis assessment procedures. In particular, when

H_{0}

is false, then (2) converges to 0 as does (3). When

H_{0}

is true, then (2) converges to its largest possible value (greater than 1 and often ∞) and, in the discrete case (3) converges to 1. In the continuous case, however, when

H_{0}

is true, then (3) typically converges to a

U (0, 1)

random variable. This simply reflects the approximate nature of the inferences and is easily resolved by requiring that a deviation

δ > 0

be specified such that if dist

(ψ_{1}, ψ_{2}) < δ,

where dist is some measure of distance determined by the application, then this difference is to be regarded as immaterial. This leads to redefining

H_{0}

as

H_{0} = {ψ :

dist

(ψ, ψ_{0}) < δ}

and typically a natural discretization of

Ψ

exists with

H_{0}

as one of its elements. With this modification (3) converges to 1 as the amount of data increases when

H_{0}

is true. Given that data is always measured to finite accuracy, the value of a typical continuous-valued parameter can only be known to a certain finite accuracy no matter how much data is collected. Thus, such a

δ

always exists and it is part of an application to determine the relevant value (see Example 7 here, Al-Labadi, Baskurt and Evans [7] and Evans, Guttman and Li [17] for developments on determining

δ

). These results establish that, as the amount of data increases, relative belief inferences will inevitably produce the correct answers to estimation and hypothesis assessment problems.

It is immediate that relative belief inferences have some excellent properties. For example, any 1-1 increasing function of

R B_{Ψ} (\cdot | d),

such as

log R B_{Ψ} (\cdot | d),

can be used to measure evidence as the inferences are invariant to this choice. In addition,

R B_{Ψ} (\cdot | d)

is invariant under smooth reparameterizations and so all relative belief inferences possess this invariance property. For example, MAP (maximum a posteriori) inferences are not invariant and this leads to considerable doubt about their validity (see also Example 1). In [1], results from a number of papers are summarized establishing optimality results for relative belief inferences in the collection of all Bayesian inferences. For example, Al-Labadi and Evans [18] establish that relative belief inferences for

ψ

have optimal robustness to the prior

π_{Ψ}

properties. In addition, as discussed in Section 3.2, since the inferences are based on a measure of evidence a key criticism of Bayesian methodology can be addressed, namely, the extent to which the inferences are biased can be measured.

Relative belief prediction inferences for future data are naturally produced by using the ratio of the posterior to prior predictive densities for the quantity in question. The following example illustrates this and demonstrates significant advantages for relative belief.

Example 1.

Prediction for Bernoulli sampling.

Consider an example discussed in Chapter 6 of [19] who further reference [9]. A tack is flipped with

x = 1

indicating the tack finishes point up and

x = 0

otherwise, so

x \sim

Bernoulli

(θ) .

Suppose the prior is

θ \sim U (0, 1)

and the goal is to predict f future observations

(y_{1}, \dots, y_{f})

having observed n independent tosses

(x_{1}, \dots, x_{n})

. The posterior of

θ

is beta

(n \bar{x} + 1, n (1 - \bar{x}) + 1),

the prior predictive density of

(x_{1}, \dots, x_{n})

is

m_{n} (x_{1}, \dots, x_{n}) = 1 / (n + 1) (\binom{n}{n \bar{x}})

and the posterior predictive density for

(y_{1}, \dots, y_{f})

is

m_{n, f} ((y_{1}, \dots, y_{f}) | (x_{1}, \dots, x_{n})) = \frac{(n + 1) (\binom{n}{n \bar{x}})}{(n + f + 1) (\binom{n + f}{(n + f) [\frac{n}{n + f} \bar{x} + \frac{f}{n + f} \bar{y}]})},

(4)

which is constant for all

(y_{1}, \dots, y_{f})

with the same value of

\bar{y} .

Maximizing (4) gives the MAP predictor of

(y_{1}, \dots, y_{f}) .

If

n \bar{x} / (n + f) > 1 / 2,

then the maximum occurs at

(y_{1}, \dots, y_{f})

with

\bar{y} = 1,

namely,

(y_{1}, \dots, y_{f}) = (1, \dots, 1) .

If

n \bar{x} / (n + f) < 1 / 2,

then the maximum occurs at

(y_{1}, \dots, y_{f})

with

\bar{y} = 0,

namely,

(y_{1}, \dots, y_{f}) = (0, \dots, 0) .

If

n \bar{x} / (n + f) = 1 / 2

, then a maximum occurs at both

(y_{1}, \dots, y_{f}) = (0, \dots, 0)

and

(y_{1}, \dots, y_{f}) = (1, \dots, 1)

. Thus, using MAP gives the absurd result that

(y_{1}, \dots, y_{f})

is always predicted to be either all 0s or all 1s. Clearly, there is a problem here with using MAP.

Now suppose

(x_{1}, \dots, x_{n}) = (0, \dots, 0)

so the prediction is all 0s and

m_{n, f} ((y_{1}, \dots, y_{f}) | (0, \dots, 0)) = (n + 1) / (n + f + 1) (\binom{n + f}{f \bar{y}}) .

For fixed

f,

then

m_{n, f} ((y_{1}, \dots, y_{f}) | (0, \dots, 0)) \to 0

as

n \to \infty

whenever

\bar{y} \neq 0

and converges to 1 when

\bar{y} = 0

. Diaconis and Skyrms [19] note, however, that when

f = n

, then

m_{n, n} ((0, \dots, 0) | (0, \dots, 0)) \to 1 / 2

as

n \to \infty

and make the comment “If this is an unwelcome surprise, then perhaps the uniform prior is suspect.” They also refer to some attempts to modify the prior to avoid this phenomenon, which clearly violates an essential component of the Bayesian approach. In our view, there is nothing wrong with the uniform prior, rather the problem lies with using posterior probabilities implicitly as measures of evidence, both to determine the predictor and to assess its reliability.

The relative belief ratio for

(y_{1}, \dots, y_{f})

is

R B ((y_{1}, \dots, y_{f}) | (x_{1}, \dots, x_{n})) = \frac{(n + 1) (\binom{n}{n \bar{x}}) (f + 1) (\binom{f}{f \bar{y}})}{(n + f + 1) (\binom{n + f}{(n + f) [\frac{n}{n + f} \bar{x} + \frac{f}{n + f} \bar{y}]})} .

(5)

With

n = f = 20

and

n \bar{x} = 6,

Figure 1 gives the plot of (5) as a function of

n \bar{y} .

The best relative belief predictor of

(y_{1}, \dots, y_{f})

is any sample with

f \bar{y} = 6

and

P l_{n} (x_{1}, \dots, x_{n}) = {(y_{1}, \dots, y_{f}) : f \bar{y} = 2, 3, \dots, 10}

has posterior content

0.893 .

Thus, there is reasonable belief that the plausibility set contains the “true” future sample but certainly there are many such samples. By contrast with MAP, a sensible prediction is made using relative belief.

For the case when

f = n

and

(x_{1}, \dots, x_{n}) = (0, \dots, 0),

then

R B ((y_{1}, \dots, y_{n}) | (0, \dots, 0)) = \frac{{(n + 1)}^{2}}{(2 n + 1)} \frac{(\binom{n}{n \bar{y}})}{(\binom{2 n}{n \bar{y}})} = \frac{{(n + 1)}^{2}}{(2 n + 1)} \prod_{i = 0}^{n - 1} \{\frac{2 - \bar{y} - i / n}{(2 - i / n)}\},

which is decreasing in

\bar{y}

and so is maximized for the sample with

\bar{y} = 0 .

Similarly, when

\bar{x} = 1

, the predictor is the sample with

\bar{y} = 1 .

Thus, at the extremes, the predictions based on MAP and relative belief are the same, but otherwise there is a sharp disagreement. In addition,

P l_{n} (0, \dots, 0) = {(y_{1}, \dots, y_{n}) : R B ((y_{1}, \dots, y_{n}) | (0, \dots, 0)) > 1}

always contains

(y_{1}, \dots, y_{n}) = (0, \dots, 0)

and for any

c \in (0, 1]

such that

\bar{y} \geq c,

\begin{matrix} R B ((y_{1}, \dots, y_{n}) | (0, \dots, 0)) = {{(n + 1)}^{2} / (2 n + 1)} \prod_{j = 0}^{n - 1} (1 - \bar{y} / (2 - j / n)) \\ \leq {{(n + 1)}^{2} / (2 n + 1)} {(1 - c / 2)}^{n} \to 0 \end{matrix}

as

n \to \infty .

Therefore, for any

c \in (0, 1],

there is an

N,

such that for all

n > N,

then

P l_{n} (0, \dots, 0)

contains no

(y_{1}, \dots, y_{n})

having a proportion of 1s that is c or greater. Thus,

P l_{n} (0, \dots, 0)

is shrinking as n increases in the sense that it contains only samples with smaller and smaller proportion of 1s as n increases.

The posterior content of the plausibility region equals

\sum_{{n \bar{y} : R B ((y_{1}, \dots, y_{n}) | (0, \dots, 0)) > 1}} m_{n, f} ((y_{1}, \dots, y_{f}) | (0, \dots, 0)) (\binom{n}{n \bar{y}}),

(6)

which equals the sum over all the summands that are greater than

1 / (n + 1)

and

\begin{matrix} m_{n, f} ((y_{1}, \dots, y_{f}) | (0, \dots, 0)) (\binom{n}{n \bar{y}}) = \frac{(n + 1)}{(2 n + 1)} \prod_{i = 0}^{n - 1} \{\frac{2 - \bar{y} - i / n}{(2 - i / n)}\} \\ = \frac{(n + 1)}{(2 n + 1)} \{\begin{matrix} 1 & \bar{y} = 0 \\ \frac{1}{2} & \bar{y} = \frac{1}{n} \\ \frac{1}{2} \frac{1 - 1 / n}{2 - 1 / n} & \bar{y} = \frac{2}{n} \\ ⋮ & ⋮ \\ \frac{1}{2} \frac{1 - 1 / n}{2 - 1 / n} \dots \frac{1 - (k - 1) / n}{2 - (k - 1) / n} & \bar{y} = \frac{k}{n} \\ ⋮ & ⋮ \end{matrix} . \end{matrix}

When

\bar{y} = k / n

, the corresponding term converges to

{(1 / 2)}^{k + 1} .

Thus, for all n large enough, the sum (6) contains the terms for

\bar{y} = 0, 1 / n, \dots, k / n .

Therefore, for

ϵ > 0

and all n large enough, (6) is greater than

(1 / 2) [1 + 1 / 2 + \dots + {(1 / 2)}^{k}] - ϵ = 1 - {(1 / 2)}^{k + 1} - ϵ

and the posterior content of

P l_{n} (0, \dots, 0)

converges to 1.

Thus, relative belief also behaves appropriately when

f = n

and

\bar{x} = 0

while MAP does not. The failure of MAP might be attributed to the requirement that the entire sample

(y_{1}, \dots, y_{n})

be predicted. If instead it was required only to predict the value

n \bar{y},

then the prior predictive of this quantity is uniform on

{0, 1, \dots, f},

the posterior of

n \bar{y}

equals

R B ((y_{1}, \dots, y_{f}) | (x_{1}, \dots, x_{n})) / (f + 1)

and the relative belief ratio for

n \bar{y}

equals

R B ((y_{1}, \dots, y_{f}) | (x_{1}, \dots, x_{n})) .

Thus, as is often the case when the quantity in question has a uniform prior, MAP and relative belief estimates are the same. However, even in this case, there is no natural cut-off for MAP inferences to say when there is evidence for or against a particular value. The fact that it is necessary to modify the problem in this way to get a reasonable inference is, in our view, a substantial failing of MAP. It seems reasonable to suggest that, when an inference approach is shown to perform poorly on such examples, that it not be generally recommended. Additional examples of poor performance of MAP are discussed in [1].

It is notable that, while the relative belief approach to inference has been described here using statistical models and priors, in reality, everything can be cast in terms of a single probability model however such an object arises. Thus, if P is a probability measure on a sample space

Ω

and

A \subset Ω

is an event whose truth value is unknown but

C \subset Ω

is known to be true, then the evidence concerning the truth of A is given by

R B (A | C) = P (A | C) / P (A),

with this defined by the appropriate limit when either A or C is a null event. As discussed in [1], the relative belief approach to inference can be seen as essentially probability theory together with the principles of evidence and relative belief.

3.2. Choosing and Checking the Ingredients

The first choice that must be made is the model and there are a number of standard models used in practice. There is not a lot written about this step, however, and yet it is perhaps the most important step in solving a statistical problem. It is generally accepted that the correct way to choose a prior is through elicitation. This means that a methodology is prescribed that directs an expert in the application area on how to translate their knowledge into a prior. There are various default priors in use that avoid this elicitation step, but it is far better to recommend that sufficient time and energy be allocated for the elicitation of a proper prior. Staying within the context of probability suggests that a variety of paradoxes and illogicalities are avoided.

Given the ingredients, the relative belief inferences may be applied correctly, but it is still reasonable to ask if these ingredients are appropriate for the particular application. If not, then the inferences drawn cannot be considered valid. There are at least two questions about the ingredients that need to be answered, namely, is there bias inherent in the choice of ingredients and are the ingredients contradicted by the data?

The concern for bias is best understood in terms of assessing the hypothesis

H_{0} : Ψ (θ) = ψ_{0} .

Let

M (\cdot | ψ)

denote the prior predictive distribution of the data given that

Ψ (θ) = ψ .

Bias against

H_{0}

means that the ingredients are such that, with high probability, evidence will not be obtained in favor of

H_{0}

even when it is true. Bias against is thus measured by

M (R B_{Ψ} (ψ_{0} | D) \leq 1 | ψ_{0}) .

(7)

If (7) is large, then obtaining evidence against

H_{0}

seems like a foregone conclusion. For bias in favor of

H_{0}

, consider

M (R B_{Ψ} (ψ_{0} | D) \geq 1 | ψ_{*})

where dist

(ψ_{*}, ψ_{0}) = δ,

so

ψ_{*}

is a value that just differs from the hypothesized value by a meaningful amount. Bias in favor of

H_{0}

is then measured by

sup_{ψ_{*} \in {ψ : dist (ψ, ψ_{0}) = δ}} M (R B_{Ψ} (ψ_{0} | D) \geq 1 | ψ_{*}) .

(8)

If (8) is large, then obtaining evidence in favor of

H_{0}

seems like a foregone conclusion. Typically,

M (R B_{Ψ} (ψ_{0} | D) \geq 1 | ψ_{*})

increases as dist

(ψ_{*}, ψ_{0})

increases so (8) is an appropriate measure of bias in favor of

H_{0}

. The choice of the prior can be used somewhat to control bias but typically a prior that makes one bias lower just results in making the other bias higher. It is established in [1] that, under quite general circumstances, both biases converge to 0 as the amount of data increases. Thus, bias can be controlled by design a priori.

The model needs to be checked against the data for, if the data d lies in the tails of every distribution in the model, then this suggests model failure. There are a wide variety of approaches to assessing this and these are not reviewed here. One relevant comment is that, at this time, there do not seem to exist general methodologies for modifying a model when model failure is encountered.

The prior can also be checked for conflict with the data. A conflict means that the observed data are in the tails of all those distributions in the model where the prior primarily places its mass. For a minimal sufficient statistic T for the model, Evans and Moshonov [20] used the tail probability

M_{T} (m_{T} (t) \leq m_{T} (T (d)))

(9)

to assess prior-data conflict where (9) small indicates prior-data conflict. In [21], it is shown that, under general circumstances, (9) converges to

Π (π (θ) \leq π (θ_{t r u e}))

as the amount of data increases. There are a variety of refinements of (9) that allow for looking at particular components of a prior to isolate where a problem with the prior may be. In [22], a method is developed for replacing a prior when a prior-data conflict has been detected. This does not mean simply replacing a prior by one that is more diffuse, however, as is demonstrated in Section 4.1.

4. Binary-Valued Response Regression Models

The following example, based on real data, is used to illustrate each aspect of the approach to statistical reasoning recommended here.

Example 2.

Bioassay experiment.

Table 1 gives the results of exposing animals to various levels in g/mL of a dosage of a toxin, where

x_{2}

is the log-dosage and the number of deaths is recorded at each dosage (see [23]). The dosages range from

e^{- 0.86} = 0.423

to

e^{0.73} = 2.075

g/mL. The logistic regression model

p (x_{1}, x_{2}) = G (β_{1} + β_{2} x_{2})

is considered for this data, so

x_{1} \equiv 1, G (z) = e^{z} / (1 + e^{z}), (β_{1}, β_{2}) \in R^{2}

and

p (1, x_{2})

is the probability of death at dosage

x_{2} .

The counts

T = (t_{1}, t_{2}, t_{3}, t_{4})

at the dosages comprise a minimal sufficient statistic for this problem with observed value

(0, 1, 3, 5) .

The conditional distribution of T given

(β_{1}, β_{2})

is a product of binomials.

In [7], a goodness-of-fit test based on this data was applied for this model using a uniform prior on the space

{[0, 1]}^{4}

of all probabilities. Relative belief was used to assess the hypothesis that the model is correct and overwhelming evidence in favor of this model was obtained and so model correctness is assumed here. One goal is the estimation of

(β_{1}, β_{2})

and another is the assessment of the hypothesis

H_{0} : β_{2} = 0 .

Acceptance of

H_{0}

implies that there is no relationship between the response and the predictor.

4.1. Eliciting the Prior

Elicitation of a prior can be difficult when the interpretation of the parameters is unclear. For example, with model (1), it is not clear what the

β_{i}

represents in contrast to linear models, where they represent either location parameters or rates of change with respect to predictors. This leads to attempts to put default priors on these quantities and there are problems with this approach. For example, suppose

p (1, x) = G (β_{1} + β_{2} x),

where G is the standard logistic cdf and the prior is given by the

β_{i}

being i.i.d.

N (0, σ^{2}),

where

σ^{2}

is chosen large to reflect little information about these values. In Figure 2, we have plotted the prior this induces on

p (1, 1)

when

σ = 20 .

This reflects the fact that as

σ

grows all the prior probability for

p (1, x)

piles up at 0 and 1 and so this is clearly a poor choice and it is certainly not noninformative.

The strange behavior of diffuse normal priors has been noted by others. Bedrick, Christensen and Johnson [24,25], based on [26], make the recommendation that priors should instead be placed on the

p (x_{i}),

as these are parameters for which there is typically prior information. Their recommendation is that k of the

x_{i}

values be selected and then beta

(α_{1 i}, α_{2 i})

priors be placed on the corresponding

p (x_{i})

via eliciting prior quantiles. This results in more sensible priors but depends on the choice of the observed predictors, and it is unclear what kind of priors this induces on the

β_{i} .

Following [24,25], priors here are elicited for the probabilities, but the approach is different. First, it is not required that the elicitation be carried out at observed values of the predictors. Rather, it is supposed that there is a set of linearly independent predictor vectors

w_{1}, \dots, w_{k}

where bounds can be placed on the probabilities in the sense that

l (w_{i}) \leq p (w_{i}) \leq u (w_{i})

for

i = 1, \dots, k

with virtual certainty. By virtual certainty, it is meant that, for prior probability measure

Π

, then

Π (l (w_{i}) \leq p (w_{i}) \leq u (w_{i}) for i = 1, \dots, k) \geq γ,

(10)

where

γ

is chosen to be close to 1. For example,

γ = 0.99

certainly seems satisfactory for many applications, but a higher or lower standard can be chosen. The motivation for this is that typically information will be available for the probabilities such as it is known that

p (w_{i})

is very small (or very large) or almost certainly that

p (w_{i})

is in some specific range. Of course, for some of the

w_{i}

, virtually nothing may be known about

p (w_{i})

and in that case taking

[l (w_{i}), u (w_{i})] = [0, 1]

is appropriate. One implication of this is that when the choice is made

[l (w_{i}), u (w_{i})] = [0, 1]

for every

i,

then the elicitation procedure should lead to a

Π

that is at least approximately uniform on the probabilities. The approach to elicitation, via stating bounds on parameter values that hold with virtual certainty, has been successfully employed in [2] to determine a prior for the multivariate normal model, and [17] to determine a prior for the multinomial model.

Another reason for allowing the elicitation procedure to be independent of the observed

x_{i}

is that prior beliefs about

p (x_{i})

may apply equally well about

p (x_{j})

for some j simply because

x_{i}

and

x_{j}

are close, and then it seems that the correlation between the beliefs should be part of the prior. Modelling such correlations is harder and hopefully can be avoided by choosing the

w_{i}

carefully. For example, requiring the

w_{i}

to be mutually orthogonal seems like an appropriate way of achieving independence in many contexts.

The second way in which our approach differs from previous developments is that

Π

is restricted to the family of multivariate normal priors on

β

as this allows us to see directly how (10) translates into information about

β .

Note that (10) is equivalent to

\begin{matrix} Π (G^{- 1} (l (w_{i})) \leq G^{- 1} (p (w_{i})) \leq G^{- 1} (u (w_{i})) for i = 1, \dots, k) \\ = Π (G^{- 1} (l (w_{i})) \leq w_{i}^{'} β \leq G^{- 1} (u (w_{i})) for i = 1, \dots, k) \\ = Π (G^{- 1} (l (W)) \leq W β \leq G^{- 1} (u (W))) \geq γ, \end{matrix}

(11)

where

W = {(w_{1} \dots w_{k})}^{'} \in R^{k \times k}, l (W) = {(l (w_{1}), \dots, l (w_{k}))}^{'} \in R^{k}, u (W) = {(u (w_{1}), \dots, u (w_{k}))}^{'} \in R^{k} .

Thus, if

W β \sim N_{k} (μ_{0}, Σ_{0}),

then

β \sim N_{k} (W^{- 1} μ_{0}, W^{- 1} Σ_{0} {(W^{- 1})}^{'})

and it is clear what this says about

β .

The task then is to determine

(μ_{0}, Σ_{0})

so that (11) is satisfied. A natural choice for

μ_{0}

is to put

μ_{0} = G^{- 1} (c (W))

where

c (W) = (l (W) + u (W)) / 2

is the centroid of the k-cell

[l (W), u (W)] .

For example, when

[l (W), u (W)] = {[0, 1]}^{k},

then

c (W) = 1_{k} / 2,

where

1_{k}

is the k-dimensional vector of ones, which implies

μ_{0} = 0 .

Other choices for

μ_{0}

can be made if there are good reasons for this.

Given that the

w_{i}

have been chosen so that prior beliefs about the probabilities

p (w_{i})

are independent, this implies that the coordinates of

W β

are independent and so

Σ_{0} =

diag

(σ_{1}^{2}, \dots, σ_{k}^{2})

for some choice of the prior variances

σ_{i}^{2} .

There are, however, typically many choices satisfying (11). For example, taking

σ_{i}^{2} = 0

for all i achieves this, but clearly this choice does not reflect what is actually known about the probabilities. As might be expected, the choice of the

σ_{i}^{2}

is critical and dependent on

G .

Furthermore, as Figure 1 demonstrates, an injudicious choice results in absurdities.

Since

G^{- 1} (u (w_{i})) - μ_{0 i} > 0

and

G^{- 1} (l (w_{i})) - μ_{0 i} < 0,

both these values are infinite iff

[l (w_{i}), u (w_{i})] = [0, 1]

and so no information is being introduced via the prior. In such a case, a uniform

[0, 1]

prior on the probability results and the appropriate normal distribution is determined by approximating the distribution function G by a normal cdf (see Examples 3–5). Suppose then that at least one of

G^{- 1} (u (w_{i}))

and

G^{- 1} (l (w_{i}))

is finite and so

σ_{i}

satisfies

Φ (\frac{G^{- 1} (u (w_{i})) - μ_{0 i}}{σ_{i}}) - Φ (\frac{G^{- 1} (l (w_{i})) - μ_{0 i}}{σ_{i}}) \geq γ^{1 / k},

(12)

as then independence ensures that (11) is satisfied. When both

G^{- 1} (u (w_{i}))

and

G^{- 1} (l (w_{i}))

are finite, the left side of (12) has the value 1, when

σ_{i} = 0,

is strictly decreasing to the value 0 as

σ_{i} \to \infty

and so there are always values of

σ_{i} \geq 0

satisfying (12). When both

G^{- 1} (u (w_{i}))

and

G^{- 1} (l (w_{i}))

are finite, there is a unique largest solution to (12), which is the preferred solution as it best represents the prior information, and it is easily found numerically by bisection. If

u (w_{i}) = 1

and

l (w_{i}) \in (0, 1),

then

σ_{i} = (G^{- 1} (l (w_{i})) - μ_{0 i}) / Φ^{- 1} (1 - γ^{1 / k})

is the solution provided

γ > {(1 / 2)}^{k}

, which is a very weak requirement as recall that

γ

represents virtual certainty. If

u (w_{i}) \in (0, 1)

and

l (w_{i}) = 0,

then

σ_{i} = (G^{- 1} (u (w_{i})) - μ_{0 i}) / Φ^{- 1} (γ^{1 / k})

is the solution again provided

γ > {(1 / 2)}^{k} .

The following examples consider the situation

l (w_{i})) = 0, u (w_{i}) = 1 .

In this case,

μ_{0 i} = 0

and

G^{- 1} (p (w_{i}))

will be distributed with cdf G when

p (w_{i}) \sim U (0, 1) .

Generally, this leads to a need to approximate G by a normal cdf to obtain a normal prior, although no approximation is required in Example 3.

Example 3.

Probit regression.

Here,

G = Φ

and so

G^{- 1} (p (w_{i})) \sim N (0, 1)

when

p (w_{i}) \sim U (0, 1) .

As such,

σ_{i} = 1

and the standard normal distribution on

G^{- 1} (p (w_{i}))

corresponds to no information about

p (w_{i})

. When there is no information about any of the

p (w_{i}),

then

β \sim N_{k} (0, W^{- 1} {(W^{- 1})}^{'}),

which equals the

N_{k} (0, I)

distribution whenever

W

is an orthogonal matrix. In general, however, a lack of information about the probabilities leads to a prior on

β

that is dependent on

W,

namely, dependent on the values of predictor variables corresponding to the probabilities.

Example 4.

Logistic regression.

In this case, G is the standard logistic cdf and so

w_{i}^{'} β = G^{- 1} (p (w_{i}))

is distributed standard logistic when

p (w_{i}) \sim U (0, 1) .

A well-known

N (0, λ^{2})

approximation to the standard logistic distribution, as discussed in [27], leads to normal priors that are much easier to work with. The optimal choice of

λ,

in the sense that it minimizes

{max}_{x \in R^{1}} | Φ (x / λ) - e^{x} / (1 + e^{x}) |

is given by

λ = 1.702

, and this leads to a maximum difference less than

0.009 .

Clearly, this error will generally be irrelevant when considering priors for the probabilities in a logistic regression problem. Thus, when

w_{i}^{'} β \sim N (0, {1.702}^{2})

, then

p (w_{i})

is approximately distributed

U (0, 1)

with the same maximum error. Figure 3 contains plots of the density of

p = e^{z} / (1 + e^{z})

when

z \sim N (0, λ^{2})

for various choices of

λ

, and it is indeed approximately uniform when

λ = 1.702 .

Using normal probabilities rather than logistic probabilities leads to relatively small differences, so it seems reasonable to use a normal prior on

β

in a logistic regression.

Example 5.

t regression.

Suppose that G is taken to be the cdf of t with

υ

degrees of freedom. Table 2 presents the optimal choice of

λ

for a

N (0, λ^{2})

approximation to the

t (υ)

cdf together with the maximum error. There does not appear to be much difference in using

t_{υ}

probabilities instead of normal ones unless

υ

is quite low.

Consider now an application of the elicitation algorithm.

Example 6.

Bioassay experiment (Example 2 continued).

In this example,

k = 2 .

To determine the prior, it is necessary to choose

W = (w_{1}

w_{2}) \in R^{2 \times 2}

and

[l (W), u (W)] .

The authors are not experts in bioassay, but, given the range of dosages applied in the experiment, it is reasonable to suppose that an expert might be willing to put bounds on the probabilities that hold with prior probability

γ = 0.99

when

x_{2} = - 0.50

and

x_{2} = 0.50

leading to

W = (\begin{matrix} 1 & - 1 / 2 \\ 1 & 1 / 2 \end{matrix}) .

Let us suppose that an expert believes with virtual certainty that the true probabilities lie in the intervals

[0.15, 0.75],

when

x_{2} = - 0.50,

and in

[0.25, 0.95],

when

x_{2} = 0.50 .

Thus, the centroid of the 2-cell

[0.15, 0.75] \times [0.25, 0.95]

is given by

(0.45, 0.60)

and since

G^{- 1} (p) = log (p / (1 - p))

for logistic regression, this implies

μ_{0} = (G^{- 1} (0.45), G^{- 1} (0.60)) = (- 0.20, 0.41) .

In addition,

[G^{- 1} (0.15), G^{- 1} (0.75] = [- 1.735, 1.099]

and

[G^{- 1} (0.25), G^{- 1} (0.95)] = [- 1.099, 2.944]

so, using (12), the largest values of

σ_{1}

and

σ_{2}

satisfying

Φ (1.299 / σ_{1}) - Φ (- 1.535 / σ_{1}) \geq {(0.99)}^{1 / 2}

and

Φ (2.534 / σ_{2}) - Φ (- 1.509 / σ_{2}) \geq {(0.99)}^{1 / 2}

are given by

σ_{1} = 0.490

and

σ_{2} = 0.580 .

Therefore, the prior on

β

is

\begin{matrix} β \sim N_{2} (W^{- 1} {(- 0.20, 0.41)}^{'}, W^{- 1} diag ({0.490}^{2}, {0.580}^{2}) {(W^{= 1})}^{'}) \\ = N_{2} ((\begin{matrix} 0.105 \\ 0.610 \end{matrix}), (\begin{matrix} 0.144 & 0.048 \\ 0.048 & 0.577 \end{matrix})) . \end{matrix}

(13)

Figure 4 contains histograms of large samples from the priors on two extreme probabilities. The shape of the prior is similar for other values of

x_{2} .

4.2. Measuring the Bias in a Prior

Consider applying the approach discussed in Section 3.2 to measuring bias in the prior derived in Section 4.1 for the bioassay example.

Example 7.

Bioassay experiment (Example 2 continued).

Consider whether or not there is bias induced by the prior in Example 6 with respect to the hypothesis

H_{0} : β_{2} = 0 .

It is necessary to compute

M_{T} (R B_{2} (0 | T) \leq 1 | β_{2} = 0),

to measure bias against, and

{sup}_{β_{2} \in {- δ, δ}} M_{T} (R B_{2} (0 | T) \geq 1 | β_{2}),

to measure bias in favor, where

R B_{2} (\cdot | T)

is the relative belief ratio function for

β_{2}

based on data T and

δ > 0

is such that, if

| β_{2} | < δ,

then practically speaking

H_{0}

is considered true. To determine

δ,

the more general problem of what changes in both

β_{1}

and

β_{2}

are deemed irrelevant is considered. Given the settings used in this experiment, it seems reasonable to consider

x_{2}

as restricted to the interval

[- 1, 1] .

Then, whenever

| β_{1} - β_{1}^{'} | < δ

and

| β_{2} - β_{2}^{'} | < δ,

the difference in log odds satisfies

| β_{1} + β_{2} x_{2} - (β_{1}^{'} + β_{2}^{'} x_{2}) | \leq 2 δ

, which implies that the ratio of the odds lies in

(e^{- 2 δ}, e^{2 δ}),

which for small

δ

is approximately equal to

(1 - 2 δ, 1 + 2 δ) .

This in turn implies that the difference in the probabilities is less than

2 δ .

In this example, we take

δ = 0.01 .

Now,

R B (β_{1}, β_{2} | T) = {\prod_{i = 1}^{4} (\binom{5}{t_{i}}) p^{t_{i}} (1, x_{2 i}) {(1 - p (1, x_{2 i}))}^{5 - t_{i}}} / m_{T} (T)

where

m_{T} (T) = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} {\prod_{i = 1}^{4} (\binom{5}{t_{i}}) p^{t_{i}} (1, x_{2 i}) {(1 - p (1, x_{2 i}))}^{5 - t_{i}}} π (β) d β .

The relative belief ratio for

β_{2}

is

R B_{2} (β_{2} | T) = \int_{- \infty}^{\infty} R B (β_{1}, β_{2} | T) π_{1} (β_{1} | β_{2}) d β_{1} = m_{T} (T | β_{2}) / m_{T} (T),

where

π_{1} (\cdot | β_{2})

is the conditional prior density of

β_{1}

given

β_{2},

which (13) implies is the

N (0.105 + 0.083 (β_{2} - 0.610), 0.140)

distribution.

For given

T = (t_{1}, t_{2}, t_{3}, t_{4}),

the numerator and denominator in

R B_{2} (0 | T)

can be estimated via simulation but to calculate the biases we need to do this for many

T .

Consider the calculation of

M_{T} (R B_{2} (0 | T) \leq 1 | β_{2} = 0)

via the following Algorithm 1 and note that there are only

6^{4} = 1296

values of

(t_{1}, t_{2}, t_{3}, t_{4}) \in {0, 1, \dots, 5}^{4} .

Algorithm 1: Algorithm

(i): simultaneously estimate the values $m_{T} (t_{1}, t_{2}, t_{3}, t_{4})$ for each $(t_{1}, t_{2}, t_{3}, t_{4})$ via a large sample from (13) and store these in a table,
(ii): simultaneously estimate the values $m_{T} (t_{1}, t_{2}, t_{3}, t_{4} | β_{2} = 0)$ for each $(t_{1}, t_{2}, t_{3}, t_{4})$ via a large sample from $π_{1} (\cdot | 0)$ and store these in a table,
(iii): using the values in these two tables estimate $R B_{2} (0 | T)$ for all values of T and then estimate $M_{T} (R B_{2} (0 | T) \leq 1 | β_{2} = 0)$ by summing the $m_{T} (t_{1}, t_{2}, t_{3}, t_{4} | β_{2} = 0)$ for those $(t_{1}, t_{2}, t_{3}, t_{4})$ for which $R B_{2} (0 | T) \leq 1 .$

The bias in favor can be estimated at

\pm δ

in exactly the same way but in step (ii) replace

π_{1} (\cdot | 0)

by

π_{1} (\cdot | - δ)

and by

π_{1} (\cdot | δ)

. These computations were carried out and resulted in the bias against equaling

0.22

and the bias in favor equaling

0.77

at

- δ

and

0.78

at

δ

. Thus, there is some bias against

H_{0}

with this prior, but there is appreciable bias in favor of

H_{0},

at least when interest is in detecting deviations of size

δ = 0.01

. For

β_{2} = 5,

however, the bias in favor of

H_{0}

is

0.006,

so there is in reality no bias in favor for large values of this parameter. One could contemplate modifying the prior to reduce the bias in favor at

δ = 0.01,

but typically this just results in trading bias in favor with bias against. The real cure for excessive bias of either variety is to collect more data.

In general problems, the approach to the computations used here will not be feasible and so alternative methods are required. In certain examples, some aspects of the computations can be done exactly, but, in general, approximations such as those discussed in [28] will be necessary.

4.3. Checking and Modifying a Prior

Consider now checking the prior derived in Section 4.1 for the bioassay example.

Example 8.

Bioassay experiment (Example 2 continued).

The tail probability for checking the prior is given by

M_{T} (m_{T} (t_{1}, t_{2}, t_{3}, t_{4}) \leq m_{T} (0, 1, 3, 5)) .

(14)

As part of the algorithm discussed in Section 4.2, the values of

m_{T} (t_{1}, t_{2}, t_{3}, t_{4})

have been estimated and the proportion of values of

m_{T} (t_{1}, t_{2}, t_{3}, t_{4})

that satisfies the inequality gives the estimate of (14). In this example, (14) equals

0.41

so there is no prior-data conflict.

If prior-data conflict exists, the methods discussed in [21] are available to obtain a more weakly informative prior. In this case, it is necessary to be careful as it has been shown in Section 4.1 that simply increasing the variance of the prior will not necessarily accomplish this. On the other hand, there is the satisfying result that the

N_{2} (0, {1.702}^{2} I_{2})

prior, where

I_{2}

is the identity matrix, will avoid prior-data conflict, so modifying the elicited prior to be closer to this prior is the appropriate thing to do when a conflict exists.

4.4. Inferences

Now, consider estimation and hypothesis assessment for the bioassay example.

Example 9.

Bioassay experiment (Example 2 continued).

Consider first the assessment of the hypothesis

H_{0} : β_{2} = 0 .

From the algorithm, the quantity

R B_{2} (0 | (0, 1, 3, 5))

is available and this indicates whether there is evidence in favor of or against

H_{0} .

In this case,

R B_{2} (0 | (0, 1, 3, 5)) = 0.021

so there is evidence against

H_{0}

. A calculation described below gives the value

0.001

for the strength, so it seems there is strong evidence against

H_{0} .

To obtain the joint relative belief estimate of

(β_{1}, β_{2}),

it is necessary to maximize

R B (β_{1}, β_{2} | T)

as a function of

(β_{1}, β_{2}),

which is the same as the MLE. The plausibility region for this estimate is then

{(β_{1}, β_{2}) : R B (β_{1}, β_{2} | T) > 1}

and the size and posterior content of this set provide a measure of the accuracy with which the coordinates of

β

can be simultaneously known. However, it is worth noting that the i-th coordinate of this joint estimate is not necessarily the value that has the greatest evidence in its favor, rather this is obtained by maximizing

R B_{i} (β_{i} | T)

as a function of

β_{i}

with plausibility region

{β_{i} : R B_{i} (β_{i} | D) > 1} .

Thus, the evidence approach dictates that

β_{i}

be estimated by maximizing

R B_{i} (β_{i} | T) .

In problems where components of a multidimensional parameter are related by some constraint, then it is clearly necessary to estimate the components simultaneously, but this is not the case here.

The value of

R B_{i} (β_{i} | T)

needs to be estimated and since this cannot be done for every value of

β_{i},

its value is estimated on a finite grid. For this, let

[L_{i}, U_{i}]

be the effective prior support for

β_{i}

, say containing

0.995

of the probability, and form the grid

G_{i} = {L_{i}, L_{i} + δ, L_{i} + 2 δ, \dots, U_{i} - δ, U_{i}} .

For each

β_{1} \in G_{1}

estimate

m_{T} (0, 1, 3, 5 | β_{1})

using a large sample from

π_{2} (\cdot | β_{1})

gives

R B_{1} (β_{1} | (0, 1, 3, 5)) = m_{T} (0, 1, 3, 5 | β_{1}) / m_{T} (0, 1, 3, 5) .

It is then easy to obtain the relative belief estimate

β_{1} (0, 1, 3, 5)

and plausibility region

{β_{1} : R B_{1} (β_{1} | (0, 1, 3, 5)) > 1} .

The true relative belief estimate will differ from this estimate by at most

δ

, but this difference has been deemed irrelevant. A similar procedure is carried out for

β_{2}

but now sampling from

π_{1} (\cdot | β_{2})

to estimate

m_{T} (0, 1, 3, 5 | β_{2}) .

The posterior density for

β_{i}

satisfies

π_{i} (β_{i} | (0, 1, 3, 5)) = R B_{i} (β_{i} | (0, 1, 3, 5)) π_{i} (β_{i})

and, since

R B_{i} (\cdot | (0, 1, 3, 5))

has been computed on the grids, these values can be used to approximate the contents of the plausibility regions via an obvious quadrature. Similarly, the strengths can be estimated and the strength quoted above equals

\sum_{β_{2} \in S} π_{2} (β_{2} | (0, 1, 3, 5)) δ

, where

S = G_{2} \cap {β_{2} : R B_{2} (β_{2} | (0, 1, 3, 5)) \leq R B_{2} (0 | (0, 1, 3, 5))} .

Implementing this for

β_{1},

the estimate

β_{1} (0, 1, 3, 5) = 0.11

was obtained with plausibility region

[- 0.21, 0.49]

having posterior content

0.35 .

Thus, the range of plausible values for

β_{1}

is not large, but there is not a high belief that the true value is in this interval. Figure 5 is a plot of

R B_{1} (\cdot | (0, 1, 3, 5)) .

An interesting phenomenon occurs when considering the estimation of

β_{2}

. In Figure 6, the left panel plots

R B_{2} (\cdot | (0, 1, 3, 5))

over the effective support of the marginal prior

π_{2}

for

β_{2} .

From this, it is clear that the relative belief estimate of

β_{2}

lies outside this range. Recall, however, that the chosen prior passed the check for prior-data conflict. The check for prior-data conflict only tells us, however, that the observed data is consistent with at least some of the probabilities determined by where the prior places its mass. The right panel of Figure 6 is a plot of

R B_{2} (\cdot | (0, 1, 3, 5))

over a much wider range. Note too that there is an important robustness property as shown in [18] for

R B_{2} (\cdot | (0, 1, 3, 5))

as it is only weakly dependent on

π_{2} .

In this case,

π_{2}

does not place mass where it appears it should, but there is not enough data to detect the conflict. The relative belief estimate of

β_{2}

is

β_{2} (0, 1, 3, 5) = 7.31

and the plausibility region for

β_{2}

is

[1.14, 30.48]

with posterior content

0.83 .

As such, there is a great deal of uncertainty concerning the true value of

β_{2} .

As long as it is possible to sample from the posterior for a 1-dimensional parameter, then the computations necessary for the inferences for such a parameter are feasible. As such, the Gibbs sampling algorithm of [29] is particularly relevant, although it is not needed in Example 9. The harder computations are those involving the various prior predictives, but these do not need to be highly accurate, as even one decimal place will indicate whether there is bias or prior-data conflict.

5. Conclusions

Criteria for a satisfactory theory of statistical reasoning have been developed. Perhaps more should be required, but it seems that those stated are necessary. In particular, the separation of the choosing and checking of the ingredients from the inference step has been emphasized as a key aspect based upon maintaining a desirable relationship with logical reasoning. An approach to statistical reasoning that satisfies these criteria has been presented. Application to a well-known example has shown that this approach can resolve anomalies/paradoxes that arise via commonly used methodology. Many other such instances of resolving inferential difficulties as well as results establishing optimal performance have been documented in [1]. An application of the approach to the problem of binary-valued response regression has been carried out, and it has been shown to lead to a number of novel insights and, in particular, a new elicitation algorithm has been developed for this problem.

Acknowledgments

The authors thank three reviewers for comments that helped to improve the paper. Michael Evans was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Author Contributions

Each author contributed equally to this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Evans, M. Measuring Statistical Evidence Using Relative Belief; Chapman and Hall/CRC: Boca Raton, FL, USA, 2015. [Google Scholar]
Cao, Y.; Evans, M.; Guttman, I. Bayesian factor analysis via concentration. In Current Trends in Bayesian Methodology with Applications; Upadhyay, S.K., Singh, U., Dey, D.K., Loganathan, A., Eds.; CRC Press: Boca Raton, FL, USA, 2015; pp. 181–201. [Google Scholar]
Evans, M. Bayesian inference procedures derived via the concept of relative surprise. Commun. Statist. Theory Methods 1997, 26, 1125–1143. [Google Scholar] [CrossRef]
Evans, M.; Guttman, I.; Swartz, T. Optimality and computations for relative surprise inferences. Can. J. Stat. 2006, 34, 113–129. [Google Scholar] [CrossRef]
Evans, M.; Shakhatreh, M. Optimal properties of some Bayesian inferences. Electr. J. Stat. 2008, 2, 1268–1280. [Google Scholar] [CrossRef]
Evans, M.; Shakhatreh, M. Consistency of Bayesian estimates for the sum of squared normal means with a normal prior. Sankhya A Indian J. Stat. 2014, 76, 25–47. [Google Scholar] [CrossRef][Green Version]
Al-Labadi, L.; Baskurt, Z.; Evans, M. Goodness of fit for the logistic regression model using relative belief. J. Stat. Distrib. Appl. 2017, 4, 17. [Google Scholar] [CrossRef]
Good, I.J. A derivation of the probabilistic explication of information. J. R. Stat. Soc. B 1966, 28, 578–581. [Google Scholar]
Jeffreys, H. Theory of Probability, 3rd ed.; Clarendon Press: Wotton-under-Edge, UK, 1961. [Google Scholar]
Etz, A.; Wagenmakers, E.-J.; Haldane’s, J.B.S. Contribution to the Bayes factor hypothesis test. Stat. Sci. 2017, 32, 313–329. [Google Scholar] [CrossRef]
Morey, R.D.; Romeijn, J.-W.; Rouder, J.N. The philosophy of Bayes factors and the quantification of statistical evidence. J. Math. Psychol. 2016, 72, 6–18. [Google Scholar] [CrossRef]
Salmon, W. Confirmation. Sci. Am. 1973, 228, 75–83. [Google Scholar] [CrossRef]
Achinstein, P. The Book of Evidence; Oxford University Press: Oxford, UK, 2001. [Google Scholar]
Royall, R. Statistical Evidence: A Likelihood Paradigm; Chapman and Hall/CRC: Boca Raton, FL, USA, 1997. [Google Scholar]
Kneale, W.; Kneale, M. The Development of Logic; Clarendon Press: Wotton-under-Edge, UK, 1962. [Google Scholar]
Baskurt, Z.; Evans, M. Hypothesis assessment and inequalities for Bayes factors and relative belief ratios. Bayesian Anal. 2013, 8, 569–590. [Google Scholar] [CrossRef]
Evans, M.; Guttman, I.; Li, P. Prior elicitation, assessment and inference with a Dirichlet prior. Entropy 2017, 19, 564. [Google Scholar] [CrossRef]
Al-Labadi, L.; Evans, M. Optimal robustness results for some Bayesian procedures and the relationship to prior-data conflict. Bayesian Anal. 2017, 12, 702–728. [Google Scholar] [CrossRef]
Diaconis, P.; Skyrms, B. Ten Great Ideas About Chance; Princeton University Press: Princeton, NJ, USA, 2018. [Google Scholar]
Evans, M.; Moshonov, H. Checking for prior-data conflict. Bayesian Anal. 2006, 1, 893–914. [Google Scholar] [CrossRef]
Evans, M.; Jang, G.-H. A limit result for the prior predictive applied to checking for prior-data conflict. Stat. Probab. Lett. 2011, 81, 1034–1038. [Google Scholar] [CrossRef]
Evans, M.; Jang, G.-H. Weak informativity and the information in one prior relative to another. Stat. Sci. 2011, 26, 423–439. [Google Scholar] [CrossRef]
Racine, A.; Grieve, A.P.; Fluhler, H.; Smith, A.F.M. Bayesian methods in practice: Experiences in the pharmaceutical industry (with discussion). J. Appl. Stat. 1986, 35, 93–150. [Google Scholar] [CrossRef]
Bedrick, E.J.; Christensen, R.; Johnson, W. A new perspective on priors for generalized linear models. J. Am. Stat. Assoc. 1996, 91, 1450–1460. [Google Scholar] [CrossRef]
Bedrick, E.J.; Christensen, R.; Johnson, W. Bayesian binomial regression: Predicting survival at a trauma center. Am. Stat. 1997, 51, 211–218. [Google Scholar]
Tsutukawa, R.K.; Lin, H.Y. Bayesian estimation of item response curves. Psychometrika 1986, 51, 251–267. [Google Scholar] [CrossRef]
Camilli, G. Origin of the scaling constant d = 1.7 in item response theory. J. Educ. Behav. Stat. 1994, 19, 293–295. [Google Scholar] [CrossRef]
Nott, D.; Drovandi, C.; Mengersen, K.; Evans, M. Approximation of Bayesian predictive p-values with regression ABC. Bayesian Anal. 2018, 13, 59–83. [Google Scholar] [CrossRef]
Albert, J.H.; Chib, S. Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 1993, 88, 669–679. [Google Scholar] [CrossRef]

Figure 1. Plot of the relative belief ratio when

n = 20, n \bar{x} = 6

in Example 1.

Figure 1. Plot of the relative belief ratio when

n = 20, n \bar{x} = 6

in Example 1.

Figure 2. Prior density of of

p (1, x) = G (β_{1} + β_{2} x),

G is the standard logistic cdf,

β_{1}, β_{2} \sim N (0, 20^{2})

and

x = 1 .

Figure 2. Prior density of of

p (1, x) = G (β_{1} + β_{2} x),

G is the standard logistic cdf,

β_{1}, β_{2} \sim N (0, 20^{2})

and

x = 1 .

Figure 3. Plots of the density of

p = e^{z} / (1 + e^{z})

when

z \sim N (0, λ^{2})

and

λ = 0.5

(–),

λ = 1.0

(- -), and

λ = - 1.702

(...).

Figure 3. Plots of the density of

p = e^{z} / (1 + e^{z})

when

z \sim N (0, λ^{2})

and

λ = 0.5

(–),

λ = 1.0

(- -), and

λ = - 1.702

(...).

Figure 4. Density histograms of

p (1, - 0.8)

(left) and

p (1, 0.8)

(right) based on a sample of

10^{5}

from the elicited prior in Example 6.

Figure 4. Density histograms of

p (1, - 0.8)

(left) and

p (1, 0.8)

(right) based on a sample of

10^{5}

from the elicited prior in Example 6.

Figure 5. A plot of

R B_{1} (\cdot | (0, 1, 3, 5))

over the effective support of the prior in Example 9.

Figure 5. A plot of

R B_{1} (\cdot | (0, 1, 3, 5))

over the effective support of the prior in Example 9.

Figure 6. Plot of

R B_{2} (\cdot | (0, 1, 3, 5))

over the effective support of the prior (left panel) and over a full range of possible values (right panel) in Example 9.

Figure 6. Plot of

R B_{2} (\cdot | (0, 1, 3, 5))

over the effective support of the prior (left panel) and over a full range of possible values (right panel) in Example 9.

Table 1. Data in Example 1.

$x_{2}$	No. of Animals	No. of Deaths
$- 0.86$	5	0
$- 0.30$	5	1
$- 0.05$	5	3
$0.73$	5	5

Table 2. Optimal choice of a

N (0, λ^{2})

distribution to approximate a

t (ν)

distribution.

Table 2. Optimal choice of a

N (0, λ^{2})

distribution to approximate a

t (ν)

distribution.

$υ$	30	20	10	5	2	1
$λ$	$1.022$	$1.034$	$1.069$	$1.144$	$1.407$	$1.980$
max error	$0.002$	$0.003$	$0.006$	$0.013$	$0.031$	$0.058$

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-Labadi, L.; Baskurt, Z.; Evans, M. Statistical Reasoning: Choosing and Checking the Ingredients, Inferences Based on a Measure of Statistical Evidence with Some Applications. Entropy 2018, 20, 289. https://doi.org/10.3390/e20040289

AMA Style

Al-Labadi L, Baskurt Z, Evans M. Statistical Reasoning: Choosing and Checking the Ingredients, Inferences Based on a Measure of Statistical Evidence with Some Applications. Entropy. 2018; 20(4):289. https://doi.org/10.3390/e20040289

Chicago/Turabian Style

Al-Labadi, Luai, Zeynep Baskurt, and Michael Evans. 2018. "Statistical Reasoning: Choosing and Checking the Ingredients, Inferences Based on a Measure of Statistical Evidence with Some Applications" Entropy 20, no. 4: 289. https://doi.org/10.3390/e20040289

APA Style

Al-Labadi, L., Baskurt, Z., & Evans, M. (2018). Statistical Reasoning: Choosing and Checking the Ingredients, Inferences Based on a Measure of Statistical Evidence with Some Applications. Entropy, 20(4), 289. https://doi.org/10.3390/e20040289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Statistical Reasoning: Choosing and Checking the Ingredients, Inferences Based on a Measure of Statistical Evidence with Some Applications

Abstract

1. Introduction

2. The Foundations of Statistical Reasoning

3. A Theory of Statistical Reasoning

3.1. Relative Belief Inferences

3.2. Choosing and Checking the Ingredients

4. Binary-Valued Response Regression Models

4.1. Eliciting the Prior

4.2. Measuring the Bias in a Prior

4.3. Checking and Modifying a Prior

4.4. Inferences

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI