Next Article in Journal
Enhancing Security of Telemedicine Data: A Multi-Scroll Chaotic System for ECG Signal Encryption and RF Transmission
Previous Article in Journal
Sampled-Data Exponential Synchronization of Complex Dynamical Networks with Saturating Actuators
Previous Article in Special Issue
Dose Finding in Oncology Trials Guided by Ordinal Toxicity Grades Using Continuous Dose Levels
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Relative Belief Inferences from Decision Theory

by
Michael Evans
1,*,† and
Gun Ho Jang
2,†
1
Department of Statistical Sciences, University of Toronto, Toronto, ON M5G 1Z5, Canada
2
Ontario Institute for Cancer Research, MaRS Centre, 661 University Avenue, Suite 510, Toronto, ON M5G 0A3, Canada
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2024, 26(9), 786; https://doi.org/10.3390/e26090786 (registering DOI)
Submission received: 13 June 2024 / Revised: 30 August 2024 / Accepted: 10 September 2024 / Published: 14 September 2024
(This article belongs to the Special Issue Bayesianism)

Abstract

:
Relative belief inferences are shown to arise as Bayes rules or limiting Bayes rules. These inferences are invariant under reparameterizations and possess a number of optimal properties. In particular, relative belief inferences are based on a direct measure of statistical evidence.

1. Introduction

We consider a sampling model for data x, as given by a collection of densities { f ( · | θ ) : θ Θ } with respect to a support measure μ on a sample space X , and a proper prior, given by density π with respect to support measure ν on Θ . When the data x X are observed, these ingredients lead to the posterior distribution on Θ with density given by π ( θ | x ) = π ( θ ) f ( x | θ ) / m ( x ) with respect to support measure ν , where
m ( x ) = Θ π ( θ ) f ( x | θ ) ν ( d θ )
is the prior predictive density of the data. In addition, there is a quantity of interest ψ = Ψ ( θ ) , where
Ψ : Θ Ψ ( Θ ) = { ψ : ψ = Ψ ( θ ) for some θ Θ } ,
for which inferences, such as an estimate ψ ( x ) or a hypothesis assessment H 0 : Ψ ( θ ) = ψ 0 , are required. Let π Ψ denote the marginal prior density of ψ and
m ( x | ψ ) = Θ f ( x | θ ) Π ( d θ | ψ )
be the conditional prior predictive of the data after integrating out the nuisance parameters via the prior conditional distribution of θ given Ψ ( θ ) = ψ . Bayesian inferences for ψ are then based on the ingredients ( { m ( · | ψ ) : ψ Ψ ( Θ ) } , π Ψ , x ) alone or by adding a loss function L . Note that the probability measure associated with m will be denoted by M and the probability measure associated with m ( · | ψ ) will be denoted by M ( · | ψ ) when these are used in the paper.
A natural question arises, namely, how are we to determine the inferences for ψ , namely, an estimate ψ ( x ) , or assess the hypothesis H 0 : Ψ ( θ ) = ψ 0 based upon these ingredients? Several approaches have been put forward to answering this question. Two broad categories can be described, namely, the evidential/inferential approach and the behavioristic/decision-theoretic approach.
The evidential approach can be characterized as having the goal of letting the evidence in the data x determine the inferences and can be subdivided into frequentist, pure likelihood, and Bayesian theories. Central to this is the need to somehow characterize the concept of statistical evidence. The frequentist theory only uses the ingredients ( { f ( · | θ ) : θ Θ } , x ) together with the idea that inferences are to be graded based on their behavior in hypothetical repeated sampling experiments. Despite the impressive accomplishments of Alan Birnbaum in attempting to formulate a definition of statistical evidence, (see [1]), it is fair to say that there is still no such generally acceptable definition within the frequentist context. The pure likelihood theory is also based on the ingredients ( { f ( · | θ ) : θ Θ } , x ) but the idea of using repeated sampling characteristics to determine the inferences is dropped and the likelihood function L ( · | x ) : Θ [ 0 , ) , as defined by L ( θ | x ) = c f ( x | θ ) for any positive constant c , is taken to be the proper characterization of statistical evidence. All inferences are then determined by the likelihood; for example, see the discussion in [2]. Again, there are gaps in this treatment, as it is unclear when the likelihood function provides evidence in favor of or against a particular value of θ being the true value, and it is unclear how the likelihood is to be used for marginal parameters ψ = Ψ ( θ ) . The Bayesian approach based on the ingredients ( { m ( · | ψ ) : ψ Ψ ( Θ ) } , π Ψ , x ) is more successful at characterizing statistical evidence concerning ψ through the principle of evidence, which, loosely speaking, suggests that if the data lead to the posterior probability of an event being greater than (less than) the corresponding prior probability, then there is evidence in favor of (against) the event being true. A precise statement of the principle of evidence is provided in Section 2.3. A full theory of inference based on this idea, and called relative belief, has been developed over a number of years (see [3]), and is sketched in Section 2.3. A much fuller discussion of the issues and developments within the context of the evidential approach to developing statistical theory can be found in [4].
The decision-theoretic approach can also be divided into frequentist and Bayesian theories. The frequentist approach is based on the ingredients ( { f ( · | θ ) : θ Θ } , x ) together with a loss function L : Θ × Ψ ( Θ ) [ 0 , ) where L ( θ , Ψ ( θ ) ) = 0 for all θ and, generally, L ( θ , ψ ) represents the loss or penalty incurred when ψ is chosen as the true value of Ψ ( θ ) . The idea then is to look for a decision procedure δ ( x ) that performs well with respect to the average loss or risk R ( θ , δ ) = X L ( θ , δ ( x ) ) f θ ( x ) μ ( d x ) , namely, choose a δ that makes R ( θ , δ ) small uniformly in θ . The frequentist decision theory, however, is not always successful in determining a suitable δ . The Bayesian theory of decision considers the prior risk r ( δ ) = Θ R ( θ , δ ) π ( θ ) ν ( d θ ) and is generally successful in determining a δ that minimizes r ( δ ) ; this is referred to as the Bayes rule. The Bayesian theory of decision has been axiomatized (see [5]); this provides considerable support for this approach.
If the various approaches to determining inferences all lead to more or less the same answers, then there would be little controversy, but unfortunately, this is not the case. The goal of this paper is to show that relative belief inferences can also be developed within the context of decision theory even though their primary motivation is through the characterization of statistical evidence. It is of historical relevance and interest that two of the founders of the statistical discipline, Fisher and Neyman, disagreed profoundly on the purpose of statistical analyses. Fisher saw the purpose of statistics as summarizing what the evidence in the observed data says about questions of scientific interest, while Neyman described the purpose in behavioristic or decision-theoretic terms, where the goal is to minimize average losses in repeated performances. Ref. [6] described this debate, which continues to be part of the statistical profession. The significance of the results of this paper is that it demonstrates that relative belief inference allows for a possible resolution of this conflict and, as we will now discuss, also resolves a general criticism of decision theory.
A natural requirement for statistical analysis is that all the ingredients chosen by the statistician need to be checkable against the observed data to ensure they align with the objective data (or, at least, are correctly collected). The choice of the model, prior, and loss functions are typically subjective decisions made by the analyst, and many consider such subjectivity to be at odds with the demands of science. However, both the model and the prior can be checked against the observed data to determine whether the choices made are contradicted by the data. Model checking has long been an acceptable, even necessary, part of statistical practice. In recent years, methods have been developed to check for a conflict between the prior and data. These methods determine if the prior placed the bulk of its mass in a region of the parameter space unsupported by the data as containing the true value of the parameter. Ref. [7] contained a discussion about checking for prior-data conflicts and also on what to do when a prior fails its checks. While such checking does not establish the objectivity of these elements, it at least allows the objective data to comment on the relevance of the choices made. However, it is unclear how one can check the loss function L using the data, and this ambiguity may be considered a flaw in decision theory, particularly for scientific applications.
There are, however loss functions that are considered intrinsic and that avoid this criticism. For example, Ref. [8] proposed using the intrinsic loss function based on a measure of distance between sampling distributions. Ref. [9] proposed using the intrinsic loss function based on the Kullback–Leibler divergence K L ( f ( · | θ ) f ( · | θ ) ) between f ( · | θ ) and f ( · | θ ) . When ψ = θ , the intrinsic loss function is given by
L ( θ , θ ) = min ( K L ( f ( · | θ ) f ( · | θ ) ) , K L ( f ( · | θ ) f ( · | θ ) ) ) .
For a marginal parameter ψ , the intrinsic loss function is L ( θ , ψ ) = inf θ Ψ 1 { ψ } L ( θ , θ ) . These loss functions are intrinsic because they are based on the sampling model, allowing their suitability to be verified through model checking. The loss functions used to derive relative belief inferences for ψ = Ψ ( θ ) are based upon the prior π Ψ , see Section 3 for the definitions, and so are also intrinsic and checkable against the data while checking for prior-data conflict.
In some contexts, relative belief inferences are Bayes rules, but in a general context, they are seen to arise as the limits of Bayes rules. This approach has some historical antecedents. For example, in [10], it is shown that the MLE is asymptotically a Bayes rule, but this conclusion is drawn under a fixed loss function, with increasing amounts of data and a sequence of priors. In the context discussed here, however, the data amount is fixed, as are the model and prior, but there is a sequence of loss functions, all based on a single fixed prior. The loss functions relevant for deriving relative belief inferences are similar to those used to justify maximum a posteriori (MAP) inferences. It can be demonstrated that under certain conditions, MAP inferences emerge as the limits of Bayes rules through a sequence of loss functions,
L λ ( θ , ψ ) = I B λ c ( ψ ) ( Ψ ( θ ) )
where λ > 0 , B λ ( Ψ ( θ ) ) is the ball of radius λ centered at ψ , I A denotes the indicator function for set A and, in the continuous case, the support measure on Ψ ( Θ ) is volume measure (see [11]). MAP inferences are not invariant under reparameterizations and such invariance can be considered as a desirable property of any inference method. Relative belief inferences, however, are invariant under reparameterizations.
Section 2 is concerned with describing the general characteristics of three approaches to deriving Bayesian inferences. Section 3 and Section 4 show how relative belief estimation and prediction inferences can be seen to arise from decision theory and Section 5 does this for credible regions and hypothesis assessment. In particular, it is shown here that relative belief estimators, as used in practice, are admissible. The contents of Section 3, Section 4 and Section 5 are original contributions by the authors that were derived some years ago but not published. Some of this discussion has appeared in Ref. [3] and is included here to provide a complete exposition of the relationship between relative belief and decision theory. All proofs of theorems and corollaries are in Appendix A, except for cases where Ψ ( Θ ) is finite, as these are quite straightforward and provide motivation for the more complicated contexts.
It should be emphasized that the authors do not consider the fact that relative belief inferences can be derived within the context of decision theory as the primary justification for the approach. Rather, the justification lies within the Bayesian context, which leads—via the principle of evidence—to a clear characterization of statistical evidence. The specific loss functions used, while appealing, are not essential to this characterization. The fact that relative belief inferences are consistent with two of the major themes of statistical research pursued over the years, in our view, provides substantial support for their appropriateness.

2. Bayesian Inference

Some approaches to deriving Bayesian inferences will now be described in detail.

2.1. Bayesian Decision Theory

An ingredient that is commonly added to ( { f ( · | θ ) : θ Θ } , π , x ) is a loss function, namely, L : Θ × Ψ ( Θ ) [ 0 , ) , satisfying L ( θ , ψ ) = L ( θ , ψ ) whenever Ψ ( θ ) = Ψ ( θ ) and L ( θ , ψ ) = 0 only when ψ = Ψ ( θ ) . The goal is to find a procedure, say δ ( x ) Ψ ( Θ ) , which in some sense minimizes the loss L ( θ , δ ( x ) ) based on the joint distribution of ( θ , x ) . Given the assumptions on L, the loss function can instead be thought of as a map L : Ψ ( Θ ) × Ψ ( Θ ) [ 0 , ) with L ( ψ , ψ ) = 0 iff ψ = ψ and the ingredients can be represented as ( { m ( · | ψ ) : ψ Ψ ( Θ ) } , π Ψ , L , x ) .
The goal of a decision analysis is then to find a decision function δ : X Ψ ( Θ ) that minimizes the prior risk,
r ( δ ) = Ψ ( Θ ) X L ( ψ , δ ( x ) ) M ( d x | ψ ) Π Ψ ( d ψ ) = X r ( δ | x ) M ( d x ) ,
where r ( δ | x ) = Ψ ( Θ ) L ( ψ , δ ( x ) ) Π Ψ ( d ψ | x ) is the posterior risk. Such a δ is called a Bayes rule and clearly a δ that minimizes r ( δ | x ) for each x is a Bayes rule. Further discussion of the Bayesian decision theory can be found in [12].
As noted in [9], a decision formulation also leads to credible regions for ψ , namely, a γ -lowest posterior loss-credible region is defined by
D γ ( x ) = { ψ : r ( ψ | x ) d γ ( x ) } ,
where d γ ( x ) = inf { k : Π Ψ ( { ψ : r ( ψ | x ) k } | x ) γ . Note that ψ in (2) is interpreted as the decision function that takes the value ψ constantly in x . Clearly, as γ 0 , set D γ ( x ) converges to the value of a Bayes rule at x . For example, with quadratic loss, the Bayes rule is given by the posterior mean and a γ -lowest posterior loss region is the smallest sphere centered at the mean containing (at least) γ of the posterior probability.

2.2. MAP Inferences

The highest posterior density (HPD) or MAP-based approach to determining inferences constructs credible regions of the following form
H γ ( x ) = { ψ : π Ψ ( ψ | x ) h γ ( x ) } ,
where π Ψ ( · | x ) is the marginal posterior density with respect to a support measure ν Ψ on Ψ ( Θ ) , and h γ ( x ) is chosen so that h γ ( x ) = sup { k : Π Ψ ( { ψ : π Ψ ( ψ | x ) k } | x ) γ } . It follows from (3) that, to assess the hypothesis H 0 : Ψ ( θ ) = ψ 0 , we can use the tail probability given by 1 inf { γ : ψ 0 H γ ( x ) } . Furthermore, the class of sets H γ ( x ) is naturally “centered” at the posterior mode (when it exists uniquely) as H γ ( x ) converges to this point as γ 0 . The use of the posterior mode as an estimator is commonly referred to as MAP estimation. We can then think of the size of set H γ ( x ) , say for γ = 0.95 , as a measure of how accurate the MAP estimator is in a given context. Furthermore, when Ψ ( Θ ) is an open subset of Euclidean space, then H γ ( x ) minimizes the volume among all γ -credible regions.
It is well-known that HPD inferences suffer from a defect. In particular, in the continuous case, MAP inferences are not invariant under reparameterization. For example, this means that, if ψ M A P ( x ) is the MAP estimate of ψ , then it is not necessarily true that Υ ( ψ M A P ( x ) ) is the MAP estimate of τ = Υ ( ψ ) when Υ is a 1-1 smooth transformation. The non-invariance of a statistical procedure seems very unnatural as it implies that the statistical analysis depends on the parameterization and typically there does not seem to be a good reason for this. Note too that estimates based upon taking posterior expectations will also suffer from this lack of invariance. It is also the case that MAP inferences are not based on a direct characterization of statistical evidence. Both of these issues motivate the development of relative belief inferences.
One justification for MAP inference is decision-theoretic via the loss functions defined in (1). It is common, however, to also consider posterior probabilities of events as expressions of evidence and so think of this approach as evidential in nature. Posterior probabilities, however, express beliefs rather than evidence. For instance, the posterior probability of an event may be very small yet larger than its prior probability, indicating that the data have increased belief in the event’s occurrence. This would suggest that the data provide evidence in favor of the event being true, rather than evidence against it, even though the posterior probability remains small. It appears that evidence is better characterized by how the data change beliefs, rather than by the beliefs themselves.

2.3. Relative Belief Inferences

Relative belief inferences, like MAP inferences, are based on the ingredients ( { m ( · | ψ ) : ψ Ψ ( Θ ) } , π Ψ , x ) . Note that underlying both approaches is the principle (axiom) of conditional probability that says that initial beliefs about ψ , as expressed by the prior π Ψ , must be replaced by conditional beliefs as expressed by the posterior π Ψ ( · | x ) . In this approach, however, a measure of statistical evidence is used given by the relative belief ratio,
R B Ψ ( ψ | x ) = π Ψ ( ψ | x ) π Ψ ( ψ ) = m ( x | ψ ) m ( x ) .
The relative belief ratio produces the following conclusions: if R B Ψ ( ψ | x ) > 1 , then there is evidence in favor of ψ being the true value, if R B Ψ ( ψ | x ) < 1 , there is evidence against ψ being the true value, and if R B Ψ ( ψ | x ) = 1 , then there is no evidence either way. These implications follow from a very simple principle of inference.
Principle of evidence: for probability model ( Ω , F , P ) , if C F is observed to be true, where P ( C ) > 0 , then there is evidence in favor of A F being true if P ( A | C ) > P ( A ) , evidence against A F being true if P ( A | C ) < P ( A ) , and no evidence either way if P ( A | C ) = P ( A ) .
This principle seems obvious when Π Ψ is a discrete probability measure. For the continuous case, where Π Ψ ( { ψ } ) = 0 , let N ϵ ( ψ ) be a sequence of neighborhoods of ψ converging nicely to ψ as ϵ 0 (see [13]), then under weak conditions, e.g., π Ψ is continuous and positive at ψ ,
lim ϵ 0 R B Ψ ( N ϵ ( ψ ) | x ) = lim ϵ 0 Π Ψ ( N ϵ ( ψ ) | x ) Π Ψ ( N ϵ ( ψ ) ) = π Ψ ( ψ | x ) π Ψ ( ψ ) = R B Ψ ( ψ | x )
and this justifies the general interpretation of R B Ψ ( ψ | x ) as a measure of evidence. The relative belief ratio determines the inferences.
A natural estimate of ψ is the relative belief estimate,
ψ R B ( x ) = arg sup ψ R B Ψ ( ψ | x )
as it has maximum evidence in favor. To assess the accuracy of ψ R B ( x ) , consider the plausible region P l Ψ ( x ) = { ψ : R B Ψ ( ψ | x ) > 1 } , which is the set of ψ values with evidence supporting them as the true value. The size of P l Ψ ( x ) along with its posterior content Π Ψ ( P l Ψ ( x ) | x ) , which measures the belief that the true value is in P l Ψ ( x ) , provides an assessment of accuracy. So, if P l Ψ ( x ) is “small” and Π Ψ ( P l Ψ ( x ) | x ) 1 , then ψ R B ( x ) is to be considered an accurate estimate of ψ but not otherwise. A relative belief γ -credible region is given by
C Ψ , γ ( x ) = { ψ : R B Ψ ( ψ | x ) c γ ( x ) } ,
where c γ ( x ) = sup { c : Π Ψ ( R B Ψ ( ψ | x ) c | x ) γ } , for ψ can also be quoted provided C Ψ , γ ( x ) P l Ψ ( x ) . The containment is necessary, as otherwise, C Ψ , γ ( x ) would contain a value ψ for which there is evidence against ψ being the true value.
To assess the hypothesis H 0 : Ψ ( θ ) = ψ 0 , the value R B Ψ ( ψ 0 | x ) indicates whether there is evidence in favor of or against H 0 . The strength of this evidence can be measured by the posterior probability Π Ψ ( { ψ 0 } | x ) , as this measures the belief in what the evidence says. So, if R B Ψ ( ψ 0 | x ) > 1 and Π Ψ ( { ψ 0 } | x ) 1 , then there is strong evidence that H 0 is true, while when R B Ψ ( ψ 0 | x ) < 1 and Π Ψ ( { ψ 0 } | x ) 0 , there is strong evidence that H 0 is false. Since Π Ψ ( { ψ 0 } ) can be small, even 0 in the continuous case, it makes more sense to measure the strength of the evidence in such a case by
S t r Ψ ( ψ 0 | x ) = Π Ψ ( R B Ψ ( ψ | x ) R B Ψ ( ψ 0 | x ) | x ) .
If R B Ψ ( ψ 0 | x ) > 1 and S t r Ψ ( ψ 0 | x ) 1 , then the evidence is strong that ψ 0 is the true value as there is little belief that the true value of ψ has more evidence in its favor than ψ 0 . If R B Ψ ( ψ 0 | x ) < 1 and S t r Ψ ( ψ 0 | x ) 0 , then the evidence is strong that ψ 0 is not the true value as there is a widespread belief that the true value of ψ has more evidence in its favor than ψ 0 . There is no reason to quote a single number to measure the strength; both Π Ψ ( { ψ 0 } | x ) and S t r Ψ ( ψ 0 | x ) can be quoted when relevant.
An important aspect of both S t r Ψ ( ψ 0 | x ) and Π Ψ ( P l Ψ ( x ) | x ) is what happens as the data increase. To ensure that these behave appropriately, namely, S t r Ψ ( ψ 0 | x ) 0 ( or 1 ) when H 0 is false (or true) and Π Ψ ( P l Ψ ( x ) | x ) 1 , it is necessary to take into account the difference that matters, δ . By this, we mean that there is a distance measure d Ψ on Ψ ( Θ ) × Ψ ( Θ ) such that if d Ψ ( ψ , ψ ) δ , then in terms of the application, these values are considered equivalent. Such a δ always exists because measurements are always taken to finite accuracy. For example, if ψ is real-valued, then there is a grid of values ψ 2 , ψ 1 , ψ 0 , ψ 1 , ψ 2 , separated by δ , and inferences are determined using the relative belief ratios of the intervals [ ψ i δ / 2 , ψ i + δ / 2 ) . In effect, H 0 is now H 0 : Ψ ( θ ) [ ψ 0 δ / 2 , ψ 0 + δ / 2 ) . When the computations are carried out in this way, then S t r Ψ ( ψ 0 | x ) and Π Ψ ( P l Ψ ( x ) | x ) do what is required. As a particular instance of this, see the results in Section 4, where such discretization plays a key role.
It is easy to see that the class of relative-belief credible regions { C Ψ , γ ( x ) : γ [ 0 , 1 ] } for ψ is independent of the marginal prior π Ψ . When a value γ [ 0 , 1 ] is specified, however, set C Ψ , γ ( x ) depends on π Ψ through c γ ( x ) . So, the form of relative belief inferences about ψ is completely robust to the choice of π Ψ but the quantification of the uncertainty in the inferences is not. For example, when ψ = Ψ ( θ ) = θ , then θ R B ( x ) is the MLE; however, in general, ψ R B ( x ) is the maximizer of the integrated likelihood m ( x | ψ ) . Similarly, relative belief regions are likelihood regions in cases of the full parameter and integrated likelihood regions. As such, likelihood regions can be seen as essentially Bayesian in character with a clear and precise characterization of evidence through the relative belief ratio, and now have probability assignments through the posterior. A relative belief ratio, R B Ψ ( ψ | x ) , while proportional to an integrated likelihood, cannot be multiplied by an arbitrary positive constant—as with a likelihood—without losing its interpretation in measuring statistical evidence. It has been established in [14] that relative belief inferences for ψ are optimally robust to the prior π Ψ .
As can be seen from (4), relative belief inferences are always invariant under smooth reparameterizations; this is at least one reason why they are preferable to MAP inferences. Any rule for measuring evidence, which satisfies the principle of evidence, also produces valid estimates, as these lie in P l Ψ ( x ) and so will have the same “accuracy” as ψ R B ( x ) . For example, if instead of the relative belief ratio, the difference π Ψ ( ψ | x ) π Ψ ( ψ ) is used as the measure of evidence with a cut-off of 0, then this satisfies the principle of evidence, but the estimate is no longer necessarily invariant under reparameterizations. The Bayes factor with a cut-off of 1 is also a valid measure of evidence but there are a number of reasons why the relative belief ratio is to be preferred to the Bayes factor for general inferences (see [15]).
We will now consider a simple example that illustrates the various concepts just discussed.
Example 1. 
Location normal
Suppose that we have a sample x = ( x 1 , , x n ) from a N ( θ , σ 0 2 ) where the mean θ Θ = R 1 is unknown and the variance σ 0 2 is assumed known. Suppose interest lies in making inferences about ψ = Ψ ( θ ) = θ , and the prior π on θ is given by a N ( θ 0 , τ 0 2 ) distribution. In this context, x ¯ serves as a minimal sufficient statistic, allowing the focus to be restricted to the N ( θ , σ 0 2 / n ) model while ignoring the remaining aspects of the data, at least for inference. Certainly the residuals ( x 1 x ¯ , , x n x ¯ ) are relevant for model checking. The prior predictive density m of x ¯ is then given by the density of a N ( θ 0 , τ 0 2 + σ 0 2 / n ) distribution and, as discussed in [7], this is relevant for checking for prior-data conflicts via the tail probability M ( m ( X ¯ ) m ( x ¯ ) ) with small values indicating the existence of a conflict.
The posterior Π ( · | x ) of θ is given by
θ | x ¯ N n / σ 0 2 + 1 / τ 0 2 1 ( n x ¯ / σ 0 2 + θ 0 / τ 0 2 ) , n / σ 0 2 + 1 / τ 0 2 1 .
As such, θ M A P ( x ¯ ) = n / σ 0 2 + 1 / τ 0 2 1 ( n x ¯ / σ 0 2 + θ 0 / τ 0 2 ) . If, as is common, squared error loss is employed, then the Bayes rule for estimating θ is given by θ M A P ( x ¯ ) , as this is also the posterior mean. On the other hand, the relative belief ratio is given by
R B ( θ | x ¯ ) = π ( θ | x ¯ ) π ( θ ) = m ( x ¯ | θ ) m ( θ ) = τ 0 2 + σ 0 2 / n σ 0 2 / n 1 / 2 exp ( n ( x ¯ θ ) 2 / 2 σ 0 2 ) exp ( ( τ 0 2 + σ 0 2 / n ) 1 ( x ¯ θ 0 ) 2 / 2 )
as, since there are no nuisance parameters, m ( · | θ ) equals the sampling density of x ¯ .
From this, it is immediate that θ R B ( x ¯ ) = x ¯ , which is the MLE, a result that is generally true for relative belief when estimating the full model parameter. The plausible interval for θ is then, putting r 0 = ( σ 0 2 / n ) / ( τ 0 2 + σ 0 2 / n ) , given by
P l ( x ¯ ) = { θ : R B ( θ | x ¯ ) > 1 } = x ¯ ± σ 0 ( τ 0 2 + σ 0 2 / n ) 1 ( x ¯ θ 0 ) 2 log ( r 0 ) n
and note that log ( r 0 ) > 0 , so this interval is always defined. The length of P l ( x ¯ ) and its posterior content, computed using Π ( · | x ) , provide a measure of the accuracy of θ R B ( x ¯ ) . Notice that P l ( x ¯ ) converges almost surely to the degenerate interval consisting of the true value of θ and the posterior content of the interval converges to 1 as n .
To assess a hypothesis, say H 0 : θ = θ 0 , the relevant relative belief ratio is as follows:
R B ( θ 0 | x ¯ ) = r 0 1 / 2 exp r 0 1 r 0 n σ 0 2 ( x ¯ θ 0 ) 2 2 .
This gives evidence in favor (or against) H 0 when
n ( x ¯ θ 0 ) 2 σ 0 2 < ( > ) r 0 log r 0 r 0 1
and note that the right-hand side is always positive. The strength of this evidence is given by
S t r ( θ 0 | x ) = Π ( R B ( θ | x ) R B ( θ 0 | x ) | x ) = Π ( | x ¯ θ | | x ¯ θ 0 | | x ) = F ( x ¯ + | x ¯ θ 0 | ; n / σ 0 2 + 1 / τ 0 2 1 ( n x ¯ / σ 0 2 + θ 0 / τ 0 2 ) , n / σ 0 2 + 1 / τ 0 2 1 ) F ( x ¯ | x ¯ θ 0 | ; n / σ 0 2 + 1 / τ 0 2 1 ( n x ¯ / σ 0 2 + θ 0 / τ 0 2 ) , n / σ 0 2 + 1 / τ 0 2 1 ) ,
where F ( · ; μ , λ ) denotes the N ( μ , λ ) cdf. As n , the strength converges to 0 (the strongest possible evidence against) when H 0 is false and converges to 1 (the strongest possible evidence in favor) when H 0 is true.

3. Estimation: Discrete Parameter Space

The following theorem presents the basic definition of the loss function for the parameter of interest ψ = Ψ ( θ ) when the set of possible values of ψ , namely, Ψ ( Θ ) = { ψ : ψ = Ψ ( θ ) for some θ Θ } , is finite. This establishes an important optimality result. The indicator function for the set A is denoted as I A .
Theorem 1. 
Suppose that π Ψ ( ψ ) > 0 for every ψ Ψ ( Θ ) , Ψ ( Θ ) is finite with ν Ψ equal to counting measure on Ψ ( Θ ) . Then for the loss function
L R B ( θ , ψ ) = I { ψ } c ( Ψ ( θ ) ) π Ψ ( Ψ ( θ ) ) ,
the relative belief estimator ψ R B is a Bayes rule.
Proof. 
We have that
r ( δ | x ) = Ψ ( Θ ) I { δ ( x ) } c ( ψ ) π Ψ ( ψ ) Π Ψ ( d ψ | x ) = Ψ ( Θ ) R B Ψ ( ψ | x ) ν Ψ ( d ψ ) R B Ψ ( δ ( x ) | x ) .
Since Ψ ( Θ ) is finite, the first term in (6) is finite and a Bayes rule at x is given by the value δ ( x ) that maximizes the second term. Therefore, ψ R B is a Bayes rule. □
The loss function L R B seems very natural. Beliefs about the true value of ψ are expressed by the prior π Ψ . As such, consider values where π Ψ ( ψ ) is very low and ψ is indeed false. It would then be misleading if inferences suggested such a value as being true. So it is appropriate for such values to bear large losses. In a sense, the statistician is acknowledging what such values are by the choice of the prior. Of course, the prior may be wrong in the sense that the bulk of its mass is placed in a region where the true value of ψ does not lie. This is why checking for prior-data conflict, before conducting inference, is always recommended. Procedures for checking priors were discussed in [16,17], and an approach to replacing priors found to be at fault was developed in [7]. The loss function L R B motivates the other losses for relative belief discussed here, making this comment relevant to those losses as well.
The prior risk of δ satisfies the following,
r ( δ ) = Ψ ( Θ ) X I { δ ( x ) } c ( ψ ) π Ψ ( ψ ) M ( d x | ψ ) Π Ψ ( d ψ ) = ψ Ψ ( Θ ) M ( δ ( x ) ψ | ψ ) ,
where M ( · | ψ ) is the conditional prior predictive probability measure of the data given ψ , so (7) is the sum of the conditional prior error probabilities over all ψ values. If instead the loss function is taken to be L M A P ( θ , ψ ) = I { ψ } c ( Ψ ( θ ) ) , as in (1), then the same proof used in Theorem 1 establishes that ψ M A P is a Bayes rule with respect to this loss, and the prior risk is given by,
ψ M ( δ ( x ) ψ | ψ ) π Ψ ( ψ ) ,
which represents the prior probability of making an error. Both L M A P and L R B are two-valued loss functions but, when an incorrect decision is made, the loss is constant in Ψ ( θ ) for L M A P , while it equals the reciprocal of the prior probability of Ψ ( θ ) for L R B . So, L R B penalizes an incorrect decision much more severely when the true value of Ψ ( θ ) is in the tails of the prior. Note that ψ M A P = ψ R B when Π Ψ is uniform. It is evident that (7) serves as an upper bound to (8), indicating that controlling losses based on L R B automatically controls the losses based on L M A P .
As already noted, R B Ψ ( ψ | x ) is proportional to the integrated likelihood of ψ . So, under the conditions of Theorem 1, the maximum integrated likelihood estimator is a Bayes rule. Furthermore, the Bayes rule is the same for every choice of π Ψ and only depends on the full prior through the conditional prior Π ( · | ψ ) placed on the nuisance parameters. When ψ = θ , then θ R B ( x ) is the MLE of θ and so the MLE of θ is a Bayes rule for every prior π .
Note that when Ψ ( Θ ) = { ψ 0 , ψ 1 } , then R B Ψ ( ψ 0 | x ) > ( < ) 1 iff R B Ψ ( ψ 1 | x ) < ( > ) 1 , so ψ R B ( x ) = ψ 0 when R B Ψ ( ψ 0 | x ) > 1 , and ψ R B ( x ) = ψ 1 otherwise. This is the classical context for hypothesis testing, where ψ R B ( x ) = ψ 0 can be viewed as acceptance of the hypothesis H 0 : θ Ψ 1 { ψ 0 } , and ψ R B ( x ) = ψ 1 as rejection of H 0 . Theorem 1 establishes that relative belief offers a Bayes rule for the hypothesis testing problem.
The loss function (5) does not provide meaningful results when Ψ ( Θ ) is infinite as (7) shows that r ( δ ) will be infinite. So, we modify (5) via a parameter η > 0 and define the loss function as follows,
L R B , η ( θ , ψ ) = I { ψ } c ( Ψ ( θ ) ) max ( η , π Ψ ( Ψ ( θ ) ) ) .
Note that L R B , η is bounded by 1 / η . This loss function is like (5) but does not allow for arbitrarily large losses. The following result shows that we can restrict attention to values of η that are sufficiently small.
Theorem 2. 
Suppose that π Ψ ( ψ ) > 0 for every ψ Ψ ( Θ ) , where Ψ ( Θ ) is countable with ν Ψ equal to counting measure, and that ψ R B ( x ) is the unique maximizer of R B Ψ ( ψ | x ) for all x . For the loss function (9) and Bayes rule δ η , then δ η ( x ) ψ R B ( x ) as η 0 , for every x X .
The proof of Theorem 2 also establishes the following result.
Corollary 1. 
For all sufficiently small η, the value of a Bayes rule at x is given by ψ R B ( x ) .
The following is an immediate consequence of Theorem 1 and Corollary 1 as ψ R B is a Bayes rule.
Corollary 2. 
ψ R B is an admissible estimator with respect to the loss function L R B when Ψ ( Θ ) is finite, and with respect to loss L R B , η , when η is sufficiently small, and Ψ ( Θ ) is countable.
In a general estimation problem, δ is risk-unbiased with respect to a loss function L if E θ ( L ( θ , δ ( x ) ) ) E θ ( L ( θ , δ ( x ) ) ) for all θ , θ Θ . This says that, on average, δ ( x ) is closer to the true value than any other value when we interpret L ( θ , δ ( x ) ) as a measure of distance between δ ( x ) and Ψ ( θ ) . A definition of Bayesian-unbiasedness for δ with respect to L is given by the inequality,
Θ Θ E θ ( L ( θ , δ ( x ) ) ) Π ( d θ ) Π ( d θ ) Θ E θ ( L ( θ , δ ( x ) ) ) Π ( d θ ) = r ( δ ) ,
as this retains the idea of being closer, on average, to the true value than a false value. We will now consider a family of loss functions defined as follows,
L ( θ , ψ ) = I { ψ } c ( Ψ ( θ ) ) h ( Ψ ( θ ) )
where h is a nonnegative function satisfying Θ h ( Ψ ( θ ) ) Π ( d θ ) < . This includes L R B and L M A P when Ψ ( Θ ) is finite and L R B , η .
Theorem 3. 
If Ψ ( Θ ) is finite or countable, then ψ R B ( x ) is Bayesian-unbiased under the loss function (10).
Suppose after observing x, there is a need to predict a future (or concealed) value y Y , where y g δ ( θ ) ( y | x ) , a density with respect to the support measure μ Y on Y , and it is assumed that the true value of θ in the model for x, gives the true value of δ ( θ ) . The prior predictive density of y is given by q ( y ) = Θ X π ( θ ) f θ ( x ) g δ ( θ ) ( y | x ) μ ( d x ) ν ( d θ ) while the posterior predictive density is q ( y | x ) = Θ π ( θ | x ) g δ ( θ ) ( y | x ) ν ( d θ ) . The relative belief ratio for a future value of y is, thus, R B Y ( y | x ) = q ( y | x ) / q ( y ) , and the relative belief prediction, namely, the value that maximizes R B Y ( · | x ) , is denoted as y R B ( x ) . When Y is finite, using the same argument as in Theorem 1, y R B is a Bayes rule under the loss function L R B ( y , y ) = I { y } c ( y ) / q ( y ) . Also, it can be demonstrated that y R B is a limit of the Bayes rule when Y is countable.
We will now consider a common application where Ψ ( Θ ) is finite.
Example 2. 
Classification
For a classification problem, there are k categories { ψ 1 , , ψ k } , prescribed by a function Ψ , where π Ψ ( ψ i ) > 0 for each i . Estimating ψ is then equivalent to classifying the data as having come from one of the distributions in the classes specified by Ψ 1 { ψ i } . The standard Bayesian solution to this problem is to use ψ M A P ( x ) as the classifier. From (8), we have that ψ M A P ( x ) minimizes the prior probability of misclassification, while from (7), ψ R B ( x ) minimizes the sum of the probabilities of misclassification. The essence of the difference is that ψ R B ( x ) treats the misclassification errors equally while ψ M A P ( x ) weights the errors by their prior probabilities.
The following shows that minimizing the sum of error probabilities is often more appropriate than minimizing the weighted sum. Suppose that k = 2 and x Bernoulli ( ψ 0 ) or x Bernoulli ( ψ 1 ) with π ( ψ 0 ) = 1 ϵ and π ( ψ 1 ) = ϵ representing the known proportions of individuals coming from population 0 or 1. For example, consider ψ 0 as the probability of a positive diagnostic test for a disease in the non-diseased population while ψ 1 is this probability for the diseased population. Suppose that ψ 0 / ψ 1 is very small, indicating that the test is successful at identifying the disease while not yielding many false positives and that ϵ is very small, so the disease is rare. The challenge then becomes assigning a randomly chosen individual to a population based on their test results.
The posterior is given by π ( ψ 0 | 1 ) = ψ 0 ( 1 ϵ ) / ( ψ 0 ( 1 ϵ ) + ψ 1 ϵ ) and π ( ψ 0 | 0 ) = ( 1 ψ 0 ) ( 1 ϵ ) / ( ( 1 ψ 0 ) ( 1 ϵ ) + ( 1 ψ 1 ) ϵ ) . Therefore,
ψ M A P ( 1 ) = ψ 0 i f ψ 0 / ψ 1 > ϵ / ( 1 ϵ ) ψ 1 o t h e r w i s e ψ M A P ( 0 ) = ψ 0 i f ( 1 ψ 0 ) / ( 1 ψ 1 ) > ϵ / ( 1 ϵ ) ψ 1 o t h e r w i s e
This implies that ψ M A P will always classify a person to the non-diseased population when ϵ is small enough, e.g., when ψ 0 = 0.05 , ψ 1 = 0.80 , and ϵ < 0.0625 . In contrast, in this situation, ψ R B always classifies an individual with a positive test to the diseased population and the non-diseased population for a negative test. Since M ( · | ψ i ) is the Bernoulli ( ψ i ) distribution, when ψ 0 < ψ 1 and ϵ are small enough, we have the following:
M ( ψ M A P ψ 0 | ψ 0 ) + M ( ψ M A P ψ 1 | ψ 1 ) = 0 + 1 = 1 , M ( ψ R B ψ 0 | ψ 0 ) + M ( ψ R B ψ 1 | ψ 1 ) = ψ 0 + ( 1 ψ 1 ) = 0.25 .
This clearly illustrates the difference between these two procedures as ψ R B does better than ψ M A P on the diseased population when ψ 0 is small and ψ 1 is large, as would be the case for a good diagnostic. Of course, ψ M A P minimizes the overall error rate, but at the price of ignoring the most important class in this problem, namely, those who have the disease. Note that this example can be extended to the situation where we need to estimate the ψ i based on samples from the respective populations, but this will not materially affect the overall conclusions.
We will now consider a situation where ( x , c ) is such that x | c f c , c | ϵ Bernoulli ( ϵ ) , where f 0 and f 1 are known but ϵ is unknown with a prior π . This is a generalization of the previous discussion, where ϵ is assumed to be known. Then, based on a sample ( x 1 , c 1 ) , , ( x n , c n ) from the joint distribution, the goal is to predict the value c n + 1 for a newly observed x n + 1 .
The prior of c is q ( c ) = 0 1 ( 1 ϵ ) 1 c ϵ c π ( ϵ ) d ϵ , and if ϵ beta ( α , β ) , the prior predictive of c n + 1 is Bernoulli ( α / ( α + β ) ) . The posterior predictive density of c n + 1 equals, where c ¯ = n 1 i = 1 n c i ,
q ( c | ( x 1 , c 1 ) , , ( x n , c n ) , x n + 1 ) ( f 0 ( x n + 1 ) ) 1 c ( f 1 ( x n + 1 ) ) c 0 1 ϵ n c ¯ + c ( 1 ϵ ) n ( 1 c ¯ ) + ( 1 c ) π ( ϵ ) d ϵ = f c ( x n + 1 ) Γ α + n c ¯ + c Γ ( β + n ( 1 c ¯ ) + 1 c ) .
It follows that, suppressing the dependence on the data, we have the following,
c M A P = 1 i f f 1 ( x n + 1 ) f 0 ( x n + 1 ) α + n c ¯ β + n ( 1 c ¯ ) > 1 0 o t h e r w i s e , c R B = 1 i f f 1 ( x n + 1 ) f 0 ( x n + 1 ) β α + n c ¯ α β + n ( 1 c ¯ ) < 1 0 o t h e r w i s e
Note that c M A P and c R B are identical whenever α = β .
From these formulas, it is apparent that a substantial difference will arise between c M A P and c R B when either α or β is much bigger than the other. As in Example 2, these correspond to situations where we believe that ϵ or 1 ϵ is very small. Suppose we take α = 1 and let β be relatively large, as this corresponds to knowing a priori that ϵ is very small. Then (11) implies that c M A P c R B and so c R B = 1 whenever c M A P = 1 . A similar conclusion arises when we take β = 1 and α < 1 .
To see what kind of improvement is possible, we consider a simulation study. Let f 0 be a N ( 0 , 1 ) density, f 1 be a N ( μ , 1 ) density, and n = 10 and the prior on ϵ be beta ( 1 , β ) .  Table 1 presents the Bayes risks for c M A P and c R B for various choices of β when μ = 1 . When β = 1 , they are equivalent, but we see that as β rises, the performance of c M A P deteriorates while c R B improves. Large values of β correspond to having information where ϵ is small. When β = 14 , about 0.50 of the prior probability is to the left of 0.05 ; with β = 32 , about 0.80 of the prior probability is to the left of 0.05 ; and with β = 100 , about 0.99 of the prior probability is to the left of 0.05 . We see that the misclassification rates for the small group ( c = 1 ) stay about the same for c R B as β increases while they deteriorate markedly for c M A P as the MAP procedure basically ignores the small group.
We also investigated other choices for n and μ . There is very little change as n increases. When μ moves toward 0 and μ moves away from 0, the error rates go up and go down, as one would expect; c R B always dominates c M A P .

4. Estimation: Continuous Parameter Space

When ψ has a continuous prior distribution, the argument in Theorem 2 does not work, as Π Ψ ( { δ ( x ) } | x ) = 0 . There are several possible ways to proceed but one approach is to use a discretization of the problem that uses Theorem 2. For this, we will assume that the spaces involved are locally Euclidean, the mappings are sufficiently smooth, and take the support measures to be the analogs of Euclidean volume on the respective spaces. While the argument presented is broadly applicable, it has been simplified in this context by assuming that all spaces are open subsets of Euclidean spaces, with the support measures being the Euclidean volume on these sets.
For each λ > 0 , suppose there is a discretization { B λ ( ψ ) : ψ Ψ ( Θ ) } of Ψ ( Θ ) into a countable number of subsets with the following properties: ψ B λ ( ψ ) , Π Ψ ( B λ ( ψ ) ) > 0 and sup ψ Ψ diam ( B λ ( ψ ) ) 0 as λ 0 . So, if ψ B λ ( ψ ) , then B λ ( ψ ) = B λ ( ψ ) . For example, B λ ( ψ ) could be equal volume rectangles in R k . Further, we assume that Π Ψ ( B λ ( ψ ) ) / ν Ψ ( B λ ( ψ ) ) π Ψ ( ψ ) as λ 0 for every ψ . This will hold whenever π Ψ is continuous everywhere and B λ ( ψ ) converges nicely to { ψ } as λ 0 . Let ψ λ ( ψ ) denote a point in B λ ( ψ ) such that ψ λ ( ψ ) = ψ λ ( ψ ) whenever ψ , ψ B λ ( ψ ) and put Ψ λ = { ψ λ ( ψ ) : ψ Ψ ( Θ ) } . So, Ψ λ is a discretized version of Ψ ( Θ ) . We will call this a regular discretization of Ψ ( Θ ) . The discretized prior on Ψ λ is π Ψ , λ ( ψ λ ( ψ ) ) = Π Ψ ( B λ ( ψ ) ) and the discretized posterior is π Ψ , λ ( ψ λ ( ψ ) | x ) = Π Ψ ( B λ ( ψ ) | x ) .
The loss function for the discretized problem is defined in Theorem 2 as follows,
L R B , λ , η ( θ , ψ λ ( ψ ) ) = I { ψ λ ( ψ ) } ( ψ λ ( Ψ ( θ ) ) ) max ( η , π Ψ , λ ( ψ λ ( Ψ ( θ ) ) ) )
and let δ λ , η ( x ) denote a Bayes rule for this problem.
Theorem 4. 
Suppose that π Ψ is positive and continuous and we have a regular discretization of Ψ . Furthermore, suppose that ψ R B ( x ) is the unique maximizer of R B Ψ ( ψ | x ) and for any ϵ > 0 ,
sup { ψ : | | ψ ψ R B ( x ) | | ϵ } R B Ψ ( ψ | x ) < R B Ψ ( ψ R B ( x ) | x ) .
Then, there exists η ( λ ) 0 as λ 0 such that a Bayes rule δ λ , η ( λ ) ( x ) , under the loss L R B , λ , η ( λ ) , converges to ψ R B ( x ) as λ 0 for all x .
Theorem 4 states that ψ R B ( x ) is a limit of Bayes rules. So, when Ψ ( θ ) = θ , we have the result that the MLE is a limit of the Bayes rule, and more generally, the MLE from an integrated likelihood is a limit of Bayes rules. The regularity conditions stated in Theorem 4 hold in many common statistical problems.
Now let ψ ^ λ ( x ) be the relative belief estimate from the discretized problem, i.e., ψ ^ λ ( x ) maximizes R B Ψ ( B λ ( ψ ) | x ) as a function of ψ Ψ λ . The following is immediate from the proof of Theorem 4, Theorem 3, and Corollary 2:
Corollary 3. 
ψ ^ λ is admissible and Bayesian-unbiased for the discretized problem, and ψ ^ λ ( x ) ψ R B ( x ) as λ 0 for every x .
By similar arguments, an analog of Theorem 4 for ψ M A P can be established. In this case, a simpler development can be followed in certain situations by using the loss function I B λ c ( ψ ) ( Ψ ( θ ) ) . For this, the posterior risk of δ in the discretized problem, is given by 1 Π Ψ ( B λ ( δ ( x ) ) | x ) = 1 π Ψ ( δ ( x ) | x ) ν Ψ ( B λ ( δ ( x ) ) ) for some δ ( x ) B λ ( δ ( x ) ) . Now suppose B λ ( ψ ) is a cube centered at ψ of edge length δ . Suppose that for each ϵ > 0 , there exists λ ( ϵ ) > 0 , such that, when | | ψ ψ M A P ( x ) | | > λ ( ϵ ) , then
p i Ψ ( ψ | x ) < inf ψ B λ ( ϵ ) ( ψ M A P ( x ) ) π Ψ ( ψ | x ) .
Since ν Ψ ( B λ ( ψ ) ) is constant, a Bayes rule δ λ ( ϵ ) must then satisfy | | δ λ ( ϵ ) ( x ) ψ M A P ( x ) | | < ϵ . This proves that ψ M A P is a limit of the Bayes rules. By contrast, for the loss
I B λ c ( ψ ) ( Ψ ( θ ) ) / Π Ψ ( B λ ( Ψ ( θ ) ) ) ,
the posterior risk of δ is given by,
Ψ { Π Ψ ( B λ ( ψ ) ) } 1 Π Ψ ( d ψ | x ) B λ ( δ ( x ) ) { Π Ψ ( B λ ( ψ ) ) } 1 Π Ψ ( d ψ | x )
and the first term is generally unbounded unless Ψ ( Θ ) is compact.
We will now consider an important example.
Example 3. 
Regression
Suppose that y = X β + e , where y R n , X R n × k is fixed of rank k , β R n × k , and e N n ( 0 , σ 2 I ) . To simplify the discussion, we will assume that σ 2 is known but this is not necessary. Let π be a prior density for β . For every π , having observed ( X , y ) , then β R B ( y ) = b = ( X X ) 1 X y , the MLE of β .
It is interesting to contrast this result with more standard Bayesian estimates such as MAP or the posterior mean. For example, suppose that β N k ( 0 , τ 2 I ) . Then the posterior distribution of β is N k ( β p o s t ( y ) , Σ p o s t ) , where
β p o s t ( y ) = Σ p o s t ( σ 2 X X b ) , Σ p o s t = ( τ 2 I + σ 2 X X ) 1
and note that β M A P ( y ) = β p o s t ( y ) . Writing the spectral decomposition of X X as X X = Q Λ Q , we have that
| | β M A P ( y ) | | = | | ( I + ( σ 2 / τ 2 ) Λ 1 ) 1 Q b | | .
Since | | b | | = | | Q b | | and 1 / ( 1 + τ 2 λ i / σ 2 ) < 1 for each i , this implies that β M A P ( y ) shrinks the MLE toward the prior mean of 0 . When the columns of X are orthonormal, then β M A P ( y ) = r ( 1 + r ) 1 b , where r = τ 2 / σ 2 , and so the shrinkage is substantial unless τ 2 is much larger than σ 2 . This shrinkage is often cited as a positive attribute of these estimates. Consider now the situation where the true value of β is some distance from the mean. In this case, it seems wrong to move β toward the prior mean; thus, it is not clear whether shrinking the MLE is necessarily a good thing, particularly as this requires giving up invariance.
Suppose that estimating the mean response ψ = Ψ ( β ) = w β at w for the predictors is required. The prior distribution of ψ is N ( 0 , σ ψ 2 ) = N ( 0 , τ 2 w w ) and the posterior distribution is N ( ψ M A P ( y ) , σ ψ , p o s t 2 ) = N ( w β M A P ( y ) , w Σ p o s t ( β ) w ) . Note the following relationships,
σ ψ 2 σ ψ , p o s t 2 = w ( τ 2 I Σ p o s t ) w = τ 2 w Q ( I ( I + ( τ 2 / σ 2 ) Λ ) 1 ) Q w > 0
since 1 / ( 1 + τ 2 λ i / σ 2 ) < 1 for each i . Therefore, maximizing the ratio of the posterior to prior densities leads to
ψ R B ( y ) = ( 1 σ ψ , p o s t 2 / σ ψ 2 ) 1 ψ M A P ( y ) .
Then σ ψ 2 > σ ψ , p o s t 2 implies | ψ R B ( y ) | > | ψ M A P ( y ) | . Note that when σ ψ , p o s t 2 is much smaller than σ ψ 2 , in other words, the posterior is much more concentrated than the prior, then ψ R B ( y ) and ψ M A P ( y ) are very similar. In general, ψ R B ( y ) is not equal to w b , the plug-in MLE of ψ , although it is the MLE from the integrated likelihood. Moreover, ψ R B ( y ) w b as τ 2 , and when X has orthonormal columns, ψ R B ( y ) = w b .
Suppose predicting a response z at the predictor value w R k is required. When β N k ( 0 , τ 2 I ) , the prior distribution of z is z N ( 0 , σ 2 + τ 2 w w ) = N ( 0 , σ z 2 ) and the posterior distribution is N ( μ p o s t ( z ) , σ p o s t 2 ( z ) ) , where we have that
μ p o s t ( z ) = w β p o s t ( y ) , σ p o s t 2 ( z ) = σ 2 + w Σ p o s t w .
To obtain z R B ( y ) , it is necessary to maximize the ratio of the posterior to the prior densities of z; this leads to
z R B ( y ) = ( 1 σ p o s t 2 ( z ) / σ p r i o r 2 ( z ) ) 1 μ p o s t ( z ) .
Note that σ z 2 σ p o s t 2 ( z ) = σ z 2 ( w β ) σ p o s t 2 ( w β ) > 0 ; thus, | z R B ( y ) | > | μ p o s t ( z ) | , and z R B is further from the prior mean than z M A P ( y ) = μ p o s t ( z ) . Also, when σ p o s t 2 ( z ) is small, then z R B ( y ) and z M A P ( y ) are very similar. Finally, comparing (13) and (14), we have that
z R B ( y ) = ( σ p r i o r 2 ( z ) / σ p o s t 2 ( ψ ) ) w ψ R B ( y ) = ( 1 + σ 2 / τ 2 ) ψ R B ( y )
and so ψ R B ( y ) at w is more dispersed than the z R B ( y ) estimate of the mean at w; this makes good sense as we have to take into account the additional variation due to prediction. By contrast, w M A P ( y ) = ψ M A P ( y ) .

5. Credible Regions and Hypothesis Assessment

Recall that a γ -relative-belief credible region for ψ = Ψ ( θ ) is given by C Ψ , γ ( x ) = { ψ : R B Ψ ( ψ | x ) c γ ( x ) } , where c γ ( x ) = sup { c : Π Ψ ( R B Ψ ( ψ | x ) c | x ) γ } . There is some arbitrariness in the choice of the greater than or equal sign to define the credible region as it could have been defined as C Ψ , γ ( x ) = { ψ : R B Ψ ( ψ | x ) > c γ ( x ) } , where c γ ( x ) = inf { c : Π Ψ ( R B Ψ ( ψ | x ) c | x ) 1 γ } . In the latter case, c γ ( x ) is the ( 1 γ ) -th quantile of the posterior distribution of the relative belief ratio. This definition has some advantages as using this implies that the plausible region satisfies P l Ψ ( x ) = C Ψ , γ ( x ) , where γ = Π Ψ ( P l Ψ ( x ) | x ) . Also, the strength of the evidence concerning the hypothesis H 0 : Ψ ( θ ) = ψ 0 satisfies S t r Ψ ( ψ 0 | x ) = 1 Π Ψ ( C Ψ , γ ( x ) | x ) where γ = 1 S t r Ψ ( ψ 0 | x ) . The key point here is the close relationship between relative-belief credible regions, the plausible region, and the strength calculation. Thus, any decision-theoretic interpretation applicable to relative-belief credible regions also pertains to the plausible region and the strength of the evidence. Throughout this section, we will retain the definition for C Ψ , γ ( x ) provided in Section 2.3.
Now consider the lowest posterior loss γ -credible regions that arise from the prior-based loss functions considered here.
Theorem 5. 
Suppose that π Ψ ( ψ ) > 0 for every ψ Ψ ( Θ ) , where Ψ ( Θ ) is finite with ν Ψ equal to the counting measure. Then C Ψ , γ ( x ) is a γ-lowest posterior loss-credible region for the loss function L R B .
Proof. 
From (2) and (6), the γ -lowest posterior loss-credible region is given by
D γ ( x ) = ψ : R B Ψ ( ψ | x ) Ψ R B Ψ ( ζ | x ) ν Ψ ( d ζ ) d γ ( x )
and d γ ( x ) = sup { d : Π Ψ ( { ψ : r ( ψ | x ) d | x ) γ } . As Ψ R B Ψ ( z | x ) ν Ψ ( d z ) is independent of ψ it is clearly equivalent to define this region via C Ψ , γ ( x ) = ψ : R B Ψ ( ψ | x ) c γ ( x ) , namely, D γ ( x ) = C Ψ , γ ( x ) .
Now consider the case where Ψ is countable and we use the loss function L R B , η . Following the proof of Theorem 5, we see that a γ -lowest posterior loss region takes the form,
D η , γ ( x ) = ψ : π Ψ ( ψ | x ) / max ( η , π Ψ ( ψ ) ) d η , γ ( x ) ,
where d η , γ ( x ) = sup { d : Π Ψ ( { ψ : π Ψ ( ψ | x ) / max ( η , π Ψ ( ψ ) ) d | x ) } γ } .
Theorem 6. 
Suppose that π Ψ ( ψ ) > 0 for every ψ Ψ , the Ψ is countable with ν Ψ equal to counting measure. For the loss function L R B , η , C Ψ , γ ( x ) lim inf η 0 D η , γ ( x ) whenever γ is such that Π Ψ ( C Ψ , γ ( x ) | x ) = γ , and lim sup η 0 D η , γ ( x ) C Ψ , γ ( x ) whenever γ > γ and Π Ψ ( C Ψ , γ ( x ) | x ) = γ .
While Theorem 6 does not establish the exact convergence lim η 0 D η , γ ( x ) = C γ ( x ) , it is likely that this does hold under quite general circumstances due to the discreteness. Theorem 6 shows that limit points of the class of sets D η , γ ( x ) always contain C Ψ , γ ( x ) and their posterior probability content differs from γ by at most γ γ , where γ > γ is the next largest value for which we have exact content.
Now, consider the continuous case with a regular discretization. For S * Ψ λ = { ψ λ ( ψ ) : ψ λ ( ψ ) B λ ( ψ ) } , namely, S * is a subset of a discretized version of Ψ ( Θ ) , we define the un-discretized version of S * to be S = ψ S * B λ ( ψ ) . Now, let C Ψ , λ , γ * ( x ) be the γ -relative belief region for the discretized problem and let C Ψ , λ , γ ( x ) be its un-discretized version. Note that in a continuous context, we will consider two sets as equal if they differ only by a set of measure 0 with respect to Π Ψ . The following result says that a γ -relative belief-credible region for the discretized problem, after un-discretizing, converges to the γ -relative belief region for the original problem.
Theorem 7. 
Suppose that π Ψ is positive and continuous, there is regular discretization of Ψ ( Θ ) and R B Ψ ( ψ | x ) has a continuous posterior distribution. Then, lim λ 0 C Ψ , λ , γ ( x ) = C Ψ , γ ( x ) .
While Theorem 7 is interesting in its own right, it can also be used to prove that relative belief regions are limits of the lowest posterior loss regions.
Let D η , λ , γ * ( x ) be the γ -lowest posterior loss region obtained for the discretized problem using loss function (12), and let D η , λ , γ ( x ) be the un-discretized version.
Theorem 8. 
Suppose that π Ψ is positive and continuous, we have a regular discretization of Ψ, and R B Ψ ( ψ | x ) has a continuous posterior distribution. Then, we have that
C Ψ , γ ( x ) = lim λ 0 lim inf η 0 D Ψ , γ ( x ) = lim λ 0 lim sup η 0 D Ψ , γ ( x ) .
In [18,19], additional properties of relative belief regions are developed. For example, it has been shown that a γ -relative belief region C Ψ , γ ( x ) for ψ , satisfying Π Ψ ( C Ψ , γ ( x ) | x ) = γ , minimizes Π Ψ ( B ) among all (measurable) subsets of Ψ satisfying Π Ψ ( B | x ) γ . So, a γ -relative belief region is the smallest among all γ -credible regions for ψ , where the size is measured using the prior measure. This property has several consequences. For example, the prior probability that a region B ( x ) Ψ ( Θ ) contains a false value from the prior is given by Θ Ψ F θ ( ψ B ( x ) ) Π Ψ ( d ψ ) Π ( d θ ) , where a false value is a value of ψ Π Ψ generated independently of ( θ , x ) Π Ψ × F θ . It can be demonstrated that a γ -relative belief region minimizes this probability among all γ -credible regions for ψ and is always unbiased in the sense that the probability of covering a false value is bounded above by γ . Furthermore, a γ -relative belief region maximizes the relative belief ratio Π Ψ ( B | x ) / Π Ψ ( B ) and the Bayes factor Π Ψ ( B | x ) Π Ψ ( B c ) / Π Ψ ( B c | x ) Π Ψ ( B ) among all regions B Ψ with Π Ψ ( B ) = Π Ψ ( C Ψ , γ ( x ) | x ) .
While the results in this section focus on obtaining credible regions for parameters, similar results can be proven for the construction of prediction regions.

6. Conclusions

Relative belief inferences are based on a clear characterization of statistical evidence and are closely related to likelihood inferences. This, along with their invariance and optimality properties, positions these as prime candidates for appropriate inferences in Bayesian contexts. This paper shows that relative belief inferences also arise naturally in a decision-theoretic formulation using loss functions based on the prior. So, relative belief inferences represent a degree of unification between the evidential and decision-theoretic approaches to deriving statistical inferences.

Author Contributions

Conceptualization, M.E. and G.H.J.; Formal analysis, M.E. and G.H.J.; Writing—original draft, M.E.; Writing—review & editing, G.H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada, grant RGPIN-2024-03839.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Proof of Theorem 2 and Corollary 1. 
We have the following,
r η ( δ | x ) = Ψ L R B , η ( θ , δ ( x ) ) π Ψ ( ψ | x ) ν Ψ ( d ψ ) = Ψ π Ψ ( ψ | x ) max ( η , π Ψ ( ψ ) ) ν Ψ ( d ψ ) π Ψ ( δ ( x ) | x ) max ( η , π Ψ ( δ ( x ) ) ) .
The first term in (A1) is bounded above by 1 / η and does not depend on δ ( x ) , so the value of the Bayes rule at x is obtained by finding δ ( x ) , which maximizes the second term. Note that,
π Ψ ( δ ( x ) | x ) max ( η , π Ψ ( δ ( x ) ) ) = π Ψ ( δ ( x ) | x ) η if η > π Ψ ( δ ( x ) ) , R B Ψ ( δ ( x ) | x ) if η π Ψ ( δ ( x ) ) .
There are, at most, finitely many values of ψ satisfying η π Ψ ( ψ ) , and so R B Ψ ( ψ | x ) assumes a maximum on this set, say at ψ η ( x ) , and ψ η ( x ) = ψ R B ( x ) when η π Ψ ( ψ R B ( x ) ) . If η > π Ψ ( δ ( x ) ) , then π Ψ ( δ ( x ) | x ) / η < R B Ψ ( δ ( x ) | x ) R B Ψ ( ψ R B ( x ) | x ) . This proves that, for all η η 0 = π Ψ ( ψ R B ( x ) ) > 0 , the maximizer of (A2) is given by δ ( x ) = ψ R B ( x ) , and the results are established. □
Proof of Theorem 3. 
The prior risk of δ is given by
r ( δ ) = Θ X L ( θ , δ ( x ) ) F ( d x | θ ) Π ( d θ ) = Θ X [ h ( Ψ ( θ ) ) I { δ ( x ) } ( Ψ ( θ ) ) h ( Ψ ( θ ) ) ] F ( d x | θ ) Π ( d θ ) = Θ h ( Ψ ( θ ) ) Π ( d θ ) X Θ I { δ ( x ) } ( Ψ ( θ ) ) h ( Ψ ( θ ) ) Π ( d θ | x ) M ( d x ) = Θ h ( Ψ ( θ ) ) Π ( d θ ) X h ( δ ( x ) ) π Ψ ( δ ( x ) | x ) M ( d x )
and
Θ Θ X L ( θ , δ ( x ) ) F ( d x | θ ) Π ( d θ ) Π ( d θ ) = Θ Θ X [ h ( Ψ ( θ ) ) I { δ ( x ) } ( Ψ ( θ ) ) h ( Ψ ( θ ) ) ] F ( d x | θ ) Π ( d θ ) Π ( d θ ) = Θ h ( Ψ ( θ ) ) Π ( d θ ) X h ( δ ( x ) ) π Ψ ( δ ( x ) ) M ( d x ) .
Therefore, δ is Bayesian-unbiased if and only if
X h ( δ ( x ) ) [ π Ψ ( δ ( x ) | x ) π Ψ ( δ ( x ) ) ] M ( d x ) 0 .
This inequality holds when δ ( x ) = ψ R B ( x ) because π Ψ ( · | x ) / π Ψ ( · ) is the density of Π Ψ ( · | x ) with respect to Π Ψ , which implies that the maximum of this density is greater than or equal to 1. □
Proof of Theorem 4 and Corollary 3. 
Just as in Theorem 2, a Bayes rule δ λ , η ( x ) maximizes π Ψ , λ ( δ ( x ) | x ) / max ( η , π Ψ , λ ( δ ( x ) ) ) for δ ( x ) Ψ λ . Furthermore, as in Theorem 2, such a rule exists. Now, we define η ( λ ) so that 0 < η ( λ ) < Π Ψ ( B λ ( ψ R B ( x ) ) ) , and note that η ( λ ) 0 as λ 0 . As λ 0 , we have that
π Ψ , λ ( ψ λ ( ψ R B ( x ) ) | x ) max ( η ( λ ) , π Ψ , λ ( ψ λ ( ψ R B ( x ) ) ) = π Ψ , λ ( ψ λ ( ψ R B ( x ) ) | x ) π Ψ , λ ( ψ λ ( ψ R B ) ) R B Ψ ( ψ R B ( x ) | x ) .
Let ϵ > 0 . Let λ 0 be such that sup ψ Ψ diam ( B λ ( ψ ) ) < ϵ / 2 for all λ < λ 0 . Then, for λ < λ 0 , and any δ ( x ) satisfying | | δ ( x ) ψ R B ( x ) | | ϵ , we have
π Ψ , λ ( ψ λ ( δ ( x ) ) | x ) π Ψ , λ ( ψ λ ( δ ( x ) ) ) = B λ ( ψ λ ( δ ( x ) ) ) π Ψ ( ψ | x ) ν Ψ ( d ψ ) B λ ( ψ λ ( δ ( x ) ) ) π Ψ ( ψ ) ν Ψ ( d ψ ) = B λ ( ψ λ ( δ ( x ) ) ) R B Ψ ( ψ | x ) π Ψ ( ψ ) ν Ψ ( d ψ ) B λ ( ψ λ ( δ ( x ) ) ) π Ψ ( ψ ) ν Ψ ( d ψ ) sup { ψ : | | ψ ψ R B ( x ) | | > ϵ / 2 } R B Ψ ( ψ | x ) < R B Ψ ( ψ R B ( x ) | x ) .
By (A4) and (A5), there exists λ 1 < λ 0 , such that, for all λ < λ 1 , then
π Ψ , λ ( ψ λ ( ψ R B ( x ) ) | x ) π Ψ , λ ( ψ λ ( ψ R B ( x ) ) ) > sup { ψ : | | ψ ψ R B ( x ) | | > ϵ / 2 } R B Ψ ( ψ | x ) .
Therefore, when λ < λ 1 , a Bayes rule, δ λ , η ( λ ) ( x ) satisfies
π Ψ , λ ( δ λ , η ( λ ) ( x ) | x ) π Ψ , λ ( δ λ , η ( λ ) ( x ) ) π Ψ , λ ( δ λ , η ( λ ) ( x ) | x ) max ( η ( λ ) , π Ψ , λ ( δ λ , η ( λ ) ( x ) ) ) π Ψ , λ ( ψ λ ( ψ R B ( x ) ) | x ) max ( η ( λ ) , π Ψ , λ ( ψ λ ( ψ R B ( x ) ) ) = π Ψ , λ ( ψ λ ( ψ R B ( x ) ) | x ) π Ψ , λ ( ψ λ ( ψ R B ( x ) ) ) .
By (A5), (A6), and (A7) this implies that | | δ λ , η ( λ ) ( x ) ψ R B ( x ) | | < ϵ / 2 and the convergence is established.
Now, π Ψ , λ ( ψ ^ λ ( x ) | x ) / π Ψ , λ ( ψ ^ λ ( x ) ) π Ψ , λ ( δ λ , η ( λ ) ( x ) | x ) / π Ψ , λ ( δ λ , η ( λ ) ( x ) ) , and so (A5), (A6), and (A7) imply that | | ψ ^ λ ( x ) ψ R B ( x ) | | < ϵ and the convergence of ψ ^ λ ( x ) to ψ R B ( x ) is established. □
Proof of Theorem 6. 
For c > 0 let S c ( x ) = { ψ : R B Ψ ( ψ | x ) c } and S η , c ( x ) = { ψ : π Ψ ( ψ | x ) / max ( η , π Ψ ( ψ ) ) c } . Note that S η , c ( x ) S c ( x ) as η 0 .
Suppose c is such that Π Ψ ( S c ( x ) | x ) γ . Then, Π Ψ ( S η , c ( x ) | x ) γ for all η , and so S η , c ( x ) D η , γ ( x ) . This implies that S c ( x ) lim inf η 0 D η , γ ( x ) and since Π Ψ ( C Ψ , γ ( x ) | x ) = γ , this implies that C Ψ , γ ( x ) lim inf η 0 D η , γ ( x ) .
Now suppose c is such that Π Ψ ( S c ( x ) | x ) > γ . Then there exists η 0 , such that for all η < η 0 , we have Π Ψ ( S η , c ( x ) | x ) > γ . Since D η , γ ( x ) S η , c ( x ) when η < η 0 , then lim sup η 0 D η , γ ( x ) S c ( x ) . Choosing c = c γ ( x ) for γ > γ implies lim sup η 0 D η , γ ( x ) C Ψ , γ ( x ) .
Proof of Theorem 7. 
Let S c ( x ) = { ψ : R B Ψ ( ψ | x ) c } and
S λ , c ( x ) = { ψ : Π Ψ ( B λ ( ψ ) | x ) / Π Ψ ( B λ ( ψ ) ) c } .
Recall that lim λ 0 Π Ψ ( B λ ( ψ ) | x ) / Π Ψ ( B λ ( ψ ) ) = lim λ 0 R B Ψ ( B λ ( ψ ) | x ) = R B Ψ ( ψ | x ) for every ψ . If R B Ψ ( ψ | x ) > c , there exists λ 0 such that for all λ < λ 0 , then Π Ψ ( R B Ψ ( B λ ( ψ ) | x ) > c and this implies that ψ lim inf λ 0 S λ , c ( x ) . Now Π Ψ ( R B Ψ ( ψ | x ) = c ) = 0 and so S c ( x ) lim inf λ 0 S λ , c ( x ) (after possibly deleting a set of Π Ψ -measure 0 from S c ( x ) ) . If ψ lim sup λ 0 S λ , c ( x ) , then R B Ψ ( B λ ( ψ ) | x ) c for infinitely many λ 0 , which implies that R B Ψ ( ψ | x ) c and, therefore, ψ S c ( x ) . This proves S c ( x ) = lim λ 0 S λ , c ( x ) (up to a set of Π Ψ -measure 0) so that lim λ 0 Π Ψ ( S λ , c ( x ) Δ S c ( x ) | x ) = 0 for any c .
Let c λ , γ ( x ) = sup { c 0 : Π Ψ ( S λ , c ( x ) | x ) γ } so S c γ ( x ) ( x ) = C Ψ , γ ( x ) , S λ , c λ , γ ( x ) ( x ) = C Ψ , λ , γ ( x ) and
Π Ψ ( C Ψ , γ ( x ) Δ C Ψ , λ , γ ( x ) | x ) = Π Ψ ( S c γ ( x ) ( x ) Δ S λ , c λ , γ ( x ) ( x ) | x ) Π Ψ ( S c γ ( x ) ( x ) Δ S λ , c γ ( x ) ( x ) | x ) + Π Ψ ( S λ , c λ , γ ( x ) ( x ) Δ S λ , c γ ( x ) ( x ) | x ) .
Since S c γ ( x ) ( x ) = lim λ 0 S λ , c γ ( x ) ( x ) then
Π Ψ ( S c γ ( x ) ( x ) Δ S λ , c γ ( x ) ( x ) | x ) 0
and Π Ψ ( S λ , c γ ( x ) ( x ) | x ) Π Ψ ( S c γ ( x ) ( x ) | x ) = γ as λ 0 . Now consider the second term in (A8). Since R B Ψ ( ψ | x ) has a continuous posterior distribution, Π Ψ ( R B Ψ ( ψ | x ) c | x ) is continuous in c . Let ϵ > 0 and note that for all λ small enough, Π Ψ ( S λ , c γ ϵ ( x ) ( x ) | x ) < γ and Π Ψ ( S λ , c γ + ϵ ( x ) ( x ) | x ) > γ , which implies that c γ + ϵ ( x ) c λ , γ ( x ) c γ ϵ ( x ) and, therefore, S λ , c γ + ϵ ( x ) ( x ) S λ , c λ , γ ( x ) S λ , c γ ϵ ( x ) ( x ) . As S λ , c λ , γ ( x ) ( x ) S λ , c γ ( x ) ( x ) or S λ , c λ , γ ( x ) ( x ) S λ , c γ ( x ) ( x ) then
Π Ψ ( S λ , c λ , γ ( x ) ( x ) Δ S λ , c γ ( x ) ( x ) | x ) = | Π Ψ ( S λ , c λ , γ ( x ) ( x ) | x ) Π Ψ ( S λ , c γ ( x ) ( x ) | x ) | .
For all λ small, then | Π Ψ ( S λ , c λ , γ ( x ) ( x ) | x ) Π Ψ ( S λ , c γ ( x ) ( x ) | x ) | is bounded above by
max { | Π Ψ ( S λ , c γ + ϵ ( x ) ( x ) | x ) Π Ψ ( S λ , c γ ( x ) ( x ) | x ) | , | Π Ψ ( S λ , c γ ϵ ( x ) ( x ) | x ) Π Ψ ( S λ , c γ ( x ) ( x ) | x ) | }
and this upper bound converges to ϵ as λ 0 . Since ϵ is arbitrary, this implies that the second term in (A8) goes to 0 as λ 0 and this proves the result. □
Proof of Theorem 8. 
Without loss of generality suppose that 0 < γ < 1 . Let ϵ > 0 and δ > 0 satisfy γ + δ 1 . Put γ ( λ , γ ) = Π Ψ ( C Ψ , λ , γ ( x ) | x ) , γ ( λ , γ )   = Π Ψ ( C Ψ , λ , γ + δ ( x ) | x ) and note that γ ( λ , γ ) γ , γ ( λ , γ ) γ + δ . By Theorem 7, we have C Ψ , λ , γ ( x ) C Ψ , γ ( x ) and C Ψ , λ , γ + δ ( x ) C Ψ , γ + δ ( x ) as λ 0 so γ ( λ , γ ) γ and γ ( λ , γ ) γ + δ as λ 0 . This implies that there is a λ 0 ( δ ) such that for all λ < λ 0 ( δ ) then γ ( λ , γ ) < γ ( λ , γ ) . Therefore, by Theorem 6, we have that for all λ < λ 0 ( δ ) ,
C Ψ , λ , γ ( x ) lim inf η 0 D η , λ , γ ( λ , γ ) ( x ) lim sup η 0 D η , λ , γ ( λ , γ ) ( x ) C Ψ , λ , γ + δ ( x ) .
From (A9) and Theorem 7, we conclude that
C Ψ , γ ( x ) lim inf λ 0 lim inf η 0 D η , λ , γ ( λ , γ ) ( x ) lim sup λ 0 lim sup η 0 D η , λ , γ ( λ , γ ) ( x ) C Ψ , γ + δ ( x ) .
Since lim δ 0 C Ψ , γ + δ ( x ) = C Ψ , γ ( x ) this establishes the result. □

References

  1. Birnbaum, A. On the foundations of statistical inference (with discussion). J. Am. Stat. Assoc. 1962, 57, 269–326. [Google Scholar] [CrossRef]
  2. Royall, R.M. Statistical Evidence: A Likelihood Paradigm; Chapman & Hall: London, UK, 1997. [Google Scholar]
  3. Evans, M. Measuring Statistical Evidence Using Relative Belief. Chapman & Hall/CRC Monographs on Statistics & Applied Probability; Chapman and Hall/CRC: Boca Raton, FL, USA, 2015. [Google Scholar]
  4. Evans, M. The concept of statistical evidence: Historical roots and current developments. Encyclopedia 2024, 4, 1201–1216. [Google Scholar] [CrossRef]
  5. Savage, L.J. The Foundations of Statistics; Dover Publications: Mineola, NY, USA, 1971. [Google Scholar]
  6. Lehmann, E.L. Neyman’s statistical philosophy. In Selected Works of E. L. Lehmann; Springer: Boston, MA, USA, 1995; pp. 1067–1073. [Google Scholar]
  7. Evans, M.; Jang, G.-H. Weak informativity and the information in one prior relative to another. Stat. Sci. 2011, 26, 423–439. [Google Scholar] [CrossRef]
  8. Robert, C.P. Intrinsic losses. Theory Decis. 1996, 40, 191–214. [Google Scholar] [CrossRef]
  9. Bernardo, J.M. Intrinsic credible regions: An objective Bayesian approach to interval estimation. Test 2005, 14, 317–384. [Google Scholar] [CrossRef]
  10. Le Cam, L. On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. Univ. Calif. Publ. Statist. 1953, 1, 277–329. [Google Scholar]
  11. Bernardo, J.M.; Smith, A.F.M. Bayesian Theory. Wiley Series in Probability and Statistics; John Wiley & Sons Ltd.: New York, NY, USA, 2000. [Google Scholar]
  12. Berger, J.O. Statistical Decision Theory and Bayesian Analysis; Springer: New York, NY, USA, 1985. [Google Scholar]
  13. Rudin, W. Real and Complex Analysis; McGraw Hill: New York, NY, USA, 1974. [Google Scholar]
  14. Al-Labadi, L.; Evans, M. Optimal robustness results for some Bayesian procedures and the relationship to prior-data conflict. Bayesian Anal. 2017, 12, 702–728. [Google Scholar] [CrossRef]
  15. Al-Labadi, L.; Alzaatreh, A.; Evans, M. How to measure evidence and its strength: Bayes factors or relative belief ratios? arXiv 2024, arXiv:2301.08994. [Google Scholar]
  16. Nott, D.; Wang, X.; Evans, M.; Englert, B.-G. Checking for prior-data conflict using prior to posterior divergences. Stat. Sci. 2020, 35, 234–253. [Google Scholar] [CrossRef]
  17. Evans, M.; Moshonov, H. Checking for prior-data conflict. Bayesian Anal. 2006, 1, 893–914. [Google Scholar] [CrossRef]
  18. Evans, M.; Shakhatreh, M. Optimal properties of some Bayesian inferences. Electron. J. Stat. 2008, 2, 1268–1280. [Google Scholar] [CrossRef]
  19. Evans, M.; Guttman, I.; Swartz, T. Optimality and computations for relative surprise inferences. Canad. J. Statist. 2006, 34, 113–129. [Google Scholar] [CrossRef]
Table 1. Conditional prior probabilities of misclassification for c M A P and c R B for various values of β in Example 2 when α = 1 , μ = 1 , and n = 10.
Table 1. Conditional prior probabilities of misclassification for c M A P and c R B for various values of β in Example 2 when α = 1 , μ = 1 , and n = 10.
β M 0 ( c MAP 0 ) + M 1 ( c MAP 1 ) M 0 ( c RB 0 ) + M 1 ( c RB 1 )
1 0.386 + 0.390 = 0.776 0.386 + 0.390 = 0.776
14 0.002 + 0.975 = 0.977 0.285 + 0.380 = 0.665
32 0.000 + 0.997 = 0.997 0.292 + 0.349 = 0.641
100 0.000 + 1.000 = 1.000 0.300 + 0.324 = 0.624
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Evans, M.; Jang, G.H. Relative Belief Inferences from Decision Theory. Entropy 2024, 26, 786. https://doi.org/10.3390/e26090786

AMA Style

Evans M, Jang GH. Relative Belief Inferences from Decision Theory. Entropy. 2024; 26(9):786. https://doi.org/10.3390/e26090786

Chicago/Turabian Style

Evans, Michael, and Gun Ho Jang. 2024. "Relative Belief Inferences from Decision Theory" Entropy 26, no. 9: 786. https://doi.org/10.3390/e26090786

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop