1. Introduction
A major drawback of objective priors, such as Jeffreys prior [
1] and the reference prior [
2], is that in many cases, they are improper. The case for objective priors has been made, see for example [
3], yet while for a parameter that is defined over a bounded interval, such as
, it is generally possible to derive objective prior distributions that are proper, this is not the case for parameters on
or
. The literature provides many examples where improper prior distributions cannot be suitably employed; such as Bayes factors, mixture models and hierarchical models, to name but a few. Methods have been proposed to overcome these obstacles, for example, Intrinsic Bayes Factors [
4] and Fractional Bayes factors [
5] or reparametrizing mixture models [
6]. However, these types of results are generally valid for a limited number of specific conditions. Additionally, improper prior distributions are not too suitable to be employed where large numbers of parameters are involved as it would be difficult to establish properness of the full posterior distribution.
The idea of this paper is to present a novel objective prior distribution for continuous parameter spaces by considering the connection between information, divergence and scoring rules. In particular, the proposed prior can be defined over and , the latter by extending the former, and it has the appealing property of being proper.
Recently, Ref. [
7], introduced a new class of objective prior which solved a differential equation of the form
, where
S is a score function and the solution
q acts as the prior distribution, and where
and
are, respectively, the first and second derivative of
q. The solution is also shown to minimize an information criterion.
There are two well-known relations that connect information, proper local scoring rules and divergences. The most famous of which links Shannon information, Kullback–Leibler divergence and the log–score, given by
where
p and
q are two densities, and integrals will be generally defined with respect to the Lebesgue measure. The term on the left-hand-side of (
1) is the Kullback–Leibler divergence [
8] between
p and
q, the first term on the right-hand-side is the Shannon information associated with density
p, and the second term is the expectation of the log-score function.
Another way to connect information, divergence and proper local scoring rules, involves Fisher divergence, Fisher information, and the Hyvärinen score function ([
9]):
where the final term has been obtained using an integration by parts, requiring
to vanish at the boundary values. In general, these relationships can be expressed as
where
D denotes the divergence,
I the measure of information and
S the score, and clearly
.
Recently, in [
10], a new class of score function was introduced, where the starting point is the property of the score function, which is that
for all densities
p. In other words, a score is said to be proper if the above is minimized by the choice of
. Let us consider the well-known log-score,
. Then, we have that it satisfies the above property, since for any density
p it is that
, with equality only when
. As such, we have that the log-score is a proper score. Furthermore, a score is said to be local if it only depends on
q through the density value
. See [
10,
11]. It must be noted that the log-score is the only proper score to be local.
If we consider the Hyvärinen score function [
9], which is given by
which we note depends on
, i.e.,
q and the first two derivatives, as such it is not local in the above sense. However, the locality condition can be weakened [
10] by allowing the score to depend on a finite number
m of derivatives. Therefore, the Hyvärinen score will be an order–2 proper local scoring rule.
More generally, if a proper score depends on
m derivatives, then it will be defined an
order–m local scoring rule. The theory in support of this is based on the fact that the minimizer of
is
p, and this can be investigated using variational analysis. The relevant Euler–Lagrange equation of order two being
The corresponding general case of (
3) is given as Equation (
18) in [
10]. Throughout this paper we will focus on the case
, since this is where we draw our prior from. The
Appendix A provides the expression for a general
m.
In [
10], the solution to Equation (
3), is proposed using properties of differential operators and 1–homogeneous functions. Recall that a 1–homogeneous function
f is such that
for any
. In particular, the Hyvärinen score arises with
and
Furthermore, Refs. [
10,
11] characterize all local and proper scoring rules of order
. With this respect, as an additional interesting result, in the
Appendix A we present the characterization using measures of information and the Bregman divergence [
12]. The benefits of the proposed approach are that complicated mathematical analysis is avoided and the derivation of the local rule is made explicit.
Following [
10,
11] and the novel derivation of their results using Bregman divergence, which is the focus of the
Appendix A, information, divergence and scores can be obtained as follows: For some convex function
,
Divergence: Given the result (
A10), we obtain
where
Using integration by parts on the right most integral, and assuming that
vanishes at the extremes of the integral,
Information: This follows from the divergence, and from (
A4), and is given by
Score: Again, from the form of the divergence and (
A8), this is given by
The score
in (
5) generalizes the Hyvärinen score, which arises when
.
The paper is organized as follows.
Section 2 introduces the proposed objective prior.
Section 3 includes a thorough simulation study, and an application to mixture models that involves both simulated and real data. In
Section 4 we have discussed another critical scenario where improper priors may resolve in improper posteriors, i.e., assigning an objective prior to the variance parameter in a hierarchical model. The supporting theory is presented in the
Appendix A. In
Appendix A.1 we use Bregman divergence to obtain general forms for score functions and associated divergences and following on from this in
Appendix A.2 we detail how we use Bregman divergences to obtain a divergence between probability density functions using their first derivatives, and show how to obtain score functions from these divergences.
Appendix A.3 provides the general case using
m derivatives. Finally, in
Appendix A.4 we make the connection with our derivations of scores and that of [
10].
2. New Objective Prior
Ref. [
7] proposed constructing objective prior distributions on parameter spaces by solving equations of the kind
. Specifically, they used a weighted mixture of the log-score and the Hyvärinen score functions. Please note that the sole use of the log-score function would result in the uniform prior, which is not appropriate in many cases and may yield improper posterior distributions. On the other hand, a weighted combination of the two score functions yields a differential equation given by
where
q denotes the prior density and
the weight balancing the two score functions.
Solutions to the differential Equation (
6) can be found for different spaces, and constraints on the shape of
q can be considered; so, to have a prior density with desirable behavior, such as monotone, convex, log–concave and more.
We have already seen that the Hyvärinen score arises with
; see (
5). An important property an objective prior distribution may be required to have is a heavy tail and it is this type of prior we are seeking to obtain in a formal way through an appropriate choice of
. We will consider such on
. Mirroring the Hyvärinen score, we adopt
with
, and
q a decreasing density on
. In this case, Equation (
5) becomes
which, by setting to 0, becomes
The solution is easily seen to be
for some constant
a. In this case, the prior on the parameter space
is
To obtain a density function on
; i.e.,
q is non-negative, we choose
, as we are permitted to do through the constant of integration from the differential equation.
Interestingly, the prior in (
7), is a Lomax distribution [
13] with scale parameter
a and shape parameter equal to 1. Recalling that the Lomax distribution can be directly connected to the Pareto Type I and Pareto Type II distributions, its heavy-tailed nature is immediately obvious.
Making the connection more directly with the theory set out in the paper with
, we have
which is easy to show satisfies
. Then using (
A8) we obtain
Setting this to zero; i.e.,
, this can be solved and the solution is precisely of the form
. We now write this all out in a theorem.
Theorem 1. Let be the convex function appearing in (A3); i.e., with . Then ϕ is convex for either or . The Euler equation associated with this ϕ; i.e., , yieldsthe solution to which can be written as where S is the corresponding score function (8). To obtain the corresponding prior on
through symmetry about 0, we obtain
Here we motivate the natural objective choice for the constant
a as 1. We note that
a is a scale parameter in the above and as such the most fundamental transformation to be considered here is
. This, for example, would take variance to precision (and vice versa). For the prior in (
7) to be invariant, i.e.,
, we need to have
, since
which yields
iff
. All the illustrations that follow have been made taking this choice for
a. Although other transformations are clearly available, the dominance of the reciprocal transform is sufficient motivation to select the
.
First Examples
The first simulation study was to make inference on a scale parameter; specifically, the standard deviation of a normal density with mean
and standard deviation
. We compare prior (
7) with Jeffreys prior, i.e.,
. We took 250 samples of size
and
, obtained the posterior distributions using standard MCMC methods (Metropolis-Hastings with normal random walk proposal, 6000 iterations, with a burn–in of 1000 and a thinning of 10) and computed the following two indexes. The root Mean Squared Error (MSE) divided by the true parameter value from the sample mean;
where
is the posterior mean, and the coverage of the 95% posterior credible interval for
.
Table 1 shows the results for the MSE for
, where we see little difference between the performance of the two priors. As one would expect, the MSE is lower for the largest sample size. However, the important point is that the score prior is proper, an important property.
The coverage of the 95% posterior credible interval is shown in
Table 2, where we can also see very similar behavior between the two priors.
To illustrate the frequentist properties of the prior in (
9), we have compared it to a flat prior,
, in making inference for a location parameter of a log–normal density with unknown
and known scale parameter
. Similar to the previous case, we have drawn 250 samples of size
and
and computed the MSE and coverage of the 95% posterior credible interval. The values of
considered were from the set
.
Table 3 shows the MSE for the two priors, where we see that apart for a small difference for
, the two priors appear to perform in a very similar fashion.
The coverage of the 95% posterior credible interval for
is shown in
Table 4, where we can see a very similar behavior for the two priors, with an exception for
, although the two coverage levels are perfectly acceptable.
The general conclusion for the two experiments above, is that the prior obtained via exhibits tails which are sufficiently heavy to generate optimal frequentist performance even for large parameter values. These properties are comparable to those obtained by Jeffreys prior, which is well- known for being the objective prior yielding good frequentist properties of the posterior. The advantage with our prior is that it is always proper.
4. Variance Parameters in Hierarchical Models
In this section, we discuss a well-known implementation of a hierarchical model that is proposed, for example, in [
17]. The basic two-level hierarchical model is as follows:
This model has three parameters, namely
,
and
. However, out interest for this paper is in
only, noting that “regular” objective priors can be used on the remaining parameters, such as
, for example. Although being improper, this prior yields a proper posterior on the parameters.
The actual concern is on the variance (scale) parameter
, as if we were to put an improper prior on it, then the corresponding posterior, most likely, would be improper as well. To compare the proposed prior, we assign an inverse-gamma prior on the variance with parameters set so to define a very sparse probability distribution. This is recommended, for example, in [
18], where the prior is
, with
sufficiently small. We do not discuss in detail the appropriateness of the above choice, or other alternatives; the reader can refer to [
17], for example, for a through discussion.
The method used to obtain the posterior samples is a Metropolis-within-Gibbs in both cases, with 40,000 iterations, a burn-in of 20,000 and a thinning of 10.
The data consists of
educational testing experiments, where the parameters
represent the relative effects of Scholastic Aptitude Test coaching programs in different schools. In this example, the parameter
represents the between-schools variability (standard deviation) of the effects.
Table 8 shows the data.
We have compared the marginal posteriors
obtained using the inverse-gamma prior with
and the proposed prior in (
7) with
. The histograms of the marginal posteriors are in
Figure 6, where we note similar results. The statistics of the posteriors are reported in
Table 9, where we note a less-informative distribution when the proposed prior is employed. This is expected, as the inverse-gamma distribution is considered a relatively informative one [
17].
5. Discussion
In this paper, we have derived a class of objective prior distributions that have the appealing properties of being proper and heavy-tailed. These have been obtained by exploiting a straightforward approach to the construction of score functions (here proposed). In detail, using convex function
we can find the score function with first two derivatives using (
5). The Hyvärinen score arises with
; whereas we have used
and used it to construct objective prior distributions using methodology introduced in [
7].
The class of prior is heavy-tailed, behaving as for large ; this result is immediately obvious as the prior on is a Lomax distribution with shape 1. In this respect, it behaves similar to standard objective priors but comes without the problems of being improper. The benefits of using a proper prior is that the posterior is automatically proper and so does not need to be checked.
We have showed that when compared to Jeffreys prior on simulated data, the frequentist performance of the prior distribution derived from score functions are nearly equivalent. In addition, we have showed that on both simulated and real data, the proposed prior is suitable to be used in a key scenario where improper priors (e.g., Jeffreys and reference) are not suitable (or are yet to be found). We have also illustrated the prior on a common problem for hierarchical models, i.e., assigning an objective prior for the variance parameter.
As a final point, we briefly discuss the case where a prior is needed on a multidimensional parameter space. Therefore, say we have a model with
k parameters, i.e.,
, where
, for
. We also assume that the uni-dimensional space for each parameter is either
or
. Assuming
k relatively large, besides some specific statistical models such as regression models or graphical models, a common practice to assign objective priors on
is as follows:
In other words, parameters are assumed to be independent a priori, so the join prior distribution is represented by the product of the marginal priors on each parameter. We can then set
to be either (
7) or (
9), for
, depending on
.