1. Introduction
Bayesian inferences require the specification of a prior, which contains a priori knowledge about the parameter(s). If the selected prior, for instance, is flawed, this may yield erroneous inferences.
The goal of this paper is to measure the sensitivity of inferences to a chosen prior (known as
robustness). Since, in most cases, it becomes very challenging to come up with only a sole prior distribution, we consider a class,
, of all possible priors over the parameter space. To construct
, a preliminary prior
is elicited. Then robustness for all priors
in a neighborhood of
is intended. A commonly accepted way to construct neighborhoods around
is through contamination. Specifically, we will consider two different classes of contaminated or mixture of priors, which are given by
and
where
is the elicited prior,
Q is a class of distributions,
is normalizing constant and
is a small given number denoting the amount of contamination. For other possible classes of priors, see for instance, De Robertis and Hartigan (1981) [
1] and Das Gupta and Studden (1988a, 1988b) [
2,
3].
The class (
1) is known as the
-contaminated class of priors. Many papers about the class (
1) are found in the literature. For instance, Berger (1984, 1990) [
4,
5], Berger and Berliner (1986) [
6], and Sivaganesan and Berger (1989) [
7] used various choices of
Q. Wasserman (1989) [
8] used (
1) to study robustness of likelihood regions. Dey and Birmiwal (1994) [
9] studied robustness based on the curvature. Al-Labadi and Evans (2017) [
10] studied robustness of relative belief ratios (Evans, 2015 [
11]) under class (
1).
On the other hand, the class (
2) will be referred as geometric contamination or mixture class. This class was first studied, in the context of Bayesian Robustness, by Gelfand and Dey (1991) [
12], where the posterior robustness was measured using Kullback-Leibler divergence. Dey and Birmiwal (1994) [
9] generalized the results of Gelfand and Dey (1991) [
12] under (
1) and (
2) by using the
divergence defined by
for a smooth convex function
. For example,
gives Kullbak-Leibler divergence.
In this paper, we extend the results of Gelfand and Dey (1991) [
12] and Dey and Birmiwal (1994) [
9] by applying Rényi divergence on both classes (
1) and (
2). This will give local sensitivity analysis on the effect of small perturbation to the prior. Rényi entropy, developed by Hungarian mathematician Alfréd Rényi in 1961, generalizes the Shannon entropy and includes other entropy measures as special cases. It finds applications, for instance, in statistics [
13], pattern recognition [
14], economics [
15] and biomedicine [
16].
Although the focus of this paper is on Rényi divergence, it also contains
family of divergence measures (Menéndez et al., 1995 [
17]). Examples of
divergence include Rényi divergence, Shama-Mittal divergence and Bhattacharyya divergence. We refer the reader to Pardo (2006) [
18] for more details about
divergence.
An outline of this paper is as follows. In
Section 2, we give definitions, notations and some properties of Rényi divergence. In
Section 3, we develop curvature formulas for measuring robustness based on Rényi divergence and
divergence. In
Section 4, three examples are studied to illustrate the results numerically.
Section 5 ends with a brief summary of the results.
2. Definitions and Notations
Suppose we have a statistical model that is given by the density function
(with respect to some measure), where
is an unknown parameter that belongs to the parameter space
. Let
be the prior distribution of
. After observing the data
x, by Bayes’ theorem, the posterior distribution of
is given by the density
where
is the prior predictive density of the data.
To measure the divergence between two posterior distributions, we consider Rényi divergence (Rényi, 1961 [
19]). Rényi divergence of order
a between two posterior densities
and
is defined as:
where
and
denotes the expectation with respect to the density
. It is known that
for all
and
if and only if
. Please note that the case
is defined by letting
. Other values of
a of a particular interest are
and
∞ (van Erven and Harremoës, 2014 [
20]). For further properties of Rényi divergence consult, for example, Li and Turner (2016) [
21].
Rényi divergence belongs to the following general class of family of divergence measures called the
divergence (Menéndez et al., 1995 [
17]).
Definition 1. Let h be a differentiable increasing real function mapping from to . The divergence measure between two posterior distributions and is defined aswhere is the ϕ divergence defined in (3). Please note that Rényi divergence is a
divergence measure with
,
for
. To see this, from Definition 1, we have
which is Rényi divergence as defined in (
4).
Similar to McCulloch (1989) [
22] and Dey and Birmiwal (1994) [
9] for calibrating, respectively, the Kullback-Leibler divergence and the
divergence, it is also possible to calibrate Rényi divergence as follows. Consider a biased coin where
(heads) occurs with probability
p. Then Rényi divergence between an unbiased and a biased coin is
where for
,
and
. Now, setting
gives
Then the number
p is the calibration of
d. In general, Equation (
6) needs to be solved numerically for
p. Please note that for the case
(i.e., the Kullback-Leibler divergence) one may use the following explicit formula for
p due to McCulloch (1989) [
22]:
Values of p close to 1 indicate that and are quite different, while values of p close to 0.5 implies that they are similar. It is restricted that p is chosen so that it is between 0.5 and 1 there is a one-to-one correspondence between p and .
A motivating key fact about Rényi divergence follows from its Taylor expansion. Let
where
is the posterior distribution of
given the data
x under the prior
defined in (
1) and (
2). Assuming differentiability with respect to
, the Taylor expansion of
about
is given by
Clearly,
. If integration and differentiation are interchangeable, we have
On the other hand,
which at
, reduces to
Here
is the Fisher information function for
(Lehmann and Casella, 1998 [
23]). Thus, for
, we have
Please note that
is known as the local
curvature at
of Rényi divergence. Formula (
8) justifies the use of the curvature to measure the Bayesian robustness of the two classes of priors
and
as defined in (
1) and (
2), respectively. Also this formula provide a direct relationship between Fisher’s information and the curvature of Rényi divergence.
4. Examples
In this section, the derived results are explained through three examples: the Bernoulli model, the multinomial model and the location normal model. In each example, the curvature values for the two classes (
1) and (
2) are reported. Additionally, in Example 1, we computed Rényi divergence between
and
and reported the calibrated value
p as described in (
6) and (
7). Recall that curvature values close to zero indicate robustness of the used prior whereas larger values suggest lack of robustness. On the other hand, values of
p close to 0.5 suggest robustness whereas values of
p close to 1 means absence of robustness.
Example 1 (Bernoulli Model)
. Suppose is a sample from a Bernoulli distribution with a parameter θ. Let the prior be Beta, i.e.,Thus, iswhere Let be Beta for . Now consider the two samples
and
of sizes
and
generated from
Bernoulli
. For comparison purposes, we consider several values of
and
c. Although it is possible to find exact formulas of the curvature by some algebraic manipulation, it looks more convenient to use a Monte Carlo approach in this example. The computational steps are summarized in Algorithm 1.
Algorithm 1 Computing curvature based on Monte Carlo approach |
For , generate from the posterior . For each , find and . Compute the sample variance of the values of . Denote this value by . Return as the curvature value under class ( 1). Compute the sample variance of the values of . Denote this values by . Return as the curvature value under class ( 2).
|
The values of the curvature for both classes (
1) and (
2) are reported in
Table 1. Remarkably, for the cases when
(uniform prior on
) and
(Jeffreys’ prior), the curvature values are prominently small for all values of
c. Also, it is clear that when
, the curvature values are 0. It worth noticing here that when fixing the parameters
and
c, the curvature decrease by increasing the sample size. This supports the fact that the effect of the prior dissipates with increasing the sample.
While it is easier to quantify the curvature based on Theorems 1 and 2, in this example, for comparison purposes, we computed Rényi divergence between
and
under classes (
1) and (
2). It can be shown that under class (
1) in (
9),
where
Also, from (
17), it can be easily concluded that the posterior
under class (
2) is obtained as
Please note that since
, it possible to compute the distance based on a Monte Carlo approach. When
,
, the Kullback-Leibler divergence. We also calibrated Rényi divergence values as described in (
6) and (
7).To save space, the results based on class (
1) and (
2) of the sample of size
are reported in
Table 2 and
Table 3, respectively.
Please note that from (
8), by multiplying the curvature value in
Table 1 by
, one may get the value of the corresponding distance in
Table 2 and
Table 3. For instance, setting
in
Table 1, gives
. The corresponding distance is
, which close to the one reported in
Table 2.
Now we consider the Australian AIDS survival data, available in the
R package “Mass”. There are 2843 patients diagnosed with AIDS in Australia before 1 July 1991. The data frame contains the following columns: state, sex, date of diagnosis, date of death at end of observation, status (“
” (alive) or “
” (dead) at end of observation), reported transmission category, and age at diagnosis. There are 1082 and 1761 alive and dead cases. We consider the values of column status. Under the prior distribution given above, the values of the curvatures for two classes (
1) and (
2) are summarized in
Table 4 for a random sample of size
and for the whole data. The sampled data is
. It interesting to notice that unlike the sample of size
, for the whole dataset (i.e.,
), the value of the curvature is small for all cases of
and
c, demonstrating less effect of the prior in the presence of a large sample size.
Example 2 (Multinomial model). Suppose that is an observation from a multinomial distribution with parameters , where and . Let the prior be Dirichlet. Then is
Let
. We consider the observation
generated from
Multinomial
. As in Example 1, we use Monte Carlo approach to compute curvature values.
Table 5 reports values of the curvature for different values of
and
c. For the cases when
(uniform prior over
) and
(Jeffreys’ prior), the curvature values are prominently small.
Example 3 (Location normal model)
. Suppose that is a sample from distribution with . Let the prior of θ be . Then Let
,
. Due to some interesting theoretical properties in this example, we present the exact formulas of the curvature for class (
1) and class (
2). We have
Therefore, for the class (
1), we have
where
is the moment generating function with respect to the density
. Thus,
is equal to
On the other hand, for the geometric contaminated class, we have
Interestingly, from (
22),
depends on the sample only through its size
n. For fixed values of
and
c, as
or
,
which indicates robustness. Also, for fixed values of
and
n, as
or
,
and no robustness will be found.
Now we consider a numerical example by generating a sample of size
from
distribution. We obtain
(with
).
Table 6 reports the values of the curvature for different values of
and
c.
Clearly, for large values of
, the value of the curvature is small, which is an indication of robustness. For instance, for
in
Table 6, that value of the curvature when
is much smaller than the value of the curvature when
.