Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior

Komaki, Fumiyasu

doi:10.3390/e15104416

Open AccessArticle

Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior

by

Fumiyasu Komaki

^1,2

¹

Department of Mathematical Informatics, Graduate School of Information Science and Technology, the University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan

²

RIKEN Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan

Entropy 2013, 15(10), 4416-4431; https://doi.org/10.3390/e15104416

Submission received: 9 August 2013 / Revised: 16 September 2013 / Accepted: 10 October 2013 / Published: 17 October 2013

Download

Browse Figures

Versions Notes

Abstract

:

Bayesian testing of a point null hypothesis is considered. The null hypothesis is that an observation, x, is distributed according to the normal distribution with a mean of zero and known variance

σ^{2}

. The alternative hypothesis is that x is distributed according to a normal distribution with an unknown nonzero mean, μ, and variance

σ^{2}

. The testing problem is formulated as a prediction problem. Bayesian testing based on priors constructed by using conditional mutual information is investigated.

Keywords:

conditional mutual information; discrete prior; Kullback-Leibler divergence; prediction; reference prior; Jeffreys-Lindley paradox

1. Introduction

We investigate a problem of testing a point null hypothesis from the viewpoint of prediction. The null hypothesis,

H_{0}

, is that an observation, x, is distributed according to the normal distribution,

N (0, σ^{2})

, with a mean of zero and variance

σ^{2}

, and the alternative hypothesis,

H_{1}

, is that x is distributed according to a normal distribution

N (μ, σ^{2})

with unknown nonzero mean μ and variance

σ^{2}

. The variance,

σ^{2}

, is assumed to be known. This simple testing problem has various essential aspects in common with more general testing problems and has been discussed by many researchers. An essential part of our discussion in the present paper holds for other testing problems based on more general models.

The assumption that the sample size is one is not essential. When we have N observations

x_{1}, x_{2}, \dots, x_{N}

from

N (0, σ^{2})

or

N (μ, σ^{2})

, then the sufficient statistic

\bar{x} = \sum_{i = 1}^{N} x_{i} / N

is distributed according to

N (0, σ^{2} / N)

under

H_{0}

or

N (μ / N, σ^{2} / N)

under

H_{1}

, respectively. Then, the null hypothesis is that

\bar{x}

is distributed according to

N (0, {\tilde{σ}}^{2})

, and the alternative hypothesis is that

\bar{x}

is distributed according to

N (\tilde{μ}, {\tilde{σ}}^{2})

(\tilde{μ} \neq 0)

, where

{\tilde{σ}}^{2} : = σ^{2} / N

and

\tilde{μ} : = μ / N

. Thus, the testing problem with sample size N is essentially equal to that with the sample size one. From now on, the variance,

σ^{2}

, is set to be one without loss of generality.

We formulate the testing problem as a prediction problem. Let

m = 0

if

H_{0}

is true and

m = 1

if

H_{1}

is true. Let w be the probability that

m = 0

, and let

π (d μ)

be the prior probability measure of μ. The probability, w, is set to be

1 / 2

in many previous studies, and the choice of

π (d μ)

is discussed; see, e.g., [1] and the references therein. The objective is to predict m by using a Bayesian predictive distribution,

p_{w, π} (m ∣ x)

, depending on the prior

π (d μ)

and the observation, x.

Common choices of π are the Normal prior

(1 / \sqrt{2 π τ^{2}}) exp (- μ^{2} / 2 τ^{2}) d μ

and the Cauchy prior

1 / {π γ (1 + m u^{2} / γ^{2})} d μ

, recommended by Jeffreys [2]. Sometimes, it is considered that large values of scale parameters τ and γ represent “ignorance” about μ. However, such a naive choice of scale parameter values could cause a serious problem known as the Jeffreys–Lindley paradox [3].

We choose

π (d μ)

from the viewpoint of prediction and construct a Bayesian predictive distribution to predict m based on an objectively chosen prior In the testing problem, the variable, m, is predicted, the variable, x, is observed and the parameter, μ, is neither observed nor predicted. The latent information prior

π^{*}

[4] is defined as a prior maximizing the conditional mutual information:

\begin{matrix} I_{m; μ ∣ x} (w, π) & = \sum_{m = 0}^{1} \int \int p_{w, π} (x, μ, m ∣ w) log \frac{p_{w, π} (m, μ ∣ x)}{p_{w, π} (m ∣ x) p_{w, π} (μ ∣ x)} d x d μ \end{matrix}

(1)

between m and μ given x.

The latent information prior introduced in [4] is an objective Bayes prior. An outline of the method based on it is as follows. First, a statistical problem is formulated as a prediction problem, in which x is the observed random variable, y is the random variable to be predicted and θ is the unknown parameter. Then, a prior

π (d θ)

that maximizes the conditional mutual information

I_{y; θ ∣ x} (π)

between y and θ given x is adopted.

In Section 2, we consider for Kullback-Leibler loss for prediction corresponding to Bayesian testing. In Section 3, we obtain the latent information prior and discuss properties of Bayesian testing based on it. In Section 4, we compare the proposed testing based on the latent information prior with Bayesian testing based on the normal prior and the Cauchy prior.

2. Kullback-Leibler Loss of Predictive Densities

We consider Kullback-Leibler loss of prediction corresponding to Bayesian testing. The Bayesian predictive density with respect to w and π is given by:

\begin{matrix} p_{w, π} (m = 0 ∣ x) = & \frac{w p_{0} (x)}{w p_{0} (x) + (1 - w) p_{π} (x)} \end{matrix}

(2)

and

\begin{matrix} p_{w, π} (m = 1 ∣ x) = \frac{(1 - w) p_{π} (x)}{w p_{0} (x) + (1 - w) p_{π} (x)} \end{matrix}

(3)

where:

\begin{matrix} p_{0} (x) = & ϕ (x; 0, 1) a n d p_{π} (x) = \int ϕ (x; μ, 1) π (d μ) \end{matrix}

(4)

and

ϕ (x; μ, σ)

is the density function of the normal distribution,

N (μ, σ^{2})

.

If the value of μ is known, then the alternative hypothesis,

H_{1}

:

N (μ, 1)

, becomes a simple hypothesis, and the predictive distribution is given by the posterior:

\begin{matrix} p_{w} (m = 0 ∣ x, μ) = & \frac{w ϕ (x; 0, 1)}{w ϕ (x; 0, 1) + (1 - w) ϕ (x; μ, 1)} \end{matrix}

(5)

and:

\begin{matrix} p_{w} (m = 1 ∣ x, μ) = & \frac{(1 - w) ϕ (x; a, 1)}{w ϕ (x; 0, 1) + (1 - w) ϕ (x; μ, 1)} \end{matrix}

(6)

To evaluate the performance of predictive densities, we adopt the Kullback-Leibler divergence:

\sum_{m = 0}^{1} p_{w} (m ∣ x, μ) log \frac{p_{w} (m ∣ x, μ)}{p_{w, π} (m ∣ x)}

(7)

from

p_{w} (m ∣ x, μ)

and to

p_{w, π} (m ∣ x)

as a loss function.

The risk function is given by:

\begin{matrix} r_{w} (μ, π) = & \int p_{w} (x ∣ μ) \sum_{m = 0}^{1} p_{w} (m ∣ x, μ) log \frac{p_{w} (m ∣ x, μ)}{p_{w, π} (m ∣ x)} d x \\ (8) & = & \sum_{m = 0}^{1} w (m) \int p (x ∣ m, μ) log \frac{p_{w} (m ∣ x, μ)}{p_{w, π} (m ∣ x)} d x \end{matrix}

where

w (0) = w

and

w (1) = 1 - w

. Here,

p_{w, π} (m ∣ x, μ)

and

p_{w, π} (x ∣ μ)

are denoted by

p_{w} (m ∣ x, μ)

and

p_{w} (x ∣ μ)

, respectively, because they do not depend on π. The distribution of x does not depend on μ if

m = 0

, because

p (x ∣ m = 0, μ) = ϕ (x; 0, 1)

.

It is not fruitful to discuss decision theoretic properties, such as the minimaxity of the risk defined by:

\begin{matrix} - \sum_{m = 0}^{1} w (m) \int p (x ∣ m, μ) log p_{w, π} (m ∣ x) d x \end{matrix}

(9)

because it is easy to distinguish between

H_{0}

and

H_{1}

when

| μ |

is very large.

The Kullback-Leibler risk in Equation (8) corresponds to the regret type quantity:

\begin{matrix} - log p_{w, π} (m ∣ x) + log p_{w} (m ∣ x, μ) \end{matrix}

(10)

which means the loss by not knowing the value of μ. By considering the minimaxity of the regret type risk in Equation (8), several reasonable results are obtained.

Lemma 1.

The risk of a Bayesian predictive density,

p_{w, π} (m ∣ x)

, is given by:

\begin{matrix} r_{w} (μ; π) = & w \int p_{0} (x) log \frac{1 + \frac{1 - w}{w} \frac{p_{π} (x)}{p_{0} (x)}}{1 + \frac{1 - w}{w} \frac{p_{0} (x - μ)}{p_{0} (x)}} d x \\ (11) & + (1 - w) \int p_{0} (x) log \frac{1 + \frac{w}{1 - w} \frac{p_{0} (x + μ)}{p_{π} (x + μ)}}{1 + \frac{w}{1 - w} \frac{p_{0} (x + μ)}{p_{0} (x)}} d x \end{matrix}

Proof. See the Appendix. ☐

The risk function in Equation (11) is a continuous function of μ for every w and π.

The Bayes risk with respect to a prior π of a Bayesian predictive density based on

\bar{π}

is:

\begin{matrix} R_{w} (π; \bar{π}) = & \int r_{w} (μ, \bar{π}) π (d μ) \\ = & \sum_{m = 0}^{1} \int \int w (m) p (x ∣ m, μ) log \frac{p_{w} (m ∣ x, μ)}{p_{w, \bar{π}} (m ∣ x)} π (d μ) d x \\ = & \sum_{m = 0}^{1} \int \int w (m) p (x ∣ m, μ) log \frac{p_{w, \bar{π}} (m ∣ x, μ) p_{w, \bar{π}} (μ ∣ x)}{p_{w, \bar{π}} (m ∣ x) p_{w, \bar{π}} (μ ∣ x)} π (d μ) d x \\ (12) & = & \sum_{m = 0}^{1} \int \int w (m) p (x ∣ m, μ) log \frac{p_{\bar{π}} (μ ∣ m, x)}{p_{w, \bar{π}} (μ ∣ x)} π (d μ) d x \end{matrix}

It is known that an important relation:

inf_{\bar{π}} R_{w} (π; \bar{π}) = R_{w} (π; π)

(13)

holds; see [5]. Here,

R_{w} (π; π)

coincides with the conditional mutual information,

I_{m; μ ∣ x} (w, π)

, defined by Equation (1) between m and μ given x.

3. Latent Information Priors

We obtain the latent information prior defined as a prior maximizing the conditional mutual information,

I_{m; μ ∣ x} (w, π)

. We restrict the original parameter space,

R

, of μ to a compact subset,

K \subset R

, for mathematical convenience. A typical choice is a bounded closed interval

K = [- b, b]

. If b is large enough, the testing problem

H_{0} : N (0, σ^{2})

versus

H_{1} : N (μ, σ^{2})

,

μ \in [- b, b]

is close to the original problem.

Let

P (K)

and

P (R)

be the spaces of all probability measures on K and

R

, respectively, endowed with the weak convergence topology. Then,

P (K)

is compact, since the K is compact. It is easy to verify that the conditional mutual information,

I_{m; μ ∣ x} (w, π)

, is a continuous function of

w \in [0, 1]

and

π \in P (K)

. Therefore, there exists

π_{w}^{*}

that attains the maximum of Equation (1) for fixed

w \in (0, 1)

, since

P (K)

is compact. In the following,

π_{w}^{*}

is denoted as

π^{*}

by omitting the subscript, w, when there is no confusion.

The Bayesian testing based on the latent information prior,

π^{*} \in P (K)

, has the following minimax property.

Theorem 1.

Let

π^{*} \in P (K)

be the latent information prior. Then:

inf_{π \in P (R)} sup_{μ \in K} r_{w} (μ, π) = sup_{μ \in K} r_{w} (μ, π^{*}) = I_{μ; m ∣ x} (w, π^{*})

(14)

Proof.

It is sufficient to show the relations:

\begin{matrix} I_{μ; m ∣ x} & (w, π^{*}) = R_{w} (π^{*}, π^{*}) = inf_{π \in P (R)} R_{w} (π^{*}, π) \leq sup_{π^{'} \in P (K)} inf_{π \in P (R)} R_{w} (π^{'}, π) \\ (15) & \leq & inf_{π \in P (R)} sup_{π^{'} \in P (K)} R_{w} (π^{'}, π) = inf_{π \in P (R)} sup_{μ \in K} r_{w} (μ, π) \leq sup_{μ \in K} r_{w} (μ, π^{*}) \leq R_{w} (π^{*}, π^{*}) \end{matrix}

In the previous section, we have seen the equalities

I_{μ; m ∣ x} (w, π) = R_{w} (π, π)

and

R_{w} (π^{'}, π^{'}) = {inf}_{π} R_{w} (π^{'}, π)

, corresponding to the first and second equalities in Equation (15). Thus, it is enough to show the last inequality,

{sup}_{μ} r_{w} (μ, π^{*}) \leq R_{w} (π^{*}, π^{*})

, since the relations, except for the first and second equalities and the last inequality, are obvious.

We prove the inequality by contradiction. Assume that there exists a value,

ξ \in K

, such that:

\begin{matrix} r_{w} (ξ, π^{*}) > R_{w} (π^{*}, π^{*}) \end{matrix}

(16)

Let

π_{t} = (1 - t) π^{*} + t δ_{ξ}

(0 \leq t \leq 1)

, where

δ_{ξ}

is the delta measure concentrated at ξ. Then,

π_{t} \in P (K)

. From Equations (12) and (16):

\begin{matrix} \frac{\partial}{\partial t} & R_{w} (π_{t}; π_{t}) |_{t = 0} = \frac{\partial}{\partial t} \{w \int p_{0} (x) log \frac{\frac{w p_{0} (x)}{w p_{0} (x) + (1 - w) p_{1} (x ∣ μ)}}{\frac{w p_{0} (x)}{w p_{0} (x) + (1 - w) p_{π_{t}} (x)}} d x π_{t} (d μ) \\ + (1 - w) \int p_{1} (x ∣ μ) log \frac{\frac{(1 - w) p_{1} (x ∣ μ)}{w p_{0} (x) + (1 - w) p_{1} (x ∣ μ)}}{\frac{(1 - w) p_{π_{t}} (x)}{w p_{0} (x) + (1 - w) p_{π_{t}} (x)}} d x π_{t} (d μ)\} |_{t = 0} \\ = & w \int p_{0} (x) \frac{(1 - w) {p_{δ_{ξ}} (x) - p_{π^{*}} (x)}}{w p_{0} (x) + (1 - w) p_{π_{t}} (x)} d x π_{t} (d μ) |_{t = 0} \\ + w \int p_{0} (x) log \frac{\frac{w p_{0} (x)}{w p_{0} (x) + (1 - w) p_{1} (x ∣ μ)}}{\frac{w p_{0} (x)}{w p_{0} (x) + (1 - w) p_{π_{t}} (x)}} d x {- π^{*} (d μ) + δ_{ξ} (d μ)} |_{t = 0} \\ - (1 - w) \int p_{1} (x ∣ μ) \frac{(1 - w) {p_{δ_{ξ}} (x) - p_{π^{*}} (x)}}{(1 - w) p_{π_{t}} (x)} d x π_{t} (d μ) |_{t = 0} \\ + (1 - w) \int p_{1} (x ∣ μ) \frac{(1 - w) {p_{δ_{ξ}} (x) - p_{π^{*}} (x)}}{w p_{0} (x) + (1 - w) p_{π_{t}} (x)} d x π_{t} (d μ) |_{t = 0} \\ + (1 - w) \int p_{1} (x ∣ μ) log \frac{\frac{(1 - w) p_{1} (x ∣ μ)}{w p_{0} (x) + (1 - w) p_{1} (x ∣ μ)}}{\frac{(1 - w) p_{π_{t}} (x)}{w p_{0} (x) + (1 - w) p_{π_{t}} (x)}} d x {- π^{*} (d μ) + δ_{ξ} (d μ)} |_{t = 0} \\ = & w \int p_{0} (x) log \frac{\frac{w p_{0} (x)}{w p_{0} (x) + (1 - w) p_{1} (x ∣ μ)}}{\frac{w p_{0} (x)}{w p_{0} (x) + (1 - w) p_{π^{*}} (x)}} d x {- π^{*} (d μ) + δ_{ξ} (d μ)} . \\ + (1 - w) \int p_{1} (x ∣ μ) log \frac{\frac{(1 - w) p_{1} (x ∣ μ)}{w p_{0} (x) + (1 - w) p_{1} (x ∣ μ)}}{\frac{(1 - w) p_{π_{t}} (x)}{w p_{0} (x) + (1 - w) p_{π^{*}} (x)}} d x {- π^{*} (d μ) + δ_{ξ} (d μ)} \\ (17) & = & - R_{w} (π^{*}, π^{*}) + r_{w} (ξ, π^{*}) > 0 \end{matrix}

where we put

p_{1} (x ∣ μ) : = p (x ∣ m = 1, μ)

. However,

{max}_{t \in [0, 1]} R_{w} (π_{t}; π_{t}) = R_{w} (π_{0}; π_{0}) = R_{w} (π^{*}; π^{*})

, because of the definition of

π^{*}

and the fact that

π_{t} \in P (K)

. This is a contradiction. Thus, we have proven the desired result. ☐

The discussion in the proof is parallel to that for submodels of multinomial models in [4], although the testing problem is not included in the class considered there. Closely related discussion on the unconditional mutual information is given in Csiszár [6]. See also, [7,8].

We set

K = [- b, b]

with

b = 7

and consider two values,

0.5

and

0.355

, of w. The latent information priors,

π_{w}^{*}

, for two values

w = 0.5

and

w = 0.355

are numerically obtained by using a generalized Arimoto-Blahut algorithm, the details of which will be discussed in another place. Here,

w = 0.5

is the setting adopted in many previous studies, and

w = 0.355

is the value maximizing

I_{m; μ ∣ x} (w, π_{w}^{*})

.

The Arimoto-Blahut algorithm [9,10] is widely used in information theory to obtain the capacity of channels. A channel is defined to be a conditional distribution,

p (y ∣ θ)

, of y given θ, where y and θ are random variables taking values in finite sets,

Y

and Θ, respectively. If a channel,

p (y ∣ θ)

, is given, then the mutual information,

I_{y; θ} (π)

, between y and θ is a function of the distribution,

π (θ)

, of θ. The maximum value,

{max}_{π} I_{y; θ} (π)

, of the mutual information as a function of π is called the capacity of the channel

p (y ∣ θ)

. The Arimoto-Blahut algorithm is an iterative algorithm to obtain the capacity

{max}_{π} I_{y; θ} (π)

and the corresponding distribution

π (θ)

, attaining the maximum value. The original Arimoto-Blahut algorithm cannot be directly applied to our problem, since we need to maximize the conditional mutual information,

I_{m; θ ∣ x}

, where x and θ are not discrete random variables, to obtain the latent information prior.

Figure 1. Latent information priors for (a)

w = 0.5

and for (b)

w = 0.355

.

Figure 1. Latent information priors for (a)

w = 0.5

and for (b)

w = 0.355

.

Figure 1 shows the numerically-obtained latent information priors. The priors have the form:

\begin{matrix} π_{w}^{*} = \frac{u}{2} (δ_{- a} + δ_{a}) + \frac{1 - u}{2} (δ_{- b} + δ_{b}) \end{matrix}

(18)

The parameter values are

a = 1.21

,

b = 7

and

u = 0.440

, when

w = 0.5

, and

a = 1.10

,

b = 7

and

u = 0.393

, when

w = 0.355

.

Lemma 2 below gives the risk of Bayesian testing based on the prior in Equation (18).

Lemma 2.

Let:

π_{a, b, u} = \frac{u}{2} (δ_{- a} + δ_{a}) + \frac{1 - u}{2} (δ_{- b} + δ_{b})

(19)

where

a, b > 0

and

0 \leq u \leq 1

. Then, the risk in Equation (8) is given by:

\begin{matrix} r_{w} (μ; π_{a, b, u}) = - w \int ϕ (x) log \{1 + \frac{1 - w}{w} exp (- \frac{1}{2} μ^{2} + μ x)\} d x \\ - (1 - w) \int ϕ (x) log \{1 + \frac{w}{1 - w} exp (- \frac{1}{2} μ^{2} - μ x)\} d x \\ + w \int ϕ (x) log \{1 + \frac{1 - w}{w} (1 - u) exp (- \frac{1}{2} b^{2}) cosh (b x) + \frac{1 - w}{w} u exp (- \frac{1}{2} a^{2}) cosh (a x)\} d x \\ (20) & + (1 - w) \int ϕ (x - μ) log \{1 + \frac{w}{1 - w} \frac{1}{(1 - u) exp (- \frac{1}{2} b^{2}) cosh (b x) + u exp (- \frac{1}{2} a^{2}) cosh (a x)}\} d x \end{matrix}

and the conditional mutual information in Equation (1) is given by:

\begin{matrix} I_{m; μ ∣ x} (w, π_{a, b, u}) = u [- w \int ϕ (x) log \{1 + \frac{1 - w}{w} exp (- \frac{1}{2} a^{2} - a x)\} d x \\ - (1 - w) \int ϕ (x) log \{1 + \frac{w}{1 - w} exp (- \frac{1}{2} a^{2} - a x)\} d x] \\ + (1 - u) [- w \int ϕ (x) log \{1 + \frac{1 - w}{w} exp (- \frac{1}{2} b^{2} - b x)\} d x \\ - (1 - w) \int ϕ (x) log \{1 + \frac{w}{1 - w} exp (- \frac{1}{2} b^{2} - b x)\} d x] \\ + w \int ϕ (x) log \{1 + \frac{1 - w}{w} u exp (- \frac{1}{2} a^{2}) cosh (a x) + \frac{1 - w}{w} (1 - u) exp (- \frac{1}{2} b^{2}) cosh (b x)\} d x \\ + (1 - w) u \int ϕ (x - a) log \{1 + \frac{w}{1 - w} \frac{1}{(1 - u) exp (- \frac{b^{2}}{2}) cosh (b x) + u exp (- \frac{a^{2}}{2}) cosh (a x)}\} d x \\ (21) & + (1 - w) (1 - u) \int ϕ (x - b) log \{1 + \frac{w}{1 - w} \frac{1}{(1 - u) exp (- \frac{b^{2}}{2}) cosh (b x) + u exp (- \frac{a^{2}}{2}) cosh (a x)}\} d x \end{matrix}

The first and second terms in Equation (20) do not depend on π. The third term in Equation (20) does not depend on μ.

Figure 2 shows the risk functions of the latent information priors when

w = 0.5

and

w = 0.355

, respectively. Note that

{max}_{μ \in [- b, b]} r_{w} (μ, π^{*})

is attained at

μ = a

and b in both examples. This is consistent with the proof of Theorem 1, and it is numerically verified that the prior maximizes the conditional mutual information. Furthermore, we observe that the supremum value,

{sup}_{μ \in R} r_{w} (μ, π^{*})

, of the risk without restriction

μ \in [- b, b]

is only slightly larger than the maximum value,

{max}_{μ \in [- b, b]} r_{w} (μ, π^{*})

, with the restriction

μ \in [- b, b]

. The risk functions rapidly converge as μ exceeds seven.

Figure 2. Risk functions of Bayesian testing based on latent information priors for (a)

w = 0.5

and for (b)

w = 0.355

. When

w = 0.5

,

a = 1.21

and

b = 7

. When

w = 0.355

,

a = 1.10

and

b = 7

. The vertical dotted lines indicate the locations of a and b.

Figure 2. Risk functions of Bayesian testing based on latent information priors for (a)

w = 0.5

and for (b)

w = 0.355

. When

w = 0.5

,

a = 1.21

and

b = 7

. When

w = 0.355

,

a = 1.10

and

b = 7

. The vertical dotted lines indicate the locations of a and b.

Since:

\begin{matrix} sup_{μ \in K} & r (μ, π^{*}) = sup_{π^{'} \in P (K)} inf_{π \in P (K)} R (π^{'}, π) = sup_{π^{'} \in P (K)} inf_{π \in P (R)} R (π^{'}, π) \end{matrix}

\begin{matrix} \leq & sup_{π^{'} \in P (R)} inf_{π \in P (R)} R (π^{'}, π) \leq inf_{π \in P (R)} sup_{π^{'} \in P (R)} R (π^{'}, π) = inf_{π \in P (R)} sup_{μ \in R} r (μ, π) \leq sup_{μ \in R} r (μ, π^{*}) \end{matrix}

(22)

and

{sup}_{μ \in R} r (μ, π^{*}) - {sup}_{μ \in K} r (μ, π^{*})

is small in our problem when

K = [- b, b]

(b = 7)

, the supremum value,

{sup}_{μ \in R} r (μ, π^{*})

, of the risk function of the latent information prior,

π^{*}

, under the parameter restriction,

μ \in [- 7, 7]

, is only slightly larger than the minimax value,

{inf}_{π \in P (R)} {sup}_{μ \in R} r (μ, π)

without the restriction. We see in the next section that the supremum,

{sup}_{μ \in R} r (μ, π)

, of the risk functions of commonly used priors are much larger than those of

π^{*}

.

The discreteness of latent information priors shown in Figure 1 is a remarkable feature. In Bayesian statistics, k-reference priors have been known to be discrete measures in many examples; see [11,12,13]. The k-reference prior is defined to be a prior maximizing the mutual information between

x^{k}

and θ when we have a set,

x^{k}

, of k-independent observations,

x_{1}, \dots, x_{k}

, from

p (x ∣ θ)

in a parametric model,

{p (x ∣ θ) ∣ θ \in Θ \subset R^{d}}

. However, such discrete priors have not been widely used. Instead of k-reference priors, reference priors introduced by Bernardo [14] have been used for many problems. Reference priors are not discrete and are defined by considering the limit that the sample size k goes to infinity. One main reason why discrete priors are not popular is that discrete priors are totally unacceptable form the viewpoint of subjective Bayes in which priors are considered to represent prior belief on parameters.

Although they have not been widely used, discrete priors, such as latent information priors, are reasonable from the viewpoint of prediction and objective Bayes. Various statistical problems, including estimation and testing, can be formulated from the viewpoint of prediction, and priors can be constructed by considering the conditional mutual information. Thus, latent information priors depending on the choice of variables to be predicted could play important roles in many statistical applications. Conditional mutual information is essential in information theory and naturally appeared in several studies in statistics; see e.g., [15,16]. Priors based on conditional mutual information and those based on unconditional mutual information are often quite different; see [4].

Bayesian testing based on latent information priors is free from the Jeffreys-Lindley paradox [3], since the priors are constructed by using conditional mutual information and depend properly on sample sizes. Posterior probabilities,

p_{w, π^{*}} (m = 0 ∣ x)

, are shown in Figure 3 and are compared with p-values of the two-sided test in Table 1. When

x = 2, 3

and 4, posterior probabilities are much smaller than p-values of the two-sided test. Large differences of posterior probabilities and p-values have been widely observed and discussed in [1,17,18].

Figure 3. Posterior probabilities

p_{w, π^{*}} (m = 0 ∣ x)

based on latent information priors for (a)

w = 0.5

and for (b)

w = 0.355

.

Figure 3. Posterior probabilities

p_{w, π^{*}} (m = 0 ∣ x)

based on latent information priors for (a)

w = 0.5

and for (b)

w = 0.355

.

Table 1. Comparison of posterior probabilities and p-values.

**Table 1.** Comparison of posterior probabilities and p-values.
x	0	1	2	3	4
$p_{w = 0.5} (m = 0 ∣ x)$	0.702	0.564	0.295	0.112	0.0217
$p_{w = 0.355} (m = 0 ∣ x)$	0.560	0.434	0.220	0.0867	0.0145
p-value (two-sided test)	1	0.317	0.0455	0.00267	$6.33 \times 10^{- 5}$

4. Other Common Priors

Discrete priors, including latent information priors discussed in the previous section, have not been widely used in Bayesian statistics. Common priors for the testing are the normal prior and the Cauchy prior. It seems to have been believed by many statisticians that the Cauchy prior is slightly better than the normal prior; see, e.g., [1,2]. In this section, we evaluate the conditional mutual information for the priors and compare the performance of them to that of the latent information prior.

4.1. The Normal Prior

The normal prior,

ϕ (μ; 0, τ^{2})

, is denoted by

N_{τ}

. From Lemma 1, we have:

\begin{matrix} r_{w} (μ, N_{τ}) = & - w \int ϕ (x; 0, 1) log \{1 + \frac{1 - w}{w} exp (- \frac{1}{2} μ^{2} + μ x)\} d x \\ - (1 - w) \int ϕ (x; 0, 1) log \{1 + \frac{w}{1 - w} exp (- \frac{1}{2} μ^{2} - μ x)\} d x \\ + w \int ϕ (x; 0, 1) log \{1 + \frac{1 - w}{w} \frac{ϕ (x; 0, τ^{2} + 1)}{ϕ (x; 0, 1)}\} d x \\ (23) & + (1 - w) \int ϕ (x; μ, 1) log \{1 + \frac{w}{1 - w} \frac{ϕ (x; 0, 1)}{ϕ (x; 0, τ^{2} + 1)}\} d x \end{matrix}

Thus, the conditional mutual information is given by:

\begin{matrix} I_{m; μ ∣ x} & (w, N_{τ}) = \int r_{w} (μ, N_{τ}) ϕ (μ; 0, τ^{2}) d μ \\ = & - w \int ϕ (x; 0, 1) ϕ (μ; 0, τ^{2}) log \{1 + \frac{1 - w}{w} exp (- \frac{1}{2} μ^{2} + μ x)\} d μ d x \\ - (1 - w) \int ϕ (x; 0, 1) ϕ (μ; 0, τ^{2}) log \{1 + \frac{w}{1 - w} exp (- \frac{1}{2} μ^{2} - μ x)\} d μ d x \\ + w \int ϕ (x; 0, 1) log \{1 + \frac{1 - w}{w} \frac{ϕ (x; 0, τ^{2} + 1)}{ϕ (x; 0, 1)}\} d x \\ (24) & + (1 - w) \int ϕ (x; 0, τ^{2} + 1) log \{1 + \frac{w}{1 - w} \frac{ϕ (x; 0, 1)}{ϕ (x; 0, τ^{2} + 1)}\} d x \end{matrix}

The conditional mutual information is evaluated by numerical integration. When

w = 0.5

and

w = 0.355

, the maximum values:

max_{τ} I_{m; μ ∣ x} (w = 0.5, N_{τ}) = 0.156 and max_{τ} I_{m; μ ∣ x} (w = 0.355, N_{τ}) = 0.166

(25)

of Equation (24) are attained at

τ = 4.92

and

τ = 5.36

, respectively. The variation of the risk functions,

r_{w = 0.5} (μ, N_{τ = 4.92})

and

r_{w = 0.355} (μ, N_{τ = 5.36})

, shown in Figure 4 are much larger than those of the risk functions of the latent information priors shown in Figure 2. Thus, the performance of the Bayesian testing based on the normal prior is worse than that based on the latent information prior if we adopt the Kullback-Leibler loss.

Figure 4. Risk functions of Bayesian testing based on normal priors for (a)

w = 0.5

and

τ = 4.92

; and for (b)

w = 0.355

and

τ = 5.36

. The functions have symmetry

r_{w} (- μ, N_{τ}) = r_{w} (μ, N_{τ})

about the origin.

Figure 4. Risk functions of Bayesian testing based on normal priors for (a)

w = 0.5

and

τ = 4.92

; and for (b)

w = 0.355

and

τ = 5.36

. The functions have symmetry

r_{w} (- μ, N_{τ}) = r_{w} (μ, N_{τ})

about the origin.

4.2. The Cauchy Prior

The Cauchy prior,

1 / {γ π (μ^{2} / γ^{2} - 1)}

, is denoted by

C_{γ}

. Since the characteristic functions of

N (0, σ^{2})

and

C_{γ}

are

exp (- \frac{1}{2} σ^{2} t^{2})

and

exp (- γ | t |)

, respectively, the characteristic function of the marginal density:

\begin{matrix} p_{C} (x ∣ γ) = \int \frac{1}{\sqrt{2 π σ^{2}}} exp \{- \frac{1}{2 σ^{2}} {(x - μ)}^{2}\} \frac{1}{π (μ^{2} / γ^{2} - 1)} \frac{1}{γ} d μ \end{matrix}

(26)

with respect to the Cauchy prior,

C_{γ}

, is given by:

\begin{matrix} exp (- γ | t | - \frac{1}{2} t^{2}) \end{matrix}

(27)

The expression:

\begin{matrix} p_{C} (x ∣ γ) = & \frac{1}{2 π} \int_{- \infty}^{\infty} e^{- i x t} exp (- γ | t | - \frac{1}{2} t^{2}) d t \\ (28) & = & \frac{1}{\sqrt{2 π} σ} Re [exp \{\frac{{(i x - γ)}^{2}}{2}\} erfc (- i \frac{x + i γ}{\sqrt{2}})] \end{matrix}

where

erfc

is the complementary error function defined by:

\begin{matrix} erfc (z) = & \frac{2}{\sqrt{π}} \int_{z}^{\infty} e^{- t^{2}} d t \end{matrix}

(29)

obtained by the inverse transform of Equation (27) is useful for numerical computation; see [19] (p. 183) and [20]. From Lemma 1, we have:

\begin{matrix} r_{w} (μ; C_{γ}) = & - w \int ϕ (x; 0, 1) log \{1 + \frac{1 - w}{w} exp (- \frac{1}{2} μ^{2} + μ x)\} d x \\ - (1 - w) \int ϕ (x; 0, 1) log \{1 + \frac{w}{1 - w} exp (- \frac{1}{2} μ^{2} - μ x)\} d x \\ + w \int ϕ (x; 0, 1) log \{1 + \frac{1 - w}{w} \frac{p_{C} (x ∣ γ)}{ϕ (x; 0, 1)}\} d ε \\ (30) & + (1 - w) \int ϕ (x; 0, 1) log \{1 + \frac{w}{1 - w} \frac{ϕ (x + μ; 0, 1)}{p_{C} (x + μ ∣ γ)}\} d x \end{matrix}

We numerically evaluate the conditional mutual information:

\begin{matrix} I_{m; μ ∣ x} (w, C_{γ}) = & \int r_{w} (μ; C_{γ}) \frac{1}{π (μ^{2} / γ^{2} - 1)} \frac{1}{γ} d μ \end{matrix}

(31)

by the Monte-Carlo method. When

w = 0.5

and

w = 0.355

, the maximum values:

max_{γ} I_{m; μ ∣ x} (w = 0.5, C_{γ}) = 0.161 and max_{γ} I_{m; μ ∣ x} (w = 0.355, C_{γ}) = 0.170

of Equation (31) are attained at

γ = 3.31

and

γ = 3.63

, respectively. The risk functions

r_{w = 0.5} (μ, C_{γ = 3.31})

and

r_{w = 0.355} (μ, C_{γ = 3.63})

are shown in Figure 5. The variation of the risk function

r_{w = 0.5} (μ, C_{γ = 3.31})

is milder than that of the risk function

r_{w = 0.5} (μ, N_{τ = 4.92})

based on the normal prior, and the inequality

{sup}_{μ} r_{w = 0.5} (μ, C_{γ = 3.31}) < {sup}_{μ} r_{w = 0.5} (μ, N_{τ = 4.92})

holds. Thus, the Cauchy prior is preferable to the normal prior from the viewpoint of the Kullback-Leibler loss. However, the variation of the risk function shown in Figure 2 based on the latent information prior is much smaller than that of

r_{w = 0.5} (μ, C_{γ = 3.31})

. Similar relations also hold when

w = 0.355

.

Figure 5. Risk functions of Bayesian testing based on Cauchy priors for (a)

w = 0.5

and

γ = 3.31

; and for (b)

w = 0.355

and

γ = 3.63

. The functions have symmetry

r_{w} (- μ, C_{γ}) = r_{w} (μ, C_{γ})

about the origin.

Figure 5. Risk functions of Bayesian testing based on Cauchy priors for (a)

w = 0.5

and

γ = 3.31

; and for (b)

w = 0.355

and

γ = 3.63

. The functions have symmetry

r_{w} (- μ, C_{γ}) = r_{w} (μ, C_{γ})

about the origin.

5. Conclusions

We discussed the use of latent information priors for Bayesian testing of a point null hypothesis. The testing problem was formulated as a prediction problem, and latent information priors were numerically obtained. The variations of the risk functions of latent information priors are much smaller than those of normal and Cauchy priors. Although the testing problem treated in the present paper is simple, the results may indicate that latent information priors could be useful for various problems, since many statistical problems can be formulated from the viewpoint of prediction.

When the parameter space is multidimensional, it becomes difficult to numerically obtain latent information priors, and some approximations need to be used. One possible approach is to use asymptotic methods, and another possible approach is to choose an approximating prior from a tractable subset of the set of all probability measures on the parameter space. These approaches require further investigation.

Acknowledgments

This research was partially supported by a Grant-in-Aid for Scientific Research (23300104, 23650144) and by the Aihara Innovative Mathematical Modelling Project, the Japan Society for the Promotion of Science (JSPS) through the “Funding Program for World-Leading Innovative R&D on Science and Technology (FIRST Program),” initiated by the Council for Science and Technology Policy (CSTP).

Conflicts of Interest

The author declares no conflict of interest.

References

Berger, J.O.; Sellke, T. Testing a point null hypothesis: The irreconcilability of p values and evidence. J. Am. Stat. Assoc. 1987, 82, 112–122. [Google Scholar] [CrossRef]
Jeffreys, H. Theory of Probability, 3rd ed.; Oxford University Press: Oxford, UK, 1961. [Google Scholar]
Lindley, D.V. A statistical paradox. Biometrika 1957, 44, 187–192. [Google Scholar] [CrossRef]
Komaki, F. Bayesian predictive densities based on latent information priors. J. Stat. Plan. Inference 2011, 141, 3705–3715. [Google Scholar] [CrossRef]
Aitchison, J. Goodness of prediction fit. Biometrika 1975, 62, 547–554. [Google Scholar] [CrossRef]
Csiszár, I. I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
Haussler, D. A general minimax result for relative entropy. IEEE Trans. Inf. Theory 1997, 43, 1276–1280. [Google Scholar] [CrossRef]
Grünwald, P.D.; Dawid, A.P. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Stat. 2004, 32, 1367–1433. [Google Scholar]
Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef]
Blahut, R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef]
Hartigan, J.A. Bayes Theory; Springer: New York, NY, USA, 1983. [Google Scholar]
Berger, J.; Bernardo, J.M.; Mendoza, M. On Priors that Maximize Expected Information. In Recent Developments in Statistics and Their Applications; Klein, J.P., Lee, J.C., Eds.; Freedom Press: Seoul, Korea, 1989; pp. 1–20. [Google Scholar]
Zhang, Z. Discrete Noninformative Priors. Ph.D. Dissertation, Department of Statistics, Yale University, New Haven, CT, USA, 1994. [Google Scholar]
Bernardo, J.M. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. B 1979, 41, 113–147. [Google Scholar]
Clarke, B.; Yuan, A. Partial information reference priors: Derivation and interpretations. J. Stat. Plan. Inference 2004, 123, 313–345. [Google Scholar] [CrossRef]
Ebrahimi, N.; Soofi, E.S.; Soyer, R. On the sample information about parameter and prediction. Stat. Sci. 2010, 25, 348–367. [Google Scholar] [CrossRef]
Edwards, W.; Lindman, H.; Savage, L.J. Bayesian statistical inference for psychological research. Psychol. Rev. 1963, 70, 193–242. [Google Scholar] [CrossRef]
Dickey, J.M. Is the tail area useful as an approximate Bayes factor? J. Am. Stat. Assoc. 1977, 72, 138–142. [Google Scholar] [CrossRef]
Temme, N.M. Error Functions, Dawson’s and Fresnel Integrals. In NIST Handbook of Mathematical Functions; Olver, F.W.J., Lozier, D.W., Boisvert, R.F., Clark, C.W., Eds.; Cambridge University Press: Cambridge, UK, 2010; pp. 159–171. [Google Scholar]
Poppe, G.P.M.; Wijers, C.M.J. Algorithm 680: Evaluation of the complex error function. ACM Trans. Math. Softw. (TOMS) 1990, 16. [Google Scholar] [CrossRef]

Appendix. Proofs of Lemmas

Proof of Lemma 1. From Equation (8), we have:

\begin{matrix} r_{w} (μ; π) = & w \int p (x ∣ m = 0) log \frac{p_{w} (m = 0 ∣ μ, x)}{p_{w, π} (m = 0 ∣ x)} d x \\ (32) & + (1 - w) \int p (x ∣ m = 1, μ) log \frac{p_{w} (m = 1 ∣ μ, x)}{p_{w, π} (m = 1 ∣ x)} d x \end{matrix}

because m and μ are independent. Since:

\begin{matrix} \frac{p_{w} (m = 0 ∣ μ, x)}{p_{w, π} (m = 0 ∣ x)} = & \frac{\frac{w p (x ∣ m = 0)}{w p (x ∣ m = 0) + (1 - w) p (x ∣ m = 1, μ)}}{\frac{w p (x ∣ m = 0)}{w p (x ∣ m = 0) + (1 - w) p_{π} (x ∣ m = 1)}} = \frac{1 + \frac{1 - w}{w} \frac{p_{π} (x ∣ m = 1)}{p (x ∣ m = 0)}}{1 + \frac{1 - w}{w} \frac{p (x ∣ m = 1, μ)}{p (x ∣ m = 0)}} \end{matrix}

(33)

and:

\begin{matrix} \frac{p_{w} (m = 1 ∣ μ, x)}{p_{w, π} (m = 1 ∣ x)} = & \frac{\frac{(1 - w) p (x ∣ m = 1, μ)}{w p (x ∣ m = 0) + (1 - w) p (x ∣ m = 1, μ)}}{\frac{(1 - w) p_{π} (x ∣ m = 1)}{w p (x ∣ m = 0) + (1 - w) p_{π} (x ∣ m = 1)}} = \frac{\frac{w}{1 - w} \frac{p (x ∣ m = 0)}{p_{π} (x ∣ m = 1)} + 1}{\frac{w}{1 - w} \frac{p (x ∣ m = 0)}{p (x ∣ m = 1, μ)} + 1} \end{matrix}

(34)

we have:

\begin{matrix} r_{w} & (μ; π) = w \int p (x ∣ m = 0) \frac{1 + \frac{1 - w}{w} \frac{p_{π} (x ∣ m = 1)}{p (x ∣ m = 0)}}{1 + \frac{1 - w}{w} \frac{p (x ∣ m = 1, μ)}{p (x ∣ m = 0)}} d x \\ + (1 - w) \int p (x ∣ m = 1, μ) log \frac{1 + \frac{w}{1 - w} \frac{p (x ∣ m = 0)}{p_{π} (x ∣ m = 1)}}{1 + \frac{w}{1 - w} \frac{p (x ∣ m = 0)}{p (x ∣ m = 1, μ)}} d x \\ (35) & = & w \int p_{0} (x) log \frac{1 + \frac{1 - w}{w} \frac{p_{π} (x)}{p_{0} (x)}}{1 + \frac{1 - w}{w} \frac{p_{0} (x - μ)}{p_{0} (x)}} d x + (1 - w) \int p_{0} (x) log \frac{1 + \frac{w}{1 - w} \frac{p_{0} (x + μ)}{p_{π} (x + μ)}}{1 + \frac{w}{1 - w} \frac{p_{0} (x + μ)}{p_{0} (x)}} d x \end{matrix}

Proof of Lemma 2. Since:

\begin{matrix} ϕ (x - a) + ϕ (x + a) & = \frac{1}{\sqrt{2 π}} exp \{- \frac{1}{2} (x^{2} - 2 a x + a^{2})\} + \frac{1}{\sqrt{2 π}} exp \{- \frac{1}{2} (x^{2} + 2 a x + a^{2})\} \\ (36) & = 2 ϕ (x) exp (- \frac{1}{2} a^{2}) cosh (a x) \end{matrix}

we have:

\begin{matrix} \frac{p_{π} (x)}{p_{0} (x)} = & \frac{\frac{1}{2} u \{ϕ (x + a) + ϕ (x - a)\} + \frac{1}{2} (1 - u) \{ϕ (x + b) + ϕ (x - b)\}}{ϕ (x)} \\ (37) & = & u exp (- \frac{1}{2} a^{2}) cosh (a x) + (1 - u) exp (- \frac{1}{2} b^{2}) cosh (b x) \end{matrix}

From Lemma 1, we have:

\begin{matrix} r_{w} & (μ; π) \\ = & - w \int p_{0} (x) log \{1 + \frac{1 - w}{w} \frac{p_{0} (x - μ)}{p_{0} (x)}\} d x - (1 - w) \int p_{0} (x) log \{1 + \frac{w}{1 - w} \frac{p_{0} (x + μ)}{p_{0} (x)}\} d x \\ + w \int p_{0} (x) log \{1 + \frac{1 - w}{w} \frac{p_{π} (x)}{p_{0} (x)}\} d x + (1 - w) \int p_{0} (x) log \{1 + \frac{w}{1 - w} \frac{p_{0} (x + μ)}{p_{π} (x + μ)}\} d x \\ = & - w \int ϕ (x) log \{1 + \frac{1 - w}{w} exp (- \frac{1}{2} μ^{2} + μ x)\} d x \\ - (1 - w) \int ϕ (x) log \{1 + \frac{w}{1 - w} exp (- \frac{1}{2} μ^{2} - μ x)\} d x \\ + w \int ϕ (x) \\ \times log \{1 + \frac{1 - w}{w} u exp (- \frac{1}{2} a^{2}) cosh (a x) + \frac{1 - w}{w} (1 - u) exp (- \frac{1}{2} b^{2}) cosh (b x)\} d x \\ + (1 - w) \int ϕ (x - μ) \\ (38) & \times log \{1 + \frac{w}{1 - w} \frac{1}{u exp (- \frac{1}{2} a^{2}) cosh (a x) + (1 - u) exp (- \frac{1}{2} b^{2}) cosh (b x)}\} d x \end{matrix}

The conditional mutual information is:

\begin{matrix} I_{m; μ ∣ x} & (w, π) = \frac{u}{2} \{r_{w} (- a; π) + r_{w} (a; π)\} + \frac{1 - u}{2} \{r_{w} (- b; π) + r_{w} (b; π)\} \\ (39) & = & u r_{w} (a; π) + (1 - u) r_{w} (b; π) \end{matrix}

From Equations (38) and (39), we obtain the desired result. ☐

© 2013 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Komaki, F. Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior. Entropy 2013, 15, 4416-4431. https://doi.org/10.3390/e15104416

AMA Style

Komaki F. Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior. Entropy. 2013; 15(10):4416-4431. https://doi.org/10.3390/e15104416

Chicago/Turabian Style

Komaki, Fumiyasu. 2013. "Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior" Entropy 15, no. 10: 4416-4431. https://doi.org/10.3390/e15104416

Article Menu

Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior

Abstract

1. Introduction

2. Kullback-Leibler Loss of Predictive Densities

3. Latent Information Priors

4. Other Common Priors

4.1. The Normal Prior

4.2. The Cauchy Prior

5. Conclusions

Acknowledgments

Conflicts of Interest

References

Appendix. Proofs of Lemmas

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI