Learning Functions and Approximate Bayesian Computation Design: ABCD

Hainy, Markus; Müller, Werner G.; P. Wynn, Henry

doi:10.3390/e16084353

Open AccessArticle

Learning Functions and Approximate Bayesian Computation Design: ABCD

by

Markus Hainy

¹,

Werner G. Müller

¹ and

Henry P. Wynn

^2,*

¹

Department of Applied Statistics, Johannes Kepler University, 4040 Linz, Austria

²

Department of Statistics, London School of Economics, Houghton Street, London WC2A 2AE, UK

^*

Author to whom correspondence should be addressed.

Entropy 2014, 16(8), 4353-4374; https://doi.org/10.3390/e16084353

Submission received: 25 April 2014 / Revised: 18 July 2014 / Accepted: 28 July 2014 / Published: 4 August 2014

(This article belongs to the Special Issue Entropy in Experimental Design, Sensor Placement, Inquiry and Search)

Download

Browse Figures

Versions Notes

Abstract

:

A general approach to Bayesian learning revisits some classical results, which study which functionals on a prior distribution are expected to increase, in a preposterior sense. The results are applied to information functionals of the Shannon type and to a class of functionals based on expected distance. A close connection is made between the latter and a metric embedding theory due to Schoenberg and others. For the Shannon type, there is a connection to majorization theory for distributions. A computational method is described to solve generalized optimal experimental design problems arising from the learning framework based on a version of the well-known approximate Bayesian computation (ABC) method for carrying out the Bayesian analysis based on Monte Carlo simulation. Some simple examples are given.

Keywords:

learning; Shannon information; majorization; optimum experimental design; approximate Bayesian computation

1. Introduction

A Bayesian approach to the optimal design of experiments uses some measure of preposterior utility, or information, to assess the efficacy of an experimental design or, more generally, the choice of sampling distribution. Various versions of this approach have been developed by Blackwell [1], and Torgerson [2] gives a clear account. Renyi [3], Lindley [4] and Goel and DeGroot [5] use information-theoretic approaches to measure the value of an experiment; see also the review paper by Ginebra [6]. Chaloner and Verdinelli [7] give a broad discussion of the Bayesian design of experiments, and Wynn and Sebastiani [8] also discuss the Bayes information-theoretic approach. There is wider interest in these issues in cognitive science and epistemology; see Chater and Oaksford [9].

When new data arrives, one can expect to improve the information about an unknown parameter θ. The key theorem, which is Theorem 2 here, gives conditions on informational functionals for this to be the case, and then, they will be called learning functionals. This class includes many special types of information, such as Shannon information, as special cases.

Section 2 gives the main theorems on learning functionals. We give our own simple proofs for completion, and the material can be considered as a compressed summary of what can be found in quite a scattered literature. We study two types of learning function, those of which we shall call the Shannon type and, in Section 3, those based on distances. For the latter, we shall make a new connection to the metric embedding theory contained in the work of Schoenberg with a link to Bernstein functions [10,11]. This yields a wide class of new learning functions. Following two, somewhat provocative, counter-examples and a short discussion of surprise in Section 4, we relate learning functions of the Shannon type to the theory of majorization in Section 5. Section 6 specializes learning functions on covariance matrices.

We shall use the classical Bayes formulation with θ as an unknown parameter with a prior density π(θ) on a parameter space Θ and a sampling density f(x|θ) on an appropriate sample space. We denote by f_X,θ (x,θ) = f(x|θ)π(θ) the joint density of X and θand use f_X(x) for the marginal density of X. The nature of expectations will be clear from the notation. To make the development straightforward, we shall look at the case of distributions with densities (with respect to Lebesgue measure) or, occasionally, discrete distributions with finite support. All necessary conditions for conditional densities, integration and differentiation will be implicitly assumed.

In Section 7, approximate Bayesian computation (ABC) is applied to problems in optimal experimental design (hence, ABCD). We believe that an understanding of modern optimal experimental design and its computational aspects needs to be grounded in some understanding of learning. At the same time, there is added value in taking a wide interpretation of optimal design as a choice, with constraints, of the sampling distribution f(x|θ). Thus, one may index f(x|θ) by a control variable z and write f(x|θ, z) or f(x(z)| θ). Certain aspects of the distribution may depend on z, others not. An experimental design can be taken as the choice of a set of z, at each of which we take one or more observations, giving a multivariate distribution. In areas, such as search theory and optimization, z may be a site at which one measures or observes with error. In spatial sampling, one may also use the term “site” for z. However, z could be a simple flag, which indicates one or another of somewhat unrelated experiments to estimate a common θ. In medicine, for example, one discusses different types of “intervention” for the same patient.

2. Information-Based Learning

The classical formulation proceeds as follows. Let U be a random variable with density f_U(u). Let g(·) be a function on R+ → R and define a measure of information of the Shannon type for U with respect to g as

I_{g} (U) = E_{U} (g (f_{U} (U))) .

When g(u) = log(u), we have Shannon information. When

g (u) = \frac{u^{γ} - 1}{γ}

(γ > −1), we have a version similar to Renyi information, which is sometimes called Tsallis information [12].

If X represents the future observation, we can measure the preposterior information of the experiment (query, etc.), which generates a realization of X, by the prior expectation of the posterior information, which we define as:

I_{g} (θ; X) = E_{X} E_{θ ∣ X} (g (π (θ ∣ X))) = E_{X, θ} (g (π (θ ∣ X))) .

In the second term, the inner expectation is with respect to the posterior (conditional) distribution of θ given X, namely π(θ|X), and the outside expectation is with respect to the marginal distribution of X. In the last term, the expectation is with respect to the full joint distribution of X and θ. We wish to compare I_g(θ;X) with the prior information:

I_{g} (θ) = E_{θ} (g (π (θ))) .

Theorem 1

For fixed g(u) and the standard Bayesian set-up, the pre-posterior quantity I_g(θ, X) and prior value, I_g(θ), satisfy:

I_{g} (θ; X) \geq I_{g} (θ) = E_{θ} (g (π (θ))),

for all joint distributions f_X,θ (x, θ) if and only if h(u) = ug(u) is convex on R⁺.

We shall postpone the proof of Theorem 1 until after a more general result for functionals on densities:

φ : π (θ) \mapsto R .

Theorem 2

For the standard Bayesian set-up and a functional φ (·),

φ (π (θ)) \leq E_{X} φ (π (θ ∣ X))

for all joint distributions f_X,θ (x, θ) if and only if φ is convex as a functional:

φ ((1 - α) π_{1} + α π_{2}) \leq (1 - α) φ (π_{1}) + α φ (π_{2}),

for 0 ≤ α ≤ 1 and all π₁, π₂.

Proof

Note that taking expectations with respect to the marginal distribution of X amounts to a convex mixing, not dependent on θ. Thus, using Jensen’s inequality:

\begin{array}{l} E_{X} (φ (π (θ ∣ X))) \geq φ (E_{X} (π (θ ∣ X))) \\ = φ (π (θ)) . \end{array}

The necessity comes from a special construction. We show that given a functional φ(·) and a triple {π₁, π₂, α}, such that:

φ ((1 - α) π_{1} + α π_{2}) > (1 - α) φ (π_{1}) + α φ (π_{2}),

we can find a pair {f(x, θ), π(θ)}, such that

φ (π (θ)) > E_{X} φ (π (θ ∣ x)) .

(1)

Thus, let X be a Bernoulli random variable with marginal distribution (prob{X = 0}, prob{X = 1}) = (1 − α, α). Then, it is straightforward to choose a joint distribution of θand X, such that:

π (θ ∣ X = 0) = π_{1} (θ), π (θ ∣ X = 1) = π_{2} (θ),

from which we obtain (1).

Proof

(of Theorem 1). We now show that Theorem 1 is a special case of Theorem 2.

Write π_α(θ) = (1 − α)π₁(θ) + απ₂(θ). If h(u) = ug(u) is convex as a function of its argument u:

\int h (π_{α} (θ)) d θ \leq \int ((1 - α) h (π_{1} (θ)) + α h (π_{2} (θ)) d θ

(2)

= (1 - α) \int h (π_{1} (θ)) d θ + α \int h (π_{2} (θ)) d θ,

(3)

proving one direction.

The reverse is to show that if I_g is convex for all π, then h is convex. For this, again, we need a special construction. We carry this out on one dimension, the extension to more than one dimension being straightforward. For ease of exposition, we also make the necessary differentiability conditions. The second directional derivative of I_g(θ) in the space of distributions (which is convex) at π₁ towards π₂ is:

\frac{\partial^{2}}{\partial α^{2}} {\int g (π_{α} (θ)) π_{α} (θ) d θ |}_{α = 0} = \int {(π_{1} - π_{2})}^{2} (g^{″} (π_{1}) π_{1} + 2 g^{'} (π_{1})) d θ .

Let π₁ represent a uniform distribution on [0,

\frac{1}{z}

], for some z ≥ 0, and let π₂ be a distribution with support contained in [0,

\frac{1}{z}

]. Then, the above becomes:

\int_{0}^{\frac{1}{z}} {(z - π_{2} (θ))}^{2} (g^{″} (z) z + 2 g^{'} (z)) d θ = (g^{″} (z) z + 2 g^{'} (z)) \int_{0}^{\frac{1}{z}} {(z - π_{2} (θ))}^{2} d θ .

Now, assume that h(z) = zg(z) is not convex at z; then h″(z) = g″ (z)z + 2g′ (z) < 0 and any choice of π₂, which makes the integral on the right-hand side positive, shows that I_g(θ) is not convex at z. This completes the proof.

Theorem 2 has a considerable history of discovery and rediscovery and, in its full version, should probably be attributed to DeGroot [13]; see Ginebra [6]. The early results concentrated on functionals of the Shannon type, basically yielding Theorem 1. Note that the condition h(u) = ug(u) being convex on R⁺ is equivalent to

g (\frac{1}{u})

being convex, which is referred to as g(u) being “reciprocally convex” by Goldman and Shaked [14]; see also Fallis and Lyddell [15].

3. Distance-Based Information Functions

Shannon type information functionals take no account of metrics. Intuitively, if mass is moved around, the information stays the same. Let Z₁,Z₂ be independent copies from π(z), and let d(z₁, z₂) be a distance or metric. Define d-information as:

φ (π) = - E_{Z_{1}, Z_{2}} (d {(Z_{1}, Z_{2})}^{2}) .

Now, with πα(z) = (1 − α)π₁(z) + απ₂(z),

φ (π_{α}) = - \int \int d {(z_{1}, z_{2})}^{2} ((1 - α) π_{1} (z_{1}) + α π_{2} (z_{1})) ((1 - α) π_{1} (z_{2}) + α π_{2} (z_{2})) d z_{1} d z_{2} .

(4)

The condition for convexity, again using the second directional derivative with respect to α, is

- \int \int d {(z_{1}, z_{2})}^{2} (π_{1} (z_{1}) - π_{2} (z_{1})) (π_{1} (z_{2}) - π_{2} (z_{2})) d z_{1} d z_{2} \geq 0.

(5)

Noting that ∫(π₁(z₁) − π₂(z₁)) = 0, (5) is a generalized version of the following condition:

- \sum_{i j} d (z_{i}, z_{j}) z_{i} z_{j} \geq 0, for all z, \sum z_{i} = 0.

(6)

Condition (6), considered as a condition on a distance matrix d_ij = d(z_i, z_j), is called almost positive and is the necessary and sufficient condition for an abstract set of points P₁, . . . , P_k, with interpoint distances {d_ij}, to be embedded in Euclidean space.

Theorem 3

If d_ij = d_ji, 1 ≤ i < j ≤ n, are

\frac{1}{2} n (n - 1)

positive quantities, then a necessary and sufficient condition that the d_ij are the interpoint distances between points P_i, i = 1, . . . , n, in Rⁿ is that the distance matrix D = −{d_ij} is an almost positive matrix.

This is a special case of metric embedding, sometimes called metric multi-dimensional scaling, in statistics; see, for example, Torgeson [16], Gower [17,18]. A more general result is:

Theorem 4

Let S be a separable metric with metric space with metric d(x, y), then S can be isometrically embedded in l₂ if and only if A(x, y) = −d(x, y) is an almost positive matrix.

It is a task to identify the functions B(d(x, y)₂), such that, when d(x, y) is a Euclidean or Hilbert space metric, the space with the new metric can still be embedded into the Hilbert space. Schoenberg [10] gives the following major result that such B(·) comprise the Bernstein function defined as follows (see Theorem 12.14 in [11]):

Definition 1

A function B : (0,∞) ↦ R is a Bernstein function if it is C^∞, f(λ) ≥ 0 for all λ > 0 and the derivatives f⁽ⁿ⁾ satisfy (−1)ⁿ⁻¹f⁽ⁿ⁾ ≥ 0 for all positive integers n and all λ > 0.

Note that this says that f′ is a completely monotone function.

Theorem 5

(Schoenberg) The following are equivalent:

(1): B(||x − y||²) (x, y ∈ H) is the square of a distance function, which isometrically embeds into Hilbert space H, i.e., there exists a φ : H ↦ H, such that:

$B ({‖ x - y ‖}^{2}) = {‖ φ (x) - φ (y) ‖}^{2} .$

(7)
(2): B is a Bernstein function.
(3): e⁻^B⁽^t⁾ is the Laplace transform of an infinitely divisible distribution, i.e.,

$B (t) = - log \int_{0}^{\infty} \frac{e^{- t u}}{u} d γ (u),$

where γ is an infinitely divisible distribution.
(4): B has the Lévy-Khintchine representation:

$B (t) = B_{μ, b} (t) = b t + \int_{0}^{\infty} (1 - e^{- t u}) d μ (u)$

(8)

for some b ≥ 0 and a measure μ, such that $\int_{0}^{\infty} (1 \land t) d μ (t) < \infty$ , with the condition that B_μ,b(t) > 0 for t > 0.

We now combine the above discussion with Schoenberg’s theorem.

Theorem 6

If B(·) is a Bernstein function with B(0) = 0 and d(z₁, z₂) is a Euclidean distance, then φ(π) = −E_Z_₁,_Z_₂ (B(d(Z₁,Z₂)²)) is a learning function.

In the univariate case the negative of the variance of the distribution is a learning function since:

var (Z) = \frac{1}{2} E_{Z_{1}, Z_{2}} {(Z_{1} - Z_{2})}^{2} .

When Z is multivariate, we again take independent copies Z₁,Z₂ of Z and use Euclidean distance, and we have that minus the trace of the covariance matrix of Z, Γ, is a learning function:

\frac{1}{2} E_{Z_{1}, Z_{2}} ({‖ Z_{1} - Z_{2} ‖}^{2}) = trace (Γ) .

Schilling et al. [11] (Chapter 15) list 138 Bernstein functions, each of which will lead to a learning functional of the distance type. We give a small selection of Bernstein functions B(λ), which then, applied with λ = d(z₁, z₂)², give a learning function:

\begin{array}{l} λ^{α}, & 0 < α < 1, \\ {(1 + λ)}^{α} - 1, & 0 < α < 1, \\ 1 - {(1 + λ)}^{α - 1}, & 0 < α < 1, \\ \frac{λ}{λ + α}, & α < 0. \end{array}

4. Counterexamples

We show first that it is not true that information always increases. That is, it is not true that the posterior information is always more than the prior information:

I_{g} (θ) \leq E_{θ ∣ X} (g (π (θ ∣ X))) .

A simple discrete example runs as follows. I have lost my keys. With high prior probability, p, I think they are on my desk. Suppose I have a uniform prior over all k likely other locations. However, suppose when I look on the desk that my keys are not there. My posterior distribution is now uniform on the other locations. Under certain conditions on p and k, Shannon information has gone down. For fixed p, the condition is k > k* where:

\begin{array}{l} k^{*} = \frac{{(1 - p)}^{1 - \frac{1}{p}}}{p} \\ = e \cdot (\frac{1}{p} - \frac{1}{2} + O (p)), \end{array}

by expanding pk* in a Taylor expansion. When

p = \frac{1}{2}

, k* = 4 and pk* → e, 1 when p → 0, 1. This example is captured by the somewhat self-doubting phrase “if my keys are not on my desk, I don’t know where they are”. Note, however, that something has improved: the support size is reduced from k + 1 to k.

There is a simple way of obtaining a large class of examples, namely to arrange that there are x-values for which the posterior distribution is approximately uniform. Then, because the uniform distribution typically has low information, for such x, we can have a decrease in information. Thus, we construct examples in which f(x|θ)π(θ) happens to be approximately constant for some x. This motivates the following example.

Let Θ ×

= [0, 1]² with joint distribution having support on [0, 1]². Let π(θ) be the prior distribution and define a sampling distribution:

f (x ∣ θ) = a (θ) (1 - x) + \frac{x}{π (θ)} .

Note that we include the prior distribution into the sampling distribution as a constructive device, not as some strange new general principle. We have in mind, in giving this construction, that when x → 1, the first term should approach zero and the second term, after multiplying by π(θ), should approach unity. Solving for a(θ) by setting

\int_{0}^{1} f (x ∣ θ) d x = 1

, we have

a (θ) = \frac{2 π (θ) - 1}{π (θ)}

so that:

f (x ∣ θ) = \frac{(2 π (θ) - 1) (1 - x) + x}{π (θ)} .

The joint distribution is then:

f (x ∣ θ) π (θ) = (2 π (θ) - 1) (1 - x) + x .

(9)

The marginal distribution of X is f_X(x) = 1 on [0, 1], since the integral of (9) is unity, so that (9) is also the posterior distribution π(θ|x). Note that, in order for (9) to be a proper density, we require that

π (θ) \geq \frac{1}{2}

for 0 ≤ θ ≤ 1.

The Shannon information of the prior is:

I_{0} = \int_{0}^{1} (θ) log π (θ) d θ,

and of the posterior is

I_{1} = \int_{0}^{1} ((2 π (θ) - 1) (1 - x) + x) log ((2 π (θ) - 1) (1 - x) + x) d θ .

When

x = \frac{1}{2}

, the integrands of I₁ and I₀ are equal and I₀ = I₁. When x = 1, the integrand of I₁ is zero, as expected. Thus, for a non-uniform prior, we have less posterior information in a neighborhood of x = 1, as we aimed to achieve.

Specializing

π (θ) = \frac{1}{2} + θ

0 [0, 1] gives:

\begin{array}{l} I_{0} = \frac{9}{8} log 3 - log 2 - 1 / 2 \\ I_{1} = \frac{1}{4 (1 - x)} ({(2 - x)}^{2} log (2 - x) - x^{2} log (x) + 2 x - 2) \end{array}

Information I₁ decreases from a maximum of

log (2) - \frac{1}{2}

at x = 0, through the value I₀ at

x = \frac{1}{2}

, to the value zero at x = 1; see also Figure 1. Thus, I₀ > I₁ for

\frac{1}{2} < x \leq 1

. Since the marginal distribution of X is uniform on [0, 1], we have the challenging fact that:

{prob}_{X} {I_{1} < I_{0}} = \frac{1}{2} .

Namely, with prior probability equal to one half, there is less Shannon information in the posterior than the prior. The Renyi entropy exhibits the same phenomenon, but we omit the calculations. We might say that f(x|θ) is not a good choice sampling distribution to learn about θ.

4.1. Surprise and Ignorance

The conflict between prior beliefs and empirical data, demonstrated by these examples, lies at the heart of debates about inference and learning, that is to say epistemology. This has given rise to formal theories of surprise, which seek to take account of the conflict. Some Bayesian theories are closely related to the learning theory discussed here and measure surprise quantities, such as the difference:

S (π, f) = I_{g} (θ) - E_{θ ∣ X} g (π (θ ∣ X)) .

Since, under the conditions of Theorem 1, S is expected to be negative, a positive value is taken to measure surprise; see Itti and Baldi [19].

Taking a subjective view of these issues, we may stray into cognitive science, where there is evidence that the human brain may react in a more focused way than normal when there is surprise. This is related to wider computational models of learning: given the finite computational capacity of the brain, we need to use our sensing resources carefully in situations of risk or utility. One such body of work emanates from the so-called “cocktail party effect”: if the subject matter is of sufficient interest, such as the mention of one’s own name across a crowded room, then one’s attention is directed towards the conversation. Discussions about how the attention is first captured are closely related to surprise; see Haykin and Chen [20].

4.2. Minimal Information Prior Distributions

It is clear that if the prior distribution has minimal information (maximum entropy), then there is no surprise, because S, as defined above, is never positive. The use of such prior distributions has been advocated for many years and is incorporated into objective Bayesian analysis by some researchers. One key idea is to use Jeffrey’s prior distributions, that is those which are invariant under a suitable group (Haar measure); for a discussion, see Berger [21].

An unresolved issue is that the minimal information distribution depends on the learning function. A simple example is that for Shannon information, the minimal information distribution with support on [0, 1] is the uniform distribution, whereas the maximum variance distribution has mass

\frac{1}{2}

at each of {0, 1} and variance

\frac{1}{4}

, which is achievable for the Beta(α, β) distribution as α, β → 1. The variance of the uniform distribution, on the other hand, is

\frac{1}{12} < \frac{1}{4}

.

Consider the standard beta-binomial Bayesian set-up, where the sampling distribution is Bin(n, θ) and the (conjugate) prior is Beta(α, β). If x is the data, the posterior distribution is Beta(α + x, β + n − x), and the posterior mean, which is the Bayes estimator with respect to quadratic loss, is

\hat{θ} = \frac{α + x}{α + β + n}

. The minimal Shannon information is achieved for the uniform distribution when α, β → 1, in which case we have

\hat{θ} = \frac{1 + x}{2 + n}

. However, if we take α, β → 0, giving, as mentioned, the minimal information with respect to the variance, we obtain in the limit the maximum likelihood estimator

\frac{x}{n}

. The same feature arises with the Dirichlet-multinomial case, with the Dirichlet prior distribution:

π (θ_{1}, \dots, θ_{k}) = \frac{\prod θ_{i}^{α_{i} - 1}}{Beta (α_{1}, \dots, α_{k})}

. The minimal Shannon information is uniform when all α_i = 1, but the minimal trace of the covariance matrix is for mass

\frac{1}{k}

at each corner of the simplex ∑θ_i = 1.

5. The Role of Majorization

We concentrate here on Shannon-type learning functions. The analysis of the last section leads to the notion that for two distributions π₁(θ) and π₂(θ), the second is more peaked than the first if and only if:

\int_{Θ} h (π_{1} (θ)) d θ \leq \int_{Θ} h (π_{2} (θ)) d θ for all convex h (u) = u g (u) on R^{+} .

(10)

The statement (10) defines a partial ordering between π₁ and π₂.

For Bayesian learning, we may hope that the ordering holds when π₁ is the prior distribution and π₂ is the posterior distribution. We have seen from the counterexamples that it does not hold in general, but, loosely speaking, always holds in expectation, by Theorem 1. However, it is natural to try to understand the partial ordering, and we shall now indicate that the ordering is equivalent to a well-known majorization ordering for distributions.

Consider two discrete distributions with n-vectors of probabilities\

π_{1} = (π_{1}^{(1)}, \dots, π_{n}^{(1)})

and

π_{2} = (π_{1}^{(2)}, \dots, π_{n}^{(2)})

, where

\sum_{i} π_{i}^{(1)} = \sum_{i} π_{i}^{(2)} = 1

. First, order the probabilities:

{\tilde{π}}_{1}^{(1)} \geq \dots \geq {\tilde{π}}_{n}^{(1)}, {\tilde{π}}_{1}^{(2)} \geq \dots \geq {\tilde{π}}_{n}^{(2)} .

Then, π₂ is said to majorize π₁, written π₁ ≼ π₂, when:

\sum_{i = 1}^{j} {\tilde{π}}_{i}^{(1)} \leq \sum_{i = 1}^{j} {\tilde{π}}_{i}^{(2)}

for j = 1, . . . , n (equality for j = n). The standard reference is Marshall and Olkin [22], where one can find several equivalent conditions. Two of the best known are:

A1. there is a doubly stochastic matrix P_n×n, such that π₁ = Pπ₂;
A2. $\sum_{i}^{n} h (π_{i}^{(1)}) \leq \sum_{i}^{n} h (π_{i}^{(2)})$ for all continuous convex functions h(x).

Condition A2 shows that, in the discrete case, the partial ordering (10) is equivalent to the majorization of the raw probabilities.

We now extend this to the continuous case. This generalization, which we shall also call ≼, to save notation, has a long history, and the area is historically referred to as the theory of the “rearrangements of functions” to respect the terminology of Hardy et al. [23]. It is particularly well-suited to probability density functions, because ∫ π₁(θ)dθ = ∫π₂(θ)dθ = 1. The natural analogue of the ordered values in the discrete case is that every density π has a unique density π̃ , called a “decreasing rearrangement”, obtained by a reordering of the probability mass to be non-increasing, by direct analogy with the discrete case above. In the theory, π and π̃ are then referred to as being equimeasurable, in the sense that the supports are transformed in a measure-preserving way.

There are short sections on the topic in Marshall and Olkin [22] and in Müller and Stoyan [24]. A key paper in the development is Ryff [25]. The next paragraph is a brief summary.

Definition 2

Let π(z) be a probability density and define m(y) = μ{z : π(z) ≥ y}. Then:

\tilde{π} (t) = s u p {y : m (y) > t}, t > 0

is called the decreasing rearrangement of π(z).

The picture is that the probability mass (in infinitely small intervals) is moved, so that a given mass is to the left of any smaller mass. For example, for the triangular distribution:

π (θ) = {\begin{array}{l} 4 θ, & 0 \leq θ < \frac{1}{2} \\ 4 (1 - θ), & \frac{1}{2} \leq θ \leq 1 \end{array}

we have:

\tilde{π} (θ) = 2 (1 - θ), 0 \leq θ \leq 1.

Definition 3

We say that π₂ majorizes π₁, written π₁ ≼ π₂, if and only if, for the decreasing rearrangements,

\int_{0}^{c} {\tilde{π}}_{1} (z) d z \leq \int_{0}^{c} {\tilde{π}}_{2} (z) d z

for all c > 0.

Define a doubly stochastic kernel P(x, y) ≥ 0 on (0,∞), that is:

\int_{x} P (x, y) = \int_{y} P (x, y) = 1.

There is a list of key equivalent conditions to ≼, which are the continuous counterparts of the discrete majorization conditions. The first two generalize A1 and A2 above.

B1. π₁(θ) = ∫_Θ P(θ, z)π₂(z)dz for some non-negative doubly stochastic kernel P(x, y).
B2. ∫_Θ h(π₁(z))dz ≤ ∫_Θ h(π₂(z))dz for all continuous convex functions h.
B3. ∫_Θ (π₁(z) − c)₊dz ≤ ∫_Θ (π₂(z) − c)₊dz for all c > 0.

Condition B2 is the key, for it shows that in the univariate case, if we assume that h(u) = ug(u) is continuous and convex, (10) is equivalent to π₁(θ) ≼ π₂(θ). We also see that ≼ is equivalent to standard first order stochastic dominance of the decreasing rearrangements, since

\tilde{F} (θ) = \int_{0}^{θ} \tilde{π} (z) d z

is the cdf corresponding to π̃(θ). Condition B3 says that the probability mass under the density above a “slice” at height c is more for π₂ than for π₁.

We can summarize this discussion by the following.

Proposition 1

A functional is a learning functional of the Shannon type (under mild conditions) if and only if it is an order-preserving functional with respect to the majorization ordering on distributions.

The role of majorization has been noticed by DeGroot and Fienberg [26] in the related area of proper scoring rules.

The classic theory of rearrangements is for univariate distributions, whereas, as stated, we are interested in θ of arbitrary dimension. In the present paper, we will simply make the claim that the interpretation of our partial ordering in terms of decreasing rearrangements can indeed be extended to the multivariate case. Heuristically, this is done as follows. For a multivariate distribution, we may create a univariate rearrangement by considering a decreasing threshold and “squashing” all of the multivariate mass for which the density is above the threshold to a univariate mass adjacent to the origin. Since we are transforming multivariate volume to area, care is needed with Jacobians. We can then use the univariate development above. It is an instructive exercise to consider the univariate decreasing rearrangement of the multivariate normal distribution, but we omit the computations here.

6. Learning Based on Covariance Functions

If we restrict our functionals to those which are only functionals of covariance matrices, then we can prove wider results than just for the trace. Dawid and Sebastiani [27] (Section 4) refer to dispersion-coherent uncertainty functions and, where their results are close to ours, we differ only by assumptions.

We use the notation A ≥ 0 to mean that a symmetric matrix is non-negative definite.

Definition 4

For two n × n symmetric non-negative definite matrices A and B, the Loewner ordering A ≥ B holds when A − B ≥ 0.

Definition 5

A function φ : A ↦ R on the class of non-negative definite matrices A is said to be Loewner increasing (also called matrix monotone) if A ≥ B ⇒ φ(A) ≥ φ(B).

Theorem 7

A function φ is Loewner increasing and concave on the class of covariance matrices Γ (π) if and only if −φ is a learning function on the corresponding distributions.

Proof

Assume φ is Loewner increasing. To simplifying the notation, we call μ(π) and Γ(π) the mean vector and covariance matrix, respectively, of the random variable Z with distribution π. Now, consider a mixed density π_α = (1 − α)π₁ + απ₂. Then, with obvious notation,

\begin{array}{l} Γ (π_{α}) = E_{α} (Z Z^{T}) - μ_{α} μ_{α}^{T} \\ = (1 - α) Γ_{1} + α Γ_{2} + (1 - α) μ_{1} μ_{1}^{T} + α μ_{2} μ_{2}^{T} - ((1 - α) μ_{1} + α μ_{2}) {((1 - α) μ_{1} + α μ_{2})}^{T} \\ = (1 - α) Γ_{1} + α Γ_{2} + α (1 - α) (μ_{1} - μ_{2}) {(μ_{1} - μ_{2})}^{T} \\ \geq (1 - α) Γ_{1} + α Γ_{2}, \end{array}

for 0 ≤ α ≤ 1, since (μ₁ − μ₂)( μ₁ − μ₂)^T is non-negative definite. Then, since φ is Loewner increasing and concave, φ(Γ(π_α)) ≥ φ ((1 − α)Γ(π₁) + αΓ(π₂)) ≥ (1 − α)φ(Γ(π₁)) + αφ(Γ(π₂)), and by Theorem 2, −φ is a learning function.

We first prove the converse for matrices Γ and Γ̃ = Γ+zz^T , for some vector z. Take two distributions with equal covariance functions, but with means satisfying μ₁ − μ₂ = 2z. Then,

\begin{array}{l} Γ (π_{\frac{1}{2}}) = Γ + \frac{1}{4} (μ_{1} - μ_{2}) {(μ_{1} - μ_{2})}^{T} \\ = Γ + z z^{T} \\ = \tilde{Γ} . \end{array}

Now assume −φ is a learning function. Then, by concavity,

\begin{array}{l} φ (\tilde{Γ}) = φ (π_{\frac{1}{2}}) \\ \geq \frac{1}{2} φ (Γ) + \frac{1}{2} φ (Γ) \\ = φ (Γ) . \end{array}

In general, we can write any Γ̃ ≥ Γ as

\tilde{Γ} = Γ + \sum_{i = 1}^{m} z^{(i)} z^{(i) T}

, for a sequence of vectors {z⁽ⁱ⁾}, i = 1, . . . ,m, and the result follows by induction from the last result.

Most criteria used in classical optimum design theory (in the linear regression setting) when applied to covariance matrices are Loewner increasing. If, in addition, we can claim concavity, then by Theorem 7, the negative of any such function is a learning function. We have seen in Section 3 that –trace(Γ) is a learning function, while −log det(Γ) corresponding to D-optimality is another example.

For the normal distribution, we can show that for two normal density functions, π₁ and π₂, with covariance Γ₁ and Γ₂, respectively, we have that for any Shannon-type learning function I_g(θ₁) ≤ I_g(θ₂) if and only if det(Γ₁) ≥ det(Γ₂). We should note that in many Bayesian set-ups, such as regression and Gaussian process prediction, we have a joint multivariate distribution between x and θ. Suppose that, with obvious notation, the joint covariance matrix is:

Γ_{θ, X} = (\begin{matrix} Γ_{θ} & γ_{θ, X} \\ γ_{θ, X}^{T} & Γ_{X} \end{matrix}) .

Then, the posterior distribution for θ has covariance

Γ_{θ} - γ_{θ, X} Γ_{X}^{- 1} γ_{θ, X}^{T} \leq Γ_{θ}

. Thus, for any Loewner increasing φ, it holds that −φ(π(θ)) ≤ −E_X(φ(π(θ|X))), by Theorem 7. However, as the conditional covariance matrix does not depend on X, we have learning in the strong sense; −φ(π(θ)) ≤ −φ(π(θ|X)). Classifying learning functions for θ and Γ_θ,X in the case where they are both unknown is not yet fully developed.

7. Approximate Bayesian Computation Designs

We now present a general method for performing optimum experimental design calculations, which, combined with the theory of learning outlined above, may provide a comprehensive approach. Recall that in our general setting, a decision about experimentation or observation is essentially a choice of the sampling distribution. In the statistical theory of the design of experiments, this choice typically means a choice of observation sites indexed by a control or independent variable z.

Indeed, we will have examples below in this category. However, the general formulation is that we want to maximize ψ over some restricted set of sampling distributions f(x|θ) ∈

ℱ

. A choice of f we call generalized design. Below, we will have one non-standard example based on selective sampling. Note that we shall always assume that the prior distribution π(θ) is fixed, which is independent of the choice of f. Then, recalling our general information functional as φ(π), the design optimization problem is (for fixed π):

max_{f \in F} ψ (f) = E_{X_{D}} φ (π (θ ∣ X_{D})),

(11)

where we stress the dependence of the random variable X on the design and, thereby, on the sampling distribution f, by adding the subscript D.

If the set of sampling distributions f is specified by the control variable z, that is the choice of the sampling distribution f(x|θ, z) amounts to selecting z ∈ Z, then the maximization problem is:

max_{z \in Z} ψ (f) = E_{X_{D}} φ (π (θ ∣ X_{D}, z)) .

In the examples that we consider below, the sampling distribution will be indexed by a control variable z.

An important distinction should be made between what we shall here call linear and non-linear criteria. By a more general utility problem being linear, we mean that there is a utility function U(θ, x), such that, when we seek to minimize, again over choice f,

E_{X_{D}} E_{θ ∣ X_{D}} U (X_{D}, θ) = E_{X_{D}, θ} U (X_{D}, θ),

where the last expectation is with respect to the joint distribution of X_D and θ. In terms of integration, this only requires a single double integral. The non-linear case requires the evaluation of an “internal” integral for E_{θ|X_D}U(X_D, θ) and an external integral for E_{X_D}. It is important to note that Shannon-type functionals are special types of linear functionals where U(θ,X_D) = g(π(θ|X_D)). The distance-based functionals are non-linear in that they require a repeated single integral.

This distinction is important when other costs or utilities are included in addition to those coming from learning. Most obvious is a cost for the experiment. This could be fixed, so that no preposterior analysis is required, or it might be random in that it depends on the actual observation. For example one might add an additional utility U(X_D) solely dependent of the outcome of the experiment: if it really does snow, then snow plows may need to be deployed. The overall (preposterior) expected value of the experiment might be:

E_{X_{D}} E_{θ ∣ X_{D}} U (X_{D}, θ) + E_{X_{D}} U (X_{D}) .

In this way, one can study the exploration-exploitation problem, often referred to in search and optimization.

We now give a procedure to compute ψ for a particular choice of sampling distribution f ∈

ℱ

. We assume that f(x|θ) and π(θ) are known. If the functional φ is non-linear, we have to obtain the posterior distribution π(θ|X_D) before evaluating φ. For simplicity, we use ABC rejection sampling (see Marjoram et al. [28]) to obtain an approximate sample from π(θ|X_D) that allows us to estimate the functional φ(π(θ|X_D)). In many cases, it is hard to find an analytical solution for π(θ|X_D), especially if f(x|θ) is intractable. These are the cases where ABC methods are most useful. Furthermore, ABC rejection sampling has the advantage that it is easily possible to re-compute φ̂(π(θ|X_D)) for different values of X_D, which is an important feature, because we have to integrate over the marginal distribution of X_D in order to obtain ψ(f) = E_{X_D}φ(π(θ|X_D)).

For a given f ∈

ℱ

, we find the estimate ψ̂ by integrating over φ̂(π(θ|X_D)) with respect to the marginal distribution f_X. We can achieve this using Monte Carlo integration:

ψ (f) \approx \hat{ψ} = \frac{1}{G} \sum_{i = 1}^{G} \hat{φ} (π (θ ∣ x_{D}^{(i)}))

for

x_{D}^{(i)} ~ f_{X}

. The ABC procedure to obtain the estimate φ̂(π(θ|x_D)) given x_D is as follows.

(1): Sample from π(θ) : {θ₁, . . . , θ_H}.
(2): For each θ_i, sample from f(x|θ_i) to obtain a sample: $x^{(i)} = (x_{1}^{(i)}, \dots, x_{n}^{(i)})$ . This gives a sample from the joint distribution: f_X,θ.
(3): For each θ_i, compute a vector of summary statistics: T(x⁽ⁱ⁾) = (T₁(x⁽ⁱ⁾), . . . , T_m(x⁽ⁱ⁾)).
(4): Split T-space into disjoint neighborhoods .
(5): Find the neighborhood for which T(x_D) ∈ and collect the θ_i for which T(x⁽ⁱ⁾) ∈ , forming an approximate posterior distribution π̃(θ|T), which if T is approximately sufficient, should be close to π(θ|x_D). If T is sufficient, we have that π̃(θ|T) → π(θ|x_D) as | | → 0.
(6): Approximate π(θ|x_D) by π̃(θ|T).
(7): Evaluate φ(π(θ|x_D)) by integration (internal integration).

Steps 1–4 need to be conducted only once at the outset for each f ∈

ℱ

; only Steps 5–7 have to be repeated for each x_D ~ f_X.

For the linear functional, explained above, we do not even need to compute the posterior distribution, π(θ|x_D), if we are happy to use the naive approximation to the double integral:

ψ (f) \approx \frac{1}{G} \sum_{i = 1}^{G} U (x_{i}, θ_{i}),

where

{x_{i}, θ_{i}}_{i = 1}^{N}

are independent draws from the joint distribution f(x, θ) = f(x|θ)π(θ).

The optimum ψ(f) for f ∈

ℱ

may be found by employing any suitable optimization method. In this paper, we intend to focus on the computation of ψ̂(f). Therefore, in the illustrative examples below, we take a “crude” optimization approach, that is we estimate ψ(f) for a fixed set of possible choices for f and compare the estimates.

The basic technique of ABCD was introduced in Hainy et al. [29], but here, we present it fully embedded into statistical learning theory. Note that related different procedures utilizing MCMC chains were independently developed in Drovandi and Pettitt [30] and Hainy et al. [31].

We now present two examples that are meant to illustrate the applicability of ABCD to very general design problems using non-linear design criteria. Although these examples are rather simple and may also be solved by analytical or numerical methods, their generalizations become intractable using traditional methods.

7.1. Selective Sampling

When the background sampling distribution is f(x|θ), we may impose prior constraints of which data we accept to use. Such models in greater generality may occur when observation is cheap, but the use of observation is expensive, for example computationally. We can call this “selective sampling”, and we present a simple example.

Suppose in a one-dimensional problem that we are only allowed to accept observations from two slits of equal width at z₁ and z₂. Here, the model is equivalent (in the limit as the slit widths become small) to replacing f(x|θ) by the discrete distribution:

f (x = i ∣ θ, z_{1}, z_{2}) = \frac{f (z_{i} ∣ θ)}{f (z_{1} ∣ θ) + f (z_{2} ∣ θ)}, i = 1, 2.

If we have a prior distribution π(θ) and f(x|z₁, z₂) = ∫ f(x|θ, z₁, z₂)π(θ)dθ denotes the marginal distribution of x, the posterior distribution is given by:

π (θ ∣ x = i, z_{1}, z_{2}) = \frac{f (x = i ∣ θ, z_{1}, z_{2}) π (θ)}{f (x = i ∣ z_{1}, z_{2})}, i = 1, 2.

To simplify even further, we take as a criterion:

φ (π (θ ∣ x, z_{1}, z_{2})) = max_{θ} π (θ ∣ x, z_{1}, z_{2}) .

The maximum is a limiting version of Tsallis entropy and is a learning functional.

Now consider a special case:

\begin{array}{r} z ∣ θ ~ N (θ, 1), \\ θ ~ U [- 1, 1] . \end{array}

The preposterior:

ψ (z_{1}, z_{2}) = \sum_{i = 1}^{2} φ (π (θ ∣ x = i, z_{1}, z_{2})) f (x = i ∣ z_{1}, z_{2})

can be calculated explicitly. If z₂ ≥ z₁ and z_i ∈ [−a, a], then:

\begin{array}{l} max_{z_{1}, z_{2}} ψ (z_{1}, z_{2}) = ψ (- a, a) \\ = \frac{1}{1 + exp (- 2 a)} \\ = {\begin{array}{l} \frac{1}{2} & a \to 0 \\ 1 & a \to \infty \end{array} . \end{array}

Next, we show how this example can be solved using ABCD. Due to the special structure of the sampling distribution in this example, we modified our ABC sampling strategy slightly.

(1)

For fixed z₁ and z₂, sample H numbers {θ⁽^j⁾, j = 1, . . . ,H} from the prior.

(2)

For each θ⁽^j⁾, repeat:

(a): sample z⁽^k⁾ ~ π(z|θ⁽^j⁾) until #{z⁽^k⁾ ∈ {N_ε (z₁),N_ε (z₂)}}= K_z, where N_ε(z) = [z − ε/2, z + ε/2];
(b): drop all z⁽^k⁾ ∉ {N_ε(z₁),N_ε(z₂)};
(c): sample x⁽^j⁾ from discrete distribution with probabilities $Pr (x^{(j)} = i) = \frac{# {z^{(k)} \in N_{ɛ} (z_{i})}}{K_{z}}$ , i = 1, 2.

(3)

For i = 1, 2, select all θ⁽^j⁾ for which x⁽^j⁾ = i, compute kernel density estimate for these θ⁽^j⁾ and obtain maximum → φ̂(π̂(θ|x = i, z₁, z₂)).

(4)

\hat{ψ} (z_{1}, z_{2}) = \sum_{i = 1}^{2} \hat{φ} (\hat{π} (θ ∣ x = i, z_{1}, z_{2})) \frac{# {x^{(j)} = i}}{H}

.

We performed our ABC sampling strategy for this example for a range of parameters for the slit neighborhood length ε (ε = 0.005, 0.01, 0.05), H (H = 100, 1,000, 10,000) and K_z (K_z = 50, 100, 200) in order to assess the effect of these parameters on the accuracy of the ABC estimates of the criterion ψ. The most notable effect was found for the ABC sample size H.

Figure 2 shows the estimated values of the criterion, ψ̂, for the special case where z₂ = −z₁ when a = 1.5. We set ε = 0.01, K_z = 100. The ABC sample size H is set to H = 100 (left), H = 1, 000 (center), and H = 10, 000 (right). The criterion was evaluated at the eight points (z₁ = 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5). The theoretical criterion function ψ(z₁) is plotted as a solid line.

7.2. Spatial Sampling for Prediction

This example is also a simple version of an important paradigm, namely optimal sampling of a spatial stochastic process for good prediction. Here, the stochastic process labeled X is indexed by a space variable z, and we write X_i = X_i(z_i), i = 1, . . . , n to indicate sampling at sites (the design) D_n = {z₁, . . . , z_n}. We would typically take the design space, Z, to be a compact region.

We wish to compute the predictive distribution at a new site z_n₊₁, namely x_n₊₁(z_n₊₁), given x_D = x(D_n) = (x₁(z₁), . . . , x_n(z_n)). In the Gaussian case, the background parameter θ could be related to a fixed effect (drift) or the covariance function of the process, or both. In the analysis, x_n₊₁ is regarded as an additional parameter, and we need its (marginal) conditional distribution.

The criterion of interest is the maximum variance of the (posterior) predictive distribution over the design space:

\begin{array}{l} - φ (x (D_{n})) = max_{z_{n + 1} \in Z} var (X_{n + 1} (z_{n + 1}) ∣ x (D_{n})) \\ = max_{z_{n + 1} \in Z} \int {(x_{n + 1} - μ_{x_{n + 1}})}^{2} π (x_{n + 1} ∣ x (D_{n}), z_{n + 1}) d x_{n + 1} . \end{array}

This functional is learnable, since it is is a maximum of a set of variances, each one of which is learnable.

Referring back to how the general design optimization problem that was stated in (11), the posterior predictive distribution of x_n₊₁ may be interpreted as the posterior distribution in (11). The optimality criterion ψ is found by integrating φ with respect to X₁, . . . ,X_n.

The strategy is to select a design D_n and then perform ABC at each test point z_n₊₁. The learning functional φ(x_D) is estimated by generating the sample

I = {x_{D}^{(j)}, x_{n + 1}^{(j)}}_{j = 1}^{H} = {x_{1}^{(j)}, x_{2}^{(j)}, \dots, x_{n}^{(j)}, x_{n + 1}^{(j)}}_{j = 1}^{H}

at the sites {z₁, z₂, . . . , z_n, z_n₊₁} and calculating:

- \hat{φ} (x_{D}) = max_{z_{n + 1} \in Z} \frac{1}{∣ J_{ɛ} (x_{D}) ∣} \sum_{j \in J_{ɛ} (x_{D})} {(x_{n + 1}^{(j)} - {\bar{x}}_{n + 1})}^{2},

where

J_{ɛ} (x_{D}) = {j \in {1, \dots, H} : x_{D}^{(j)} \in N_{ɛ} (x_{D})}

, we have

x_{D}^{(j)} \in N_{ɛ} (x_{D})

if

∣ x_{i}^{(j)} - x_{i} ∣ \leq ɛ \forall i = 1, \dots, n

, and

{\bar{x}}_{n + 1} = (1 / ∣ J_{ɛ} (x_{D}) ∣) \sum_{j \in J_{ɛ} (x_{D})} x_{n + 1}^{(j)}

.

In order to estimate ψ(D_n) = E_{X_D}(φ(X_D)), we obtain a sample

O = {x_{D}^{(i)}}_{i = 1}^{G}

from the marginal distribution of the random field at the design D_n and perform Monte Carlo integration:

\hat{ψ} (D_{n}) = \frac{1}{G} \sum_{i = 1}^{G} \hat{φ} (x_{D}^{(i)}) .

(12)

For each

x_{D}^{(i)} \in O

from the marginal sample, we use the sample I to compute

\hat{φ} (x_{D}^{(i)})

in order to save computing time. We then vary the design using some optimization algorithm.

A simple example is adopted from Müller et al. [32]. The observations (x₁(z₁), x₂(z₂), x₃(z₃), x₄(z₄)) are assumed to be distributed according to a one-dimensional Gaussian random field with mean zero, a marginal variance of one and z_i ∈ [0, 1]. We want to select an optimal design D₃ = (z₁, z₂, z₃), such that:

- ψ (D_{3}) = E_{X_{1 : 3} (D_{3})} [max_{z_{4} \in [0, 1]} var (X_{4} (z_{4}) ∣ X_{1 : 3} (D_{3}))]

is minimal.

We assume the Ornstein–Uhlenbeck process with correlation function ρ(|s − t|; θ) = e⁻^θ|s⁻^t|. Two prior distributions for the parameter θ are considered. The first one is a point prior at θ = log(100), so that ρ(h) = ρ (h; log(100)) = 0.01^h. This is the correlation function used by Müller et al. [32] in their study of empirical kriging optimal designs. The second prior distribution is an exponential prior for θ with scale parameter λ = 10 (i.e., θ ~ Exp(10)). The scale parameter λ was chosen, such that the average correlation functions of the point and exponential priors are similar. By that, we mean that the average of the mean correlation function for the exponential prior over all pairs of sites s and t, E_s,t[E_θ{ρ(|s−t|; θ)|θ ~ Exp(λ)}] = E_s,t[1/(1+λ|s−t|)], matches the average of the fixed correlation function ρ(|s − t|; log(100)) = 0.01^|s⁻^t| over all pairs of sites s and t, E_s,t[0.01^|s⁻^t|]. The sites are assumed to be uniformly distributed over the coordinate space.

To be more specific, first, for each site s ∈ Entropy 16 04353f5

, the average correlation to all other sites t ∈ Entropy 16 04353f5

is computed. Then, these average correlations are averaged over all sites s ∈ Entropy 16 04353f5

. For the point prior, the average correlation is

E_{s, t} [ρ (∣ s - t ∣; log (100))] = \frac{2}{log {(100)}^{2}} (log (100) - (1 - \frac{1}{100})) = 0.3409

, and for the exponential prior, the value is

E_{s, t} [E_{θ} {ρ (∣ s - t ∣; θ) ∣ θ ~ Exp (λ)}] = \frac{2}{λ^{2}} [(1 + λ) log (1 + λ) - λ]

. If λ = 10, we have E_s,t[E_θ{ρ(|s − t|; θ)|θ ~ Exp(10)}] = 0.3275.

Figure 3 depicts the distributions of the correlation function ρ(h; θ) = exp(−θh) under the two prior distributions. The solid line corresponds to the fixed correlation function ρ(h; θ = log(100)) = 0.01^h. The dotted line and the two dashed lines represent the mean correlation function and the 0.025- and 0.975-quantile functions for ρ(h; θ) under the prior θ ~ Exp(10).

We estimated the criterion on a grid with spacing 0.05 for the design points z₁ and z₃ (z₂ is fixed at z₂ = 0.5). We set G = 1, 000, H = 5 · 10₆ and ε = 0.01 for each design point. The sample

{x^{j} (z) : z \in Z}_{j = 1}^{H}

is simulated at all points z of the grid prior to the actual ABC algorithm. In order to accelerate the computations, it is then reused for all possible designs D₃ to estimate each

\hat{φ} (x_{D}^{(i)})

, i = 1, . . . ,G, in (12). The sample size H = 5 · 10₆ was deemed to provide a sufficiently exhaustive sample from the four-dimensional normal vector (x₁(z₁), x₂(z₂), x₃(z₃), x₄(z₄)) for any z_i ∈ Z, so that the distortive effect of using the same sample for the computations of all

\hat{φ} (x_{D}^{(i)})

is only of negligible concern for our purposes of ranking the designs.

Figure 4 (left) shows the map of estimated criterion values, − ψ̂ (D₃), when the prior distribution of θ is the point prior at θ = log(100). It can be seen that the minimum of the criterion is attained at about (z₁, z₃) = (0.9, 0.1) or (z₁, z₃) = (0.1, 0.9), which is comparable to the the results obtained in Müller et al. [32] for empirical kriging optimal designs. Note that the diverging criterion values at the diagonal and at z₁ = 0.5 and z₃ = 0.5 are attributable to a specific feature of the ABC method used. At these designs, the actual dimension of the design is lower than three, so for a given ε, there are more elements in the neighborhood than for the other designs with three distinctive design points. Hence, a much larger fraction of the total sample,

{x_{n + 1}^{(j)}}_{j = 1}^{H}

, is included in the ABC sample,

{x_{n + 1}^{(j)} : j \in J_{ɛ} (y_{D})}

. Therefore, the values of the criterion get closer to the marginal variance of one. In order to avoid this effect, the parameter ε would have to be adapted in these cases. Alternatively, one could use other variants of ABC rejection, where the fixed number of N elements of

I = {x_{D}^{(j)}, x_{n + 1}^{(j)}}_{j = 1}^{H}

with the smallest distance to the draw

x_{D}^{(i)} \in O

are constituting the ABC posterior sample, making it necessary to compute and sort out the distances for each

x_{D}^{(i)} \in O

.

Figure 4 (right) gives the estimated criterion values, −ψ̂(D₃), when the prior of θ is θ ~ Exp(10). Due to the uncertainty of the prior parameter θ, the optimal design points for z₁ and z₃ slightly move to the edges, which is also in accordance with the findings of Müller et al. [32].

8. Conclusions

There are some fundamental results in Bayesian learning which provide important background to fields like the optimal design of experiments. Functionals of prior distributions which are learnable, via observation, in a wide sense, are convex. Shannon information is an example but there are many others and the paper points to some wide classes with connections to other fields. It combines the theory of learning with an effective method for the optimal design of experiments based on simulation: ABCD. It is suggested that the method should prove useful in non-standard situations, such as non-linear, non-Gaussian models and for complex problems where the sampling distribution is intractable but one can still draw samples from it, for given parameter values. A simple message is that the learning theory and simulation method applies to a generalized notion of an experiment as a choice of sampling distribution, under restrictions.

Acknowledgments

The research of the first author has been partially supported by the French Science Fund (ANR) and Austrian Science Fund (FWF) bilateral grant I-833-N18. The last author is grateful for the award of Exzellenzstipendium des Landes Oberösterreich by the governor of Upper-Austria, in 2012.

Author Contributions

The background sections were mainly authored by Henry P. Wynn; ABCD was jointly conceived by Werner G. Müller and Henry P. Wynn; All computations for the examples were performed by Markus Hainy. All authors have read and approved the final published manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Blackwell, D. Comparison of Experiments. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 31 July–12 August 1950; University of California Press: Berkeley, CA, USA, 1951; pp. 93–102. [Google Scholar]
Torgersen, E. Comparison of Statistical Experiments; Encyclopedia of Mathematics and its Applications 36; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
Rényi, A. On Measures of Entropy and Information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
Lindley, D.V. On a Measure of the Information Provided by an Experiment. Ann. Math. Stat 1956, 27, 986–1005. [Google Scholar]
Goel, P.K.; DeGroot, M.H. Comparison of Experiments and Information Measures. Ann. Math. Stat 1979, 7, 1066–1077. [Google Scholar]
Ginebra, J. On the measure of the information in a statistical experiment. Bayesian Anal 2007, 2, 167–211. [Google Scholar]
Chaloner, K.; Verdinelli, I. Bayesian Experimental Design: A Review. Stat. Sci 1995, 10, 273–304. [Google Scholar]
Sebastiani, P.; Wynn, H.P. Maximum entropy sampling and optimal Bayesian experimental design. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 2000, 62, 145–157. [Google Scholar]
Chater, N. The Probability Heuristics Model of Syllogistic Reasoning. Cogn. Psychol 1999, 38, 191–258. [Google Scholar]
Schoenberg, I.J. Metric Spaces Positive Definite Functions. Trans. Am. Math. Soc 1938, 44, 522–536. [Google Scholar]
Schilling, R.L.; Song, R.; Vondracek, Z. Bernstein Functions: Theory and Applications; De Gruyter Studies in Mathematics 37; De Gruyter: Berlin, Germany, 2012. [Google Scholar]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys 1988, 52, 479–487. [Google Scholar]
DeGroot, M.H. Optimal Statistical Decisions, WCL edition; Wiley-Interscience: Hoboken, NJ, USA, 2004. [Google Scholar]
Goldman, A.I.; Shaked, M. Results on inquiry and truth possession. Stat. Probab. Lett 1991, 12, 415–420. [Google Scholar]
Fallis, D.; Liddell, G. Further results on inquiry and truth possession. Stat. Probab. Lett 2002, 60, 169–182. [Google Scholar]
Torgerson, W.S. Theory and Methods of Scaling; John Wiley and Sons, Inc: New York, NY, USA, 1958. [Google Scholar]
Gower, J.C. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 1966, 53, 325–338. [Google Scholar]
Gower, J.C. Euclidean distance geometry. Math. Sci 1982, 7, 1–14. [Google Scholar]
Itti, L.; Baldi, P. Bayesian surprise attracts human attention. Vis. Res 2009, 49, 1295–1306. [Google Scholar]
Haykin, S.; Chen, Z. The Cocktail Party Problem. Neural Comput 2005, 17, 1875–1902. [Google Scholar]
Berger, J. The case for objective Bayesian analysis. Bayesian Anal 2006, 1, 385–402. [Google Scholar]
Marshall, A.W.; Olkin, I.; Arnold, B.C. Inequalities: Theory of Majorization and Its Applications, 2nd ed; Springer Series in Statistics; Springer: Berlin, Germany, 2009. [Google Scholar]
Hardy, G.H.; Littlewood, J.E.; Pólya, G. Inequalities, 2nd ed; Cambridge Mathematical Library; Cambridge University Press: Cambridge, UK, 1988. [Google Scholar]
Müller, A.; Stoyan, D. Comparison Methods for Stochastic Models and Risks, 1st ed; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
Ryff, J.V. Orbits of l¹-functions under doubly stochastic transformations. Trans. Am. Math. Soc 1965, 117, 92–100. [Google Scholar]
DeGroot, M.H.; Fienberg, S. Comparing probability forecasters: Basic binary concepts and multivariate extensions. In Bayesian Inference and Decision Techniques; Goel, P., Zellner, A., Eds.; North-Holland: Amsterdam, The Netherlands, 1986; pp. 247–264. [Google Scholar]
Dawid, A.P.; Sebastiani, P. Coherent dispersion criteria for optimal experimental design. Ann. Stat 1999, 27, 65–81. [Google Scholar]
Marjoram, P.; Molitor, J.; Plagnol, V.; Tavaré, S. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 2003, 100, 15324–15328. [Google Scholar]
Hainy, M.; Müller, W.; Wynn, H. Approximate Bayesian Computation Design (ABCD), an Introduction. In mODa 10—Advances in Model-Oriented Design and Analysis; Ucinski, D., Atkinson, A.C., Patan, M., Eds.; Contributions to Statistics; Springer International Publishing: Heidelberg/Berlin, Germany, 2013; pp. 135–143. [Google Scholar]
Drovandi, C.C.; Pettitt, A.N. Bayesian Experimental Design for Models with Intractable Likelihoods. Biom 2013, 69, 937–948. [Google Scholar] [Green Version]
Hainy, M.; Müller, W.G.; Wagner, H. Likelihood-free Simulation-based Optimal Design; Technical Report; Johannes Kepler University: Linz, Austria, 2013. [Google Scholar]
Müller, W.G.; Pronzato, L.; Waldl, H. Beyond space-filling: An illustrative case. Procedia Environ. Sci 2011, 7, 14–19. [Google Scholar]

Figure 1. Shannon information of the prior, I₀, and of the posterior, I₁, depending on x.

Figure 2. Estimated values of the criterion ψ̂(z₁) (points) and theoretical criterion function ψ(z₁) (solid line) for ε = 0.01, K_z = 100, and H = 100 (a), H = 1, 000 (b), H = 10, 000 (c).

Figure 3. Prior distributions of correlation function ρ(h; θ): correlation function ρ(h) = 0.01^h under point prior θ = log(100) (solid line); mean correlation function (dotted line) and 0.025- and 0.975-quantile functions (dashed lines) for ρ(h; θ) under the prior θ ~ Exp(10).

Figure 4. Spatial prediction criterion map for the point prior at θ = log(100) (left) and for the exponential prior θ ~ Exp(10) (right).

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Hainy, M.; Müller, W.G.; P. Wynn, H. Learning Functions and Approximate Bayesian Computation Design: ABCD. Entropy 2014, 16, 4353-4374. https://doi.org/10.3390/e16084353

AMA Style

Hainy M, Müller WG, P. Wynn H. Learning Functions and Approximate Bayesian Computation Design: ABCD. Entropy. 2014; 16(8):4353-4374. https://doi.org/10.3390/e16084353

Chicago/Turabian Style

Hainy, Markus, Werner G. Müller, and Henry P. Wynn. 2014. "Learning Functions and Approximate Bayesian Computation Design: ABCD" Entropy 16, no. 8: 4353-4374. https://doi.org/10.3390/e16084353

Article Menu

Learning Functions and Approximate Bayesian Computation Design: ABCD

Abstract

1. Introduction

2. Information-Based Learning

Theorem 1

Theorem 2

Proof

Proof

3. Distance-Based Information Functions

Theorem 3

Theorem 4

Definition 1

Theorem 5

Theorem 6

4. Counterexamples

4.1. Surprise and Ignorance

4.2. Minimal Information Prior Distributions

5. The Role of Majorization

Definition 2

Definition 3

Proposition 1

6. Learning Based on Covariance Functions

Definition 4

Definition 5

Theorem 7

Proof

7. Approximate Bayesian Computation Designs

7.1. Selective Sampling

7.2. Spatial Sampling for Prediction

8. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI