1. Introduction
Jeffrey conditioning is a method of update (recommended first by Richard Jeffrey in [
1]) which generalizes standard conditioning and operates in probability kinematics where evidence is uncertain (
P (
E) ≠ 1). Sometimes, when we reason inductively, outcomes that are observed have entailment relationships with partitions of the possibility space that pose challenges that Jeffrey conditioning cannot meet. As we will see, it is not difficult to resolve these challenges by generalizing Jeffrey conditioning. There are claims in the literature that the principle of maximum entropy, from now on
pme, conflicts with this generalization. I will show under which conditions this conflict obtains. Since proponents of
pme are unlikely to subscribe to these conditions, the position of
pme in the larger debate over inductive logic and reasoning is not undermined.
In Section 2, I will introduce the obverse Majerník problem and sketch how it ties in with two natural generalizations of Jeffrey conditioning: Wagner conditioning and the
pme. In Section 3, I will introduce Jeffrey conditioning in a notation that will later help us to solve the obverse Majerník problem. In Section 4, I will introduce Wagner conditioning and show how it naturally generalizes Jeffrey conditioning. In Section 5, I will show that
pme does so as well under conditions that are straightforward to accept for proponents of
pme. This solves the obverse Majerník problem and makes Wagner conditioning unnecessary as a generalization of Jeffrey conditioning, since the
pme seamlessly incorporates it. The conclusion in Section 6 summarizes my claims and briefly refers to epistemological consequences. An
appendix gives proofs how
pme generalizes standard conditioning and Jeffrey conditioning, providing a template for a simplified proof of the claim in the body of the paper.
2. Jeffrey’s Updating Principle and the Principle of Maximum Entropy
In his paper “Marginal Probability Distribution Determined by the Maximum Entropy Method” (see [
2]), Vladimír Majerník asks the following question: If we had two partitions of an event space and knew all the conditional probabilities (any conditional probability of one event in the first partition conditional on another event in the second partition), would we be able to calculate the marginal probabilities for the two partitions? The answer is yes, if we commit ourselves to
pme:
[pme] Keep the information entropy of your probability distribution maximal within the constraints that the evidence provides (in the synchronic case), or your cross-entropy minimal (in the diachronic case).
For Majerník’s question,
pme provides us with a unique and plausible answer (see Majerník’s paper). We may also be interested in the obverse question: if the marginal probabilities of the two partitions were given, would we similarly be able to calculate the conditional probabilities? The answer is yes: given
pme, Theorems 2.2.1. and 2.6.5. in [
3] reveal that the joint probabilities are the product of the marginal probabilities (see also [
4]). Once the joint probabilities and the marginal probabilities are available, it is trivial to calculate the conditional probabilities.
It is important to note that these joint probabilities do not legislate independence, even though they allow it [
4] (p.1670). Mérouane Debbah and Ralf Müller correctly describe these joint probabilities as a model with as many degrees of freedom as possible, which leaves free degrees for correlation to exist or not [
4] (p.1674). This avoids the introduction of unjustified information [
4] (p.1672) corresponding to the simple intuition behind
pme: when updating your probabilities, waste no useful information and do not gain information unless the evidence compels you to gain it (see [
4] (p.1685f), [
5] (p.376), [
6,
7], [
8] (p.186)). The principle comes with its own formal apparatus, not unlike probability theory itself: Shannon’s information entropy [
9], the Kullback-Leibler divergence (see [
10,
11], [
12] (p.308ff), [
13] (p.262ff)), the use of Lagrange multipliers (see [
3] (p.409ff), [
12] (p.327f), [
13] (p.281)), and the log-inverse relationship between information and probability (see [
14–
17]).
There is an older problem by Carl Wagner [
18] which can be cast in similar terms as Majerník’s. If we were given some of the marginal probabilities in an updating problem as well as some logical relationships between the two partitions, would we be able to calculate the remaining marginal probabilities? This problem is best understood by example (see Wagner’s
Linguist problem in Section 4). Wagner solves it using a natural generalization of Jeffrey conditioning, which I will call Wagner conditioning. It is not based on
pme, but on what I call Jeffrey’s updating principle, or
jup for short:
[jup] In a diachronic updating process, keep the ratio of probabilities constant as long as they are unaffected by the constraints that the evidence poses.
As is the case for
pme, there is a debate whether updating on evidence by rational agents is bound by
jup (for a defence see [
19]; for detractors see [
20]). Our interest in this paper is the relationship between
pme and
jup, both of which are updating principles. Wagner contends that his natural generalization of Jeffrey conditioning, based on
jup, contradicts
pme. Among formal epistemologists, there is a widespread view that, while
pme is a generalization of Jeffrey conditioning, it is an inappropriate updating method in certain cases and does not enjoy the generality of Jeffrey conditioning. Wagner’s claims support this view inasmuch as Wagner conditioning is based on the relatively plausible
jup and naturally generalizes Jeffrey conditioning, but according to Wagner it contradicts
pme, which gives wrong results in these cases.
This paper resists Wagner’s conclusions and shows that pme generalizes both Jeffrey conditioning and Wagner conditioning, providing a much more integrated approach to probability updating. This integrated approach also gives a coherent answer to the obverse Majerník problem posed above.
3. Jeffrey Conditioning
Richard Jeffrey proposes an updating method for cases in which the evidence is uncertain, generalizing standard probabilistic conditioning. I will present this method in unusual notation, anticipating using my notation to solve Wagner’s
Linguist problem and to give a general solution for the obverse Majerník problem. Let Ω be a finite event space and {
θj}
j=1, …,
n, a partition of Ω. Let κ be an m × n matrix for which each column contains exactly one 1, otherwise 0. Let
P =
Pprior and
Then {
ωi}i=1, …,
m, for which
is likewise a partition of Ω (the
ω are basically a more coarsely grained partition than the
θ).
if
κij = 0,
otherwise. Let
β be the vector of prior probabilities for
{θj}j=1,…,
n(
P (
θj) =
βj) and
the vector of posterior probabilities
; likewise for
α and
corresponding to the prior and posterior probabilities for
{ωi}i=1,…,
m, respectively.
A Jeffrey-type problem is when
β and
are given and we are looking for
. A mathematically more concise characterization of a Jeffrey-type problem is the triple (
κ, β,
). The solution, using Jeffrey conditioning, is
The notation is more complicated than it needs to be for Jeffrey conditioning. In Section 5, however, I will take full advantage of it to present a generalization where the
ωi do not range over the
θj. In the meantime, here is an example to illustrate
(2).
A token is pulled from a bag containing 3 yellow tokens, 2 blue tokens, and 1 purple token. You are colour blind and cannot distinguish between the blue and the purple token when you see it. When the token is pulled, it is shown to you in poor lighting and then obscured again. You come to the conclusion based on your observation that the probability that the pulled token is yellow is 1/3 and that the probability that the pulled token is blue or purple is 2/3. What is your updated probability that the pulled token is blue?
Let
P (blue) be the prior subjective probability that the pulled token is blue and
(blue) the respective posterior subjective probability. Jeffrey conditioning, based on
jup (which mandates, for example, that
(blue|blue or purple) =
P (blue|blue or purple)) recommends
In the notation of
(2), the example is calculated with
β = (1/2, 1/3, 1/6)
⊤,
,
and yields the same result as
(3) with
.
4. Wagner Conditioning
Carl Wagner uses
jup (explained in more detail in [
21]) to solve a problem which cannot be solved by Jeffrey conditioning. Here is the narrative (call this the
Linguist problem):
You encounter the native of a certain foreign country and wonder whether he is a Catholic northerner (
θ1), a Catholic southerner (
θ2), a Protestant northerner (
θ3), or a Protestant southerner (
θ4). Your prior probability
p over these possibilities (based, say, on population statistics and the judgment that it is reasonable to regard this individual as a random representative of his country) is given by
p(
θ1) = 0.2,
p(
θ2) = 0.3,
p(
θ3) = 0.4, and
p(
θ4) = 0.1. The individual now utters a phrase in his native tongue which, due to the aural similarity of the phrases in question, might be a traditional Catholic piety (
ω1), an epithet uncomplimentary to Protestants (
ω2), an innocuous southern regionalism (
ω3), or a slang expression used throughout the country in question (
ω4). After reflecting on the matter you assign subjective probabilities
u(
ω1) = 0.4,
u(
ω2) = 0.3,
u(
ω3) = 0.2, and
u(
ω4) = 0.1 to these alternatives. In the light of this new evidence how should you revise
p? (See [
18] (p.252) and [
22] (p197).)
Let us call a problem of this type a Wagner-type problem. It is an instance of the more general obverse Majerník problem where partitions are given with logical relationships between them as well as some marginal probabilities. Wagner-type problems seek as a solution missing marginals, while obverse Majerník problems seek the conditional probabilities as well, both of which I will eventually provide using pme.
Wagner’s solution for such problems (from now on Wagner conditioning) rests on
jup and a formal apparatus established by Arthur Dempster in [
23], which is quite different from our notational approach. Wagner legitimately calls his solution a “natural generalization of Jeffrey conditioning” [
18] (p.250). There is, however, another natural generalization of Jeffrey conditioning, E.T. Jaynes’ principle of maximum entropy in [
24].
pme does not rest on
jup, but rather claims that one should keep one’s entropy maximal within the constraints that the evidence provides (in the synchronic case) and one’s cross-entropy minimal (in the diachronic case).
It is important to distinguish between type I and type II prior probabilities. The former precede any information at all (so-called ignorance priors). The latter are simply prior relative to posterior probabilities in probability kinematics. They may themselves be posterior probabilities with respect to an earlier instance of probability kinematics. Although Jaynes’ original claims are concerned with type I prior probabilities, this paper works on the assumptions of Jaynes’ later work focusing on type II prior probabilities. Some distinguish between MAXENT, the synchronic rule, and
Infomin, the diachronic rule. The understanding here is that both operate on type II prior probabilities: MAXENT considers uniform prior probabilities (however this uniformity may have arisen) and a set of synchronic constraints on them;
Infomin, in a more standard sense of updating, considers type II prior probabilities that are not necessarily uniform and updates them given evidence represented as new (diachronic) constraints on acceptable posterior probability distributions. Some say that MAXENT and
Infomin contradict each other, but I disagree and maintain that they are compatible. I will have to defer this problem to future work, but a core argument for compatibility is already accessible in [
21]
One advantage of
pme is that it works on the wide domain of updating problems where the evidence corresponds to an affine constraint (for affine constraints see [
25]; for problems with evidence not in the form of affine constraints see [
26]). Updating problems where standard conditioning and Jeffrey conditioning are applicable are a subset of this domain. Some partial information cases (using the moment(s) of a distribution as evidence), such as Bas van Fraassen’s
Judy Benjamin problem and Jaynes’
Brandeis Dice problem, are not amenable to either standard conditioning or Jeffrey conditioning.
pme generalizes Jeffrey conditioning (and, a fortiori, standard conditioning) and therefore absorbs
jup on the more narrow domain of problems that we can solve using Jeffrey conditioning (for a proof see the
appendix, although it can also be gleaned from [
27]).
Wagner’s contention is that on the wider domain of problems where we must use Wagner conditioning (and which he does not cast in terms of affine constraints),
jup and
pme contradict each other. We are now in the awkward position of being confronted with two plausible intuitions,
jup and
pme, and it appears that we have to let one of them go. Wagner adduces other conceptual problems for
pme (see [
13,
28–
30], [
31] (p.270), [
32] (p.107)) to reinforce his conclusion that
pme is not a principle on which we should rely in general.
5. A Natural Generalization of Jeffrey and Wagner Conditioning
In order to show how
pme generalizes Jeffrey conditioning (in the
appendix) and Wagner conditioning to boot, I use the notation that I have already introduced for Jeffrey conditioning. We can characterize Wagner-type problems analogously to Jeffrey-type problems by a triple (
κ, β,
). {
θj}
j=1,…,
n and, …,
m now refer to independent partitions of Ω,
i.e.,
(1) need not be true. Besides the marginal {
ωi}
i=1 probabilities
P (
θj) =
βj,
,
P (
ωi) =
αi,
, we therefore also have joint probabilities
μij =
P (
ωi ∩ θj) and
.
Given the specific nature of Wagner-type problems, there are a few constraints on the triple (
κ, β,
). The last row (
μmj)
j=1,…,
n is special because it represents the probability of
ωm, which is the negation of the events deemed possible after the observation. In the
Linguist problem, for example,
ω5 is the event (initially highly likely, but impossible after the observation of the native’s utterance) that the native does not make any of the four utterances. The native may have, after all, uttered a typical Buddhist phrase, asked where the nearest bathroom was, complimented your fedora, or chosen to be silent.
κ will have all 1s in the last row. Let
for
i=1, …,
m − 1 and
j = 1, …,
n; and
for
j = 1, …,
n.
κ^ equals
κ except that its last row are all 0s, and
. Otherwise the 0s are distributed over
κ (and equally over
) so that no row and no column has all 0s, representing the logical relationships between the
ωis and the
θjs (
κij = 0 if and only if
). We set
, where
x depends on the specific prior knowledge. Fortunately, the value of
x cancels out nicely and will play no further role. For convenience, we define
with
ζm = 1 and
ζi = 0 for
i ≠
m. The best way to visualize such a problem is by providing the joint probability matrix
M = (
μij) together with the marginals
α and
β in the last column/row, here for example as for the
Linguist problem with
m = 5 and
n = 4 (note that this is not the matrix
M, which is
m × n, but
M expanded with the marginals in improper matrix notation):
The
μij ≠ 0 where
κij = 1. Ditto, mutatis mutandis, for
,
,
. To make this a little less abstract, Wagner’s
Linguist problem is characterized by the triple (
κ, β,
),
Wagner’s solution, based on
jup, is
The posterior probability that the native encountered by the linguist is a northerner, for example, is 34%. Wagner’s notation is completely different and never specifies or provides the joint probabilities, but I hope the reader appreciates both the analogy to
(2) underlined by this notation as well as its efficiency in delivering a correct
pme solution for us. The solution that Wagner attributes to
pme is misleading because of Wagner’s Dempsterian setup which does not take into account that proponents of
pme are likely to be proponents of the classical Bayesian position that type II prior probabilities are specified and determinate once the agent attends to the events in question. Some Bayesians in the current discussion explicitly disavow this requirement for (possibly retrospective) determinacy (especially James Joyce in [
33] and other papers). Proponents of
pme (a proper subset of Bayesians), however, are unlikely to follow Joyce—if they did, they would indeed have to address Wagner’s example to show that their allegiances to
pme and to indeterminacy are compatible.
That
(9) follows from
jup is well-documented in Wagner’s paper. For the
pme solution for this problem, I will not use
(9) or
jup, but maximize the entropy for the joint probability matrix
M and then minimize the cross-entropy between the prior probability matrix
M and the posterior probability matrix
. The
pme solution, despite its seemingly different ancestry in principle, formal method, and assumptions, agrees with
(9). This completes our argument.
What follows may only be accessible to
pme cognoscenti, since it involves the Lagrange multiplier method (see [
12] (p.327ff) and [
34] (p.244)). Others may read the conclusion and find a sketch for an easier, but much less rigorous proof in the
appendix. To maximize the Shannon entropy of
M and minimize the Kullback-Leibler divergence between
and
M, consider the Lagrangian functions:
and
For the optimization, we set the partial derivatives to 0, which results in
where
ri =
eζiλm, sj =
e−1−ξj,
represent factors arising from the Lagrange multiplier method (
ζ was defined in
(5)) operator ◦ is the entry-wise Hadamard product in linear algebra.
r, s,
are the vectors containing the
ri, sj,
, respectively.
R, S,
are the diagonal matrices with
Ril =
riδil, Skj =
sjδkj,
(
δ is Kronecker delta).
Consequently,
(19) gives us the same solution as
(9), taking into account
(17). Therefore, Wagner conditioning and
pme agree.