Abstract
Objective Bayesianism says that the strengths of one’s beliefs ought to be probabilities, calibrated to physical probabilities insofar as one has evidence of them, and otherwise sufficiently equivocal. These norms of belief are often explicated using the maximum entropy principle. In this paper we investigate the extent to which one can provide a unified justification of the objective Bayesian norms in the case in which the background language is a first-order predicate language, with a view to applying the resulting formalism to inductive logic. We show that the maximum entropy principle can be motivated largely in terms of minimising worst-case expected loss.
1. Introduction
Objective Bayesianism holds that the strengths of one’s beliefs should satisfy three norms [1,2]:
- Probability. The strengths of one’s beliefs should satisfy the axioms of probability: if bel is one’s belief function, which assigns a degree of belief to each sentence of one’s language, then bel ∈ ℙ, the set of probability functions defined on the sentences of one’s language.
- Calibration. The strengths of one’s beliefs should fit one’s evidence: bel ∈ , the set of belief functions compatible with one’s evidence. In particular, the strengths of one’s beliefs should be calibrated with physical probabilities, insofar as one has evidence as to what the physical probabilities are: if one’s evidence determines just that the physical probability function P* lies in some non-empty set ℙ* of probability functions, then bel ∈ = ⟨ℙ*⟩, where ⟨ℙ*⟩ is the convex hull of ℙ* [3].
- Equivocation. The strengths of one’s beliefs should otherwise equivocate sufficiently between the basic possibilities that one can express: bel is some function in E that is sufficiently equivocal. Note that entropy is often used as a measure of the extent to which a probability function equivocates.
These three norms are usually justified in rather different ways. The Probability norm is usually justified as being required if one is to avoid sure loss—the Dutch book argument. The Calibration norm needs to hold if one is to avoid loss in the long run when one repeatedly bets on similar events. It has also been argued that the Equivocation norm should hold if one is to minimise worst-case expected loss. See Williamson [1] (Chapter 3) for discussion of these justifications. Unfortunately, these justifications do not cohere particularly well, because the betting set-up and the notion of loss differ in each case—for the Probability norm, the notion of loss is sure single-case loss, where losses may be positive or negative; for the Calibration norm it is almost-sure (i.e., probability 1) long-run loss, positive or negative; for the Equivocation norm, it is worst-case expected loss, where the loss is positive and logarithmic. Furthermore, a justification for the order in which the norms are applied is missing. In particular, the justification of the Equivocation norm presumes that belief is probabilistic; for this justification to work, some argument is needed for the claim that avoiding sure loss should be prioritised over minimising worst-case expected loss; but there is as yet no such argument. The question thus arises as to whether a single, unified justification can be given for the three norms, in order to circumvent the above problems.
Landes and Williamson [4] provided a single, unified justification for the situation in which one’s beliefs are defined over propositions, construed as subsets of a finite set Ω of outcomes. It turns out that all three norms must hold if one is to minimise worst-case expected loss: one’s belief function should be a probability function in
= ⟨ℙ*⟩ that has sufficiently high entropy. This line of argument will be described in Section 2. Landes and Williamson [4] went on to extend this unified justification to the situation in which beliefs are defined over sentences of a propositional language, formed by recursively applying the usual propositional connectives ¬, ˄, ˅, →, ↔ to a finite set of propositional variables.
In this paper we shall show that a similar justification goes through for the situation in which beliefs are defined over sentences of a first-order predicate language, with the use of predicate, constant and variable symbols as well as the quantifiers ∀, ∃. In Section 3 we shall formulate the norms of objective Bayesianism in the context of a predicate language. In Section 4 we shall provide a justification for maximising entropy when the language in question is a predicate language without quantifier symbols and when the evidence set is finitely generated. In Section 5, we shall extend this line of argument to predicate languages that contain quantifier symbols. In Section 6 we shall investigate the case of evidence which is not finitely generated. Key concepts and notation are collected in Appendix C for ease of reference.
The key technical results in this paper are Theorem 3, Theorem 6, Theorem 7, and Theorem 8. These results all suppose that the available evidence is finitely generated (in the sense of Definition 5). The first two jointly show that, on a quantifier-free predicate language, the belief function with the best loss profile is the calibrated probability function which has maximal entropy. Theorem 7 implies that adding new constant or predicate symbols to the language does not change the inferences one draws which are expressible in the original language. Theorem 8 extends Theorem 3 and Theorem 6 to predicate languages with quantifiers. En route to proving Theorem 8, we improve on Gaifman’s Unique Extension Theorem [5] (Theorem 1) in Proposition 24.
The case of evidence which cannot be finitely generated is more involved. We consider a case in which no belief function has an optimal loss profile in Proposition 28 and Proposition 30. While there are no functions with the best loss profile in that case, we show in Proposition 29 and Proposition 31 that probability functions in a neighbourhood of the calibrated function with maximal entropy have arbitrarily good loss profiles. We also discuss a case in which the belief function with best loss profile does indeed turn out to be the calibrated probability function which has maximal entropy, see Theorem 9.
2. Beliefs over Propositions
Here we will recap the relevant results of Landes and Williamson [4], to which the reader is referred for further details and motivation. In this section we will be concerned solely with a finite set Ω of possible outcomes. We shall suppose that each member ω of Ω is a state ±A1 ˄…˄ ±An of a finite propositional language L = {A1,∆, An}. A proposition F is a subset of Ω. Let Π be the set of all partitions of Ω. We take {∅, Ω}, {Ω} ∈ Π. In order to limit the proliferation of partitions, we suppose that the only partition in which ∅ occurs is {∅, Ω}.
Given a belief function bel :
that is not zero everywhere, we normalise by dividing each degree of belief by
to form a belief function, B :
, with degrees of belief in the unit interval. The set of normalised belief functions is
On the other hand, the set of probability functions is
where ⊂ denotes strict subset inclusion. The inclusion is strict since the following normalised belief function B is not in ℙ, B(∅) = 1 and B(F ) = 0 for all ∅ ⊂ F ⊆ Ω. Since {Ω} is a partition we have P (Ω) = 1 and since {Ω, ∅} is a partition it holds that P (∅) = 0 for all P ∈ ℙ.
Let L(F, B) be the loss incurred by adopting belief function B when proposition F turns out to be true. Arguably, in the absence of knowledge of the true loss function, the loss function L should be taken to be logarithmic, as we shall now see. Consider the following four conditions on a default loss function L:
- L1. L(F, B) = 0 if B(F ) = 1.
- L2. L(F, B) strictly increases as B(F) decreases from 1 towards 0.
- L3. L(F, B) depends only on B(F), not on B(F 0) for F 0 6= F.To express the next condition we need some notation. Suppose : say that for some 1 ≤ m < n. Then ω ∈ Ω takes the form ω1 ˄ ω2 where ω1 ∈ Ω1 is a state of , and ω2 ∈ Ω2 is a state of . Given propositions F1 ⊆ Ω1 and F2 ⊆ Ω2 we can define F1 × F2 := {ω = ω1 ˄ ω2 : ω1 ∈ F1, ω2 ∈ F2}, a proposition of . Given a fixed belief function B such that B(Ω) = 1, and are independent sublanguages, written , if B(F1 × F2) = B(F1) · B(F2) for all F1 ⊆ Ω1 and F2 ⊆ Ω2, where B(F1) := B(F1 × Ω2) and B(F2) := B(Ω1 × F2). The restriction of B to is a belief function on defined by , and similarly for .
- L4. Losses are additive when the language is composed of independent sublanguages: if for , then , where L1, L2 are loss functions defined on , respectively.
Theorem 1. If a loss function L satisfies L1–4 then L(F, B) = −k log B(F) for some constant k > 0 that does not depend on.
When we consider the notion of expected loss, we see that this concept depends on the weight given to the various partitions under consideration. Let g : Π → ℝ≥0 be a function that assigns a weight to each partition. Then the g-expected loss or g-score of a belief function
with respect to a probability function P ∈ ℙ is defined by
for any weighting function g that is inclusive in the sense that for any proposition F, some partition π containing F is given positive weight. We adopt the usual convention that 0 log 0 = 0. This ensures that
is well-defined. Theorem 1 allows us to focus attention on logarithmic g-score,
An important property of a scoring rule is that
for all P ∈ ℙ. That is, for fixed P ∈ ℙ,
is uniquely minimised by B = P. This property is known as strict propriety.
Proposition 1 (Strict Propriety). Sg is strictly proper.
By analogy with the generalised notion of scoring rule, we get a similar generalisation of entropy, g-entropy:
The standard entropy function corresponds to the special case in which g = gΩ, the (non-inclusive) weighting function that gives weight 1 to the partition {{ω} : ω ∈ Ω} of states and weight 0 to all other partitions.
It turns out that, if there is such a function, the probability function that minimises worst-case g-score, where the worst case is taken over physical probability functions in the set
, is the probability function in
that has maximum g-entropy:
Theorem 2. As noted above,
is taken to be convex and g inclusive. There is a unique member of, which we shall denote by. Furthermore,
Throughout this paper we use
(and
) to refer to the points in the closure [
] of
that achieve the supremum (respectively infimum) whether or not these points are in. (This convention shall also apply mutatis mutandis to suprema and infima over sets of belief functions defined on predicate languages later in this paper.)
The above theorem concerns the minimisation of worst-case g-score. If one replaces the minimisation of worst-case g-score by a more fine-grained criterion (which breaks ties between belief functions with the same worst-case g-score), then an analogue of the above theorem holds: there exists a unique belief function which is best with respect to this criterion and this function is
, which maximises g-score in [
]. When we move to predicate languages we will consider such a refinement in Definition 21.
3. Beliefs over Sentences of a Predicate Language
3.1. Norms
In this section we introduce the norms of objective Bayesianism as they apply to strength of belief in sentences formulated in a predicate language. This framework is presented in more detail in Williamson [1] (Chapter 5). It is this set of norms that we seek to justify in terms of the loss that a belief function exposes one to.
We shall take
to be a first-order predicate language with finitely many relation symbols U1, …, Us, countably many constant symbols t1, t2, …, but no function or equality symbols. We will consider languages with and without the existential quantifier symbol, using the notation
and
to disambiguate where needed. We shall assume, as is usual in this setting, that each individual in the domain of discourse is picked out by a some constant symbol. The sentences
of
are formed by recursively applying the usual connectives and the existential quantifier, if present. In
, universally quantified sentences may be defined in terms of existentially quantified sentences as usual via ∀xθ(x) := ¬∃x¬θ(x). Note that
coincides with the set of quantifier-free sentences of
. We shall also be interested in the finite sublanguages
, for n≥1, which are identical to
except that they have only finitely many constant symbols t1, …, tn.
We shall list the atomic sentences of
, i.e., sentences of the form Ut where U is a relation symbol and t is a tuple of constant symbols of the corresponding arity, as A1, A2, …, ordered in such a way that atomic sentences that can be expressed in
but not in
occur after the atomic sentences A1, …, Arn of
, for each n. Ωn will denote the set of n-states, i.e., sentences of the form
. We shall use Greek letters, such as θ, to denote sentences of
, and Roman letters, e.g., F, to denote propositions expressed by such sentences. We shall construe propositions as sets of n-states, F ⊆ Ωn for some n (see Section 2).
The norms of objective Bayesianism can then be explicated thus:
Probability. The strengths of one’s beliefs should be representable by a probability function, i.e., a function
that satisfies the properties:
- P1. P (τ) = 1 for all tautologies τ.
- P2. If ⊨¬(φ ˄ ψ) then P (φ ˅ ψ) = P (φ) + P (ψ).
- P3..
(Clearly P3 is only applicable in the case
.)
Calibration. One’s degrees of belief should satisfy constraints imposed by one’s evidence. Assuming all evidence is evidence of physical probabilities, P should lie in the set
, the convex hull of the set of epistemically possible physical probability functions.
Equivocation. One’s degrees of belief should otherwise be sufficiently equivocal. Again, one can explicate this by saying that one’s belief function should have sufficiently high entropy. Here P has higher entropy than Q if there is some N such that for all n≥N,
, where
is standard entropy on
,
.
The key question we attempt to answer here is: can these norms be given a unified justification in terms of avoiding avoidable loss?
3.2. Belief and Probability
A (non-normalised) belief function bel :
is a function that maps any sentence of the language to a non-negative real number. For technical convenience we shall focus our attention on normalised belief functions, which are defined below.
A (countable) set of mutually exclusive sentences
is called exhaustive if, for all interpretations
under which the constants exhaust the universe of
, there exists a sentence θ ∈ π such that
. This means that it is not possible for all θ ∈ π to be false at the same time. In order to control the number of partitions, we shall assume that the only partitions in which contradictions κ occur are the partitions of the form {τ, κ}, for some tautology τ. Let
denote the set of partitions of
.
Example 1 (Infinite partitions). Even though does not contain a symbol for equality and every element of a partition is a sentence of, which is of finite length, infinite partitions such as the following do exist:
(Here it is presupposed that contains a unary predicate symbol U1.) On the other hand, it turns out that there are no infinite partitions in [6] (§2.5).
We take it that it is a matter of convention on which scale beliefs are measured. For convenience, we want to normalise this scale to the unit interval, [0, 1], so that all belief functions are considered on the same scale.
Definition 1 (Normalised belief function). Let. Then define the normalisation of bel as, if M > 0. For a function f assigning every the same value v ∈ ℝ≥0 we write f ≡ v. We shall consider bel ≡ 0 as normalised. The set of normalized belief functions on then is
For the normalisation of bel, B, it holds that B ≡ 0, if and only if M = +∞ or bel ≡ 0.
We will be particularly interested in the following subset of functions:
These are the probability functions:
Proposition 2., if and only if satisfies the axioms of probability:
- P1. P (τ) = 1 for all tautologies.
- P2. If ⊨ ¬(φ ˄ ψ) then P (φ ˅ ψ) = P (φ) + P (ψ).
- P3..
Proof. First we shall see that
satisfies the axioms of probability.
- P1. For any tautology τ ∈ SL it holds that P (τ) = 1 because {τ} is a partition in ΠL. P (κ) = 0 for all contradictions κ because {τ, κ} is a partition in ΠL and P (τ) = 1.
- P2. Suppose that φ, are such that ⊨ ¬(φ ˄ ψ). We shall proceed by cases to show that P (φ ˅ ψ) = P (φ) + P (ψ). In the first three cases one of the sentences is a contradiction, in the last two cases there are no contradictions.
- φ and ⊨ ¬ψ, then ⊨ φ ˅ ψ. Thus by the above P (φ) = 1 and P (ψ) = 0 and hence P (φ ˅ ψ) = 1 = P (φ) + P (ψ).
- ⊨ ¬φ and ⊨ ¬ψ, then ⊨ ¬φ ˄ ¬ψ. Thus P (φ ˅ ψ) = 0 = P (φ) + P (ψ).
- ⊭ ¬φ, ⊭ φ, and ⊨ ¬ψ, then {φ ˅ ψ, ¬φ ˅ ψ} and {φ, ¬φ ˅ ψ} are both partitions in . Thus P (φ ˅ ψ) + P (¬φ ˅ ψ) = 1 = P (φ) + P (¬φ ˅ ψ). Putting these observations together we now find P (φ ˅ ψ) = P (φ) = P (φ) + P (ψ).
- ⊭ ¬φ, ⊭ ¬ψ and ⊨ φ ↔ ¬ψ, then {φ, ψ} is a partition and φ ˅ ψ is a tautology. Hence, P (φ) + P (ψ) = 1 and P (φ ˅ ψ) = 1. This now yields P (φ) + P (ψ) = P (φ ˅ ψ).
- ⊭ ¬φ, ⊭ ¬ψ and ⊭ φ ↔ ¬ψ, then none of the following sentences is a tautology or a contradiction φ, ψ, φ˅ψ, ¬(φ˅ψ). Since {φ, ψ, ¬(φ˅ψ)} and {φ˅ψ, ¬(φ˅ψ)} are both partitions in ΠL we obtain P (φ) + P (ψ) = 1 − P (¬(φ ˅ ψ)) = P (φ ˅ ψ). So P (φ) + P (ψ) = P (φ ˅ ψ).
- P3. For the rest of this proof we only have to consider .
If ⊨ ∃xθ(x), then P (∃xθ(x)) = 1.
Furthermore, the set {θn : n ∈ ℕ} with
is exhaustive. Note that
. P1 and P2 are well-known to imply that logically equivalent sentences are assigned the same probability; see [7] (Proposition 2.1.c). Hence,
.
The θi are mutually exclusive. We obtain from P2 that
. Next, define a set Θ := {θn : θn satisfiable} which consists of exhaustive, satisfiable and mutually exclusive sentences. Hence Θ is a partition in
. We finally obtain
P1 and P2 are also well-known to imply that if ⊨ χ → ψ then P (χ) ≤ P (ψ), see [7] (Proposition 2.1.c). Since
we obtain
.
is a (not necessarily strictly) increasing sequence. Then
The second equality holds also when
.
If neither ⊨ ∃xθ(x) nor ⊨ ¬∃xθ(x), then {∀x¬θ(x),∃xθ(x)} is a partition. We consider two cases.
In the first case the set
is not a partition.
For example, this set fails to be a partition for θ(x) = ¬Ut2 ˄ Ux: the sentence θ(t2) ˄ ¬θ(t1) = ¬Ut2˄Ut2˄¬(¬Ut2˄Ut1) is a contradiction and hence it cannot be contained in a partition π consisting of infinitely many sentences.
cannot be a contradiction since ¬∃θ(x) is satisfiable and
. If
is a tautology, then all θn with n ≤ m are contradictions. Hence, for all m ∈ ℕ the set
is a partition, as is
. Furthermore, {∀x¬θ(x)} ∪ {θn : θn is satisfiable} is a partition.
Recalling that P(κ) = 0 for all contradictions κ we obtain
and
It remains to show that
This follows as we saw above in (3).
In the second case the set {∀x¬θ(x), θ(t1), θ(t2) ˄ ¬θ(t1), …, θ(tk) ˄ ¬ θ(tj),…} is a partition. Recall that {∀x¬θ(x), ∃xθ(x)} is also a partition. We obtain as in the first case that
For the converse, note that P1–3 imply that P is a probability measure on
, and so additive over countable partitions (§2 in [8]; §2.5 in [6]). Hence
. □
Another key feature of probability functions is that they respect logical equivalence:
Definition 2 (Respecting logical equivalence). For a sublanguage of we say that a function f :
respects logical equivalence on
, if and only if for all φ,
with φ ↔ ψ it holds that f(φ) = f(ψ). For we simply say that f respects logical equivalence.
Proposition 3. The probability functions respect logical equivalence.
Proof. Suppose
and assume that φ,
are logically equivalent. Observe that {φ, ¬φ} and {ψ, ¬φ} are partitions in
. Hence,
But then P (φ) = P (ψ).
Thus, the
assign logically equivalent sentences the same probability. □
If a belief function
respects logical equivalence, it gives sentences which express the same proposition the same degree of belief. Hence, for any n ∈ ℕ, B induces a function °B defined over the propositions F ⊆ Ωn (c.f., Section 2). °B is defined by:
We will use the notation °nB to avoid ambiguity in cases where n varies.
The notion of a dominated belief function will prove useful in what follows:
Definition 3 (Dominated belief function). is dominated by a probability function, if and only if for all it holds that B(φ) ≤ P (φ).
Note that if B is dominated by P, then B ≠ P, and thus B(φ) < P (φ) has to hold at least for one sentence φ.
Proposition 4. There exist which are not dominated.
Proof. Let U be a relation symbol in
of arity a ≥ 1, say. Let Ut1t be a well-formed formula of
, i.e., t is a a − 1 tuple with consisting only of t1 and t2. Let O4 := {Ut1t ˄ Ut2t, Ut1t ˄ ¬Ut2t, ¬Ut1t ˄ Ut2t, ¬Ut1t ˄ ¬Ut2t}.
Let
be such that
Clearly,
. We now that there does not exist a
such that B(φ) ≤ P (φ) for all show φ ∈ SL.
Note that
and that for all
it holds that
Note for later reference that for all n ≥ 3 and ω ∈ O4, {¬ω} ∪ {ν ∈ Ωn : ν⊨ ω} is a partition. So,
has to hold. Hence,
.
Thus far we have considered partitions of sentences. We shall also need to consider partitions of propositions:
Definition 4 (Partitions of propositions). Let Πn be the set of partitions on Ωn. As in Section 2, we take {Ωn} and {Ωn, ∅} to be partitions and we suppose that there is no further partition containing ∅.
We then define the set of partitions:.
We use πn to denote the partition of n-states {{ω} : ω ∈ Ωn}.
Note that F1 := {ω ∈ Ω1 : ω ⊨U1t1} and F2 := {ω ∈ Ω2 : ω U1t1} are different propositions, where U1 is a unary predicate symbol. F1 is a member of {F1,
} ∈ Π1 and F2 is a member of {F2,
} ∈ Π2, but not vice versa. So {F1,
} and {F2,
} are different partitions, even if these partitions are intuitively equivalent.
3.3. Application to Inductive Logic
We shall be particularly interested in the use of objective Bayesianism over predicate languages to provide semantics for inductive logic.
Inductive logic typically seeks to answer questions of following form [9] (§1.1):
This asks, if premiss sentences φ1, …, φk of
have probabilities in sets X1, …, Xk ⊆ [0, 1] respectively, which probability or set of probabilities should attach to the conclusion sentence ψ?
The answer to this question will depend on the semantics given to the inductive entailment relation |≈ [9] (Part I). One natural option is to give the entailment relation objective Bayesian semantics, denoted by|≈°. Here the premisses are construed as statements about chance, i.e., P*(φ1) ∈ X1, …, P*(φk) ∈ Xk, and the question concerns rational belief: if one’s total evidence is captured by the premisses, to what extent should one believe the conclusion sentence ψ? Applying the norms of objective Bayesianism,
holds just in case P (ψ) ∈ Y for every
that has maximal entropy, where
This application of objective Bayesian epistemology to inductive logic is an example in which
is generated by constraints involving only sentences of some finite sublanguage
. We will be particularly interested in the case where φ1, …, φk are quantifier-free sentences, i.e., sentences of
for some n.
Let
be the set of probability functions on
, and let
where P⇂n is the restriction of P to
. Note that,
for all
.
To ease the reading we also let
.
Definition 5 (Finitely generated evidence set). is finitely generated if it takes the form for some n ∈ ℕ, where. Thus,
is generated by constraints involving only some and no other sentences.
From now on, for finitely generated
, the letter K is used to denote the smallest number n such that
is generated by constraints on
.
Note that an evidence set
which is not finitely generated may not be recapturable from
. For instance, for
the following two facts hold simultaneously:
- for all n ∈ N.
4. Quantifier-Free Languages
We would like to develop an analogue of Theorem 2 for beliefs defined over the sentences of a predicate language: we would like to show that belief functions which minimise worst-case expected loss are probability functions in E that maximise entropy. The main difficulty in moving from the finite domain of propositions to countably many sentences of a predicate language is to ensure that worst-case expected loss is finite where possible, so that these losses can be compared and a belief function can be chosen that minimises worst-case expected loss. For this reason we proceed in two steps. First, in this section, we shall consider the case in which the predicate language has no quantifier symbol, i.e.,
; comparing worst-case expected loss is more straightforward in this case. Then, in Section 5, we shall examine how far our approach can be extended to handle predicate languages with quantifiers.
First, in Section 4.1 we define the notion of a weighting function. This allows us to define and analyse the concept of entropy of a probability function on
in Section 4.2. In Section 4.3 we introduce the idea of the loss profile of a belief function. Finally in Section 4.4 we show that, in various natural scenarios, the belief function that has the best loss profile is the probability function, from all those calibrated with evidence, that has maximal standard entropy.
4.1. Weighting Functions
Definition 6 (Weighting function). A weighting function on
, gn : Πn → ℝ≥0, maps partitions π ∈ Πn to non-negative real numbers. A weighting function on
,
: Π → ℝ≥0, is defined over partitions of propositions of all finite sublanguages. A weighting function on can be thought of as a family of weighting functions gn on, where n ranges over the natural numbers. Given a fixed weighting function on, we shall take for each n ∈ ℕ. A (general) weighting function g is taken to be defined over each predicate language. Different languages,
have different sets of relation symbols.
A weighting function g is atomic if for each
and each n, gn depends only on the number of atomic propositions in
, not on the structure of those atomic propositions. Thus if
and
are such that
and
have the same number of atomic propositions, then
. In this paper we shall suppose that all weighting functions are atomic; hence there will be no need to superscript a weighting function on
or
by the particular language
.
We call g inclusive, if and only if it attaches positive weight to each proposition, i.e., if and only if for all n and all F ⊆ Ωn it holds that
As in Section 2, g is symmetric if for each n it is invariant under permutations of the states of
. It is refined if for each n it gives no less weight to a refinement π′ ∈ Πn of a partition π ∈ Πn than to π itself. For example, the partition weighting gΠ gives weight 1 to each partition, gΠ(π) = 1 for all π ∈ Π. The proposition weighting gives weight 1 to each partition of size 2 and weight 0 to all other partitions; this amounts to giving weight 1 to each proposition. The standard weighting gΩ gives weight 1 to the partition πn of n-states, for each n, and weight 0 to all other partitions. These weighting functions are all symmetric. The partition and proposition weightings are inclusive, but the standard weighting is not. The partition and standard weightings are refined, but the proposition weighting is not.
Definition 7 (Strongly refined weighting function). g is strongly refined if and only if it satisfies the following properties:
- g is refined: in each finite sublanguage, if partition π′ is a refinement of partition π, then g(π′) ≥ g(π).
- Each finite sublanguage receives the same total weight: for all n, is constant.
- A state partition on a richer language should not receive less weight than one one a less rich language: if m < n then g(πm) ≤ g(πn)
- Non-state-partitions receive finite total weight: the following limit exists (i.e., is finite),
Throughout this paper we will be particularly interested in the following weighting functions:
Definition 8 (Regular weighting function). g is regular if it is atomic, inclusive, symmetric and strongly refined.
4.2. Entropy
Definition 9 (n-entropy). Given a weighting function g and n ∈ ℕ, we define the n-entropy
by:
Recall that, for a probability function P (or indeed any belief function that respects logical equivalence) defined on sentences, °P is the function induced by P over the domain of propositions. Note that by our convention, −0 log 0 = 0 = −1 log 1. Thus, for all n ∈ ℕ,
In calculating n-entropy we may thus ignore all partitions which contain Ωn.
Definition 10 (Standard entropy). For the standard weighting gΩ we denote the corresponding n-entropy by. We refer to as standard entropy (on Ln). is the well-known Shannon Entropy of the n-states of P :
For a fixed weighting function g, we say that
has greater entropy than
, written P ≫ Q, if the n-entropy of P eventually dominates that of Q, i.e., if there is some N ∈ ℕ such that for all n ≥ N,
.
This relation ≫ for comparing entropy is preferable to an alternative notion posed in terms of the limiting behaviour of the n-entropy of P and Q, which says that P has greater entropy than Q just when
. This is because the limiting behaviour is not fine-grained enough to distinguish greater from lesser entropy: n-entropy will often tend to infinity for both P and Q, and, even where the limiting n-entropy of P and Q are both finite, these limits may be equal even though the entropy of P is intuitively greater than that of Q, insofar as the n-entropy of P eventually dominates that of Q. See Williamson [1] (§5.5) for further discussion of these comparative notions of entropy.
We will be particularly interested in the probability functions in [
] with maximal entropy:
We shall also consider entropy maximisers on finite sublanguages. We shall use the notation:
(The members of this set are defined only on the sentences of
, not on the sentences of the language
as a whole.) Note that for convex
,
is convex for all n ∈ ℕ and that
is a strictly concave function on
for inclusive g. If g is inclusive, then
is strictly concave on
. Hence
contains a unique element, which we will denote by
.
Let us consider the set of limit points of the entropy maximisers on finite sublanguages:
Definition 11 (Entropy limit). A probability function is a limit point of the entropy maximisers on finite sublanguages if it is arbitrarily close to infinitely many such maximisers. We denote the set of such limit points by:
Whenever ℙ† consists only of a single function we shall denote that function by ℙ† and refer to ℙ† as the entropy limit.
One important desideratum for a procedure for choosing a rational belief function, particularly in the context of inductive logic, is language invariance. We shall consider two notions of language invariance: the following notion defined in terms of finite sublanguages, and a second form of language invariance, introduced in Definition 23, which we term infinite-language invariance.
Definition 12 (Finite-language invariant weighting function). A weighting function g : Π → ℝ≥0 is finite-language invariant, if and only if the following holds: for all finitely generated by constraints on, if and are such that, then for all there exists some such that Q⇂n =R⇂n
4.2.1. The Standard Entropy Limit
Standard entropy, i.e., entropy with respect to the standard weighting gΩ, is the subject of a substantial literature. We here collect the features of standard entropy most relevant for our purposes.
Firstly, gΩ is finite-language invariant; see, e.g., [7]. If
is finitely generated and g = gΩ, then
contains a unique element. Furthermore, there exists a unique function P ∈ [
] such that for all n ≥ K P⇂n ∈
holds. This function P is the entropy limit with respect to the standard weighting gΩ; it will be called the standard entropy limit and denoted by
. Henceforth we use
to denote the standard entropy limit on
, rather than on Ω as in Section 2.
Definition 13 (Open-minded belief function). We say that a belief function is open-minded on
, if and only if for all for which there exists some such that P (φ) > 0 it holds that B(φ) > 0. For we say that the belief function is open-minded.
The following proposition lists further important properties of
which we shall make frequent use of in the following two properties—see [7] (p. 95) for a proof of the first property.
Proposition 5. satisfies the following properties:
- is open-minded.
- For a finitely generated, for all n ≥ K and all ν ∈ Ωn, ω ∈ ΩK with ν ω it holds that.
The second property will follow from Proposition 9 and from the fact that gΩ is language invariant. Let ν be a consistent conjunction of pairwise different literals such that ν ⊨ ω for some n-state ω with n ≥ K. Denoting by |ν|, |ω| the number of literals in ν, respectively, ω, it follows from the second property in Proposition 5 that
.
4.2.2. General Entropies
The question remains as to how the functions on
with maximal entropy, i.e., the members of maxent
, relate to the entropy maximisers
on the finite sublanguages
. We shall explore this question here.
Proposition 6..
Proof. Let P † ∈ ℙ†. Thus, for all sentences
, P†(φ) is the limit of a sequence
such that
and I ⊆ ℕ is infinite. Since [
] and all the [
] are closed, P † ∈ [
].
Of particular interest is the most equivocal probability function of
, which is called the equivocator and denoted by P=. P= is uniquely defined by the requirement that for all n ∈ ℕ it assigns all n-states ω ∈ Ωn the same probability,
The restriction of P= to ℙn is denoted by P= ⇂n.
In certain cases ℙ† will only contain a single limit point ℙ†.
Definition 14. [4] (Definition 16, p. 3573.) A weighting function gn on is called equivocator-preserving, if and only if
g is called equivocator-preserving, if and only if gn is equivocator-preserving for all n ∈ ℕ.
Proposition 7. If P= ∈ [
] and if g is symmetric and inclusive, then ℙ† = {P=}.
Proof. By Landes and Williamson [4] (Corollary 6, p. 3574) we have
It follows that
and hence ℙ† = {P=}. □
So, if g is symmetric and inclusive, then g is equivocator-preserving. In Appendix B we shall show that there exist non-symmetric g which are equivocator-preserving.
Definition 15 (State-inclusive weighting function). Given, we call a weighting function g : Π → [0, 1] state-inclusive on
, if and only if for each state ω ∈ Ωn there exists a π ∈ Πn such that {ω} ∈ π and g(π) > 0. A weighting function g : Π → [0, 1] is state-inclusive, if and only if it is state-inclusive on each. It is eventually state-inclusive, if and only if there exists a J ∈ ℕ such that for all n ≥ J, g is state-inclusive on.
For example, if g(πn) > 0 for all n ∈ ℕ, then g is state-inclusive. Moreover, inclusive implies state-inclusive.
Lemma 1. If g is state-inclusive on, then is strictly concave on ℙn.
Proof. Let P, Q ∈ ℙn be different and λ ∈ (0, 1). Since for all π ∈ Πn we have
we find using the strict concavity of −x · log x on [0, 1]
The inequality is strict, if and only if there exists some π ∈ Πn with g(π) > 0 such that there is some F ∈ π with °P (F ) ≠ °Q(F). Since P, Q are different probability functions, there exists some ω ∈ Ωn such that P (ω) ≠ Q(ω). Since g is state-inclusive, g(π) > 0 for some π ∈ Πn with {ω} ∈ π. Hence, the inequality is strict. □
Proposition 8. If is finitely generated, and g is eventually state-inclusive and language invariant, then ℙ† consists of a single probability function ℙ† and for all it holds that.
Proof. Recall that
is expressible by constraints in
and let J as in Definition 15. Let n ≥ max{J, K}.
By the above Lemma 1,
is strictly concave on ℙn. Since
is convex,
contains a single element. Hence, Q, R ∈
agree on
.
Since g is language invariant, we have
for all n ≤ l ≤ m.
For all
, there exists an s ∈ ℕ such that
. Hence, for l, m ≥ max{J, K,s} it holds that for
and
that R(φ) = Q(φ). □
For instance, standard entropy [4] (Equation 80), the substate weighting and other examples generated by Landes and Williamson [4] (Lemma 8) are eventually state-inclusive and language invariant. Note that these weighting functions are not inclusive.
Definition 16. We say that Hg is strictly concave, if and only if for all n ∈ ℕ,
is strictly concave on ℙn.
Proposition 9 (Equivocation beyond). Let be finitely generated and let g be symmetric. If Hg is strictly concave, then for all n ≥ K and all ν, μ ∈ Ωn such that there exists an ω ∈ ΩK with ν ⊨ ω and μ ⊨ ω it holds that
for all.
We call such ν, μ ∈ Ωn extensions of ω ∈ ΩK and say that
equivocates beyond. In particular,
equivocates beyond
up to
.
Proof. Let n > K and let P ∈ [
] be such that there exist ν, μ ∈ Ωn with P (ν) ≠ P(μ) such that there exists an ω ∈ ΩK with ν ⊨ ω and μ ⊨ ω. Assume for contradiction that
.
Now define a probability function
by first specifying Q on the n-states. Let
For a λ ∈ Ωr with r ≥ n we let
where ξ ∈ Ωr is the unique r-state such that λ ⊨ ξ.
By construction, Q and P agree on
. Since
is finitely generated, it follows that Q ∈ [
]. Furthermore, Q⇂n can be obtained from P⇂n by a renaming of n-states and it holds that Q⇂n ≠ P⇂n. Since gn is symmetric it holds that
. Since [
] is convex and
is strictly concave, neither P⇂n nor Q⇂n can maximise
over [
].
This contradicts P maximising
over [
].
Corollary 1. Let be finitely generated. If is strictly concave on ℙn for n ≥ K and if g is symmetric, then for n ≥ K the following maximisation problem
can be understood as an optimisation problem in the variables P (ω) with ω ∈ ΩK. In particular, the number of variables does not grow as n tends to infinity.
Proof. Follows immediately from the above proposition by noting that
equivocates beyond
up to
.
This corollary shows that in order to compute
for n ≥ K one needs to solve an optimisation problem on ΩK. If g is not language invariant, then, in general, the objective function of the optimisation problem changes as n changes. So, in general, (
)⇂K varies with n.
Corollary 2. Under the assumptions of Proposition 9 it holds that for F ⊆ Ωn and ν, μ ∈ Ωn, °(F) =°(Fν,μ), where Fν,μ is the result we obtain by replacing ν by μ and vice versa in F.
Proof. For an η ∈ Ωn denote by ωη ∈ ΩK the unique K-state such that η ⊨ ωη. Now simply note that by Proposition 9
Corollary 3. Let be finitely generated. For all n ≥ K and all equivocating beyond up to it holds for all K ≤ k ≤ n − 1 that
If g is symmetric and Hg is strictly concave, then
Proof. For ν ∈ Ωk+1 let ων ∈ Ωk be the unique k state such that ν ⊨ ων. For K ≤ k ≤ n − 1 we now find for a
equivocating beyond
up to
The second part of the proof follows directly by observing that
and
equivocate beyond
up to
by Proposition 9. □
Corollary 4. Let be finitely generated. For all n ≥ K and all not equivocating beyond up to it holds that.
Proof. There has to exist at least one ξ ∈ ΩK such that there exist ν, λ ∈ Ωn with ν ⊨ ξ and λ ⊨ ξ such that P (ν) ≠ P (λ). Since P is a probability function it holds that
. We thus find sing the log-sum inequality (see, e.g., Theorem 2.7.1 in [10])
If ξ ∈ ΩK is such that for all ν, λ ∈ Ωn with ν⊨ ξ and λ⊨ ξ it holds that P (ν) = P (λ), then the above calculation holds with the exception that the inequality is in fact an equality.
We hence find by summing over all ω ∈ ΩK
Corollary 5. Let EL be finitely generated. If g is symmetric and if for all n ≥ K is strictly concave on ℙn, then
Proof. By Corollary 1,
is uniquely determined by
for ω ∈ ΩK. That is, we can understand
as sequence taking values in
and
is compact. Hence, the sequence
has point of accumulation, Q, with Q ∈ [
]. Let I ⊆ ℕ be infinite such that
for all ω ∈ ΩK.
Recall that for n > K that
equivocates under
up to
. We now extend Q to a probability function in [
] by defining it on the n-states ν ∈ Ωn for n > K as follows:
. Hence, Q equivocates beyond
.
Consider some
. It follows that there is some r ≥ K such that
. For ν ∈ Ωr denote by ων the unique element of ΩK such that ν ⊨ ων.
We thus find
We now turn our attention to the calibrated functions with maximal entropy, maxent
. Our aim is to show that maxent
holds for regular g.
Lemma 2. If g is regular, then
Proof. Since g is total it is in particular g defined for the language
which only contains a single relation symbol which is unary. When needed, we shall add a superscript U express that we consider
.
Now define a sequence (an)n∈ℕ by
By the Cauchy condensation test [11] (p. 61, Theorem 3.27) for (not necessarily strictly) decreasing sequences we have that
Since the series on the left converges by the assumption on finite weights, so does the right, and that implies that
.
For n ∈ ℕ let k ∈ ℕ be such that 2k ≤ n < 2k+1. Since an is (not necessarily strictly) decreasing
. Hence,
The right hand side converges to 0 by Cauchy’s condensation test (6). Thus,
Now if
is some other language in our sense different from
, then for all n ∈ ℕ there exists an mn > n such that
. This in turn implies the existence of a canonical bijections fn identifying Πn with
which respect the structure of partitions.
Because g is atomic it follows that for all π ∈ Πn that g(π) = g(fn(π)) holds. Thus,
We then observe that the sequence
is a subsequence of
Hence,
Lemma 3. If g is strongly refined and state-inclusive, then there exist 0 < a ≤ b < +∞ such that for all n ∈ ℕ, g(πn) ∈ [a, b].
Proof. For every ω ∈ Ω1 there exists some π ∈ Π1 which contains {ω} with g(π) > 0. π1 refines all these partitions (or π1 is that partition). Hence, g(π1) > 0.
Since state partitions on richer languages are assigned more weight it follows that g(πn) ≥ g(π1) > 0 for all n ∈ ℕ.
Trivially,
. The latter is constant for all n. Hence, the sequence g(πn) is bounded from above by
.
We can thus choose a, b as follows a := g(π1) and
.
Following [4] (p. 3556) we define:
Definition 17 (Spectrum of π). The spectrum of a partition π is defined as the multi-set of sizes of the members of π. We write σ(π) to denote the spectrum of π.
In other words, if π′ can be obtained from π by permuting the states in the members of π, then σ(π) = σ(π′). If g is symmetric, then g(π) only depends on the spectrum of π.
Lemma 4. If g is symmetric, then for all n and all spectra s
Proof. First note that
is a concave function, since −x log x is concave function for x ∈ [0, 1].
If P, P′ ∈
are such that one can be obtained from the other by a permutation of n-states, then for all spectra s
Hence, for all fixed spectra s P= ⇂n lies inside the contour lines of the function
in ℙn. It follows that
Corollary 6. If g is symmetric and such that
then for all P ∈ PL
Proof. For a fixed spectrum s we have
Thus,
Summing over all spectra now yields for all
The claimed result follows.
In particular, if g is regular then the above Corollary applies, by Lemma 2.
Let us consider the application of objective Bayesianism to inductive logic (Section 3.3). It turns out that if g is regular and
is finitely generated then the functions in
with maximal entropy coincide with the entropy limits (Definition 11), and moreover there is a unique such function, the standard entropy limit:
Theorem 3. Let g be symmetric, atomic, state-inclusive and strongly refined, and be finitely generated. Then
Note that if g is also inclusive, then g is regular.
Proof. By Lemma 3 there exist 0 < a ≤ b < +∞ such that g(πn) ∈ [a, b] for all n ∈ N and by Corollary 6 the combined weight given to all other partitions on Πn tends to zero, as n increases, fast enough that, for all
,
For
there exists a minimal
with n ≥ K such that
. Since
is strictly convex on
and
maximises
over
it holds that
. Using Corollary 3 and Corollary 4 we obtain
for r ≥ n. Thus,
For large enough r the sums over the π ≠ πr become negligible. Since g(πr) is bounded there has to exist some R ∈ ℕ with R ≥ max{K, n} such that for all r ≥ R it holds that
Hence, for all large enough r it holds that
.
Thus, maxent
.
For the second part of the proof we show that for all r ∈ N and all F ⊆ Ωr it holds that
Observe that for all n ∈ ℕ
The first sum tends to zero as n goes to infinity by our assumptions on g.
For the second sum observe that for all ϵ > 0 there exists an N ∈ ℕ such that for all n ≥ max{N, K} and all P ∈ [
] it holds that
< ϵ. Hence, ϵ >
. So,
For all n ≥ K,
and
equivocate under K up to n (Proposition 9). Hence, it holds that
(Corollary 3). So,
is a strictly concave and continuous function on ℙK. Hence, limn→∞(ω) =
(ω) for all ω ∈ ΩK. So, limn→(
)⇂K = (
)⇂K.
For an arbitrary n ≥ K and an F ⊆ Ωn we find using that
equivocates beyond K
The result for F ⊆ Ωr with r < K follows similarly. □
4.3. Loss and Expected Loss
We shall now analyse the notion of the loss incurred by an agent with belief function B ∈
. In Section Section 5 we shall be interested how degrees of beliefs in quantified sentences affect losses. The following definition, axioms L1–4, Theorem 4 and Proposition 12 apply within our current, quantifier-free framework, i.e., = ∄but they also apply to quantified sentences, i.e., = ∃.
Definition 18 (Independent Sublanguages). Let B ∈
be a fixed belief function such that B(τ) = 1 for any tautology τ, and = 1 ∪ 2 where 1 and 2 are disjoint: 1 and 2 contain the same constants, they do not have a relation symbol in common and the union of the relation symbols in 1 and 2 equals {U1,…, Us}, the set of relation symbols in . We say that 1 and 2 are independent sublanguages, written 1⫫B2, if and only if B(ϕ1 ˄ ϕ2) = B(ϕ1) · B(ϕ2) for all ϕ1 ∈ S1 and ϕ2 ∈ S2. Let B⇂1(ϕ1) := B(ϕ1), B⇂2 (ϕ2) := B(ϕ2).
By analogy with the line of argument of Section 2, we shall suppose that a default loss function L : S ×→ (− ∞, ∞] satisfies the following requirements. Here L(φ, B) is to be interpreted as the loss specific to φ turning out to be true, when one adopts belief function B:
- L1. L(φ, B) = 0, if B(φ) = 1.
- L2. L(φ, B) strictly increases as B(φ) decreases from 1 towards 0.
- L3. L(φ, B) only depends on B(φ).
- L4. Losses are additive when the language is composed of independent sublanguages: if = 1 ∪ 2 for 1⫫B2, then L(ϕ1 ˄ ϕ2, B) = L1(ϕ1, B⇂1) + L2(ϕ2, B⇂2), where L1, L2 are loss functions defined on 1, 2 respectively.
Theorem 4. If a loss function L on S × satisfies L1–4, then L(φ, B) = −k log B(φ), where the constant k > 0 does not depend on the language .
Proof. The proof is exactly analogous to that of Landes and Williamson [4] (Theorem 4), which gives the result in the case in which is a finite propositional language. □
Since multiplication by a constant is equivalent to change of base, we can take log to be the natural logarithm. Since we will be interested in the belief functions that minimise loss, rather than in the absolute value of any particular losses, we can take k = 1 without loss of generality. Theorem 4 thus allows us to focus on the logarithmic loss function:
Next we define our notion of expected loss. The expectation is taken with respect to a probability function P, and we consider the expectation taken over each partition of propositions. Each partition is weighted by the given weighting function g. Attention is restricted to inclusive weighting functions, so that each belief is evaluated; if the weighting function were not inclusive then degrees of belief in some propositions would fail to contribute to the expectation.
Definition 19 (n-representation). A sentence θ ∈ Sn n-represents a proposition F ⊆ Ωn, if and only if F = {ω ∈ Ωn: ω ⊨ θ}. Let ⊆
Ωn be a set of pairwise distinct propositions. We say that Θ ⊆ Sn is a set of n-representatives of , if and only if each sentence θ ∈ Θ n-represents a unique proposition in and each proposition in is n-represented by a unique sentence θ ∈ Θ.
A set ρ of n-representatives ofΩn will be called an n-representation. We shall use ρF to denote the sentence in ρ which n-represents F. We denote by ϱn the set of all n-representations.
Note that if belief function B respects logical equivalence, then for all n ∈ ℕ, all F ⊆ Ωn and all l-representations ρ with l ≥ n it holds that B(ρF ) = °B(F ). Otherwise there exist an n ∈ ℕ a proposition F ⊆ Ωn and n-representations ρ, ρ′, such that B(ρF) ≠ B(ρ′F).
Definition 20 (n-score). Given a loss function L, an inclusive weighting function g: Π → ℝ≥0, n ∈ ℕ, and an n-representation ρ ∈ ϱn we define the representation-relative n-score
: ℙ × → [−∞, ∞] by:
Define the (representation-independent) n-score
by
(As a technical convenience, we shall consider loss functions and n-scores to be defined more generally, taking arguments P, B: S → [0, 1], although we will primarily be concerned with the case above where P is a probability function and B is a belief function.)
In the light of Theorem 4, we will focus exclusively on the logarithmic loss function in this paper:
For P ∈ ℙ we have that P (ρF) = P (ρ′F ) for all ρ, ρ′ ∈ ϱn, since P respects logical equivalence. Hence for P, Q ∈ ℙ we have
where Sg is the propositional scoring rule introduced in Section 2, in the case Ω = Ωn. There are also connections with g-entropy
, defined in (5), and the propositional notion of entropy Hg, defined in Section 2:
If g = gΩ, we call the resulting function the standard logarithmic n-score:
where the latter equality applies if B respects logical equivalence.
The question arises as to how
, the notion of expected loss defined on a finite sublanguage n, relates to loss on , the language as a whole. One particularly natural suggestion is that B has a better overall loss profile than B′ if the latter’s n-scores eventually dominate those of B or if the worst-case n-score incurred by B′ is eventually greater than that of B:
- If B has lower worst-case expected loss than B′ for all sufficiently large n, then B has a better loss profile than B′.
- If for all P ∈ ℙ, B has an expected loss which is less than or equal than that of B′, and if for some P ∈ [ ], B has strictly lower expected loss than B′ for sufficiently large n, then B has a better loss profile than B′.
We make this precise as follows:
Definition 21 (Better loss profile). B has a better loss profile than B′ if and only if:
- There exists some N ∈ ℕ such that for all n ≥ N, (P, B) < (P, B′), or
- (P, B) ≤ (P, B′) < +∞ for all P ∈ ℙ and all n ∈ ℕ, and there exist at least one function Q ∈ [ ] and some NQ ∈ ℕ such that (Q, B) < (Q, B′) for all n ≥ NQ.
We write B ≺ B′ to denote that B has better loss profile than B′. We will be interested in those belief functions that have the best loss profile, i.e., the minimal elements of ≺, and define:
Proposition 10 (Properties of ≺). The binary relation ≺ is asymmetric, partial, irreflexive and transitive.
Proof. Note that if for all P ∈ ℙ and all n ∈ ℕ it holds that
(P, B) ≤
(P, B′), then
(P, B) ≤
(P, B′) follows trivially. Hence, conditions 1 and 2 of Definition 21 are consistent, in the sense that the induced relation ≺ is asymmetric.
There exist different B, B′ ∈
which are not open-minded on 1 and thus have infinite loss on n for all n ≥ 1 (cf., Proposition 13). For example, if B(τ′) = B′(τ′) = 0 where τ′ is a tautology in S1, then B and B′ have infinite expected loss for all n ∈ ℕ and all P ∈ ℙ. Thus, ≺ is only partial.
That ≺ is irreflexive follows directly from the definition.
Now consider B1, B2, B3 ∈
such that B1 ≺ B2 ≺ B3. We will consider cases to prove that B1 ≺ B3.
If there exist N1,2, N2,3 such that
then
Thus, B1 ≺ B3.
Now assume that there exists a number N1,2 such that
(P, B1) <
(P, B2) for all n ≥ N1,2 and assume that the pair (B2, B3) satisfies the second condition of Definition 21. Then,
(P, B1) <
(P, B3) for all n ≥ N1,2. Thus, B1 ≺ B3.
The same argument shows that if the pair (B1, B2) satisfies the second condition of Definition 21 and the pair (B2, B3) satisfies the first condition, then B1 ≺ B3.
Finally, suppose that the pairs (B1, B2) and (B2, B3) both satisfy the second condition of Definition 21. Then for all P ∈ ℙ and all n ∈ ℕ it holds that
(P, B1) ≤
(P, B3). Furthermore, there has to exist a Q ∈ [
] and an NQ ∈ ℕ such that for all n ≥ NQ it holds that
(Q, B1) <
(Q, B2). But then
(Q, B1) <
(Q, B3) for all n ≥ NQ. Thus, B1 ≺ B3.
Since ≺ is irreflexive and transitive it cannot contain a cycle.
One main theme of the rest of this paper will be the search for belief functions with the best loss profile. Since the loss function L we are interested in is − log B(φ), and these values monotonically decrease as B(φ) increases from 0 to 1, it follows that, ceteris paribus, the belief functions with better loss profiles assign greater degrees of belief to sentences.
It might appear then that the normalisation (see Definition 1) would directly imply that no B ∈
\ℙ could have the best loss profile. Intuitively, this might be thought to hold since the belief functions B ∈
\ℙ assign smaller degrees of belief than the probability functions P ∈ ℙ. However, Equation (4) shows that some B ∈
\ℙ assign greater degrees of belief than a probability function P ∈ ℙ to certain sentences in the following sense: there exists a set of sentences Φ ⊂ S such that for all P ∈ ℙ it holds that ∑φ∈Φ B(φ)> ∑φ∈Φ P(φ).
While Condition 1 of Definition 21 deals with worst-case expected loss, Condition 2 deals with dominance of expected loss. Now, dominance is often used on its own to justify the Probability norm; see, e.g., de Finetti [12] (Chapter 3) and more recently by Joyce [13,14]. So, one might think that Condition 2 is strong enough on its own to imply the probability norm. However this is not the case:
Proposition 11. For = ℙ there exist a weighting function g and a non-probabilistic belief function B ∈
\ℙ such that no probability function P ∈ ℙ has a loss which dominates that of B in the sense of Condition 2.
Proof. It suffices to show that there exist a weighting g and a B ∈
\ℙ such that for all Q ∈ ℙ there exist a P ∈ ℙ and infinitely many n ∈ ℕ such that
(P, B) <
(P, Q).
Consider a B ∈
\ℙ from Proposition 4 and consider an arbitrary Q ∈ ℙ. Then there has to exist an ν ∈ O4 such that Q(ν) ≠ B(ν). Next note that Q(¬ν) ≠ B(¬ν) follows. Then, −
log
log
log Q(ν) −
log Q(ν) since the logarithmic scoring rule is strictly proper.
So, for P ∈ ℙ with P (ν) =
and g({ν, ¬ν}) > 0 it holds that
Next let ν1 := ¬Ut1t ˄ ¬Ut2t, ν2 := Ut1t ˄ ¬Ut2t, ν3 := ¬Ut1t ˄ Ut2t, and ν4 := Ut1t ˄ Ut2t. For n ≥ 4 let
⊂ Ωn be the unique proposition which is equivalent to νi,
= {ω ∈ Ωn : ω ⊨ νi}.
Now define gn for n ≥ 4 as follows:
So, for this B and this g we have found that for all Q ∈ ℙ there exist a P ∈ ℙ and infinitely many n ∈ ℕ (every fourth n) such that
□
In general, determining the functions comprising minloss
is a challenging problem, which we shall tackle in due course. However, there is one general property we can prove directly: assigning zero degree of belief to an epistemically possible sentence is irrational, in the sense that it exposes one to avoidable losses. To see this, first note that:
Proposition 12. For any, there exists a probability function P ∈
which is open-minded.
Proof. The set of consistent sentences in is countable. The set
is a subset of the set of consistent sentences and is thus countable, too. We can hence enumerate Φ by some countable index set, I, say. Note that |I| ≥ 2 since P (τ) = 1 for all P ∈ ℙ and all tautologies τ.
For all φ ∈ Φ choose some Pφ ∈
such that Pφ(φ) > 0. Next, for all i ∈ I pick an αi ∈ (0, 1) ⊂ ℝ such that ∑i∈I αi = 1. Since |I| ≥ 2 such αi exist.
We shall now define an open-minded function P ∈
by putting
Note that P is in
since it is a convex combination of probability functions in the convex set
.
We next show that P is indeed open-minded. Let φ ∈ Φ be at the j-th position in the enumeration I of Φ. We now obtain P (φ) ≥ αjPφ(φ) > 0. So, P (φ) > 0 for all φ ∈ Φ. □
Proposition 13. B ∈ minloss
implies that B is open-minded.
Proof. If B is not open-minded, then there exists a k ∈ ℕ and a φ ∈ Sk such that B(φ) = 0 and there exists a P ∈ [
] such that P (φ) > 0. Since φ ∈ Sr for all r ≥ k, it holds for all r ≥ k that
(P,B)=+∞.
By Proposition 12 there exists an open-minded Q ∈ [
]. Thus,
(P, Q) < ∞ for all r. □
Note that the above proposition does not imply that minloss
is non-empty.
4.4. Minimax Theorems
In this section we shall relate the belief functions that have best loss profile to the probability functions that have maximal g-entropy.
It turns out that an improvement in loss profile is not necessarily accompanied by an increase in entropy (Appendix A). Nevertheless, we shall see that given appropriate conditions on g, there is a close relationship between the belief function that has the best loss profile and the probability function which has maximum entropy. On a finite sublanguage, the unique belief function with minimum worst-case expected loss is the probability function with maximum entropy (Section 4.4.1). Moreover, on the language as whole, if the evidence set
is finitely generated then the unique belief function with the best lost profile (i.e., the belief function that is minimal with respect to ≺) is the probability function in EL with maximal entropy (Section 4.4.2). However, this is not necessarily so when
is not finitely generated (Section 6.1).
4.4.1. Minimax on Finite Sublanguages
Lemma 5. For all n ∈ ℕ, all P ∈ ℙ and all B ∈
respecting logical equivalence on n it holds that (P, B) =
(P, B) for all ρ ∈ ϱn.
Proof. Simply note that
(P, B) = −log B(ρF) does not depend on ρ ∈ ϱn. □
Lemma 6. For all inclusive g, for all n ∈ ℕ and each belief function
B† respects logical equivalence on n. Furthermore, for all such B† there exists a partition π ∈ Πn such that ∑F∈π B†(ρF)=1 for all ρ ∈ϱn.
Proof. Firstly, B† cannot assign all φ ∈ Sn degree of belief 0, since this would an incur an infinite worst-case expected loss; and as we saw in Proposition 13, there are functions which have finite worst-case expected loss.
Assume for contradiction that a B† ∈
does not respect logical equivalence on n. Then define a function Binf : S → [0, 1] which respects logical equivalence on n by
The next step in this proof is to show that
In the second part of the proof we shall see that there is a belief function which has a strictly better worst case expected loss than Binf. This then contradicts the assumption that the belief function B† has best worst case expected loss, i.e., B† ∈ arg
.
Since B† does not respect logical equivalence on
, there are logical equivalent φ, ψ ∈
such that B†(φ) ≠ B†(ψ). Thus, Binf(φ) < max{B†(φ), B†(ψ)} and hence Binf(φ) + Binf(¬φ) < max{B†(φ), B†(ψ)} + B†(¬φ) ≤ 1. The last inequality holds since B† ∈
. So,
.
Recall that we extended the definition of scoring rules allowing the belief function to be any function defined on
taking values in [0, 1]. We shall be careful not to appeal to results that assume a normalised belief function in this situation.
We now find for P ∈
Hence
, as claimed above.
Let us now consider cases to derive a contradiciton.
Case i There exists a π ∈
such that ∑F∈π Binf(ρF)=1.
Since Binf respects logical equivalence this fact is independent of the particular ρ ∈ ϱn. Recall that we use the notation °Binf = °nBinf to denote the function that Binf induces over propositions in Ωn, defined by °Binf(F) = Binf(∨F).
With this convention we then note that °Binf ∈
\. Let
be the set of probability functions on Ωn which are in the canonical one-to-one correspondence with the probability functions on
, i.e.,
. We thus find, using Theorem 2 to obtain the strict inequality, that:
Case ii For all π ∈ Πn and all ρ ∈ ϱn it holds that ∑F∈π Binf(ρF) < 1.
Since Binf respects logical equivalence on
we may consider the induced function °Binf defined over propositions of Ωn. Since Πn is finite, so is the set {∑F∈π °Binf(F)}. Thus, supπ∈Πn ∑F∈π °Binf (F) = 1 − ϵ for some ϵ ∈ (0, 1].
Let us now define a function
. Denote by μ ∈ (0, 1] the unique number such that for all π ∈ Πn and all ρ ∈ ϱn it holds that ∑F∈π μ+Binf (ρF) = ∑F∈π μ + °Binf(F) ≤ 1 and for at least one π ∈ Πn and one ρ ∈ ϱn we have ∑F∈π μ + Binf (ρF) = ∑F∈π μ + °Binf(F) = 1
Put B′(φ) := μ + Binf(φ) > Binf(φ) for all φ ∈
and B′(φ) := 0 otherwise. Observe that B′ ∈
and that B′(¬τ) ≥ μ > 0 for the tautologies τ of
. But then °B′ ∈
. Then for all π ∈ Πn and all P ∈
we have −∑ F∈π P (ρF) log B′ (ρF) < −∑ F∈π P(ρF) log Binf (ρF). We now apply Theorem 2 to find the strict inequality below
So, in Case i and in Case ii we have found that
has strictly better worst-case expected loss than B† contradicting B† ∈ arg
.
Finally, we need to show that for all such belief functions B† there exists a π ∈ Πn such that ∑F∈π °B†(F) = 1. Suppose for contradiction that is not the case. Note that B† respects logical equivalence on
. Hence, we can define a belief function B′ ∈
by adding a strictly positive number μ as in Case ii. B′ has a worst-case expected loss that is less or equal to the worst-case expected loss of B†. Again, we find that °B′ ∈
and hence B′ does not have minimal worst-case expected loss. Clearly then, B† cannot have minimal worst-case expected loss. Contradiction. □
Theorem 5 (Finite sublanguage minimax). For all inclusive, all n ∈
, all C ∈ arg
and all Q ∈ arg
it holds that
Proof. From Lemma 6 we know that for every C ∈ arg
it holds that C⇂n respects logical equivalence on
and that °C := °nC ∈
(since C is normalised). Every probability function in P ∈
respects logical equivalence (Proposition 3).
Thus,
and
collapse to
, respectively
, the logarithmic scoring rule for propositions (1).
However, for the propositional case we know from Theorem 2 that the unique
-entropy maximiser on
is the unique worst-case expected loss minimiser on
,
. arg
.
Thus, for all F ⊆ Ωn it holds that
for all ρ ∈ ϱn. Hence,
. □
4.4.2. Minimax for Inductive Logic
We shall now consider the language
as a whole. We shall assume in this section that EL is finitely generated by constraints on
. As noted in Section 3.3, this is the scenario that is of key relevance to inductive logic. Our goal is to justify the norms of objective Bayesianism by showing that the belief functions with the best loss profile are the probability functions in
with maximum entropy.
First we shall see that this is the case if
is language invariant:
Proposition 14 (Language invariance minimax). If is inclusive and language invariant and if is finitely generated, then
Proof. Note that we have
from Proposition 8, in particular
for all n ≥ K.
Since
is inclusive,
is strictly concave on
(Lemma 1). Hence,
is uniquely determined. By language invariance we obtain P† ∈ arg
for all n ≥ K. Thus, P† ∈ maxent
.
For Q ∈
\ {P†} there has to exist some N ∈
such than Q⇂n ≠ P†⇂n for all n ≥ N. Since
is a strictly concave function on
and since P† maximises
for all n ≥ K it follows that
for all n ≥ max{K, N}. Thus, Q ∉ maxent
.
From Theorem 5 we have that
∈ arg
for all n ≥ K. Since
is finitely generated and g is language invariant we have that P† ∈ arg
for all n ≥ K. Thus, P† ∈ minloss
.
For every C ∈
\{P†} there has to exist an N ∈
such that for all n ≥ N it holds that
For all n ≥ max{K, N} we now apply Theorem 5 to obtain
. Hence, C ∉ minloss
. □
This result is not entirely satisfactory, because we cannot say anything yet about whether such weighting functions exist. Indeed, it was conjectured in Landes and Williamson [4] (p. 3564) that no inclusive, symmetric and refined weighting function
is language invariant. This conjecture remains open.
Our next result says that, for the standard weighting
, the probability function with the best loss profile is the standard entropy maximiser:
Proposition 15 (Standard entropy minimax). If is finitely generated and, then
Proof. follows directly, since
is language-invariant and state-inclusive, Proposition 8.
It is well-known that
see for instance [15]. Hence,
□
Because it only identifies probability functions with the best loss profile, rather than normalised belief functions with the best loss profile, Proposition 15 provides a justification for only two norms of objective Bayesianism, the Calibration Norm and the Equivocation Norm, under the supposition that
. This is a useful result if there is some independent reason—such as the Dutch book argument—for taking belief functions to be probability functions. But our goal in this paper is to investigate the extent to which the notion of loss profile developed above can be used to justify all three norms at once.
We know that there are weighting functions that are regular, i.e., which are atomic, inclusive, symmetric and strongly refined. The plan of the rest this section is to prove the following analogous minimax theorem for regular weighting functions. This says that, for any regular weighting function, the belief function with the best loss profile is the probability function in
which has maximal standard entropy. This theorem thus justifies all three norms at once.
Theorem 6 (Regularity minimax). If is regular and is finitely generated, then
In order to prove this theorem we give a number of lemmata. We shall state these lemmata under more minimal conditions on
. The reader not interested in the details might always replace the stated conditions on
by: “
is regular”.
To begin with, we shall consider only belief functions B which respect logical equivalence. (Later we shall relax this restriction.) Hence,
does not depend on ρ and we can ignore the particular representation ρ. This will allow us to focus on propositions.
Lemma 7. If n ≥ K, Q ∈
and if is finite, then it holds that
Proof. Let P′ ∈ arg
. Then define P″ on Ωn+1 by
for all ν ∈ Ωn+1 and ων ∈ Ωn with
. Now extend P″ arbitrarily to a function in
. Note that
since
is finitely generated and n ≥ K.
Since − log(x) is a strictly convex function on (0, 1] and since
for all ω ∈ Ωn it holds for all fixed ω ∈ Ωn that
. We now find
Definition 22 (γ-weighting). To simplify notation we define for n ∈ N and F ⊆ Ωn
If g is symmetric, then γn(F ) only depends on |F | := |{ω ∈ Ωn : ω ∈ F }| and we write γn(|F |).
In particular, since the belief function B is assumed to respect logical equivalence, we can write
Furthermore, we can easily characterise the set of inclusive g. g is inclusive, if and only if for all
and all F ⊆ Ωn γn(F) > 0.
Lemma 8. Let g be inclusive and such that there exist 0 < a ≤ b < +∞ such that g(πn) ∈ [a, b] for all and such that
Then
Proof. Let us thus first note that
Recall that
is open-minded (Proposition 5). Thus,
, F ⊆ Ωn and
imply
. Let
Then, for F ⊆ Ωn such that
it holds that
since
equivocates beyond
.
Hence,
, F ⊆ Ωn and
imply that
. Since
we now find
To complete the proof, it suffices to note that this sums is eventually positive and converges in
to zero by our assumption on g and the fact that m is constant.
Proposition 16. Let g be inclusive and such that there exist 0 < a ≤ b < +∞ such that g(πn) ∈ [a, b] for all and such that
Then for all that respect logical equivalence,
.
Proof. We shall proceed by considering cases.\
Case 1.
There exists an N ≥ K such that for all n ≥ N it holds that
. It is well-known that for all
That is, the usual logarithmic scoring rule, when applied to probability functions
and
, is strictly proper. Savage [16] showed that this scoring rule is not only strictly proper but also unique under the further assumption of locality, which is requirement L3 in our framework. Thus,
.
We then find by the first part of Corollary 3 and Lemma 7 for all n ≥ N that
Recall from Lemma 8 that Restn converges to zero. Furthermore, the sequence
is bounded in [a, b] with a > 0. Thus, for all large enough n ∈ N it holds that
Case 2.
Case 2A There exists a
such that for all
and all F ⊆ Ωn it holds that
, i.e., PB dominates B.
Case 2Ai and no other
is such that
for all n and all F ⊆ Ωn. Then for all
and all propositions F it holds that
Thus, for all
and
it holds that
.
Since
there exists some
and a ∅ ⊂ F ⊆ ΩN such that
. For n > N let ∅ ⊂ Fn ⊆ Ωn be such that Fn = {ω ∈ Ωn : ω ∈ F }. Hence, for all n > N it holds that
. Thus,
. Since g is inclusive (γn(F ) > 0 for all
and all F ⊆ Ωn) it holds that
for all n ≥ N.
Applying the second condition of Definition 21 yields
.
Case 2Aii There exists a
dominating B such that
.
Then for all n ≥ K and all
it holds that
. For all large enough
it holds by Case 1 that
. Thus, we find for all large enough n
Cas 2B There does not exist a
such that for all
and all F ⊆ Ωn it holds that
.
For example, the belief functions constructed in Proposition 4 are of this form, i.e., not dominated by a probability function.
Let us assume for contradiction that there exists an infinite set
such that
. Now define a function Q on
by requiring that Q respects logical equivalence and that
Next we show
and
for all F which will allow us to derive the required contradiction.
First note that for all
it holds that
Furthermore, we have for all
and all F ⊆ Ωn
So,
.
Now assume that there exists a proposition F ⊆ Ωn such that
. Since
it holds that
. Note that
is a partition in
. Since we assumed that B respects logical equivalence it holds that
. Thus,
has to hold for all large i. We now obtain the required contradiction as follows:
Thus, there has to exist an α > 0 and an
with N ≥ K such that for all n ≥ N it holds that
. We have for n ≥ N that
To complete the proof we will now show that there exists some β > 0, which depends on
and g but does not depend on the particular n ≥ N, such that
. Since g(πn) is bounded, we then obtain that
for all large enough n.
We need to show that for all large enough n,
for all functions f : Ωn → [0, 1] such that
.
Suppose
. If
and f′(ω) = 0, then
. Hence, the minimum cannot obtain for such an f′. On the other hand, if f′(ω) > 0 and
, then there has to exist a μ ∈ Ωn \ {ω} such that
. Then define a function f″ such that f″ (ω) := 0, f″ (μ) := f′ (μ) + f′ (ω) > f′ (μ) and f″ (λ) := f′ (λ) for all λ ∈ Ωn \ {ω, μ}. Then
. Again, the minimum cannot obtain for such an f′.
We may thus assume in the following that any f′ minimising the above sum satisfies:
, if and only if f′(ω) > 0. In particular, the function f′(ω) = 0 for all ω ∈ Ωn cannot be optimal.
Let
. Then
By definition,
. The sum in the above equation is thus standard logarithmic scoring rule on
,
. For fixed P ∈ ℙ the minimum under this scoring rule obtains for a function which agrees with P on the states ω ∈ Ωn.
Thus, for fixed af the function f minimising
is the af multiple of
. In order to minimize
, −log af has to be minimal. This minimum obtains for af = 1 − α. We hence find the value of the minimum as
β may thus be chosen as β = − log(1 − α) > 0. □
We now drop the assumption that belief functions respect logical equivalence.
Proposition 17. If g is inclusive and such that there exist 0 < a ≤ b < +∞ such that g(πn) ∈ [a, b] for all n ∈ ℕ and such that
then
Proof. We shall consider cases for
. We will show that
holds for all cases. Then minloss
follows.
Case 1 B respects logical equivalence.
By Proposition 16 we obtain
.
Case 2 B does not respect logical equivalence.
Since B does not respect logical equivalence, there exists a minimal N ∈ ℕ such that two different logically equivalent sentences φ, ψ ∈ SN are assigned different degrees of belief, i.e., B(φ) ≠ B(ψ).
We now inductively define functions Bn : S→ [0, 1] for n ≥ N. First, let
Now assume n > N. For all χ ∈ Sn such that no θ ∈ Sn−1 is logically equivalent to χ let
and otherwise let
Note that Bn is well-defined, Bn−1 respects logical equivalence on n−1 and thus Bn−1(θ) does not depend on the particular sentence θ ∈ S n−1 which is logically equivalent to χ.
By construction, Bn+1 agrees with Bn on S n.
Finally, let BI(χ) := limn→∞ Bn(χ). Trivially, BI⇂N = BN⇂N.
Since for all n ≥ N the Bn respect logical equivalence on n, BI respects logical equivalence on .
Furthermore, BI agrees with Bn on the sentences of n.
Now consider a χ ∈ S and let k ∈ ℕ be minimal such that χ ∈ Sk and consider the corresponding proposition F ⊆ Ωk. For all n ≥ max{N, k} we shall show that
If k ≤ N, then for all n ≥ N it holds that Bn(χ) = inf{B(θ) : θ ∈ SN & ⊨χ ↔ θ} = BN(χ). Hence, BI(χ) = BN(χ). For n ≥ N there exist ρ ∈ ϱn such that ρF = χ. Thus,
.
If k ≥ N, then there are two cases. If no θ ∈ Sk−1 is logically equivalent to χ, then Bk(χ) = inf{B(θ) : θ ∈ Sk \ Sk−1 & ⊨ χ ↔ θ}. In which case, we find for all n ≥ k > N
In the other case there does exist some θ ∈ Sk−1 which is logically equivalent to χ. Then Bn(χ) = Bk−1(θ) for all n ≥ k. So BI(χ) = Bk−1(θ). Thus, for all n ≥ max{N, k} ≥ k − 1 it is true that
It thus follows for all P ∈ ℙ and all n ≥ N that
Let us now note that BI(φ) < max{B(φ), B(ψ)}. Thus, BI(φ) + BI(¬φ) < max{B(φ), B(ψ)} + BI(¬φ). Also observe that BI(χ) ≤ B(χ) for all χ ∈ SN. Thus, BI(¬φ) ≤ B(¬φ). Hence,
We infer BI(φ) + BI(¬φ) < 1 and thus BI ∉ ℙ.
Case 2A.
Since BI respects logical equivalence, we obtain by Proposition 16 that
. Applying (13) we obtain
.
Case 2B.
We shall now define a function BJ assigning every proposition a value in [0, 1] as follows. Let τ ∈ S be some tautology. {τ} is a partition. Since
it follows that BI(τ) < 1. Now put BJ(κ) := 1 − BI(τ) for all contradictions κ ∈ S. Clearly, BJ(κ) > 0. For all satisfiable χ ∈ S let BJ (χ) := BI(χ).
Note that
and since BJ(¬τ) > 0 it follows that
. Also note that for all n ∈ ℕ and all P ∈ ℙ it holds that
and so
Since BJ respects logical equivalence we can apply Case 2A to obtain
. But then
. □
Our main minimax theorem (already stated above on Page 2492) then follows immediately from Proposition 17 by applying Lemma 2 and Theorem 3:
Theorem 6 (Regularity minimax). If g is regular and is finitely generated, then
If
, then the unique function with greatest entropy is the equivocator (Proposition 7). Thus by Theorem 6,
Recall that P= assigns all n-states ω ∈ Ωn the same probability,
. So, if the agent does not possess any evidence then all n-states ω ∈ Ωn are all believed to the same degree. Absence of evidence entails symmetric degrees of belief. In other words, the three norms of objective Bayesianism entail an instance of the Principle of Indifference.
Surprisingly, perhaps, symmetry of the weighting function is not necessary to guarantee this instance of the Principle of Indifference on finite sublanguages—see Appendix B.
4.5. Infinite-Language Invariance
So far, we have been working over a fixed predicate language (without quantifiers). One might wonder what would have happened if one had started out with a different such language.
We will investigate this question by considering predicate languages which contain finitely many further relation symbols and/or finitely many further constant symbols than does .
For all languages we consider here, we shall suppose that the ways the constant symbols are ordered are consistent. Furthermore, we suppose that the order types of the constant symbols are ω, the first infinite ordinal. That is, for ⊂ 1 let t1, t2, … be the constant symbols in and let
be the set of constant symbols in 1 which are not in . Then we require that the constant symbols of 1 are ordered such that
- for all n ∈ ℕ, tn appears before tn+1 (consistency),
- for all t ∈ T new there exists some n ∈ ℕ such that t appears before tn (order type ω).
The way the constant symbols of 1 are ordered can be thought of as inserting the t ∈ Tnew into the ordering of the constant symbols of .
From now on, superscripts are used to refer to such predicate languages, while subscripts continue to refer to their respective finite sublanguages. For example,
is the finite sublanguage of 1 which contains only the first n constants of 1. For ⊂ 1, in general, the set of the first n constants of may be different from the set of the first n constants of 1.
Definition 23 (Infinite-Language Invariance). A weighting function g is infinite-language invariant, if and only if the following holds: for all and for all finitely generated by constraints on the finite sublanguage K of , if 1 and 2 are such that ⊆ 1 ⊆ 2, then for all B ∈ minloss
there exists a C ∈ minloss
such that.
Infinite-language invariance is motivated by the thought that simply adding new constant or predicate symbols to the language should not change the inferences which are expressible in the original language . Note the following qualification: since each element of the domain is picked out by some member of , one can infer that in ′ formed by adding constants to , there must be some constants which name the same individual.
We shall now proceed to show that the weighting functions which we focus on in this paper—the regular weighting functions—are infinite-language invariant.
Lemma 9. If ε, ε′ are non-empty and convex sets of the following form
then for
it holds that for all 1 ≤ i ≤ n.
Proof. That the suprema are unique follows from the convexity of the sets ε, ε′ and the fact that
are strictly concave functions on ℙn, respectively, ℙ2n.
Recall that U is the language introduced in Lemma 2.
is a direct consequence of
equivocating beyond
(Proposition 9). □
Theorem 7. If g is regular, then g is infinite-language invariant.
Proof. Let
be finitely generated by constraints expressible in K. Let ⊆ 1 ⊆ 2. By Theorem 6 we obtain minloss
and minloss
, where
and
are the standard entropy limits on 1, respectively, 2.
Let K2 ∈ ℕ be minimal such that
, i.e., the set of the first K2 constant symbols of 2 contains the constant symbols {t1, …, tK} of . It suffices to show that for all n ≥ K2 and all
it holds that
, where
is the set of n-states of 1. Note that the constants in t1, …, tK are in
.
Since the standard entropy limits is finite-language invariant (Section 4.2.1) it follows for n ≥ K2 that
, where
, and
, where
.
We now obtain from Lemma 9 and Proposition 5 that
where ων is the unique maximal state of such that ν ⊨ ων. Thus,
. □
So, neither adding new redundant names for individuals in the domain to nor adding relation symbols which are not constrained by the agent’s evidence on changes one’s rational beliefs in the sentences φ ∈ S.
Language invariance is an important desideratum for reasoning under uncertainty. We have seen that focussing on regular weighting functions ensures language invariance. We conjecture that, if one imposes the desiderata that g be atomic, inclusive, symmetric, refined and infinite-language invariant, then the standard entropy maximiser will be the belief function with the best loss profile. If this is the case then our results for regular weighting functions, which are strongly refined, are symptomatic of a more general phenomenon.
5. Handling Quantifiers
Thus far, we have shown that, on a language ∄ without quantifiers, if the evidence is finitely generated and the weighting function is regular, then the belief function that has the best lost profile is the probability function in
that maximises standard entropy. This provides a justification for all the norms of objective Bayesianism on a language without quantifiers.
As we shall see in Section 5.1, that the language is quantifier free was key here: on a language ∄ with quantifiers, the n-scores become infinite, which makes the comparison of loss profiles impossible. That the evidence is finitely generated is also key: we shall see in Section 6.1 that the minimax result need not hold true if the evidence is not finitely generated.
While the use of scoring rules cannot be readily adapted to a quantified language ∄, we shall see in Section 5.2 that we can nevertheless justify the norms of objective Bayesianism on ∄ if we extend our notion of loss profile and add two further desiderata motivated by the application of objective Bayesianism to inductive logic: that inferences should be language invariant, and that, ceteris paribus, universal hypotheses should be afforded substantial credence.
5.1. Limits to the Minimax Approach
Here we explain why the minimax analysis adopted in Section 4 cannot be applied to the case of a language with quantifier symbols. The problem is that n-score becomes infinite, making it impossible to compare the scores of different belief functions.
There are two ways in which n-score becomes infinite. The first is through a failure of super-regularity. A probability function is super-regular, if it gives every contingent sentence positive probability. Now, many probability functions that seem eminently rational are not super-regular. For example, if one has no evidence,
, then it is plausible that one is rationally entitled (even if not rationally compelled) to adopt the equivocator function P=, which gives each n-state the same probability, as one’s belief function. However, this probability function will give zero probability to a universally quantified sentence such as ∀xUx. More generally, if evidence is finitely generated then no inclusive, symmetric entropy maximiser will be super-regular:
Proposition 18. Let be finitely generated and let g be symmetric and inclusive. If the sequence has a point of accumulation Q ∈ ℙ, then Q is not super-regular.
Proof. Let U be a relation symbol in of arity r, say. For all n ∈ ℕ let
where ti denotes the tuple of r repetitions of ti.
If
, then by the open-mindedness of entropy maximisers
for all n ≥ K. Thus, for all points of accumulation Q ∈ ℙ it holds that Q(φK) = 0. Hence, Q is not super-regular.
If
, then we apply Proposition 9 to find that for all l ≥ n
Let Q be a point of accumulation of
and let
be a subsequence which converges to Q. Since K is fixed we now find
Q is not super-regular. □
Now, a failure of super-regularity is not normally problematic—it is simply a well accepted fact that probability theory forces probability 0 (respectively 1) on many sentences which might be true (respectively false). For example, the strong law of large numbers and the various zero-one laws force extreme probabilities. Moreover, the issue of super-regularity did not arise on ∄, where no contingent sentences are given probability 0 by the entropy maximisers considered above. However, a problem does emerge if we try to apply the scoring rule approach to ∄, where super-regularity becomes pertinent. If θ is possible yet is given zero belief by belief function B then the logarithmic loss, −log B(θ), is infinite if θ turns out to be true. Hence, as long as some epistemically possible physical probability function gives positive probability to θ, belief function B will have infinite score. When scores become infinite, they cannot be readily used to compare belief functions. It is clear, for example, that some non-super-regular belief functions will have better loss profiles than others, but this will not be apparent if we define loss profiles in terms of scores. This problem appears to limit the scope of scoring rules to languages without quantifiers.
One might suggest here that the fact that non-super-regular functions lead to infinite scores merely serves to show that one should adopt a super-regular function as one’s belief function. However, there are good grounds for questioning such a conclusion. In particular, consider again the case of a total absence of evidence. As mentioned above, imposing super-regularity rules out the equivocation function P= as a viable belief function. This means that any super-regular function must, in the total absence of evidence, force a skewed distribution on the n-states, for some n. Thus, one is forced to believe some states to a greater degree than others, despite the fact that one has no evidence to distinguish any such state from any other. So super-regularity leads to very counter-intuitive consequences and the infinite score problem suggests that the scoring rule approach breaks down on languages with quantifiers.
There is a second way in which the scores become infinite when quantifiers are admitted into the language. When one admits quantifiers into the language, one introduces the possibility of infinite partitions (Example 1) and it is natural, when defining a scoring rule on such a language, to consider scores on these infinite partitions. If a weighting function is inclusive then for any sentence
, some partition containing θ will be given positive weight. If it is refined, then any partition that refines this partition will be given positive weight, including any infinite partition which refines this partition. The problem is that, even in the total absence of evidence, every belief function has infinite worst-case expected loss over such a partition:
Proposition 19. If there exists a partition consisting of infinitely many sentences such that g(π∞) > 0, then for all it holds that
Proof. Let π∞ = {φ1, φ2, … }. Let
be arbitrary but fixed.
If there exists a φ ∈ π∞ such that B(φ) = 0, then any
with P (φ) > 0 satisfies
.
Now assume that B(φn) > 0 for all n ∈ ℕ.
Since
it holds that
. Thus, there has to exists an infinite set ℕB ⊆ ℕ \ {1} such that n ∈ ℕB implies
. Let
be an enumeration of ℕB. Let
be an enumeration of an infinite subset of ℕB such that
and
for all k ∈ ℕ \ {1}. Since the
tend to infinity, such a sequence
has to exist.
Recall that
. Let
be such that for k ≥ 2 it holds that
We now explain why such a probability function
exists.
The idea is to define a measure which assigns the set of term structures which are a model of
the value
and assigns value zero to all other term structures which do not model any of the
. The probability of an arbitrary sentence
is then measure assigned to all term structures in which χ holds. One has to be careful of how to set up this measure. Fortunately, the recipe for doing so is well-known.
We follow [7] (pp. 164) and define a term structure
of
as a structure with domain {tn : n ∈ ℕ} and each constant symbol tn of
is interpreted in
as itself. We use T to denote the set of term structures of
.
Now let
denote the power set of
and put
For a quantified sentence θ = ∃xθ(x) let T(θ) := ∪i∈ℕT(θ(ti)), similarly for the universal quantifier ∀.
Now let μ* be any (finitely additive and normalised to one) outer measure on
such that
. Particularly simple such outer measures μ* are measures which for all mk assign a single particular term structure
in which
holds the value
.
Next, define R∞ to be the smallest subset of
which contains R and is closed under complements and countable unions. We now define a countably additive measure μ∞ on R∞ as follows: μ∞ : R∞ → [0, 1] such that μ∞(A) = μ*(A) for all A ∈ R∞.
Letting P(θ) := μ∞(T(θ)) defines a probability function as shown in [7] (pp. 168–171). Furthermore, by construction
.
Having demonstrated the existence of the required probability function P, we now show that, for this function P, B incurs an infinite loss. Intuitively, P(φn) can be obtained from the sequence
by inserting zeros and normalising by multiplying with
. The idea behind this definition is to ensure that for all k ∈ ℕ there exists a unique n ∈ ℕB such that
. Furthermore, for these n ∈ ℕB it holds that
. For all other n > 1 we ensure that P(φn) vanishes; P(φ1) is defined in such that Σφ∈π P(φ)=1 holds.
So, when P(φn) > 0 and
we have
Finally, we obtain
In particular, even the super-regular belief functions have infinite score on any such partition, so one cannot say that any super-regular function has lower overall score than a non-super-regular function. This result, then, casts further doubt on the suggestion that it might be preferable to adopt a super-regular function as one’s belief function. Moreover, it clearly suggests that an attempt to extend the minimax approach, which is based on scoring rules, to languages with quantifiers will be fraught with difficulty.
5.2. The Probability Norm
We have argued that there is little scope for straightforwardly extending the minimax analysis to languages with quantifiers because of the problem that scores will quickly become infinite and thus incomparable. So we need another approach, if we are to show that the Probability axioms P1-P3, as well as the Calibration and Equivocation norms, are to apply to languages with quantifiers.
Our plan of attack is as follows. First, as noted in Section 4.5, language invariance is an important desideratum. In particular, one would not want one’s degrees of belief on the sentences of a quantifier-free language
to change if one were to introduce quantifiers into the language. That is, if evidence determines that one should adopt B1 as one’s belief function on
and B2 as one’s belief function on
, where both languages contain the same individuals and relation symbols, then one would want B1 and B2 to agree on quantifier-free sentences of
, i.e., one would want that B1(θ) = B2(θ) for each
.
Thus far, we have argued that a belief function on
, given finitely generated
, ought to satisfy the axioms of probability P1 and P2 on
, as well as the Calibration and Equivocation norms. Given the language invariance desideratum, this implies that the appropriate belief function on
, should, when restricted to quantifier-free sentences, satisfy P1, P2 and the Calibration and Equivocation norms. If we can show that the probability axioms P1-3 should also be satisfied on the language
as a whole, then degrees of belief in the quantified sentences are uniquely determined by those on the quantifier-free sentences [7] (Theorem 11.2): there is no further role that Calibration or Equivocation can play on the quantified sentences. Thus it suffices to argue for the probability axioms on
. As usual, we restrict attention to evidence sets that are finitely generated in the sense of Definition 5, i.e.,
generated by constraints involving sentences of some
and regular weighting functions g.
In Theorem 4 we showed that the default loss incurred by adopting belief function B when φ is true is such that L(φ, B) = − log B(φ), modulo some multiplicative constant. This penalises smaller degrees of belief more than larger degrees of belief. As discussed above, there is little scope for using this to measure the overall expected loss incurred by B on
, and so we cannot directly extend the notion of loss profile developed in Definition 21 to
. However, this default loss function does suggest the following constraint:
(*) Suppose that for all
, B(θ) ≥ B′(θ), and there is some
such that B(φ) > B′(φ). Then B has a better loss profile than B′.
In other words, if the default loss incurred by B′ dominates that incurred by B then B has a better loss profile than B′. We can use (*) to extend our notion of loss profile: the two conditions in Definition 21 apply to quantifier-free sentences in
, and we add the further condition (*) to constrain the quantified sentences. We shall show that the addition of (*) goes some way towards demonstrating P1-3 on
, although we shall have to add a further desideratum in order to complete the derivation.
Definition 24 (Better loss profile on). B has a better loss profile on
than B′ if and only if:
- B ≺ B′ (as defined in Definition 21), or
- B dominates B′ on and there exists some such that B(φ) > B′(φ).
We write B ≺* B′ to denote that B has a better loss profile on than B′. Clearly, ≺* is asymmetric. We will be interested in those belief functions on that have the best loss profile on
, i.e., the minimal elements of ≺*, and define:
Note that if B dominates B′ on
, then B ≺ B′ cannot hold. ≺ and ≺* are thus consistent.
Proposition 20. All B ∈ minloss*
agree with on.
Proof. Since we assume that g is regular and that
is finitely generated we can apply Theorem 6 to obtain that all all B ∈ minloss
agree with
on
.
The claim now follows, since B ≺ B′ implies B ≺* B′. □
Proposition 21. If minloss
, then minloss*
.
Proof. ≺ is asymmetric, irreflexive and transitive, Proposition 10; and thus free of cycles. Hence, for all fixed
there exists some
such that B ≺ B′. This implies B ≺* B′.
Hence, for all
there exists some
such that B ≺* B′. We obtain minloss*
. □
We shall use
to denote an arbitrary but fixed belief function in minloss*
. A priori, it is not clear that such a function B† exists.
The rest of this section does not depend on
, the weighting function g nor the particular probability function the B ∈ minloss
agree with on
. All that matters is that there exists some probability function
the B ∈ minloss
agree with on
. As we know, this is the case if
is finitely generated and g is regular.
Definition 25. A sentence is called contingent, if and only if φ and ¬φ are satisfiable.
Lemma 10. For all θ,
such that θ |= φ it holds that B†(φ) ≥ B†(θ). In particular, B†(ψ) = 0 for all contradictions and B†(χ) = 1 for all tautologies.
For θ,
we have already seen that B†(φ) ≥ B†(θ), this followed from B† satisfying P1 and P2 on
.
Proof. Case 1. θ is a contradiction.
For a tautology
, {τ, θ} is a partition. Since B†(τ) = 1 and B†(τ) + B†(θ) ≤ 1 it follows that B†(θ) = 0. Hence, B†(φ) ≥ 0 = B†(θ).
Case 2. θ is a tautology.
Let
be a contradiction. We just proved that B†(χ) = 0. The only constraints applying to B†(θ) are of the form B†(θ) + B†(χ) ≤ 1 where χ is a contradiction and of the form B†(θ) ≤ 1. Thus, the only meaningful constraint on B†(θ) is B†(θ) ≤ 1. By (*) we have B†(θ) = 1.
Since θ implies φ, φ has to be a tautology, too. Hence, B†(φ) = 1 = B†(θ).
Case 3. θ is contingent.
If φ is a tautology, then B†(φ) = 1 by the above and we are done.
Note that φ cannot be a contradiction since θ is satisfiable.
Assume from now on that φ is contingent.
Case 3A |= θ ↔ φ.
For all index sets I and all sentences
the following are equivalent
- ,
(*) implies that B†(φ) = B†(θ).
Case 3B θ, φ and φ ∧ ¬θ are contingent.
Let I be any countable index set and let
for i ∈ I be contingent such that
Then by the consistency of θ and φ ∧ ¬θ
And since θ |= φ
From normalisation (Definition 1) we now obtain
Note that the equations in (15) are the only constraints which constrain B†(φ). In particular, B†(φ) = B†(θ) will not violate any constraint in (15).
The question arises whether B†(ϕ) = B†(θ) imposes any further constraints?
B†(ϕ) only imposes constraints on the B†(φi) for i ∈ I. Let i ∈ I be fixed and let J be an index set and
be such that
. Then
. Thus, B†(φ) = B†(θ) does not impose any further constraint on B†(φi) which is not already imposed by B†(θ).
By (*) we now find B†(θ) ≤ B†(φ). □
Corollary 7. B† respects logical equivalence on.
Proof. If φ,
are logically equivalent, then B†(φ) ≤ B†(θ) ≤ B†(φ) and thus B†(φ) = B†(θ). □
Corollary 8. For all it holds that
Proof. First note that
implies
. Thus,
is a (not necessarily strictly) increasing sequence in [0, 1] which has a limit. Finally, note that for all
implies ∃xθ(x). Hence, B†(∃xθ(x)) has to be greater or equal than the limit. □
Corollary 9 (Superadditivity of B† on
). If |= ¬ (θ ∧ φ), then B†(θ) + B†(ϕ) ≤ B†(θ ˅ φ).
Proof. If either θ or φ is a contradiction or a tautology, then the Corollary follows trivially.
If θ ˅ φ is a tautology, then the corollary follows trivially, too.
It remains to consider the case of contingent θ ˅ φ. By the above we may assume that θ and φ are contingent. Let I be any countable index set and let
for i ∈ I be satisfiable such that
Then,
From normalisation (Definition 1) we now obtain
The same reasoning a in Lemma 10 about constraints now yields: B†(θ) + B†(φ) ≤ B†(θ ˅ φ).
Lemma 11. For all it holds that B†(θ) + B†(¬θ) = 1.
In particular, this means that
.
Proof. If θ is not contingent, then the lemma holds trivially.
Now assume that θ is contingent and B†(θ) + B†(¬θ) < 1.
Case 1 There exist contingent
such that
with
Note that
and thus
. Adding the above equations we now obtain
B†(θ) + B†(¬θ) ≥ 1 follows. Contradiction.
Case 2 For all
with θ ∈ π and all
with
it holds that
and
.
Applying (*) we obtain a contradiction since B†(θ) or B†(¬θ) could have been set to a greater number.
Case 3 For all
with θ ∈ π it holds that
and there exists a partition
with
such that ∑φ∈π′ B†(φ) = 1.
Let π′ comprise of contingent (φi)i∈I and ¬θ. For
with θ ∈ π we have for all finite J ⊆ I that
In the same manner as in the proof of Lemma 10 it follows that B†(θ) ≥ ∑j∈J B†(φj). Since this holds for all finite J ⊆ I and I can be at most countable, it follows that B†(θ) ≥ ∑i∈I B†(φj).
From B†(¬θ) + ∑i∈I B†(φj) = ∑φ∈π′ B†(φ) = 1 the required contradiction follows:
□
(*) is not strong enough to uniquely determine constrain B† on
. We invoke the following further desideratum to pin down B†: ceteris paribus, prefer belief function B to belief function B′ if B gives greater degree of belief to some universally quantified sentence than does B′. One has to be a bit careful about how one formulates such a principle, in order to specify it in such a way that it can be applied consistently. One can appeal to the concept of prenex normal form in order to formulate this desideratum:
(∀*) Suppose that neither of B, B′ have a better loss profile on
than the other. Furthermore, suppose there exists a minimal quantifier rank q such that the following hold: For all
in prenex normal form with a quantifier rank of q−1 or less it holds that B(φ) = B(φ′) and for all universally quantified
in prenex normal form of quantifier rank q it holds that B(θ) ≥ B′(θ) and the inequality is strict at least once. Then B is to be preferred to B′.
The motivation behind (∀*) is not in terms of loss. Rather, the motivation stems from the application to inductive logic (see Section 3.3). The use of probability in inductive logic has been roundly criticised for tending to give non-tautological universal laws probability zero, when such laws are widely—and seemingly rationally—believed in science and beyond; see, e.g., Popper [17] (Appendix *vii). Thus there seems good reason to prefer, ceteris paribus, those probability functions which give more credence to universal hypotheses. (There is a flip-side to (∀*). The more credence one gives to a universal statement ∀xθ(x), the less credence one must give to ∃x¬θ(x). One might motivate the latter policy by appeal to Okham’s Razor, which demands scepticism with respect to the existence of entities—particularly new kinds of entity.)
This leaves us with some desiderata that stem from considerations to do with loss, namely the criteria that make up Definition 21—appealing to dominance of loss, dominance of expected loss, and worst-case expected loss—and some desiderata that stem from the application to inductive logic, namely language invariance and (∀*). These desiderata taken together are enough to justify the norms of objective Bayesianism on
, as we shall proceed to show in the remainder of this section.
We shall see first that (∀*) is responsible for ensuring that the degree of belief B(∀xθ(x)), which is already constrained to
, is equal to the upper bound. On the other hand, B(∃xθ(x)) comes out to be
. An arbitrary belief function B† ∈ minloss*
which is also optimal according to (∀*) will be denoted by
.
Proposition 22. For all universally quantified sentences it holds that.Proof. First note that
for all n ∈ ℕ and we thus obtain from Lemma 10 that
.
We now prove by an argument on quantifier ranks that
Assume for contradiction that there exists a minimal quantifier rank q ≥ 1 and a sentence ∀xψ(x) in prenex normal form of quantifier rank q such that
.
We now define a function B′ which will be preferred to
which contradicts our standing assumption that no function is preferred to
. Let
for all sentences
which are in prenex normal form and have a quantifier rank of q − 1 or less. In particular,
and B′ agree on
.
For all
in prenex normal form of quantifier rank q − 1 we let
and
Now arbitrarily extend B′ to a function in
.
Note that
and
. So, (*) does not discriminate between
and B′. Hence,
and B′ are equally preferable according to ≺*.
and B′ agree on all sentences in prenex normal form of quantifier rank q−1. Since
has to hold for all
it follows that for φ(x) in prenex normal form of quantifier rank q − 1 that
and for ∀xφ(x) = ψ the inequality is sharp. (∀*) now implies that B′ is preferred to
.
Finally, every sentence of the form ∀xθ(x) is logically equivalent to a universally quantified sentence φ = ∀xφ(x) in prenex normal. Note that θ(t) is logically equivalent to φ(t) for all constants t. Hence,
□
Proposition 23. satisfies the axiom P3.
Proof. Applying Lemma 11, Proposition 22 and applying Lemma 11 a second time we find
□
The following might be of interest outside the context of this paper since it generalises Gaifman’s Theorem, [5] (Theorem 1).
Proposition 24. If satisfies
- f(θ) = 1 for all tautologies,
- for all mutually exclusive θ, it holds that,
- for all and – [P3]
- f respects logical equivalence on − [P4],
then f is a probability function, i.e.,
.
Clearly, P1 on
and P4 jointly imply P1.
Proof. First note that f agrees with some probability function on the quantifier free sentences of
. By Gaifman’s Theorem, this probability function is unique on
; it shall be denoted by Pf.
We now show that f = Pf. We need to show that for all
that f(φ) = Pf(φ).
First, write φ in prenex normal form, φpre. Note that f(φ) = f(φpre).
Next, we do a proof by induction on the quantifier-block rank of φpre to show that f(φpre) = Pf(φpre). The quantifier-block rank of φpre is the number of alternating quantifier blocks in φpre
Base case φpre is of quantifier block rank zero, i.e., φpre does not contain quantifiers. Then
where the second equation holds since f and Pf agree on all sentences of
. The first and the last equation hold since f and Pf respect logical equivalence on
. This fact will be used without further mention.
Inductive step φpre is of quantifier block rank q ≥ 1.
Let us first suppose that
For q ≥ 2 the first symbol of χ is a universal quantifier, ∀, for q = 1, the first symbol of χ is a relation symbol, a negation symbol or an opening bracket. We find for q = 1
where we may substitute Pf for f since χ is quantifier-free and we can thus apply the induction hypothesis.
For q ≥ 2
, where Q = ∃ for odd q and Q = ∀ for even q.
First, here is an example of two logically equivalent sentences:
Note that the quantifier block rank on of the sentence on the right of “↔” is two. The quantifier block rank has been kept low at the price of larger blocks of quantifiers. Since we are giving a proof by induction on the quantifier block rank, we do not have to worry about paying this price. To denote the larger blocks we will use
. In general, the greater the number of variables and on the left of an
, the greater the number of variables in
.
Now let us compute
“I H” indicates that we used the induction hypothesis on a sentence of quantifier rank q − 1.
The case of
= ∀xχ(x) is analogous, simply replace the disjunctions by conjunctions.
Theorem 8. If is finitely generated and g is regular, then
Proof. By Proposition 24 we only need to convince ourselves that
satisfies P1 on
, P2 on
, P3 and P4 in order to conclude that
. Note that we have done so in Theorem 6, Proposition 23 and Corollary 7. So all
are probability functions.
All
agree on
with
. Two different probability functions have to disagree on a quantifier-free sentence (Gaifman’s theorem). Hence,
is a unique and equal to
.
We should point out that (∀*) was only used in Proposition 23. We showed that (*) alone is enough to force that
satisfies P1, P2 on
,
and P4.
In sum, then, by adding invoking two new considerations, (*) and (∀*), one can show that the Probability norm must hold on a predicate language with quantifiers. Since the Calibration and Equivocation norms are already forced on the quantifier-free sentences, and probabilities on these quantifier-free sentences determine those of the quantified sentences, all the norms of objective Bayesianism hold on
, assuming that the weighting function is regular and the evidence is finitely generated.
6. More Complex Evidence
The question arises as to which functions have an optimal loss profile when
is not finitely generated. In Section 6.2 we shall present a tractable case and show that in that example the function with maximal standard entropy has the best loss profile. First, in Section 6.1, we shall see that not all examples admit of such an analysis. In particular, we shall analyse an example in some depth in which
. Thus, when evidence is not finitely generated, the optimal loss profile may not be achievable by maximising entropy.
6.1. When Losses Cannot Be Minimised
We shall now develop an example in which the minimax theorem fails:
as we shall see in Proposition 27. However, the entropy identity,
, does hold (Proposition 25 and Proposition 26). The connection with optimal loss fails to obtain since minloss
(Proposition 30). Thus, there is no belief function with an optimal loss profile in this sort of example. Nevertheless, certain equivocal functions
derived from the maximal entropy function come arbitrarily close to having the best loss profile (Proposition 29 and Proposition 31). So, while there is no unique function with the best loss profile, the functions
have a very good loss profile.
In the following discussion we shall focus on the most simple possible language,
, which contains only one relation symbol, U, which is unary. We focus on this simple language since the minimax results already fail here and considering more expressive languages does not lead to new insights while creating more notational issues. As a technical convenience, we extend the notion of a loss profile to arbitrary functions
, not merely normalised belief functions.
The example that we shall consider is generated by the following evidence:
Let
be the k-th n-state of
, i.e.,
. The set of calibrated probability functions can be characterized in various ways:
The last two characterisations employ quantifiers; adding quantifiers to the language enables a finite representation of what is essentially an infinitely generated evidence set. Hence in Definition 5, we specified that an evidence set is finitely generated just if it generated by quantifier-free sentences of some finite sublanguage.
We now begin our analysis of this example:
Proposition 25. If g = gΩ or if g is symmetric and inclusive, then and is not open-minded.
Proof. For all
Then, by Landes and Williamson [4] (Corollary 6, p. 3574) for symmetric and inclusive g
and so for all
and all 1 ≤ i ≤ 2n−1
For all
and i ∈ {1, 2n−1+1,…,2n}
The result for g = gΩ follows in the same way as above. □
We shall note for later reference that for all n ≥ 2
Proposition 26. If g = gΩ or if g is regular, then
Proof. First note that
.
We shall show that for all
there exists an
such that for all n ≥ N we have
and
.
Since
there exists a minimal
and a k-state ν ∈ Ωk such that Q(ν) > PΩ†(ν) ≥ 0.
Case 1.
To simplify notation let α := Pk†(ν) = Pk†()
Let us now define a function
. Note that since we want
to be a member of
we need to let
. Now let for all
The restriction operator
applied to some belief function B continuous to refer to the restriction of B to
, rather than to the restriction to
.
Note that for all n ≥ 1
since entropy maximisers assign n-states the same degree of belief whenever possible [4] (Corollary 7, p. 3577). Thus,
.
Let us compute for n ≥ k
It follows that for all large enough n ∈
that
.
For regular g we now find
So, as long as
goes to zero quickly enough it follows that
for large enough n. Corollary 6 shows that this is indeed the case for regular g.
Case 2 Since Q is assumed to be calibrated,
, this case cannot occur. Case 3.
Case 3A.
Then
But for all n ∈ ℕ
Since Q ≠ PΩ† it follows that there exists some N ∈ ℕ such that
. But then
.
Case B.
Then
. Proceed as in Case 1. □
Proposition 27. If g = gΩ or if g is regular, then.
Proof. We here show that there exists an
such that for all n ∈ ℕ it holds that
and that there exists an open-minded
such that for all n ∈ ℕ we have
.
Note that the probability function
with
. Then
.
We shall now construct an open-minded
as advertised. For all n ∈ ℕ let
Thus, Q is open-minded and hence
for all n∈ ℕ. □
Note that Condition 1 of Definition 21 is solely responsible for the fact that
. Condition 2 has played no role here.
So far, we have established that
does not have the best loss profile. The question arises whether there exists a belief function
which is a minimal element of ≺, i.e.,
.
Proposition 28. If g = gΩ, then
Initially, one might suspect that
would be somehow due to the fact that the
do not take beliefs in all sentences into account. This is not the case. As we will see,
holds. That is, even when restricting attention to probability functions, whose values on the n-states completely determine degrees of beliefs in all other sentences, we cannot find a function with an optimal loss profile.
Proof. Suppose for contradiction that
.
If Q is not open-minded, then there exists an N ∈ ℕ, an F ⊆ ΩN and an
such that °P (F ) > 0 and °Q(F ) = 0. But then there has to exists some ω ∈ ΩN with ω ∈ F such that P (ω) > 0 = Q(ω) since Q and P are probability functions. Thus, for all n ≥ N there exists some ν ∈ Ωn such that ν = ω with P (ν) > 0 = Q(ν). But then
for all n ≥ N.
In the proof of Proposition 27 we constructed an open-minded function Q+ ∈ E. For Q+ we have for all n that
. So, any
has to be open-minded.
Case 1and Q ∉ E
Since
there has to exist a minimal k ≥ 2 such that
.
We next define a probability function
with the following construction for all n ≥ 2
It follows that for all n ≥ k and all
and all
such that P (ω) > 0 it holds that.
For all large enough n ∈ N we then find
Hence, there has to exists a
.
Case 2 Q.
Thus,
Let N ≥ 3 be such that
For n ≥ N let
We now find for all fixed n ≥ N that
We shall now define a function
by letting for all n ≥ 2:
That is,
.
For large enough M ∈ ℕ it holds for all n ≥ M that
Furthermore, for all n ≥ max{M, N} it holds that
and hence for all large enough fixed n ∈ ℕ and all
Thus, R has a better loss profile than Q. Hence,
.
Finally, let us consider loss profiles for
.
Case 3..
For all
, the expression
only depends on the degrees of belief B assigns to sentences which represent an n-state. So, the degree of belief in a sentence φ ∈ S which does not n-represent an n-state are ignored by
for all n and all
. If B agrees with some probability function
on all sentences of S∄ which n-represent an n-state, then B and P are equally preferable according to ≺. As we saw above, for all
there exists some
with Q ≺ P. Thus, B cannot be a minimal element of ≺.
We can hence assume that for all
there exists some sentence
which n-represents an n-state such that B(φ) 6= P(φ). Since no
is dominated, it follows that B(ϕ) < P(φ).
First define a function B0 as follows:
B0, which does not agree with any probability function on
has been constructed in such a way that B and B0 are equally preferred according to ≺.
Next define a function B+ by first letting for all fixed N ∈ ℕ
for all sentences
which are logically equivalent to an N-state. Put B+(ψ) := 0 for all other
.
Since B+ dominates B0 the loss profile of B+ cannot be worse than that of B0. Furthermore, note that for all N ∈ ℕ, all ω ∈ ΩN and all n > N it holds that
Let
. For α = 0 it follows by the usual reasoning that B+ cannot have an ideal loss profile. This leads to a contradiction in the usual way.
For 1 ≥ α > 0 define a function B∞ by first letting for all sentences
which are logically equivalent to some n-state ω
For all other sentences
let B∞(φ) := 0.
Observe that for all k ∈ ℕ and all ω ∈ Ωk
Finally, we note that B∞ agrees with some
on all sentences in
which represent a state. Then B cannot have a better loss profile than P. As we saw in Case1 and Case2, for all
there exists a
which has a strictly better loss profile than P. This contradicts B ∈ minloss
. □
Denote by
the unique probability function in
satisfying for all n ∈ ℕ
That is,
agrees with
on
and equivocates beyond
as much as possible while satisfying
Proposition 29. For all ϵ > 0 there exists an N ∈
such that for all n ≥ N
Proof. For all large enough N ∈
and even larger n ∈
we find
For ϵ > 0 let N > 2 be such that
. Then for all n ≥ N it holds that
. For n ≥ N large enough we now obtain
□
Having considered loss for
we now investigate loss for regular
.
Proposition 30. If is regular, then minloss
.
Proof. We will show that ≺ has no minimal element. Suppose for contradiction that B ∈
is such a minimal element.
Define a function
by
B′ and B are equally preferable according to ≺ since P (φ) = 0 for all P ∈
and all such φ.
For all φ ∈
let nφ be the minimal n such that φ ∈
. Now define a function Binf by first letting
Put Binf(φ) := B′(φ) for all other φ ∈
. For all φ ∈
it holds that Binf(φ) ≤ B(φ). Furthermore, Binf is equally preferable to B′ according to ≺. We now consider cases to show that there is a function with a strictly better loss profile than Binf, which contradicts our assumption that B ∈ minloss
.
Case A There exists some N ∈
such that for all n ≥ N, Binf and
agree on all n-states. Since
it holds that
and hence
. Thus, for all n ≥ Nand
agree on all n-states. But then for all n ≥ N all F ⊆ Ωn and all ρ ∈ ϱn. Hence, for all P ∈
it holds that
.
From the above we have that for all n ≥ N there exists an F ⊆ Ωn such that
and such that
for some ρ. Thus, there exists some P ∈
with °P(F) > 0. Then
for this P ∈
and all n ≥ N.
Thus,
by Condition 2 of Definition 21.
Case B There exist infinitely many n ∈
where Binf and
agree on all n-states and infinitely many n ∈
many where they do not agree on all n-states.
Since
is a probability function it follows that for all n ∈
, all F ⊆ Ωn and all ρ ∈ ϱn has to hold. Now proceed as in Case A.
Case C The number of n ∈
for which Binf and
agree on all n-states is finite (possibly zero).
Case C1 There exists an infinite set J ⊆
, J = {j1, j2, … }, such that limi−→∞.
If
dominates Binf, we are done.
If
does not dominate Binf, then define a function B1 ∈
by letting for all n ∈
and all F ⊆ Ωn
and requiring that B1 satisfies logical equivalence on L∄. For all φ ∈
use Gaifman’s condition to ensure that B1 is a probability function.
Since we assumed that
does not dominate Binf holds. Furthermore, B1 dominates Binf.So, the loss profile of B1 ∈
is at least equally good as that of B.
We complete this proof by showing that
∩ minloss
.
Now suppose for contradiction that there exists a function Q ∈
∩ minloss
such that
for some n ≥ 2, i.e., Q ∉
. It needs to hold that
for all n ∈
(open-mindedness).
Let k ≥ 2 be minimal such that
. Now define a function R ∈
by letting for all n > k
That is, R is the arithmetic mean of Q and
on
. Beyond
, R equivocates under the k-states which imply Ut1. For such n-states
holds. Beyond
, there are only two n-states which imply ¬Ut1 which are assigned non-zero probability,
and
.
We now show that R has a strictly better loss profile than Q what contradicts Q ∈ minloss
.
Let
∈ arg
. Trivially,
. Next note that for all n ≥ k which are large enough it holds that
and that
We now find for all large enough n > k that
Whenever °P (F) > 0 with F ⊆ Ωn, then °R(F) is bounded from below by
. Hence, the last term in the above sum converges to zero, since g is regular.
We now obtain the contradiction as follows: there exists some ϵ > 0 such that for all large enough n ≥ k it holds that
We have thus shown that if
∩ minloss
, then there exists some Q ∈
∩ minloss
.
Case C1A. Then Q has infinite worst-case expected loss for all n ∈
and we are done.
Case C1B.
By open-mindedness,
has to hold.
For all n ∈
let
∈ arg
From Q ∈
we now obtain that for all large enough n there exists a probability function R ∈ arg
such that
.
Next, define a probability function Q′ ∈
where
and Q′ equivocates over Ut1,
for all n ∈
and for all 2n−1 + 1 ≤ i ≤ 2n. Assume for contradiction that Q ≠ Q′.
We next show that Q′ ≺ Q. This contradicts Q ∈ minloss
. To this end let us note that for all large enough n
Since whenever °P (F) > 0, then °Q′(F) is bounded from below by
.
Thus, for all large enough n we have
g is regular, hence, this last term converges to zero. We thus obtain
Since Q ≠ Q′, Q, Q′ ∈
and
, there has to exist some minimal k ∈ ℕ a minimal
i ≥ 2k−1 + 1 such that
. We now find for all large enough n that
Recall that there exists 0 < a ≤ b such that for all n ∈ ℕ a ≤ g(πn) ≤ b holds. Hence, there exists some constant c > 0 such that
. From (17) we conclude that for all large enough n
holds. Thus, Q′ ≺ Q. So, Q ∉ minloss
.
To complete the proof of Case C1B we show that there exists some N ∈ ℕ such that
has a strictly better loss profile than Q′.
Let N ∈ ℕ be such that
. Analogous to the above it holds that
It hence suffices to show that there exists some ε > 0 such that for large enough N ∈ ℕ and all n ≥ N
We now recall that
. The required inequality follows for large enough n ∈ ℕ
Hence,
.
Case C2 There exist an α > 0 and an minimal N1 such that for all
holds.
We may assume that Binf is open-minded on ∄. Thus there has to exist some minimal N ≥ N1 such that
for all n ≥ N. For all large enough n ≥ N we now find
□
Proposition 31. For all regular g and all ϵ > 0 there exists an N ∈ ℕ such that for all n ≥ N
Proof. Let ϵ > 0 be fixed. By (18) it suffices to show that there exists some N ∈ ℕ such that for all n ≥ N it holds that
Now simply note that we have proved this already in Proposition 29. □
Hence, for all ϵ > 0 there exists some N ∈ ℕ such that for all n ≥ N and all Q ∈
Although,
is not a minimal element of ≺, the losses incurred by adopting any other B ∈
can only be marginally better, eventually.
Thus, for fixed k and δ > 0 there exists an N ∈ ℕ such that for all
. Hence, belief functions with an arbitrarily good loss can be found within an (Euclidean) neighbourhood of
.
Since the
are probability functions, there does not exist a B ∈
which dominates
on ∃ or on ∄. Furthermore, the
are optimal according to (∀*). The
thus are almost optimal in all the senses we here considered.
In essence, the phenomenon of minloss
arises from
having a strictly better loss profile than
but the limit of the sequence
is
, which is not open-minded. This phenomenon is reminiscent of min{x ∈ ℝ : 0 < x < 1} = ∅, where it is possible to get ever closer to zero but it is impossible to reach it.
6.2. When Losses Can Be Minimised
The analysis of Section 6.1, shows that there can be no general minimax theorem which covers any evidence that is not finitely generated. On the other hand, we shall see in this section that for certain natural cases evidence which cannot be finitely generated, minimax theorems do obtain.
Let contain only one m-ary relation symbol, U, and c ∈ [0, 1]. Let
and let
be an enumeration of the remaining n-states. We shall consider the following example:
Slightly less general versions of
have attracted recent interest in the literature [18] (Example 3, p. 95), [19] (Example 3.5, p. 172) and [1] (Example 5.7, p. 99). We here consider relations symbols U of arbitrary arity, while previously U was taken to be unary.
First of all, if c = 0 and g is symmetric and inclusive, then P= ∈
and we immediately obtain that
and
.
We shall assume from now on that c > 0.
Proposition 32. For symmetric and inclusive g it holds that and and for all n ∈ ℕ and all 1 ≤ I ≤ |Ωn|.
Proof. For all n ≥ 2 and symmetric and inclusive gn it holds that
for all 1 ≤ i ≤ |Ωn| − 2 by [4] (Corollary 7, p. 3577). Thus, there exists some λn ≥ 0 such that
and
for all 2 ≤ k ≤ |Ωn|.
For all n ∈ ℕ, now define a function P1 ∈
by
. Then, define a convex combination of the equivocator on
and P1 by
.. Recall that gn is equivocator-preserving (Proposition 7) and that
is strictly concave on ℙn (Lemma 1). Thus,
for all
.
On the one hand g-entropy strictly increases with decreasing λn on the other hand
imposes the constraint
.. Let N ∈ ℕ be minimal with
Then for all n ≥ N it holds that
. and
for all 1 ≤ i ≤ |Ωn| −2.
For all r ≥ N it follows that
Thus, for all r ≥ N we find
Thus, for all n ∈ ℕ
and
.
We now show that P† is indeed a probability function. We need to show that
for all n ∈ ℕ and all ω ∈ Ωn:
Finally, observe that
. Hence,
. □
Proposition 33. If g = gΩ or if g is regular, then maxent
.
Proof. Let Q ∈
. For regular g, it suffices to show that there exists an N ∈ ℕ such that for all
holds.
Since
there has to exist a minimal N ∈ ℕ and an N-state ω′ ∈
such that
.
Now define a function Q′: S → [0, 1] by requiring that Q′ respects logical equivalence, Q and Q′ agree on SN,
- Q(ω0) for all n > N all ν′ ∈ Ωn with ν′ ⊨ ω′,
- for all n > N and
- for all n > N and all ν ∈ Ωn \ with ν ⊭ ω′
In general, Q′ is not a probability function because
Note that for all
holds.
We now show that for all large enough n holds. Let us first compute
Since
we now find with
that
Since
there exists some ϵ > 0 such that for all large enough n
This establishes the result for g = gΩ.
We now turn to regular g.
The last sum goes to zero since g is regular, Corollary 6. Eventually,
is greater some ϵ > 0 as we established in the first part of the proof. Thus, for all large enough n ∈ ℕ and all
we have
□
Lemma 12. The following three conditions are equivalent for all large enough n ∈ ℕ and inclusive and symmetric g
Proof. Note that for all P ∈ ℙ
The term between the last set of brackets () does not depend on i. So,
only depends on P(ν1) but not on how P distributes probabilities among the other n-states.
For large enough N ∈ ℕ it holds that
for all 3 ≤ i ≤ |Ωn|.
Since g is symmetric, γn(F) is only a function of the size of F, |F|, it follows that every
assigns as little probability as possible to ν1. Since we require that
it follows that P′(ν1) = c.
The result for
follows as above by noting that for g = gΩ it holds that γn(ν) = 1 for all n-states ν ∈ Ωn and γn(F) = 0 otherwise. □
Adapting Joyce’s notion of truth-directedness [14] we define:
Definition 26 (Chance-directed scoring rule). A function Ff: [0, 1] × [0, 1] → [0, +∞] of the form Ff (x, y) = x · f(y) + (1 − x) · f(1 − y) is called chance-directed, if and only if for all x ∈ [0, 1], all 0 ≤ λ < 1 and all y ∈ [0, 1] \ {x}
holds. For a scoring rule Ff this formalises the idea that beliefs which are closer to the chances on two mutually exclusive and exhaustive events are strictly better scored.
In particular, Ff(x, y) = −x log y − (1 − x) log(1 − y) is chance-directed. The score improves by simultaneously moving y closer to x and 1 − y closer to 1 − x.
Proposition 34. If g is regular, then all B ∈ minloss
agree with on ∄.
Proof. If c = 1, then
and maxent
follows trivially. By Theorem 5 we have that for every function
it holds that
. Thus, all B ∈ minloss
agree with
on ∄.
We now focus on 0 < c < 1.
From the above lemma we obtain
We now follow the structure of the proof of Proposition 16 for fixed 0 < c < 1. Let B ∈ minloss
.
Case1.
Case1A.
If there exists an n ∈ ℕ such that
, then
. If there exists an m ∈ ℕ such that
, then there has to exist some k > m such that
Since
either such an n ∈ ℕ or such a k ∈ ℕ has to exist, possibly both exist. Overall, there has to exist some N ∈ ℕ, a
and an ϵ > 0 such that
.
For large enough n ∈ ℕ, depending on B, and c, it holds that
Since we may assume that
converges in n to c we now find
Whether this limit exists or not, we have thus established that for large enough n ∈ ℕ there exists a lower bound of the sequence
which is strictly positive, since we take N ∈ ℕ to be fixed here.
For all fixed n ∈ ℕ let
be such that
and
. Note that
for all large enough n and
for all large enough n, Lemma 12.
To simplify notation let
. With this notation we have for all large enough n ∈ ℕ
By our standing assumption on g (regularity), we obtain that Rn converges to zero. We now find
Because g(πn) is bounded and Rn converges to zero, we obtain for all large enough n ∈ ℕ that
Case1B.
Case1Bi
.
Let us first note that this limit has to exist, because
is a (not necessarily strictly) decreasing sequence bounded from below by c. Let
.
Note that there has to exist some N ∈ ℕ such that for all n ≥ N it holds that
. For all n ≥ N there has to exist some
such that
. Then, for all n ≥ N
where the strict inequality follows from chance-directedness. We now find
where the last line follows from the fact that the standard logarithmic scoring rules is strictly proper, i.e., Equation (11) holds.
Case1Bii.
Let
, b2 exists for the same reasons b1 exists. Note that there has to exist some N ∈ ℕ such that for all n ≥ N it holds that
. Using chance-directedness we find for all n ≥ N
Now proceed as in Case1Bi.
Case2\ and B respects logical equivalence on
.
Case2A There exists a
such that for all n ∈ ℕ and all F ⊆ Ωn it holds that °B(F) ≤ °PB(F).
Since
there has to exists an N ∈ ℕ and an F′ ∈ ΩN such that °B(F′) < °PB(F′).
Case2Ai and no other
is such that °B(F) ≤ °P (F) for all n and all F ⊆ Ωn. Follows as does Case2Ai in Proposition 16.
Case2Aii There exists a
such that
.
Then for all n ≥ N and all P ∈ [
] it holds that
. For all large enough n ∈ ℕ it holds by Case1 that
Thus,
Case2B There does not exist a
such that for all n ∈ ℕ and all F ⊆ Ωn it holds that °B(F) ≤ °PB(F).
As in Case2B in Proposition 16 we obtain that there has to exist an α > 0 and a N ∈ ℕ such that for all n ≥ N it holds that
.
We have for n ≥ N that
To complete the proof we will now show that there exists some β > 0, which depends on
and g but does not depend on the particular n ≥ N, such that
. Since g(πn) is bounded, we then obtain that
for all large enough n ∈ ℕ.
We show that for all large enough n ∈ ℕ that
for all functions f : Ωn→ [0, 1] such that
.
The minimum obtains, if and only if
for all ω ∈ Ωn as we saw in Proposition 16. Thus, the minimum obtains for
and
for all other
. Let us now compute
For n approaching infinity we find
which is strictly greater some β > 0 as required.
Case3 and B does not respect logical equivalence on
.
Simply proceed as in Case2 in Theorem 17. □
Theorem 9.
Proof. Since all
agree with
on
, all
agree with
on
; as we noted in Proposition 20.
Recall that Theorem 8 does not depend on the particular probability function, as we stated on Page 2508. We can thus apply Theorem 8 to infer that
□
7. Conclusion
In this paper we have set out to provide a unified justification of the three norms of objective Bayesianism in the setting in which the underlying language is a first-order predicate language. We have seen that an approach based on scoring rules can be used to justify the norms on sentences without quantifiers: if the evidence is finitely generated, then the belief function with the best loss profile is a probability function in the set of those calibrated with evidence which has maximum standard entropy, as long as the scoring rule used in the definition of loss profile is defined in terms of a regular weighting function. One can extend this line of argument to handle sentences with quantifiers if one extends the notion of loss profile and imposes two extra desiderata: (i) language invariance and (ii) that one should not give universal hypotheses less credence than the maximum forced by the evidence.
Finally, we saw that this line of justification also applies in some cases in which evidence is not finitely generated. However, we investigated another case in which the justification does not apply because the evidence is such that there is no belief function with the best loss profile. The most one can ask in such a situation is for a belief function that has a sufficiently good loss profile. We saw that in this case one can use standard entropy maximisers to determine belief functions which are arbitrarily close to optimal.
We would identify two main questions for further research. First, it remains an open question as to whether, when the evidence is not finitely generated, a construction appealing to standard entropy maximisers always leads to belief functions that are arbitrarily close to optimal. Second, it would be interesting to investigate the extent to which one can relax the condition that a weighting function should be regular. We speculated that it may be the case that language invariance can be used in place of the condition that the weighting function be strongly refined, but we have little evidence, at this stage, to warrant apportioning a high degree of belief to this claim.
Appendix
A. Non-maximal entropies and non-minimal losses
In Section Section 4.4 we gave a number of minimax theorems for finitely generated evidence. As we saw in Section Section 6.1 the case of evidence which is not finitely generated is more complex. Entropy limits incur, in certain cases, infinite worst case expected loss.
While the minimax theorems relate entropy maximisers (respectively entropy limits) to loss minimisers (respectively belief functions with the best loss profile), these theorems do not tell us much about the general relation between entropy and loss. In particular, the minimax theorems leave open the question as to whether an improvement in loss profile is always accompanied by greater entropy. In this section we will show that this is not the case, by appealing to an example involving a set of calibrated probability functions
which is finitely generated and two probability functions, Q,
, such that Q has a better loss profile than R but has lower entropy than R.
In contrast to Section 6.1, our functions Q and R are open-minded. So, all losses we consider are finite. The fact that R has greater entropy than Q but also incurs a greater loss is thus not due to taking logarithms of zero.
For the sake of simplicity, we shall consider
.
Proposition 35. There exist regular weightings g, a finitely generated set and probability functions Q,
such that for all
The standard weighting gΩ is another such weighting.
Thus, Q has a better loss profile than R, Q ≺ R, but Q also has lower entropy than R, R ≫ Q. Proof. Let
We now define R and Q as follows for n ≥ 3:
That is, Q and R equivocate beyond
.
We find for n = 1
and
Having established the result for n = 1 we shall now to the general case for n ≥ 2.
For n = 2 note that
For n ≥ 3 we have
and in the same way we find
Furthermore,
This establishes the result for g = gΩ.
The result follows for such general weightings g which converge quickly enough to the standard weighting gΩ so that all further terms are negligible.
B. Symmetry and equivocator preservation
Recall Definition 14: g is called equivocator-preserving, if and only if gn is equivocator-preserving for all
, i.e., if and only if
So, if g is equivocator-preserving and if
, then
and thus
. We know from Proposition 7 that inclusive and symmetric g are equivocator-preserving.
Interestingly, we shall see that there exist non-symmetric gn which are equivocator-preserving. This answers the question posed at the bottom of Landes and Williamson [4] (p. 3574) in the negative.
Proposition 36 (Non-symmetric equivocator preservation). For all such that |Ωn| ≥ 4 there exist inclusive, equivocator-preserving and non-symmetric weighting functions gn. The set of such weighting functions gn is convex.
Proof. By Landes and Williamson [4] (Lemma 9 p. 3573) it holds that gn is inclusive and equivocator-preserving, if and only if
for some constant, c.
Note that we can simply this expression as follows
The first sum does not depend on ω. Thus, y(ω) is constant, if and only if
is constant.
Let us now define an inclusive and non-symmetric weighting function
which satisfies this condition. Let k be such that
and put
Clearly,
is inclusive (gn(π) > 0 for all π ∈ Πn), non-symmetric (there are two partitions π, π′ such that the classes of π and π′ have the same number of elements but
) and z(ω) is constant (since
is invariant under permutations of n-states).
Addressing the second part of the proof: For inclusive and equivocator-preserving gn it holds that
is a strictly concave function on
and
always obtains for P=⇂n.
is convex. Hence, the unique maximum of every convex combination of such gn obtains for P=⇂n.
In general, computing a function which maximises
for
is a non-trivial computational problem, even for g = gΩ. The only widely shared intuition is that P= ought to be the function in
which has greatest entropy. Imposing symmetry is sufficient—but, as we have just seen, not necessary—to ensure that this constraint is satisfied. Imposing symmetry has further structural consequences such as: if
is invariant under renaming of states, then so is
; see Landes and Williamson [4] (Appendix B.3) for details.
C. Key notation
Here we summarise key notation, for ease of reference.
| Symbol | Reference | Meaning |
| ⟨ ⟩ | Page 2460 | Convex hull |
| [ ] | Page 2463 | Closure |
| Page 2463 | Predicate language with quantifiers | |
| Page 2463 | Predicate language without quantifiers | |
| Definition 1 | (Normalised) belief functions on sentences | |
| Page 2465 | Probability functions on | |
| Page 2469 | Probability functions on | |
| Ωn | Page 2463 | n-states |
| P= | Page 2473 | Equivocator function in , P=(ω) = 1/|Ωn| for each ω ∈ Ωn |
| Page 2464 | Partitions of sentences | |
| Πn | Definition 4 | Partitions of propositions |
| Π | Definition 4 | All partitions of propositions, |
| πn | Page 2470 | {{ω} : ω ∈ Ωn}, the finest partition of Ωn |
| Page 2464 | Calibrated belief functions on | |
| Page 2469 | Restrictions of these functions to | |
| °B | Page 2467 | Belief function on propositions induced by B defined on sentences |
| g | Definition 6 | Weighting function |
| gΩ | Page 2470 | Standard weighting function |
| Definition 9 | n-entropy | |
| Definition 10 | Standard (Shannon) entropy | |
| maxent | Page 2472 | Calibrated functions on with maximal entropy |
| Page 2472 | Calibrated functions on with maximum n-entropy | |
| Page 2472 | Unique such function | |
| Definition 11 | Limit points of maximum n-entropy functions | |
| P† | Page 2472 | Unique such entropy limit |
| Page 2472 | Standard entropy limit | |
| ϱn | Definition 19 | n-representations |
| Page 2484 | Logarithmic n-score of a belief function wrt a probability function | |
| Page 2484 | Representation-relative n-score | |
| minloss | Definition 21 | Belief functions on with the best loss profile |
| Minloss* | Definition 24 | Belief functions on with the best loss profile |
Acknowledgments
This research was conducted as a part of the project, From objective Bayesian epistemology to inductive logic. We are grateful to the UK Arts and Humanities Research Council for funding this research and to Bas Lemmens for helpful comments.
Author Contributions
Both authors conceived the idea, did the analysis and wrote the paper. Both authors have read and approved the final manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References and Notes
- Williamson, J. In defence of objective Bayesianism; Oxford University Press: Oxford, UK, 2010. [Google Scholar]
- There are several alternatives to the objective Bayesian account of strength of belief, including subjective Bayesianism, imprecise probability, the theory of Dempster-Shafer belief functions and related theories. Here we only have the space to motivate objective Bayesianism, not to assess these other views.
- Taking the convex hull may mean that a calibrated belief function does not satisfy the known constraints on physical probability. For example, if θ is known to be a statement about the past then it is known that its physical probability is 0 or 1; bel is not constrained to be 0 or 1, however, unless it is also known whether or not θ is true. Similarly, it may be known that two propositions are probabilistically independent with respect to physical probability; this need not imply that they are probabilistically independent with respect to epistemic probability. See Williamson [1] (pp. 44–45) for further discussion of this point.
- Landes, J.; Williamson, J. Objective Bayesianism and the Maximum Entropy Principle. Entropy 2013, 15, 3528–3591. [Google Scholar]
- Gaifman, H. Concerning Measures in First Order Calculi. Isr. J. Math. 1964, 2, 1–18. [Google Scholar]
- Williamson, J. Lectures on Inductive Logic; Oxford University Press: Oxford, UK, 2015. [Google Scholar]
- Paris, J.B. The Uncertain Reasoner’s Companion; Cambridge University Press: Cambridge, UK, 1994. [Google Scholar]
- Williamson, J. Probability logic. In Handbook of the Logic of Argument and Inference: the Turn toward the Practical; Gabbay, D., Johnson, R., Ohlbach, H.J., Woods, J., Eds.; Elsevier: Amsterdam, The Netherlands, 2002; pp. 397–424. [Google Scholar]
- Haenni, R.; Romeijn, J.W.; Wheeler, G.; Williamson, J. Probabilistic Logics and Probabilistic Networks; Synthese Library, Springer: Dordrecht, The Netherlands, 2011. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley and Sons: New York, NY, USA, 1991. [Google Scholar]
- Rudin, W. Principles of Mathematical Analysis, 3 ed.; McGraw-Hill: New York, USA, 1973. [Google Scholar]
- de Finetti, B. Theory of Probability; Wiley: London, UK, 1974. [Google Scholar]
- Joyce, J.M. A Nonpragmatic Vindication of Probabilism. Philos. Sci. 1998, 65, 575–603. [Google Scholar]
- Joyce, J.M. Accuracy and Coherence: Prospects for an Alethic Epistemology of Partial Belief. In Degrees of Belief; Huber, F., Schmidt-Petri, C., Eds.; Synthese Library 342; Springer: New York, NY, USA, 2009. [Google Scholar]
- Grünwald, P.; Dawid, A.P. Game Theory, Maximum Entropy, Minimum Discrepancy, and Robust Bayesian Decision Theory. Ann. Stat. 2004, 32, 1367–1433. [Google Scholar]
- Savage, L.J. Elicitation of Personal Probabilities and Expectations. J. Am. Stat. Assoc. 1971, 66, 783–801. [Google Scholar]
- Popper, K.R. The Logic of Scientific Discovery; Routledge: London, UK, 1999. [Google Scholar]
- Barnett, O.; Paris, J.B. Maximum Entropy Inference with Quantified Knowledge. Logic J. IGPL 2008, 16, 85–98. [Google Scholar]
- Williamson, J. Objective Bayesian Probabilistic Logic. J. Algorithm. 2008, 63, 167–183. [Google Scholar]
© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).