**Principle of Minimal Updating (PMU)**: *Beliefs should be updated only to the minimal extent required by the new information.*

The special case of updating in the absence of new information deserves a comment. The PMU states that when there is no new information, ideally, rational agents should not change their minds. In fact, it is difficult to imagine any notion of rationality that would allow the possibility of changing one's mind for no apparent reason.

Minimal updating offers yet another pragmatic advantage. As we shall see below, rather than identifying what features of a distribution are singled out for updating and then specifying the detailed nature of the update, we will adopt design criteria that stipulate what is not to be updated. The practical advantage of this approach is that it enhances objectivity—there are many ways to change something but only one way to keep it the same. The analogy with mechanics can be pursued even further: if updating is a form of dynamics, then minimal updating is the analogue of inertia. Rationality and objectivity demand a considerable amount of inertia.

#### 3.1.3. Independence

The next general requirement turns out to be crucially important because without it, the very possibility of scientific theories would be compromised. The point is that every scientific model, whatever the topic, if it is to be useful at all, must assume that all relevant variables have been taken into account and that whatever was left out—the rest of the universe—should not matter. To put it another way, in order to do scientific work, we must be able to understand parts of the universe without having to understand the universe as a

whole. Granted, a pragmatic understanding need not be complete and exact; it must be merely adequate for our purposes.

The assumption, then, is that it is possible to focus our attention on a suitably chosen system of interest and neglect the rest of the universe because the system and the rest of the universe are "sufficiently independent". Thus, in any form of science, the notion of statistical independence must play a central and privileged role. This idea—that some things can be neglected and that not everything matters—is implemented by imposing a criterion that tells us how to handle independent systems. The chosen criterion is quite natural: *whenever two systems are a priori believed to be independent and we receive information about just one, it should not matter if the other is included in the analysis or not.* This is an example of the PMU in action; it amounts to requiring that independence to be preserved unless information about correlations is explicitly introduced.

Again, we emphasize that none of these criteria are imposed by nature. They are desirable for pragmatic reasons; they are imposed by design.

#### *3.2. Entropy as a Tool for Updating Probabilities*

Consider a set of propositions {*x*} about which we are uncertain. The proposition *x* can be discrete or continuous, in one or in several dimensions. It could, for example, represent the microstate of a physical system, a point in phase space, or an appropriate set of quantum numbers. The uncertainty about *x* is described by a probability distribution *q*(*x*). The goal is to update from the prior distribution *q*(*x*) to a posterior distribution *p*(*x*) when new information—by which we mean a set of constraints—becomes available. The question is, which distribution among all those that satisfy the constraints should we select?

Our goal is to design a method that allows a systematic search for the preferred posterior distribution. The central idea, first proposed by Skilling [16], is disarmingly simple: to select the posterior, first rank all candidate distributions in increasing order of "preference" and then pick the distribution that ranks the highest. Irrespective of what it is that makes one distribution "preferable" over another (we will get to that soon enough), it is clear that any such ranking must be transitive: if distribution *p*<sup>1</sup> is preferred over distribution *p*2, and *p*<sup>2</sup> is preferred over *p*3, then *p*<sup>1</sup> is preferred over *p*3. Transitive rankings are implemented by assigning to each *p* a real number *S*[*p*], which is called the entropy of *p* in such a way that if *p*<sup>1</sup> is preferred over *p*2, then *S*[*p*1] > *S*[*p*2]. The selected distribution (one or possibly many, for *there may be several equally preferred distributions*) is that which maximizes the entropy functional.

The importance of Skilling's strategy of ranking distributions cannot be overestimated: it answers the questions "why entropy?" and "why a maximum?". The strategy implies that the updating method will take the form of a variational principle—the method of maximum entropy (ME)—involving a certain functional that maps distributions to real numbers. These features are not imposed by nature; they are all imposed by design. They are dictated by the function that the ME method is supposed to perform. (Thus, it makes no sense to seek a generalization in which entropy is a complex number or a vector; such generalized entropies would just not perform the desired function.)

Next, we specify the ranking scheme, that is, we choose a specific functional form for the entropy *S*[*p*]. Note that the purpose of the method is to update *from priors to posteriors* so the ranking scheme must depend on the particular prior *q* and therefore, the entropy *S* must be a functional of both *p* and *q*. The entropy *S*[*p*, *q*] describes a ranking of the distributions *p relative* to the given prior *q*. *S*[*p*, *q*] is the entropy of *p relative* to *q*, and accordingly, *S*[*p*, *q*] is commonly called *relative entropy*. This is appropriate and sometimes we will follow this practice. However, since all entropies are relative, even when relative to a uniform distribution, the qualifier "relative" is redundant and can be dropped.

The functional *S*[*p*, *q*] is designed by a process of elimination—this is a process of *eliminative induction*. First, we state the desired design criteria; this is the crucial step that defines what makes one distribution preferable over another. Candidate functionals that fail to satisfy the criteria are discarded—hence, the qualifier "eliminative". As we shall see, the criteria adopted below are so constraining that there is a single entropy functional *S*[*p*, *q*] that survives the process of elimination.

This approach has a number of virtues. First, to the extent that the design criteria are universally desirable, the single surviving entropy functional will also be of universal applicability. Second, the reason why alternative entropy candidates are eliminated is quite explicit—at least one of the design criteria is violated. Thus, *the justification behind the single surviving entropy is not that it leads to demonstrably correct inferences, but rather, that all other candidates demonstrably fail to perform as desired.*

#### *3.3. Specific Design Criteria*

Consider a lattice of propositions generated by a set X of atomic propositions that are mutually exclusive and exhaustive and are labeled by a discrete index *i* = 1, 2, ... , *n*. The extension to infinite sets and to continuous labels turns out to be straightforward. The index *i* might, for example, label the microstates of a physical system but, since the argument below is supposed to be of general validity, we shall not assume that the labels themselves carry any particular significance. We can always permute labels; this should have no effect on the updating of probabilities.

We adopt design criteria that reflect the structure of the lattice of propositions—the propositions are related to each other by disjunctions (OR) and conjunctions (AND) and the consistency of the web of beliefs is implemented through the sum and product rules of the probability theory. Our criteria refer to the two extreme situations of propositions that are mutually exclusive and of propositions that are mutually independent. At one end, we deal with the probabilities of propositions that are highly correlated (if one proposition is true, the other is false and vice versa); at the other end, we deal with the probabilities of propositions that are totally uncorrelated (the truth or falsity of one proposition has no effect on the truth or falsity of the other). One extreme is described by the simplified sum rule, *p*(*i* ∨ *j*) = *p*(*i*) + *p*(*j*), and the other extreme by the simplified product rule, *p*(*i* ∧ *j*) = *p*(*i*)*p*(*j*). (For an alternative approach to the foundations of inference that exploits the various symmetries of the lattice of propositions see [40,41].

The two design criteria and their consequences for the functional form of entropy are given below. Detailed proofs are deferred to the appendices.

## 3.3.1. Mutually Exclusive Subdomains
