*3.4. The ME Method*

We can now summarize the overall conclusion.

**The ME method**: *The goal is to update from a prior distribution q to a posterior distribution when there is new information in the form of constraints* C *that specify a family* {*p*} *of candidate posteriors. The preferred posterior P is that which maximizes the relative entropy,*

$$S[p,q] = -\sum\_{i} p\_i \log \frac{p\_i}{q\_i} \quad \text{or} \quad S[p,q] = -\int d\mathbf{x} \, p(\mathbf{x}) \, \log \left[\frac{p(\mathbf{x})}{q(\mathbf{x})}\right],\tag{9}$$

*within the family* {*p*} *specified by the constraints* C*.*

This extends the method of maximum entropy beyond its original purpose as a rule to assign probabilities from a given underlying measure (MaxEnt) to a method for updating probabilities from any arbitrary prior (ME). Furthermore, the logic behind the updating procedure does not rely on any particular meaning assigned to the entropy whether in terms of information, or heat, or disorder. Entropy is merely a tool for inductive inference. *No interpretation for S*[*p*, *q*] *is given and none is needed.*

The derivation above has singled out *a unique S*[*p*, *q*] *to be used in inductive inference*. Other "entropies" (such as the one-parameter families of entropies proposed in [12–14] might turn out to be useful for other purposes—perhaps as measures of some kind of "information", as measures of discrimination or distinguishability among distributions, of ecological diversity, or for some altogether different function—but they are unsatisfactory for the purpose of updating because they fail to perform the functions stipulated by the design criteria DC1 and DC2. They induce correlations that are unwarranted by the information in the priors or the constraints.

#### **4. Bayes' Rule as a Special Case of ME**

Back in Section 3.3.1, we saw that ME is designed to include Bayes' rule as a special case. Here, we wish to verify this explicitly [2]. The goal is to update our beliefs about *θ* ∈ Θ (*θ* represents one or many parameters) on the basis of three pieces of information: (1) the prior information codified into a prior distribution *q*(*θ*); (2) the new information conveyed by data *x* ∈ X (obtained in one or many experiments); and (3) the known relation between *θ* and *x* given by a model defined by the sampling distribution or likelihood, *q*(*x*|*θ*). The updating will result in replacing the *prior* probability distribution *q*(*θ*) by a *posterior* distribution *P*(*θ*) that applies after the data information has been processed.

The crucial element that will allow the Bayes' rule to be smoothly integrated into the ME scheme is the realization that before the data are collected, not only do we not know *θ*, but we do not know *x* either. Thus, the relevant space for inference is not the space Θ but the product space Θ × X , and the relevant joint *prior* is *q*(*x*, *θ*) = *q*(*θ*)*q*(*x*|*θ*). Let us emphasize two points: first, the likelihood function is an integral part of the *prior* distribution; second, the prior information about how *x* is related to *θ* is contained in the *functional form* of the distribution *q*(*x*|*θ*) and not in the numerical values of the arguments *x* and *θ*, which, at this point, are still unknown.

Next, data are collected and the observed values turn out to be *x* . We must update to a posterior that lies within the family of distributions *p*(*x*, *θ*) that reflect the fact that the previously unknown *x* is now known to be *x* , that is,

$$p(\mathbf{x}) = \int d\theta \, p(\theta, \mathbf{x}) = \delta(\mathbf{x} - \mathbf{x}')\,. \tag{10}$$

The information in this data constrains but is not sufficient to fully determine the joint distribution,

$$p(\mathbf{x}, \theta) = p(\mathbf{x})p(\theta|\mathbf{x}) = \delta(\mathbf{x} - \mathbf{x}')p(\theta|\mathbf{x}')\,. \tag{11}$$

Any choice of *p*(*θ*|*x* ) is, in principle, possible. So far, the formulation of the problem parallels Section 3.3.1 exactly. We are, after all, solving the same problem. The next step is to apply the ME method.

According to the ME method, the selected joint posterior *P*(*x*, *θ*) is that which maximizes the entropy,

$$S[p,q] = -\int d\mathbf{x}d\theta \, p(\mathbf{x},\theta) \log \frac{p(\mathbf{x},\theta)}{q(\mathbf{x},\theta)}\,\prime\,\tag{12}$$

subject to the data constraints. Note that Equation (10) represents an *infinite* number of constraints on the family *p*(*x*, *θ*): there is one constraint and one Lagrange multiplier *λ*(*x*) for each value of *x*. Maximizing *S*, (12), subject to (10) and normalization,

$$\delta \left\{ \mathbf{S} + \mathbf{a} \left[ \int d\mathbf{x} d\theta \; p(\mathbf{x}, \theta) - 1 \right] + \int d\mathbf{x} \, \lambda \left( \mathbf{x} \right) \left[ \int d\theta \; p(\mathbf{x}, \theta) - \delta(\mathbf{x} - \mathbf{x}') \right] \right\} = \mathbf{0} \,, \tag{13}$$

yields the joint posterior

$$P(\mathbf{x}, \theta) = q(\mathbf{x}, \theta) \frac{e^{\lambda(\mathbf{x})}}{Z} \, , \tag{14}$$

where *Z* is a normalization constant, and the multiplier *λ*(*x*) is determined from (10) as follows:

$$\int d\theta \, q(\mathbf{x}, \theta) \frac{e^{\lambda(\mathbf{x})}}{Z} = q(\mathbf{x}) \frac{e^{\lambda(\mathbf{x})}}{Z} = \delta(\mathbf{x} - \mathbf{x}') \,, \tag{15}$$

so that the joint posterior is

$$P(\mathbf{x}, \theta) = q(\mathbf{x}, \theta) \frac{\delta(\mathbf{x} - \mathbf{x}')}{q(\mathbf{x})} = \delta(\mathbf{x} - \mathbf{x}') q(\theta | \mathbf{x}) \,. \tag{16}$$

The corresponding marginal posterior probability *P*(*θ*) is

$$P(\theta) = \int d\mathbf{x} \, P(\theta, \mathbf{x}) = q(\theta | \mathbf{x}') = q(\theta) \frac{q(\mathbf{x}' | \theta)}{q(\mathbf{x}')} \, , \tag{17}$$

which is Bayes' rule. Thus, Bayes' rule is derivable from, and therefore consistent with, the ME method.

To summarize, the prior *q*(*x*, *θ*) = *q*(*x*)*q*(*θ*|*x*) is updated to the posterior *P*(*x*, *θ*) = *P*(*x*)*P*(*θ*|*x*), where *P*(*x*) = *δ*(*x* − *x* ) is fixed by the observed data while *P*(*θ*|*x* ) = *q*(*θ*|*x* ) remains unchanged. Note that in accordance with the PMU philosophy that drives the ME method, *one only updates those aspects of one's beliefs for which corrective new evidence has been supplied*. In [2,8,42], further examples are given that show how ME allows generalizations of Bayes' rule to situations where the data itself are uncertain, there is information about moments of *x* or moments of *θ*, or even in situations where the likelihood function is unknown. In conclusion, the ME method of maximum entropy can fully reproduce and then go beyond the results obtained by the standard Bayesian methods.

#### **5. Deviations from Maximum Entropy**

The basic ME problem is to update from a prior *q*(*x*) given information specified by certain constraints. The constraints specify a family of candidate distributions as follows:

$$p\_{\theta}(\mathbf{x}) = p(\mathbf{x}|\theta) \tag{18}$$

which can be conveniently labeled with a finite number of parameters *θa*, *a* = 1 ... *n*. (The generalization to an infinite number of parameters poses technical but not insurmountable difficulties.) Thus, the parameters *θ* are coordinates on the statistical manifold specified by the constraints. The distributions in this manifold are ranked according to their entropy,

$$S[p\_{\theta'}q] = -\int d\mathbf{x} \, p(\mathbf{x}|\theta) \, \log \frac{p(\mathbf{x}|\theta)}{q(\mathbf{x})} = S(\theta) \, \text{,}\tag{19}$$

and the selected posterior is the distribution *p*(*x*|*θ*0) that maximizes the entropy *S*(*θ*). (The notation indicates that *S*[*pθ*, *q*] is a functional of *p<sup>θ</sup>* while *S*(*θ*) is a function of *θ*.)

The question we now address concerns the extent to which *p*(*x*|*θ*0) should be preferred over other distributions with lower entropy or, to put it differently, to what extent is it rational to believe that the selected value ought to be the entropy maximum *θ*<sup>0</sup> rather than any other value *θ* [1]? This is a question about the probability *p*(*θ*) of various values of *θ*. The original problem which led us to design the maximum entropy method was to assign a probability to the quantity *x*; we now see that the full problem is to assign probabilities to both *x* and *θ*. We are concerned not just with *p*(*x*), but rather with the joint distributions which we denote as *π*(*x*, *θ*); the universe of discourse has been expanded from X (the space of *x*s) to the product space X × Θ (Θ is the space of parameters *θ*).

To determine the joint distribution *π*(*x*, *θ*) ,we make use of essentially the only (universal) method at our disposal—the ME method itself—but this requires that we address

the standard two preliminary questions: First, what is the prior distribution? What do we know about *x* and *θ* before we receive information about the constraints? Second, what is the new information that constrains the allowed joint distributions *π*(*x*, *θ*)?

This first question is the more subtle one: when we know absolutely nothing about the *θ*s, we know neither their physical meaning nor whether there is any relation to the *x*s. A joint prior that reflects this lack of correlations is a product, *q*(*x*, *θ*) = *q*(*x*)*q*(*θ*). We will assume that the prior *q*(*x*) is known—it is the same prior we had used when we updated from *q*(*x*) to *p*(*x*|*θ*0) using (19).

However, we are not totally ignorant about the *θ*s: we know that they label distributions *π*(*x*|*θ*) on some as yet unspecified statistical manifold Θ. Then there exists a natural measure of distance in the space Θ. It is given by the information metric *d*<sup>2</sup> = *gabdθadθ<sup>b</sup>* [8,43], where

$$g\_{ab} = \int dx \, p(x|\theta) \frac{\partial \log p(x|\theta)}{\partial \theta^a} \frac{\partial \log p(x|\theta)}{\partial \theta^b} \,, \tag{20}$$

and the corresponding volume elements are given by *g*1/2(*θ*)*dnθ*, where *g*(*θ*) is the determinant of the metric. The uniform prior for *θ*, which assigns equal probabilities to equal volumes, is proportional to *g*1/2(*θ*), and therefore we choose *q*(*θ*) = *g*1/2(*θ*). Therefore, the joint prior is *q*(*x*, *θ*) = *q*(*x*)*g*1/2(*θ*).

Next, we tackle the second question: what are the constraints on the allowed joint distributions *π*(*x*, *θ*)? Consider the space of all joint distributions. To each choice of the functional form of *π*(*x*|*θ*) (for example, whether we talk about Gaussians, Boltzmann– Gibbs distributions, or something else), there corresponds a different subspace defined by distributions of the form *π*(*x*, *θ*) = *π*(*θ*)*π*(*x*|*θ*). The crucial constraint is that which specifies the subspace by imposing that *π*(*x*|*θ*) takes the particular functional form given by the constraint (18), *π*(*x*|*θ*) = *p*(*x*|*θ*). This defines the meaning to the *θ*s and also fixes the prior *g*1/2(*θ*) on the relevant subspace.

The preferred joint distribution, *P*(*x*, *θ*) = *P*(*θ*)*p*(*x*|*θ*), is the distribution, *π*(*x*, *θ*) = *π*(*θ*)*p*(*x*|*θ*), that maximizes the joint entropy,

$$\mathcal{S}[\pi, q] = -\int dx \, d\theta \, \pi(\theta) p(\mathbf{x}|\theta) \, \log \frac{\pi(\theta) p(\mathbf{x}|\theta)}{\mathcal{S}^{1/2}(\theta) q(\mathbf{x})}$$

$$= -\int d\theta \, \pi(\theta) \log \frac{\pi(\theta)}{\mathcal{S}^{1/2}(\theta)} + \int d\theta \, \pi(\theta) \mathcal{S}(\theta) \, , \tag{21}$$

where *S*(*θ*) is given in (19). Varying (21) with respect to *π*(*θ*) with *dθ π*(*θ*) = 1 and *p*(*x*|*θ*) fixed yields the posterior probability that the value of *θ* lies within the small volume *g*1/2(*θ*)*dnθ*,

$$P(\theta)d^{\eta}\theta = \frac{1}{\tilde{\zeta}} \, \prescript{S(\theta)}{}{\mathcal{g}}^{1/2}(\theta)d^{\eta}\theta \quad \text{with} \quad \tilde{\zeta} = \int d^{\eta}\theta \, \prescript{1/2}{}{\left(\theta\right)} \, \prescript{S(\theta)}{}{\cdot}. \tag{22}$$

Equation (22) is the result we seek. It tells us that, as expected, the preferred value of *θ* is the value *θ*<sup>0</sup> that maximizes the entropy *S*(*θ*), Equation (19), because this maximizes the scalar density exp *S*(*θ*). However, it also tells us the degree to which values of *θ* away from the maximum are ruled out. (Note that the density exp *S*(*θ*) is a scalar function and the presence of the Jacobian factor *g*1/2(*θ*) makes Equation (22) manifestly invariant under changes of the coordinates *θ* in the space Θ.)

This discussion allows us to refine our understanding of the ME method. ME is not an all-or-nothing recommendation to pick the single distribution that maximizes entropy and reject all others. The ME method is more nuanced: in principle, all distributions within the constraint manifold ought to be included in the analysis; they contribute in proportion to the exponential of their entropy and this turns out to be significant in situations where the entropy maximum is not particularly sharp.

Going back to the original problem of updating from the prior *q*(*x*), given information that specifies the manifold {*p*(*x*|*θ*)}, the preferred update within the family {*p*(*x*|*θ*)} is *p*(*x*|*θ*0), but to the extent that other values of *θ* are not totally ruled out, a better update is obtained marginalizing the joint posterior *P*(*x*, *θ*) = *P*(*θ*)*p*(*x*|*θ*) over *θ*,

$$P(\mathbf{x}) = \int d^\mathbf{u} \theta \, P(\theta) p(\mathbf{x}|\theta) = \int d^\mathbf{u} \theta \, g^{1/2}(\theta) \frac{e^{\mathcal{S}(\theta)}}{\mathcal{J}} p(\mathbf{x}|\theta) \,. \tag{23}$$

In situations where the entropy maximum at *θ*<sup>0</sup> is very sharp, we recover the old result,

$$P(x) \approx p(x|\theta\_0)\,. \tag{24}$$

When the entropy maximum is not very sharp a more honest update is Equation (23), which, incidentally, is a form of superstatistics.

One of the limitations of the standard MaxEnt method is that it selects a single "posterior" *p*(*x*|*θ*0) and strictly rules out all other distributions. The result (22) overcomes this limitation and finds many applications. For example, it extends the Einstein theory of thermodynamic fluctuations beyond the regime of small fluctuations; it provides a bridge to the theory of large deviations; and, suitably adapted for Bayesian data analysis, it leads to the notion of entropic priors [44].

#### **6. Discussion**

Consistency with the law of large numbers.

Entropic methods of inference are of general applicability but there exist special situations—for example, those involving large numbers of independent subsystems where inferences can be made by purely probabilistic methods without ever invoking the concept of entropy. In such cases, one can check (see, for example, [6,45]) that the two methods of calculation are consistent with each other. It is significant, however, that alternative entropies, such as those proposed in [12–14], do not pass this test [46,47], which rules them out as tools for updating. Some probability distributions obtained by maximizing the alternative entropies have, however, turned out to be physically relevant. It is, therefore, noteworthy that those successful distributions can also be derived through a more standard application of MaxEnt or ME, as advocated in this review [8,48–51]. In other words, what is being ruled out are not the distributions themselves, but the alternative entropies from which they were inferred.

#### On priors.

Choosing the prior density *q*(*x*) can be tricky. Sometimes, symmetry considerations can be useful but otherwise, there is no fixed set of rules to translate information into a probability distribution except, of course, for Bayes' rule and the ME method themselves.

What if the prior *q*(*x*) vanishes for some values of *x*? *S*[*p*, *q*] can be infinitely negative when *q*(*x*) vanishes within some region D. This means that the ME method confers an infinite preference on those distributions *p*(*x*) that vanish whenever *q*(*x*) does. One must emphasize that this is as it should be. A similar situation also arises in the context of Bayes' theorem, where assigning a vanishing prior represents a tremendously serious commitment because no amount of data to the contrary would allow us to revise it. In both ME and Bayes updating, we should recognize the implications of assigning a vanishing prior. Assigning a very low but non-zero prior represents a safer and possibly less prejudiced representation of one's prior beliefs.

#### Commuting and non-commuting constraints.

The ME method allows one to process information in the form of constraints. When we are confronted with several constraints, we must be particularly cautious. Should they be processed simultaneously or sequentially? And, if the latter, in what order? The answer depends on the problem at hand [42].

We refer to constraints as *commuting* when it makes no difference whether they are handled simultaneously or sequentially. The most common example is that of Bayesian updating on the basis of data collected in several independent experiments. In this case, the order in which the observed data *x* = {*x* <sup>1</sup>, *x* <sup>2</sup>, ...} are processed does not matter for the purpose of inferring *θ*. In general, however, constraints need not commute and when this is the case, the order in which they are processed is critical.

To decide whether constraints are to be handled sequentially or simultaneously, one must be clear about how the ME method handles constraints. The ME machinery interprets a constraint in a very mechanical way: all distributions satisfying the constraint are, in principle, allowed, while all distributions violating it are ruled out. Therefore, sequential updating is appropriate when old constraints become obsolete and are superseded by new information, while simultaneous updating is appropriate when old constraints remain valid. The two cases refer to different states of information, and therefore, it is to be expected that they will result in different inferences. These comments are meant to underscore the importance of understanding what information is and how it is processed by the ME method; failure to do so will lead to errors that do not reflect a shortcoming of the ME method but rather a misapplication of it.
