**1. Introduction**

Inductive inference is a framework for coping with uncertainty, for reasoning with incomplete information. The framework must include a means to represent a state of partial knowledge—this is handled through the introduction of probabilities—and it must allow us to change from one state of partial knowledge to another when new information becomes available. Indeed, any inductive method that recognizes that a situation of incomplete information is in some way unfortunate—by which we mean that it constitutes a problem in need of a solution—would be severely deficient if it failed to address the question of how to proceed in those fortunate circumstances when new information becomes available. The theory of probability, if it is to be useful at all, demands a method for assigning and updating probabilities.

The challenge is to develop updating methods that are both systematic, objective and practical. When information consists of data and a likelihood function, Bayesian updating is the uniquely natural method of choice. Its foundation lies in recognizing the value of prior information: whatever was learned in the past is valuable and should not be disregarded, which amounts to requiring that beliefs ought to be revised but only to the extent required by the new data. This immediately raises a number of questions: How do we update when the information is not in the form of data? If the information is not data, what else could it possibly be? Indeed, what, after all, is "information"? On a separate line of development, the method of Maximum Gibbs–Shannon Entropy (MaxEnt) allows one to process information in the form of constraints on the allowed probability distributions. This provides a partial answer to one of our questions: in addition to data, information can also take the form of constraints. However, it immediately raises several other questions: What is the interpretation of entropy? Is there unique entropy? Are Bayesian and entropic methods mutually compatible?

The purpose of this paper is to review one particular approach to entropic updating. The presentation below, which is meant to be pedagogical and self-contained, is based on work presented in a sequence of papers [1–5] and in the sets of lectures [6–8]. As we shall see below, we adopt a pragmatic approach in which entropy is a tool designed for the specific purpose of updating probabilities.

Historically, the method of maximum relative entropy (ME) is a direct descendant of the MaxEnt method, pioneered by Jaynes [9,10]. In the MaxEnt framework, entropy is interpreted through the Shannon axioms as a measure of the amount of information that is missing in a probability distribution. This approach has its limitations. The Shannon axioms refer to probabilities of discrete variables; for continuous variables, the Shannon entropy is not defined. A more serious objection is that even if we grant that the Shannon axioms do lead to a reasonable expression for entropy, to what extent do we believe the axioms themselves? Shannon's third axiom, the grouping property, is indeed sort of reasonable, but is it necessary? Is entropy the only consistent measure of uncertainty or of information? Indeed, there exist examples in which the Shannon entropy does not seem to reflect one's intuitive notion of information [8,11]. One could introduce other entropies justified by different choices of axioms (e.g., [12–14]), but this move raises problems of its own: Which entropy should one adopt? If different systems are handled using different entropies, how does one handle composite systems?

From our perspective, the problem can be traced to the fact that neither Shannon nor Jaynes were concerned with the task of updating probabilities. Shannon's communication theory aimed to characterize the sources of information, to measure the capacity of the communication channels, and to learn how to control the degrading effects of noise. On the other hand, Jaynes conceived MaxEnt as a method to assign probabilities on the basis of constraint information and a fixed underlying measure and not from an arbitrary prior distribution.

Considerations such as these motivated several attempts to develop ME directly as a method for updating probabilities without invoking questionable measures of information [1,5,15–17]. The important contribution by Shore and Johnson was the realization that one could axiomatize the updating method itself rather than the information measure. Their axioms have, however, raised criticisms [11,18–20] and counter-criticisms [2,6,8,21,22]. Despite the controversies, Shore and Johnson's pioneering papers have had an enormous influence: they identified the correct goal to be achieved.

The concept of relative entropy is introduced as a tool for updating probabilities. Hereinafter, we drop the qualifier "relative"and adopt the simpler term "entropy". The reasons for the improved nomenclature are the following: (1) The general concept should receive the general name "entropy", while the more specialized concepts should be the ones receiving a qualifier, such as "thermodynamic" or "Clausius" entropy, and "Gibbs–Shannon" entropy. (2) All entropies are relative, even if they happen to be relative to an implicit uniform prior. Making this fact explicit has tremendous pedagogical value. (3) The practice is already in use with the concept of energy: all energies are relative too, but there is no advantage in constantly referring to a "relative energy". Accordingly, ME will be read as "maximum entropy"; additional qualifiers are redundant.

As with all tools, entropy too is *designed* to perform a certain function, and its performance must meet certain *design criteria* or *specifications*. There is no implication that the method is "true", or that it succeeds because it achieves some special contact with reality. Instead, the claim is that the method succeeds in the pragmatic sense that it works as designed—this is satisfactory because when properly deployed, it leads to empirically adequate models. In this approach, *entropy needs no interpretation* whether it be in terms of heat, multiplicity of states, disorder, uncertainty, or even in terms of an amount of information. Incidentally, this may explain why the search for the meaning of entropy has proved so elusive: we need not know what "entropy" means—we only need to know how to use it.

Since our topic is the updating of probabilities when confronted with new information, our starting point is to address the question, "what is information?". In Section 2, we develop a concept of information that is both pragmatic and Bayesian. "Information" is defined in terms of its effects on the beliefs of rational agents. The design of entropy as a tool for updating is the topic of Section 3. There, we state the design specifications that define what function entropy is supposed to perform, and we derive its functional form. To streamline the presentation, some of the mathematical derivations are left to the appendices.

To conclude, we present two further developments. In Section 4, we show that Bayes' rule can be derived as a special case of the ME method. An earlier derivation of this important result following a different line of argument was given by Williams [23] before a sufficient understanding of entropy as an updating tool had been achieved. It is not, therefore, surprising that Williams' achievement has not received the widespread appreciation it deserves. Thus, within the ME framework, entropic and Bayesian methods are unified into a single consistent theory of inference. One advantage of this insight is that it allows a number of generalizations of Bayes' rule [2,8]. Another is that it provides an important missing piece for the old puzzles of quantum mechanics concerning the so-called collapse of the wave function and the quantum measurement problem [24,25].

There is yet another function that the ME method must perform in order to fully qualify as a method of inductive inference. Once we have decided that the distribution of maximum entropy is to be preferred over all others, the following question arises immediately: the maximum of the entropy functional is never infinitely sharp, so are we confident that distributions that lie very close to the maximum are completely ruled out? In Section 5, the ME method is deployed to assess quantitatively the extent to which distributions with lower entropy are ruled out. The significance of this result is that it provides a direct link to the theories of fluctuations and large deviations. Concluding remarks are given in Section 6.
