Paradigms of Cognition

Topsøe, Flemming

doi:10.3390/e19040143

Open AccessArticle

Paradigms of Cognition^†

by

Flemming Topsøe

Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5, 2100 Copenhagen, Denmark

^†

This paper is an extended version of our paper published in the 36th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Ghent, Belgium, 10–15 July 2016.

Entropy 2017, 19(4), 143; https://doi.org/10.3390/e19040143

Submission received: 19 December 2016 / Revised: 23 February 2017 / Accepted: 10 March 2017 / Published: 27 March 2017

(This article belongs to the Special Issue Selected Papers from MaxEnt 2016)

Download

Browse Figures

Versions Notes

Abstract

:

An abstract, quantitative theory which connects elements of information—key ingredients in the cognitive proces—is developed. Seemingly unrelated results are thereby unified. As an indication of this, consider results in classical probabilistic information theory involving information projections and so-called Pythagorean inequalities. This has a certain resemblance to classical results in geometry bearing Pythagoras’ name. By appealing to the abstract theory presented here, you have a common point of reference for these results. In fact, the new theory provides a general framework for the treatment of a multitude of global optimization problems across a range of disciplines such as geometry, statistics and statistical physics. Several applications are given, among them an “explanation” of Tsallis entropy is suggested. For this, as well as for the general development of the abstract underlying theory, emphasis is placed on interpretations and associated philosophical considerations. Technically, game theory is the key tool.

Keywords:

entropy; divergence; redundancy; information triples; proper effort functions; fundamental inequality; Jensen-Shannon divergence; core; Bregman construction; Tsallis entropy

1	Introduction		2

2	Information without Probability		5
	2.1	The World and You . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	5
	2.2	Truth and Belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	5
	2.3	A Tendency to Act, a Wish to Control . . . . . . . . . . . . . . . . . . . . . . .	6
	2.4	Atomic Situations, Controllability and Visibility . . . . . . . . . . . . . . .	7
	2.5	Knowledge, Perception and Deformation . . . . . . . . . . . . . . . . . . . .	8
	2.6	Effort and Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	9
	2.7	Information Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	11
	2.8	Relativization, Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	15
	2.9	Feasible Preparations, Core and Robustness . . . . . . . . . . . . . . . . . .	16
	2.10	Inference via Games, Some Basic Concepts . . . . . . . . . . . . . . . . . .	18
	2.11	Refined Notions of Properness . . . . . . . . . . . . . . . . . . . . . . . . . . . .	21
	2.12	Inference via Games, Some Basic Results . . . . . . . . . . . . . . . . . . . .	22
	2.13	Games Based on Utility, Updating . . . . . . . . . . . . . . . . . . . . . . . . . .	28
	2.14	Formulating Results with a Geometric Flavour . . . . . . . . . . . . . . . .	29
	2.15	Adding Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	32
	2.16	Jensen-Shannon Divergence at Work . . . . . . . . . . . . . . . . . . . . . . . .	36

3	Examples, towards Applications		42
	3.1	Primitive Triples and Generation by Integration . . . . . . . . . . . . . . .	42
	3.2	A Geometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	48
	3.3	Universal Coding and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . .	50
	3.4	Sylvester’s Problem from Location Theory . . . . . . . . . . . . . . . . . . .	52
	3.5	Capacity Problems, an Indication . . . . . . . . . . . . . . . . . . . . . . . . . .	53
	3.6	Tsallis Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	54
	3.7	Maximum Entropy Problems of Classical Shannon Theory . . . . . .	58
	3.8	Determining D-Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	60

4	Conclusions		61

A	Notions of Properness		62

B	Protection against Misinformation		65

C	Cause and Effect		66

D	Negative Definite Kernels and Squared Metrics		66

1. Introduction

Originally, the driving force behind this study was to extend the clear and convincing operational interpretations associated with classical information theory as developed by Shannon [1] and followers, to the theory promoted by Tsallis for statistical physics and thermodynamics, cf., [2,3]. That there are difficulties is witnessed by the fact that, despite its apparent success, some well known physicists still find grounds for criticism. Evidence of this attitude may be found in Gross [4].

A possible solution to the problem is presented towards the end of our study, in Theorem 18. It is based on the idea that, possibly, what the physicist perceives as the essence in a particular situation could be a result of both the true state of the situation and the physicists preconceptions as expressed by his beliefs. In case there is no deformation from truth and belief to perception, i.e., if “what you see is what is true”, you regain the classical notions of Boltzmann, Gibbs and Shannon.

The approach indicated in Theorem 18 rests on philosophical considerations and associated interpretations. As it turns out, this approach is applicable in a far more abstract setting than needed for the discussion of the particular problem. As a result, a general abstract, quantitative theory is developed. This theory, presented in Section 2, with its many subsections is the main contribution of our research. A number of possible applications, including Theorem 18, are listed in Section 3, which has a number of sub-sections covering applications from different areas. They serve as justification for the work which has gone into the development of the general abstract theory. The conclusions are collected in Section 4.

The theory may be seen as an extension of classical Shannon theory. One does not achieve the same degree of clarity as in the classical theory, where coding provides a solid reference. However, the results developed in Section 2 and Section 3 demonstrate that the extension to a more abstract framework is meaningful and opens up for new areas of research. In addition, previous results are consolidated and unified.

The theory of Section 2 is an abstract theory of information without probability. Inspiration from Shannon Theory and from the theory of inference within statistics and statistical physics is apparent. However, the ideas are presented here as an independent theory.

Previous endeavours in the direction taken include research by Ingarden and Urbanik [5] who wrote “... information seems intuitively a much simpler and more elementary notion than that of probability ... [it] represents a more primary step of knowledge than that of cognition of probability ...”. We also point to Kolmogorov, cf., [6,7] who in the latter reference (but going back to 1970, it seems) stated “Information theory must precede probability theory and not be based on it”. The ideas by Ingarden and Urbanik were taken up by Kampé de Fériet, see the survey in [8]. The work of Kampé de Fériet is rooted in logic. Logic is also a key ingredient in comprehensive studies over some 40 years by Jaynes, collected posthumously in [9]. Although many philosophically-oriented discussions are contained in the work of Jaynes, the situations he deals with are limited to probabilistic models and intended mainly for a study of statistical physics.

The work by Amari and Nagaoka in information geometry, cf., [10], may also be viewed as a broad attempt to free oneself from a tie to probability. There are many followers of the theory developed by Amari and Nagaoka. Here we only mention the recent thesis by Anthonis [11] which has a base in physics.

In complexity theory as developed by Solomonoff, Kolmogorov and others, cf., the recent survey [12] by Rathmanner and Hutter, we have a highly theoretical discipline which aims at inference not necessarily tied to probabilistic modeling. The Minimum Description Length Principle may be considered an important spin-off of this theory. It is mainly directed at problems of statistical inference and was developed, primarily, by Rissanen and by Barron and Yu, cf., [13]. We also point to the treatise [14] by Grünwald. In this work you find discussions of many of the issues dealt with here, including a discussion of the work of Jaynes.

Still other areas of research have a bearing on “information without probability”, e.g., semiotics, philosophy of information, pragmatism, symbolic linguistics, placebo research, social information and learning theory. Many areas within psychology are also of relevance. Some specific works of interest include Jumarie [15], Shafer and Vovk [16], Gernert [17], Bundesen and Habekost [18], Benedetti [19] and Brier [20]. The handbook [21] edited by Adriaans and Bentham and the encyclopaedia article [22] by Adriaans collect views on the very concept of “information”. Over the years, an overwhelming amount of thought has been devoted to that concept in one form or another. Most of this bulk of material is entirely philosophical and not open to quantitative analysis. Part of it is impractical and presently mainly of theoretical interest. Moreover, some is far from Shannon’s theory which we hold as a cornerstone of quantitative information theory. In fact, we consider it a requirement of any quantitative theory of information to be downward compatible with basic parts of Shannon theory. This requirement is largely respected in the present work, but not entirely. For example, we do not know if one can meaningfully lift the concept of coding as known from Shannon theory to a more abstract level.

In many respects, our endeavours go “beyond Shannon”. So does, e.g., Brier in his development of cybersemiotics, cf., [20,23]. Brier goes deeper into some of the philosophical aspects than we do and also attempts a broad coverage by incorporating not only the exact natural sciences but also life science, the humanities and the social sciences. Though not foreign to such a wider scope, our study aims at more concrete results by basing the study more directly on quantitative elements. Both studies emphasize the role of the individual in the cognitive process.

A special feature of our development is the appeal to game theoretical considerations, cf., especially Section 2.10, Section 2.12 and Section 2.13. To illuminate the importance we attach to this aspect we quote from Jaynes’ preface to [9] where he comments on the maximum entropy principle, the central principle of inference promoted by Jaynes:

“... it [maximum entropy] predicts only observable facts (functions of future or past observations) rather than values of parameters which may exist only in our imagination ... it protects us against drawing conclusions not warranted by the data. But when the information is extremely vague, it may be difficult to define any appropriate sample space, and one may wonder whether still more primitive principles than maximum entropy can be found. There is room for much new creative thought here.”

This is one central place where game theory comes in. It represents a main addition, so we claim, to Jaynes’ work In passing, it is noted that at the conference “Maximum Entropy and Bayesian Methods”, Paris 1992, the author had much hoped to discuss the impact of game theoretical reasoning with professor Jaynes. Unfortunately, Jaynes, who died in 1998, was too ill at the time to participate. He never incorporated arguments such as those in [24] which can be conceived as supportive of his own theory.

The merits of game theory in relation to information theoretical inference were first indicated in the probabilistic, Shannon-like setting, independently of each other, by Pfaffelhuber [25] and by the author [26]. More recent references include Harremoës and Topsøe [27], Grünwald and Dawid [28], Friedman et al. [29] (a utility-based work) and Dayi [30]. As sources of background material [31,32,33] may be helpful.

The quantitative elements we work with are brought into play via a focus on description effort—or just effort. From this concept, general notions of entropy and redundancy (and the close to equivalent notion of divergence) are derived. The information triples we shall take as the key object of study are expressions of the concepts effort/entropy/redundancy (or effort/entropy/divergence). By a “change of sign”, the triples may, just as well, concern utility/max-utility/divergence.

Apart from introducing game theory into the picture, a main feature of the present work lies in its abstract nature with a focus on interpretations rather than on axiomatics which was the emphasis of many previous authors, including Jaynes.

The set of interpretations we shall emphasize in Section 2 is not be the only one possible. Different sets of interpretations are briefly indicated in Appendix B and Appendix C. Though some of this played a role in the development, in statistics, of the notion of properness, we have relegated the material to the appendices, not to disturb the flow of presentation and also as we consider this material to be of lesser significance when comparing it with the main line of thought.

Section 3 may be viewed as a justification of the partly speculative deliberations of Section 2.1, Section 2.2, Section 2.3, Section 2.4, Section 2.5, Section 2.6, Section 2.7, Section 2.8, Section 2.9, Section 2.10, Section 2.11, Section 2.12, Section 2.13, Section 2.14, Section 2.15 and Section 2.16. Also, in view of the rather elaborate theory of Section 2 with many new concepts and unusual notation, it may well be that occasional reference to the material in Section 3 will ease the absorption of the theoretical material.

In Section 3.1, the natural building stones behind the information triples is presented. This is closely related to the well-known construction associated with Bregman’s name. The construction may be expanded by allowing non-smooth functions as “generators”. Pursuing this leads to situations where the standard notion of properness breaks down and needs a replacement by weaker notions. Such notions are introduced at the end of Section 2.10 but may only be appreciated after acquaintance with the less abstract material in Appendix A.

The applications presented—or indications of potential applications—come from combinatorial geometry, probabilistic information theory, statistics and statistical physics. For most of them, we focus on providing the key notions needed for the theory to work, thus largely leaving concrete applications aside. The aim is to provide enough details in order to demonstrate that our modeling can be applied in quite different contexts. For the case of discrete probabilistic models we do, however, embark on a more thorough analysis. The reason for this is, firstly, that this is what triggered the research reported on and, secondly, with a thorough discussion of modeling in this context, virtually all elements introduced in the many sub-sections of Section 2 have a clear and natural interpretation. In fact, full appreciation of the abstract theory may only be achieved after reading the material in Section 3.6 and Section 3.7.

Our treatment is formally developed independently of previous research. However, unconsciously or not, it depends on earlier studies as referred to above and on the tradition developed over time. More specifically, we mention that our focus on description effort, especially the notion of properness, cf., Section 2.6, is closely related to ideas first developed for areas touching on meteorology, statistics and information theory.

Previous relevant writings of the author include [34,35,36,37,38]. The present study is here published as a substantial expansion of the latter. For instance, elements related to control—modeling an observer’s active response to belief—and a detailed discussion of Jensen-Shannon divergence as well as more cumbersome technical details were left out in [38]. Thus, [38] may best serve as an easy-to-read appetizer to the present more heavy and comprehensive theory.

2. Information without Probability

2.1. The World and You

By

Ω

we denote the actual world, perhaps one among several possible worlds. Two fictitious persons play a major role in our modeling, “Nature” and “Observer”. These “persons” behave quite differently and, though stereotypical, the reader may associate opposing sexes to them, say female for Nature, male for Observer. The interplay between the two takes place in relation to studies of situations from the world. Observer’s aim is to gain insight about situations studied. It may be helpful to think of Observer as “you”, say a physicist, psychologist, statistician, information theoretician or what the case may be. Nature is seen as an expression of the world itself and reflects the rules of the world. Mostly, such rules may be identified with laws of nature. However, we shall consider models where the rules express an interplay between Nature and Observer and as such may not be absolutes, independent of observer’s interference.

The insight or knowledge sought by Observer will be focused on inference concerning particular situations under study. A different form of inference not focused on any particular situation may also be of relevance if Observer does not know which world he is placed in. Of course, the actual world is a possible world or it could not exist. So Observer may, based on experience gained from situations encountered, attempt to ascertain which one out of a multitude of possible worlds is actualized.

The notions introduced are left as loose indications. They will take more shape as the modeling progresses. The terminology chosen here and later on is intended to provoke associations to common day experiences of the cognitive process. In addition, the terminology is largely consistent with usage in philosophy.

2.2. Truth and Belief

Nature is the holder of truth. Observer seeks the truth but is relegated to belief. However, Observer possesses a conscious and creative mind which can be exploited in order to obtain knowledge as effortlessly as possible. In contrast, Nature does not have a mind—and still, the reader may find it helpful to think of Nature as a kind of “person”!

We introduce a set X, the state space, and a set Y, the belief reservoir. Elements of X, generically denoted by x, are truth instances or states of truth or just states, whereas elements of Y, generically denoted by y, are belief instances. We assume that

Y \supseteq X

. Therefore, in any situation, it is conceivable that Observer actually believes what is true. Mostly,

Y = X

will hold. Then, whatever Observer believes, could be true.

Typically, in any situation, we imagine that Nature chooses a state and that Observer chooses a belief instance. This leads to the introduction of certain games which will be studied systematically later on, starting with Section 2.10.

Though there may be no such thing as absolute truth, it is tempting to imagine that there is and to think of Natures choice as an expression of just that. This then helps to maintain a distinction between Nature and Observer. However, a closer analysis reveals that what goes on at Natures side is most correctly thought of as another manifestation of Observer. Thus the two sides cannot be separated. Rather, a key to our modeling is the interplay between the two.

For some models it may be appropriate to introduce a set

X_{0} \subseteq X

of realistic states. States not in

X_{0}

are considered unrealistic, out of reach for Observer, typically because they would involve availability of unlimited resources. Moreover, some models involve a set

Y_{\det} \subseteq Y

of certain beliefs. Beliefs from

Y_{\det}

are chosen by Observer if he is quite determined on what is going on—but of course, he could be wrong. If nothing is said to the contrary, you can take

X_{0} = X

and

Y_{\det} = \emptyset

.

In a specific situation, Nature’s choice may not be free within all of X. Rather, it may be restricted to a non-empty subset

𝒫

of X, the preparation. The idea is that Observer, perhaps a physicist, can “prepare” a situation, thereby forcing Nature to restrict the choice of state accordingly. For instance, by placing a gas in a heat bath, Nature is restricted to states which have a mean energy consistent with the prescribed temperature.

A situation is normally characterized by specifying a preparation

P

. A state x is consistent—viz., consistent with the preparation

P

of the situation—if

x \in P

. Later on, we shall consider preparation families which are sets, generically denoted by

P

, whose members are preparations.

Faced with a specific situation with preparation

P

, Observer speculates about the state of truth chosen by Nature. Observer may express his opinion by assigning a belief instance to the situation. If this is always chosen from the preparation

P

, Observer will only believe what could be true. Sometimes, Observer may prefer to assign a belief instance in Y\P to the situation. Then this instance cannot possibly be one chosen by Nature. Nevertheless, it may be an adequate choice if an instance inside

P

would contradict Observer’s subjective beliefs. Therefore, the chosen instance may be the “closest” to the actual truth instance in a subjective sense. Anyhow, Observer’s choice of belief instance is considered a subjective choice which takes available information into account such as general insight and any prior information. Qualitatively, these thoughts agree with Bayesian thinking, and as such enjoy the merits, but are also subject to the standard criticism, which applies to this line of thought, cf., [12,39].

2.3. A Tendency to Act, a Wish to Control

Two considerations will lead us to new and important structural elements.

First, we point to the mantra that belief is a tendency to act. This is a rewording taken from Good [40] who suggested this point of view as a possible interpretation of the notion of belief. In daily life, action appears more often than not to be a spontaneous reaction in situations man is faced with, rather than a result of rational considerations. Or reaction depends on psychological factors or brain activity largely outside conscious control. In contrast, we shall rely on rational thinking based on quantitative considerations. As a preparation we introduce a set

\hat{Y}

, the action space, and a map from Y into

\hat{Y}

, referred to as response. Elements of

\hat{Y}

are called actions. We use the notation

\hat{y}

to indicate the action which is Observer’s response in situations where Observer’s belief is represented by the belief instance y. Note that as we have assumed that

X \subseteq Y

,

\hat{x}

is well defined for every state x.

Response need not be injective, thus it is in general not possible to infer Observer’s belief from Observer’s action. Response need not either be surjective, though for most applications it will be so. Actions not in the range are idle for the actual model under discussion but may become relevant if the setting is later expanded.

Belief instances, say

y_{1}

and

y_{2}

, with the same response are response-equivalent, notationally written

y_{1} \hat{\sim} y_{2}

.

If the model contains certain beliefs, i.e., if

Y_{\det} \neq \emptyset

, we assume that

\hat{Y}

contains a special element, the empty action, and that this action is chosen by Observer in response to any certain belief instance. In such cases, Observer sees no reason to take any action. If Observer finds several actions equally attractive, one could allow response to be a set-valued map. However, for the present study we insist that response is an ordinary map defined on all of Y. This will actually be quite important.

For a preparation

P

,

\hat{P}

denotes the set of

\hat{x}

with

x \in P

.

Let us turn to another tendency of man, the wish to control. This makes us introduce a set W, the control space. The elements of W are referred to as controls. For the present modeling, this will not, formally, lead to further complications as we shall take W and

\hat{Y}

to be identical:

W = \hat{Y}

. This simplification may be defended by taking the point of view that in order to exercise control, you have to act, typically by setting up appropriate experiments. Moreover, you may consider it the purpose of Observer’s action to exercise control. Thus, in an idealized and simplified model as here presented, we simply identify the two aspects, action and control. Later elaborations of the modeling may lead to a clear distinction between action and the more passive concept of control. As

\hat{Y}

and W are identified, we shall often use w as a generic element of

\hat{Y} = W

and we shall denote the empty action—the same as the empty control—by

w_{\emptyset}

.

The simplest models are obtained when response is an injection or even a bijection. Moreover, simplest among these models are the cases when

Y = \hat{Y} = W

and response is the identity map. This corresponds to a further identification of belief with action or control. Even then it makes a difference if you think about an element as an expression of belief, as an expression of action or as an expression of control.

Although many models do not need the introduction of

\hat{Y}

(or W), the further development will to a large extent refer first and foremost to

\hat{Y}

-related concepts. Technically, this results in greater generality, as response need not be injective. Belief-type concepts, often indicated by referring to the “Y-domain”, will then be derived from action- or control-based concepts, often indicated by pointing to the “

\hat{Y}

-domain”. The qualifying indication may be omitted if it is clear from the context whether we work in the one domain or the other.

2.4. Atomic Situations, Controllability and Visibility

Two relations will be introduced. Controllability is the primary one from which the other one, visibility, will be derived. These relations constitute refinements which may be disregarded at a first reading. This can be done by taking the relations to be the diffuse relations , in notation below,

X \otimes \hat{Y} = X \times \hat{Y}

and

X \otimes Y = X \times Y

. The reader may recall that in general mathematical jargon, a diffuse relation is one without restrictions, i.e., one for which any element is in relation to any other element.

Pairs of states and belief instances or pairs of states and controls are key ingredients in situations from the world. However, not all such pairs will be allowed. Instead, we imagine that offhand, Observer has some limited insight into Natures behaviour and therefore, Observer takes care not to choose “completely stupid” belief instances or controls, as the case may be.

We express these ideas in the

\hat{Y}

-domain by introducing a relation from X to

\hat{Y}

, called controllability and denoted

X \otimes \hat{Y}

. Thus

X \otimes \hat{Y}

is a subset of the product set

X \times \hat{Y}

. Elements of

X \otimes \hat{Y}

are atomic situations (in the

\hat{Y}

-domain). If a preparation

P

is given, it may suffice to consider the restriction

P \otimes \hat{Y}

which consists of all atomic situations

(x, w)

with

x \in P

.

For an atomic situation

(x, w)

, we write

w ≻ x

and say that w controls x or that x can be controlled by w. An atomic situation

(x, w)

is an adapted pair if w is adapted to x in the sense that

w = \hat{x}

.

For a preparation

P

we write

w ≻ P

, and call w a control of

P

, if w controls every state in

P

(

\forall x \in P : w ≻ x

). We also express this by saying that w controls

P

. By

^[P]

we denote the set of all controls of

P

. We write

^[x]

if

P

is the singleton set

{x}

. In case

X \otimes \hat{Y}

is the diffuse relation,

^[P] = \hat{Y}

for any preparation

P

.

For

Q \subseteq \hat{Y}

,

] Q [

denotes the control region of

Q

, the set of

x \in X

for which

w ≻ x

for some

w \in Q

. We write

] w [

if

Q

is the singleton set

{w}

. Clearly, the statements

w \in^[P]

,

w ≻ P

and

P \subseteq [w]

are equivalent.

We assume that the following conditions hold:

\begin{matrix} \forall x \in X : \hat{x} ≻ x, \end{matrix}

(1)

\begin{matrix} \forall w \in \hat{Y} :] w [\neq \emptyset, \end{matrix}

(2)

and normally also that

\exists y \in Y : \hat{y} ≻ X .

(3)

The first condition is essential and the second one is rather innocent. The third condition is introduced when we want to ensure that X (or Y) is not “too large”. Models where (3) does not hold are considered unrealistic, beyond what man (Observer) can grasp. If response is surjective, it amounts to the condition

^[X] \neq \emptyset

. It is illuminating to have models of classical Shannon theory in mind, cf., Section 3.7.

For a preparation

P

, we define the centre of

P

(

\hat{Y}

-domain) as the set of controls in

\hat{P}

which control

P

:

ctr^(P) = \hat{P} \cap^[P] .

(4)

From controllability we derive the relation of visibility for the Y-domain, denoted

X \otimes Y

, and given by

X \otimes Y = {(x, y) \in X \times Y | \hat{y} ≻ x} .

(5)

Restrictions

P \otimes Y = {(x, y) \in X \otimes Y | x \in P}

are at times of relevance.

If

(x, y) \in X \otimes Y

, we say that

(x, y)

is an atomic situation (in the Y-domain) and write

y ≻ x

. Such a situation is an adapted pair if

(x, \hat{y})

is so in the

\hat{Y}

-domain, i.e., if

y \hat{\sim} x

and

(x, y)

is a perfect match if

y = x

. The two notions coincide if response is injective. An atomic situation

(x, y)

is certain if

y \in Y_{\det}

.

Note that we use the same sign, ≻, for visibility and for controllability. The context will have to show if we work in the Y- or in the

\hat{Y}

-domain. We see that

y ≻ x

if and only if

\hat{y} ≻ x

. If this is so, we also say that y covers x or that x is visible from y.

By (1) and by the defining relation (5),

x ≻ x

for all

x \in X

, thus

X \otimes Y

contains the diagonal

X \times X

. The outlook (or view) from

y \in Y

is the set

] y [= {x | y ≻ x}

. Clearly,

] y [=] \hat{y} [

. By (2) and (5), this set is non-empty and, when (3) holds, for at least one belief instance, the outlook is all of X.

For a preparation

P

we write

y ≻ P

, and call y a viewpoint of

P

, if

y ≻ x

for every

x \in P

. The set of all viewpoints of

P

is denoted

[P]

. We write

[x]

if

P

is the singleton

P = {x}

. By

ctr (P)

, the centre of

𝒫

(Y-domain), we denote the set of viewpoints in the preparation:

ctr (P) = P \cap [P] .

(6)

Note that

ctr^(P) = {\hat{x} | x \in ctr (P)}

.

In any situation, Observer should ensure that from his chosen belief instance, every state which could conceivably be chosen by Nature is visible. Therefore, in a situation where the preparation

P

is known to Observer, Observer should only consider belief instances in

[P]

.

In the sequel we shall often consider bivariate functions, generically denoted by either

\hat{f}

(

\hat{Y}

-domain) or by f (Y-domain). The

\hat{f}

-type functions are defined either on

X \otimes \hat{Y}

or on some subset of the form

P \times^[P]

for some preparation

P

. The range of

\hat{f}

may be any abstract set but will often be a subset of the extended real line. Given

\hat{f}

, it is understood that f without the hat denotes the derived function defined by

f (x, y) = \hat{f} (x, \hat{y})

for pairs

(x, y)

for which

(x, \hat{y})

is in the domain of definition of

\hat{f}

. The domain of definition of the derived function is either

X \otimes Y

or the set

P \times [P]

if

\hat{f}

is defined on

P \times^[P]

.

Every derived function depends only on response in the sense that

f (x, y_{1}) = f (x, y_{2})

if only

y_{1} \hat{\sim} y_{2}

. If response is a surjection, there is a natural one-to-one relation between

\hat{Y}

-type functions and Y-type functions which depend only on response.

Consider an f-type function defined on all of

X \otimes Y

. For

y \in Y

,

f^{y}

denotes the marginal function given y, defined on

] y [

by

f^{y} (x) = f (x, y)

. The marginal function given

x \in X

is the function

f_{x}

defined by

f_{x} (y) = f (x, y)

for

y \in [x]

. We write

f^{y} < \infty

on

𝒫

to express, firstly, that

y ≻ P

so that

f^{y}

is well defined on all of

𝒫

and, secondly, that this marginal function is finite on

𝒫

. We write

f^{y} < \infty

if

f^{y} < \infty

on X.

2.5. Knowledge, Perception and Deformation

Observer strives for knowledge, conceived as the synthesis of extensive experience. Referring to probabilistic thinking, we could point to situations where accidental experimental data are smoothed out over time as you enter the regime of the law of large numbers. However, Observer’s endeavours may result in less definitive insight, a more immediate reaction which we refer to as perception. It reflects how Observer perceives situations from the world or, with a different focus, how situations from the world are presented to Observer.

In the same way as we have introduced truth- and belief instances, we consider knowledge instances, also referred to as perceptions. Typically, they are denoted by z and taken from a set denoted Z, the knowledge base or perception base.

A simplifying assumption for our modeling is that the rules of the world

Ω

contain a special function,

\hat{Π}

, which maps

X \otimes \hat{Y}

into Z, generically,

z = \hat{Π} (x, w) .

(7)

The derived function,

Π

, then maps

X \otimes Y

into Z. Both functions are referred to as the deformation. The context will show which one we have in mind,

\hat{Π}

or

Π

.

Thus knowledge can be derived deterministically from truth and belief alone, and as far as belief is concerned, we only have to know the associated response. In terms of perception, Observer’s perception z of an atomic situation

(x, y)

is given by

z = Π (x, y) = \hat{Π} (x, \hat{y})

.

In our modeling, the world is characterized by the deformation. We may thus talk about the world with deformation

Π

,

Ω = Ω_{Π}

. The rules of the world may contain other structural elements, but such elements are not specified in the present study. Possibilities which could be considered in future developments include context, noise from the environment, and dynamics. To some extent, such features can be expressed in the present modeling by defining

X, Y

and Z appropriately and by introducing suitable interpretations.

In case response is a bijection and Z contains X as well as Y we may consider the deformations

Π_{1}

and

Π_{0}

defined by

Π_{1} (x, y) = x

, respectively

Π_{0} (x, y) = y

. The associated worlds are

Ω_{1} = Ω_{Π_{1}}

and

Ω_{0} = Ω_{Π_{0}}

. In

Ω_{1}

, “what you see is what is true”, whereas in

Ω_{0}

, “you only see what you believe”—or, in some interpretations, you only see what you want to see. The world

Ω_{1}

is the classical world where, optimistically, truth can be learned, whereas, in

Ω_{0}

, you cannot learn anything about truth. We refer to

Ω_{0}

as a black hole. It is a narcissistic world, a world of extreme scepticism, only reflecting Observer’s beliefs and bearing no trace of Nature. If Z is provided with a linear structure, we can consider further deformations

Π_{q}

depending on a parameter q by putting

Π_{q} (x, y) = q x + (1 - q) y

. Worlds associated with deformations of this type are denoted

Ω_{q}

. These are the worlds we find of relevance for the discussion of Tsallis entropy, cf., Section 3.6.

The simplest world to grasp is the classical world, but also the worlds

Ω_{q}

and even a black hole contain elements which are familiar to us from daily experience, especially in relation to certain psychological phenomena. In this connection we point to placebo effects, cf., Benedetti [19], and to visual attention, cf., Bundesen and Habekost [18]. Presently, the relevance of our modeling in relation to these phenomena is purely qualitative.

Considering examples as indicated above, it is natural to expect that knowledge is of a nature closely related to the nature of truth and of belief. A key case to look into is that

Z = X = Y

. However, we shall not make any general assumption in this direction. What we shall do is to follow the advice of Shannon, as far as possible avoiding assumptions which depend on concrete semantic interpretations. As a consequence we shall only in Section 3.6 introduce more specific assumptions about the representation of knowledge.

2.6. Effort and Description

We turn to the introduction of the key quantitative tool we shall work with. In so doing, we will be guided by the view that perception requires effort. Expressed differently, knowledge is obtained at a cost. Since, according to the previous section, knowledge can be derived from truth and belief alone, or from truth and action, no explicit reference to knowledge is necessary. Instead, we model effort (in the

\hat{Y}

-domain) by a certain bivariate function, the effort function, defined on

X \otimes \hat{Y}

.

The rules of the world

Ω

may not point directly to an effort function which Observer can favorably work with. Or there may be several sensible functions to choose from. The actual selection is considered a task for Observer.

Effort, description, experiment and measurement are related concepts. We put emphasis on the notion of description, which is intended to aid Observer in his encounters with situations from the world. Logically, description comes before effort. Effort arises when specific ideas about description are developed into a method of description, which you may here identify with an experiment. The implementation of such a method or the performance of the associated experiment involves a cost and this is what we conceive as specified quantitatively by the effort function.

Description depends on semantic interpretations and is often thought of in loose qualitative terms. However, in order to develop precise concepts which can be communicated among humans, quantitative elements will inevitably appear, typically through a finite set of certain real-valued functions, descriptors. The descriptors of Section 3.6 give an indication of what could be involved.

Imagine now that somehow Observer has chosen all elements needed—response, actions, experiments—and settled for an effort function,

\hat{Φ} = \hat{Φ} (x, w)

defined on

X \otimes \hat{Y}

. Let us agree on what a “good” effort function should mean. Generally speaking, Observer should aim at experiments with low associated effort. Consider a fixed truth instance x and the various possible actions, in principle free to be any action which controls x. It appears desirable that the action adapted to x should be the one preferred by Observer. Thus effort should be minimal in this case, i.e.,

\hat{Φ} (x, w) \geq \hat{Φ} (x, \hat{x})

should hold. Further, if the inequality is sharp except for the adapted action, this will have a training effect which, over time, will encourage Observer to choose the optimal action,

\hat{x}

.

Formally, we define an effort function (in the

\hat{Y}

-domain) as a function

\hat{Φ}

on

X \otimes \hat{Y}

with values in

] - \infty, + \infty]

such that, for all

x \in X

and all

w ≻ x

,

\hat{Φ} (x, w) \geq \hat{Φ} (x, \hat{x}) .

(8)

Thus, for all

x \in X

,

\hat{x} \in \arg \min {\hat{Φ}}_{x}

. The minimal value of

{\hat{Φ}}_{x}

is the entropy of x for which we use the notation

H (x)

:

H (x) = \hat{Φ} (x, \hat{x}) .

(9)

This quantity will be discussed more thoroughly in the sequel. If

w_{\emptyset} \in \hat{Y}

, it is to be expected that

\hat{Φ} (x, w_{\emptyset}) = 0

when

w_{\emptyset} ≻ x

.

The effort function is proper, if, for any

x \in X

with

H (x) < \infty

, the minimum of

{\hat{Φ}}_{x}

is only achieved for the control

\hat{x}

adapted to x. As opposed to this notion we have the notion of a degenerate effort function which is one which only depends on the first argument x, i.e., for all

x \in X

,

{\hat{Φ}}_{x}

is a constant function.

Note that effort may be negative (but not

- \infty

). This flexibility will later be convenient as it will allow us to pass freely from notions of effort to notions of utility by a simple change of sign. However, for more standard applications, effort functions will be non-negative.

The set of effort functions and the set of proper effort functions over

X \otimes \hat{Y}

are ordered positive cones in a natural way. You may note that if, in a sum of effort functions, one of the summands is proper, so is the sum. Two effort functions

{\hat{Φ}}_{1}

and

{\hat{Φ}}_{2}

, which only differ from each other by a positive finite factor are scalarly equivalent. If an effort function is proper, so is every scalarly equivalent one. There may be many non-scalarly equivalent effort functions. The choice among scalarly equivalent ones amounts to a choice of unit.

Proper effort functions could have been taken as the key primitive concept on which other concepts, especially response, can be based. To illustrate this, assume that

Y = X

and consider a function

\hat{Φ} : X \otimes \hat{Y} \mapsto] - \infty, \infty]

such that, for every state x for which

{\hat{Φ}}_{x}

is not identically

+ \infty

,

\arg \min {\hat{Φ}}_{x}

is a singleton. The minimal value of

{\hat{Φ}}_{x}

is again the entropy

H (x)

and we may define the set of realistic states by

X_{0} = {H < \infty}

and, more importantly, response

x \mapsto \hat{x}

by the requirement that

\hat{Φ} (x, \hat{x}) = \min {\hat{Φ}}_{x}

. This defines response uniquely on

X_{0}

and for

x \notin X_{0}

, the definition of

\hat{x}

is really immaterial and any element in

\hat{Y}

which controls x will do.

Turning to the Y-domain, we define an effort function (Y-domain), as a function

Φ : X \otimes Y \mapsto] - \infty, \infty]

such that

Φ (x, y) \geq Φ (x, x) for all (x, y) \in X \otimes Y .

(10)

Entropy is given by

H (x) = Φ (x, x)

. If there are certain atomic situation, it is natural to expect that effort vanishes for such situations. The effort function is proper if equality in (10) only holds if either

H (x) = \infty

or else

y = x

. We also express this by saying that

Φ

satisfies the perfect match principle. An effort function is degenerate if, for every

(x, y) \in X \otimes Y

,

Φ (x, y) = H (x)

.

The notions just introduced were defined directly with reference to the Y-domain. However, it lies nearby also to consider functions which can be derived from

\hat{Y}

-effort functions

\hat{Φ}

. They are derived effort functions and, in case

\hat{Φ}

is proper, proper derived effort functions. The two strategies for definitions, intrinsic and via derivation, give slightly different concepts. In case response is injective, the resulting notions are equivalent. In general, derived effort functions depend only on response, i.e., if

y_{1} ≻ x

and

y_{2} ≻ x

and if

y_{1} \hat{\sim} y_{2}

then

Φ (x, y_{1}) = Φ (x, y_{2})

. In the other direction, for a proper derived effort function, you can only conclude response-equivalence,

y \hat{\sim} x

, if

Φ (x, y) = H (x)

and

H (x) < \infty

.

Formally, the definitions related to Y-effort functions may be conceived as a special case of the definitions pertaining to the

\hat{Y}

-domain (put

\hat{Y} = Y

and take the identity map as response).

We shall talk about effort functions without a qualifying prefix,

\hat{Y}

or Y, if it is clear from the context what we have in mind. We shall always point out if we have derived functions in mind.

The effort functions introduced determine net effort. However, the implementation of the method of description—which we imagine lies behind—may, in addition to a specific cost, entail a certain overhead and, occasionally, it is appropriate to include this overhead in the effort. We refer to Section 3.6 for instances of this.

We imagine that the choice of effort function involves considerations related to knowledge and to the rules of the world. However, once

\hat{Φ}

, hence also

Φ

are fixed, these other elements are only present indirectly. The ideas of Section 2.5 have thus mainly served as motivation for the further abstract development. The ideas will be taken up again when in Section 3.6 we turn to a study of probabilistic models.

The author was led to consider proper effort functions in order to illuminate certain aspects of statistical physics, cf., [34,37]. However, the ideas have been around for quite some time, especially among statisticians. For them it has been more natural to work with functions taken with the reverse sign by looking at “score” rather than effort. Our notion of proper effort functions, when specialized to a probabilistic setting, matches the notion of proper scoring rules as you find it in the statistical literature. As to the literature, Csiszár [41] comments on the early sources, including Brier [42], a forerunner of research which followed, cf., Good [40], Savage [43] (see e.g., Section 9.4) and Fischer [44]. See also the reference work [45] by Gneiting and Raftery. For research of Dawid and collaborators—partly in line with what you find here—see [28,46,47,48].

2.7. Information Triples

As advocated in the last section, effort is a notion of central importance. However, this notion should not stand alone but be discussed together with other fundamental concepts of information. This point of view will be emphasized by the introduction of a notion of information triples, the main notion of the present study. We start by philosophizing over the very concept of information.

Information in any particular situation concerns truth. If

𝒫

is a preparation, “

x \in P

” signifies that the true state is to be found among the states in

𝒫

. If

𝒫

is a singleton, we talk about full information and use the notation “x” rather than “

x \in {x}

”; otherwise, we talk about partial information.

We shall not be concerned with how information can be obtained—if at all. Perhaps, Observer only speculates about the potential possibility of acquiring information, either through his own activity or otherwise, e.g., via the involvement of an aid or a third party, an informer.

Information will be related to quantitatively defined concepts. As our basis we take a proper effort function

\hat{Φ}

. Following Shannon we disregard semantic content. Instead, we focus on the possibility for Observer to benefit from information by a saving of effort. Accordingly, we view

\hat{Φ} (x, w)

as the information content of “x” in an atomic situation with x as truth instance and w as action or control—indeed, if you are told that x is the true state, you need not allocate the effort

\hat{Φ} (x, w)

to the situation which you were otherwise prepared to do. The somewhat intangible and elusive concept of “information” is, therefore, measured by the more concrete and physical notion of effort, hence the unit of information is the same as the unit used for effort.

There is a huge literature elucidating what information really “is”. Suffice it here to refer to [21] and, as an example of a discussion more closely targeted on our main themes, we refer to Caticha [49] who maintains that “Just as a force is defined as that which induces a change in motion, so information is that which induces a change in beliefs”. One may just as well—or even better—focus on action. Then we can claim that “information” is that which induces a change of action.

The central concept of the theory developed by Shannon is that of entropy. This concept was already introduced in the preceding section. Here, we elaborate on possible interpretations. One view is that entropy is guaranteed saving of effort. With effort given by

\hat{Φ}

we are led to define the entropy

H (x)

associated with the information “x” as the minimum over w of

\hat{Φ} (x, w)

. Thus, by (8), (9) holds.

The considerations above make most sense if, one way or another, Observer eventually obtains full information about the true state. However, if, instead, you view entropy as necessary allocation of effort, understood as the effort you have to allocate in order to have a chance to obtain full information, it does not appear important actually to obtain that information. In passing, one may think that a more neutral terminology such as “necessity” could have been chosen in place of “entropy”. That could be less awkward when you turn to other applications of the abstract theory than classical Shannon theory or statistical physics.

As yet a third route to entropy we suggest to view it as a quantitative expression of the complexity of the various states, maintaining that to evaluate complexity, Observer may use minimal accepted effort, the effort he is willing to allocate to the various states in order to obtain the information in question.

Entropy may also be obtained with reference only to the Y-domain. Indeed, with

Φ

the derived effort function, for each state x,

H (x) = Φ (x, x)

.

Whichever route to entropy you take—including the game theoretical route of Section 2.10—it appears that subjective elements are involved, typically through Observer’s choice of description and associated experiments. If, modulo scalar equivalence, the actual world only allows one proper effort function, then entropy and notions related to entropy are of a more objective nature. We shall later see examples of such worlds but also for such worlds subjective elements may enter if Observer is considering which world is the actual one.

Apart from effort itself, and the derived notion of entropy, we turn to the introduction of two other basic concepts which make sense in our abstract setting, viz., redundancy for the

\hat{Y}

-domain and its counterpart, divergence, for the Y-domain.

To define redundancy, consider an atomic situation

(x, w) \in X \otimes \hat{Y}

. Then redundancy

\hat{D}

between x and w is measured by the difference between actual and minimal effort, i.e., ideally, as

\hat{D} (x, w) = \hat{Φ} (x, w) - H (x) .

(11)

Assume, for a moment, that entropy is finite-valued. Then redundancy in (11) is well defined. Furthermore, redundancy is non-negative and only vanishes if

(x, w)

is an adapted pair.

However, we find it important to be able to deal with models for which entropy may be infinite. We do that by simply assuming that appropriate versions of redundancy and divergence exist with desirable properties. The simple device we shall apply in order to reach a sensible definition is to rewrite the defining relation (11), isolating effort on the left hand side.

With the above preparations, we are ready to introduce the key concepts of our study. We start with concepts for the

\hat{Y}

-domain and follow up after that by parallel concepts for the Y-domain.

We consider certain triples

(\hat{Φ}, H, \hat{D})

of functions taking values in

] - \infty, \infty]

with

\hat{Φ}

and

\hat{D}

defined on

X \otimes \hat{Y}

and H defined on X. If need be we may talk about triples over

X \otimes \hat{Y}

or we may point to the

\hat{Y}

-domain. Such triples must satisfy special conditions in order to be of interest. The most important properties to consider are the following four:

\begin{matrix} \hat{Φ} (x, w) = H (x) + \hat{D} (x, w) (linking identity, L); \end{matrix}

(12)

\begin{matrix} \hat{D} (x, w) \geq 0 (fundamental inequality, F); \end{matrix}

(13)

\begin{matrix} \hat{D} (x, \hat{x}) = 0 (soundness, S); \end{matrix}

(14)

\begin{matrix} w \neq \hat{x} \Rightarrow \hat{D} (x, w) > 0 (properness, P) . \end{matrix}

(15)

The properties (12), (13) and (15) are considered for all

(x, w) \in X \otimes \hat{Y}

and (14) for all

x \in X

. The linking identity (12) may be written shortly as

\hat{Φ} = H + \hat{D}

or, formally correct with

\hat{pr}

the projection of

X \otimes \hat{Y}

onto X, as

\hat{Φ} = H \circ \hat{pr} + \hat{D}

.

An information triple is a triple

(\hat{Φ}, H, \hat{D})

which satisfies the three first conditions (L, F and S). For such triples the function

\hat{Φ}

is the associated effort function, H the associated entropy and

\hat{D}

the associated redundancy. This does not conflict with previous terminology. In particular, the associated effort function is indeed an effort function in the sense of Section 2.6.

Information triples with the same redundancy are said to be equivalent. Equivalent triples may have quite different properties and one may search for representatives with good properties.

A proper information triple in the Y-domain is an information triple for which redundancy is proper, i.e., (15) holds. Clearly, the effort function of a proper information triple is proper in the sense of Section 2.6. Moreover, if a triple is proper, so is any equivalent one.

An information triple is degenerate if redundancy vanishes:

\hat{D} (x, w) = 0

for all

(x, w) \in X \otimes \hat{Y}

. The effort function of a degenerate information triple is degenerate.

Among the four defining properties, the last three (FSP) only involve redundancy. Accordingly, a function

\hat{D}

defined on

X \otimes \hat{Y}

is a general redundancy function if it satisfies the fundamental inequality as well as the requirements of soundness and properness. Note that for such a redundancy function,

(\hat{D}, 0, \hat{D})

is a proper information triple and that any equivalent information triple may be obtained from

(\hat{D}, 0, \hat{D})

by a natural process of addition related to any function on X with values in

] - \infty, \infty]

, taking this function as the entropy function. To be precise, what is involved structurally is that you add information triples, one of which is proper and the other degenerate, viz., you add

(\hat{D}, 0, \hat{D})

and

(H, H, 0)

. For further details on this theme, see Section 3.1.

Normally, given a proper effort function

\hat{Φ}

, there is a natural way to extend the redundancy function as defined by (11) when

H (x) < \infty

, so that a proper information triple emerges. For this reason, we may talk about the information triple generated by

\hat{Φ}

. Then, the problem of indeterminacy of redundancy disappears. The slightly strengthened assumption that redundancy can be defined “appropriately” on all of

X \otimes \hat{Y}

will, as it turns out, present no limitation in concrete cases of interest.

We turn briefly to Y-type triples. They are triples

(Φ, H, D)

with

Φ

and D defined on

X \otimes Y

and H defined on X. Key properties to consider are quite parallel to what we have discussed for the

\hat{Y}

-domain:

\begin{matrix} Φ (x, y) = H (x) + D (x, y) (linking identity, L); \end{matrix}

(16)

\begin{matrix} D (x, y) \geq 0 (fundamental inequality, F); \end{matrix}

(17)

\begin{matrix} D (x, \hat{x}) = 0 (soundness, S); \end{matrix}

(18)

\begin{matrix} y \neq x \Rightarrow D (x, y) > 0 (properness, P) . \end{matrix}

(19)

An information triple in the Y-domain is a triple which satisfies the conditions L, F and S. For such triples,

Φ

is the associated effort, H the associated entropy and D the associated divergence.

A proper information triple is one for which divergence is proper. Such triples are intrinsically defined in the sense that they do not depend on any action space or response function. If divergence vanishes, the triple is degenerate. The effort function of a proper information triple is proper in the sense of Section 2.6 and the effort function of a degenerate triple is degenerate.

A triple

(Φ, H, D)

is a derived information triple, respectively a derived proper information triple, if there exists a triple

(\hat{Φ}, H, \hat{D})

satisfying the corresponding properties for the

\hat{Y}

-domain such that

Φ

is derived from

\hat{Φ}

and D from

\hat{D}

. Note that a derived proper information triple need not be a proper information triple according to the intrinsic definition. Indeed, from

D (x, y) = 0

you can only conclude that x and y are response equivalent. Of course, if response is injective, the two types of proper information triples for the Y-domain—intrinsically defined or defined via derivation—are equivalent concepts.

A general divergence function D on

X \otimes Y

is a function on

X \otimes Y

which satisfies the F, S and P-requirements. Note that we include the property of properness in the definition. A general derived divergence function is one which can be derived from a general redundancy function.

For the Y-domain, notions of equivalence (same divergence!) and of addition of information triples are defined in the obvious manner.

Instead of taking triples as introduced above as the basis, it is quite often more natural to focus on triples of the “opposite nature”. This refers to situations where it is appropriate to focus on a positively oriented quantity such as utility or pay-off rather than on effort. Typically, this is the case for studies of economy, meteorology and statistics where one also meets the notion of “score” as previously indicated. In order to distinguish the two types of triples from each other, we may refer to them as being effort-based, respectively utility-based.

For the

\hat{Y}

-domain,

(\hat{U}, M, \hat{D})

is a utility-based information triple if

(- \hat{U}, - M, \hat{D})

is so as an effort-based triple and, for the Y-domain,

(U, M, D)

is a utility-based information triple if

(- U, - M, D)

is so as an effort-based triple. Properness and other concepts introduced for effort-based triples carry over in the obvious way to utility-based triples.

For utility-based triples,

\hat{U}

and U are called utility, M is called max-utility. As for effort-based triples,

\hat{D}

is redundancy and D divergence. The linking identity takes the form

\hat{U} = M - \hat{D}

(

U = M - D

) which can never result in the indeterminate form

\infty - \infty

since, by definition,

\hat{U}

and U, hence also M, can never assume the value

+ \infty

.

In view of the main examples we have in mind, we have found it most illuminating to take effort rather than utility as the basic concept to work with, and hence to develop the main results for effort-based quantities. Anyhow, even if you are primarily interested in considerations based on effort, you are easily led to consider also utility-based quantities as we shall see right away in the next section.

The concept of proper information triples is, except for minor technical details, equivalent to the concept of proper effort functions. Apart from a slight technical advantage, the triples constitute a preferable base for information theoretical investigations as the three truly basic notions of information are all emphasized together with their basic interrelationship—the linking identity. Historically, the notions arose for classical probabilistic information theoretical models, cf., Section 3.7. Effort functions go back to Kerridge [50] who coined the term inaccuracy, entropy goes back to Shannon [1] and divergence to Kullback [51]. The term “redundancy” which we have used for another side of divergence, corresponds to one usage in information theory, though there the term is used in several other ways which are not expressed in our abstract setting.

As an aside, it is tempting for the author to point to the pioneering work of Edgar Rubin going back to the twenties. Unfortunately, this was only published posthumously in 1956, cf., [52,53,54]. Rubin made experiments over human speech and focused on what he called the reserve of understanding. This is a quantitative measure of the amount you can cut out of a persons speech without seriously disrupting a listeners ability to understand what has been said. It can be conceived as a forerunner of the notion of redundancy.

Our way to information triples was through effort and one may ask why we did not go directly to the triples. For one thing, triples lead to a smooth axiomatic theory, as will be demonstrated in the present research, compare also with our previous contribution [55]. However, though axiomatization can be technically attractive, we find that a focus on interpretation as in our more philosophical and speculative approach, is of primary importance and contributes best to an understanding of central concepts of information. Axiomatics only comes in after basic interpretations are in place.

A comment on the choice of terminology in relation to the concept of properness is in place. This concept is at times considered to be unnecessarily strong and we shall later, at the end of Section 2.10 and in Appendix A, develop weaker notions. When only a redundancy function or a divergence function is given and not a full information triple, we have chosen to incorporate the requirement of properness in its usual form in the definition of what we understand by a general redundancy function or a general divergence function.

2.8. Relativization, Updating

In this section we shall work entirely in the Y-domain. We start by considering a proper effort-based information triple

(Φ, H, D)

over

X \otimes Y

. Often, it is natural to measure effort relative to some standard performance rather than by

Φ

itself. An especially important instance of this kind of relativization concerns situations where Observer originally fixed a prior, say

y_{0} \in Y

, but now wants to update his belief by replacing

y_{0}

with a posterior y. Perhaps Observer—through his own actions or via an informer—has obtained the information “

x \in P

” for some preparation

𝒫

. If

y_{0} \notin P

, Observer may want to replace

y_{0}

by a posterior

y \in P

. In a first attempt of a reasonable definition, the associated updating gain is given by the quantity

U_{| y_{0}}

obtained by comparing performance under the posterior with performance under the prior:

U_{| y_{0}} (x, y) = Φ (x, y_{0}) - Φ (x, y) .

(20)

A difficulty with (20) concerns the possible indeterminate form

\infty - \infty

. If we ignore the difficulty and apply the linking identity (16) to both terms in (20), entropy

H (x)

cancels out and we find the expression

U_{| y_{0}} (x, y) = D (x, y_{0}) - D (x, y) .

(21)

This is less likely to be indeterminate. When not of the indeterminate form

\infty - \infty

, we therefore agree to use (21) as the formal definition of updating gain, more precisely of relative updating gain with

y_{0}

as prior. For the present study, we shall only work with updating gain when the marginal function

D^{y_{0}}

(defined in accordance with concepts and notation introduced in Section 2.4) is finite on some preparation

𝒫

under consideration. Assuming that this is the case, we realize that

(U_{| y_{0}}, D^{y_{0}}, D)

(22)

is a proper utility-based information triple over

P \otimes Y

. For such triples we put

Y_{\det} = {y_{0}}

, i.e., we take

y_{0}

as the only certain belief instance. Max-utility is identified as the marginal function

D^{y_{0}}

on

𝒫

and divergence is the original divergence function restricted to

P \otimes Y

.

It is important to note that the triples which occur in this way by varying

y_{0}

and

𝒫

do not require the full effort function

Φ

in order to make sense. It suffices to start out with a general divergence function on

X \otimes Y

. When the construction is based on a general divergence function D, we refer to (22) as the updating triple generated by D and with

y_{0}

as prior.

Though rather trivial, the observations regarding updating gain are important as they show that results in that setting may be obtained from results based on effort. To emphasize this, we introduce—based only on a general divergence function D—the effort-based information triple associated with (22) as the triple

(Φ_{| y_{0}}, - D^{y_{0}}, D)

(23)

with

Φ_{| y_{0}}

given by

Φ_{| y_{0}} (x, y) = D (x, y) - D (x, y_{0}) .

(24)

This is a perfectly feasible effort-based triple over

P \otimes Y

whenever

D^{y_{0}}

is finite on

𝒫

. Clearly, it is proper.

In Section 2.13 and Section 2.15 we shall derive results about minimum divergence (information projections) from results about maximum entropy by exploiting the simple facts here uncovered.

As we have seen, natural information triples may be derived from a general divergence function by a simple process of relativization. While we are at it, we note that in case

Y = X

, also reverse divergence

(x, y) \mapsto D (y, x)

defines a genuine divergence function on

X \otimes Y

(in contrast, reverse description effort need not define a genuine effort function). Therefore, if

D_{y_{0}} < \infty

and we put

Φ_{| y_{0}}^{r} (x, y) = D (y, x) - D (y_{0}, x)

,

(Φ_{| y_{0}}^{r} (x, y), - D (y_{0}, x), D (y, x))

(25)

defines a genuine proper information triple (when restricting the variables x and y appropriately). However, these triples are not found to be that significant.

2.9. Feasible Preparations, Core and Robustness

We claim that description is a key to obtainable information, to what can be known. Not every possible information “

x \in P

” for any odd preparation

𝒫

can be expected to reflect a realistic situation. The question we ask is “what can Observer know?” or “what kind of information can Observer hope to obtain?”. We thus want to investigate “limits to knowledge” and “limits to information”. In order to provide an answer, we shall identify classes of preparations which represent feasible information. These classes will be defined with reference to an effort function

\hat{Φ}

. For this section,

\hat{Φ}

need not be proper.

Given

w \in \hat{Y}

and a level

h < \infty

, we define the level set

P^{w} (h)

and the sub level set

P^{w} (h^{↓})

by

P^{w} (h) = {{\hat{Φ}}^{w} = h}; P^{w} (h^{↓}) = {{\hat{Φ}}^{w} \leq h},

(26)

i.e., as the set of states which are controlled by w, either at the level h or at the maximum level h. These sets are genuine preparations whenever they are non-empty. When w is the response of a state

x \in X

,

P^{w} (h^{↓})

is non-empty whenever

h \geq H (x)

. As level- and sub level sets for other functions will appear later on, cf., Section 2.14, we may for clarity refer to

P^{w} (h)

and to

P^{w} (h^{↓})

as, respectively,

{\hat{Φ}}^{w}

-level sets and

{\hat{Φ}}^{w}

-sub level sets.

The preparations in (26) we call primitive strict, respectively primitive slack preparations. A general strict, respectively a general slack preparation is a finite non-empty intersection of primitive strict, respectively primitive slack preparations. The genus of these preparations is the smallest number of primitive preparations (either strict or slack as the case may be) which can enter into the definition just given. Thus primitive preparations are of genus 1.

If

w = (w_{1}, \dots, w_{n})

are elements of

\hat{Y}

and

h = (h_{1}, \dots, h_{n})

are real numbers, the sets

P^{w} (h) = ⋂_{i \leq n} P^{w_{i}} (h_{i}) and P^{w} (h^{↓}) = ⋂_{i \leq n} P^{w_{i}} (h_{i}^{↓})

(27)

define strict, respectively slack preparations of genus at most n whenever they are non-empty. The set

P^{w} (h)

is the corona of

P^{w} (h^{↓})

whenever it is non-empty.

The preparations introduced above via the representation (27) are those we consider to be feasible and we formally refer to them as the feasible preparations. They provide the answer to the question about what can be known. They are the key ingredients in situations which Observer can be faced with. In any such situation a main problem concerns inference, an issue we shall take up in the next section.

Often, families of feasible preparations are of interest. Given

w = (w_{1}, \dots, w_{n})

, we denote by

P^{w}

, respectively

P^{w ↓}

, the families which consist of all preparations

P^{w} (h)

, respectively

P^{w} (h^{↓})

, which can be obtained by varying

h

.

Clearly, the feasible preparations can also be expressed by reference to the derived effort function

Φ

rather than

\hat{Φ}

. We use the notation

P^{y} (h)

and

P^{y} (h^{↓})

for, respectively, the

Φ^{y}

-level set

{Φ^{y} = h}

and the

Φ^{y}

-sub level set

{Φ^{y} \leq h}

. If

\hat{y} = w

,

P^{y} (h) = P^{w} (h)

and

P^{y} (h^{↓}) = P^{w} (h^{↓})

(note that for an expression such as

P^{q} (h)

, the nature of q determines if this is a

\hat{Φ}

- or a

Φ

-level set). For finite sequences

y = (y_{1}, \dots, y_{n})

of elements of Y and

h = (h_{1}, \dots, h_{n})

of real numbers, the sets

P^{y} (h)

and

P^{y} (h^{↓})

are defined in the obvious manner as are the families of preparations

P^{y}

, respectively

P^{y ↓}

.

The level sets may be used to define certain special belief instances or controls which will later, theoretically as well as for applications, play a significant role. Given is a certain preparation

P

. Then, the core of

P

consists of all belief instances y for which the effort

Φ (x, y)

is finite and independent of x as long as x is consistent. This notion, appropriately adjusted, also makes sense for the

\hat{Y}

-domain. Notation and defining requirements are given as follows:

\begin{matrix} core (P) & = {y ≻ P | \exists h < \infty : P \subseteq P^{y} (h)}, \end{matrix}

(28)

\begin{matrix} core^(P) & = {w ≻ P | \exists h < \infty : P \subseteq P^{w} (h)} . \end{matrix}

(29)

If

y \in core (P)

, respectively

w \in core^(P)

, we also say that y, respectively w, is robust.

We shall refine the notions above in two ways. Firstly, for a family

P

of preparations—such as a family of the form

P^{w}

defined above—the core is defined as the intersection of the individual cores:

\begin{matrix} core (P) & = ⋂_{P \in P} core (P), \end{matrix}

(30)

\begin{matrix} core^(P) & = ⋂_{P \in P} core^(P) . \end{matrix}

(31)

The second refinement we have in mind depends on on an auxiliary preparation

E

, assumed to be a subset of the given preparation

P

. For the

\hat{Y}

-domain, a control

w^{*} ≻ P

is a

(E, P)

-robust strategy for Observer if there exists a finite constant h, such that the following two conditions hold:

\begin{matrix} \hat{Φ} (x, w^{*}) = h for all x \in E, \end{matrix}

(32)

\begin{matrix} \hat{Φ} (x, w^{*}) \leq h for all x \in P \end{matrix}

(33)

When

E = P

we recover the original notion of robustness. The similar notion for belief instances is defined in the obvious way. Notation and defining relations for the corresponding adjustments of the notion of core are as follows:

\begin{matrix} core (E | P) & = {y ≻ P | \exists h < \infty : E \subseteq P^{y} (h), P \subseteq P^{y} (h^{↓})}, \end{matrix}

(34)

\begin{matrix} core^(E | P) & = {w ≻ P | \exists h < \infty : E \subseteq P^{w} (h), P \subseteq P^{w} (h^{↓})} . \end{matrix}

(35)

From a formal point of view, it does not matter if we use

P^{w}

-type sets or

P^{y}

-type sets as the basis for the definition of feasible preparations. However, entering into more speculative interpretations, the

P^{w}

-type sets which emphasize control seem preferable. Individual controls

w \in \hat{Y}

or a collection of such controls point to experiments which Observer may perform. An experimental setup identifies a certain preparation, and thus determines what is known to Observer. Determining all preparations which can arise in this way, we are led to the class of feasible preparations as defined above.

As to the nature of the various controls, we imagine that they are derived from description. To control a situation, you must be able to describe it, and with a description you have the key to control. We may imagine that, corresponding to a control w, Observer can realize a certain experimental setup consisting of various parts – measuring instruments and the like. In particular, there is a special handle which is used to fix the level of effort. If the level, perhaps best thought of as a kind of temperature, is fixed to be h, the states available to Nature are those in the appropriate feasible preparation. Several experiments can be carried out with the same equipment by adjusting the setting of the handle. If Observer wants to constrain the states by other means, he can add equipment corresponding to another control

w^{'}

and choose a level

h^{'}

for the experimental setup constructed based on

w^{'}

. The result is a restriction of the available states to the intersection of the two preparations involved. If the preparation is

P^{w} (h^{↓})

and the actual state is not inside this preparation, you may imagine that the result is overheating and breakdown of the experimental setup! Thus you must keep the state inside the preparation and this may well be what requires an effort as specified by

\hat{Φ}

.

2.10. Inference via Games, Some Basic Concepts

For this section,

(\hat{Φ}, H, \hat{D})

is an effort-based information triple over

X \otimes \hat{Y}

and

(Φ, H, D)

the derived triple over

X \otimes Y

. Further, a preparation

P

is given, conceived as the partial information “

x \in P

”. In practice,

𝒫

will be a feasible preparation, but we need not assume so for this section.

The process of inference concerns the identification of “sensible” states in

𝒫

—ideally only one such state, the inferred state. In many cases, this can be achieved by game theoretical methods involving a two-person zero-sum game. As it turns out, this will result in double inference where also either control instances or belief instances will be identified—ideally, only one such instance, the inferred control or the inferred belief instance as the case may be.

An inferred state, say

x^{*}

, brings Observer as close as possible to the truth in a way specified by the method applied. The same may be said about an inferred belief instance—or you may find it more appropriate to view an inferred belief instance as a final representation of Observers subjective views and conviction. Turning to controls, an inferred control is conceived as an invitation to Observer to act, say regarding the setup of experiments and performance of subsequent observations. In this way, actions by Observer as dictated by an inferred control

w^{*}

is conceived as that which is needed for Observer in order to justify the inference

x^{*}

about truth. In short, double inference gives Observer information both about what can be inferred about truth and how.

Given

𝒫

, we shall study two closely related two-person zero-sum games, the control game

\hat{γ} (P)

, and the belief game

γ (P)

, also referred to as the derived game. If need be, we may write

\hat{γ} (P | \hat{Φ})

and

γ (P | Φ)

. The games have Nature and Observer as players and

\hat{Φ}

, respectively

Φ

as objective function. Nature is understood to be a maximizer, Observer a minimizer. For both games, strategies for Nature involve the choice of a consistent state. Observer strategies for

\hat{γ} (P)

are controls from which every state in

𝒫

can be controlled. For

γ (P)

, Observer strategies are belief instances from which every state in

P

is visible, in other words, they are viewpoints of

P

. Thus pairs of permissible strategies for the two games are either pairs

(x, w)

with

x \in P

and

w ≻ P

(with the understanding that

w \in \hat{Y}

) or pairs

(x, y)

with

x \in P

and

y ≻ P

(with the understanding that

y \in Y

). In consistency with the discussion in Section 2.4, an observer strategy may be thought of as a strategy which is not “completely stupid” whatever the strategy of Nature, as long as that strategy is consistent. The choice of strategy for Observer may be a real choice, whereas, for Nature, it is often more appropriate to have a fictive choice in mind which reflects Observer’s speculations over what the truth could be.

A remark is in order regarding models where it is unnatural to work with controls and only belief is involved. Then the basis will be an effort-based information triple

(Φ, H, D)

over

X \otimes Y

and only one type of game,

γ (P)

will be involved. Formally, this may be considered a derived game by artificially introducing

\hat{Y} = Y

,

X \otimes \hat{Y} = X \otimes Y

, by taking response to be the identity map and by taking

(\hat{Φ}, H, \hat{D})

to be identical with

(Φ, H, D)

Thus the approach we shall take with a primary focus on the control games, based on objects for the

\hat{Y}

-domain is, formally, the more general one.

Following standard philosophy of game theory, Observer should always be prepared for a choice by Nature which is least favourable to him. One can argue that in our setting anything else would mean that Observer would not have used all available information. The line of thought goes well with Jaynes thinking as collected in [9], though there you find no reference to game theory.

In order for our exposition to be self-contained and also because our games are slightly at variance with what is normally considered, we shall here give full details regarding definitions and proofs. As references to game theory and applications to the physical sciences, ref. [32,56,57] may be useful.

Let us introduce basic notions for the control game and then comment more briefly on the derived game. The two values of

\hat{γ} (P)

are, for Nature,

\sup_{x \in P} \inf_{w ≻ x} \hat{Φ} (x, w)

(36)

and, for Observer,

\inf_{w ≻ P} \sup_{x \in P} \hat{Φ} (x, w) .

(37)

Note the slight deviation from usual practice in that w in the infimum in (36) varies over

^[x]

and not just over

^[P]

or some other set independent of x. Philosophically, one may argue that Nature does not know of the restriction to

𝒫

—this is something Observer has arranged—and hence cannot know of any restriction besides the natural one

w ≻ x

. As the infimum in (36) is nothing but the entropy

H (x)

, the value for Nature is the maximum entropy value, also referred to as the MaxEnt-value:

H_{\max} (P) = \sup_{x \in P} H (x) .

(38)

Problems on the determination of

H_{\max} (P)

and associated strategies are classical problems known from information theory or statistical physics. If

x^{*} \in P

and

H (x^{*}) = H_{\max} (P)

,

x^{*}

is an optimal strategy for Nature, also referred to as a MaxEnt-state or MaxEnt-strategy. The archetypal concrete problems of this nature are discussed in Section 3.7.

As to the value for Observer, we identify the supremum in (37) with the risk associated with the strategy w and denote it by

\hat{Ri} (w | P)

:

\hat{Ri} (w | P) = \sup_{x \in P} \hat{Φ} (x, w) .

(39)

The value for Observer then is the minimal risk of the game, also referred to as the MinRisk-value:

{\hat{Ri}}_{\min} (P) = \inf_{w ≻ P} \hat{Ri} (w | P) .

(40)

An optimal strategy for Observer is a control

w^{*} ≻ P

with

\hat{Ri} (w^{*} | P) = {\hat{Ri}}_{\min} (P)

, also referred to as a MinRisk-control or a MinRisk-strategy. Note the general validity of the minimax inequality:

H_{\max} (P) \leq {\hat{Ri}}_{\min} (P) .

(41)

Indeed, for arbitrary

x \in P

and arbitrary

w ≻ P

,

H (x) = \hat{Φ} (x, \hat{x}) \leq \hat{Φ} (x, w) \leq \hat{Ri} (w | P)

and taking supremum over x and infimum over w, (41) follows. If (41) holds with equality and defines a finite quantity, the game is said to be in game theoretical equilibrium, or just in equilibrium, and the common value of

H_{\max} (P)

and

{\hat{Ri}}_{\min} (P)

is the value of the game.

A further notion of equilibrium is attached to Nash’s name. It should, however, be said that for the relatively simple case here considered (two players, zero sum), the ideas we need originated with von Neumann, see [58,59] and, for a historical study, Kjeldsen [60]. A pair of permissible strategies

(x^{*}, w^{*})

is a Nash equilibrium pair for

\hat{γ} (P)

if, with these strategies, none of the players have an incentive to change strategy—provided the opponent does not do so either. This means, for Nature, that

\forall x \in P : \hat{Φ} (x, w^{*}) \leq \hat{Φ} (x^{*}, w^{*}),

(42)

and, for Observer, that

\forall w ≻ P : \hat{Φ} (x^{*}, w) \geq \hat{Φ} (x^{*}, w^{*}) .

(43)

The inequalities (42) and (43) constitute a special case of the celebrated saddle-value inequalities of game theory. Note that, in our case, one of these inequalities (43), is automatic if

(x^{*}, w^{*})

is an adapted pair. This implies that

x^{*} \in ctr (P)

and that

w^{*} \in \hat{P}

as follows from the following trivial observation:

Proposition 1.

If

x^{*}

and

w^{*}

are permissible strategies for the two players in

\hat{γ} (P)

and if

w^{*}

is adapted to

x^{*}

, then

x^{*} \in ctr (P)

and

w^{*} \in ctr^(P)

.

Proof.

By hypothesis,

x^{*} \in P

,

w^{*} ≻ P

and

w^{*} = \hat{x^{*}}

, hence

w^{*} \in \hat{P} \cap^[P] = ctr^(P)

, equivalent to the statement

x^{*} \in ctr (P)

. ☐

Key notions and definitions for the belief game

γ (P)

are quite parallel to what we have discussed for the control game. Briefly, the values of

γ (P)

are

\sup_{x \in P} \inf_{y ≻ x} Φ (x, y)

(for Nature) and

\inf_{y ≻ P} \sup_{x \in P} Φ (x, y)

(for Observer) and notions of strategies and optimal strategies are defined in an obvious manner. We notice that the value for Nature in

γ (P)

is

H_{\max} (P)

, the same as the value for Nature in

\hat{γ} (P)

and that the notion of optimal strategies for Nature in the two games are equivalent notions. We use Ri as notation for risk in

γ (P)

, i.e., for

y ≻ P

Ri (y | P) = \sup_{x \in P} Φ (x, y) .

(44)

Clearly, for any

y ≻ P

,

Ri (y | P) = \hat{Ri} (\hat{y} | P) .

(45)

Therefore, if

y_{1} \hat{\sim} y_{2}

and one of these belief instances is a viewpoint of

P

, then so is the other and the associated risks are the same. The value for Observer in

γ (P)

is

{Ri}_{\min} (P) = \inf_{y ≻ P} Ri (y | P) .

(46)

The game

γ (P)

is in equilibrium if the two values of the game coincide and are finite. A pair

(x^{*}, y^{*})

of permissible strategies is a Nash equilibrium pair for

γ (P)

if the two saddle-value inequalities hold:

\begin{matrix} \forall x \in P : Φ (x, y^{*}) \leq Φ (x^{*}, y^{*}), \end{matrix}

(47)

\begin{matrix} \forall y ≻ P : Φ (x^{*}, y) \geq Φ (x^{*}, y^{*}) . \end{matrix}

(48)

Basic relationships between the values for the players in the belief game and the control game may be summarized as follows.

Proposition 2.

The values for Nature in

γ (P)

and in

\hat{γ} (P)

coincide and are equal to the MaxEnt value

H_{\max} (P)

. The corresponding values for Observer in the two games are

{Ri}_{\min} (P)

, respectively

{\hat{Ri}}_{\min} (P)

. In general,

{Ri}_{\min} (P) \geq {\hat{Ri}}_{\min} (P) .

(49)

If response is surjective, equality holds in (49). Equality also holds if

γ (P)

is in equilibrium. In that case also

\hat{γ} (P)

is in equilibrium and the values for the two games coincide:

{Ri}_{\min} (P) = {\hat{Ri}}_{\min} (P) = H_{\max} (P)

.

Proof.

The first statement regarding the values for Nature is trivial and also noted above. The inequality (49) follows by (45), which also implies that equality holds in case response is surjective. If

γ (P)

is in equilibrium, apply the minimax inequality to

\hat{γ} (P)

, exploit equilibrium of

γ (P)

as well as the inequality (49) and you find that

{\hat{Ri}}_{\min} (P) \geq H_{\max} (P) = {Ri}_{\min} (P) \geq {\hat{Ri}}_{\min} (P) .

It follows that also

\hat{γ} (P)

is in equilibrium. Clearly, the values for the two games coincide. ☐

As it will turn out, in a great many cases of relevance for the applications, it is possible rather directly to identify optimal strategies for the players and to show that the games considered are in equilibrium. Furthermore, in many cases there is a natural relationship between the

\hat{γ}

- and the

γ

-type games with the effect that, typically, there is a unique optimal strategy for Observer in

\hat{γ} (P)

and this strategy, a certain control, is adapted to any optimal strategy for Nature in the games

\hat{γ} (P)

and

γ (P)

. Even more so, there is a tendency for the unique optimal control to be robust.

Results to support these claims will be taken up in Section 2.12. The results require that somehow you have good candidates for the hoped-for optimal strategies. For this, the indicated tendency towards robustness is a clue to how such candidates can actually be found in concrete cases of interest. In fact, a search for optimal objects via robustness is very efficient and more natural than the usual approach via the differential calculus as we shall also comment on in Section 2.12.

2.11. Refined Notions of Properness

The discussion to follow may appear unnecessary since normally, the standard notion of properness will apply. However, there are interesting cases where this is not so. Therefore, there is a need to look for suitable weaker notions which are still strong enough to have desirable consequences especially regarding properties of optimal strategies. As justification of the good sense in considering also the weaker notions of properness presented below we point to the general results of Section 2.12 and to the extended applicability of a a well-known construction due to Bregman, cf., Section 3.1 and Appendix A.

With assumptions as in Section 2.10, let us assume that

\hat{γ} (P)

is in equilibrium and, for simplicity, that there is a unique MaxEnt-state

x^{*}

. Let us think of the system which Observer is studying as a physical system subject to the laws of statistical physics. Then Observer will expect that after some lead-in time, the system will stabilize and

x^{*}

will represent the true state of the system. Observer aims at choosing a control which is optimal and at the same time adapted to Natures choice,

x^{*}

. Unfortunately, Observer does not know which state this is among the consistent states. So Observer cannot just choose the control

w^{*} = \hat{x^{*}}

adapted to

x^{*}

, but has to somehow choose some control of

P

, say w.

At this point we introduce a built-in learning mechanism operating over time which may lead Observer in the right direction. The idea is illuminated by introducing an all-knowing being, Guru. Guru will not reveal the truth to Observer directly but may respond to specific questions. With this option, Observer may eventually end up by a choice of just the right control.

The three questions we shall consider all concern the entropy

H (x^{*})

which Observer expects to be the MaxEnt-value. The questions are all related to the inequality

\hat{Φ} (x, w) \leq H (x^{*}) .

(50)

The questions put higher and higher demands on the chosen control w and are as follows:

$Q_{1}$ :: Does (50) hold for $x = x^{*}$ ?
$Q_{2}$ :: Does (50) hold for all consistent x?
$Q_{3}$ :: Does (50) even hold with equality for all consistent x?

With Question

Q_{1}

, Observer wants to know if the effort he applies is minimal. Clearly, in view of the linking identity and the fundamental inequality—and as

H (x^{*}) < \infty

by the assumed equilibrium of

\hat{γ} (P)

—the question is equivalent to asking if

\hat{D} (x^{*}, w) = 0

. If the reply is negative, Observer knows that his choice cannot be optimal and he will then choose another control. But even with an affirmative answer, i.e., when

\hat{D} (x^{*}, w) = 0

, Observer may not be satisfied and may, therefore, continue the questioning. If the information triple is proper, an affirmative answer to

Q_{1}

will tell Observer that

w = \hat{x^{*}}

and he may be satisfied—even though it could still happen, as examples will show, that w is not optimal. Further questioning may thus only be needed if the information triple is not proper—or not known to be proper.

For the second question,

Q_{2}

, Observer is worried about his risk in case the state should somehow change. The question is equivalent to asking if

\hat{Ri} (w | P) \leq H (x^{*})

. With a negative reply, Observer will dismiss the choice of w, if for no other reason, because w cannot be optimal then. If the reply is positive, w is optimal and one may wonder if Observer will still find any further checking necessary. The suggested third question

Q_{3}

reflects the ambition of Observer that he wants the control to be robust at the level

H (x^{*})

.

Motivated by our considerations, we shall say that the information triple

(\hat{Φ}, H, \hat{D})

is

Q_{1}

,

Q_{2}

or

Q_{3}

-proper over

P

if, with

x^{*} \in P

, we can conclude that w is adapted to

x^{*}

from affirmative answers to, respectively, question

Q_{1}

,

Q_{2}

or

Q_{3}

. If we just talk about, say

Q_{2}

-properness, it is understood that the conditions hold with

P = X

. If the entropy function is finite-valued,

Q_{1}

-properness is equivalent to (standard) properness.

Concerning questions being asked to Guru, one may wonder why Observer does not simply ask directly either if the chosen control is optimal or if it is adapted to the truth. In this connection, we remark that questions which can be asked to Guru must depend on the possibilities for Observer’s communication with the system. For a further discussion of this, one should replace Guru with some mathematically defined rules for this communication. Such rules may reflect the kind of experiments and associated measurements which Observer can perform on the system.

2.12. Inference via Games, Some Basic Results

We shall investigate the possibility to identify optimal strategies based on a suggestion of possible candidates. Moreover, when optimal strategies exist, we shall look at the ensuing consequences. This approach will involve problems which are easy to handle technically and yet, it may be argued that from an applied point of view the results obtained are of greater significance than theoretically more sophisticated results, such as those developed in Section 2.16. Several examples illustrating this point of view are listed in Section 3.

As in the previous section, an effort-based information triple

(\hat{Φ}, H, \hat{D})

over

X \otimes \hat{Y}

, the underlying triple, is given together with a preparation

P

.

When we speak about an optimal state without any further specification it is understood that we have an optimal strategy for Nature in one of the games

\hat{γ} (P)

or

γ (P)

in mind. As we observed in the previous section it does not matter which game we think of. Moreover, when we speak of an optimal belief instance, respectively an optimal control it is also clear what we have in mind, viz., an optimal strategy for Observer in

γ (P)

, respectively in

\hat{γ} (P)

.

In our first result we investigate situations where, in addition to a requirement of equilibrium, there exist optimal strategies for both players.

Theorem 1 (Optimal strategies, basics).

(i): If

γ (P)

is in equilibrium and both players have optimal strategies in this game, then also

\hat{γ} (P)

is in equilibrium and optimal strategies for both players in that game exist. Further, the values of the two games agree and, if

(x^{*}, y^{*})

are optimal strategies in

γ (P)

, then

(x^{*}, \hat{y^{*}})

are optimal strategies in

\hat{γ} (P)

(but there may be many other optimal strategies).

(ii): Now assume that

(\hat{Φ}, H, \hat{D})

is proper. Then, if

\hat{γ} (P)

is in equilibrium and both players have optimal strategies, say

x^{*}

and

w^{*}

, then

x^{*} \in ctr (P)

,

w^{*} \in ctr^(P)

and

w^{*} = \hat{x^{*}}

. It follows that the optimal control is unique. Furthermore, also

γ (P)

is in equilibrium and both players have optimal strategies. A belief instance is optimal in

γ (P)

if and only if it has

w^{*}

as response. If response is injective, each of the three optimal strategies associated with

γ (P)

and

\hat{γ} (P)

—the optimal state

x^{*}

, the optimal belief instance

y^{*}

and the optimal control

w^{*}

—are unique and

x^{*} = y^{*}

.

Proof.

(i): Assume that

γ (P)

is in equilibrium and that

(x^{*}, y^{*})

are optimal strategies for this game. A bit parallel to the reasoning in the proof of Proposition 2, we find that under the stated conditions

H_{\max} (P) = H (x^{*}) = Ri (y^{*} | P) = \hat{Ri} (\hat{y^{*}} | P)

and the claimed assertions follow readily.

(ii): Now assume that

\hat{γ} (P)

is in equilibrium and that

(x^{*}, w^{*})

are optimal strategies for this game. By the defining relations (8) and (9), by the assumed equilibrium, by optimality of

x^{*}

and of

w^{*}

and by the definition (39) of risk, we find that

\hat{Φ} (x^{*}, w^{*}) \geq \hat{Φ} (x^{*}, \hat{x^{*}}) = H (x^{*}) = H_{\max} (P) = {\hat{Ri}}_{\min} (P) = \hat{Ri} (w^{*} | P) \geq \hat{Φ} (x^{*}, w^{*}),

(51)

hence equality must hold throughout. Further, as

H (x^{*}) < \infty

, we conclude that

\hat{D} (x^{*}, w^{*}) = 0

, hence by properness that

w^{*} = \hat{x^{*}}

. Then, by Proposition 1,

x^{*} \in ctr (P)

and

w^{*} \in ctr^(P)

.

Since

x^{*}

above was an arbitrary optimal strategy for Nature and

w^{*}

an arbitrary optimal strategy for Observer, and by the fact

w^{*} = \hat{x^{*}}

just established, we conclude that the optimal Observer strategy is unique and further, that all optimal strategies for Nature are response-equivalent, lie in

ctr (P)

and have the optimal control as response.

We leave it to the reader to establish the stated results for

γ (P)

, say by noting that

y ≻ P

is equivalent with

\hat{y} ≻ P

and that

Ri (y | P) = \hat{Ri} (\hat{y} | P)

and by using the first facts established.

In case response is injective, the uniqueness assertions are easily established and the identity of

x^{*}

and

y^{*}

follows as these belief instances are response-equivalent. ☐

Some remarks are in order.

Remark 1.

Simple and very concrete “toy examples” over discrete sets—either finite or countably infinite—may be constructed to illuminate various assumptions and to investigate the limits of the conclusions. This involves matrix games which are easy to visualize. In this way one realizes that the games may be in equilibrium and yet there may be no optimal strategy for any of the players or there may be one or several optimal strategies for one of the players and none for the other. Three such examples for games in equilibrium and with an underlying proper information triple are indicated in Figure 1 where the rows are states and the columns controls (or belief instances). In case (a) there is a unique optimal control but no optimal state, in case (b) there is a unique optimal state but no optimal control and in case (c) all controls are optimal but there is no optimal state. It is also easy to construct an example where all states are optimal but no control is so.

Remark 2.

Regarding the necessity of injectivity of response in the last part of the theorem, note that if this condition does not hold, there may be strategies for Nature with the optimal control

w^{*}

as response which are not optimal. Simple examples, say with “collapse of response”, i.e., with

\hat{Y}

a singleton, will demonstrate that.

Remark 3.

Several remarks on the assumption of properness are in place. First note that we did not have to assume that response is surjective in order to prove that the optimal strategy

w^{*}

in the second part of the theorem is in the range of this map. The assumed properness takes care of that. However, we need not assume that properness in its strongest form holds but may work with the weaker forms introduced in Section 2.11.

To make this more precise, first note that all assertions of the second part of Theorem 1 continue to hold if properness is replaced by

Q_{2}

-properness. This follows from the discussion in Section 2.11 by noting that from the relations in (51) one can conclude, not only that

\hat{D} (x^{*}, w) = 0

, but also that

\hat{Ri} (w | P) \leq H (x^{*})

. In this way some of the concrete models discussed in Appendix A, can be handled—but not all.

We add, without going through the details, that if we assume that the weaker

Q_{3}

-properness holds in conjunction with an assumption of robustness, viz., that all controls which are adapted to an optimal state are robust, then uniqueness of a robust optimal control is secured. The robustness condition appears to be related to a requirement that response be defined “appropriately”. For the models of Appendix A this requires that special care is taken when defining response at boundary points of the state space.

In the sequel some results are proved under the assumption of

Q_{2}

-properness. This is, so we claim, a simple, worth while and natural extension over results proved only under an assumption of standard

Q_{1}

-properness. Even more general results involving also robustness as just indicated may well be possible. However, it seems that before that will make much sense, one should develop results and constructions going beyond what is indicated in Appendix A and in Corollary 6 further on.

Inspired by Theorem 1, a pair

(x^{*}, w^{*})

of permissible strategies is said to be a bi-optimal pair, if

H (x^{*}) = \hat{Ri} (w^{*} | P) < \infty

and if

w^{*}

is the only optimal control. As follows from the theorem and from Remark 3, the required uniqueness property of

w^{*}

is automatic under an assumption of

Q_{2}

-properness of the underlying information triple and further,

w^{*}

must be adapted to

x^{*}

.

If we have only given a state

x^{*}

, we say that the state is bi-optimal if

(x^{*}, w^{*})

is a bi-optimal pair with

w^{*}

adapted to

x^{*}

.

Whereas it may be difficult to find optimal strategies, it is often easy to check if given candidates are in fact optimal:

Theorem 2.

[Identification] Under the assumptions of

Q_{2}

-properness, let

x^{*}

be a state in

ctr (P)

with finite entropy and let

w^{*}

be a control of

P

.

Then a necessary and sufficient condition that the pair

(x^{*}, w^{*})

is bi-optimal is that it is a Nash equilibrium pair. If this is so,

w^{*}

is adapted to

x^{*}

.

Proof.

First note that (42) is equivalent with the requirement

\hat{Ri} (w^{*} | P) \leq \hat{Φ} (x^{*}, w^{*})

and that, because

\hat{x^{*}} ≻ P

is known to hold (as

x^{*} ≻ P

), (43) is equivalent with the requirement

Φ (x^{*}, w^{*}) \leq H (x^{*})

.

Thus, when (42) and (43) hold, we find, also invoking the minimax inequality, that

\hat{Ri} (w^{*} | P) \leq \hat{Φ} (x^{*}, w^{*}) \leq H (x^{*}) \leq \hat{Ri} (w^{*} | P)

hence, recalling that

H (x^{*}) < \infty

, both

\hat{D} (x^{*}, w^{*}) = 0

and

\hat{Ri} (w^{*} | P) = H (x^{*})

follow. By

Q_{2}

-properness, we then realize that

w^{*}

is adapted to

x^{*}

. Collecting facts established, we conclude that

(x^{*}, w^{*})

is a bi-optimal pair. This proves sufficiency.

The necessity and the last part of the theorem follow from Theorem 1 and the above noticed equivalent forms of the saddle-value inequalities. ☐

Elaborating slightly, we obtain the following corollary:

Corollary 1.

Under the assumption of

Q_{2}

-properness, if

(x^{*}, w^{*})

are permissible strategies for

\hat{γ} (P)

with

x^{*} \in ctr (P)

,

H (x^{*}) < \infty

and with

w^{*}

adapted to

x^{*}

, then a necessary and sufficient condition that

\hat{γ} (P)

and

γ (P)

are in equilibrium with

x^{*}

as bi-optimal state is that

\hat{Ri} (w^{*} | P) \leq H (x^{*})

, i.e., that

\forall x \in P : \hat{Φ} (x, w^{*}) \leq H (x^{*}) .

(52)

Proof.

Under the conditions stated, (43) is automatic and (52) is a reformulation of (42). Thus (52) implies that

(x^{*}, w^{*})

is a Nash equilibrium pair and the result follows from Theorem 2. ☐

An important and trivial consequence of the existence of a bi-optimal state is the validity of the Pythagorean inequalities. Let

x^{*}

be a bi-optimal state and

w^{*}

its response. The direct Pythagorean inequality, or just the Pythagorean inequality, is the inequality

H (x) + \hat{D} (x, w^{*}) \leq H (x^{*})

, typically considered for

x \in P

. This is nothing but a trivial rewriting of (52). When it holds,

H (x^{*}) = H_{\max} (P)

and the inequality for an individual state

x \in P

is, therefore, a sharper form of the trivial inequality

H (x) \leq H_{\max} (P)

. The dual Pythagorean inequality is the inequality

\hat{Ri} (w^{*} | P) + \hat{D} (x^{*}, w) \leq \hat{Ri} (w | P)

, typically considered for

w ≻ P

. When it holds,

\hat{Ri} (w^{*} | P) = {\hat{Ri}}_{\min} (P)

, and the inequality for an individual strategy

w ≻ P

is, therefore, a sharper form of the trivial inequality

{\hat{Ri}}_{\min} (P) \leq \hat{Ri} (w | P)

.

Theorem 3.

[Pythagorean inequalities] Under the assumption of

Q_{2}

-properness, if

γ (P)

and

\hat{γ} (P)

are in equilibrium with

x^{*}

as bi-optimal state then, with

w^{*} = \hat{x^{*}}

, the direct as well as the dual Pythagorean inequalities hold:

\begin{matrix} \forall x \in P : & H (x) + \hat{D} (x, w^{*}) \leq H (x^{*}), \end{matrix}

(53)

\begin{matrix} \forall w ≻ P : & \hat{Ri} (w^{*} | P) + \hat{D} (x^{*}, w) \leq \hat{Ri} (w | P) . \end{matrix}

(54)

Proof.

As to (53), this follows from Corollary 1. Also (54) must hold since, for

w ≻ P

,

\hat{Ri} (w^{*} | P) + \hat{D} (x^{*}, w) = H (x^{*}) + \hat{D} (x^{*}, w) = \hat{Φ} (x^{*}, w) \leq \hat{Ri} (w | P) .

☐

The Pythagorean flavour of (53) is more pronounced when one turns to models of updating, cf., Section 2.13 and Section 3.2.

Let us elaborate on the direct Pythagorean inequality. First, let us agree that a control w of

P

is a Pythagorean control for

\hat{γ} (P)

if, for every

x \in P

,

H (x) + \hat{D} (x, w) \leq H_{\max} (P) .

(55)

This notion will be used whether or not

\hat{γ} (P)

is in equilibrium and whether or not this game has optimal strategies. In particular, it applies in cases when no MaxEnt-state exists. Of course, the notion is only of interest if

H_{\max} (P) < \infty

.

Translating to the Y-domain, we say that y is a Pythagorean belief instance for

γ (P)

if

y ≻ P

and if, for every

x \in P

,

H (x) + D (x, y) \leq H_{\max} (P) .

(56)

Theorem 4.

Under the assumption of

Q_{2}

-properness, assume that a MaxEnt-state exists for the preparation

P

and that

H_{\max} (P) < \infty

. Then the following three conditions are equivalent:

a Pythagorean control for $\hat{γ} (P)$ exists;
a Pythagorean belief instance for $γ (P)$ exists;
The games $\hat{γ} (P)$ and $γ (P)$ are in equilibrium and a bi-optimal state for these games exist.

If these conditions are fulfilled, the Pythagorean control,

w^{*}

, is unique and identical to the optimal strategy for Observer in

\hat{γ} (P)

. Further, a belief instance,

y^{*}

with

y^{*} ≻ P

is a Pythagorean belief instance if and only if it has

w^{*}

as response.

Proof.

Assume that w is a Pythagorean control. Then, by (55),

\hat{Ri} (w | P) \leq H_{\max} (P)

. Choose a MaxEnt-state

x^{*}

. Then

H_{\max} (P) = H (x^{*}) < \infty

and (55) with

x = x^{*}

implies that

\hat{D} (x^{*}, w) = 0

. As

\hat{Ri} (w | P) \leq H (x^{*})

also holds,

Q_{2}

-properness shows that

w = x^{*}

. Then, by Corollary 1,

\hat{γ} (P)

and

γ (P)

are in equilibrium with

x^{*}

as bi-optimal state. Appealing also to previous results, all statements of the theorem follow. ☐

The three results to follow are often useful in applications.

Theorem 5.

[Robustness theorem] Under the assumption of

Q_{2}

-properness, let

(x^{*}, w^{*})

be an adapted pair and assume that

w^{*}

is robust for

\hat{γ} (P)

, say at the level h of robustness, and that

x^{*}

is consistent. Then

\hat{γ} (P)

is in equilibrium with h as value and with

x^{*}

as bi-optimal state. Furthermore, for any

x \in P

, the Pythagorean inequality holds with equality:

H (x) + \hat{D} (x, w^{*}) = H_{\max} (P) .

(57)

Similarly, if

x^{*}

and

y^{*}

are response-equivalent, if

y^{*}

is robust for

γ (P)

and if

x^{*}

is consistent, then

γ (P)

is in equilibrium with

x^{*}

as bi-optimal state and, for

x \in P

,

H (x) + D (x, y^{*}) = H_{\max} (P) .

(58)

The equality (57) or (58) for

x \in P

is the Pythagorean equality, here in an abstract version. A more compact geometry flavored formulation of the first part of Theorem 5 in the direction of Corollary 1 runs as follows:

Corollary 2.

Under the assumption of

Q_{2}

-properness, if h is finite and

x^{*} \in P \subseteq P^{\hat{x^{*}}} (h)

, then

h = H (x^{*})

and

\hat{γ} (P)

is in equilibrium with

x^{*}

as bi-optimal state.

In case response is injective, the second part of Theorem 5 really only involves one element,

x^{*}

, as the other element,

y^{*}

, has to be identical to

x^{*}

. The two essential conditions are one on

x^{*}

as a strategy for Nature, viz., that it is consistent, and one on

x^{*}

as a strategy for Observer, viz., that it is robust. There can only be one such element. If we drop the condition of consistency, there may be many more such elements. They form the previously defined core of

γ (P)

.

For preparation families we find the following result:

Theorem 6.

Under the standard assumption on properness, consider a preparation family

P^{y}

with

y = (y_{1}, \dots, y_{n})

. Let

x^{*}

be a state, put

y^{*} = x^{*}

and assume that

y^{*} \in core (P^{y} | Φ)

. Further, put

h = (h_{1}, \dots, h_{n})

with

h_{i} = Φ (x^{*}, y_{i})

for

i = 1, \dots, n

and assume that these constants are finite. Then

P^{y} (h) \in P^{y}

and

γ (P^{y} (h))

is in equilibrium and has

x^{*}

as bi-optimal state. In particular,

x^{*}

is the MaxEnt strategy for

P^{y} (h)

.

This follows directly from the involved definitions and from Theorem 5. The reader will easily establish an analogous result for the

\hat{Y}

-domain.

The notions of robustness and core also make sense for games defined in terms of proper or just

Q_{2}

-proper utility-based information triples. If

(U, M, D)

is such a triple, we simply apply the above definitions to the associated effort-based triple

(- U, - M, D)

.

Theorem 2 points to a strategy which is often fruitful in the search for a MaxEnt-strategy, viz., first to determine the core of the given preparation and then to select that element (if any) in the core which is consistent. This route to determine MaxEnt strategies does not involve the infinitesimal calculus, in particular, it does not need the use of Lagrange multipliers. Researchers of statistical physics may claim that you need the Lagrange multipliers as they are of special physical significance, see e.g., Kuic [61]. In that connection, one will find that these quantities turn up anyhow and in a more natural way if you follow the approach via robustness, cf., [62].

The notion of robustness has not received much attention in a game theoretical setting. It is implicit in [26,63] and perhaps first formulated in [24]. Apparently, the existence of suitable robust strategies is a strong assumption. However, for typical models appearing in applications, the assumption is often fulfilled when optimal strategies exist. Results from [27] point in that direction.

Dual versions of the notions and results indicated above could be introduced, depending on (54) rather than on (53). However, it seems that the notions related to the direct Pythagorean inequality are the more useful ones.

For the result to follow we need an abstract version of Jeffrey’s divergence given, for two states

x_{1}

and

x_{2}

, by

J (x_{1}, x_{2}) = D (x_{1}, x_{2}) + D (x_{2}, x_{1}) .

(59)

Corollary 3.

[transitivity inequality] Assume that

(Φ, H, D)

is a

Q_{2}

-proper information triple. If

γ (P)

is in equilibrium with

x^{*}

as a bi-optimal state, then, for every state

x \in P

and every belief instance

y ≻ P

, the inequality

H (x) + D (x, x^{*}) + D (x^{*}, y) \leq Ri (y | P)

(60)

holds. In particular, for every

x \in ctr (P)

,

H (x) + J (x, x^{*}) \leq Ri (x | P) .

(61)

Proof.

First note that also

\hat{γ} (P)

is in equilibrium with

x^{*}

as bi-optimal state. Then, putting

w^{*} = \hat{x^{*}}

, (53) and (54) hold. Therefore, and as

H (x^{*}) = \hat{Ri} (w^{*} | P)

, for

x \in P

and

w ≻ P

,

H (x) + \hat{D} (x, w^{*}) + \hat{D} (x^{*}, w) \leq \hat{Ri} (w | P) .

(62)

To a given belief instance y with

y ≻ P

we then apply (62) with

w = \hat{y}

. As

\hat{D} (x, w^{*}) = D (x, x^{*})

,

\hat{D} (x^{*}, w) = D (x^{*}, y)

and

\hat{Ri} (w | P) = Ri (y | P)

, (60) follows. ☐

We refer to (60) as the transitivity inequality. It is a sharper version of the minimax inequality

H (x) \leq Ri (y | P)

. It combines both Pythagorean inequalities and these are easily derived from it. If

Ri (y | P) < \infty

, the inequality holds with equality if and only if both Pythagorean inequalities (53) and (54) hold with equality.

As to the last part of Corollary 3, we note that if you put

r = Ri (x | P) - H (x)

, then the bi-optimal state has Jeffrey divergence at most r from x.

For the final result of this section we shall work in the Y-domain based on the derived triple

(Φ, H, D)

.

First we point to an extra property of bi-optimal states which follows from (53). In order to formulate this in a convenient way we need some definitions. A sequence

(x_{n})

of states converges in divergence to the state x, written

x_{n} \overset{D}{\to} x

, if

\lim_{n \to \infty} D (x_{n}, x) = 0

. This requires that

(x_{n}, x) \in X \otimes Y

for all n (or for all n sufficiently large). If

x_{n} \in P

for all n, we say that

(x_{n})

is asymptotically optimal, more precisely asymptotically optimal for Nature in the game

γ (P)

, if

H (x_{n}) \to H_{\max} (P)

as

n \to \infty

. Finally, a state x (not necessarily in

𝒫

) is a maximum entropy-attractor for

𝒫

with respect to convergence in divergence, more briefly, a

H_{\max}

-attractor for

P

wrt D-convergence, if

x_{n} \overset{D}{\to} x

for every asymptotically optimal sequence

(x_{n})

.

We can now state a trivial corollary to Theorem 3 (transformed to the Y-domain):

Corollary 4.

Any bi-optimal state

x^{*}

for a game

γ (P)

in equilibrium, is a

H_{\max}

-attractor for

𝒫

wrt D-convergence.

We shall later demonstrate the existence of attractors in certain cases when the bi-optimal state may not exist. However, that will also involve a variant of the notion of attractor which relates to a different kind of convergence, convergence in Jensen-Shannon divergence, rather than convergence in divergence. The two concepts are identical in key cases as we shall later demonstrate (discussion after the proof of Theorem 11).

2.13. Games Based on Utility, Updating

In the previous section we investigated games related to an effort-based information triple. Similar notions and results apply when we start-out with a utility-based triple. Let us work in the Y-domain and base the first part of our discussion on a proper utility-based information triple

(U, M, D)

over

X \otimes Y

. Then, given a preparation

𝒫

, the associated game

γ (P) = γ (P | U)

has Observer as maximizer and Nature as minimizer and the two values of the game are, for Nature, the minimax utility

M_{\min} (P)

:

M_{\min} (P) = \inf_{x \in P} \sup_{y ≻ x} U (x, y) = \inf_{x \in P} U (x, x) = \inf_{x \in P} M (x)

(63)

and, for Observer, the corresponding maximin value

\sup_{y ≻ P} \inf_{x \in P} U (x, y) .

(64)

For

y ≻ P

, the infimum occurring here is the guaranteed utility associated with the strategy y. We denote it

Gtu (y | P)

. The maximin value (64) is also referred to as the maximal guaranteed utility. We denote it

{Gtu}_{\max} (P)

:

{Gtu}_{\max} (P) = \sup_{y ≻ P} Gtu (y | P) = \sup_{y ≻ P} \inf_{x \in P} U (x, y) .

(65)

Notions and results, e.g., related to equilibrium, to optimal or bi-optimal states etc. are developed in an obvious manner, either by following Section 2.12 in parallel or by applying the results of Section 2.12 to the effort-based triple

(- U, - M, D)

. The reader who wishes so will also be able to relax the assumption of properness to

Q_{2}

-properness.

Here, we limit the discussion to an elaboration of the important case of updating, cf., Section 2.8. For updating, according to Section 2.8, we do not need a full information triple. Therefore, for the remainder of the section we take as our basis a general divergence function D on

X \otimes Y

, a preparation

𝒫

and a prior

y_{0}

with

D^{y_{0}} < \infty

on

𝒫

. The game associated with the utility-based information triple

(U_{| y_{0}}, D^{y_{0}}, D)

we denote

γ (P; y_{0})

. According to (63), the value for Nature in this game is

\inf_{x \in P} D^{y_{0}} (x)

, also denoted

D_{\min} (P; y_{0})

and referred to as the minimum divergence value or the MinDiv-value:

D_{\min} (P; y_{0}) = \inf_{x \in P} D (x, y_{0}) .

(66)

An optimal strategy for Nature is here called a D-projection of

y_{0}

on

𝒫

. Consider an Observer strategy

y ≻ P

, i.e., a possible posterior. We use the same notation as in the general case, “Gtu” , to indicate Observer’s evaluation of the performance of the posterior. Incidentally, the letters can here be taken to stand for “guaranteed updating (gain)”. Thus

Gtu (y | P; y_{0}) = \inf_{x \in P} U_{| y_{0}} (x, y) = \inf_{x \in P} (D (x, y_{0}) - D (x, y))

(67)

is the guaranteed updating gain associated with the choice of y as posterior, and

{Gtu}_{\max} (P; y_{0}) = \sup_{y ≻ P} Gtu (y | P; y_{0})

(68)

is Observer’s value of the game, the maximum guaranteed updating gain, or the MaxGtu-value of

γ (P; y_{0})

.

The basic results for the updating game may be summarized as follows:

Theorem 7.

Let D be a general divergence function on

X \otimes Y

,

𝒫

a preparation and

y_{0}

a belief instance with

D^{y_{0}} < \infty

on

𝒫

. Consider the updating game

γ = γ (P; y_{0})

.

If

x^{*} \in ctr (P)

, then γ is in equilibrium with

x^{*}

as bi-optimal state if and only if the Pythagorean inequality

D (x, y_{0}) \geq D (x, x^{*}) + D (x^{*}, y_{0})

(69)

holds for every

x \in P

. Moreover, if this condition is satisfied,

x^{*}

is the D-projection of

y_{0}

on

𝒫

. Furthermore, the dual Pythagorean inequality

Gtu (y | P; y_{0}) + D (x^{*}, y) \leq Gtu (x^{*} | P; y_{0})

(70)

holds for every

y ≻ P

.

The proof can be carried out by applying Corollary 1 and Theorem 3 to the effort function

Φ_{| y_{0}}

associated with the updating game considered, cf., (24). Details are left to the reader.

The concept of attractors also makes sense for updating games. Then the relevant notion is that of a relative attractor given

y_{0}

, also referred to as the

D_{\min}^{y_{0}}

-attractor, which is defined as a state

x^{*}

such that, for every sequence

(x_{n})

in

𝒫

with

D (x_{n}, y_{0}) \to D_{\min} (P; y_{0})

it holds that

x_{n} \overset{D}{\to} x^{*}

. In the situation covered by Theorem 7—assuming also that limit states for convergence in divergence are unique—the relative attractor exists and coincides with the bi-optimal state.

The Pythagorean inequality originated with Chentsov [64] and Csiszár [63] where updating in a probabilistic setting was considered. Further versions, still probabilistic in nature can be found in Csiszár [65] and in Csiszár and Matús [66]. In [67] these authors present a general abstract study, adapting a functional analytical approach building technically on meticulous exploitation of tools of convex analysis, partly developed by the authors. This source may also be consulted for information about the historical development and related works. As a work depending on a reversed Pythagorean inequality related to the triple (25), we mention Glonti et al. [68].

The reader should be aware that our notation deviates from what is most commonly found in the literature and promoted by Csiszár, mainly for classical Shannon Theory. Thus a relative attractor is mostly called a generalized I-projection (information projection). We have chosen to stick to the terminology with attractors, partly as their discussion is based on the primary results involving MaxEnt-analysis for which a terminology of projection is less natural.

2.14. Formulating Results with a Geometric Flavour

The results of Section 2.12 are formulated analytically. In this section we make a translation to results which have a certain geometric flavour. We shall work entirely in the Y-domain. No mention of controls or response will occur. This corresponds to a model with

\hat{Y} = Y = X

and where response is the identity map. Throughout the section results are based on a proper effort-based information triple

(Φ, H, D)

.

In the previous sections, we had a fixed preparation in mind. Here, we shall also discuss to which extent you can change a preparation without changing the optimal strategy.

Sub Level sets of the form

{Φ^{y} \leq a}

play a a key role. These sets appeared before as primitive feasible preparations. Here they have a different role and we prefer to use the bracket notation as above.

Proposition 3.

Let

x^{*}

be a state with finite entropy

h = H (x^{*})

. Then, given a preparation

𝒫

, the necessary and sufficient condition that the game

γ (P)

is in equilibrium with

x^{*}

as bi-optimal state is that

𝒫

is squeezed in between

{x^{*}}

and

{Φ^{x^{*}} \leq h}

, i.e., that

x^{*} \in P \subseteq {Φ^{x^{*}} \leq h}

. In particular,

{Φ^{x^{*}} \leq h}

is the largest such preparation.

This follows directly from Theorem 2 and Corollary 1.

For a fixed preparation

𝒫

, we can express the two values of

γ (P)

,

H_{\max} (P)

and

{Ri}_{\min} (P)

, in a geometrically flavoured way. This can be done whether or not the game is in equilibrium and the result can thus be used to check if the game is in fact in equilibrium. It is convenient to introduce some preparatory terminology.

Firstly, a subset of X is an entropy sub level set if it is a (non-empty) set of the form

{H \leq a}

. The size of such a set is the smallest number a which can occur in this representation, clearly equal to the MaxEnt-value associated with the preparation

{H \leq a}

. Given a preparation

𝒫

, the associated enveloping entropy sub level set is the smallest entropy sub level set containing

𝒫

.

Secondly, and quite analogously in view of (38) and (39), we introduce the size of the

Φ^{y}

-sub level set

{Φ^{y} \leq a}

as the smallest number a which can occur in this representation. And we define the enveloping

Φ^{y}

-sub level set associated with

𝒫

to be the smallest

Φ^{y}

-sub level set containing

𝒫

.

Proposition 4.

Consider the game

γ (P)

associated with a preparation

𝒫

. Then:

(i): The MaxEnt-value $H_{\max} (P)$ is the size of the enveloping entropy sub level set associated with $𝒫$ ;
(ii): For fixed $y ≻ P$ , $Ri (y | P)$ is the size of the enveloping $Φ^{y}$ -sub level set associated with $𝒫$ .
(iii): The MinRisk-value ${Ri}_{\min} (P)$ is the infimum over $y ≻ P$ of the sizes of the enveloping $Φ^{y}$ -sub level sets associated with $𝒫$ .

In view of (38)–(40), this is obvious. Some comments on the result are in order. In (i) it is understood that the size is infinite if no entropy sub level set exists which contains

𝒫

. A similar convention applies to (ii). Also note that the result gives rise to a simple geometrically flavoured proof of the minimax inequality (41) by noting that for each

y ≻ P

and each h,

{Φ^{y} \leq h} \subseteq {H \leq h}

.

There are two families of sets involved in Proposition 4, the entropy sub level sets and the

Φ^{y}

-sub level sets. As the proposition shows, both families give valuable information about the games we are interested in. From the second family alone, one can in fact obtain rather complete information. Indeed, if

{Φ^{y} \leq a}

contains a given preparation for appropriately chosen y and a, the associated game is well behaved:

Proposition 5.

Given a preparation

𝒫

, a necessary and sufficient condition that

γ (P)

is in equilibrium and has a bi-optimal state is that

{Φ^{y} \leq a} \supseteq P

for some

(y, a)

with

y \in P

and

a = H (y)

. When the condition is fulfilled, a is the value of the game and y the bi-optimal state.

The simple proof is left to the reader. It is the sufficiency which is most useful in practical applications.

The results above translate without difficulty to results about games associated with a utility-based information triple

(U, M, D)

. For this, superlevel sets of the form

{U^{y} \geq k}

as well as strict sub level sets of the form either

{M < a}

or

{U^{y} < a}

play an important role. The notion of size of these latter sets, those defined by strict inequality, is defined as the largest value of a which can occur in the representations given.

We shall consider the largest sets of the form

{M < a}

, respectively

{U^{y} < a}

, which are contained in the complement

∁ P

or, as we shall consistently prefer to say below, which are external to

𝒫

.

Either directly—or as corollaries to Propositions 3–5 applied to the effort-based triple

(- U, - M, D)

—one derives the following results:

Proposition 6.

Let

(U, M, D)

be a utility-based information triple and consider a state

x^{*}

with

k = M (x^{*}) > - \infty

. Then, for any preparation

𝒫

, the game

γ (P | U)

is in equilibrium with

x^{*}

as bi-optimal state if and only if

x^{*} \in P \subseteq {U^{x^{*}} \geq k}

. In particular, the largest such preparation is the superlevel set

{U^{x^{*}} \geq k}

.

Proposition 7.

Let

(U, M, D)

be a utility-based information triple and consider a preparation

𝒫

and the associated game

γ (P | U)

. Then:

(i): The value $M_{\min} (P)$ is the size of the largest strict sub level set ${M < a}$ which is external to $𝒫$ .
(ii): For fixed $y ≻ P$ , $Gtu (y | P)$ is the size of the largest strict sub level set ${U^{y} < a}$ which is external to $𝒫$ .
(iii): The value ${Gtu}_{\max} (P)$ , as the supremum of $Gtu (y | P)$ , is the supremum of all sizes of sets of the form ${U^{y} < a}$ with $y ≻ P$ which are external to $𝒫$ .

Proposition 8.

Let

(U, M, D)

be a utility-based information triple and consider a preparation

𝒫

. Then a necessary and sufficient condition that

γ (P | U)

is in equilibrium and has a bi-optimal state is that

{U^{y} < a}

is external to

𝒫

for some

(y, a)

with

y \in P

and

a = M (y)

. When the condition is fulfilled, a is the value of the game and y the bi-optimal state.

We also note that the minimax inequality

{Gtu}_{\max} (P) \leq M_{\min} (P)

follows from Proposition 7 by applying the fact that, generally,

{M < a} \subseteq {U^{y} < a}

.

Let us look specifically at models of updating, cf., Section 2.13.

Given is a general divergence function D on

X \otimes Y

and we consider preparations

𝒫

and priors

y_{0}

for which

D^{y_{0}} < \infty

on

𝒫

. The sets we shall focus on related to the games

γ (P; y_{0})

are of two types, which we associate with, respectively “balls” and “half-spaces”. Firstly, for

r > 0

, consider the open divergence ball with radius r and centre

y_{0}

, defined as the

D^{y_{0}}

-sub level set

B (r | y_{0}) = {D^{y_{0}} < r} .

(71)

In case

r = D (x^{*}, y_{0})

for some state

x^{*}

, we write this set as

B (x^{*} | y_{0})

:

B (x^{*} | y_{0}) = B (D (x^{*}, y_{0}) | y_{0}) = {x | D (x, y_{0}) < D (x^{*}, y_{0})} .

(72)

And, secondly, we consider sets—all referred to as half-spaces—of one of the following forms

\begin{matrix} σ^{+} (y, a | y_{0}) = {x | U_{| y_{0}} < a} = {x | D (x, y_{0}) - D (x, y) < a} \end{matrix}

(73)

\begin{matrix} σ^{-} (y, a | y_{0}) = {x | U_{| y_{0}} \geq a} = {x | D (x, y_{0}) - D (x, y) \geq a} \end{matrix}

(74)

\begin{matrix} σ^{+} (y | y_{0}) = {x | U_{| y_{0}} < D (y, y_{0})} = {x | D (x, y_{0}) - D (x, y) < D (y, y_{0})} \end{matrix}

(75)

\begin{matrix} σ^{-} (y | y_{0}) = {x | U_{| y_{0}} \geq D (y, y_{0}} = {x | D (x, y_{0}) - D (x, y) \geq D (y, y_{0})} \end{matrix}

(76)

Associated with the sets introduced we define certain “boundary sets” , respectively peripheries and hyper-spaces. Notation and definition for the former type of sets is given by

\begin{matrix} \partial B (r | y_{0}) & = {x | D (x, y_{0}) = r} and \\ \partial B (x^{*} | y_{o}) & = {x | D (x, y_{0}) = D (x^{*}, y_{0})} \end{matrix}

and for the latter type we use

\begin{matrix} \partial σ (y, a | y_{0}) & = {x | D (x, y_{0}) - D (x, y) = a} and \\ \partial σ (y | y_{0}) & = {x | D (x, y_{0}) - D (x, y) = D (y, y_{0})} . \end{matrix}

When translating basic parts of Propositions 6–8 to the setting we are now considering, we find the following result:

Proposition 9.

Let D be a general divergence function on

X \otimes Y

and consider a belief instance

y_{0} ≻ X

such that

D^{y_{0}} < \infty

. Then the following results hold for the associated updating games with

y_{0}

as prior:

(i): For any $x^{*} \in X$ , the largest preparation $𝒫$ for which $γ (P; y_{0})$ is in equilibrium with $x^{*}$ as bi-optimal state, hence with $x^{*}$ as the D-projection of $y_{0}$ on $𝒫$ , is the half-space $σ^{-} (x^{*} | y_{o})$ .
(ii): For a fixed updating game $γ (P; y_{0})$ , the MinDiv-value $D_{\min} (P; y_{0})$ is the size of the largest strict divergence ball $B (r | y_{0})$ which is external to $𝒫$ , and the maximal guaranteed updating gain ${Gtu}_{\max} (P; y_{0})$ is the supremum of a for which there exists $y ≻ P$ such that the half-space $σ^{+} (y, a | y_{0})$ is external to $𝒫$ .
(iii): An updating game $γ (P; y_{0})$ is in equilibrium and has a bi-optimal state if and only if, for some $y \in P$ , the half-space $σ^{+} (y | y_{0})$ is external to $𝒫$ . When this condition holds, y is the bi-optimal state, hence the D-projection of $y_{0}$ on $𝒫$ .

For illustrations see cases (a) and (b) shown in the figure in Section 3.2.

2.15. Adding Convexity

It has been recognized since long that notions of convexity play an important role for basic properties of Shannon theory and for optimization theory in general, cf. in particular Boyd and Vandenberghe [54] which also has a bearing on many of the concrete problems treated later on. Deliberately, we have postponed the introduction of this element until this late moment, thereby demonstrating that a large number of concepts and results can be formulated quite abstractly and do not require convexity considerations. Also, it will become more clear exactly where convexity is needed.

We shall study results which can be obtained under added algebraic assumptions related to convexity considerations.

We assume that X is a convex set. The convex hull of a preparation

𝒫

is denoted

co (P)

. We assume that controllability is adapted to the convex structure in the sense that a control w controls a convex combination, say

w ≻ \bar{x} = \sum α_{i} x_{i}

, if and only if w controls every

x_{i}

with

α_{i} > 0

. It follows, that all control regions

] w [

are convex. Also note that, for every convex combination

\bar{x} = \sum α_{i} x_{i}

, we conclude from

\bar{x} ≻ \bar{x}

that

\hat{\bar{x}} ≻ x_{i}

for all i with

α_{i} > 0

and hence, if we switch to the Y-domain,

\bar{x} ≻ x_{i}

for every i with

α_{i} > 0

.

Regarding convex combinations, they are understood to be finite convex combination, often written as above without introducing any special notation for the relevant index set.

Properties of Concavity, convexity and affinity of real-valued functions f defined on X or on a convex subset of X are largely defined in the usual way. Thus, for concavity, the condition is that if

\sum α_{i} x_{i}

is a convex combination of elements in the domain of definition of f, then

f (\sum α_{i} x_{i}) \geq \sum α_{i} f (x_{i})

. For convexity the inequality sign is turned around and for affinity it is replaced by equality. The notions make sense and will also be applied to extended real-valued functions provided they do not assume both values

+ \infty

and

- \infty

. One comment has to be made, though. We only require that X is a convex set. However, X could be affine, i.e., combinations

\sum α_{i} x_{i}

could be defined whenever the coefficients

α_{i}

are arbitrary real numbers which sum up to 1. This will be the case for some models. We shall then point out if stated results hold for arbitrary affine combinations, not just for convex combinations.

The above definitions and concepts along with associated assumptions will always be understood to apply when, in the sequel, we work with a convex state space.

The basis in this section, except for the last part (Example 1 and Proposition 10), is a proper effort-based information triple

(\hat{Φ}, H, \hat{D})

over

X \otimes \hat{Y}

. The derived information triple over

X \otimes Y

is denoted

(Φ, H, D)

. When there is also given a preparation

P

, the results developed continue to hold under

Q_{2}

-properness.

Emphasis will be on concavity, convexity or affinity for the w-marginals

{\hat{Φ}}^{w}

—either all of them or only those with a control in the range of the response function. Note that, say affinity for

{\hat{Φ}}^{w}

with w of the form

\hat{x}

for some

x \in X

amounts to the same as affinity of

Φ^{x}

.

Basic properties of entropy and redundancy (hence also divergence) under added conditions about the marginals

{\hat{Φ}}^{w}

or

Φ^{y}

are contained in the following result:

Theorem 8 (Deviation from affinity).

(i): If the marginals $Φ^{x}$ with $x \in X$ are concave, then, for every convex combination $\bar{x} = \sum α_{i} x_{i}$ of elements in X,

$H (\sum α_{i} x_{i}) \geq \sum α_{i} H (x_{i}) + \sum α_{i} D (x_{i}, \bar{x}) .$

(77)

In particular, H is concave and if $H (\bar{x}) = \sum α_{i} H (x_{i})$ and this quantity is finite, then all $x_{i}$ with $α_{i} > 0$ are response equivalent, in fact $x_{i} \hat{\sim} \bar{x}$ for these indices. If response is injective, the entropy function is strictly concave.
(ii): If the marginals $Φ^{x}$ with $x \in X$ are even affine, equality holds in (77):

$H (\sum α_{i} x_{i}) = \sum α_{i} H (x_{i}) + \sum α_{i} D (x_{i}, \bar{x}) .$

(78)
(iii): If the marginals ${\hat{Φ}}^{w}$ with $w \in \hat{Y}$ are affine and if $H (\bar{x}) < \infty$ for a convex combination $\bar{x} = \sum_{i} α_{i} x_{i}$ then, for every control w with $w ≻ \bar{x}$ ,

$\sum α_{i} \hat{D} (x_{i}, w) = \hat{D} (\sum α_{i} x_{i}, w) + \sum α_{i} D (x_{i}, \bar{x}) .$

(79)
(iv): If the marginals ${\hat{Φ}}^{w}$ with $w \in \hat{Y}$ are affine, if $P$ is a convex preparation with $H_{\max} (P) < \infty$ and if $w \in^[P]$ , then the restriction of ${\hat{D}}^{w}$ to $P$ is convex and if $\sum α_{i} \hat{D} (x_{i}, w) = \hat{D} (\bar{x}, w)$ for a convex combination $\bar{x} = \sum α_{i} x_{i}$ of states in $P$ , then all $x_{i}$ with $α_{i} > 0$ are response equivalent, in fact $x_{i} \hat{\sim} \bar{x}$ for these indices. If response is injective, the restriction of ${\hat{D}}^{w}$ to $P$ is strictly convex.

Proof.

The result is a natural extension of (the main parts of) Theorem 1 of [55] and the proof is similar: For (i), apply linking to rewrite the right hand side, then upper bound the expression you get by the assumed concavity and you end with the upper bound

Φ (\bar{x}, \bar{x}) = H (\bar{x})

. The results about concavity of H are easy consequences and property (ii) is proved similarly. For the basic assertion of (iii), add

\sum α_{i} \hat{D} (x_{i}, w)

to both sides of (78), and use linking to rewrite the right hand side. Then apply the assumed affinity and the term

\hat{Φ} (\bar{x}, w)

appears to which you once more apply linking. Finally subtract

H (\bar{x})

from both sides. The assertions of (iv) are easy consequences. ☐

Several comments are in place. First, as a simple corollary to (i) of Theorem 8 we note the following:

Corollary 5.

Assume that the marginals

{\hat{Φ}}^{x}

with

x \in X

are concave and consider the game

\hat{γ} (P)

for a convex preparation

P

. Then the set of optimal strategies for Nature in this game is convex and, in case response is injective and

H_{\max} (P) < \infty

, there can be at most one optimal strategy for Nature.

Conditions of affinity will play a main role for many results to follow. Notions of affine equivalence applies in various contexts (

\hat{Y}

-domain, Y-domain, effort-based or utility-based). Some examples will suffice: The effort functions

{\hat{Φ}}_{1}

and

{\hat{Φ}}_{2}

over

X \otimes \hat{Y}

are affinely equivalent if there exists a finite-valued affine function f on X such that, for

(x, w) \in X \otimes \hat{Y}

,

{\hat{Φ}}_{2} (x, w) = {\hat{Φ}}_{1} (x, w) + f (x)

. If so,

{\hat{Φ}}_{1}

and

{\hat{Φ}}_{2}

are equivalent (

D_{1} = D_{2}

). Moreover, two effort-based information triples

(Φ_{1}, H_{1}, D_{1})

and

(Φ_{2}, H_{2}, D_{2})

are affinely equivalent if they are equivalent and there exists a finite-valued affine function f on X such that, for

(x, y) \in X \otimes Y

,

H_{2} (x) = H_{1} (x) + f (x)

. Then of course, also

Φ_{2} (x, y) = Φ_{1} (x, y) + f (x)

.

A simple and practically important result which follows readily from affinity conditions exploits the notion of robustness in its weakened form introduced in Section 2.9, cf., (32) and (33). The result is an extension of Theorem 5.

Theorem 9.

Let X be a convex state space and let

(\hat{Φ}, H, \hat{D})

be a proper information triple over

X \otimes \hat{Y}

for which the marginals

{\hat{Φ}}^{w}

with

w \in \hat{Y}

are all affine. Let

(x^{*}, w^{*})

be a pair of permissible strategies for

\hat{γ} (P)

with

w^{*}

adapted to

x^{*}

. Assume that

x^{*} \in co (E)

and that

w^{*}

is

(E, P)

-robust. Then

\hat{γ} (P)

is in equilibrium with

x^{*}

as bi-optimal strategy.

Proof.

Let

h < \infty

be the constant for which (32) and (33) hold. By affinity, (32) extends to states in

co (E)

, hence

H (x^{*}) = \hat{Φ} (x^{*}, w^{*}) = h

. The result now follows from Theorem 2. ☐

Then some comments on (79). In the terminology of [69], this is the compensation identity with the last term as compensation term. This term appears as a measure of deviation from affinity, both in relation to entropy, cf., (78), and in relation to redundancy (hence also to divergence), cf., (79). The significance of such terms is being more widely recognized. This applies in particular to the case of an even mixture

\bar{x} = \frac{1}{2} x_{1} + \frac{1}{2} x_{2}

, for which the term is called Jensen-Shannon divergence, briefly just JSD-divergence, between

x_{1}

and

x_{2}

. We shall use the notation

JSD (x_{1}, x_{2}) = \frac{1}{2} D (x_{1}, \bar{x_{1} x_{2}}) + \frac{1}{2} D (x_{2}, \bar{x_{1} x_{2}}),

(80)

where a “bar” signals “midpoint of”, a notation to be used often in the sequel:

\bar{x y} = \frac{1}{2} x + \frac{1}{2} y .

(81)

For even mixtures of two states, the compensation identity states that

\frac{1}{2} (D (x_{1}, y) + D (x_{2}, y)) = D (\bar{x_{1} x_{2}}, y) + JSD (x_{1}, x_{2}) .

(82)

which, for classical Shannon theory, is sometimes called the parallelogram identity. The identity makes sense for an arbitrary general divergence function but one should note the requirement of finiteness in (79), expressed somewhat indirectly via the entropy function. That some restriction is important will be seen from Example 1 below. When (82) holds, you may apply it with

y = x_{1}

and with

y = x_{2}

, and derive the identity

D (\bar{x_{1} x_{2}}, x_{1}) - D (\bar{x_{1} x_{2}}, x_{2}) = \frac{1}{2} (D (x_{2}, x_{1}) - D (x_{1}, x_{2})) .

(83)

Previously, JSD-divergence has mainly been studied in the context of classical Shannon theory. For our more abstract theory, we have chosen to put emphasis on it, especially in the formulation of technical assumptions which are needed for the proofs of some basic results to follow. Note that JSD-divergence is everywhere defined on

X \times X

which D-divergence need not be. In the next section we take up a closer study of Jensen-Shannon divergence.

The purpose of the next result is to indicate that it is conceivable that for many concrete situations, a bi-optimal state will be robust, i.e., lie in the core of the preparation concerned. This result, in a more concrete set-up goes back to Csiszár, cf., [63]. It depends on the following notion: A state x is an algebraic inner point of

P

(typically assumed convex) if, for every

x_{1} \in P

distinct from x, there exists

x_{2} \in P

such that x is a genuine convex combination of

x_{1}

and

x_{2}

.

Corollary 6.

Assume that

Φ^{x}

is affine for all

x \in X

and let

P

be a convex preparation. If

γ (P)

is in equilibrium and has a bi-optimal state

x^{*}

and if this state is algebraic inner in

P

, then

x^{*}

is robust for

γ (P)

at the robustness level

H_{\max} (P)

. In particular,

x^{*} \in core (P)

.

Proof.

With assumptions as stated, consider any

x \in P

distinct from

x^{*}

and determine

x^{'} \in P

such that

x^{*}

is a genuine convex combination of x and

x^{'}

, say

x^{*} = α x + β x^{'}

. We find that

Φ (x, x^{*}) \leq Ri (x^{*} | P) = H_{\max} (P)

. Similarly,

Φ (x^{'}, x^{*}) \leq H_{\max} (P)

. As the convex combination

α Φ (x, x^{*}) + β Φ (x^{'}, x^{*})

equals

Φ (α x + β x^{'}, x^{*}) = Φ (x^{*}, x^{*}) = H (x^{*}) = H_{\max} (P)

, we conclude that

Φ (x, x^{*}) = H_{\max} (P)

. As this holds for every

x \in P

, the result follows. ☐

An example is in place to illuminate the importance of the finiteness condition in relation to the compensation identity. We shall work in the Y-domain, for which the identity takes the following form:

\sum α_{i} D (x_{i}, y) = D (\sum α_{i} x_{i}, y) + \sum α_{i} D (x_{i}, \bar{x}) .

(84)

The identity can be considered for more or less any bivariate function D on

X \otimes Y

. As before let X be convex and assume that

Y = X

. We further assume that D is a general divergence function on

X \otimes Y

. It may be that D is derived from an information triple over

X \otimes \hat{Y}

, but we do not assume so. In particular, no response function is involved.

In order to check if the compensation identity holds for D, you may check if the difference

\sum α_{i} D (x_{i}, y) - D (\sum α_{i} x_{i}, y)

is well defined and independent of y. Or you may inspect more closely the expression for D. If this expression, apart from pure x-only dependent terms, only contains terms which, for fixed y, are linear terms in x, a suitable entropy can be identified and the compensation identity (84) will hold (when

H (\bar{x}) < \infty

). The procedure is demonstrated in the following example which, at the same time, also illustrates the role of the two assumptions made in part (iii) of Theorem 8 in order for (79) or (84) to hold.

Example 1.

Let

X = Y = \hat{Y}

be copies of the real line

] - \infty, \infty [

provided with the standard structure, let response be the identity map and let visibility be the diffuse relation. Further, let α be a positive parameter and consider the bivariate function D given by

D (x, y) = {| x - y |}^{α} .

(85)

Clearly, this is a genuine general divergence function.

If

α \neq 2

, (84) does not hold. Indeed, if you consider the mixture

\bar{x} = \frac{1}{2} 0 + \frac{1}{2} 1

and as y take

y = 1

, then the left hand side of (84) equals

\frac{1}{2}

whereas the right hand side equals

2^{1 - α}

. Thus, when

α \neq 2

, there is no information triple

(Φ, H, D)

equivalent to

(D, 0, D)

for which (84) holds generally. So you cannot add a finite entropy function to

(D, 0, D)

and obtain an effort function with affine marginals.

If

α = 2

, the matter is quite different. Then

D (x, y) = x^{2} + y^{2} - 2 x y

and you can subtract

x^{2}

to obtain a function with linear dependency on x for a given value of y. In other words, if you consider the triple equivalent to

(D, 0, D)

for which entropy is given by

H (x) = - x^{2}

, all conditions of Theorem 8, (iii) are fulfilled, thus (84) must hold. Further material on this and similar examples can be found in Section 3.1.

For our last observation of this section we return to an updating triple

(U_{| y_{0}}, D^{y_{0}}, D)

as introduced in Section 2.8, cf. (22). Here, D is a general divergence and

y_{0}

a prior. A certain preparation

P

is also given and it is assumed that

D^{y_{0}} < \infty

on

P

. The triple

(U_{| y_{0}}, D^{y_{0}}, D)

is a genuine proper utility-based information triple over

P \otimes Y

. It is still assumed that X is convex and that

Y = X

. The observation we want to point out is the following:

Lemma 1.

If, in addition to assumptions above, the compensation identity (84) holds for all convex combinations of states in

P

and all

y \in Y

, then all marginal functions of the utility function

U_{| y_{0}}

obtained by fixing an element

y \in Y

are affine.

Proof.

Consider any

y \in Y

and any convex combination

\bar{x} = \sum α_{i} x_{i}

of states in

𝒫

. As

D^{y_{0}} < \infty

on

𝒫

, the sum

\sum α_{i} D (x_{i}, y_{0})

is finite. By the compensation identity, so is the sum

\sum α_{i} D (x_{i}, \bar{x})

. For

y \in Y

, we find that

\begin{matrix} U_{| y_{0}} (\bar{x}, y) & = D (\bar{x}, y_{0}) - D (\bar{x}, y) \\ = (\sum α_{i} D (x_{i}, y_{0}) - \sum α_{i} D (x_{i}, \bar{x})) \\ - (\sum α_{i} D (x_{i}, y) - \sum α_{i} D (x_{i}, \bar{x})) \\ = \sum α_{i} D (x_{i}, y_{0}) - \sum α_{i} D (x_{i}, y) \\ = \sum α_{i} U_{| y_{0}} (x_{i}, y) . \end{matrix}

This is the affinity relation sought. ☐

The significance of this result is that it will later allow us to apply results for the updating games under convexity assumptions, cf., Theorem 15.

2.16. Jensen-Shannon Divergence at Work

As in the previous section, X is a convex set. We assume now that

Y = X

. For the first part of the section we take as base a general divergence function D over

X \otimes Y

. No preparation, effort function or entropy function will appear until later in the section. We work entirely in the Y-domain.

As is no surprise, not all results of information theory are constructive and in order to be able to handle situations where constructive methods are not available, we shall introduce topologically flavored notions and methods. Previously, as in [55], we introduced topology into the picture by referring to a “reference topology” which could be a topology with no very direct relation to the theory developed. Now we apply a different approach and insist that everything topological can be expressed in terms of quantities of direct interest for the theory dealt with. In fact, the previously defined Jensen-Shannon divergence (JSD), cf., (80), will now be the central quantity to work with. This notion of divergence is an everywhere defined, smoothed and symmetrized version of standard divergence. It may take the value

+ \infty

. The following properties are obvious in view of the definition:

\begin{matrix} JSD (x, y) & \geq 0 (JSD is non - negative), \end{matrix}

(86)

\begin{matrix} JSD (y, x) & = JSD (x, y) (JSD is symmetric), \end{matrix}

(87)

\begin{matrix} JSD (x, x) & = 0 (JSD is sound), \end{matrix}

(88)

\begin{matrix} JSD (x, y) & > 0 if y \neq x (JSD is proper) . \end{matrix}

(89)

These properties hold for all

x, y \in X

. The same properties hold for any bivariate function on

X \otimes Y

which is a function of some metric with a function defined on

[0, \infty [

which vanishes at 0 and nowhere else. In several concrete cases, Jensen-Shannon divergence is of this type, in some central cases even in a very simple way as JSD will be a squared metric in the cases we have in mind. For research in this direction, we refer to Endres and Schindelin [70], Fuglede and Topsøe [71] and Briët and Harremoës [72]. The present study is a further indication of the significance of Jensen-Shannon divergence.

Jensen-Shannon divergence defines a natural sequential notion of convergence in X. To be precise, a sequence

(x_{n}) = {(x_{n})}_{n \geq 1}

converges in Jensen-Shannon divergence to x and we write

x_{n} \overset{JSD}{⟶} x

, if

JSD (x_{n}, x) \to 0

as

n \to \infty

. We shall only pay attention to convergence of ordinary sequences. Convergence in Jensen-Shannon divergence is also referred to as JSD-convergence.

A sequence

{(x_{n})}_{n \in N}

is a JSD-Cauchy sequence if

\lim_{n, m \to \infty} JSD (x_{n}, x_{m}) = 0 .

(90)

We shall consider the following five properties:

\begin{matrix} C 1 & (soundness) : x_{n} \equiv x \Rightarrow x_{n} \overset{JSD}{⟶} x; \end{matrix}

(91)

\begin{matrix} C 2 (subsequence consistency) : \\ x_{n} \overset{JSD}{⟶} x \Rightarrow x_{n_{k}} \overset{JSD}{⟶} x for any subsequence; \end{matrix}

(92)

\begin{matrix} C 3 & (unique limits) : x_{n} \overset{JSD}{⟶} x \land x_{n} \overset{JSD}{⟶} y \Rightarrow y = x; \end{matrix}

(93)

\begin{matrix} C 4 (subsubsequence principle) : \\ If \forall (x_{n_{k}}) \exists (x_{n_{k_{l}}}) : x_{n_{k_{l}}} \overset{JSD}{⟶} x then x_{n} \overset{JSD}{⟶} x; \end{matrix}

(94)

\begin{matrix} C 5 & (completeness) : any JSD-Cauchy sequence is JSD-convergent . \end{matrix}

(95)

We may use terminology such as JSD-convergence has unique limits or JSD convergence is complete, for example. Clearly C1, C2 and C4 hold generally. Completeness (C5) will be taken as an independent axiom. Adding two relatively innocent technical axioms, we shall also establish C3.

The axiom ASC of algebraic sequential continuity wrt JSD-convergence is the requirement that, for convex combinations

z_{n} = α_{n} x_{n} + β_{n} y_{n}

and for a convex combination

z = α x + β y

such that

x_{n} \overset{JSD}{⟶} x

,

y_{n} \overset{JSD}{⟶} y

and

α_{n} \to α

(hence also

β_{n} \to β

) it holds that

z_{n} \overset{JSD}{⟶} z

.

The axiom JSC of joint sequential lower semi-continuity of divergence is the requirement that, for

x_{n} \overset{JSD}{⟶} x

and

y_{n} \overset{JSD}{⟶} y

, it holds that, properly interpreted,

D (x, y) \leq \underset{n \to \infty}{lim inf} D (x_{n}, y_{n}) .

(96)

Regarding the proper interpretation of (96), we shall agree to define

D (x, y) = \infty

whenever

y ⊁ x

. Thus the axiom implies that if the right hand side of (96) is finite, then

y ≻ x

must hold.

The significance of the properties C1-4 lies in a general result dueto Kisynski [73], see also Dudley [74], according to which these conditions ensure that the notion of convergence studied is topological, i.e., that there exists a topology on X for which sequential convergence coincides with the given notion of convergence. When this is so, there exists a unique strongest such topology, which we refer to as the associated topology. For this topology, a set is open if and only if any sequence which converges in the notion of convergence to a point in the set, eventually lies in the set. Note that, typically, there are many topologies for which sequential convergence coincides with a given notion of convergence. As a concrete example consider

X = N

and note that the convergent sequences for the discrete topology (the eventually constant sequences) coincides with the class of convergent sequences for the strictly weaker topology specified by taking

G \subseteq N

to be open if either

1 \notin G

or else

\lim μ_{n} ([1, n]) = 1

with

μ_{n}

the uniform probability measure over

[1, n]

(this is, essentially, “Appert space” of [75]).

We are now ready to prove the following result:

Theorem 10.

Under the added axioms ASC and JSC, the convergence properties C1-4 hold, hence JSD-convergence is topological and the associated topology is well defined. Further, JSD is a sequentially lower semi-continuous notion, i.e., for

x_{n} \overset{JSD}{⟶} x

and

y_{n} \overset{JSD}{⟶} y

, the following inequality holds:

JSD (x, y) \leq \underset{n \to \infty}{lim inf} JSD (x_{n}, x_{m}) .

(97)

Proof.

To establish (97), note that by axiom ASC the convergence

\bar{x_{n} y_{n}} \overset{JSD}{⟶} \bar{x y}

is ensured. Then, by axiom JSC,

D (x, \bar{x y}) \leq lim inf D (x_{n}, \bar{x_{n} y_{n}}) .

(98)

Similarly,

D (y, \bar{x y}) \leq lim inf D (y_{n}, \bar{x_{n} y_{n}}) .

(99)

As the left hand side in (97) is the sum of the left hand sides of (98) and (99), and as the sum of the two right hand sides is dominated by the right hand side in (97), (97) must hold.

As to property C3, assume that

x_{n} \overset{JSD}{⟶} x

and that

x_{n} \overset{JSD}{⟶} y

. Then, by (97),

JSD (x, y) \leq lim inf JSD (x_{n}, x_{n}) = 0

and hence

D (x, \bar{x y}) = 0

follows. By properness,

x = \bar{x y}

and then

x = y

follows. ☐

Under the discussion of properties (86)–(89) we indicated that often JSD is directly related to a metric in that a relation of the form

JSD (x, y) = f (ρ (x, y))

(100)

holds for some metric

ρ

. In such cases it is mostly easy to identify the associated topology (without relying on any extra axioms). We leave it to the reader to prove the following simple result.

Proposition 10.

Assume that, for some metric ρ on X and some continuous and strictly increasing function f on

[0, \infty [

with

f (0) = 0

, Equation (100) holds for all

(x, y) \in X \times X

. Then the associated topology for JSD-convergence exists and can be identified as the metric topology defined by ρ. Further, JSD is jointly lower semi-continuous. If the metric ρ is complete, so is JSD.

Under suitable conditions we now aim at establishing existence of optimal strategies for the players in the games

γ (P)

. However, in certain important cases Nature does not have an optimal strategy. Instead, we aim at showing that rather generally replacements in the form of

H_{\max}

-attractors exist. We shall aim at attractors for JSD-convergence but, as it will turn out, under conditions stated, that will amount to the same thing as attractors for D-convergence. The result below, stated in rather full detail for reference purposes, is a main technical result of the present contribution.

Theorem 11.

Consider a convex state space X, let

Y = X

and let

(Φ, H, D)

be a proper information triple over

X \otimes Y

with affine marginals

Φ^{y}

for all

y \in Y

. Assume that the axioms ASC, JSC and the axiom of JSD-completeness which all relate to the divergence function D hold.

Then, for every convex preparation

P

with

H_{\max} (P) < \infty

,

γ (P)

is in equilibrium and there exists a unique optimal strategy

y^{*}

for Observer and a unique

H_{\max}

-attractor

x^{*}

wrt JSD-convergence. Furthermore,

y^{*} = x^{*}

and the direct as well as the dual Pythagorean inequalities hold, i.e., for

x \in P

and

y ≻ P

,

\begin{matrix} H (x) + D (x, y^{*}) & \leq H_{\max} (P); \\ {Ri}_{\min} (P) + D (x^{*}, y) & \leq Ri (y | P) . \end{matrix}

Proof.

First we prove an auxiliary result, viz that if, for a sequence

(x_{n})

of states in

P

and for a state

x \in X

,

x_{n} \overset{D}{\to} x

holds, then also

x_{n} \overset{JSD}{⟶} x

must hold.

To see this, note that by assumptions made, we conclude from (iii) of Theorem 8 that, for all n, m and all

y \in Y

,

\frac{1}{2} D (x_{n}, y) + \frac{1}{2} D (x_{m}, y) = D (\bar{x_{n} x_{m}}, y) + JSD (x_{n}, x_{m}) .

(101)

Applying this with

y = x

, we see that

(x_{n})

is a JSD-Cauchy sequence. By completeness, there exists

x^{'} \in X

such that

x_{n} \overset{JSD}{⟶} x^{'}

. By axiom JSD,

D (x^{'}, x) \leq {lim inf}_{n \to \infty} D (x_{n}, x) = 0

, hence

x^{'} = x

and

x_{n} \overset{JSD}{⟶} x

follows.

Now, let

(x_{n})

be an asymptotically optimal sequence for

γ (P)

. Then (i) of Theorem 8 applied to

\bar{x_{n} x_{m}}

shows that

H_{\max} (P) \geq H (\bar{x_{n} x_{m}}) = \frac{1}{2} H (x_{n}) + \frac{1}{2} H (x_{m}) + JSD (x_{n}, x_{m})

and we realize that

(x_{n})

is a JSD-Cauchy sequence. Therefore the sequence is JSD-convergent, say

x_{n} \overset{JSD}{⟶} x

. If also

(z_{n})

is an asymptotically optimal sequence, there must, likewise, exist

z \in X

such that

z_{n} \overset{JSD}{⟶} z

. As the alternating sequence

x_{1}, z_{1}, x_{2}, z_{2}, \dots

is also asymptotically optimal, that sequence too JSD-converges, say with

a \in X

as limit state. By properties C2 and C3 we find that

x = a = z

. This shows that there exists a unique

H_{\max}

-attractor wrt JSD-convergence. Let

x^{*}

be this unique attractor.

Then we remark that if there exists an optimal strategy

y^{*}

for Observer in

γ (P)

, there can only be one such strategy and it must coincide with

x^{*}

. To see this, note that if

y^{*}

is optimal,

Ri (y^{*} | P) \leq H_{\max} (P)

, hence, for every

x \in P

,

H (x) + D (x, y^{*}) \leq H_{\max} (P)

and hence

y^{*}

is also an

H_{\max}

-attractor wrt convergence in D (cf., also Corollary 4). By the auxiliary fact established in the beginning of the proof,

y^{*}

is also an

H_{\max}

-attractor wrt JSD-convergence, hence must coincide with

x^{*}

as claimed.

Now fix an asymptotically optimal sequence, say

(x_{n})

. Then, for

x \in P

consider “suitable” convex combinations

ξ_{n} = α_{n} x_{n} + β_{n} x

with

β_{n} \to 0

and all

β_{n}

positive (in fact,

β_{n} = \frac{1}{n}

if the difference

δ_{n} = H_{\max} (P) - H (x_{n})

either vanishes or is larger than 1 and otherwise

β_{n} = \sqrt{δ_{n}}

will do). Then

\begin{matrix} H_{\max} (P) & \geq H (ξ_{n}) = α_{n} H (x_{n}) + β_{n} H (x) + α_{n} D (x_{n}, ξ_{n}) + β_{n} D (x, ξ_{n}) \\ \geq α_{n} H (x_{n}) + β_{n} (H (x) + D (x, ξ_{n})), \end{matrix}

hence

H (x) + D (x, ξ_{n}) \leq \frac{1}{β_{n}} (H_{\max} (P) - H (x_{n})) + H (x_{n}) .

Clearly, we can select the

β_{n}

’s such that this quantity converges to

H_{\max} (P)

as

n \to \infty

. By axiom ASC,

ξ_{n}

converges in JSD-divergence to

x^{*}

and then, by axiom JSC, we conclude that

H (x) + D (x, x^{*}) \leq H_{\max} (P)

. Since this holds for every consistent state x,

Ri (x^{*} | P) \leq H_{\max} (P)

, from which we conclude that

γ (P)

is in equilibrium, that the direct Pythagorean inequality holds and that

x^{*}

is an optimal strategy for Observer. As we have seen before, this strategy is unique.

As, for any

y \in X

,

\begin{matrix} {Ri}_{\min} (P) + D (x^{*}, y) = \lim_{n \to \infty} H (x_{n}) + D (x^{*}, y) \\ \leq \lim_{n \to \infty} H (x_{n}) + \underset{n \to \infty}{lim inf} D (x_{n}, y) \\ = \underset{n \to \infty}{lim inf} Φ (x_{n}, y) \leq Ri (y | P), \end{matrix}

also the dual Pythagorean inequality holds. ☐

Several remarks concerning this theorem are in order.

Firstly, note that for the auxiliary result we started out to prove, we had to appeal (implicitly) to the finiteness condition

H_{\max} (P) < \infty

in view of the condition

H (\bar{x}) < \infty

in (iii) of Theorem 8. Alternatively, we could instead demand that the compensation identity holds unconditionally.

Then, in general, the D-notion and the JSD-notion of convergence may differ from each other (with D-convergence the stronger of the two). However, it follows from the theorem that under the conditions stated, it does not matter whether we define

H_{\max}

-attractors wrt D-convergence or wrt JSD-convergence. We may, therefore, simply talk about an

H_{\max}

-attractor, or even just an attractor, without specifying the mode of convergence we have in mind.

Further, it lies nearby to ask if also the inequality

H (x^{*}) \leq H_{m a x} (P)

can be added to the conclusions in Theorem 11. If H is sequentially lower semi-continuous wrt D-convergence (or wrt JSD-convergence)—as will normally (always?) be the case—the inequality obviously holds. Assume now that this is the case. Then there are two possibilities why an attractor

x^{*}

may fail to be an optimal strategy for Nature, either because

x^{*} \notin P

or, more interestingly, because there is an entropy loss in that

H (x^{*}) < H_{\max} (P)

. In Harremoës and Topsøe [76], the authors speculate that the phenomena of entropy loss could be important in computational linguistics and provide a partial explanation behind Zipf’s law.

Following up on the remark above, we may investigate what can be accomplished if we work with a state

x^{*}

which is known to be consistent and apply the same technique of proof as for Theorem 11. What we find is that in the presence of convexity (and with technical axioms added), the essential inequality

Ri (y^{*} | P) \leq H (x^{*})

is not needed in full strength. It suffices to assume one of the facts which flow from that inequality, viz., that

H (x^{*}) = H_{\max} (P)

. To be precise:

Theorem 12.

With assumptions as in Theorem 11, let

𝒫

be a convex preparation and

x^{*}

a consistent state with finite entropy which is also a possible strategy for Observer, i.e.,

x^{*} ≻ P

. Then the condition

H (x^{*}) = H_{\max} (P)

is not only necessary, but also sufficient for

Ri (x^{*} | P) \leq H (x^{*})

to hold, hence for

γ (P)

to be in equilibrium with

x^{*}

as bi-optimal state.

Proof.

Consider a state

x \in P

and apply (77) to a convex combination of the form

y_{n} = (1 - \frac{1}{n}) x^{*} + \frac{1}{n} x

. We find that

H (x^{*}) \geq H (y_{n}) \geq (1 - \frac{1}{n}) H (x^{*}) + \frac{1}{n} H (x) + \frac{1}{n} D (x, y_{n})

from which we conclude that

H (x) + D (x, y_{n}) \leq H (x^{*})

. By axiom JSD,

x^{*} ≻ x

and

H (x) + D (x, x^{*}) \leq H (x^{*})

follows. As

x \in P

was arbitrary, the desired inequality follows. Apply Corollary 1 and the result follows. ☐

After these remarks let us turn to another key result:

Theorem 13.

Let

P

be any preparation—convex or not—such that

H_{\max} (P) < \infty

. Keeping the other assumptions of Theorem 11 as they are, the game

γ (P)

is in equilibrium if and only if entropy is not increased by taking convex mixtures in the sense that

H_{\max} (co (P)) = H_{\max} (P) .

(102)

When (102) holds,

γ (P)

and

γ (co (P))

have the same unique optimal strategy

y^{*}

for Observer and the same

H_{\max}

-attractor,

x^{*}

for Nature and the two agree:

x^{*} = y^{*}

.

Proof.

First remark that if (102) holds,

H_{\max} (co (P)) < \infty

and Theorem 11 applies. All claimed properties then follow easily from that result.

To prove necessity, note that quite generally,

{Ri}_{\min} (co (P)) = {Ri}_{\min} (P) .

(103)

In more detail, the condition

y ≻ co (P)

is equivalent with

y ≻ P

, and, for each belief instance

y \in Y

,

Ri (y | co (P)) = Ri (y | P) .

(104)

This follows by standard assumptions made in the beginning of Section 2.15 according to which visibility is adapted to the convex structure and by affinity of the marginals

Φ^{y}

(convexity would do). Then, if

γ (P)

is in equilibrium, we can argue that

H_{\max} (co (P)) \leq {Ri}_{\min} (co (P)) = {Ri}_{\min} (P) = H_{m a x} (P)

and (102) follows. ☐

As we saw, the result is essentially a corollary to Theorem 11. The proof above is modeled after the proof of a less abstract result in [55].

We have formulated results for the Y-domain which appear less involved. We leave it to the reader to formulate and prove versions of the two key theorems above for the

\hat{Y}

-domain.

Translating Theorems 11 and 12 to a setting based on utility—this requires an obvious dual notion of attractors aiming at minimax utility rather than at maximin effort (i.e., maximal entropy)—one finds the following result:

Theorem 14.

Again with X a convex state space, let

(U, M, D)

be a proper utility-based information triple with affine marginals

U^{y}

for

y \in X

. Assume that the technical axioms ASC and JSC hold. Further assume that JSD-divergence is complete. Let

𝒫

be a convex preparation for which

M_{\min} (P) > - \infty

. Then:

(i): Without further assumptions, the utility game $γ (P | U)$ is in equilibrium and there exists a unique optimal strategy $y^{*}$ for Observer and a unique $M_{\min}$ -attractor $x^{*}$ . Furthermore, $y^{*} = x^{*}$ and the direct as well as the dual Pythagorean inequalities hold, i.e., for $x \in P$ and $y \in Y$ ,

$\begin{matrix} M_{\min} (P) + D (x, y^{*}) & \leq M (x); \end{matrix}$

(105)

$\begin{matrix} Gtu (y | P) + D (x^{*}, y) & \leq {Gtu}_{\max} (P) . \end{matrix}$

(106)
(ii): In case $x^{*}$ is a consistent state with finite max-utility, i.e., $M (x^{*}) > - \infty$ for which $M (x^{*}) = M_{\min} (P)$ , $x^{*} \in ctr (P)$ and the game $γ (P | U)$ is in equilibrium and has $x^{*}$ as bi-optimal state. In particular, the Pythagorean inequality

$M (x^{*}) + D (x, y^{*}) \leq M (x)$

(107)

holds for every $x \in P$ .

Let us collect the key results about updating games in one theorem:

Theorem 15.

Let X be convex, let

P

be any preparation and let D be a general divergence on

X \otimes Y

with

Y = X

for which the compensation identity holds. Assume that the technical axioms ASC and JSC hold and that JSD-divergence is complete. Consider a prior

y_{0} \in Y

and assume that

D^{y_{0}} < \infty

on

P

and that

M_{\min} (P; y_{0}) > - \infty

. Then:

(i): Without adding extra conditions, Observer has a unique optimal strategy, $y^{*}$ , in the game $γ (P; y_{0})$ .
(ii): Observer strategies for $γ (co (P); y_{0})$ and for $γ (P; y_{0})$ coincide, i.e., $[co (P)] = [P]$ and, for every such strategy y, $Gtu (y | co (P); y_{0}) = Gtu (y | P; y_{0})$ , hence

${Gtu}_{\max} (co (P); y_{0}) = {Gtu}_{\max} (P; y_{0}) .$

(108)
(iii): If $𝒫$ is convex, the game $γ (P; y_{0})$ is in equilibrium and the $D_{\min}^{y_{0}}$ -attractor exists. This attractor, say $x^{*}$ , is identical to the optimal Observer strategy $y^{*}$ from (i); it is the D-projection of $y_{0}$ on $𝒫$ if and only if $x^{*} \in P$ .
(iv): The game $γ (P; y_{0})$ is in equilibrium if and only if

$D_{\min} (co (P); y_{0}) = D_{\min} (P; y_{0}) .$

(109)

Proof.

This may be proved by applying the key results of this section, also recalling Lemma 1. Details are left to the reader. ☐

Further properties of Jensen-Shannon divergence are worth investigating. This concerns in particular the notion of negative definiteness, cf., [71,72]. Some indications are in place. When the property holds, JSD is the square of a Hilbert metric in a natural sense (loc. cit.). Investigating this property, one will quickly realize that, modulo finiteness conditions on the entropy function (say

H_{\max} (X) < \infty

), JSD is negative definite if and only if the entropy function is midpoint-negative definite, i.e., for any finite sequence of states

{(x_{i})}_{i \leq n}

and any associated sequence of real numbers

{(c_{i})}_{i \leq n}

with

\sum_{i} c_{i} = 0

, it holds that

\sum_{i, j} c_{i} c_{j} H (\bar{x_{i} x_{j}}) \leq 0

. If this property holds with a restriction on n we express the property by saying that H is

M P (n)

-negative definite. Clearly, MP(2)-negative definiteness is equivalent to midpoint concavity of H. In the same way as we introduced the notion of

M P (n)

-negative definiteness for H, we may introduce a notion of n-negative definiteness of JSD.

Whereas the results about embeddability in a Hilbert space are rather deep, if we just ask for the property to be a squared metric, the matter is much simpler:

Proposition 11.

Assume that JSD is everywhere finite. Then the following conditions are equivalent:

JSD is the square of a metric;
JSD is 3-negative definite;
H is $M P (3)$ -negative definite

This result depends on the properties (86)–(89). The key argument is not specific to JSD. For the sake of good order, we provide a proof of the basic general result in Appendix D.

3. Examples, towards Applications

3.1. Primitive Triples and Generation by Integration

Natural building blocks for information triples will be defined. We shall here concentrate on a simple, important and easy-to-apply approach.

A possible expansion of the considerations in the present section is dealt with in the Appendix A. This is related to our introduction of weaker concepts of properness and will allow you to work more generally with non-smooth “generators” (see below). Desirable is also an introduction of an action space and of the notion of response. How this can be done is indicated in Appendix A. We have chosen not to deal with the possible refinements in the main text, partly to keep the exposition simple, partly as a few technical issues may still need a closer investigation.

Let I be a subinterval of

[- \infty, \infty [

with endpoints a and b (

- \infty \leq a < b \leq \infty

). Either none, one or both endpoints belong to I but neither

+ \infty

nor

- \infty

are members of I. Provide I with its usual algebraic and topological structure. We take I as state space as well as belief reservoir. Thus

X = Y = I

. Visibility is normally taken to be the diffuse relation so that any state

s \in I

is visible from any belief instance. However, at times a more restricted notion of visibility is relevant, especially for

I = [0, 1]

or

I = [0, \infty [

. Then

I \otimes I = I^{2} ∖ {(s, u) | s > 0, u = 0}

(110)

is a better choice.

We agree that in this section, visibility

I \otimes I

is either the discrete relation

I \times I

or else given by (110) in certain cases when

0 \in I

is a left endpoint of I.

An effort-based information triple over

I \otimes I

is said to be primitive. The “primitivity” lies in the fact that the state space and belief reservoir appear to be as simple as one can think of—if you do not want to enter into discrete structures with a finite or countably infinite state space. We use lower case letters as in

(ϕ, h, d)

for such triples. Upper case letters will then occur for constructions via a process of summation or integration, starting with primitive triples.

We are especially interested in proper primitive triples. The conditions they must satisfy are as follows (linking, fundamental inequality, soundness and properness):

\begin{matrix} ϕ (s, u) & = h (s) + d (s, u), \end{matrix}

(111)

\begin{matrix} d (s, u) & \geq 0, \end{matrix}

(112)

\begin{matrix} d (s, s) & = 0, \end{matrix}

(113)

\begin{matrix} d (s, u) & = 0 \Rightarrow u = s . \end{matrix}

(114)

It is understood here and later on that such requirements are to hold for all

s \in I

(for (113)) or for all

(s, u) \in I \otimes I

(for (111), (112) and (114)). From Section 2.15 we know that it is desirable for the effort function to have affine marginals

ϕ^{u}

. For this to be the case, there must exist functions on I,

η

and

ξ

say, such that

ϕ (s, u) = s η (u) + ξ (u)

(115)

for

(s, u) \in I \otimes I

. There is a simple way to generate a multitude of such information triples. The method is inspired by Bregman, [77], who used the construction for other purposes. Given is a Bregman generator h which is here understood to be a continuous, real-valued, strictly concave function on I which is sufficiently smooth on the interior of the interval, say continuously differentiable. We take this function as the entropy function, h. Defining effort and divergence by

\begin{matrix} ϕ (s, u) & = h (u) + (s - u) h^{'} (u), \end{matrix}

(116)

\begin{matrix} d (s, u) & = h (u) - h (s) + (s - u) h^{'} (u), \end{matrix}

(117)

the triple

(ϕ, h, d)

is indeed a proper primitive information triple with affine marginals,

ϕ^{u}

. Figure 2 illustrates what is involved.

It is also easy to illustrate geometrically what Jensen-Shannon divergence amounts to. Referring to Figure 3, we find that the Jensen-Shannon divergence between

s_{1}

and

s_{2}

, for primitive triples denoted by jsd is given by

jsd (s_{1}, s_{2}) = \frac{1}{2} d_{1} + \frac{1}{2} d_{2} .

(118)

It follows geometrically that

jsd (s_{1}, s_{2}) \leq \frac{1}{2} d (s_{1}, s_{2}) + \frac{1}{2} d (s_{2}, s_{1}) .

(119)

We also find that for a bounded interval I, JSD-convergence and D-divergence are equivalent concepts and that the associated topology is the standard topology on I.

The utility-based analogues of notions introduced are defined in an obvious manner (see also examples below). We shall use

(u, m, d)

as generic notation for primitive utility-based triples.

As two examples of effort-based Bregman generated primitive triples, we point to the standard algebraic triple given by

\begin{matrix} ϕ (s, u) & = u^{2} - 2 s u, \end{matrix}

(120)

\begin{matrix} h (s) & = - s^{2}, \end{matrix}

(121)

\begin{matrix} d (s, u) & = {(s - u)}^{2} \end{matrix}

(122)

over

] - \infty, + \infty [

and to the standard logarithmic triple

\begin{matrix} ϕ (s, u) & = u - s + s \ln \frac{1}{u}, \end{matrix}

(123)

\begin{matrix} h (s) & = s \ln \frac{1}{s}, \end{matrix}

(124)

\begin{matrix} d (s, u) & = u - s + s \ln \frac{s}{u} . \end{matrix}

(125)

over

[0, \infty]

. Both triples are given in their effort-based versions. If need be, we refer to these triples as standard primitive effort-based triples.

The first triple is equivalent to a triple we met in Example 1. It leads to basic concepts of real Hilbert space theory by a natural process of summation or, more generally, integration. By a similar process, the second triple leads to basic concepts of Shannon information theory. Before elaborating on that, we shall generalize both examples by the introduction of a parameter q. In fact, we shall see that, modulo affine equivalence, both examples can be conceived as belonging to the same family of triples.

In order to modify the standard algebraic triple, it lies nearby to consider generators of the form

h_{q} (s) = α (q) s^{q} + β (q) s + γ (q)

(126)

with

α, β

and

γ

functions depending on a real parameter q. Let us agree to work mainly with

I = [0, \infty]

as state space. Then q could in principle be any real parameter. For each fixed q,

h_{q}

is either strictly concave—an effort-based Bregman generator—strictly convex—a utility-based Bregman generator—(or degenerate). Applications of (116) and (117) give the formulas

\begin{matrix} ϕ_{q} (s, u) & = α (q) (1 - q) u^{q} + α (q) q s u^{q - 1} + β (q) s + γ (q) \end{matrix}

(127)

\begin{matrix} d_{q} (s, u) & = α (q) (1 - q) u^{q} + α (q) q s u^{q - 1} - α (q) s^{q} . \end{matrix}

(128)

When

α (q) q (q - 1)

is negative,

h_{q}

is a genuine effort-based Bregman generator and the triple

(ϕ_{q}, h_{q}, d_{q})

is a proper primitive effort-based information triple. When

α (q) q (q - 1)

is positive,

h_{q}

is strictly convex and the triple

(ϕ_{q}, h_{q}, - d_{q})

is a proper primitive utility-based information triple (which should then rather be denoted

(u_{q}, m_{q}, d_{q})

). Thus, if you consider the triple

(ϕ_{q}, h_{q}, | d_{q} |)

you are certain to obtain a primitive triple, either effort-based or utility-based (or degenerate). It also follows from (126)–(128) that modulo affine equivalence, the triples you obtain from different choices of

α

,

β

and

γ

are scalarly equivalent. For some choices you may prefer to restrict the parameter so that only effort-based triples emerge, for others you may find it interesting to focus on triples where there is a smooth variation from effort-based to utility-based triples. In applications—purely speculative at the moment—this could reflect situations in economic or physical or chemical systems where e.g., a change from positive to negative rent or from exothermic to endothermic reaction can take place.

If you choose

α = 1

and

β = γ = 0

, then

(ϕ_{q}, h_{q}, | d_{q} |)

equals

((1 - q) u^{q} + q s u^{q - 1}, s^{q}, | (1 - q) u^{q} + q s u^{q - 1} - s^{q} |) .

(129)

As you go from large to small values of q this primitive triple starts out as utility-based, then, for

q = 1

, becomes degenerate, after which it switches to the effort-based mode until, for

q = 0

, it again becomes degenerate, after which it switches back to the utility-based mode. For

q = 2

, the triple is the utility-based standard algebraic triple, the utility-based version of the triple given in (120)–(122). That triple is most naturally considered over

I \otimes I

with

I =] - \infty, \infty [

.

We can remove the “singularity” of the system at

q = 1

by blowing up the generator near

q = 1

. Let us choose

α, β

and

γ

as follows:

α (q) = \frac{1}{1 - q}, β (q) = \frac{- 1}{1 - q}, γ (q) = γ_{0} .

(130)

Here, the constant

γ_{0}

represents an eventual overhead With choices as specified, we obtain the triples

(ϕ_{q}, h_{q}, d_{q})

with

\begin{matrix} ϕ_{q} (s, u) & = u^{q} + \frac{1}{1 - q} (q u^{q - 1} - 1) s + γ_{0}, \end{matrix}

(131)

\begin{matrix} h_{q} (s) & = s \frac{s^{q - 1} - 1}{1 - q} + γ_{0}, \end{matrix}

(132)

\begin{matrix} d_{q} (s, u) & = u^{q} + \frac{q}{1 - q} u^{q - 1} s - \frac{1}{1 - q} s^{q} . \end{matrix}

(133)

The Equation (131) gives you gross effort with net effort obtained by putting

γ_{0} = 0

. Similarly, (132) is gross entropy and the same formula with

γ_{0} = 0

gives you net entropy.

The family of triples (131)–(133) is well defined for all

q \geq 0

if we allow for an interpretation by continuity for

q = 1

. For

q = 0

the triple is degenerate, for

q > 0

it determines a proper primitive effort-based information triples. For

q = 1

continuity considerations show that

(ϕ_{1}, h_{1}, d_{1})

is identical to the standard logarithmic triple given in (123)–(125) (assuming that the overhead is neglected,

γ_{0} = 0

).

The triples we have identified may all be conceived to be of the same structure as the standard logarithmic triple. What is meant by this, is that if we, following Tsallis [78], introduce the deformed logarithms,

\ln_{q}

, defined by the formula

\ln_{q} t = \{\begin{matrix} \ln t if q = 1 \\ \frac{1}{1 - q} (t^{1 - q} - 1) otherwise, \end{matrix}

(134)

then the Formulas (131)–(133) may be expressed as follows in terms of the deformed logarithms:

\begin{matrix} ϕ_{q} (s, u) & = u^{q} - s + q s \ln_{q} (\frac{1}{u}) + γ_{0}, \end{matrix}

(135)

\begin{matrix} h_{q} (s) & = s \ln_{q} (\frac{1}{s}) + γ_{0}, \end{matrix}

(136)

\begin{matrix} d_{q} (s, u) & = u^{q} - s + q s \ln_{q} (\frac{1}{u}) - s \ln_{q} (\frac{1}{s}) . \end{matrix}

(137)

These formulas are used for

s, u \geq 0

and

q \geq 0

(for negative q you do not obtain effort-based quantities). Note that if

q \leq 1

, then

\ln_{q} \frac{1}{t} = \infty

for

t = 0

. The formulas indicate that it is not so much the logarithmic function

t \mapsto \ln_{q} t

which is of importance but more so the function

t \mapsto \ln_{q} \frac{1}{t}

. This is no surprise to information theorists as the latter expression has a well known interpretation in terms of coding when

q = 1

, provided t represents a probability. No convincing interpretation of

\ln_{q} \frac{1}{t}

appears to be known for other values of q. For

q = 1

, (135)–(137) reduce to (123)–(125) pertaining to the standard logarithmic triple.

The family of triples (135)–(137),

q \geq 0

, is referred to as the family of deformed primitive triples—adding a qualifying “effort-based” if need be. The analogous utility-based primitive is the family of triples

(u_{q}, m_{q}, d_{q}) = (- ϕ_{q}, - h_{q}, d_{q})

, i.e., for

q \geq 0

,

\begin{matrix} u_{q} (s, u) & = - u^{q} + s - q s \ln_{q} (\frac{1}{u}) - γ_{0}, \end{matrix}

(138)

\begin{matrix} m_{q} (s) & = - s \ln_{q} (\frac{1}{s}) - γ_{0}, \end{matrix}

(139)

\begin{matrix} d_{q} (s, u) & = u^{q} - s + q s \ln_{q} (\frac{1}{u}) - s \ln_{q} (\frac{1}{s}) . \end{matrix}

(140)

Let us return to the process of integration hinted at in the beginning of the section. A substantial amount of concrete triples which illustrate the theory developed can be constructed by combining the Bregman construction with a process of integration.

Integration may be applied to any family of information triples and gives us new triples to work with. Note that by linearity of integration, the important property of affinity of marginals is preserved.

We comment mainly on integration of effort-based triples with a view towards applications in information theory and in statistical physics. Consider integration of one and the same primitive triple

(ϕ, h, d)

over

I \otimes I

with Bregman generator h. Partly for technical convenience we assume that h is non-negative. Then effort, entropy and divergence, will all be non-negative, also in the integrated version. Considering the intended applications, this is only natural.

Let T be a set provided with a Borel structure and with an associated measure

μ

. Let

X = Y

be the function space consisting of all measurable functions

x : T \mapsto I

. Functions in X are identified if they agree

μ

-almost everywhere. Note that X is a convex cone. Consider the integrated triple

(Φ, H, D) = \int_{T} (ϕ, h, d) d μ (t)

(141)

by which we express that the following equations hold:

\begin{matrix} Φ (x, y) & = \int_{T} ϕ (x (t), y (t)) d μ (t), \end{matrix}

(142)

\begin{matrix} H (x) & = \int_{T} h (x (t)) d μ (t), \end{matrix}

(143)

\begin{matrix} D (x, y) & = \int_{T} d (x (t), y (t)) d μ (t) . \end{matrix}

(144)

As

h \geq 0

, H is well defined and

0 \leq H (x) \leq \infty

for all

x \in X

and as

t \mapsto (x (t), y (t))

is measurable and

(s, u) \mapsto d (s, u)

non-negative and measurable, cf., (117), D is well-defined by (144). By linking, also

Φ

is well defined. Thus,

(Φ, H, D)

is a well defined triple over

X \times Y

. We leave it to the reader to verify that

(Φ, H, D)

is a proper information triple. Moreover, if

ϕ

has affine marginals

ϕ^{u}

for all

u \in I

, then

Φ

has affine marginals

Φ^{y}

for all

y \in Y

. The divergence functions which can be obtained in this way are Bregman divergences. Note that with this construction, the essential fundamental inequality

D \geq 0

even holds pointwise as

d \geq 0

. For this reason, when we discuss the integrated triple, we refer to (112) as the pointwise fundamental inequality.

Bregman divergence may be used to modify visibility by taking

X \otimes Y

to consist of all pairs

(x, y) \in X \times Y

with

D (x, y) < \infty

.

For the standard logarithmic triple (123)–(125), one may construct discrete models, say over a finite or countably infinite alphabet T, by a process of summation related to the interval

I = [0, \infty]

rather than the traditional choice

I = [0, 1]

. States will then be certain sequences

{(x_{i})}_{i \in T}

, which may be conceived as intensity sequences consisting of point intensities rather than the usual probability sequences of point probabilities. As regularity conditions one could take sequences with bounded intensities or sequences for which the primitive entropy function h of (124) satisfies the requirement

\sum_{i \in T} h (x_{i}) > - \infty

. For this to work technically, we realize the importance of the pointwise fundamental inequality for d of (125) and note that this requires the inclusion of the term

u - s

in d. Thus one may suggest to replace classicalprobability spaces with certain intensity spaces.

Returning to the classical choice with discrete probability distributions over a discrete alphabet T,

Φ

becomes discrete Kerridge inaccuracy, H classical Shannon entropy and D discrete Kullback-Leibler divergence. If we generalize to cover non-discrete settings, entropy can only be finite for distributions with countable support, whereas the generalization of divergence makes sense more generally. For instance, we may consider the generator

s \mapsto s \ln \frac{1}{s}

on the entire half-line

I = [0, \infty [

and for

(T, μ)

take an arbitrary measure space, provided with some measure

μ

. As state space we can then, as one possibility, take the set of measures absolutely continuous with respect to

μ

and with finite-valued Radon-Nikodym derivatives with respect to

μ

. For two such measures, say

P = p d μ

and

Q = q d μ

we find that

D (P, Q) = \int (p (t) \ln \frac{p (t)}{q (t)} + q (t) - p (t)) d μ (t) .

(145)

This may be called generalized Kullback-Leibler divergence. It is the more natural divergence to consider. For one thing, the integrand is non-negative by the pointwise fundamental inequality. If we restrict attention to finite measures P and Q with the same total mass, this reduces to the standard expression

\int p \ln \frac{p}{q} d μ

. The standard expression also gives a divergence measure if the two measures are finite and

Q (T) \leq P (T)

and, moreover, the important compensation identity also holds in this case since the additional terms (stemming from

u - s

in (125)) are integrable and affine.

Now consider extensions to cover also integration of the family

(ϕ_{q}, h_{q}, d_{q})

. It is natural to consider these triples over

I \otimes I

with

I = [0, 1]

in order to ensure that

h_{q} \geq 0

. By integration we obtain the triples

(Φ_{q}, H_{q}, D_{q})

(146)

defined over appropriate function spaces, typically representing probability distributions. For

q > 0

these triples are proper effort-based information triples. For

q = 0

you obtain degenerate triples. The quantity

H_{q}

, is meaningful in discrete cases with T finite or countably infinite, and defines Tsallis entropy. For the continuous case, Tsallis entropy does not make much sense, but the divergence function

D_{q}

does.

So far, we have discussed integration of primitive triples. This concerns a process where the original state space (the interval I) is changed to a new state space and then, an information triple over the new state space is constructed. A similar process applies if we start out with a family

{((Φ_{t}, H_{t}, D_{t}))}_{t \in T}

of proper information triples over the same state space X (formally, over

X \times Y

or

X \otimes Y

with structures as usual and, typically,

Y = X

). Then we may consider the integrated triple

(Φ, H, D) = \int_{T} (Φ_{t}, H_{t}, D_{t}) d μ (t)

(147)

defined by

\begin{matrix} Φ (x, y) & = \int_{T} Φ_{t} (x, y) d μ (t), \end{matrix}

(148)

\begin{matrix} H (x) & = \int_{T} H_{t} (x, y) d μ (t), \end{matrix}

(149)

\begin{matrix} D (x, y) & = \int_{T} D_{t} (x, y) d μ (t) . \end{matrix}

(150)

With suitable measurability conditions,

(Φ, H, D)

is a well-defined proper information triple. Also, the standard restriction of affinity is preserved by this process. As a useful but trivial remark, we note that properness of the integrated triple only needs properness of

(Φ_{t}, H_{t}, D_{t})

for a set of positive

μ

-measure. An instance of this feature with T a two-element set was already discussed in Section 2.7.

The most obvious application of the process of integration probably is to integrate the utility-based standard algebraic triple

(u, m, d) = (- u^{2} + 2 s u, s^{2}, {(s - u)}^{2})

, cf., (129). This triple is considered over

I \otimes I

with

I =] - \infty, \infty [

. Integrating over a measure space

(T, μ)

, you are led to take as state space the

L^{2}

-space over

(T, μ)

. In standard notation, the integrated triple

(U, M, D)

is given by

\begin{matrix} U (x, y) & = - {∥ y ∥}^{2} + 2 〈 x, y 〉, \end{matrix}

(151)

\begin{matrix} M (x) & = {∥ x ∥}^{2} \end{matrix}

(152)

\begin{matrix} D (x, y) & = {∥ x - y ∥}^{2} . \end{matrix}

(153)

We collect in Section 3.2 comments on these classical concepts, seen in the light of the theory here developed.

Some comments on the generation of information triples by the method inspired by Bregman [77] are in order. The focus of Bregman’s method has often been on the divergence measures it generates. Before Bregman’s work one mainly studied f-divergences, introduced independently by Csiszár [79], Morimoto [80] and by Ali and Silvey [81]. We find that often, Bregman divergences occur more naturally and have more convincing interpretations.

As we have seen, the widely studied entropies bearing Tsallis’ name can be derived via a Bregman-type construction. In Section 3.6 we shall have a closer look at these entropies. They have received a good deal of attention, especially within statistical physics. Some comments on the origin of these measures of entropy are in place. Tsallis’ trend-setting paper [2] is from 1988 but, originally, the entropies go back to Havrda and Charvát [82], to Daróczy [83] and to Lindhard and Nielsen [84,85] who all, independently of each other, found the notion of interest. Characterizations via functional equations were derived in Aczél and Daróczy [86], see also the reference work [87] as well as [41]. Regarding the physical literature, there is a casual reference to Lindhard’s work in one of Jaynes’ papers [88]. However, only after the publication of Tsallis 1988-paper mathematicians and, especially, physicists took an interest in the “new” entropy measures. We refer to the database maintained by Tsallis with more than 2000 references. From the recent literature we only point to Naudts, ref. [89] who also emphasized the convenient approach via Bregman generators.

3.2. A Geometric Model

Let us return to the model

(U, M, D)

given by (151)–(153) of Section 3.1. This is the utility-based information triple

(- {∥ y ∥}^{2} + {2 〈 x, y 〉, ∥ x ∥}^{2}, {∥ x - y ∥}^{2})

pertaining to the Hilbert space

X = Y = L^{2} (T, μ)

. The triple is proper and has affine marginals

U^{y}

given y.

In this case, the linking identity (after rearrangement of terms) is identical to the cosine relation. Other well-known basic facts of inner-product spaces can be derived by combining the linear structure of such spaces with the basic properties of information triples. Thus, the identity you obtain from the compensation identity (79) applied to D is of central importance for classical least squares analysis (apparently, the identity has no special name in this setting—it goes back at least to Gauss).

Games directly associated with the information triple

(U, M, D)

involve minimization of M over various preparations, in other terms, the search for elements closest to the origin subject to certain restrictions. Let us, instead comment on relative games, which are games depending on the specification of a preparation and a prior

y_{0} \in Y

, cf., Section 2.8. If the preparation

𝒫

is convex and closed, the D-projection

x^{*}

of

y_{0}

on

𝒫

exists; it is the unique point in

𝒫

which is closest in norm to

y_{0}

(though classical, the reader may appreciate to note that this existence result is derived with ease from the compensation identity and completeness of Hilbert space). As standard convexity- and continuity assumptions are also in place, Theorem 15 applies. It follows that the game

γ (P; y_{0})

is in equilibrium with the D-projection

x^{*}

as bi-optimal state. The updating gain for this game is given by (21), i.e.,

U_{| y_{0}} (x, y) = ∥ x - y_{0} ∥^{2} - {∥ x - y ∥}^{2} .

(154)

In this case the Pythagorean inequality reduces to the classical inequality

∥ x - y_{0} ∥^{2} \geq ∥ x - x^{*} ∥^{2} + {∥ x^{*} - y_{0} ∥}^{2},

(155)

valid for every

x \in P

.

Combining Proposition 9 and Theorem 15 we obtain rather complete information about the updating games, also for preparations which are not necessarily convex. For instance, Figure 4, case (a) illustrates a case with unique optimal strategies for both players and yet, the game is not in equilibrium. Case (b) illustrates a typical case with a game in equilibrium. For both figures,

x^{*}

denotes the optimal strategy for Nature and

y^{*}

the optimal strategy for Observer. Indicated on the figures you also find the largest strict divergence ball

B (x^{*} | y_{0})

and the largest half-space

σ^{+} (y^{*} | y_{0})

which is external to

𝒫

. The two values of the game can then be determined from the figures,

∥ x^{*} - y_{0} ∥^{2}

for Nature, respectively

∥ y^{*} - y_{0} ∥^{2}

for Observer.

Lastly some words on the typical preparations you meet in practice. In consistency with the philosophy expressed in Section 2.9 these are the feasible preparations. The strict ones are affine subspaces and the slack ones are convex polyhedral subsets. We shall determine the core of families of strict preparations:

Proposition 12.

Consider a family

P = P^{y}

of strict feasible preparations determined by finitely many points

y = (y_{1}, \dots, y_{n})

in X. The core of this family consists of all points in the affine subspace through

y_{0}

generated by the vectors

y_{i} - y_{0}; i = 1, \dots, n

, i.e.,

core (P) = {y_{0} + \sum α_{i} (y_{i} - y_{0}) | (α_{1}, \dots, α_{n}) \in R^{n}} .

(156)

Proof.

An individual member

𝒫

of

P

is determined by considering all

x \in X

for which the values of

U_{| y_{0}} (x, y_{i}); i = 1, \dots, n

have been fixed. Note that fixing these values is the same as fixing the inner products

〈 x - y_{0}, y_{i} - y_{0} 〉

or, equivalently, the inner products

〈 x, y_{i} - y_{0} 〉

. If

y^{*}

is of the form given by (156),

y^{*} = y_{0} + \sum α_{i} (y_{i} - y_{0})

, then

〈 x, y^{*} - y_{0} 〉 = \sum α_{i} 〈 x, y_{i} - y_{0} 〉

and we realize that this is independent of x if x is restricted to run over some preparation in

P

. Then also

U_{| y_{0}} (x, y^{*})

is independent of x when x is so restricted. We conclude that

y^{*} \in core (P)

. This proves the inclusion “⊇” of (156).

To prove the other inclusion, assume, as we may, that

y_{0} = 0

and that the

y_{i}

forms an orthonormal system. Consider a point

y^{*} \in core (P)

. Determine

P \in P

such that

y^{*} \in P

. By Theorem 5,

y^{*}

is the bi-optimal state of

γ (P; y_{0})

. Let

c_{i}; i = 1, \dots, n

denote the common values of

〈 x, y_{i} 〉

for

x \in P

. Then

x^{*} = \sum c_{i} y_{i}

is the orthogonal projection of

y_{0} = 0

on

𝒫

, hence

y^{*} = x^{*}

. This argument shows that the core is contained in the subspace generated by the

y_{i}

. This is the result we want as we assumed that

y_{0} = 0

. ☐

In order to determine the projection of

y_{0}

on a specific preparation

P = P^{y} (h) \in P

, we simply intersect

core (P)

with

𝒫

. If you do this analytically, one may avoid trivial cases and assume that

y_{i} - y_{0}; i = 1, \dots, n

are linearly independent. In Figure 4, case (c) we have illustrated the situation in the simple case when

n = 1

.

3.3. Universal Coding and Prediction

In this and in the next two sections we present problems where randomization plays a role. It will be realized that apart from this, the discussion of the three problems treated, though different in nature, relies on the same type of considerations (Kuhn-Tucker type results).

We start by discussing a problem of universal coding and prediction.

Let

A

be a discrete finite set, the common alphabet and consider languages whose written representation use letters from

A

. Let

P

be a finite set of such languages, referred to as the selection, e.g., the selection could be English, German and French. Assume that for each individual language from

P

we know the distribution of single letters in a typical text from that language, and let us identify a language with the corresponding distribution over

A

. In this way, the selection is identified with a certain finite subset

P \subseteq

of

M_{+}^{1} (A)

, the set of all distributions over

A

.

When we observe letters from

A

generated by a typical text from just one of the languages, say with associated single-letter distribution

P \in M_{+}^{1} (A)

, information theory tells us how to encode letters from

A

in strings of letters from a reference alphabet, say the binary alphabet consisting of the two elements 0 and 1, so as to minimize the expected length of the encoded binary strings. The encoded string corresponding to the letter

x \in A

, will then have a length

κ (x)

which is given roughly as

κ (x) \approx \log \frac{1}{P (x)}

(157)

with log denoting binary logarithms. This choice ensures that the average code length

〈 κ, P 〉 = \sum_{x \in A} P (x) κ (x)

(158)

is minimal.

The precise sense in which (157)—even with exact equality—is the undisputed right choice will not be discussed here. It is a cornerstone of information theory for which you may consult standard text books on information theory such as [90] or an introductory text such as Topsøe [91]. Note that (157) with equality implies that

\sum_{x \in A} 2^{- κ (x)} = 1

(Kraft’s equality).

Let us change to a more theoretical concept of encoding by idealization, forgetting that the length of a binary sequence is a natural number and by a change to natural units rather than binary units. This leads us to redefine a code over

A

to be a map

κ : A \mapsto [0, \infty]

such that

\sum_{x \in A} \exp (- κ (x)) = 1,

(159)

i.e., such that Kraft’s equality with natural units holds. Denote by

K (A)

the set of all such codes. The requirement (159) amounts to the requirement that the correspondence

κ \leftrightarrow P

given by

κ (x) = \ln \frac{1}{P (x)}; x \in A

(160)

is a one-to-one correspondence between

M_{+}^{1} (A)

and

K (A)

. We also express (160) by saying that

κ

is adapted to P and we write

κ = \hat{P}

. As is easily seen, either directly or referring to previous material from Section 3.1,

κ = \hat{P}

is the unique code for which the average code length

〈 κ, P 〉

is minimal.

With this property in mind, we define the redundancy of a pair

(P, κ) \in M_{+}^{1} (A) \times K (A)

as the quantity

\hat{D} (P, κ) = \sum_{x \in A} P (x) (κ (x) - \ln \frac{1}{P (x)}) .

(161)

From our discussion we know—in a theoretical idealized way at least—how to encode letters from

A

if we want to process letters from a text source generated by a single language in an optimal manner. We shall investigate what can be done if we receive text from an unknown language, except that we know that the language is one from the given selection.

We agree to call a code

κ \in K (A)

universal for the language selection

P

if the risk, here defined as

{\hat{Ri}}_{0} (κ | P) = \max_{P \in P} \hat{D} (P, κ)

(162)

is minimal. The associated distribution under the correspondence

κ \leftrightarrow P

is then said to be a universal predictor. Note that the risk

{\hat{Ri}}_{0}

is associated with the information triple

(\hat{D}, 0, \hat{D})

and that a universal code is the same as an optimal strategy for Observer in the game associated with this triple. Clearly, the game in question is not in equilibrium, hence equilibrium type results as developed previously are not of much use. Instead it turns out that a very direct approach will lead to an identification of universal objects.

Theorem 16.

Let

(P^{*}, κ^{*}) \in M_{+}^{1} (A) \times K (A)

with

κ^{*}

adapted to

P^{*}

. Assume further that for some finite constant R,

\hat{D} (P, κ^{*}) \leq R f o r a l l P \in P

(163)

and that

P^{*}

can be written as a convex combination of a set of distributions in

P

for which equality holds in (163). Then

κ^{*}

is the unique universal code and

P^{*}

the unique universal predictor.

Proof.

Clearly,

{\hat{Ri}}_{0} (κ^{*} | P) = R

.

Then consider any code

κ

different from

κ^{*}

. Write

P^{*}

as a convex combination

P^{*} = \sum_{i} α_{i} P_{i}

of distributions in

P

all of which satisfy the relation

\hat{D} (P_{i}, κ^{*}) = R

. Then the compensation identity tells us that

\begin{matrix} {\hat{Ri}}_{0} (κ | P) & = \sum_{i} α_{i} {\hat{Ri}}_{0} (κ | P) \geq \sum_{i} α_{i} \hat{D} (P_{i}, κ) \\ = \hat{D} (P^{*}, κ) + \sum_{i} α_{i} \hat{D} (P_{i}, κ^{*}) = \hat{D} (P^{*}, κ) + R . \end{matrix}

Thus, as

\hat{D}

is proper,

{\hat{Ri}}_{0} (κ | P) > R

. As this holds for all

κ \neq κ^{*}

, the result follows. ☐

Note the essential point that

\hat{D}

satisfies the compensation identity. That this is so follows either by direct calculation or, more systematically, by applying (iii) of Theorem 8 to the triple you obtain by adding entropy to

(\hat{D}, 0, \hat{D})

. For the derived domain you then work with the typical Shannon triple, listed explicitly in (185)–(187). So, after all, the information triples are also useful for the above problem.

It can be shown that the result always applies in the sense that the unique optimal code and the unique optimal predictor exist and that they satisfy the conditions stated in the theorem. Note that the representation of the optimal predictor as given in the theorem may not be unique.

3.4. Sylvester’s Problem from Location Theory

As starting point we take a simple Y-domain model with

Y = X

, a convex set. For visibility we take the diffuse relation

X \times Y

. Given is a finite-valued general divergence function over

X \times Y

for which the compensation identity (79) holds.

As a concrete example, one may have in mind, take that of a Euclidean space X provided with norm-squared distance,

D (x, y) = {∥ x - y ∥}^{2}

. Moreover, as the motivating problem, consider Sylvester’s problem, to determine the point with the least maximal distance to a given finite set

P

of points in X, cf., [92] or the monograph [93]. For the original problem, X was the Euclidean plane. However, the problem makes good sense in the general setting with X any convex set provided with a suitable replacement for classical squared distance.

The problem is a minimax problem and may formally be conceived as related to the special proper information triple

(D, 0, D)

. Indeed, the problem is to find optimal Observer strategies for the associated game

γ (P)

and to calculate Observer’s value of the game, the MinRisk-value

{Ri}_{\min} (P)

. However, this game is rather trivial as Natures value in the game is 0. Thus no equilibrium-type results are available.

To find a remedy, we apply a process of randomization. For that, we no longer consider X as the state space but take the convex space

\tilde{X} = MOL (X)

of molecular probability measures as a new state space. An element

α \in \tilde{X}

is represented as a family

α = {(α_{x})}_{x \in X}

of non-negative numbers such that

\sum_{x \in X} α_{x} = 1

and such that the support of

α

, i.e., the set

supp (α) = {x | α_{x} > 0}

, is finite.

The new model we shall construct is conceived as a

\hat{Y}

-type model. As state space we take

\tilde{X}

. Just as X, this is a convex set. For formal reasons—so that the modeling fits the general abstract theory—we may also take

\tilde{X}

as belief reservoir, though we will have no need really to consider belief instances. Instead, control will be in the focus, and for the set of control instances we shall take

Y = X

. Once more for formal reasons, we consider the barycentric map which maps an (artificial) belief instance into its barycenter as response. This map will play an important role for the modeling. Let the map be

α \mapsto b (α)

with

α \in \tilde{X}

and barycenter of

α

given by

b (α) = \sum_{x \in X} α_{x} x .

(164)

The good sense in considering elements of X as controls is the idea from location theory, that from a point in X, conceived as a location, you should try to control the given points in the set

P

as best you can.

With these preparations, we may consider the triple

(\tilde{Φ}, \tilde{H}, \tilde{D})

over

\tilde{X} \times Y

given by

\begin{matrix} \tilde{Φ} (α, y) = \sum_{x \in X} α_{x} D (x, y), \end{matrix}

(165)

\begin{matrix} \tilde{H} (α) = \sum_{x \in X} α_{x} D (x, b (α)), \end{matrix}

(166)

\begin{matrix} \tilde{D} (α, y) = D (b (α), y) . \end{matrix}

(167)

For

P \subseteq X

, denote by

\tilde{P}

the set of

α \in \tilde{X}

which are supported by

P

, i.e.,

\sum_{x \in P} α_{x} = 1

. By

\tilde{γ} (\tilde{P})

we denote the game corresponding to the triple

(\tilde{Φ}, \tilde{H}, \tilde{D})

with

\tilde{P}

as preparation. A basic fact which contributes to the significance of games of this type is that, as easily seen, risk does not increase when you replace the game

γ (P)

with

\tilde{γ} (\tilde{P})

, in particular, with self-explanatory notation,

{\tilde{Ri}}_{\min} (\tilde{P}) = {Ri}_{\min} (P) .

(168)

This fact relies on the affinity of the marginals of

\tilde{Φ}

for fixed y.

Theorem 17.

The triple

(\tilde{Φ}, \tilde{H}, \tilde{D})

over

\tilde{X} \times Y

is a proper information triple over

\tilde{X} \times Y

and the triple has affine marginals.

Let

P

be a subset of X and consider the game

\tilde{γ} (\tilde{P})

. Consider a pair

(α^{*}, y^{*}) \in \tilde{P} \times Y

of strategies in the game

\tilde{γ} (\tilde{P})

with

y^{*}

adapted to

α^{*}

, i.e.,

y^{*} = b (α^{*})

. Then if, for some constant R,

\begin{matrix} \forall x \in X : & D (x, y^{*}) \leq R, \end{matrix}

(169)

\begin{matrix} \forall x \in supp (α^{*}) : & D (x, y^{*}) = R, \end{matrix}

(170)

y^{*}

is the unique optimal strategy for Observer in

\tilde{γ} (\tilde{P})

as well as in

γ (P)

. Further,

{Ri}_{\min} (P) = R

and

x^{*}

is a bi-optimal strategy for

\tilde{γ} (\tilde{P})

.

Proof.

With preparations done, the first part is trivial, and the second is also so, obtainable as an application of Corollary 1. ☐

Note that the linking identity is just another way of formulating the compensation identity and that the entropy function is the compensation term in that identity.

With Theorem 17 we have a solution to Sylvester’s problem for an abstract model provided you can somehow point to a possible solution. It can be shown, modulo technical assumptions to ensure existence of optimal strategies, that the sought optimal Observer strategy must be of the form as stated in the theorem.

3.5. Capacity Problems, an Indication

Problems concerning capacity are among the most well known problems from information theory. They concern the determination of capacity defined as maximal information transmission rate under various conditions and on the associated optimal ways of coding. We shall only define one of the basic concepts and derive a key relation and leave it to the reader to consult the literature for more concrete results.

We first elaborate on the information triple given in the previous section by (165)–(167). The entropy function of that triple we may think of as related to information transmission rate of information theory (then also related to the notion of mutual information which is, however, not investigated further in the present study). This refers to the map

x \mapsto y

as a map from an input letter to an output letter. Then an element

α \in \tilde{X}

represents a distribution over the input letters, a source, and response tells you what is happening on the output side. It is important to study how the rate behaves under mixtures. Thus we have a need to study elements in

\tilde{\tilde{X}} = MOL (\tilde{X})

. The result one needs exploits the flexibility of the modeling, especially related to Theorem 8.

First, define information transmission rate related to

α \in \tilde{X}

simply as

I (α) = \tilde{H} (α) .

(171)

We wish to emphasize the following result:

Lemma 2.

With the setting as above, consider any

w = {(w_{α})}_{α \in \tilde{X}} \in \tilde{\tilde{X}}

and put

α_{0} = \sum_{α \in \tilde{X}} w_{α} α

. Then, for every

w \in \tilde{\tilde{X}}

,

I (\sum_{α \in \tilde{X}} w_{α} α) = \sum_{α \in \tilde{X}} w_{α} I (α) + \sum_{α \in \tilde{X}} w_{α} D (b (α), b (α_{0})) .

(172)

Proof.

If you write

\tilde{H}

in place of I, this follows from the identity (77) of Theorem 8 with

\tilde{H}

in place of H. ☐

With the technical lemma in place, a study of abstract models of information transmission systems runs smoothly and you can derive operational necessary and sufficient conditions for the requirements of optimal strategies. On Natures side, an optimal strategy is an input distribution for which the transmission rate reaches the maximum, the capacity of the system. The result is a Kuhn-Tucker type result, well known from general convexity theory and from Information theory, and much resembles the results of the previous two sections. We refer to Topsøe [94] for an exposition of a result which exploits the lemma just proved.

3.6. Tsallis Worlds

Recall the introduction in Section 3.1 of the family of Tsallis entropies. In this section we present arguments which may help to appreciate the significance of these measures of entropy.

The main result, Theorem 18 was presented in a different form in [36] and, less formally, in [35]. Here we present detailed proofs which were not provided in these sources.

The introduction in Section 3.1 of the Bregman generators

h_{q}

and thereby, via a process of integration, of Tsallis entropy, cf., (146), does not in itself constitute an acceptable interpretation. Via coding considerations, the significance of the Bregman generator

h_{1}

, leading to the notion of Shannon entropy is well understood. Despite some attempts to extend this to more general entropy measures, cf., [95,96,97], a general approach via coding has not yet been fully convincing. In [98] you find a previous attempt of the author centred on a certain property of factorization.

The results presented here indicate that possibly, a convincing and generally acceptable physical justification of Tsallis entropy can be provided by involving deformation between the physical system studied and the physicist. Previous endeavours to find physical justification for Tsallis entropy are discussed in detail in Tsallis, [99]. We share the view that though the “Tsallis-q” can be viewed just as a parameter introduced simply to fit data, this is not satisfactory and operational justification is needed. Deformation as here emphasized in combination with a notion of description may offer a common ground on the way to more insight.

To set the scene for our study, introduce the alphabet

A

, a discrete set of basic events which are identified by an index, typically denoted by i. Sensible indexing is often of importance and depends on the concrete physical application. The semiotic assignment of indices shall facilitate technical handling and catalyze semantic awareness. As we have no concrete application in mind, no extra structure is introduced which could justify a specific choice of indices.

The state space X is taken to be identical to the belief reservoir Y and, for simplicity, equal to

M_{+}^{1} (A)

, the set of probability distributions over

A

(you could have worked, instead and more generally, with sets involving intensity as suggested in Section 3.1). Generically,

x = {(x_{i})}_{i \in A}

will denote a state and

y = {(y_{i})}_{i \in A}

a belief instance. Thus x and y are characterized by their point probabilities. As

Y_{\det}

, the set of certain belief instances, we take the set of deterministic distributions over

A

. Visibility

y ≻ x

shall mean that x is absolutely continuous wrt y. Thus

X \otimes Y

consists of all pairs

(x, y) \in M_{+}^{1} (A) \times M_{+}^{1} (A)

with

supp (x) \subseteq supp (y)

, with @supp@ denoting support. We shall not need a control space or a response function.

A knowledge instance will be a family

z = {(z_{i})}_{i \in A}

over

A

of real numbers, not necessarily a probability distribution. The interpretation of

z_{i}

is as the intensity with which the basic event indexed by i is presented to Observer. For this reason, z is referred to as the intensity function. The individual elements

z_{i}

are the local intensities.

The deformation between x, y and z is given by a deformation

Π

, cf., Section 2.5. We assume that

Π

acts locally, i.e., that there exists a real-valued function

π

, the local deformation, defined on

{[0, 1]}^{2} = [0, 1] \times [0, 1]

such that, when

z = Π (x, y)

, then

z_{i} = π (x_{i}, y_{i})

for all

i \in A

. The world defined in this way by a local deformation is denoted

Ω_{π}

or, if need be,

Ω_{π} (A)

. From now on, when we talk about a “deformation”, we have a local deformation in mind.

Regarding regularity conditions, we assume that

π

is finite on

[0, 1] \times] 0, 1]

, continuous on

{[0, 1]}^{2} ∖ {(0, 0)}

and continuously differentiable on

] 0, 1 [\times] 0, 1 [

. The deformation is weakly consistent if

\sum_{i \in A} z_{i} = 1

whenever

(x, y) \in X \otimes Y

and

{(z_{i})}_{i \in A} = Π (x, y)

. If you can even conclude that

z = {(z_{i})}_{i \in A}

is a probability distribution,

π

is strongly consistent. The deformation

π

is sound if

π (s, s) = s

for every

s \in [0, 1]

.

For

q \in R

, the algebraic deformation

π_{q}

is given on

{[0, 1]}^{2}

by

π_{q} (s, t) = q s + (1 - q) t .

(173)

These deformations are all sound and weakly consistent and, for

0 \leq q \leq 1

, even strongly consistent. The corresponding worlds are denoted

Ω_{q} = Ω_{q} (A)

. The notation is consistent with the notation introduced in Section 2.5. The significance of the algebraic deformations is derived from the following result.

Lemma 3.

Assume that the alphabet

A

is countably infinite. Then only the algebraic deformations are weakly consistent.

Proof.

Let

π

be weakly consistent and put

q = π (1, 0)

. Consider a deterministic distribution

δ

over

A

and apply weak consistency with

x = y = δ

to find that

π (0, 0) = 0

. Thus, if x and y both have support in a subset

A_{0} \subseteq A

, you can neglect contributions stemming from

(x_{i}, y_{i})

with

i \notin A_{0}

and conclude consistency over

A_{0}

, i.e., that

\sum_{i \in A_{0}} π (x_{i}, y_{i}) = 1

. By weak consistency (in the extended form just established),

π (s, t) + π (1 - s, 1 - t) = 1

for all

(s, t) \in [0, 1] \times [0, 1]

, in particular,

π (0, 1) = 1 - q

. Consider

(x_{0}, y_{0}) = (0, 1)

and

(x_{i}, y_{i}) = (\frac{1}{n}, 0)

for

i = 1, \dots, n

, apply weak consistency and conclude that

π (\frac{1}{n}, 0) = \frac{1}{n} q

. Then, for

p \in N

, consider vectors

(x_{i}, y_{i})

of the form

(0, 1), (\frac{1}{n}, 0), \dots, (\frac{1}{n}, 0), (\frac{p}{n}, 0)

. By weak consistency and previous findings, conclude that

π (s, 0) = s q

for all rational

s \in [0, 1]

. By continuity, this formula holds for all

s \in [0, 1]

. Quite analogously,

π (0, t) = t (1 - q)

for all

t \in [0, 1]

. Finally,

π = π_{q}

follows by weak consistency applied to

(s, t), (1 - s, 0), (0, 1 - t)

. ☐

In particular, if

A

is infinite then, automatically, a weakly consistent deformation is sound. In fact, all concrete deformations we shall deal with will be sound.

Instead of searching only for a suitable entropy function for the world

Ω_{π}

, we find it more rewarding to search for a suitable full information triple for this world. Let us analyze what such a triple, say

(Φ, H, D)

, could be. A natural demand is that

Φ, H

and D should all act locally. Therefore, according to Section 3.1 what we are really searching for is a primitive information triple

(ϕ, h, d)

over

[0, 1] \times [0, 1] ∖ {(s, u) | s > 0, u = 0}

, cf., (110), such that

(Φ, H, D)

is obtained from this triple by integration over

A

equipped with counting measure. In particular, the requirements (111)–(114) must be satisfied. Obvious names for the sought functions

ϕ, h

and d are, respectively, local effort, local entropy and local divergence.

Let us suggest a suitable form of local effort. It will depend on the notion of a descriptor, defined as any continuous, strictly decreasing function on

[0, 1]

which is finite-valued and continuously differentiable on

[0, 1]

, vanishes at

t = 1

and satisfies the condition that

κ^{'} (1) = - 1 .

(174)

The value

κ (u)

is conceived as the effort you have to allocate to any basic event in which you have a belief expressed by u. The condition

κ (1) = 0

reflects the fact that if you feel certain that a basic event will occur, there is no reason why you should allocate any effort at all to that event. Also, it is to be expected that events you do not have much belief in are more difficult to describe than those you believe in with a higher degree of confidence. Therefore, we may just as well assume from the outset that

κ

is decreasing. The norming requirement (174) will enable comparisons of effort, entropy and divergence across different descriptors or even different worlds. The unit defined implicitly by (174) is the natural information unit, the “nat”.

An important class of descriptors is the class

{(κ_{q})}_{q \geq 0}

given on

[0, 1]

by

κ_{q} (s) = \ln_{q} \frac{1}{s} .

(175)

With access to a descriptor you may suggest to assign the effort

κ (u)

to an event with belief instance u, but you should multiply this effort with the intensity with which the event is presented to you. This gives the suggestion

ϕ (s, u) = π (s, u) κ (u)

for local effort. Then local divergence should be the function

d (s, u) = π (s, u) κ (u) - π (s, s) κ (s)

. However, this is not going to work as the fundamental inequality (112) is bound to fail (consider

(s, u)

with u close to 1). Fortunately, insight gained in Section 3.1 indicates how one may modify the suggestion in order to have a chance that the fundamental inequality could hold, viz., by adding an overhead term. Therefore, given a descriptor, we now suggest to define the local functions as follows:

\begin{matrix} ϕ_{π} (s, u | κ) & = π (s, u) κ (u) + u \end{matrix}

(176)

\begin{matrix} h_{π} (s | κ) & = π (s, s) κ (s) + s \end{matrix}

(177)

\begin{matrix} d_{π} (s, u) | κ) & = (π (s, u) κ (u) + u) - (π (s, s) κ (s) + s) . \end{matrix}

(178)

One may study modifications with more general overhead terms, but we shall not do so. The important thing is to realize that something has to be done. Moreover, inspired by the fact that for the important cases with descriptors of the form

κ_{q}

, adding a simple linear overhead as suggested above works. This is stated explicitly in Corollary 7 below.

Lemma 4.

Let π be a deformation and κ a descriptor. Assume that

d_{π} (\cdot, \cdot | κ)

given by (178) is a genuine primitive divergence function, i.e., that (112) (the pointwise fundamental inequality) and (114) (pointwise properness) hold. Then

(Φ_{π} (\cdot, \cdot | κ), H_{π} (\cdot | κ), D_{π} (\cdot, \cdot | κ))

obtained by integration of the local quantities given in (176)–(178) over

A

is a proper information triple over

X \otimes Y

.

The proof follows directly from the discussion in Section 3.1.

Note that for sound deformations, the measures of entropy constructed this way only depend on the descriptor, not on the deformation.

Also note that the quantities defined really give gross effort and gross entropy. In particular, minimal entropy is not 0 as usual, but 1. This may appear odd but, on the other hand, the way to these quantities was very natural and one may ask if it is not advantageous in many situations to incorporate an overhead. Moreover, why not use the overhead to fix the unit of effort?

We also remark that if we allow incomplete probability measures Q as belief instances, then this change of the space

X \otimes Y

will not change the conclusion above. However, sticking to probability measures also for belief instances, we may subtract the number 1 from gross effort and from gross entropy and obtain the more familiar net-quantities.

Corollary 7.

For

0 < q \leq \infty

the deformation

π_{q}

and the descriptor

κ_{q}

satisfy the conditions of Lemma 4. Accordingly, the information triple generated by integration over

A

is a proper information triple. Furthermore, the effort function has affine marginals.

The obtained effort- and entropy functions are gross-quantities. The corresponding net-quantities give the information triple

(Φ_{q}, H_{q}, D_{q})

in (146) of Section 3.1. In particular,

H_{q}

is standard Tsallis entropy with q as parameter.

The simple checking is left to the reader.

We turn to problems of another nature, viz., if, given a deformation, one can find an appropriate descriptor such that the generated global description effort is proper.

Lemma 5.

Assume that the alphabet

A

has at least three elements. Let π be a sound deformation and denote by χ the function on

] 0, 1 [

defined by

χ (t) = \frac{\partial π}{\partial t} (t, t) .

(179)

Under the assumption that χ is bounded in the vicinity of

t = 1

, there can only exist one descriptor κ such that the net-effort function generated by π and κ, i.e., the function Φ given by

Φ (x, y) = \sum_{i \in A} π (x_{i}, y_{i}) κ (y_{i})

(180)

is a proper effort function over

X \otimes Y

. Indeed, κ must be the unique solution in

[0, 1]

to the differential equation

χ (t) κ (t) + t κ^{'} (t) = - 1

(181)

for which

κ (1) = \lim_{t \to 1} κ (t) = 0

.

Proof.

Assume that

κ

exists with

Φ_{π} (\cdot, \cdot | κ)

proper. For

0 < t < 1

put

f (t) = χ (t) κ (t) + t κ^{'} (t) .

Consider a, for the time, fixed probability vector

x = (x_{1}, x_{2}, x_{3})

with positive point probabilities. Then the function F given by

F (y) = F (y_{1}, y_{2}, y_{3}) = \sum_{1}^{3} π (x_{i}, y_{i}) κ (y_{i})

on

] 0, 1 [\times] 0, 1 [\times] 0, 1 [

assumes its minimal value at the interior point

y = x

when restricted to probability distributions. As standard regularity conditions are fulfilled, there exists a Lagrange multiplier

λ

such that, for

i = 1, 2, 3

,

\frac{\partial}{\partial y_{i}} (F (y) - λ \sum_{1}^{3} y_{i}) = 0

when

y = x

. This shows that

f (x_{1}) = f (x_{2}) = f (x_{3})

.

Using this with

(x_{1}, x_{2}, x_{3}) = (\frac{1}{2}, x, \frac{1}{2} - x)

for a value of x in

] 0, \frac{1}{2} [

, we conclude that f is constant on

] 0, \frac{1}{2} [

. Then consider a value

x \in [\frac{1}{2}, 1]

and the probability vector

(x, \frac{1}{2} (1 - x), \frac{1}{2} (1 - x))

and conclude from the first part of the proof that

f (x) = f (\frac{1}{2} (1 - x))

. As

0 < \frac{1}{2} (1 - x) < \frac{1}{2}

, we conclude that

f (x) = f (\frac{1}{2})

. Thus f is constant on

[0, 1]

. By letting

t \to 1

in (181) and appealing to the technical boundedness assumption, we conclude that the value of the constant is

- 1

. ☐

Note the use in the above proof of Lagrange multipliers in the study of properties that hold under the realization of an extremum. This is quite different from the usage we have opted against where the technique is used as a tool to verify that an extremum has been found. In the latter case, we claim that, typically, more adequate intrinsic methods apply.

We can now formulate one of the main results:

Theorem 18.

Assume that the alphabet has at least three elements.

(i): If $q \leq 0$ , there is no descriptor which, together with $π_{q}$ , generates a proper effort function.
(ii): If $q > 0$ there exists a unique descriptor, $κ_{q}$ defined by (175) which, together with $π_{q}$ generates a proper effort function. The generated information triple $(Φ_{q}, H_{q}, D_{q})$ is proper.

Proof.

By Lemma 5 we see that

κ_{q}

given by (175) is the only descriptor which, together with

π_{q}

, could possibly generate a proper effort function. That it does so for

q > 0

, follows by Lemma 4. For

q \leq 0

, this is not the case as the reader can verify by considering atomic situations with

x = (1 - ε, ε)

and

y = (\frac{1}{2}, \frac{1}{2})

and letting

ε

tend to 0. ☐

We may add that for the case of a black hole,

q = 0

, the descriptor is given by

κ_{0} (s) = \frac{1}{s} - 1

and, using

| \cdot |

for “number of elements in ⋯” , the generated information triple

(Φ_{0}, H_{0}, D_{0})

is given by

\begin{matrix} Φ_{0} (x, y) & = | supp (y) |, \end{matrix}

(182)

\begin{matrix} H_{0} (x) & = | supp (x) |, \end{matrix}

(183)

\begin{matrix} D_{0} (x, y) & = | supp (y) ∖ supp (x) | \end{matrix}

(184)

for all

y ≻ x

. Note that if terms of the form

π (x_{i}, y_{i}) κ (y_{i})

were to be interpreted by continuity, the resulting triple would be discrete.

We have noted that the descriptor is uniquely determined from the deformation. Therefore, in principle, only the deformation needs to be known. Examples will show that different deformations may well determine the same descriptor. For instance, deformation defined as a geometric average rather than an arithmetic average as in the definition of

π_{q}

will lead to the same descriptor. Thus, knowing only the descriptor, you cannot know which world you operate in, in particular, you cannot determine divergence or description effort. But you can determine the entropy function. This emphasizes again the general thesis, that entropy should never be considered alone.

Finally a comment on the descriptors

κ_{q}

. A focus on their inverses is also in order. They may be interpreted as probability checkers: Indeed, if, in a Tsallis world with parameter q, you have access to a nats and ask how complex an event this will allow you to describe, the appropriate answer is “you can describe any event with a probability as low as

κ^{- 1} (a)

”. Thus, when

q \leq 1

, however large your resources to nats are, there are events so complex that you cannot describe them, whereas, if

q > 1

you can describe any event if you have access to K nats if only K is sufficiently large (

K \geq \frac{1}{q - 1}

).

3.7. Maximum Entropy Problems of Classical Shannon Theory

Terminology and results as developed in Section 2 are evidently inspired by maximum entropy problems of classical information theory. The classical problems concern inference of probability distributions over some finite or countably infinite alphabet

A

, typically with preparations given in terms of certain constraints, often interpreted as “moment constraints” related to random variables of interest. Such preparations will, modulo technical conditions, be feasible in the sense as defined in Section 2.9. Examples are numerous, from information theory proper, from statistics, from statistical physics or elsewhere. The variety of possibilities may be grasped from the collection of examples in Kapur’s monograph [100]. The abstract results developed in Section 2 can favorably be applied to all such examples. This then has a unifying effect. However, for many concrete examples, it may involve a considerable amount of effort actually to verify the requirements needed for the abstract results to apply. This may involve the verification of Nash’s inequality (52) or the determination of the core of models under study, cf., Theorems 5 and 6. No detailed calculations for specific examples will be carried out here.

A very large number of researchers have worked with these problems. The related publications of the present author comprises [26,101]. We shall focus on applications of the general theory from Section 2.

The basic model we shall discuss is the same as in Section 3.6 based on a finite or countably infinite alphabet

A

. Note that, in principle, discrete alphabets with more than enumerably many elements could be allowed. However, that would contradict the sensible requirement (3).

The relevant information triple is the proper information triple composed of Kerridge inaccuracy, Shannon entropy and Kullback-Leibler divergence:

\begin{matrix} Φ (P, Q) & = \sum_{a \in A} P (a) \ln \frac{1}{Q (a)}; \end{matrix}

(185)

\begin{matrix} H (P) & = \sum_{a \in A} P (a) \ln \frac{1}{P (a)}; \end{matrix}

(186)

\begin{matrix} D (P, Q) & = \sum_{a \in A} P (a) \ln \frac{P (a)}{Q (a)} . \end{matrix}

(187)

We shall also work with the action space

\hat{Y} = K (A)

introduced in Section 3.3 and as response we take the bijection

Q \mapsto \hat{Q}

from Y to

\hat{Y}

given, for

a \in A

, by

\hat{Q} (a) = \ln \frac{1}{Q (a)} .

(188)

Controllability is the relation for which control

κ ≻ P

means that

P (a) = 0

whenever

κ (a) = \infty

. The information triple to work with in the

\hat{Y}

-domain is

(\hat{Φ}, H, \hat{D})

with entropy as above and with

\begin{matrix} \hat{Φ} (P, κ) & = \sum_{a \in A} P (a) κ (a), \end{matrix}

(189)

\begin{matrix} \hat{D} (P, κ) & = \sum_{a \in A} P (a) (κ (a) - \hat{P} (a)) . \end{matrix}

(190)

The triples

(Φ, H, D)

and

(\hat{Φ}, H, \hat{D})

are genuine proper information triples with affine marginals. Thus all parts of the abstract results developed are available and ready to apply. However, we limit the discussion by focusing only on the role of the feasible preparations, leaving elaborations in concrete examples to those interested.

Thinking of states P as determining the distribution of a random element

ξ

over

A

, it is often desirable to consider preparations corresponding to the prescription of one or more mean values of

ξ

. A typical preparation consists of all

P \in X

such that

\sum_{a \in A} P (a) λ (a) = c

(191)

with c a given constant and

λ = {(λ (a))}_{a \in A}

a given function on

A

. This is a strict feasible preparation if and only if the partition function (a special Dirichlet series),

Z (β) = \sum_{a \in A} \exp (- β λ (a))

(192)

has a finite abscissa of convergence, i.e., converges for some finite constant

β

, cf., [26] (or monographs on Dirichlet series). However, for the most important part, having concrete applications in mind, viz., the “if”-part, this is clear. Indeed, if the condition is fulfilled, there exist constants

α_{0}

and

β_{0}

such that the function

κ_{0}

given for

a \in A

by

κ_{0} (a) = α_{0} + β_{0} λ (a)

(193)

defines a code. Then

P = P^{κ_{0}} (k)

for some constant k, hence it is a strict feasible preparation of genus 1. It is a member of the preparation family

P = P^{κ_{0}}

. Consider, for any

β

with

Z (β) < \infty

, the code

κ_{β}

given for

a \in A

by

κ_{β} (a) = \ln Z (β) + β λ (a) .

(194)

Then this code is a member of

core^(P^{κ_{0}})

as is easily seen. In fact all members of the core are of this form (this fact can be proved as a kind of exercise in linear algebra, but more elegant proofs using the structure of the problem should be possible). If we can adjust the parameter

β

such that the corresponding distribution

P_{β}

given by

P_{β} (a) = \frac{\exp (- β λ (a))}{Z (β)} for a \in A

(195)

is a member of the original preparation

P

, this must be the maximum entropy distribution of

P

, as follows from Theorem 6, translated to the

\hat{Y}

-domain.

Schematically then: In searching for the MaxEnt distribution of a given preparation, first identify the preparation as a feasible preparation (of genus 1 or higher), then calculate if possible the appropriate partition function and finally adjust parameters to fit the original constraint(s). This gives you the MaxEnt distribution searched for. If calculations are prohibitive, you may resort to numerical, algorithmic or graphical methods instead.

As already mentioned, the literature very often solves MaxEnt-problems by the introduction of Lagrange multipliers. As shown, this is not necessary. An intrinsic approach building on the abstract theory of Section 2 appears preferable. For one thing, the fact that you obtain a maximum for the entropy function (and not just a stationary point) is automatic—it is all hidden in the fundamental inequality. For another, the quantities you work with when appealing to the abstract theory, have natural interpretations.

3.8. Determining D-Projections

The setting is basically the same as in the previous section, especially we again consider a preparation

P

given by (191). The problem we shall consider is how to update a given prior

Q_{0} \in M_{+}^{1} (A)

. Then, the triple

(Φ, H, D)

given by (185) is no longer relevant but should be replaced by the triple

(U_{| Q_{0}}, D^{Q_{0}}, D)

as defined in Section 2.8, cf., (21). This makes good sense if

D^{Q_{0}}

is finite on

P

. The update we seek is the D-projection of

Q_{0}

on

P

as defined in Section 2.13 in connection with (66).

We shall apply much the same strategy as in the previous section. However, we choose not to introduce response and an action space in this setting (this can be done with controls consisting of code improvements which are code length functions measured relative to the code

κ_{0}

associated with

Q_{0}

). Instead, we work directly in the Y-domain and seek a representation of

P

as a strict feasible preparation of genus 1, now to be understood with respect to

U_{| Q_{0}}

. Analyzing what this amounts to, we find that if the partition function, now defined by

Z (β) = \sum_{a \in A} Q_{0} (a) \exp (- β λ (a)),

(196)

converges for some

β < \infty

, a representation as required is indeed possible. Assuming that this is the case we realize that for each

β

with

Z (β) < \infty

, the distribution

Q_{β}

defined by

Q_{β} (a) = \frac{Q_{0} (a) \exp (- β λ (a))}{Z (β)} for a \in A

(197)

is a member of the core of

P

. Then it is a matter of adjusting

β

such that

Q_{β}

is consistent, and we have found the sought update.

The cancellation that takes place from (20) to (21) allows an extension of the discussion of updating from the discrete setting to a setting based on a general measurable space. For instance, one may consider a measurable space provided with a

σ

-finite reference measure

μ

and then work with distributions that have densities with respect to

μ

. As is well known, cf., also Section 3.1, the definition of Kullback-Leibler divergence makes good sense in the more general setting. Thus updating problems can be formulated quite generally. If the prior has density

q_{0}

, the partition function one should work with is given by

Z (β) = \int \exp (- β λ) q_{0} d μ

. Strategies for updating may be formulated much in analogy with the strategies of Section 3.7. Further details and consideration of concrete examples are left to the interested reader.

4. Conclusions

The theory presented provides a general abstract framework for the treatment of a wide range of optimization problems within geometry, statistics, statistical physics and other disciplines. Looking back, considering the methods applied and the demonstrated wide applicability, two factors seem to be essential, the type of modeling and affinity. Regarding the modeling, the key focus was on our information triples involving three interrelated quantities, effort, entropy and divergence—dually, utility, max-utility and divergence—each one being in itself of great significance and seen together playing distinct well-defined roles.

Regarding the focus on affinity, it is true that for the basic theoretical results this is not necessary. However, for almost every successful concrete application, affinity seems to pop up and appears both as a necessity and as a guarantee of success. There is something fundamental about this—possibly rooted in deep facts concerning the essential nature of observation, description and measuring.

On the theoretical side, one should note the emphasis placed on Jensen-Shannon divergence.

The game theoretical approach expressing the “man/system” or, as here, the “Observer/Nature” interface has played a major role. It has led to minimax and maximin problems. Adding convexity, it is an empirical fact that interesting and tractable optimization problems of this nature either concerns a minimax or a maximin problem for which the first optimization is easy to solve. This aspect is also present in our modeling through the linking identity and the fundamental inequality. Thus, for fixed second argument, minimal effort in our basic models is a quantity given by assumptions made and called entropy.

The extensive appeal to loose, sometimes speculative philosophical considerations is another pronounced feature of the exposition. This is intended as a guide to sensible model building and may also catalyze the consideration of meaningful applications to look into.

Other attempts to build quite general theories in this area of science include Jaynes [9], Csiszár and Matús [67], Amari and Nagaoka [10] and then the recent work of Pavon and Ferrante [102]. In the latter we find a focus on the same kind of issues as we have promoted, simplicity of modeling and affinity. With simplicity also, as here, pointing to the unnecessary appeal to techniques involving Lagrange multipliers. The base for the modeling of Pavon and Ferrante is geometry via a lemma of geometric orthogonality. So, as “models of the world” these authors, as well as Amari and Nagaoka and their followers take geometry, whereas we take a more “social” approach via game theory, emphasizing man’s role in the world.

We believe that the approach presented here is technically the more elementary one.

Along the way, our approach gave rise to a few points worth emphasizing. A modeling of what can be known (Section 2.9) appears to be a useful concept. The suggested weak notions of properness in Section 2.11 is new whereas the material in Appendix A, which serves as a partial justification, may well be common knowledge. The notion of deformation introduced in Section 2.5 and its role in the discussion of Tsallis entropy in Section 3.6 has been announced before but is here given a more full treatment, also incorporating a Bregman construction in Section 3.1. Regarding the discussion of Tsallis entropy also note the emphasis on the descriptors

κ_{q}

.

Many issues are left for further discussion and consolidation of the theory. Some of the possibilities are indicated in the text. Others involve a look at sufficiency, duality, mutual information, learning theory and more. Much of this appears feasible. However, there is an important area where we do not see that our approach and results provide any clue, viz., quantum information theory. Let this challenge to the reader be the last word for now.

Acknowledgments

The author has worked with problems related to the material here presented for many years. However, the realization that many of the methods applied work in far more general situations than intended originally only matured slowly from around 2006. The author is thankful to organizers of workshops and conferences where he has presented aspects of the ideas. Thanks are due to Ardon Lyon for discussions of many of the philosophical considerations in Section 2 of the manuscript, to Bjarne Andresen for introducing me to Tsallis entropy, to Philip Dawid for discussions at workshops and assistance regarding references, further to Peter Harremoës for collaboration and discussions at workshops and elsewhere over many years. Jop Briët provided me with the reference to [103]. The guidelines received from both reviewers led to significant improvements, especially concerning the presentation of the material. We also acknowledge advice given and work done by the guest editor, Geert Verdoolaege. Jan Caesar helped with all technical issues, including production of the figures. Finally, a stipend from the San Cataldo Foundation, December 2012, allowed the author to start collecting the material in a comprehensive and coherent form under ideal conditions at the former nunnery at the Amalfi coast, now owned by the Foundation.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Notions of Properness

This appendix serves as motivation for the introduction of weak notions of properness. Arguments presented are elementary and there may well be references to previous relevant work.

Three considerations underlie the refinements of this section.

Firstly, as already noted, MaxEnt problems can often be tackled without recourse to techniques involving differentiation. This is not a new observation, see e.g., Csiszár [63], Topsøe [26], Campenhout and Cover [104] and the recent work by Pavon and Ferrante, [102]. In contrast to this, the many examples contained in Kapur [100] builds excessively on differentiation techniques.

Secondly, have a look at Figure 2. Really, what is dominating is the curve h and the straight line, w. The belief instance u is not that prominent. More so the line it determines. True, this is the tangent at

(u, h (u))

, determined by differentiation but what is essential is that it dominates h, a feature ensured by concavity. Domination by control appears as the right focus.

Thirdly, also non-concave functions can of course have maxima. Therefore, avoiding differentiation, there may be no need for the convenience of the assumed concavity of the generator.

Motivated by these considerations we embark on the intended refinement. We shall work in the subset

I \times R

of the plane

R^{2}

with I an interval. It simplifies matters if I is open and this will be assumed until further notice. For linear functions on I we use the bracket notation as in

〈 s, w 〉 = α + β s; s \in I .

(A1)

A linear function w is identified with its graph, which could be any non-vertical line. For a point

Q \in I \times R

we talk about points to the left of Q (to the right of Q) as points left of (right of) the vertical line through Q.

We shall work with some special sets, called butterfly sets. Such a set is characterized by two linear functions

w^{-}

and

w^{+}

, the boundary lines, and a point

Q \in w^{-} \cap w^{+}

, the crossing point. This terminology is also applied if

w^{-}

and

w^{+}

coincide. The butterfly set determined by

(w^{-}, w^{+})

and with crossing point Q is the set

B (w^{-}, w^{+} | Q)

of points

(s, t) \in I \times R

, “squeezed in” between the boundary lines:

B (w^{-}, w^{+} | Q) = {(s, t) | \min (〈 s, w^{-} 〉, 〈 s, w^{+} 〉) \leq t \leq \max (〈 s, w^{-} 〉, 〈 s, w^{+} 〉)} .

(A2)

In the notation for butterfly sets it is assumed that either

w^{-} = w^{+}

or else

w^{-}

is below

w^{+}

to the left of Q and above

w^{+}

to the right of Q. If

w^{-} = w^{+}

, the butterfly set is thin. Otherwise it is fat.

We shall consider a generalized generator which is just any real-valued function h defined on I. For our standard modeling, I will be the state space as well as the belief reservoir:

X = Y = I

. Moreover, a control, here a control line, is any linear function w which dominates h, i.e.,

h (s) \leq 〈 s, w 〉

for

s \in I

. The set of controls is denoted W (rather than

\hat{Y}

). We assume that

W \neq \emptyset

. Visibility and controllability are the diffuse relations on

X \times Y

, respectively

X \times \hat{Y}

.

The key lemma is the following geometry-based result. We shall not write out all details of the proof. This is standard routine. You should observe that both parts of the result are existence statements which do not have purely constructive proofs. The proof is based only on the most basic elements of the infinitesimal calculus via appeal to statements about existence of suprema and infima of sets of real numbers.

Lemma A1.

(i) With assumptions as stated (I open,

W \neq \emptyset

), there exists a function

\bar{h}

on I such that every point on the graph of

\bar{h}

lies on some control line and such that this property applies to no point below this graph.

(ii) Further, for every

u \in I

there exists a butterfly set

B_{u} = B (w_{u}^{-}, w_{u}^{+} | Q_{u})

with

Q_{u} = (u, \bar{h} (u))

as crossing point such that the set of control lines which passes through

Q_{u}

is identical with the set of control lines contained in

B_{u}

.

Proof.

Property (i) is trivial. One simply defines

\bar{h}

by

\bar{h} (s) = \inf {t | \exists w \in W : 〈 s, w 〉 = t}

(A3)

for

s \in I

. As you will realize,

\bar{h}

is the concave envelope of h. Automatically, this function is upper semi-continuous.

As to (ii), we shall outline one way to the proof. Let

u \in I

and have a look at Figure A1. There,

P = (u, h (u))

and

Q = (u, \bar{h} (u))

. For every pair of points

(P^{-}, P^{+})

on the graph of h, with

P^{-}

to the left and

P^{+}

to the right of Q, the set T, understood to be open, which lies above the butterfly set in the figure, does not contain any point from the graph of h. Clearly, the union of all sets T which can be constructed in this way, call it

T_{u}

, is the set above two, possibly coinciding control lines

w_{u}^{-}

and

w_{u}^{+}

which constitute the boundary of

T_{u}

. The set

B (w_{u}^{-}, w_{+}^{+} | Q)

is the butterfly set

B_{u}

we were looking for. ☐

Figure A1. For the proof of Lemma A1.

With the lemma in place, we can define response as a point map from I into W. The map will not be surjective and, depending on h, possibly not injective either. To define the map, let

u \in Y

be a belief instance and consider the butterfly set

B_{u} = B (w_{u}^{-}, w_{u}^{+} | Q_{u})

. As response of u we take

w_{u}^{+} = w_{u}^{-}

if these control lines coincide. If the horizontal line through

Q_{u}

is contained in

B_{u}

, we take this control line as response. In the remaining cases we take as response that control line

\hat{u}

among

w_{u}^{+}

and

w_{u}^{-}

which, numerically, has the smallest slope.

The above construction defines

\hat{u} \in W

uniquely. When

B_{u}

is thin, there is only one control line to choose from, whereas when

B_{u}

is fat, we made a specific choice so as to minimize the risk. The control lines constructed this way are called minimal-risk controls. As to the nature of the result, one may note that it involves global rather than local considerations as would be involved in an approach via differentiation.

The following obvious corollary is a replacement of a classical basic result on maxima of functions based on differentiation.

Corollary A1.

Let

I \subseteq R

be an open interval and h a real function defined on I which is dominated by a real line.

A necessary and sufficient condition that h has a maximum in I is that for some point

u \in I

, the butterfly set

B_{u}

contains a horizontal line, necessarily

w_{u}

, and that

h (u) = \bar{h} (u)

. Assume that these conditions are fulfilled for some point

u \in I

. Then u is a maximum point of h and a necessary and sufficient condition that u is the unique maximum point of h in I is that

w = \hat{u}

intersects no other point on the graph of h than the point

Q_{u} = (u, h (u))

.

As to the various possibilities for the type of

B_{u}

—fat or thin—and for

\bar{h}

in relation to h, we note the following:

Lemma A2.

Let

u \in I

.

(i): If $B_{u}$ is fat, then $\bar{h} (u) = {lim sup}_{v \to u} h (v)$ .
(ii): If h is upper semi-continuous, in particular if h is continuous, and if $\bar{h} (u) > h (u)$ , then $B_{u}$ is thin.

Proof.

(i) follows by noting that if

w_{u}^{-} \neq w_{u}^{+}

, then no line segment connecting a point on

w_{u}^{-}

to the left of

Q_{u}

with a point on

w_{u}^{+}

to the right of

Q_{u}

can dominate the relevant part of h since then the prolongation of the line segment would dominate h for all arguments in I, clearly contradicting the definition of

\bar{h} (u)

.

Part (ii) is an easy consequence. ☐

The cases depicted in Figure A2 illustrate some possibilities for the location of the possible butterfly sets in relation to

\bar{h}

.

Figure A2. Examples of generators and butterfly sets; control lines as given by response are shown in red.

Our construction allows us to define a pretty natural information triple associated with any generalized generator. We simply define

\hat{ϕ}

and

\hat{d}

for

(s, w) \in I \times W

by

\begin{matrix} \hat{ϕ} (s, w) & = 〈 s, w 〉, \end{matrix}

(A4)

\begin{matrix} \hat{d} (s, w) & = 〈 s, w 〉 - \bar{h} (s) \end{matrix}

(A5)

and can then assert as follows:

Theorem A1.

With the definitions (A4) and (A5),

(\hat{ϕ}, \bar{h}, \hat{d})

is a

Q_{2}

-proper effort-based information triple over

I \times W

. The triple has affine marginals

{\hat{ϕ}}^{w}

.

With the thorough preparations, this is evident.

If

\bar{h} = h

, i.e., if h is concave, our construction has some merits over the standard Bregman construction as smoothness is not required.

Regarding the assumption that I is open, this can be dispensed with at the cost of some comments on degenerate control lines, lines which really only give control at one of the endpoints. This may be formulated by allowing infinite values for the controls or one may focus on decompositions of

I \times R

into two convex sets. We leave it to the reader to work this out (and to modify the proof of Lemma A1 accordingly, working separately to the left of Q and to the right of Q).

As a trivial but illuminating example when working with a closed rather than an open interval we take

I = [0, 1]

and as generator consider the identity map h on I. Then h itself is a control and we realize that

H = \hat{u}

for all u with

0 \leq u < 1

whereas the constant control

w_{1}

given by

〈 s, w_{1} 〉 = 1

for all

s \in I

is the response to

u = 1

. You realize that with this generator, the associated information triple is not

Q_{2}

-proper, but it is

Q_{3}

-proper and it also satisfies the other property demanded of what we called standard properness, viz., that the optimal control is robust.

Among issues and further possibilities depending on the construction in this appendix we point to a few:

Clearly, one may “change sign” and discuss utility-based systems. This involves notion of support lines and minimal-risk supports.

Then, just as with standard Bregman constructions, one should deal with the more involved geometric complications when functions over (convex) areas in finite dimensional Euclidean spaces are involved.

One may replace h, first with its graph (in fact done), but further with any subset of

I \times R

. More generally, you may consider subsets G of a separable Hilbert space provided with a hyperplane

π

and a choice of direction orthogonal to

π

. The hyperplane is a replacement for the abscisse-axes of our discussion and the direction a replacement for the ordinate-axes. For such systems, height over the hyperplane will be a replacement for function values.

It does appear quite natural to allow continuous and concave, but not necessarily smooth generators. For instance, you may consider the generator

h (s) = 1 - | s |

on

I = [- 1, 1]

. In that case, it is easy to find examples to demonstrate that this generator is not

M P (3)

-negative definite. Then, according to Proposition 11, the Jensen-Shannon divergence jsd associated with this generator is not the square of a metric. Elaborating a bit on this in a pretty natural manner, one finds that:

Proposition A1.

No Jensen-Shannon divergence constructed from a generator with bends can be the square of a metric.

Though not that surprising, this result supports the view that the attractive cases when Jensen-Shannon divergence is in fact a squared metric—perhaps even related via embedding to a squared Hilbert metric—requires a strong degree of smoothness for an underlying generator.

Appendix B. Protection against Misinformation

We present a possible variation of the interpretations emphasized in Section 2 of our study. This involves a theme which has been important for the development of the notion of proper score functions. For this appendix,

X = Y

is assumed.

In a sense, what we shall discuss here is what happens if Nature can communicate. Then we speak instead about Expert. Moreover, Observer becomes Customer. Expert holds the truth, x, or rather, x represents Experts best evaluation of what the truth is. Customer wants to know what Expert thinks about a certain situation and asks Expert for advice—against payment, to be agreed upon. For despicable reasons, Expert may be tempted to advice against better knowing, i.e., to give as advice y, instead of the honest advice x. Misinformation could either be due to the difficulty Expert may have in reaching a true expert opinion or it could be out of self-interest, with Expert taking advantage of false information given to Customer. Or Expert may try to mislead Customer in order to hide a business secret.

We assume that truth will be revealed to both Expert and Customer soon after Expert has given advice to Customer and further, that a proper effort function

Φ = Φ (x, y)

is known to both Expert and Customer. We shall device a payment scheme which will protect Customer against misinformation. The idea is simple. At the time of signing a contract—before advice is given—Customer pays a flat sum to Expert and further, Expert and Customer agree on an insurance scheme stipulating a penalty to be payed by Expert to Customer proportional to

Φ (x^{*}, y)

where

x^{*}

represents what really happened and y is the advice given. If Expert is confident that he knows what will happen, he will assume that

x^{*} = x

will hold and it will be in his own interest to give to Customer the honest advice

y = x

.

In the literature this scheme is mainly considered based on a proper score function, the same as a proper utility function. This gives an obvious variation of the payment scheme with the score function determining payment from Customer to Expert. The most often treated situation is probably that of weather forecasting with Brier [42] the first and Weijs and Giesen [105] a recent contribution. However, also situations from economy and statistics have been studied frequently. Apart from sources just cited we refer to the sources pointed to in Section 2.6 and to McCarthy [106] as well as to Chambers [107]. As a final reference we point to Hilden [108] where applications to diagnostics is discussed.

Works cited and their references will reveal a rich literature. With access to our abstract modeling, further meaningful applications, not necessarily tied to probabilistic modeling may emerge.

Appendix C. Cause and Effect

We present one further possible variation of the interpretations emphasized in Section 2 of our study. We assume that

Y = X

and put

W = \hat{X}

. Elements of X are now interpreted as causes and response, considered as a map defined on X, as the transformation of a cause into its associated consequence. This change moves the focus from Observer’s thoughts as discussed in Section 2.3 to a reflection of causality in Nature. The set-up is in this way conceived as a model of cause and effect.

Previously we considered possible choices of Observer in

γ

- or

\hat{γ}

-type games. Now it is more pertinent to focus on consequences—elements of W—as possible observations by Observer of the effect of the actual cause. For

x \in X

and

w \in W

,

\hat{Φ} (x, w)

is now be interpreted as the cost to Observer if he has observed (or believes to have observed) the effect w when the actual cause is x.

Consider the game

\hat{γ}

, say with preparation

P = X

. With the new interpretation in mind it appears particularly pertinent to consider Observer’s risk associated with the various possible observations.

Concrete situations where the change of interpretation makes sense, involve information theoretical problems of capacity.

Appendix D. Negative Definite Kernels and Squared Metrics

The result needed for the proof of Proposition 11 is a simple fact leading up to a group of rather deep results, see e.g., Chapter 6 of Deza and Laurent [103] (note that there, negative definiteness is referred to as being of negative type). For the convenience of the reader we present a simple direct proof of the needed more primitive result:

Proposition A2.

Let X be an abstract set and

D : X \times X \mapsto R

a “kernel” which is sound (

D (x, x) = 0

), proper (

D (x, y) > 0

if

y \neq x

) and symmetric (

D (x, y) = D (y, x)

). Then D is a squared metric if and only if D is negative definite over three-element sets, i.e., if and only if, for any scalers

c_{1}, c_{2}, c_{3}

with

c_{1} + c_{2} + c_{3} = 0

and any set

x_{1}, x_{2}, x_{3},

of elements in X, the sum

S = \sum_{i, j} c_{i} c_{j} D (x_{i}, x_{j})

is non-positive.

Proof.

“only if”: Assume that

D = d^{2}

with d a metric on X. With

(c_{1}, c_{2}, c_{3}) = (- 1, t, 1 - t)

and

α = d (x_{1}, x_{2}), β = d (x_{2}, x_{3})

and

γ = d (x_{3}, x_{1})

, one finds that

S = - 2 (β^{2} t^{2} + (α^{2} - β^{2} - γ^{2}) t + γ^{2})

. The second order polynomium in the parenthesis has the discriminant

{(α^{2} - β^{2} - γ^{2})}^{2} - 4 β^{2} γ^{2}

which is non-positive as

| α^{2} - β^{2} - γ^{2} | \leq 2 β γ

(consider separately the cases

α^{2} - β^{2} - γ^{2} \geq 0

and

α^{2} - β^{2} - γ^{2} < 0

). Thus S is non-positive.

“if”: With

x_{1}, x_{2}, x_{3}

given, put

α = \sqrt{D (x_{1}, x_{2})}, β = \sqrt{D (x_{2}, x_{3})}

and

γ = \sqrt{D (x_{3}, x_{1})}

. As the sum S is non-positive with scalers of the form

- 1, t, 1 - t

we find from previous calculations that

| α^{2} - β^{2} - γ^{2} | \leq 2 β γ

from which the desired triangle inequality

α \leq β + γ

follows. ☐

References

Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Tsallis, C. Introduction to Nonextensive Statistical Mechanics; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Gross, D. Comment on: “Nonextensivity: From low-dimensional maps to Hamiltonian systems” by Tsallis et al. arXiv, 2002; arXiv:cond-mat/0210448. [Google Scholar]
Ingarden, R.S.; Urbanik, K. Information without probability. Colloq. Math. 1962, 9, 131–150. [Google Scholar]
Kolmogorov, A.N. Logical basis for information theory and probability theory. IEEE Trans. Inf. Theory 1968, 14, 662–664. [Google Scholar] [CrossRef]
Kolmogorov, A.N. Combinatorial foundations of information theory and the calculus of probabilities. Russ. Math. Surv. 1983, 38, 29–40. [Google Scholar] [CrossRef]
de Fériet, K. La theorie génerélisée de l’information. In Théories de L’information (Colloq. Iiformation et Questionnaires, Marseille-Luminy, 1973); Springer: Berlin, Germany, 1974; pp. 1–35. (In French) [Google Scholar]
Jaynes, E.T. Probability Theory—The Logic of Science; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Amari, S.; Nagaoka, H. Methods of Information Geometry; Translations of Mathematical Monographs. 191; American Mathematical Society, Oxford University Press: New York, NY, USA, 1985. [Google Scholar]
Anthonis, B. Extension of Information Geometry for Modelling Non-Statistical Systems. Ph.D. Thesis, Universiteit Antwerpen, Antwerp, Belgium, 2014. [Google Scholar]
Rathmanner, S.; Hutter, M. A Philosophical Treatise of Universal Induction. Entropy 2011, 13, 1076–1136. [Google Scholar] [CrossRef]
Barron, A.; Rissanen, J.; Yu, B. The Minimum Description Length Principle in Coding and Modeling. IEEE Trans. Inf. Theory 1998, 44, 2743–2760. [Google Scholar] [CrossRef]
Grünwald, P.D. The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Jumarie, G. Maximum Entropy, Information without Probability and Complex Fractals—Classical and Quantum Approach; Kluwer: Dordrecht, The Netherlands, 2000. [Google Scholar]
Shafer, G.; Vovk, V. Probability and Finance. It’s Only a Game! Wiley: Chichester, UK, 2001. [Google Scholar]
Gernert, D. Pragmatic Information: Historical Exposition and General Overview. Mind Matter 2006, 4, 141–167. [Google Scholar]
Bundesen, C.; Habekost, T. Principles of Visual Attention; Oxford University Press: Oxford, UK, 2008. [Google Scholar]
Benedetti, F. Placebo Effects. Understanding the Mechanisms in Health and Disease; Oxford University Press: Oxford, UK, 2009. [Google Scholar]
Brier, S. Cybersemiotics: An Evolutionary World View Going Beyond Entropy and Information into the Question of Meaning. Entropy 2010, 12, 1902–1920. [Google Scholar] [CrossRef]
Van Benthem, J.; Adriaans, P. (Eds.) Handbook on the Philosophy of Information; Handbook of the Philosophy of Science; Elsivier: Amsterdam, The Netherlands, 2007; Volume 8. [Google Scholar]
Adriaans, P. Information. Stanford Encyclopedia of Philosophy, 2012; p. 43. Available online: http://plato.stanford.edu/archives/fall2013/entries/information/ (accessed on 26 March 2017).
Brier, S. Cybersemiotics: Why Information Is Not Enough; Toronto University Press: Toronto, ON, Canada, 2008. [Google Scholar]
Topsøe, F. Game Theoretical Equilibrium, Maximum Entropy and Minimum Information Discrimination. In Maximum Entropy and Bayesian Methods; Mohammad-Djafari, A., Demoments, G., Eds.; Kluwer Academic Publishers: Dordrecht, The Netherlands; Boston, MA, USA; London, UK, 1993; pp. 15–23. [Google Scholar]
Pfaffelhuber, E. Minimax Information Gain and Minimum Discrimination Principle. In Topics in Information Theory, Proceedings of the Colloquia Mathematica Societatis János Bolyai, Oberwolfach, Germany, 13–23 April 1977; Csiszár, I., Elias, P., Eds.; János Bolyai Mathematical Society and North-Holland: Amsterdam, The Netherlands; Oxford, UK; New York, NY, USA, 1977; Volume 16, pp. 493–519. [Google Scholar]
Topsøe, F. Information Theoretical Optimization Techniques. Kybernetika 1979, 15, 8–27. [Google Scholar]
Harremoës, P.; Topsøe, F. Maximum Entropy Fundamentals. Entropy 2001, 3, 191–226. [Google Scholar] [CrossRef]
Grünwald, P.D.; Dawid, A.P. Game Theory, Maximum Entropy, Minimum Discrepancy, and Robust Bayesian Decision Theory. Ann. Math. Stat. 2004, 32, 1367–1433. [Google Scholar]
Friedman, C.; Jinggang, H.; Sandow, S. A Utility-Based Approach to Some Information Measures. Entropy 2007, 9, 1–26. [Google Scholar] [CrossRef]
Dayi, H. Game Analyzing based on Strategic Entropy. Chin. J. Manag. Sci. 2009, 17, 133–138. (In Chinese) [Google Scholar]
Harremoës, P.; Topsøe, F. The Quantitative Theory of Information. In Handbook on the Philosophy of Information; van Benthem, J., Adriaans, P., Eds.; Handbook of the Philosophy of Science; Elsivier: Amsterdam, The Netherlands, 2008; Volume 8, pp. 171–216. [Google Scholar]
Aubin, J.P. Optima and Equilibria. An Introduction to Nonlinear Analysis; Springer: Berlin, Germany, 1993. [Google Scholar]
Cesa-Bianchi, N.; Lugosi, G. Prediction, Learning and Games; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Topsøe, F. Interaction between Truth and Belief as the key to entropy and other quantities of statistical physics. arXiv, 2008; arXiv:0807.4337v1. [Google Scholar]
Topsøe, F. Truth, Belief and Experience—A route to Information. J. Contemp. Math. Anal. Armen. Acad. Sci. 2009, 44, 105–110. [Google Scholar] [CrossRef]
Topsøe, F. On truth, belief and knowledge. In Proceedings of the 2009 IEEE International Symposium on Information Theory, Seoul, Korea, 28 June–3 July 2009; pp. 139–143. [Google Scholar]
Topsøe, F. Towards operational interpretations of generalized entropies. J. Phys. Conf. Ser. 2010, 201, 15. [Google Scholar] [CrossRef]
Topsøe, F. Elements of the Cognitive Universe. Available online: http://www.math.ku.dk/~topsoe/isit2011.pdf (accessed on 31 March 2017).
Wikipedia. Bayesian Probability—Wikipedia, The Free Encyklopedia. 2009. Available online: https://en.wikipedia.org/wiki/Bayesian_Probability (accessed on 31 January 2011).
Good, I.J. Rationel Decisions. J. R. Stat. Soc. Ser. B 1952, 14, 107–114. [Google Scholar]
Csiszár, I. Axiomatic Characterizations of Information Measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef]
Brier, G.W. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
Savage, L.J. Elicitation of Personal Probabilities and Expectations. J. Am. Stat. Assoc. 1971, 66, 783–801. [Google Scholar] [CrossRef]
Fischer, P. On the Inequality ∑ p_if(p_i) ≥ ∑ p_if(q_i). Metrika 1972, 18, 199–208. [Google Scholar] [CrossRef]
Gneiting, T.; Raftery, A.E. Strictly Proper Scoring Rrules, Prediction, and Estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
Dawid, A.P.; Lauritzen, S.L. The geometry of decision theory. In Proceedings of the Second International Symposium on Information Geometry and its Applications, Tokyo, Japan, 12–16 December 2006; pp. 22–28. [Google Scholar]
Dawid, A.P.; Musio, M. Theory and Applications of Proper Scoring Rules. Metron 2014, 72, 169–183. [Google Scholar] [CrossRef]
Philip, A.; Dawid, M.M.; Ventura, L. Minimum Scoring Rule Inference. Scand. J. Stat. 2016, 43, 123–138. [Google Scholar]
Caticha, A. Information and Entropy. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 27th International Workshop on Bayesian Inference and Maximum Entropy Methods; American Institute of Physics Inc.: Woodbury, NY, USA, 2007; Volume 954, pp. 11–22. [Google Scholar]
Kerridge, D.F. Inaccuracy and inference. J. R. Stat. Soc. B 1961, 23, 184–194. [Google Scholar]
Kullback, S. Information Theory and Statistics; Wiley: New York, NY, USA, 1959. [Google Scholar]
Rubin, E. Om Forstaaelighedsreserven og om Overbestemthed. In Til Minde om Edgar Rubin; Nordisk Psykologisk Monografiserie NR. 8: Copenhagen, Denmark, 1956; pp. 28–37. (In Danish) [Google Scholar]
Rasmussen, E.T. Bemærkninger om E. Rubin’s “reserve-begreb”. In Til Minde om Edgar Rubin; Nordisk Psykologisk Monografiserie NR. 8: Copenhagen, Denmark, 1956; pp. 38–42. [Google Scholar]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Topsøe, F. Game theoretical optimization inspired by information theory. J. Glob. Optim. 2009, 43, 553–564. [Google Scholar] [CrossRef]
Zeidler, E. Applied Mathematical Sciences. In Applied Functional Analysis: Applications to Mathematical Physics; Springer: New York, NY, USA, 1995; Volume 108. [Google Scholar]
Zeidler, E. Applied Mathematical Sciences. In Applied Functional Analysis: Main Principles and Their Applications; Springer: Berlin, Germany, 1995; Volume 109. [Google Scholar]
Von Neumann, J. Zur Theorie der Gesellschaftsspiele. Math. Ann. 1928, 100, 295–320. [Google Scholar] [CrossRef]
Von Neumann, J. Über ein ökonomische Gleichungssystem und eine Veralgemeinerung des Brouwerschen Fixpunktsatzes. Ergeb. Math. Kolloqu. 1937, 8, 73–83. (In German) [Google Scholar]
Kjeldsen, T.H. John von Neumann’s Conception of the Minimax Theorem: A Journey Through Different Mathematical Contexts. Arch. Hist. Exact Sci. 2001, 56, 39–68. [Google Scholar] [CrossRef]
Kuic, D. Maximum information entropy principle and the interpretation of probabilities in statistical mechanics—A short review. Eur. Phys. J. B 2016, 89, 1–7. [Google Scholar] [CrossRef]
Topsøe, F. Exponential Families and MaxEnt Calculations for Entropy Measures of Statistical Physics. In Complexity, Metastability, and Non-Extensivity, CTNEXT07; AIP Conference Proceedings; American Institute of Physics: New York, NY, USA, 2007; Volume 965, pp. 104–113. [Google Scholar]
Csiszár, I. I-Divergence Geometry of Probability Distributions and Minimization Problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
Čencov, N.N. Statistical Decision Rules and Optimal Inference; In Russian, Translation in “Translations of Mathematical Monographs”; American Mathematical Society: Providence, RI, USA, 1982; Nauka: Moscow, Russia, 1972. [Google Scholar]
Csiszár, I. Generalized projections for non-negative functions. Acta Math. Hung. 1995, 68, 161–185. [Google Scholar] [CrossRef]
Csiszár, I.; Matús, F. Information projections revisited. IEEE Trans. Inf. Theory 2003, 49, 1474–1490. [Google Scholar] [CrossRef]
Csiszár, I.; Matús, F. Generalized minimizers of convex integral functionals, Bregman distance, Pythagorean identities. Kybernetika 2012, 48, 637–689. [Google Scholar]
Glonti, O.; Harremoës, P.; Khechinashili, Z.; Topsøe, F. Nash Equilibrium in a Game of Calibration. Theory Probab. Appl. 2007, 51, 415–426. [Google Scholar] [CrossRef]
Topsøe, F. Basic Concepts, Identities and Inequalities—The Toolkit of Information Theory. Entropy 2001, 3, 162–190. [Google Scholar] [CrossRef]
Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef]
Fuglede, B.; Topsøe, F. Jensen-Shannon Divergence and Hilbert space Embedding. In Proceedings of the 2004 International Symposium on Information Theory, Honolulu, HW, USA, 29 June–4 July 2014; p. 31. [Google Scholar]
Briët, J.; Harremoës, P. Properties of Classical and Quantum Jensen-Shannon Divergence. Phys. Rev. A 2009, 79, 11. [Google Scholar] [CrossRef]
Kisynski, J. Convergence du typè L. Colloq. Math. 1960, 7, 205–211. [Google Scholar]
Dudley, R. On Sequential Convergence. Trans. Am. Math. Soc. 1964, 112, 483–507. [Google Scholar] [CrossRef]
Steen, L.; Seebach, J. Counterexamples in Topology; Springer: Berlin, Germany, 1941. [Google Scholar]
Harremoës, P.; Topsøe, F. Zipf’s law, hyperbolic distributions and entropy loss. In Proceedings of the IEEE International Symposium on Information Theory, Lausanne, Switzerland, 30 June–5 July 2002; p. 207. [Google Scholar]
Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Tsallis, C. What are the numbers that experiments provide? Quim. Nova 1994, 17, 468. [Google Scholar]
Csiszár, I. Eine informationstheoretische Ungleichung und ihre anwendung auf denBeweis der ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hung. Acad. 1963, 8, 95–108. [Google Scholar]
Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 12, 328–331. [Google Scholar] [CrossRef]
Ali, S.M.; Silvey, S.D. A General Class of Coefficients of Divergence of One Distribution from Another. J. R. Stat. Soc. Ser. B 1966, 28, 131–142. [Google Scholar]
Havrda, J.; Charvát, F. Quantification method of classification processes. Concept of structural a-entropy. Kybernetika 1967, 3, 30–35. [Google Scholar]
Daróczy, Z. Generalized Information Functions. Inf. Control 1970, 16, 36–51. [Google Scholar] [CrossRef]
Lindhard, J.; Nielsen, V. Studies in Statistical Dynamics. Det Kongelige Danske Videnskabernes Selskab Matematisk-Fysiske Meddelelser 1971, 38, 1–42. [Google Scholar]
Lindhard, J. On the Theory of Measurement and its Consequences in Statistical Dynamics. Det Kongelige Danske Videnskabernes Selskab Matematisk-Fysiske Meddelelser 1974, 39, 1–39. [Google Scholar]
Aczél, J.; Daróczy, Z. On Measures of Information and Their Characterizations; Academic Press: New York, NY, USA, 1975. [Google Scholar]
Ebanks, B.; Sahoo, P.; Sander, W. Characterizations of Information Measures; World Scientific: Singapore, 1998. [Google Scholar]
Jaynes, E.T. Where do we Stand on Maximum Entropy? In The Maximum Entrropy Formalism; Levine, R., Tribus, M., Eds.; MIT Press: Cambridge, MA, USA, 1979; pp. 1–104. [Google Scholar]
Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149. [Google Scholar] [CrossRef]
Gallager, R. Information Theory and Reliable Communication; Wiley: New York, NY, USA, 1968. [Google Scholar]
Topsøe, F. Informationstheorie, eine Einführung; Teubner: Stuttgart, Germany, 1974. [Google Scholar]
Sylvester, J.J. A Question in the Geometry of Situation. Q. J. Pure Appl. Math. 1857, 1, 79. [Google Scholar]
Drezner, Z.; Hamacher, H. (Eds.) Facility Location. Applications and Theory; Springer: Berlin, Germany, 2002. [Google Scholar]
Topsøe, F. A New Proof of a Result Concerning Computation of the Capacity for a Discrete Channel. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 1972, 22, 166–168. [Google Scholar] [CrossRef]
Van der Lubbe, J.C.A. Transactions of the Prague Conferences on Information Theory. In On Certain Coding Theorems for the Information of Order α and of type β; Springer: Dordrecht, The Netherlands, 1979. [Google Scholar]
Ahlswede, R. Identification Entropy. In General Theory of Information Transfer and Combinatorics; Lecture Notes in Computer Science; Springer: Berlin, Germany, 2006; Volume 4123, pp. 595–613. [Google Scholar]
Suyari, H. Tsallis entropy as a lower bound of average description length for the q-generalized code tree. In Proceedings of the IEEE International Symposium on Information Theory (ISIT 2007), Nice, France, 24–29 June 2007; pp. 901–905. [Google Scholar]
Topsøe, F. Factorization and escorting in the game-theoretical approach to non-extensive entropy measures. Physica A 2006, 365, 91–95. [Google Scholar] [CrossRef]
Tsallis, C. Conceptual Inadequacy of the Shore and Johnson Axioms for Wide Classes of Complex Systems. Entropy 2015, 17, 2853–2861. [Google Scholar] [CrossRef]
Kapur, J.N. Maximum Entropy Models in Science and Engineering; First Edition 1989; Wiley: New York, NY, USA, 1993. [Google Scholar]
Topsøe, F. Maximum Entropy versus Minimum Risk and Applications to some classical discrete Distributions. IEEE Trans. Inf. Theory 2002, 48, 2368–2376. [Google Scholar] [CrossRef]
Pavon, M.; Ferrante, A. On the Geometry of Maximum Entropy Problems. SIAM Rev. 2013, 55, 415–439. [Google Scholar] [CrossRef]
Deza, M.M.; Laurent, M. Geometry of Cuts and Metrics; Springer: Berlin, Germany, 1997. [Google Scholar]
Van Campenhout, J.M.; Cover, T.M. Maximum Entropy and Conditional Probability. IEEE Trans. Inf. Theory 1981, IT-27, 483–489. [Google Scholar] [CrossRef]
Weijs, S.V.; van de Giesen, N. Accounting for Observational Uncertainty in Forecast Verification: An Information-Theoretical View on Forecasts, Observations, and Truth. Mon. Weather Rev. 2011, 139, 2156–2162. [Google Scholar] [CrossRef]
McCarthy, J. Measures of the Value of Information. Proc. Natl. Acad. Sci. USA 1956, 42, 654–655. [Google Scholar] [CrossRef] [PubMed]
Chambers, C.P. Proper scoring rules for general decision models. Games Econ. Behav. 2008, 63, 32–40. [Google Scholar] [CrossRef]
Hilden, J. Scoring Rules for Evaluation of Prognosticians and Prognostic Rules, First Version 1999; 2008, unpublished. Available online: http://publicifsv.sund.ku.dk/~jh/ (accessed on 26 March 2017).

Figure 1. Matrix games where one of the players does not have an optimal strategy.

Figure 2. Bregman generator and primitive effort-based information triple.

Figure 3. Jensen-Shannon divergence

(d_{1} + d_{2})

for the Bregman construction.

Figure 3. Jensen-Shannon divergence

(d_{1} + d_{2})

for the Bregman construction.

Figure 4. Optimal strategies, typical equilibrium via core.

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Topsøe, F. Paradigms of Cognition. Entropy 2017, 19, 143. https://doi.org/10.3390/e19040143

AMA Style

Topsøe F. Paradigms of Cognition. Entropy. 2017; 19(4):143. https://doi.org/10.3390/e19040143

Chicago/Turabian Style

Topsøe, Flemming. 2017. "Paradigms of Cognition" Entropy 19, no. 4: 143. https://doi.org/10.3390/e19040143

APA Style

Topsøe, F. (2017). Paradigms of Cognition. Entropy, 19(4), 143. https://doi.org/10.3390/e19040143

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Paradigms of Cognition †

Abstract

Contents

1. Introduction

2. Information without Probability

2.1. The World and You

2.2. Truth and Belief

2.3. A Tendency to Act, a Wish to Control

2.4. Atomic Situations, Controllability and Visibility

2.5. Knowledge, Perception and Deformation

2.6. Effort and Description

2.7. Information Triples

2.8. Relativization, Updating

2.9. Feasible Preparations, Core and Robustness

2.10. Inference via Games, Some Basic Concepts

2.11. Refined Notions of Properness

2.12. Inference via Games, Some Basic Results

2.13. Games Based on Utility, Updating

2.14. Formulating Results with a Geometric Flavour

2.15. Adding Convexity

2.16. Jensen-Shannon Divergence at Work

3. Examples, towards Applications

3.1. Primitive Triples and Generation by Integration

3.2. A Geometric Model

3.3. Universal Coding and Prediction

3.4. Sylvester’s Problem from Location Theory

3.5. Capacity Problems, an Indication

3.6. Tsallis Worlds

3.7. Maximum Entropy Problems of Classical Shannon Theory

3.8. Determining D-Projections

4. Conclusions

Acknowledgments

Conflicts of Interest

Appendix A. Notions of Properness

Appendix B. Protection against Misinformation

Appendix C. Cause and Effect

Appendix D. Negative Definite Kernels and Squared Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Paradigms of Cognition^†