Khinchin’s Fourth Axiom of Entropy Revisited

Zhang, Zhiyi; Huang, Hongwei; Xu, Hao

doi:10.3390/stats6030049

Open AccessCommunication

Khinchin’s Fourth Axiom of Entropy Revisited

by

Zhiyi Zhang

^1,*

,

Hongwei Huang

² and

Hao Xu

²

¹

Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA

²

Wells Fargo Bank, Charlotte, NC 28282, USA

^*

Author to whom correspondence should be addressed.

Stats 2023, 6(3), 763-772; https://doi.org/10.3390/stats6030049

Submission received: 11 July 2023 / Revised: 25 July 2023 / Accepted: 26 July 2023 / Published: 27 July 2023

(This article belongs to the Section Data Science)

Download Versions Notes

Abstract

:

The Boltzmann–Gibbs–Shannon (BGS) entropy is the only entropy form satisfying four conditions known as Khinchin’s axioms. The uniqueness theorem of the BGS entropy, plus the fact that Shannon’s mutual information completely characterizes independence between the two underlying random elements, puts the BGS entropy in a special place in many fields of study. In this article, the fourth axiom is replaced by a slightly weakened condition: an entropy whose associated mutual information is zero if and only if the two underlying random elements are independent. Under the weaker fourth axiom, other forms of entropy are sought by way of escort transformations. Two main results are reported in this article. First, there are many entropies other than the BGS entropy satisfying the weaker condition, yet retaining all the desirable utilities of the BGS entropy. Second, by way of escort transformations, the newly identified entropies are the only ones satisfying the weaker axioms.

Keywords:

Khinchin’s axioms; escort distributions; independence-dependence preserving; power function; entropy uniqueness

1. Introduction and Summary

Let X be a random element in a countable alphabet

X = {x_{k}; k \geq 1}

, where

x_{k}

,

k \geq 1

are distinct letters or labels with a probability distribution

p = {p_{k}; k \geq 1} \in P

, where

P

is the collection of all possible probability distributions on

X

. Many random system properties of interest may be described by entropic quantities or entropies, that is, functions of

p

that are label-independent. For generality of discussion, letting

p_{↓} = {p_{(k)}; k \geq 1}

be a non-increasingly ordered

p

, a function of

p

that satisfies

H = H (p) = H (p_{↓})

for all

p \in P

is referred to as an entropy. An entropy may also be denoted as

H (X)

for notation simplicity. For example, the Boltzmann–Gibbs–Shannon entropy

H_{BGS} = - \sum_{k \geq 1} p_{k} ln p_{k}

, the Rényi entropy

H_{R} = {(1 - α)}^{- 1} ln (\sum_{k \geq 1} p_{k}^{α})

for

α > 0

and

α \neq 1

is a constant, and the Tsallis entropy

H_{T} = {(α - 1)}^{- 1} (1 - \sum_{k \geq 1} p_{k}^{α})

for any

α \neq 0

are of high utility across many fields of study, such as statistical mechanics and information theory.

Ever since the information-theoretic utility of

H_{BGS}

was unlocked in [1], a large volume of research has been amassed in relation to

H_{BGS}

. A considerable amount of the research effort has been placed in framing

H_{BGS}

in general forms. Many articles have been published under the theme of generalized entropy, but from different perspectives. One particular perspective is the axiomatic system studied by Khinchin in [2].

Let

(X, Y)

be a pair of random elements on a joint countable alphabet,

X \times Y = {(x_{i}, y_{j}); i \geq 1, j \geq 1}

, where

(x_{i}, y_{j})

for

i \geq 1

and

j \geq 1

are distinct labels with a joint probability distribution,

p_{X, Y} = {p_{i, j}; i \geq 1, j \geq 1}

, and the two marginal distributions are

p_{X} = {p_{i, \cdot}; i \geq 1}

and

p_{Y} = {p_{\cdot, j}; j \geq 1}

, where

p_{i, \cdot} = \sum_{j \geq 1} p_{i, j}

and

p_{\cdot, j} = \sum_{i \geq 1} p_{i, j}

for X and Y, respectively. Let

H (p)

be an entropy. The four axioms of Khinchin are given below.

$K_{1}$ :: (Continuity) $H (p)$ is continuous with respect to all elements of $p$ .
$K_{2}$ :: (Maximality) Given $K = \sum_{k \geq 1} 1_{[p_{k} > 0]} < \infty$ , $H (p)$ is maximized in $P$ at $p_{k} = 1 / K$ for $k = 1, \dots, K$ .
$K_{3}$ :: (Expansibility) Letting $p_{0} = {0, p_{1}, p_{2}, \dots}$ , $H (p) = H (p_{0})$ .
$K_{4}$ :: (Separability) For any pair of random elements $(X, Y)$ on $X \times Y$ with a joint probability distribution $p_{X, Y}$ ,

$H (X, Y) = H (X) + H (Y | X)$

(1)

where $H (Y | X) = \sum_{i \geq 1} p_{i, \cdot} H (Y | X = x_{i})$ and $H (Y | X = x_{i})$ is the entropy of the conditional distribution of Y given $X = x_{i}$ .

K_{4}

is sometimes also known as Strong Additivity.

Khinchin famously proved the following fact in [2].

Fact 1

(The uniqueness theorem of entropy). For any

p \in

P

, if an entropy

H (p)

, such that

H (p) < \infty

, satisfies all four axioms,

K_{1}

–

K_{4}

, then

H (p)

must be uniquely of the form

H_{B G S} = - \sum_{k \geq 1} p_{k} ln p_{k}

up to a multiplicative constant.

Let

M I = M I (X, Y) = H_{BGS} (X) + H_{BGS} (Y) - H_{BGS} (X, Y)

(2)

which is often referred to as Shannon’s mutual information. The following fact is due to Shannon.

Fact 2.

Let

p_{X, Y}

be a joint probability distribution of

(X, Y)

satisfying

H (X) < \infty

and

H (Y) < \infty

. Then, X and Y are independent if and only if

M I (X, Y) = 0

.

The fact that the independence between X and Y may be characterized by a single-valued index

M I

under a general joint distribution on

X \times Y

puts

M I

in a very important place in information theory. Furthermore the uniqueness theorem of entropy adds a special aura to

H_{BGS}

.

However,

H_{BGS}

, which satisfies

H_{BGS} (X, Y) = H_{BGS} (X) + H_{BGS} (Y)

(3)

under independence of X and Y, is considered by many physicists to be overly rigid. In search of more general forms of entropy, Khinchin’s axiom

K_{4}

is weakened in various ways, and a large number of research articles has been published under the weaker conditions. Many of these articles may be found in two excellent review articles, see [3,4].

The fourth axiom,

K_{4}

, may be weakened to different degrees across a spectrum. At one end of it,

K_{4}

is ignored and generalized entropy forms are sought only under

{K_{1}, K_{2}, K_{3}}

. Other versions of the weakened

K_{4}

are mostly given in more general forms of (3), under independence of X and Y. For example, an entropy

H (p)

may be required to satisfy

H (X, Y) = Φ (H (X), H (Y))

(4)

for some symmetric function of two variables,

Φ

, of which a special case is

H (X, Y) = H (X) + H (Y) + (1 - α) H (X) H (Y)

(5)

where

α \neq 1

is a real number. The condition in (5) is a centerpiece of non-extensive statistical mechanics. The Tsallis entropy satisfies (4) in general and (5) in specific. It is to be particularly noted that conditions in (4) and (5) are necessary conditions of X and Y being independent, but are not sufficient conditions.

Non-extensive physics aside, the rigidity of

H_{BGS}

has its remarkable utility, namely Fact 2: a single-valued index characterizes the stochastic association between two random elements on a non-meterized joint alphabet under general distributions. (Also see standardized mutual information in Chapter 5 of [5]).

It may be interesting to ask whether there exist other entropies,

H (p)

, such that, in addition to satisfying

{K_{1}, K_{2}, K_{3}}

, it also satisfies

K_{4}^{♭} : H (X) + H (Y) - H (X, Y) = 0 if and only if X and Y are independent .

(6)

Let it be noted that

K_{4}^{♭}

is weaker than

K_{4}

in the sense that

{K_{1}, K_{2}, K_{3}, K_{4}}

implies

{K_{1}, K_{2}, K_{3}, K_{4}^{♭}}

. In the same sense,

K_{4}^{♭}

is a stronger condition than

K_{4}^{♭ ♭} : H (X) + H (Y) - H (X, Y) = 0 if X and Y are independent,

(7)

since

K_{4}^{♭ ♭}

is a necessary condition of X and Y being independent, but not a sufficient condition. The condition in (7) is a special case of (4).

Example 1.

The Rényi entropy,

H_{T}

, satisfies

K_{4}^{♭ ♭}

but not

K_{4}^{♭}

.

Example 2.

The no-name entropy,

H_{N} = - \sum_{k \geq 1} p_{k}^{α} ln p_{k}^{α}

for any

α > 1

, satisfies

K_{4}^{♭ ♭}

but not

K_{4}^{♭}

. By the way, it may be interesting to note that

{lim}_{α \to 1} H_{N} = H_{B G S}

.

As it turns out, there are many entropies satisfying

{K_{1}, K_{2}, K_{3}, K_{4}^{♭}}

. Consider the following family of entropies,

H_{α} (p) = - \sum_{k \geq 1} (\frac{p_{k}^{α}}{\sum_{i \geq 1} p_{i}^{α}}) ln (\frac{p_{k}^{α}}{\sum_{i \geq 1} p_{i}^{α}}),

(8)

for

α > 1

, and its implied mutual information,

M I_{α} (X, Y) = H_{α} (X) + H_{α} (Y) - H_{α} (X, Y) .

(9)

Obviously the family of entropies in (8) also satisfies

{lim}_{α \to 1} H_{α} (p) = H_{BGS}

, which, however, is only finitely defined for some of the distributions in

P

. A significant advantage of (8) is that every member is finitely defined for each and every

p \in P

. The first main result established in this article is the fact that X and Y are independent if and only if

M I_{α} (X, Y) = 0

for any

α \in (1, \infty)

, which immediately implies that there are many entropies satisfying

{K_{1}, K_{2}, K_{3}, K_{4}^{♭}}

.

A second question of interest is whether there exist other forms of entropy satisfying

{K_{1}, K_{2}, K_{3}, K_{4}^{♭}}

beyond those in (8). The answer to this question is unknown in general. However, under certain restrictions on the functional forms of

H (p)

, uniqueness of (8) may be established. This, in fact, is the second main result of this article.

In Section 2, the statements of the two main results are made more precise and are established in two separate subsections. The article ends with Section 3, where several related minor results are summarized.

2. Main Results

The path leading to both main results of this article goes through a mapping,

Φ

:

P \Rightarrow P^{*} \subseteq P

, denoted

p^{*} = Φ (p)

and referred to as the escort distribution of

p

on the same alphabet

X

.

p^{*}

is constructed in a special way, according to the concept of escort distributions introduced in [6]. For a given function

ϕ (p) \geq 0

on

[0, 1]

and a distribution

p \in P

, provided that

0 < \sum_{i \geq 1} ϕ (p_{i}) = C (p, ϕ (\cdot)) < \infty

,

p^{*} = {p_{k}^{*}; k \geq 1}

, where

p_{k}^{*} = \frac{ϕ (p_{k})}{\sum_{i \geq 1} ϕ (p_{i})},

(10)

is referred to as the escort distribution of

p

associated with the deformation function (also known as the the escort function),

ϕ (p)

. The utility of escort distributions is discussed by many researchers, most notably [7] in the context of statistical mechanics, and [8] regarding information geometry. Escort distributions, originally as natural scanners of a single underpinning probability distribution,

p

, in a multifractal structure, have been shown to be useful in a great variety of places and ways, ranging from information theory and coding theory to multifractal neural networks. For example, many interesting results and applications may be found in [9,10], and the references there within.

Consider a special family of power escort functions,

{ϕ (p) = λ p^{α} : α > 1, λ > 0} .

(11)

When

ϕ (p)

is a member of (11), (10) becomes

p_{k}^{*} = \frac{p_{k}^{α}}{\sum_{i \geq 1} p_{i}^{α}} .

(12)

The Boltzmann–Gibbs–Shannon entropy of the escort distribution,

p^{*} = {p_{k}^{*}; k \geq 1}

, becomes

H_{α} (p)

of (8), and

M I_{α}

of (9) becomes its implied Shannon’s mutual information.

2.1. Characterization of Independence

Theorem 1.

X and Y are independent if and only if

M I_{α} (X, Y) = 0

, for any fixed

α > 1

.

Let

(X, Y)

be a pair of random elements on

X \times Y

, with a joint probability distribution,

p_{X, Y} = {p_{i, j}; i \geq 1, j \geq 1}

, and two marginal distributions,

p_{X}

and

p_{Y}

. For a fixed

α > 1

, consider another pair of random elements

(X^{*}, Y^{*})

in the same alphabet

X \times Y

, but with an induced joint probability distribution,

p_{X, Y}^{*} = {p_{i, j}^{*}; i \geq 1, j \geq 1}

, and two marginal distributions,

p_{X}^{*} = {p_{i, \cdot}^{*}; i \geq 1}

and

p_{Y}^{*} = {p_{\cdot, j}^{*}; j \geq 1}

, where, for some

α > 1

,

p_{i, j}^{*} = \frac{p_{i, j}^{α}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{α}}, p_{i, \cdot}^{*} = \frac{\sum_{j \geq 1} p_{i, j}^{α}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{α}}, and p_{\cdot, j}^{*} = \frac{\sum_{i \geq 1} p_{i, j}^{α}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{α}} .

(13)

Since (8) is the Boltzmann–Gibbs–Shannon entropy of the escort distribution,

p^{*} = {p_{k}^{*}; k \geq 1}

, that is,

H_{α} (X) = H_{BGS} (X^{*})

, and

M I_{α} (X, Y) = M I (X^{*}, Y^{*})

, by Fact 2, Theorem 1 is an immediate consequence of the following lemma.

Lemma 1.

X and Y are independent if and only if

X^{*}

and

Y^{*}

are independent.

Proof.

For each

α > 1

, if

p_{i, j} = p_{i, \cdot} \times p_{\cdot, j}

for all pairs

(i, j)

,

i \geq 1

and

j \geq 1

, then

p_{i, j}^{*} = \frac{p_{i, j}^{α}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{α}} = (\frac{p_{i, \cdot}^{α}}{\sum_{s \geq 1} p_{s, \cdot}^{α}}) \times (\frac{p_{\cdot, j}^{α}}{\sum_{t \geq 1} p_{\cdot, t}^{α}}) = p_{i, \cdot}^{*} \times p_{\cdot, j}^{*} .

(14)

Therefore, the fact that

H_{α} (X, Y)

is the Boltzmann–Gibbs–Shannon entropy of

(X^{*}, Y^{*})

implies

M I_{α} (X, Y) = 0

.

Conversely, if

M I_{α} (X, Y) = 0

, then (14) holds, which implies for each pair

(i, j)

,

i \geq 1

and

j \geq 1

,

\begin{matrix} p_{i, j} & = {(\frac{p_{i, \cdot}^{α}}{\sum_{s \geq 1} p_{s, \cdot}^{α}})}^{1 / α} \times {(\frac{p_{\cdot, j}^{α}}{\sum_{t \geq 1} p_{\cdot, t}^{α}})}^{1 / α} \times {(\sum_{s \geq 1, t \geq 1} p_{s, t}^{α})}^{1 / α} \\ = p_{i, \cdot} \times p_{\cdot, j} \times {(\frac{\sum_{s \geq 1, t \geq 1} p_{s, t}^{α}}{\sum_{s \geq 1} p_{s, \cdot}^{α} \times \sum_{t \geq 1} p_{\cdot, t}^{α}})}^{1 / α} . \end{matrix}

Noting that the third factor in the expression above does not depend on i or j, the lemma immediately follows the factorization theorem. □

Remark 1.

It is acknowledged that the proof of Lemma 1 above is inspired by a similar proof in [11], where a similar result with a more restrictive family of escort functions is established.

2.2. A Uniqueness Theorem

In Theorem 1, it is established that X and Y are independent if and only if

X^{*}

and

Y^{*}

are independent, where

(X, Y)

and

(X^{*}, Y^{*})

are linked by a power escort transformation,

ϕ (p)

, through the mapping,

Φ

, between their respective joint distributions. Such an escort function may be reasonably referred to as an independence-dependence preserving deformation function. In concept, however, there may exist other such functions outside of the power family in (11). However, Theorem 2 below says otherwise.

Definition 1.

A deformation function

ϕ (p)

on

[0, 1]

is said to be independence-dependence preserving with respect to a subclass

P_{X, Y}^{'} \subseteq P_{X, Y}

, if for each and every

p_{X, Y} \in P_{X, Y}^{'}

and its associated escort distribution

p_{X, Y}^{*}

, X and Y are independent if and only if

X^{*}

and

Y^{*}

are independent. More specifically, a deformation function

ϕ (p)

on

[0, 1]

is said to be independence-dependence preserving if it is independence-dependence preserving with respect to

P_{X, Y}

.

Theorem 2.

A measurable and integrable deformation function,

ϕ (p)

for

p \in (0, 1)

, is independence preserving if and only if it is a member in the family of power functions in (11).

Lemma 2.

Suppose

f (x) > 0

is a Lebesgue measurable function on

R

such that

f (x + y) = f (x) f (y)

for all

x, y \in R

. Then,

f (x) = e^{α x}

for all

x \in R

and some constant

α \in R

.

Lemma 2 above is due to [12]. In fact, it is also established in [12] that if the condition of Lebesgue measurability is not imposed, a nowhere-continuous

f (x)

, satisfying

f (x + y) = f (x) f (y)

for all

x, y \in R

, exists and is therefore not of the form

f (x) = e^{α x}

.

Lemma 3.

Suppose

g (x) > 0

is a Lebesgue measurable function on

(0, \infty)

such that

g (x y) = g (x) g (y)

for all

x, y \in (0, \infty)

. Then,

g (x) = x^{α}

for all

x > 0

and some constant

α \in R

.

Proof.

For any

x, y \in (0, \infty)

, consider the variable transformation

s = ln x

and

t = ln y

, and hence

x = e^{s}

and

y = e^{t}

. Let

f (s) = g (e^{s})

. It follows that, for all

x, y \in (0, \infty)

,

f (s + t) = g (e^{s + t}) = g (e^{s} e^{t}) = g (e^{s}) g (e^{t}) = f (s) f (t) .

Since

g (x)

is measurable,

f (s)

is measurable, and therefore, by Lemma 2,

f (s) = e^{α s}

for some constant

α > 0

, or equivalently

g (x) = x^{α}

. □

Lemma 4.

If

ϕ (p)

is a member of (11), then

ϕ (p)

is independence-dependence preserving.

Proof.

Let

{p_{i, j}; i \geq 1, j \geq 1}

be any joint distribution on

X \times Y

and denote its marginal distributions as

{p_{i, \cdot}; i \geq 1}

and

{p_{\cdot, j}; j \geq 1}

. Suppose

p_{i, j} = p_{i, \cdot} p_{\cdot, j}

for every pair

(i, j)

. It follows that

\frac{ϕ (p_{i, j})}{\sum_{s \geq 1, t \geq 1} ϕ (p_{s, t})} = \frac{λ^{2} p_{i, \cdot}^{α} p_{\cdot, j}^{α}}{λ^{2} \sum_{s \geq 1, t \geq 1} p_{s, \cdot}^{α} p_{\cdot, t}^{α}} = \frac{ϕ (p_{i, \cdot})}{\sum_{s \geq 1} ϕ (p_{s, \cdot})} \times \frac{ϕ (p_{\cdot, j})}{\sum_{t \geq 1} ϕ (p_{\cdot, t})} .

(15)

Conversely, suppose

\frac{ϕ (p_{i, j})}{\sum_{s \geq 1, t \geq 1} ϕ (p_{s, t})} = \frac{ϕ (p_{i, \cdot})}{\sum_{s \geq 1} ϕ (p_{s, \cdot})} \times \frac{ϕ (p_{\cdot, j})}{\sum_{t \geq 1} ϕ (p_{\cdot, t})}

holds for every pair

(i, j)

. Noting

λ \sum_{s \geq 1, t \geq 1} ϕ (p_{s, t}) = \sum_{s \geq 1} ϕ (p_{s, \cdot}) \times \sum_{t \geq 1} ϕ (p_{\cdot, t})

, it follows that

λ ϕ (p_{i, j}) = ϕ (p_{i, \cdot}) ϕ (p_{\cdot, j})

,

λ^{2} p_{i, j}^{α} = λ p_{i, \cdot}^{α} λ p_{\cdot, j}^{α}

, and

p_{i, j} = p_{i, \cdot} p_{\cdot, j}

for every pair

(i, j)

. □

Lemma 5.

If

ϕ (p)

is Lebesgue measurable and integrable on

(0, 1)

and is independence-dependence preserving on

P_{X, Y}

, then

ϕ (p)

is a member of (11).

Proof.

Suppose X and Y are independent, that is

p_{i, j} = p_{i, \cdot} \times p_{\cdot, j}

for all

{i, j}

. For

ϕ (\cdot)

to preserve independence, it is to satisfy, for all

{i, j}

,

p_{i, j}^{*} = p_{i, \cdot}^{*} \times p_{\cdot, j}^{*},

(16)

or to satisfy, for any two pairs of indices

(k, l)

and

(s, t)

,

\frac{ϕ (p_{k, l})}{ϕ (p_{s, t})} = \frac{\sum_{j} ϕ (p_{k, j}) \times \sum_{i} ϕ (p_{i, l})}{\sum_{j} ϕ (p_{s, j}) \times \sum_{i} ϕ (p_{i, t})} .

(17)

More specifically let

k = s

, then Equation (17) is reduced to

\frac{ϕ (p_{k, l})}{ϕ (p_{k, t})} = \frac{\sum_{i} ϕ (p_{i, l})}{\sum_{i} ϕ (p_{i, t})} .

(18)

Noting

p_{i, j} = p_{i, \cdot} \times p_{\cdot, j}

for all pairs of

(i, j)

, it follows from (18) that

\frac{ϕ (p_{k, \cdot} \times p_{\cdot, l})}{ϕ (p_{k, \cdot} \times p_{\cdot, t})} = \frac{\sum_{i} ϕ (p_{i, \cdot} \times p_{\cdot, l})}{\sum_{i} ϕ (p_{i, \cdot} \times p_{\cdot, t})} .

(19)

The right-hand side of (19) is independent of k, and so is the left-hand side of (19). It follows that

\frac{ϕ (p_{k, \cdot} \times q_{\cdot, l})}{ϕ (p_{k, \cdot} \times q_{\cdot, t})} = \frac{ϕ (p_{s, \cdot} \times q_{\cdot, l})}{ϕ (p_{s, \cdot} \times q_{\cdot, t})}

(20)

regardless if

k = s

or

k \neq s

. It is to be noted that (20) implies

\frac{ϕ (p \cdot q)}{ϕ (p \cdot q^{'})} = \frac{ϕ (p^{'} \cdot q)}{ϕ (p^{'} \cdot q^{'})}

(21)

for any

p \in (0, 1)

,

p^{'} \in (0, 1)

,

q \in (0, 1)

, and

q^{'} \in (0, 1)

, but subject to the constraints

p + p^{'} \leq 1

and

q + q^{'} \leq 1

.

Now, it is desired to show that, without the constraints

p + p^{'} \leq 1

and

q + q^{'} \leq 1

, (21) still holds. Toward that end, the proof is given in two steps, (1)

q + q^{'} \leq 1

and (2)

q + q^{'} > 1

.

Step 1: Suppose

q + q^{'} \leq 1

. Let

\begin{matrix} P_{1} = & \{p, \frac{1}{2} (1 - max (p, p^{'})), \dots\}, \\ P_{2} = & \{p^{'}, \frac{1}{2} (1 - max (p, p^{'})), \dots\}, \\ Q = & \{q, q^{'}, \dots\} . \end{matrix}

Applying

P_{1}

and Q to Equation (20), it follows that

\frac{ϕ (p \cdot q)}{ϕ (p \cdot q^{'})} = \frac{ϕ (\frac{1}{2} (1 - max (p, p^{'})) \cdot q)}{ϕ (\frac{1}{2} (1 - max (p, p^{'})) \cdot q^{'})}

(22)

Applying

P_{2}

and Q to Equation (20), it follows that

\frac{ϕ (p^{'} \cdot q)}{ϕ (p^{'} \cdot q^{'})} = \frac{ϕ (\frac{1}{2} (1 - max (p, p^{'})) \cdot q)}{ϕ (\frac{1}{2} (1 - max (p, p^{'})) \cdot q^{'})}

(23)

and that

\frac{ϕ (p \cdot q)}{ϕ (p \cdot q^{'})} = \frac{ϕ (p^{'} \cdot q)}{ϕ (p^{'} \cdot q^{'})} .

(24)

Before moving on to Step 2, let it be noted that the constraint

p + p^{'} \leq 1

is not used in the above proof of Step 1. That is to say that (24) holds under the condition

q + q^{'} \leq 1

regardless of whether

p + p^{'} \leq 1

or

p + p^{'} > 1

.

Step 2: Suppose

q + q^{'} > 1

. Let

q^{″} = (1 - max (q, q^{'})) .

Noting

q + q^{″} \leq 1

, evaluating (24) with

q^{″}

in place of

q^{'}

gives

\frac{ϕ (p \cdot q)}{ϕ (p \cdot \frac{1}{2} (1 - max (q, q^{'})))} = \frac{ϕ (p^{'} \cdot q)}{ϕ (p^{'} \cdot \frac{1}{2} (1 - max (q, q^{'})))} .

(25)

Noting

q^{'} + q^{″} \leq 1

, evaluating (24) with

q^{'}

in place of q gives

\frac{ϕ (p \cdot q^{'})}{ϕ (p \cdot \frac{1}{2} (1 - max (q, q^{'})))} = \frac{ϕ (p^{'} \cdot q^{'})}{ϕ (p^{'} \cdot \frac{1}{2} (1 - max (q, q^{'})))} .

(26)

Combining (25) and (26) again gives (24), but for all

p, p^{'}, q, q^{'} \in (0, 1)

without any constraints.

Moreover, (24) may be further simplified. Let

ϵ

be such that

0 < ϵ < min (1 / q, 1 / q^{'})

. Then, for any

p \in (0, 1)

,

q \in (0, 1)

,

q^{'} \in (0, 1)

,

\frac{ϕ (p \cdot q)}{ϕ (p \cdot q^{'})} = \frac{ϕ (q)}{ϕ (q^{'})} .

(27)

This is so because

\frac{ϕ (p \cdot q)}{ϕ (p \cdot q^{'})} = \frac{ϕ (\frac{p}{1 + ϵ} \cdot (1 + ϵ) q)}{ϕ (\frac{p}{1 + ϵ} \cdot (1 + ϵ) q^{'})} = \frac{ϕ (\frac{1}{1 + ϵ} \cdot (1 + ϵ) q)}{ϕ (\frac{1}{1 + ϵ} \cdot (1 + ϵ) q^{'})} = \frac{ϕ (q)}{ϕ (q^{'})} .

Next, it is to establish the fact that

ϕ (p)

is continuous on

(0, 1)

. Since

ϕ (\cdot)

is measurable and integrable on

(0, 1)

by assumption, substituting

u = x v

,

Φ (x) = \int_{0}^{x} ϕ (u) d u = x \int_{0}^{1} ϕ (x v) d v .

By (27), for a fixed

s \in (0, 1)

,

\frac{ϕ (x v)}{ϕ (s v)} = \frac{ϕ (x)}{ϕ (s)} or ϕ (x v) = \frac{ϕ (x) ϕ (s v)}{ϕ (s)} .

It follows that

Φ (x) = x ϕ (x) \int_{0}^{1} \frac{ϕ (s v)}{ϕ (s)} d v .

Since

Φ

is continuous in x, and

x \int \frac{ϕ (s v)}{ϕ (s)} d v

is a linear function of x, it follows that

ϕ (x)

is continuous in x on

(0, 1)

. Denoting

ϕ (1) = {lim}_{x \to 1} ϕ (x)

, by (27),

ϕ (p \cdot q) = \frac{ϕ (p) \cdot ϕ (q)}{ϕ (1)},

(28)

which implies that

ϕ (p)

is a member of (11). □

Lemmas 4 and 5 immediately give Theorem 2.

3. Other Results

An independence-dependence preserving deformation function,

ϕ (p)

, with respect to a subspace,

P_{X, Y}^{'} \subset P_{X, Y}

, needs not necessarily to be of the power form in (11). Let

P_{K \times K}

be the collection of all distributions of

(X, Y)

on a

K \times K

joint alphabet, where

K \geq 2

is a finite integer.

Proposition 1.

A Lebesgue measurable and integrable deformation function

ϕ (p)

is independence-dependence preserving on a subspace

P_{X, Y}^{'}

such that

P_{3 \times 3} \subseteq P_{X, Y}^{'} \subseteq P_{X, Y}

, if and only if

ϕ (p)

is a member of power functions in (11).

The proof of Proposition 1 is trivial because the the proof of Lemma 5 directly applies in this case, noting that the existence of the constructed distributions,

P_{1}, P_{2}, Q

, in the proof of Lemma 5 only requires the joint alphabet to be

3 \times 3

or larger.

It may be interesting to note that Proposition 1 is a stronger statement than Theorem 2 in the sense that Theorem 2 may be considered a corollary of Proposition 1. This is due to the fact that, for two sub-classes of distributions

P_{X, Y}^{'} \subseteq P_{X, Y}^{″}

, an independence-dependence preserving

ϕ (p)

with respect to

P_{X, Y}^{″}

is independence-dependence preserving with respect to

P_{X, Y}^{'}

.

Example 3 below illustrates the fact that on some restricted class of probability distributions an independence-dependence preserving deformation function,

ϕ (p)

, needs not to be of the power form of (11).

Example 3.

Consider the collection of all distributions on a

2 \times 2

joint alphabet, denoted

P_{2 \times 2}

. The function,

ϕ (x) = exp (- (x - 1 / 4) + 2 {(x - 1 / 4)}^{2})),

(29)

is independence-dependence preserving but is not in the form of a power function. To see this, let it be first noted what being independence-dependence preserving entails on a

2 \times 2

alphabet. Write the joint distribution as

{p_{i, j}; i = 1, 2, j = 1, 2}

and the two marginal distributions as

{p_{1, \cdot}, p_{2, \cdot}} = {p, 1 - p}

and

{p_{\cdot, 1}, p_{\cdot, 2}} = {q, 1 - q}

. When two underlying random elements X and Y on the

2 \times 2

alphabet are independent, a qualified deformation function

ϕ (x)

must satisfy

\frac{ϕ (p_{1, 1})}{\sum_{s = 1, 2; t = 1, 2} ϕ (p_{s, t})} = \frac{ϕ (p_{1, 1}) + ϕ (p_{1, 2})}{\sum_{s = 1, 2; t = 1, 2} ϕ (p_{s, t})} \times \frac{ϕ (p_{1, 1}) + ϕ (p_{2, 1})}{\sum_{s = 1, 2; t = 1, 2} ϕ (p_{s, t})},

or, letting

\sum = ϕ (p q) + ϕ (p (1 - q)) + ϕ (q (1 - p)) + ϕ ((1 - p) (1 - q))

,

\frac{ϕ (p q)}{\sum} = \frac{ϕ (p q) + ϕ (p (1 - q))}{\sum} \times \frac{ϕ (p q) + ϕ (q (1 - p))}{\sum},

which, after a few algebraic steps, is reduced to

ϕ (p q) \times ϕ ((1 - p) (1 - q)) = ϕ (p (1 - q)) ϕ (q (1 - p)) .

(30)

It is easily verified that (29) satisfies (30).

The independence-dependence preserving property of a chosen

ϕ (p)

may be important when dependence between two random elements is of importance in a study. However, when only a one-to-one escort deformation mapping

Φ

is desired, the deformation function

ϕ (p)

needs not to be a member of (11). The following proposition provides a sufficient condition.

Proposition 2.

If

ϕ (p)

is strictly increasing for

p \in [0, 1]

, then the ϕ-induced mapping Φ:

P \to P^{*}

is injective.

Proof.

For every strictly increasing

ϕ (p)

on

p \in [0, 1]

for which

p^{*}

is well-defined for each and every

p \in P

, it suffices to show that, given any

p^{*} \in P^{*}

,

Φ^{- 1} (p^{*})

is unique. Toward that end, suppose there are two distinct distributions,

p_{1} = {p_{1, k}; k \geq 1} \in P

and

p_{2} = {p_{2, k}; k \geq 1} \in P

, such that

p_{1}^{*} = p_{2}^{*} = p^{*}

. It follows that there exists an index

k^{'}

such that

p_{1, k^{'}} \neq p_{2, k^{'}}

. Without loss of generality, let it be supposed that

p_{1, k^{'}} < p_{2, k^{'}} .

(31)

Then, it follows further that

\frac{ϕ (p_{1, k^{'}})}{\sum_{i \geq 1} ϕ (p_{1, i})} = \frac{ϕ (p_{2, k^{'}})}{\sum_{i \geq 1} ϕ (p_{2, i})} = p_{k^{'}}^{*}, and hence ϕ (p_{1, k^{'}}) = ϕ (p_{2, k^{'}}) \frac{\sum_{i \geq 1} ϕ (p_{1, i})}{\sum_{i \geq 1} ϕ (p_{2, i})} .

By (31) and the condition that

ϕ (p)

is strictly increasing on

[0, 1]

, it follows that

ϕ (p_{1, k^{'}}) < ϕ (p_{2, k^{'}})

, and hence

r = \frac{\sum_{i \geq 1} ϕ (p_{1, i})}{\sum_{i \geq 1} ϕ (p_{2, i})} < 1 .

(32)

However, since

ϕ (p_{1, k}) = ϕ (p_{2, k}) r

for every

k \geq 1

, it follows that

ϕ (p_{1, k}) < ϕ (p_{2, k})

for every

k \geq 1

. Since

ϕ (p)

is strictly increasing, it follows that

p_{1, k} < p_{2, k}

for every

k \geq 1

, which contradicts the supposition that both

p_{1}

and

p_{2}

are probability distributions, that is,

\sum_{i \geq 1} p_{1, i} = \sum_{i \geq 1} p_{2, i} = 1

. The said contradiction implies that

p_{1} = p_{2}

, and the proposition follows. □

Example 4.

Consider a special family of

ϕ (p)

,

ϕ (p) = {(1 + λ p)}^{1 / λ}

for

p \in [0, 1]

, where

λ > 0

is a parameter. The escort distribution based on this deformation function is one of several well-studied families of escort distributions known as the Tsallis distributions. Every such

ϕ (p)

is strictly increasing on

[0, 1]

and therefore, by Proposition 2, every member of the family induces an injective mapping. However, by Theorem 2, the induced mapping is not independence-dependence preserving on a countable joint alphabet

X \times Y

.

4. Concluding Remarks

Two main results are established in this article. First, there exist many entropies other than the BGS entropy satisfying the weaker axiomatic conditions, more specifically

{K_{1}, K_{2}, K_{3}, K_{4}^{♭}}

rather than

{K_{1}, K_{2}, K_{3}, K_{4}}

, yet retaining the key utility that the associated mutual information preserves independence-dependence on a countable joint alphabet as the BGS entropy does. Second, by way of escort transformations, the newly identified entropies are the only ones satisfying the weaker axioms on a general countable joint alphabet.

The significance of the established results come into better focus in a broader perspective. Inspired by the development in modern data science, a shift is increasingly visible in the foundation of statistical inference, away from a real space, where random variables reside, toward a non-meterized and non-ordinal alphabet, where more general random elements reside. While statistical inferences based on random variables are theoretically well supported in the rich literature of probability and statistics, inferences on alphabets, mostly by way of various entropies and their estimation, are less systematically supported in theory. Without the familiar notions of neighborhood, real or complex moments, tails, etc., associated with random variables, probability and statistics based on random elements on alphabets need more attention to foster a clearer framework for the rigorous development of entropy-based statistical exercises, which may be more concisely termed entropic statistics. While a considerable volume of published work has been accumulated over several decades on entropy estimation, it is fair to say that the current research activities in the existing literature are sporadic in nature and the implied theoretical framework is porous. The said porosity permeates across the board, from basic axioms to a general definition of entropy, and from model interpretability to statistical inference, although some recent effort is observed to alleviate it. See, for example, [13] and [14], where a general definition of entropy and several fundamental results are given.

The exploration of this article is on the axiomatic foundation of entropy. If the primary study interest lies with the independence-dependence between two sets of random elements on a joint countable alphabet, as is usually the case in practice, then the main results of this article suggest that there is flexibility in choosing an entropy from (8) to serve the interest. In addition, the uniqueness of (8) by way of escort distributions is of interest not only in its own right theoretically but also as it provides support in practice. For example, in artificial neural networks with multi-layer fractal structures, the links between nodes in two layers are often modeled by power escort distributions propagating through the entire network. In such a context, it is of fundamental importance to know that the power escort distributions are the only type that would preserve dependence and independence. If the dependence-independence preserving property is desired, then the power escort is appropriate. If not, then some other escort is needed.

Author Contributions

All authors contributed equally to this article. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

No data were used in this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423; 623–656. [Google Scholar] [CrossRef] [Green Version]
Khinchin, A.I. Mathematical Foundations of Information Theory; Dover Publications: New York, NY, USA, 1957. [Google Scholar]
Amigó, J.M.; Balogh, S.G.; Hernández, S. A Brief Review of Generalized Entropies. Entropy 2018, 20, 813. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ilić, V.M.; Korbel, J.; Gupta, S.; Scarfone, A.M. An overview of generalized entropic forms. Europhys. Lett. 2021, 133, 50005. [Google Scholar] [CrossRef]
Zhang, Z. Statistical Implications of Turing’s Formula; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2017. [Google Scholar]
Beck, C.; Schlögl, F. Thermodynamics of Chaotic Systems; Cambridge University Press: Cambridge, UK, 1993. [Google Scholar]
Tsallis, C. Nonadditive Entropy and Nonextensive Statistical Mechanics. An Overview after 20 Years. Braz. J. Phys. 2009, 39, 337–357. [Google Scholar] [CrossRef]
Amari, S. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016. [Google Scholar]
Matsuzoe, H. A Sequence of Escort Distributions and Generalizations of Expectations on q-Exponential Family. Entropy 2017, 19, 7. [Google Scholar] [CrossRef] [Green Version]
Ampilova, N.; Soloviev, I.; Sergeev, V. On using escort distributions in digital image analysis. J. Meas. Eng. 2012, 9, 58–70. [Google Scholar] [CrossRef]
Zhang, Z. Generalized Mutual Information. Stats 2020, 3, 13. [Google Scholar] [CrossRef]
Hewitt, E.; Stromberg, K. Real and Abstract Analysis: A Modern Treatment of the Theory of Functions of a Real Variable; Springer: Berlin/Heidelberg, Germany, 1965. [Google Scholar]
Zhang, Z. Entropy-Based Statistics and Their Applications. Entropy 2023, 25, 936. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z. Several Basic Elements of Entropic Statistics. Entropy 2023, 25, 1060. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Huang, H.; Xu, H. Khinchin’s Fourth Axiom of Entropy Revisited. Stats 2023, 6, 763-772. https://doi.org/10.3390/stats6030049

AMA Style

Zhang Z, Huang H, Xu H. Khinchin’s Fourth Axiom of Entropy Revisited. Stats. 2023; 6(3):763-772. https://doi.org/10.3390/stats6030049

Chicago/Turabian Style

Zhang, Zhiyi, Hongwei Huang, and Hao Xu. 2023. "Khinchin’s Fourth Axiom of Entropy Revisited" Stats 6, no. 3: 763-772. https://doi.org/10.3390/stats6030049

Article Menu

Khinchin’s Fourth Axiom of Entropy Revisited

Abstract

1. Introduction and Summary

2. Main Results

2.1. Characterization of Independence

2.2. A Uniqueness Theorem

3. Other Results

4. Concluding Remarks

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI