On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius

Nielsen, Frank

doi:10.3390/e23040464

Open AccessEditor’s ChoiceArticle

On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius

by

Frank Nielsen

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

Entropy 2021, 23(4), 464; https://doi.org/10.3390/e23040464

Submission received: 12 March 2021 / Revised: 9 April 2021 / Accepted: 9 April 2021 / Published: 14 April 2021

(This article belongs to the Special Issue Selected Papers from the 5th conference on Geometric Science of Information)

Download

Browse Figures

Versions Notes

Abstract

:

We generalize the Jensen-Shannon divergence and the Jensen-Shannon diversity index by considering a variational definition with respect to a generic mean, thereby extending the notion of Sibson’s information radius. The variational definition applies to any arbitrary distance and yields a new way to define a Jensen-Shannon symmetrization of distances. When the variational optimization is further constrained to belong to prescribed families of probability measures, we get relative Jensen-Shannon divergences and their equivalent Jensen-Shannon symmetrizations of distances that generalize the concept of information projections. Finally, we touch upon applications of these variational Jensen-Shannon divergences and diversity indices to clustering and quantization tasks of probability measures, including statistical mixtures.

Keywords:

Jensen-Shannon divergence; diversity index; Rényi entropy; information radius; information projection; exponential family; Bregman divergence; Fenchel–Young divergence; Bregman information; q-exponential family; q-divergence; Bhattacharyya distance; centroid; clustering

Graphical Abstract

1. Introduction: Background and Motivations

The goal of the author is to methodologically contribute to an extension of the Sibson’s information radius [1] and also concentrate on analysis of the specified families of distributions called exponential families [2].

Let

(X, F)

denote a measurable space [3] with sample space

X

and

σ

-algebra

F

on the set

X

. The Jensen-Shannon divergence [4] (JSD) between two probability measures P and Q (or probability distributions) on

(X, F)

is defined by:

D_{JS} [P, Q] : = \frac{1}{2} (D_{KL} [P : \frac{P + Q}{2}] + D_{KL} [Q : \frac{P + Q}{2}]),

(1)

where

D_{KL}

denotes the Kullback–Leibler divergence [5,6] (KLD):

D_{KL} [P : Q] : = \{\begin{matrix} \int_{X} log (\frac{d P (x)}{d Q (x)}) d P, & P ≪ Q \\ + \infty, & P ≪ Q \end{matrix}

(2)

where

P ≪ Q

means that P is absolutely continuous with respect to Q [3], and

\frac{d P}{d Q}

is the Radon–Nikodym derivative of P with respect to Q. Equation (2) can be rewritten using the chain rule as:

D_{KL} [P : Q] : = \{\begin{matrix} \int_{X} \frac{d P (x)}{d Q (x)} log (\frac{d P (x)}{d Q (x)}) d Q, & P ≪ Q \\ + \infty, & P ≪ Q \end{matrix}

(3)

Consider a measure

μ

for which both the Radon–Nikodym derivatives

p : = \frac{d P}{d μ}

and

q : = \frac{d P}{d μ}

exist (e.g.,

μ = \frac{P + Q}{2}

). Subsequently the Kullback–Leibler divergence can be rewritten as (see Equation (2.5) page 5 of [5] and page 251 of the Cover & Thomas’ textbook [6]):

D_{KL} [p : q] : = \int_{X} p (x) log (\frac{p (x)}{q (x)}) d μ (x) .

(4)

Denote by

D = D (X)

the set of all densities with full support

X

(Radon–Nikodym derivatives of probability measures with respect to

μ

):

D (X) : = \{p : X \to R : p (x) > 0 μ - a l m o s t e v e r y w h e r e, \int_{X} p (x) d μ (x) = 1\} .

Subsequently, the Jensen-Shannon divergence [4] between two densities p and q of

D

is defined by:

D_{JS} [p, q] : = \frac{1}{2} (D_{KL} [p : \frac{p + q}{2}] + D_{KL} [q : \frac{p + q}{2}]) .

(5)

Often, one considers the Lebesgue measure [3]

μ = μ_{L}

on

(R^{d}, B (R^{d}))

, where

B (R^{d})

is the Borel

σ

-algebra, or the counting measure [3]

μ = μ_{#}

on

(X, 2^{X})

where

X

is a countable set, for defining the measure space

(X, F, μ)

.

The JSD belongs to the class of f-divergences [7,8,9] which are known as the invariant decomposable divergences of information geometry (see [10], pp. 52–57). Although the KLD is asymmetric (i.e.,

D_{KL} [p : q] \neq D_{KL} [q : p]

), the JSD is symmetric (i.e.,

D_{JS} [p, q] = D_{JS} [q, p]

). The notation ‘:’ is used as a parameter separator to indicate that the parameters are not permutation invariant, and that the order of parameters is important.

In this work, a distance

D (O_{1} : O_{2})

is a measure of dissimilarity between two objects

O_{1}

and

O_{2}

, which do not need to be symmetric or satisfy the triangle inequality of metric distances. A distance only satisfies the identity of indiscernibles:

D (O_{1} : O_{2}) = 0

if and only if

O_{1} = O_{2}

. When the objects

O_{1}

and

O_{2}

are probability densities with respect to

μ

, we call this distance a statistical distance, use the brackets to enclose the arguments of the statistical distance (i.e.,

D [O_{1} : O_{2}]

), and we have

D [O_{1} : O_{2}] = 0

if and only if

O_{1} (x) = O_{2} (x)

μ

-almost everywhere.

The 2-point JSD of Equation (4) can be extended to a weighted set of n densities

P : = {(w_{1}, p_{1}), \dots, (w_{n}, p_{n})}

(with positive

w_{i}

’s normalized to sum up to unity, i.e.,

\sum_{i = 1}^{n} w_{i} = 1

) thus providing a diversity index, i.e., a n-point JSD for

P

:

D_{JS} (P) : = \sum_{i = 1}^{n} w_{i} D_{KL} [p_{i} : \bar{p}],

(6)

where

\bar{p} : = \sum_{i = 1}^{n} w_{i} p_{i}

denotes the statistical mixture [11] of the densities of

P

. We have

D_{JS} [p : q] = D_{JS} ({(\frac{1}{2}, p), (\frac{1}{2}, q)})

. We call

D_{JS} (P)

the Jensen-Shannon diversity index.

The KLD is also called the relative entropy since it can be expressed as the difference between the cross entropy

h [p : q]

and the entropy

h [p]

:

\begin{matrix} D_{KL} [p : q] & : = & \int_{X} p (x) log (\frac{p (x)}{q (x)}) d μ (x) \end{matrix}

(7)

\begin{matrix} = & \int_{X} p (x) log p (x) d μ (x) - \int_{X} p (x) log q (x) d μ (x), \end{matrix}

(8)

\begin{matrix} = & h [p : q] - h [p], \end{matrix}

(9)

with the cross-entropy and entropy defined, respectively, by

\begin{matrix} h [p : q] & : = & - \int_{X} p (x) log q (x) d μ (x), \end{matrix}

(10)

\begin{matrix} h [p] & : = & - \int_{X} p (x) log p (x) d μ (x) . \end{matrix}

(11)

Because

h [p] = h [p : p]

, we may say that the entropy is the self-cross-entropy.

When

μ

is the Lebesgue measure, the Shannon entropy is also called the differential entropy [6]. Although the discrete entropy

H [p] = - \sum_{i} p_{i} log p_{i}

(i.e., entropy with respect to the counting measure) is always positive and bounded by

log | X |

, the differential entropy may be negative (e.g., entropy of a Gaussian distribution with small variance).

The Jensen-Shannon divergence of Equation (6) can be rewritten as:

D_{JS} [p, q] = h [\bar{p}] - \sum_{i = 1}^{n} w_{i} h [p_{i}] : = J_{- h} [p, q] .

(12)

The JSD representation of Equation (12) is a Jensen divergence [12] for the strictly convex negentropy

F (p) = - h [p]

, since the entropy function

h [.]

is strictly concave. Therefore, it is appropriate to call this divergence the Jensen-Shannon divergence.

Because

\frac{p_{i} (x)}{\bar{p} (x)} \leq \frac{p_{i} (x)}{w_{i} p_{i} (x)} = \frac{1}{w_{i}}

, it can be shown that the Jensen-Shannon diversity index is upper bounded by

H (w) : = - \sum_{i = 1}^{n} w_{i} log w_{i}

, the discrete Shannon entropy. Thus, the Jensen-Shannon diversity index is bounded by

log n

, and the 2-point JSD is bounded by

log 2

, although the KLD is unbounded and it may even be equal to

+ \infty

when the definite integral diverges (e.g., KLD between the standard Cauchy distribution and the standard Gaussian distribution). Another nice property of the JSD is that its square root yields a metric distance [13,14]. This property further holds for the quantum JSD [15]. The JSD has gained interest in machine learning. See, for example, the Generative Adversarial Networks [16] (GANs) in deep learning [17], where it was proven that minimizing the GAN objective function by adversarial training is equivalent to minimizing a JSD.

To delineate the different roles that are played by the factor

\frac{1}{2}

in the ordinary Jensen-Shannon divergence (i.e., in weighting the two KLDs and in weighting the two densities), let us introduce two scalars

α, β \in (0, 1)

, and define a generic

(α, β)

-skewed Jensen-Shannon divergence, as follows:

\begin{matrix} D_{JS, α, β} [p : q] & : = & (1 - β) D_{KL} [p : m_{α}] + β D_{KL} [q : m_{α}], \end{matrix}

(13)

\begin{matrix} = & (1 - β) h [p : m_{α}] + β h [q : m_{α}] - (1 - β) h [p] - β h [q], \end{matrix}

(14)

\begin{matrix} = & h [m_{β} : m_{α}] - ((1 - β) h [p] + β h [q]), \end{matrix}

(15)

where

m_{α} : = (1 - α) p + α q

and

m_{β} : = (1 - β) p + β q

. This identity holds, because

D_{JS, α, β}

is bounded by

(1 - β) log \frac{1}{1 - α} + β log \frac{1}{α}

, see [18]. Thus, when

β = α

, we have

D_{JS, α} [p, q] = D_{JS, α, α} [p, q] = h [m_{α}] - ((1 - α) h [p] + α h [q])

, since the self-cross entropy corresponds to the entropy:

h [m_{α} : m_{α}] = h [m_{α}]

.

A f-divergence [9,19,20] is defined for a convex generator f, which is strictly convex at 1 (to satisfy the identity of the indiscernibles) and that satisfies

f (1) = 0

, by

I_{f} [p : q] : = \int p (x) f (\frac{q (x)}{p (x)}) d μ (x) \geq f (1) = 0,

(16)

where the right-hand-side follows from Jensen’s inequality [20]. For example, the total variation distance

D_{TV} [p : q] = \frac{1}{2} \int_{X} | p (x) - q (x) | d μ (x)

is a f-divergence for the generator

f_{TV} (u) = | u - 1 |

:

D_{TV} [p : q] = I_{f_{TV}} [p : q]

. The generator

f_{TV} (u)

is convex on

R

, strictly convex at 1, and it satisfies

f (u) = 1

.

The

D_{JS, α, β}

divergence is a f-divergence

D_{JS, α, β} [p : q] = I_{f_{JS, α, β}} [p : q],

(17)

for the generator:

f_{JS, α, β} (u) = - ((1 - β) log (α u + (1 - α)) + β u log (\frac{1 - α}{u} + α)) .

(18)

We check that the generator

f_{JS, α, β}

is strictly convex, since, for any

a \in (0, 1)

and

b \in (0, 1)

, we have

f_{JS, α, β}^{″} (u) = \frac{a^{2} (1 - b) u + {(a - 1)}^{2} b}{a^{2} u^{3} + 2 a (1 - a) u^{2} + {(a - 1)}^{2} u} > 0,

(19)

when

u > 0

.

The Jensen-Shannon principle of taking the average of the (Kullback–Leibler) divergences between the source parameters to the mid-parameter can be applied to other distances. For example, the Jensen–Bregman divergence is a Jensen-Shannon symmetrization of the Bregman divergence

B_{F}

[12]:

B_{F}^{JS} (θ_{1} : θ_{2}) : = \frac{1}{2} (B_{F} (θ_{1} : \frac{θ_{1} + θ_{2}}{2}) + B_{F} (θ_{2} : \frac{θ_{1} + θ_{2}}{2})),

(20)

where the Bregman divergence [21]

B_{F}

is defined by

B_{F} (θ : θ^{'}) : = F (θ) - F (θ^{'}) - {(θ - θ^{'})}^{⊤} \nabla F (θ^{'}) .

(21)

The Jensen–Bregman divergence

B_{F}^{JS}

can also be written as an equivalent Jensen divergence

J_{F}

:

B_{F}^{JS} (θ_{1} : θ_{2}) = J_{F} (θ_{1} : θ_{2}) : = \frac{F (θ_{1}) + F (θ_{2})}{2} - F (\frac{θ_{1} + θ_{2}}{2}),

(22)

where F is a strictly convex function ensuring

J_{F} (θ_{1} : θ_{2}) \geq 0

with equality if

θ_{1} = θ_{2}

.

Because of its use in various fields of information sciences [22], various generalizations of the JSD have been proposed: These generalizations are either based on Equation (5) [23] or Equation (12) [18,24,25]. For example, the (arithmetic) mixture

\bar{p} = \sum_{i} w_{i} p_{i}

in Equation (6) was replaced by an abstract statistical mixture with respect to a generic mean M in [23] (e.g., the geometric mixture induced by the geometric mean), and the two KLDS defining the JSD in Equation (5) was further averaged using another abstract mean N, thus yielding the following generic

(M, N)

-Jensen-Shannon divergence [23] (abbreviated as

(M, N)

-JSD):

D_{JS}^{M, N} [p : q] : = N (D_{KL} [p : {(p q)}_{\frac{1}{2}}^{M}], D_{KL} [q : {(p q)}_{\frac{1}{2}}^{M}]),

(23)

where

{(p q)}_{α}^{M}

denotes the statistical weighted M-mixture:

{(p q)}_{α}^{M} : = \frac{M_{α} (p (x), q (x))}{\int_{X} M_{α} (p (x), q (x)) d μ (x)} .

(24)

Notice that, when

M = N = A

(the arithmetic mean), Equation (23) of the

(A, A)

-JSD reduces to the ordinary JSD of Equation (5). When the means M and N are symmetric, the

(M, N)

-JSD is symmetric.

In general, a weighted mean

M_{α} (a, b)

for any

α \in [0, 1]

shall satisfy the in-betweeness property [26] (i.e., a mean should be contained inside its extrema):

min {a, b} \leq M_{α} (a, b) \leq max {a, b} .

(25)

The three Pythagorean means defined for positive scalars

a > 0

and

b > 0

are classic examples of means:

The arithmetic mean $A (a, b) = \frac{a + b}{2}$ ,
the geometric mean $G (a, b) = \sqrt{a b}$ , and
the harmonic mean $H (a, b) = \frac{2 a b}{a + b}$ .

These Pythagorean means may be interpreted as special instances of another parametric family of means: The power means

P_{α} (a, b) : = {(\frac{a^{α} + b^{α}}{2})}^{\frac{1}{α}},

(26)

defined for

α \in R \ {0}

(also called Hölder means). The power means can be extended to the full range

α \in R

by using the property that

{lim}_{α \to 0} P_{α} (a, b) = G (a, b)

. The power means are homogeneous means:

P_{α} (λ a, λ b) = λ P_{α} (a, b)

for any

λ > 0

. We refer to the handbook of means [27] to obtain definitions and principles of other means beyond these power means.

A weighted mean (also called barycenter) can be built from a non-weighted mean

M (a, b)

(i.e.,

α = \frac{1}{2}

) by using the dyadic expansion of the real weight

α \in [0, 1]

, see [28]. That is, we can define the weighted mean

M (p, q; w, 1 - w)

for

w = \frac{i}{2^{k}}

with

i \in {0, \dots, 2^{k}}

and k an integer. For example, consider a symmetric mean

M (p, q) = M (q, p)

. Subsequently, we get the following weighted means when

k = 3

:

\begin{matrix} M (p, q; \frac{0}{8} = 0, \frac{8}{8} = 1) & = & q \\ M (p, q; \frac{1}{8}, \frac{7}{8}) & = & M (M (M (p, q), q), q) \\ M (p, q; \frac{2}{8} = \frac{1}{4}, \frac{6}{8} = \frac{3}{4}) & = & M (M (p, q), q) \\ M (p, q; \frac{3}{8}, \frac{5}{8}) & = & M (M (M (p, q), p), q) \\ M (p, q; \frac{4}{8} = \frac{1}{2}, \frac{4}{8} = \frac{1}{2}) & = & M (p, q) \\ M (p, q; \frac{5}{8}, \frac{3}{8}) & = & M (M (M (p, q), q), p) \\ M (p, q; \frac{6}{8} = \frac{3}{4}, \frac{2}{8} = \frac{1}{4}) & = & M (M (p, q), p) \\ M (p, q; \frac{7}{8}, \frac{1}{8}) & = & M (M (M (p, q), p), p) \\ M (p, q; \frac{8}{8} = 1, \frac{0}{8} = 0) & = & p \end{matrix}

Let

w = \sum_{i = 1}^{\infty} \frac{d_{i}}{2^{i}}

be the unique dyadic expansion of the real number

w \in (0, 1)

, where the

d_{i}

’s are binary digits (i.e.,

d_{i} \in {0, 1}

). We define the weighted mean

M (x, y; w, 1 - w)

of two positive reals p and q for a real weight

w \in (0, 1)

as

M (x, y; w, 1 - w) : = lim_{n \to \infty} M (x, y; \sum_{i = 1}^{n} \frac{d_{i}}{2^{i}}, 1 - \sum_{i = 1}^{n} \frac{d_{i}}{2^{i}}) .

(27)

Choosing the abstract mean M in accordance with the family

R = {p_{θ} : θ \in Θ}

of the densities allows one to obtain closed-form formula for the

(M, N)

-JSDs that rely on definite integral calculations [23]. For example, the JSD between two Gaussian densities does not admit a closed-form formula because of the log-sum integral, but the

(G, N)

-JSD admits a closed-form formula when using geometric statistical mixtures (i.e., when

M = G

). The calculus trick is to find a weighted mean

M_{α}

, such that, for two densities

p_{θ_{1}}

and

p_{θ_{2}}

, the weighted mean distribution

M_{α} (p_{θ_{1}} (x), p_{θ_{2}} (x)) = \frac{p_{θ_{1, 2, α}} (x)}{Z_{M_{α}} (θ_{1}, θ_{2})}

, where

Z_{M_{α}} (θ_{1}, θ_{2})

is the normalizing coefficient and

p_{θ_{1, 2, α}} \in R

. Thus, the integral calculation can be simply calculated as

\int M_{α} (p_{θ_{1}} (x), p_{θ_{2}} (x)) d μ (x) = \frac{1}{Z_{M_{α}} (θ_{1}, θ_{2})}

since

p_{θ_{1, 2, α}} (x)

, and, therefore,

\int p_{θ_{1, 2, α}} (x) d μ (x) = 1

. This trick has also been used in Bayesian hypothesis testing for upper bounding the probability of error between two densities of a parametric family of distributions by replacing the usual geometric mean (Section 11.7 of [6], page 375) by a more general quasi-arithmetic mean [29]. For example, the harmonic mean is well-suited to Cauchy distributions, and the power means to Student t-distributions [29].

As an application of these generalized JSDs, Deasy et al. [30] used the skewed geometric JSD (namely, the

(G_{α}, A_{1 - α})

-JSD for

α \in (0, 1)

), which admits a closed-form formula between normal densities [23], and showed how regularizing an optimization task with this G-JSD divergence improved reconstruction and generation of Variational AutoEncoders (VAEs).

More generally, instead of using the KLD, one can also use any arbitrary distance D to define its JS-symmetrization, as follows:

D_{M, N}^{JS} [p : q] : = N (D [p : {(p q)}_{\frac{1}{2}}^{M}], D [q : {(p q)}_{\frac{1}{2}}^{M}]) .

(28)

These symmetrizations may further be skewed by using

M_{α}

and/or

N_{β}

for

α \in (0, 1)

and

β \in (0, 1)

, yielding the definition [23]:

D_{M_{α}, N_{β}}^{JS} [p : q] : = N_{β} (D [p : {(p q)}_{α}^{M}], D [q : {(p q)}_{α}^{M}]) .

(29)

With these notations, the ordinary JSD is

D_{JS} = {D_{KL}}_{A, A}^{JS}

, the

(A, A)

JS-symmetrization of the KLD with respect to the arithmetic means

M = A

and

N = A

.

The JS-symmetrization can be interpreted as the

N_{β}

-Jeffreys’ symmetrization of a generalization of Lin’s

α

-skewed K-divergence [4]

D_{M_{α}}^{K} [p : q]

:

\begin{matrix} D_{M_{α}, N_{β}}^{JS} [p : q] & = & N_{β} (D_{M_{α}}^{K} [p : q], D_{M_{α}}^{K} [p : q]), \end{matrix}

(30)

\begin{matrix} D_{M_{α}}^{K} [p : q] & : = & D [p : {(p q)}_{α}^{M_{α}}] . \end{matrix}

(31)

In this work, we consider symmetrizing an arbitrary distance D (including the KLD), generalizing the Jensen-Shannon divergence by using a variational formula for the JSD. Namely, we observe that the Jensen-Shannon divergence can also be defined as the following minimization problem:

D_{JS} [p, q] : = min_{c \in D} \frac{1}{2} (D_{KL} [p : c] + D_{KL} [q : c]),

(32)

since the optimal density c is proven unique using the calculus of variation [1,31,32] and it corresponds to the mid density

\frac{p + q}{2}

, a statistical (arithmetic) mixture.

Proof.

Let

S (c) = D_{KL} [p : c] + D_{KL} [q : c] \geq 0

. We use the method of the Lagrange multipliers for the constrained optimization problem

{min}_{c} S (c)

such that

\int c (x) d μ (x) = 1

. Let us minimize

S (c) + λ (\int c (x) d μ (x) - 1)

. The density c realizing the minimum

S (c)

satisfies the Euler–Lagrange equation

\frac{\partial L}{\partial c} = 0

, where

L (c) : = p log \frac{p}{c} + q log \frac{q}{c} + λ c

is the Lagrangian. That is,

- \frac{p}{c} - \frac{q}{c} + λ = 0

or, equivalently,

c = \frac{1}{λ} (p + q)

. Parameter

λ

is then evaluated from the constraint

\int_{X} c (x) d μ (x) = 1

: we get

λ = 2

since

\int_{X} (p (x) + q (x)) d μ (x) = 2

. Therefore, we find that

c (x) = \frac{p (x) + q (x)}{2}

, the mid density of

p (x)

and

q (x)

. □

Considering Equation (32) instead of Equation (5) for defining the Jensen-Shannon divergence is interesting, because it allows one to consider a novel approach for generalizing the Jensen-Shannon divergence. This variational approach was first considered by Sibson [1] to define the

α

-information radius of a set of weighted distributions while using Rényi

α

-entropies that are based on Rényi principled

α

-means [33]. The

α

-information radius includes the Jensen-Shannon diversity index when

α = 1

. Sibson’s work is our point of departure for generalizing the Jensen-Shannon divergence and proposing the Jensen-Shannon symmetrizations of arbitrary distances.

The paper is organized, as follows: in Section 2, we recall the rationale and definitions of the Rényi

α

-entropy and the Rényi

α

-divergence [33], and explain the information radius of Sibson [1], which includes, as a special case, the ordinary Jensen-Shannon divergence and that can be interpreted as generalized skew Bhattacharyya distances. We report, in Theorem 2, a closed-form formula for calculating the information radius of order

α

between two densities of an exponential family when

\frac{1}{α}

is an integer. It is noteworthy to point out that Sibson’s work (1969) includes, as a particular case of the information radius, a definition of the JSD, prior to the well-known reference paper of Lin [4] (1991). In Section 3, we present the JS-symmetrization variational definition that is based on a generalization of the information radius with a generic mean (Equation (88) and Definition 3). In Section 4, we constrain the mixture density to belong to a prescribed class of (parametric) probability densities, like an exponential family [2], and obtain a relative information radius generalizing information radius and related to the concept of information projections. Our Definition 5 generalizes the (relative) normal information radius of Sibson [1], who considered the multivariate normal family (Proposition 4). We illustrate this notion of relative information radius by calculating the density of an exponential family minimizing the reverse Kullback–Leibler divergence between a mixture of densities of that exponential family (Proposition 6). Moreover, we get a semi-closed-form formula for the Kullback–Leibler divergence between the densities of two different exponential families (Proposition 5), generalizing the Fenchel–Young divergence [34]. As an application of these relative variational JSDs, we touch upon the problems of clustering and quantization of probability densities in Section 4.2. Finally, we conclude by summarizing our contributions and discussing related works in Section 5.

2. Rényi Entropy and Divergence, and Sibson Information Radius

Rényi [33] investigated a generalization of the four axioms of Fadeev [35], yielding the unique Shannon entropy [20]. In doing so, Rényi replaced the ordinary weighted arithmetic mean by a more general class of averaging schemes. Namely, Rényi considered the weighted quasi-arithmetic means [36]. A weighted quasi-arithmetic mean can be induced by a strictly monotonous and continuous function g, as follows:

M_{g} (x_{1}, \dots, x_{n}; w_{1}, \dots, w_{n}) : = g^{- 1} (\sum_{i = 1}^{n} w_{i} g (x_{i})),

(33)

where the

x_{i}

’s and the

w_{i}

’s are positive (the weights are normalized, so that

\sum_{i = 1}^{n} w_{i} = 1

). Because

M_{g} = M_{- g}

, we may assume without loss of generality that g is a strictly increasing and continuous function. The quasi-arithmetic means were investigated independently by Kolmogorov [36], Nagumo [37], and de Finetti [38].

For example, the power means

P_{α} (a, b) = {(\frac{a^{α} + b^{α}}{2})}^{\frac{1}{α}}

introduced earlier are quasi-arithmetic means for the generator

g_{α}^{P} (u) : = u^{α}

:

P_{α} (a, b) = M_{g_{α}^{P}} (a, b; \frac{1}{2}, \frac{1}{2}) .

(34)

Rényi proved that, among the class of weighted quasi-arithmetic means, only the means induced by the family of functions

\begin{matrix} g_{α} (u) & : = & 2^{(α - 1) u}, \end{matrix}

(35)

\begin{matrix} g_{α}^{- 1} (v) & : = & \frac{1}{α - 1} {log}_{2} v, \end{matrix}

(36)

for

α > 0

and

α \neq 1

yield a proper generalization of Shannon entropy, nowadays called the Rényi

α

-entropy. The Rényi

α

-mean is

\begin{matrix} M_{α}^{R} (x_{1}, \dots, x_{n}; w_{1}, \dots, w_{n}) & = & M_{g_{α}} (x_{1}, \dots, x_{n}; w_{1}, \dots, w_{n}), \end{matrix}

(37)

\begin{matrix} = & \frac{1}{α - 1} {log}_{2} (\sum_{i = 1}^{n} w_{i} 2^{(α - 1) x_{i}}) . \end{matrix}

(38)

The Rényi

α

-means

M_{α}^{R}

are not power means: They are not homogeneous means [31]. Let

M_{α}^{R} (p, q) = M_{α}^{R} (p, q; \frac{1}{2}, \frac{1}{2}) = \frac{1}{α - 1} {log}_{2} \frac{2^{(α - 1) p} + 2^{(α - 1) q}}{2}

. Subsequently, we have

{lim}_{α \to \infty}

M_{α}^{R} (p, q) = max {p, q}

and

{lim}_{α \to 1} M_{α}^{R} (p, q) = A (p, q) = \frac{p + q}{2}

. Indeed, we have

\begin{matrix} M_{α}^{R} (p, q) & = & \frac{1}{α - 1} {log}_{2} \frac{2^{(α - 1) p} + 2^{(α - 1) q}}{2}, \\ = & \frac{1}{α - 1} {log}_{2} \frac{e^{(α - 1) p log 2} + e^{(α - 1) q log 2}}{2}, \\ \approx_{α \to 1} & \frac{1}{α - 1} {log}_{2} (1 + (α - 1) \frac{p + q}{2} log 2), \\ \approx_{α \to 1} & \frac{1}{α - 1} \frac{1}{log 2} (α - 1) \frac{p + q}{2} log 2, \\ \approx_{α \to 1} & \frac{p + q}{2} = A (p, q), \end{matrix}

using the following first-order approximations:

e^{x} \approx_{x \to 0} = 1 + x

and

log (1 + x) \approx_{x \to 0} = x

.

To obtain an intuition of the Rényi entropy, we may consider generalized entropies derived from quasi-arithmetic means, as follows:

h_{g} [p] : = - M_{g} ({log}_{2} p_{1}, \dots, {log}_{2} p_{n}; p_{1}, \dots, p_{n}) .

(39)

When

g (u) = u

, we recover Shannon entropy. When

g_{2} (u) = 2^{u}

, we get

h_{g_{2}} [p] = - {log}_{2} \sum_{i} p_{i}^{2}

, called the collision entropy, since

- log \Pr [X_{1} = X_{2}] = h_{g_{2}} [p]

, when

X_{1}

and

X_{2}

are independent and identically distributed random variables with

X_{1} \sim p

and

X_{2} \sim p

. When

g (u) = g_{α} (u) = 2^{(α - 1) u}

, we get

\begin{matrix} h_{g_{α}} [p] & = & - \frac{1}{α - 1} {log}_{2} (\sum_{i} p_{i} 2^{(α - 1) {log}_{2} p_{i}}), \end{matrix}

(40)

\begin{matrix} = & \frac{1}{1 - α} {log}_{2} \sum_{i} p_{i} p_{i}^{α - 1} = \frac{1}{1 - α} {log}_{2} \sum_{i} p_{i}^{α} . \end{matrix}

(41)

The formula of Equation (41) is the discrete Rényi

α

-entropy [33], which can be defined more generally on a measure space

(X, F, μ)

, as follows:

h_{α}^{R} [p] : = \frac{1}{1 - α} log (\int_{X} p^{α} (x) d μ (x)), α \in (0, 1) \cup (1, \infty) .

(42)

In the limit case

α \to 1

, the Rényi

α

-entropy converges to Shannon entropy:

{lim}_{α \to 1} h_{α}^{R} [p] = h [p]

. Rényi

α

-entropies are non-increasing with respect to increasing

α

:

h_{α}^{R} [p] \geq h_{α^{'}}^{R} [p]

for

α < α^{'}

. In the discrete case (i.e., counting measure

μ

on a finite alphabet

X

), we can further define

h_{0} [p] = log | X |

for

α = 0

(also called max-entropy or Hartley entropy). The Rényi

+ \infty

-entropy

h_{+ \infty} [p] = - log max_{x \in X} p (x)

is also called the min-entropy, since the sequence

h_{α}

is non-increasing with respect to increasing

α

.

Similarly, Rényi obtained the

α

-divergences for

α > 0

and

α \neq 1

(originally called information gain of order

α

):

D_{α}^{R} [p : q] : = \frac{1}{α - 1} {log}_{2} (\int_{X} p {(x)}^{α} q {(x)}^{1 - α} d μ (x)),

(43)

generalizing the Kullback–Leibler divergence, since

{lim}_{α \to 1} D_{α}^{R} [p : q] = D_{KL} [p : q]

. Rényi

α

-divergences are non-decreasing with respect to increasing

α

[39]:

D_{α}^{R} [p : q] \leq D_{α^{'}}^{R} [p : q]

for

α^{'} \geq α

.

Sibson (Robin Sibson (1944–2017) is also renown for inventing the natural neighbour interpolation [40]) [1] considered both the Rényi

α

-divergence [33]

D_{α}^{R}

and the Rényi

α

-weighted mean

M_{α}^{R} : = M_{g_{α}}

to define the information radius

R_{α}

of order

α

of a weighted set

P = {(w_{i}, p_{i})}_{i = 1}^{n}

of densities

p_{i}

’s as the following minimization problem:

R_{α} (P) : = min_{c \in D} R_{α} (P, c),

(44)

where

R_{α} (P, c) : = M_{α}^{R} (D_{α}^{R} [p_{1} : c], \dots, D_{α}^{R} [p_{n} : c]; w_{1}, \dots, w_{n}) .

(45)

The Rényi

α

-weighted mean

M_{α}^{R}

can be rewritten as

\begin{matrix} M_{α}^{R} (x_{1}, \dots, x_{n}; w_{1}, \dots, w_{n}) & = & \frac{1}{α - 1} LSE ((α - 1) x_{1} log 2 + log w_{1}, \dots, (α - 1) x_{i} log 2 + log w_{i}), \end{matrix}

(46)

where function

LSE (a_{1}, \dots, a_{n}) : = log (\sum_{i = 1}^{n} e^{a_{i}})

denotes the log-sum-exp (convex) function [41,42].

Notice that

2^{(α - 1) D_{α}^{R} [p : q]} = \int_{X} p {(x)}^{α} q {(x)}^{1 - α} d μ (x)

, the Bhattacharyya

α

-coefficient [12] (also called Chernoff

α

-coefficient [43,44]):

C_{Bhat, α} [p : q] : = \int_{X} p {(x)}^{α} q {(x)}^{1 - α} d μ (x) .

(47)

Thus, we have

R_{α} (P, c) = \frac{1}{α - 1} {log}_{2} (\sum w_{i} C_{Bhat, α} [p_{i} : c]) .

(48)

The ordinary Bhattacharyya coefficient is obtained for

α = \frac{1}{2}

:

C_{Bhat} [p : q] : = \int_{X} \sqrt{p (x)}

\sqrt{q (x)} d μ (x)

.

Sibson [1] also considered the limit case

α \to \infty

when defining the information radius:

D_{\infty}^{R} [p : q] : = {log}_{2} sup_{x \in X} \frac{p (x)}{q (x)} .

(49)

Sibson reported the following theorem in his information radius study [1]:

Theorem 1

(Theorem 2.2 and Corollary 2.3 of [1]). The optimal density

c_{α}^{*} = arg {min}_{c \in D} R_{α}

(P, c)

is unique, and we have:

\begin{matrix} c_{1}^{*} (x) = \sum_{i} w_{i} p_{i} (x), & R_{1} (P) = R_{1} (P, c_{1}^{*}) = \int_{X} \sum_{i} w_{i} p_{i} {log}_{2} \frac{p_{i}}{\sum_{j} w_{j} p_{j} (x)} d μ (x), \\ c_{α}^{*} (x) = \frac{{(\sum_{i} w_{i} p_{i} {(x)}^{α})}^{\frac{1}{α}}}{\int_{X} {(\sum_{i} w_{i} p_{i} {(x)}^{α})}^{\frac{1}{α}} d μ (x)}, & R_{α} (P) = R_{α} (P, c_{α}^{*}) = \frac{1}{α - 1} {log}_{2} {(\int_{X} {(\sum_{i} w_{i} p_{i} {(x)}^{α})}^{\frac{1}{α}} d μ (x))}^{α}, \\ α \in (0, 1) \cup (1, \infty) \\ c_{\infty}^{*} (x) = \frac{{max}_{i} p_{i} (x)}{\int_{X} ({max}_{i} p_{i} (x)) d μ (x)}, & R_{\infty} (P) = R_{\infty} (P, c_{\infty}^{*}) = {log}_{2} \int_{X} (max_{i} p_{i} (x)) d μ (x), \end{matrix}

Observe that

R_{\infty} (P)

does not depend on the (positive) weights.

The proof follows from the following decomposition of the information radius:

Proposition 1.

We have:

R_{α} (P, c) - R_{α} (P, c_{α}^{*}) = D_{α}^{R} (c_{α}^{*}, c) \geq 0 .

(50)

Because the proof is omitted in [1], we report it here:

Proof.

Let

Δ (c, c_{α}^{*}) : = R_{α} (P, c) - R_{α} (P, c_{α}^{*})

. We handle the three cases, depending on the

α

values:

Case $α \in (0, 1) \cup (1, \infty)$ : Let $P_{α} (P) (x) : = {(\sum_{i} w_{i} p_{i} {(x)}^{α})}^{\frac{1}{α}}$ . We have ${(c_{α}^{*} (x))}^{α} = \frac{\sum_{i} w_{i} p_{i} {(x)}^{α}}{{(\int P_{α} (P) (x) d μ (x))}^{α}}$ . We obtain

$\begin{matrix} Δ (c, c_{α}^{*}) & = & \frac{1}{α - 1} {log}_{2} (\sum_{i} w_{i} \int p_{i} {(x)}^{α} c {(x)}^{1 - α} d μ (x)) - \frac{1}{α - 1} {log}_{2} {(\int P_{α} (P) (x) d μ (x))}^{α}, \end{matrix}$

(51)

$\begin{matrix} = & \frac{1}{α - 1} {log}_{2} \frac{\sum_{i} w_{i} \int p_{i} {(x)}^{α} c {(x)}^{1 - α} d μ}{{(\int P_{α} (P) (x) d μ (x))}^{α}}, \end{matrix}$

(52)

$\begin{matrix} = & \frac{1}{α - 1} {log}_{2} \frac{\int (\sum_{i} w_{i} p_{i} {(x)}^{α}) c {(x)}^{1 - α}}{{(\int P_{α} (P) (x) d μ (x))}^{α}} d μ (x), \end{matrix}$

(53)

$\begin{matrix} = & \frac{1}{α - 1} {log}_{2} \int {(c_{α}^{*} (x))}^{α} c {(x)}^{1 - α} d μ (x), \end{matrix}$

(54)

$\begin{matrix} : = & D_{α}^{R} (c_{α}^{*}, c) . \end{matrix}$

(55)
Case $α = 1$ : we have $Δ (c, c_{1}^{*}) : = R_{1} (P, c) - R_{1} (P, c_{1}^{*})$ with $c_{1}^{*} = \sum_{i} w_{i} p_{i}$ . Because $R_{1} (P, c) = \sum_{i} w_{i} D_{KL} [p_{i} : c]$ , we have

$\begin{matrix} R_{1} (P, c) & = & \sum_{i} w_{i} h [p_{i} : c] - w_{i} h [p_{i}], \end{matrix}$

(56)

$\begin{matrix} = & h [\sum_{i} w_{i} p_{i} : c] - \sum_{i} w_{i} h [p_{i}], \end{matrix}$

(57)

$\begin{matrix} = & h [c_{1}^{*} : c] - \sum_{i} w_{i} h [p_{i}] . \end{matrix}$

(58)

It follows that

$\begin{matrix} Δ (c, c_{1}^{*}) & = & h [c_{1}^{*} : c] - \sum_{i} w_{i} h [p_{i}] - (h [c_{1}^{*} : c_{1}^{*}] - \sum_{i} w_{i} h [p_{i}]), \end{matrix}$

(59)

$\begin{matrix} = & h^{[} c_{1}^{*} : c] - h [c_{1}^{*}], \end{matrix}$

(60)

$\begin{matrix} = & D_{KL} [c_{1}^{*} : c] = D_{1}^{R} [c_{1}^{*} : c] . \end{matrix}$

(61)
Case $α = \infty$ : we have $c_{\infty}^{*} = \frac{{max}_{i} p_{i} (x)}{\int ({max}_{i} p_{i} (x)) d μ (x)}$ , $R_{\infty} (P, c_{\infty}^{*}) = {log}_{2} \int ({max}_{i} p_{i} (x)) d μ (x)$ , and $D_{\infty}^{R} [p : q] = {log}_{2} {sup}_{x} \frac{p (x)}{q (x)}$ . We have $R_{\infty} (P, c) = {log}_{2} {sup}_{x} \frac{p_{i} (x)}{c (x)}$ Thus, $Δ (c, c_{α}^{*}) : = R_{\infty} (P, c) - R_{\infty} (P, c_{\infty}^{*}) = {log}_{2} {sup}_{x} \frac{c_{\infty}^{*} (x)}{c (x)} = D_{\infty}^{R} [c_{\infty}^{*} : c]$ .

□

It follows that

min_{c} R_{α} (P, c) = min_{c} R_{α} (P, c_{α}^{*}) + D_{α}^{R} (c_{α}^{*}, c) \equiv min_{c} D_{α}^{R} (c_{α}^{*}, c) \geq 0 .

Thus we have

c = c_{α}^{*}

since

D_{α}^{R} (c_{α}^{*}, c)

is minimized for

c = c_{α}^{*}

.

Notice that

c_{\infty}^{*} (x) = \frac{max {p_{1} (x), \dots, p_{n} (x)}}{\int_{X} ({max}_{i} p_{i} (x)) d μ (x)}

is the upper envelope of the densities

p_{i} (x)

’s normalized to be a density. Provided that the densities

p_{i}

’s intersect pairwise in at most s locations (i.e.,

| {p_{i} (x) \cap p_{j} (x)} | \leq s

for

i \neq j

), we can efficiently compute this upper envelope using an output-sensitive algorithm [45] of computational geometry.

When the point set is

P = \{(\frac{1}{2}, p), (\frac{1}{2}, q)\}

with

w_{1} = w_{2} = \frac{1}{2}

, the information radius defines a (2-point) symmetric distance, as follows:

\begin{matrix} R_{1} (p, q) = \frac{1}{2} \int_{X} p (x) {log}_{2} \frac{2 p}{p (x) + q (x)} d μ (x) + \frac{1}{2} \int_{X} q (x) {log}_{2} \frac{2 q (x)}{p (x) + q (x)} d μ (x), & α = 1 \\ R_{α} (p, q) = \frac{α}{α - 1} {log}_{2} \int_{X} {(\frac{p {(x)}^{α} + q {(x)}^{α}}{2})}^{\frac{1}{α}} d μ (x) = \frac{α}{α - 1} {log}_{2} \int_{X} P_{α} (p (x), q (x)) d μ (x), & α \in (0, 1) \cup (1, \infty) \\ R_{\infty} (p, q) = {log}_{2} \int_{X} max {p (x), q (x)} d μ (x), & α = \infty . \end{matrix}

This family of symmetric divergences may be called the Sibson’s

α

-divergences, and the Jensen-Shannon divergence is interpreted as a limit case when

α \to 1

. Notice that, since we have

{lim}_{α \to \infty} P_{α} (p, q) = max {p, q}

and

{lim}_{α \to \infty} \frac{α}{α - 1} = 1

, we have

{lim}_{α \to \infty} R_{α} (p, q) = R_{\infty} (p, q)

. Notice that, for

α = 1

, the integral and logarithm operations are swapped as compared to

R_{α}

for

α \in (0, 1) \cup (1, \infty)

.

Theorem 2.

When

α = \frac{1}{k}

for an integer

k \geq 2

, the Sibson α-divergences between two densities

p_{θ_{1}}

and

p_{θ_{2}}

of an exponential family

{p_{θ} : θ \in Θ}

with cumulant function

F (θ)

is available in closed form:

R_{α} (p_{θ_{1}}, p_{θ_{2}}) = - \frac{1}{k - 1} {log}_{2} (\frac{1}{2^{k}} \sum_{i = 0}^{k} (\binom{k}{i}) exp (F (\frac{i}{k} θ_{1} + (1 - \frac{i}{k}) θ_{2}) - (\frac{i}{k} F (θ_{1}) + (1 - \frac{i}{k}) F (θ_{2})))) .

Proof.

Let

p = p_{θ_{1}}

and

q = p_{θ_{2}}

be two densities of an exponential family [2] with cumulant function

F (θ)

and natural parameter space

Θ

. Without a loss of generality, we may consider a natural exponential family [2] with densities written canonically as

p_{θ} (x) = exp (x^{⊤} θ - F (θ))

for

θ \in Θ

. It can be shown that the cumulant function

F (θ) = log \int_{X} exp (x^{⊤} θ) d μ (x)

is strictly convex and analytic on the open convex natural parameter space

Θ

[2].

When

α = \frac{1}{2}

(i.e.,

k = 2

), we have:

\begin{matrix} R_{\frac{1}{2}} (p, q) & = & - {log}_{2} \int_{X} {(\frac{\sqrt{p (x)} + \sqrt{q (x)}}{2})}^{2} d μ (x), \end{matrix}

(62)

\begin{matrix} = & - {log}_{2} (\frac{1}{2} + \frac{1}{2} \int_{X} \sqrt{p (x)} \sqrt{q (x)} d μ (x)), \end{matrix}

(63)

\begin{matrix} = & - {log}_{2} (\frac{1}{2} + \frac{1}{2} C_{Bhat} [p : q]) \geq 0, \end{matrix}

(64)

where

C_{Bhat} [p : q] : = \int_{X} \sqrt{p (x)} \sqrt{q (x)} d μ (x)

is the Bhattacharyya coefficient (with

0 \leq C_{Bhat} [p : q] \leq 1

). Using Theorem 3 of [12], we have

C_{Bhat} [p_{θ_{1}}, p_{θ_{2}}] = exp (F (\frac{θ_{p} + θ_{q}}{2}) - \frac{F (θ_{p}) + F (θ_{q})}{2}),

so that we obtain the following closed-form formula:

R_{\frac{1}{2}} (p_{θ_{1}}, p_{θ_{2}}) = - {log}_{2} (\frac{1}{2} + \frac{1}{2} exp (F (\frac{θ_{p} + θ_{q}}{2}) - \frac{F (θ_{p}) + F (θ_{q})}{2})) \geq 0,

Now, assume that

k = \frac{1}{α} \geq 2

is an arbitrary integer, and let us apply the binomial expansion for

P_{α} (p_{θ_{1}}, p_{θ_{2}})

in the spirit of [46,47]:

\begin{matrix} \int_{X} P_{α} (p_{θ_{1}} (x), p_{θ_{2}} (x)) d μ (x) & = & \int_{X} {(\frac{p_{θ_{1}} {(x)}^{\frac{1}{k}} + p_{θ_{2}} {(x)}^{\frac{1}{k}}}{2})}^{k} d μ (x), \end{matrix}

(65)

\begin{matrix} = & \frac{1}{2^{k}} \sum_{i = 0}^{k} (\binom{k}{i}) \int_{X} {(p_{θ_{1}} {(x)}^{\frac{1}{k}})}^{i} {(p_{θ_{2}} {(x)}^{\frac{1}{k}})}^{k - i} d μ (x) . \end{matrix}

(66)

Let

I_{k, i} (θ_{1}, θ_{2}) : = \int_{X} {(p_{θ_{1}} {(x)}^{\frac{1}{k}})}^{i} {(p_{θ_{2}} {(x)}^{\frac{1}{k}})}^{k - i} d μ (x)

. Because

\frac{i}{k} θ_{1} + \frac{k - i}{k} θ_{2} = θ_{2} + \frac{i}{k} (θ_{1} - θ_{2}) \in Θ

for

i \in {0, \dots, k}

, we get by following the calculation steps in [12]:

I_{k, i} (θ_{1}, θ_{2}) : = exp (F (\frac{i}{k} θ_{1} + (1 - \frac{i}{k}) θ_{2}) - (\frac{i}{k} F (θ_{1}) + (1 - \frac{i}{k}) F (θ_{2}))) < \infty .

Notice that

I_{2, 1} = C_{Bhat} [p_{θ_{1}}, p_{θ_{2}}]

, and

I_{k, 0} = I_{k, k} = 1

.

Thus, we get the following closed-form formula:

\begin{matrix} R_{α} (p_{θ_{1}}, p_{θ_{2}}) & = & - \frac{1}{k - 1} {log}_{2} (\frac{1}{2^{k}} \sum_{i = 0}^{k} (\binom{k}{i}) exp (F (\frac{i}{k} θ_{1} + (1 - \frac{i}{k}) θ_{2}) - (\frac{i}{k} F (θ_{1}) + (1 - \frac{i}{k}) F (θ_{2})))) . \end{matrix}

(67)

□

This closed-form formula applies, in particular, to the family

{N (μ, Σ)}

of (multivariate) normal distributions: In this case, the natural parameters

θ

are expressed using both a vector parameter component v and a matrix parameter component M:

θ = (v, M) = (Σ^{- 1} m, - \frac{1}{2} Σ^{- 1}),

(68)

and the cumulant function is:

F_{N} (θ) = \frac{d}{2} log π - \frac{1}{2} log | - 2 M | - \frac{1}{4} v^{⊤} M^{- 1} v,

(69)

where

| \cdot |

denotes the matrix determinant.

In general, the optimal density

c_{α}^{*} = arg {min}_{c \in D} R_{α} (P, c)

yielding the information radius

R_{α} (P)

can be interpreted as a generalized centroid (extending the notion of Fréchet means [48]) with respect to

(M_{α}^{R}, D_{α}^{R})

, where a

(M, D)

-centroid is defined by:

Definition 1

(

(M, D)

-centroid). Let

P = {(w_{1}, p_{1}), \dots, (w_{n}, p_{n})}

be a normalized weighted parameter set, M a mean, and D a distance. Subsequently, the

(M, D)

-centroid is defined as

c_{M, D} (P) = arg min_{c} M (D (p_{1} : c), \dots, D (p_{n} : c); w_{1}, \dots, w_{n}) .

Here, we give a general definition of the

(M, D)

-centroid for an arbitrary distance (not necessarily a symmetric nor metric distance). The parameter set can either be probability measures having densities with respect to a given measure

μ

or a set of vectors. In the first case, the distance D is called a statistical distance. When the densities belong to a parametric family of densities

P = {p_{θ} : θ \in Θ}

, the statistical distance

D [p_{θ_{1}} : p_{θ_{2}}]

amounts to a parameter distance:

D_{P} (θ_{1} : θ_{2}) : = D [p_{θ_{1}} : p_{θ_{2}}]

. For example, when all of the densities

p_{i}

’s belong to a same natural exponential family [2]

P = {p_{θ} (x) = exp (θ^{⊤} t (x) - F (θ)) : θ \in Θ}

with cumulant function

F (θ) = log \int exp (θ^{⊤} t (x)) d μ (x)

(i.e.,

p_{i} = p_{θ_{i}}

) and sufficient statistic vector

t (x)

, we have

D_{KL} [p_{θ} : p_{θ_{i}}] = B_{F}^{*} (θ : θ_{i}) : = B_{F} (θ_{i} : θ)

, where

B_{F}^{*}

denotes the reverse Bregman divergence (by parameter order swapping) the Bregman divergence [21]

B_{F}

defined by

B_{F} (θ : θ^{'}) : = F (θ) - F (θ^{'}) - {(θ - θ^{'})}^{⊤} \nabla F (θ^{'}) .

(70)

Thus, we have

D_{P} (θ_{1} : θ_{2}) : = B_{F}^{*} (θ_{1} : θ_{2}) = D_{KL} [p_{θ_{1}} : p_{θ_{2}}]

.

Let

V = {(w_{1}, θ_{1}), \dots, (w_{n}, θ_{n})}

be the parameter set corresponding to

P

. Define

R_{F} (V, θ) : = \sum_{i = 1}^{n} w_{i} B_{F} (θ_{i} : θ) .

(71)

Subsequently, we have the equivalent decomposition of Proposition 1:

R_{F} (V, θ) - R_{F} (V, θ^{*}) = B_{F} (θ^{*} : θ),

(72)

with

θ^{*} = \bar{θ} : = \sum_{i = 1}^{n} w_{i} θ_{i}

. (this decomposition is used to prove Proposition 1 of [21]). The quantity

R_{F} (V) = R_{F} (V, θ^{*})

was termed the Bregman information [21,49]. The Bregman information generalizes the variance that was obtained when the Bregman divergence is the squared Euclidean distance.

R_{F} (V)

could also be called Bregman information radius according to Sibson. Because

R_{F} (V) = \sum_{i = 1}^{n} w_{i} D_{KL} [p_{\bar{θ}} : p_{θ_{i}}]

, we can interpret the Bregman information as a Sibson’s information radius for densities of an exponential family with respect to the arithmetic mean

M_{1}^{R} = A

and the reverse Kullback–Leibler divergence:

D_{KL}^{*} [p : q] : = D_{KL} [q : p]

. This observation yields us the JS-symmetrization of distances based on generalized information radii in Section 3.

More generally, we may consider the densities belonging to a deformed q-exponential family (see [10], page 85–89 and the monograph [50]). Deformed q-exponential families generalize the exponential families, and include the q-Gaussians [10]. A common way to measure the statistical distance between two densities of a q-exponential family is the q-divergence [10], which is related to Tsallis’ entropy [51]. We may also define another statistical divergence between two densities of a q-exponential family which amounts to Bregman divergence. For example, we refer to [52] for details concerning the family of Cauchy distributions, which are q-Gaussians for

q = 2

.

Sibson proved that the information radii of any order are all upper bounded (Theorem 2.8 and Theorem 2.9 of [1]) as follows:

\begin{matrix} R_{1} (P) & \leq & \sum_{i} w_{i} {log}_{2} \frac{1}{w_{j}} \leq {log}_{2} n < \infty, \end{matrix}

(73)

\begin{matrix} R_{α} (P) & \leq & \frac{α}{α - 1} {log}_{2} (\sum_{i} w_{i}^{\frac{1}{α}}) \leq {log}_{2} n < \infty, α \in (0, 1) \cup (1, \infty) \end{matrix}

(74)

\begin{matrix} R_{\infty} (P) & \leq & {log}_{2} n < \infty . \end{matrix}

(75)

We interpret Sibson’s upper bounds of Equations (73)–(75), as follows:

Proposition 2

(Information radius upper bound). The information radius of order α of a weighted set of distributions is upper bounded by the discrete Rényi entropy of order

\frac{1}{α}

of the weight distribution:

R_{α} (P) \leq H_{\frac{1}{α}}^{R} [w]

, where

H_{α}^{R} [w] : = \frac{1}{1 - α} log (\sum_{i} w_{i}^{α})

.

3. JS-Symmetrization of Distances Based on Generalized Information Radius

Let us give the following definitions generalizing the information radius (i.e., Jensen-Shannon symmetrization of the distance when

| P | = 2

) and the ordinary Jensen-Shannon divergence:

Definition 2

(

(M, D)

-information radius). Let M be a weighted mean and D a distance. Subsequently, the generalized information radius for a weighted set of points (e.g., vectors or densities)

(w_{1}, p_{1}), \dots, (w_{n}, p_{n})

is:

R_{M, D} (P) : = min_{c \in D} M (D (p_{1} : c), \dots, D (p_{n} : c); w_{1}, \dots, w_{n}) .

Recall that we also defined the

(M, D)

-centroid in Definition 1 as follows:

c_{M, D} (P) : = arg min_{c \in D} M (D (p_{1} : c), \dots, D (p_{n} : c); w_{1}, \dots, w_{n}) .

When

M = A

, we recover the notion of Fréchet mean [48]. Notice that, although the minimum

R_{M, D} (P)

is unique, several generalized centroids

c_{M, D} (P)

may potentially exist, depending on

(M, D)

. In particular, Definition 2 and Definition 1 apply when D is a statistical distance, i.e., a distance between densities (Radon–Nikodym derivatives of corresponding probability measures with respect to a dominating measure

μ

).

The generalized information radius can be interpreted as a diversity index or an n-point distance. When

n = 2

, we get the following (2-point) distances, which are considered as a generalization of the Jensen-Shannon divergence or Jensen-Shannon symmetrization:

Definition 3

(M-vJS symmetrization of D). Let M be a mean and D a statistical distance. Subsequently, the variational Jensen-Shannon symmetrization of D is defined by the formula of a generalized information radius:

D_{M}^{vJS} [p : q] : = min_{c \in D} M (D [p : c], D [q : c]) .

We use the acronym

vJS

to distinguish it with the JS-symmetrization reported in [23]:

D_{M}^{JS} [p : q] = D_{M, A}^{JS} [p : q] : = \frac{1}{2} (D [p : {(p q)}_{\frac{1}{2}}^{M}] + D [q : {(p q)}_{\frac{1}{2}}^{M}]) .

We recover Sibson’s information radius

R_{α} [p : q]

induced by two densities p and q from Definition 3 as the

M_{α}^{R}

-vJS symmetrization of the Rényi divergence

D_{α}^{R}

. We have

{B_{F}}_{A}^{vJS}

, which is the Bregman information [21]. Notice that we may skew these generalized JSDs by taking weighted mean

M_{β}

instead of M for

β \in (0, 1)

, yielding the general definition:

Definition 4

(Skew

M_{β}

-vJS symmetrization of D). Let

M_{β}

be a weighted mean and D a statistical distance. Subsequently, the variational skewed Jensen-Shannon symmetrization of D is defined by the formula of a generalized information radius:

\begin{matrix} D_{M_{β}}^{vJS} [p : q] : = min_{c \in D} M_{β} (D [p : c], D [q : c]) \end{matrix}

Example 1.

For example, the skewed Jensen–Bregman divergence of Equation (20) can be interpreted as a Jensen-Shannon symmetrization of the Bregman divergence

B_{F}

[12] since we have:

\begin{matrix} {B_{F}}_{A_{β}}^{vJS} (θ_{1} : θ_{2}) & = & min_{θ \in Θ} A_{β} (B_{F} (θ_{1} : θ), B_{F} (θ_{2} : θ)), \end{matrix}

(76)

\begin{matrix} = & min_{θ \in Θ} (1 - β) B_{F} (θ_{1} : θ) + β B_{F} (θ_{2} : θ), \end{matrix}

(77)

\begin{matrix} = & (1 - β) B_{F} (θ_{1} : (1 - β) θ_{1} + β θ_{2}) + β B_{F} (θ_{2} : (1 - β) θ_{1} + β θ_{2}), \end{matrix}

(78)

\begin{matrix} = : & {JB}_{F, β} (θ_{1} : θ_{2}) . \end{matrix}

(79)

Indeed, the Bregman barycenter

arg {min}_{θ \in Θ} (1 - β) B_{F} (θ_{1} : θ) + B_{F} (θ_{2} : θ)

is unique and it corresponds to

θ = (1 - β) θ_{1} + β θ_{2}

, see [21]. The skewed Jensen–Bregman divergence

{JB}_{F, β} (θ_{1} : θ_{2})

can also be rewritten as an equivalent skewed Jensen divergence (see Equation (22)):

\begin{matrix} {JB}_{F, β} (θ_{1} : θ_{2}) & = & (1 - β) B_{F} (θ_{1} : (1 - β) θ_{1} + β θ_{2}) + β B_{F} (θ_{2} : (1 - β) θ_{1} + β θ_{2}), \end{matrix}

(80)

\begin{matrix} = & (1 - β) F (θ_{1}) + β F (θ_{2}) - F ((1 - β) θ_{1} + β θ_{2}), \end{matrix}

(81)

\begin{matrix} = : & J_{F, β} (θ_{1} : θ_{2}) . \end{matrix}

(82)

Example 2.

Consider a conformal Bregman divergence [53] that is defined by

B_{F, ρ} (θ_{1} : θ_{2}) = ρ (θ_{1}) B_{F} (θ_{1} : θ_{2}),

(83)

where

ρ (θ) > 0

is a conformal factor. Subsequently, we have

\begin{matrix} {B_{F, ρ}}_{A_{β}}^{vJS} (θ_{1} : θ_{2}) & = & min_{θ \in Θ} A_{β} (B_{F, ρ} (θ_{1} : θ), B_{F, ρ} (θ_{2} : θ)), \end{matrix}

(84)

\begin{matrix} = & min_{θ \in Θ} (1 - β) B_{F, ρ} (θ_{1} : θ) + B_{F, ρ} (θ_{2} : θ), \end{matrix}

(85)

\begin{matrix} = & (1 - β) B_{F} (θ_{1} : γ_{1} θ_{1} + γ_{2} θ_{2}) + β B_{F} (θ_{2} : γ_{1} θ_{1} + γ_{2} θ_{2}), \end{matrix}

(86)

where

γ_{1} = \frac{(1 - β) ρ (θ_{1})}{(1 - β) ρ (θ_{1}) + β ρ (θ_{2})}

and

γ_{2} = \frac{β ρ (θ_{2})}{(1 - β) ρ (θ_{1}) + β ρ (θ_{2})} = 1 - γ_{1}

.

Notice that this definition is implicit and it can be made explicit when the centroid

c^{*} (p, q)

is unique:

D_{M_{β}}^{vJS} [p : q] = M_{β} (D [p : c^{*} (p, q)], D [q : c^{*} (p, q)]

(87)

In particular, when

D = D_{KL}

, the KLD, we obtain generalized skewed Jensen-Shannon divergences for

M_{β}

a weighted mean with

β \in (0, 1)

:

D_{vJS}^{M_{β}} [p : q] : = min_{c \in D} M_{β} (D_{KL} [p : c], D_{KL} [q : c]) .

(88)

Example 3.

Amari [31] obtained the

(A, D_{α})

-information radius and its corresponding unique centroid for

D_{α}

, the α-divergence of information geometry [10] (page 67).

Example 4.

Brekelmans et al. [54] studied the geometric path

{(p_{1} p_{2})}_{β}^{G} (x) \propto p_{1}^{1 - β} (x) p_{2}^{β} (x)

between two distributions

p_{1}

and

p_{2}

of

D

, where

G_{β} (a, b) = a^{1 - β} b^{β}

(with

a, b > 0

) is the weighted geometric mean. They proved the variational formula:

{(p_{1} p_{2})}_{β}^{G} = min_{c \in D} (1 - β) D_{KL} [c : p_{1}] + β D_{KL} [c : p_{2}] .

(89)

That is,

{(p_{1} p_{2})}_{β}^{G}

is a

G_{β}

-

D_{KL}^{*}

centroid, where

D_{KL}^{*}

is the reverse KLD. The corresponding

(G_{β}, D_{KL}^{*})

-vJSD is studied is [23] and it is used in deep learning in [30].

It is interesting to study the link between

(M_{β}, D)

-variational Jensen-Shannon symmetrization of D and the

(M_{α}^{'}, N_{β}^{'})

-JS symmetrization of [23]. In particular, the link between

M_{β}

for averaging in the minimization and

M_{α}^{'}

the mean for generating abstract mixtures.

More generally, Brekelmans et al. [55] considered the α-divergences extended to positive measures (i.e., a separable divergence built as the different between a weighted arithmetic mean and a geometric mean [56]):

D_{α}^{e} [p : q] : = \frac{4}{1 - α^{2}} \int_{X} (\frac{1 - α}{2} p (x) + \frac{1 + α}{2} q (x) - p^{\frac{1 - α}{2}} (x) q^{\frac{1 + α}{2}} (x)) d μ (x)

(90)

and proved that

c_{β}^{*} = arg min_{c \in D} {(1 - β) D_{α}^{e} [p_{1} : c] + β D_{α}^{e} [p_{2} : c]}

(91)

is a density of a likelihood ratio q-exponential family:

c_{β}^{*} = \frac{p_{1} (x)}{Z_{β, q}} {exp}_{q} (β {log}_{q} \frac{p_{2} (x)}{p_{1} (x)})

for

q = \frac{1 + α}{2}

. That is,

c_{β}^{*}

is the

(A_{β}, D_{α}^{e})

-generalized centroid, and the corresponding information radius is the variational JS symmetrization:

{D_{α}^{e}}^{vJS} [p_{1} : p_{2}] = (1 - β) D_{α}^{e} [p_{1} : c_{β}^{*}] + β D_{α}^{e} [p_{2} : c_{β}^{*}]

(92)

Example 5.

The q-divergence [57]

D_{q}

between two densities of a q-exponential family amounts to a Bregman divergence [10,57]. Thus,

D_{q}^{vJS}

for

M = A

is a generalized information radius that amounts to a Bregman information.

For the case

α = \infty

in Sibson’s information radius, we find that the information radius is related to the total variation:

Proposition 3

(Lemma 2.4 [1]). :

D_{\infty}^{vJS, R} [p : q] = {log}_{2} (1 + D_{TV} [p : q]),

(93)

where

D_{TV}

denotes the total variation

D_{TV} [p : q] = \frac{1}{2} \int_{X} | p (x) - q (x) | d μ (x) .

(94)

Proof.

Because

max {p (x), q (x)} = \frac{p (x) + q (x)}{2} + \frac{1}{2} | q (x) - p (x) |

, it follows that we have:

\int_{X} max {p (x), q (x)} d μ (x) = 1 + D_{TV} [p : q] .

From Theorem 1, we have

R_{\infty} ({(\frac{1}{2}, p), (\frac{1}{2}, q)) = {log}_{2} \int_{X} max {p (x), q (x)} d μ (x)

and, therefore,

R_{\infty} ({(\frac{1}{2}, p), (\frac{1}{2}, q)) = {log}_{2} (1 + D_{TV} [p : q])

. □

Notice that, when

M = M_{g}

is a quasi-arithmetic mean, we may consider the divergence

D_{g} [p : q] = g^{- 1} (D [p : q))

, so that the centroid of the

(M_{g}, D_{g})

-JS symmetrization is:

arg min_{c} g^{- 1} (\sum_{i = 1}^{n} w_{i} D [p_{i} : c]) \equiv arg min_{c} \sum_{i = 1}^{n} w_{i} D [p_{i} : c] .

(95)

The generalized

α

-skewed Bhattacharyya divergence [29] can also be considered with respect to a weighted mean

M_{α}

:

D_{Bhat, M_{α}} [p : q] = - log \int_{X} M_{α} (p (x), q (x)) d μ (x) .

In particular, when

M_{α}

is a quasi-arithmetic weighted mean that is induced by a strictly continuous and monotone function g, we have

D_{Bhat, g, α} [p : q] : = - log \int_{X} M_{g} (p (x), q (x); α) d μ (x) = : D_{Bhat, {(M_{g})}_{α}} [p : q] .

Because

min {p (x), q (x)} \leq M_{g} (p (x), q (x); α) \leq max {p (x), q (x)}

,

min {a, b} = \frac{a + b}{2} - \frac{| b - a |}{2}

and

max {a, b} = \frac{a + b}{2} + \frac{| b - a |}{2}

, we deduce that we have:

0 \leq 1 - D_{TV} [p, q] \leq \int_{X} M_{g} (p (x), q (x); α) d μ (x) \leq 1 + D_{TV} [p, q] \leq 2 .

(96)

The information radius of Sibson for

α \in (0, 1) \cup (1, \infty)

may be interpreted as generalized scaled

α

-skewed Bhattacharyya divergences with respect to the power means

P_{α}

, since we have

R_{α} (p, q) = \frac{α}{α - 1} {log}_{2} \int_{X} P_{α} (p (x), q (x); α) d μ (x) = \frac{α}{1 - α} D_{Bhat, P_{α}} [p : q]

.

4. Relative Information Radius and Relative Jensen-Shannon Symmetrizations of Distances

4.1. Relative Information Radius

In this section, instead of considering the full space of densities

D

on

(X, F, μ)

for performing the variational optimization of the information radius, we rather consider a subfamily of (parametric) densities

R \subset D

. Subsequently, we define accordingly the

R

-relative Jensen-Shannon divergence (

R

-JSD for short) as

D_{vJS}^{R} [p : q] : = min_{c \in R} \{\frac{1}{2} D_{KL} [p : c] + \frac{1}{2} D_{KL} [q : c]\} .

(97)

In particular, Sibson [1] considered the normal information radius, i.e., the

R

-relative Jensen-Shannon divergence with

R = {N (μ, Σ) : (μ, Σ) \in R^{d} \times P_{+ +}^{d}}

, where

P_{+ +}^{d}

denotes the open cone of

d \times d

positive-definite matrices (positive-definite covariance matrices of Gaussian distributions). More generally, we may consider any exponential family

E

[2].

Definition 5

(Relative

(R, M)

-JS symmetrization of D). Let M be a mean and D a statistical distance. Subsequently, the relative

(R, M)

-JS symmetrization of D is:

D_{M, R}^{vJS} [p : q] : = min_{c \in R} M (D [p : c], D [q : c]) .

We obtain the relative Jensen-Shannon divergences when

D = D_{KL}

.

Example 6.

Grosse et al. [58] considered geometric and moment average paths for annealing. They proved that, when

p_{1} = p_{θ_{1}}

and

p_{2} = p_{θ_{2}}

belong to an exponential family [2]

E_{F}

with cumulant function F, we have

{(p_{1} p_{2})}_{β}^{G} = \frac{p_{1} {(x)}^{1 - β} p_{2} {(x)}^{β}}{\int p_{1} {(x)}^{1 - β} p_{2} {(x)}^{β} d μ (x)} = arg min_{c \in E_{F}} \{(1 - β) D_{KL} [c : p_{1}] + β D_{KL} [c : p_{2}]\},

(98)

and

p_{\bar{η}} = arg min_{c \in E_{F}} \{(1 - β) D_{KL} [p_{1} : c] + β D_{KL} [c : p_{2}]\},

(99)

where

\bar{η} = (1 - β) η_{1} + β η_{2}

,

η_{i} = E_{p_{θ_{i}}} [t (x)]

(this is not an arithmetic mixture, but an exponential family density moment parameter that is a mixture of the parameters).

The corresponding minima can be interpreted as relative skewed Jensen-Shannon symmetrization for the reverse KLD

D_{KL}^{*}

(Equation (98)) and the relative skewed Jensen-Shannon divergence (Equation (99)):

\begin{matrix} {D_{KL}^{*}}_{A_{β}, E_{F}}^{vJS} [p_{1} : p_{2}] & = & min_{c \in E_{F}} \{(1 - β) D_{KL}^{*} [p_{1} : c] + β D_{KL}^{*} [p_{2} : c]\}, \end{matrix}

(100)

\begin{matrix} D_{A_{β}, E_{F}}^{vJS} [p_{1} : p_{2}] & = & min_{c \in E_{F}} \{(1 - β) D_{KL} [c : p_{1}] + β D_{KL} [c : p_{2}]\}, \end{matrix}

(101)

where

A_{β} (a, b) : = (1 - β) a + β b

is the weighted arithmetic mean for

β \in (0, 1)

.

Notice that, when

p = q

, we have

D_{M, R}^{vJS} [p : p] = {min}_{c \in R} D [p : c]

, which is the information projection [59] with respect to D of density p to the submanifold

R

. Thus, when

p \notin R

, we have

D_{M, R}^{vJS} [p : p] > 0

, i.e., the relative JSDs are not proper divergences, since a proper divergence ensures that

D [p : q] \geq 0

with equality if

p = q

. Figure 1 illustrates the main cases of the relative Jensen-Shannon divergenc between p and q: Either p and q are both inside or outside

R

, or one point is inside

R

, while the other point is outside

R

. When

p = q

, we get an information projection when both of the points are outside

R

, and

D_{vJS}^{R} [p : p] = 0

when

p \in R

. When

p, q \in R

with

p \neq q

, the value

D_{vJS}^{R} [p : q]

corresponds to the information radius (and the arg min to the right-sided Kullback–Leibler centroid).

4.2. Relative Jensen-Shannon Divergences: Applications to Density Clustering and Quantization

Let

D_{KL} [p : q_{θ}]

be the Kullback–Leibler divergence between an arbitrary density p and a density

q_{θ}

of an exponential family

Q = {q_{θ} : θ \in Θ}

. Let us canonically express [2,60] the density

q_{θ} (x)

, as

q_{θ} (x) = exp (θ^{⊤} t_{Q} (x) - F_{Q} (θ) + k_{Q} (x)),

where

t_{Q} (x)

denotes the sufficient statistics,

k_{Q} (x)

is an auxiliary carrier measure term (e.g.,

k (x) = 0

for the Gaussian family and

k (x) = log (x)

for the Rayleigh family [60]), and

F_{Q} (θ)

the cumulant function. Assume that we know in closed-form the following quantities:

$m_{p} : = E_{p} [t_{Q} (x)] = \int p (x) t_{Q} (x) d μ (x)$ and
the Shannon entropy $h [p] = - \int p (x) log p (x) d μ (x)$ of p.

Subsequently, we can express the KLD using a semi-closed-form formula.

Proposition 4.

Let

q_{θ} \in Q

be a density of an exponential family and p an arbitrary density with

m_{p} = E_{p} [t_{Q} (x)]

. Subsequently, the Kullback–Leibler divergence between p and

q_{θ}

is expressed as:

D_{KL} [p : q_{θ}] = F_{Q} (θ) - m_{p}^{⊤} θ - E_{p} [k_{Q} (x)] - h [p],

(102)

where

h [p : q_{θ}] = F_{Q} (θ) - m_{p}^{⊤} θ - E_{p} [k_{Q} (x)]

is the cross-entropy between p and

q_{θ}

.

Proof.

The proof is straightforward since

log q_{θ} (x) = θ^{⊤} t_{Q} (x) - F_{Q} (θ) + k_{Q} (x)

. Therefore, we have:

\begin{matrix} D_{KL} [p : q_{θ}] & = & h [p : q_{θ}] - h [p], \end{matrix}

(103)

\begin{matrix} = & - \int_{X} p (x) log q_{θ} (x) d μ (x) - h [p], \end{matrix}

(104)

\begin{matrix} = & F_{Q} (θ) - m_{p}^{⊤} θ - E_{p} [k_{Q} (x)] - h [p] . \end{matrix}

(105)

□

Example 7.

For example, when

q_{θ} = q_{μ, Σ}

is the density of a multivariate Gaussian distribution

N (μ, Σ)

(with

k_{N} (x) = 0

), we have

D_{KL} [p : q_{μ, Σ}] = \frac{1}{2} (log | 2 π Σ | + {(μ - m)}^{⊤} Σ^{- 1} (μ - m) + tr (Σ^{- 1} S)) - h [p],

(106)

where

m = μ (p) = E_{p} [X]

and

S = Cov (p) : = E_{p} [X X^{⊤}] - E_{p} [X] E_{p} {[X]}^{⊤}

.

The formula of Proposition 4 is said in semi-closed-form, because it relies on knowing both the entropy h of p and the sufficient statistic moments

E_{p} [t_{Q} (x)]

. Yet, this semi-closed formula may prove to be useful in practice: For example, we can answer the comparison predicate

“Is

D_{KL} [p : q_{θ_{1}}] \geq D_{KL} [p : q_{θ_{2}}]

or not?”

by checking whether

F_{Q} (θ_{1}) - F_{Q} (θ_{2}) - m_{p}^{⊤} (θ_{1} - θ_{2}) \geq 0

or not (i.e., the terms

- E_{p} [k_{Q} (x)] - h [p]

in Equation (102) cancel out). Thus, we get a closed-form predicate, although

D_{KL}

is only known in semi-closed-form. This KLD comparison predicate shall be used later on when clustering densities with respect to centroids in Section 4.2.

Remark 1.

Note that when

Y = f (X)

for an invertible and differentiable transformation f then we have

h [Y] = h [X] + E_{X} [log | J_{f} (X) |]

where

J_{f}

denotes the Jacobian matrix. For example, when

Y = f (X) = A X

, we have

h [Y] = h [X] + log | A |

.

When p belongs to an exponential family

P

(

P

may be different from

Q

) with cumulant function

F_{P}

, sufficient statistics

t_{P} (x)

, auxiliary carrier term

k_{P} (x)

, and natural parameter

θ

, we have the entropy [61] expressed, as follows:

\begin{matrix} h [p] & = & F_{P} (θ) - θ^{⊤} \nabla F_{P} (θ) - E_{p} [k_{P} (x)], \end{matrix}

(107)

\begin{matrix} = & - F_{P}^{*} (η) - E_{p} [k_{P} (x)], \end{matrix}

(108)

where

F_{P}^{*} (η) = θ^{⊤} \nabla F (θ) - F (θ)

is the Legendre transform of

F (θ)

and

η = η (θ) = \nabla F (θ)

is called the moment parameter since we have

η (θ) = E_{p} [t_{P} (x)]

[2,60].

It follows the following proposition refining Proposition 4 when

p = p_{θ} \in P

:

Proposition 5.

Let

p_{θ}

be a density of an exponential family

P

and

q_{θ^{'}}

be a density of an exponential family

Q

. Subsequently, the Kullback–Leibler divergence between

p_{θ}

and

q_{θ^{'}}

is expressed as:

D_{KL} [p_{θ} : q_{θ^{'}}] = F_{Q} (θ^{'}) + F_{P}^{*} (η) - E_{p_{θ}} {[t_{Q} (x)]}^{⊤} θ^{'} + E_{p_{θ}} [k_{P} (x) - k_{Q} (x)] .

(109)

Proof.

We have

\begin{matrix} D_{KL} [p_{θ} : q_{θ^{'}}] & = & h [p_{θ} : q_{θ^{'}}] - h [p_{θ}], \end{matrix}

(110)

\begin{matrix} = & F_{Q} (θ^{'}) - m_{p_{θ}}^{⊤} θ^{'} - E_{p_{θ}} [k_{Q} (x)] + F_{P}^{*} (η) + E_{p_{θ}} [k_{P} (x)], \end{matrix}

(111)

\begin{matrix} = & F_{Q} (θ^{'}) + F_{P}^{*} (η) - E_{p_{θ}} {[t_{Q} (x)]}^{⊤} θ^{'} + E_{p_{θ}} [k_{P} (x) - k_{Q} (x)] . \end{matrix}

(112)

□

In particular, when p and q belong both to the same exponential family (i.e.,

P = Q

with

k_{P} (x) = k_{Q} (x)

), we have

F (θ) : = F_{P} (θ) : = F_{Q} (θ)

and

E_{p_{θ}} [t_{Q} (x)] = \nabla F (θ) = : η

, and

D_{KL} [p_{θ} : q_{θ^{'}}] = F (θ^{'}) + F^{*} (η) - θ^{' ⊤} η .

This last equation is the Fenchel–Young divergence in Bregman manifolds [34,62] (called dually flat spaces in information geometry [10]). Thus the divergence can be rewritten as equivalent dual Bregman divergences:

\begin{matrix} D_{KL} [p_{θ} : q_{θ^{'}}] & = & F (θ^{'}) + F^{*} (η) - η^{⊤} θ^{'}, \end{matrix}

(113)

\begin{matrix} = & B_{F} (θ^{'} : θ), \end{matrix}

(114)

\begin{matrix} = & B_{F^{*}} (η : η^{'}), \end{matrix}

(115)

where

η^{'} = \nabla F (θ^{'})

.

Notice that

D_{KL} [p_{θ} : Q] : = {min}_{θ^{'} \in Θ^{'}} D_{KL} [p_{θ} : q_{θ^{'}}]

is unique and can be calculated as

η^{'} = \nabla F_{Q} (θ^{'}) = E_{p_{θ}} [t_{Q} (x)]

.

Let us report two examples of calculations of the KLD between two densities of two exponential families.

Example 8.

For the first exponential family, consider the family of Laplacian distributions:

P = L = \{p_{σ} (x) : = \frac{1}{2 σ} exp (- \frac{| x |}{σ}) : σ > 0\} .

The canonical decomposition of the density yields

t_{L} (x) = | x |

,

θ = - \frac{1}{σ}

,

k_{L} (x) = 0

, and

F_{L} (θ) = log \frac{2}{- θ}

. (i.e.,

F_{L} (θ (σ)) = log 2 σ

). It follows that

η (θ) = F_{L}^{'} (θ) = - \frac{1}{θ}

(

η (σ) = σ = E [| x |]

),

θ (η) = - \frac{1}{η}

, and

F_{L}^{*} (η) = - 1 - log (2 η)

and, therefore,

F_{L}^{*} (η (σ)) = - 1 - log (2 σ)

.

For the second family, consider the exponential family of zero-centered Gaussian distributions:

Q = N_{0} = \{q_{σ^{'}} (x) = \frac{1}{\sqrt{2 π {(σ^{'})}^{2}}} exp (- \frac{x^{2}}{2 {(σ^{'})}^{2}})\} .

We have

t_{N_{0}} (x) = x^{2}

,

k_{N_{0}} (x) = 0

,

θ^{'} = - \frac{1}{2 {(σ^{'})}^{2}}

, and

F_{N_{0}} (σ^{'}) = \frac{1}{2} log (2 π {(σ^{'})}^{2})

.

Moreover, let us calculate

E_{p_{σ}} [t_{N_{0}} (x)] = E_{p_{σ}} [x^{2}] = 2 σ^{2}

. Subsequently, we can calculate the Kullback–Leibler divergence between

p_{σ} \sim L (σ)

and

q_{σ^{'}} \sim N_{0} (σ^{'})

, as follows:

\begin{matrix} D_{KL} [p_{σ} : q_{σ^{'}}] & = & F_{Q} (θ^{'} (σ^{'})) + F_{P}^{*} (η (σ)) - E_{p_{σ}} {[t_{Q} (x)]}^{⊤} θ^{'} (σ^{'}) + E_{p_{σ}} [k_{P} (x) - k_{Q} (x)], \end{matrix}

(116)

\begin{matrix} = & \frac{1}{2} log (2 π {(σ^{'})}^{2}) - 1 - log (2 σ) - 2 σ^{2} (- \frac{1}{2 {(σ^{'})}^{2}}), \end{matrix}

(117)

\begin{matrix} = & log (\frac{σ^{'}}{σ}) + {(\frac{σ}{σ^{'}})}^{2} + \frac{1}{2} log (\frac{π}{2}) - 1 . \end{matrix}

(118)

Notice that

D_{KL} [p_{σ} : q_{σ^{'}}] \geq 0

, but never 0 since the

P \cap Q = \emptyset

.

Let us now compute the reverse Kullback–Leibler divergence

D_{KL} [q_{σ^{'}} : p_{σ}]

. We first calculate

E_{q_{σ^{'}}} [t_{L} (x)] = E_{q_{σ^{'}} (σ^{'})} [| x |] = \sqrt{\frac{2}{π}} σ^{'}

. Since

F_{Q} (θ^{'}) = \frac{1}{2} log (\frac{π}{- θ^{'}})

, we have

η^{'} (θ^{'}) = F_{Q}^{'} (θ^{'}) = - \frac{1}{2 θ^{'}}

. Thus

η^{'} (σ^{'}) = {(σ^{'})}^{2}

and

F_{Q}^{*} (η^{'}) = - \frac{1}{2} - \frac{1}{2} log (2 π η)

. Therefore, we get

F_{Q}^{*} (η^{'} (σ^{'})) = - h [q_{σ^{'}}] = - \frac{1}{2} log (2 π e {(σ^{'})}^{2})

.

It follows that

\begin{matrix} D_{KL} [q_{σ^{'}} : p_{σ}] & = & F_{P} (θ (σ)) + F_{Q}^{*} (η^{'} (σ^{'})) - E_{q_{θ^{'}}} {[t_{P} (x)]}^{⊤} θ (σ) + E_{q_{θ^{'}}} [k_{P} (x) - k_{Q} (x)], \end{matrix}

(119)

\begin{matrix} = & log (2 σ) - \frac{1}{2} log (2 π e {(σ^{'})}^{2}) - \sqrt{\frac{2}{π}} σ^{'} \times (- \frac{1}{σ}), \end{matrix}

(120)

\begin{matrix} = & \sqrt{\frac{2}{π}} \frac{σ^{'}}{σ} + log (\frac{σ}{σ^{'}}) - \frac{1}{2} log (\frac{π}{2} e) . \end{matrix}

(121)

Again, we have

D_{KL} [q_{σ^{'}} : p_{σ}] \geq 0

, but never 0, because

P \cap Q = \emptyset

.

Example 9.

Let us use the formula of Equation (109) to calculate the KLD between two Weibull distributions [63]. A Weibull distribution of shape

κ > 0

and scale

σ > 0

has a density defined on

X = [0, \infty)

, as follows:

p_{κ, σ}^{Wei} (x) : = \frac{κ}{σ} {(\frac{x}{σ})}^{κ - 1} exp (- {(\frac{x}{σ})}^{κ}) .

For a fixed shape κ, the set of Weibull distributions

W_{κ} : = {p_{κ, σ}^{Wei} : σ > 0}

form an exponential family with natural parameter

θ = - \frac{1}{σ^{κ}}

, sufficient statistic

t_{κ} (x) = x^{κ}

, auxiliary carrier term

k_{κ} (x) = (κ - 1) log x + log κ

, and cumulant function

F_{κ} (θ) = - log (- θ)

(so that

F_{κ} (θ (σ)) = F_{κ} (σ) = κ log σ

):

p_{κ, σ}^{Wei} (x) : = exp (- \frac{1}{σ^{κ}} x^{k} + log \frac{1}{σ^{κ}} + k (x)) .

We recover the exponential family of exponential distributions of rate parameter

λ = \frac{1}{σ}

when

κ = 1

:

\begin{matrix} p_{λ}^{Exp} (x) & = & p_{1, σ}^{Wei} (x) = \frac{1}{σ} exp (- \frac{x}{σ}), \\ = & λ exp (- λ x), \end{matrix}

and the exponential family of Rayleigh distributions when

κ = 2

with scale parameter

σ_{Ray} = \frac{σ}{\sqrt{2}}

:

\begin{matrix} p_{σ_{Ray}}^{Ray} (x) & = & p_{2, σ}^{Wei} (x) = \frac{2 x}{σ^{2}} exp (- \frac{x^{2}}{σ^{2}}), \\ = & \frac{x}{σ_{Ray}^{2}} exp (- \frac{x^{2}}{2 σ_{Ray}^{2}}) . \end{matrix}

Now, assume that we are given the differential entropy of the Weibull distributions [64] (pp. 155–156):

h [p_{κ_{1}, σ_{1}}^{Wei}] = γ (1 - \frac{1}{κ_{1}}) + log \frac{σ_{1}}{κ_{1}} + 1,

where

γ \approx 0.5772156649

is the Euler–Mascheroni constant, and the Weibull raw moments [64] (p. 155):

m = E_{p_{κ_{1}, σ}^{Wei}} [x^{κ_{2}}] = σ_{1}^{κ_{2}} Γ (1 + \frac{κ_{2}}{κ_{1}}),

where

Γ (x) = \int_{0}^{\infty} t^{x - 1} e^{- t} d t

is the gamma function (with

Γ (n) = (n - 1)!

for integers n). Because

h [p_{κ, σ}^{Wei}] = F_{κ} (θ) - θ^{⊤} \nabla F_{κ} (θ) - E_{p_{κ, σ}^{Wei}} [k_{κ} (x)] = - F_{κ}^{*} (η) - E_{p_{κ, σ}^{Wei}} [k_{κ} (x)]

, we deduce that

E_{p_{κ, σ}^{Wei}} [k_{κ} (x)] = - F_{κ}^{*} (η) - h [p_{κ, σ}^{Wei}],

where

F_{κ}^{*} (η)

is the Legendre transform of

F_{κ} (θ)

and

η (θ) = \nabla F_{κ} (θ) = - \frac{1}{θ} = E [t (x)] = E [x^{κ}]

. We have

θ (η) = \nabla F_{κ}^{*} (η) = - \frac{1}{η}

and

F_{κ}^{*} (η) = η^{⊤} \nabla F_{κ}^{*} (η) - F_{κ} (\nabla F_{κ}^{*} (η)) = - 1 - log η

. It follows that

E_{p_{κ, σ}^{Wei}} [k_{κ} (x)] = 1 + log (σ Γ (1 + \frac{1}{κ})) - γ (1 - \frac{1}{κ}) - log \frac{σ}{κ} + 1 .

Therefore, we deduce that the logarithmic moment of

p_{κ_{1}, σ}^{Wei}

is:

E_{p_{κ_{1}, σ}^{Wei}} [log x] = - \frac{γ}{κ_{1}} + log σ_{1} .

This coincides with the explicit definite integral calculation reported in [63].

Subsequently, we calculate the KLD between two Weibull distributions using Equation (109), as follows:

\begin{matrix} D_{KL} [p_{κ_{1}, σ_{1}}^{Wei} : p_{κ_{2}, σ_{2}}^{Wei}] & = & F_{κ_{2}} (θ^{'}) + F_{κ_{1}}^{*} (η) - E_{p_{κ_{1}, σ_{1}}} {[x^{κ_{2}}]}^{⊤} θ^{'} + E_{p_{κ_{1}, σ_{1}}} [k_{κ_{1}} (x) - k_{κ_{2}} (x)] \end{matrix}

(122)

\begin{matrix} = & log \frac{κ_{1}}{σ_{1}^{κ_{1}}} - log \frac{κ_{2}}{σ_{2}^{κ_{2}}} + (κ_{1} - κ_{2}) [log σ_{1} - \frac{γ}{κ_{1}}] + {(\frac{σ_{1}}{σ_{2}})}^{κ_{2}} Γ (\frac{κ_{2}}{κ_{1}} + 1) - 1 \end{matrix}

(123)

since we have the following terms:

\begin{matrix} F_{κ_{2}} (θ^{'}) & = & log σ_{2}^{κ_{2}}, \\ F_{κ_{1}}^{*} (η) & = & - 1 - log σ_{1}^{κ_{1}}, \\ - E_{p_{κ_{1}, σ_{1}}} {[x^{κ_{2}}]}^{⊤} θ^{'} & = & \frac{1}{σ_{2}^{κ_{2}}} σ_{1}^{κ_{2}} Γ (1 + \frac{κ_{2}}{κ_{1}}) \\ E_{p_{κ_{1}, σ_{1}}} [k_{κ_{1}} (x) - k_{κ_{2}} (x)] & = & (κ_{1} - κ_{2}) E_{p_{κ_{1}, σ_{1}}} [log x] + log \frac{κ_{1}}{κ_{2}}, \\ = & log \frac{κ_{1}}{κ_{2}} + (κ_{1} - κ_{2}) (log σ_{1} - \frac{γ}{κ_{1}}) . \end{matrix}

This formula matches the formula reported in [63].

When

κ_{1} = κ_{2} = 1

, we recover the ordinary KLD formula between two exponential distributions [60] with

λ_{i} = \frac{1}{σ_{i}}

since

Γ (2) = (2 - 1)! = 1

:

\begin{matrix} D_{KL} [p_{1, σ_{1}}^{Wei} : p_{1, σ_{2}}^{Wei}] & = & log \frac{σ_{2}}{σ_{1}} + \frac{σ_{1}}{σ_{2}} - 1, \end{matrix}

(124)

\begin{matrix} = & \frac{λ_{2}}{λ_{1}} - log \frac{λ_{2}}{λ_{1}} - 1 . \end{matrix}

(125)

When

κ_{1} = κ_{2} = 2

, we recover the ordinary KLD formula between two Rayleigh distributions [60], with

σ_{Ray} = \frac{σ}{\sqrt{2}}

:

\begin{matrix} D_{KL} [p_{2, σ_{1}}^{Wei} : p_{2, σ_{2}}^{Wei}] & = & log (\frac{σ_{2}^{2}}{σ_{1}^{2}}) + \frac{σ_{1}^{2}}{σ_{2}^{2}} - 1, \end{matrix}

(126)

\begin{matrix} = & log (\frac{{σ_{Ray}}_{2}^{2}}{{σ_{Ray}}_{1}^{2}}) + \frac{{σ_{Ray}}_{1}^{2}}{{σ_{Ray}}_{2}^{2}} - 1 . \end{matrix}

(127)

The formulae of Equations (127) and (126) are linked by the fact that if

X \sim Exp (λ)

and

Y = \sqrt{X}

then

Y \sim Ray (\frac{1}{\sqrt{2 λ}})

, and f-divergences [65], including the Kullback–Leibler divergence are invariant by a differentiable transformation [66].

Jeffreys’ divergence symmetrizes the KLD divergence, as follows:

D_{J} [p : q] : = D_{KL} [p : q] + D_{KL} [q : p] = 2 A (D_{KL} [p : q], D_{KL} [q : p]) .

(128)

The Jeffreys divergence between two densities of different exponential families

P

and

Q

is

D_{J} [p_{θ} : q_{θ^{'}}] = θ^{' ⊤} (η^{'} - E_{p_{θ}} [t_{Q} (x)]) + θ^{⊤} (η - E_{q_{θ^{'}}} [t_{P} (x)]) + E_{p_{θ}} [k_{P} (x) - k_{Q} (x)] + E_{q_{θ^{'}}} [k_{Q} (x) - k_{P} (x)] .

(129)

When

P = Q

, we have

E_{p_{θ}} [t_{Q} (x)] = η

and

E_{q_{θ^{'}}} [t_{P} (x)]) = η^{'}

, so that we find the usual expression of the Jeffreys divergence between two densities of an exponential family:

D_{J} [p_{θ} : p_{θ^{'}}] = {(θ^{'} - θ)}^{⊤} (η^{'} - η) .

(130)

To find the best density

q_{θ}

approximating p by minimizing

{min}_{θ} D_{KL} [p : q_{θ}]

, we solve

\nabla F (θ) = η = m

and, therefore,

θ = \nabla F^{*} (m) = {(\nabla F)}^{- 1} (m)

, where

F^{*} (η) = E_{q_{η}} [log q_{η} (m)]

, with

F^{*}

denoting the Legendre–Fenchel convex conjugate [2]. In particular, when

p = \sum w_{i} p_{θ_{i}}

is a mixture of EFs (with

m = E_{p} [t (x)] = \sum w_{i} η_{i}

with

η_{i} = E_{p_{θ_{i}}} [t (x)]

thanks to the linearity of the expectation), then the best density of the EF simplifying p is

\begin{matrix} min_{θ} D_{KL} [p : q_{θ}] & = & min_{θ} F (θ) - m^{⊤} θ, \end{matrix}

(131)

\begin{matrix} = & min_{θ} F (θ) - \sum w_{i} η_{i}^{⊤} θ . \end{matrix}

(132)

Taking the gradient with respect to

θ

, we have

\nabla F (θ) = η = \sum w_{i} η_{i}

. This yields another proof without the Pythagoras theorem [67,68].

Proposition 6.

Let

m (x) = \sum w_{i} p_{θ_{i}} (x)

be a mixture with components that belong to an exponential family with cumulant function F. Subsequently,

θ^{*} = {arg}_{θ} {min}_{θ} D_{KL} [p : q_{θ}]

is

\nabla F^{*} (\sum_{i = 1}^{n} w_{i} η_{i})

, where the

η_{i} = \nabla F (θ_{i})

are the moment parameters of the mixture components.

Consider the following two problems:

Problem 1

(Density clustering). Given a set of n weighted densities

(w_{1}, p_{1}), \dots, (w_{n}, p_{n})

, partition them into k clusters

C_{1}, \dots, C_{k}

in order to minimize the k-centroid objective function with respect to a statistical divergence D:

\sum_{i = 1}^{n} w_{i} {min}_{l \in {1, \dots, k}} D [p_{i} : c_{l}]

, where

c_{l}

denotes the centroid of cluster

C_{l}

for

l \in {1, \dots, k}

.

For example, when all the densities

p_{i}

’s are isotropic Gaussians, we recover the k-means objective function [69].

Problem 2

(Mixture component quantization). Given a statistical mixture

m (x) = \sum_{i = 1}^{n} w_{i}

p_{i} (x)

, quantize the mixture components into k densities

q_{1}, \dots, q_{k}

in order to minimize

\sum_{i} w_{i}

{min}_{l \in {1, \dots, k}} D [p_{i} : q_{l}]

.

Notice that, in Problem 1, the input densities

p_{i}

’s may be mixtures, i.e.,

p_{i} (x) = \sum_{j = 1}^{n_{i}} w_{i, j} p_{i, j} (x)

. Using the relative information radius, we can cluster a set of distributions (potentially mixtures) into an exponential family mixture, or quantize an exponential family mixture. Indeed, we can implement an extension of k-means [69] with k-centers

q_{θ_{i}}

, to assign density

p_{i}

to cluster

C_{j}

(with center

q_{j}

), we need to perform basic comparison tests

D_{KL} [p_{i} : q_{θ_{l}}] \geq D_{KL} [p_{i} : q_{θ_{j}}]

. Provided that the cumulant F of the exponential family is in closed-form, we do not need formula for the entropies

h (p_{i})

.

Clustering and quantization of densities/mixtures have been widely studied in the literature, see, for example, [70,71,72,73,74,75,76].

5. Conclusions

To summarize, the ordinary Jensen-Shannon divergence has been defined in three equivalent ways in the literature:

\begin{matrix} D_{JS} [p, q] & : = & min_{c \in D} \frac{1}{2} (D_{KL} [p : c] + D_{KL} [q : c]), \end{matrix}

(133)

\begin{matrix} = & \frac{1}{2} (D_{KL} [p : \frac{p + q}{2}] + D_{KL} [q : \frac{p + q}{2}]), \end{matrix}

(134)

\begin{matrix} = & h [\frac{p + q}{2}] - \frac{h [p] + h [q]}{2} . \end{matrix}

(135)

The JSD Equation (133) was studied by Sibson in 1969 within the wider scope of information radius [1]: Sibson relied on the Rényi

α

-divergences (relative Rényi

α

-entropies [77]) and recovered the ordinary Jensen-Shannon divergence as a particular case of the

α

-information radius when

α = 1

and

n = 2

points. The

α

-information radii are related to generalized Bhattacharyya distances with respect to power means and the total variation distance in the limit case of

α = \infty

.

Lin [4] investigated the JSD Equation (134) in 1991 with its connection to the JSD defined in Equation (134)). In Lin [4], the JSD is interpreted as the arithmetic symmetrization of the K-divergence [24]. Generalizations of the JSD based on Equation (134) were proposed in [23] using a generic mean instead of the arithmetic mean. One motivation was to obtain a closed-form formula for the geometric JSD between multivariate Gaussian distributions, which relies on the geometric mixture (see [30] for a use case of that formula in deep learning). Indeed, the ordinary JSD between Gaussians is not available in closed-form (not analytic). However, the JSD between Cauchy distributions admit a closed-form formula [78], despite the calculation of a definite integral of a log-sum term. Instead of using an abstract mean to define a mid-distribution of two densities, one may also consider the mid-point of a geodesic linking these two densities (the arithmetic means

\frac{p + q}{2}

is interpreted as a geodesic midpoint). Recently, Li [79] investigated the transport Jensen-Shannon divergence as a symmetrization of the Kullback–Leibler divergence in the

L^{2}

-Wasserstein space. See Section 5.4 of [79] and the closed-form formula of Equation (18) obtained for the transport Jensen-Shannon divergence between two multivariate Gaussian distributions.

The generalization of the identity between the JSD of Equation (134) and the JSD of Equation (135) was studied while using a skewing vector in [18]. Although the JSD is a f-divergence [8,18], the Sibson-M Jensen-Shannon symmetrization of a distance does not belong, in general, to the class of f-divergences. The variational JSD definition of Equation (133) is implicit, while the definitions of Equations (134) and (135) are explicit because the unique optimal centroid

c^{*} = \frac{p + q}{2}

has been plugged into the objective function that was minimized by Equation (133).

In this paper, we proposed a generalization of the Jensen-Shannon divergence based on the variational definition of the ordinary Jensen-Shannon divergence based on the variational JSD definition of Equation (133):

D_{vJS} [p : q] = {min}_{c} \frac{1}{2} (D_{KL} [p : c] + D_{KL} [q : c])

. We introduced the Jensen-Shannon symmetrization of an arbitrary divergence D by considering a generalization of the information radius with respect to an abstract weighted mean

M_{β}

:

D_{M}^{vJS} [p : q] : = {min}_{c} M_{β} (D [p : c], D [q : c])

. Notice that, in the variational JSD, the mean

M_{β}

is used for averaging divergence values, while the mean

M_{α}

in the

(M_{α}, N_{β})

JSD is used to define generic statistical mixtures. We also consider relative variational JS symmetrization when the centroid has to belong to a prescribed family of densities. For the case of exponential family, we showed how to compute the relative centroid in closed form, thus extending the pioneering work of Sibson, who considered the relative normal centroid used to calculate the relative normal information radius. Figure 2 illustrates the three generalizations of the ordinary skewed Jensen-Shannon divergence. Notice that, in general, the

(M, N)

-JSDs and the variational JDSs are not f-divergences (except in the ordinary case).

In a similar vein, Chen et al. [80] considered the following minimax symmetrization of the scalar Bregman divergence [81]:

\begin{matrix} B_{f}^{minmax} (p, q) & : = & min_{c} max_{λ \in [0, 1]} λ B_{f} (p : c) + (1 - λ) B_{f} (q : c), \end{matrix}

(136)

\begin{matrix} = & max_{λ \in [0, 1]} λ B_{f} (p : λ p + (1 - λ) q) + (1 - λ) B_{f} (q : λ p + (1 - λ)), \end{matrix}

(137)

\begin{matrix} = & λ f (p) + (1 - λ) f (q) - f (λ p + (1 - λ)) \end{matrix}

(138)

where

B_{f}

denotes the scalar Bregman divergence induced by a strictly convex and smooth function f:

B_{f} (p : q) = f (p) - f (q) - (p - q) f^{'} (q) .

(139)

They proved that

\sqrt{B_{f}^{minmax} (p, q)}

yields a metric when

3 {(log f^{''})}^{″} \geq {({(log f^{''})}^{'})}^{2}

, and extend the definition to the vector case and conjecture that the square-root metrization still holds in the multivariate case. In a sense, this definition geometrically highlights the notion of radius, since the minmax optimization amount to find a smallest enclosing ball enclosing [82] the source distributions. The circumcenter, also called the Chebyshev center [83], is then the mid-distribution instead of the centroid for the information radius. The term "information radius” is well-suited to measure the distance between two points for an arbitrary distance D. Indeed, the JS-symmetrization of D is defined by

D^{JS} [p : q] : = {min}_{c} {\frac{1}{2} D [p : c] + \frac{1}{2} D [q : c]}

. When

D [p : q] = D_{E} [p : q] = ∥ p - q ∥

is the Euclidean distance, we have

c = \frac{p + q}{2}

, and

D [p : c] = D [q : c] = \frac{1}{2} ∥ p - q ∥ = : r

(i.e., the radius being half of the diameter

∥ p - q ∥

). Thus,

D_{E}^{JS} [p : q] = r

; hence, the term chosen by Sibson [1] for

D^{JS}

: information radius. Besides providing another viewpoint, variational definitions of divergences have proven to be useful in practice (e.g., for estimation). For example, a variational definition of the Rényi divergence generalizing the Donsker–Varadhan variational formula of the KLD is given in [84], which is used to estimate the Rényi Divergences.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Acknowledgments

We warmly thank Rob Brekelmans (Information Sciences Institute, University of Southern California, USA) for discussions and feedback related to the contents of this work. The author thanks the reviewers for valuable feedback, comments, and suggestions, and Gaëtan Hadjeres (Sony CSL Paris) for his careful reading of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sibson, R. Information radius. Z. Wahrscheinlichkeitstheorie Verwandte Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Billingsley, P. Probability and Measure; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
Kullback, S. Information Theory and Statistics; Courier Corporation: Chelmsford, MA, USA, 1997. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [Google Scholar] [CrossRef]
Csiszár, I. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kut. Int. Koezl. 1964, 8, 85–108. [Google Scholar]
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodological) 1966, 28, 131–142. [Google Scholar] [CrossRef]
Amari, S.i. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
McLachlan, G.J.; Peel, D. Finite Mixture Models; John Wiley & Sons: Hoboken, NJ, USA, 2004. [Google Scholar]
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef] [Green Version]
Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef] [Green Version]
Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings, Chicago, IL, USA, 27 June–2 July 2004; IEEE: Piscataway, NJ, USA, 2004; p. 31. [Google Scholar]
Virosztek, D. The metric property of the quantum Jensen-Shannon divergence. Adv. Math. 2021, 380, 107595. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Nielsen, F. On a generalization of the Jensen-Shannon divergence and the Jensen-Shannon centroid. Entropy 2020, 22, 221. [Google Scholar] [CrossRef] [Green Version]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef] [Green Version]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Antolín, J.; Angulo, J.; López-Rosa, S. Fisher and Jensen-Shannon divergences: Quantitative comparisons among distributions. application to position and momentum atomic densities. J. Chem. Phys. 2009, 130, 074110. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Nielsen, F. On the Jensen-Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv 2010, arXiv:1009.4004. [Google Scholar]
Nielsen, F.; Nock, R. Generalizing skew Jensen divergences and Bregman divergences with comparative convexity. IEEE Signal Process. Lett. 2017, 24, 1123–1127. [Google Scholar] [CrossRef]
De Carvalho, M. Mean, what do you Mean? Am. Stat. 2016, 70, 270–274. [Google Scholar] [CrossRef] [Green Version]
Bullen, P.S. Handbook of Means and Their Inequalities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 560. [Google Scholar]
Niculescu, C.P.; Persson, L.E. Convex Functions and Their Applications: A Contemporary Approach; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef] [Green Version]
Deasy, J.; Simidjievski, N.; Liò, P. Constraining Variational Inference with Geometric Jensen-Shannon Divergence. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Amari, S.I. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef]
Calin, O.; Udriste, C. Geometric Modeling in Probability and Statistics; Mathematics and Statistics; Springer International Publishing: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1961; Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California: Oakland, CA, USA, 1961. [Google Scholar]
Blondel, M.; Martins, A.F.; Niculae, V. Learning with Fenchel-Young losses. J. Mach. Learn. Res. 2020, 21, 1–69. [Google Scholar]
Faddeev, D.K. Zum Begriff der Entropie einer endlichen Wahrscheinlichkeitsschemas. In Arbeiten zur Informationstheorie I; Deutscher Verlag der Wissenschaften: Berlin, Germany, 1957; pp. 85–90. [Google Scholar]
Kolmogorov, A.N.; Castelnuovo, G. Sur la Notion de la Moyenne; Bardi, G., Ed.; Atti della Academia Nazionale dei Lincei: Rome, Italy, 1930; Volume 12, pp. 323–343. [Google Scholar]
Nagumo, M. Über eine klasse der mittelwerte. In Japanese Journal of Mathematics: Transactions and Abstracts; The Mathematical Society of Japan: Tokyo, Japan, 1930; Volume 7, pp. 71–79. [Google Scholar]
De Finetti, B. Sul Concetto di Media; Istituto Italiano Degli Attuari: Roma, Italy, 1931. [Google Scholar]
Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
Sibson, R. A brief description of natural neighbour interpolation. In Interpreting Multivariate Data; Barnett, V., Ed.; John Wiley & Sons: Hoboken, NJ, USA, 1981; pp. 21–36. [Google Scholar]
Boyd, S.; Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Nielsen, F.; Sun, K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy 2016, 18, 442. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. Chernoff information of exponential families. arXiv 2011, arXiv:1102.2684. [Google Scholar]
Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef]
Nielsen, F.; Yvinec, M. An output-sensitive convex hull algorithm for planar objects. Int. J. Comput. Geom. Appl. 1998, 8, 39–65. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F.; Nock, R. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process. Lett. 2013, 21, 10–13. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. The statistical Minkowski distances: Closed-form formula for Gaussian mixture models. In International Conference on Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2019; pp. 359–367. [Google Scholar]
Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. L’Institut Henri Poincaré 1948, 10, 215–310. [Google Scholar]
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
Naudts, J. Generalised Thermostatistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Nielsen, F. On Voronoi diagrams on the information-geometric Cauchy manifolds. Entropy 2020, 22, 713. [Google Scholar] [CrossRef] [PubMed]
Nock, R.; Nielsen, F.; Amari, S.i. On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 2015, 62, 527–538. [Google Scholar] [CrossRef] [Green Version]
Brekelmans, R.; Nielsen, F.; Makhzani, A.; Galstyan, A.; Steeg, G.V. Likelihood Ratio Exponential Families. arXiv 2020, arXiv:2012.15480. [Google Scholar]
Brekelmans, R.; Masrani, V.; Bui, T.; Wood, F.; Galstyan, A.; Steeg, G.V.; Nielsen, F. Annealed Importance Sampling with q-Paths. arXiv 2020, arXiv:2012.07823. [Google Scholar]
Nielsen, F. A generalization of the α-divergences based on comparable and distinct weighted means. arXiv 2020, arXiv:2001.09660. [Google Scholar]
Amari, S.i.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [Google Scholar] [CrossRef]
Grosse, R.; Maddison, C.J.; Salakhutdinov, R. Annealing between distributions by averaging moments. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 2769–2777. [Google Scholar]
Nielsen, F. What is an information projection? Not. AMS 2018, 65, 321–324. [Google Scholar] [CrossRef]
Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. arXiv 2009, arXiv:0911.4863. [Google Scholar]
Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 3621–3624. [Google Scholar]
Nielsen, F. On Geodesic Triangles with Right Angles in a Dually Flat Space. In Progress in Information Geometry: Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 153–190. [Google Scholar]
Bauckhage, C. Computing the Kullback-Leibler divergence between two Weibull distributions. arXiv 2013, arXiv:1310.3713. [Google Scholar]
Michalowicz, J.V.; Nichols, J.M.; Bucholtz, F. Handbook of Differential Entropy; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
Csiszár, I. On topological properties of f-divergences. Stud. Math. Hungar. 1967, 2, 329–339. [Google Scholar]
Nielsen, F. On information projections between multivariate elliptical and location-scale families. arXiv 2021, arXiv:2101.03839. [Google Scholar]
Pelletier, B. Informative barycentres in statistics. Ann. Inst. Stat. Math. 2005, 57, 767–780. [Google Scholar] [CrossRef]
Schwander, O.; Nielsen, F. Learning mixtures by simplifying kernel density estimators. In Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013; pp. 403–426. [Google Scholar]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Davis, J.V.; Dhillon, I. Differential entropic clustering of multivariate Gaussians. In Proceedings of the 19th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006; pp. 337–344. [Google Scholar]
Nielsen, F.; Nock, R. Clustering multivariate normal distributions. In Emerging Trends in Visual Computing; Springer: Berlin/Heidelberg, Germany, 2008; pp. 164–174. [Google Scholar]
Fischer, A. Quantization and clustering with Bregman divergences. J. Multivar. Anal. 2010, 101, 2207–2221. [Google Scholar] [CrossRef]
Zhang, K.; Kwok, J.T. Simplifying mixture models through function approximation. IEEE Trans. Neural Netw. 2010, 21, 644–658. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Duan, J.; Wang, Y. Information-Theoretic Clustering for Gaussian Mixture Model via Divergence Factorization. In Proceedings of the 2013 Chinese Intelligent Automation Conference, Yangzhou, China, 23–25 August 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 565–573. [Google Scholar]
Wang, J.C.; Yang, Y.H.; Wang, H.M.; Jeng, S.K. Modeling the affective content of music with a Gaussian mixture model. IEEE Trans. Affect. Comput. 2015, 6, 56–68. [Google Scholar] [CrossRef] [Green Version]
Spurek, P.; Pałka, W. Clustering of Gaussian distributions. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, USA, 24–29 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3346–3353. [Google Scholar]
Esteban, M.D.; Morales, D. A summary on entropy statistics. Kybernetika 1995, 31, 337–346. [Google Scholar]
Nielsen, F.; Okamura, K. On f-divergences between Cauchy distributions. arXiv 2021, arXiv:2101.12459. [Google Scholar]
Li, W. Transport information Bregman divergences. arXiv 2021, arXiv:2101.01162. [Google Scholar]
Chen, P.; Chen, Y.; Rao, M. Metrics defined by Bregman divergences: Part 2. Commun. Math. Sci. 2008, 6, 927–948. [Google Scholar] [CrossRef] [Green Version]
Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Arnaudon, M.; Nielsen, F. On approximating the Riemannian 1-center. Comput. Geom. 2013, 46, 93–104. [Google Scholar] [CrossRef]
Candan, Ç. Chebyshev Center Computation on Probability Simplex With α-Divergence Measure. IEEE Signal Process. Lett. 2020, 27, 1515–1519. [Google Scholar] [CrossRef]
Birrell, J.; Dupuis, P.; Katsoulakis, M.A.; Rey-Bellet, L.; Wang, J. Variational Representations and Neural Network Estimation for Rényi Divergences. arXiv 2020, arXiv:2007.03814. [Google Scholar]

Figure 1. Illustrating several cases of the relative Jensen-Shannon divergence based on whether

p \in R

and

q \in R

or not.

Figure 1. Illustrating several cases of the relative Jensen-Shannon divergence based on whether

p \in R

and

q \in R

or not.

Figure 2. Three equivalent expressions of the ordinary (skewed) Jensen-Shannon divergence which yield three different generalizations.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy 2021, 23, 464. https://doi.org/10.3390/e23040464

AMA Style

Nielsen F. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy. 2021; 23(4):464. https://doi.org/10.3390/e23040464

Chicago/Turabian Style

Nielsen, Frank. 2021. "On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius" Entropy 23, no. 4: 464. https://doi.org/10.3390/e23040464

APA Style

Nielsen, F. (2021). On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy, 23(4), 464. https://doi.org/10.3390/e23040464

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius

Abstract

1. Introduction: Background and Motivations

2. Rényi Entropy and Divergence, and Sibson Information Radius

3. JS-Symmetrization of Distances Based on Generalized Information Radius

4. Relative Information Radius and Relative Jensen-Shannon Symmetrizations of Distances

4.1. Relative Information Radius

4.2. Relative Jensen-Shannon Divergences: Applications to Density Clustering and Quantization

5. Conclusions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI