Next Article in Journal
On the Locally Polynomial Complexity of the Projection-Gradient Method for Solving Piecewise Quadratic Optimisation Problems
Previous Article in Journal
Adaptive Diagnosis for Fault Tolerant Data Fusion Based on α-Rényi Divergence Strategy for Vehicle Localization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius

Sony Computer Science Laboratories, Tokyo 141-0022, Japan
Entropy 2021, 23(4), 464; https://doi.org/10.3390/e23040464
Submission received: 12 March 2021 / Revised: 9 April 2021 / Accepted: 9 April 2021 / Published: 14 April 2021

Abstract

:
We generalize the Jensen-Shannon divergence and the Jensen-Shannon diversity index by considering a variational definition with respect to a generic mean, thereby extending the notion of Sibson’s information radius. The variational definition applies to any arbitrary distance and yields a new way to define a Jensen-Shannon symmetrization of distances. When the variational optimization is further constrained to belong to prescribed families of probability measures, we get relative Jensen-Shannon divergences and their equivalent Jensen-Shannon symmetrizations of distances that generalize the concept of information projections. Finally, we touch upon applications of these variational Jensen-Shannon divergences and diversity indices to clustering and quantization tasks of probability measures, including statistical mixtures.

Graphical Abstract

1. Introduction: Background and Motivations

The goal of the author is to methodologically contribute to an extension of the Sibson’s information radius [1] and also concentrate on analysis of the specified families of distributions called exponential families [2].
Let ( X , F ) denote a measurable space [3] with sample space X and σ -algebra F on the set X . The Jensen-Shannon divergence [4] (JSD) between two probability measures P and Q (or probability distributions) on ( X , F ) is defined by:
D JS [ P , Q ] : = 1 2 D KL P : P + Q 2 + D KL Q : P + Q 2 ,
where D KL denotes the Kullback–Leibler divergence [5,6] (KLD):
D KL [ P : Q ] : = X log d P ( x ) d Q ( x ) d P , P Q + , P Q
where P Q means that P is absolutely continuous with respect to Q [3], and d P d Q is the Radon–Nikodym derivative of P with respect to Q. Equation (2) can be rewritten using the chain rule as:
D KL [ P : Q ] : = X d P ( x ) d Q ( x ) log d P ( x ) d Q ( x ) d Q , P Q + , P Q
Consider a measure μ for which both the Radon–Nikodym derivatives p : = d P d μ and q : = d P d μ exist (e.g., μ = P + Q 2 ). Subsequently the Kullback–Leibler divergence can be rewritten as (see Equation (2.5) page 5 of [5] and page 251 of the Cover & Thomas’ textbook [6]):
D KL [ p : q ] : = X p ( x ) log p ( x ) q ( x ) d μ ( x ) .
Denote by D = D ( X ) the set of all densities with full support X (Radon–Nikodym derivatives of probability measures with respect to μ ):
D ( X ) : = p : X R : p ( x ) > 0 μ - a l m o s t   e v e r y w h e r e , X p ( x ) d μ ( x ) = 1 .
Subsequently, the Jensen-Shannon divergence [4] between two densities p and q of D is defined by:
D JS [ p , q ] : = 1 2 D KL p : p + q 2 + D KL q : p + q 2 .
Often, one considers the Lebesgue measure [3] μ = μ L on ( R d , B ( R d ) ) , where B ( R d ) is the Borel σ -algebra, or the counting measure [3] μ = μ # on ( X , 2 X ) where X is a countable set, for defining the measure space ( X , F , μ ) .
The JSD belongs to the class of f-divergences [7,8,9] which are known as the invariant decomposable divergences of information geometry (see [10], pp. 52–57). Although the KLD is asymmetric (i.e., D KL [ p : q ] D KL [ q : p ] ), the JSD is symmetric (i.e., D JS [ p , q ] = D JS [ q , p ] ). The notation ‘:’ is used as a parameter separator to indicate that the parameters are not permutation invariant, and that the order of parameters is important.
In this work, a distance D ( O 1 : O 2 ) is a measure of dissimilarity between two objects O 1 and O 2 , which do not need to be symmetric or satisfy the triangle inequality of metric distances. A distance only satisfies the identity of indiscernibles: D ( O 1 : O 2 ) = 0 if and only if O 1 = O 2 . When the objects O 1 and O 2 are probability densities with respect to μ , we call this distance a statistical distance, use the brackets to enclose the arguments of the statistical distance (i.e., D [ O 1 : O 2 ] ), and we have D [ O 1 : O 2 ] = 0 if and only if O 1 ( x ) = O 2 ( x ) μ -almost everywhere.
The 2-point JSD of Equation (4) can be extended to a weighted set of n densities P : = { ( w 1 , p 1 ) , , ( w n , p n ) } (with positive w i ’s normalized to sum up to unity, i.e., i = 1 n w i = 1 ) thus providing a diversity index, i.e., a n-point JSD for P :
D JS ( P ) : = i = 1 n w i D KL p i : p ¯ ,
where p ¯ : = i = 1 n w i p i denotes the statistical mixture [11] of the densities of P . We have D JS [ p : q ] = D JS ( { ( 1 2 , p ) , ( 1 2 , q ) } ) . We call D JS ( P ) the Jensen-Shannon diversity index.
The KLD is also called the relative entropy since it can be expressed as the difference between the cross entropy h [ p : q ] and the entropy h [ p ] :
D KL [ p : q ] : = X p ( x ) log p ( x ) q ( x ) d μ ( x )
= X p ( x ) log p ( x ) d μ ( x ) X p ( x ) log q ( x ) d μ ( x ) ,
= h [ p : q ] h [ p ] ,
with the cross-entropy and entropy defined, respectively, by
h [ p : q ] : = X p ( x ) log q ( x ) d μ ( x ) ,
h [ p ] : = X p ( x ) log p ( x ) d μ ( x ) .
Because h [ p ] = h [ p : p ] , we may say that the entropy is the self-cross-entropy.
When μ is the Lebesgue measure, the Shannon entropy is also called the differential entropy [6]. Although the discrete entropy H [ p ] = i p i log p i (i.e., entropy with respect to the counting measure) is always positive and bounded by log | X | , the differential entropy may be negative (e.g., entropy of a Gaussian distribution with small variance).
The Jensen-Shannon divergence of Equation (6) can be rewritten as:
D JS [ p , q ] = h [ p ¯ ] i = 1 n w i h [ p i ] : = J h [ p , q ] .
The JSD representation of Equation (12) is a Jensen divergence [12] for the strictly convex negentropy F ( p ) = h [ p ] , since the entropy function h [ . ] is strictly concave. Therefore, it is appropriate to call this divergence the Jensen-Shannon divergence.
Because p i ( x ) p ¯ ( x ) p i ( x ) w i p i ( x ) = 1 w i , it can be shown that the Jensen-Shannon diversity index is upper bounded by H ( w ) : = i = 1 n w i log w i , the discrete Shannon entropy. Thus, the Jensen-Shannon diversity index is bounded by log n , and the 2-point JSD is bounded by log 2 , although the KLD is unbounded and it may even be equal to + when the definite integral diverges (e.g., KLD between the standard Cauchy distribution and the standard Gaussian distribution). Another nice property of the JSD is that its square root yields a metric distance [13,14]. This property further holds for the quantum JSD [15]. The JSD has gained interest in machine learning. See, for example, the Generative Adversarial Networks [16] (GANs) in deep learning [17], where it was proven that minimizing the GAN objective function by adversarial training is equivalent to minimizing a JSD.
To delineate the different roles that are played by the factor 1 2 in the ordinary Jensen-Shannon divergence (i.e., in weighting the two KLDs and in weighting the two densities), let us introduce two scalars α , β ( 0 , 1 ) , and define a generic ( α , β ) -skewed Jensen-Shannon divergence, as follows:
D JS , α , β [ p : q ] : = ( 1 β ) D KL [ p : m α ] + β D KL [ q : m α ] ,
= ( 1 β ) h [ p : m α ] + β h [ q : m α ] ( 1 β ) h [ p ] β h [ q ] ,
= h [ m β : m α ] ( 1 β ) h [ p ] + β h [ q ] ,
where m α : = ( 1 α ) p + α q and m β : = ( 1 β ) p + β q . This identity holds, because D JS , α , β is bounded by ( 1 β ) log 1 1 α + β log 1 α , see [18]. Thus, when β = α , we have D JS , α [ p , q ] = D JS , α , α [ p , q ] = h [ m α ] ( ( 1 α ) h [ p ] + α h [ q ] ) , since the self-cross entropy corresponds to the entropy: h [ m α : m α ] = h [ m α ] .
A f-divergence [9,19,20] is defined for a convex generator f, which is strictly convex at 1 (to satisfy the identity of the indiscernibles) and that satisfies f ( 1 ) = 0 , by
I f [ p : q ] : = p ( x ) f q ( x ) p ( x ) d μ ( x ) f ( 1 ) = 0 ,
where the right-hand-side follows from Jensen’s inequality [20]. For example, the total variation distance D TV [ p : q ] = 1 2 X | p ( x ) q ( x ) | d μ ( x ) is a f-divergence for the generator f TV ( u ) = | u 1 | : D TV [ p : q ] = I f TV [ p : q ] . The generator f TV ( u ) is convex on R , strictly convex at 1, and it satisfies f ( u ) = 1 .
The D JS , α , β divergence is a f-divergence
D JS , α , β [ p : q ] = I f JS , α , β [ p : q ] ,
for the generator:
f JS , α , β ( u ) = ( 1 β ) log α u + ( 1 α ) + β u log 1 α u + α .
We check that the generator f JS , α , β is strictly convex, since, for any a ( 0 , 1 ) and b ( 0 , 1 ) , we have
f JS , α , β ( u ) = a 2 ( 1 b ) u + ( a 1 ) 2 b a 2 u 3 + 2 a ( 1 a ) u 2 + ( a 1 ) 2 u > 0 ,
when u > 0 .
The Jensen-Shannon principle of taking the average of the (Kullback–Leibler) divergences between the source parameters to the mid-parameter can be applied to other distances. For example, the Jensen–Bregman divergence is a Jensen-Shannon symmetrization of the Bregman divergence B F [12]:
B F JS ( θ 1 : θ 2 ) : = 1 2 B F θ 1 : θ 1 + θ 2 2 + B F θ 2 : θ 1 + θ 2 2 ,
where the Bregman divergence [21] B F is defined by
B F ( θ : θ ) : = F ( θ ) F ( θ ) ( θ θ ) F ( θ ) .
The Jensen–Bregman divergence B F JS can also be written as an equivalent Jensen divergence J F :
B F JS ( θ 1 : θ 2 ) = J F ( θ 1 : θ 2 ) : = F ( θ 1 ) + F ( θ 2 ) 2 F θ 1 + θ 2 2 ,
where F is a strictly convex function ensuring J F ( θ 1 : θ 2 ) 0 with equality if θ 1 = θ 2 .
Because of its use in various fields of information sciences [22], various generalizations of the JSD have been proposed: These generalizations are either based on Equation (5) [23] or Equation (12) [18,24,25]. For example, the (arithmetic) mixture p ¯ = i w i p i in Equation (6) was replaced by an abstract statistical mixture with respect to a generic mean M in [23] (e.g., the geometric mixture induced by the geometric mean), and the two KLDS defining the JSD in Equation (5) was further averaged using another abstract mean N, thus yielding the following generic ( M , N ) -Jensen-Shannon divergence [23] (abbreviated as ( M , N ) -JSD):
D JS M , N [ p : q ] : = N D KL p : ( p q ) 1 2 M , D KL q : ( p q ) 1 2 M ,
where ( p q ) α M denotes the statistical weighted M-mixture:
( p q ) α M : = M α ( p ( x ) , q ( x ) ) X M α ( p ( x ) , q ( x ) ) d μ ( x ) .
Notice that, when M = N = A (the arithmetic mean), Equation (23) of the ( A , A ) -JSD reduces to the ordinary JSD of Equation (5). When the means M and N are symmetric, the ( M , N ) -JSD is symmetric.
In general, a weighted mean M α ( a , b ) for any α [ 0 , 1 ] shall satisfy the in-betweeness property [26] (i.e., a mean should be contained inside its extrema):
min { a , b } M α ( a , b ) max { a , b } .
The three Pythagorean means defined for positive scalars a > 0 and b > 0 are classic examples of means:
  • The arithmetic mean A ( a , b ) = a + b 2 ,
  • the geometric mean G ( a , b ) = a b , and
  • the harmonic mean H ( a , b ) = 2 a b a + b .
These Pythagorean means may be interpreted as special instances of another parametric family of means: The power means
P α ( a , b ) : = a α + b α 2 1 α ,
defined for α R \ { 0 } (also called Hölder means). The power means can be extended to the full range α R by using the property that lim α 0 P α ( a , b ) = G ( a , b ) . The power means are homogeneous means: P α ( λ a , λ b ) = λ P α ( a , b ) for any λ > 0 . We refer to the handbook of means [27] to obtain definitions and principles of other means beyond these power means.
A weighted mean (also called barycenter) can be built from a non-weighted mean M ( a , b ) (i.e., α = 1 2 ) by using the dyadic expansion of the real weight α [ 0 , 1 ] , see [28]. That is, we can define the weighted mean M ( p , q ; w , 1 w ) for w = i 2 k with i { 0 , , 2 k } and k an integer. For example, consider a symmetric mean M ( p , q ) = M ( q , p ) . Subsequently, we get the following weighted means when k = 3 :
M p , q ; 0 8 = 0 , 8 8 = 1 = q M p , q ; 1 8 , 7 8 = M ( M ( M ( p , q ) , q ) , q ) M p , q ; 2 8 = 1 4 , 6 8 = 3 4 = M ( M ( p , q ) , q ) M p , q ; 3 8 , 5 8 = M ( M ( M ( p , q ) , p ) , q ) M p , q ; 4 8 = 1 2 , 4 8 = 1 2 = M ( p , q ) M p , q ; 5 8 , 3 8 = M ( M ( M ( p , q ) , q ) , p ) M p , q ; 6 8 = 3 4 , 2 8 = 1 4 = M ( M ( p , q ) , p ) M p , q ; 7 8 , 1 8 = M ( M ( M ( p , q ) , p ) , p ) M p , q ; 8 8 = 1 , 0 8 = 0 = p
Let w = i = 1 d i 2 i be the unique dyadic expansion of the real number w ( 0 , 1 ) , where the d i ’s are binary digits (i.e., d i { 0 , 1 } ). We define the weighted mean M ( x , y ; w , 1 w ) of two positive reals p and q for a real weight w ( 0 , 1 ) as
M ( x , y ; w , 1 w ) : = lim n M x , y ; i = 1 n d i 2 i , 1 i = 1 n d i 2 i .
Choosing the abstract mean M in accordance with the family R = { p θ : θ Θ } of the densities allows one to obtain closed-form formula for the ( M , N ) -JSDs that rely on definite integral calculations [23]. For example, the JSD between two Gaussian densities does not admit a closed-form formula because of the log-sum integral, but the ( G , N ) -JSD admits a closed-form formula when using geometric statistical mixtures (i.e., when M = G ). The calculus trick is to find a weighted mean M α , such that, for two densities p θ 1 and p θ 2 , the weighted mean distribution M α ( p θ 1 ( x ) , p θ 2 ( x ) ) = p θ 1 , 2 , α ( x ) Z M α ( θ 1 , θ 2 ) , where Z M α ( θ 1 , θ 2 ) is the normalizing coefficient and p θ 1 , 2 , α R . Thus, the integral calculation can be simply calculated as M α ( p θ 1 ( x ) , p θ 2 ( x ) ) d μ ( x ) = 1 Z M α ( θ 1 , θ 2 ) since p θ 1 , 2 , α ( x ) , and, therefore, p θ 1 , 2 , α ( x ) d μ ( x ) = 1 . This trick has also been used in Bayesian hypothesis testing for upper bounding the probability of error between two densities of a parametric family of distributions by replacing the usual geometric mean (Section 11.7 of [6], page 375) by a more general quasi-arithmetic mean [29]. For example, the harmonic mean is well-suited to Cauchy distributions, and the power means to Student t-distributions [29].
As an application of these generalized JSDs, Deasy et al. [30] used the skewed geometric JSD (namely, the ( G α , A 1 α ) -JSD for α ( 0 , 1 ) ), which admits a closed-form formula between normal densities [23], and showed how regularizing an optimization task with this G-JSD divergence improved reconstruction and generation of Variational AutoEncoders (VAEs).
More generally, instead of using the KLD, one can also use any arbitrary distance D to define its JS-symmetrization, as follows:
D M , N JS [ p : q ] : = N D p : ( p q ) 1 2 M , D q : ( p q ) 1 2 M .
These symmetrizations may further be skewed by using M α and/or N β for α ( 0 , 1 ) and β ( 0 , 1 ) , yielding the definition [23]:
D M α , N β JS [ p : q ] : = N β D p : ( p q ) α M , D q : ( p q ) α M .
With these notations, the ordinary JSD is D JS = D KL A , A JS , the ( A , A ) JS-symmetrization of the KLD with respect to the arithmetic means M = A and N = A .
The JS-symmetrization can be interpreted as the N β -Jeffreys’ symmetrization of a generalization of Lin’s α -skewed K-divergence [4] D M α K [ p : q ] :
D M α , N β JS [ p : q ] = N β ( D M α K [ p : q ] , D M α K [ p : q ] ) ,
D M α K [ p : q ] : = D p : ( p q ) α M α .
In this work, we consider symmetrizing an arbitrary distance D (including the KLD), generalizing the Jensen-Shannon divergence by using a variational formula for the JSD. Namely, we observe that the Jensen-Shannon divergence can also be defined as the following minimization problem:
D JS [ p , q ] : = min c D 1 2 D KL [ p : c ] + D KL [ q : c ] ,
since the optimal density c is proven unique using the calculus of variation [1,31,32] and it corresponds to the mid density p + q 2 , a statistical (arithmetic) mixture.
Proof. 
Let S ( c ) = D KL [ p : c ] + D KL [ q : c ] 0 . We use the method of the Lagrange multipliers for the constrained optimization problem min c S ( c ) such that c ( x ) d μ ( x ) = 1 . Let us minimize S ( c ) + λ c ( x ) d μ ( x ) 1 . The density c realizing the minimum S ( c ) satisfies the Euler–Lagrange equation L c = 0 , where L ( c ) : = p log p c + q log q c + λ c is the Lagrangian. That is, p c q c + λ = 0 or, equivalently, c = 1 λ ( p + q ) . Parameter λ is then evaluated from the constraint X c ( x ) d μ ( x ) = 1 : we get λ = 2 since X ( p ( x ) + q ( x ) ) d μ ( x ) = 2 . Therefore, we find that c ( x ) = p ( x ) + q ( x ) 2 , the mid density of p ( x ) and q ( x ) . □
Considering Equation (32) instead of Equation (5) for defining the Jensen-Shannon divergence is interesting, because it allows one to consider a novel approach for generalizing the Jensen-Shannon divergence. This variational approach was first considered by Sibson [1] to define the α -information radius of a set of weighted distributions while using Rényi α -entropies that are based on Rényi principled α -means [33]. The α -information radius includes the Jensen-Shannon diversity index when α = 1 . Sibson’s work is our point of departure for generalizing the Jensen-Shannon divergence and proposing the Jensen-Shannon symmetrizations of arbitrary distances.
The paper is organized, as follows: in Section 2, we recall the rationale and definitions of the Rényi α -entropy and the Rényi α -divergence [33], and explain the information radius of Sibson [1], which includes, as a special case, the ordinary Jensen-Shannon divergence and that can be interpreted as generalized skew Bhattacharyya distances. We report, in Theorem 2, a closed-form formula for calculating the information radius of order α between two densities of an exponential family when 1 α is an integer. It is noteworthy to point out that Sibson’s work (1969) includes, as a particular case of the information radius, a definition of the JSD, prior to the well-known reference paper of Lin [4] (1991). In Section 3, we present the JS-symmetrization variational definition that is based on a generalization of the information radius with a generic mean (Equation (88) and Definition 3). In Section 4, we constrain the mixture density to belong to a prescribed class of (parametric) probability densities, like an exponential family [2], and obtain a relative information radius generalizing information radius and related to the concept of information projections. Our Definition 5 generalizes the (relative) normal information radius of Sibson [1], who considered the multivariate normal family (Proposition 4). We illustrate this notion of relative information radius by calculating the density of an exponential family minimizing the reverse Kullback–Leibler divergence between a mixture of densities of that exponential family (Proposition 6). Moreover, we get a semi-closed-form formula for the Kullback–Leibler divergence between the densities of two different exponential families (Proposition 5), generalizing the Fenchel–Young divergence [34]. As an application of these relative variational JSDs, we touch upon the problems of clustering and quantization of probability densities in Section 4.2. Finally, we conclude by summarizing our contributions and discussing related works in Section 5.

2. Rényi Entropy and Divergence, and Sibson Information Radius

Rényi [33] investigated a generalization of the four axioms of Fadeev [35], yielding the unique Shannon entropy [20]. In doing so, Rényi replaced the ordinary weighted arithmetic mean by a more general class of averaging schemes. Namely, Rényi considered the weighted quasi-arithmetic means [36]. A weighted quasi-arithmetic mean can be induced by a strictly monotonous and continuous function g, as follows:
M g ( x 1 , , x n ; w 1 , , w n ) : = g 1 i = 1 n w i g ( x i ) ,
where the x i ’s and the w i ’s are positive (the weights are normalized, so that i = 1 n w i = 1 ). Because M g = M g , we may assume without loss of generality that g is a strictly increasing and continuous function. The quasi-arithmetic means were investigated independently by Kolmogorov [36], Nagumo [37], and de Finetti [38].
For example, the power means P α ( a , b ) = a α + b α 2 1 α introduced earlier are quasi-arithmetic means for the generator g α P ( u ) : = u α :
P α ( a , b ) = M g α P a , b ; 1 2 , 1 2 .
Rényi proved that, among the class of weighted quasi-arithmetic means, only the means induced by the family of functions
g α ( u ) : = 2 ( α 1 ) u ,
g α 1 ( v ) : = 1 α 1 log 2 v ,
for α > 0 and α 1 yield a proper generalization of Shannon entropy, nowadays called the Rényi α -entropy. The Rényi α -mean is
M α R ( x 1 , , x n ; w 1 , , w n ) = M g α x 1 , , x n ; w 1 , , w n ,
= 1 α 1 log 2 i = 1 n w i 2 ( α 1 ) x i .
The Rényi α -means M α R are not power means: They are not homogeneous means [31]. Let M α R ( p , q ) = M α R p , q ; 1 2 , 1 2 = 1 α 1 log 2 2 ( α 1 ) p + 2 ( α 1 ) q 2 . Subsequently, we have lim α M α R ( p , q ) = max { p , q } and lim α 1 M α R ( p , q ) = A ( p , q ) = p + q 2 . Indeed, we have
M α R ( p , q ) = 1 α 1 log 2 2 ( α 1 ) p + 2 ( α 1 ) q 2 , = 1 α 1 log 2 e ( α 1 ) p log 2 + e ( α 1 ) q log 2 2 , α 1 1 α 1 log 2 1 + ( α 1 ) p + q 2 log 2 , α 1 1 α 1 1 log 2 ( α 1 ) p + q 2 log 2 , α 1 p + q 2 = A ( p , q ) ,
using the following first-order approximations: e x x 0 = 1 + x and log ( 1 + x ) x 0 = x .
To obtain an intuition of the Rényi entropy, we may consider generalized entropies derived from quasi-arithmetic means, as follows:
h g [ p ] : = M g ( log 2 p 1 , , log 2 p n ; p 1 , , p n ) .
When g ( u ) = u , we recover Shannon entropy. When g 2 ( u ) = 2 u , we get h g 2 [ p ] = log 2 i p i 2 , called the collision entropy, since log Pr [ X 1 = X 2 ] = h g 2 [ p ] , when X 1 and X 2 are independent and identically distributed random variables with X 1 p and X 2 p . When g ( u ) = g α ( u ) = 2 ( α 1 ) u , we get
h g α [ p ] = 1 α 1 log 2 i p i 2 ( α 1 ) log 2 p i ,
= 1 1 α log 2 i p i p i α 1 = 1 1 α log 2 i p i α .
The formula of Equation (41) is the discrete Rényi α -entropy [33], which can be defined more generally on a measure space ( X , F , μ ) , as follows:
h α R [ p ] : = 1 1 α log X p α ( x ) d μ ( x ) , α ( 0 , 1 ) ( 1 , ) .
In the limit case α 1 , the Rényi α -entropy converges to Shannon entropy: lim α 1 h α R [ p ] = h [ p ] . Rényi α -entropies are non-increasing with respect to increasing α : h α R [ p ] h α R [ p ] for α < α . In the discrete case (i.e., counting measure μ on a finite alphabet X ), we can further define h 0 [ p ] = log | X | for α = 0 (also called max-entropy or Hartley entropy). The Rényi + -entropy
h + [ p ] = log max x X p ( x )
is also called the min-entropy, since the sequence h α is non-increasing with respect to increasing α .
Similarly, Rényi obtained the α -divergences for α > 0 and α 1 (originally called information gain of order α ):
D α R [ p : q ] : = 1 α 1 log 2 X p ( x ) α q ( x ) 1 α d μ ( x ) ,
generalizing the Kullback–Leibler divergence, since lim α 1 D α R [ p : q ] = D KL [ p : q ] . Rényi α -divergences are non-decreasing with respect to increasing α [39]: D α R [ p : q ] D α R [ p : q ] for α α .
Sibson (Robin Sibson (1944–2017) is also renown for inventing the natural neighbour interpolation [40]) [1] considered both the Rényi α -divergence [33] D α R and the Rényi α -weighted mean M α R : = M g α to define the information radius R α of order α of a weighted set P = { ( w i , p i ) } i = 1 n of densities p i ’s as the following minimization problem:
R α ( P ) : = min c D R α ( P , c ) ,
where
R α ( P , c ) : = M α R D α R [ p 1 : c ] , , D α R [ p n : c ] ; w 1 , , w n .
The Rényi α -weighted mean M α R can be rewritten as
M α R ( x 1 , , x n ; w 1 , , w n ) = 1 α 1 LSE ( α 1 ) x 1 log 2 + log w 1 , , ( α 1 ) x i log 2 + log w i ,
where function LSE ( a 1 , , a n ) : = log i = 1 n e a i denotes the log-sum-exp (convex) function [41,42].
Notice that 2 ( α 1 ) D α R [ p : q ] = X p ( x ) α q ( x ) 1 α d μ ( x ) , the Bhattacharyya α -coefficient [12] (also called Chernoff α -coefficient [43,44]):
C Bhat , α [ p : q ] : = X p ( x ) α q ( x ) 1 α d μ ( x ) .
Thus, we have
R α ( P , c ) = 1 α 1 log 2 w i C Bhat , α [ p i : c ] .
The ordinary Bhattacharyya coefficient is obtained for α = 1 2 : C Bhat [ p : q ] : = X p ( x ) q ( x ) d μ ( x ) .
Sibson [1] also considered the limit case α when defining the information radius:
D R [ p : q ] : = log 2 sup x X p ( x ) q ( x ) .
Sibson reported the following theorem in his information radius study [1]:
Theorem 1
(Theorem 2.2 and Corollary 2.3 of [1]). The optimal density c α * = arg min c D R α ( P , c ) is unique, and we have:
c 1 * ( x ) = i w i p i ( x ) , R 1 ( P ) = R 1 ( P , c 1 * ) = X i w i p i log 2 p i j w j p j ( x ) d μ ( x ) , c α * ( x ) = ( i w i p i ( x ) α ) 1 α X ( i w i p i ( x ) α ) 1 α d μ ( x ) , R α ( P ) = R α ( P , c α * ) = 1 α 1 log 2 X ( i w i p i ( x ) α ) 1 α d μ ( x ) α , α ( 0 , 1 ) ( 1 , ) c * ( x ) = max i p i ( x ) X ( max i p i ( x ) ) d μ ( x ) , R ( P ) = R ( P , c * ) = log 2 X max i p i ( x ) d μ ( x ) ,
Observe that R ( P ) does not depend on the (positive) weights.
The proof follows from the following decomposition of the information radius:
Proposition 1.
We have:
R α ( P , c ) R α ( P , c α * ) = D α R ( c α * , c ) 0 .
Because the proof is omitted in [1], we report it here:
Proof. 
Let Δ ( c , c α * ) : = R α ( P , c ) R α ( P , c α * ) . We handle the three cases, depending on the α values:
  • Case α ( 0 , 1 ) ( 1 , ) : Let P α ( P ) ( x ) : = i w i p i ( x ) α 1 α . We have ( c α * ( x ) ) α = i w i p i ( x ) α P α ( P ) ( x ) d μ ( x ) α . We obtain
    Δ ( c , c α * ) = 1 α 1 log 2 i w i p i ( x ) α c ( x ) 1 α d μ ( x ) 1 α 1 log 2 P α ( P ) ( x ) d μ ( x ) α ,
    = 1 α 1 log 2 i w i p i ( x ) α c ( x ) 1 α d μ ( P α ( P ) ( x ) d μ ( x ) ) α ,
    = 1 α 1 log 2 ( i w i p i ( x ) α ) c ( x ) 1 α ( P α ( P ) ( x ) d μ ( x ) ) α d μ ( x ) ,
    = 1 α 1 log 2 ( c α * ( x ) ) α c ( x ) 1 α d μ ( x ) ,
    : = D α R ( c α * , c ) .
  • Case α = 1 : we have Δ ( c , c 1 * ) : = R 1 ( P , c ) R 1 ( P , c 1 * ) with c 1 * = i w i p i . Because R 1 ( P , c ) = i w i D KL [ p i : c ] , we have
    R 1 ( P , c ) = i w i h [ p i : c ] w i h [ p i ] ,
    = h [ i w i p i : c ] i w i h [ p i ] ,
    = h [ c 1 * : c ] i w i h [ p i ] .
    It follows that
    Δ ( c , c 1 * ) = h [ c 1 * : c ] i w i h [ p i ] h [ c 1 * : c 1 * ] i w i h [ p i ] ,
    = h [ c 1 * : c ] h [ c 1 * ] ,
    = D KL [ c 1 * : c ] = D 1 R [ c 1 * : c ] .
  • Case α = : we have c * = max i p i ( x ) ( max i p i ( x ) ) d μ ( x ) , R ( P , c * ) = log 2 ( max i p i ( x ) ) d μ ( x ) , and D R [ p : q ] = log 2 sup x p ( x ) q ( x ) . We have R ( P , c ) = log 2 sup x p i ( x ) c ( x ) Thus, Δ ( c , c α * ) : = R ( P , c ) R ( P , c * ) = log 2 sup x c * ( x ) c ( x ) = D R [ c * : c ] .
 □
It follows that
min c R α ( P , c ) = min c R α ( P , c α * ) + D α R ( c α * , c ) min c D α R ( c α * , c ) 0 .
Thus we have c = c α * since D α R ( c α * , c ) is minimized for c = c α * .
Notice that c * ( x ) = max { p 1 ( x ) , , p n ( x ) } X ( max i p i ( x ) ) d μ ( x ) is the upper envelope of the densities p i ( x ) ’s normalized to be a density. Provided that the densities p i ’s intersect pairwise in at most s locations (i.e., | { p i ( x ) p j ( x ) } | s for i j ), we can efficiently compute this upper envelope using an output-sensitive algorithm [45] of computational geometry.
When the point set is P = 1 2 , p , 1 2 , q with w 1 = w 2 = 1 2 , the information radius defines a (2-point) symmetric distance, as follows:
R 1 ( p , q ) = 1 2 X p ( x ) log 2 2 p p ( x ) + q ( x ) d μ ( x ) + 1 2 X q ( x ) log 2 2 q ( x ) p ( x ) + q ( x ) d μ ( x ) , α = 1 R α ( p , q ) = α α 1 log 2 X p ( x ) α + q ( x ) α 2 1 α d μ ( x ) = α α 1 log 2 X P α ( p ( x ) , q ( x ) ) d μ ( x ) , α ( 0 , 1 ) ( 1 , ) R ( p , q ) = log 2 X max { p ( x ) , q ( x ) } d μ ( x ) , α = .
This family of symmetric divergences may be called the Sibson’s α -divergences, and the Jensen-Shannon divergence is interpreted as a limit case when α 1 . Notice that, since we have lim α P α ( p , q ) = max { p , q } and lim α α α 1 = 1 , we have lim α R α ( p , q ) = R ( p , q ) . Notice that, for α = 1 , the integral and logarithm operations are swapped as compared to R α for α ( 0 , 1 ) ( 1 , ) .
Theorem 2.
When α = 1 k for an integer k 2 , the Sibson α-divergences between two densities p θ 1 and p θ 2 of an exponential family { p θ : θ Θ } with cumulant function F ( θ ) is available in closed form:
R α ( p θ 1 , p θ 2 ) = 1 k 1 log 2 1 2 k i = 0 k k i exp F i k θ 1 + 1 i k θ 2 i k F ( θ 1 ) + 1 i k F ( θ 2 ) .
Proof. 
Let p = p θ 1 and q = p θ 2 be two densities of an exponential family [2] with cumulant function F ( θ ) and natural parameter space Θ . Without a loss of generality, we may consider a natural exponential family [2] with densities written canonically as p θ ( x ) = exp ( x θ F ( θ ) ) for θ Θ . It can be shown that the cumulant function F ( θ ) = log X exp ( x θ ) d μ ( x ) is strictly convex and analytic on the open convex natural parameter space Θ [2].
When α = 1 2 (i.e., k = 2 ), we have:
R 1 2 ( p , q ) = log 2 X p ( x ) + q ( x ) 2 2 d μ ( x ) ,
= log 2 1 2 + 1 2 X p ( x ) q ( x ) d μ ( x ) ,
= log 2 1 2 + 1 2 C Bhat [ p : q ] 0 ,
where C Bhat [ p : q ] : = X p ( x ) q ( x ) d μ ( x ) is the Bhattacharyya coefficient (with 0 C Bhat [ p : q ] 1 ). Using Theorem 3 of [12], we have
C Bhat [ p θ 1 , p θ 2 ] = exp F θ p + θ q 2 F ( θ p ) + F ( θ q ) 2 ,
so that we obtain the following closed-form formula:
R 1 2 ( p θ 1 , p θ 2 ) = log 2 1 2 + 1 2 exp F θ p + θ q 2 F ( θ p ) + F ( θ q ) 2 0 ,
Now, assume that k = 1 α 2 is an arbitrary integer, and let us apply the binomial expansion for P α ( p θ 1 , p θ 2 ) in the spirit of [46,47]:
X P α ( p θ 1 ( x ) , p θ 2 ( x ) ) d μ ( x ) = X p θ 1 ( x ) 1 k + p θ 2 ( x ) 1 k 2 k d μ ( x ) ,
= 1 2 k i = 0 k k i X p θ 1 ( x ) 1 k i p θ 2 ( x ) 1 k k i d μ ( x ) .
Let I k , i ( θ 1 , θ 2 ) : = X p θ 1 ( x ) 1 k i p θ 2 ( x ) 1 k k i d μ ( x ) . Because i k θ 1 + k i k θ 2 = θ 2 + i k ( θ 1 θ 2 ) Θ for i { 0 , , k } , we get by following the calculation steps in [12]:
I k , i ( θ 1 , θ 2 ) : = exp F i k θ 1 + 1 i k θ 2 i k F ( θ 1 ) + 1 i k F ( θ 2 ) < .
Notice that I 2 , 1 = C Bhat [ p θ 1 , p θ 2 ] , and I k , 0 = I k , k = 1 .
Thus, we get the following closed-form formula:
R α ( p θ 1 , p θ 2 ) = 1 k 1 log 2 1 2 k i = 0 k k i exp F i k θ 1 + 1 i k θ 2 i k F ( θ 1 ) + 1 i k F ( θ 2 ) .
 □
This closed-form formula applies, in particular, to the family { N ( μ , Σ ) } of (multivariate) normal distributions: In this case, the natural parameters θ are expressed using both a vector parameter component v and a matrix parameter component M:
θ = ( v , M ) = Σ 1 m , 1 2 Σ 1 ,
and the cumulant function is:
F N ( θ ) = d 2 log π 1 2 log | 2 M | 1 4 v M 1 v ,
where | · | denotes the matrix determinant.
In general, the optimal density c α * = arg min c D R α ( P , c ) yielding the information radius R α ( P ) can be interpreted as a generalized centroid (extending the notion of Fréchet means [48]) with respect to ( M α R , D α R ) , where a ( M , D ) -centroid is defined by:
Definition 1
( ( M , D ) -centroid). Let P = { ( w 1 , p 1 ) , , ( w n , p n ) } be a normalized weighted parameter set, M a mean, and D a distance. Subsequently, the ( M , D ) -centroid is defined as
c M , D ( P ) = arg min c M ( D ( p 1 : c ) , , D ( p n : c ) ; w 1 , , w n ) .
Here, we give a general definition of the ( M , D ) -centroid for an arbitrary distance (not necessarily a symmetric nor metric distance). The parameter set can either be probability measures having densities with respect to a given measure μ or a set of vectors. In the first case, the distance D is called a statistical distance. When the densities belong to a parametric family of densities P = { p θ : θ Θ } , the statistical distance D [ p θ 1 : p θ 2 ] amounts to a parameter distance: D P ( θ 1 : θ 2 ) : = D [ p θ 1 : p θ 2 ] . For example, when all of the densities p i ’s belong to a same natural exponential family [2]
P = { p θ ( x ) = exp ( θ t ( x ) F ( θ ) ) : θ Θ }
with cumulant function F ( θ ) = log exp ( θ t ( x ) ) d μ ( x ) (i.e., p i = p θ i ) and sufficient statistic vector t ( x ) , we have D KL [ p θ : p θ i ] = B F * ( θ : θ i ) : = B F ( θ i : θ ) , where B F * denotes the reverse Bregman divergence (by parameter order swapping) the Bregman divergence [21] B F defined by
B F ( θ : θ ) : = F ( θ ) F ( θ ) ( θ θ ) F ( θ ) .
Thus, we have D P ( θ 1 : θ 2 ) : = B F * ( θ 1 : θ 2 ) = D KL [ p θ 1 : p θ 2 ] .
Let V = { ( w 1 , θ 1 ) , , ( w n , θ n ) } be the parameter set corresponding to P . Define
R F ( V , θ ) : = i = 1 n w i B F ( θ i : θ ) .
Subsequently, we have the equivalent decomposition of Proposition 1:
R F ( V , θ ) R F ( V , θ * ) = B F ( θ * : θ ) ,
with θ * = θ ¯ : = i = 1 n w i θ i . (this decomposition is used to prove Proposition 1 of [21]). The quantity R F ( V ) = R F ( V , θ * ) was termed the Bregman information [21,49]. The Bregman information generalizes the variance that was obtained when the Bregman divergence is the squared Euclidean distance. R F ( V ) could also be called Bregman information radius according to Sibson. Because R F ( V ) = i = 1 n w i D KL [ p θ ¯ : p θ i ] , we can interpret the Bregman information as a Sibson’s information radius for densities of an exponential family with respect to the arithmetic mean M 1 R = A and the reverse Kullback–Leibler divergence: D KL * [ p : q ] : = D KL [ q : p ] . This observation yields us the JS-symmetrization of distances based on generalized information radii in Section 3.
More generally, we may consider the densities belonging to a deformed q-exponential family (see [10], page 85–89 and the monograph [50]). Deformed q-exponential families generalize the exponential families, and include the q-Gaussians [10]. A common way to measure the statistical distance between two densities of a q-exponential family is the q-divergence [10], which is related to Tsallis’ entropy [51]. We may also define another statistical divergence between two densities of a q-exponential family which amounts to Bregman divergence. For example, we refer to [52] for details concerning the family of Cauchy distributions, which are q-Gaussians for q = 2 .
Sibson proved that the information radii of any order are all upper bounded (Theorem 2.8 and Theorem 2.9 of [1]) as follows:
R 1 ( P ) i w i log 2 1 w j log 2 n < ,
R α ( P ) α α 1 log 2 i w i 1 α log 2 n < , α ( 0 , 1 ) ( 1 , )
R ( P ) log 2 n < .
We interpret Sibson’s upper bounds of Equations (73)–(75), as follows:
Proposition 2
(Information radius upper bound). The information radius of order α of a weighted set of distributions is upper bounded by the discrete Rényi entropy of order 1 α of the weight distribution: R α ( P ) H 1 α R [ w ] , where H α R [ w ] : = 1 1 α log i w i α .

3. JS-Symmetrization of Distances Based on Generalized Information Radius

Let us give the following definitions generalizing the information radius (i.e., Jensen-Shannon symmetrization of the distance when | P | = 2 ) and the ordinary Jensen-Shannon divergence:
Definition 2
( ( M , D ) -information radius). Let M be a weighted mean and D a distance. Subsequently, the generalized information radius for a weighted set of points (e.g., vectors or densities) ( w 1 , p 1 ) , , ( w n , p n ) is:
R M , D ( P ) : = min c D M D ( p 1 : c ) , , D ( p n : c ) ; w 1 , , w n .
Recall that we also defined the ( M , D ) -centroid in Definition 1 as follows:
c M , D ( P ) : = arg min c D M D ( p 1 : c ) , , D ( p n : c ) ; w 1 , , w n .
When M = A , we recover the notion of Fréchet mean [48]. Notice that, although the minimum R M , D ( P ) is unique, several generalized centroids c M , D ( P ) may potentially exist, depending on ( M , D ) . In particular, Definition 2 and Definition 1 apply when D is a statistical distance, i.e., a distance between densities (Radon–Nikodym derivatives of corresponding probability measures with respect to a dominating measure μ ).
The generalized information radius can be interpreted as a diversity index or an n-point distance. When n = 2 , we get the following (2-point) distances, which are considered as a generalization of the Jensen-Shannon divergence or Jensen-Shannon symmetrization:
Definition 3
(M-vJS symmetrization of D). Let M be a mean and D a statistical distance. Subsequently, the variational Jensen-Shannon symmetrization of D is defined by the formula of a generalized information radius:
D M vJS [ p : q ] : = min c D M D [ p : c ] , D [ q : c ] .
We use the acronym vJS to distinguish it with the JS-symmetrization reported in [23]:
D M JS [ p : q ] = D M , A JS [ p : q ] : = 1 2 D p : ( p q ) 1 2 M + D q : ( p q ) 1 2 M .
We recover Sibson’s information radius R α [ p : q ] induced by two densities p and q from Definition 3 as the M α R -vJS symmetrization of the Rényi divergence D α R . We have B F A vJS , which is the Bregman information [21]. Notice that we may skew these generalized JSDs by taking weighted mean M β instead of M for β ( 0 , 1 ) , yielding the general definition:
Definition 4
(Skew M β -vJS symmetrization of D). Let M β be a weighted mean and D a statistical distance. Subsequently, the variational skewed Jensen-Shannon symmetrization of D is defined by the formula of a generalized information radius:
D M β vJS [ p : q ] : = min c D M β D [ p : c ] , D [ q : c ]
Example 1.
For example, the skewed Jensen–Bregman divergence of Equation (20) can be interpreted as a Jensen-Shannon symmetrization of the Bregman divergence B F [12] since we have:
B F A β vJS ( θ 1 : θ 2 ) = min θ Θ A β B F ( θ 1 : θ ) , B F ( θ 2 : θ ) ,
= min θ Θ ( 1 β ) B F ( θ 1 : θ ) + β B F ( θ 2 : θ ) ,
= ( 1 β ) B F ( θ 1 : ( 1 β ) θ 1 + β θ 2 ) + β B F ( θ 2 : ( 1 β ) θ 1 + β θ 2 ) ,
= : JB F , β ( θ 1 : θ 2 ) .
Indeed, the Bregman barycenter arg min θ Θ ( 1 β ) B F ( θ 1 : θ ) + B F ( θ 2 : θ ) is unique and it corresponds to θ = ( 1 β ) θ 1 + β θ 2 , see [21]. The skewed Jensen–Bregman divergence JB F , β ( θ 1 : θ 2 ) can also be rewritten as an equivalent skewed Jensen divergence (see Equation (22)):
JB F , β ( θ 1 : θ 2 ) = ( 1 β ) B F ( θ 1 : ( 1 β ) θ 1 + β θ 2 ) + β B F ( θ 2 : ( 1 β ) θ 1 + β θ 2 ) ,
= ( 1 β ) F ( θ 1 ) + β F ( θ 2 ) F ( ( 1 β ) θ 1 + β θ 2 ) ,
= : J F , β ( θ 1 : θ 2 ) .
Example 2.
Consider a conformal Bregman divergence [53] that is defined by
B F , ρ ( θ 1 : θ 2 ) = ρ ( θ 1 ) B F ( θ 1 : θ 2 ) ,
where ρ ( θ ) > 0 is a conformal factor. Subsequently, we have
B F , ρ A β vJS ( θ 1 : θ 2 ) = min θ Θ A β B F , ρ ( θ 1 : θ ) , B F , ρ ( θ 2 : θ ) ,
= min θ Θ ( 1 β ) B F , ρ ( θ 1 : θ ) + B F , ρ ( θ 2 : θ ) ,
= ( 1 β ) B F ( θ 1 : γ 1 θ 1 + γ 2 θ 2 ) + β B F ( θ 2 : γ 1 θ 1 + γ 2 θ 2 ) ,
where γ 1 = ( 1 β ) ρ ( θ 1 ) ( 1 β ) ρ ( θ 1 ) + β ρ ( θ 2 ) and γ 2 = β ρ ( θ 2 ) ( 1 β ) ρ ( θ 1 ) + β ρ ( θ 2 ) = 1 γ 1 .
Notice that this definition is implicit and it can be made explicit when the centroid c * ( p , q ) is unique:
D M β vJS [ p : q ] = M β D [ p : c * ( p , q ) ] , D [ q : c * ( p , q ) ]
In particular, when D = D KL , the KLD, we obtain generalized skewed Jensen-Shannon divergences for M β a weighted mean with β ( 0 , 1 ) :
D vJS M β [ p : q ] : = min c D M β D KL [ p : c ] , D KL [ q : c ] .
Example 3.
Amari [31] obtained the ( A , D α ) -information radius and its corresponding unique centroid for D α , the α-divergence of information geometry [10] (page 67).
Example 4.
Brekelmans et al. [54] studied the geometric path ( p 1 p 2 ) β G ( x ) p 1 1 β ( x ) p 2 β ( x ) between two distributions p 1 and p 2 of D , where G β ( a , b ) = a 1 β b β (with a , b > 0 ) is the weighted geometric mean. They proved the variational formula:
( p 1 p 2 ) β G = min c D ( 1 β ) D KL [ c : p 1 ] + β D KL [ c : p 2 ] .
That is, ( p 1 p 2 ) β G is a G β - D KL * centroid, where D KL * is the reverse KLD. The corresponding ( G β , D KL * ) -vJSD is studied is [23] and it is used in deep learning in [30].
It is interesting to study the link between ( M β , D ) -variational Jensen-Shannon symmetrization of D and the ( M α , N β ) -JS symmetrization of [23]. In particular, the link between M β for averaging in the minimization and M α the mean for generating abstract mixtures.
More generally, Brekelmans et al. [55] considered the α-divergences extended to positive measures (i.e., a separable divergence built as the different between a weighted arithmetic mean and a geometric mean [56]):
D α e [ p : q ] : = 4 1 α 2 X 1 α 2 p ( x ) + 1 + α 2 q ( x ) p 1 α 2 ( x ) q 1 + α 2 ( x ) d μ ( x )
and proved that
c β * = arg min c D { ( 1 β ) D α e [ p 1 : c ] + β D α e [ p 2 : c ] }
is a density of a likelihood ratio q-exponential family: c β * = p 1 ( x ) Z β , q exp q ( β log q p 2 ( x ) p 1 ( x ) ) for q = 1 + α 2 . That is, c β * is the ( A β , D α e ) -generalized centroid, and the corresponding information radius is the variational JS symmetrization:
D α e vJS [ p 1 : p 2 ] = ( 1 β ) D α e [ p 1 : c β * ] + β D α e [ p 2 : c β * ]
Example 5.
The q-divergence [57] D q between two densities of a q-exponential family amounts to a Bregman divergence [10,57]. Thus, D q vJS for M = A is a generalized information radius that amounts to a Bregman information.
For the case α = in Sibson’s information radius, we find that the information radius is related to the total variation:
Proposition 3
(Lemma 2.4 [1]). :
D vJS , R [ p : q ] = log 2 ( 1 + D TV [ p : q ] ) ,
where D TV denotes the total variation
D TV [ p : q ] = 1 2 X | p ( x ) q ( x ) | d μ ( x ) .
Proof. 
Because max { p ( x ) , q ( x ) } = p ( x ) + q ( x ) 2 + 1 2 | q ( x ) p ( x ) | , it follows that we have:
X max { p ( x ) , q ( x ) } d μ ( x ) = 1 + D TV [ p : q ] .
From Theorem 1, we have R ( { ( 1 2 , p ) , ( 1 2 , q ) ) = log 2 X max { p ( x ) , q ( x ) } d μ ( x ) and, therefore, R ( { ( 1 2 , p ) , ( 1 2 , q ) ) = log 2 1 + D TV [ p : q ] . □
Notice that, when M = M g is a quasi-arithmetic mean, we may consider the divergence D g [ p : q ] = g 1 ( D [ p : q ) ) , so that the centroid of the ( M g , D g ) -JS symmetrization is:
arg min c g 1 i = 1 n w i D [ p i : c ] arg min c i = 1 n w i D [ p i : c ] .
The generalized α -skewed Bhattacharyya divergence [29] can also be considered with respect to a weighted mean M α :
D Bhat , M α [ p : q ] = log X M α ( p ( x ) , q ( x ) ) d μ ( x ) .
In particular, when M α is a quasi-arithmetic weighted mean that is induced by a strictly continuous and monotone function g, we have
D Bhat , g , α [ p : q ] : = log X M g ( p ( x ) , q ( x ) ; α ) d μ ( x ) = : D Bhat , ( M g ) α [ p : q ] .
Because min { p ( x ) , q ( x ) } M g ( p ( x ) , q ( x ) ; α ) max { p ( x ) , q ( x ) } , min { a , b } = a + b 2 | b a | 2 and max { a , b } = a + b 2 + | b a | 2 , we deduce that we have:
0 1 D TV [ p , q ] X M g ( p ( x ) , q ( x ) ; α ) d μ ( x ) 1 + D TV [ p , q ] 2 .
The information radius of Sibson for α ( 0 , 1 ) ( 1 , ) may be interpreted as generalized scaled α -skewed Bhattacharyya divergences with respect to the power means P α , since we have R α ( p , q ) = α α 1 log 2 X P α ( p ( x ) , q ( x ) ; α ) d μ ( x ) = α 1 α D Bhat , P α [ p : q ] .

4. Relative Information Radius and Relative Jensen-Shannon Symmetrizations of Distances

4.1. Relative Information Radius

In this section, instead of considering the full space of densities D on ( X , F , μ ) for performing the variational optimization of the information radius, we rather consider a subfamily of (parametric) densities R D . Subsequently, we define accordingly the R -relative Jensen-Shannon divergence ( R -JSD for short) as
D vJS R [ p : q ] : = min c R 1 2 D KL [ p : c ] + 1 2 D KL [ q : c ] .
In particular, Sibson [1] considered the normal information radius, i.e., the R -relative Jensen-Shannon divergence with R = { N ( μ , Σ ) : ( μ , Σ ) R d × P + + d } , where P + + d denotes the open cone of d × d positive-definite matrices (positive-definite covariance matrices of Gaussian distributions). More generally, we may consider any exponential family E [2].
Definition 5
(Relative ( R , M ) -JS symmetrization of D). Let M be a mean and D a statistical distance. Subsequently, the relative ( R , M ) -JS symmetrization of D is:
D M , R vJS [ p : q ] : = min c R M D [ p : c ] , D [ q : c ] .
We obtain the relative Jensen-Shannon divergences when D = D KL .
Example 6.
Grosse et al. [58] considered geometric and moment average paths for annealing. They proved that, when p 1 = p θ 1 and p 2 = p θ 2 belong to an exponential family [2] E F with cumulant function F, we have
( p 1 p 2 ) β G = p 1 ( x ) 1 β p 2 ( x ) β p 1 ( x ) 1 β p 2 ( x ) β d μ ( x ) = arg min c E F ( 1 β ) D KL [ c : p 1 ] + β D KL [ c : p 2 ] ,
and
p η ¯ = arg min c E F ( 1 β ) D KL [ p 1 : c ] + β D KL [ c : p 2 ] ,
where η ¯ = ( 1 β ) η 1 + β η 2 , η i = E p θ i [ t ( x ) ] (this is not an arithmetic mixture, but an exponential family density moment parameter that is a mixture of the parameters).
The corresponding minima can be interpreted as relative skewed Jensen-Shannon symmetrization for the reverse KLD D KL * (Equation (98)) and the relative skewed Jensen-Shannon divergence (Equation (99)):
D KL * A β , E F vJS [ p 1 : p 2 ] = min c E F ( 1 β ) D KL * [ p 1 : c ] + β D KL * [ p 2 : c ] ,
D A β , E F vJS [ p 1 : p 2 ] = min c E F ( 1 β ) D KL [ c : p 1 ] + β D KL [ c : p 2 ] ,
where A β ( a , b ) : = ( 1 β ) a + β b is the weighted arithmetic mean for β ( 0 , 1 ) .
Notice that, when p = q , we have D M , R vJS [ p : p ] = min c R D [ p : c ] , which is the information projection [59] with respect to D of density p to the submanifold R . Thus, when p R , we have D M , R vJS [ p : p ] > 0 , i.e., the relative JSDs are not proper divergences, since a proper divergence ensures that D [ p : q ] 0 with equality if p = q . Figure 1 illustrates the main cases of the relative Jensen-Shannon divergenc between p and q: Either p and q are both inside or outside R , or one point is inside R , while the other point is outside R . When p = q , we get an information projection when both of the points are outside R , and D vJS R [ p : p ] = 0 when p R . When p , q R with p q , the value D vJS R [ p : q ] corresponds to the information radius (and the arg min to the right-sided Kullback–Leibler centroid).

4.2. Relative Jensen-Shannon Divergences: Applications to Density Clustering and Quantization

Let D KL [ p : q θ ] be the Kullback–Leibler divergence between an arbitrary density p and a density q θ of an exponential family Q = { q θ : θ Θ } . Let us canonically express [2,60] the density q θ ( x ) , as
q θ ( x ) = exp θ t Q ( x ) F Q ( θ ) + k Q ( x ) ,
where t Q ( x ) denotes the sufficient statistics, k Q ( x ) is an auxiliary carrier measure term (e.g., k ( x ) = 0 for the Gaussian family and k ( x ) = log ( x ) for the Rayleigh family [60]), and F Q ( θ ) the cumulant function. Assume that we know in closed-form the following quantities:
  • m p : = E p [ t Q ( x ) ] = p ( x ) t Q ( x ) d μ ( x ) and
  • the Shannon entropy h [ p ] = p ( x ) log p ( x ) d μ ( x ) of p.
Subsequently, we can express the KLD using a semi-closed-form formula.
Proposition 4.
Let q θ Q be a density of an exponential family and p an arbitrary density with m p = E p [ t Q ( x ) ] . Subsequently, the Kullback–Leibler divergence between p and q θ is expressed as:
D KL [ p : q θ ] = F Q ( θ ) m p θ E p [ k Q ( x ) ] h [ p ] ,
where h [ p : q θ ] = F Q ( θ ) m p θ E p [ k Q ( x ) ] is the cross-entropy between p and q θ .
Proof. 
The proof is straightforward since log q θ ( x ) = θ t Q ( x ) F Q ( θ ) + k Q ( x ) . Therefore, we have:
D KL [ p : q θ ] = h [ p : q θ ] h [ p ] ,
= X p ( x ) log q θ ( x ) d μ ( x ) h [ p ] ,
= F Q ( θ ) m p θ E p [ k Q ( x ) ] h [ p ] .
 □
Example 7.
For example, when q θ = q μ , Σ is the density of a multivariate Gaussian distribution N ( μ , Σ ) (with k N ( x ) = 0 ), we have
D KL [ p : q μ , Σ ] = 1 2 log | 2 π Σ | + ( μ m ) Σ 1 ( μ m ) + tr ( Σ 1 S ) h [ p ] ,
where m = μ ( p ) = E p [ X ] and S = Cov ( p ) : = E p X X E p [ X ] E p [ X ] .
The formula of Proposition 4 is said in semi-closed-form, because it relies on knowing both the entropy h of p and the sufficient statistic moments E p [ t Q ( x ) ] . Yet, this semi-closed formula may prove to be useful in practice: For example, we can answer the comparison predicate
“Is D KL [ p : q θ 1 ] D KL [ p : q θ 2 ] or not?”
by checking whether F Q ( θ 1 ) F Q ( θ 2 ) m p ( θ 1 θ 2 ) 0 or not (i.e., the terms E p [ k Q ( x ) ] h [ p ] in Equation (102) cancel out). Thus, we get a closed-form predicate, although D KL is only known in semi-closed-form. This KLD comparison predicate shall be used later on when clustering densities with respect to centroids in Section 4.2.
Remark 1.
Note that when Y = f ( X ) for an invertible and differentiable transformation f then we have h [ Y ] = h [ X ] + E X [ log | J f ( X ) | ] where J f denotes the Jacobian matrix. For example, when Y = f ( X ) = A X , we have h [ Y ] = h [ X ] + log | A | .
When p belongs to an exponential family P ( P may be different from Q ) with cumulant function F P , sufficient statistics t P ( x ) , auxiliary carrier term k P ( x ) , and natural parameter θ , we have the entropy [61] expressed, as follows:
h [ p ] = F P ( θ ) θ F P ( θ ) E p [ k P ( x ) ] ,
= F P * ( η ) E p [ k P ( x ) ] ,
where F P * ( η ) = θ F ( θ ) F ( θ ) is the Legendre transform of F ( θ ) and η = η ( θ ) = F ( θ ) is called the moment parameter since we have η ( θ ) = E p [ t P ( x ) ] [2,60].
It follows the following proposition refining Proposition 4 when p = p θ P :
Proposition 5.
Let p θ be a density of an exponential family P and q θ be a density of an exponential family Q . Subsequently, the Kullback–Leibler divergence between p θ and q θ is expressed as:
D KL [ p θ : q θ ] = F Q ( θ ) + F P * ( η ) E p θ [ t Q ( x ) ] θ + E p θ [ k P ( x ) k Q ( x ) ] .
Proof. 
We have
D KL [ p θ : q θ ] = h [ p θ : q θ ] h [ p θ ] ,
= F Q ( θ ) m p θ θ E p θ [ k Q ( x ) ] + F P * ( η ) + E p θ [ k P ( x ) ] ,
= F Q ( θ ) + F P * ( η ) E p θ [ t Q ( x ) ] θ + E p θ [ k P ( x ) k Q ( x ) ] .
 □
In particular, when p and q belong both to the same exponential family (i.e., P = Q with k P ( x ) = k Q ( x ) ), we have F ( θ ) : = F P ( θ ) : = F Q ( θ ) and E p θ [ t Q ( x ) ] = F ( θ ) = : η , and
D KL [ p θ : q θ ] = F ( θ ) + F * ( η ) θ η .
This last equation is the Fenchel–Young divergence in Bregman manifolds [34,62] (called dually flat spaces in information geometry [10]). Thus the divergence can be rewritten as equivalent dual Bregman divergences:
D KL [ p θ : q θ ] = F ( θ ) + F * ( η ) η θ ,
= B F ( θ : θ ) ,
= B F * ( η : η ) ,
where η = F ( θ ) .
Notice that D KL [ p θ : Q ] : = min θ Θ D KL [ p θ : q θ ] is unique and can be calculated as η = F Q ( θ ) = E p θ [ t Q ( x ) ] .
Let us report two examples of calculations of the KLD between two densities of two exponential families.
Example 8.
For the first exponential family, consider the family of Laplacian distributions:
P = L = p σ ( x ) : = 1 2 σ exp | x | σ : σ > 0 .
The canonical decomposition of the density yields t L ( x ) = | x | , θ = 1 σ , k L ( x ) = 0 , and F L ( θ ) = log 2 θ . (i.e., F L ( θ ( σ ) ) = log 2 σ ). It follows that η ( θ ) = F L ( θ ) = 1 θ ( η ( σ ) = σ = E [ | x | ] ), θ ( η ) = 1 η , and F L * ( η ) = 1 log ( 2 η ) and, therefore, F L * ( η ( σ ) ) = 1 log ( 2 σ ) .
For the second family, consider the exponential family of zero-centered Gaussian distributions:
Q = N 0 = q σ ( x ) = 1 2 π ( σ ) 2 exp x 2 2 ( σ ) 2 .
We have t N 0 ( x ) = x 2 , k N 0 ( x ) = 0 , θ = 1 2 ( σ ) 2 , and F N 0 ( σ ) = 1 2 log ( 2 π ( σ ) 2 ) .
Moreover, let us calculate E p σ [ t N 0 ( x ) ] = E p σ [ x 2 ] = 2 σ 2 . Subsequently, we can calculate the Kullback–Leibler divergence between p σ L ( σ ) and q σ N 0 ( σ ) , as follows:
D KL [ p σ : q σ ] = F Q ( θ ( σ ) ) + F P * ( η ( σ ) ) E p σ [ t Q ( x ) ] θ ( σ ) + E p σ [ k P ( x ) k Q ( x ) ] ,
= 1 2 log ( 2 π ( σ ) 2 ) 1 log ( 2 σ ) 2 σ 2 1 2 ( σ ) 2 ,
= log σ σ + σ σ 2 + 1 2 log π 2 1 .
Notice that D KL [ p σ : q σ ] 0 , but never 0 since the P Q = .
Let us now compute the reverse Kullback–Leibler divergence D KL [ q σ : p σ ] . We first calculate E q σ [ t L ( x ) ] = E q σ ( σ ) [ | x | ] = 2 π σ . Since F Q ( θ ) = 1 2 log ( π θ ) , we have η ( θ ) = F Q ( θ ) = 1 2 θ . Thus η ( σ ) = ( σ ) 2 and F Q * ( η ) = 1 2 1 2 log ( 2 π η ) . Therefore, we get F Q * ( η ( σ ) ) = h [ q σ ] = 1 2 log ( 2 π e ( σ ) 2 ) .
It follows that
D KL [ q σ : p σ ] = F P ( θ ( σ ) ) + F Q * ( η ( σ ) ) E q θ [ t P ( x ) ] θ ( σ ) + E q θ [ k P ( x ) k Q ( x ) ] ,
= log ( 2 σ ) 1 2 log ( 2 π e ( σ ) 2 ) 2 π σ × 1 σ ,
= 2 π σ σ + log σ σ 1 2 log ( π 2 e ) .
Again, we have D KL [ q σ : p σ ] 0 , but never 0, because P Q = .
Example 9.
Let us use the formula of Equation (109) to calculate the KLD between two Weibull distributions [63]. A Weibull distribution of shape κ > 0 and scale σ > 0 has a density defined on X = [ 0 , ) , as follows:
p κ , σ Wei ( x ) : = κ σ x σ κ 1 exp x σ κ .
For a fixed shape κ, the set of Weibull distributions W κ : = { p κ , σ Wei : σ > 0 } form an exponential family with natural parameter θ = 1 σ κ , sufficient statistic t κ ( x ) = x κ , auxiliary carrier term k κ ( x ) = ( κ 1 ) log x + log κ , and cumulant function F κ ( θ ) = log ( θ ) (so that F κ ( θ ( σ ) ) = F κ ( σ ) = κ log σ ):
p κ , σ Wei ( x ) : = exp 1 σ κ x k + log 1 σ κ + k ( x ) .
We recover the exponential family of exponential distributions of rate parameter λ = 1 σ when κ = 1 :
p λ Exp ( x ) = p 1 , σ Wei ( x ) = 1 σ exp x σ , = λ exp λ x ,
and the exponential family of Rayleigh distributions when κ = 2 with scale parameter σ Ray = σ 2 :
p σ Ray Ray ( x ) = p 2 , σ Wei ( x ) = 2 x σ 2 exp x 2 σ 2 , = x σ Ray 2 exp x 2 2 σ Ray 2 .
Now, assume that we are given the differential entropy of the Weibull distributions [64] (pp. 155–156):
h p κ 1 , σ 1 Wei = γ 1 1 κ 1 + log σ 1 κ 1 + 1 ,
where γ 0.5772156649 is the Euler–Mascheroni constant, and the Weibull raw moments [64] (p. 155):
m = E p κ 1 , σ Wei x κ 2 = σ 1 κ 2 Γ 1 + κ 2 κ 1 ,
where Γ ( x ) = 0 t x 1 e t d t is the gamma function (with Γ ( n ) = ( n 1 ) ! for integers n). Because h [ p κ , σ Wei ] = F κ ( θ ) θ F κ ( θ ) E p κ , σ Wei [ k κ ( x ) ] = F κ * ( η ) E p κ , σ Wei [ k κ ( x ) ] , we deduce that
E p κ , σ Wei [ k κ ( x ) ] = F κ * ( η ) h p κ , σ Wei ,
where F κ * ( η ) is the Legendre transform of F κ ( θ ) and η ( θ ) = F κ ( θ ) = 1 θ = E [ t ( x ) ] = E [ x κ ] . We have θ ( η ) = F κ * ( η ) = 1 η and F κ * ( η ) = η F κ * ( η ) F κ ( F κ * ( η ) ) = 1 log η . It follows that
E p κ , σ Wei [ k κ ( x ) ] = 1 + log σ Γ 1 + 1 κ γ 1 1 κ log σ κ + 1 .
Therefore, we deduce that the logarithmic moment of p κ 1 , σ Wei is:
E p κ 1 , σ Wei [ log x ] = γ κ 1 + log σ 1 .
This coincides with the explicit definite integral calculation reported in [63].
Subsequently, we calculate the KLD between two Weibull distributions using Equation (109), as follows:
D KL p κ 1 , σ 1 Wei : p κ 2 , σ 2 Wei = F κ 2 ( θ ) + F κ 1 * ( η ) E p κ 1 , σ 1 [ x κ 2 ] θ + E p κ 1 , σ 1 [ k κ 1 ( x ) k κ 2 ( x ) ]
= log κ 1 σ 1 κ 1 log κ 2 σ 2 κ 2 + κ 1 κ 2 log σ 1 γ κ 1 + σ 1 σ 2 κ 2 Γ κ 2 κ 1 + 1 1
since we have the following terms:
F κ 2 ( θ ) = log σ 2 κ 2 , F κ 1 * ( η ) = 1 log σ 1 κ 1 , E p κ 1 , σ 1 [ x κ 2 ] θ = 1 σ 2 κ 2 σ 1 κ 2 Γ 1 + κ 2 κ 1 E p κ 1 , σ 1 [ k κ 1 ( x ) k κ 2 ( x ) ] = ( κ 1 κ 2 ) E p κ 1 , σ 1 [ log x ] + log κ 1 κ 2 , = log κ 1 κ 2 + ( κ 1 κ 2 ) log σ 1 γ κ 1 .
This formula matches the formula reported in [63].
When κ 1 = κ 2 = 1 , we recover the ordinary KLD formula between two exponential distributions [60] with λ i = 1 σ i since Γ ( 2 ) = ( 2 1 ) ! = 1 :
D KL p 1 , σ 1 Wei : p 1 , σ 2 Wei = log σ 2 σ 1 + σ 1 σ 2 1 ,
= λ 2 λ 1 log λ 2 λ 1 1 .
When κ 1 = κ 2 = 2 , we recover the ordinary KLD formula between two Rayleigh distributions [60], with σ Ray = σ 2 :
D KL p 2 , σ 1 Wei : p 2 , σ 2 Wei = log σ 2 2 σ 1 2 + σ 1 2 σ 2 2 1 ,
= log σ Ray 2 2 σ Ray 1 2 + σ Ray 1 2 σ Ray 2 2 1 .
The formulae of Equations (127) and (126) are linked by the fact that if X Exp ( λ ) and Y = X then Y Ray 1 2 λ , and f-divergences [65], including the Kullback–Leibler divergence are invariant by a differentiable transformation [66].
Jeffreys’ divergence symmetrizes the KLD divergence, as follows:
D J [ p : q ] : = D KL [ p : q ] + D KL [ q : p ] = 2 A ( D KL [ p : q ] , D KL [ q : p ] ) .
The Jeffreys divergence between two densities of different exponential families P and Q is
D J [ p θ : q θ ] = θ ( η E p θ [ t Q ( x ) ] ) + θ ( η E q θ [ t P ( x ) ] ) + E p θ [ k P ( x ) k Q ( x ) ] + E q θ [ k Q ( x ) k P ( x ) ] .
When P = Q , we have E p θ [ t Q ( x ) ] = η and E q θ [ t P ( x ) ] ) = η , so that we find the usual expression of the Jeffreys divergence between two densities of an exponential family:
D J [ p θ : p θ ] = ( θ θ ) ( η η ) .
To find the best density q θ approximating p by minimizing min θ D KL [ p : q θ ] , we solve F ( θ ) = η = m and, therefore, θ = F * ( m ) = ( F ) 1 ( m ) , where F * ( η ) = E q η [ log q η ( m ) ] , with F * denoting the Legendre–Fenchel convex conjugate [2]. In particular, when p = w i p θ i is a mixture of EFs (with m = E p [ t ( x ) ] = w i η i with η i = E p θ i [ t ( x ) ] thanks to the linearity of the expectation), then the best density of the EF simplifying p is
min θ D KL [ p : q θ ] = min θ F ( θ ) m θ ,
= min θ F ( θ ) w i η i θ .
Taking the gradient with respect to θ , we have F ( θ ) = η = w i η i . This yields another proof without the Pythagoras theorem [67,68].
Proposition 6.
Let m ( x ) = w i p θ i ( x ) be a mixture with components that belong to an exponential family with cumulant function F. Subsequently, θ * = arg θ min θ D KL [ p : q θ ] is F * ( i = 1 n w i η i ) , where the η i = F ( θ i ) are the moment parameters of the mixture components.
Consider the following two problems:
Problem 1
(Density clustering). Given a set of n weighted densities ( w 1 , p 1 ) , , ( w n , p n ) , partition them into k clusters C 1 , , C k in order to minimize the k-centroid objective function with respect to a statistical divergence D: i = 1 n w i min l { 1 , , k } D [ p i : c l ] , where c l denotes the centroid of cluster C l for l { 1 , , k } .
For example, when all the densities p i ’s are isotropic Gaussians, we recover the k-means objective function [69].
Problem 2
(Mixture component quantization). Given a statistical mixture m ( x ) = i = 1 n w i p i ( x ) , quantize the mixture components into k densities q 1 , , q k in order to minimize i w i min l { 1 , , k } D [ p i : q l ] .
Notice that, in Problem 1, the input densities p i ’s may be mixtures, i.e., p i ( x ) = j = 1 n i w i , j p i , j ( x ) . Using the relative information radius, we can cluster a set of distributions (potentially mixtures) into an exponential family mixture, or quantize an exponential family mixture. Indeed, we can implement an extension of k-means [69] with k-centers q θ i , to assign density p i to cluster C j (with center q j ), we need to perform basic comparison tests D KL [ p i : q θ l ] D KL [ p i : q θ j ] . Provided that the cumulant F of the exponential family is in closed-form, we do not need formula for the entropies h ( p i ) .
Clustering and quantization of densities/mixtures have been widely studied in the literature, see, for example, [70,71,72,73,74,75,76].

5. Conclusions

To summarize, the ordinary Jensen-Shannon divergence has been defined in three equivalent ways in the literature:
D JS [ p , q ] : = min c D 1 2 D KL [ p : c ] + D KL [ q : c ] ,
= 1 2 D KL p : p + q 2 + D KL q : p + q 2 ,
= h p + q 2 h [ p ] + h [ q ] 2 .
The JSD Equation (133) was studied by Sibson in 1969 within the wider scope of information radius [1]: Sibson relied on the Rényi α -divergences (relative Rényi α -entropies [77]) and recovered the ordinary Jensen-Shannon divergence as a particular case of the α -information radius when α = 1 and n = 2 points. The α -information radii are related to generalized Bhattacharyya distances with respect to power means and the total variation distance in the limit case of α = .
Lin [4] investigated the JSD Equation (134) in 1991 with its connection to the JSD defined in Equation (134)). In Lin [4], the JSD is interpreted as the arithmetic symmetrization of the K-divergence [24]. Generalizations of the JSD based on Equation (134) were proposed in [23] using a generic mean instead of the arithmetic mean. One motivation was to obtain a closed-form formula for the geometric JSD between multivariate Gaussian distributions, which relies on the geometric mixture (see [30] for a use case of that formula in deep learning). Indeed, the ordinary JSD between Gaussians is not available in closed-form (not analytic). However, the JSD between Cauchy distributions admit a closed-form formula [78], despite the calculation of a definite integral of a log-sum term. Instead of using an abstract mean to define a mid-distribution of two densities, one may also consider the mid-point of a geodesic linking these two densities (the arithmetic means p + q 2 is interpreted as a geodesic midpoint). Recently, Li [79] investigated the transport Jensen-Shannon divergence as a symmetrization of the Kullback–Leibler divergence in the L 2 -Wasserstein space. See Section 5.4 of [79] and the closed-form formula of Equation (18) obtained for the transport Jensen-Shannon divergence between two multivariate Gaussian distributions.
The generalization of the identity between the JSD of Equation (134) and the JSD of Equation (135) was studied while using a skewing vector in [18]. Although the JSD is a f-divergence [8,18], the Sibson-M Jensen-Shannon symmetrization of a distance does not belong, in general, to the class of f-divergences. The variational JSD definition of Equation (133) is implicit, while the definitions of Equations (134) and (135) are explicit because the unique optimal centroid c * = p + q 2 has been plugged into the objective function that was minimized by Equation (133).
In this paper, we proposed a generalization of the Jensen-Shannon divergence based on the variational definition of the ordinary Jensen-Shannon divergence based on the variational JSD definition of Equation (133): D vJS [ p : q ] = min c 1 2 ( D KL [ p : c ] + D KL [ q : c ] ) . We introduced the Jensen-Shannon symmetrization of an arbitrary divergence D by considering a generalization of the information radius with respect to an abstract weighted mean M β : D M vJS [ p : q ] : = min c M β ( D [ p : c ] , D [ q : c ] ) . Notice that, in the variational JSD, the mean M β is used for averaging divergence values, while the mean M α in the ( M α , N β ) JSD is used to define generic statistical mixtures. We also consider relative variational JS symmetrization when the centroid has to belong to a prescribed family of densities. For the case of exponential family, we showed how to compute the relative centroid in closed form, thus extending the pioneering work of Sibson, who considered the relative normal centroid used to calculate the relative normal information radius. Figure 2 illustrates the three generalizations of the ordinary skewed Jensen-Shannon divergence. Notice that, in general, the ( M , N ) -JSDs and the variational JDSs are not f-divergences (except in the ordinary case).
In a similar vein, Chen et al. [80] considered the following minimax symmetrization of the scalar Bregman divergence [81]:
B f minmax ( p , q ) : = min c max λ [ 0 , 1 ] λ B f ( p : c ) + ( 1 λ ) B f ( q : c ) ,
= max λ [ 0 , 1 ] λ B f ( p : λ p + ( 1 λ ) q ) + ( 1 λ ) B f ( q : λ p + ( 1 λ ) ) ,
= λ f ( p ) + ( 1 λ ) f ( q ) f ( λ p + ( 1 λ ) )
where B f denotes the scalar Bregman divergence induced by a strictly convex and smooth function f:
B f ( p : q ) = f ( p ) f ( q ) ( p q ) f ( q ) .
They proved that B f minmax ( p , q ) yields a metric when 3 ( log f ) ( ( log f ) ) 2 , and extend the definition to the vector case and conjecture that the square-root metrization still holds in the multivariate case. In a sense, this definition geometrically highlights the notion of radius, since the minmax optimization amount to find a smallest enclosing ball enclosing [82] the source distributions. The circumcenter, also called the Chebyshev center [83], is then the mid-distribution instead of the centroid for the information radius. The term "information radius” is well-suited to measure the distance between two points for an arbitrary distance D. Indeed, the JS-symmetrization of D is defined by D JS [ p : q ] : = min c { 1 2 D [ p : c ] + 1 2 D [ q : c ] } . When D [ p : q ] = D E [ p : q ] = p q is the Euclidean distance, we have c = p + q 2 , and D [ p : c ] = D [ q : c ] = 1 2 p q = : r (i.e., the radius being half of the diameter p q ). Thus, D E JS [ p : q ] = r ; hence, the term chosen by Sibson [1] for D JS : information radius. Besides providing another viewpoint, variational definitions of divergences have proven to be useful in practice (e.g., for estimation). For example, a variational definition of the Rényi divergence generalizing the Donsker–Varadhan variational formula of the KLD is given in [84], which is used to estimate the Rényi Divergences.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Acknowledgments

We warmly thank Rob Brekelmans (Information Sciences Institute, University of Southern California, USA) for discussions and feedback related to the contents of this work. The author thanks the reviewers for valuable feedback, comments, and suggestions, and Gaëtan Hadjeres (Sony CSL Paris) for his careful reading of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sibson, R. Information radius. Z. Wahrscheinlichkeitstheorie Verwandte Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
  2. Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
  3. Billingsley, P. Probability and Measure; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
  4. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
  5. Kullback, S. Information Theory and Statistics; Courier Corporation: Chelmsford, MA, USA, 1997. [Google Scholar]
  6. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  7. Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [Google Scholar] [CrossRef]
  8. Csiszár, I. Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kut. Int. Koezl. 1964, 8, 85–108. [Google Scholar]
  9. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodological) 1966, 28, 131–142. [Google Scholar] [CrossRef]
  10. Amari, S.i. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
  11. McLachlan, G.J.; Peel, D. Finite Mixture Models; John Wiley & Sons: Hoboken, NJ, USA, 2004. [Google Scholar]
  12. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef] [Green Version]
  13. Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef] [Green Version]
  14. Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings, Chicago, IL, USA, 27 June–2 July 2004; IEEE: Piscataway, NJ, USA, 2004; p. 31. [Google Scholar]
  15. Virosztek, D. The metric property of the quantum Jensen-Shannon divergence. Adv. Math. 2021, 380, 107595. [Google Scholar] [CrossRef]
  16. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
  17. Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  18. Nielsen, F. On a generalization of the Jensen-Shannon divergence and the Jensen-Shannon centroid. Entropy 2020, 22, 221. [Google Scholar] [CrossRef] [Green Version]
  19. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
  20. Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef] [Green Version]
  21. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  22. Antolín, J.; Angulo, J.; López-Rosa, S. Fisher and Jensen-Shannon divergences: Quantitative comparisons among distributions. application to position and momentum atomic densities. J. Chem. Phys. 2009, 130, 074110. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  23. Nielsen, F. On the Jensen-Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef] [Green Version]
  24. Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv 2010, arXiv:1009.4004. [Google Scholar]
  25. Nielsen, F.; Nock, R. Generalizing skew Jensen divergences and Bregman divergences with comparative convexity. IEEE Signal Process. Lett. 2017, 24, 1123–1127. [Google Scholar] [CrossRef]
  26. De Carvalho, M. Mean, what do you Mean? Am. Stat. 2016, 70, 270–274. [Google Scholar] [CrossRef] [Green Version]
  27. Bullen, P.S. Handbook of Means and Their Inequalities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 560. [Google Scholar]
  28. Niculescu, C.P.; Persson, L.E. Convex Functions and Their Applications: A Contemporary Approach; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  29. Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef] [Green Version]
  30. Deasy, J.; Simidjievski, N.; Liò, P. Constraining Variational Inference with Geometric Jensen-Shannon Divergence. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
  31. Amari, S.I. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef]
  32. Calin, O.; Udriste, C. Geometric Modeling in Probability and Statistics; Mathematics and Statistics; Springer International Publishing: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
  33. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1961; Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California: Oakland, CA, USA, 1961. [Google Scholar]
  34. Blondel, M.; Martins, A.F.; Niculae, V. Learning with Fenchel-Young losses. J. Mach. Learn. Res. 2020, 21, 1–69. [Google Scholar]
  35. Faddeev, D.K. Zum Begriff der Entropie einer endlichen Wahrscheinlichkeitsschemas. In Arbeiten zur Informationstheorie I; Deutscher Verlag der Wissenschaften: Berlin, Germany, 1957; pp. 85–90. [Google Scholar]
  36. Kolmogorov, A.N.; Castelnuovo, G. Sur la Notion de la Moyenne; Bardi, G., Ed.; Atti della Academia Nazionale dei Lincei: Rome, Italy, 1930; Volume 12, pp. 323–343. [Google Scholar]
  37. Nagumo, M. Über eine klasse der mittelwerte. In Japanese Journal of Mathematics: Transactions and Abstracts; The Mathematical Society of Japan: Tokyo, Japan, 1930; Volume 7, pp. 71–79. [Google Scholar]
  38. De Finetti, B. Sul Concetto di Media; Istituto Italiano Degli Attuari: Roma, Italy, 1931. [Google Scholar]
  39. Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
  40. Sibson, R. A brief description of natural neighbour interpolation. In Interpreting Multivariate Data; Barnett, V., Ed.; John Wiley & Sons: Hoboken, NJ, USA, 1981; pp. 21–36. [Google Scholar]
  41. Boyd, S.; Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  42. Nielsen, F.; Sun, K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy 2016, 18, 442. [Google Scholar] [CrossRef] [Green Version]
  43. Nielsen, F. Chernoff information of exponential families. arXiv 2011, arXiv:1102.2684. [Google Scholar]
  44. Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef]
  45. Nielsen, F.; Yvinec, M. An output-sensitive convex hull algorithm for planar objects. Int. J. Comput. Geom. Appl. 1998, 8, 39–65. [Google Scholar] [CrossRef] [Green Version]
  46. Nielsen, F.; Nock, R. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process. Lett. 2013, 21, 10–13. [Google Scholar] [CrossRef] [Green Version]
  47. Nielsen, F. The statistical Minkowski distances: Closed-form formula for Gaussian mixture models. In International Conference on Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2019; pp. 359–367. [Google Scholar]
  48. Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. L’Institut Henri Poincaré 1948, 10, 215–310. [Google Scholar]
  49. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
  50. Naudts, J. Generalised Thermostatistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  51. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  52. Nielsen, F. On Voronoi diagrams on the information-geometric Cauchy manifolds. Entropy 2020, 22, 713. [Google Scholar] [CrossRef] [PubMed]
  53. Nock, R.; Nielsen, F.; Amari, S.i. On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 2015, 62, 527–538. [Google Scholar] [CrossRef] [Green Version]
  54. Brekelmans, R.; Nielsen, F.; Makhzani, A.; Galstyan, A.; Steeg, G.V. Likelihood Ratio Exponential Families. arXiv 2020, arXiv:2012.15480. [Google Scholar]
  55. Brekelmans, R.; Masrani, V.; Bui, T.; Wood, F.; Galstyan, A.; Steeg, G.V.; Nielsen, F. Annealed Importance Sampling with q-Paths. arXiv 2020, arXiv:2012.07823. [Google Scholar]
  56. Nielsen, F. A generalization of the α-divergences based on comparable and distinct weighted means. arXiv 2020, arXiv:2001.09660. [Google Scholar]
  57. Amari, S.i.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [Google Scholar] [CrossRef]
  58. Grosse, R.; Maddison, C.J.; Salakhutdinov, R. Annealing between distributions by averaging moments. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 2769–2777. [Google Scholar]
  59. Nielsen, F. What is an information projection? Not. AMS 2018, 65, 321–324. [Google Scholar] [CrossRef]
  60. Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. arXiv 2009, arXiv:0911.4863. [Google Scholar]
  61. Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 3621–3624. [Google Scholar]
  62. Nielsen, F. On Geodesic Triangles with Right Angles in a Dually Flat Space. In Progress in Information Geometry: Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 153–190. [Google Scholar]
  63. Bauckhage, C. Computing the Kullback-Leibler divergence between two Weibull distributions. arXiv 2013, arXiv:1310.3713. [Google Scholar]
  64. Michalowicz, J.V.; Nichols, J.M.; Bucholtz, F. Handbook of Differential Entropy; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
  65. Csiszár, I. On topological properties of f-divergences. Stud. Math. Hungar. 1967, 2, 329–339. [Google Scholar]
  66. Nielsen, F. On information projections between multivariate elliptical and location-scale families. arXiv 2021, arXiv:2101.03839. [Google Scholar]
  67. Pelletier, B. Informative barycentres in statistics. Ann. Inst. Stat. Math. 2005, 57, 767–780. [Google Scholar] [CrossRef]
  68. Schwander, O.; Nielsen, F. Learning mixtures by simplifying kernel density estimators. In Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013; pp. 403–426. [Google Scholar]
  69. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
  70. Davis, J.V.; Dhillon, I. Differential entropic clustering of multivariate Gaussians. In Proceedings of the 19th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006; pp. 337–344. [Google Scholar]
  71. Nielsen, F.; Nock, R. Clustering multivariate normal distributions. In Emerging Trends in Visual Computing; Springer: Berlin/Heidelberg, Germany, 2008; pp. 164–174. [Google Scholar]
  72. Fischer, A. Quantization and clustering with Bregman divergences. J. Multivar. Anal. 2010, 101, 2207–2221. [Google Scholar] [CrossRef]
  73. Zhang, K.; Kwok, J.T. Simplifying mixture models through function approximation. IEEE Trans. Neural Netw. 2010, 21, 644–658. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  74. Duan, J.; Wang, Y. Information-Theoretic Clustering for Gaussian Mixture Model via Divergence Factorization. In Proceedings of the 2013 Chinese Intelligent Automation Conference, Yangzhou, China, 23–25 August 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 565–573. [Google Scholar]
  75. Wang, J.C.; Yang, Y.H.; Wang, H.M.; Jeng, S.K. Modeling the affective content of music with a Gaussian mixture model. IEEE Trans. Affect. Comput. 2015, 6, 56–68. [Google Scholar] [CrossRef] [Green Version]
  76. Spurek, P.; Pałka, W. Clustering of Gaussian distributions. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, USA, 24–29 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3346–3353. [Google Scholar]
  77. Esteban, M.D.; Morales, D. A summary on entropy statistics. Kybernetika 1995, 31, 337–346. [Google Scholar]
  78. Nielsen, F.; Okamura, K. On f-divergences between Cauchy distributions. arXiv 2021, arXiv:2101.12459. [Google Scholar]
  79. Li, W. Transport information Bregman divergences. arXiv 2021, arXiv:2101.01162. [Google Scholar]
  80. Chen, P.; Chen, Y.; Rao, M. Metrics defined by Bregman divergences: Part 2. Commun. Math. Sci. 2008, 6, 927–948. [Google Scholar] [CrossRef] [Green Version]
  81. Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
  82. Arnaudon, M.; Nielsen, F. On approximating the Riemannian 1-center. Comput. Geom. 2013, 46, 93–104. [Google Scholar] [CrossRef]
  83. Candan, Ç. Chebyshev Center Computation on Probability Simplex With α-Divergence Measure. IEEE Signal Process. Lett. 2020, 27, 1515–1519. [Google Scholar] [CrossRef]
  84. Birrell, J.; Dupuis, P.; Katsoulakis, M.A.; Rey-Bellet, L.; Wang, J. Variational Representations and Neural Network Estimation for Rényi Divergences. arXiv 2020, arXiv:2007.03814. [Google Scholar]
Figure 1. Illustrating several cases of the relative Jensen-Shannon divergence based on whether p R and q R or not.
Figure 1. Illustrating several cases of the relative Jensen-Shannon divergence based on whether p R and q R or not.
Entropy 23 00464 g001
Figure 2. Three equivalent expressions of the ordinary (skewed) Jensen-Shannon divergence which yield three different generalizations.
Figure 2. Three equivalent expressions of the ordinary (skewed) Jensen-Shannon divergence which yield three different generalizations.
Entropy 23 00464 g002
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Nielsen, F. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy 2021, 23, 464. https://doi.org/10.3390/e23040464

AMA Style

Nielsen F. On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy. 2021; 23(4):464. https://doi.org/10.3390/e23040464

Chicago/Turabian Style

Nielsen, Frank. 2021. "On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius" Entropy 23, no. 4: 464. https://doi.org/10.3390/e23040464

APA Style

Nielsen, F. (2021). On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius. Entropy, 23(4), 464. https://doi.org/10.3390/e23040464

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop