Next Article in Journal
The Evaluation of Noise Spectroscopy Tests
Next Article in Special Issue
A Sequence of Escort Distributions and Generalizations of Expectations on q-Exponential Family
Previous Article in Journal
Supply Chain Strategies for Quality Inspection under a Customer Return Policy: A Game Theoretical Approach
Previous Article in Special Issue
Foliations-Webs-Hessian Geometry-Information Geometry-Entropy and Cohomology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities

1
Computer Science Department LIX, École Polytechnique, 91128 Palaiseau Cedex, France
2
Sony Computer Science Laboratories Inc, Tokyo 141-0022, Japan
3
King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
*
Author to whom correspondence should be addressed.
Entropy 2016, 18(12), 442; https://doi.org/10.3390/e18120442
Submission received: 20 October 2016 / Revised: 4 December 2016 / Accepted: 5 December 2016 / Published: 9 December 2016
(This article belongs to the Special Issue Differential Geometrical Theory of Statistics)

Abstract

:
Information-theoretic measures, such as the entropy, the cross-entropy and the Kullback–Leibler divergence between two mixture models, are core primitives in many signal processing tasks. Since the Kullback–Leibler divergence of mixtures provably does not admit a closed-form formula, it is in practice either estimated using costly Monte Carlo stochastic integration, approximated or bounded using various techniques. We present a fast and generic method that builds algorithmically closed-form lower and upper bounds on the entropy, the cross-entropy, the Kullback–Leibler and the α-divergences of mixtures. We illustrate the versatile method by reporting our experiments for approximating the Kullback–Leibler and the α-divergences between univariate exponential mixtures, Gaussian mixtures, Rayleigh mixtures and Gamma mixtures.

1. Introduction

Mixture models are commonly used in signal processing. A typical scenario is to use mixture models [1,2,3] to smoothly model histograms. For example, Gaussian Mixture Models (GMMs) can be used to convert grey-valued images into binary images by building a GMM fitting the image intensity histogram and then choosing the binarization threshold as the average of the Gaussian means [1]. Similarly, Rayleigh Mixture Models (RMMs) are often used in ultrasound imagery [2] to model histograms, and perform segmentation by classification. When using mixtures, a fundamental primitive is to define a proper statistical distance between them. The Kullback–Leibler (KL) divergence [4], also called relative entropy or information discrimination, is the most commonly used distance. Hence the main target of this paper is to faithfully measure the KL divergence. Let m ( x ) = i = 1 k w i p i ( x ) and m ( x ) = i = 1 k w i p i ( x ) be two finite statistical density mixtures of k and k components, respectively. Notice that the Cumulative Density Function (CDF) of a mixture is like its density also a convex combinations of the component CDFs. However, beware that a mixture is not a sum of random variables (RVs). Indeed, sums of RVs have convolutional densities. In statistics, the mixture components p i ( x ) are often parametric: p i ( x ) = p ( x ; θ i ) , where θ i is a vector of parameters. For example, a mixture of Gaussians (MoG also used as a shortcut instead of GMM) has each component distribution parameterized by its mean μ i and its covariance matrix Σ i (so that the parameter vector is θ i = ( μ i , Σ i ) ). Let X = { x R : p ( x ; θ ) > 0 } be the support of the component distributions. Denote by H × ( m , m ) = X m ( x ) log m ( x ) d x the cross-entropy [4] between two continuous mixtures of densities m and m , and denote by H ( m ) = H × ( m , m ) = X m ( x ) log 1 m ( x ) d x = X m ( x ) log m ( x ) d x the Shannon entropy [4]. Then the Kullback–Leibler divergence between m and m is given by:
KL ( m : m ) = H × ( m , m ) H ( m ) = X m ( x ) log m ( x ) m ( x ) d x 0 .
The notation “:” is used instead of the usual comma “,” notation to emphasize that the distance is not a metric distance since neither is it symmetric ( KL ( m : m ) KL ( m : m ) ), nor does it satisfy the triangular inequality [4] of metric distances ( KL ( m : m ) + KL ( m : m ) KL ( m : m ) ). When the natural base of the logarithm is chosen, we get a differential entropy measure expressed in nat units. Alternatively, we can also use the base-2 logarithm ( log 2 x = log x log 2 ) and get the entropy expressed in bit units. Although the KL divergence is available in closed-form for many distributions (in particular as equivalent Bregman divergences for exponential families [5], see Appendix C), it was proven that the Kullback–Leibler divergence between two (univariate) GMMs is not analytic [6] (see also the particular case of a GMM of two components with the same variance that was analyzed in [7]). See Appendix A for an analysis. Note that the differential entropy may be negative. For example, the differential entropy of a univariate Gaussian distribution is log ( σ 2 π e ) , and is therefore negative when the standard variance σ < 1 2 π e 0 . 242 . We consider continuous distributions with entropies well-defined (entropy may be undefined for singular distributions like Cantor’s distribution [8]).

1.1. Prior Work

Many approximation techniques have been designed to beat the computationally intensive Monte Carlo (MC) stochastic estimation: KL ^ s ( m : m ) = 1 s i = 1 s log m ( x i ) m ( x i ) with x 1 , , x s m ( x ) (s independently and identically distributed (i.i.d.) samples x 1 , , x s ). The MC estimator is asymptotically consistent, lim s KL ^ s ( m : m ) = KL ( m : m ) , so that the “true value” of the KL of mixtures is estimated in practice by taking a very large sample (say, s = 10 9 ). However, we point out that the MC estimator gives as output a stochastic approximation, and therefore does not guarantee deterministic bounds (confidence intervals may be used). Deterministic lower and upper bounds of the integral can be obtained by various numerical integration techniques using quadrature rules. We refer to [9,10,11,12] for the current state-of-the-art approximation techniques and bounds on the KL of GMMs. The latest work for computing the entropy of GMMs is [13]. It considers arbitrary finely tuned bounds of the entropy of isotropic Gaussian mixtures (a case encountered when dealing with KDEs, kernel density estimators). However, there is a catch in the technique of [13]: It relies on solving the unique roots of some log-sum-exp equations (See Theorem 1 of [13], p. 3342) that do not admit a closed-form solution. Thus it is a hybrid method that contrasts with our combinatorial approach. Bounds of the KL divergence between mixture models can be generalized to bounds of the likelihood function of mixture models [14], because log-likelihood is just the KL between the empirical distribution and the mixture model up to a constant shift.
In information geometry [15], a mixture family of linearly independent probability distributions p 1 ( x ) , . . . , p k ( x ) is defined by the convex combination of those non-parametric component distributions: m ( x ; η ) = i = 1 k η i p i ( x ) with η i > 0 and i = 1 k η i = 1 . A mixture family induces a dually flat space where the Kullback–Leibler divergence is equivalent to a Bregman divergence [5,15] defined on the η-parameters. However, in that case, the Bregman convex generator F ( η ) = m ( x ; η ) log m ( x ; η ) d x (the Shannon information) is not available in closed-form. Except for the family of multinomial distributions that is both a mixture family (with closed-form KL ( m : m ) = i = 1 k m i log m i m i , the discrete KL [4]) and an exponential family [15].

1.2. Contributions

In this work, we present a simple and efficient method that builds algorithmically a closed-form formula that guarantees both deterministic lower and upper bounds on the KL divergence within an additive factor of log k + log k . We then further refine our technique to get improved adaptive bounds. For univariate GMMs, we get the non-adaptive bounds in O ( k log k + k log k ) time, and the adaptive bounds in O ( k 2 + k 2 ) time. To illustrate our generic technique, we demonstrate it based on Exponential Mixture Models (EMMs), Gamma mixtures, RMMs and GMMs. We extend our preliminary results on KL divergence [16] to other information theoretical measures such as the differential entropy and α-divergences.

1.3. Paper Outline

The paper is organized as follows. Section 2 describes the algorithmic construction of the formula using piecewise log-sum-exp inequalities for the cross-entropy and the Kullback–Leibler divergence. Section 3 instantiates this algorithmic principle to the entropy and discusses related works. Section 4 extends the proposed bounds to the family of alpha divergences. Section 5 discusses an extension of the lower bound to f-divergences. Section 6 reports our experimental results on several mixture families. Finally, Section 7 concludes this work by discussing extensions to other statistical distances. Appendix A proves that the Kullback–Leibler divergence of mixture models is not analytic [6]. Appendix B reports the closed-form formula for the KL divergence between scaled and truncated distributions of the same exponential family [17] (that include Rayleigh, Gaussian and Gamma distributions among others). Appendix C shows that the KL divergence between two mixtures can be approximated by a Bregman divergence.

2. A Generic Combinatorial Bounding Algorithm Based on Density Envelopes

Let us bound the cross-entropy H × ( m : m ) by deterministic lower and upper bounds, L × ( m : m ) H × ( m : m ) U × ( m : m ) , so that the bounds on the Kullback–Leibler divergence KL ( m : m ) = H × ( m : m ) H × ( m : m ) follows as:
L × ( m : m ) U × ( m : m ) KL ( m : m ) U × ( m : m ) L × ( m : m ) .
Since the cross-entropy of two mixtures i = 1 k w i p i ( x ) and j = 1 k w j p j ( x ) :
H × ( m : m ) = X i = 1 k w i p i ( x ) log j = 1 k w j p j ( x ) d x
has a log-sum term of positive arguments, we shall use bounds on the log-sum-exp (lse) function [18,19]:
lse { x i } i = 1 l = log i = 1 l e x i .
We have the following basic inequalities:
max { x i } i = 1 l < lse { x i } i = 1 l log l + max { x i } i = 1 l .
The left-hand-side (LHS) strict inequality holds because i = 1 l e x i > max { e x i } i = 1 l = exp max { x i } i = 1 l since e x > 0 , x R . The right-hand-side (RHS) inequality follows from the fact that i = 1 l e x i l max { e x i } i = 1 l = l exp ( max { x i } i = 1 l ) , and equality holds if and only if x 1 = = x l . The lse function is convex but not strictly convex, see exercise 7.9 [20]. It is known [21] that the conjugate of the lse function is the negative entropy restricted to the probability simplex. The lse function enjoys the following translation identity property: lse { x i } i = 1 l = c + lse { x i c } i = 1 l , c R . Similarly, we can also lower bound the lse function by log l + min { x i } i = 1 l . We write equivalently that for l positive numbers x 1 , , x l ,
max log max { x i } i = 1 l , log l + log min { x i } i = 1 l log i = 1 l x i log l + log max { x i } i = 1 l .
In practice, we seek matching lower and upper bounds that minimize the bound gap. The gap of that ham-sandwich inequality in Equation (5) is min { log max i x i min i x i , log l } , which is upper bounded by log l .
A mixture model j = 1 k w j p j ( x ) must satisfy
max max { log w j p j ( x ) } j = 1 k , log k + min { log w j p j ( x ) } j = 1 k log j = 1 k w j p j ( x ) log k + max { log w j p j ( x ) } j = 1 k
point-wisely for any x X . Therefore we shall bound the integral term X m ( x ) log j = 1 k w j p j ( x ) d x in Equation (3) using piecewise lse inequalities where the min and max are kept unchanged. We get
L × ( m : m ) = X m ( x ) max { log w j p j ( x ) } j = 1 k d x log k ,
U × ( m : m ) = X m ( x ) max min { log w j p j ( x ) } j = 1 k + log k , max { log w j p j ( x ) } j = 1 k d x .
In order to calculate L × ( m : m ) and U × ( m : m ) efficiently using closed-form formula, let us compute the upper and lower envelopes of the k real-valued functions { w j p j ( x ) } j = 1 k defined on the support X , that is, E U ( x ) = max { w j p j ( x ) } j = 1 k and E L ( x ) = min { w j p j ( x ) } j = 1 k . These envelopes can be computed exactly using techniques of computational geometry [22,23] provided that we can calculate the roots of the equation w r p r ( x ) = w s p s ( x ) , where w r p r ( x ) and w s p s ( x ) are a pair of weighted components. (Although this amounts to solve quadratic equations for Gaussian or Rayleigh distributions, the roots may not always be available in closed form, e.g. in the case of Weibull distributions.)
Let the envelopes be combinatorially described by elementary interval pieces in the form I r = ( a r , a r + 1 ) partitioning the support X = r = 1 I r (with a 1 = min X and a + 1 = max X ). Observe that on each interval I r , the maximum of the functions { w j p j ( x ) } j = 1 k is given by w δ ( r ) p δ ( r ) ( x ) , where δ ( r ) indicates the weighted component dominating all the others, i.e., the arg max of { w j p j ( x ) } j = 1 k for any x I r , and the minimum of { w j p j ( x ) } j = 1 k is given by w ϵ ( r ) p ϵ ( r ) ( x ) .
To fix ideas, when mixture components are univariate Gaussians, the upper envelope E U ( x ) amounts to find equivalently the lower envelope of k parabolas (see Figure 1) which has linear complexity, and can be computed in O ( k log k ) -time [24], or in output-sensitive time O ( k log ) [25], where denotes the number of parabola segments in the envelope. When the Gaussian mixture components have all the same weight and variance (e.g., kernel density estimators), the upper envelope amounts to find a lower envelope of cones: min j | x μ j | (a Voronoi diagram in arbitrary dimension).
To proceed once the envelopes have been built, we need to calculate two types of definite integrals on those elementary intervals: (i) the probability mass in an interval a b p ( x ) d x = Φ ( b ) Φ ( a ) where Φ denotes the Cumulative Distribution Function (CDF); and (ii) the partial cross-entropy a b p ( x ) log p ( x ) d x [26]. Thus let us define these two quantities:
C i , j ( a , b ) = a b w i p i ( x ) log ( w j p j ( x ) ) d x ,
M i ( a , b ) = a b w i p i ( x ) d x .
By Equations (7) and (8), we get the bounds of H × ( m : m ) as
L × ( m : m ) = r = 1 s = 1 k C s , δ ( r ) ( a r , a r + 1 ) log k , U × ( m : m ) = r = 1 s = 1 k min C s , δ ( r ) ( a r , a r + 1 ) , C s , ϵ ( r ) ( a r , a r + 1 ) M s ( a r , a r + 1 ) log k .
The size of the lower/upper bound formula depends on the envelope complexity , the number k of mixture components, and the closed-form expressions of the integral terms C i , j ( a , b ) and M i ( a , b ) . In general, when a pair of weighted component densities intersect in at most p points, the envelope complexity is related to the Davenport–Schinzel sequences [27]. It is quasi-linear for bounded p = O ( 1 ) , see [27].
Note that in symbolic computing, the Risch semi-algorithm [28] solves the problem of computing indefinite integration in terms of elementary functions provided that there exists an oracle (hence the term “semi-algorithm”) for checking whether an expression is equivalent to zero or not (however it is unknown whether there exists an algorithm implementing the oracle or not).
We presented the technique by bounding the cross-entropy (and entropy) to deliver lower/upper bounds on the KL divergence. When only the KL divergence needs to be bounded, we rather consider the ratio term m ( x ) m ( x ) . This requires to partition the support X into elementary intervals by overlaying the critical points of both the lower and upper envelopes of m ( x ) and m ( x ) , which can be done in linear time. In a given elementary interval, since max { k min i { w i p i ( x ) } , max i { w i p i ( x ) } } m ( x ) k max i { w i p i ( x ) } , we then consider the inequalities:
max { k min i { w i p i ( x ) } , max i { w i p i ( x ) } } k max j { w j p j ( x ) } m ( x ) m ( x ) k max i { w i p i ( x ) } max { k min j { w j p j ( x ) } , max j { w j p j ( x ) } } .
We now need to compute definite integrals of the form a b w 1 p ( x ; θ 1 ) log w 2 p ( x ; θ 2 ) w 3 p ( x ; θ 3 ) d x (see Appendix B for explicit formulas when considering scaled and truncated exponential families [17]). (Thus for exponential families, the ratio of densities removes the auxiliary carrier measure term.)
We call these bounds CELB and CEUB for Combinatorial Envelope Lower and Upper Bounds, respectively.

2.1. Tighter Adaptive Bounds

We shall now consider shape-dependent bounds improving over the additive log k + log k non-adaptive bounds. This is made possible by a decomposition of the lse function explained as follows. Let t i ( x 1 , , x k ) = log j = 1 k e x j x i . By translation identity of the lse function,
lse ( x 1 , , x k ) = x i + t i ( x 1 , , x k )
for all i [ k ] . Since e x j x i = 1 if j = i , and e x j x i > 0 , we have necessarily t i ( x 1 , , x k ) > 0 for any i [ k ] . Since Equation (13) is an identity for all i [ k ] , we minimize the residual t i ( x 1 , , x k ) by maximizing x i . Denoting by x ( 1 ) , , x ( k ) the sequence of numbers sorted in non-decreasing order, the decomposition
lse ( x 1 , , x k ) = x ( k ) + t ( k ) ( x 1 , , x k )
yields the smallest residual. Since x ( j ) x ( k ) 0 for all j [ k ] , we have
t ( k ) x 1 , , x k = log 1 + j = 1 k 1 e x ( j ) x ( k ) log k .
This shows the bounds introduced earlier can indeed be improved by a more accurate computation of the residual term t ( k ) x 1 , , x k .
When considering 1D GMMs, let us now bound t ( k ) ( x 1 , , x k ) in a combinatorial range I r = ( a r , a r + 1 ) . Let δ = δ ( r ) denote the index of the dominating weighted component in this range. Then,
x I r , i , exp log σ i ( x μ i ) 2 2 σ i 2 + log w i exp log σ δ ( x μ δ ) 2 2 σ δ 2 + log w δ .
Thus we have:
log m ( x ) = log w δ σ δ 2 π ( x μ δ ) 2 2 σ δ 2 + log 1 + i δ exp ( x μ i ) 2 2 σ i 2 + log w i σ i + ( x μ δ ) 2 2 σ δ 2 log w δ σ δ .
Now consider the ratio term:
ρ i , δ ( x ) = exp ( x μ i ) 2 2 σ i 2 + log w i σ δ w δ σ i + ( x μ δ ) 2 2 σ δ 2 .
It is maximized in I r = ( a r , a r + 1 ) by maximizing equivalently the following quadratic equation:
l i , δ ( x ) = ( x μ i ) 2 2 σ i 2 + log w i σ δ w δ σ i + ( x μ δ ) 2 2 σ δ 2 .
Setting the derivative to zero ( l i , δ ( x ) = 0 ), we get the root (when σ i σ δ )
x i , δ = μ δ σ δ 2 μ i σ i 2 / 1 σ δ 2 1 σ i 2 .
If x i , δ I r , the ratio ρ i , δ ( x ) can be bounded in the slab I r by considering the extreme values of the three element set { ρ i , δ ( a r ) , ρ i , δ ( x i , δ ) , ρ i , δ ( a r + 1 ) } . Otherwise ρ i , δ ( x ) is monotonic in I r , its bounds in I r are given by { ρ i , δ ( a r ) , ρ i , δ ( a r + 1 ) } . In any case, let ρ i , δ min ( r ) and ρ i , δ max ( r ) represent the resulting lower and upper bounds of ρ i , δ ( x ) in I r . Then t δ is bounded in the range I r by:
0 < log 1 + i δ ρ i , δ min ( r ) t δ log 1 + i δ ρ i , δ max ( r ) log k .
In practice, we always get better bounds using the shape-dependent technique at the expense of computing overall O ( k 2 ) intersection points of the pairwise densities. We call those bounds CEALB and CEAUB for Combinatorial Envelope Adaptive Lower Bound and Combinatorial Envelope Adaptive Upper Bound.
Let us illustrate one scenario where this adaptive technique yields very good approximations. Consider a GMM with all variance σ 2 tending to zero (a mixture of k Diracs). Then in a combinatorial slab I r , we have ρ i , δ max ( r ) 0 for all i δ , and therefore we get tight bounds.
As a related technique, we could also upper bound a r a r + 1 log m ( x ) d x by ( a r + 1 a r ) log m ( a r , a r + 1 ) where m ( x , x ) denotes the maximal value of the mixture density in the range ( x , x ) . This maximal value is either found at the slab extremities, or is a mode of the GMM. It then requires to find the modes of a GMM [29,30], for which no analytical solution is known in general.

2.2. Another Derivation Using the Arithmetic-Geometric Mean Inequality

Let us start by considering the inequality of arithmetic and geometric weighted means (AGI, Arithmetic-Geometric Inequality) applied to the mixture component distributions:
m ( x ) = i = 1 k w i p ( x ; θ i ) i = 1 k p ( x ; θ i ) w i
with equality holds iff. θ 1 = = θ k .
To get a tractable formula with a positive remainder of the log-sum term log m ( x ) , we need to have the log argument greater or equal to 1, and thus we shall write the positive remainder:
R ( x ) = log m ( x ) i = 1 k p ( x ; θ i ) w i 0 .
Therefore, we can decompose the log-sum into a tractable part and a remainder as:
log m ( x ) = i = 1 k w i log p ( x ; θ i ) + log m ( x ) i = 1 k p ( x ; θ i ) w i .
For exponential families, the first term can be integrated accurately. For the second term, we notice that i = 1 k p ( x ; θ i ) w i is a distribution in the same exponential family. We denote p ( x ; θ 0 ) = i = 1 k p ( x ; θ i ) w i . Then
R ( x ) = log i = 1 k w i p ( x ; θ i ) p ( x ; θ 0 )
As the ratio p ( x ; θ i ) / p ( x ; θ 0 ) can be bounded above and below using techniques in Section 2.1, R ( x ) can be correspondingly bounded. Notice the similarity between Equations (14) and (15). The key difference with the adaptive bounds is that, here we choose p ( x ; θ 0 ) instead of the dominating component in m ( x ) as the “reference distribution” in the decomposition. This subtle difference is not presented in detail in our experimental studies but discussed here for completeness. Essentially, the gap of the bounds is up to the difference between the geometric average and the arithmetic average. In the extreme case that all mixture components are identical, this gap will reach zero. Therefore we expect good quality bounds with a small gap when the mixture components are similar as measured by KL divergence.

2.3. Case Studies

In the following, we instantiate the proposed method for several prominent cases on the mixture of exponential family distributions.

2.3.1. The Case of Exponential Mixture Models

An exponential distribution has density p ( x ; λ ) = λ exp ( λ x ) defined on X = [ 0 , ) for λ > 0 . Its CDF is Φ ( x ; λ ) = 1 exp ( λ x ) . Any two components w 1 p ( x ; λ 1 ) and w 2 p ( x ; λ 2 ) (with λ 1 λ 2 ) have a unique intersection point
x = log ( w 1 λ 1 ) log ( w 2 λ 2 ) λ 1 λ 2
if x 0 ; otherwise they do not intersect. The basic formulas to evaluate the bounds are
C i , j ( a , b ) = log λ j w j M i ( a , b ) + w i λ j a + 1 λ i e λ i a b + 1 λ i e λ i b ,
M i ( a , b ) = w i e λ i a e λ i b .

2.3.2. The Case of Rayleigh Mixture Models

A Rayleigh distribution has density p ( x ; σ ) = x σ 2 exp x 2 2 σ 2 , defined on X = [ 0 , ) for σ > 0 . Its CDF is Φ ( x ; σ ) = 1 exp x 2 2 σ 2 . Any two components w 1 p ( x ; σ 1 ) and w 2 p ( x ; σ 2 ) (with σ 1 σ 2 ) must intersect at x 0 = 0 and can have at most one other intersection point
x = log w 1 σ 2 2 w 2 σ 1 2 / 1 2 σ 1 2 1 2 σ 2 2
if the square root is well defined and x > 0 . We have
C i , j ( a , b ) = log w j ( σ j ) 2 M i ( a , b ) + w i 2 ( σ j ) 2 ( a 2 + 2 σ i 2 ) e a 2 2 σ i 2 ( b 2 + 2 σ i 2 ) e b 2 2 σ i 2 w i a b x σ i 2 exp x 2 2 σ i 2 log x d x ,
M i ( a , b ) = w i e a 2 2 σ i 2 e b 2 2 σ i 2 .
The last term in Equation (20) does not have a simple closed form (it requires the exponential integral, Ei). One need a numerical integrator to compute it.

2.3.3. The Case of Gaussian Mixture Models

The Gaussian density p ( x ; μ , σ ) = 1 2 π σ e ( x μ ) 2 / ( 2 σ 2 ) has support X = R and parameters μ R and σ > 0 . Its CDF is Φ ( x ; μ , σ ) = 1 2 1 + erf ( x μ 2 σ ) , where erf is the Gauss error function. The intersection point x of two components w 1 p ( x ; μ 1 , σ 1 ) and w 2 p ( x ; μ 2 , σ 2 ) can be obtained by solving the quadratic equation log w 1 p ( x ; μ 1 , σ 1 ) = log w 2 p ( x ; μ 2 , σ 2 ) , which gives at most two solutions. As shown in Figure 1, the upper envelope of Gaussian densities corresponds to the lower envelope of parabolas. We have
C i , j ( a , b ) = M i ( a , b ) log w j log σ j 1 2 log ( 2 π ) 1 2 ( σ j ) 2 ( μ j μ i ) 2 + σ i 2 + w i σ i 2 2 π ( σ j ) 2 ( a + μ i 2 μ j ) e ( a μ i ) 2 2 σ i 2 ( b + μ i 2 μ j ) e ( b μ i ) 2 2 σ i 2 ,
M i ( a , b ) = w i 2 erf b μ i 2 σ i erf a μ i 2 σ i .

2.3.4. The Case of Gamma Distributions

For simplicity, we only consider gamma distributions with the shape parameter k > 0 fixed and the scale λ > 0 varying. The density is defined on ( 0 , ) as p ( x ; k , λ ) = x k 1 e x λ λ k Γ ( k ) , where Γ ( · ) is the gamma function. Its CDF is Φ ( x ; k , λ ) = γ ( k , x / λ ) / Γ ( k ) , where γ ( · , · ) is the lower incomplete gamma function. Two weighted gamma densities w 1 p ( x ; k , λ 1 ) and w 2 p ( x ; k , λ 2 ) (with λ 1 λ 2 ) intersect at a unique point
x = log w 1 λ 1 k log w 2 λ 2 k / 1 λ 1 1 λ 2
if x > 0 ; otherwise they do not intersect. From straightforward derivations,
C i , j ( a , b ) = log w j ( λ j ) k Γ ( k ) M i ( a , b ) + w i a b x k 1 e x λ i λ i k Γ ( k ) x λ j ( k 1 ) log x d x ,
M i ( a , b ) = w i Γ ( k ) γ k , b λ i γ k , a λ i .
Similar to the case of Rayleigh mixtures, the last term in Equation (25) relies on numerical integration.

3. Upper-Bounding the Differential Entropy of a Mixture

First, consider a finite parametric mixture model m ( x ) = i = 1 k w i p ( x ; θ i ) . Using the chain rule of the entropy, we end up with the well-known lemma:
Lemma 1.
The entropy of a d-variate mixture is upper bounded by the sum of the entropy of its marginal mixtures: H ( m ) i = 1 d H ( m i ) , where m i is the 1D marginal mixture with respect to variable x i .
Since the 1D marginals of a multivariate GMM are univariate GMMs, we thus get a loose upper bound. A generic sample-based probabilistic bound is reported for the entropies of distributions with given support [31]: The method builds probabilistic upper and lower piecewisely linear CDFs based on an i.i.d. finite sample set of size n and a given deviation probability threshold. It then builds algorithmically between those two bounds the maximum entropy distribution [31] with a so-called string-tightening algorithm.
Instead, we proceed as follows: Consider finite mixtures of component distributions defined on the full support R d that have finite component means and variances (like exponential families). Then we shall use the fact that the maximum entropy distribution with prescribed mean and variance is a Gaussian distribution, and conclude the upper bound by plugging the mixture mean and variance in the differential entropy formula of the Gaussian distribution. In general, the maximum entropy with moment constraints yields as a solution an exponential family.
Without loss of generality, consider GMMs in the form m ( x ) = i = 1 k w i p ( x ; μ i , Σ i ) ( Σ i = σ i 2 for univariate Gaussians). The mean μ ¯ of the mixture is μ ¯ = i = 1 k w i μ i and the variance is σ ¯ 2 = E [ m 2 ] E [ m ] 2 . Since E [ m 2 ] = i = 1 k w i x 2 p ( x ; μ i , Σ i ) d x = i = 1 k w i μ i 2 + σ i 2 , we deduce that
σ ¯ 2 = i = 1 k w i ( μ i 2 + σ i 2 ) i = 1 k w i μ i 2 = i = 1 k w i ( μ i μ ¯ ) 2 + σ i 2 .
The entropy of a random variable with a prescribed variance σ ¯ 2 is maximal for the Gaussian distribution with the same variance σ ¯ 2 , see [4]. Since the differential entropy of a Gaussian is log ( σ ¯ 2 π e ) , we deduce that the entropy of the GMM is upper bounded by
H ( m ) 1 2 log ( 2 π e ) + 1 2 log i = 1 k w i ( μ i μ ¯ ) 2 + σ i 2 .
This upper bound can be easily generalized to arbitrary dimensionality. We get the following lemma:
Lemma 2.
The entropy of a d-variate GMM m ( x ) = i = 1 k w i p ( x ; μ i , Σ i ) is upper bounded by d 2 log ( 2 π e ) + 1 2 log det Σ , where Σ = i = 1 k w i ( μ i μ i + Σ i ) i = 1 k w i μ i i = 1 k w i μ i .
In general, exponential families have finite moments of any order [17]: In particular, we have E [ t ( X ) ] = F ( θ ) and V [ t ( X ) ] = 2 F ( θ ) . For Gaussian distribution, we have the sufficient statistics t ( x ) = ( x , x 2 ) so that E [ t ( X ) ] = F ( θ ) yields the mean and variance from the log-normalizer. It is easy to generalize Lemma 2 to mixtures of exponential family distributions.
Note that this bound (called the Maximum Entropy Upper Bound in [13], MEUB) is tight when the GMM approximates a single Gaussian. It is fast to compute compared to the bound reported in [9] that uses Taylor’ s expansion of the log-sum of the mixture density.
A similar argument cannot be applied for a lower bound since a GMM with a given variance may have entropy tending to . For example, assume the 2-component mixture’s mean is zero, and that the variance approximates 1 by taking m ( x ) = 1 2 G ( x ; 1 , ϵ ) + 1 2 G ( x ; 1 , ϵ ) where G denotes the Gaussian density. Letting ϵ 0 , we get the entropy tending to .
We remark that our log-sum-exp inequality technique yields a log 2 additive approximation range in the case of a Gaussian mixture with two components. It thus generalizes the bounds reported in [7] to GMMs with arbitrary variances that are not necessarily equal.
To see the bound gap, we have
r I r m ( x ) log k + log max i w i p i ( x ) d x H ( m ) r I r m ( x ) max log max i w i p i ( x ) , log k + log min i w i p i ( x ) d x .
Therefore the gap is at most
Δ = min r I r m ( x ) log max i w i p i ( x ) min i w i p i ( x ) d x , log k = min s r I r w s p s ( x ) log max i w i p i ( x ) min i w i p i ( x ) d x , log k .
Thus to compute the gap error bound of the differential entropy, we need to integrate terms in the form
w a p a ( x ) log w b p b ( x ) w c p c ( x ) d x .
See Appendix B for a closed-form formula when dealing with exponential family components.

4. Bounding the α-Divergence

The α-divergence [15,32,33,34] between m ( x ) = i = 1 k w i p i ( x ) and m ( x ) = i = 1 k w i p i ( x ) is defined as
D α m : m = 1 α ( 1 α ) 1 X m ( x ) α m ( x ) 1 α d x ,
which clearly satisfies D α m : m = D 1 α m : m . The α-divergence is a family of information divergences parametrized by α R \ { 0 , 1 } . Let α 1 , we get the KL divergence (see [35] for a proof):
lim α 1 D α ( m : m ) = KL ( m : m ) = X m ( x ) log m ( x ) m ( x ) d x ,
and α 0 gives the reverse KL divergence:
lim α 0 D α ( m : m ) = KL ( m : m ) .
Other interesting values [33] include α = 1 / 2 (squared Hellinger distance), α = 2 (Pearson Chi-square distance), α = 1 (Neyman Chi-square distance), etc. Notably, the Hellinger distance is a valid distance metric which satisfies non-negativity, symmetry, and the triangle inequality. In general, D α ( m : m ) only satisfies non-negativity so that D α m : m 0 for any m ( x ) and m ( x ) . It is neither symmetric nor admitting the triangle inequality. Minimization of α-divergences allows one to choose a trade-off between mode fitting and support fitting of the minimizer [36]. The minimizer of α-divergences including MLE as a special case has interesting connections with transcendental number theory [37].
To compute D α m : m for given m ( x ) and m ( x ) reduces to evaluate the Hellinger integral [38,39]:
H α ( m : m ) = X m ( x ) α m ( x ) 1 α d x ,
which in general does not have a closed form, as it was known that the α-divergence of mixture models is not analytic [6]. Moreover, H α ( m : m ) may diverge making the α-divergence unbounded. Once H α ( m : m ) can be solved, the Rényi and Tsallis divergences [35] and in general Sharma–Mittal divergences [40] can be easily computed. Therefore the results presented here directly extend to those divergence families.
Similar to the case of KL divergence, the Monte Carlo stochastic estimation of H α ( m : m ) can be computed either as
H ^ α n m : m = 1 n i = 1 n m ( x i ) m ( x i ) 1 α ,
where x 1 , , x n m ( x ) are i.i.d. samples, or as
H ^ α n m : m = 1 n i = 1 n m ( x i ) m ( x i ) α ,
where x 1 , , x n m ( x ) are i.i.d. In either case, it is consistent so that lim n H ^ α n m : m = H α m : m . However, MC estimation requires a large sample and does not guarantee deterministic bounds. The techniques described in [41] work in practice for very close distributions, and do not apply between mixture models. We will therefore derive combinatorial bounds for H α ( m : m ) . The structure of this Section is parallel with Section 2 with necessary reformulations for a clear presentation.

4.1. Basic Bounds

For a pair of given m ( x ) and m ( x ) , we only need to derive bounds of H α ( m : m ) in Equation (31) so that L α ( m : m ) H α ( m : m ) U α ( m : m ) . Then the α-divergence D α ( m : m ) can be bounded by a linear transformation of the range L α ( m : m ) , U α ( m : m ) . In the following we always assume without loss of generality α 1 / 2 . Otherwise we can bound D α ( m : m ) by considering equivalently the bounds of D 1 α ( m : m ) .
Recall that in each elementary slab I r , we have
max k w ϵ ( r ) p ϵ ( r ) ( x ) , w δ ( r ) p δ ( r ) ( x ) m ( x ) k w δ ( r ) p δ ( r ) ( x ) .
Notice that k w ϵ ( r ) p ϵ ( r ) ( x ) , w δ ( r ) p δ ( r ) ( x ) , and k w δ ( r ) p δ ( r ) ( x ) are all single component distributions up to a scaling coefficient. The general thinking is to bound the multi-component mixture m ( x ) by single component distributions in each elementary interval, so that the integral in Equation (31) can be computed in a piecewise manner.
For the convenience of notation, we rewrite Equation (32) as
c ν ( r ) p ν ( r ) ( x ) m ( x ) c δ ( r ) p δ ( r ) ( x ) ,
where
c ν ( r ) p ν ( r ) ( x ) : = k w ϵ ( r ) p ϵ ( r ) ( x )   or   w δ ( r ) p δ ( r ) ( x ) ,
c δ ( r ) p δ ( r ) ( x ) : = k w δ ( r ) p δ ( r ) ( x ) .
If 1 / 2 α < 1 , then both x α and x 1 α are monotonically increasing on R + . Therefore we have
A ν ( r ) , ν ( r ) α ( I r ) I r m ( x ) α m ( x ) 1 α d x A δ ( r ) , δ ( r ) α ( I r ) ,
where
A i , j α ( I ) = I c i p i ( x ) α c j p j ( x ) 1 α d x ,
and I denotes an interval I = ( a , b ) R . The other case α > 1 is similar by noting that x α and x 1 α are monotonically increasing and decreasing on R + , respectively. In conclusion, we obtain the following bounds of H α ( m : m ) :
If 1 / 2 α < 1 , L α ( m : m ) = r = 1 A ν ( r ) , ν ( r ) α ( I r ) , U α ( m : m ) = r = 1 A δ ( r ) , δ ( r ) α ( I r ) ;
if α > 1 , L α ( m : m ) = r = 1 A ν ( r ) , δ ( r ) α ( I r ) , U α ( m : m ) = r = 1 A δ ( r ) , ν ( r ) α ( I r ) .
The remaining problem is to compute the definite integral A i , j α ( I ) in the above equations. Here we assume all mixture components are in the same exponential family so that p i ( x ) = p ( x ; θ i ) = h ( x ) exp θ i t ( x ) F ( θ i ) , where h ( x ) is a base measure, t ( x ) is a vector of sufficient statistics, and the function F is known as the cumulant generating function. Then it is straightforward from Equation (37) that
A i , j α ( I ) = c i α ( c j ) 1 α I h ( x ) exp α θ i + ( 1 α ) θ j t ( x ) α F ( θ i ) ( 1 α ) F ( θ j ) d x .
If 1 / 2 α < 1 , then θ ¯ = α θ i + ( 1 α ) θ j belongs to the natural parameter space M θ . Therefore A i , j α ( I ) is bounded and can be computed from the CDF of p ( x ; θ ¯ ) as
A i , j α ( I ) = c i α ( c j ) 1 α exp ( F θ ¯ α F ( θ i ) ( 1 α ) F ( θ j ) ) I p x ; θ ¯ d x .
The other case α > 1 is more difficult: if θ ¯ = α θ i + ( 1 α ) θ j still lies in M θ , then A i , j α ( I ) can be computed by Equation (41). Otherwise we try to solve it by a numerical integrator. This is not ideal as the integral may diverge, or our approximation may be too loose to conclude. We point the reader to [42] and Equations (61)–(69) in [35] for related analysis with more details. As computing A i , j α ( I ) only requires O ( 1 ) time, the overall computational complexity (without considering the envelope computation) is O ( ) .

4.2. Adaptive Bounds

This section derives the shape-dependent bounds which improve the basic bounds in Section 4.1. We can rewrite a mixture model m ( x ) in a slab I r as
m ( x ) = w ζ ( r ) p ζ ( r ) ( x ) 1 + i ζ ( r ) w i p i ( x ) w ζ ( r ) p ζ ( r ) ( x ) ,
where w ζ ( r ) p ζ ( r ) ( x ) is a weighted component in m ( x ) serving as a reference. We only discuss the case that the reference is chosen as the dominating component, i.e., ζ ( r ) = δ ( r ) . However it is worth to note that the proposed bounds do not depend on this particular choice. Therefore the ratio
w i p i ( x ) w ζ ( r ) p ζ ( r ) ( x ) = w i w ζ ( r ) exp θ i θ ζ ( r ) t ( x ) F ( θ i ) + F ( θ ζ ( r ) )
can be bounded in a sub-range of [ 0 , 1 ] by analyzing the extreme values of t ( x ) in the slab I r . This can be done because t ( x ) usually consists of polynomial functions with finite critical points which can be solved easily. Correspondingly the function 1 + i ζ ( r ) w i p i ( x ) w ζ ( r ) p ζ ( r ) ( x ) in I r can be bounded in a subrange of [ 1 , k ] , denoted as [ ω ζ ( r ) ( I r ) , Ω ζ ( r ) ( I r ) ] . Hence
ω ζ ( r ) ( I r ) w ζ ( r ) p ζ ( r ) ( x ) m ( x ) Ω ζ ( r ) ( I r ) w ζ ( r ) p ζ ( r ) ( x ) .
This forms better bounds of m ( x ) than Equation (32) because each component in the slab I r is analyzed more accurately. Therefore, we refine the fundamental bounds of m ( x ) by replacing the Equations (34) and (35) with
c ν ( r ) p ν ( r ) ( x ) : = ω ζ ( r ) ( I r ) w ζ ( r ) p ζ ( r ) ( x ) ,
c δ ( r ) p δ ( r ) ( x ) : = Ω ζ ( r ) ( I r ) w ζ ( r ) p ζ ( r ) ( x ) .
Then, the improved bounds of H α are given by Equations (38) and (39) according to the above replaced definition of c ν ( r ) p ν ( r ) ( x ) and c δ ( r ) p δ ( r ) ( x ) .
To evaluate ω ζ ( r ) ( I r ) and Ω ζ ( r ) ( I r ) requires iterating through all components in each slab. Therefore the computational complexity is increased to O ( k + k ) .

4.3. Variance-Reduced Bounds

This section further improves the proposed bounds based on variance reduction [43]. By assumption, α 1 / 2 , then m ( x ) α m ( x ) 1 α is more similar to m ( x ) rather than m ( x ) . The ratio m ( x ) α m ( x ) 1 α / m ( x ) is likely to have a small variance when x varies inside a slab I r , especially when α is close to 1. We will therefore bound this ratio term in
I r m ( x ) α m ( x ) 1 α d x = I r m ( x ) m ( x ) α m ( x ) 1 α m ( x ) d x = i = 1 k I r w i p i ( x ) m ( x ) m ( x ) 1 α d x .
No matter α < 1 or α > 1 , the function x 1 α must be monotonic on R + . In each slab I r , m ( x ) / m ( x ) 1 α ranges between these two functions:
c ν ( r ) p ν ( r ) ( x ) c δ ( r ) p δ ( r ) ( x ) 1 α and c δ ( r ) p δ ( r ) ( x ) c ν ( r ) p ν ( r ) ( x ) 1 α ,
where c ν ( r ) p ν ( r ) ( x ) , c δ ( r ) p δ ( r ) ( x ) , c ν ( r ) p ν ( r ) ( x ) and c δ ( r ) p δ ( r ) ( x ) are defined in Equations (45) and (46). Similar to the definition of A i , j α ( I ) in Equation (37), we define
B i , j , l α ( I ) = I w i p i ( x ) c l p l ( x ) c j p j ( x ) 1 α d x .
Therefore we have,
L α ( m : m ) = min S , U α ( m : m ) = max S , S = r = 1 i = 1 k B i , δ ( r ) , ν ( r ) α ( I r ) , r = 1 i = 1 k B i , ν ( r ) , δ ( r ) α ( I r ) .
The remaining problem is to evaluate B i , j , l α ( I ) in Equation (49). Similar to Section 4.1, assuming the components are in the same exponential family with respect to the natural parameters θ, we get
B i , j , l α ( I ) = w i c l 1 α c j 1 α exp F ( θ ¯ ) F ( θ i ) ( 1 α ) F ( θ l ) + ( 1 α ) F ( θ j ) I p ( x ; θ ¯ ) d x .
If θ ¯ = θ i + ( 1 α ) θ l ( 1 α ) θ j is in the natural parameter space, B i , j , l α ( I ) can be computed from the CDF of p ( x ; θ ¯ ) ; otherwise B i , j , l α ( I ) can be numerically integrated by its definition in Equation (49). The computational complexity is the same as the bounds in Section 4.2, i.e., O ( k + k ) .
We have introduced three pairs of deterministic lower and upper bounds that enclose the true value of α-divergence between univariate mixture models. Thus the gap between the upper and lower bounds provides the additive approximation factor of the bounds. We conclude by emphasizing that the presented methodology can be easily generalized to other divergences [35,40] relying on Hellinger-type integrals H α , β ( p : q ) = p ( x ) α q ( x ) β d x like the γ-divergence [44] as well as entropy measures [45].

5. Lower Bounds of the f -Divergence

The f-divergence between two distributions m ( x ) and m ( x ) (not necessarily mixtures) is defined for a convex generator f by:
D f ( m : m ) = m ( x ) f m ( x ) m ( x ) d x .
If f ( x ) = log x , then D f ( m : m ) = KL ( m : m ) .
Let us partition the support X = r = 1 I r arbitrarily into elementary ranges, which do not necessarily correspond to the envelopes. Denote by M I the probability mass of a mixture m ( x ) in the range I: M I = I m ( x ) d x . Then
D f ( m : m ) = r = 1 M I r I r m ( x ) M I r f m ( x ) m ( x ) d x .
Note that in range I r , m ( x ) M I r is a unit weight distribution. Thus by Jensen’s inequality f ( E [ X ] ) E [ f ( X ) ] , we get
D f ( m : m ) r = 1 M I r f I r m ( x ) M I r m ( x ) m ( x ) d x = r = 1 M I r f M I r M I r .
Notice that the RHS of Equation (52) is the f-divergence between ( M I 1 , , M I ) and ( M I 1 , , M I ) , denoted by D f I ( m : m ) . In the special case that = 1 and I 1 = X , the above Equation (52) turns out to be the usual Gibbs’ inequality: D f ( m : m ) f ( 1 ) , and Csiszár generator is chosen so that f ( 1 ) = 0 . In conclusion, for a fixed (coarse-grained) countable partition of X , we recover the well-know information monotonicity [46] of the f-divergences:
D f ( m : m ) D f I ( m : m ) 0 .
In practice, we get closed-form lower bounds when M I = a b m ( x ) d x = Φ ( b ) Φ ( a ) is available in closed-form, where Φ ( · ) denote the CDF. In particular, if m ( x ) is a mixture model, then its CDF can be computed by linearly combining the CDFs of its components.
To wrap up, we have proved that coarse-graining by making a finite partition of the support X yields a lower bound on the f-divergence by virtue of the information monotonicity. Therefore, instead of doing Monte Carlo stochastic integration:
D ^ f n ( m : m ) = 1 n i = 1 n f m ( x i ) m ( x i ) ,
with x 1 , , x n i.i.d. m ( x ) , it could be better to sort those n samples and consider the coarse-grained partition:
I = ( , x ( 1 ) ] i = 1 n 1 ( x ( i ) , x ( i + 1 ) ] ( x ( n ) , + )
to get a guaranteed lower bound on the f-divergence. We will call this bound CGQLB for Coarse Graining Quantization Lower Bound.
Given a budget of n splitting points on the range X , it would be interesting to find the best n points that maximize the lower bound D f I ( m : m ) . This is ongoing research.

6. Experiments

We perform an empirical study to verify our theoretical bounds. We simulate four pairs of mixture models { ( EMM 1 , EMM 2 ) , ( RMM 1 , RMM 2 ) , ( GMM 1 , GMM 2 ) , ( GaMM 1 , GaMM 2 ) } as the test subjects. The component type is implied by the model name, where GaMM stands for Gamma mixtures. The components of each mixture model are given as follows.
  • EMM 1 ’s components, in the form ( λ i , w i ) , are given by ( 0 . 1 , 1 / 3 ) , ( 0 . 5 , 1 / 3 ) , ( 1 , 1 / 3 ) ; EMM 2 ’s components are ( 2 , 0 . 2 ) , ( 10 , 0 . 4 ) , ( 20 , 0 . 4 ) .
  • RMM 1 ’s components, in the form ( σ i , w i ) , are given by ( 0 . 5 , 1 / 3 ) , ( 2 , 1 / 3 ) , ( 10 , 1 / 3 ) ; RMM 2 consists of ( 5 , 0 . 25 ) , ( 60 , 0 . 25 ) , ( 100 , 0 . 5 ) .
  • GMM 1 ’s components, in the form ( μ i , σ i , w i ) , are ( 5 , 1 , 0 . 05 ) , ( 2 , 0 . 5 , 0 . 1 ) , ( 5 , 0 . 3 , 0 . 2 ) , ( 10 , 0 . 5 , 0 . 2 ) , ( 15 , 0 . 4 , 0 . 05 ) , ( 25 , 0 . 5 , 0 . 3 ) , ( 30 , 2 , 0 . 1 ) ; GMM 2 consists of ( 16 , 0 . 5 , 0 . 1 ) , ( 12 , 0 . 2 , 0 . 1 ) , ( 8 , 0 . 5 , 0 . 1 ) , ( 4 , 0 . 2 , 0 . 1 ) , ( 0 , 0 . 5 , 0 . 2 ) , ( 4 , 0 . 2 , 0 . 1 ) , ( 8 , 0 . 5 , 0 . 1 ) , ( 12 , 0 . 2 , 0 . 1 ) , ( 16 , 0 . 5 , 0 . 1 ) .
  • GaMM 1 ’s components, in the form ( k i , λ i , w i ) , are ( 2 , 0 . 5 , 1 / 3 ) , ( 2 , 2 , 1 / 3 ) , ( 2 , 4 , 1 / 3 ) ; GaMM 2 consists of ( 2 , 5 , 1 / 3 ) , ( 2 , 8 , 1 / 3 ) , ( 2 , 10 , 1 / 3 ) .
We compare the proposed bounds with Monte Carlo estimation with different sample sizes in the range { 10 2 , 10 3 , 10 4 , 10 5 } . For each sample size configuration, we report the 0.95 confidence interval by Monte Carlo estimation using the corresponding number of samples. Figure 2a–d shows the input signals as well as the estimation results, where the proposed bounds CELB, CEUB, CEALB, CEAUB, CGQLB are presented as horizontal lines, and the Monte Carlo estimations over different sample sizes are presented as error bars. We can loosely consider the average Monte Carlo output with the largest sample size ( 10 5 ) as the underlying truth, which is clearly inside our bounds. This serves as an empirical justification on the correctness of the bounds.
A key observation is that the bounds can be very tight, especially when the underlying KL divergence has a large magnitude, e.g., KL ( RMM 2 : RMM 1 ) . This is because the gap between the lower and upper bounds is always guaranteed to be within log k + log k . Because KL is unbounded [4], in the general case two mixture models may have a large KL. Then our approximation gap is relatively very small. On the other hand, we also observed that the bounds in certain cases, e.g., KL ( EMM 2 : EMM 1 ) , are not as tight as the other cases. When the underlying KL is small, the bounds are not as informative as the general case.
Comparatively, there is a significant improvement of the shape-dependent bounds (CEALB and CEAUB) over the combinatorial bounds (CELB and CEUB). In all investigated cases, the adaptive bounds can roughly shrink the gap by half of its original size at the cost of additional computation.
Note that, the bounds are accurate and must contain the true value. Monte Carlo estimation gives no guarantee on where the true value is. For example, in estimating KL ( GMM 1 : GMM 2 ) , Monte Carlo estimation based on 10 4 samples can go beyond our bounds! It therefore suffers from a larger estimation error.
CGQLB as a simple-to-implement technique shows surprising good performance in several cases, e.g., KL ( RMM 1 , RMM 2 ) . Although it requires a large number of samples, we can observe that increasing sample size has limited effect on improving this bound. Therefore, in practice, one may intersect the range defined by CEALB and CEAUB with the range defined by CGQLB with a small sample size (e.g., 100) to get better bounds.
We simulates a set of Gaussian mixture models besides the above GMM 1 and GMM 2 . Figure 3 shows the GMM densities as well as their differential entropy. A detailed explanation of the components of each GMM model is omitted for brevity.
The key observation is that CEUB (CEAUB) is very tight in most of the investigated cases. This is because that the upper envelope that is used to compute CEUB (CEAUB) gives a very good estimation of the input signal.
Notice that MEUB only gives an upper bound of the differential entropy as discussed in Section 3. In general the proposed bounds are tighter than MEUB. However, this is not the case when the mixture components are merged together and approximate one single Gaussian (and therefore its entropy can be well approximated by the Gaussian entropy), as shown in the last line of Figure 3.
For α-divergence, the bounds introduced in Section 4.1, Section 4.2 and Section 4.3 are denoted as “Basic”, “Adaptive” and “VR”, respectively. Figure 4 visualizes these GMMs and plots the estimations of their α-divergences against α. The red lines mean the upper envelope. The dashed vertical lines mean the elementary intervals. The components of GMM 1 and GMM 2 are more separated than GMM 3 and GMM 4 . Therefore these two pairs present different cases. For a clear presentation, only VR (which is expected to be better than Basic and Adaptive) is shown. We can see that, visually in the big scale, VR tightly surrounds the true value.
For a more quantitative comparison, Table 1 shows the estimated α-divergence by MC, Basic, Adaptive, and VR. As D α is defined on R \ { 0 , 1 } , the KL bounds CE(A)LB and CE(A)UB are presented for α = 0 or 1. Overall, we have the following order of gap size: Basic > Adaptive > VR, and VR is recommended in general for bounding α-divergences. There are certain cases that the upper VR bound is looser than Adaptive. In practice one can compute the intersection of these bounds as well as the trivial bound D α ( m : m ) 0 to get the best estimation.
Note the similarity between KL in Equation (30) and the expression in Equation (47). We give without a formal analysis that: CEAL(U)B is equivalent to VR at the limit α 0 or α 1 . Experimentally as we slowly set α 1 , we can see that VR is consistent with CEAL(U)B.

7. Concluding Remarks and Perspectives

We have presented a fast versatile method to compute bounds on the Kullback–Leibler divergence between mixtures by building algorithmic formulae. We reported on our experiments for various mixture models in the exponential family. For univariate GMMs, we get a guaranteed bound of the KL divergence of two mixtures m and m with k and k components within an additive approximation factor of log k + log k in O ( k + k ) log ( k + k ) -time. Therefore, the larger the KL divergence, the better the bound when considering a multiplicative ( 1 + α ) -approximation factor, since α = log k + log k KL ( m : m ) . The adaptive bounds are guaranteed to yield better bounds at the expense of computing potentially O k 2 + ( k ) 2 intersection points of pairwise weighted components.
Our technique also yields the bound for the Jeffreys divergence (the symmetrized KL divergence: J ( m , m ) = KL ( m : m ) + KL ( m : m ) ) and the Jensen–Shannon divergence [47] (JS):
JS ( m , m ) = 1 2 KL m : m + m 2 + KL m : m + m 2 ,
since m + m 2 is a mixture model with k + k components. One advantage of this statistical distance is that it is symmetric, always bounded by log 2 , and its square root yields a metric distance [48]. The log-sum-exp inequalities may also be used to compute some Rényi divergences [35]:
R α ( m , p ) = 1 α 1 log m ( x ) α p ( x ) 1 α d x ,
when α is an integer, m ( x ) a mixture and p ( x ) a single (component) distribution. Getting fast guaranteed tight bounds on statistical distances between mixtures opens many avenues. For example, we may consider building hierarchical mixture models by merging iteratively two mixture components so that those pairs of components are chosen so that the KL distance between the full mixture and the simplified mixture is minimized.
In order to be useful, our technique is unfortunately limited to univariate mixtures: indeed, in higher dimensions, we can still compute the maximization diagram of weighted components (an additively weighted Bregman–Voronoi diagram [49,50] for components belonging to the same exponential family). However, it becomes more complex to compute in the elementary Voronoi cells V, the functions C i , j ( V ) and M i ( V ) (in 1D, the Voronoi cells are segments). We may obtain hybrid algorithms by approximating or estimating these functions. In 2D, it is thus possible to obtain lower and upper bounds on the Mutual Information [51] (MI) when the joint distribution m ( x , y ) is a 2D mixture of Gaussians:
I ( M ; M ) = m ( x , y ) log m ( x , y ) m ( x ) m ( y ) d x d y .
Indeed, the marginal distributions m ( x ) and m ( y ) are univariate Gaussian mixtures.
A Python code implementing those computational-geometric methods for reproducible research is available online [52].

Acknowledgments

The authors gratefully thank the referees for their comments. This work was carried out while Ke Sun was visiting Frank Nielsen at Ecole Polytechnique, Palaiseau, France.

Author Contributions

Frank Nielsen and Ke Sun contributed to the theoretical results as well as to the writing of the article. Ke Sun implemented the methods and performed the numerical experiments. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Kullback–Leibler Divergence of Mixture Models Is Not Analytic [6]

Ideally, we aim at getting a finite length closed-form formula to compute the KL divergence of finite mixture models. However, this is provably mathematically intractable [6] because of the log-sum term in the integral, as we shall prove below.
Analytic expressions encompass closed-form formula and may include special functions (e.g., Gamma function) but do not allow to use limits or integrals. An analytic function f ( x ) is a C function (infinitely differentiable) such that around any point x 0 the k-order Taylor series T k ( x ) = i = 0 k f ( i ) ( x 0 ) i ! ( x x 0 ) i converges to f ( x ) : lim k T k ( x ) = f ( x ) for x belonging to a neighborhood N r ( x 0 ) = { x : | x x 0 | r } of x 0 , where r is called the radius of convergence. The analytic property of a function is equivalent to the condition that for each k N , there exists a constant c such that d k f d x k ( x ) c k + 1 k ! .
To prove that the KL of mixtures is not analytic (hence does not admit a closed-form formula), we shall adapt the proof reported in [6] (in Japanese, we thank Professor Aoyagi for sending us his paper [6]). We shall prove that KL ( p : q ) is not analytic for p ( x ) = G ( x ; 0 , 1 ) and q ( x ; w ) = ( 1 w ) G ( x ; 0 , 1 ) + w G ( x ; 1 , 1 ) , where w ( 0 , 1 ) , and G ( x ; μ , σ ) = 1 2 π σ exp ( ( x μ ) 2 2 σ 2 ) is the density of a univariate Gaussian of mean μ and standard deviation σ. Let D ( w ) = KL ( p ( x ) : q ( x ; w ) ) denote the KL divergence between these two mixtures (p has a single component and q has two components).
We have
log p ( x ) q ( x ; w ) = log exp x 2 2 ( 1 w ) exp x 2 2 + w exp ( x 1 ) 2 2 = log ( 1 + w ( e x 1 2 1 ) ) .
Therefore
d k D d w k = ( 1 ) k k p ( x ) ( e x 1 2 1 ) d x .
Let x 0 be the root of the equation e x 1 2 1 = e x 2 so that for x x 0 , we have e x 1 2 1 e x 2 . It follows that:
d k D d w k 1 k x 0 p ( x ) e k x 2 d x = 1 k e k 2 8 A k
with A k = x 0 1 2 π exp ( x k 2 2 ) d x . When k , we have A k 1 . Consider k 0 N such that A k 0 > 0 . 9 . Then the radius of convergence r is such that:
1 r lim k 1 k k ! 0 . 9 exp k 2 8 1 k = .
Thus the convergence radius is r = 0 , and therefore the KL divergence is not an analytic function of the parameter w. The KL of mixtures is an example of a non-analytic smooth function. (Notice that the absolute value is not analytic at 0.)

Appendix B. Closed-Form Formula for the Kullback–Leibler Divergence between Scaled and Truncated Exponential Families

When computing approximation bounds for the KL divergence between two mixtures m ( x ) and m ( x ) , we end up with the task of computing D w a p a ( x ) log w b p b ( x ) w c p c ( x ) d x where D X is a subset of the full support X . We report a generic formula for computing these formula when the mixture (scaled and truncated) components belong to the same exponential family [17]. An exponential family has canonical log-density written as l ( x ; θ ) = log p ( x ; θ ) = θ t ( x ) F ( θ ) + k ( x ) , where t ( x ) denotes the sufficient statistics, F ( θ ) the log-normalizer (also called cumulant function or partition function), and k ( x ) an auxiliary carrier term.
Let KL ( w 1 p 1 : w 2 p 2 : w 3 p 3 ) = X w 1 p 1 ( x ) log w 2 p 2 ( x ) w 3 p 3 ( x ) d x = H × ( w 1 p 1 : w 3 p 3 ) H × ( w 1 p 1 : w 2 p 2 ) . Since it is a difference of two cross-entropies, we get for three distributions belonging to the same exponential family [26] the following formula:
KL ( w 1 p 1 : w 2 p 2 : w 3 p 3 ) = w 1 log w 2 w 3 + w 1 ( F ( θ 3 ) F ( θ 2 ) ( θ 3 θ 2 ) F ( θ 1 ) ) .
Furthermore, when the support is restricted, say to support range D X , let m D ( θ ) = D p ( x ; θ ) d x denote the mass and p ( x ; θ ) ˜ = p ( x ; θ ) m D ( θ ) the normalized distribution. Then we have:
D w 1 p 1 ( x ) log w 2 p 2 ( x ) w 3 p 3 ( x ) d x = m D ( θ 1 ) ( KL ( w 1 p ˜ 1 : w 2 p ˜ 2 : w 3 p ˜ 3 ) ) log w 2 m D ( θ 3 ) w 3 m D ( θ 2 ) .
When F D ( θ ) = F ( θ ) log m D ( θ ) is strictly convex and differentiable then p ( x ; θ ) ˜ is an exponential family and the closed-form formula follows straightforwardly. Otherwise, we still get a closed-form but need more derivations. For univariate distributions, we write D = ( a , b ) and m D ( θ ) = a b p ( x ; θ ) d x = P θ ( b ) P θ ( a ) where P θ ( a ) = a p ( x ; θ ) d x denotes the cumulative distribution function.
The usual formula for truncated and scaled Kullback–Leibler divergence is:
KL D ( w p ( x ; θ ) : w p ( x ; θ ) ) = w m D ( θ ) log w w + B F ( θ : θ ) + w ( θ θ ) m D ( θ ) ,
where B F ( θ : θ ) is a Bregman divergence [5]:
B F ( θ : θ ) = F ( θ ) F ( θ ) ( θ θ ) F ( θ ) .
This formula extends the classic formula [5] for full regular exponential families (by setting w = w = 1 and m D ( θ ) = 1 with m D ( θ ) = 0 ).
Similar formulæ are available for the cross-entropy and entropy of exponential families [26].

Appendix C. On the Approximation of KL between Smooth Mixtures by a Bregman Divergence [5]

Clearly, since Bregman divergences are always finite while KL divergences may diverge, we need extra conditions to assert that the KL between mixtures can be approximated by Bregman divergences. We require that the Jeffreys divergence between mixtures be finite in order to approximate the KL between mixtures by a Bregman divergence. We loosely derive this observation (Careful derivations will be reported elsewhere) using two different approaches:
  • First, continuous mixture distributions have smooth densities that can be arbitrarily closely approximated using a single distribution (potentially multi-modal) belonging to the Polynomial Exponential Families [53,54] (PEFs). A polynomial exponential family of order D has log-likelihood l ( x ; θ ) i = 1 D θ i x i : Therefore, a PEF is an exponential family with polynomial sufficient statistics t ( x ) = ( x , x 2 , , x D ) . However, the log-normalizer F D ( θ ) = log exp ( θ t ( x ) ) d x of a D-order PEF is not available in closed-form: It is computationally intractable. Nevertheless, the KL between two mixtures m ( x ) and m ( x ) can be theoretically approximated closely by a Bregman divergence between the two corresponding PEFs: KL ( m ( x ) : m ( x ) ) KL ( p ( x ; θ ) : p ( x ; θ ) ) = B F D ( θ : θ ) , where θ and θ are the natural parameters of the PEF family { p ( x ; θ ) } approximating m ( x ) and m ( x ) , respectively (i.e., m ( x ) p ( x ; θ ) and m ( x ) p ( x ; θ ) ). Notice that the Bregman divergence of PEFs has necessarily finite value but the KL of two smooth mixtures can potentially diverge (infinite value), hence the conditions on Jeffreys divergence to be finite.
  • Second, consider two finite mixtures m ( x ) = i = 1 k w i p i ( x ) and m ( x ) = j = 1 k w j p j ( x ) of k and k components (possibly with heterogeneous components p i ( x ) ’s and p j ( x ) ’s), respectively. In information geometry, a mixture family is the set of convex combination of fixed component densities. Thus in statistics, a mixture is understood as a convex combination of parametric components while in information geometry a mixture family is the set of convex combination of fixed components. Let us consider the mixture families { g ( x ; ( w , w ) ) } generated by the D = k + k fixed components p 1 ( x ) , , p k ( x ) , p 1 ( x ) , , p k ( x ) :
    g ( x ; ( w , w ) ) = i = 1 k w i p i ( x ) + j = 1 k w j p j ( x ) : i = 1 k w i + j = 1 k w j = 1
    We can approximate arbitrarily finely (with respect to total variation) mixture m ( x ) for any ϵ > 0 by g ( x ; α ) ( 1 ϵ ) m ( x ) + ϵ m ( x ) with α = ( ( 1 ϵ ) w , ϵ w ) (so that i = 1 k + k α i = 1 ) and m ( x ) g ( x ; α ) = ϵ m ( x ) + ( 1 ϵ ) m ( x ) with α = ( ϵ w , ( 1 ϵ ) w ) (and i = 1 k + k α i = 1 ). Therefore KL ( m ( x ) : m ( x ) ) KL ( g ( x ; α ) : g ( x ; α ) ) = B F * ( α : α ) , where F * ( α ) = g ( x ; α ) log g ( x ; α ) d x is the Shannon information (negative Shannon entropy) for the composite mixture family. Again, the Bregman divergence B F * ( α : α ) is necessarily finite but KL ( m ( x ) : m ( x ) ) between mixtures may be potentially infinite when the KL integral diverges (hence, the condition on Jeffreys divergence finiteness). Interestingly, this Shannon information can be arbitrarily closely approximated when considering isotropic Gaussians [13]. Notice that the convex conjugate F ( θ ) of the continuous Shannon neg-entropy F * ( η ) is the log-sum-exp function on the inverse soft map.

References

  1. Huang, Z.K.; Chau, K.W. A new image thresholding method based on Gaussian mixture model. Appl. Math. Comput. 2008, 205, 899–907. [Google Scholar] [CrossRef]
  2. Seabra, J.; Ciompi, F.; Pujol, O.; Mauri, J.; Radeva, P.; Sanches, J. Rayleigh mixture model for plaque characterization in intravascular ultrasound. IEEE Trans. Biomed. Eng. 2011, 58, 1314–1324. [Google Scholar] [CrossRef] [PubMed]
  3. Julier, S.J.; Bailey, T.; Uhlmann, J.K. Using Exponential Mixture Models for Suboptimal Distributed Data Fusion. In Proceedings of the 2006 IEEE Nonlinear Statistical Signal Processing Workshop, Cambridge, UK, 13–15 September 2006; IEEE: New York, NY, USA, 2006; pp. 160–163. [Google Scholar]
  4. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  5. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  6. Watanabe, S.; Yamazaki, K.; Aoyagi, M. Kullback Information of Normal Mixture is Not an Analytic Function; Technical Report of IEICE Neurocomputing; The Institute of Electronics, Information and Communication Engineers: Tokyo, Japan, 2004; pp. 41–46. (In Japanese) [Google Scholar]
  7. Michalowicz, J.V.; Nichols, J.M.; Bucholtz, F. Calculation of differential entropy for a mixed Gaussian distribution. Entropy 2008, 10, 200–206. [Google Scholar] [CrossRef]
  8. Pichler, G.; Koliander, G.; Riegler, E.; Hlawatsch, F. Entropy for singular distributions. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, 29 June–4 July 2014; pp. 2484–2488.
  9. Huber, M.F.; Bailey, T.; Durrant-Whyte, H.; Hanebeck, U.D. On entropy approximation for Gaussian mixture random vectors. In Proceedings of the IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, Seoul, Korea, 20–22 August 2008; IEEE: New York, NY, USA, 2008; pp. 181–188. [Google Scholar]
  10. Yamada, M.; Sugiyama, M. Direct importance estimation with Gaussian mixture models. IEICE Trans. Inf. Syst. 2009, 92, 2159–2162. [Google Scholar] [CrossRef]
  11. Durrieu, J.L.; Thiran, J.P.; Kelly, F. Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian Mixture Models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; IEEE: New York, NY, USA, 2012; pp. 4833–4836. [Google Scholar]
  12. Schwander, O.; Marchand-Maillet, S.; Nielsen, F. Comix: Joint estimation and lightspeed comparison of mixture models. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, 20–25 March 2016; pp. 2449–2453.
  13. Moshksar, K.; Khandani, A.K. Arbitrarily Tight Bounds on Differential Entropy of Gaussian Mixtures. IEEE Trans. Inf. Theory 2016, 62, 3340–3354. [Google Scholar] [CrossRef]
  14. Mezuman, E.; Weiss, Y. A Tight Convex Upper Bound on the Likelihood of a Finite Mixture. arXiv 2016. [Google Scholar]
  15. Amari, S.-I. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016; Volume 194. [Google Scholar]
  16. Nielsen, F.; Sun, K. Guaranteed Bounds on the Kullback–Leibler Divergence of Univariate Mixtures. IEEE Signal Process. Lett. 2016, 23, 1543–1546. [Google Scholar] [CrossRef]
  17. Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. arXiv 2009. [Google Scholar]
  18. Calafiore, G.C.; El Ghaoui, L. Optimization Models; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  19. Shen, C.; Li, H. On the dual formulation of boosting algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 2216–2231. [Google Scholar] [CrossRef] [PubMed]
  20. Beck, A. Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MATLAB; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2014. [Google Scholar]
  21. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  22. De Berg, M.; van Kreveld, M.; Overmars, M.; Schwarzkopf, O.C. Computational Geometry; Springer: Heidelberg, Germany, 2000. [Google Scholar]
  23. Setter, O.; Sharir, M.; Halperin, D. Constructing Two-Dimensional Voronoi Diagrams via Divide-and-Conquer of Envelopes in Space; Springer: Heidelberg, Germany, 2010. [Google Scholar]
  24. Devillers, O.; Golin, M.J. Incremental algorithms for finding the convex hulls of circles and the lower envelopes of parabolas. Inf. Process. Lett. 1995, 56, 157–164. [Google Scholar] [CrossRef]
  25. Nielsen, F.; Yvinec, M. An output-sensitive convex hull algorithm for planar objects. Int. J. Comput. Geom. Appl. 1998, 8, 39–65. [Google Scholar] [CrossRef]
  26. Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 17th IEEE International Conference on Image Processing (ICIP), Hong Kong, China, 26–29 September 2010; IEEE: New York, NY, USA, 2010; pp. 3621–3624. [Google Scholar]
  27. Sharir, M.; Agarwal, P.K. Davenport-Schinzel Sequences and Their Geometric Applications; Cambridge University Press: Cambridge, UK, 1995. [Google Scholar]
  28. Bronstein, M. Algorithms and computation in mathematics. In Symbolic Integration. I. Transcendental Functions; Springer: Berlin, Germany, 2005. [Google Scholar]
  29. Carreira-Perpinan, M.A. Mode-finding for mixtures of Gaussian distributions. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1318–1323. [Google Scholar] [CrossRef]
  30. Aprausheva, N.N.; Sorokin, S.V. Exact equation of the boundary of unimodal and bimodal domains of a two-component Gaussian mixture. Pattern Recognit. Image Anal. 2013, 23, 341–347. [Google Scholar] [CrossRef]
  31. Learned-Miller, E.; DeStefano, J. A probabilistic upper bound on differential entropy. IEEE Trans. Inf. Theory 2008, 54, 5223–5230. [Google Scholar] [CrossRef]
  32. Amari, S.-I. α-Divergence Is Unique, Belonging to Both f-Divergence and Bregman Divergence Classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
  33. Cichocki, A.; Amari, S.I. Families of Alpha- Beta- and Gamma-Divergences: Flexible and Robust Measures of Similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
  34. Póczos, B.; Schneider, J. On the Estimation of α-Divergences. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 609–617.
  35. Nielsen, F.; Nock, R. On Rényi and Tsallis entropies and divergences for exponential families. arXiv 2011. [Google Scholar]
  36. Minka, T. Divergence Measures and Message Passing; Technical Report MSR-TR-2005-173; Microsoft Research: Cambridge, UK, 2005. [Google Scholar]
  37. Améndola, C.; Drton, M.; Sturmfels, B. Maximum Likelihood Estimates for Gaussian Mixtures Are Transcendental. arXiv 2015. [Google Scholar]
  38. Hellinger, E. Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J. Reine Angew. Math. 1909, 136, 210–271. (In German) [Google Scholar]
  39. Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
  40. Nielsen, F.; Nock, R. A closed-form expression for the Sharma-Mittal entropy of exponential families. J. Phys. A Math. Theor. 2012, 45, 032003. [Google Scholar] [CrossRef]
  41. Nielsen, F.; Nock, R. On the Chi Square and Higher-Order Chi Distances for Approximating f-Divergences. IEEE Signal Process. Lett. 2014, 21, 10–13. [Google Scholar] [CrossRef]
  42. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
  43. Jarosz, W. Efficient Monte Carlo Methods for Light Transport in Scattering Media. Ph.D. Thesis, University of California, San Diego, CA, USA, 2008. [Google Scholar]
  44. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
  45. Havrda, J.; Charvát, F. Quantification method of classification processes. Concept of structural α-entropy. Kybernetika 1967, 3, 30–35. [Google Scholar]
  46. Liang, X. A Note on Divergences. Neural Comput. 2016, 28, 2045–2062. [Google Scholar] [CrossRef] [PubMed]
  47. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
  48. Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef] [Green Version]
  49. Nielsen, F.; Boissonnat, J.D.; Nock, R. On Bregman Voronoi diagrams. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 746–755. [Google Scholar]
  50. Boissonnat, J.D.; Nielsen, F.; Nock, R. Bregman Voronoi diagrams. Discret. Comput. Geom. 2010, 44, 281–307. [Google Scholar] [CrossRef]
  51. Foster, D.V.; Grassberger, P. Lower bounds on mutual information. Phys. Rev. E 2011, 83, 010101. [Google Scholar] [CrossRef] [PubMed]
  52. Nielsen, F.; Sun, K. PyKLGMM: Python Software for Computing Bounds on the Kullback-Leibler Divergence between Mixture Models. 2016. Available online: https://www.lix.polytechnique.fr/~nielsen/KLGMM/ (accessed on 6 December 2016).
  53. Cobb, L.; Koppstein, P.; Chen, N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc. 1983, 78, 124–130. [Google Scholar] [CrossRef]
  54. Nielsen, F.; Nock, R. Patch matching with polynomial exponential families and projective divergences. In Proceedings of the 9th International Conference Similarity Search and Applications (SISAP), Tokyo, Japan, 24–26 October 2016.
Figure 1. Lower envelope of parabolas corresponding to the upper envelope of weighted components of a Gaussian mixture with k = 3 components.
Figure 1. Lower envelope of parabolas corresponding to the upper envelope of weighted components of a Gaussian mixture with k = 3 components.
Entropy 18 00442 g001
Figure 2. Lower and upper bounds on the KL divergence between mixture models. The y-axis means KL divergence. Solid/dashed lines represent the combinatorial/adaptive bounds, respectively. The error-bars show the 0.95 confidence interval by Monte Carlo estimation using the corresponding sample size (x-axis). The narrow dotted bars show the CGQLB estimation w.r.t. the sample size.
Figure 2. Lower and upper bounds on the KL divergence between mixture models. The y-axis means KL divergence. Solid/dashed lines represent the combinatorial/adaptive bounds, respectively. The error-bars show the 0.95 confidence interval by Monte Carlo estimation using the corresponding sample size (x-axis). The narrow dotted bars show the CGQLB estimation w.r.t. the sample size.
Entropy 18 00442 g002aEntropy 18 00442 g002b
Figure 3. Lower and upper bounds on the differential entropy of Gaussian mixture models. On the left of each subfigure is the simulated GMM signal. On the right of each subfigure is the estimation of its differential entropy. Note that a subset of the bounds coincide with each other in several cases.
Figure 3. Lower and upper bounds on the differential entropy of Gaussian mixture models. On the left of each subfigure is the simulated GMM signal. On the right of each subfigure is the estimation of its differential entropy. Note that a subset of the bounds coincide with each other in several cases.
Entropy 18 00442 g003
Figure 4. Two pairs of Gaussian Mixture Models and their α-divergences against different values of α. The “true” value of D α is estimated by MC using 10 4 random samples. VR(L) and VR(U) denote the variation reduced lower and upper bounds, respectively. The range of α is selected for each pair for a clear visualization.
Figure 4. Two pairs of Gaussian Mixture Models and their α-divergences against different values of α. The “true” value of D α is estimated by MC using 10 4 random samples. VR(L) and VR(U) denote the variation reduced lower and upper bounds, respectively. The range of α is selected for each pair for a clear visualization.
Entropy 18 00442 g004
Table 1. The estimated D α and its bounds. The 95% confidence interval is shown for MC.
Table 1. The estimated D α and its bounds. The 95% confidence interval is shown for MC.
αMC( 10 2 )MC( 10 3 )MC( 10 4 )BasicAdaptiveVR
LULULU
GMM 1 & GMM 2 0 15.96 ± 3.9 12.30 ± 1.0 13.63 ± 0.3 11.7515.8912.9614.63
0.01 13.36 ± 2.9 10.63 ± 0.8 11.66 ± 0.3 −700.5011.73−77.3311.7311.4012.27
0.5 3.57 ± 0.3 3.47 ± 0.1 3.47 ± 0.07 −0.603.423.013.423.173.51
0.99 40.04 ± 7.7 37.22 ± 2.3 38.58 ± 0.8 −333.9039.045.3638.9838.2838.96
1 104.01 ± 28 84.96 ± 7.2 92.57 ± 2.5 91.4495.5992.7694.41
GMM 3 & GMM 4 0 0.71 ± 0.2 0.63 ± 0.07 0.62 ± 0.02 0.001.760.001.16
0.01 0.71 ± 0.2 0.63 ± 0.07 0.62 ± 0.02 −179.137.63−38.744.960.291.00
0.5 0.82 ± 0.3 0.57 ± 0.1 0.62 ± 0.04 −5.230.93−0.710.85−0.181.19
0.99 0.79 ± 0.3 0.76 ± 0.1 0.80 ± 0.03 −165.7212.10−59.769.110.371.28
1 0.80 ± 0.3 0.77 ± 0.1 0.81 ± 0.03 0.001.820.311.40

Share and Cite

MDPI and ACS Style

Nielsen, F.; Sun, K. Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities. Entropy 2016, 18, 442. https://doi.org/10.3390/e18120442

AMA Style

Nielsen F, Sun K. Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities. Entropy. 2016; 18(12):442. https://doi.org/10.3390/e18120442

Chicago/Turabian Style

Nielsen, Frank, and Ke Sun. 2016. "Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities" Entropy 18, no. 12: 442. https://doi.org/10.3390/e18120442

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop