*1.4. Kullback–Leibler Divergence Between Exponential Family Densities*

It is well-known that the KLD between two distributions *Pθ*<sup>1</sup> and *Pθ*<sup>2</sup> of E amounts to computing an equivalent Fenchel–Young divergence [13]:

$$D\_{\rm KL}[P\_{\theta\_1} : P\_{\theta\_2}] = \int\_{\mathcal{X}} p\_{\theta\_1}(\mathbf{x}) \log \frac{p\_{\theta\_1}(\mathbf{x})}{p\_{\theta\_2}(\mathbf{x})} \, \mathrm{d}\mu(\mathbf{x}) = \mathcal{Y}\_{\mathcal{F}, \mathcal{F}^\*}(\theta\_2, \eta\_1), \tag{9}$$

where *η* = ∇*F*(*θ*) = *EP<sup>θ</sup>* [*t*(*x*)] is the moment parameter [1] and

$$\nabla F(\theta) = \left[ \frac{\partial}{\partial \theta\_1} F(\theta), \dots, \frac{\partial}{\partial \theta\_D} F(\theta) \right]^\top \tag{10}$$

is the gradient of *F* with respect to *θ* = [*θ*1, ... , *θD*] . The Fenchel–Young divergence is defined for a pair of strictly convex conjugate functions [14] *F*(*θ*) and *F*∗(*η*) related by the Legendre–Fenchel transform by

$$Y\_{F,F^\*} (\theta\_1, \eta\_2) := F(\theta\_1) + F^\*(\eta\_2) - \theta\_1^\top \eta\_2. \tag{11}$$

Amari (1985) first introduced this formula as the canonical divergence of dually flat spaces in information geometry [15] (Equation 3.21), and proved that the Fenchel–Young divergence is obtained as the KLD between densities belonging to the same exponential family [15] (Theorem 3.7). Azoury and Warmuth expressed the KLD *D*KL[*Pθ*<sup>1</sup> : *Pθ*<sup>2</sup> ] using dual Bregman divergences in [13] (2001):

$$D\_{\rm KL}[P\_{\theta\_1} : P\_{\theta\_2}] = B\_F(\theta\_2 : \theta\_1) = B\_{F^\*}(\eta\_1 : \eta\_2),\tag{12}$$

where a Bregman divergence [16] *BF*(*θ*<sup>1</sup> : *θ*2) is defined for a strictly convex and differentiable generator *F*(*θ*) by:

$$B\_F(\theta\_1:\theta\_2) := F(\theta\_1) - F(\theta\_2) - (\theta\_1 - \theta\_2) \, ^\mid \nabla F(\theta\_2). \tag{13}$$

Acharyya termed the divergence *YF*,*F*<sup>∗</sup> the Fenchel–Young divergence in his PhD thesis [17] (2013), and Blondel et al. called such divergences Fenchel–Young losses (2020) in the context of machine learning [18] (Equation (9) in Definition 2). This term was also used by the author the Legendre–Fenchel divergence in [19]. The Fenchel–Young divergence stems from the Fenchel–Young inequality [14,20]:

$$F(\theta\_1) + F^\*(\eta\_2) \ge \theta\_1^\perp \eta\_{2\prime} \tag{14}$$

with equality if and only if *η*<sup>2</sup> = ∇*F*(*θ*1).

Figure 1 visualizes the 1D Fenchel–Young divergence and gives a geometric proof that *YF*,*F*<sup>∗</sup> (*θ*1, *η*2) ≥ 0 with equality if and only if *η*<sup>2</sup> = *F* (*θ*1). Indeed, by considering the behavior of the Legendre–Fenchel transformation under translations:


we may assume without loss of generality that *F*(0) = 0. The function *F* (*θ*) is strictly increasing and continuous since *F*(*θ*) is a strictly convex and differentiable convex function. Thus we have *F*(*θ*) = *<sup>θ</sup>* <sup>0</sup> *F* (*θ*) d*θ* and *F*∗(*η*) = *<sup>η</sup>* <sup>0</sup> *F*<sup>∗</sup> (*η*) d*η* = *<sup>η</sup>* <sup>0</sup> *<sup>F</sup>*−<sup>1</sup> (*η*) d*η*.

**Figure 1.** Visualizing the Fenchel–Young divergence.

The Bregman divergence *BF*(*θ*<sup>1</sup> : *θ*2) amounts to a dual Bregman divergence [13] between the dual parameters with swapped order: *BF*(*θ*<sup>1</sup> : *θ*2) = *BF*<sup>∗</sup> (*η*<sup>2</sup> : *η*1) where *η<sup>i</sup>* = ∇*F*(*θi*) for *i* ∈ {1, 2}. Thus the KLD between two distributions *Pθ*<sup>1</sup> and *Pθ*<sup>2</sup> of E can be expressed equivalently as follows:

$$D\_{\mathrm{KL}}[P\_{\theta\_1} : P\_{\theta\_2}] = \mathcal{Y}\_{\mathrm{F}, \mathcal{F}^\*}(\theta\_2 : \eta\_1) = B\_{\mathrm{F}}(\theta\_2 : \theta\_1) = B\_{\mathcal{F}^\*}(\eta\_1 : \eta\_2) = \mathcal{Y}\_{\mathcal{F}^\*, \mathcal{F}}(\eta\_1 : \eta\_2). \tag{15}$$

The symmetrized Kullback–Leibler divergence *DJ*[*Pθ*<sup>1</sup> : *Pθ*<sup>2</sup> ] between two distributions *Pθ*<sup>1</sup> and *Pθ*<sup>2</sup> of E is called Jeffreys' divergence [21] and amounts to a symmetrized Bregman divergence [22]:

$$D\_{\!}[P\_{\theta\_1} : P\_{\theta\_2}] \quad = \quad D\_{\text{KL}}[P\_{\theta\_1} : P\_{\theta\_2}] + D\_{\text{KL}}[P\_{\theta\_2} : P\_{\theta\_1}]\_{\prime} \tag{16}$$

$$=\ \ \_ \ B\_F(\theta\_2 : \theta\_1) + \ B\_F(\theta\_1 : \theta\_2) \,. \tag{17}$$

$$= (\theta\_2 - \theta\_1)^\mid (\eta\_2 - \eta\_1) := S\_F(\theta\_1, \theta\_2). \tag{18}$$

Note that the Bregman divergence *BF*(*θ*<sup>1</sup> : *θ*2) can also be interpreted as a surface area:

$$B\_F(\theta\_1 : \theta\_2) = \int\_{\theta\_2}^{\theta\_1} (F'(\theta) - F'(\theta\_2)) \, \mathrm{d}\theta. \tag{19}$$

Figure 2 illustrates the sided and symmetrized Bregman divergences.

**Figure 2.** Visualizing the sided and symmetrized Bregman divergences.

#### *1.5. Contributions and Paper Outline*

We recall in Section 2 the formula obtained for the Kullback–Leibler divergence between two exponential family densities equivalent to each other [23] (Equation (29)). Inspired by this formula, we give a definition of the duo Fenchel–Young divergence induced by a pair of strictly convex functions *F*<sup>1</sup> and *F*<sup>2</sup> (Definition 1) in Section 3, and prove that the divergence is always non-negative provided that *F*<sup>1</sup> upper bounds *F*2. We then define the duo Bregman divergence (Definition 2) corresponding to the duo Fenchel–Young divergence. In Section 4, we show that the Kullback–Leibler divergence between a truncated density and a density of a same parametric exponential family amounts to a duo Fenchel–Young divergence or equivalently to a duo Bregman divergence on swapped parameters (Theorem 1). That is, we consider a truncated exponential family [7] E<sup>1</sup> of an exponential family E<sup>1</sup> such that the common support of the distributions of E<sup>1</sup> is contained in the common support of the distributions of E<sup>2</sup> and both canonical decompositions of the families coincide (see Equation (2)). In particular, when E<sup>2</sup> is also a truncated exponential family of E, then we express the KLD between two truncated distributions as a duo Bregman divergence. As examples, we report the formula for the Kullback–Leibler divergence between two densities of truncated exponential families (Corollary 1), and illustrate the formula for the Kullback–Leibler divergence between truncated exponential distributions (Example 6) and for the Kullback–Leibler divergence between truncated normal distributions (Example 7).

In Section 5, we further consider the skewed Bhattacharyya distance between densities of truncated exponential families and prove that it amounts to a duo Jensen divergence (Theorem 2). Finally, we conclude in Section 6.

#### **2. Kullback–Leibler Divergence Between Different Exponential Families**

Consider now two exponential families [1] P and Q defined by their Radon–Nikodym derivatives with respect to two positive measures *<sup>μ</sup>*<sup>P</sup> and *<sup>μ</sup>*<sup>Q</sup> on (X , <sup>Σ</sup>):

$$\mathcal{P}\_{-}=\{P\_{\theta}\,:\,\theta\in\Theta\},\tag{20}$$

$$\mathcal{Q}\_{-}=\{\mathcal{Q}\_{\theta'} \,:\, \theta' \in \Theta'\}.\tag{21}$$

The corresponding natural parameter spaces are

$$\Theta \quad = \left\{ \theta \in \mathbb{R}^{D} \, : \, \int\_{X} \exp(\theta^{\top} t\_{\mathcal{P}}(\mathbf{x}) + k\_{\mathcal{P}}(\mathbf{x})) \, \mathrm{d}\mu\_{\mathcal{P}}(\mathbf{x}) < \infty \right\},\tag{22}$$

$$\Theta^{\prime} \quad = \left\{ \theta^{\prime} \in \mathbb{R}^{D^{\prime}} \,:\, \int\_{\mathcal{X}} \exp(\theta^{\prime \top} t\_{\mathcal{Q}}(\mathbf{x}) + k\_{\mathcal{Q}}(\mathbf{x})) \, \mathrm{d}\mu\_{\mathcal{Q}}(\mathbf{x}) < \infty \right\},\tag{23}$$

The order of P is *<sup>D</sup>*, *<sup>t</sup>*P(*x*) denotes the sufficient statistics of *<sup>P</sup>θ*, and *<sup>k</sup>*P(*x*) is a term to adjust/tilt the base measure *μ*P. Similarly, the order of Q is *D* , *<sup>t</sup>*Q(*x*) denotes the sufficient statistics of *<sup>Q</sup><sup>θ</sup>* , and *<sup>k</sup>*Q(*x*) is an optional term to adjust the base measure *<sup>μ</sup>*Q. Let *<sup>p</sup><sup>θ</sup>* and *q<sup>θ</sup>* denote the Radon–Nikodym derivatives with respect to the measures *μ*<sup>P</sup> and *μ*Q, respectively:

$$p\_{\theta} \quad = \frac{\mathrm{d}P\_{\theta}}{\mathrm{d}\mu\_{\mathcal{P}}} = \exp(\theta^{\top}t\_{\mathcal{P}}(\mathbf{x}) - F\_{\mathcal{P}}(\theta) + k\_{\mathcal{P}}(\mathbf{x})),\tag{24}$$

$$q\_{\theta'} = \frac{\mathrm{d}Q\_{\theta'}}{\mathrm{d}\mu\_{\mathcal{Q}}} = \exp(\theta'^\top t\_{\mathcal{Q}}(\mathbf{x}) - F\_{\mathcal{Q}}(\theta') + k\_{\mathcal{Q}}(\mathbf{x})),\tag{25}$$

where *<sup>F</sup>*P(*θ*) and *<sup>F</sup>*Q(*<sup>θ</sup>* ) denote the corresponding log-normalizers of P and Q, respectively.

$$F\_{\mathcal{P}}(\theta) \quad = \log \left( \int \exp(\theta^{\top} t\_{\mathcal{P}}(\mathbf{x}) + k\_{\mathcal{P}}(\mathbf{x})) \, \mathrm{d}\mu\_{\mathcal{P}}(\mathbf{x}) \right), \tag{26}$$

$$F\_{\mathcal{Q}}(\theta) \;=\; \log \left( \int \exp(\theta^{\top} t\_{\mathcal{Q}}(\mathbf{x}) + k\_{\mathcal{Q}}(\mathbf{x})) \, \mathrm{d}\mu\_{\mathcal{Q}}(\mathbf{x}) \right). \tag{27}$$

The functions *F*<sup>P</sup> and *F*<sup>Q</sup> are strictly convex and real analytic [1]. Hence, those functions are infinitely many times differentiable on their open natural parameter spaces.

Consider the KLD between *<sup>P</sup><sup>θ</sup>* ∈ P and *<sup>Q</sup><sup>θ</sup>* ∈ Q such that *<sup>μ</sup>*<sup>P</sup> = *<sup>μ</sup>*<sup>Q</sup> (and hence *P<sup>θ</sup>* -*Q<sup>θ</sup>*). Then the KLD between *P<sup>θ</sup>* and *Q<sup>θ</sup>* was first considered in [23]:

$$\begin{split} \left[D\_{\mathrm{KL}}\left[P\_{\theta}:Q\_{\theta}\right]\right] &=& E\_{\mathrm{P}}\Big[\log\left(\frac{\mathrm{d}P\_{\theta}}{\mathrm{d}Q\_{\theta}}\right)\Big], \\ &=& E\_{\mathrm{P}\_{\theta}}\Bigg[\left(\theta^{\top}t\_{\mathcal{P}}(\mathbf{x})-\theta^{\top}\,^{\top}t\_{\mathcal{Q}}(\mathbf{x})-F\_{\mathcal{P}}(\theta)+F\_{\mathcal{Q}}(\theta')+k\_{\mathcal{P}}(\mathbf{x})-k\_{\mathcal{Q}}(\mathbf{x})\right)\underbrace{\mathrm{d}\mu\_{\mathcal{P}}}\_{=1}\Bigg], \\ &=& F\_{\mathrm{Q}}(\theta')-F\_{\mathcal{P}}(\theta)+\theta^{\top}E\_{\mathrm{P}\_{\theta}}[t\_{\mathcal{P}}(\mathbf{x})]-\theta^{\top}\,^{\top}E\_{\mathrm{P}\_{\theta}}[t\_{\mathcal{Q}}(\mathbf{x})]+E\_{\mathrm{P}}[k\_{\mathcal{P}}(\mathbf{x})-k\_{\mathcal{Q}}(\mathbf{x})]. \end{split} \tag{28}$$

Recall that the dual parameterization of an exponential family density *P<sup>θ</sup>* is *P<sup>η</sup>* with *<sup>η</sup>* = *EP<sup>θ</sup>* [*t*P(*x*)] = ∇*F*P(*θ*) [1], and that the Fenchel–Young equality is *<sup>F</sup>*(*θ*) + *<sup>F</sup>*∗(*η*) = *<sup>θ</sup><sup>η</sup>* for *η* = ∇*F*(*θ*). Thus the KLD between *P<sup>θ</sup>* and *Q<sup>θ</sup>* can be rewritten as

$$D\_{\rm KL}[P\_{\theta}:Q\_{\theta'}] = F\_{\underline{\mathcal{Q}}}(\theta') + F\_{\mathcal{P}}^\*(\eta) - \theta'^{'\perp}E\_{\mathcal{P}\mathfrak{z}}[t\_{\underline{\mathcal{Q}}}(\mathbf{x})] + E\_{\mathcal{P}\mathfrak{z}}[k\_{\mathcal{P}}(\mathbf{x}) - k\_{\mathcal{Q}}(\mathbf{x})].\tag{29}$$

This formula was reported in [23] and generalizes the Fenchel–Young divergence [17] obtained when P = Q (with *<sup>t</sup>*P(*x*) = *<sup>t</sup>*Q(*x*), *<sup>k</sup>*P(*x*) = *<sup>k</sup>*Q(*x*), and *<sup>F</sup>*(*θ*) = *<sup>F</sup>*P(*θ*) = *<sup>F</sup>*Q(*θ*) and *F*∗(*η*) = *F*∗ <sup>P</sup>(*η*) = *<sup>F</sup>*<sup>∗</sup> <sup>Q</sup>(*η*)).

The formula of Equation (29) was illustrated in [23] with two examples: the KLD between Laplacian distributions and zero-centered Gaussian distributions, and the KLD between two Weibull distributions. Both these examples use the Lebesgue base measure for *μ*<sup>P</sup> and *μ*Q.

Let us report another example that uses the counting measure as the base measure for *μ*<sup>P</sup> and *μ*Q.

**Example 1.** *Consider the KLD between a Poisson probability mass function (pmf) and a geometric pmf. The canonical decompositions of the Poisson and geometric pmfs are summarized in Table 1. The KLD between a Poisson pmf p<sup>λ</sup> and a geometric pmf qp is equal to*

$$D\_{\rm KL}[P\_{\lambda}:Q\_{\mathcal{P}}] \quad = \, \, \_{\mathcal{Q}}F\_{\mathcal{Q}}(\theta') + F\_{\mathcal{P}}^{\*}(\eta) - E\_{\mathcal{P}}[t\_{\mathcal{Q}}(\mathbf{x})] \cdot \theta' + E\_{\mathcal{P}\_{\rm B}}[k\_{\mathcal{P}}(\mathbf{x}) - k\_{\mathcal{Q}}(\mathbf{x})], \tag{30}$$

$$=\left[-\log p + \lambda \log \lambda - \lambda - \lambda \log(1 - p) - E\_{P\_{\lambda}}[\log \ge 1] \right.\tag{31}$$

*Since Ep<sup>λ</sup>* [<sup>−</sup> log *<sup>x</sup>*!] = <sup>−</sup> <sup>∑</sup><sup>∞</sup> *<sup>k</sup>*=<sup>0</sup> *<sup>e</sup>*−*<sup>λ</sup> <sup>λ</sup><sup>k</sup>* log(*k*!) *<sup>k</sup>*! *, we have*

$$D\_{\mathrm{KL}}[P\_{\lambda}:Q\_{\mathcal{P}}] = -\log p + \lambda \log \frac{\lambda}{1-p} - \lambda - \sum\_{k=0}^{\infty} e^{-\lambda} \frac{\lambda^k \log(k!)}{k!}.\tag{32}$$

*Note that we can calculate the KLD between two geometric distributions Qp*<sup>1</sup> *and Qp*<sup>2</sup> *as*

$$\begin{array}{rcl}D\_{\text{KL}}[Q\_{p\_1} : Q\_{p\_2}] &=& B\_{F\_{\mathcal{Q}}}(\theta(p\_2) : \theta(p\_1)), \\ &=& F\_{\mathcal{Q}}(\theta(p\_2)) - F\_{\mathcal{Q}}(\theta(p\_1)) - (\theta(p\_2) - \theta(p\_1))\eta(p\_1), \end{array} \tag{33}$$

*We obtain:*

$$D\_{\mathrm{KL}}[Q\_{p\_1} : Q\_{p\_2}] = \log\left(\frac{p\_1}{p\_2}\right) - \left(1 - \frac{1}{p\_1}\right)\log\frac{1 - p\_1}{1 - p\_2}.$$


**Table 1.** Canonical decomposition of the Poisson and the geometric discrete exponential families.
