**5. Bhattacharyya Skewed Divergence Between Truncated Densities of an Exponential Family**

The Bhattacharyya *α*-skewed divergence [29,30] between two densities *p*(*x*) and *q*(*x*) with respect to *μ* is defined for a skewing scalar parameter *α* ∈ (0, 1) as:

$$D\_{\text{Bhut},a}[p:q] := -\log \int\_{\mathcal{X}} p(\mathbf{x})^a q(\mathbf{x})^{1-a} \, \mathbf{d}\mu(\mathbf{x}),\tag{114}$$

where X denotes the support of the distributions. The Bhattacharyya distance is

$$D\_{\text{Bhat}}[p,q] = D\_{\text{Bhat},\frac{\lambda}{2}}[p:q] = -\log \int\_{\mathcal{X}} \sqrt{p(\mathbf{x})q(\mathbf{x})} \, \mathbf{d}\mu(\mathbf{x}).\tag{115}$$

The Bhattacharyya distance is not a metric distance since it does not satisfy the triangle inequality. The Bhattacharyya distance is related to the Hellinger distance [31] as follows:

$$D\_{H}[p\_{\prime}q] = \sqrt{\frac{1}{2} \int\_{\mathcal{X}} \left(\sqrt{p(\mathbf{x})} - \sqrt{q(\mathbf{x})}\right)^{2} \mathrm{d}\mu(\mathbf{x})} = \sqrt{1 - \exp(-D\_{\mathrm{Bhat}}[p\_{\prime}q])}.\tag{116}$$

The Hellinger distance is a metric distance.

Let *Iα*[*p* : *q*] := <sup>X</sup> *<sup>p</sup>*(*x*)*αq*(*x*)1−*<sup>α</sup>* <sup>d</sup>*μ*(*x*) denote the skewed affinity coefficient so that *D*Bhat,*α*[*p* : *q*] = − log *Iα*[*p* : *q*]. Since *Iα*[*p* : *q*] = *I*1−*α*[*q* : *p*], we have *D*Bhat,*α*[*p* : *q*] = *<sup>D</sup>*Bhat,1−*α*[*<sup>q</sup>* : *<sup>p</sup>*].

Consider an exponential family E = {*pθ*} with log-normalizer *F*(*θ*). Then it is wellknown that the *α*-skewed Bhattacharyya divergence between two densities of an exponential family amounts to a skewed Jensen divergence [30] (originally called Jensen difference in [32]):

$$D\_{\text{Bhat}, \mathfrak{a}}[p\_{\theta\_1} : p\_{\theta\_2}] = J\_{\mathbb{F}, \mathfrak{a}}(\theta\_1 : \theta\_2), \tag{117}$$

where the skewed Jensen divergence is defined by

$$J\_{F,a}(\theta\_1:\theta\_2) = aF(\theta\_1) + (1-a)F(\theta\_2) - F(a\theta\_1 + (1-a)\theta\_2). \tag{118}$$

The convexity of the log-normalizer *F*(*θ*) ensures that *JF*,*α*(*θ*<sup>1</sup> : *θ*2) ≥ 0. The Jensen divergence can be extended to full real *α* by rescaling it by <sup>1</sup> *<sup>α</sup>*(1−*α*), see [33].

**Remark 1.** *The Bhattacharyya skewed divergence D*Bhat,*α*[*p* : *q*] *appears naturally as the negative of the log-normalizer of the exponential family induced by the exponential arc* {*rα*(*x*) *α* ∈ (0, 1)} *linking two densities p and q with rα*(*x*) ∝ *p*(*x*)*αq*(*x*)1−*α. This arc is an exponential family of order* 1*:*

$$r\_{\mathfrak{a}}(\mathbf{x}) = -\exp(\mathfrak{a}\log p(\mathbf{x}) + (1 - \mathfrak{a})\log q(\mathbf{x}) - \log Z\_{\mathfrak{a}}(p:q)),\tag{119}$$

$$=\left(\alpha\log\frac{p(\mathbf{x})}{q(\mathbf{x})} - F\_{pq}(\alpha)\right)q(\mathbf{x}).\tag{120}$$

*The sufficient statistic is t*(*x*) = *<sup>p</sup>*(*x*) *<sup>q</sup>*(*x*) *, the natural parameter α* ∈ (0, 1)*, and the log-normalizer Fpq*(*α*) = log *<sup>Z</sup>α*(*<sup>p</sup>* : *<sup>q</sup>*) = log *<sup>p</sup>*(*x*)*αq*(*x*)1−*α*d*μ*(*x*) = <sup>−</sup>*D*Bhat,*α*[*<sup>p</sup>* : *<sup>q</sup>*]*. This shows that D*Bhat,*α*[*p* : *q*] *is concave with respect to α since log-normalizers Fpq*(*α*) *are always convex. Grünwald called those exponential families the likelihood ratio exponential families [34].*

Now, consider calculating *D*Bhat,*α*[*pθ*<sup>1</sup> : *qθ*<sup>2</sup> ] where *pθ*<sup>1</sup> ∈ E<sup>1</sup> with E<sup>1</sup> a truncated exponential family of <sup>E</sup><sup>2</sup> and *<sup>q</sup>θ*<sup>2</sup> ∈ E<sup>2</sup> <sup>=</sup> {*qθ*}. We have *<sup>q</sup><sup>θ</sup>* (*x*) = *<sup>Z</sup>*1(*θ*) *<sup>Z</sup>*2(*θ*) *<sup>p</sup><sup>θ</sup>* (*x*), where *<sup>Z</sup>*1(*θ*) and *Z*2(*θ*) are the partition functions of E<sup>1</sup> and E2, respectively. Thus we have

$$I\_{\mathbb{A}}[p\_{\theta\_1} : q\_{\theta\_2}] = \left(\frac{Z\_1(\theta\_2)}{Z\_2(\theta\_2)}\right)^{1-\alpha} I\_{\mathbb{A}}[p\_{\theta\_1} : p\_{\theta\_2}],\tag{121}$$

and the *α*-skewed Bhattacharyya divergence is

$$D\_{\rm Bhat,a}[p\_{\theta\_1} : q\_{\theta\_2}] = D\_{\rm Bhat,a}[p\_{\theta\_1} : p\_{\theta\_2}] - (1 - a)(F\_1(\theta\_2) - F\_2(\theta\_2)).\tag{122}$$

Therefore we obtain

$$D\_{\text{Hlat},a} \left[ p\_{\theta\_1} : q\_{\theta\_2} \right] \quad = \quad J\_{\text{F}\_1,a} \left( \theta\_1 : \theta\_2 \right) - (1 - a) \left( F\_1(\theta\_2) - F\_2(\theta\_2) \right), \tag{123}$$

$$=\, \ aF\_1(\theta\_1) + (1-a)F\_2(\theta\_2) - F\_1(a\theta\_1 + (1-a)\theta\_2),\tag{124}$$

$$=:\quad J\_{\mathbb{F}\_1,\mathbb{F}\_2,a}(\theta\_1:\theta\_2).\tag{125}$$

We call *JF*1,*F*2,*α*(*θ*<sup>1</sup> : *θ*2) the duo Jensen divergence. Since *F*2(*θ*) ≥ *F*1(*θ*), we check that

$$J\_{F1,F2,a}(\theta\_1 : \theta\_2) \ge J\_{F1,a}(\theta\_1 : \theta\_2) \ge 0. \tag{126}$$

Figure 7 illustrates graphically the duo Jensen divergence *JF*1,*F*2,*α*(*θ*<sup>1</sup> : *θ*2).

**Figure 7.** The duo Jensen divergence *JF*1,*F*2,*α*(*θ*<sup>1</sup> : *θ*2) is greater than the Jensen divergence *JF*1,*α*(*θ*<sup>1</sup> : *θ*2) for *F*2(*θ*) ≥ *F*1(*θ*).

**Theorem 2.** *The α-skewed Bhattacharyya divergence for α* ∈ (0, 1) *between a truncated density of* E<sup>1</sup> *with log-normalizer F*1(*θ*) *and another density of an exponential family* E<sup>2</sup> *with log-normalizer F*2(*θ*) *amounts to a duo Jensen divergence:*

$$D\_{\text{Bhat},\mathfrak{a}}\left[p\_{\theta\_1} : q\_{\theta\_2}\right] = f\_{\text{F}\_1,\text{F}\_2,\mathfrak{a}}(\theta\_1 : \theta\_2),\tag{127}$$

*where JF*1,*F*2,*α*(*θ*<sup>1</sup> : *θ*2) *is the duo skewed Jensen divergence induced by two strictly convex functions F*1(*θ*) *and F*2(*θ*) *such that F*2(*θ*) ≥ *F*1(*θ*)*:*

$$J\_{\mathbb{F}\_1, \mathbb{F}\_2, a}(\theta\_1 : \theta\_2) = a F\_1(\theta\_1) + (1 - a) F\_2(\theta\_2) - F\_1(a \theta\_1 + (1 - a) \theta\_2). \tag{128}$$

In [30], it is reported that

$$\begin{array}{rcl}D\_{\text{KL}}[p\_{\theta\_1} : p\_{\theta\_2}] & = & B\_F(\theta\_2 : \theta\_1), \\ \end{array} \tag{129}$$

$$=\lim\_{a\to 0} \frac{1}{a} J\_{F,a}(\theta\_2:\theta\_1) = \lim\_{a\to 0} \frac{1}{a} J\_{F,1-a}(\theta\_1:\theta\_2),\tag{130}$$

$$= \lim\_{\alpha \to 0} \frac{1}{\alpha} D\_{\text{Bhat},a} [p\_{\theta\_2} : p\_{\theta\_1}] = \lim\_{\alpha \to 0} \frac{1}{\alpha} D\_{\text{Bhat},1-\alpha} [p\_{\theta\_1} : p\_{\theta\_2}].\tag{131}$$

Indeed, using the first-order Taylor expansion of

$$F(\theta\_1 + \alpha(\theta\_2 - \theta\_1)) \underset{a \to 0}{\approx} F(\theta\_1) + \alpha(\theta\_2 - \theta\_1)^\top \nabla F(\theta\_1) \tag{132}$$

when *α* → 0, we check that we have

$$\frac{1}{a}\mathbb{I}\_{\mathbb{F},\mathbb{A}}(\theta\_2:\theta\_1) \qquad := \qquad \frac{\mathbb{F}(\theta\_1) + a(\mathbb{F}(\theta\_2) - \mathbb{F}(\theta\_1)) - \mathbb{F}(\theta\_1 + a(\theta\_2 - \theta\_1))}{a},\tag{133}$$

$$\underset{\mathfrak{a}\to 0}{\operatorname{Equation}} \stackrel{(132)}{\approx} \underbrace{\left[\mathfrak{F}(\boldsymbol{\theta}\boldsymbol{\xi}\boldsymbol{\zeta} + \boldsymbol{a}(\boldsymbol{F}(\boldsymbol{\theta}\_{2}) - \boldsymbol{F}(\boldsymbol{\theta}\_{1})) - \sum\_{\mathfrak{a}} (\boldsymbol{\theta}\boldsymbol{\xi}\boldsymbol{\zeta} - \boldsymbol{a}(\boldsymbol{\theta}\_{2} - \boldsymbol{\theta}\_{1})^{\top}\boldsymbol{\nabla}\boldsymbol{F}(\boldsymbol{\theta}\_{1})\right]}{\mathfrak{a}},\tag{134}$$

$$=\stackrel{\cdot}{\cdot} \qquad F(\theta\_2) - F(\theta\_1) - (\theta\_2 - \theta\_1)^\top \nabla F(\theta\_1),\tag{135}$$

$$\coloneqq : \qquad B\_F(\theta\_2 : \theta\_1).$$

Thus we have lim*α*→<sup>0</sup> <sup>1</sup> *<sup>α</sup> JF*,*α*(*θ*<sup>2</sup> : *θ*1) = *BF*(*θ*<sup>2</sup> : *θ*1). Moreover, we have

$$\lim\_{\alpha \to 0} \frac{1}{\alpha} D\_{\text{Bhat}, 1-\alpha}[p:q] = D\_{\text{KL}}[p:q]. \tag{137}$$

Similarly, we can prove that

$$\lim\_{\alpha \to 1} \frac{1}{1 - \alpha} J\_{\mathbb{F}\_1, \mathbb{F}\_2, \alpha}(\theta\_1 : \theta\_2) = B\_{\mathbb{F}\_2, \mathbb{F}\_1}(\theta\_2 : \theta\_1), \tag{1.38}$$

which can be reinterpreted as

$$\lim\_{\alpha \to 1} \frac{1}{1 - \alpha} D\_{\text{Bhat}, a} [p\_{\theta\_1} : q\_{\theta\_2}] = D\_{\text{KL}} [p\_{\theta\_1} : q\_{\theta\_2}].\tag{139}$$

#### **6. Concluding Remarks**

We considered the Kullback–Leibler divergence between two parametric densities *p<sup>θ</sup>* ∈ E<sup>1</sup> and *q<sup>θ</sup>* ∈ E<sup>2</sup> belonging to truncated exponential families [7] E<sup>1</sup> and E2, and we showed that their KLD is equivalent to a duo Bregman divergence on swapped parameter order (Theorem 1). This result generalizes the study of Azoury and Warmuth [13]. The duo Bregman divergence can be rewritten as a duo Fenchel–Young divergence using mixed natural/moment parameterizations of the exponential family densities (Definition 1). This second result generalizes the approach taken in information geometry [15,35]. We showed how to calculate the Kullback–Leibler divergence between two truncated normal distributions as a duo Bregman divergence. More generally, we proved that the skewed Bhattacharyya distance between two parametric densities of truncated exponential families amounts to a duo Jensen divergence (Theorem 2). We showed asymptotically that scaled duo Jensen divergences tend to duo Bregman divergences generalizing a result of [30,33]. This study of duo divergences induced by pair of generators was motivated by the formula obtained for the Kullback–Leibler divergence between two densities of two different exponential families originally reported in [23] (Equation (29)).

It is interesting to find applications of the duo Fenchel–Young, Bregman, and Jensen divergences beyond the scope of calculating statistical distances between truncated exponential family densities. Note that in [36], the authors exhibit a relationship between densities with nested supports and quasi-convex Bregman divergences. However, those considered parametric densities are not exponential families since their supports depend on the parameter. Recently, Khan and Swaroop [37] used this duo Fenchel–Young divergence in machine learning for knowledge-adaptation priors in the so-called change regularizer task.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The author would like to thank the three reviewers for their helpful comments, which led to this improved paper.

**Conflicts of Interest:** The author declares no conflicts of interest.

#### **References**

