**4. Kullback–Leibler Divergence between Distributions of Truncated Exponential Families**

Let E<sup>1</sup> = {*P<sup>θ</sup>* : *θ* ∈ Θ1} be an exponential family of distributions all dominated by *μ* with Radon–Nikodym density *p<sup>θ</sup>* (*x*) = exp(*θt*(*x*) − *F*1(*θ*) + *k*(*x*)) d*μ*(*x*) defined on the support X1. Let E<sup>2</sup> = {*Q<sup>θ</sup>* : *θ* ∈ Θ2} be another exponential family of distributions all dominated by *μ* with Radon–Nikodym density *q<sup>θ</sup>* (*x*) = exp(*θt*(*x*) − *F*2(*θ*) + *k*(*x*)) d*μ*(*x*) defined on the support X<sup>2</sup> such that X<sup>1</sup> ⊆ X2. Let *p*˜*<sup>θ</sup>* (*x*) = exp(*θt*(*x*) + *k*(*x*)) d*μ*(*x*) be the common unnormalized density so that

$$p\_{\theta}(\mathbf{x}) = \frac{\vec{p}\_{\theta}(\mathbf{x})}{Z\_{1}(\theta)}\tag{56}$$

and

$$q\_{\theta}(\mathbf{x}) = \frac{\vec{p}\_{\theta}(\mathbf{x})}{Z\_{2}(\theta)} = \frac{Z\_{1}(\theta)}{Z\_{2}(\theta)} \, p\_{\theta}(\mathbf{x}),\tag{57}$$

with *Z*1(*θ*) = exp(*F*1(*θ*)) and *Z*2(*θ*) = exp(*F*2(*θ*)) being the log-normalizer functions of E<sup>1</sup> and E2, respectively.

We have

$$D\_{\rm KL}[p\_{\theta\_1} : q\_{\theta\_2}] \quad = \int\_{\mathcal{X}\_1} p\_{\theta\_1}(\mathbf{x}) \log \frac{p\_{\theta\_1}(\mathbf{x})}{q\_{\theta\_2}(\mathbf{x})} \mathbf{d}\mu(\mathbf{x}), \tag{58}$$

$$=\int\_{\mathcal{X}\_1} p\_{\theta\_1}(\mathbf{x}) \log \frac{p\_{\theta\_1}(\mathbf{x})}{p\_{\theta\_2}(\mathbf{x})} \mathrm{d}\mu(\mathbf{x}) + \int\_{\mathcal{X}\_1} p\_{\theta\_1}(\mathbf{x}) \log \left(\frac{Z\_2(\theta\_2)}{Z\_1(\theta\_2)}\right) \mathrm{d}\mu(\mathbf{x}),\tag{59}$$

$$=\ \ \_ \ D\_{\text{KL}}[p\_{\theta\_1} : p\_{\theta\_2}] + \log Z\_2(\theta\_2) - \log Z\_1(\theta\_2). \tag{60}$$

Since *D*KL[*pθ*<sup>1</sup> : *pθ*<sup>2</sup> ] = *BF*<sup>1</sup> (*θ*<sup>2</sup> : *θ*1) and log *Zi*(*θ*) = *Fi*(*θ*), we obtain

$$\begin{array}{rcl}D\_{\rm KL}[p\_{\theta\_1} : q\_{\theta\_2}] & = & B\_{\rm F\_1}(\theta\_2 : \theta\_1) + F\_2(\theta\_2) - F\_1(\theta\_2), \\ \end{array} \tag{61}$$

$$=\left.F\_1(\theta\_2) - F\_1(\theta\_1) - (\theta\_2 - \theta\_1)^\top \nabla F\_1(\theta\_1) + F\_2(\theta\_2) - F\_1(\theta\_2)\right.\tag{62}$$

$$=\,^1F\_2(\theta\_2) - F\_1(\theta\_1) - \left(\theta\_2 - \theta\_1\right)^\top \nabla F\_1(\theta\_1) =: B\_{F\_2, F\_1}(\theta\_2: \theta\_1). \tag{63}$$

Observe that since X<sup>1</sup> ⊆ X 2, we have:

$$F\_2(\theta) = \log \int\_{\mathcal{X}\_2^\circ} \vec{p}\_\theta(\mathbf{x}) \, \mathrm{d}\mu(\mathbf{x}) \geq \log \int\_{\mathcal{X}\_1^\circ} \vec{p}\_\theta(\mathbf{x}) \, \mathrm{d}\mu(\mathbf{x}) := F\_1(\theta). \tag{64}$$

Therefore Θ<sup>2</sup> ⊆ Θ1, and the common natural parameter space is Θ<sup>12</sup> = Θ<sup>1</sup> ∩ Θ<sup>2</sup> = Θ2.

Notice that the reverse Kullback–Leibler divergence *D*∗ KL[*pθ*<sup>1</sup> : *qθ*<sup>2</sup> ] = *D*KL[*qθ*<sup>2</sup> : *pθ*<sup>1</sup> ] = +∞ since *Qθ*<sup>2</sup> -*Pθ*<sup>1</sup> .

**Theorem 1** (Kullback–Leibler divergence between truncated exponential family densities)**.** *Let* E<sup>2</sup> = {*qθ*<sup>2</sup> } *be an exponential family with support* X2*, and* E<sup>1</sup> = {*pθ*<sup>1</sup> } *a truncated exponential family of* E<sup>2</sup> *with support* X<sup>1</sup> ⊂ X2*. Let F*<sup>1</sup> *and F*<sup>2</sup> *denote the log-normalizers of* E<sup>1</sup> *and* E<sup>2</sup> *and η*<sup>1</sup> *and η*<sup>2</sup> *the moment parameters corresponding to the natural parameters θ*<sup>1</sup> *and θ*2*. Then the Kullback–Leibler divergence between a truncated density of* E<sup>1</sup> *and a density of* E<sup>2</sup> *is*

$$D\_{\rm KL}[p\_{\theta\_1} : q\_{\theta\_2}] = Y\_{\mathbb{F}\_2, \mathbb{F}\_1^\*}(\theta\_2 : \eta\_1) = B\_{\mathbb{F}\_2, \mathbb{F}\_1}(\theta\_2 : \theta\_1) = B\_{\mathbb{F}\_1^\*, \mathbb{F}\_2^\*}(\eta\_1 : \eta\_2) = Y\_{\mathbb{F}\_1^\*, \mathbb{F}\_2}(\eta\_1 : \theta\_2). \tag{65}$$

For example, consider the calculation of the KLD between an exponential distribution (view as half a Laplacian distribution, i.e., a truncated Laplacian distribution on the positive real support) and a Laplacian distribution defined on the real line support.

**Example 4.** *Let* <sup>R</sup>++ <sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup> : *<sup>x</sup>* <sup>&</sup>gt; <sup>0</sup>} *denote the set of positive reals. Let* <sup>E</sup><sup>1</sup> <sup>=</sup> {*pλ*(*x*) = *<sup>λ</sup>* exp(−*λx*), *<sup>λ</sup>* <sup>∈</sup> <sup>R</sup>++, *<sup>x</sup>* <sup>&</sup>gt; <sup>0</sup>} *and* <sup>E</sup><sup>2</sup> <sup>=</sup> {*qλ*(*x*) = *<sup>λ</sup>* exp(−*λ*|*x*|), *<sup>λ</sup>* <sup>∈</sup> <sup>R</sup>++, *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>} *denote the exponential families of exponential distributions and Laplacian distributions, respectively. We have the sufficient statistic t*(*x*) = −|*x*| *and natural parameter θ* = *λ so that p*˜*<sup>θ</sup>* (*x*) = exp(−|*x*|*θ*)*. The log-normalizers are F*1(*θ*) = − log *θ and F*2(*θ*) = − log *θ* + log 2 *(hence F*2(*θ*) ≥ *F*1(*θ*)*). The moment parameter <sup>η</sup>* <sup>=</sup> <sup>∇</sup>*F*1(*θ*) = <sup>∇</sup>*F*2(*θ*) = <sup>−</sup><sup>1</sup> *<sup>θ</sup>* <sup>=</sup> <sup>−</sup> <sup>1</sup> *<sup>λ</sup> . Thus using the duo Bregman divergence, we have:*

$$\begin{array}{rcl}D\_{\text{KL}}[p\_{\theta\_1} : q\_{\theta\_2}] & = & B\_{F\_2, F\_1}(\theta\_2 : \theta\_1), \\ \end{array} \tag{66}$$

$$=\left.F\_{2}(\theta\_{2}) - F\_{1}(\theta\_{1}) - (\theta\_{2} - \theta\_{1})^{\top} \nabla F\_{1}(\theta\_{1})\right. \tag{67}$$

$$=\quad\log 2 + \log \frac{\lambda\_1}{\lambda\_2} + \frac{\lambda\_2}{\lambda\_1} - 1.\tag{68}$$

*Moreover, we can interpret that divergence using the Itakura–Saito divergence [24]:*

$$D\_{\rm IS}[\lambda\_1 : \lambda\_2] := \frac{\lambda\_1}{\lambda\_2} - \log \frac{\lambda\_1}{\lambda\_2} - 1 \ge 0. \tag{69}$$

*we have*

$$D\_{\rm KL}[p\_{\theta\_1} : q\_{\theta\_2}] = D\_{\rm IS}[\lambda\_2 : \lambda\_1] + \log 2 \ge 0. \tag{70}$$

*We check the result using the duo Fenchel–Young divergence:*

$$D\_{\rm KL}[p\_{\theta\_1} : q\_{\theta\_2}] = \mathcal{Y}\_{\mathbb{P}\_2, \mathbb{P}\_1^\*}(\theta\_2 : \eta\_1),\tag{71}$$

 $with \, F\_1^\*(\eta) = -1 + \log\left(-\frac{1}{\eta}\right);$ 
$$D\_{\text{KL}}[p\_{\theta\_1} : q\_{\theta\_2}] \quad = \quad Y\_{\text{F}\_2, F\_1^\*}(\theta\_2 : \eta\_1),\tag{72}$$

$$=\ -\log\lambda\_2 + \log 2 - 1 + \log\lambda\_1 + \frac{\lambda\_2}{\lambda\_1} \tag{73}$$

$$=\quad\log\frac{\lambda\_1}{\lambda\_2} + \frac{\lambda\_2}{\lambda\_1} + \log\_2 - 1.\tag{74}$$

Next, consider the calculation of the KLD between a half-normal distribution and a (full) normal distribution:

**Example 5.** *Consider* E<sup>1</sup> *and* E<sup>2</sup> *to be the scale family of the half standard normal distributions and the scale family of the standard normal distribution, respectively. We have <sup>p</sup>*˜*<sup>θ</sup>* (*x*) = exp- <sup>−</sup> *<sup>x</sup>*<sup>2</sup> 2*σ*<sup>2</sup> *with Z*1(*θ*) = *σ <sup>π</sup>* <sup>2</sup> *and Z*2(*θ*) = *σ* <sup>√</sup>2*π. Let the sufficient statistic be <sup>t</sup>*(*x*) = <sup>−</sup>*x*<sup>2</sup> <sup>2</sup> *so that the natural parameter is θ* = <sup>1</sup> *<sup>σ</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>++*. Here, we have both* <sup>Θ</sup><sup>1</sup> <sup>=</sup> <sup>Θ</sup><sup>2</sup> <sup>=</sup> <sup>R</sup>++*. For this example, we check that Z*1(*θ*) = <sup>1</sup> <sup>2</sup> *<sup>Z</sup>*2(*θ*)*. We have <sup>F</sup>*1(*θ*) = <sup>−</sup><sup>1</sup> <sup>2</sup> log *<sup>θ</sup>* <sup>+</sup> <sup>1</sup> <sup>2</sup> log *<sup>π</sup>* <sup>2</sup> *and F*2(*θ*) = −1 <sup>2</sup> log *<sup>θ</sup>* <sup>+</sup> <sup>1</sup> <sup>2</sup> log(2*π*) *(with <sup>F</sup>*2(*θ*) <sup>≥</sup> *<sup>F</sup>*1(*θ*)*). We have <sup>η</sup>* <sup>=</sup> <sup>−</sup> <sup>1</sup> <sup>2</sup>*<sup>θ</sup>* <sup>=</sup> <sup>−</sup><sup>1</sup> <sup>2</sup>*σ*2*. The KLD between two half scale normal distributions is*

$$D\_{\rm KL}[p\_{\theta\_1} : p\_{\theta\_2}] \quad = \quad B\_{F\_1}(\theta\_2 : \theta\_1), \tag{75}$$

$$=\frac{1}{2}\left(\log\frac{\sigma\_2^2}{\sigma\_1^2} + \frac{\sigma\_1^2}{\sigma\_2^2} - 1\right). \tag{76}$$

*Since F*1(*θ*) *and F*2(*θ*) *differ only by a constant and the Bregman divergence is invariant under an affine term of its generator, we have*

$$D\_{\rm KL}[q\_{\theta\_1} : q\_{\theta\_2}] \quad = \quad B\_{\rm F\_2}(\theta\_2 : \theta\_1), \tag{77}$$

$$\mathcal{B} = \ \mathcal{B}\_{\mathcal{F}\_1}(\theta\_2 : \theta\_1) = \frac{1}{2} \left( \log \frac{\sigma\_2^2}{\sigma\_1^2} + \frac{\sigma\_1^2}{\sigma\_2^2} - 1 \right). \tag{78}$$

*Moreover, we can interpret those Bregman divergences as half of the Itakura–Saito divergence:*

$$D\_{\rm KL}[p\_{\theta\_1} : p\_{\theta\_2}] = D\_{\rm KL}[q\_{\theta\_1} : q\_{\theta\_2}] = B\_{\rm F\_2}(\theta\_2 : \theta\_1) = \frac{1}{2} D\_{\rm IS}[\sigma\_1^2 : \sigma\_2^2]. \tag{79}$$

*It follows that*

$$D\_{\rm KL}[p\_{\theta\_1} : q\_{\theta\_2}] \quad = \quad B\_{\rm F\_2,F\_1}(\theta\_2 : \theta\_1) = F\_2(\theta\_2) - F\_1(\theta\_1) - (\theta\_2 - \theta\_1)^\top \nabla F\_1(\theta\_1), \tag{80}$$

$$=\frac{1}{2}\left(\log\frac{\sigma\_2^2}{\sigma\_1^2} + \frac{\sigma\_1^2}{\sigma\_2^2} + \log 4 - 1\right),\tag{81}$$

$$=\ \ \ D\_{\text{KL}}[q\_{\theta\_1} : q\_{\theta\_2}] + \log 2. \tag{82}$$

*Since* log 2 > 0*, we have D*KL[*pθ*<sup>1</sup> : *qθ*<sup>2</sup> ] ≥ *D*KL[*qθ*<sup>1</sup> : *qθ*<sup>2</sup> ]*.*

Thus the Kullback–Leibler divergence between a truncated density and another density of the same exponential family amounts to calculate a duo Bregman divergence on the reverse parameter order: *D*KL[*pθ*<sup>1</sup> : *qθ*<sup>2</sup> ] = *BF*2,*F*<sup>1</sup> (*θ*<sup>2</sup> : *θ*1). Let *D*<sup>∗</sup> KL[*p* : *q*] := *D*KL[*q* : *p*] be the reverse Kullback–Leibler divergence. Then *D*∗ KL[*qθ*<sup>2</sup> : *pθ*<sup>1</sup> ] = *BF*2,*F*<sup>1</sup> (*θ*<sup>2</sup> : *θ*1).

Notice that truncated exponential families are also exponential families but those exponential families may be non-steep [25].

Let <sup>E</sup><sup>1</sup> <sup>=</sup> {*pa*1,*b*<sup>1</sup> *<sup>θ</sup>* } and <sup>E</sup><sup>2</sup> <sup>=</sup> {*pa*2,*b*<sup>2</sup> *<sup>θ</sup>* } be two truncated exponential families of the exponential family <sup>E</sup> <sup>=</sup> {*p<sup>θ</sup>* <sup>=</sup> <sup>d</sup>*P<sup>θ</sup>* <sup>d</sup>*<sup>μ</sup>* } with log-normalizer *F*(*θ*) such that

$$p\_{\theta}^{a\_i, b\_i}(\mathbf{x}) = \frac{p\_{\theta}(\mathbf{x})}{Z\_{a\_i, b\_i}(\theta)},\tag{83}$$

with *Zai*,*bi* (*θ*) = Φ*<sup>θ</sup>* (*bi*) − Φ*<sup>θ</sup>* (*ai*), where Φ*<sup>θ</sup>* (*x*) denotes the CDF of *p<sup>θ</sup>* (*x*). Then the lognormalizer of E*<sup>i</sup>* is *Fi*(*θ*) = *F*(*θ*) + log(Φ*<sup>θ</sup>* (*bi*) − Φ*<sup>θ</sup>* (*ai*)) for *i* ∈ {1, 2}.

**Corollary 1** (Kullback–Leibler divergence between densities of truncated exponential families)**.** *Let* <sup>E</sup>*<sup>i</sup>* <sup>=</sup> {*pai*,*bi <sup>θ</sup>* } *be truncated exponential families of the exponential family* E = {*pθ*} *with support* X*<sup>i</sup>* = [*ai*, *bi*] ⊂ X *(where* X *denotes the support of* E*) for i* ∈ {1, 2}*. Then the Kullback–Leibler divergence between pa*1,*b*<sup>1</sup> *<sup>θ</sup>*<sup>1</sup> *and <sup>p</sup>a*2,*b*<sup>2</sup> *<sup>θ</sup>*<sup>2</sup> *is infinite if* [*a*1, *<sup>b</sup>*1] ⊂ [*a*2, *<sup>b</sup>*2] *and has the following formula when* [*a*1, *b*1] ⊂ [*a*2, *b*2]*:*

$$D\_{\rm KL} \left[ p\_{\theta\_1}^{a\_1, b\_1} \, : \, p\_{\theta\_2}^{a\_2, b\_2} \right] = D\_{\rm KL} \left[ p\_{\theta\_1}^{a\_1, b\_1} \, : \, p\_{\theta\_2}^{a\_1, b\_1} \right] + \log \frac{Z\_{a\_2, b\_2}(\theta\_2)}{Z\_{a\_1, b\_1}(\theta\_2)}.\tag{84}$$

**Proof.** We have *pa*1,*b*<sup>1</sup> *<sup>θ</sup>* <sup>=</sup> *<sup>p</sup><sup>θ</sup> Za*1,*b*<sup>1</sup> (*θ*) and *<sup>p</sup>a*2,*b*<sup>2</sup> *<sup>θ</sup>* <sup>=</sup> *<sup>p</sup><sup>θ</sup> Za*2,*b*<sup>2</sup> (*θ*). Therefore *<sup>p</sup>a*2,*b*<sup>2</sup> *<sup>θ</sup>* <sup>=</sup> *<sup>p</sup>a*1,*b*<sup>1</sup> *θ Za*1,*b*<sup>1</sup> (*θ*) *Za*2,*b*<sup>2</sup> (*θ*). Thus we have

$$D\_{\rm KL}[p\_{\theta\_1}^{a\_1, b\_1} : p\_{\theta\_2}^{a\_2, b\_2}] \quad = \int\_{\mathcal{X}\_1} p\_{\theta\_1}^{a\_1, b\_1}(\mathbf{x}) \log \frac{p\_{\theta\_1}^{a\_1, b\_1}(\mathbf{x})}{p\_{\theta\_2}^{a\_2, b\_2}} \mathbf{d}\mu(\mathbf{x}),\tag{85}$$

$$=\int\_{\mathcal{X}\_1} p\_{\theta\_1}^{a\_1,b\_1}(\boldsymbol{\pi}) \log \frac{p\_{\theta\_1}^{a\_1,b\_1}(\boldsymbol{\pi})}{p\_{\theta\_2}^{a\_1,b\_1}} \mathrm{d}\mu(\boldsymbol{\pi}) + \log \frac{Z\_{a\_2,b\_2}(\theta\_2)}{Z\_{a\_1,b\_1}(\theta\_2)},\tag{86}$$

$$=\ \ \_ {D\_{\text{KL}}} [p\_{\theta\_1}^{a\_1, b\_1} : p\_{\theta\_2}^{a\_1, b\_1}] + \log \frac{Z\_{a\_2, b\_2}(\theta\_2)}{Z\_{a\_1, b\_1}(\theta\_2)}.\tag{87}$$

Thus the KLD between truncated exponential family densities *pa*1,*b*<sup>1</sup> *<sup>θ</sup>*<sup>1</sup> and *<sup>p</sup>a*2,*b*<sup>2</sup> *<sup>θ</sup>*<sup>2</sup> amounts to the KLD between the densities with the same truncation parameter with an additive term depending on the log ratio of the mass with respect to the truncated supports evaluated at *θ*2. We shall illustrate with two examples the calculation of the KLD between truncated exponential families.

**Example 6.** *Consider the calculation of the KLD between a truncated exponential distribution pa*1,*b*<sup>1</sup> *<sup>λ</sup>*<sup>1</sup> *with support* X<sup>1</sup> = [*a*1, *<sup>b</sup>*1] *(b*<sup>1</sup> > *<sup>a</sup>*<sup>1</sup> ≥ <sup>0</sup>*) and another truncated exponential distribution pa*2,*b*<sup>2</sup> *<sup>λ</sup>*<sup>2</sup> *with support* X<sup>2</sup> = [*a*2, *<sup>b</sup>*2] *(b*<sup>2</sup> > *<sup>a</sup>*<sup>2</sup> ≥ 0*). We have <sup>p</sup>λ*(*x*) = *<sup>λ</sup>* exp(−*λx*) *(density of the untruncated exponential family with natural parameter θ* = *λ, sufficient statistic t*(*x*) = −*x and log-normalizer <sup>F</sup>*(*θ*) = <sup>−</sup> log *<sup>θ</sup>), <sup>p</sup>a*1,*b*<sup>1</sup> *<sup>λ</sup>*<sup>1</sup> <sup>=</sup> <sup>1</sup> *Za*1,*b*<sup>1</sup> (*λ*) *<sup>p</sup>λ*<sup>1</sup> (*x*)*, and <sup>p</sup>a*2,*b*<sup>2</sup> *<sup>λ</sup>*<sup>2</sup> <sup>=</sup> <sup>1</sup> *Za*2,*b*<sup>2</sup> (*λ*) *<sup>p</sup>λ*<sup>2</sup> (*x*)*. Let* Φ*λ*(*x*) = 1 − exp(−*λx*) *denote the cumulative distribution function of the exponential distribution. We have Za*,*b*(*λ*) = Φ*b*(*λ*) − Φ*a*(*λ*) *and*

$$F\_{a,b}(\lambda) = F(\lambda) + \log(\Phi\_b(\lambda) - \Phi\_a(\lambda)) = -\log \lambda + \log(e^{-\lambda a} - e^{-\lambda b}).\tag{88}$$

*If* [*a*1, *b*1] ∈ [*a*2, *b*2] *then D*KL[*pλ*<sup>1</sup> : *qλ*<sup>2</sup> ]=+∞*. Otherwise,* [*a*1, *b*1] ∈ [*a*2, *b*2]*, and the exponential family* {*pλ*} *is the truncated exponential family* {*qλ*}*. Using the computer algebra system Maxima (https://maxima.sourceforge.io/ accessed on 15 March 2022), we find that*

$$-E\_{p\lambda}[x] = \frac{(1+\lambda b)e^{\lambda a} - (1+\lambda a)e^{\lambda b}}{\lambda(e^{\lambda b} - e^{\lambda a})} = F\_{a,b}'(\lambda). \tag{89}$$

*Thus we have:*

$$\begin{split} D\_{\mathrm{KL}}[p^{a\_1,b\_1}\_{\lambda\_1} : q^{a\_2b\_2}\_{\lambda\_2}] &= \ \_ {B\_{\mathrm{E},\mathrm{F}\_1}}(\theta\_2 : \theta\_1)\_{\prime} \\ &= \ \_ {F\_{a\_2b\_2}}(\lambda\_2) - F\_{a\_1,b\_1}(\lambda\_1) - (\lambda\_2 - \lambda\_1) F\_{a\_1,b\_1}'(\lambda\_1)\_{\prime} \\ &= \ \_ {\log\frac{\lambda\_1}{\lambda\_2}} + (\lambda\_2 - \lambda\_1) F\_{p\_{\lambda\_1}}[\mathbf{x}] + \log \frac{e^{-\lambda\_2 a\_2} - e^{-\lambda\_2 b\_2}}{e^{-\lambda\_1 a\_1} - e^{-\lambda\_1 b\_1}}. \end{split} \tag{91}$$

*When a*<sup>1</sup> = *a*<sup>2</sup> = 0 *and b*<sup>1</sup> = *b*<sup>2</sup> = +∞*, we recover the KLD between two exponential distributions pλ*<sup>1</sup> *and pλ*<sup>2</sup> *:*

$$D\_{\rm KL}[p\_{\lambda\_1} : p\_{\lambda\_2}] \quad = \quad B\_{\rm F}(\lambda\_2 : \lambda\_1), \tag{92}$$

$$=\left.F(\theta\_2) - F(\theta\_1) - (\theta\_2 - \theta\_1)F'(\theta\_1)\right|\tag{93}$$

$$=\ \frac{\lambda\_2}{\lambda\_1} - \log \frac{\lambda\_2}{\lambda\_1} - 1 = D\_{\text{IS}}[\lambda\_2 : \lambda\_1]. \tag{94}$$

*Note that the KLD between two truncated exponential distributions with the same truncation support* X = [*a*, *b*] *is*

$$D\_{\rm KL}[p^{a,b}\_{\lambda\_1} : p^{a,b}\_{\lambda\_2}] = \log \frac{\lambda\_2}{\lambda\_1} + \log \frac{\Phi\_{\lambda\_2}(b) - \Phi\_{\lambda\_2}(a)}{\Phi\_{\lambda\_1}(b) - \Phi\_{\lambda\_1}(a)} + (\lambda\_2 - \lambda\_1) E\_{p^{a,b}\_1}[\ge]. \tag{95}$$

*We also check Corollary 1:*

$$D\_{\rm KL}[p^{a\_1, b\_1}\_{\lambda\_1} : p^{a\_2, b\_2}\_{\lambda\_2}] = D\_{\rm KL}[p^{a\_1, b\_1}\_{\lambda\_1} : p^{a\_1, b\_1}\_{\lambda\_2}] + \log \frac{Z\_{a\_2, b\_2}(\lambda\_2)}{Z\_{a\_1, b\_1}(\lambda\_2)}\tag{96}$$

The next example shows how to compute the Kullback–Leibler divergence between two truncated normal distributions:

**Example 7.** *Let Na*,*b*(*m*,*s*) *denote a truncated normal distribution with support the open interval* (*a*, *b*) *(a* < *b) and probability density function defined by:*

$$p\_{m,s}^{a,b}(\mathbf{x}) = \frac{1}{Z\_{a,b}(m,s)} \exp\left(-\frac{(\mathbf{x}-m)^2}{2s^2}\right),\tag{97}$$

*where Za*,*b*(*m*,*s*) *is related to the partition function [26] expressed using the cumulative distribution function (CDF)* Φ*m*,*s*(*x*)*:*

$$Z\_{a,b}(m,s) = \sqrt{2\pi s} \left(\Phi\_{m,s}(b) - \Phi\_{m,s}(a)\right),\tag{98}$$

*with*

$$\Phi\_{m,s}(\mathbf{x}) = \frac{1}{2} \left( 1 + \text{erf} \left( \frac{\mathbf{x} - m}{\sqrt{2}s} \right) \right), \tag{99}$$

*where* erf(*x*) *is the error function:*

$$\text{erf}(\mathbf{x}) := \frac{2}{\sqrt{\pi t}} \int\_0^\mathbf{x} e^{-t^2} \,\mathrm{d}t. \tag{100}$$

*Thus we have* erf(*x*) = 2 Φ( <sup>√</sup>2*x*) <sup>−</sup> <sup>1</sup> *where* <sup>Φ</sup>(*x*) = <sup>Φ</sup>0,1(*x*)*. The pdf can also be written as*

$$p\_{m,s}^{a,b}(\boldsymbol{x}) = \frac{1}{s} \frac{\phi\left(\frac{\boldsymbol{x}-\boldsymbol{m}}{s}\right)}{\Phi\left(\frac{b-\boldsymbol{m}}{s}\right) - \Phi\left(\frac{a-\boldsymbol{m}}{s}\right)}\tag{101}$$

*where φ*(*x*) *denotes the standard normal pdf (φ*(*x*) = *p*−∞,+<sup>∞</sup> 0,1 (*x*)*):*

$$\phi(\mathbf{x}) := \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{\mathbf{x}^2}{2}\right),\tag{102}$$

*and* <sup>Φ</sup>(*x*) = <sup>Φ</sup>0,1(*x*) = *<sup>x</sup>* <sup>−</sup><sup>∞</sup> *<sup>φ</sup>*(*t*) <sup>d</sup>*<sup>t</sup> is the standard normal CDF. When <sup>a</sup>* <sup>=</sup> <sup>−</sup><sup>∞</sup> *and <sup>b</sup>* = +∞*, we have Z*−∞,∞(*m*,*s*) = <sup>√</sup>2*<sup>π</sup> s since* <sup>Φ</sup>(−∞) = <sup>0</sup> *and* <sup>Φ</sup>(+∞) = <sup>1</sup>*.*

*The density pa*,*<sup>b</sup> <sup>m</sup>*,*s*(*x*) *belongs to an exponential family* <sup>E</sup>*a*,*<sup>b</sup> with natural parameter <sup>θ</sup>* <sup>=</sup> - *m <sup>s</sup>*<sup>2</sup> , <sup>−</sup> <sup>1</sup> 2*s*<sup>2</sup> *, sufficient statistics t*(*x*)=(*x*, *x*2)*, and log-normalizer:*

$$F\_{a,b}(\theta) = -\frac{\theta\_1^2}{4\theta\_2} + \log Z\_{a,b}(\theta) \tag{103}$$

*The natural parameter space is* <sup>Θ</sup> <sup>=</sup> <sup>R</sup> <sup>×</sup> <sup>R</sup>−− *where* <sup>R</sup>−− <sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup> : *<sup>x</sup>* <sup>&</sup>lt; <sup>0</sup>} *denotes the set of negative real numbers.*

*The log-normalizer can be expressed using the source parameters* (*m*,*s*) *(which are not the mean and standard deviation when the support is truncated, hence the notation m and s):*

$$F\_{a,b}(m,s) \quad = \quad \frac{m^2}{2s^2} + \log Z\_{a,b}(m,s),\tag{104}$$

$$=\frac{m^2}{2s^2} + \frac{1}{2}\log 2\pi s^2 + \log(\Phi\_{\mathfrak{m},\mathfrak{s}}(b) - \Phi\_{\mathfrak{m},\mathfrak{s}}(a)).\tag{105}$$

*We shall use the fact that the gradient of the log-normalizer of any exponential family distribution amounts to the expectation of the sufficient statistics [1]:*

$$\nabla F\_{a,b}(\theta) = E\_{p^{a,b}\_{w,s}}[t(\mathbf{x})] = \eta. \tag{106}$$

*Parameter η is called the moment or expectation parameter [1].*

*The mean μ*(*m*,*s*; *a*, *b*) = *E pa*,*<sup>b</sup> m*,*s* [*x*] = *<sup>∂</sup> ∂θ*<sup>1</sup> *Fa*,*b*(*θ*) *and the variance σ*2(*m*,*s*; *a*, *b*) = *E pa*,*<sup>b</sup> m*,*s* [*x*2] <sup>−</sup> *μ*<sup>2</sup> *(with E pa*,*<sup>b</sup> m*,*s* [*x*2] = *<sup>∂</sup> ∂θ*<sup>2</sup> *Fa*,*b*(*θ*)*) of the truncated normal <sup>p</sup>a*,*<sup>b</sup> <sup>m</sup>*,*<sup>s</sup> can be expressed using the following formula [26,27] (page 25):*

$$\left(\mu(m, s; a, b)\right) \quad = \quad m - s \frac{\phi(\beta) - \phi(a)}{\Phi(\beta) - \Phi(a)},\tag{107}$$

$$\sigma^2(m, s; a, b) \quad = \quad s^2 \left( 1 - \frac{\beta \phi(\beta) - a \phi(a)}{\Phi(\beta) - \Phi(a)} - \left( \frac{\phi(\beta) - \phi(a)}{\Phi(\beta) - \Phi(a)} \right)^2 \right), \tag{108}$$

*where α* := *<sup>a</sup>*−*<sup>m</sup> <sup>s</sup> and <sup>β</sup>* :<sup>=</sup> *<sup>b</sup>*−*<sup>m</sup> <sup>s</sup> . Thus we have the following moment parameter η* = (*η*1, *η*2) *with*

$$(\eta\_1(m, s; a, b) \quad = \\_E\_{p\_{m,s}^{a,b}}[x] = \mu(m, s; a, b), \tag{109}$$

$$
\eta\_2(m, s; a, b) \quad = \quad E\_{p\_{m,s}^{ab}}[\mathbf{x}^2] = \sigma^2(m, s; a, b) + \mu^2(m, s; a, b). \tag{110}
$$

*Now consider two truncated normal distributions <sup>p</sup>a*1,*b*<sup>1</sup> *<sup>m</sup>*1,*s*<sup>1</sup> *and <sup>p</sup>a*2,*b*<sup>2</sup> *<sup>m</sup>*2,*s*<sup>2</sup> *with* [*a*1, *<sup>b</sup>*1] <sup>⊆</sup> [*a*2, *<sup>b</sup>*2] *(otherwise, we have <sup>D</sup>*KL[*pa*1,*b*<sup>1</sup> *<sup>m</sup>*1,*s*<sup>1</sup> : *<sup>p</sup>a*2,*b*<sup>2</sup> *<sup>m</sup>*2,*s*<sup>2</sup> ]=+∞*). Then the KLD between <sup>p</sup>a*1,*b*<sup>1</sup> *<sup>m</sup>*1,*s*<sup>1</sup> *and <sup>p</sup>a*2,*b*<sup>2</sup> *<sup>m</sup>*2,*s*<sup>2</sup> *is equivalent to a duo Bregman divergence:*

$$\begin{split} D\_{\mathrm{KL}}[p\_{\mathrm{m}\_{1}\mathrm{s}\_{1}}^{\mathrm{m}\_{1}\mathrm{h}\_{1}} : p\_{\mathrm{m}\_{2}\mathrm{s}\_{2}}^{\mathrm{m}\_{2}\mathrm{h}\_{2}}] &= \ \ F\_{\mathrm{m}\_{2}\mathrm{s}\_{2}}(\theta\_{2}) - \mathcal{F}\_{\mathrm{m}\_{1}\mathrm{s}\_{1}}(\theta\_{1}) - (\theta\_{2} - \theta\_{1})^{\top} \nabla F\_{\mathrm{m}\_{1}\mathrm{s}\_{1}}(\theta\_{1}), \\ &= \ \ \frac{m\_{2}}{2s\_{2}^{2}} - \frac{m\_{1}}{2s\_{1}^{2}} + \log \frac{Z\_{\mathrm{m}\_{2}\mathrm{h}\_{2}}(m\_{2}, s\_{2})}{Z\_{\mathrm{a}\_{1}\mathrm{h}\_{1}}(m\_{1}, s\_{1})} - \left(\frac{m\_{2}}{s\_{2}^{2}} - \frac{m\_{1}}{s\_{1}^{2}}\right) \eta\_{1}(m\_{1}, s\_{1}; a\_{1}, b\_{1}) \\ &- \left(\frac{1}{2s\_{1}^{2}} - \frac{1}{2s\_{2}^{2}}\right) \eta\_{2}(m\_{1}, s\_{1}; a\_{1}, b\_{1}). \end{split} \tag{111}$$

*Note that Fm*2,*s*<sup>2</sup> (*θ*) ≥ *Fm*1,*s*<sup>1</sup> (*θ*)*.*

*This formula is valid for (1) the KLD between two truncated normal distributions, or for (2) the KLD between a truncated normal distribution and a (full support) normal distribution. Note that the formula depends on the erf function used in function* Φ*. Furthermore, when a*<sup>1</sup> = *a*<sup>2</sup> = −∞ *and b*<sup>1</sup> = *b*<sup>2</sup> = +∞*, we recover (3) the KLD between two univariate normal distributions, since* log *Za*2,*b*<sup>2</sup> (*m*2,*s*2) *Za*1,*b*<sup>1</sup> (*m*1,*s*1) <sup>=</sup> log *<sup>σ</sup>*<sup>2</sup> *<sup>σ</sup>*<sup>1</sup> <sup>=</sup> <sup>1</sup> <sup>2</sup> log *<sup>σ</sup>*<sup>2</sup> 2 *σ*2 1 *:*

$$D\_{\mathrm{KL}}[p\_{\mathrm{m}\_1,\mathbb{s}\_1} : p\_{\mathrm{m}\_2,\mathbb{s}\_2}] = \frac{1}{2} \left( \log \frac{s\_2^2}{s\_1^2} + \frac{\sigma\_1^2}{\sigma\_2^2} + \frac{(m\_2 - m\_1)^2}{s\_2^2} - 1 \right). \tag{112}$$

*Note that for full support normal distributions, we have μ*(*m*,*s*; −∞; +∞) = *m and <sup>σ</sup>*2(*m*,*s*; <sup>−</sup>∞; <sup>+</sup>∞) = *<sup>s</sup>*2*.*

*The entropy of a truncated normal distribution (an exponential family [28]) is h*[*pa*,*<sup>b</sup> <sup>m</sup>*,*s*] = <sup>−</sup> *<sup>b</sup> <sup>a</sup> <sup>p</sup>a*,*<sup>b</sup> <sup>m</sup>*,*s*(*x*)log *pa*,*<sup>b</sup> <sup>m</sup>*,*s*d*x* = −*F*∗(*η*) = *F*(*θ*) − *θη. We find that*

$$h[p\_{m,s}^{a,b}] = \log\left(\sqrt{2\pi c}\epsilon\left(\Phi(\beta) - \Phi(a)\right)\right) + \frac{a\phi(a) - \beta\phi(\beta)}{2(\Phi(\beta) - \Phi(a))}.\tag{113}$$

*When* (*a*, *b*)=(−∞, ∞)*, we have* Φ(*β*) − Φ(*α*) = 1 *and αφ*(*α*) − *βφ*(*β*) = 0 *since β* = −*α, <sup>φ</sup>*(−*x*) = *<sup>φ</sup>*(*x*) *(an even function), and* lim*β*→+<sup>∞</sup> *βφ*(*β*) = 0*. Thus we recover the differential entropy of a normal distribution: h*[*pμ*,*σ*] = log-<sup>√</sup>2*πe<sup>σ</sup> .*
