# **Divergence Measures Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems**

Edited by Igal Sason Printed Edition of the Special Issue Published in *Entropy*

www.mdpi.com/journal/entropy

## **Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems**

## **Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems**

Editor

**Igal Sason**

MDPI ' Basel ' Beijing ' Wuhan ' Barcelona ' Belgrade ' Manchester ' Tokyo ' Cluj ' Tianjin

*Editor* Igal Sason Andrew & Erna Viterbi Faculty of Electrical and Computer Engineering and the Faculty of Mathematics Technion – Israel Institute of Technology Haifa Israel

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Entropy* (ISSN 1099-4300) (available at: www.mdpi.com/journal/entropy/special issues/divergence).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-4332-1 (Hbk) ISBN 978-3-0365-4331-4 (PDF)**

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editor**

#### **Igal Sason**

Igal Sason received his B.Sc. and Ph.D. degrees in electrical engineering from the Technion —Israel Institute of Technology, Haifa, Israel, in 1992 and 2001, respectively. During 1993–1997, he worked in Israel as a communication engineer. During 2001–2003, he was a scientific collaborator at the School of Computer and Communication Sciences at EPFL, Lausanne, Switzerland. Since 2003, he has been a faculty member at the Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering of the Technion, where he is currently a professor. Since 2021, he is also a Professor at the Faculty of Mathematics of the Technion. His research interests are in information theory, and its aspects in combinatorics and probability theory. He served at the editorial board of the IEEE Transactions on Information Theory for overall 10 years, including terms as the Executive Editor and Editor-in-Chief. He also served as a Guest Editor of two Special Issues of the *Entropy* journal (MDPI) during 2019–2022. I. Sason has been a member of the IEEE since 1998, and he is an IEEE Fellow of the Information Theory Society (effective from 2019) for contributions to the achievable rate region of the Gaussian interference channel, and the analysis of low-complexity capacity-achieving linear codes.

## *Editorial* **Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems**

**Igal Sason 1,2**


Data science, information theory, probability theory, statistical learning, statistical signal processing, and other related disciplines greatly benefit from non-negative measures of dissimilarity between pairs of probability measures. These are known as divergence measures, and exploring their mathematical foundations and diverse applications is of significant interest (see, e.g., [1–10] and references therein).

The present Special Issue, entitled *Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems*, is focused on the study of the mathematical properties and applications of classical and generalized divergence measures from an information-theoretic perspective. It includes eight original contributions on the subject, which mainly deal with two key generalizations of the relative entropy: namely, the Rényi divergence and the important class of *f*-divergences. The Rényi divergence was introduced by Rényi as a generalization of relative entropy (relative entropy is a.k.a. the Kullback–Leibler divergence [11]), and it found numerous applications in information theory, statistics, and other related fields [12,13]. The notion of an *f*-divergence, which was independently introduced by Ali-Silvey [14], Csiszár [15–17], and Morimoto [18], is a useful generalization of some well-known divergence measures, retaining some of their major properties, including data-processing inequalities. It should be noted that, although the Rényi divergence of an arbitrary order is not an *f*-divergence, it is a one-toone transformation of a subclass of *f*-divergences, so it inherits some of the key properties of *f*-divergences. We next describe the eight contributions in this Special Issue, and their relation to the literature.

Relative entropy is a well-known asymmetric and unbounded divergence measure [11], whereas the Jensen-Shannon divergence [19,20] (a.k.a. the capacitory discrimination [21]) is a bounded symmetrization of relative entropy, which does not require the pair of probability measures to have matching supports. It has the pleasing property that its square root is a distance metric, and it also belongs to the class of *f*-divergences. The latter implies, in particular, that the Jensen–Shannon divergence satisfies data-processing inequalities. The first paper in this Special Issue [22], authored by Nielsen, studies generalizations of the Jensen–Shannon divergence and the Jensen–Shannon centroid. The work in [22] further suggests an iterative algorithm for the numerical computation of the Jensen–Shannon-type centroids for a set of probability densities belonging to a mixture family in information geometry. This includes the case of calculating the Jensen–Shannon centroid of a set of categorical distributions or normalized histograms.

Many of Shannon's information measures appear naturally in the context of horse gambling, when the gambler's utility function is the expected log-wealth. The second paper [23], coauthored by Bleuler, Lapidoth, and Pfister, shows that, under a more general family of utility functions, gambling also provides a context for some of Rényi's information measures. Motivated by a horse betting problem in the setting where the gambler has side information, a new conditional Rényi divergence is introduced in [23]. It is compared with the conditional Rényi divergences by Csiszár and Sibson, and the properties of all

**Citation:** Sason, I. Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems. *Entropy* **2022**, *24*, 712. https:// doi.org/10.3390/e24050712

Received: 3 May 2022 Accepted: 13 May 2022 Published: 16 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the three are studied in depth by the authors, with an emphasis on the behavior of these conditional divergence measures under data processing. In the same way that Csiszár's and Sibson's conditional divergences lead to the respective dependence measures, so does the new conditional divergence in [23] lead to the Lapidoth–Pfister mutual information. The authors further demonstrate that their new conditional divergence measure is also related to the Arimoto–Rényi conditional entropy and to Arimoto's measure of dependence. In the second part of [23], the horse betting problem is analyzed where, instead of Kelly's expected log-wealth criterion, a more general family of power-mean utility functions is considered. The key role in the analysis is played by the Rényi divergence, and the setting where the gambler has access to side information provides an operational meaning to the Lapidoth–Pfister mutual information. Finally, a universal strategy for independent and identically distributed races is presented in [23] which, without knowing the winning probabilities or the parameter of the utility function, asymptotically maximizes the gambler's utility function.

The relative entropy [11] and the chi-squared divergence [24] are classical divergence measures which play a key role in information theory, statistical machine learning, signal processing, statistics, probability theory, and many other branches of mathematics. These divergence measures are fundamental in problems pertaining to source and channel coding, large deviations theory, tests of goodness-of-fit and independence in statistics, expectation– maximization iterative algorithms for estimating a distribution from an incomplete data, and other sorts of problems. They also belong to the generalized class of *f*-divergences. The third paper [25], by Nishiyama and Sason, studies integral relations between the relative entropy and chi-squared divergence, the implications of these relations, their informationtheoretic applications, and some generalizations pertaining to the rich class of *f*-divergences. Applications that are studied in [25] include lossless compression, the method of types and large deviations, strong data-processing inequalities, bounds on contraction coefficients and maximal correlation, and the convergence rate to the stationarity of a type of discrete-time Markov chain.

The interesting interplay between inequalities and information theory has a rich history, with notable examples that include the relationship between the Brunn–Minkowski inequality and the entropy power inequality, transportation-cost inequalities and their tight connections to information theory, logarithmic Sobolev inequalities and the entropy method, inequalities for matrices obtained from the nonnegativity of relative entropy, connections between information inequalities and finite groups, combinatorics, and other fields of mathematics (see, e.g., [26–30]). The fourth paper by Reeves [31] considers applications of a two-moment inequality for the integral of fractional power of a function between zero and one. The first contribution of this paper provides an upper bound on the Rényi entropy of a random vector, expressed in terms of the two different moments. This also recovers some previous results based on maximum entropy distributions under a single moment constraint. The second contribution in [31] is a method for upper bounding mutual information in terms of certain integrals with respect to the variance of the conditional density.

Basic properties of an *f*-divergence are its non-negativity, convexity in the pair of probability measures, and the satisfiability of data-processing inequalities as a result of the convexity of the function *f* (and by the requirement that *f* vanishes at 1). These properties lead to *f*-divergence inequalities, and to information-theoretic applications (see, e.g., [4,10,32–37]). Furthermore, tightened (strong) data-processing inequalities for *f*-divergences have been of recent interest (see, e.g., [38–42]). The fifth paper [43], authored by Melbourne, is focused on the study of how stronger convexity properties of the function *f* imply improvements of classical *f*-divergence inequalities. It provides a systematic study of strongly convex divergences, and it quantifies how the convexity of a divergence generator *f* influences the behavior of the *f*-divergence. It proves that every (so-called) strongly convex divergence dominates the square of the total variation, which extends the classical bound provided by the chi-squared divergence. Its analysis also yields improvements of Bayes risk *f*-divergence inequalities, consequently achieving a sharpening of Pinsker's inequality.

Divergences between probability measures are often used in statistics and data science in order to perform inference under models of various types. The corresponding methods extend the likelihood paradigm, and suggest inference in settings of minimum distance or minimum divergence, while allowing some tradeoff between efficiency and robustness. The sixth paper [44], authored by Broniatowski, considers a subclass of *f*-divergences, which contains most of the classical inferential tools, and which is indexed by a single scalar parameter. This class belongs to the family of *f*-divergences, and is usually referred to as the power divergence class, which has been considered by Cressie and Read [7,45]. The work in [44] states that the most commonly used minimum divergence estimators are maximumlikelihood estimators for suitably generalized bootstrapped sampling schemes. It also considers optimality of associated goodness-of-fit tests under such sampling schemes.

The seventh paper by Verdú [46] is a research and tutorial paper on error exponents and *α*-mutual information. Similarly to [23] (the second paper in this Special Issue), it relates to Rényi's generalization of the relative entropy and mutual information. In light of the landmark paper by Shannon [47], it is well known that the analysis of the fundamental limits of noisy communication channels in the regime of vanishing error probability (by letting the blocklength of the code tend to infinity) leads to the introduction of the channel capacity as the maximal rate which enables to obtain reliable communication. The channel capacity is expressed in terms of a basic information measure: the input–output mutual information maximized over the input distribution. Furthermore, in the regime of fixed nonzero error probability, the asymptotic fundamental limit is a function of not only the channel capacity but the channel dispersion, which is expressible in terms of an information measure: the variance of the information density obtained with the capacity-achieving distribution [48]. In the regime of exponentially decreasing error probability, at fixed code rate below capacity, the analysis of the fundamental limits has gone through three distinct phases: (1) the early days of information theory and the error exponents analysis at MIT; (2) expressions for the error exponent functions by incorporating the relative entropy; and (3) the error exponent research with Rényi information measures. Thanks to Csiszár's realization of the relevance of Rényi's information measures to this problem [32], the third phase has found a way to express the error exponent functions as a function of generalized information measures, and also to solve the associated optimization problems in a systematic way. While in the absence of cost constraints, the problem reduces to finding the maximal *α*-mutual information, cost constraints make the problem significantly more challenging. The remained gaps in the interrelationships between three approaches, in the general case of cost-constrained encoding, motivated the present study in [46]. Furthermore, no systematic approach has been suggested so far for solving the attendant optimization problems by exploiting the specific structure of the information functions. The work by Verdú in [46] closes those gaps, while proposing a simple method to maximize the Augustin–Csiszár mutual information of order *α* under cost constraints [32,49], by means of the maximization of the *α*-mutual information subject to an exponential average constraint.

In statistical inference, the information-theoretic performance limits can often be expressed in terms of a statistical divergence measure between the underlying statistical models (see, e.g., [50] and references therein). As the data dimension grows, computing the statistics involved in decision making and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising performance because of the attendant loss of information. The eighth and last paper in the present Special Issue [51] considers linear dimensionality reduction, such that the divergence between the models is maximally preserved. Specifically, this work is focused on Gaussian models where discriminant analysis under several *f*-divergence measures are considered. The optimal design of the linear transformation of the data onto a lower-dimensional subspace is characterized for zero-mean Gaussian models, and numerical algorithms are employed to find the design for general Gaussian models with non-zero means.

It is our hope that the reader will find interest in the eight original contributions of this Special Issue, and that these works will stimulate further research in the study of the mathematical foundations and applications of divergence measures.

**Acknowledgments:** The Guest Editor is grateful to all the authors for their contributions to this Special Issue, to the anonymous peer-reviewers for their timely reports and constructive feedback.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


## *Article* **On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid**

#### **Frank Nielsen**

Sony Computer Science Laboratories, Tokyo 141-0022, Japan; Frank.Nielsen@acm.org

Received: 5 December 2019; Accepted: 14 February 2020; Published: 16 February 2020

**Abstract:** The Jensen–Shannon divergence is a renown bounded symmetrization of the Kullback–Leibler divergence which does not require probability densities to have matching supports. In this paper, we introduce a vector-skew generalization of the scalar *α*-Jensen–Bregman divergences and derive thereof the vector-skew *α*-Jensen–Shannon divergences. We prove that the vector-skew *α*-Jensen–Shannon divergences are *f*-divergences and study the properties of these novel divergences. Finally, we report an iterative algorithm to numerically compute the Jensen–Shannon-type centroids for a set of probability densities belonging to a mixture family: This includes the case of the Jensen–Shannon centroid of a set of categorical distributions or normalized histograms.

**Keywords:** Bregman divergence; *f*-divergence; Jensen–Bregman divergence; Jensen diversity; Jensen–Shannon divergence; capacitory discrimination; Jensen–Shannon centroid; mixture family; information geometry; difference of convex (DC) programming

#### **1. Introduction**

Let (X , F, *µ*) be a measure space [1] where X denotes the sample space, F the *σ*-algebra of measurable events, and *µ* a positive measure; for example, the measure space defined by the Lebesgue measure *<sup>µ</sup><sup>L</sup>* with Borel *<sup>σ</sup>*-algebra <sup>B</sup>(R*<sup>d</sup>* ) for <sup>X</sup> <sup>=</sup> <sup>R</sup>*<sup>d</sup>* or the measure space defined by the counting measure *µ<sup>c</sup>* with the power set *σ*-algebra 2 <sup>X</sup> on a finite alphabet X . Denote by *L*1(X , F, *µ*) the Lebesgue space of measurable functions, P<sup>1</sup> the subspace of *positive* integrable functions *f* such that R X *f*(*x*)d*µ*(*x*) = 1 and *f*(*x*) > 0 for all *x* ∈ X , and P<sup>1</sup> the subspace of *non-negative* integrable functions *f* such that R *f*(*x*)d*µ*(*x*) = 1 and *f*(*x*) ≥ 0 for all *x* ∈ X .

X We refer to the book of Deza and Deza [2] and the survey of Basseville [3] for an introduction to the many types of statistical divergences met in information sciences and their justifications. The *Kullback–Leibler Divergence* (KLD) KL : P<sup>1</sup> × P<sup>1</sup> → [0, ∞] is an oriented statistical distance (commonly called the relative entropy in information theory [4]) defined between two densities *p* and *q* (i.e., the Radon–Nikodym densities of *µ*-absolutely continuous probability measures *P* and *Q*) by

$$\text{KL}(p:q) := \int p \log \frac{p}{q} \mathbf{d}\mu. \tag{1}$$

Although KL(*p* : *q*) ≥ 0 with equality iff. *p* = *q µ*-a. e. (Gibb's inequality [4]), the KLD may diverge to infinity depending on the underlying densities. Since the KLD is asymmetric, several symmetrizations [5] have been proposed in the literature.

A well-grounded symmetrization of the KLD is the *Jensen–Shannon Divergence* [6] (JSD), also called *capacitory discrimination* in the literature (e.g., see [7]):

$$\text{JS}(p,q) \quad := \quad \frac{1}{2} \left( \text{KL}\left(p: \frac{p+q}{2}\right) + \text{KL}\left(q: \frac{p+q}{2}\right) \right), \tag{2}$$

$$=\frac{1}{2}\int \left(p\log\frac{2p}{p+q} + q\log\frac{2q}{p+q}\right)\mathrm{d}\mu = \mathrm{JS}(q,p).\tag{3}$$

The Jensen–Shannon divergence can be interpreted as the *total KL divergence to the average distribution <sup>p</sup>*+*<sup>q</sup>* 2 . The Jensen–Shannon divergence was historically implicitly introduced in [8] (Equation (19)) to calculate distances between random graphs. A nice feature of the Jensen–Shannon divergence is that this divergence can be applied to densities with *arbitrary* support (i.e., *p*, *q* ∈ P<sup>1</sup> with the convention that 0 log 0 = 0 and log <sup>0</sup> <sup>0</sup> = 0); moreover, the JSD is *always* upper bounded by log 2. Let X*<sup>p</sup>* = supp(*p*) and X*<sup>q</sup>* = supp(*q*) denote the supports of the densities *p* and *q*, respectively, where supp(*p*) := {*x* ∈ X : *p*(*x*) > 0}. The JSD saturates to log 2 whenever the supports X*<sup>p</sup>* and X*<sup>p</sup>* are disjoints. We can rewrite the JSD as

$$\text{JS}(p,q) = h\left(\frac{p+q}{2}\right) - \frac{h(p)+h(q)}{2},\tag{4}$$

where *h*(*p*) = − R *p* log *p*d*µ* denotes Shannon's entropy. Thus, the JSD can also be interpreted as the *entropy of the average distribution minus the average of the entropies*.

The square root of the JSD is a metric [9] satisfying the triangle inequality, but the square root of the JD is not a metric (nor any positive power of the Jeffreys divergence, see [10]). In fact, the JSD can be interpreted as a Hilbert metric distance, meaning that there exists some isometric embedding of (X , √ JS) into a Hilbert space [11,12]. Other principled symmetrizations of the KLD have been proposed in the literature: For example, Naghshvar et al. [13] proposed the *extrinsic Jensen–Shannon divergence* and demonstrated its use for variable-length coding over a discrete memoryless channel (DMC).

Another symmetrization of the KLD sometimes met in the literature [14–16] is the *Jeffreys divergence* [17,18] (JD) defined by

$$f(p,q) := \text{KL}(p:q) + \text{KL}(q:p) = \int (p-q)\log\frac{p}{q}d\mu = f(q,p). \tag{5}$$

However, we point out that this Jeffreys divergence lacks sound information-theoretical justifications.

For two positive but not necessarily normalized densities *p*˜ and *q*˜, we define the *extended Kullback–Leibler divergence* as follows:

$$\text{KL}^+(\mathfrak{p}:\mathfrak{q}) \quad := \text{KL}(\mathfrak{p}:\mathfrak{q}) + \int \mathfrak{p}\mathbf{d}\mu - \int \mathfrak{p}\mathbf{d}\mu,\tag{6}$$

$$=\int \left(\vec{p}\log\frac{\vec{p}}{\vec{q}} + \vec{q} - \vec{p}\right) \mathrm{d}\mu. \tag{7}$$

The Jensen–Shannon divergence and the Jeffreys divergence can both be extended to positive (unnormalized) densities without changing their formula expressions:

$$\text{JS}^{+}\left(\vec{p},\vec{q}\right) \quad := \quad \frac{1}{2}\left(\text{KL}^{+}\left(\vec{p}:\frac{\vec{p}+\vec{q}}{2}\right) + \text{KL}^{+}\left(\vec{q}:\frac{\vec{p}+\vec{q}}{2}\right)\right),\tag{8}$$

$$=\frac{1}{2}\left(\text{KL}\left(\tilde{p}:\frac{\tilde{p}+\tilde{q}}{2}\right)+\text{KL}\left(\tilde{q}:\frac{\tilde{p}+\tilde{q}}{2}\right)\right)=\text{JS}(\tilde{p},\tilde{q}),\tag{9}$$

$$\int \mathbf{J}^+ (\not p, \not q) \quad := \; \mathbf{KL}^+ (\not p : \not q) + \mathbf{KL}^+ (\not p : \not q) = \int (\not p - \not q) \log \frac{\not p}{\not q} \mathbf{d} \mu = \mathbf{J} (\not p, \not q). \tag{10}$$

However, the extended JS<sup>+</sup> divergence is upper-bounded by ( 1 2 log 2)(R (*p*˜ + *q*˜)d*µ*) = <sup>1</sup> 2 (*µ*(*p*) + *µ*(*q*))log 2 instead of log 2 for normalized densities (i.e., when *µ*(*p*) + *µ*(*q*) = 2).

Let (*pq*)*α*(*x*) := (1 − *α*)*p*(*x*) + *αq*(*x*) denote the statistical weighted mixture with component densities *p* and *q* for *α* ∈ [0, 1]. The asymmetric *α*-skew Jensen–Shannon divergence can be defined for a scalar parameter *α* ∈ (0, 1) by considering the weighted mixture (*pq*)*<sup>α</sup>* as follows:

$$\text{JS}\_a^a(p:q) \quad := \ (1-a)\text{KL}(p:(pq)\_\mathfrak{a}) + a\text{KL}(q:(pq)\_\mathfrak{a}),\tag{11}$$

$$=\left(1-\mathfrak{a}\right)\int p\log\frac{p}{(pq)\_{\mathfrak{a}}}\mathrm{d}\mu+\mathfrak{a}\int q\log\frac{q}{(pq)\_{\mathfrak{a}}}\mathrm{d}\mu.\tag{12}$$

Let us introduce the *α-skew K-divergence* [6,19] *Kα*(*p* : *q*) by:

$$\operatorname{KL}\_{\mathfrak{a}}\left(p:q\right):=\operatorname{KL}\left(p:(1-\mathfrak{a})p+\mathfrak{a}q\right)=\operatorname{KL}\left(p:(pq)\_{\mathfrak{a}}\right).\tag{13}$$

Then, both the Jensen–Shannon divergence and the Jeffreys divergence can be rewritten [20] using *K<sup>α</sup>* as follows:

$$\text{JS}\left(p,q\right) \quad = \ \frac{1}{2}\left(\mathcal{K}\_{\frac{1}{2}}\left(p:q\right) + \mathcal{K}\_{\frac{1}{2}}\left(q:p\right)\right),\tag{14}$$

$$J\_1(p,q) \quad = \quad K\_1(p:q) + K\_1(q:p),\tag{15}$$

since (*pq*)<sup>1</sup> = *q*, KL(*p* : *q*) = *K*1(*p* : *q*) and (*pq*) <sup>1</sup> 2 = (*qp*) <sup>1</sup> 2 .

We can thus define the *symmetric α-skew Jensen–Shannon divergence* [20] for *α* ∈ (0, 1) as follows:

$$\text{JS}^{a}(p,q) \quad := \quad \frac{1}{2}\mathcal{K}\_{a}(p:q) + \frac{1}{2}\mathcal{K}\_{a}(q:p) = \text{JS}^{a}(q\_{\prime}p). \tag{16}$$

The ordinary Jensen–Shannon divergence is recovered for *α* = <sup>1</sup> 2 .

In general, skewing divergences (e.g., using the divergence *K<sup>α</sup>* instead of the KLD) have been experimentally shown to perform better in applications like in some natural language processing (NLP) tasks [21].

The *α-Jensen–Shannon divergences* are Csiszár *f*-divergences [22–24]. An *f*-divergence is defined for a convex function *f* , strictly convex at 1, and satisfies *f*(1) = 0 as:

$$I\_f(p:q) = \int q(\mathbf{x}) f\left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right) d\mathbf{x} \ge f(1) = 0. \tag{17}$$

We can always symmetrize *f*-divergences by taking the *conjugate* convex function *f* ∗ (*x*) = *x f*( 1 *x* ) (related to the perspective function): *If*+*<sup>f</sup>* <sup>∗</sup> (*p*, *q*) is a symmetric divergence. The *f*-divergences are convex statistical distances which are provably the only separable invariant divergences in information geometry [25], except for binary alphabets X (see [26]).

The Jeffreys divergence is an *f*-divergence for the generator *f*(*x*) = (*x* − 1)log *x*, and the *α*-Jensen–Shannon divergences are *f*-divergences for the generator family *fα*(*x*) = − log((1 − *<sup>α</sup>*) + *<sup>α</sup>x*) <sup>−</sup> *<sup>x</sup>* log((<sup>1</sup> <sup>−</sup> *<sup>α</sup>*) + *<sup>α</sup> x* ). The *f*-divergences are upper-bounded by *f*(0) + *f* ∗ (0). Thus, the *f*-divergences are finite when *f*(0) + *f* ∗ (0) < ∞.

The main contributions of this paper are summarized as follows:

• First, we generalize the Jensen–Bregman divergence by skewing a weighted separable Jensen–Bregman divergence with a *k*-dimensional *vector α* ∈ [0, 1] *k* in Section 2. This yields a generalization of the symmetric skew *α*-Jensen–Shannon divergences to a vector-skew parameter. This extension retains the key properties for being upper-bounded and for application to densities with potentially different supports. The proposed generalization also allows one to grasp a better understanding of the "mechanism" of the Jensen–Shannon divergence itself. We also show how to

directly obtain the weighted vector-skew Jensen–Shannon divergence from the decomposition of the KLD as the difference of the cross-entropy minus the entropy (i.e., KLD as the relative entropy).


#### **2. Extending the Jensen–Shannon Divergence**

#### *2.1. Vector-Skew Jensen–Bregman Divergences and Jensen Diversities*

Recall our notational shortcut: (*ab*)*<sup>α</sup>* := (1 − *α*)*a* + *αb*. For a *k*-dimensional vector *α* ∈ [0, 1] *k* , a weight vector *w* belonging to the (*k* − 1)-dimensional open simplex ∆*<sup>k</sup>* , and a scalar *γ* ∈ (0, 1), let us define the following vector *skew α-Jensen–Bregman divergence* (*α*-JBD) following [28]:

$$\text{JB}\_{F}^{a,\gamma,w}(\theta\_1:\theta\_2) := \sum\_{i=1}^{k} w\_i \text{B}\_F\left( (\theta\_1 \theta\_2)\_{a\_i} : (\theta\_1 \theta\_2)\_{\gamma} \right) \ge 0,\tag{18}$$

where *B<sup>F</sup>* is the *Bregman divergence* [29] induced by a strictly convex and smooth generator *F*:

$$B\_F(\theta\_1:\theta\_2) := F(\theta\_1) - F(\theta\_2) - \langle \theta\_1 - \theta\_2, \nabla F(\theta\_2) \rangle\_\prime \tag{19}$$

with h·, ·i denoting the Euclidean inner product h*x*, *y*i = *x* >*y* (dot product). Expanding the Bregman divergence formulas in the expression of the *α*-JBD and using the fact that

$$(\theta\_1 \theta\_2)\_{a\_i} - (\theta\_1 \theta\_2)\_{\gamma} = (\gamma - \mathfrak{a}\_i)(\theta\_1 - \theta\_2)\_{\iota} \tag{20}$$

we get the following expression:

$$\text{JB}\_{F}^{\mathfrak{a},\gamma,\mathfrak{w}}(\theta\_{1}:\theta\_{2}) = \left(\sum\_{i=1}^{k} w\_{i} \text{F}\left((\theta\_{1}\theta\_{2})\_{\mathfrak{a}\_{i}}\right)\right) - \text{F}\left((\theta\_{1}\theta\_{2})\_{\gamma}\right) - \left\langle \sum\_{i=1}^{k} w\_{i}(\gamma-\mathfrak{a}\_{i})(\theta\_{1}-\theta\_{2})\_{\mathfrak{r}} \nabla F((\theta\_{1}\theta\_{2})\_{\mathfrak{r}}) \right\rangle. \tag{21}$$

The inner product term of Equation (21) vanishes when

$$\gamma = \sum\_{i=1}^{k} w\_i \mathfrak{a}\_i := \bar{\mathfrak{a}}.\tag{22}$$

Thus, when *γ* = *α*¯ (assuming at least two distinct components in *α* so that *γ* ∈ (0, 1)), we get the simplified formula for the vector-skew *α*-JBD:

$$\mathbf{J}\mathbf{B}\_{F}^{a,w}(\theta\_1:\theta\_2) = \left(\sum\_{i=1}^{k} w\_i \mathbf{F}\left((\theta\_1\theta\_2)\_{a\_i}\right)\right) - \mathbf{F}\left((\theta\_1\theta\_2)\_{\mathbb{R}}\right).\tag{23}$$

*Entropy* **2020**, *22*, 221

This vector-skew Jensen–Bregman divergence is always finite and amounts to a *Jensen diversity* [30] *J<sup>F</sup>* induced by Jensen's inequality gap:

$$\mathrm{JB}\_{\mathrm{F}}^{\mathrm{a},\mathrm{w}}(\theta\_{1}:\theta\_{2}) = \mathrm{J}\_{\mathrm{F}}((\theta\_{1}\theta\_{2})\_{\mathrm{a}\_{1}},\dots,(\theta\_{1}\theta\_{2})\_{\mathrm{a}\_{\bar{k}}};\mathrm{w}\_{1},\dots,\mathrm{w}\_{k}) := \sum\_{i=1}^{k} \mathrm{w}\_{i}\mathrm{F}\left((\theta\_{1}\theta\_{2})\_{\bar{k}\_{i}}\right) - \mathrm{F}\left((\theta\_{1}\theta\_{2})\_{\bar{k}}\right) \geq 0. \tag{24}$$

The Jensen diversity is a quantity which arises as a generalization of the cluster variance when clustering with Bregman divergences instead of the ordinary squared Euclidean distance; see [29,30] for details. In the context of Bregman clustering, the Jensen diversity has been called the *Bregman information* [29] and motivated by rate distortion theory: Bregman information measures the minimum expected loss when encoding a set of points using a single point when the loss is measured using a Bregman divergence. In general, a *k*-point measure is called a diversity measure (for *k* > 2), while a distance/divergence is the special case of a 2-point measure.

Conversely, in 1D, we may start from Jensen's inequality for a strictly convex function *F*:

$$\sum\_{i=1}^{k} w\_i F(\theta\_i) \ge F\left(\sum\_{i=1}^{k} w\_i \theta\_i\right). \tag{25}$$

Let us notationally write [*k*] := {1, . . . , *<sup>k</sup>*}, and define *<sup>θ</sup><sup>m</sup>* := min*i*∈[*k*]{*θi*}*<sup>i</sup>* and *<sup>θ</sup><sup>M</sup>* := max*i*∈[*k*]{*θi*}*<sup>i</sup>* <sup>&</sup>gt; *<sup>θ</sup><sup>m</sup>* (i.e., assuming at least two distinct values). We have the barycenter ¯*<sup>θ</sup>* <sup>=</sup> <sup>∑</sup>*<sup>i</sup> <sup>w</sup>iθ<sup>i</sup>* <sup>=</sup>: (*θmθM*)*<sup>γ</sup>* which can be interpreted as the linear interpolation of the extremal values for some *γ* ∈ (0, 1). Let us write *θ<sup>i</sup>* = (*θmθM*)*α<sup>i</sup>* for *i* ∈ [*k*] and proper values of the *αi*s. Then, it comes that

$$
\delta\_{\perp} = \sum\_{i} w\_{i} \theta\_{i\prime} \tag{26}
$$

$$\epsilon = \sum\_{i} w\_{i} (\theta\_{m} \theta\_{M})\_{a\_{i}} \tag{27}$$

$$\epsilon = \sum\_{i} w\_{i} ((1 - \alpha\_{i})\theta\_{m} + \alpha\_{i}\theta\_{M}) \, . \tag{28}$$

$$\left(1 - \sum\_{i} w\_{i} \alpha\_{i}\right) \theta\_{m} + \sum\_{i} \alpha\_{i} w\_{i} \theta\_{M} \tag{29}$$

$$\dot{\theta} = \dot{\theta}(\theta\_m \theta\_M)\_{\sum\_i w\_i \mathfrak{a}\_i} = (\theta\_m \theta\_M)\_{\gamma \nu} \tag{30}$$

so that *γ* = ∑*<sup>i</sup> wiα<sup>i</sup>* = *α*¯.

#### *2.2. Vector-Skew Jensen–Shannon Divergences*

Let *f*(*x*) = *x* log *x* − *x* be a strictly smooth convex function on (0, ∞). Then, the Bregman divergence induced by this univariate generator is

$$B\_f(p:q) = p \log \frac{p}{q} + q - p = \text{kl}\_+(p:q),\tag{31}$$

the *extended scalar Kullback–Leibler divergence*.

We extend the scalar-skew Jensen–Shannon divergence as follows: JS*α*,*<sup>w</sup>* (*p* : *q*) := JB*α*,*α*¯,*<sup>w</sup>* −*h* (*p* : *q*) for *h*, the Shannon's entropy [4] (a strictly concave function [4]).

**Definition 1** (Weighted vector-skew (*α*, *w*)-Jensen–Shannon divergence)**.** *For a vector α* ∈ [0, 1] *k and a unit positive weight vector w* ∈ ∆*<sup>k</sup> , the* (*α*, *<sup>w</sup>*)*-Jensen–Shannon divergence between two densities <sup>p</sup>*, *<sup>q</sup>* <sup>∈</sup> <sup>P</sup>¯ <sup>1</sup> *is defined by:*

$$\text{JS}^{a,w}(p:q) := \sum\_{i=1}^k w\_i \text{KL}((pq)\_{a\_i} : (pq)\_{\mathbb{R}}) = h\left((pq)\_{\mathbb{R}}\right) - \sum\_{i=1}^k w\_i h\left((pq)\_{a\_i}\right),$$

*with α*¯ = ∑ *k <sup>i</sup>*=<sup>1</sup> *wiα<sup>i</sup> , where h*(*p*) = − R *p*(*x*)log *p*(*x*)d*µ*(*x*) *denotes the Shannon entropy [4] (i.e.,* −*h is strictly convex).*

This definition generalizes the ordinary JSD; we recover the ordinary Jensen–Shannon divergence when *k* = 2, *α*<sup>1</sup> = 0, *α*<sup>2</sup> = 1, and *w*<sup>1</sup> = *w*<sup>2</sup> = <sup>1</sup> <sup>2</sup> with *<sup>α</sup>*¯ <sup>=</sup> <sup>1</sup> 2 : JS(*p*, *q*) = JS(0,1),( 1 2 , 1 2 ) (*p* : *q*).

Let KL*α*,*β*(*p* : *q*) := KL((*pq*)*<sup>α</sup>* : (*pq*)*β*). Then, we have KL*α*,*β*(*q* : *p*) = KL1−*α*,1−*β*(*p* : *q*). Using this (*α*, *β*)-KLD, we have the following identity:

$$\text{JS}^{\mathfrak{a},\mathfrak{w}}(p:q) \quad = \sum\_{i=1}^{k} w\_i \text{KL}\_{\mathfrak{a}\_i,\mathbb{R}}(p:q),\tag{32}$$

$$=\sum\_{i=1}^{k} w\_i \text{KL}\_{1-a\_i, 1-\bar{a}}(q:p) = \text{JS}^{1\_k-a, w}(q:p),\tag{33}$$

since ∑ *k <sup>i</sup>*=<sup>1</sup> *wi*(1 − *αi*) = 1*<sup>k</sup>* − *α* = 1 − *α*¯, where 1*<sup>k</sup>* = (1, . . . , 1) is a *k*-dimensional vector of ones.

A very interesting property is that the vector-skew Jensen–Shannon divergences are *f*-divergences [22].

**Theorem 1.** *The vector-skew Jensen–Shannon divergences* JS*α*,*<sup>w</sup>* (*p* : *q*) *are f-divergences for the generator fα*,*w*(*u*) = ∑ *k <sup>i</sup>*=<sup>1</sup> *<sup>w</sup>i*(*αi<sup>u</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>α</sup>i*))log (1−*α<sup>i</sup>* )+*αiu* (1−*α*¯)+*α*¯ *u with α*¯ = ∑ *k <sup>i</sup>*=<sup>1</sup> *wiα<sup>i</sup> .*

**Proof.** First, let us observe that the positively weighted sum of *f*-divergences is an *f*-divergence: ∑ *k <sup>i</sup>*=<sup>1</sup> *w<sup>i</sup> If i* (*p* : *q*) = *I<sup>f</sup>* (*p* : *q*) for the generator *f*(*u*) = ∑ *k <sup>i</sup>*=<sup>1</sup> *w<sup>i</sup> fi*(*u*).

Now, let us express the divergence KL*α*,*β*(*p* : *q*) as an *f*-divergence:

$$\text{KL}\_{\mathfrak{a},\mathfrak{\beta}}(p:q) = I\_{f\_{\mathfrak{a},\mathfrak{\beta}}}(p:q),\tag{34}$$

with generator

$$f\_{\mathfrak{a},\mathfrak{\beta}}(u) = (\mathfrak{a}u + 1 - \mathfrak{a}) \log \frac{(1 - \mathfrak{a}) + \mathfrak{a}u}{(1 - \mathfrak{\beta}) + \mathfrak{\beta}u}.\tag{35}$$

Thus, it follows that

$$\text{JS}^{a,w}(p:q) \quad = \sum\_{i=1}^{k} w\_i \text{KL}((pq)\_{\mathfrak{a}\_i} : (pq)\_{\tilde{\mathfrak{a}}})\_\prime \tag{36}$$

$$=\sum\_{i=1}^{k} w\_{i} I\_{f\_{a\_{i},k}}(p:q),\tag{37}$$

$$=\quad I\_{\sum\_{i=1}^{k} w\_i f\_{a\_i, h}}(p:q). \tag{38}$$

Therefore, the vector-skew Jensen–Shannon divergence is an *f*-divergence for the following generator:

$$f\_{\mathfrak{a},\mathfrak{w}}(\mathfrak{u}) = \sum\_{i=1}^{k} w\_i (\mathfrak{a}\_i \mathfrak{u} + (1 - \mathfrak{a}\_i)) \log \frac{(1 - \mathfrak{a}\_i) + \mathfrak{a}\_i \mathfrak{u}}{(1 - \mathfrak{a}) + \mathfrak{a} \mathfrak{u}'} \bigg| \tag{39}$$

where *α*¯ = ∑ *k <sup>i</sup>*=<sup>1</sup> *wiα<sup>i</sup>* .

When *α* = (0, 1) and *w* = ( <sup>1</sup> 2 , 1 2 ), we recover the *f*-divergence generator for the JSD:

$$f\_{\rm fS}(u) \;=\; \frac{1}{2} \log \frac{1}{\frac{1}{2} + \frac{1}{2}u} + \frac{1}{2}u \log \frac{u}{\frac{1}{2} + \frac{1}{2}u'} \tag{40}$$

$$=\frac{1}{2}\left(\log\frac{2}{1+u} + u\log\frac{2u}{1+u}\right).\tag{41}$$

Observe that *f* ∗ *<sup>α</sup>*,*w*(*u*) = *u fα*,*w*(1/*u*) = *f*1−*α*,*w*(*u*), where 1 − *α* := (1 − *α*1, . . . , 1 − *α<sup>k</sup>* ).

We also refer the reader to Theorem 4.1 of [31], which defines skew *f*-divergences from any *f*-divergence.

**Remark 1.** *Since the vector-skew Jensen divergence is an f-divergence, we easily obtain Fano and Pinsker inequalities following [32], or reverse Pinsker inequalities following [33,34] (i.e., upper bounds for the vector-skew Jensen divergences using the total variation metric distance), data processing inequalities using [35], etc.*

Next, we show that KL*α*,*<sup>β</sup>* (and JS*α*,*<sup>w</sup>* ) are separable convex divergences. Since the *f*-divergences are separable convex, the KL*α*,*<sup>β</sup>* divergences and the JS*α*,*<sup>w</sup>* divergences are separable convex. For the sake of completeness, we report a simplex explicit proof below.

**Theorem 2** (Separable convexity)**.** *The divergence* KL*α*,*β*(*p* : *q*) *is strictly separable convex for α* 6= *β and x* ∈ X*<sup>p</sup>* ∩ X*q.*

**Proof.** Let us calculate the second partial derivative of KL*α*,*β*(*x* : *y*) with respect to *x*, and show that it is strictly positive:

$$\frac{\partial^2}{\partial \mathbf{x}^2} \mathbf{K} \mathbf{L}\_{\mathbf{a}, \boldsymbol{\beta}}(\mathbf{x} : \mathbf{y}) = \frac{(\boldsymbol{\beta} - \boldsymbol{\alpha})^2 y^2}{(\mathbf{x} \boldsymbol{y})\_{\boldsymbol{\alpha}} (\mathbf{x} \boldsymbol{y})\_{\boldsymbol{\beta}}^2} > 0,\tag{42}$$

for *x*, *y* > 0. Thus, KL*α*,*<sup>β</sup>* is strictly convex on the left argument. Similarly, since KL*α*,*β*(*y* : *x*) = KL1−*α*,1−*β*(*x* : *y*), we deduce that KL*α*,*<sup>β</sup>* is strictly convex on the right argument. Therefore, the divergence KL*α*,*<sup>β</sup>* is separable convex.

It follows that the divergence JS*α*,*<sup>w</sup>* (*p* : *q*) is strictly separable convex, since it is a convex combination of weighted KL*α<sup>i</sup>* ,*α*¯ divergences.

Another way to derive the vector-skew JSD is to decompose the KLD as the difference of the cross-entropy *h* × minus the entropy *h* (i.e., KLD is also called the relative entropy):

$$\text{KL}(p:q) = h^{\times}(p:q) - h(p), \tag{43}$$

where *h* <sup>×</sup>(*p* : *q*) := − R *p* log *q*d*µ* and *h*(*p*) := *h* <sup>×</sup>(*p* : *p*) (self cross-entropy). Since *α*1*h* <sup>×</sup>(*p*<sup>1</sup> : *q*) + *α*2*h* <sup>×</sup>(*p*<sup>2</sup> : *q*) = *h* <sup>×</sup>(*α*<sup>1</sup> *p*<sup>1</sup> + *α*<sup>2</sup> *p*<sup>2</sup> : *q*) (for *α*<sup>2</sup> = 1 − *α*1), it follows that

$$\mathsf{JS}^{a,w}(p:q) \quad := \sum\_{\substack{i=1 \\ \square}}^{k} w\_i \mathsf{KL}((pq)\_{\mathbb{N}\_i} : (pq)\_{\gamma})\_{\prime} \tag{44}$$

$$=\sum\_{i=1}^{k} w\_i \left( h^\times((pq)\_{a\_i} : (pq)\_\gamma) - h((pq)\_{a\_i}) \right),\tag{45}$$

$$= -h^\times \left(\sum\_{i=1}^k w\_i(pq)\_{a\_i} : (pq)\_\gamma \right) - \sum\_{i=1}^k w\_i h\left((pq)\_{a\_i}\right). \tag{46}$$

Here, the "trick" is to choose *γ* = *α*¯ in order to "convert" the cross-entropy into an entropy: *h* ×(∑ *k <sup>i</sup>*=<sup>1</sup> *wi*(*pq*)*α<sup>i</sup>* : (*pq*)*γ*) = *h*((*pq*)*α*¯) when *γ* = *α*¯. Then, we end up with

$$\left| \underbrace{\mathbf{J}^{a,w}(p:q) = h\left((pq)\_{\mathbb{R}}\right) - \sum\_{i=1}^{k} w\_i h\left((pq)\_{\mathbb{A}\_i}\right)}\_{} \right| \tag{47}$$

When *α* = (*α*1, *α*2) with *α*<sup>1</sup> = 0 and *α*<sup>2</sup> = 0 and *w* = (*w*1, *w*2) = ( <sup>1</sup> 2 , 1 2 ), we have *α*¯ = <sup>1</sup> 2 , and we recover the Jensen–Shannon divergence:

$$\text{JS}(p:q) = h\left(\frac{p+q}{2}\right) - \frac{h(p)+h(q)}{2}.\tag{48}$$

Notice that Equation (13) is the usual definition of the Jensen–Shannon divergence, while Equation (48) is the reduced formula of the JSD, which can be interpreted as a Jensen gap for Shannon entropy, hence its name: The *Jensen–Shannon divergence*.

Moreover, if we consider the cross-entropy/entropy extended to positive densities *p*˜ and *q*˜:

$$h\_+^\times(\mathfrak{p}:\mathfrak{q}) = -\int (\mathfrak{p}\log\mathfrak{q} + \mathfrak{q})\mathrm{d}\mu,\quad h\_+(\mathfrak{p}) = h\_+^\times(\mathfrak{p}:\mathfrak{p}) = -\int (\mathfrak{p}\log\mathfrak{p} + \mathfrak{p})\mathrm{d}\mu,\tag{49}$$

we get:

$$\mathrm{JS}\_{+}^{a,w}(\tilde{p}:\tilde{q}) = \sum\_{i=1}^{k} w\_{i} \mathrm{KL}\_{+}((\tilde{p}\tilde{q})\_{\mathbb{A}\_{i}} : (\tilde{p}\tilde{q})\_{\mathbb{T}}) = h\_{+}((\tilde{p}\tilde{q})\_{\mathbb{R}}) - \sum\_{i=1}^{k} w\_{i}h\_{+}((\tilde{p}\tilde{q})\_{\mathbb{A}\_{i}}).\tag{50}$$

Next, we shall prove that our generalization of the skew Jensen–Shannon divergence to vector-skewing is always bounded. We first start by a lemma bounding the KLD between two mixtures sharing the same components:

**Lemma 1** (KLD between two *w*-mixtures)**.** *For α* ∈ [0, 1] *and β* ∈ (0, 1)*, we have:*

$$\text{KL}\_{\mathfrak{a},\mathfrak{\beta}}(p:q) = \text{KL}\left((pq)\_{\mathfrak{a}} : (pq)\_{\mathfrak{\beta}}\right) \le \log \max\left\{ \frac{1-\mathfrak{a}}{1-\mathfrak{\beta}}, \frac{\mathfrak{a}}{\mathfrak{\beta}} \right\}.$$

**Proof.** For *p*(*x*), *q*(*x*) > 0, we have

$$\frac{(1-\mathfrak{a})p(\mathfrak{x})+\mathfrak{a}q(\mathfrak{x})}{(1-\mathfrak{beta})p(\mathfrak{x})+\mathfrak{beta}q(\mathfrak{x})} \le \max\left\{\frac{1-\mathfrak{a}}{1-\mathfrak{beta}}, \frac{\mathfrak{a}}{\mathfrak{beta}}\right\}.\tag{51}$$

Indeed, by considering the two cases *α* ≥ *β* (or equivalently, 1 − *α* ≤ 1 − *β*) and *α* ≤ *β* (or equivalently, <sup>1</sup> <sup>−</sup> *<sup>α</sup>* <sup>≥</sup> <sup>1</sup> <sup>−</sup> *<sup>β</sup>*), we check that (<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)*p*(*x*) <sup>≤</sup> max <sup>n</sup> 1−*α* 1−*β* , *α β* o (1 − *β*)*p*(*x*) and *αq*(*x*) ≤ max <sup>n</sup> 1−*α* 1−*β* , *α β* o *βq*(*x*). Thus, we have (1−*α*)*p*(*x*)+*αq*(*x*) (1−*β*)*p*(*x*)+*βq*(*x*) <sup>≤</sup> max <sup>n</sup> 1−*α* 1−*β* , *α β* o . Therefore, it follows that:

$$\text{KL}\left((pq)\_{\mathfrak{a}} : (pq)\_{\mathfrak{f}}\right) \le \int (pq)\_{\mathfrak{a}} \log \max\left\{ \frac{1-\mathfrak{a}}{1-\mathfrak{f}}, \frac{\mathfrak{a}}{\mathfrak{f}} \right\} d\mathfrak{a} = \log \max\left\{ \frac{1-\mathfrak{a}}{1-\mathfrak{f}}, \frac{\mathfrak{a}}{\mathfrak{f}} \right\}.\tag{52}$$

Notice that we can interpret log max <sup>n</sup> 1−*α* 1−*β* , *α β* o <sup>=</sup> max{log <sup>1</sup>−*<sup>α</sup>* 1−*β* , log *<sup>α</sup> β* } as the ∞-Rényi divergence [36,37] between the following two two-point distributions: (*α*, 1 − *α*) and (*β*, 1 − *β*). See Theorem 6 of [36].

A weaker upper bound is KL((*pq*)*<sup>α</sup>* : (*pq*)*β*) <sup>≤</sup> log <sup>1</sup> *β*(1−*β*) . Indeed, let us form a partition of the sample space X into two dominance regions:


We have (*pq*)*α*(*x*) = (1 − *α*)*p*(*x*) + *αq*(*x*) ≤ *p*(*x*) for *x* ∈ *R<sup>p</sup>* and (*pq*)*α*(*x*) ≤ *q*(*x*) for *x* ∈ *Rq*. It follows that

$$\operatorname{KL}\left((pq)\_{\mathfrak{a}} : (pq)\_{\mathfrak{f}}\right) \le \int\_{R\_p} (pq)\_{\mathfrak{a}}(\mathbf{x}) \log \frac{p(\mathbf{x})}{(1-\mathfrak{f})p(\mathbf{x})} \operatorname{d}\mu(\mathbf{x}) + \int\_{R\_q} (pq)\_{\mathfrak{a}}(\mathbf{x}) \log \frac{q(\mathbf{x})}{\mathfrak{f}q(\mathbf{x})} \operatorname{d}\mu(\mathbf{x}).$$

That is, KL((*pq*)*<sup>α</sup>* : (*pq*)*β*) ≤ − log(<sup>1</sup> <sup>−</sup> *<sup>β</sup>*) <sup>−</sup> log *<sup>β</sup>* <sup>=</sup> log <sup>1</sup> *β*(1−*β*) . Notice that we allow *α* ∈ {0, 1} but not *β* to take the extreme values (i.e., *β* ∈ (0, 1)).

In fact, it is known that for both *<sup>α</sup>*, *<sup>β</sup>* <sup>∈</sup> (0, 1), KL (*pq*)*<sup>α</sup>* : (*pq*)*<sup>β</sup>* amount to compute a Bregman divergence for the Shannon negentropy generator, since {(*pq*)*<sup>γ</sup>* : *γ* ∈ (0, 1)} defines a *mixture family* [38] of order 1 in information geometry. Hence, it is always finite, as Bregman divergences are always finite (but not necessarily bounded).

By using the fact that

$$\text{JS}^{a,w}(p:q) = \sum\_{i=1}^{k} w\_i \text{KL}\left( (pq)\_{a\_i} : (pq)\_{\mathbb{R}} \right),\tag{53}$$

we conclude that the vector-skew Jensen–Shannon divergence is upper-bounded:

**Lemma 2** (Bounded (*w*, *α*)-Jensen–Shannon divergence)**.** JS*α*,*<sup>w</sup> is bounded by* log <sup>1</sup> *α*¯(1−*α*¯) *where α*¯ = ∑ *k <sup>i</sup>*=<sup>1</sup> *wiα<sup>i</sup>* ∈ (0, 1)*.*

**Proof.** We have JS*α*,*<sup>w</sup>* (*p* : *q*) = ∑*<sup>i</sup> wi*KL ((*pq*)*α<sup>i</sup>* : (*pq*)*α*¯). Since 0 ≤ KL ((*pq*)*α<sup>i</sup>* : (*pq*)*α*¯) <sup>≤</sup> log <sup>1</sup> *α*¯(1−*α*¯) , it follows that we have

$$0 \le \text{JS}^{\alpha, w}(p:q) \le \log \frac{1}{\overline{\mathfrak{a}}(1-\overline{\mathfrak{a}})}.$$

Notice that we also have

$$\mathsf{JS}^{\alpha, w}(p:q) \le \sum\_{i} w\_i \log \max \left\{ \frac{1 - \alpha\_i}{1 - \bar{\alpha}}, \frac{\alpha\_i}{\bar{\alpha}} \right\} \dots$$

The vector-skew Jensen–Shannon divergence is symmetric if and only if for each index *i* ∈ [*k*] there exists a matching index *σ*(*i*) such that *ασ*(*i*) = 1 − *α<sup>i</sup>* and *wσ*(*i*) = *w<sup>i</sup>* .

For example, we may define the *symmetric scalar α-skew Jensen–Shannon divergence* as

$$\text{LHS}\_{s}^{\mathfrak{a}}(p,q) \quad = \frac{1}{2}\text{KL}((pq)\_{\mathfrak{a}} : (pq)\_{\frac{1}{2}}) + \frac{1}{2}\text{KL}((pq)\_{1-\mathfrak{a}} : (pq)\_{\frac{1}{2}})\_{\mathfrak{a}} \tag{54}$$

$$=\frac{1}{2}\int (pq)\_{\mathfrak{a}} \log \frac{(pq)\_{\mathfrak{a}}}{(pq)\_{\frac{1}{2}}} \mathsf{d}\mu + \frac{1}{2} \int (pq)\_{1-\mathfrak{a}} \log \frac{(pq)\_{1-\mathfrak{a}}}{(pq)\_{\frac{1}{2}}} \mathsf{d}\mu,\tag{55}$$

$$=\frac{1}{2}\int (qp)\_{1-a} \log \frac{(qp)\_{1-a}}{(qp)\_{\frac{1}{2}}} \mathrm{d}\mu + + \frac{1}{2} \int (qp)\_a \log \frac{(qp)\_a}{(qp)\_{\frac{1}{2}}} \mathrm{d}\mu \tag{56}$$

$$h = -h((pq)\_{\frac{1}{2}}) - \frac{h((pq)\_a) + h((pq)\_{1-a})}{2},\tag{57}$$

$$=:\quad \mathcal{J}\mathcal{S}\_s^a(q,p),\tag{58}$$

since it holds that (*ab*)*<sup>c</sup>* = (*ba*)1−*<sup>c</sup>* for any *<sup>a</sup>*, *<sup>b</sup>*, *<sup>c</sup>* <sup>∈</sup> <sup>R</sup>. Note that JS*<sup>α</sup> s* (*p*, *<sup>q</sup>*) <sup>6</sup><sup>=</sup> JS*<sup>α</sup>* (*p*, *q*).

**Remark 2.** *We can always symmetrize a vector-skew Jensen–Shannon divergence by doubling the dimension of the skewing vector. Let α* = (*α*1, . . . , *α<sup>k</sup>* ) *and w be the vector parameters of an asymmetric vector-skew JSD, and consider α* <sup>0</sup> = (1 − *α*1, . . . , 1 − *α<sup>k</sup>* ) *and w to be the parameters of* JS*<sup>α</sup>* 0 ,*w . Then,* JS(*α*,*<sup>α</sup>* 0 ),( *w* 2 , *w* 2 ) *is a symmetric skew-vector JSD:*

$$\text{JS}^{(a,a'),(\frac{w}{2},\frac{w}{2})}(p:q) \quad := \frac{1}{2}\text{JS}^{a,w}(p:q) + \frac{1}{2}\text{JS}^{a',w}(p:q),\tag{59}$$

$$=\frac{1}{2}\mathbb{J}^{a,w}(p:q) + \frac{1}{2}\mathbb{J}^{a,w}(q:p) = \mathbb{J}^{(a,a'),(\frac{w}{2},\frac{w}{2})}(q:p).\tag{60}$$

*Entropy* **2020**, *22*, 221

*Since the vector-skew Jensen–Shannon divergence is an f-divergence for the generator fα*,*<sup>w</sup> (Theorem 1), we can take generator f s w*,*α* (*u*) = *<sup>f</sup>w*,*α*(*u*)+*<sup>f</sup>* ∗ *w*,*α* (*u*) 2 *to define the symmetrized f-divergence, where f* ∗ *w*,*α* (*u*) = *u fw*,*α*( 1 *u* ) *denotes the convex conjugate function. When fα*,*<sup>w</sup> yields a symmetric f-divergence Ifα*,*<sup>w</sup> , we can apply the generic upper bound of f-divergences (i.e., I<sup>f</sup>* ≤ *f*(0) + *f* ∗ (0)*) to get the upper bound on the symmetric vector-skew Jensen–Shannon divergences:*

$$|I\_{f\_{a,w}}(p:q)| \le \quad f\_{a,w}(0) + f\_{a,w}^\*(0),\tag{61}$$

$$\leq \sum\_{i=1}^{k} w\_i \left( (1 - \alpha\_i) \log \frac{1 - \alpha\_i}{1 - \overline{\alpha}} + \alpha\_i \log \frac{\alpha\_i}{\overline{\alpha}} \right), \tag{62}$$

*since*

$$f\_{a,w}^\*(\mu) \quad = \quad uf\_{a,w}\left(\frac{1}{\mu}\right),\tag{63}$$

$$\varepsilon = \sum\_{i=1}^{k} w\_i ((1 - \alpha\_i)\mu + \alpha\_i) \log \frac{(1 - \alpha\_i)\mu + \alpha\_i}{(1 - \overline{\kappa})\mu + \overline{\kappa}}.\tag{64}$$

*For example, consider the ordinary Jensen–Shannon divergence with w* = 1 2 , 1 2 *and α* = (0, 1)*. Then, we find* JS(*p* : *w*) = *I<sup>f</sup>* (0,1),( 1 2 , 1 2 ) (*<sup>p</sup>* : *<sup>q</sup>*) <sup>≤</sup> <sup>1</sup> 2 log 2 + <sup>1</sup> 2 log 2 = log 2*, the usual upper bound of the JSD.*

As a side note, let us notice that our notation (*pq*)*<sup>α</sup>* allows one to compactly write the following property:

**Property 1.** *We have q* = (*qq*)*<sup>λ</sup> for any λ* ∈ [0, 1]*, and* ((*p*<sup>1</sup> *p*2)*λ*(*q*1*q*2)*λ*)*<sup>α</sup>* = ((*p*1*q*1)*α*(*p*2*q*2)*α*)*<sup>λ</sup> for any α*, *λ* ∈ [0, 1]*.*

**Proof.** Clearly, *q* = (1 − *λ*)*q* + *λq* =: ((*qq*)*λ*) for any *λ* ∈ [0, 1]. Now, we have

$$((p\_1p\_2)\_
\lambda (q\_1q\_2)\_
\lambda)\_
\mu \quad = 
\ (1-
\mathfrak{a})(p\_1p\_2)\_
\lambda + 
\mathfrak{a}(q\_1q\_2)\_
\lambda \tag{65}$$

$$=\left(1-\alpha\right)\left((1-\lambda)p\_1+\lambda p\_2\right)+\mathfrak{a}\left((1-\lambda)q\_1+\lambda q\_2\right),\tag{66}$$

$$=\left(1-\lambda\right)\left((1-\alpha)p\_1+\alpha q\_1\right)+\lambda\left((1-\alpha)p\_2+\alpha q\_2\right),\tag{67}$$

$$=\quad(1-\lambda)(p\_1q\_1)\_{\mathfrak{a}}+\lambda(p\_2q\_2)\_{\mathfrak{a}\nu}\tag{68}$$

$$= \quad ((p\_1q\_1)\_
\mathfrak{a}(p\_2q\_2)\_
\mathfrak{a})\_
\lambda. \tag{69}$$

#### *2.3. Building Symmetric Families of Vector-Skewed Jensen–Shannon Divergences*

We can build infinitely many vector-skew Jensen–Shannon divergences. For example, consider *α* = 0, 1, <sup>1</sup> 3 and *w* = 1 3 , 1 3 , 1 3 . Then, *α*¯ = <sup>1</sup> <sup>3</sup> <sup>+</sup> <sup>1</sup> <sup>9</sup> <sup>=</sup> <sup>4</sup> 9 , and

$$\mathrm{JS}^{a,w}(p:q) = h\left((pq)\_{\frac{4}{3}}\right) - \frac{h(p) + h(q) + h\left((pq)\_{\frac{1}{3}}\right)}{3} \neq \mathrm{JS}^{a,w}(q:p). \tag{70}$$

Interestingly, we can also build infinitely many families of *symmetric* vector-skew Jensen–Shannon divergences. For example, consider these two examples that illustrate the construction process:

• Consider *k* = 2. Let (*w*, 1 − *w*) denote the weight vector, and *α* = (*α*1, *α*2) the skewing vector. We have *α*¯ = *wα*<sup>1</sup> + (1 − *w*)*α*<sup>2</sup> = *α*<sup>2</sup> + *w*(*α*<sup>1</sup> − *α*2). The vector-skew JSD is symmetric iff. *w* =

<sup>1</sup> <sup>−</sup> *<sup>w</sup>* <sup>=</sup> <sup>1</sup> 2 (with *α*¯ = *α*1+*α*<sup>2</sup> 2 ) and *<sup>α</sup>*<sup>2</sup> <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>α</sup>*1. In that case, we have *<sup>α</sup>*¯ <sup>=</sup> <sup>1</sup> 2 , and we obtain the following family of symmetric Jensen–Shannon divergences:

$$\text{JS}^{(a,1-a),(\frac{1}{2},\frac{1}{2})}(p,q) \;=\;h\left((pq)\_{\frac{1}{2}}\right) - \frac{h((pq)\_{\mathfrak{a}}) + h((pq)\_{1-\mathfrak{a}})}{2},\tag{71}$$

$$=\left.h\left((pq)\_{\frac{1}{2}}\right)-\frac{h((pq)\_{\mathfrak{a}})+h((qp)\_{\mathfrak{a}})}{2}\right)=\mathbf{J}\mathbf{S}^{(a,1-a),(\frac{1}{2},\frac{1}{2})}(q,p).\tag{72}$$

• Consider *k* = 4, weight vector *w* = 1 3 , 1 3 , 1 6 , 1 6 , and skewing vector *α* = (*α*1, 1 − *α*1, *α*2, 1 − *α*2) for *<sup>α</sup>*1, *<sup>α</sup>*<sup>2</sup> <sup>∈</sup> (0, 1). Then, *<sup>α</sup>*¯ <sup>=</sup> <sup>1</sup> 2 , and we get the following family of symmetric vector-skew JSDs:

$$\text{JS}^{(\mathfrak{a}\_1\mathfrak{a}\_2)}(p,q) = -h\left((pq)\_{\frac{1}{2}}\right) - \frac{2h((pq)\_{\mathfrak{a}\_1}) + 2h((pq)\_{1-\mathfrak{a}\_1}) + h((pq)\_{\mathfrak{a}\_2}) + h((pq)\_{1-\mathfrak{a}\_2})}{6},\tag{73}$$

$$=\left.h\left((pq)\_{\frac{1}{2}}\right)-\frac{2h((pq)\_{a\_1})+2h((qp)\_{a\_1})+h((pq)\_{a\_2})+h((qp)\_{a\_2})}{6}\right.\tag{74}$$

$$=\quad \mathbf{J}\mathbf{S}^{(a\_1,a\_2)}(q,p).\tag{75}$$

• We can similarly carry on the construction of such symmetric JSDs by increasing the dimensionality of the skewing vector.

In fact, we can define

$$\mathrm{JS}\_{\mathrm{s}}^{q,w}(p,q) := h\left((pq)\_{\frac{1}{2}}\right) - \sum\_{i=1}^{k} w\_i \frac{h((pq)\_{\mathfrak{a}\_i}) + h((pq)\_{1-\mathfrak{a}\_i})}{2} = \sum\_{i=1}^{k} w\_i \mathrm{JS}\_{\mathrm{s}}^{\mathfrak{a}\_i}(p,q)\_{\prime} \tag{76}$$

with

$$\mathrm{JS}\_s^a(p,q) := h\left((pq)\_{\frac{1}{2}}\right) - \frac{h((pq)\_a) + h((pq)\_{1-a})}{2}.\tag{77}$$

#### **3. Jensen–Shannon Centroids on Mixture Families**

#### *3.1. Mixture Families and Jensen–Shannon Divergences*

Consider a mixture family in information geometry [25]. That is, let us give a prescribed set of *D* + 1 linearly independent probability densities *p*0(*x*), . . . , *pD*(*x*) defined on the sample space X . A *mixture family* M of order *D* consists of all *strictly* convex combinations of these component densities:

$$\mathcal{M} := \left\{ m(\mathbf{x}; \boldsymbol{\theta}) := \sum\_{i=1}^{D} \theta^{i} p\_{i}(\mathbf{x}) + \left( 1 - \sum\_{i=1}^{D} \theta^{i} \right) p\_{0}(\mathbf{x}) \; : \; \theta^{i} > 0, \; \sum\_{i=1}^{D} \theta^{i} < 1 \right\}. \tag{78}$$

For example, the family of categorical distributions (sometimes called "multinouilli" distributions) is a mixture family [25]:

$$\mathcal{M} = \left\{ m\_{\theta}(\mathbf{x}) = \sum\_{i=1}^{D} \theta\_{i} \delta(\mathbf{x} - \mathbf{x}\_{i}) + \left( \mathbf{1} - \sum\_{i=1}^{D} \theta\_{i} \right) \delta(\mathbf{x} - \mathbf{x}\_{0}) \right\}, \tag{79}$$

where *δ*(*x*) is the Dirac distribution (i.e., *δ*(*x*) = 1 for *x* = 0 and *δ*(*x*) = 0 for *x* 6= 0). Note that the mixture family of categorical distributions can also be interpreted as an exponential family.

Notice that the linearly independent assumption on probability densities is to ensure to have an identifiable model: *θ* ↔ *m*(*x*; *θ*).

The KL divergence between two densities of a mixture family M amounts to a Bregman divergence for the Shannon negentropy generator *F*(*θ*) = −*h*(*m<sup>θ</sup>* ) (see [38]):

$$\text{KL}(m\_{\theta\_1} : m\_{\theta\_2}) = B\_F(\theta\_1 : \theta\_2) = B\_{-h(m\_\theta)}(\theta\_1 : \theta\_2). \tag{80}$$

*Entropy* **2020**, *22*, 221

On a mixture manifold M, the mixture density (1 − *α*)*mθ*<sup>1</sup> + *αmθ*<sup>2</sup> of two mixtures *mθ*<sup>1</sup> and *mθ*<sup>2</sup> of M also belongs to M:

$$(1 - \mathfrak{a})m\_{\theta\_1} + \mathfrak{a}m\_{\theta\_2} = m\_{(\theta\_1 \theta\_2)\_a} \in \mathcal{M},\tag{81}$$

where we extend the notation (*θ*1*θ*2)*<sup>α</sup>* := (1 − *α*)*θ*<sup>1</sup> + *αθ*<sup>2</sup> to vectors *θ*<sup>1</sup> and *θ*2: (*θ*1*θ*2) *i <sup>α</sup>* = (*θ i* 1 *θ i* 2 )*α*.

Thus, the vector-skew JSD amounts to a vector-skew Jensen diversity for the Shannon negentropy convex function *F*(*θ*) = −*h*(*m<sup>θ</sup>* ):

$$\text{JS}^{a,w}(m\_{\theta\_1} : m\_{\theta\_2}) \quad = \sum\_{i=1}^{k} w\_i \text{KL}\left( (m\_{\theta\_1} m\_{\theta\_2})\_{\mathfrak{a}\_i} : (m\_{\theta\_1} m\_{\theta\_2})\_{\mathfrak{a}} \right), \tag{82}$$

$$=\sum\_{i=1}^{k} w\_i \text{KL}\left(m\_{(\theta\_1 \theta\_2)\_{a\_i}} : m\_{(\theta\_1 \theta\_2)\_{\tilde{a}}}\right) \,\,\,\tag{83}$$

$$=\sum\_{i=1}^{k} w\_i \mathcal{B}\_F\left( (\theta\_1 \theta\_2)\_{\mathfrak{a}\_i} : (\theta\_1 \theta\_2)\_{\mathfrak{a}} \right),\tag{84}$$

$$=\underset{\bullet}{\text{J}}^{\text{a},\text{R},\text{w}}(\theta\_1:\theta\_2),\tag{85}$$

$$=\sum\_{i=1}^{k} w\_i \mathcal{F}\left((\theta\_1 \theta\_2)\_{\mathbb{A}\_i}\right) - \mathcal{F}\left((\theta\_1 \theta\_2)\_{\mathbb{A}}\right),\tag{86}$$

$$=\left.h(m\_{(\theta\_1\theta\_2)\_{\mathbb{H}}})-\sum\_{i=1}^{k}w\_ih\left(m\_{(\theta\_1\theta\_2)\_{\mathbb{H}}}\right).\tag{87}$$

#### *3.2. Jensen–Shannon Centroids*

Given a set of *n* mixture densities *mθ*<sup>1</sup> , . . . , *mθ<sup>n</sup>* of M, we seek to calculate the *skew-vector Jensen–Shannon centroid* (or barycenter for non-uniform weights) defined as *m<sup>θ</sup>* <sup>∗</sup> , where *θ* ∗ is the minimizer of the following objective function (or loss function):

$$L(\theta) := \sum\_{j=1}^{n} \omega\_j \mathsf{JS}^{u,w}(m\_{\theta\_k} : m\_{\theta}), \tag{88}$$

where *ω* ∈ ∆*<sup>n</sup>* is the weight vector of densities (uniform weight for the centroid and non-uniform weight for a barycenter). This definition of the skew-vector Jensen–Shannon centroid is a generalization of the *Fréchet mean* (the Fréchet mean may not be unique, as it is the case on the sphere for two antipodal points for which their Fréchet means with respect to the geodesic metric distance form a great circle) [39] to non-metric spaces. Since the divergence JS*α*,*<sup>w</sup>* is strictly separable convex, it follows that the Jensen–Shannon-type centroids are unique when they exist.

Plugging Equation (86) into Equation (88), we get that the calculation of the Jensen–Shannon centroid amounts to the following minimization problem:

$$L(\theta) = \sum\_{j=1}^{n} \omega\_j \left( \sum\_{i=1}^{k} w\_i F((\theta\_j \theta)\_{\mathbb{A}\_i}) - F\left((\theta\_j \theta)\_{\mathbb{R}}\right) \right). \tag{89}$$

This optimization is a *Difference of Convex* (DC) programming optimization, for which we can use the ConCave–Convex procedure [27,40] (CCCP). Indeed, let us define the following two convex functions:

$$A(\theta) \quad = \sum\_{j=1}^{n} \sum\_{i=1}^{k} \omega\_j w\_i F((\theta\_j \theta)\_{a\_i}),\tag{90}$$

$$B(\theta) \quad = \sum\_{j=1}^{n} \omega\_j F\left( (\theta\_j \theta)\_{\bar{n}} \right). \tag{91}$$

Both functions *A*(*θ*) and *B*(*θ*) are convex since *F* is convex. Then, the minimization problem of Equation (89) to solve can be rewritten as:

$$\min\_{\theta} A(\theta) - B(\theta). \tag{92}$$

This is a DC programming optimization problem which can be solved iteratively by initializing *θ* to an arbitrary value *θ* (0) (say, the centroid of the *θi*s), and then by updating the parameter at step *t* using the CCCP [27] as follows:

$$\theta^{(t+1)} = (\nabla B)^{-1} (\nabla A(\theta^{(t)})).\tag{93}$$

Compared to a gradient descent local optimization, there is no required step size (also called "learning" rate) in CCCP.

We have ∇*A*(*θ*) = ∑ *n <sup>j</sup>*=<sup>1</sup> ∑ *k <sup>i</sup>*=<sup>1</sup> *ωjwiαi*∇*F*((*θjθ*)*α<sup>i</sup>* ) and ∇*B*(*θ*) = ∑ *n <sup>j</sup>*=<sup>1</sup> *ωjα*¯∇*F* (*θjθ*)*α*¯ .

The CCCP converges to a local optimum *θ* ∗ where the support hyperplanes of the function graphs of *A* and *B* at *θ* ∗ are parallel to each other, as depicted in Figure 1. The set of stationary points is {*θ* : ∇*A*(*θ*) = ∇*B*(*θ*)}. In practice, the delicate step is to invert ∇*B*. Next, we show how to implement this algorithm for the Jensen–Shannon centroid of a set of categorical distributions (i.e., normalized histograms with all non-empty bins).

**Figure 1.** The Convex–ConCave Procedure (CCCP) iteratively updates the parameter *θ* by aligning the support hyperplanes at *θ*. In the limit case of convergence to *θ* ∗ , the support hyperplanes at *θ* ∗ are parallel to each other. CCCP finds a local minimum.

#### 3.2.1. Jensen–Shannon Centroids of Categorical Distributions

To illustrate the method, let us consider the mixture family of categorical distributions [25]:

$$\mathcal{M} = \left\{ m\_{\theta}(\mathbf{x}) = \sum\_{i=1}^{D} \theta\_{i} \delta(\mathbf{x} - \mathbf{x}\_{i}) + \left( 1 - \sum\_{i=1}^{D} \theta\_{i} \right) \delta(\mathbf{x} - \mathbf{x}\_{0}) \right\}. \tag{94}$$

The Shannon negentropy is

$$F(\theta) = -h(m\_{\theta}) = \sum\_{i=1}^{D} \theta\_i \log \theta\_i + \left(1 - \sum\_{i=1}^{D} \theta\_i\right) \log \left(1 - \sum\_{i=1}^{D} \theta\_i\right). \tag{95}$$

We have the partial derivatives

$$\nabla F(\theta) = \left[\frac{\partial}{\partial \theta\_i}\right]\_{\mathbf{i}} \prime \quad \frac{\partial}{\partial \theta\_i} F(\theta) = \log \frac{\theta\_{\mathbf{i}}}{1 - \sum\_{j=1}^{D} \theta\_j} \cdot \tag{96}$$

*Entropy* **2020**, *22*, 221

Inverting the gradient ∇*F* requires us to solve the equation ∇*F*(*θ*) = *η* so that we get *θ* = (∇*F*) −1 (*η*). We find that

$$\nabla F^\*(\eta) = (\nabla F)^{-1}(\eta) = \frac{1}{1 + \sum\_{j=1}^D \exp(\eta\_j)} [\exp(\eta\_j)]\_{i^\prime}, \quad \theta\_i = (\nabla F^{-1}(\eta))\_i = \frac{\exp(\eta\_i)}{1 + \sum\_{j=1}^D \exp(\eta\_j)}, \forall i \in [D]. \tag{97}$$

Table 1 summarizes the dual view of the family of categorical distributions, either interpreted as an exponential family or as a mixture family.

We have JS(*p*1, *p*2) = *JF*(*θ*1, *θ*2) for *p*<sup>1</sup> = *mθ*<sup>1</sup> and *p*<sup>2</sup> = *mθ*<sup>2</sup> , where

$$J\_F(\theta\_1:\theta\_2) = \frac{F(\theta\_1) + F(\theta\_2)}{2} - F\left(\frac{\theta\_1 + \theta\_2}{2}\right),\tag{98}$$

is the Jensen divergence [40]. Thus, to compute the Jensen–Shannon centroid of a set of *n* densities *p*1, . . . , *p<sup>n</sup>* of a mixture family (with *p<sup>i</sup>* = *mθ<sup>i</sup>* ), we need to solve the following optimization problem for a density *p* = *m<sup>θ</sup>* :

$$\min\_{p} \sum\_{i} \text{JS}(p\_i, p),\tag{99}$$

$$\min\_{\theta} \sum\_{i}^{i} f\_{F}(\theta\_{i}, \theta)\_{i} \tag{100}$$

$$\min\_{\theta} \sum\_{i} \frac{F(\theta\_i) + F(\theta)}{2} - F\left(\frac{\theta\_i + \theta}{2}\right),\tag{101}$$

$$\dot{\theta} \equiv \min\_{\theta} \frac{1}{2} F(\theta) - \frac{1}{n} \sum\_{i} F\left(\frac{\theta\_i + \theta}{2}\right) := E(\theta). \tag{102}$$

The CCCP algorithm for the Jensen–Shannon centroid proceeds by initializing *θ* (0) = <sup>1</sup> *<sup>n</sup>* ∑*<sup>i</sup> θi* (center of mass of the natural parameters), and iteratively updates as follows:

$$\theta^{(t+1)} = (\nabla F)^{-1} \left( \frac{1}{n} \sum\_{i} \nabla F \left( \frac{\theta\_i + \theta^{(t)}}{2} \right) \right). \tag{103}$$

We iterate until the absolute difference |*E*(*θ* (*t*) ) − *E*(*θ* (*t*+1) )| between two successive *θ* (*t*) and *θ* (*t*+1) goes below a prescribed threshold value. The convergence of the CCCP algorithm is linear [41] to a local minimum that is a fixed point of the equation

$$\theta = M\_H \left( \frac{\theta\_1 + \theta}{2}, \dots, \frac{\theta\_n + \theta}{2} \right),\tag{104}$$

where *MH*(*v*1, . . . , *vn*) := *H*−<sup>1</sup> (∑ *n <sup>i</sup>*=<sup>1</sup> *H*(*vi*)) is a vector generalization of the formula of the quasi-arithmetic means [30,40] obtained for the generator *H* = ∇*F*. Algorithm 1 summarizes the method for approximating the Jensen–Shannon centroid of a given set of categorical distributions (given a prescribed number of iterations). In the pseudo-code, we used the notation (*t*+1) *θ* instead of *θ* (*t*+1) in order to highlight the conversion procedures of the natural parameters to/from the mixture weight parameters by using superscript notations for coordinates.





Figure 2 displays the results of the calculations of the Jeffreys centroid [18] and the Jensen–Shannon centroid for two normalized histograms obtained from grey-valued images of Lena and Barbara. Figure 3 show the Jeffreys centroid and the Jensen–Shannon centroid for the Barbara image and its negative image. Figure 4 demonstrates that the Jensen–Shannon centroid is well defined even if the input histograms do not have coinciding supports. Notice that on the parts of the support where only one distribution is defined, the JS centroid is a scaled copy of that defined distribution.

Jeffreys vs Jensen−Shannon histogram centroids

**Figure 2.** The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for two grey normalized histograms of the Lena image (red histogram) and the Barbara image (blue histogram). Although these Jeffreys and Jensen–Shannon centroids look quite similar, observe that there is a major difference between them in the range [0, 20] where the blue histogram is zero.

Jensen−Shannon histogram centroids

Barbara (red)/invert Barbara (blue) histograms

**Figure 3.** The Jeffreys centroid (grey histogram) and the Jensen–Shannon centroid (black histogram) for the grey normalized histogram of the Barbara image (red histogram) and its negative image (blue histogram which corresponds to the reflection around the vertical axis *x* = 128 of the red histogram).

**Figure 4.** Jensen–Shannon centroid (black histogram) for the clamped grey normalized histogram of the Lena image (red histograms) and the clamped gray normalized histogram of Barbara image (blue histograms). Notice that on the part of the sample space where only one distribution is non-zero, the JS centroid scales that histogram portion.

3.2.2. Special Cases

Let us now consider two special cases:

• For the special case of *D* = 1, the categorical family is the Bernoulli family, and we have *F*(*θ*) = *θ* log *θ* + (1 − *θ*)log(1 − *θ*) (binary negentropy), *F* 0 (*θ*) = log *<sup>θ</sup>* 1−*θ* (and *F* <sup>00</sup>(*θ*) = <sup>1</sup> *<sup>θ</sup>*(1−*θ*) <sup>&</sup>gt; 0) and (*F* 0 ) −1 (*η*) = *<sup>e</sup> η* 1+*e <sup>η</sup>* . The CCCP update rule to compute the binary Jensen–Shannon centroid becomes

$$\theta^{(t+1)} = (F')^{-1} \left( \sum\_{i} w\_i F' \left( \frac{\theta^{(t)} + \theta\_i}{2} \right) \right). \tag{105}$$

• Since the skew-vector Jensen–Shannon divergence formula holds for positive densities:

$$\text{JS}^{+\text{a},\text{w}}(\vec{p}:\vec{q}) \quad = \sum\_{i=1}^{k} w\_i \text{KL}^+((\vec{p}\vec{q})\_{\text{il}} : ((\vec{p}\vec{q})\_{\text{il}}), \tag{106}$$

$$=\sum\_{i=1}^{k} w\_{i} \left( \text{KL}((\not p\vec{\eta})\_{\mathbb{A}\_{i}} : ((\not p\vec{\eta})\_{\mathbb{R}}) + \int (\not p\vec{\eta})\_{\mathbb{R}} \text{d}\mu - \underbrace{\sum\_{i=1}^{k} w\_{i} \int (\not p\vec{\eta})\_{\mathbb{A}\_{i}} \text{d}\mu}\_{=\int (\not p\vec{\eta})\_{\mathbb{R}} \text{d}\mu} \right), \tag{107}$$

$$= \mathbf{J} \mathbf{S}^{a, w} (\mathfrak{p} : \mathfrak{q}) , \tag{108}$$

we can *relax* the computation of the Jensen–Shannon centroid by considering 1D separable minimization problems. We then normalize the positive JS centroids to get an approximation of the probability JS centroids. This approach was also considered when dealing with the Jeffreys' centroid [18]. In 1D, we have *F*(*θ*) = *θ* log *θ* − *θ*, *F* 0 (*θ*) = log *θ* and (*F* 0 ) −1 (*η*) = *e η* .

In general, calculating the negentropy for a mixture family with continuous densities sharing the same support is not tractable because of the log-sum term of the differential entropy. However, the following remark emphasizes an extension of the mixture family of categorical distributions:

#### 3.2.3. Some Remarks and Properties

**Remark 3.** *Consider a mixture family m*(*θ*) = ∑ *D i*=1 *<sup>θ</sup><sup>i</sup> <sup>p</sup>i*(*x*) + 1 − ∑ *D i*=1 *θi p*0(*x*) *(for a parameter θ belonging to the D-dimensional standard simplex) of probability densities p*0(*x*), . . . , *pD*(*x*) *defined respectively on the supports* X0, X1, . . . , X*D. Let θ*<sup>0</sup> := 1 − ∑ *D i*=1 *θi . Assume that the support* X*<sup>i</sup> s of the p<sup>i</sup> s are* mutually non-intersecting *(*X*<sup>i</sup>* ∩ X*<sup>j</sup>* = ∅ *for all i* 6= *j implying that the D* + 1 *densities are linearly independent) so that m<sup>θ</sup>* (*x*) = *θ<sup>i</sup> pi*(*x*) *for all x* ∈ X*<sup>i</sup> , and let* X = ∪*i*X*<sup>i</sup> . Consider Shannon negative entropy F*(*θ*) = −*h*(*m<sup>θ</sup>* ) *as a strictly convex function. Then, we have*

$$F(\theta) \quad = \quad -h(m\_{\theta}) = \int\_{\mathcal{X}} m\_{\theta}(\mathbf{x}) \log m\_{\theta}(\mathbf{x}) \,. \tag{109}$$

$$=\sum\_{i=0}^{D} \theta\_i \int\_{\mathcal{X}\_i^\circ} p\_i(\mathbf{x}) \log(\theta\_i p\_i(\mathbf{x})) \mathrm{d}\mu(\mathbf{x}),\tag{110}$$

$$\xi = \sum\_{i=0}^{D} \theta\_i \log \theta\_i - \sum\_{i=0}^{D} \theta\_i h(p\_i). \tag{111}$$

*Note that the term* ∑*<sup>i</sup> θih*(*pi*) *is affine in θ, and Bregman divergences are defined up to affine terms so that the Bregman generator F is equivalent to the Bregman generator of the family of categorical distributions. This example generalizes the ordinary mixture family of categorical distributions where the p<sup>i</sup> s are distinct Dirac distributions. Note that when the support of the component distributions are not pairwise disjoint, the (neg)entropy may not be analytic [42] (e.g., mixture of the convex weighting of two prescribed distinct Gaussian distributions). This contrasts with the fact that the cumulant function of an exponential family is always real-analytic [43]. Observe that the term* ∑*<sup>i</sup> θih*(*pi*) *can be interpreted as a conditional entropy:* ∑*i <sup>θ</sup>ih*(*pi*) = *<sup>h</sup>*(*X*|Θ) *where* Pr(<sup>Θ</sup> <sup>=</sup> *<sup>i</sup>*) = *<sup>θ</sup><sup>i</sup> and* Pr(*<sup>X</sup>* <sup>∈</sup> *<sup>S</sup>*|<sup>Θ</sup> <sup>=</sup> *<sup>i</sup>*) = <sup>R</sup> *S pi*(*x*)d*µ*(*x*)*.*

*Notice that we can truncate an exponential family [25] to get a (potentially non-regular [44]) exponential family for defining the p<sup>i</sup> s on mutually non-intersecting domains* X*<sup>i</sup> s. The entropy of a natural exponential family* {*e*(*x* : *θ*) = exp(*x* <sup>&</sup>gt;*θ* − *ψ*(*θ*)) : *θ* ∈ Θ} *with cumulant function ψ*(*θ*) *and natural parameter space* Θ *is* −*ψ* ∗ (*η*)*, where η* = ∇*ψ*(*θ*)*, and ψ* ∗ *is the Legendre convex conjugate [45]: h*(*e*(*x* : *θ*)) = −*ψ* ∗ (∇*ψ*(*θ*))*.*

In general, the entropy and cross-entropy between densities of a mixture family (whether the distributions have disjoint supports or not) can be calculated in closed-form.

**Property 2.** *The entropy of a density belonging to a mixture family* M *is h*(*m<sup>θ</sup>* ) = −*F*(*θ*)*, and the cross-entropy between two mixture densities mθ*<sup>1</sup> *and mθ*<sup>2</sup> *is h* <sup>×</sup>(*mθ*<sup>1</sup> : *mθ*<sup>2</sup> ) = −*F*(*θ*2) − (*θ*<sup>1</sup> − *θ*2) <sup>&</sup>gt;*η*<sup>2</sup> = *F* ∗ (*η*2) − *θ* > 1 *η*2*.*

**Proof.** Let us write the KLD as the difference between the cross-entropy minus the entropy [4]:

$$\text{KL}(m\_{\theta\_1} : m\_{\theta\_2}) \quad = \quad h^\times(m\_{\theta\_1} : m\_{\theta\_2}) - h(m\_{\theta\_1}), \tag{112}$$

$$=\ \mathcal{B}\_{\mathcal{F}}(\theta\_1 : \theta\_2),\tag{113}$$

$$=\left.F(\theta\_1) - F(\theta\_2) - (\theta\_1 - \theta\_2)^\top \nabla F(\theta\_2). \tag{114}$$

Following [45], we deduce that *h*(*m<sup>θ</sup>* ) = −*F*(*θ*) + *c* and *h* <sup>×</sup>(*mθ*<sup>1</sup> : *mθ*<sup>2</sup> ) = −*F*(*θ*2) − (*θ*<sup>1</sup> − *θ*2) <sup>&</sup>gt;*η*<sup>2</sup> − *c* for a constant *c*. Since *F*(*θ*) = −*h*(*m<sup>θ</sup>* ) by definition, it follows that *c* = 0 and that *h* <sup>×</sup>(*mθ*<sup>1</sup> : *mθ*<sup>2</sup> ) = −*F*(*θ*2) − (*θ*<sup>1</sup> − *θ*2) <sup>&</sup>gt;*η*<sup>2</sup> = *F* ∗ (*η*2) − *θ* > 1 *η*<sup>2</sup> where *η* = ∇*F*(*θ*).

Thus, we can numerically compute the Jensen–Shannon centroids (or barycenters) of a set of densities belonging to a mixture family. This includes the case of categorical distributions and the case of Gaussian Mixture Models (GMMs) with prescribed Gaussian components [38] (although in this case, the negentropy needs to be stochastically approximated using Monte Carlo techniques [46]). When the densities do not belong to a mixture family (say, the Gaussian family, which is an exponential family [25]), we face the problem that the mixture of two densities does not belong to the family anymore. One way to tackle this problem is to project the mixture onto the Gaussian family. This corresponds to an *m*-projection (mixture projection) which can be interpreted as a Maximum Entropy projection of the mixture [25,47]).

Notice that we can perform fast *k*-means clustering without centroid calculations using a generalization of the *k*-means++ probabilistic initialization [48,49]. See [50] for details of the generalized *k*-means++ probabilistic initialization defined according to an arbitrary divergence.

Finally, let us notice some decompositions of the Jensen–Shannon divergence and the skew Jensen divergences.

**Remark 4.** *We have the following decomposition for the Jensen–Shannon divergence:*

$$\text{JS}(p\_1, p\_2) \quad = \; h\left(\frac{p\_1 + p\_2}{2}\right) - \frac{h(p\_1) + h(p\_2)}{2},\tag{115}$$

$$=\
\hbar\_{\rm fs}^{\times}(p\_1:p\_2) - \hbar\_{\rm fs}(p\_2) \ge 0,\tag{116}$$

*where*

$$h\_{\rm JS}^{\times}(p\_1:p\_2) = h\left(\frac{p\_1+p\_2}{2}\right) - \frac{1}{2}h(p\_1),\tag{117}$$

*and h*JS(*p*2) = *h* × JS(*p*<sup>2</sup> : *<sup>p</sup>*2) = *<sup>h</sup>*(*p*2) <sup>−</sup> <sup>1</sup> 2 *h*(*p*2) = <sup>1</sup> 2 *h*(*p*2)*. This decomposition bears some similarity with the KLD decomposition viewed as the cross-entropy minus the entropy (with the cross-entropy always upper-bounding the entropy).*

*Similarly, the α-skew Jensen divergence*

$$\|f\_{\mathcal{F}}^{\mathfrak{a}}(\theta\_1:\theta\_2):= \left(F(\theta\_1)F(\theta\_2)\right)\_{\mathfrak{a}} - F\left((\theta\_1\theta\_2)\_{\mathfrak{a}}\right), \quad \mathfrak{a} \in (0,1) \tag{118}$$

*can be decomposed as the sum of the information I α F* (*θ*1) = (1 − *α*)*F*(*θ*1) *minus the cross-information C α F* (*θ*<sup>1</sup> : *θ*2) := *F* ((*θ*1*θ*2)*α*) − *αF*(*θ*2)*:*

$$I\_F^a(\theta\_1:\theta\_2) = I\_F^a(\theta\_1) - \mathbb{C}\_F^a(\theta\_1:\theta\_2) \ge 0. \tag{119}$$

*Notice that the information I α F* (*θ*1) *is the self cross-information: I α F* (*θ*1) = *C α F* (*θ*<sup>1</sup> : *θ*1) = (1 − *α*)*F*(*θ*1)*. Recall that the convex information is the negentropy where the entropy is concave. For the Jensen–Shannon divergence on the mixture family of categorical distributions, the convex generator F*(*θ*) = −*h*(*m<sup>θ</sup>* ) = ∑ *D i*=1 *θ i* log *θ i is the Shannon negentropy.*

Finally, let us briefly mention the *Jensen–Shannon diversity* [30] which extends the Jensen–Shannon divergence to a weighted set of densities as follows:

$$\text{JS}(p\_1, \dots, p\_k; w\_1, \dots, w\_k) := \sum\_{i=1}^k w\_i \text{KL}(p\_i : \vec{p}) \,. \tag{120}$$

where *p*¯ = ∑ *k <sup>i</sup>*=<sup>1</sup> *w<sup>i</sup> p<sup>i</sup>* . The Jensen–Shannon diversity plays the role of the variance of a cluster with respect to the KLD. Indeed, let us state the compensation identity [51]: For any *q*, we have

$$\sum\_{i=1}^{k} w\_i \text{KL}(p\_i : q) = \sum\_{i=1}^{k} w\_i \text{KL}(p\_i : \overline{p}) + \text{KL}(\overline{p} : q). \tag{121}$$

Thus, the cluster center defined as the minimizer of ∑ *k <sup>i</sup>*=<sup>1</sup> *wi*KL(*p<sup>i</sup>* : *q*) is the centroid *p*¯, and

$$\sum\_{i=1}^{k} w\_i \text{KL}(p\_i : \mathfrak{p}) = \text{JS}(p\_1, \dots, p\_k; w\_1, \dots, w\_k). \tag{122}$$

#### **4. Conclusions and Discussion**

The Jensen–Shannon divergence [6] is a renown symmetrization of the Kullback–Leibler oriented divergence that enjoys the following three essential properties:


This JSD plays an important role in machine learning and in deep learning for studying Generative Adversarial Networks (GANs) [52]. Traditionally, the JSD has been skewed with a scalar parameter [19,53] *α* ∈ (0, 1). In practice, it has been experimentally demonstrated that skewing divergences may significantly improve the performance of some tasks (e.g., [21,54]).

In general, we can symmetrize the KLD KL(*p* : *q*) by taking an *abstract mean* (we require a symmetric mean *M*(*x*, *y*) = *M*(*y*, *x*) with the in-betweenness property: min{*x*, *y*} ≤ *M*(*x*, *y*) ≤ max{*x*, *y*}) *M* between the two orientations KL(*p* : *q*) and KL(*q* : *p*):

$$\text{KL}\_M(p,q) := M(\text{KL}(p:q), \text{KL}(q:p)).\tag{123}$$

We recover the Jeffreys divergence by taking the arithmetic mean twice (i.e., *J*(*p*, *q*) = 2*A*(KL(*p* : *q*), KL(*q* : *p*)) where *A*(*x*, *y*) = *<sup>x</sup>*+*<sup>y</sup>* 2 ), and the resistor average divergence [55] by taking the harmonic

mean (i.e., *<sup>R</sup>*KL(*p*, *<sup>q</sup>*) = *<sup>H</sup>*(KL(*<sup>p</sup>* : *<sup>q</sup>*), KL(*<sup>q</sup>* : *<sup>p</sup>*)) = 2KL(*p*:*q*)KL(*q*:*p*) KL(*p*:*q*)+KL(*q*:*p*) where *<sup>H</sup>*(*x*, *<sup>y</sup>*) = <sup>2</sup> 1 *<sup>x</sup>* <sup>+</sup> <sup>1</sup> *y* ). When we take the limit of Hölder power means, we get the following extremal symmetrizations of the KLD:

$$\text{KL}^{\text{min}}(p:q) \quad = \min\{\text{KL}(p:q), \text{KL}(q:p)\} = \text{KL}^{\text{min}}(q:p), \tag{124}$$

$$\text{KL}^{\text{max}}(p:q) \quad = \max\{\text{KL}(p:q), \text{KL}(q:p)\} = \text{KL}^{\text{max}}(q:p). \tag{125}$$

In this work, we showed how to *vector-skew* the JSD while preserving the above three properties. These new families of *weighted vector-skew Jensen–Shannon divergences* may allow one to fine-tune the dissimilarity in applications by replacing the skewing scalar parameter of the JSD by a vector parameter (informally, adding some "knobs" for tuning a divergence). We then considered computing the Jensen–Shannon centroids of a set of densities belonging to a mixture family [25] by using the convex–concave procedure [27].

In general, we can vector-skew any arbitrary divergence *D* by using two *k*-dimensional vectors *α* ∈ [0, 1] *<sup>k</sup>* and *<sup>β</sup>* <sup>∈</sup> [0, 1] *k* (with *α* 6= *β*) by building a weighted separable divergence as follows:

$$D^{a,\beta,w}(p:q) := \sum\_{i=1}^{k} w\_i \mathcal{D}\left((pq)\_{a\_i} : (pq)\_{\beta\_i}\right) = \mathcal{D}^{1\_k - a, 1\_k - \beta, w}(q:p), \quad a \neq \beta. \tag{126}$$

This bi-vector-skew divergence unifies the Jeffreys divergence with the Jensen–Shannon *α*-skew divergence by setting the following parameters:

$$\text{KL}^{(0,1),(1,0),(1,1)}(p:q) \quad = \text{KL}(p:q) + \text{KL}(q:p) = \text{J}(p,q), \tag{127}$$

$$\text{KL}^{(0,a),(1,1-a),(\frac{1}{2},\frac{1}{2})}(p:q) \quad = \frac{1}{2}\text{KL}(p:(pq)\_a) + \frac{1}{2}\text{KL}(q:(pq)\_a). \tag{128}$$

We have shown in this paper that interesting properties may occur when the skewing vector *β* is purposely correlated to the skewing vector *α*: Namely, for the bi-vector-skew Bregman divergences with *β* = (*α*¯, . . . , *α*¯) and *α*¯ = ∑*<sup>i</sup> wiα<sup>i</sup>* , we obtain an equivalent Jensen diversity for the Jensen–Bregman divergence, and, as a byproduct, a vector-skew generalization of the Jensen–Shannon divergence.

**Funding:** This research received no external funding.

**Acknowledgments:** The author is very grateful to the two Reviewers and the Academic Editor for their careful reading, helpful comments, and suggestions which led to this improved manuscript. In particular, Reviewer 2 kindly suggested the stronger bound of Lemma 1 and hinted at Theorem 1. .

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Conditional Rényi Divergences and Horse Betting**

#### **Cédric Bleuler, Amos Lapidoth and Christoph Pfister \***

Signal and Information Processing Laboratory, ETH Zurich, 8092 Zurich, Switzerland; cedric.bleuler@gmail.com (C.B.); lapidoth@isi.ee.ethz.ch (A.L.)

**\*** Correspondence: pfister@isi.ee.ethz.ch

Received: 17 December 2019; Accepted: 9 March 2020; Published: 11 March 2020

**Abstract:** Motivated by a horse betting problem, a new conditional Rényi divergence is introduced. It is compared with the conditional Rényi divergences that appear in the definitions of the dependence measures by Csiszár and Sibson, and the properties of all three are studied with emphasis on their behavior under data processing. In the same way that Csiszár's and Sibson's conditional divergence lead to the respective dependence measures, so does the new conditional divergence lead to the Lapidoth–Pfister mutual information. Moreover, the new conditional divergence is also related to the Arimoto–Rényi conditional entropy and to Arimoto's measure of dependence. In the second part of the paper, the horse betting problem is analyzed where, instead of Kelly's expected log-wealth criterion, a more general family of power-mean utility functions is considered. The key role in the analysis is played by the Rényi divergence, and in the setting where the gambler has access to side information, the new conditional Rényi divergence is key. The setting with side information also provides another operational meaning to the Lapidoth–Pfister mutual information. Finally, a universal strategy for independent and identically distributed races is presented that—without knowing the winning probabilities or the parameter of the utility function—asymptotically maximizes the gambler's utility function.

**Keywords:** conditional Rényi divergence; horse betting; Kelly gambling; Rényi divergence; Rényi mutual information

#### **1. Introduction**

As shown by Kelly [1,2], many of Shannon's information measures appear naturally in the context of horse gambling when the gambler's utility function is expected log-wealth. Here, we show that under a more general family of utility functions, gambling also provides a context for some of Rényi's information measures. Moreover, the setting where the gambler has side information motivates a new Rényi-like conditional divergence, which we study and compare to other conditional divergences. The proposed family of utility functions in the context of gambling with side information also provides another operational meaning to the Rényi-like mutual information that was recently proposed by Lapidoth and Pfister [3]: it measures the gambler's gain from the side information as measured by the increase in the minimax value of the two-player zero-sum game in which the bookmaker picks the odds and the gambler then places the bets based on these odds and her side information.

Deferring the gambling-based motivation to the second part of the paper, we first describe the different conditional divergences and study some of their properties with emphasis on their behavior under data processing. We also show that the new conditional Rényi divergence relates to the Lapidoth–Pfister mutual information in much the same way that Csiszár's and Sibson's conditional divergences relate to their corresponding mutual informations. Before discussing the conditional divergences, we first recall other information measures.

*Entropy* **2020**, *22*, 316

The Kullback–Leibler divergence (or relative entropy) is an important concept in information theory and statistics [2,4–6]. It is defined between two probability mass functions (PMFs) *P* and *Q* over a finite set X as

$$D(P\|\|Q) \triangleq \sum\_{\mathbf{x} \in \mathcal{X}} P(\mathbf{x}) \log \frac{P(\mathbf{x})}{Q(\mathbf{x})},\tag{1}$$

where log(·) denotes the base-2 logarithm. Defining a conditional Kullback–Leibler divergence is straightforward because, as simple algebra shows, the two natural approaches lead to the same result:

$$D\left(\mathcal{P}\_{\mathbf{Y}|\mathbf{X}}\|\mathcal{Q}\_{\mathbf{Y}|\mathbf{X}}|\mathcal{P}\_{\mathbf{X}}\right) \stackrel{\Delta}{=} \sum\_{\mathbf{x} \in \text{supp}(\mathcal{P}\_{\mathbf{X}})} P(\mathbf{x}) D\left(\mathcal{P}\_{\mathbf{Y}|\mathbf{X}=\mathbf{x}} \|\mathcal{Q}\_{\mathbf{Y}|\mathbf{X}=\mathbf{x}}\right) \tag{2}$$

$$=D(P\_{\mathcal{X}}P\_{\mathcal{Y}|\mathcal{X}}||P\_{\mathcal{X}}Q\_{\mathcal{Y}|\mathcal{X}}),\tag{3}$$

where supp(*P*) , {*x* ∈ X : *P*(*x*) > 0} denotes the support of *P*, and in (3) and throughout *PXPY*|*<sup>X</sup>* denotes the PMF on X × Y that assigns (*x*, *y*) the probability *PX*(*x*)*PY*|*X*(*y*|*x*).

The Rényi divergence of order *α* [7,8] between two PMFs *P* and *Q* is defined for all positive *α*'s other than one as

$$D\_{\mathfrak{a}}(P\|\|Q) \stackrel{\Delta}{=} \frac{1}{\mathfrak{a}-1} \log \sum\_{\mathbf{x} \in \mathcal{X}} P(\mathbf{x})^{\mathfrak{a}} Q(\mathbf{x})^{1-\mathfrak{a}}.\tag{4}$$

A conditional Rényi divergence can be defined in more than one way. In this paper, we consider the following three definitions, two classic and one new:

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X} \| Q\_{Y|X} | P\_X) \stackrel{\Delta}{=} \sum\_{\mathbf{x} \in \operatorname{supp}(P\_X)} P(\mathbf{x}) D\_{\mathfrak{a}}(P\_{Y|X=\mathbf{x}} \| Q\_{Y|X=\mathbf{x}}) \tag{5}$$

$$D\_{\mathfrak{a}}^{\mathfrak{g}}(\mathcal{P}\_{Y|X} \| \mathcal{Q}\_{Y|X} | \mathcal{P}\_{X}) \stackrel{\Delta}{=} D\_{\mathfrak{a}}(\mathcal{P}\_{X}\mathcal{P}\_{Y|X} \| \mathcal{P}\_{X}\mathcal{Q}\_{Y|X}) \, \tag{6}$$

$$D\_{\mathfrak{a}}^{\perp}(P\_{Y|X} \| Q\_{Y|X} | P\_{X}) \stackrel{\scriptstyle \Delta}{=} \frac{\mathfrak{a}}{\mathfrak{a} - 1} \log \sum\_{\mathbf{x} \in \text{supp}(P\_{X})} P(\mathbf{x}) 2^{\frac{\mathfrak{a} - 1}{\mathfrak{a}} D\_{\mathfrak{a}}(P\_{Y|X = \mathbf{x}} \| Q\_{Y|X = \mathbf{x}})} \tag{7}$$

where (5) is inspired by Csiszár [9]; (6) is inspired by Sibson [10]; and (7) is motivated by the horse betting problem discussed in Section 9. The first two conditional Rényi divergences were used to define the Rényi measures of dependence of Csiszár *I* c *α* (*X*;*Y*) [9] and of Sibson *I* s *α* (*X*;*Y*) [10]:

$$I\_{\alpha}^{\mathbf{c}}(X;Y) \triangleq \min\_{Q\_Y} D\_{\alpha}^{\mathbf{c}}(P\_{Y|X} || Q\_Y | P\_X) \tag{8}$$

$$I\_{\mathfrak{a}}^{\mathfrak{a}}(X;Y) \triangleq \min\_{Q\_Y} D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_{Y|X}||Q\_Y|P\_X) \tag{9}$$

where the minimization is over all PMFs on the set Y. (Gallager's *E*<sup>0</sup> function [11] and *I* s *α* (*X*;*Y*) are in one-to-one correspondence; see (65) below.) The analogous minimization of *D*<sup>l</sup> *α* (·) leads to the Lapidoth–Pfister mutual information *Jα*(*X*;*Y*) [3]:

$$D\_{\mathfrak{a}}(X;Y) \triangleq \min\_{Q\_X, Q\_Y} D\_{\mathfrak{a}}(P\_{XY} \| Q\_X Q\_Y) \tag{10}$$

$$=\min\_{Q\_Y} D\_\mathfrak{a}^1(P\_{Y|X}||Q\_Y|P\_X)\_\prime \tag{11}$$

where (11) is proved in Proposition 5.

The first part of the paper is structured as follows: In Section 2, we discuss some preliminaries. In Sections 3–5, we study the properties of the three conditional Rényi divergences and their associated measure of dependence. In Section 6, we express the Arimoto–Rényi conditional entropy *Hα*(*X*|*Y*) and the Arimoto measure of dependence *I* a *α* (*X*;*Y*) [12] in terms of *D*<sup>l</sup> *α* (*PX*|*Y*k*UX*|*PY*). In Section 7,

*Entropy* **2020**, *22*, 316

we relate the conditional Rényi divergences to each other and discuss the relations between the Rényi dependence measures.

The second part of the paper deals with horse gambling under our proposed family of power-mean utility functions. It is in this context that the Rényi divergence (Theorem 9) and the conditional Rényi divergence *D*<sup>l</sup> *α* (·) (Theorem 10) appear naturally.

More specifically, consider a horse race with a finite nonempty set of horses X , where a bookmaker offers odds *o*(*x*)-for-1 on each horse *x* ∈ X , where *o* : X → (0, ∞) [2] (Section 6.1). A gambler spends all her wealth placing bets on the horses. The fraction of her wealth that she bets on Horse *x* ∈ X is denoted *b*(*x*) ≥ 0, which sums up to 1 over *x* ∈ X , and the PMF *b* is her "betting strategy." The winning horse, which we denote *X*, is drawn according to the PMF *p*, where we assume *p*(*x*) > 0 for all *x* ∈ X . The wealth relative (or end-to-beginning wealth ratio) is the random variable

$$S \stackrel{\Delta}{=} b(X)o(X). \tag{12}$$

Hence, given an initial wealth *γ*, the gambler's wealth after the race is *γS*. We seek betting strategies that maximize the utility function

$$\mathcal{U}\_{\beta} \triangleq \begin{cases} \frac{1}{\beta} \log \mathcal{E}[\mathcal{S}^{\beta}] & \text{if } \beta \neq 0, \\ \mathcal{E}[\log \mathcal{S}] & \text{if } \beta = 0, \end{cases} \tag{13}$$

where *β* ∈ R is a parameter that accounts for the risk sensitivity. This optimization generalizes the following cases:


Note that, for *β* 6= 0 and *η* , 1 − *β*, maximizing *U<sup>β</sup>* is equivalent to maximizing

$$\mathbb{E}\left[\frac{\mathbb{S}^{1-\eta}}{1-\eta}\right],\tag{14}$$

which is known in the finance literature as Constant Relative Risk Aversion (CRRA) [13,14].

We refer to our utility function as "power mean" because it can be written as the logarithm of a weighted power mean [15,16]:

$$\mathcal{U}\_{\beta} = \log \left[ \sum\_{\mathbf{x}} p(\mathbf{x}) \left( b(\mathbf{x}) o(\mathbf{x}) \right)^{\beta} \right]^{\frac{1}{\beta}}.\tag{15}$$

Because the power mean tends to the geometric mean as *β* tends to zero [15] (Problem 8.1), *U<sup>β</sup>* is continuous at *β* = 0:

$$\lim\_{\beta \to 0} Ll\_{\beta} = \log \prod\_{\mathbf{x}} \left( b(\mathbf{x}) o(\mathbf{x}) \right)^{p(\mathbf{x})} \tag{16}$$

$$\mathbf{S} = \mathbb{E}[\log \mathbf{S}].\tag{17}$$

Campbell [17,18] used an exponential cost function with a similar structure to (15) to provide an operational meaning to the Rényi entropy in source coding. Other information-theoretic applications of exponential moments were studied in [19].

The second part of the paper is structured as follows: In Section 8, we relate the utility function *U<sup>β</sup>* to the Rényi divergence (Theorem 9) and derive its optimal gambling strategy. In Section 9, we consider the situation where the gambler observes side information prior to betting, a situation that leads to the conditional Rényi divergence *D*<sup>l</sup> *α* (·) (Theorem 10) and to a new operational meaning for the measure of dependence *Jα*(*X*;*Y*) (Theorem 11). In Section 10, we consider the situation where the gambler invests only part of her money. In Section 11, we present a universal strategy for independent and identically distributed (IID) races that requires neither knowledge of the winning probabilities nor of the parameter *β* of the utility function and yet asymptotically maximizes the utility function for all PMFs *p* and all *β* ∈ R.

#### **2. Preliminaries**

Throughout the paper, log(·) denotes the base-2 logarithm, X and Y are finite sets, *PXY* denotes a joint PMF over X × Y, *Q<sup>X</sup>* denotes a PMF over X , and *Q<sup>Y</sup>* denotes a PMF over Y. An expression of the form *PXPY*|*<sup>X</sup>* denotes the PMF on X × Y that assigns (*x*, *y*) the probability *PX*(*x*)*PY*|*X*(*y*|*x*). We use *P* and *Q* as generic PMFs over a finite set X . We denote by supp(*P*) , {*x* ∈ X : *P*(*x*) > 0} the support of *P*, and by P(X ) the set of all PMFs over X . When clear from the context, we often omit sets and subscripts: for example, we write ∑*<sup>x</sup>* for <sup>∑</sup>*x*∈X , min*QX*,*Q<sup>Y</sup>* for min(*QX*,*QY*)∈P(<sup>X</sup> )×P(Y) , *P*(*x*) for *PX*(*x*), and *P*(*y*|*x*) for *PY*|*X*(*y*|*x*). When *P*(*x*) is 0, we define the conditional probability *P*(*y*|*x*) as 1/|Y|. The conditional distribution of *Y* given *X* = *x* is denoted by *PY*|*X*=*<sup>x</sup>* , thus

$$P\_{Y|X=x}(y) = P(y|\mathbf{x}).\tag{18}$$

We denote by <sup>1</sup>{condition} the indicator function that is one if the condition is satisfied and zero otherwise.

In the definition of the Kullback–Leibler divergence in (1), we use the conventions

$$0 \log \frac{0}{q} = 0 \quad \forall q \ge 0, \qquad p \log \frac{p}{0} = \infty \quad \forall p > 0. \tag{19}$$

In the definition of the Rényi divergence in (4), we read *P*(*x*) *<sup>α</sup> Q*(*x*) <sup>1</sup>−*<sup>α</sup>* as *P*(*x*) *<sup>α</sup>*/*Q*(*x*) *α*−1 for *α* > 1 and use the conventions

$$\frac{0}{0} = 0, \qquad \frac{p}{0} = \infty \quad \forall p > 0. \tag{20}$$

For *α* being zero, one, or infinity, we define by continuous extension of (4)

$$D\_0(P\|\|Q) \stackrel{\Delta}{=} -\log \sum\_{\mathbf{x} \in \text{supp}(P)} Q(\mathbf{x})\_{\mathbf{x}} \tag{21}$$

$$D\_1(P \| \| Q) \triangleq D(P \| \| Q) \, \tag{22}$$

$$D\_{\infty}(P\|Q) \triangleq \log \max\_{\mathbf{x}} \frac{P(\mathbf{x})}{Q(\mathbf{x})}.\tag{23}$$

The Rényi divergence for negative *α* is defined as

$$D\_{\mathfrak{a}}(P\|\|Q) \stackrel{\Delta}{=} \frac{1}{\mathfrak{a} - 1} \log \sum\_{\mathfrak{x}} \frac{Q(\mathfrak{x})^{1-\mathfrak{a}}}{P(\mathfrak{x})^{-\mathfrak{a}}}.\tag{24}$$

*Entropy* **2020**, *22*, 316

(We use negative *α* in the proof of Proposition 1 (e) below and in Remark 6. More about negative orders can be found in [8] (Section V). For other applications of negative orders, see [20] (Proof of Theorem 1 and Example 1).)

The Rényi divergence satisfies the following basic properties:

**Proposition 1.** *Let P and Q be PMFs. Then, the Rényi divergence Dα*(*P*k*Q*) *satisfies the following:*


$$P'(\mathbf{x'}) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P(\mathbf{x}) \, A\_{X'|X}(\mathbf{x'}|\mathbf{x}) \, \tag{25}$$

$$Q'(\mathbf{x'}) \triangleq \sum\_{\mathbf{x}} Q(\mathbf{x}) A\_{X'|X}(\mathbf{x'}|\mathbf{x}).\tag{26}$$

*Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}(P' \| Q') \le D\_{\mathfrak{a}}(P \| Q). \tag{27}$$

#### **Proof.** See Appendix A.

All three conditional Rényi divergences reduce to the unconditional Rényi divergence when both *PY*|*<sup>X</sup>* and *QY*|*<sup>X</sup>* are independent of *X*:

**Remark 1.** *Let PY, QY, and P<sup>X</sup> be PMFs. Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_Y \| Q\_Y | P\_X) = D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_Y \| Q\_Y | P\_X) = D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_Y \| Q\_Y | P\_X) = D\_{\mathfrak{a}}(P\_Y \| Q\_Y). \tag{28}$$

**Proof.** This follows from the definitions of *D*<sup>c</sup> *α* (·), *<sup>D</sup>*<sup>s</sup> *α* (·), and *<sup>D</sup>*<sup>l</sup> *α* (·) in (5)–(7).

#### **3. Csiszár's Conditional Rényi Divergence**

For a PMF *<sup>P</sup><sup>X</sup>* and conditional PMFs *<sup>P</sup>Y*|*<sup>X</sup>* and *<sup>Q</sup>Y*|*X*, Csiszár's conditional Rényi divergence *<sup>D</sup>*<sup>c</sup> *α* (·) is defined for every *α* ∈ [0, ∞] as

$$D\_{\mathfrak{a}}^{\mathbf{c}}(P\_{Y|X} \| \| Q\_{Y|X} | P\_X) \stackrel{\Delta}{=} \sum\_{\mathbf{x} \in \operatorname{supp}(P\_X)} P(\mathbf{x}) D\_{\mathfrak{a}}(P\_{Y|X=\mathbf{x}} \| \| Q\_{Y|X=\mathbf{x}}).\tag{29}$$

For *α* ∈ (0, 1) ∪ (1, ∞),

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X} \| Q\_{Y|X} | P\_X) = \frac{1}{\mathfrak{a} - 1} \sum\_{\mathbf{x} \in \operatorname{supp}(P\_X)} P(\mathbf{x}) \log \sum\_{\mathbf{y}} P(\mathbf{y} | \mathbf{x})^{\mathfrak{a}} Q(\mathbf{y} | \mathbf{x})^{1 - \mathfrak{a}}.\tag{30}$$

which follows from the definition of the Rényi divergence in (4). For *α* being zero, one, or infinity, we obtain from (21)–(23) and (2)

$$D\_0^\mathbf{C}(\mathsf{P}\_{\mathbf{Y}|\mathbf{X}}||\mathcal{Q}\_{\mathbf{Y}|\mathbf{X}}|\mathcal{P}\_\mathbf{X}) = -\sum\_{\mathbf{x}\in\mathsf{supp}(\mathcal{P}\_\mathbf{X})} P(\mathbf{x}) \log \sum\_{\mathbf{y}\in\mathsf{supp}(\mathcal{P}\_{\mathbf{Y}|\mathbf{X}=\mathbf{x}})} Q(\mathbf{y}|\mathbf{x}),\tag{31}$$

$$D\_1^\mathbf{C}(\mathbf{P}\_{\mathbf{Y}|\mathbf{X}} \| \mathbf{Q}\_{\mathbf{Y}|\mathbf{X}} | \mathbf{P}\_\mathbf{X}) = D(\mathbf{P}\_{\mathbf{Y}|\mathbf{X}} \| \mathbf{Q}\_{\mathbf{Y}|\mathbf{X}} | \mathbf{P}\_\mathbf{X}),\tag{32}$$

$$D\_{\infty}^{\mathbb{C}}(P\_{Y|X}||Q\_{Y|X}|P\_{X}) = \sum\_{\mathbf{x} \in \text{supp}(P\_X)} P(\mathbf{x}) \log \max\_{\mathbf{y}} \frac{P(\mathbf{y}|\mathbf{x})}{Q(\mathbf{y}|\mathbf{x})}.\tag{33}$$

Augustin [21] and later Csiszár [9] defined the measure of dependence

$$M\_{\alpha}^{\mathbf{c}}(X;Y) \triangleq \min\_{Q\_Y} D\_{\alpha}^{\mathbf{c}}(P\_{Y|X} || Q\_Y | P\_X). \tag{34}$$

Augustin used this measure to study the error exponents for channel coding with input constraints, while Csiszár used it to study generalized cutoff rates for channel coding with composition constraints. Nakibo˘glu [22] studied more properties of *I* c *α* (*X*;*Y*). Inter alia, he analyzed the minimax properties of the Augustin capacity

$$\sup\_{P\_X \in \mathcal{A}} I\_{\mathfrak{a}}^{\mathbf{c}}(P\_{X\prime} P\_{Y|X}) = \sup\_{P\_X \in \mathcal{A}} \min D\_{\mathfrak{a}}^{\mathbf{c}}(P\_{Y|X} \| Q\_Y | P\_X) \,\tag{35}$$

where A ⊆ P(X ) is a constraint set. The Augustin capacity is used in [23] to establish the sphere packing bound for memoryless channels with cost constraints.

The rest of the section presents some properties of *D*<sup>c</sup> *α* (·). Being an average of Rényi divergences (see (29)), *D*<sup>c</sup> *α* (·) inherits many properties from the Rényi divergence:

**Proposition 2.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. Then,*


**Proof.** These follow from (29) and the properties of the Rényi divergence (Proposition 1). For Parts (f) and (g), recall that a nonnegative weighted sum of concave functions is concave.

We next consider data-processing inequalities for *D*<sup>c</sup> *α* (·). We distinguish between processing *Y* and processing *X*. The data-processing inequality for processing *Y* follows from the data-processing inequality for the (unconditional) Rényi divergence:

**Theorem 1.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. For a conditional PMF A<sup>Y</sup>* 0 <sup>|</sup>*XY, define*

$$P\_{Y'|X}(y'|\mathbf{x}) \stackrel{\Delta}{=} \sum\_{\mathbf{y}} P\_{Y|X}(y|\mathbf{x}) A\_{Y'|XY}(y'|\mathbf{x}, \mathbf{y}) \, \tag{36}$$

$$Q\_{Y'|X}(y'|\mathbf{x}) \stackrel{\Delta}{=} \sum\_{\mathbf{y}} Q\_{Y|X}(y|\mathbf{x}) A\_{Y'|XY}(y'|\mathbf{x},\mathbf{y}).\tag{37}$$

*Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(\mathcal{P}\_{Y'|X} \| \mathcal{Q}\_{Y'|X} | \mathcal{P}\_{\mathcal{X}}) \le D\_{\mathfrak{a}}^{\mathfrak{c}}(\mathcal{P}\_{Y|X} \| \mathcal{Q}\_{Y|X} | \mathcal{P}\_{\mathcal{X}}).\tag{38}$$

**Proof.** See Appendix B.

The following data-processing inequality for processing *X* holds for *α* ∈ [0, 1] (as shown in Example 1 below, it does not extend to *α* ∈ (1, ∞]):

**Theorem 2.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. For a conditional PMF B<sup>X</sup>* 0 |*X, define the PMFs*

$$P\_{X'}(\mathbf{x'}) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \, B\_{X'|X}(\mathbf{x'}|\mathbf{x}) \, \tag{39}$$

$$\mathcal{B}\_{X|X'}(\mathbf{x}|\mathbf{x'}) \triangleq \begin{cases} \mathcal{P}\_X(\mathbf{x})\mathcal{B}\_{X'|X}(\mathbf{x'}|\mathbf{x})/\mathcal{P}\_{X'}(\mathbf{x'}) & \text{if } \mathcal{P}\_{X'}(\mathbf{x'}) > 0, \\ 1/|\mathcal{X}| & \text{otherwise,} \end{cases} \tag{40}$$

$$P\_{Y|X'}(y|\mathbf{x'}) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} B\_{X|X'}(\mathbf{x}|\mathbf{x'}) \, P\_{Y|X}(y|\mathbf{x}) \, \tag{41}$$

$$Q\_{Y|X'}(y|\mathbf{x'}) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} B\_{X|X'}(\mathbf{x}|\mathbf{x'}) \, Q\_{Y|X}(y|\mathbf{x}).\tag{42}$$

*Then, for all α* ∈ [0, 1]*,*

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X'} \| Q\_{Y|X'} | P\_{X'}) \le D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X} \| \| Q\_{Y|X} | P\_X). \tag{43}$$

Note that *P<sup>X</sup>* <sup>0</sup> , *PY*|*<sup>X</sup>* <sup>0</sup> , and *QY*|*<sup>X</sup>* <sup>0</sup> in Theorem 2 can be obtained from the following marginalizations:

$$P\_{X'}(\mathbf{x'})P\_{Y|X'}(y|\mathbf{x'}) = \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \, B\_{X'|X}(\mathbf{x'}|\mathbf{x}) \, P\_{Y|X}(y|\mathbf{x}) \, \tag{44}$$

$$P\_{X'}(\mathbf{x'})Q\_{Y|X'}(y|\mathbf{x'}) = \sum\_{\mathbf{x}} P\_X(\mathbf{x})B\_{X'|X}(\mathbf{x'}|\mathbf{x})Q\_{Y|X}(y|\mathbf{x}).\tag{45}$$

#### **Proof of Theorem 2.** See Appendix C.

As a special case of Theorem 2, we obtain the following relation between the conditional and the unconditional Rényi divergence:

**Corollary 1.** *For a PMF P<sup>X</sup> and conditional PMFs PY*|*<sup>X</sup> and QY*|*X, define the marginal PMFs*

$$P\_Y(y) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \, P\_{Y|X}(y|\mathbf{x}) \, \tag{46}$$

$$Q\_Y(y) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \, Q\_{Y|X}(y|\mathbf{x}). \tag{47}$$

*Then, for all α* ∈ [0, 1]*,*

$$D\_{\mathfrak{a}}(P\_Y \| Q\_Y) \le D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X} \| Q\_{Y|X} | P\_X). \tag{48}$$

*Entropy* **2020**, *22*, 316

**Proof.** See Appendix D.

Consider next *α* ∈ (1, ∞]. It turns out that Corollary 1, and hence Theorem 2, cannot be extended to these values of *α* (not even if *QY*|*<sup>X</sup>* is restricted to be independent of *X*, i.e., if *QY*|*<sup>X</sup>* = *QY*):

**Example 1.** *Let* <sup>X</sup> <sup>=</sup> <sup>Y</sup> <sup>=</sup> {0, 1}*. For <sup>e</sup>* <sup>∈</sup> (0, 1)*, define the PMFs PX, Q*(*e*) *Y , and P*(*e*) *Y*|*X as*

$$P\_X(0) = 0.5, \qquad \qquad P\_X(1) = 0.5,\tag{49}$$

$$Q\_Y^{(\epsilon)}(0) = 1 - \epsilon, \qquad Q\_Y^{(\epsilon)}(1) = \epsilon,\tag{50}$$

$$P\_{Y|X}^{(\epsilon)}(0|0) = 1 - \epsilon, \qquad P\_{Y|X}^{(\epsilon)}(1|0) = \epsilon,\tag{51}$$

$$P\_{Y|X}^{(\epsilon)}(0|1) = \epsilon, \qquad \qquad P\_{Y|X}^{(\epsilon)}(1|1) = 1 - \epsilon. \tag{52}$$

*Then, for every α* ∈ (1, ∞]*, there exists an e* ∈ (0, 1) *such that*

$$D\_{\mathfrak{a}}\left(P\_{Y} \| \mathcal{Q}\_{Y}^{(\epsilon)}\right) > D\_{\mathfrak{a}}^{\mathfrak{c}}\left(P\_{Y|X}^{(\epsilon)} \| \mathcal{Q}\_{Y}^{(\epsilon)} \| P\_{X}\right),\tag{53}$$

*where the PMF P<sup>Y</sup> is defined by* (46) *and, irrespective of e, satisfies PY*(0) = *PY*(1) = 0.5*.*

**Proof.** See Appendix E.

#### **4. Sibson's Conditional Rényi Divergence**

For a PMF *<sup>P</sup><sup>X</sup>* and conditional PMFs *<sup>P</sup>Y*|*<sup>X</sup>* and *<sup>Q</sup>Y*|*X*, Sibson's conditional Rényi divergence *<sup>D</sup>*<sup>s</sup> *α* (·) is defined for every *α* ∈ [0, ∞] as

$$D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_{Y|X}||Q\_{Y|X}|P\_X) \triangleq D\_{\mathfrak{a}}(P\_X P\_{Y|X}||P\_X Q\_{Y|X}).\tag{54}$$

For *α* ∈ (0, 1) ∪ (1, ∞),

$$D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_{Y|X} \| Q\_{Y|X} | P\_X) = \frac{1}{\mathfrak{a} - 1} \log \sum\_{\mathbf{x} \in \text{supp}(P\_X)} P(\mathbf{x}) \sum\_{\mathbf{y}} P(\mathbf{y} | \mathbf{x})^{\mathfrak{a}} Q(\mathbf{y} | \mathbf{x})^{1 - \mathfrak{a}} \tag{55}$$

$$\hat{\mathbf{x}} = \frac{1}{\mathfrak{a} - 1} \log \sum\_{\mathbf{x} \in \text{supp}(P\_{\mathbf{X}})} P(\mathbf{x}) 2^{(\mathfrak{a} - 1)D\_{\mathbf{a}}(P\_{\mathbf{Y}|X = \mathbf{x}} || Q\_{\mathbf{Y}|X = \mathbf{x}})} \,\tag{56}$$

where (55) and (56) follow from the definition of the Rényi divergence in (4). For *α* being zero, one, or infinity, we obtain from (21)–(23) and (3)

$$D\_0^\mathbf{g}(P\_{Y|X} \| Q\_{Y|X} | P\_X) = -\log \sum\_{\mathbf{x} \in \text{supp}(P\_X)} P(\mathbf{x}) \sum\_{\mathbf{y} \in \text{supp}(P\_{Y|X-\mathbf{x}})} Q(\mathbf{y}|\mathbf{x})\_\mathbf{y} \tag{57}$$

$$D\_1^\mathbf{g}(\mathcal{P}\_{\mathbf{Y}|\mathbf{X}} \| \mathcal{Q}\_{\mathbf{Y}|\mathbf{X}} | \mathcal{P}\_\mathbf{X}) = D(\mathcal{P}\_{\mathbf{Y}|\mathbf{X}} \| \mathcal{Q}\_{\mathbf{Y}|\mathbf{X}} | \mathcal{P}\_\mathbf{X}),\tag{58}$$

$$D\_{\infty}^{\mathfrak{g}}(\mathbb{P}\_{\mathbf{Y}|\mathbf{X}}||\mathbb{Q}\_{\mathbf{Y}|\mathbf{X}}|\mathbb{P}\_{\mathbf{X}}) = \log \max\_{\mathbf{x} \in \text{supp}(\mathbb{P}\_{\mathbf{X}})} \max\_{y} \frac{P(y|\mathbf{x})}{\mathbb{Q}(y|\mathbf{x})}.\tag{59}$$

Sibson [10] defined the measure of dependence

$$D\_{\mathfrak{a}}^{\mathfrak{g}}(X;Y) \triangleq \min\_{Q\_Y} D\_{\mathfrak{a}}^{\mathfrak{g}}(P\_{Y|X}||Q\_Y|P\_X). \tag{60}$$

This minimum can be computed explicitly [10] (Corollary 2.3): For *α* ∈ (0, 1) ∪ (1, ∞),

$$I\_a^{\mathfrak{g}}(X;Y) = \frac{\mathfrak{a}}{\mathfrak{a}-1} \log \sum\_{\mathcal{Y}} \left[ \sum\_{\mathbf{x}} P(\mathbf{x}) P(\mathbf{y}|\mathbf{x})^{\mathfrak{a}} \right]^{\frac{1}{\mathfrak{a}}},\tag{61}$$

and for *α* being one or infinity,

$$I\_1^{\mathfrak{g}}(\mathbf{X};\mathbf{Y}) = I(\mathbf{X};\mathbf{Y}),\tag{62}$$

$$I\_{\infty}^{\mathfrak{g}}(X;Y) = \log \sum\_{y} \max\_{\mathbf{x}} P(y|\mathbf{x}),\tag{63}$$

where *I*(*X*;*Y*) denotes Shannon's mutual information.

The concavity and convexity properties of *D*<sup>s</sup> *α* (·) and *I* s *α* (*X*;*Y*) were studied by Ho–Verdú [24]. More properties of *I* s *α* (*X*;*Y*) were collected by Verdú [25]. The maximization of *I* s *α* (*X*;*Y*) with respect to *P<sup>X</sup>* and the minimax properties of *D*<sup>s</sup> *α* (·) were studied by Nakibo˘glu [26] and Cai–Verdú [27].

The conditional Rényi divergence *D*<sup>s</sup> *α* (·) was used by Fong and Tan [28] to establish strong converse theorems for multicast networks. Yu and Tan [29] analyzed channel resolvability, among other measures, in terms of *D*<sup>s</sup> *α* (·).

From (61) we see that Gallager's *E*<sup>0</sup> function [11], which is defined as

$$E\_0(\rho\_\prime P\_{X\prime} P\_{Y|X}) \stackrel{\Delta}{=} -\log \sum\_{\mathbf{y}} \left[ \sum\_{\mathbf{x}} P(\mathbf{x}) P(\mathbf{y}|\mathbf{x})^{\frac{1}{1+\rho}} \right]^{1+\rho} \tag{64}$$

is in one-to-one correspondence to Sibson's measure of dependence:

$$I\_{\mathfrak{a}}^{\mathfrak{g}}(X;Y) = \frac{\mathfrak{a}}{1-\mathfrak{a}}E\_0\left(\frac{1-\mathfrak{a}}{\mathfrak{a}}, P\_{X'}P\_{Y|X}\right). \tag{65}$$

Gallager's *E*<sup>0</sup> function is important in channel coding: it appears in the random coding exponent [30] and in the sphere packing exponent [31,32] (see also Gallager [11]). The exponential strong converse theorem proved by Arimoto [33] also uses the *E*<sup>0</sup> function. Polyanskiy and Verdú [34] extended the exponential strong converse theorem to channels with feedback. Augustin [21] and Nakibo˘glu [35,36] extended the sphere packing bound to channels with feedback.

The rest of the section presents some properties of *D*<sup>s</sup> *α* (·). Because *<sup>D</sup>*<sup>s</sup> *α* (·) can be written as an (unconditional) Rényi divergence (see (54)), it inherits many properties from the Rényi divergence:

**Proposition 3.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. Then,*


**Proof.** These follow from (54) and the properties of the Rényi divergence (Proposition 1).

We next consider data-processing inequalities for *D*<sup>s</sup> *α* (·). We distinguish between processing *Y* and processing *X*. The data-processing inequality for processing *Y* follows from the data-processing inequality for the (unconditional) Rényi divergence:

**Theorem 3.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. For a conditional PMF A<sup>Y</sup>* 0 <sup>|</sup>*XY, define*

$$P\_{Y'|X}(y'|\mathbf{x}) \stackrel{\Delta}{=} \sum\_{\mathbf{y}} P\_{Y|X}(y|\mathbf{x}) A\_{Y'|XY}(y'|\mathbf{x}, \mathbf{y}) \, \tag{66}$$

$$Q\_{Y'|X}(y'|\mathbf{x}) \stackrel{\Delta}{=} \sum\_{\mathbf{y}} Q\_{Y|X}(y|\mathbf{x}) A\_{Y'|XY}(y'|\mathbf{x}, \mathbf{y}).\tag{67}$$

*Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}^{\mathfrak{a}}(\mathcal{P}\_{Y'|X} \| \mathcal{Q}\_{Y'|X} | \mathcal{P}\_{\mathcal{X}}) \le D\_{\mathfrak{a}}^{\mathfrak{a}}(\mathcal{P}\_{Y|X} \| \mathcal{Q}\_{Y|X} | \mathcal{P}\_{\mathcal{X}}).\tag{68}$$

**Proof.** See Appendix F.

The data-processing inequality for processing *X* similarly follows from the data-processing inequality for the (unconditional) Rényi divergence:

**Theorem 4.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. For a conditional PMF B<sup>X</sup>* 0 |*X, define the PMFs*

$$P\_{X'}(\mathbf{x'}) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \, B\_{X'|X}(\mathbf{x'}|\mathbf{x}) \, \tag{69}$$

$$\mathcal{B}\_{X|X'}(\mathbf{x}|\mathbf{x'}) \triangleq \begin{cases} \mathcal{P}\_X(\mathbf{x})\mathcal{B}\_{X'|X}(\mathbf{x'}|\mathbf{x})/\mathcal{P}\_{X'}(\mathbf{x'}) & \text{if } \mathcal{P}\_{X'}(\mathbf{x'}) > 0, \\ 1/|\mathcal{X}| & \text{otherwise,} \end{cases} \tag{70}$$

$$P\_{Y|X'}(y|\mathbf{x'}) \triangleq \sum\_{\mathbf{x}}^{\cdot} B\_{X|X'}(\mathbf{x}|\mathbf{x'}) \, P\_{Y|X}(y|\mathbf{x}) \, \tag{71}$$

$$Q\_{Y|X'}(y|\mathbf{x'}) \triangleq \sum\_{\mathbf{x}} B\_{X|X'}(\mathbf{x}|\mathbf{x'}) \, Q\_{Y|X}(y|\mathbf{x}).\tag{72}$$

*Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}^{\mathfrak{g}}(P\_{Y|X'}||Q\_{Y|X'}|P\_{X'}) \le D\_{\mathfrak{a}}^{\mathfrak{g}}(P\_{Y|X}||Q\_{Y|X}|P\_X). \tag{73}$$

**Proof.** See Appendix G.

As a special case of Theorem 4, we obtain the following relation between the conditional and the unconditional Rényi divergence:

**Corollary 2.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. Define the marginal PMFs*

$$P\_Y(y) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x}) \, P\_{Y|X}(y|\mathbf{x}) \, \tag{74}$$

$$Q\_Y(y) \triangleq \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \, Q\_{Y|X}(y|\mathbf{x}). \tag{75}$$

*Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}(P\_Y \| Q\_Y) \le D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_{Y|X} \| Q\_{Y|X} | P\_X). \tag{76}$$

**Proof.** This follows from Theorem 4 in the same way that Corollary 1 followed from Theorem 2.

#### **5. New Conditional Rényi Divergence**

Let *P<sup>X</sup>* be a PMF, and let *PY*|*<sup>X</sup>* and *QY*|*<sup>X</sup>* be conditional PMFs. For *α* ∈ (0, 1) ∪ (1, ∞), define

$$D\_{\mathfrak{a}}^{\perp}(\mathcal{P}\_{\mathbf{Y}|X} \| \mathcal{Q}\_{\mathbf{Y}|X} | \mathcal{P}\_{\mathbf{X}}) \stackrel{\scriptstyle \mathcal{A}}{=} \frac{\mathfrak{a}}{\mathfrak{a} - 1} \log \sum\_{\mathbf{x} \in \operatorname{supp}(\mathcal{P}\_{\mathbf{X}})} P(\mathbf{x}) 2^{\frac{\mathfrak{a} - 1}{\mathfrak{a}} D\_{\mathfrak{a}}(\mathcal{P}\_{\mathbf{Y}|X - \mathbf{x}} \| \mathcal{Q}\_{\mathbf{Y}|X = \mathbf{x}})} \tag{77}$$

$$=\frac{\mathfrak{a}}{\mathfrak{a}-1}\log\sum\_{\mathbf{x}\in\operatorname{supp}(P\_X)}P(\mathbf{x})\left[\sum\_{\mathcal{Y}}P(\mathbf{y}|\mathbf{x})^{\mathfrak{a}}\,\mathcal{Q}(\mathbf{y}|\mathbf{x})^{1-\mathfrak{a}}\right]^{\frac{1}{\mathfrak{a}}},\tag{78}$$

where (78) follows from the definition of the Rényi divergence in (4). (Except for the sign, the exponential averaging in (77) is very similar to the one of the Arimoto–Rényi conditional entropy; compare with (147) below.) For *α* being zero, one, or infinity, we define by continuous extension of (77)

$$D\_0^1(P\_{Y|X} \| Q\_{Y|X} | P\_X) \stackrel{\Delta}{=} -\log \max\_{\mathbf{x} \in \text{supp}(P\_X)} \sum\_{\mathbf{y} \in \text{supp}(P\_{Y|X-\mathbf{x}})} Q(\mathbf{y}|\mathbf{x})\_\mathbf{y} \tag{79}$$

$$D\_1^1(P\_{Y|X} \| \| Q\_{Y|X} | P\_X) \stackrel{\Delta}{=} D(P\_{Y|X} \| \| Q\_{Y|X} | P\_X) \,. \tag{80}$$

$$D\_{\infty}^{\Delta}(\mathcal{P}\_{\mathbf{Y}|X} \| \mathcal{Q}\_{\mathbf{Y}|X} | \mathcal{P}\_{X}) \triangleq \log \sum\_{\mathbf{x} \in \text{supp}(\mathcal{P}\_{\mathbf{X}})} P(\mathbf{x}) \max\_{\mathbf{y}} \frac{P(\mathbf{y}|\mathbf{x})}{Q(\mathbf{y}|\mathbf{x})}.\tag{81}$$

This conditional Rényi divergence has an operational meaning in horse betting with side information (see Theorem 10 below). Before discussing the measure of dependence associated with *D*<sup>l</sup> *α* (·), we establish the following alternative characterization of *<sup>D</sup>*<sup>l</sup> *α* (·):

**Proposition 4.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}^{1}(P\_{Y|X}||Q\_{Y|X}|P\_{X}) = \min\_{Q\_{X}} D\_{\mathfrak{a}}(P\_{X}P\_{Y|X}||Q\_{X}Q\_{Y|X}).\tag{82}$$

**Proof.** We first treat the case *α* ∈ (0, 1) ∪ (1, ∞). Some algebra reveals that, for every PMF *QX*,

$$D\_{\mathbf{a}}(\mathbf{P}\_{\mathbf{X}}\mathbf{P}\_{\mathbf{Y}|\mathbf{X}} \| \mathbf{Q}\_{\mathbf{X}}\mathbf{Q}\_{\mathbf{Y}|\mathbf{X}}) = D\_{\mathbf{a}}\left(\mathbf{Q}\_{\mathbf{X}}^{\*(\mathbf{a})} \| \mathbf{Q}\_{\mathbf{X}}\right) + \frac{\mathbf{a}}{\mathbf{a} - 1} \log \sum\_{\mathbf{x} \in \text{supp}(\mathbf{P}\_{\mathbf{X}})} P(\mathbf{x}) \left[\sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{\mathbf{a}} Q(\mathbf{y}|\mathbf{x})^{\mathbf{1} - \mathbf{a}}\right]^{\frac{1}{\mathbf{a}}},\tag{83}$$

where the PMF *Q* ∗(*α*) *X* is defined as

$$Q\_X^{\*(a)}(\mathbf{x}) \triangleq \frac{P(\mathbf{x}) \left[\sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^a Q(\mathbf{y}|\mathbf{x})^{1-a}\right]^{1/a}}{\sum\_{\mathbf{x'} \in \text{supp}(P\_X)} P(\mathbf{x'}) \left[\sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x'})^a Q(\mathbf{y}|\mathbf{x'})^{1-a}\right]^{1/a}}.\tag{84}$$

The right-hand side (RHS) of (82) is thus equal to the minimum over *Q<sup>X</sup>* of the RHS of (83). Since *D<sup>α</sup> Q* ∗(*α*) *X* k*Q<sup>X</sup>* ≥ 0 with equality if *Q<sup>X</sup>* = *Q* ∗(*α*) *X* (Proposition 1 (a)), this minimum is equal to the second term on the RHS of (83), which, by (78), equals *D*<sup>l</sup> *α* (*PY*|*X*k*QY*|*X*|*PX*).

For *α* = 1 and *α* = ∞, (82) follows from the same argument using that, for every PMF *QX*,

$$D\_1(P\_X P\_{\bar{Y}|X} \| Q\_X Q\_{Y|X}) = D(P\_X \| Q\_X) + D(P\_{\bar{Y}|X} \| Q\_{Y|X} \| P\_X),\tag{85}$$

$$D\_{\infty}(P\_{\mathbf{X}}\mathbf{P}\_{\mathbf{Y}|\mathbf{X}} \| Q\_{\mathbf{X}} Q\_{\mathbf{Y}|\mathbf{X}}) = D\_{\infty} \left( Q\_{\mathbf{X}}^{\*(\infty)} \| Q\_{\mathbf{X}} \right) + \log \sum\_{\mathbf{x} \in \text{supp}(P\_{\mathbf{X}})} P(\mathbf{x}) \max\_{\mathbf{y}} \frac{P(\mathbf{y}|\mathbf{x})}{Q(\mathbf{y}|\mathbf{x})} \tag{86}$$

where the PMF *Q* ∗(∞) *X* is defined as

$$Q\_X^{\*(\infty)}(\mathbf{x}) \triangleq \frac{P(\mathbf{x}) \max\_{\mathcal{Y}} \left[ P(y|\mathbf{x}) / Q(y|\mathbf{x}) \right]}{\sum\_{\mathbf{x'} \in \text{supp}(P\_X)} P(\mathbf{x'}) \max\_{\mathcal{Y}} \left[ P(y|\mathbf{x'}) / Q(y|\mathbf{x'}) \right]}. \tag{87}$$

For *α* = 0, (82) holds because

$$\min\_{Q\_{\mathbf{X}}} D\_0(P\_{\mathbf{X}} \mathbf{P}\_{\mathbf{Y}|\mathbf{X}} \| Q\_{\mathbf{X}} Q\_{\mathbf{Y}|\mathbf{X}}) = \min\_{Q\_{\mathbf{X}}} -\log \sum\_{\mathbf{x} \in \text{supp}(P\_{\mathbf{X}})} Q(\mathbf{x}) \sum\_{\mathbf{y} \in \text{supp}(P\_{\mathbf{Y}|\mathbf{X}=\mathbf{x}})} Q(\mathbf{y}|\mathbf{x}) \tag{88}$$

$$=-\log\max\_{Q\_X} \sum\_{\mathbf{x}\in\text{supp}(P\_X)} Q(\mathbf{x}) \sum\_{\mathbf{y}\in\text{supp}(P\_{Y|X=\mathbf{x}})} Q(\mathbf{y}|\mathbf{x})\tag{89}$$

$$=-\log\max\_{\boldsymbol{x}\in\text{supp}(P\_X)}\sum\_{\boldsymbol{y}\in\text{supp}(P\_{Y|X=\boldsymbol{x}})}\mathcal{Q}(\boldsymbol{y}|\boldsymbol{x})\tag{90}$$

$$=D\_0^1(P\_{Y|X} \| Q\_{Y|X} | P\_X)\_\prime \tag{91}$$

where (88) follows from the definition of *D*0(*P*k*Q*) in (21), and (91) follows from (79).

Tomamichel and Hayashi [37] and Lapidoth and Pfister [3] independently introduced and studied the dependence measure

$$D\_{\mathfrak{A}}(X;Y) \triangleq \min\_{Q\_X, Q\_Y} D\_{\mathfrak{A}}(P\_{XY} \| Q\_X Q\_Y). \tag{92}$$

(For some measure-theoretic properties of *Jα*(*X*;*Y*), see Aishwarya–Madiman [38].) The measure *Jα*(*X*;*Y*) can be related to the error exponents in a hypothesis testing problem where the samples are either from a known joint distribution or an unknown product distribution (see [37] (Equation (57)) and [39]). It also appears in horse betting with side information (see Theorem 11 below).

Similar to *I* c *α* (*X*;*Y*) in (34) and *I* s *α* (*X*;*Y*) in (60), the measure *Jα*(*X*;*Y*) can be expressed as a minimization involving the new conditional Rényi divergence:

**Proposition 5.** *Let PXY be a joint PMF. Denote its marginal PMFs by P<sup>X</sup> and P<sup>Y</sup> and its conditional PMFs by PY*|*<sup>X</sup> and PX*|*Y, so PXY* = *PXPY*|*<sup>X</sup>* = *PYPX*|*Y. Then, for all α* ∈ [0, ∞]*,*

$$J\_{\mathfrak{a}}(X;Y) = \min\_{Q\_Y} D^1\_{\mathfrak{a}}(P\_{Y|X} || Q\_Y | P\_X) \tag{93}$$

$$\lambda = \min\_{Q\_X} D\_a^1(P\_{X|Y} \| Q\_X | P\_Y). \tag{94}$$

**Proof.** Equation (93) holds because

$$\min\_{Q\_Y} D\_\mathbf{a}^1(P\_{Y|X} \| Q\_Y | P\_X) = \min\_{Q\_Y} \min\_{Q\_X} D\_\mathbf{a}(P\_X P\_{Y|X} \| Q\_X Q\_Y) \tag{95}$$

$$\mathbf{y} = \mathbf{J}\_{\mathbf{a}}(\mathbf{X}; \mathbf{Y}), \tag{96}$$

where (95) follows from Proposition 4, and (96) follows from (92). Swapping the roles of *X* and *Y* establishes (94):

$$\min\_{\mathbf{Q}\_{\mathbf{X}}} D\_{\mathbf{a}}^{\mathbf{1}} (P\_{\mathbf{X}|\mathbf{Y}} \| Q\_{\mathbf{X}} | P\_{\mathbf{Y}}) = \min\_{\mathbf{Q}\_{\mathbf{X}}} \min\_{\mathbf{Q}\_{\mathbf{Y}}} D\_{\mathbf{a}} (P\_{\mathbf{Y}} P\_{\mathbf{X}|\mathbf{Y}} \| Q\_{\mathbf{Y}} Q\_{\mathbf{X}}) \tag{97}$$

$$\mathfrak{l} = \mathfrak{l}\_{\mathfrak{a}}(X;Y)\_{\mathfrak{l}} \tag{98}$$

where (97) follows from Proposition 4, and (98) follows from (92).

The rest of the section presents some properties of *D*<sup>l</sup> *α* (·). **Proposition 6.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. Then,*


**Proof.** We prove these properties as follows:

(a) For all *α* ∈ [0, ∞], Proposition 4 implies

$$D\_a^1(P\_{Y|X} \| Q\_{Y|X} | P\_X) = \min\_{Q\_X} D\_a(P\_X P\_{Y|X} \| Q\_X Q\_{Y|X}).\tag{99}$$

The nonnegativity of *D*<sup>l</sup> *α* (·) now follows from the nonnegativity of the Rényi divergence (Proposition 1 (a)). If *PY*|*X*=*<sup>x</sup>* = *QY*|*X*=*<sup>x</sup>* for all *x* ∈ supp(*PX*) , then *PXPY*|*<sup>X</sup>* = *PXQY*|*X*. Hence, using *Q<sup>X</sup>* = *P<sup>X</sup>* on the RHS of (99), *D*<sup>l</sup> *α* (*PY*|*X*k*QY*|*X*|*PX*) equals zero. Conversely, if *α* ∈ (0, ∞] and *D*<sup>l</sup> *α* (·) = 0, then *PXPY*|*<sup>X</sup>* = *QXQY*|*<sup>X</sup>* for some *Q<sup>X</sup>* by Proposition 1 (a), which implies *PY*|*X*=*<sup>x</sup>* = *QY*|*X*=*<sup>x</sup>* for all *x* ∈ supp(*PX*) .


We next consider the continuity at *<sup>α</sup>* = 0. Define *<sup>τ</sup>* , min*x*∈supp(*PX*) *<sup>P</sup>*(*x*). Then, for all *<sup>α</sup>* ∈ (0, 1),

$$(a-1)D\_a^1(\mathbb{P}\_{\mathcal{Y}|\mathcal{X}} \| Q\_{\mathcal{Y}|\mathcal{X}} | P\_{\mathcal{X}}) = a \log \sum\_{\mathbf{x} \in \text{supp}(\mathcal{P}\_{\mathcal{X}})} P(\mathbf{x}) 2^{\frac{a-1}{a}D\_a(P\_{\mathcal{Y}|\mathcal{X}=\mathbf{x}} \| Q\_{\mathcal{Y}|\mathcal{X}=\mathbf{x}})} \tag{100}$$

$$\geq \mathfrak{a} \log \sum\_{\mathbf{x} \in \operatorname{supp}(P\_{\mathbf{X}})} \tau \mathbf{2}^{\frac{\mathbf{a} - 1}{\mathbf{a}} D\_{\mathbf{a}}(P\_{\mathbf{Y}|X=\mathbf{x}} \| Q\_{\mathbf{Y}|X=\mathbf{x}})} \tag{101}$$

$$\geq \mathfrak{a} \log \max\_{\mathbf{x} \in \text{supp}(P\_{\mathbf{X}})} \tau \mathbf{2}^{\frac{a-1}{a} D\_{\mathbf{a}}(P\_{\mathbf{Y}|X=\mathbf{x}} || Q\_{\mathbf{Y}|X=\mathbf{x}})} \tag{102}$$

$$=\mathfrak{a}\log\mathfrak{r}+\max\_{\mathbf{x}\in\operatorname{supp}(P\_{\mathbf{X}})}(\mathfrak{a}-1)D\_{\mathfrak{a}}(P\_{Y|X=\mathbf{x}}||Q\_{Y|X=\mathbf{x}})\_{\mathbf{y}}\tag{103}$$

where (100) follows from the definition in (77). On the other hand, for all *α* ∈ (0, 1),

$$(a-1)D\_a^1(P\_{\mathcal{Y}|X} \| Q\_{\mathcal{Y}|X} | P\_X) = a \log \sum\_{\mathbf{x} \in \text{supp}(P\_X)} P(\mathbf{x}) 2^{\frac{a-1}{a}D\_a(P\_{\mathcal{Y}|X=\mathbf{x}} \| Q\_{\mathcal{Y}|X=\mathbf{x}})} \tag{104}$$

$$1 \le \mathfrak{a} \log \max\_{\mathbf{x} \in \operatorname{supp}(P\_X)} \mathbf{2}^{\frac{a-1}{a} D\_a(P\_{Y|X-\mathbf{x}} \| Q\_{Y|X-\mathbf{x}})} \tag{105}$$

$$=\max\_{\mathbf{x}\in\text{supp}(P\_{\mathbf{X}})} (\boldsymbol{\alpha}-\mathbf{1}) \, D\_{\boldsymbol{\alpha}}(P\_{\mathbf{Y}|\mathbf{X}=\mathbf{x}} || Q\_{\mathbf{Y}|\mathbf{X}=\mathbf{x}}).\tag{106}$$

Because lim*α*→<sup>0</sup> *α* log *τ* = 0, it follows from (103) and (106) and the sandwich theorem that

$$\lim\_{\mathfrak{a}\downarrow 0} D\_{\mathfrak{a}}^{\mathbf{1}}(P\_{\mathbf{Y}|\mathbf{X}} \| Q\_{\mathbf{Y}|\mathbf{X}} | P\_{\mathbf{X}}) = \lim\_{\mathfrak{a}\downarrow 0} \frac{1}{\mathfrak{a} - 1} \max\_{\mathbf{x} \in \operatorname{supp}(P\_{\mathbf{X}})} (\mathfrak{a} - 1) D\_{\mathfrak{a}} (P\_{\mathbf{Y}|\mathbf{X} = \mathbf{x}} \| Q\_{\mathbf{Y}|\mathbf{X} = \mathbf{x}}) \tag{107}$$

$$=-\log\max\_{\mathbf{x}\in\operatorname{supp}(P\_{\mathbf{X}})}\sum\_{\mathcal{Y}\in\operatorname{supp}(P\_{Y|X=x})}Q(\mathcal{Y}|\mathbf{x})\_{\prime}\tag{108}$$

where (108) follows from the continuity of the Rényi divergence (Proposition 1 (c)) and the definition of *D*0(*P*k*Q*) in (21).

We conclude with the continuity at *α* = ∞. Observe that

$$\lim\_{\mathfrak{a}\to\infty} D\_{\mathfrak{a}}^{\mathbb{L}}(P\_{Y|X} \| Q\_{Y|X} | P\_X) = \lim\_{\mathfrak{a}\to\infty} \frac{\mathfrak{a}}{\mathfrak{a}-1} \log \sum\_{\mathbf{x}\in\operatorname{supp}(P\_X)} P(\mathbf{x}) 2^{\frac{\mathfrak{a}-1}{\mathfrak{a}} D\_{\mathfrak{a}}(P\_{Y|X=\mathbf{x}} \| Q\_{Y|X=\mathbf{x}})} \tag{109}$$

$$=\log\sum\_{\mathbf{x}\in\operatorname{supp}(P\_X)}P(\mathbf{x})2^{\lim\_{\mathbf{a}\to\mathbf{os}}D\_{\mathbf{a}}(P\_{Y|X=\mathbf{x}}||Q\_{Y|X=\mathbf{x}})}\tag{110}$$

$$=\log\sum\_{\mathbf{x}\in\text{supp}(P\_X)}P(\mathbf{x})\max\_{\mathbf{y}}\frac{P(\mathbf{y}|\mathbf{x})}{Q(\mathbf{y}|\mathbf{x})},\tag{111}$$

where (109) follows from the definition in (77), and (111) follows from the continuity of the Rényi divergence (Proposition 1 (c)) and the definition of *D*∞(*P*k*Q*) in (23).

(d) For all *α* ∈ [0, ∞], Proposition 4 implies

$$D\_a^1(P\_{Y|X} \| Q\_{Y|X} | P\_X) = \min\_{Q\_X} D\_a(P\_X P\_{Y|X} \| Q\_X Q\_{Y|X}).\tag{112}$$

Because *α* 7→ *Dα*(*P*k*Q*) is nonincreasing on [0, ∞] (Proposition 1 (d)) and because the pointwise minimum preserves the monotonicity, the mapping *<sup>α</sup>* 7→ *<sup>D</sup>*<sup>l</sup> *α* (·) is nonincreasing on [0, ∞]. (e) By Proposition 4,

$$\frac{1-a}{a}D\_{\mathfrak{a}}^{1}(P\_{Y|X} \| \, Q\_{Y|X} \| P\_{X}) = \begin{cases} \min\_{Q\_{\mathcal{X}}} \frac{1-a}{a} D\_{\mathfrak{a}}(P\_{X} P\_{Y|X} \| \, Q\_{X} Q\_{Y|X}) & \text{if } a \in (0,1],\\ \max\_{Q\_{\mathcal{X}}} \frac{1-a}{a} D\_{\mathfrak{a}}(P\_{X} P\_{Y|X} \| \, Q\_{X} Q\_{Y|X}) & \text{if } a \in (1,\infty). \end{cases} \tag{113}$$

By the nonnegativity of the Rényi divergence (Proposition 1 (a)), the RHS of (113) is nonnegative for *α* ∈ (0, 1] and nonpositive for *α* ∈ (1, ∞). Hence, it suffices to show separately that the mapping *<sup>α</sup>* 7→ <sup>1</sup>−*<sup>α</sup> <sup>α</sup> <sup>D</sup>*<sup>l</sup> *α* (*PY*|*X*k*QY*|*X*|*PX*) is nonincreasing on (0, 1] and on (1, ∞). This is indeed the case: the mapping *<sup>α</sup>* 7→ <sup>1</sup>−*<sup>α</sup> <sup>α</sup> Dα*(*PXPY*|*X*k*QXQY*|*X*) on the RHS of (113) is nonincreasing on (0, ∞) (Proposition 1 (e)), and the monotonicity is preserved by the pointwise minimum and maximum, respectively.

(f) For *α* ∈ [0, 1], Proposition 4 implies that

$$\mathbb{P}\left(1-\mathfrak{a}\right)D\_{\mathfrak{a}}^{1}\left(P\_{\mathbf{Y}|\mathbf{X}}\|\|Q\_{\mathbf{Y}|\mathbf{X}}\|P\_{\mathbf{X}}\right) = \min\_{\mathbf{Q}\mathbf{x}} \Big[ \left(1-\mathfrak{a}\right)D\_{\mathfrak{a}}\left(P\_{\mathbf{X}}P\_{\mathbf{Y}|\mathbf{X}}\|\|Q\_{\mathbf{X}}Q\_{\mathbf{Y}|\mathbf{X}}\right) \Big].\tag{114}$$

Because *α* 7→ (1 − *α*)*Dα*(*PXPY*|*X*k*QXQY*|*X*) is concave on [0, 1] (Proposition 1 (f)) and because the pointwise minimum preserves the concavity, the mapping *<sup>α</sup>* 7→ (<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)*D*<sup>l</sup> *α* (*PY*|*X*k*QY*|*X*|*PX*) is concave on [0, 1].

(g) This follows from Proposition 1 (g) in the same way that Part (f) followed from Proposition 1 (f).

We next consider data-processing inequalities for *D*<sup>l</sup> *α* (·). We distinguish between processing *Y* and processing *X*. The data-processing inequality for processing *Y* follows from the data-processing inequality for the (unconditional) Rényi divergence:

**Theorem 5.** *Let P<sup>X</sup> be a PMF, and let PY*|*<sup>X</sup> and QY*|*<sup>X</sup> be conditional PMFs. For a conditional PMF A<sup>Y</sup>* 0 <sup>|</sup>*XY, define*

$$P\_{Y'|X}(y'|\mathbf{x}) \stackrel{\Delta}{=} \sum\_{\mathbf{y}} P\_{Y|X}(y|\mathbf{x}) A\_{Y'|XY}(y'|\mathbf{x}, \mathbf{y}) \, \tag{115}$$

$$Q\_{Y'|X}(y'|\mathbf{x}) \stackrel{\Delta}{=} \sum\_{\mathbf{y}} Q\_{Y|X}(y|\mathbf{x}) A\_{Y'|X\mathbf{Y}}(y'|\mathbf{x},\mathbf{y}).\tag{116}$$

*Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}^{1}(\mathcal{P}\_{Y'|X} \| \mathcal{Q}\_{Y'|X} | \mathcal{P}\_{X}) \le D\_{\mathfrak{a}}^{1}(\mathcal{P}\_{Y|X} \| \mathcal{Q}\_{Y|X} | \mathcal{P}\_{X}).\tag{117}$$

**Proof.** We prove (117) for *α* ∈ (0, 1) ∪ (1, ∞); the claim will then extend to *α* ∈ [0, ∞] by the continuity of *D*<sup>l</sup> *α* (·) in *α* (Proposition 6 (c)). For every *x* ∈ supp(*PX*), we can apply Proposition 1 (h) with the substitution of *A<sup>Y</sup>* 0 |*Y*,*X*=*x* for *A<sup>Y</sup>* 0 <sup>|</sup>*<sup>Y</sup>* to obtain

$$D\_{\mathfrak{a}}\left(\mathbf{P}\_{Y'|X=\mathbf{x}} \| \mathbf{Q}\_{Y'|X=\mathbf{x}}\right) \le D\_{\mathfrak{a}}\left(\mathbf{P}\_{Y|X=\mathbf{x}} \| \mathbf{Q}\_{Y|X=\mathbf{x}}\right). \tag{118}$$

For *α* ∈ (0, 1) ∪ (1, ∞), (117) now follows from (77) and (118).

Processing *X* is different. Consider first *QY*|*<sup>X</sup>* that does not depend on *X*. Then, writing *QY*|*<sup>X</sup>* = *QY*, we have the following result (which, as shown in Example 2 below, does not extend to general *QY*|*X*):

**Theorem 6.** *Let P<sup>X</sup> and Q<sup>Y</sup> be PMFs, and let PY*|*<sup>X</sup> be a conditional PMF. For a conditional PMF B<sup>X</sup>* 0 <sup>|</sup>*X, define the PMFs*

$$P\_{X'}(\mathbf{x'}) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \, B\_{X'|X}(\mathbf{x'}|\mathbf{x})\_{\prime} \tag{119}$$

$$\mathcal{B}\_{X|X'}(\mathbf{x}|\mathbf{x'}) \triangleq \begin{cases} \mathcal{P}\_X(\mathbf{x}) \mathcal{B}\_{X'|X}(\mathbf{x'}|\mathbf{x}) / \mathcal{P}\_{X'}(\mathbf{x'}) & \text{if } \mathcal{P}\_{X'}(\mathbf{x'}) > 0, \\ 1/|\mathcal{X}| & \text{otherwise,} \end{cases} \tag{120}$$

$$P\_{Y|X'}(y|\mathbf{x'}) \triangleq \sum\_{\mathbf{x}} B\_{X|X'}(\mathbf{x}|\mathbf{x'}) \, P\_{Y|X}(y|\mathbf{x}).\tag{121}$$

*Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}^{1}(\mathcal{P}\_{Y|X'} \| Q\_{Y} | P\_{X'}) \le D\_{\mathfrak{a}}^{1}(\mathcal{P}\_{Y|X} \| Q\_{Y} | P\_{X}).\tag{122}$$

Once we provide the operational meaning of *D*<sup>l</sup> *α* (·) in horse betting with side information (Theorem 10 below), Theorem 6 will become very intuitive: it expresses the fact that preprocessing the side information cannot increase the gambler's utility; see Remark 8. Note that *P<sup>X</sup>* <sup>0</sup> and *PY*|*<sup>X</sup>* <sup>0</sup> in Theorem 6 can be obtained from the following marginalization:

$$P\_{X'}(\mathbf{x'})\,\mathcal{P}\_{Y|X'}(\mathcal{y}|\mathbf{x'}) = \sum\_{\mathbf{x}} \mathcal{P}\_{\mathbf{X}}(\mathbf{x})\,\mathcal{B}\_{\mathbf{X'}|\mathbf{X}}(\mathbf{x'}|\mathbf{x})\,\mathcal{P}\_{\mathbf{Y}|\mathbf{X}}(\mathcal{y}|\mathbf{x}).\tag{123}$$

**Proof of Theorem 6.** We show (122) for *α* ∈ (0, 1) ∪ (1, ∞); the claim will then extend to *α* ∈ [0, ∞] by the continuity of *D*<sup>l</sup> *α* (·) in *α* (Proposition 6 (c)). Consider first *α* ∈ (1, ∞). Then, (122) holds because

$$\begin{split} \frac{\mathfrak{a}-1}{\mathfrak{a}} D\_{\mathfrak{a}}^{1}(P\_{Y|X'}||Q\_{Y}|P\_{X'})\\ &= \log \sum\_{\mathbf{x'} \in \operatorname{supp}(P\_{X'})} P\_{X'}(\mathbf{x'}) \left[ \sum\_{\mathbf{y}} P\_{Y|X'}(\mathbf{y}|\mathbf{x'})^{\mathbf{a}} Q\_{Y}(\mathbf{y})^{1-\mathbf{a}} \right]^{\frac{1}{\mathbf{a}}} \end{split} \tag{124}$$

$$=\log\sum\_{\mathbf{x}'\in\text{supp}(P\_{\mathbf{X}'})}P\_{X'}(\mathbf{x}')\left[\sum\_{\mathbf{y}}\left[\sum\_{\mathbf{x}}B\_{X|\mathbf{X}'}(\mathbf{x}|\mathbf{x}')P\_{Y|\mathbf{X}}(\mathbf{y}|\mathbf{x})Q\_{Y}(\mathbf{y})^{\frac{1-a}{a}}\right]^{a}\right]^{\frac{1}{a}}\tag{125}$$

$$=\log\sum\_{\mathbf{x}'\in\operatorname{supp}(P\_{\mathbf{X}'})}\left[\sum\_{\mathcal{Y}}\left[\sum\_{\mathbf{x}\in\operatorname{supp}(P\_{\mathbf{X}})}P\_{\mathbf{X}}(\mathbf{x})\,\mathcal{B}\_{\mathbf{X}'|\mathbf{X}}(\mathbf{x}'|\mathbf{x})\,\mathcal{P}\_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})\,\mathcal{Q}\_{\mathbf{Y}}(\mathbf{y})^{\frac{1-q}{q}}\right]^{a}\right]^{\frac{1}{a}}\tag{126}$$

$$\leq \log \sum\_{\mathbf{x}' \in \operatorname{supp}(P\_{\mathbf{X}'})} \sum\_{\mathbf{x} \in \operatorname{supp}(P\_{\mathbf{X}})} \left[ \sum\_{\mathbf{y}} \left[ P\_{\mathbf{X}}(\mathbf{x}) B\_{\mathbf{X}'|\mathbf{X}}(\mathbf{x}'|\mathbf{x}) P\_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x}) Q\_{\mathbf{Y}}(\mathbf{y})^{\frac{1-\mathfrak{a}}{\mathfrak{a}}} \right]^{\frac{1}{\mathfrak{a}}} \right]^{\frac{1}{\mathfrak{a}}} \tag{127}$$

$$=\log\sum\_{\mathbf{x}\in\operatorname{supp}(P\_{X})}P\_{X}(\mathbf{x})\left[\sum\_{\mathbf{x'}\in\operatorname{supp}(P\_{X'})}B\_{X'|X}(\mathbf{x'}|\mathbf{x})\right]\left[\sum\_{\mathbf{y}}P\_{Y|X}(\mathbf{y}|\mathbf{x})^{a}Q\_{Y}(\mathbf{y})^{1-a}\right]^{\frac{1}{a}}\tag{128}$$

$$=\log\sum\_{\mathbf{x}\in\text{supp}(P\_X)}P\_{\mathbf{X}}(\mathbf{x})\left[\sum\_{\mathbf{y}}P\_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})^aQ\_{\mathbf{Y}}(\mathbf{y})^{1-a}\right]^{\frac{1}{a}}\tag{129}$$

$$\mathcal{L} = \frac{\mathfrak{a} - 1}{\mathfrak{a}} D\_{\mathfrak{a}}^{\mathbf{1}} (\mathcal{P}\_{\mathbf{Y}|\mathbf{X}} || \mathcal{Q}\_{\mathbf{Y}} | \mathcal{P}\_{\mathbf{X}}),\tag{130}$$

where (124) follows from (78); (125) follows from (121); (126) follows from (120); (127) follows from the Minkowski inequality [16] (III 2.4 Theorem 9); (129) holds because *PX*(*x*) > 0 and *P<sup>X</sup>* <sup>0</sup>(*x* 0 ) = 0 imply *B<sup>X</sup>* 0 <sup>|</sup>*X*(*x* 0 |*x*) = 0, hence the first expression in square brackets on the left-hand side (LHS) of (129) equals one; and (130) follows from (78).

The proof for *α* ∈ (0, 1) is very similar: (124)–(126) and (128)–(130) continue to hold, and (127) is reversed [16] (III 2.4 Theorem 9). Because now *<sup>α</sup>*−<sup>1</sup> *<sup>α</sup>* < 0, (122) continues to hold for *α* ∈ (0, 1).

As a special case of Theorem 6, we obtain the following relation between the conditional and the unconditional Rényi divergence:

**Corollary 3.** *Let P<sup>X</sup> and Q<sup>Y</sup> be PMFs, and let PY*|*<sup>X</sup> be a conditional PMF. Define the marginal PMF*

$$P\_Y(y) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x}) \, P\_{Y|X}(y|\mathbf{x}). \tag{131}$$

*Then, for all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}(P\_Y \| Q\_Y) \le D\_{\mathfrak{a}}^1(P\_{Y|X} \| Q\_Y | P\_X). \tag{132}$$

**Proof.** This follows from Theorem 6 in the same way that Corollary 1 followed from Theorem 2.

Consider next *QY*|*<sup>X</sup>* that does depend on *X*. It turns out that Corollary 3, and hence Theorem 6, cannot be extended to this setting:

**Example 2.** *Let* X = {0, 1} *and* Y = {0, 1, 2}*. Define the PMFs PX, PY*|*X, and QY*|*<sup>X</sup> as*

$$P\_{\mathcal{X}}(0) = 0.5, \qquad \qquad P\_{\mathcal{X}}(1) = 0.5,\tag{133}$$

$$P\_{Y|X}(0|0) = 0.96, \qquad P\_{Y|X}(1|0) = 0.02, \qquad P\_{Y|X}(2|0) = 0.02,\tag{134}$$

$$\mathbf{P}\_{\mathbf{Y}|X}(0|1) = 0.12, \qquad \mathbf{P}\_{\mathbf{Y}|X}(1|1) = 0.02, \qquad \mathbf{P}\_{\mathbf{Y}|X}(2|1) = 0.86,\tag{135}$$

$$\mathbf{O} \quad (\alpha|\alpha) \quad \mathbf{o} \text{ } \alpha \mathbf{o} \qquad \mathbf{o} \quad (\bullet|\alpha) \quad \mathbf{o} \text{ } \alpha \mathbf{o} \qquad \mathbf{o} \quad (\bullet|\alpha) \quad \mathbf{o} \text{ } \alpha \mathbf{o} \tag{135}$$

$$\begin{aligned} Q\_{\mathbf{Y}|X}(0|0) &= 0.06, \qquad Q\_{\mathbf{Y}|X}(1|0) = 0.92, \qquad Q\_{\mathbf{Y}|X}(2|0) = 0.02, \\ Q\_{\mathbf{Y}|X}(0|1) &= 0.02, \qquad Q\_{\mathbf{Y}|X}(1|1) = 0.16, \qquad Q\_{\mathbf{Y}|X}(2|1) = 0.82. \end{aligned} \tag{136}$$

*Then, for α* = 0.5 *and for α* = 2*,*

$$D\_{\mathfrak{a}}(P\_Y \| Q\_Y) > D\_{\mathfrak{a}}^1(P\_{Y|X} \| \| Q\_{Y|X} | P\_X)\_{\prime} \tag{138}$$

*where the PMFs P<sup>Y</sup> and Q<sup>Y</sup> are given by*

$$P\_Y(y) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x}) \, P\_{Y|X}(y|\mathbf{x}) \, \tag{139}$$

$$Q\_Y(y) \stackrel{\Delta}{=} \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \, Q\_{Y|X}(y|\mathbf{x}). \tag{140}$$

**Proof.** Numerically, *<sup>D</sup>*0.5(*PY*k*QY*) <sup>≈</sup> 1.11 bits, which is larger than *<sup>D</sup>*<sup>l</sup> 0.5(*PY*|*X*k*QY*|*X*|*PX*) ≈ 0.93 bits. Similarly, *<sup>D</sup>*2(*PY*k*QY*) <sup>≈</sup> 2.95 bits, which is larger than *<sup>D</sup>*<sup>l</sup> 2 (*PY*|*X*k*QY*|*X*|*PX*) ≈ 2.75 bits.

#### **6. Relation to Arimoto's Measures**

Before discussing Arimoto's measures, we first recall the definition of the Rényi entropy. The Rényi entropy of order *α* [7] is defined for all positive *α*'s other than one as

$$H\_a(X) \stackrel{\Delta}{=} \frac{1}{1-a} \log \sum\_{\mathbf{x}} P(\mathbf{x})^a. \tag{141}$$

For *α* being zero, one, or infinity, we define by continuous extension of (141)

$$H\_0(X) \triangleq \log |\text{supp}(P\_X)|\,\tag{142}$$

$$H\_1(X) \triangleq H(X),\tag{143}$$

$$H\_{\infty}(X) \stackrel{\Delta}{=} -\log \max\_{\mathbf{x}} P(\mathbf{x}),\tag{144}$$

where *H*(*X*) denotes Shannon's entropy. The Rényi entropy can be related to the Rényi divergence as follows:

$$H\_{\mathfrak{A}}(X) = \log|\mathcal{X}| - D\_{\mathfrak{A}}(P\_X \| \| \mathcal{U}\_X)\_{\mathfrak{A}} \tag{145}$$

where *U<sup>X</sup>* denotes the uniform distribution over X .

There are different ways to define a conditional Rényi entropy [40]; we use Arimoto's proposal. The Arimoto–Rényi conditional entropy of order *α* [12,38,40,41] is defined for positive *α* other than one as

$$H\_{\mathfrak{a}}(X|Y) \triangleq \frac{\mathfrak{a}}{1-\mathfrak{a}} \log \sum\_{\substack{y \in \text{supp}(P\_Y)}} P(y) \left[ \sum\_{\chi} P(\mathfrak{x}|y)^{\mathfrak{a}} \right]^{\frac{1}{\mathfrak{a}}} \tag{146}$$

$$\epsilon = \frac{a}{1-a} \log \sum\_{y \in \text{supp}(P\_Y)}^{1-a} P(y) \mathcal{Z}^{\frac{1-a}{a}H\_a(P\_{X|Y=y})} \, \tag{147}$$

where (147) follows from the definition of the Rényi entropy in (141). The Arimoto–Rényi conditional entropy plays a key role in guessing with side information [20,42–44] and in task encoding with side information [45]; and it can be related to hypothesis testing [41]. For *α* being zero, one, or infinity, we define by continuous extension of (146)

$$H\_0(X|Y) \stackrel{\Delta}{=} \log \max\_{y \in \text{supp}(P\_Y)} |\text{supp}(P\_{X|Y=y})|\,\tag{148}$$

$$H\_1(X|Y) \triangleq H(X|Y),\tag{149}$$

$$H\_{\infty}(X|Y) \stackrel{\Delta}{=} -\log \sum\_{y \in \text{supp}(P\_Y)} P(y) \max\_{\mathbf{x}} P(\mathbf{x}|y),\tag{150}$$

where *H*(*X*|*Y*) denotes Shannon's conditional entropy. The analog of (145) for *Hα*(*X*|*Y*) is:

*Entropy* **2020**, *22*, 316

**Remark 2.** *For all α* ∈ [0, ∞]*,*

$$H\_{\mathfrak{k}}(X|Y) = \log|\mathcal{X}| - D\_{\mathfrak{a}}^{1}(P\_{X|Y} \| \| U\_{X} | P\_{Y}) \tag{151}$$

$$=\log|\mathcal{X}| - \min\_{Q\_Y} D\_\mathfrak{a}(P\_Y P\_{\mathcal{X}|Y} \| \| Q\_Y \mathcal{U}\_\mathcal{X}).\tag{152}$$

**Proof.** Equation (151) follows, using some algebra, from the definition of *D*<sup>l</sup> *α* (·) in (78)–(81); and (152) follows from Proposition 4. (The characterization in (152) previously appeared as [40] (Theorem 4).)

Arimoto [12] also defined the following measure of dependence:

$$H\_{\mathfrak{a}}^{\mathfrak{a}}(X;Y) \triangleq H\_{\mathfrak{a}}(X) - H\_{\mathfrak{a}}(X|Y) \tag{153}$$

$$\hat{\sigma} = \frac{\kappa}{\kappa - 1} \log \sum\_{\mathbf{y}} \left[ \sum\_{\mathbf{x}} \frac{P(\mathbf{x})^a}{\sum\_{\mathbf{x'} \in \mathcal{X}} P(\mathbf{x'})^a} P(\mathbf{y}|\mathbf{x})^a \right]^{\frac{1}{a}},\tag{154}$$

where (154) follows from (141) and (146). Using Remark 2, we can express *I* a *α* (*X*;*Y*) in terms of *D*<sup>l</sup> *α* (·):

**Remark 3.** *For all α* ∈ [0, ∞]*,*

$$\mathcal{I}\_{\mathfrak{a}}^{\mathfrak{a}}(\mathcal{X};\mathcal{Y}) = D\_{\mathfrak{a}}^{\mathfrak{1}}(P\_{\mathcal{X}|\mathcal{Y}} \| \mathcal{U}\_{\mathcal{X}} | P\_{\mathcal{Y}}) - D\_{\mathfrak{a}}(P\_{\mathcal{X}} \| \mathcal{U}\_{\mathcal{X}}).\tag{155}$$

**Proof.** This follows from (145), (151), and (153).

#### **7. Relations Between the Conditional Rényi Divergences and the Rényi Dependence Measures**

In this section, we first establish the greater-or-equal-than order between the conditional Rényi divergences, where the order depends on whether *α* ∈ [0, 1] or *α* ∈ [1, ∞]. We then show that this implies the same order between the dependence measures derived from the conditional Rényi divergences. Finally, we remark that many of the dependence measures coincide when they are maximized over all PMFs *PX*.

**Proposition 7.** *For all α* ∈ [0, ∞]*,*

$$D\_{\mathfrak{a}}^{1}(P\_{Y|X} \| \| Q\_{Y|X} | P\_X) \le D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_{Y|X} \| \| Q\_{Y|X} | P\_X). \tag{156}$$

**Proof.** This holds because

$$D\_{\mathfrak{a}}^{1}(P\_{Y|X} \| \| Q\_{Y|X} | P\_X) = \min\_{Q\_X} D\_{\mathfrak{a}}(P\_X P\_{Y|X} \| \| Q\_X Q\_{Y|X}) \tag{157}$$

$$\leq D\_{\mathfrak{a}}(P\_X P\_{Y|X} || P\_X Q\_{Y|X}) \tag{158}$$

$$=D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_{Y|X}||Q\_{Y|X}|P\_X)\_{\prime} \tag{159}$$

where (157) follows from Proposition 4, and (159) follows from the definition of *D*<sup>s</sup> *α* (·) in (54).

**Theorem 7.** *For all α* ∈ [0, 1]*,*

$$D\_{\mathfrak{a}}^{\mathbf{1}}(P\_{Y|X} \| Q\_{Y|X} | P\_{X}) \le D\_{\mathfrak{a}}^{\mathbf{a}}(P\_{Y|X} \| Q\_{Y|X} | P\_{X}) \le D\_{\mathfrak{a}}^{\mathbf{c}}(P\_{Y|X} \| Q\_{Y|X} | P\_{X}).\tag{160}$$

*For all α* ∈ [1, ∞]*,*

$$D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_{Y|X} \| Q\_{Y|X} | P\_X) \le D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_{Y|X} \| \| Q\_{Y|X} | P\_X) \le D\_{\mathfrak{a}}^{\mathfrak{a}}(P\_{Y|X} \| \| Q\_{Y|X} | P\_X). \tag{161}$$

**Proof.** For both *<sup>α</sup>* <sup>∈</sup> [0, 1] and *<sup>α</sup>* <sup>∈</sup> [1, <sup>∞</sup>], the relation *<sup>D</sup>*<sup>l</sup> *α* (·) <sup>≤</sup> *<sup>D</sup>*<sup>s</sup> *α* (·) follows from Proposition 7. *Entropy* **2020**, *22*, 316

We next show that *D*<sup>s</sup> *α* (·) <sup>≤</sup> *<sup>D</sup>*<sup>c</sup> *α* (·) for *α* ∈ [0, 1]. We show this for *α* ∈ (0, 1); the claim will then extend to *<sup>α</sup>* <sup>∈</sup> [0, 1] by the continuity in *<sup>α</sup>* of *<sup>D</sup>*<sup>s</sup> *α* (·) and *<sup>D</sup>*<sup>c</sup> *α* (·) (Proposition 3 (c) and Proposition 2 (c)). For *α* ∈ (0, 1),

$$P(\mathfrak{a} - 1)D\_{\mathfrak{a}}^{\mathfrak{g}}(\mathbb{P}\_{\mathbf{Y}|\mathbf{X}} \| Q\_{\mathbf{Y}|\mathbf{X}} | \mathcal{P}\_{\mathbf{X}}) = \log \sum\_{\mathbf{x} \in \text{supp}(\mathcal{P}\_{\mathbf{X}})} P(\mathbf{x}) \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{\mathbf{a}} Q(\mathbf{y}|\mathbf{x})^{1 - \mathbf{a}} \tag{162}$$

$$\geq \sum\_{\mathbf{x} \in \text{supp}(P\_{\mathbf{X}})} P(\mathbf{x}) \log \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{\mathbf{a}} Q(\mathbf{y}|\mathbf{x})^{1-\mathbf{a}} \tag{163}$$

$$= (\mathfrak{a} - \mathbf{1}) D\_{\mathfrak{a}}^{\mathbf{c}} (P\_{Y|X} \| Q\_{Y|X} | P\_X) \, \tag{164}$$

where (162) follows from (55); (163) follows from Jensen's inequality because log(·) is a concave function; and (164) follows from (30). The proof of the claim for *α* ∈ (0, 1) is finished by dividing (162)–(164) by *α* − 1, which reverses the inequality because *α* − 1 < 0.

We conclude by showing that *D*<sup>c</sup> *α* (·) <sup>≤</sup> *<sup>D</sup>*<sup>l</sup> *α* (·) for *α* ∈ [1, ∞]. We show this for *α* ∈ (1, ∞); the claim will then extend to *<sup>α</sup>* <sup>∈</sup> [1, <sup>∞</sup>] by the continuity of *<sup>D</sup>*<sup>c</sup> *α* (·) and *<sup>D</sup>*<sup>l</sup> *α* (·) in *α* (Proposition 2 (c) and Proposition 6 (c)). For *α* ∈ (1, ∞),

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X} \| Q\_{Y|X} | P\_X) = \sum\_{\mathbf{x} \in \text{supp}(P\_X)} P(\mathbf{x}) \frac{1}{\mathfrak{a} - 1} \log \sum\_{\mathbf{y}} P(\mathbf{y} | \mathbf{x})^{\mathfrak{a}} Q(\mathbf{y} | \mathbf{x})^{1 - \mathfrak{a}} \tag{165}$$

$$=\frac{\alpha}{\alpha-1}\sum\_{\mathbf{x}\in\text{supp}(P\_X)}P(\mathbf{x})\log\left[\sum\_{\mathbf{y}}P(\mathbf{y}|\mathbf{x})^{\alpha}Q(\mathbf{y}|\mathbf{x})^{1-\alpha}\right]^{\frac{1}{\alpha}}\tag{166}$$

$$\leq \frac{\alpha}{\alpha - 1} \log \sum\_{\mathbf{x} \in \text{supp}(P\_X)} P(\mathbf{x}) \left[ \sum\_{\mathcal{Y}} P(\mathcal{Y}|\mathbf{x})^a Q(\mathcal{Y}|\mathbf{x})^{1-a} \right]^{\frac{1}{a}} \tag{167}$$

$$=D\_{\mathfrak{a}}^{1}(P\_{Y|X}||Q\_{Y|X}|P\_{X}).\tag{168}$$

where (165) follows from (30); (167) follows from Jensen's inequality because log(·) is a concave function; and (168) follows from (78).

**Corollary 4.** *For all α* ∈ [0, 1]*,*

$$I\_{\mathfrak{a}}(\mathbf{X};\mathbf{Y}) \le I\_{\mathfrak{a}}^{\mathfrak{a}}(\mathbf{X};\mathbf{Y}) \le I\_{\mathfrak{a}}^{\mathfrak{c}}(\mathbf{X};\mathbf{Y}).\tag{169}$$

*For all α* ∈ [1, ∞]*,*

$$I\_{\mathfrak{a}}^{\mathfrak{c}}(X;Y) \le I\_{\mathfrak{a}}(X;Y) \le I\_{\mathfrak{a}}^{\mathfrak{a}}(X;Y). \tag{170}$$

**Proof.** By (34) and (60) and Proposition 5, respectively,

*I* c *α* (*X*;*Y*) = min *Q<sup>Y</sup> D* c *α* (*PY*|*X*k*QY*|*PX*), (171)

$$I\_{\mathfrak{a}}^{\mathfrak{g}}(X;Y) = \min\_{Q\_Y} D\_{\mathfrak{a}}^{\mathfrak{g}}(P\_{Y|X} \| Q\_Y | P\_X) \tag{172}$$

$$J\_{\mathfrak{A}}(\mathbf{X};\mathbf{Y}) = \min\_{\mathbf{Q}\_{\mathbf{Y}}} D^{\mathbf{1}}\_{\mathfrak{a}}(\mathbf{P}\_{\mathbf{Y}|\mathbf{X}} || \mathbf{Q}\_{\mathbf{Y}} | \mathbf{P}\_{\mathbf{X}}).\tag{173}$$

The corollary now follows from (171)–(173) and Theorem 7.

Despite *I* c *α* (*X*;*Y*), *I* s *α* (*X*;*Y*), *I* a *α* (*X*;*Y*), and *Jα*(*X*;*Y*) being different measures, they often coincide when maximized over all PMFs *PX*:

**Theorem 8.** *For every conditional PMF PY*|*<sup>X</sup> and every α* ∈ (0, 1) ∪ (1, ∞)*,*

$$\max\_{P\_X} I\_\mathfrak{a}^\mathbf{C}(P\_{X\prime}, P\_{Y|X}) = \max\_{P\_X} I\_\mathfrak{a}^\mathbf{B}(P\_{X\prime}, P\_{Y|X}) \tag{174}$$

$$\mathfrak{h} = \max\_{P\_X} I\_\mathfrak{a}^\mathbf{a} (P\_{X\prime} P\_{Y|X}). \tag{175}$$

*In addition, for every conditional PMF PY*|*<sup>X</sup> and every α* ∈ [ 1 2 , 1) ∪ (1, ∞)*,*

$$\max\_{P\_X} I\_\mathfrak{a} \left( P\_{X'} P\_{Y|X} \right) = \max\_{P\_X} I\_\mathfrak{a}^\mathfrak{g} \left( P\_{X'} P\_{Y|X} \right). \tag{176}$$

*For <sup>α</sup>* <sup>∈</sup> (0, <sup>1</sup> 2 )*, the situation is different: there exists a conditional PMF PY*|*<sup>X</sup> such that, for every <sup>α</sup>* <sup>∈</sup> (0, <sup>1</sup> 2 )*,*

$$\max\_{P\_X} I\_\mathfrak{a} \left( P\_{X\_\prime} P\_{Y|X} \right) < \max\_{P\_X} I\_\mathfrak{a}^\mathfrak{g} \left( P\_{X\_\prime} P\_{Y|X} \right). \tag{177}$$

**Proof.** Equation (174) follows from [9] (Proposition 1); (175) follows from [12] (Lemma 1); and (176) follows from [38] (Theorem V.1) for *α* ∈ (1, ∞).

We next establish (176) for *α* ∈ [ 1 2 , 1). Observe that, for *α* ∈ [ 1 2 , 1), (176) is equivalent to

$$\max\_{P\_X} -2^{\frac{a-1}{a}I\_\mathfrak{a}(P\_{X,\mathcal{P}|X})} = \max\_{P\_X} -2^{\frac{a-1}{a}I\_\mathfrak{a}^\mathfrak{a}(P\_{X,\mathcal{P}|X})}.\tag{178}$$

For *α* ∈ [ 1 2 , 1), (178) holds because

$$\max\_{P\_X} -2^{\frac{a-1}{a}I\_\mathbb{I}(P\_X, P\_{Y|X})} = \max\_{P\_X} \min\_{Q\_Y} -2^{\frac{a-1}{a}D^1\_\mathbb{I}(P\_{Y|X} \| Q\_Y \| P\_X)}\tag{179}$$

$$=-\min\_{P\_{\mathcal{X}}}\max\_{Q\_{\mathcal{Y}}}\sum\_{\mathbf{x}}P\_{\mathcal{X}}(\mathbf{x})\left[\sum\_{\mathcal{Y}}P(\mathbf{y}|\mathbf{x})^{\mathbf{a}}\,\mathcal{Q}\_{\mathcal{Y}}(\mathbf{y})^{1-\mathbf{a}}\right]^{\frac{1}{\mathbf{a}}}\tag{180}$$

$$=-\max\_{Q\_Y} \min\_{P\_X} \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \left[ \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{\mathbf{a}} \mathcal{Q}\_Y(\mathbf{y})^{1-\mathbf{a}} \right]^{\frac{1}{\mathbf{a}}} \tag{181}$$

$$=-\max\_{Q\_Y} \min\_{\mathbf{x}} \left[ \sum\_{\mathcal{Y}} P(\mathcal{Y}|\mathbf{x})^{\mathbf{a}} Q\_{\mathcal{Y}}(\mathbf{y})^{1-\mathbf{a}} \right]\_{\mathbf{a}}^{\frac{1}{\mathbf{a}}} \tag{182}$$

$$=-\left[\max\_{Q\_Y} \min\_{\mathbf{x}} \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{\mathbf{a}} Q\_Y(\mathbf{y})^{1-\mathbf{a}}\right]^{\frac{1}{\mathbf{a}}}\tag{183}$$

$$=-\left[\max\_{Q\_Y}\min\_{P\_X}\sum\_{\mathbf{x}}P\_X(\mathbf{x})\sum\_{\mathbf{y}}P(\mathbf{y}|\mathbf{x})^aQ\_Y(\mathbf{y})^{1-a}\right]^{\frac{1}{a}}\tag{184}$$

$$\hat{\mathbf{y}} = -\left[ \min\_{P\_{\mathbf{X}}} \max\_{Q\_{\mathbf{Y}}} \sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x}) \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{\mathbf{a}} Q\_{\mathbf{Y}}(\mathbf{y})^{1-\mathbf{a}} \right]\_{\mathbf{a}}^{\frac{1}{\mathbf{a}}} \tag{185}$$

$$=-\min\_{P\_X} \max\_{Q\_Y} \left[ \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^\mathbf{a} Q\_Y(\mathbf{y})^{1-\mathbf{a}} \right]^{\frac{1}{\mathbf{a}}} \tag{186}$$

$$\mathbf{h} = \max\_{\mathcal{P}\_{\mathcal{X}}} \min\_{Q\_{\mathcal{Y}}} -\mathbf{2}^{\frac{\mathbf{a}-\mathbf{1}}{\mathbf{a}}} D\_{\mathbf{a}}^{\mathbf{a}}(\mathcal{P}\_{\mathcal{Y}|\mathcal{X}} \| Q\_{\mathcal{Y}} \| \mathcal{P}\_{\mathcal{X}}) \tag{187}$$

$$=\max\_{P\_X} -2^{\frac{\kappa-1}{\alpha}I\_\mathbb{A}^\mathbf{a}(P\_X, P\_{Y|X})},\tag{188}$$

where (179) follows from Proposition 5; (180) follows from (78); (181) and (185) follow from a minimax theorem and are justified below; (187) follows from (55); and (188) follows from (60).

To justify (181), we apply the minimax theorem [46] (Corollary 37.3.2) to the function *f* : P(Y) × P(X ) → R,

$$f(Q\_Y, P\_X) = \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \left[ \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^a Q\_Y(\mathbf{y})^{1-a} \right]^{\frac{1}{a}}.\tag{189}$$

The sets of all PMFs over X and over Y are convex and compact; the function *f* is jointly continuous in the pair (*QY*, *PX*) because it is a composition of continuous functions; for every *Q<sup>Y</sup>* ∈ P(Y), the function *f* is linear and hence convex in *PX*; and it only remains to show that the function *f* is concave in *Q<sup>Y</sup>* for every *P<sup>X</sup>* ∈ P(X ). Indeed, for every *λ*, *λ* <sup>0</sup> ∈ [0, 1] with *λ* + *λ* 0 = 1, every *QY*, *Q*<sup>0</sup> *<sup>Y</sup>* ∈ P(Y), and every *P<sup>X</sup>* ∈ P(X ),

$$f(\lambda Q\_Y + \lambda^{\prime} Q\_{Y^{\prime}}^{\prime} P\_X) \tag{190}$$

$$\hat{\mathbf{y}} = \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \left[ \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^a \left[ \lambda \, Q\_Y(\mathbf{y}) + \lambda' \, Q\_Y'(\mathbf{y}) \right]^{1-a} \right]^{\frac{1}{a}} \tag{191}$$

$$\mathbf{x} = \sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x}) \left[ \sum\_{\mathbf{y}} \left[ \lambda^{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{\frac{\mathbf{a}}{1-\mathbf{a}}} \mathbf{Q}\_{Y}(\mathbf{y}) + \lambda^{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{\frac{\mathbf{a}}{1-\mathbf{a}}} \mathbf{Q}\_{Y}^{\prime}(\mathbf{y}) \right]^{1-\mathbf{a}} \right]^{\frac{1}{1-\mathbf{a}} \cdot \frac{1-\mathbf{a}}{\mathbf{a}}} \tag{192}$$

<sup>≥</sup> <sup>∑</sup>*<sup>x</sup> PX*(*x*) ("∑ *y* - *λP*(*y*|*x*) *α* <sup>1</sup>−*<sup>α</sup> QY*(*y*) 1−*<sup>α</sup>* # <sup>1</sup> 1−*α* + " ∑ *y* - *λ* <sup>0</sup> *P*(*y*|*x*) *α* <sup>1</sup>−*<sup>α</sup> Q* 0 *Y* (*y*) 1−*<sup>α</sup>* # <sup>1</sup> 1−*α* )<sup>1</sup>−*<sup>α</sup> α* (193)

$$=\sum\_{\mathbf{x}}P\_{\mathbf{X}}(\mathbf{x})\left\{\lambda\left[\sum\_{\mathbf{y}}P(\mathbf{y}|\mathbf{x})^{a}Q\_{\mathbf{Y}}(\mathbf{y})^{1-a}\right]^{\frac{1}{1-a}}+\lambda'\left[\sum\_{\mathbf{y}}P(\mathbf{y}|\mathbf{x})^{a}Q\_{\mathbf{Y}}'(\mathbf{y})^{1-a}\right]^{\frac{1}{1-a}}\right\}^{\frac{1}{a}}\tag{194}$$

$$\geq \sum\_{\mathbf{x}} P\_{\mathbf{X}}(\mathbf{x}) \left\{ \lambda \left[ \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{a} Q\_{\mathbf{Y}}(\mathbf{y})^{1-a} \right]^{\frac{1}{a}} + \lambda' \left[ \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{a} Q\_{\mathbf{Y}}'(\mathbf{y})^{1-a} \right]^{\frac{1}{a}} \right\} \tag{195}$$

$$=\lambda f(Q\_{\rm Y}, P\_{\rm X}) + \lambda^{\prime} f(Q\_{\rm Y}^{\prime}, P\_{\rm X}), \tag{196}$$

where (193) follows from the reverse Minkowski inequality [16] (III 2.4 Theorem 9) because *α* ∈ [ 1 2 , 1); and (195) holds because the function *z* 7→ *z* (1−*α*)/*α* is concave for *α* ∈ [ 1 2 , 1).

The justification of (185) is very similar to that of (181); here, we apply the minimax theorem to the function *g* : P(Y) × P(X ) → R,

$$\log(Q\_{Y\prime}P\_X) = \sum\_{\mathbf{x}} P\_X(\mathbf{x}) \sum\_{\mathbf{y}} P(\mathbf{y}|\mathbf{x})^{\mathbf{a}} Q\_Y(\mathbf{y})^{1-\mathbf{a}}.\tag{197}$$

Compared to the justification of (181), the only essential difference lies in showing that the function *g* is concave in *Q<sup>Y</sup>* for every *P<sup>X</sup>* ∈ P(X ): here, this follows easily from the concavity of the function *z* 7→ *z* 1−*α* for *α* ∈ [ 1 2 , 1).

We conclude the proof by establishing (177). Let X = Y = {0, 1}, and let the conditional PMF *<sup>P</sup>Y*|*<sup>X</sup>* be given by *<sup>P</sup>Y*|*X*(*y*|*x*) = <sup>1</sup>{*<sup>y</sup>* <sup>=</sup> *<sup>x</sup>*}. (This corresponds to a binary noiseless channel.) Then, denoting by *U<sup>X</sup>* the uniform distribution over X ,

$$\max\_{P\_{\mathcal{X}}} I\_{\mathfrak{a}}^{\mathfrak{g}}(P\_{\mathcal{X}}, P\_{Y|\mathcal{X}}) \ge I\_{\mathfrak{a}}^{\mathfrak{g}}(\mathcal{U}\_{\mathcal{X}}, P\_{Y|\mathcal{X}}) \tag{198}$$

$$=\log 2,\tag{199}$$

where (199) follows from (61). On the other hand, for every *<sup>α</sup>* <sup>∈</sup> (0, <sup>1</sup> 2 ) and every PMF *PX*,

$$f\_{\mathfrak{a}}(\mathbf{P}\_{\mathbf{X}}, \mathbf{P}\_{\mathbf{Y}|\mathbf{X}}) = \frac{\mathfrak{a}}{1-\mathfrak{a}}H\_{\infty}(\mathbf{P}\_{\mathbf{X}}) \tag{200}$$

$$1 \le \frac{\mathfrak{a}}{1-\mathfrak{a}} \log 2 \tag{201}$$

$$<\log \text{2},\tag{202}$$

where (200) follows from [3] (Lemma 11); (201) follows from (144); and (202) holds because *<sup>α</sup>* <sup>∈</sup> (0, <sup>1</sup> 2 ). Inequality (177) now follows from (199) and (202).

#### **8. Horse Betting**

In this section, we analyze horse betting with a gambler investing all her money. Recall from the introduction that the winning horse *X* is distributed according to the PMF *p*, where we assume *p*(*x*) > 0 for all *x* ∈ X ; that the odds offered by the bookmaker are denoted by *o* : X → (0, ∞); that the fraction of her wealth that the gambler bets on Horse *x* ∈ X is denoted *b*(*x*) ≥ 0; that the wealth relative is the random variable *S* , *b*(*X*)*o*(*X*); and that we seek betting strategies that maximize the utility function

$$\mathcal{U}\_{\beta} \triangleq \begin{cases} \frac{1}{\beta} \log \mathcal{E}[\mathcal{S}^{\beta}] & \text{if } \beta \neq 0, \\ \mathcal{E}[\log \mathcal{S}] & \text{if } \beta = 0. \end{cases} \tag{203}$$

Because the gambler invests all her money, *b* is a PMF. As in [47] (Section 10.3), define the constant

$$c \triangleq \left[\sum\_{\mathbf{x}} \frac{1}{o(\mathbf{x})}\right]^{-1} \tag{204}$$

and the PMF

$$r(\mathbf{x}) \stackrel{\Delta}{=} \frac{c}{o(\mathbf{x})}.\tag{205}$$

Using these definitions, the utility function *U<sup>β</sup>* can be decomposed as follows:

**Theorem 9.** *Let β* ∈ (−∞, 1)*, and let b be a PMF. Then,*

$$\mathcal{U}\_{\beta} = \log c + D\_{\frac{1}{1-\beta}}(p||r) - D\_{1-\beta}(g^{(\beta)}||b)\_{\prime} \tag{206}$$

*where the PMF g*(*β*) *is given by*

$$\log^{(\beta)}(\mathbf{x}) \triangleq \frac{p(\mathbf{x})^{\frac{1}{1-\beta}}o(\mathbf{x})^{\frac{\beta}{1-\beta}}}{\sum\_{\mathbf{x'} \in \mathcal{X}} p(\mathbf{x'})^{\frac{1}{1-\beta}}o(\mathbf{x'})^{\frac{\beta}{1-\beta}}}.\tag{207}$$

*Thus, choosing b* = *g* (*β*) *uniquely maximizes U<sup>β</sup> among all PMFs b.*

The three terms in (206) can be interpreted as follows:


3. The third term, −*D*1−*β*(*g* (*β*)k*b*), is related to the gambler's estimate of the winning probabilities. It is zero if and only if *b* is equal to *g* (*β*) .

**Remark 4.** *For β* = 0*,* (206) *reduces to the following decomposition of the doubling rate* E[log *S*]*:*

$$\mathbb{E}[\log S] = \log c + D(p||r) - D(p||b). \tag{208}$$

*(This decomposition appeared previously in [47] (Section 10.3).) Equation* (208) *implies that the doubling rate is maximized by proportional gambling, i.e., that* E[log *S*] *is maximized if and only if b is equal to p.*

**Remark 5.** *Considering the limits <sup>β</sup>* → −<sup>∞</sup> *and <sup>β</sup>* <sup>↑</sup> <sup>1</sup>*, the PMF g*(*β*) *satisfies, for every x* ∈ X *,*

$$\lim\_{\beta \to -\infty} g^{(\beta)}(\mathbf{x}) = \frac{c}{o(\mathbf{x})},\tag{209}$$

$$\lim\_{\beta \uparrow 1} g^{(\beta)}(\mathbf{x}) = \frac{p(\mathbf{x}) \mathbbm{1} \{\mathbf{x} \in \mathcal{S}\}}{\sum\_{\mathbf{x'} \in \mathcal{X}} p(\mathbf{x'}) \mathbbm{1} \{\mathbf{x'} \in \mathcal{S}\}} \tag{210}$$

*where the set* S *is defined as* S , *x* <sup>0</sup> ∈ X : *p*(*x* 0 )*o*(*x* 0 ) = max*x*[*p*(*x*)*o*(*x*)] *. It follows from Proposition 8 below that the RHS of* (209) *is the unique maximizer of* lim*β*→−<sup>∞</sup> *Uβ; and it follows from the proof of Proposition 9 below that the RHS of* (210) *is a maximizer (not necessarily unique) of U*1*.*

**Proof of Remark 5.** Recall that we assume *p*(*x*) > 0 for every *x* ∈ X . Then, (209) follows from (207) and the definition of *c* in (204). To establish (210), define *τ* , max*x*[*p*(*x*)*o*(*x*)] and observe that, for every *x* ∈ X ,

$$\lim\_{\beta \uparrow 1} g^{(\beta)}(\mathbf{x}) = \lim\_{\beta \uparrow 1} \frac{p(\mathbf{x}) \left[ p(\mathbf{x}) o(\mathbf{x}) / \pi \right]^{\frac{\beta}{1-\beta}}}{\sum\_{\mathbf{x'} \in \mathcal{X}} p(\mathbf{x'}) \left[ p(\mathbf{x'}) o(\mathbf{x'}) / \pi \right]^{\frac{\beta}{1-\beta}}} \tag{211}$$

$$=\frac{p(\mathbf{x})\,\mathbb{1}\{\mathbf{x}\in\mathcal{S}\}}{\sum\_{\mathbf{x'}\in\mathcal{X}}p(\mathbf{x'})\,\mathbb{1}\{\mathbf{x'}\in\mathcal{S}\}}\,\tag{212}$$

where (211) follows from (207) and some algebra; and (212) is justified as follows: if *x* ∈ S, then - *p*(*x*)*o*(*x*)/*τ β*/(1−*β*) equals one; and if *<sup>x</sup>* ∈ S / , then - *p*(*x*)*o*(*x*)/*τ β*/(1−*β*) tends to zero as *β* ↑ 1 because *p*(*x*)*o*(*x*)/*τ* < 1 and because lim*β*↑<sup>1</sup> *β* <sup>1</sup>−*<sup>β</sup>* = +∞.

**Remark 6.** *Using the definition in* (24) *for the Rényi divergence of negative orders, it is not difficult to see from the proof of Theorem 9 below that* (206) *also holds for β* > 1*. However, because the Rényi divergence of negative orders is nonpositive instead of nonnegative, the above interpretation is not valid anymore; in particular, for β* > 1*, choosing b* = *g* (*β*) *is in general not optimal.*

**Proof of Theorem 9.** We first show the maximization claim. The only term on the RHS of (206) that depends on *b* is −*D*1−*β*(*g* (*β*)k*b*). Because <sup>1</sup> <sup>−</sup> *<sup>β</sup>* <sup>&</sup>gt; 0, this term is maximized if and only if *<sup>b</sup>* <sup>=</sup> *<sup>g</sup>* (*β*) (Proposition 1 (a)).

We now establish (206) for *β* ∈ (−∞, 0) ∪ (0, 1); we omit the proof for *β* = 0, which can be found in [47] (Section 10.3). For *β* ∈ (−∞, 0) ∪ (0, 1),

$$\mathcal{U}\_{\mathcal{B}} = \frac{1}{\beta} \log \sum\_{\mathbf{x}} p(\mathbf{x}) b(\mathbf{x})^{\mathcal{B}} o(\mathbf{x})^{\mathcal{B}}.\tag{213}$$

For every *x* ∈ X ,

$$p(\mathbf{x})b(\mathbf{x})^{\boldsymbol{\theta}}o(\mathbf{x})^{\boldsymbol{\theta}} = \left[\sum\_{\mathbf{x'} \in \mathcal{X}} p(\mathbf{x'})^{\frac{1}{1-\boldsymbol{\theta}}}o(\mathbf{x'})^{\frac{\boldsymbol{\theta}}{1-\boldsymbol{\theta}}}\right]^{1-\boldsymbol{\theta}} \cdot g^{(\boldsymbol{\theta})}(\mathbf{x})^{1-\boldsymbol{\theta}}b(\mathbf{x})^{\boldsymbol{\theta}}.\tag{214}$$

which follows from (207). Now, (206) holds because

$$\mathcal{U}\_{\beta} = \frac{1-\beta}{\beta} \log \sum\_{\mathbf{x'} \in \mathcal{X}} p(\mathbf{x'})^{\frac{1}{1-\beta}} o(\mathbf{x'})^{\frac{\beta}{1-\beta}} + \frac{1}{\beta} \log \sum\_{\mathbf{x}} g^{(\beta)}(\mathbf{x})^{1-\beta} b(\mathbf{x})^{\beta} \tag{215}$$

$$=\frac{1-\beta}{\beta}\log\sum\_{\mathbf{x'}\in\mathcal{X}}p(\mathbf{x'})^{\frac{1}{1-\beta}}o(\mathbf{x'})^{\frac{\beta}{1-\beta}}-D\_{1-\beta}(\mathcal{g}^{(\beta)}||b)\tag{216}$$

$$\hat{\rho} = \log \mathcal{L} + \frac{1-\beta}{\beta} \log \sum\_{\mathbf{x'} \in \mathcal{X}} p(\mathbf{x'})^{\frac{1}{1-\beta}} r(\mathbf{x'})^{\frac{-\beta}{1-\beta}} - D\_{1-\beta}(\mathbf{g}^{(\beta)} || b) \tag{217}$$

$$\alpha = \log \mathcal{c} + D\_{\frac{1}{1-\beta}}(p||r) - D\_{1-\beta}(g^{(\beta)}||b) \, \tag{218}$$

where (215) follows from (213) and (214); (216) follows from identifying the Rényi divergence (recall that *g* (*β*) and *b* are PMFs); (217) follows from (205); and (218) follows from identifying the Rényi divergence (recall that *r* is a PMF).

The rest of the section presents the cases *β* → −∞, *β* ≥ 1, and *β* → +∞.

**Proposition 8.** *Let b be a PMF. Then,*

$$\lim\_{\beta \to -\infty} \mathcal{U}\_{\beta} = \log \min\_{\mathbf{x}} \left[ b(\mathbf{x}) o(\mathbf{x}) \right] \tag{219}$$
 
$$< \text{loc} \,\, \epsilon$$

$$1 \le \log c.\tag{220}$$

*Inequality* (220) *holds with equality if and only if b*(*x*) = *c*/*o*(*x*) *for all x* ∈ X *.*

Observe that if *b*(*x*) = *c*/*o*(*x*) for all *x* ∈ X , then *S* = *c* with probability one, i.e., *S* does not depend on the winning horse.

**Proof of Proposition 8.** Equation (219) holds because

$$\lim\_{\beta \to -\infty} \mathcal{U}\_{\beta} = \lim\_{\beta \to -\infty} \log \left[ \sum\_{\mathbf{x}} p(\mathbf{x}) \left( b(\mathbf{x}) o(\mathbf{x}) \right)^{\beta} \right]^{\frac{1}{\beta}} \tag{221}$$

$$=\log\min\_{\mathbf{x}}\left[b(\mathbf{x})o(\mathbf{x})\right],\tag{222}$$

where (222) holds because, in the limit as *β* tends to −∞, the power mean tends to the minimum (since *p* is a PMF with *p*(*x*) > 0 for all *x* ∈ X [15] (Chapter 8)).

We show (220) by contradiction. Assume that there exists a PMF *b* that does not satisfy (220), thus

$$b(\mathbf{x})o(\mathbf{x}) > c \tag{223}$$

for all *x* ∈ X . Then,

$$1 = \sum\_{\mathbf{x}} b(\mathbf{x})\tag{224}$$

$$\geq \sum\_{\mathfrak{x}} \frac{\mathfrak{c}}{o(\mathfrak{x})} \tag{225}$$

$$= 1\_{\text{\textquotedblleft}} \tag{226}$$

where (224) holds because *b* is a PMF; (225) follows from (223); and (226) follows from the definition of *c* in (204). Because 1 > 1 is impossible, such a *b* cannot exist, which establishes (220).

It is not difficult to see that (220) holds with equality if *b*(*x*) = *c*/*o*(*x*) for all *x* ∈ X . We therefore focus on establishing that if (220) holds with equality, then *b*(*x*) = *c*/*o*(*x*) for all *x* ∈ X . Observe first that, if (220) holds with equality, then, for all *x* ∈ X ,

$$b(\mathbf{x})o(\mathbf{x}) \ge \mathbf{c}.\tag{227}$$

We now claim that (227) holds with equality for all *x* ∈ X . Indeed, if this were not the case, then there would exist an *x* <sup>0</sup> ∈ X for which *b*(*x* 0 )*o*(*x* 0 ) > *c*, thus (224)–(226) would hold, which would lead to a contradiction. Hence, if (220) holds with equality, then *b*(*x*) = *c*/*o*(*x*) for all *x* ∈ X .

**Proposition 9.** *Let β* ≥ 1*, and let b be a PMF. Then,*

$$M\_{\beta} \le \log \max\_{\mathbf{x}} \left[ p(\mathbf{x})^{1/\beta} o(\mathbf{x}) \right]. \tag{228}$$

*Equality in* (228) *can be achieved by choosing b*(*x*) = <sup>1</sup>{*<sup>x</sup>* <sup>=</sup> *<sup>x</sup>* <sup>0</sup>} *for some x*<sup>0</sup> ∈ X *satisfying*

$$p(\mathbf{x'})^{1/\beta}o(\mathbf{x'}) = \max\_{\mathbf{x}} \left[ p(\mathbf{x})^{1/\beta}o(\mathbf{x}) \right]. \tag{229}$$

**Remark 7.** *Proposition 9 implies that if β* ≥ 1*, then it is optimal to bet on a single horse. Unless* |X | = 1*, this is not the case when β* < 1*: When β* < 1*, an optimal betting strategy requires placing a bet on every horse. This follows from Theorem 9 and our assumption that p*(*x*) *and o*(*x*) *are all positive.*

**Proof of Proposition 9.** Inequality (228) holds because

$$\mathcal{U}\_{\mathcal{B}} = \frac{1}{\mathcal{\beta}} \log \sum\_{\mathbf{x}} p(\mathbf{x}) b(\mathbf{x})^{\mathcal{B}} o(\mathbf{x})^{\mathcal{B}} \tag{230}$$

$$\leq \frac{1}{\beta} \log \sum\_{\mathbf{x}} p(\mathbf{x}) b(\mathbf{x}) o(\mathbf{x})^{\beta} \tag{231}$$

$$\leq \frac{1}{\beta} \log \sum\_{\mathbf{x}} b(\mathbf{x}) \cdot \max\_{\mathbf{x'} \in \mathcal{X}} \left[ p(\mathbf{x'}) o(\mathbf{x'})^{\beta} \right] \tag{232}$$

$$\hat{\boldsymbol{\beta}} = \frac{1}{\beta} \log \max\_{\mathbf{x'} \in \mathcal{X}} \left[ p(\mathbf{x'}) o(\mathbf{x'})^\beta \right] \tag{233}$$

$$=\log\max\_{\mathbf{x}'\in\mathcal{X}}\left[p(\mathbf{x}')^{1/\beta}o(\mathbf{x}')\right],\tag{234}$$

where (231) holds because *b*(*x*) ∈ [0, 1] and *β* ≥ 1, and (233) holds because *b* is a PMF. It is not difficult to see that (228) holds with equality if *<sup>b</sup>*(*x*) = <sup>1</sup>{*<sup>x</sup>* <sup>=</sup> *<sup>x</sup>* <sup>0</sup>} for some *x* <sup>0</sup> ∈ X satisfying (229).

**Proposition 10.** *Let b be a PMF. Then,*

$$\lim\_{\beta \to +\infty} \mathcal{U}\_{\beta} = \log \max\_{\mathbf{x}} \left[ b(\mathbf{x}) o(\mathbf{x}) \right] \tag{235}$$

$$0 \le \log \max\_{\mathbf{x}} o(\mathbf{x}). \tag{236}$$

*Equality in* (236) *can be achieved by choosing b*(*x*) = <sup>1</sup>{*<sup>x</sup>* <sup>=</sup> *<sup>x</sup>* <sup>0</sup>} *for some x*<sup>0</sup> ∈ X *satisfying*

$$o(\mathbf{x}') = \max\_{\mathbf{x}} o(\mathbf{x}).\tag{237}$$

**Proof.** Equation (235) holds because

$$\lim\_{\beta \to +\infty} \mathcal{U}\_{\beta} = \lim\_{\beta \to +\infty} \log \left[ \sum\_{\mathbf{x}} p(\mathbf{x}) \left( b(\mathbf{x}) o(\mathbf{x}) \right)^{\beta} \right]^{\frac{1}{\beta}} \tag{238}$$

$$=\log\max\_{\mathbf{x}}\left[b(\mathbf{x})o(\mathbf{x})\right],\tag{239}$$

where (239) holds because in the limit as *β* tends to +∞, the power mean tends to the maximum (since *p* is a PMF with *p*(*x*) > 0 for all *x* ∈ X [15] (Chapter 8)). Inequality (236) holds because *b*(*x*) ≤ 1 for all *<sup>x</sup>* ∈ X . It is not difficult to see that (236) holds with equality if *<sup>b</sup>*(*x*) = <sup>1</sup>{*<sup>x</sup>* <sup>=</sup> *<sup>x</sup>* <sup>0</sup>} for some *x* <sup>0</sup> ∈ X satisfying (237).

#### **9. Horse Betting with Side Information**

In this section, we study the horse betting problem where the gambler observes some side information *Y* before placing her bets. This setting leads to the conditional Rényi divergence *D*<sup>l</sup> *α* (·) discussed in Section 5 (see Theorem 10). In addition, it provides a new operational meaning to the dependence measure *Jα*(*X*;*Y*) (see Theorem 11).

We adapt our notation as follows: The joint PMF of *X* and *Y* is denoted *pXY*. (Recall that *X* denotes the winning horse.) We drop the assumption that the winning probabilities *p*(*x*) are positive, but we assume that *p*(*y*) > 0 for all *y* ∈ Y. We continue to assume that the gambler invests all her wealth, so a betting strategy is now a conditional PMF *bX*|*Y*, and the wealth relative *S* is

$$S \stackrel{\triangle}{=} b(X|Y)o(X). \tag{240}$$

As in Section 8, define the constant

$$c \triangleq \left[\sum\_{\mathbf{x}} \frac{1}{o(\mathbf{x})}\right]^{-1} \tag{241}$$

and the PMF

$$r\_X(\mathbf{x}) \triangleq \frac{c}{o(\mathbf{x})}.\tag{242}$$

#### The following decomposition of the utility function *U<sup>β</sup>* parallels that of Theorem 9:

**Theorem 10.** *Let β* ∈ (−∞, 1)*. Then,*

$$\mathcal{U}\_{\mathfrak{F}} = \log \mathcal{c} + D\_{1-\mathfrak{F}}^{1} (p\_{X|Y} \| r\_X | p\_Y) - D\_{1-\mathfrak{F}} (\mathcal{g}\_{X|Y}^{(\mathfrak{F})} \mathcal{g}\_Y^{(\mathfrak{F})} \| \| b\_{X|Y} \mathcal{g}\_Y^{(\mathfrak{F})}) \,. \tag{243}$$

*where the conditional PMF g*(*β*) *X*|*Y and the PMF g*(*β*) *Y are given by*

$$\log\_{X|Y}^{(\beta)}(\mathbf{x}|y) \triangleq \frac{p(\mathbf{x}|y)^{\frac{1}{1-\beta}}o(\mathbf{x})^{\frac{\beta}{1-\beta}}}{\sum\_{\mathbf{x'}}p(\mathbf{x'}|y)^{\frac{1}{1-\beta}}o(\mathbf{x'})^{\frac{\beta}{1-\beta}}}\,\tag{244}$$

$$\mathcal{G}\_{Y}^{(\beta)}(y) \triangleq \frac{p(y) \left[\sum\_{\mathbf{x'}} p(\mathbf{x'}|y)^{\frac{1}{1-\beta}} o(\mathbf{x'})^{\frac{\beta}{1-\beta}}\right]^{1-\beta}}{\sum\_{\mathbf{y'}} p(\mathbf{y'}) \left[\sum\_{\mathbf{x'}} p(\mathbf{x'}|y')^{\frac{1}{1-\beta}} o(\mathbf{x'})^{\frac{\beta}{1-\beta}}\right]^{1-\beta}}.\tag{245}$$

*Thus, choosing bX*|*<sup>Y</sup>* = *g* (*β*) *X*|*Y uniquely maximizes U<sup>β</sup> among all conditional PMFs bX*|*Y.* **Proof.** We first show that *U<sup>β</sup>* is uniquely maximized by *g* (*β*) *X*|*Y* . The only term on the RHS of (243) that depends on *bX*|*<sup>Y</sup>* is −*D*1−*<sup>β</sup> g* (*β*) *X*|*Y g* (*β*) *Y* k*bX*|*Yg* (*β*) *Y* . Because 1 − *β* > 0, this term is maximized if and only if *bX*|*Yg* (*β*) *<sup>Y</sup>* = *g* (*β*) *X*|*Y g* (*β*) *Y* (Proposition 1 (a)). By our assumptions that *p*(*y*) > 0 for all *y* ∈ Y and *o*(*x*) > 0 for all *x* ∈ X , we have *g* (*β*) *Y* (*y*) > 0 for all *y* ∈ Y. Consequently, *bX*|*Yg* (*β*) *<sup>Y</sup>* = *g* (*β*) *X*|*Y g* (*β*) *Y* if and only if *bX*|*<sup>Y</sup>* = *g* (*β*) *X*|*Y* .

Consider now (243) for *β* = 0. For *β* = 0, (243) reduces to

$$E[\log S] = \log c + D(p\_{X|Y}p\_Y \| r\_X p\_Y) - D(p\_{X|Y}p\_Y \| b\_{X|Y}p\_Y). \tag{246}$$

and some algebra reveals that (246) holds.

We conclude with establishing (243) for *β* ∈ (−∞, 0) ∪ (0, 1). For *β* ∈ (−∞, 0) ∪ (0, 1),

$$\mathcal{U}L\_{\beta} = \frac{1}{\beta} \log \sum\_{\mathbf{x}, \mathbf{y}} p(\mathbf{x}, \mathbf{y}) b(\mathbf{x}|\mathbf{y})^{\beta} o(\mathbf{x})^{\beta}. \tag{247}$$

For every *x* ∈ X and every *y* ∈ Y,

$$p(\mathbf{x}, y)b(\mathbf{x}|y)^{\beta}o(\mathbf{x})^{\beta} = \sum\_{\mathbf{y}' \in \mathcal{Y}} p(\mathbf{y}') \left[ \sum\_{\mathbf{x}' \in X} p(\mathbf{x}'|y')^{\frac{1}{1-\beta}} o(\mathbf{x}')^{\frac{\beta}{1-\beta}} \right]^{1-\beta} \cdot g\_Y^{(\beta)}(y) g\_{X|Y}^{(\beta)}(\mathbf{x}|y)^{1-\beta}b(\mathbf{x}|y)^{\beta} . \tag{248}$$

which follows from (244) and (245). Now, (243) holds because

$$\begin{split} \mathcal{U}\_{\beta} &= \frac{1}{\beta} \log \sum\_{y' \in \mathcal{Y}} p(y') \left[ \sum\_{x' \in \mathcal{X}} p(x'|y')^{\frac{1}{1-\beta}} o(x')^{\frac{\beta}{1-\beta}} \right]^{1-\beta} \\ &+ \frac{1}{\beta} \log \sum\_{x,y} \left[ g\_{X|Y}^{(\beta)}(x|y) g\_Y^{(\beta)}(y) \right]^{1-\beta} \left[ b(x|y) g\_Y^{(\beta)}(y) \right]^{\beta} \end{split} \tag{249}$$

$$\hat{\boldsymbol{\beta}} = \frac{1}{\beta} \log \sum\_{\mathbf{y}' \in \mathcal{Y}} p(\mathbf{y}') \left[ \sum\_{\mathbf{x}' \in \mathcal{X}} p(\mathbf{x}'|\mathbf{y}')^{\frac{1}{1-\beta}} o(\mathbf{x}')^{\frac{\beta}{1-\beta}} \right]^{1-\beta} - \mathbf{D}\_{1-\beta} (\mathbf{g}\_{\mathbf{X}|\mathbf{Y}}^{\{\beta\}} g\_{\mathbf{Y}}^{\{\beta\}} \|\mathbf{b}\_{\mathbf{X}|\mathbf{Y}} g\_{\mathbf{Y}}^{\{\beta\}}) \tag{250}$$

$$\hat{\boldsymbol{\beta}} = \log \boldsymbol{c} + \frac{1}{\beta} \log \sum\_{\mathbf{y}' \in \mathcal{Y}} p(\mathbf{y}') \left[ \sum\_{\mathbf{x}' \in \mathcal{X}} p(\mathbf{x}'|\mathbf{y}')^{\frac{1}{1-\beta}} r\_{\mathbf{X}}(\mathbf{x}')^{\frac{-\beta}{1-\beta}} \right]^{1-\beta} - D\_{1-\beta} (\boldsymbol{g}\_{\mathbf{X}|\mathbf{Y}}^{(\beta)} \boldsymbol{g}\_{\mathbf{Y}}^{(\beta)} || \boldsymbol{b}\_{\mathbf{X}|\mathbf{Y}} \boldsymbol{g}\_{\mathbf{Y}}^{(\beta)}) \tag{251}$$

$$\mathcal{L} = \log \mathcal{c} + D\_{1\_{\overline{\mathcal{P}}}}^{1} (p\_{X|Y} \| r\_X | p\_Y) - D\_{1-\overline{\mathcal{G}}} (g\_{X|Y}^{(\mathcal{G})} g\_Y^{(\mathcal{G})} \| b\_{X|Y} g\_Y^{(\mathcal{G})})\_\prime \tag{252}$$

where (249) follows from (247) and (248) and the fact that *g* (*β*) *Y* (*y*) = *g* (*β*) *Y* (*y*) <sup>1</sup>−*<sup>β</sup> g* (*β*) *Y* (*y*) *β* ; (250) follows by identifying the Rényi divergence; (251) follows from (242); and (252) follows by identifying the conditional Rényi divergence using (78).

**Remark 8.** *It follows from Theorem 10 that, if the gambler gambles optimally, then, for β* ∈ (−∞, 1)*,*

$$\mathcal{U}\_{\mathcal{B}} = \log c + D^1\_{\frac{1}{1-\beta}}(p\_{X|Y} \| r\_X | p\_Y). \tag{253}$$

*Operationally, it is clear that preprocessing the side information cannot increase the gambler's utility, i.e., that, for every conditional PMF p<sup>Y</sup>* 0 |*Y,*

$$D\_{\frac{1}{1-\overline{\beta}}}^{1}(p\_{X|Y'}||r\_X|p\_{Y'}) \le D\_{\frac{1}{1-\overline{\beta}}}^{1}(p\_{X|Y}||r\_X|p\_Y)\_{\prime} \tag{254}$$

*where pX*|*<sup>Y</sup>* <sup>0</sup> *and p<sup>Y</sup>* <sup>0</sup> *are derived from the joint PMF pXYY*<sup>0</sup> *given by*

$$p\_{XYY}(\mathbf{x}, \mathbf{y}, \mathbf{y}') = p\_Y(\mathbf{y}) p\_{X|Y}(\mathbf{x}|\mathbf{y}) p\_{Y|Y}(\mathbf{y}'|\mathbf{y}).\tag{255}$$

*This provides the intuition for Theorem 6, where* (254) *is shown directly.*

*The extreme case is when the preprocessing maps the side information to a constant and hence leads to the case where the side information is absent. In this case, Y* 0 *is deterministic and pX*|*<sup>Y</sup>* <sup>0</sup> *equals pX. Theorem 9 and Theorem 10 then lead to the following relation between the conditional and unconditional Rényi divergence:*

$$D\_{\frac{1}{1-\beta}}(p\_X \| r\_X) \le D\_{\frac{1}{1-\beta}}^1(p\_{X|Y} \| r\_X \| p\_Y),\tag{256}$$

*where the marginal PMF p<sup>X</sup> is given by*

$$p\_X(\mathbf{x}) = \sum\_{\mathbf{y}} p\_{XY}(\mathbf{x}, \mathbf{y}). \tag{257}$$

*This motivates Corollary 3, where* (256) *is derived from* (254)*.*

The last result of this section provides a new operational meaning to the Lapidoth–Pfister mutual information *Jα*(*X*;*Y*): assuming that *β* ∈ (−∞, 1) and that the gambler knows the winning probabilities, *<sup>J</sup>*1/(1−*β*) (*X*;*Y*) measures how much the side information that is available to the gambler but not the bookmaker increases the gambler's smallest guaranteed utility for a fixed level of fairness *c*. To see this, consider first the setting without side information. By Theorem 9, the gambler chooses *b* = *g* (*β*) to maximize her utility, where *g* (*β*) is defined in (207). Then, using the nonnegativity of the Rényi divergence (Proposition 1 (a)), the following lower bound on the gambler's utility follows from (206):

$$
\mathcal{U}\_{\notin} \ge \log c.\tag{258}
$$

We call the RHS of (258) the smallest guaranteed utility for a fixed level of fairness *c* because (258) holds with equality if the bookmaker chooses the odds inversely proportional to the winning probabilities. Comparing (258) with (259) below, we see that the difference due to the side information is *<sup>J</sup>*1/(1−*β*) (*X*;*Y*). Note that *<sup>J</sup>*1/(1−*β*) (*X*;*Y*) is typically not the difference between the utility with and without side information; this is because the odds for which (258) and (259) hold with equality are typically not the same.

**Theorem 11.** *Let <sup>β</sup>* <sup>∈</sup> (−∞, 1)*. If bX*|*<sup>Y</sup> is equal to g*(*β*) *X*|*Y from Theorem 10, then*

$$\mathcal{U}\_{\beta} \ge \log c + f\_{\frac{1}{1-\overline{\beta}}}(\mathcal{X}; \mathcal{Y}).\tag{259}$$

*Moreover, for every c* > 0*, there exist odds o* : X → (0, ∞) *such that* (259) *holds with equality.*

**Proof.** For this choice of *bX*|*Y*, (259) holds because

$$\mathcal{U}\_{\mathcal{B}} = \log \mathcal{c} + D\_{\frac{1}{1-\mathcal{P}}}^{1} (p\_{X|Y} \| r\_X \| p\_Y) \tag{260}$$

$$\geq \log c + \min\_{\mathbb{P}\_X \in \mathcal{P}(\mathcal{X})} D^1\_{1\_1}(p\_{X|Y} \| \mathbb{P}\_X | p\_Y) \tag{261}$$

$$=\log \mathcal{c} + \underline{\mathcal{I}}\_{\frac{1}{\Gamma-\overline{\mathcal{P}}}}(\mathcal{X};\mathcal{Y}),\tag{262}$$

where (260) follows from Theorem 10, and (262) follows from Proposition 5.

Fix now *c* > 0, let *r*˜ ∗ *X* achieve the minimum on the RHS of (261), and choose the odds

$$\rho(\mathbf{x}) = \frac{\mathbf{c}}{\vec{r}\_X^\*(\mathbf{x})}.\tag{263}$$

Then, (261) holds with equality because *r<sup>X</sup>* = *r*˜ ∗ *X* by (241) and (242).

#### **10. Horse Betting with Part of the Money**

In this section, we treat the possibility that the gambler does not invest all her wealth. We restrict ourselves to the setting without side information and to *β* ∈ (−∞, 0) ∪ (0, 1). (For the case *β* = 0, see [47] (Section 10.5).) We assume that *p*(*x*) > 0 and *o*(*x*) > 0 for all *x* ∈ X . Denote by *b*(0) the fraction of her wealth that the gambler does not use for betting. (We assume 0 ∈ X / .) Then, *b* : X ∪ {0} → [0, 1] is a PMF, and the wealth relative *S* is the random variable

$$S \triangleq b(\mathbf{0}) + b(X)\,\boldsymbol{\varrho}(X). \tag{264}$$

As in Section 8, define the constant

$$c \triangleq \left[ \sum\_{\mathbf{x}} \frac{1}{o(\mathbf{x})} \right]^{-1} \,. \tag{265}$$

We treat the cases *c* < 1 and *c* ≥ 1 separately, starting with the latter. If *c* ≥ 1, then it is optimal to invest all the money:

**Proposition 11.** *Assume c* ≥ 1*, let β* ∈ R*, and let b be a PMF on* X ∪ {0} *with utility Uβ. Then, there exists a PMF b*0 *on* X ∪ {0} *with b*<sup>0</sup> (0) = 0 *and utility U*0 *<sup>β</sup>* ≥ *Uβ.*

**Proof.** Choose the PMF *b* 0 as follows:

$$b'(\mathbf{x}) = \begin{cases} \frac{c}{o(\mathbf{x})} \cdot b(0) + b(\mathbf{x}) & \text{if } \mathbf{x} \in \mathcal{X}, \\ 0 & \text{if } \mathbf{x} = \mathbf{0}. \end{cases} \tag{266}$$

Then, for every *x* ∈ X ,

$$b'(0) + b'(\mathbf{x})o(\mathbf{x}) = c \cdot b(0) + b(\mathbf{x})o(\mathbf{x}) \tag{267}$$

$$\geq b(0) + b(\mathbf{x})o(\mathbf{x}),\tag{268}$$

where (268) holds because *c* ≥ 1 by assumption. For *β* > 0, *U*<sup>0</sup> *<sup>β</sup>* ≥ *U<sup>β</sup>* holds because (268) implies E[*S* 0*β* ] ≥ E[*S β* ]. For *β* < 0 and *β* = 0, *U*0 *<sup>β</sup>* ≥ *U<sup>β</sup>* follows similarly from (268).

On the other hand, if *β* < 1 and the odds are subfair, i.e., if *c* < 1, then Claim (c) of the following theorem shows that investing all the money is not optimal:

**Theorem 12.** *Assume c* < 1*, let β* ∈ (−∞, 0) ∪ (0, 1)*, and let b* ∗ *be a PMF on* X ∪ {0} *that maximizes U<sup>β</sup> among all PMFs b. Defining*

$$\mathcal{S} \triangleq \{ \mathbf{x} \in \mathcal{X} : b^\*(\mathbf{x}) > 0 \}, \tag{269}$$

$$\Gamma \stackrel{\Delta}{=} \frac{1 - \sum\_{\boldsymbol{x} \in \mathcal{S}} p(\boldsymbol{x})}{1 - \sum\_{\boldsymbol{x} \in \mathcal{S}} \frac{1}{\rho(\boldsymbol{x})}} \,\tag{270}$$

$$\gamma(\mathbf{x}) \triangleq \max \left\{ 0, \Gamma^{\frac{1}{\beta - 1}} p(\mathbf{x})^{\frac{1}{1 - \beta}} o(\mathbf{x})^{\frac{\beta}{1 - \beta}} - \frac{1}{o(\mathbf{x})} \right\} \quad \forall \mathbf{x} \in \mathcal{X}, \tag{271}$$

*the following claims hold:*


$$b^\*(\mathbf{x}) = \gamma(\mathbf{x}) b^\*(\mathbf{0}). \tag{272}$$

*(c) The quantity b*∗ (0) *satisfies*

$$b^\*(0) = \frac{1}{1 + \sum\_{\mathbf{x} \in \mathcal{X}} \gamma(\mathbf{x})}.\tag{273}$$

*In particular, b*∗ (0) > 0*.*

Claim (b) implies that for every *x* ∈ X , *b* ∗ (*x*) > 0 if and only if *p*(*x*)*o*(*x*) > Γ. Ordering the elements *x*1, *x*2, . . . of X such that *p*(*x*1)*o*(*x*1) ≥ *p*(*x*2)*o*(*x*2) ≥ . . ., the set S thus has a special structure: it is either empty or equal to {*x*1, *x*2, . . . , *xk*} for some integer *k*. To maximize *Uβ*, the following procedure can be used: for every S with the above structure, compute the corresponding *b* according to (270)–(273); and from these *b*'s, take one that maximizes *Uβ*. This procedure leads to an optimal solution: an optimal solution *b* ∗ exists because we are optimizing a continuous function over a compact set, and *b* ∗ corresponds to a set S that will be considered by the procedure.

**Proof of Theorem 12.** The proof is based on the Karush–Kuhn–Tucker conditions. By separately considering the cases *β* ∈ (0, 1) and *β* < 0, we first show that, for *β* ∈ (−∞, 0) ∪ (0, 1), a strategy *b*(·) is optimal if and only if the following conditions are satisfied for some *µ* ∈ R:

$$\sum\_{\mathbf{x}\in\mathcal{X}} p(\mathbf{x}) \left( b(\mathbf{0}) + b(\mathbf{x}) o(\mathbf{x}) \right)^{\beta - 1} \begin{cases} = \mu & \text{if } b(\mathbf{0}) > \mathbf{0}, \\ \le \mu & \text{if } b(\mathbf{0}) = \mathbf{0}, \end{cases} \tag{274}$$

and, for every *x* ∈ X ,

$$p(\mathbf{x})o(\mathbf{x})\left(b(0) + b(\mathbf{x})o(\mathbf{x})\right)^{\theta - 1} \begin{cases} = \mu & \text{if } b(\mathbf{x}) > 0, \\ \le \mu & \text{if } b(\mathbf{x}) = 0. \end{cases} \tag{275}$$

Consider first *β* ∈ (0, 1), and define the function *τ* : P(X ∪ {0}) → R,

$$\pi(b) \triangleq \sum\_{\mathbf{x} \in \mathcal{X}} p(\mathbf{x}) \left( b(\mathbf{0}) + b(\mathbf{x}) o(\mathbf{x}) \right)^{\beta}. \tag{276}$$

Since *β* > 0 and since the logarithm is an increasing function, maximizing *U<sup>β</sup>* = <sup>1</sup> *β* log E[*S β* ] over *b* is equivalent to maximizing *τ*(*b*). Observe that *τ* is concave, thus, by the Karush–Kuhn–Tucker conditions [11] (Theorem 4.4.1), it is maximized by a PMF *b* if and only if there exists a *λ* ∈ R such that (i) for all *x* ∈ X ∪ {0} with *b*(*x*) > 0,

$$\frac{\partial \tau}{\partial b(\mathbf{x})}(b) = \lambda\_\prime \tag{277}$$

and (ii) for all *x* ∈ X ∪ {0} with *b*(*x*) = 0,

$$\frac{\partial \tau}{\partial b(\mathbf{x})}(b) \le \lambda. \tag{278}$$

Henceforth, we use the following notation: to designate that (i) and (ii) both hold, we write

$$\frac{\partial \tau}{\partial b(\mathbf{x})}(b) \begin{cases} = \lambda & \text{if } b(\mathbf{x}) > 0, \\ \le \lambda & \text{if } b(\mathbf{x}) = 0. \end{cases} \tag{279}$$

Dividing both sides of (279) by *β* > 0 and defining *µ* , *<sup>λ</sup> β* , we obtain that (279) is equivalent to

$$\frac{1}{\beta} \cdot \frac{\partial \tau}{\partial b(\mathbf{x})}(b) \begin{cases} = \mu & \text{if } b(\mathbf{x}) > 0, \\ \le \mu & \text{if } b(\mathbf{x}) = 0. \end{cases} \tag{280}$$

Now, (280) translates to (274) for *x* = 0 and to (275) for *x* ∈ X .

Consider now *β* < 0, and define *τ* as in (276). Then, because *β* < 0, maximizing *U<sup>β</sup>* = <sup>1</sup> *β* log E[*S β* ] is equivalent to minimizing *τ*. The function *τ* is convex, thus Inequality (278) is reversed. Dividing by *β* < 0 again reverses the inequalities, thus (280), (274), and (275) continue to hold for *β* < 0.

Having established that, for all *β* ∈ (−∞, 0) ∪ (0, 1), a strategy *b* is optimal if and only if (274) and (275) hold, we next continue with the proof. Let *β* ∈ (−∞, 0) ∪ (0, 1), and let *b* <sup>∗</sup> be a PMF on X ∪ {0} that maximizes *Uβ*. By the above discussion, (274) and (275) are satisfied by *b* ∗ for some *µ* ∈ R. The LHS of (274) is positive, so *µ* > 0. We now show that for all *x* ∈ X ,

$$b^\*(\mathbf{x}) = \max \left\{ 0, \left[ \frac{p(\mathbf{x}) o(\mathbf{x})^\beta}{\mu} \right]^{\frac{1}{1-\beta}} - \frac{b^\*(0)}{o(\mathbf{x})} \right\}. \tag{281}$$

To this end, fix *x* ∈ X . If *b* ∗ (*x*) > 0, then (275) implies

$$b^\*(\mathbf{x}) = \left[\frac{p(\mathbf{x})o(\mathbf{x})^\beta}{\mu}\right]^{\frac{1}{1-\beta}} - \frac{b^\*(0)}{o(\mathbf{x})},\tag{282}$$

and the RHS of (282) is equal to the RHS of (281) because, being equal to *b* ∗ (*x*), it is positive. If *b* ∗ (*x*) = 0, then (275) implies

$$
\left[\frac{p(\mathbf{x})o(\mathbf{x})^{\theta}}{\mu}\right]^{\frac{1}{1-\beta}} - \frac{b^\*(0)}{o(\mathbf{x})} \le 0,\tag{283}
$$

so the RHS of (281) is zero and (281) hence holds.

Having established (281), we next show that *b* ∗ (*x*ˆ) = 0 for some *x*ˆ ∈ X . For a contradiction, assume that *b* ∗ (*x*) > 0 for all *x* ∈ X . Then,

$$\sum\_{\mathbf{x}\in\mathcal{X}} p(\mathbf{x}) \left( b^\*(\mathbf{0}) + b^\*(\mathbf{x}) o(\mathbf{x}) \right)^{\theta - 1} = \mu \sum\_{\mathbf{x}\in\mathcal{X}} \frac{1}{o(\mathbf{x})} \tag{284}$$

$$
\geq \mu\_{\prime} \tag{285}
$$

where (284) follows from (275), and (285) holds because *c* < 1 by assumption. However, this is impossible: (285) contradicts (274).

Let now *x*ˆ ∈ X be such that *b* ∗ (*x*ˆ) = 0. Then, by (281),

$$\left[\frac{p(\hat{\mathfrak{x}})o(\hat{\mathfrak{x}})^{\mathfrak{f}}}{\mu}\right]^{\frac{1}{1-\mathfrak{f}}} - \frac{b^\*(0)}{o(\hat{\mathfrak{x}})} \le 0. \tag{286}$$

Because *p*(*x*ˆ) and *o*(*x*ˆ) are positive, this implies *b* ∗ (0) > 0. Thus, by (274),

$$\sum\_{\mathbf{x}\in\mathcal{X}} p(\mathbf{x}) \left( b^\*(\mathbf{0}) + b^\*(\mathbf{x}) o(\mathbf{x}) \right)^{\beta - 1} = \mu. \tag{287}$$

Splitting the sum on the LHS of (287) depending on whether *b* ∗ (*x*) > 0 or *b* ∗ (*x*) = 0, we obtain

$$\mu = \sum\_{\mathbf{x} \in \mathcal{S}} p(\mathbf{x}) \left( b^\*(\mathbf{0}) + b^\*(\mathbf{x}) o(\mathbf{x}) \right)^{\beta - 1} + \sum\_{\mathbf{x} \notin \mathcal{S}} p(\mathbf{x}) \left( b^\*(\mathbf{0}) + b^\*(\mathbf{x}) o(\mathbf{x}) \right)^{\beta - 1} \tag{288}$$

$$\hat{\mathbf{x}} = \sum\_{\mathbf{x} \in \mathcal{S}} \frac{\mu}{o(\mathbf{x})} + \sum\_{\mathbf{x} \notin \mathcal{S}} p(\mathbf{x}) \, b^\*(\mathbf{0})^{\mathcal{S}-1} \tag{289}$$

$$=\mu \sum\_{\mathbf{x}\in\mathcal{S}} \frac{1}{o(\mathbf{x})} + b^\*(\mathbf{0})^{\mathcal{S}-1} \left[ \mathbf{1} - \sum\_{\mathbf{x}\in\mathcal{S}} p(\mathbf{x}) \right],\tag{290}$$

where (289) follows from (275). Rearranging (290), we obtain

$$\mu\left[1-\sum\_{\mathbf{x}\in\mathcal{S}}\frac{1}{o(\mathbf{x})}\right] = b^\*(0)^{\beta-1}\left[1-\sum\_{\mathbf{x}\in\mathcal{S}}p(\mathbf{x})\right].\tag{291}$$

Recall that *µ* > 0 and *b* ∗ (0) > 0. In addition, <sup>1</sup> − <sup>∑</sup>*x*∈S *<sup>p</sup>*(*x*) > <sup>0</sup> because *<sup>b</sup>* ∗ (*x*ˆ) = 0 and hence *<sup>x</sup>*<sup>ˆ</sup> ∈ S / . Thus, <sup>1</sup> − <sup>∑</sup>*x*∈S 1 *<sup>o</sup>*(*x*) > 0, so both the numerator and denominator in the definition of Γ in (270) are positive, which establishes Claim (a), namely that Γ is well-defined and positive.

To establish Claim (b), note that (291) and (270) imply that *µ* is given by

$$
\mu = b^\*(0)^{\beta - 1} \Gamma\_\prime \tag{292}
$$

which, when substituted into (281), yields (272).

We conclude by proving Claim (c). Because *b* ∗ is a PMF on X ∪ {0},

$$1 = b^\*(0) + \sum\_{\mathbf{x} \in \mathcal{X}} b^\*(\mathbf{x}) \tag{293}$$

$$b = b^\*(0)\left[1 + \sum\_{\mathbf{x} \in \mathcal{X}} \gamma(\mathbf{x})\right],\tag{294}$$

where (294) follows from (272). Rearranging (294) yields (273).

#### **11. Universal Betting for IID Races**

In this section, we present a universal gambling strategy for IID races that requires neither knowledge of the winning probabilities nor of the parameter *β* of the utility function and yet asymptotically maximizes the utility function for all PMFs *p* and all *β* ∈ R. Consider *n* consecutive horse races, where the winning horse in the *i*th race is denoted *X<sup>i</sup>* for *i* ∈ {1, . . . , *n*}. We assume that *X*1, . . . , *X<sup>n</sup>* are IID according to the PMF *p*, where *p*(*x*) > 0 for all *x* ∈ X . In every race, the bookmaker offers the same odds *o* : X → (0, ∞), and the gambler spends all her wealth placing bets on the horses. The gambler plays race-after-race, i.e., before placing bets for a race, she is revealed the winning horse of the previous race and receives the money from the bookmaker. Her betting strategy is hence a sequence of conditional PMFs *bX*<sup>1</sup> , *bX*2|*X*<sup>1</sup> , *bX*3|*X*1*X*<sup>2</sup> , . . . , *<sup>b</sup>Xn*|*X*1*X*2···*Xn*−<sup>1</sup> . The wealth relative is the random variable

$$\mathbf{S}\_{n} \triangleq \prod\_{i=1}^{n} b(\mathbf{X}\_{i}|\mathbf{X}\_{1}, \dots, \mathbf{X}\_{i-1}) o(\mathbf{X}\_{i}).\tag{295}$$

We seek betting strategies that maximize the utility function

$$\mathcal{U}\_{\mathcal{S},n} \triangleq \begin{cases} \frac{1}{\beta} \log \mathbb{E}[\mathbf{S}\_n^{\mathcal{S}}] & \text{if } \beta \neq 0, \\ \mathbb{E}[\log \mathcal{S}\_n] & \text{if } \beta = 0. \end{cases} \tag{296}$$

We first establish that to maximize *Uβ*,*<sup>n</sup>* for a fixed *β* ∈ R, it suffices to use the same betting strategy in every race; see Theorem 13. We then show that the individual-sequence-universal strategy by Cover–Ordentlich [48] allows to asymptotically achieve the same normalized utility without knowing *p* or *β* (see Theorem 14).

For a fixed *β* ∈ R, let the PMF *b* <sup>∗</sup> be a betting strategy that maximizes the single-race utility *U<sup>β</sup>* discussed in Section 8, and denote by *U*∗ *β* the utility associated with *b* ∗ . Using the same betting strategy *b* <sup>∗</sup> over *n* races leads to the utility *Uβ*,*n*, and it follows from (295) and (296) that

$$\mathcal{U}\_{\mathcal{B},\mathfrak{u}} = \mathfrak{n} \mathcal{U}\_{\mathcal{B}}^\*. \tag{297}$$

As we show next, *nU*∗ *β* is the maximum utility that can be achieved among all betting strategies:

**Theorem 13.** *Let <sup>β</sup>* <sup>∈</sup> <sup>R</sup>*, and let bX*<sup>1</sup> , *bX*2|*X*<sup>1</sup> , *bX*3|*X*1*X*<sup>2</sup> , . . . , *<sup>b</sup>Xn*|*X*1*X*2···*Xn*−<sup>1</sup> *be a sequence of conditional PMFs. Then,*

$$\|\mathcal{U}\_{\beta,n} \le n\mathcal{U}\_{\beta}^\*.\tag{298}$$

**Proof.** We show (298) for *β* > 0; analogous arguments establish (298) for *β* < 0 and *β* = 0. We prove (298) by induction on *n*. For *n* = 1, (298) holds because *U*∗ *β* is the maximum single-race utility. Assume now *n* ≥ 2 and that (298) is valid for *n* − 1. For *β* > 0, (298) holds because

$$\mathcal{U}\_{\boldsymbol{\theta},\boldsymbol{n}} = \frac{1}{\boldsymbol{\beta}} \log \mathbb{E}[\boldsymbol{S}\_{\boldsymbol{n}}^{\boldsymbol{\theta}}] \tag{299}$$

$$\hat{\mathbf{x}} = \frac{1}{\beta} \log \sum\_{\mathbf{x}\_1, \dots, \mathbf{x}\_n} P(\mathbf{x}\_1) \cdots P(\mathbf{x}\_n) \prod\_{i=1}^n b(\mathbf{x}\_i | \mathbf{x}^{i-1})^\beta o(\mathbf{x}\_i)^\beta \tag{300}$$

$$=\frac{1}{\beta}\log\sum\_{\mathbf{x}\_{1},\ldots,\mathbf{x}\_{n-1}}P(\mathbf{x}\_{1})\cdots P(\mathbf{x}\_{n-1})\left[\prod\_{i=1}^{n-1}b(\mathbf{x}\_{i}|\mathbf{x}^{i-1})^{\beta}o(\mathbf{x}\_{i})^{\beta}\right]\sum\_{\mathbf{x}\_{\text{n}}}P(\mathbf{x}\_{\text{n}})b(\mathbf{x}\_{\text{n}}|\mathbf{x}^{n-1})^{\beta}o(\mathbf{x}\_{\text{n}})^{\beta}\tag{301}$$

$$\leq \frac{1}{\beta} \log \sum\_{\mathbf{x}\_{1}, \dots, \mathbf{x}\_{n-1}} P(\mathbf{x}\_{1}) \cdot \dots \cdot P(\mathbf{x}\_{n-1}) \left[ \prod\_{i=1}^{n-1} b(\mathbf{x}\_{i}|\mathbf{x}^{i-1})^{\beta} o(\mathbf{x}\_{i})^{\beta} \right] \max\_{\mathbf{b} \in \mathcal{P}(\mathcal{X})} \sum\_{\mathbf{x}\_{\mathbf{z}}} P(\mathbf{x}\_{\mathbf{z}}) b(\mathbf{x}\_{\mathbf{z}})^{\beta} o(\mathbf{x}\_{\mathbf{z}})^{\beta} \tag{302}$$

$$=\frac{1}{\beta}\log\sum\_{\mathbf{x}\_{1},\ldots,\mathbf{x}\_{n-1}}P(\mathbf{x}\_{1})\cdots P(\mathbf{x}\_{n-1})\left[\prod\_{i=1}^{n-1}b(\mathbf{x}\_{i}|\mathbf{x}^{i-1})^{\beta}o(\mathbf{x}\_{i})^{\beta}\right]\sum\_{\mathbf{x}\_{n}}P(\mathbf{x}\_{n})b^{\*}(\mathbf{x}\_{n})^{\beta}o(\mathbf{x}\_{n})^{\beta}\tag{303}$$

$$\mathcal{L} = \mathcal{U}\_{\mathfrak{H}, n-1} + \mathcal{U}\_{\mathfrak{H}}^\* \tag{304}$$

$$\leq \left(n-1\right)\mathcal{U}\_{\boldsymbol{\beta}}^{\*} + \mathcal{U}\_{\boldsymbol{\beta}}^{\*}\tag{305}$$

$$= \pi \mathcal{L}^\*\_{\mathfrak{h}'} \tag{306}$$

where (303) holds because *b* <sup>∗</sup> maximizes the single-race utility *Uβ*, and (305) holds because (298) is valid for *n* − 1.

In portfolio theory, Cover–Ordentlich [48] (Definition 1) proposed a universal strategy. Adapted to our setting, it leads to the following sequence of conditional PMFs:

$$\delta(\mathbf{x}\_{i}|\mathbf{x}^{i-1}) = \frac{\int\_{b \in \mathcal{P}(\mathcal{X})} b(\mathbf{x}\_{i}) \, \mathcal{S}\_{i-1}(b, \mathbf{x}^{i-1}) \, d\mu(b)}{\int\_{b \in \mathcal{P}(\mathcal{X})} \mathcal{S}\_{i-1}(b, \mathbf{x}^{i-1}) \, d\mu(b)},\tag{307}$$

where *i* ∈ {1, 2, . . .}; *µ* is the Dirichlet(1/2, . . . , 1/2) distribution on P(X ); *S*0(*b*, *x* 0 ) , 1; and

$$\mathcal{S}\_i(b, \boldsymbol{x}^i) \triangleq \prod\_{j=1}^i b(\boldsymbol{x}\_j) o(\boldsymbol{x}\_j). \tag{308}$$

*Entropy* **2020**, *22*, 316

This strategy depends neither on the winning probabilities *p* nor on the parameter *β*. Denoting the utility (296) associated with the strategy ˆ*b*(*x<sup>i</sup>* |*x i*−1 ) by *U*ˆ *<sup>β</sup>*,*n*, we have the following result:

**Theorem 14.** *For every β* ∈ R*,*

$$n\,\mathrm{U}\_{\beta}^{\*} - \log 2 - \frac{|\mathcal{X}| - 1}{2} \log(n + 1) \le \hat{\mathcal{U}}\_{\beta, n} \tag{309}$$

$$\leq \mathfrak{n} \mathcal{U}\_{\mathfrak{B}}^{\*}.\tag{310}$$

*Hence,*

$$\lim\_{n \to \infty} \frac{1}{n} \hat{\mathcal{U}}\_{\beta, n} = \mathcal{U}\_{\beta}^\*. \tag{311}$$

**Proof.** Inequality (310) follows from Theorem 13; and (311) follows from (309) and (310) and the sandwich theorem. It thus remains to establish (309): We do so for *β* > 0; analogous arguments establish (309) for *β* < 0 and *β* = 0. For a fixed sequence *x <sup>n</sup>* ∈ X *<sup>n</sup>* , let ˜*<sup>b</sup>* be a PMF on <sup>X</sup> that maximizes *Sn*(*b*, *x n* ), and denote the wealth relative in (295) associated with using ˜*b* in every race by *S*˜ *<sup>n</sup>*(*x n* ), thus

$$\tilde{S}\_n(\mathbf{x}^n) = \max\_{b \in \mathcal{P}(\mathcal{X})} \prod\_{i=1}^n b(\mathbf{x}\_i) o(\mathbf{x}\_i). \tag{312}$$

Let *S*ˆ *<sup>n</sup>*(*x n* ) denote the wealth relative in (295) associated with the strategy ˆ*b*(*x<sup>i</sup>* |*x i*−1 ) and the sequence *x n* . Using [48] (Theorem 2) it follows that, for every *x <sup>n</sup>* ∈ X *<sup>n</sup>* ,

$$\mathcal{S}\_{\mathfrak{n}}(\mathbf{x}^{\mathfrak{n}}) \ge \frac{1}{2(n+1)^{(|\mathcal{K}|-1)/2}} \mathcal{S}\_{\mathfrak{n}}(\mathbf{x}^{\mathfrak{n}}).\tag{313}$$

This implies that (309) holds for *β* > 0 because

$$\hat{\mathcal{U}}\_{\beta,n} = \frac{1}{\beta} \log \mathbb{E} \left[ \mathbb{S}\_n(\mathbf{X}^n)^{\beta} \right] \tag{314}$$

$$\geq \frac{1}{\beta} \log \mathbb{E} \left[ \tilde{S}\_n(X^n)^\beta \right] - \log 2 - \frac{|\mathcal{X}| - 1}{2} \log(n + 1) \tag{315}$$

$$\geq \frac{1}{\beta} \log \sum\_{\mathbf{x}\_1, \dots, \mathbf{x}\_n} P(\mathbf{x}\_1) \cdot \dots \cdot P(\mathbf{x}\_n) \prod\_{i=1}^n b^\*(\mathbf{x}\_i)^{\beta} o(\mathbf{x}\_i)^{\beta} - \log 2 - \frac{|\mathcal{X}| - 1}{2} \log(n + 1) \tag{316}$$

$$\varepsilon = n\mathcal{U}\_{\tilde{\beta}}^{\*} - \log 2 - \frac{|\mathcal{X}| - 1}{2} \log(n + 1),\tag{317}$$

where (315) follows from (313), and (316) follows from (312).

**Remark 9.** *As discussed in Section 8, the optimal single-race betting strategy varies significantly with different values of β, thus it might be a bit surprising that the Cover–Ordentlich strategy is not only universal with respect to the winning probabilities, but also with respect to β. This is due to the following two reasons: First, for fixed winning probabilities and a fixed β, it is optimal to use the same betting strategy in every race (see Theorem 13). Second, for every x <sup>n</sup>* ∈ X *<sup>n</sup> , the wealth relative of the Cover–Ordentlich strategy is not much worse than that of using the same strategy b*(·) *in every race, irrespective of b*(·) *(see* (313)*). Hence, irrespective of the optimal single-race betting strategy, the Cover–Ordentlich strategy is able to asymptotically achieve the same normalized utility.*

**Author Contributions:** Writing—original draft preparation, C.B., A.L., and C.P.; and writing—review and editing, C.B., A.L., and C.P. All authors have read and agreed to the published version of the manuscript

**Funding:** This research received no external funding.

*Entropy* **2020**, *22*, 316

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Proof of Proposition 1**

These properties mostly follow from van Erven–Harremoës [8]:


$$\frac{1-\mathfrak{a}}{\mathfrak{a}}D\_{\mathfrak{a}}(P||Q) = D\_{1-\mathfrak{a}}(Q||P) \tag{A1}$$

$$\geq D\_{1-\mathfrak{a}'}(Q||P) \tag{A2}$$

$$=\frac{1-\alpha'}{\alpha'}D\_{\alpha'}(P\|\|Q)\_{\prime} \tag{A3}$$

where (A1) and (A3) follow from [8] (Lemma 10), and (A2) holds because the Rényi divergence, extended to negative orders, is nondecreasing ([8] (Theorem 39)).


$$(a-1)D\_{1/a}(P||Q) = a\left(1 - \frac{1}{a}\right)D\_{1/a}(P||Q)\tag{A4}$$

$$=\alpha \inf\_{\mathcal{R}} \left[ \frac{1}{\alpha} D(R \| P) + \left( 1 - \frac{1}{\alpha} \right) D(R \| Q) \right] \tag{A5}$$

$$=\inf\_{R}\left[D(R||P) + (\mathfrak{a} - 1)D(R||Q)\right],\tag{A6}$$

where (A5) follows from [8] (Theorem 30). Hence, (*α* − 1)*D*1/*<sup>α</sup>* (*P*k*Q*) is concave in *α* because the expression in square brackets on the RHS of (A6) is concave in *α* for every *R* and because the pointwise infimum preserves the concavity.

(h) See [8] (Theorem 9).

#### **Appendix B. Proof of Theorem 1**

Beginning with (29),

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y'|X} \| \mathbf{Q}\_{Y'|X} | P\_X) = \sum\_{\mathbf{x} \in \operatorname{supp}(P\_X)} P(\mathbf{x}) D\_{\mathfrak{a}}(P\_{Y'|X=\mathbf{x}} \| \mathbf{Q}\_{Y'|X=\mathbf{x}}) \tag{A7}$$

$$\leq \sum\_{\mathbf{x} \in \text{supp}(P\_{\mathbf{X}})} P(\mathbf{x}) \, D\_{\mathbf{a}} (P\_{Y|\mathbf{X}=\mathbf{x}} \| Q\_{Y|\mathbf{X}=\mathbf{x}}) \tag{A8}$$

$$=D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X}||Q\_{Y|X}|P\_{X})\,. \tag{A9}$$

where (A8) follows by applying, separately for every *x* ∈ supp(*PX*), Proposition 1 (h) with the conditional PMF *A<sup>Y</sup>* 0 |*Y*,*X*=*x* .

#### **Appendix C. Proof of Theorem 2**

We show (43) for *<sup>α</sup>* <sup>∈</sup> (0, 1); the claim then extends to *<sup>α</sup>* <sup>∈</sup> [0, 1] by the continuity of *<sup>D</sup>*<sup>c</sup> *α* (·) in *α* (Proposition 2 (c)). Let *α* ∈ (0, 1). Keeping in mind that *α* − 1 < 0, (43) holds because

$$\begin{aligned} (\mathfrak{a} - 1) \, D\_{\mathfrak{a}}^{\mathfrak{c}} (\mathcal{P}\_{\mathbf{Y}|\mathbf{X}'} \| Q\_{\mathbf{Y}|\mathbf{X}'} | \mathcal{P}\_{\mathbf{X}'}) \\ = \sum\_{\mathbf{x}' \in \operatorname{supp}(\mathbf{P}\_{\mathbf{X}'})} P\_{\mathbf{X}'}(\mathbf{x}') \log \sum\_{\mathbf{y}} P\_{\mathbf{Y}|\mathbf{X}'}(\mathbf{y}|\mathbf{x}')^{a} Q\_{\mathbf{Y}|\mathbf{X}'}(\mathbf{y}|\mathbf{x}')^{1-a} \end{aligned} \tag{A10}$$

$$=\sum\_{\mathbf{x'}\in\text{supp}(P\_{X'})}P\_{X'}(\mathbf{x'})\log\sum\_{\mathbf{y}}\left[\sum\_{\mathbf{x}}B\_{X|X'}(\mathbf{x}|\mathbf{x'})\,P\_{Y|X}(\mathbf{y}|\mathbf{x})\right]^a\left[\sum\_{\mathbf{x}}B\_{X|X'}(\mathbf{x}|\mathbf{x'})\,Q\_{Y|X}(\mathbf{y}|\mathbf{x})\right]^{1-a}\tag{A11}$$

$$\geq \sum\_{\mathbf{x'} \in \text{supp}(P\_{\mathbf{X'}})} P\_{\mathbf{X'}}(\mathbf{x'}) \log \sum\_{\mathbf{y}} \sum\_{\mathbf{x}} B\_{\mathbf{X}|\mathbf{X'}}(\mathbf{x}|\mathbf{x'}) P\_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})^{a} Q\_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})^{1-a} \tag{A12}$$

$$=\sum\_{\mathbf{x'}\in\text{supp}(P\_{X'})}P\_{X'}(\mathbf{x'})\log\sum\_{\mathbf{x}\in\text{supp}(P\_X)}B\_{X|X'}(\mathbf{x}|\mathbf{x'})\sum\_{\mathbf{y}}P\_{Y|X}(\mathbf{y}|\mathbf{x})^{\mathbf{a}}Q\_{Y|X}(\mathbf{y}|\mathbf{x})^{1-\mathbf{a}}\tag{A13}$$

$$\geq \sum\_{\mathbf{x'} \in \text{supp}(\mathcal{P}\_{\mathbf{X'}})} P\_{\mathbf{X'}}(\mathbf{x'}) \sum\_{\mathbf{x} \in \text{supp}(\mathcal{P}\_{\mathbf{X}})} \mathcal{B}\_{\mathbf{X}|\mathbf{X'}}(\mathbf{x}|\mathbf{x'}) \log \sum\_{\mathbf{y}} P\_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})^{a} Q\_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})^{1-a} \tag{A14}$$

$$=\sum\_{\mathbf{x}\in\text{supp}(P\_X)}P\_X(\mathbf{x})\left[\sum\_{\mathbf{x'}\in\text{supp}(P\_{X'})}B\_{X'|X}(\mathbf{x'}|\mathbf{x})\right]\log\sum\_{\mathbf{y}}P\_{Y|X}(\mathbf{y}|\mathbf{x})^aQ\_{Y|X}(\mathbf{y}|\mathbf{x})^{1-a}\tag{A15}$$

$$=\sum\_{\mathbf{x}\in\text{supp}(P\_{\mathbf{X}})}P\_{\mathbf{X}}(\mathbf{x})\log\sum\_{\mathbf{y}}P\_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})^{a}Q\_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})^{1-a}\tag{A16}$$

$$= (\mathfrak{a} - 1) D\_{\mathfrak{a}}^{\mathfrak{c}} (\mathbb{P}\_{\mathbf{Y}|X} \| Q\_{\mathbf{Y}|X} | \mathcal{P}\_{\mathbf{X}}) , \tag{A17}$$

where (A10) follows from (30); (A11) follows from (41) and (42); (A12) follows from Hölder's inequality; (A13) holds because *BX*|*<sup>X</sup>* <sup>0</sup>(*x*|*x* 0 ) = 0 if *P<sup>X</sup>* <sup>0</sup>(*x* 0 ) > 0 and *PX*(*x*) = 0; (A14) follows from Jensen's inequality because log(·) is concave; (A15) follows from (40); (A16) holds because *PX*(*x*) > 0 and *PX* <sup>0</sup>(*x* 0 ) = 0 imply *B<sup>X</sup>* 0 <sup>|</sup>*X*(*x* 0 |*x*) = 0, hence the expression in square brackets on the LHS of (A16) equals one; and (A17) follows from (30).

#### **Appendix D. Proof of Corollary 1**

Applying Theorem 2 with X <sup>0</sup> , {1} and the conditional PMF *B<sup>X</sup>* 0 <sup>|</sup>*X*(*x* 0 |*x*) , 1, we obtain

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X'} \| Q\_{Y|X'} | P\_{X'}) \le D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X} \| \| Q\_{Y|X} | P\_X). \tag{A18}$$

To complete the proof of (48), observe that

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X'}||Q\_{Y|X'}|P\_{X'}) = D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_Y||Q\_Y|P\_{X'})\tag{A19}$$

$$=D\_{\mathfrak{a}}(P\_Y \| Q\_Y)\_{\mathfrak{h}} \tag{A20}$$

where (A19) holds because (41) and (46) imply *PY*|*<sup>X</sup>* <sup>0</sup>(*y*|*x* 0 ) = *PY*(*y*) and because (42) and (47) imply *QY*|*<sup>X</sup>* <sup>0</sup>(*y*|*x* 0 ) = *QY*(*y*); and (A20) follows from Remark 1.

#### **Appendix E. Proof of Example 1**

If *α* = ∞, then it can be verified numerically that (53) holds for *e* = 0.1. Fix now *α* ∈ (1, ∞). Then, for all *e* ∈ (0, 1),

$$D\_a\left(P\_Y \| Q\_Y^{(\varepsilon)}\right) = \frac{1}{a-1} \log \left[ 0.5^a (1-\varepsilon)^{1-a} + 0.5^a \varepsilon^{1-a} \right] \tag{A21}$$

$$\geq \frac{1}{\alpha - 1} \log \left[ 0.5^{\mu} \epsilon^{1 - \alpha} \right] \tag{A22}$$

$$
\epsilon = \frac{\mathfrak{a}}{\mathfrak{a} - 1} \log 0.5 + \log \frac{1}{\mathfrak{e}}.\tag{A23}
$$

The RHS of (53) satisfies, for sufficiently small *e*,

$$D\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{Y|X}^{(\epsilon)} \| \mathbb{Q}\_{Y}^{(\epsilon)} | P\_{X}) = 0.5 \cdot 0 + 0.5 \cdot D\_{\mathfrak{a}} \left( P\_{Y|X=1}^{(\epsilon)} \| \mathbb{Q}\_{Y}^{(\epsilon)} \right) \tag{A24}$$

$$=\frac{0.5}{\alpha - 1} \log \left[ \epsilon^{a} \left( 1 - \epsilon \right)^{1 - a} + (1 - \epsilon)^{a} \epsilon^{1 - a} \right] \tag{A25}$$

$$=\frac{0.5}{\alpha - 1} \log \left[ \epsilon^{1-a} \left( (1 - \epsilon)^a + \epsilon^{2a-1} (1 - \epsilon)^{1-a} \right) \right] \tag{A26}$$

$$\epsilon \le \frac{0.5}{\alpha - 1} \log \left[ 2 \epsilon^{1 - \alpha} \right] \tag{A27}$$

$$=\frac{0.5}{\alpha - 1} \log 2 + 0.5 \log \frac{1}{\epsilon'} \tag{A28}$$

where (A27) holds for sufficiently small *e* because lim*e*↓<sup>0</sup> (1 − *e*) *<sup>α</sup>* + *e* 2*α*−1 (1 − *e*) 1−*α* = 1. Because lim*e*↓<sup>0</sup> log <sup>1</sup> *<sup>e</sup>* = ∞, (53) follows from (A23) and (A28) for sufficiently small *e*.

#### **Appendix F. Proof of Theorem 3**

Observe that, for all *x* <sup>0</sup> ∈ X and all *y* <sup>0</sup> ∈ Y<sup>0</sup> ,

$$P\_{\mathbf{X}}(\mathbf{x}')P\_{Y|X}(y'|\mathbf{x}') = \sum\_{\mathbf{x},\mathbf{y}}P\_{\mathbf{X}}(\mathbf{x})P\_{Y|X}(y|\mathbf{x})\,\mathbb{1}\{\mathbf{x}'=\mathbf{x}\} \, A\_{Y|X\mathbf{Y}}(y'|\mathbf{x},\mathbf{y})\_{\mathbf{y}} \tag{A29}$$

$$P\_X(\mathbf{x'})Q\_{Y'|X}(y'|\mathbf{x'}) = \sum\_{\mathbf{x},\mathbf{y}}P\_X(\mathbf{x})Q\_{Y|X}(y|\mathbf{x})\,\mathbb{1}\{\mathbf{x'}=\mathbf{x}\}\,A\_{Y'|XY}(y'|\mathbf{x},\mathbf{y}).\tag{A30}$$

Hence, (68) follows from (54) and

$$D\_{\mathfrak{a}}(P\_X P\_{Y'|X} \| \| P\_X Q\_{Y'|X}) \le D\_{\mathfrak{a}}(P\_X P\_{Y|X} \| \| P\_X Q\_{Y|X})\_{\mathfrak{a}} \tag{A31}$$

which follows from the data-processing inequality for the Rényi divergence by substituting 1*X* <sup>0</sup>=*<sup>X</sup> A<sup>Y</sup>* 0 <sup>|</sup>*XY* for *A<sup>X</sup>* 0*Y* 0 <sup>|</sup>*XY* in Proposition 1 (h).

#### **Appendix G. Proof of Theorem 4**

Observe that, for all *x* <sup>0</sup> ∈ X <sup>0</sup> and all *y* <sup>0</sup> ∈ Y,

$$P\_{X'}(\mathbf{x'})P\_{Y|X'}(y'|\mathbf{x'}) = \sum\_{\mathbf{x},\mathbf{y}}P\_X(\mathbf{x})P\_{Y|X}(y|\mathbf{x})B\_{X'|X}(\mathbf{x'}|\mathbf{x})\,\mathbb{1}\{y'=y\},\tag{A32}$$

$$P\_{X'}(\mathbf{x'}) \, Q\_{Y|X'}(y'|\mathbf{x'}) = \sum\_{\mathbf{x},\mathbf{y}} P\_X(\mathbf{x}) \, Q\_{Y|X}(y|\mathbf{x}) \, B\_{X'|X}(\mathbf{x'}|\mathbf{x}) \, \mathbb{I}\left\{y'=y\right\}.\tag{A33}$$

Hence, (73) follows from (54) and

$$D\_{\mathfrak{a}}\left(P\_{X'}P\_{Y|X'}\|\|P\_{X'}Q\_{Y|X'}\right) \le D\_{\mathfrak{a}}\left(P\_X P\_{Y|X}\|\|P\_X Q\_{Y|X}\right)\_{\mathfrak{h}}\tag{A34}$$

which follows from the data-processing inequality for the Rényi divergence by substituting *B<sup>X</sup>* 0 <sup>|</sup>*X*1*<sup>Y</sup>* 0=*Y* for *A<sup>X</sup>* 0*Y* 0 <sup>|</sup>*XY* in Proposition 1 (h).

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article* **On Relations Between the Relative Entropy and** *χ* **2 -Divergence, Generalizations and Applications**

**Tomohiro Nishiyama <sup>1</sup> and Igal Sason 2,\***


Received: 22 April 2020; Accepted: 17 May 2020; Published: 18 May 2020

**Abstract:** The relative entropy and the chi-squared divergence are fundamental divergence measures in information theory and statistics. This paper is focused on a study of integral relations between the two divergences, the implications of these relations, their information-theoretic applications, and some generalizations pertaining to the rich class of *f*-divergences. Applications that are studied in this paper refer to lossless compression, the method of types and large deviations, strong data–processing inequalities, bounds on contraction coefficients and maximal correlation, and the convergence rate to stationarity of a type of discrete-time Markov chains.

**Keywords:** relative entropy; chi-squared divergence; *f*-divergences; method of types; large deviations; strong data–processing inequalities; information contraction; maximal correlation; Markov chains

#### **1. Introduction**

The relative entropy (also known as the Kullback–Leibler divergence [1]) and the chi-squared divergence [2] are divergence measures which play a key role in information theory, statistics, learning, signal processing, and other theoretical and applied branches of mathematics. These divergence measures are fundamental in problems pertaining to source and channel coding, combinatorics and large deviations theory, goodness-of-fit and independence tests in statistics, expectation–maximization iterative algorithms for estimating a distribution from an incomplete data, and other sorts of problems (the reader is referred to the tutorial paper by Csiszár and Shields [3]). They both belong to an important class of divergence measures, defined by means of convex functions *f* , and named *f*-divergences [4–8]. In addition to the relative entropy and the chi-squared divergence, this class unifies other useful divergence measures such as the total variation distance in functional analysis, and it is also closely related to the Rényi divergence which generalizes the relative entropy [9,10]. In general, *f*-divergences (defined in Section 2) are attractive since they satisfy pleasing features such as the data–processing inequality, convexity, (semi)continuity, and duality properties, and they therefore find nice applications in information theory and statistics (see, e.g., [6,8,11,12]).

In this work, we study integral relations between the relative entropy and the chi-squared divergence, implications of these relations, and some of their information-theoretic applications. Some generalizations which apply to the class of *f*-divergences are also explored in detail. In this context, it should be noted that integral representations of general *f*-divergences, expressed as a function of either the DeGroot statistical information [13], the *Eγ*-divergence (a parametric sub-class of *f*-divergences, which generalizes the total variation distance [14] [p. 2314]) and the relative information spectrum, have been derived in [12] [Section 5], [15] [Section 7.B], and [16] [Section 3], respectively.

Applications in this paper are related to lossless source compression, large deviations by the method of types, and strong data–processing inequalities. The relevant background for each of these applications is provided to make the presentation self contained.

We next outline the paper contributions and the structure of our manuscript.

#### *1.1. Paper Contributions*

This work starts by introducing integral relations between the relative entropy and the chi-squared divergence, and some inequalities which relate these two divergences (see Theorem 1, its corollaries, and Proposition 1). It continues with a study of the implications and generalizations of these relations, pertaining to the rich class of *f*-divergences. One implication leads to a tight lower bound on the relative entropy between a pair of probability measures, expressed as a function of the means and variances under these measures (see Theorem 2). A second implication of Theorem 1 leads to an upper bound on a skew divergence (see Theorem 3 and Corollary 3). Due to the concavity of the Shannon entropy, let the concavity deficit of the entropy function be defined as the non-negative difference between the entropy of a convex combination of distributions and the convex combination of the entropies of these distributions. Then, Corollary 4 provides an upper bound on this deficit, expressed as a function of the pairwise relative entropies between all pairs of distributions. Theorem 4 provides a generalization of Theorem 1 to the class of *f*-divergences. It recursively constructs non-increasing sequences of *f*-divergences and as a consequence of Theorem 4 followed by the usage of polylogairthms, Corollary 5 provides a generalization of the useful integral relation in Theorem 1 between the relative entropy and the chi-squared divergence. Theorem 5 relates probabilities of sets to *f*-divergences, generalizing a known and useful result by Csiszár for the relative entropy. With respect to Theorem 1, the integral relation between the relative entropy and the chi-squared divergence has been independently derived in [17], which also derived an alternative upper bound on the concavity deficit of the entropy as a function of total variational distances (differing from the bound in Corollary 4, which depends on pairwise relative entropies). The interested reader is referred to [17], with a preprint of the extended version in [18], and to [19] where the connections in Theorem 1 were originally discovered in the quantum setting.

The second part of this work studies information-theoretic applications of the above results. These are ordered by starting from the relatively simple applications, and ending at the more complicated ones. The first one includes a bound on the redundancy of the Shannon code for universal lossless compression with discrete memoryless sources, used in conjunction with Theorem 3 (see Section 4.1). An application of Theorem 2 in the context of the method of types and large deviations analysis is then studied in Section 4.2, providing non-asymptotic bounds which lead to a closed-form expression as a function of the Lambert *W* function (see Proposition 2). Strong data–processing inequalities with bounds on contraction coefficients of skew divergences are provided in Theorem 6, Corollary 7 and Proposition 3. Consequently, non-asymptotic bounds on the convergence to stationarity of time-homogeneous, irreducible, and reversible discrete-time Markov chains with finite state spaces are obtained by relying on our bounds on the contraction coefficients of skew divergences (see Theorem 7). The exact asymptotic convergence rate is also obtained in Corollary 8. Finally, a property of maximal correlations is obtained in Proposition 4 as an application of our starting point on the integral relation between the relative entropy and the chi-squared divergence.

#### *1.2. Paper Organization*

This paper is structured as follows. Section 2 presents notation and preliminary material which is necessary for, or otherwise related to, the exposition of this work. Section 3 refers to the developed relations between divergences, and Section 4 studies information-theoretic applications. Proofs of the results in Sections 3 and 4 (except for short proofs) are deferred to Section 5.

#### **2. Preliminaries and Notation**

This section provides definitions of divergence measures which are used in this paper, and it also provides relevant notation.

**Definition 1.** *[12] [p. 4398] Let P and Q be probability measures, let µ be a dominating measure of P and Q (i.e., <sup>P</sup>*, *<sup>Q</sup> <sup>µ</sup>), and let <sup>p</sup>* :<sup>=</sup> <sup>d</sup>*<sup>P</sup>* d*µ and q* := d*Q* d*µ be the densities of P and Q with respect to µ. The f*-divergence *from P to Q is given by*

$$D\_f(P\|\|Q) := \int q \, f\left(\frac{p}{q}\right) \, \text{d}\mu \,\tag{1}$$

*where*

$$f(0) := \lim\_{t \to 0^+} f(t), \quad 0f\left(\frac{0}{0}\right) := 0,\tag{2}$$

$$0f\left(\frac{a}{0}\right) := \lim\_{t \to 0^+} tf\left(\frac{a}{t}\right) = a \lim\_{u \to \infty} \frac{f(u)}{u}, \quad a > 0. \tag{3}$$

*It should be noted that the right side of* (1) *does not depend on the dominating measure µ.*

Throughout the paper, we denote by 1{relation} the indicator function; it is equal to 1 if the relation is true, and it is equal to 0 otherwise. Throughout the paper, unless indicated explicitly, logarithms have an arbitrary common base (that is larger than 1), and exp(·) indicates the inverse function of the logarithm with that base.

**Definition 2.** *[1] The* relative entropy *is the f -divergence with f*(*t*) := *t* log *t for t* > 0*,*

$$D(P\|\|Q) := D\_f(P\|\|Q) \tag{4}$$

$$
\rho = \int p \, \log \frac{p}{q} \, \mathrm{d}\mu. \tag{5}
$$

**Definition 3.** *The* total variation distance *between probability measures P and Q is the f-divergence from P to Q with f*(*t*) := |*t* − 1| *for all t* ≥ 0*. It is a symmetric f -divergence, denoted by* |*P* − *Q*|*, which is given by*

$$|P - Q| := D\_f(P \| Q) \tag{6}$$

$$= \int |p - q| \,\mathrm{d}\mu. \tag{7}$$

**Definition 4.** *[2] The* chi-squared divergence *from P to Q is defined to be the f-divergence in* (1) *with f*(*t*) := (*t* − 1) 2 *or f*(*t*) := *t* <sup>2</sup> <sup>−</sup> <sup>1</sup> *for all t* <sup>&</sup>gt; <sup>0</sup>*,*

$$\chi^2(P\|\vert Q) := D\_f(P\|\vert Q) \tag{8}$$

$$
\rho = \int \frac{(p-q)^2}{q} \, \mathrm{d}\mu = \int \frac{p^2}{q} \, \mathrm{d}\mu - 1. \tag{9}
$$

The Rényi divergence, a generalization of the relative entropy, was introduced by Rényi [10] in the special case of finite alphabets. Its general definition is given as follows (see, e.g., [9]).

**Definition 5.** *[10] Let P and Q be probability measures on* X *dominated by µ, and let their densities be respectively denoted by p* = <sup>d</sup>*<sup>P</sup>* d*µ and q* = d*Q* d*µ . The* Rényi divergence *of order α* ∈ [0, ∞] *is defined as follows:*

*Entropy* **2020**, *22*, 563

• *If α* ∈ (0, 1) ∪ (1, ∞)*, then*

$$D\_{\mathfrak{a}}(P\|\!|Q) = \frac{1}{\mathfrak{a}-1} \log \mathbb{E}\left[p^{\mathfrak{a}}(Z) \, q^{1-\mathfrak{a}}(Z)\right] \tag{10}$$

$$\alpha = \frac{1}{\alpha - 1} \text{ log } \sum\_{\mathbf{x} \in \mathcal{X}} P^{\mathfrak{a}}(\mathbf{x}) \, Q^{1-\mathfrak{a}}(\mathbf{x}) \,. \tag{11}$$

*where Z* ∼ *µ in* (10)*, and* (11) *holds if* X *is a discrete set.*

• *By the continuous extension of Dα*(*P*k*Q*)*,*

$$D\_0(P\|\|Q) = \max\_{\mathcal{A}: P(\mathcal{A}) = 1} \log \frac{1}{Q(\mathcal{A})}\,\tag{12}$$

$$D\_1(P\|\|Q) = D(P\|\|Q)\_\prime \tag{13}$$

$$D\_{\infty}(P||Q) = \log \operatorname{ess\,sup} \frac{p(Z)}{q(Z)}.\tag{14}$$

The second-order Rényi divergence and the chi-squared divergence are related as follows:

$$D\_2(P\|\|Q) = \log\left(1 + \chi^2(P\|\|Q)\right),\tag{15}$$

and the relative entropy and the chi-squared divergence satisfy (see, e.g., [20] [Theorem 5])

$$D(P\|\|Q) \le \log\left(1 + \chi^2(P\|\|Q)\right). \tag{16}$$

Inequality (16) readily follows from (13), (15), and since *Dα*(*P*k*Q*) is monotonically increasing in *α* ∈ (0, ∞) (see [9] [Theorem 3]). A tightened version of (16), introducing an improved and locally-tight upper bound on *D*(*P*k*Q*) as a function of *χ* 2 (*P*k*Q*) and *χ* 2 (*Q*k*P*), is introduced in [15] [Theorem 20]. Another sharpened version of (16) is derived in [15] [Theorem 11] under the assumption of a bounded relative information. Furthermore, under the latter assumption, tight upper and lower bounds on the ratio *<sup>D</sup>*(*P*k*Q*) *<sup>χ</sup>*2(*P*k*Q*) are obtained in [15] [(169)].

**Definition 6.** *[21] The* Györfi–Vajda divergence *of order s* ∈ [0, 1] *is an f -divergence with*

$$f(t) = \phi\_s(t) := \frac{(t-1)^2}{s + (1-s)t'}, \quad t \ge 0. \tag{17}$$

*Vincze–Le Cam distance (also known as the triangular discrimination) ([22,23]) is a special case with s* = <sup>1</sup> 2 *.*

In view of (1), (9) and (17), it can be verified that the Györfi–Vajda divergence is related to the chi-squared divergence as follows:

$$D\_{\Phi\_s}(P\|\|Q) = \begin{cases} \frac{1}{s^2} \cdot \chi^2(P\|\left(1-s\right)P + sQ), & s \in (0,1],\\ \chi^2(Q\|P), & s = 0. \end{cases} \tag{18}$$

Hence,

$$D\_{\Phi1}(P\|Q) = \chi^2(P\|Q)\_\prime \tag{19}$$

$$D\_{\Phi0}(P\|Q) = \chi^2(Q\|P). \tag{20}$$

#### **3. Relations between Divergences**

We introduce in this section results on the relations between the relative entropy and the chi-squared divergence, their implications, and generalizations. Information–theoretic applications are studied in the next section.

#### *3.1. Relations between the Relative Entropy and the Chi-Squared Divergence*

The following result relates the relative entropy and the chi-squared divergence, which are two fundamental divergence measures in information theory and statistics. This result was recently obtained in an equivalent form in [17] [(12)] (it is noted that this identity was also independently derived by the coauthors in two separate un-published works in [24] [(16)] and [25]). It should be noted that these connections between divergences in the quantum setting were originally discovered in [19] [Theorem 6]. Beyond serving as an interesting relation between these two fundamental divergence measures, it is introduced here for the following reasons:


**Theorem 1.** *Let P and Q be probability measures defined on a measurable space* (X , F )*, and let*

$$R\_{\lambda} := (1 - \lambda)P + \lambda Q\_{\prime} \quad \lambda \in [0, 1] \tag{21}$$

*be the convex combination of P and Q. Then, for all λ* ∈ [0, 1]*,*

$$\frac{1}{\text{loge}} D(P \| \| R\_{\lambda}) = \int\_{0}^{\lambda} \chi^{2}(P \| \| R\_{s}) \, \frac{\text{ds}}{\text{s}} \, \text{.} \tag{22}$$

$$\frac{1}{2}\lambda^2\chi^2(\mathcal{R}\_{1-\lambda}||\mathcal{Q}) = \int\_0^\lambda \chi^2(\mathcal{R}\_{1-s}||\mathcal{Q})\,\frac{\mathrm{d}s}{s}.\tag{23}$$

**Proof.** See Section 5.1.

A specialization of Theorem 1 by letting *λ* = 1 gives the following identities.

**Corollary 1.**

$$\frac{1}{\log e} D(P \| \| Q) = \int\_0^1 \chi^2(P \| \| (1 - s)P + sQ) \, \frac{\mathrm{d}s}{\mathrm{s}} \, \Big|\, \tag{24}$$

$$\frac{1}{2}\chi^2(P\|Q) = \int\_0^1 \chi^2(sP + (1-s)Q \, \| \, Q) \, \frac{ds}{s}.\tag{25}$$

**Remark 1.** *The substitution s* := <sup>1</sup> 1+*t transforms* (24) *to [26] [Equation (31)], i.e.,*

$$\frac{1}{\log e} D(P \| \| Q) = \int\_0^\infty \chi^2 \left( P \| \left. \frac{tP + Q}{1 + t} \right\} \frac{\mathbf{d}t}{1 + t} \right. \tag{26}$$

In view of (18) and (21), an equivalent form of (22) and (24) is given as follows:

**Corollary 2.** *For s* ∈ [0, 1]*, let φ<sup>s</sup>* : [0, ∞) → R *be given in* (17)*. Then,*

$$\frac{1}{\log e} D(P \| R\_{\lambda}) = \int\_{0}^{\lambda} s D\_{\Phi^\*} (P \| Q) \, \mathrm{d}s, \quad \lambda \in [0, 1], \tag{27}$$

$$\frac{1}{\log \text{e}} D(P \| \| Q) = \int\_0^1 s D\_{\Phi\_s}(P \| \| Q) \, \text{d}s. \tag{28}$$

By Corollary 1, we obtain original and simple proofs of new and old *f*-divergence inequalities.


$$D(P\|\|Q) \ge \frac{1}{2}|P - Q|^2 \log \text{e.}\tag{29}$$

*(b)*

$$\frac{1}{\log e} \, D(P||Q) \le \frac{1}{5} \chi^2(P||Q) + \frac{1}{5} \chi^2(Q||P). \tag{30}$$

*Furthermore, let* {*Pn*} *be a sequence of probability measures that is defined on a measurable space* (X , F )*, and which converges to a probability measure P in the sense that*

$$\lim\_{n \to \infty} \operatorname{ess\,sup} \frac{\mathrm{d}P\_{\mathrm{ll}}}{\mathrm{d}P} \left( \mathrm{X} \right) = \mathbf{1},\tag{31}$$

*with X* ∼ *P. Then,* (30) *is locally tight in the sense that its both sides converge to 0, and*

$$\lim\_{n \to \infty} \frac{\frac{1}{3} \chi^2(P\_n \| P) + \frac{1}{6} \chi^2(P \| P\_n)}{\frac{1}{\log \text{e}} D(P\_n \| P)} = 1. \tag{32}$$

*(c) For all θ* ∈ (0, 1)*,*

$$D(P\|\|Q) \ge (1-\theta)\log\left(\frac{1}{1-\theta}\right)D\_{\theta\|}(P\|\|Q). \tag{33}$$

*Moreover, under the assumption in* (31)*, for all θ* ∈ [0, 1]

$$\lim\_{n \to \infty} \frac{D(P||P\_n)}{D\_{\phi\psi}(P||P\_n)} = \frac{1}{2} \log \text{e.} \tag{34}$$

*(d) [15] [Theorem 2]:*

$$\frac{1}{\log e}D(P||Q) \le \frac{1}{2}\chi^2(P||Q) + \frac{1}{4}|P - Q|.\tag{35}$$

**Proof.** See Section 5.2.

**Remark 2.** *Inequality* (30) *is locally tight in the sense that* (31) *yields* (32)*. This property, however, is not satisfied by* (16) *since the assumption in* (31) *implies that*

$$\lim\_{n \to \infty} \frac{\log\left(1 + \chi^2(P\_n || P)\right)}{D(P\_n || P)} = 2.\tag{36}$$

**Remark 3.** *Inequality* (30) *readily yields*

$$D(P\|Q) + D(Q\|P) \le \frac{1}{2} \left( \chi^2(P\|Q) + \chi^2(Q\|P) \right) \log \mathbf{e},\tag{37}$$

*which is proved by a different approach in [27] [Proposition 4]. It is further shown in [15] [Theorem 2 b)] that*

$$\sup \frac{D(P\|\|Q) + D(Q\|\|P)}{\chi^2(P\|\|Q) + \chi^2(Q\|\|P)} = \frac{1}{2} \log \mathbf{e}\_\prime \tag{38}$$

*where the supremum is over P Q and P* 6= *Q.*

#### *3.2. Implications of Theorem 1*

We next provide two implications of Theorem 1. The first implication, which relies on the Hammersley–Chapman–Robbins (HCR) bound for the chi-squared divergence [28,29], gives the following tight lower bound on the relative entropy *D*(*P*k*Q*) as a function of the means and variances under *P* and *Q*.

**Theorem 2.** *Let P and Q be probability measures defined on the measurable space* (R, B)*, where* R *is the real line and* B *is the Borel σ–algebra of subsets of* R*. Let mP, mQ, σ* 2 *P , and σ* 2 *Q denote the expected values and variances of X* ∼ *P and Y* ∼ *Q, i.e.,*

$$\mathbb{E}[X] = :\boldsymbol{m}\_{P\prime} \colon \mathbb{E}[Y] = :\boldsymbol{m}\_{Q\prime} \quad \text{Var}(X) = :\sigma\_{P\prime}^{2} \text{ Var}(Y) = :\sigma\_{Q\prime}^{2} \tag{39}$$

*(a) If m<sup>P</sup>* 6= *mQ, then*

$$D(P\|\|Q) \ge d(r\|\|\mathbf{s})\,\tag{40}$$

*where <sup>d</sup>*(*r*k*s*) :<sup>=</sup> *<sup>r</sup>*log *<sup>r</sup> <sup>s</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>r</sup>*)log <sup>1</sup>−*<sup>r</sup>* 1−*s , for r*,*s* ∈ [0, 1]*, denotes the binary relative entropy (with the convention that* 0 log <sup>0</sup> <sup>0</sup> = 0*), and*

$$r := \frac{1}{2} + \frac{b}{4av} \in [0, 1]. \tag{41}$$

$$s := r - \frac{a}{2v} \in [0, 1]. \tag{42}$$

$$a \coloneqq \mathcal{m}\_{\mathcal{P}} - \mathcal{m}\_{\mathcal{Q}'} \tag{43}$$

$$b := a^2 + \underline{\sigma\_{\mathcal{Q}}^2 - \sigma\_{\mathcal{P}'}^2} \tag{44}$$

$$w := \sqrt{\sigma\_P^2 + \frac{b^2}{4a^2}}.\tag{45}$$

*(b) The lower bound on the right side of* (40) *is attained for P and Q which are defined on the two-element set* U := {*u*1, *u*2}*, and*

$$P(\mu\_1) = r, \quad Q(\mu\_1) = s,\tag{46}$$

*with r and s in* (41) *and* (42)*, respectively, and for m<sup>P</sup>* 6= *m<sup>Q</sup>*

$$
\mu\_1 := m\_P + \sqrt{\frac{(1-r)\sigma\_P^2}{r}}, \quad \mu\_2 := m\_P - \sqrt{\frac{r\sigma\_P^2}{1-r}}.\tag{47}
$$

*(c) If m<sup>P</sup>* = *mQ, then*

$$\inf\_{P,Q} D(P||Q) = 0,\tag{48}$$

*where the infimum on the left side of* (48) *is taken over all P and Q which satisfy* (39)*.*

**Proof.** See Section 5.3.

**Remark 4.** *Consider the case of the non-equal means in Items (a) and (b) of Theorem 2. If these means are fixed, then the infimum of D*(*P*k*Q*) *is zero by choosing arbitrarily large equal variances. Suppose now that the non-equal means m<sup>P</sup> and m<sup>Q</sup> are fixed, as well as one of the variances (either σ* 2 *P or σ* 2 *Q ). Numerical experimentation shows that, in this case, the achievable lower bound in* (40) *is monotonically decreasing as a function of the other variance, and it tends to zero as we let the free variance tend to infinity.* *This asymptotic convergence to zero can be justified by assuming, for example, that mP*, *mQ, and σ* 2 *Q are fixed, and m<sup>P</sup>* > *m<sup>Q</sup> (the other cases can be justified in a similar way). Then, it can be verified from* (41)*–*(45) *that*

$$r = \frac{(m\_P - m\_Q)^2}{\sigma\_P^2} + O\left(\frac{1}{\sigma\_P^4}\right), \quad s = O\left(\frac{1}{\sigma\_P^4}\right),\tag{49}$$

*which implies that d*(*r*k*s*) → 0 *as we let σ<sup>P</sup>* → ∞*. The infimum of the relative entropy D*(*P*k*Q*) *is therefore equal to zero since the probability measures P and Q in* (46) *and* (47)*, which are defined on a two-element set and attain the lower bound on the relative entropy under the constraints in* (39)*, have a vanishing relative entropy in this asymptotic case.*

**Remark 5.** *The proof of Item (c) in Theorem 2 suggests explicit constructions of sequences of pairs probability measures* {(*Pn*, *Qn*)} *such that*


*This yields in particular* (48)*.*

A second consequence of Theorem 1 gives the following result. Its first part holds due to the concavity of exp −*D*(*P*k·) (see [30] [Problem 4.2]). The second part is new, and its proof relies on Theorem 1. As an educational note, we provide an alternative proof of the first part by relying on Theorem 1.

**Theorem 3.** *Let P Q, and F* : [0, 1] → [0, ∞) *be given by*

$$F(\lambda) := D\left(P \parallel (1 - \lambda)P + \lambda Q\right), \quad \forall \lambda \in [0, 1]. \tag{50}$$

*Then, for all λ* ∈ [0, 1]*,*

$$F(\lambda) \le \log\left(\frac{1}{1 - \lambda + \lambda \exp\left(-D(P\|\|Q)\right)}\right),\tag{51}$$

*with an equality if λ* = 0 *or λ* = 1*. Moreover, F is monotonically increasing, differentiable, and it satisfies*

$$F'(\lambda) \ge \frac{1}{\lambda} \left[ \exp \left( F(\lambda) \right) - 1 \right] \log \mathbf{e}, \quad \forall \lambda \in (0, 1], \tag{52}$$

$$\lim\_{\lambda \to 0^+} \frac{F'(\lambda)}{\lambda} = \chi^2(Q \| P) \text{ log\,e},\tag{53}$$

*so the limit in* (53) *is twice as large as the value of the lower bound on this limit as it follows from the right side of* (52)*.*

**Proof.** See Section 5.4.

**Remark 6.** *By the convexity of the relative entropy, it follows that F*(*λ*) ≤ *λ D*(*P*k*Q*) *for all λ* ∈ [0, 1]*. It can be verified, however, that the inequality* 1 − *λ* + *λ* exp(−*x*) ≥ exp(−*λx*) *holds for all x* ≥ 0 *and λ* ∈ [0, 1]*. Letting x* := *D*(*P*k*Q*) *implies that the upper bound on F*(*λ*) *on the right side of* (51) *is tighter than or equal to the upper bound λ D*(*P*k*Q*) *(with an equality if and only if either λ* ∈ {0, 1} *or P* ≡ *Q).*

**Corollary 3.** *Let* {*Pj*} *m j*=1 *, with m* ∈ N*, be probability measures defined on a measurable space* (X , F )*, and let* {*αj*} *m j*=1 *be a sequence of non-negative numbers that sum to 1. Then, for all i* ∈ {1, . . . , *m*}*,*

$$D\left(P\_i \parallel \sum\_{j=1}^m a\_j P\_j\right) \le -\log\left(a\_i + (1 - a\_i) \exp\left(-\frac{1}{1 - a\_i} \sum\_{j \ne i} a\_j D\left(P\_i || P\_j\right)\right)\right).\tag{54}$$

**Proof.** For an arbitrary *i* ∈ {1, . . . , *m*}, apply the upper bound on the right side of (51) with *λ* := 1 − *α<sup>i</sup>* , *P* := *P<sup>i</sup>* and *Q* := <sup>1</sup> 1−*α<sup>i</sup>* ∑ *j*6=*i αjP<sup>j</sup>* . The right side of (54) is obtained from (51) by invoking the convexity of the relative entropy, which gives *<sup>D</sup>*(*Pi*k*Q*) <sup>≤</sup> <sup>1</sup> 1−*α<sup>i</sup>* ∑ *j*6=*i αjD*(*Pi*k*Pj*).

The next result provides an upper bound on the non-negative difference between the entropy of a convex combination of distributions and the respective convex combination of the individual entropies (it is also termed as the concavity deficit of the entropy function in [17] [Section 3]).

**Corollary 4.** *Let* {*Pj*} *m j*=1 *, with m* ∈ N*, be probability measures defined on a measurable space* (X , F )*, and let* {*αj*} *m j*=1 *be a sequence of non-negative numbers that sum to 1. Then,*

$$0 \le H\left(\sum\_{j=1}^{m} a\_j P\_j\right) - \sum\_{j=1}^{m} a\_j H(P\_j) \le -\sum\_{i=1}^{m} a\_i \log\left(a\_i + (1 - a\_i) \exp\left(-\frac{1}{1 - a\_i} \sum\_{j \ne i} a\_j \operatorname{D}(P\_i || P\_j)\right)\right). \tag{55}$$

**Proof.** The lower bound holds due to the concavity of the entropy function. The upper bound readily follows from Corollary 3, and the identity

$$H\left(\sum\_{j=1}^{m} a\_j P\_j\right) - \sum\_{j=1}^{m} a\_j H(P\_j) = \sum\_{i=1}^{m} a\_i D\left(P\_i \parallel \sum\_{j=1}^{m} a\_j P\_j\right). \tag{56}$$

**Remark 7.** *The upper bound in* (55) *refines the known bound (see, e.g., [31] [Lemma 2.2])*

$$H\left(\sum\_{j=1}^{m} \mathfrak{a}\_{j} P\_{\mathfrak{j}}\right) - \sum\_{j=1}^{m} \mathfrak{a}\_{j} H(P\_{\mathfrak{j}}) \le \sum\_{j=1}^{m} \mathfrak{a}\_{j} \log \frac{1}{\mathfrak{a}\_{j}} = H(\underline{\mathfrak{a}}),\tag{57}$$

*by relying on all the* <sup>1</sup> <sup>2</sup>*m*(*m* − 1) *pairwise relative entropies between the individual distributions* {*Pj*} *m j*=1 *. Another refinement of* (57)*, expressed in terms of total variation distances, has been recently provided in [17] [Theorem 3.1].*

#### *3.3. Monotonic Sequences of f -Divergences and an Extension of Theorem 1*

The present subsection generalizes Theorem 1, and it also provides relations between *f*-divergences which are defined in a recursive way.

**Theorem 4.** *Let P and Q be probability measures defined on a measurable space* (X , F )*. Let Rλ, for λ* ∈ [0, 1]*, be the convex combination of P and Q as in* (21)*. Let f*<sup>0</sup> : (0, ∞) → R *be a convex function with f*0(1) = 0*, and let* { *f<sup>k</sup>* (·)} ∞ *k*=0 *be a sequence of functions that are defined on* (0, ∞) *by the recursive equation*

$$f\_{k+1}(\mathbf{x}) := \int\_0^{1-\mathbf{x}} f\_k(1-s) \, \frac{\mathbf{ds}}{\mathbf{s}} \, \mathsf{x} > 0, \; k \in \{0, 1, \ldots\}. \tag{58}$$

*Then,*

$$(a) \quad \left\{ D\_{f\_k}(\mathbb{P} \| \mathbb{Q}) \right\}\_{k=0}^{\infty} \text{ is a non-increasing (and non-negative) sequence of } f \text{-divergences.} $$

*(b) For all λ* ∈ [0, 1] *and k* ∈ {0, 1, . . .}*,*

$$D\_{f\_{k+1}}(R\_{\lambda} \| \| P) = \int\_0^{\lambda} D\_{f\_k}(R\_s \| \| P) \, \frac{\mathbf{ds}}{\mathbf{s}}.\tag{59}$$

#### **Proof.** See Section 5.5.

We next use the polylogarithm functions, which satisfy the recursive equation [32] [Equation (7.2)]:

$$\mathrm{Li}\_k(\mathbf{x}) := \begin{cases} \frac{\mathbf{x}}{1 - \mathbf{x}}, & \text{if } k = \mathbf{0}, \\\\ \int\_0^\mathbf{x} \frac{\mathrm{Li}\_{k-1}(\mathbf{s})}{s} \, \mathrm{d}s, & \text{if } k \ge 1. \end{cases} \tag{60}$$

This gives Li1(*x*) = − log<sup>e</sup> (1 − *x*), Li2(*x*) = − R *<sup>x</sup>* 0 1 *s* log<sup>e</sup> (1 − *s*) d*s* and so on, which are real–valued and finite for *x* < 1.

#### **Corollary 5.** *Let*

$$f\_k(\mathbf{x}) := \text{Li}\_k(1-\mathbf{x}), \quad \mathbf{x} > \mathbf{0}, \ k \in \{0, 1, \ldots\}. \tag{61}$$

*Then,* (59) *holds for all λ* ∈ [0, 1] *and k* ∈ {0, 1, . . .}*. Furthermore, setting k* = 0 *in* (59) *yields* (22) *as a special case.*

#### **Proof.** See Section 5.6.

#### *3.4. On Probabilities and f -Divergences*

The following result relates probabilities of sets to *f*-divergences.

**Theorem 5.** *Let* (X , F, *µ*) *be a probability space, and let* C ∈ F *be a measurable set with µ*(C) > 0*. Define the conditional probability measure*

$$\mu\_{\mathcal{C}}(\mathcal{E}) := \frac{\mu(\mathcal{C} \cap \mathcal{E})}{\mu(\mathcal{C})}, \quad \forall \mathcal{E} \in \mathcal{F}. \tag{62}$$

*Let f* : (0, ∞) → R *be an arbitrary convex function with f*(1) = 0*, and assume (by continuous extension of f at zero) that f*(0) := lim *<sup>t</sup>*→0<sup>+</sup> *f*(*t*) < ∞*. Furthermore, let f*e: (0, ∞) → R *be the convex function which is given by*

$$
\widetilde{f}(t) := tf\left(\frac{1}{t}\right), \quad \forall t > 0. \tag{63}
$$

*Then,*

$$D\_f(\mu\_{\mathcal{C}} \| \mu) = \tilde{f}(\mu(\mathcal{C})) + \left(1 - \mu(\mathcal{C})\right) f(0). \tag{64}$$

**Proof.** See Section 5.7.

Connections of probabilities to the relative entropy, and to the chi-squared divergence, are next exemplified as special cases of Theorem 5.

**Corollary 6.** *In the setting of Theorem 5,*

$$D\left(\mu\_{\mathcal{C}} \| \mu\right) = \log \frac{1}{\mu\left(\mathcal{C}\right)}\,\tag{65}$$

$$
\chi^2(\mu\_{\mathcal{C}} \| \mu) = \frac{1}{\mu(\mathcal{C})} - 1,\tag{66}
$$

*so* (16) *is satisfied in this case with equality. More generally, for all α* ∈ (0, ∞)*,*

$$D\_{\mathfrak{A}}\left(\mu\_{\mathcal{C}} \| \mu\right) = \log \frac{1}{\mu\left(\mathcal{C}\right)}.\tag{67}$$

**Proof.** See Section 5.7.

**Remark 8.** *In spite of its simplicity,* (65) *proved very useful in the seminal work by Marton on transportation–cost inequalities, proving concentration of measures by information-theoretic tools [33,34] (see also [35] [Chapter 8] and [36] [Chapter 3]). As a side note, the simple identity* (65) *was apparently first explicitly used by Csiszár (see [37] [Equation (4.13)]).*

#### **4. Applications**

This section provides applications of our results in Section 3. These include universal lossless compression, method of types and large deviations, and strong data–processing inequalities (SDPIs).

#### *4.1. Application of Corollary 3: Shannon Code for Universal Lossless Compression*

Consider *m* > 1 discrete, memoryless, and stationary sources with probability mass functions {*Pi*} *m i*=1 , and assume that the symbols are emitted by one of these sources with an *a priori* probability *α<sup>i</sup>* for source no. *i*, where {*αi*} *m i*=1 are positive and sum to 1.

For lossless data compression by a universal source code, suppose that a single source code is designed with respect to the average probability mass function *P* := *m* ∑ *j*=1 *αjP<sup>j</sup>* .

Assume that the designer uses a Shannon code, where the code assignment for a symbol *x* ∈ X is of length `(*x*) = <sup>l</sup> log <sup>1</sup> *P*(*x*) m bits (logarithms are on base 2). Due to the mismatch in the source distribution, the average codeword length `avg satisfies (see [38] [Proposition 3.B])

$$\sum\_{i=1}^{m} a\_i H(P\_i) + \sum\_{i=1}^{m} a\_i D(P\_i || P) \le \ell\_{\text{avg}} \le \sum\_{i=1}^{m} a\_i H(P\_i) + \sum\_{i=1}^{m} a\_i D(P\_i || P) + 1. \tag{68}$$

The fractional penalty in the average codeword length, denoted by *ν*, is defined to be equal to the ratio of the penalty in the average codeword length as a result of the source mismatch, and the average codeword length in case of a perfect matching. From (68), it follows that

$$\frac{\sum\_{i=1}^{m} \alpha\_i D(P\_i || P)}{1 + \sum\_{i=1}^{m} \alpha\_i H(P\_i)} \le \nu \le \frac{1 + \sum\_{i=1}^{m} \alpha\_i D(P\_i || P)}{\sum\_{i=1}^{m} \alpha\_i H(P\_i)}.\tag{69}$$

We next rely on Corollary 3 to obtain an upper bound on *ν* which is expressed as a function of the *m*(*m* − 1) relative entropies *D*(*Pi*k*Pj*) for all *i* 6= *j* in {1, . . . , *m*}. This is useful if, e.g., the *m* relative entropies on the left and right sides of (69) do not admit closed-form expressions, in contrast to the *m*(*m* − 1) relative entropies *D*(*Pi*k*Pj*) for *i* 6= *j*. We next exemplify this case.

For *i* ∈ {1, . . . , *m*}, let *P<sup>i</sup>* be a Poisson distribution with parameter *λ<sup>i</sup>* > 0. For all *i*, *j* ∈ {1, . . . , *m*}, the relative entropy from *P<sup>i</sup>* to *P<sup>j</sup>* admits the closed-form expression

$$D(P\_{\bar{i}} \| P\_{\bar{j}}) = \lambda\_{\bar{i}} \log \left(\frac{\lambda\_{\bar{i}}}{\lambda\_{\bar{j}}}\right) + (\lambda\_{\bar{j}} - \lambda\_{\bar{i}}) \log \mathbf{e}.\tag{70}$$

From (54) and (70), it follows that

$$D(P\_i \| P) \le -\log\left(a\_i + (1 - a\_i) \exp\left(-\frac{f\_i(\underline{a}, \underline{\lambda})}{1 - a\_i}\right)\right),\tag{71}$$

where

$$f\_{\vec{i}}(\underline{a}, \underline{\lambda}) := \sum\_{\vec{j} \neq \vec{i}} a\_{\vec{j}} D(P\_{\vec{i}} \| P\_{\vec{j}}) \tag{72}$$

$$\lambda = \sum\_{j \neq i} \left\{ \alpha\_j \left[ \lambda\_i \log \left( \frac{\lambda\_i}{\lambda\_j} \right) + (\lambda\_j - \lambda\_i) \log \mathbf{e} \right] \right\}. \tag{73}$$

The entropy of a Poisson distribution, with parameter *λ<sup>i</sup>* , is given by the integral representation [39–41]

$$H(P\_l) = \lambda\_l \log\left(\frac{\mathbf{e}}{\lambda\_l}\right) + \int\_0^\infty \left(\lambda\_l - \frac{1 - \mathbf{e}^{-\lambda\_l(1 - \mathbf{e}^{-u})}}{1 - \mathbf{e}^{-u}}\right) \frac{\mathbf{e}^{-u}}{u} \,\mathrm{d}u \,\, \log\,\mathbf{e}.\tag{74}$$

#### Combining (69), (71) and (74) finally gives an upper bound on *ν* in the considered setup.

**Example 1.** *Consider five discrete memoryless sources where the probability mass function of source no. i is given by P<sup>i</sup>* = Poisson(*λi*) *with λ* = [16, 20, 24, 28, 32]*. Suppose that the symbols are emitted from one of the sources with equal probability, so α* = - 1 5 , 1 5 , 1 5 , 1 5 , 1 5 *. Let P* := <sup>1</sup> 5 (*P*<sup>1</sup> + . . . + *P*5) *be the average probability mass function of the five sources. The term* ∑*<sup>i</sup> α<sup>i</sup> D*(*Pi*k*P*)*, which appears in the numerators of the upper and lower bounds on ν (see* (69)*), does not lend itself to a closed-form expression, and it is not even an easy task to calculate it numerically due to the need to compute an infinite series which involves factorials. We therefore apply the closed-form upper bound in* (71) *to get that* ∑*<sup>i</sup> α<sup>i</sup> D*(*Pi*k*P*) ≤ 1.46 *bits, whereas the upper bound which follows from the convexity of the relative entropy (i.e.,* ∑*<sup>i</sup> α<sup>i</sup> fi*(*α*, *λ*)*) is equal to 1.99 bits (both upper bounds are smaller than the trivial bound* log<sup>2</sup> 5 ≈ 2.32 *bits). From* (69)*,* (74)*, and the stronger upper bound on* ∑*<sup>i</sup> α<sup>i</sup> D*(*Pi*k*P*)*, the improved upper bound on ν is equal to* 57.0% *(as compared to a looser upper bound of* 69.3%*, which follows from* (69)*,* (74)*, and the looser upper bound on* ∑*<sup>i</sup> α<sup>i</sup> D*(*Pi*k*P*) *that is equal to 1.99 bits).*

#### *4.2. Application of Theorem 2 in the Context of the Method of Types and Large Deviations Theory*

Let *X <sup>n</sup>* = (*X*1, . . . , *<sup>X</sup>n*) be a sequence of i.i.d. random variables with *<sup>X</sup>*<sup>1</sup> <sup>∼</sup> *<sup>Q</sup>*, where *<sup>Q</sup>* is a probability measure defined on a finite set X , and *Q*(*x*) > 0 for all *x* ∈ X . Let P be a set of probability measures on X such that *Q* ∈/ P, and suppose that the closure of P coincides with the closure of its interior. Then, by Sanov's theorem (see, e.g., [42] [Theorem 11.4.1] and [43] [Theorem 3.3]), the probability that the empirical distribution *P*b*X<sup>n</sup>* belongs to P vanishes exponentially at the rate

$$\lim\_{n \to \infty} \frac{1}{n} \log \frac{1}{\mathbb{P}[\widehat{P}\_{\mathcal{X}^n} \in \mathcal{J}^\rho]} = \inf\_{P \in \mathcal{P}} D(P \| \mathcal{Q}).\tag{75}$$

Furthermore, for finite *n*, the method of types yields the following upper bound on this rare event:

$$\mathbb{P}[\hat{P}\_{X^{\mathfrak{N}}} \in \mathcal{P}] \le \binom{n+|\mathcal{X}|-1}{|\mathcal{X}|-1} \exp\left(-n \inf\_{P \in \mathcal{P}} D(P \| Q)\right) \tag{76}$$

$$1 \le (n+1)^{|\mathcal{X}|-1} \exp\left(-n \inf\_{P \in \mathcal{P}} D(P \| Q)\right),\tag{77}$$

whose exponential decay rate coincides with the exact asymptotic result in (75).

Suppose that *Q* is not fully known, but its mean *m<sup>Q</sup>* and variance *σ* 2 *Q* are available. Let *m*<sup>1</sup> ∈ R and *δ*1,*ε*1, *σ*<sup>1</sup> > 0 be fixed, and let P be the set of all probability measures *P*, defined on the finite set X , with mean *m<sup>P</sup>* ∈ [*m*<sup>1</sup> − *δ*1, *m*<sup>1</sup> + *δ*1] and variance *σ* 2 *<sup>P</sup>* ∈ [*σ* 2 <sup>1</sup> − *ε*1, *σ* 2 <sup>1</sup> + *ε*1], where |*m*<sup>1</sup> − *mQ*| > *δ*1. Hence, P coincides with the closure of its interior, and *Q* ∈/ P.

The lower bound on the relative entropy in Theorem 2, used in conjunction with the upper bound in (77), can serve to obtain an upper bound on the probability of the event that the empirical distribution of *X <sup>n</sup>* belongs to the set P, regardless of the uncertainty in *Q*. This gives

$$\mathbb{P}[\hat{P}\_{X^n} \in \mathcal{J}^\rho] \le (n+1)^{|\mathcal{X}|-1} \exp\left(-nd^\*\right),\tag{78}$$

where

$$d^\* := \inf\_{m\_P \sigma\_P^2} d(r \| s) \,. \tag{79}$$

and, for fixed (*mP*, *mQ*, *σ* 2 *P* , *σ* 2 *Q* ), the parameters *r* and *s* are given in (41) and (42), respectively.

Standard algebraic manipulations that rely on (78) lead to the following result, which is expressed as a function of the Lambert *W* function [44]. This function, which finds applications in various engineering and scientific fields, is a standard built–in function in mathematical software tools such as Mathematica, Matlab, and Maple. Applications of the Lambert *W* function in information theory and coding are briefly surveyed in [45].

**Proposition 2.** *For ε* ∈ (0, 1)*, let n* ∗ := *n* ∗ (*ε*) *denote the minimal value of n* ∈ N *such that the upper bound on the right side of* (78) *does not exceed ε* ∈ (0, 1)*. Then, n*<sup>∗</sup> *admits the following closed-form expression:*

$$m^\* = \max\left\{ \left\lceil -\frac{\left(|\mathcal{X}| - 1\right)\mathcal{W}\_{-1}(\eta)\log\mathbf{e}}{d^\*} \right\rceil - 1, 1\right\},\tag{80}$$

*with*

$$\eta := -\frac{d^\* \left(\varepsilon \exp(-d^\*)\right)^{1/(|\mathcal{X}|-1)}}{\left(|\mathcal{X}|-1\right)\log\mathbf{e}} \in \left[-\frac{1}{\mathbf{e}}, 0\right),\tag{81}$$

*and W*−1(·) *on the right side of* (80) *denotes the secondary real–valued branch of the Lambert W function (i.e., x* := *W*−1(*y*) *where W*−<sup>1</sup> : - −1 e , 0) → (−∞, −1] *is the inverse function of y* := *x*e *x ).*

**Example 2.** *Let Q be an arbitrary probability measure, defined on a finite set* X *, with mean m<sup>Q</sup>* = 40 *and variance σ* 2 *<sup>Q</sup>* = 20*. Let* P *be the set of all probability measures P, defined on* X *, whose mean m<sup>P</sup> and variance σ* 2 *P lie in the intervals* [43, 47] *and* [18, 22]*, respectively. Suppose that it is required that, for all probability measures Q as above, the probability that the empirical distribution of the i.i.d. sequence X <sup>n</sup>* <sup>∼</sup> *<sup>Q</sup><sup>n</sup> be included in the set* P *is at most ε* = 10−10*. We rely here on the upper bound in* (78)*, and impose the stronger condition where it should not exceed ε. By this approach, it is obtained numerically from* (79) *that d* ∗ = 0.203 *nats. We next examine two cases:*


We close this discussion by providing numerical experimentation of the lower bound on the relative entropy in Theorem 2, and comparing this attainable lower bound (see Item (b) of Theorem 2) with the following closed-form expressions for relative entropies:

(a) The relative entropy between real-valued Gaussian distributions is given by

$$D\left(\mathcal{N}(m\_P, \sigma\_\mathcal{P}^2) \parallel \mathcal{N}(m\_Q, \sigma\_\mathcal{Q}^2)\right) = \log \frac{\sigma\_\mathcal{Q}}{\sigma\_\mathcal{P}} + \frac{1}{2} \left[\frac{(m\_P - m\_\mathcal{Q})^2 + \sigma\_\mathcal{P}^2}{\sigma\_\mathcal{Q}^2} - 1\right] \log \text{e.}\tag{82}$$

(b) Let *E<sup>µ</sup>* denote a random variable which is exponentially distributed with mean *µ* > 0; its probability density function is given by

$$e\_{\mu}(\mathbf{x}) = \frac{1}{\mu} \mathbf{e}^{-\mathbf{x}/\mu} \mathbf{1}\{\mathbf{x} \ge 0\}. \tag{83}$$

Then, for *a*1, *a*<sup>2</sup> > 0 and *d*1, *d*<sup>2</sup> ∈ R,

$$D(E\_{d\_1} + d\_1 \| E\_{d\_2} + d\_2) = \begin{cases} \log \frac{a\_2}{a\_1} + \frac{d\_1 + a\_1 - d\_2 - a\_2}{a\_2} \log \mathbf{e}, & d\_1 \ge d\_2, \\ \infty, & d\_1 < d\_2. \end{cases} \tag{84}$$

In this case, the means under *P* and *Q* are *m<sup>P</sup>* = *d*<sup>1</sup> + *a*<sup>1</sup> and *m<sup>Q</sup>* = *d*<sup>2</sup> + *a*2, respectively, and the variances are *σ* 2 *<sup>P</sup>* = *a* 2 1 and *σ* 2 *<sup>Q</sup>* = *a* 2 2 . Hence, for obtaining the required means and variances, set

$$a\_1 = \sigma\_{\mathcal{P}}, \quad a\_2 = \sigma\_{\mathcal{Q}\_{\prime}} \quad d\_1 = m\_{\mathcal{P}} - \sigma\_{\mathcal{P}\_{\prime}} \quad d\_2 = m\_{\mathcal{Q}} - \sigma\_{\mathcal{Q}}.\tag{85}$$

**Example 3.** *We compare numerically the attainable lower bound on the relative entropy, as it is given in* (40)*, with the two relative entropies in* (82) *and* (84)*:*


#### *4.3. Strong Data–Processing Inequalities and Maximal Correlation*

The information contraction is a fundamental concept in information theory. The contraction of *f*-divergences through channels is captured by data–processing inequalities, which can be further tightened by the derivation of SDPIs with channel-dependent or source-channel dependent contraction coefficients (see, e.g., [26,46–52]).

We next provide necessary definitions which are relevant for the presentation in this subsection.

**Definition 7.** *Let Q<sup>X</sup> be a probability distribution which is defined on a set* X *, and that is not a point mass, and let <sup>W</sup>Y*|*<sup>X</sup>* : X → Y *be a stochastic transformation. The* contraction coefficient for *<sup>f</sup>*-divergences *is defined as*

$$\mu\_f(Q\_{\mathcal{X}}, \mathcal{W}\_{\mathcal{Y}|\mathcal{X}}) := \sup\_{\mathcal{P}\_{\mathcal{X}} : D\_f(\mathcal{P}\_{\mathcal{X}} \| Q\_{\mathcal{X}}) \in (0, \infty)} \frac{D\_f(\mathcal{P}\_{\mathcal{Y}} \| Q\_{\mathcal{Y}})}{D\_f(\mathcal{P}\_{\mathcal{X}} \| Q\_{\mathcal{X}})} \tag{86}$$

*where, for all y* ∈ Y*,*

$$P\_Y(y) = \left(P\_X \mathcal{W}\_{Y|X}\right)(y) := \int\_X \mathrm{d}P\_X(x) \, \mathcal{W}\_{Y|X}(y|x),\tag{87}$$

$$Q\_Y(y) = \left(Q\_X \mathcal{W}\_{Y|X}\right)(y) := \int\_X \mathbf{d}Q\_X(x) \, \mathcal{W}\_{Y|X}(y|x). \tag{88}$$

*The notation in* (87) *and* (88) *is consistent with the standard notation used in information theory (see, e.g., the first displayed equation after (3.2) in [53]).*

The derivation of good upper bounds on contraction coefficients for *f*-divergences, which are strictly smaller than 1, lead to SDPIs. These inequalities find their applications, e.g., in studying the exponential convergence rate of an irreducible, time-homogeneous and reversible discrete-time Markov chain to its unique invariant distribution over its state space (see, e.g., [49] [Section 2.4.3] and [50] [Section 2]). It is in sharp contrast to DPIs which do not yield convergence to stationarity at any rate. We return to this point later in this subsection, and determine the exact convergence rate to stationarity under two parametric families of *f*-divergences.

We next rely on Theorem 1 to obtain upper bounds on the contraction coefficients for the following *f*-divergences.

**Definition 8.** *For α* ∈ (0, 1]*, the α-skew K-divergence is given by*

$$K\_{\mathfrak{a}}(P||Q) := D\left(P \, \| \, (1-\mathfrak{a})P + \mathfrak{a}Q\right),\tag{89}$$

*and, for α* ∈ [0, 1]*, let*

$$\mathcal{S}\_{\mathfrak{a}}(P \| Q) := \mathfrak{a} \, D\left(P \, \| \, (1 - \mathfrak{a})P + \mathfrak{a}Q \right) + (1 - \mathfrak{a}) \, D\left(Q \, \| \, (1 - \mathfrak{a})P + \mathfrak{a}Q \right) \tag{90}$$

$$\mathbf{K} = \mathfrak{a} \, \mathrm{K}\_{\mathfrak{a}}(P \| \| Q) + (1 - \mathfrak{a}) \, \mathrm{K}\_{1 - \mathfrak{a}}(Q \| \| P) \, \mathrm{} \tag{91}$$

*with the convention that K*0(*P*k*Q*) ≡ 0 *(by a continuous extension at α* = 0 *in* (89)*). These divergence measures are specialized to the relative entropies:*

$$\mathcal{K}\_1(P\|\mathcal{Q}) = D(P\|\mathcal{Q}) = \mathcal{S}\_1(P\|\mathcal{Q}), \quad \mathcal{S}\_0(P\|\mathcal{Q}) = D(\mathcal{Q}\|P), \tag{92}$$

*and S*<sup>1</sup> 2 (*P*k*Q*) *is the Jensen–Shannon divergence [54–56] (also known as the capacitory discrimination [57]):*

$$S\_{\frac{1}{2}}(P\|Q) = \frac{1}{2}D\left(P\|\,\frac{1}{2}(P+Q)\right) + \frac{1}{2}D(Q\|\,\frac{1}{2}(P+Q))\tag{93}$$

$$\mathcal{L} = H(\frac{1}{2}(P+Q)) - \frac{1}{2}H(P) - \frac{1}{2}H(Q) \coloneqq \|\mathbf{S}(P \| \| Q). \tag{94}$$

*It can be verified that the divergence measures in* (89) *and* (90) *are f -divergences:*

$$K\_{\mathfrak{a}}(P\|\|Q) = D\_{\mathbf{k}\_{\mathfrak{a}}}(P\|\|Q), \quad \mathfrak{a} \in (0,1], \tag{95}$$

$$\mathcal{S}\_a(P\|\|Q) = D\_{\mathfrak{s}\_a}(P\|\|Q), \quad \mathfrak{a} \in [0, 1]. \tag{96}$$

*with*

$$k\_a(t) := t \log t - t \log \left( a + (1 - a)t \right), \quad t > 0, \ a \in (0, 1]. \tag{97}$$

$$s\_{\mathfrak{a}}(t) := \mathfrak{a}t \log t - \left(\mathfrak{a}t + 1 - \mathfrak{a}\right) \log\left(\mathfrak{a} + (1 - \mathfrak{a})t\right) \tag{98}$$

$$t = ak\_a(t) + (1 - a)t \, k\_{1-a} \left(\frac{1}{t}\right), \quad t > 0, \ \alpha \in [0, 1], \tag{99}$$

*where kα*(·) *and sα*(·) *are strictly convex functions on* (0, ∞)*, and vanish at 1.*

**Remark 9.** *The α-skew K-divergence in* (89) *is considered in [55] and [58] [(13)] (including pointers in the latter paper to its utility). The divergence in* (90) *is akin to Lin's measure in [55] [(4.1)], the asymmetric α-skew Jensen–Shannon divergence in [58] [(11)–(12)], the symmetric α-skew Jensen–Shannon divergence in [58] [(16)], and divergence measures in [59] which involve arithmetic and geometric means of two probability distributions. Properties and applications of quantum skew divergences are studied in [19] and references therein.*

**Theorem 6.** *The f-divergences in* (89) *and* (90) *satisfy the following integral identities, which are expressed in terms of the Györfi–Vajda divergence in* (17)*:*

$$\frac{1}{\log e} K\_{\mathfrak{a}}(P \| Q) = \int\_0^{\mathfrak{a}} s D\_{\Phi\_{\mathfrak{s}}}(P \| Q) \, \mathrm{d}s, \qquad \mathfrak{a} \in (0, 1], \tag{100}$$

$$\frac{1}{2\log e} \mathcal{S}\_{\mathfrak{a}}(P||Q) = \int\_0^1 g\_{\mathfrak{a}}(s) \, D\_{\mathfrak{b}s}(P||Q) \, \mathrm{d}s, \quad \mathfrak{a} \in [0, 1], \tag{101}$$

*with*

$$\mathfrak{g}\_a(s) := \mathfrak{a}s \, \mathbf{1} \{ s \in (0, a] \} + (1 - a)(1 - s) \, \mathbf{1} \{ s \in [a, 1) \}, \quad (a, s) \in [0, 1] \times [0, 1]. \tag{102}$$

*Moreover, the contraction coefficients for these f -divergences are related as follows:*

$$
\mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}\vert\mathbf{X}}) \le \mu\_{k\_a}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}\vert\mathbf{X}}) \le \sup\_{s \in (0,a]} \mu\_{\Phi\_s^s}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}\vert\mathbf{X}}), \quad a \in (0,1], \tag{103}
$$

$$
\mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}|\mathbf{X}}) \le \mu\_{\mathfrak{s}\_a}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}|\mathbf{X}}) \le \sup\_{s \in (0,1)} \mu\_{\mathfrak{s}\_s}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}|\mathbf{X}}), \quad a \in [0,1], \tag{104}
$$

*where <sup>µ</sup>χ*<sup>2</sup> (*QX*, *<sup>W</sup>Y*|*X*) *denotes the contraction coefficient for the chi-squared divergence.*

**Proof.** See Section 5.8.

**Remark 10.** *The upper bounds on the contraction coefficients for the parametric f-divergences in* (89) *and* (90) *generalize the upper bound on the contraction coefficient for the relative entropy in [51] [Theorem III.6] (recall that K*1(*P*k*Q*) = *D*(*P*k*Q*) = *S*1(*P*k*Q*)*), so the upper bounds in Theorem 6 are specialized to the latter bound at α* = 1*.*

**Corollary 7.** *Let*

$$\mu\_{\chi^2}(\boldsymbol{W\_{Y|X}}) := \sup\_{\mathcal{Q}} \mu\_{\chi^2}(\boldsymbol{Q\_{X}}, \boldsymbol{W\_{Y|X}})\_{\prime} \tag{105}$$

*where the supremum on the right side is over all probability measures Q<sup>X</sup> defined on* X *. Then,*

$$
\mu\_{\chi^2}(Q\_{X'}W\_{Y|X}) \le \mu\_{k\_a}(Q\_{X'}W\_{Y|X}) \le \mu\_{\chi^2}(W\_{Y|X}), \quad a \in (0,1].\tag{106}
$$

$$
\mu\_{\chi^2}(Q\_{X'}\mathcal{W}\_{Y|X}) \le \mu\_{\mathfrak{s}\_{\mathfrak{u}}}(Q\_{X'}\mathcal{W}\_{Y|X}) \le \mu\_{\chi^2}(\mathcal{W}\_{Y|X}), \quad \mathfrak{a} \in [0,1]. \tag{107}
$$

**Proof.** See Section 5.9.

**Example 4.** *Let Q<sup>X</sup>* = Bernoulli 1 2 *, and let <sup>W</sup>Y*|*<sup>X</sup> correspond to a binary symmetric channel (BSC) with crossover probability <sup>ε</sup>. Then, <sup>µ</sup>χ*<sup>2</sup> (*QX*, *<sup>W</sup>Y*|*X*) = *<sup>µ</sup>χ*<sup>2</sup> (*WY*|*X*) = (<sup>1</sup> − <sup>2</sup>*ε*) 2 *. The upper and lower bounds on µk<sup>α</sup>* (*QX*, *<sup>W</sup>Y*|*X*) *and <sup>µ</sup>s<sup>α</sup>* (*QX*, *<sup>W</sup>Y*|*X*) *in* (106) *and* (107) *match for all <sup>α</sup>, and they are all equal to* (<sup>1</sup> − <sup>2</sup>*ε*) 2 *.*

The upper bound on the contraction coefficients in Corollary <sup>7</sup> is given by *<sup>µ</sup>χ*<sup>2</sup> (*WY*|*X*), whereas the lower bound is given by *<sup>µ</sup>χ*<sup>2</sup> (*QX*, *<sup>W</sup>Y*|*X*), which depends on the input distribution *<sup>Q</sup>X*. We next provide alternative upper bounds on the contraction coefficients for the considered (parametric) *<sup>f</sup>*-divergences, which, similarly to the lower bound, scale like *<sup>µ</sup>χ*<sup>2</sup> (*QX*, *<sup>W</sup>Y*|*X*). Although the upper bound in Corollary 7 may be tighter in some cases than the alternative upper bounds which are next presented in Proposition 3 (and in fact, the former upper bound may be even achieved with equality as in Example 4), the bounds in Proposition 3 are used shortly to determine the exponential rate of the convergence to stationarity of a type of Markov chains.

**Proposition 3.** *For all α* ∈ (0, 1]*,*

$$
\mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}|\mathbf{X}}) \le \mu\_{\mathbf{k}\_d}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}|\mathbf{X}}) \le \frac{1}{\mathfrak{a}\,\mathcal{Q}\_{\min}} \cdot \mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}|\mathbf{X}}),\tag{108}
$$

$$
\mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}\vert\mathbf{X}}) \le \mu\_{\mathfrak{s}\varepsilon}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}\vert\mathbf{X}}) \le \frac{(1-a)\log\_{\mathfrak{e}}\left(\frac{1}{a}\right) + 2a - 1}{(1 - 3a + 3a^2)\log\_{\min}} \cdot \mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W}\_{\mathbf{Y}\vert\mathbf{X}}),\tag{109}
$$

*where Q*min *denotes the minimal positive mass of the input distribution QX.*

**Proof.** See Section 5.10.

**Remark 11.** *In view of* (92)*, at α* = 1*,* (108) *and* (109) *specialize to an upper bound on the contraction coefficient of the relative entropy (KL divergence) as a function of the contraction coefficient of the chi-squared divergence. In this special case, both* (108) *and* (109) *give*

$$
\mu\_{\boldsymbol{X}}\mu\_{\boldsymbol{X}}(\boldsymbol{Q}\_{\boldsymbol{X}},\boldsymbol{W}\_{\boldsymbol{Y}|\boldsymbol{X}}) \leq \mu\_{\boldsymbol{\text{KL}}}(\boldsymbol{Q}\_{\boldsymbol{X}},\boldsymbol{W}\_{\boldsymbol{Y}|\boldsymbol{X}}) \leq \frac{1}{\boldsymbol{Q}\_{\text{min}}} \cdot \mu\_{\boldsymbol{X}}(\boldsymbol{Q}\_{\boldsymbol{X}},\boldsymbol{W}\_{\boldsymbol{Y}|\boldsymbol{X}}) \tag{110}
$$

*which then coincides with [48] [Theorem 10].*

We next apply Proposition 3 to consider the convergence rate to stationarity of Markov chains by the introduced *f*-divergences in Definition 8. The next result follows [49] [Section 2.4.3], and it provides a generalization of the result there.

**Theorem 7.** *Consider a time-homogeneous, irreducible, and reversible discrete-time Markov chain with a finite state space* X *, let W be its probability transition matrix, and Q<sup>X</sup> be its unique stationary distribution (reversibility means that QX*(*x*)[*W*]*x*,*<sup>y</sup>* = *QX*(*y*)[*W*]*y*,*<sup>x</sup> for all x*, *y* ∈ X *). Let P<sup>X</sup> be an initial probability distribution over* X *. Then, for all α* ∈ (0, 1] *and n* ∈ N*,*

$$\mathbb{K}\_{\mathfrak{a}}(P\_X \mathcal{W}^{\mathfrak{n}} || Q\_X) \le \mu\_{k\_{\mathfrak{a}}}(Q\_{X'} \mathcal{W}^{\mathfrak{n}}) \, \, \mathbb{K}\_{\mathfrak{a}}(P\_X || Q\_X) \, \, \tag{111}$$

$$\mathcal{S}\_{\mathfrak{a}}(P\_X \mathcal{W}^{\mathfrak{n}} \| \! \left| Q\_X \right> \le \mu\_{\mathfrak{s}\_{\mathfrak{a}}}(Q\_{X'} \mathcal{W}^{\mathfrak{n}}) \; \mathcal{S}\_{\mathfrak{a}}(P\_X \| \! \left| Q\_X \right> ), \tag{112}$$

*and the contraction coefficients on the right sides of* (111) *and* (112) *scale like the n-th power of the contraction coefficient for the chi-squared divergence as follows:*

$$\left(\mu\_{\chi^2}(Q\_{\mathbf{X}}\mathcal{W})\right)^n \le \mu\_{k\_a}(Q\_{\mathbf{X}}\mathcal{W}^n) \le \frac{1}{a\,Q\_{\min}} \cdot \left(\mu\_{\chi^2}(Q\_{\mathbf{X}}\mathcal{W})\right)^n,\tag{113}$$

$$\left(\mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W})\right)^n \le \mu\_{\mathfrak{s}\_\mathbf{R}}(Q\_{\mathbf{X}\prime}\mathcal{W}^\eta) \le \frac{(1-a)\log\_{\mathbf{e}}\left(\frac{1}{a}\right) + 2a - 1}{(1 - 3a + 3a^2)\log\_{\text{min}}} \cdot \left(\mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W})\right)^n. \tag{114}$$

**Proof.** Inequalities (111) and (112) hold since *<sup>Q</sup>XW<sup>n</sup>* <sup>=</sup> *<sup>Q</sup>X*, for all *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>, and due to Definition <sup>7</sup> and (95) and (96). Inequalities (113) and (114) hold by Proposition 3, and due to the reversibility of the Markov chain which implies that (see [49] [Equation (2.92)])

$$\mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W}^{\eta}) = \left(\mu\_{\chi^2}(Q\_{\mathbf{X}\prime}\mathcal{W})\right)^{\eta}, \quad \eta \in \mathbb{N}.\tag{115}$$

In view of (113) and (114), Theorem 7 readily gives the following result on the exponential decay rate of the upper bounds on the divergences on the left sides of (111) and (112).

*Entropy* **2020**, *22*, 563

**Corollary 8.** *For all α* ∈ (0, 1]*,*

$$\lim\_{n \to \infty} \left( \mu\_{k\_a} (Q\_{X'} \mathcal{W}^n) \right)^{1/n} = \mu\_{\chi^2} (Q\_{X'} \mathcal{W}) = \lim\_{n \to \infty} \left( \mu\_{s\_a} (Q\_{X'} \mathcal{W}^n) \right)^{1/n}.\tag{116}$$

**Remark 12.** *Theorem 7 and Corollary 8 generalize the results in [49] [Section 2.4.3], which follow as a special case at α* = 1 *(see* (92)*).*

We end this subsection by considering maximal correlations, which are closely related to the contraction coefficient for the chi-squared divergence.

**Definition 9.** *The* maximal correlation *between two random variables X and Y is defined as*

$$\rho\_{\mathbf{m}}(X;Y) := \sup\_{f,g} \mathbb{E}[f(X)g(Y)],\tag{117}$$

*where the supremum is taken over all real-valued functions f and g such that*

$$\mathbb{E}[f(X)] = \mathbb{E}[\mathbf{g}(Y)] = 0, \quad \mathbb{E}[f^2(X)] \le 1,\\ \mathbb{E}[\mathbf{g}^2(Y)] \le 1. \tag{118}$$

It is well-known [60] that, if *<sup>X</sup>* ∼ *<sup>Q</sup><sup>X</sup>* and *<sup>Y</sup>* ∼ *<sup>Q</sup><sup>Y</sup>* = *<sup>Q</sup>XWY*|*X*, then the contraction coefficient for the chi-squared divergence *<sup>µ</sup>χ*<sup>2</sup> (*QX*, *<sup>W</sup>Y*|*X*) is equal to the square of the maximal correlation between the random variables *X* and *Y*, i.e.,

$$
\rho\_{\mathbf{m}}(\mathbf{X};\mathbf{Y}) = \sqrt{\mu\_{\chi^2}(Q\_{\mathbf{X}\prime}W\_{\mathbf{Y}|\mathbf{X}})}.\tag{119}
$$

A simple application of Corollary 1 and (119) gives the following result.

**Proposition 4.** *In the setting of Definition 7, for s* ∈ [0, 1]*, let X<sup>s</sup>* ∼ (1 − *s*)*P<sup>X</sup>* + *sQ<sup>X</sup> and Y<sup>s</sup>* ∼ (1 − *s*)*P<sup>Y</sup>* + *sQ<sup>Y</sup> with P<sup>X</sup>* 6= *Q<sup>X</sup> and P<sup>X</sup> QX. Then, the following inequality holds:*

$$\sup\_{s \in [0,1]} \rho\_{\mathrm{m}}(X\_{\mathrm{s}}; Y\_{\mathrm{s}}) \geq \max \left\{ \sqrt{\frac{D(P\_{\mathrm{Y}} \| Q\_{\mathrm{Y}})}{D(P\_{\mathrm{X}} \| Q\_{\mathrm{X}})}}, \sqrt{\frac{D(Q\_{\mathrm{Y}} \| P\_{\mathrm{Y}})}{D(Q\_{\mathrm{X}} \| P\_{\mathrm{X}})}} \right\}.\tag{120}$$

**Proof.** See Section 5.11.

#### **5. Proofs**

This section provides proofs of the results in Sections 3 and 4.

#### *5.1. Proof of Theorem 1*

*Proof of* (22): We rely on an integral representation of the logarithm function (on base e):

$$\log\_e x = \int\_0^1 \frac{x - 1}{x + (1 - x)v} \,\mathrm{d}v, \quad \forall x > 0. \tag{121}$$

Let *<sup>µ</sup>* be a dominating measure of *<sup>P</sup>* and *<sup>Q</sup>* (i.e., *<sup>P</sup>*, *<sup>Q</sup> <sup>µ</sup>*), and let *<sup>p</sup>* :<sup>=</sup> <sup>d</sup>*<sup>P</sup>* d*µ* , *q* := d*Q* d*µ* , and

$$r\_{\lambda} := \frac{\mathbf{d}R\_{\lambda}}{\mathbf{d}\mu} = (1 - \lambda)p + \lambda q\_{\prime} \quad \forall \lambda \in [0, 1], \tag{122}$$

where the last equality is due to (21). For all *λ* ∈ [0, 1],

$$\frac{1}{\log \text{e}} D(P \| R\_{\lambda}) = \int p \log\_{\text{e}} \left( \frac{p}{r\_{\lambda}} \right) \text{d}\mu \tag{123}$$

$$=\int\_{0}^{1}\int \frac{p(p-r\_{\lambda})}{p+v(r\_{\lambda}-p)}\,\mathrm{d}\mu\,\mathrm{d}v,\tag{124}$$

where (124) holds due to (121) with *x* := *p rλ* , and by swapping the order of integration. The inner integral on the right side of (124) satisfies, for all *v* ∈ (0, 1],

$$\int \frac{p(p - r\_{\lambda})}{p + v(r\_{\lambda} - p)} \, \mathrm{d}\mu = \int (p - r\_{\lambda}) \left( 1 + \frac{v(p - r\_{\lambda})}{p + v(r\_{\lambda} - p)} \right) \, \mathrm{d}\mu \tag{125}$$

$$=\int (p - r\_{\lambda}) \, \mathrm{d}\mu + v \int \frac{(p - r\_{\lambda})^2}{p + v(r\_{\lambda} - p)} \, \mathrm{d}\mu \tag{126}$$

$$= v \int \frac{(p - r\_{\lambda})^2}{(1 - v)p + vr\_{\lambda}} \, \text{d}\mu \tag{127}$$

$$=\frac{1}{v}\int \frac{\left(p-\left[(1-v)p+vr\_{\lambda}\right]\right)^{2}}{(1-v)p+vr\_{\lambda}}\,\mathrm{d}\mu\,\tag{128}$$

$$=\frac{1}{v}\chi^2\{P\,\|\,(1-v)P+v\mathcal{R}\_\lambda\}\_\prime\tag{129}$$

where (127) holds since R *p* d*µ* = 1, and R *r<sup>λ</sup>* d*µ* = 1. From (21), for all (*λ*, *v*) ∈ [0, 1] × [0, 1],

$$(1 - v)P + vR\_{\lambda} = (1 - \lambda v)P + \lambda v\\\bigotimes = R\_{\lambda v}.\tag{130}$$

The substitution of (130) into the right side of (129) gives that, for all (*λ*, *v*) ∈ [0, 1] × (0, 1],

$$\int \frac{p(p - r\_{\lambda})}{p + v(r\_{\lambda} - p)} \, \mathrm{d}\mu = \frac{1}{v} \, \chi^{2}(P || R\_{\lambda \nu}).\tag{131}$$

Finally, substituting (131) into the right side of (124) gives that, for all *λ* ∈ (0, 1],

$$\frac{1}{\text{loge}} D(P \| \| R\_{\lambda}) = \int\_{0}^{1} \frac{1}{v} \chi^{2}(P \| \| R\_{\lambda v}) \, \text{d}v \tag{132}$$

$$=\int\_0^\lambda \frac{1}{s} \chi^2(P \| \mathbf{R}\_s) \, \mathrm{d}s,\tag{133}$$

where (133) holds by the transformation *s* := *λv*. Equality (133) also holds for *λ* = 0 since we have *D*(*P*k*R*0) = *D*(*P*k*P*) = 0.

*Proof of* (23): For all *s* ∈ (0, 1],

$$\begin{split} \chi^2(P \| \mathcal{Q}) &= \int \frac{(p-q)^2}{q} \, \mathrm{d}\mu \\ &= \frac{1}{s^2} \int \frac{\left[ \left( sp + (1-s)q \right) - q \right]^2}{q} \, \mathrm{d}\mu \end{split} \tag{134}$$

$$=\frac{1}{s^2} \int \frac{\left(r\_{1-s} - q\right)^2}{q} \,\mathrm{d}\mu\,\tag{135}$$

$$=\frac{1}{s^2}\chi^2(\mathbb{R}\_{1-s}\parallel\mathbb{Q}),\tag{136}$$

where (135) holds due to (122). From (136), it follows that for all *λ* ∈ [0, 1],

$$\int\_0^\lambda \frac{1}{s} \chi^2(\mathbb{R}\_{1-s} \| \, \mathbb{Q}) \, \mathrm{d}s = \int\_0^\lambda s \, \mathrm{d}s \, \chi^2(P \| \, \mathbb{Q}) = \frac{1}{2} \lambda^2 \chi^2(P \| \, \mathbb{Q}).\tag{137}$$

*5.2. Proof of Proposition 1*

(a) *Simple Proof of Pinsker's Inequality*: By [61] or [62] [(58)],

$$\chi^2(P\|\|Q) \ge \begin{cases} |P - Q|^2, & \text{if } |P - Q| \in [0, 1], \\ \frac{|P - Q|}{2 - |P - Q|}, & \text{if } |P - Q| \in (1, 2]. \end{cases} \tag{138}$$

We need the weaker inequality *χ* 2 (*P*k*Q*) ≥ |*P* − *Q*| 2 , proved by the Cauchy–Schwarz inequality:

$$\chi^2(P\|\mathbb{Q}) = \int \frac{(p-q)^2}{q} \, \text{d}\mu \int q \, \text{d}\mu \tag{139}$$

$$\geq \left( \int \frac{|p-q|}{\sqrt{q}} \cdot \sqrt{q} \, \mathrm{d}\mu \right)^2 \tag{140}$$

$$= |P - Q|^2. \tag{141}$$

By combining (24) and (139)–(141), it follows that

$$\frac{1}{\text{loge}}D(P\|\|Q) = \int\_0^1 \chi^2(P\|\|(1-s)P + sQ) \, \frac{\text{ds}}{\text{s}} \tag{142}$$

$$\geq \int\_{0}^{1} \left| P - \left( (1 - s)P + sQ \right) \right|^{2} \frac{ds}{s} \tag{143}$$

$$=\int\_0^1 \mathbf{s} \left| P - \mathbb{Q} \right|^2 \mathbf{d} \mathbf{s} \tag{144}$$

$$= \frac{1}{2} \|P - Q\|^2. \tag{145}$$

(b) *Proof of* (30) *and its local tightness*:

$$\frac{1}{2\log e}D(P\|\|\mathcal{Q}) = \int\_0^1 \chi^2(P\|\left(1-s\right)P + s\mathcal{Q})\,\frac{\mathrm{d}s}{s} \tag{146}$$

$$\mathcal{I} = \int\_0^1 \left( \int \frac{\left[ p - ((1-s)p + sq) \right]^2}{(1-s)p + sq} \, \mathrm{d}\mu \right) \frac{\mathrm{d}s}{\mathrm{s}} \tag{147}$$

$$\rho = \int\_0^1 \int \frac{s(p-q)^2}{(1-s)p+sq} \,\mathrm{d}\mu\,\mathrm{d}s \tag{148}$$

$$1 \le \int\_0^1 \int s(p-q)^2 \left(\frac{1-s}{p} + \frac{s}{q}\right) \mathrm{d}\mu \text{ ds} \tag{149}$$

$$=\int\_0^1 s^2 \, \mathrm{d}s \, \int \frac{(p-q)^2}{q} \, \mathrm{d}\mu + \int\_0^1 s(1-s) \, \mathrm{d}s \, \int \frac{(p-q)^2}{p} \, \mathrm{d}\mu \tag{150}$$

$$=\frac{1}{5}\chi^2(P\|Q) + \frac{1}{6}\chi^2(Q\|P). \tag{151}$$

where (146) is (24), and (149) holds due to Jensen's inequality and the convexity of the hyperbola. We next show the local tightness of inequality (30) by proving that (31) yields (32). Let {*Pn*} be a sequence of probability measures, defined on a measurable space (X , F ), and assume that {*Pn*}

converges to a probability measure *P* in the sense that (31) holds. In view of [16] [Theorem 7] (see also [15] [Section 4.F] and [63]), it follows that

$$\lim\_{n \to \infty} D(P\_n || P) = \lim\_{n \to \infty} \chi^2(P\_n || P) = 0,\tag{152}$$

and

$$\lim\_{n \to \infty} \frac{D(P\_n || P)}{\chi^2(P\_n || P)} = \frac{1}{2} \text{ log\,e},\tag{153}$$

$$\lim\_{n \to \infty} \frac{\chi^2(P\_n || P)}{\chi^2(P || P\_n)} = 1,\tag{154}$$

which therefore yields (32).

(c) *Proof of* (33) *and* (34): The proof of (33) relies on (28) and the following lemma.

**Lemma 1.** *For all s*, *θ* ∈ (0, 1)*,*

$$\frac{D\_{\Phi^\*}(P||Q)}{D\_{\Phi^\*}(P||Q)} \ge \min\left\{\frac{1-\theta}{1-s}, \frac{\theta}{s}\right\}.\tag{155}$$

**Proof.**

$$D\_{\Phi\_{\mathbb{S}}}(P\|Q) = \int \frac{(p-q)^2}{(1-s)p + sq} \, \mathrm{d}\mu \tag{156}$$

$$=\int \frac{(p-q)^2}{(1-\theta)p+\theta q} \frac{(1-\theta)p+\theta q}{(1-s)p+sq} \, \mathrm{d}\mu \tag{157}$$

$$\geq \min\left\{ \frac{1-\theta}{1-s}, \frac{\theta}{s} \right\} \int \frac{(p-q)^2}{(1-\theta)p+\theta q} \cdot \mathbf{d}\mu \tag{158}$$

$$=\min\left\{\frac{1-\theta}{1-s}, \frac{\theta}{s}\right\} D\_{\Phi\flat}(P||Q). \tag{159}$$

From (28) and (155), for all *θ* ∈ (0, 1),

$$\frac{1}{2\log e}D(P\|\|Q) = \int\_0^\theta s D\_{\Phi\_\theta}(P\|\|Q) \, \mathrm{d}s + \int\_\theta^1 s D\_{\Phi\_\theta}(P\|\|Q) \, \mathrm{d}s \tag{160}$$

$$\geq \int\_{0}^{\theta} \frac{s \left(1 - \theta\right)}{1 - s} \cdot D\_{\theta\theta} \left(P \|\| Q \right) \, \mathrm{d}s + \int\_{\theta}^{1} \theta \, D\_{\theta\theta} \left(P \|\| Q \right) \, \mathrm{d}s \tag{161}$$

$$=\left[-\theta + \log\_e\left(\frac{1}{1-\theta}\right)\right] \left(1-\theta\right) D\rho\_\theta(P\|\mathcal{Q}) + \theta(1-\theta)D\rho\_\theta(P\|\mathcal{Q})\tag{162}$$

$$=(1-\theta)\log\_{\theta}\left(\frac{1}{1-\theta}\right)D\_{\theta\theta}(P\|\!|Q\!).\tag{163}$$

This proves (33). Furthermore, under the assumption in (31), for all *θ* ∈ [0, 1],

$$\lim\_{n \to \infty} \frac{D(P \| P\_n)}{D\_{\Phi\sharp}(P \| P\_n)} = \lim\_{n \to \infty} \frac{D(P \| P\_n)}{\chi^2(P \| P\_n)} \lim\_{n \to \infty} \frac{\chi^2(P \| P\_n)}{D\_{\Phi\sharp}(P \| P\_n)}\tag{164}$$

$$=\frac{1}{2}\log\mathbf{e}\cdot\frac{2}{\phi\_{\theta}^{\prime\prime}(1)}\tag{165}$$

$$= \frac{1}{2} \log \mathbf{e}\_{\prime} \tag{166}$$

where (165) holds due to (153) and the local behavior of *f*-divergences [63], and (166) holds due to (17) which implies that *φ* 00 *θ* (1) = 2 for all *θ* ∈ [0, 1]. This proves (34).

(d) *Proof of* (35): From (24), we get

$$\frac{1}{\text{loge}}D(P\|\|Q) = \int\_0^1 \chi^2(P\|\|\mathbf{1}(1-\mathbf{s})\mathbf{P} + \mathbf{s}\mathbf{Q}) \, \frac{\mathbf{ds}}{\mathbf{s}} \tag{167}$$

$$I = \int\_0^1 \left[ \chi^2(P \parallel (1 - s)P + sQ) - s^2 \chi^2(P \parallel Q) \right] \frac{ds}{s} + \int\_0^1 s \, \text{ds} \, \chi^2(P \parallel Q) \tag{168}$$

$$=\int\_0^1 \left[\chi^2(P\|\left(1-s\right)P+s\mathbb{Q})-s^2\,\chi^2(P\|\mathbb{Q})\right] \frac{ds}{s} + \frac{1}{2}\chi^2(P\|\mathbb{Q}).\tag{169}$$

Referring to the integrand of the first term on the right side of (169), for all *s* ∈ (0, 1],

$$\frac{1}{s} \left[ \chi^2(P \parallel (1-s)P + sQ) - s^2 \chi^2(P \parallel Q) \right]$$

$$\xi = s \int (p-q)^2 \left[ \frac{1}{(1-s)p + sq} - \frac{1}{q} \right] d\mu \tag{170}$$

$$s = s(1 - s) \int \frac{(q - p)^3}{q \left[ (1 - s)p + sq \right]} \, \text{d}\mu \tag{171}$$

$$\rho = s(1-s) \int |q-p| \cdot \underbrace{\frac{|q-p|}{q} \cdot \frac{q-p}{p+s(q-p)}}\_{\leq \frac{1}{s} \, 1 \{q \geq p\}} \, \text{d}\mu \tag{172}$$

$$\leq (1-s)\int (q-p)\,\mathbf{1}\{q \geq p\} \,\mathrm{d}\mu \tag{173}$$

*s*

$$= \frac{1}{2}(1 - s)\left|P - Q\right|\tag{174}$$

where the last equality holds since the equality R (*q* − *p*) d*µ* = 0 implies that

$$\int (q-p)\,\mathbf{1}\{q \ge p\} \,\mathrm{d}\mu = \int (p-q)\,\mathbf{1}\{p \ge q\} \,\mathrm{d}\mu \tag{175}$$

$$=\frac{1}{2}\int |p-q|\,\mathrm{d}\mu = \frac{1}{2}\,|P-Q|.\tag{176}$$

From (170)–(174), an upper bound on the right side of (169) results. This gives

$$\frac{1}{2\log\mathrm{e}}D(P||Q) \le \frac{1}{2}\int\_0^1 (1-s)\,\mathrm{ds}\,\left|P-Q\right| + \frac{1}{2}\chi^2(P||Q) \tag{177}$$

$$=\frac{1}{4}\left|P-Q\right|+\frac{1}{2}\chi^{2}(P\|\|Q).\tag{178}$$

It should be noted that [15] [Theorem 2(a)] shows that inequality (35) is tight. To that end, let *ε* ∈ (0, 1), and define probability measures *P<sup>ε</sup>* and *Q<sup>ε</sup>* on the set A = {0, 1} with *Pε*(1) = *ε* 2 and *Qε*(1) = *ε*. Then,

$$\lim\_{\varepsilon \downarrow 0} \frac{\frac{1}{\log e} D(P\_{\varepsilon} || Q\_{\varepsilon})}{\frac{1}{4} |P\_{\varepsilon} - Q\_{\varepsilon}| + \frac{1}{2} \chi^{2} (P\_{\varepsilon} || Q\_{\varepsilon})} = 1. \tag{179}$$

#### *5.3. Proof of Theorem 2*

We first prove Item (a) in Theorem 2. In view of the Hammersley–Chapman–Robbins lower bound on the *χ* <sup>2</sup> divergence, for all *<sup>λ</sup>* <sup>∈</sup> [0, 1]

$$\chi^2(P\|(1-\lambda)P+\lambda Q) \ge \frac{\left(\mathbb{E}[X] - \mathbb{E}[Z\_{\lambda}]\right)^2}{\text{Var}(Z\_{\lambda})},\tag{180}$$

where *X* ∼ *P*, *Y* ∼ *Q* and *Z<sup>λ</sup>* ∼ *R<sup>λ</sup>* := (1 − *λ*)*P* + *λQ* is defined by

$$Z\_{\lambda} := \begin{cases} X, & \text{with probability } 1 - \lambda, \\ Y, & \text{with probability } \lambda. \end{cases} \tag{181}$$

For *λ* ∈ [0, 1],

$$\mathbb{E}[Z\_{\lambda}] = (1 - \lambda)m\_P + \lambda m\_{Q'} \tag{182}$$

and it can be verified that

$$\text{Var}(Z\_{\lambda}) = (1 - \lambda)\sigma\_P^2 + \lambda\sigma\_Q^2 + \lambda(1 - \lambda)(m\_P - m\_Q)^2. \tag{183}$$

We now rely on (24)

$$\frac{1}{\text{loge}}D(P||Q) = \int\_0^1 \chi^2(P||(1-\lambda)P + \lambda Q) \, \frac{d\lambda}{\lambda} \tag{184}$$

to get a lower bound on the relative entropy. Combining (180), (183) and (184) yields

$$\frac{1}{\text{logse}}D(P\|Q) \ge (m\_P - m\_Q)^2 \int\_0^1 \frac{\lambda}{(1-\lambda)\sigma\_P^2 + \lambda\sigma\_Q^2 + \lambda(1-\lambda)(m\_P - m\_Q)^2}d\lambda. \tag{185}$$

From (43) and (44), we get

$$\int\_0^1 \frac{\lambda}{(1-\lambda)\sigma\_P^2 + \lambda\sigma\_Q^2 + \lambda(1-\lambda)(m\_P - m\_Q)^2} \,\mathrm{d}\lambda = \int\_0^1 \frac{\lambda}{(a-a\lambda)(\beta+a\lambda)} \,\mathrm{d}\lambda,\tag{186}$$

where

$$\mathfrak{a} := \sqrt{\sigma\_P^2 + \frac{b^2}{4a^2}} + \frac{b}{2a'} \tag{187}$$

$$\mathcal{B} := \sqrt{\sigma\_P^2 + \frac{b^2}{4a^2}} - \frac{b}{2a}. \tag{188}$$

By using the partial fraction decomposition of the integrand on the right side of (186), we get (after multiplying both sides of (185) by log e)

$$D(P||Q) \ge \frac{(m\_P - m\_Q)^2}{a^2} \left[ \frac{a}{a+\beta} \log\left(\frac{a}{a-a}\right) + \frac{\beta}{a+\beta} \log\left(\frac{\beta}{\beta+a}\right) \right] \tag{189}$$

$$\alpha = \frac{\alpha}{\alpha + \beta} \log \left( \frac{\alpha}{\alpha - a} \right) + \frac{\beta}{\alpha + \beta} \log \left( \frac{\beta}{\beta + a} \right) \tag{190}$$

$$
\zeta = d \left( \frac{a}{a+\beta} \parallel \frac{a-a}{a+\beta} \right),
\tag{191}
$$

where (189) holds by integration since *α* − *aλ* and *β* + *aλ* are both non-negative for all *λ* ∈ [0, 1]. To verify the latter claim, it should be noted that (43) and the assumption that *m<sup>P</sup>* 6= *m<sup>Q</sup>* imply that *a* 6= 0. Since *α*, *β* > 0, it follows that, for all *λ* ∈ [0, 1], either *α* − *aλ* > 0 or *β* + *aλ* > 0 (if *a* < 0, then the former is positive, and, if *a* > 0, then the latter is positive). By comparing the denominators of both integrands on the left and right sides of (186), it follows that (*α* − *aλ*)(*β* + *aλ*) ≥ 0 for all *λ* ∈ [0, 1]. Since the product of *α* − *aλ* and *β* + *aλ* is non-negative and at least one of these terms is positive, it follows that *α* − *aλ* and *β* + *aλ* are both non-negative for all *λ* ∈ [0, 1]. Finally, (190) follows from (43).

If *m<sup>P</sup>* − *m<sup>Q</sup>* → 0 and *σ<sup>P</sup>* 6= *σQ*, then it follows from (43) and (44) that *a* → 0 and *b* → *σ* 2 *<sup>P</sup>* − *σ* 2 *Q* 6= 0. Hence, from (187) and (188), *α* ≥ *b a* → ∞ and *β* → 0, which implies that the lower bound on *D*(*P*k*Q*) in (191) tends to zero.

Letting *r* := *<sup>α</sup> α*+*β* and *s* := *<sup>α</sup>*−*<sup>a</sup> α*+*β* , we obtain that the lower bound on *D*(*P*k*Q*) in (40) holds. This bound is consistent with the expressions of *r* and *s* in (41) and (42) since, from (45), (187) and (188),

$$r = \frac{a}{a+\beta} = \frac{v + \frac{b}{\Delta a}}{\Delta v} = \frac{1}{2} + \frac{b}{4av'} \tag{192}$$

$$s = \frac{\alpha - a}{\alpha + \beta} = r - \frac{a}{\alpha + \beta} = r - \frac{a}{2v}.\tag{193}$$

It should be noted that *r*,*s* ∈ [0, 1]. First, from (187) and (188), *α* and *β* are positive if *σ<sup>P</sup>* 6= 0, which yields *r* = *<sup>α</sup> α*+*β* ∈ (0, 1). We next show that *s* ∈ [0, 1]. Recall that *α* − *aλ* and *β* + *aλ* are both non-negative for all *λ* ∈ [0, 1]. Setting *λ* = 1 yields *α* ≥ *a*, which (from (193)) implies that *s* ≥ 0. Furthermore, from (193) and the positivity of *α* + *β*, it follows that *s* ≤ 1 if and only if *β* ≥ −*a*. The latter holds since *β* + *aλ* ≥ 0 for all *λ* ∈ [0, 1] (in particular, for *λ* = 1). If *σ<sup>P</sup>* = 0, then it follows from (41)–(45) that *v* = *<sup>b</sup>* 2|*a*| , *b* = *a* <sup>2</sup> + *σ* 2 *Q* , and (recall that *a* 6= 0)

$$\begin{aligned} \text{(i)} \qquad \text{If } a > 0 \text{, then } v = \frac{b}{2a} \text{ implies that } r = \frac{1}{2} + \frac{b}{4av} = 1 \text{, and } s = r - \frac{a}{2v} = 1 - \frac{a^2}{b} = \frac{\sigma\_Q^2}{\sigma\_Q^2 + a^2} \in [0, 1];\\ \text{(iii)} \qquad \text{v} \qquad \text{and} \qquad \text{h} \qquad \text{h} \qquad \text{v} \qquad \text{h} \qquad \text{h} \qquad \text{h} \qquad \text{s} \qquad \text{h} \qquad \text{s}^2 \qquad \text{s}^2 \qquad \text{h} \qquad \text{h} \qquad \text{s} \qquad \text{h} \end{aligned}$$

$$\text{(ii)}\qquad\text{if }a<0,\text{ then }v=-\frac{b}{2a}\text{ implies that }r=0,\text{and }s=r-\frac{a}{2v}=\frac{a^2}{b}=\frac{a^2}{a^2+v\_Q^2}\in[0,1].$$

We next prove Item (b) in Theorem 2 (i.e., the achievability of the lower bound in (40)). To that end, we provide a technical lemma, which can be verified by the reader.

**Lemma 2.** *Let r*,*s be given in* (41)*–*(45)*, and let u*1,2 *be given in* (47)*. Then,*

$$(s-r)(\mu\_1 - \mu\_2) = m\_Q - m\_{P\_I} \tag{194}$$

$$
\mu\_1 + \mu\_2 = m\_P + m\_Q + \frac{\sigma\_Q^2 - \sigma\_P^2}{m\_Q - m\_P}.\tag{195}
$$

Let *X* ∼ *P* and *Y* ∼ *Q* be defined on a set U = {*u*1, *u*2} (for the moment, the values of *u*<sup>1</sup> and *u*<sup>2</sup> are not yet specified) with *P*[*X* = *u*1] = *r*, *P*[*X* = *u*2] = 1 − *r*, *Q*[*Y* = *u*1] = *s*, and *Q*[*Y* = *u*2] = 1 − *s*. We now calculate *u*<sup>1</sup> and *u*<sup>2</sup> such that E[*X*] = *m<sup>P</sup>* and Var(*X*) = *σ* 2 *P* . This is equivalent to

$$
\mathfrak{r}u\_1 + (1-r)u\_2 = \mathfrak{m}\_{P\_1} \tag{196}
$$

$$
\sigma u\_1^2 + (1 - r)u\_2^2 = m\_P^2 + \sigma\_P^2. \tag{197}
$$

Substituting (196) into the right side of (197) gives

$$r u\_1^2 + (1 - r) u\_2^2 = \left[ r u\_1 + (1 - r) u\_2 \right]^2 + \sigma\_{\rm P'}^2 \tag{198}$$

*Entropy* **2020**, *22*, 563

which, by rearranging terms, also gives

$$
\mu\_1 - \mu\_2 = \pm \sqrt{\frac{\sigma\_P^2}{r(1-r)}}.\tag{199}
$$

Solving simultaneously (196) and (199) gives

$$
\mu\_1 = m\_P \pm \sqrt{\frac{(1-r)\sigma\_P^2}{r}},\tag{200}
$$

$$
\mu\_2 = m\_P \mp \sqrt{\frac{r\sigma\_P^2}{1-r}}.\tag{201}
$$

We next verify that, by setting *u*1,2 as in (47), one also gets (as desired) that E[*Y*] = *m<sup>Q</sup>* and Var(*Y*) = *σ* 2 *Q* . From Lemma 2, and, from (196) and (197), we have

$$\mathbb{E}[Y] = su\_1 + (1 - s)u\_2 \tag{202}$$

$$\mathbf{u} = \left(r\boldsymbol{u}\_1 + (1 - r)\boldsymbol{u}\_2\right) + (\mathbf{s} - r)(\boldsymbol{u}\_1 - \boldsymbol{u}\_2) \tag{203}$$

$$
\dot{\mathbf{u}}\_1 = m\_P + (s - r)(\mathbf{u}\_1 - \mathbf{u}\_2) = m\_{Q'} \tag{204}
$$

$$\mathbb{E}[Y^2] = su\_1^2 + (1-s)u\_2^2\tag{205}$$

$$=ru\_1^2 + (1-r)u\_2^2 + (s-r)(u\_1^2 - u\_2^2) \tag{206}$$

$$\mathbb{E} = \mathbb{E}[X^2] + (s - r)(\mu\_1 - \mu\_2)(\mu\_1 + \mu\_2) \tag{207}$$

$$\sigma = m\_P^2 + \sigma\_P^2 + (m\_Q - m\_P) \left( m\_P + m\_Q + \frac{\sigma\_Q^2 - \sigma\_P^2}{m\_Q - m\_P} \right) \tag{208}$$

$$
\sigma = m\_Q^2 + \sigma\_Q^2. \tag{209}
$$

By combining (204) and (209), we obtain Var(*Y*) = *σ* 2 *Q* . Hence, the probability mass functions *P* and *Q* defined on U = {*u*1, *u*2} (with *u*<sup>1</sup> and *u*<sup>2</sup> in (47)) such that

$$P(\mu\_1) = 1 - P(\mu\_2) = r, \quad Q(\mu\_1) = 1 - Q(\mu\_2) = s \tag{210}$$

satisfy the equality constraints in (39), while also achieving the lower bound on *D*(*P*k*Q*) that is equal to *d*(*r*k*s*). It can be also verified that the second option where

$$u\_1 = m\_P - \sqrt{\frac{(1-r)\sigma\_P^2}{r}}, \quad u\_2 = m\_P + \sqrt{\frac{r\sigma\_P^2}{1-r}}\tag{211}$$

does *not* yield the satisfiability of the conditions E[*Y*] = *m<sup>Q</sup>* and Var(*Y*) = *σ* 2 *Q* , so there is only a unique pair of probability measures *P* and *Q*, defined on a two-element set that achieves the lower bound in (40) under the equality constraints in (39).

We finally prove Item (c) in Theorem 2. Let *m* ∈ R, *σ* 2 *P* , and *σ* 2 *Q* be selected arbitrarily such that *σ* 2 *<sup>Q</sup>* ≥ *σ* 2 *P* . We construct probability measures *P<sup>ε</sup>* and *Q<sup>ε</sup>* , depending on a free parameter *ε*, with means *m<sup>P</sup>* = *m<sup>Q</sup>* := *m* and variances *σ* 2 *P* and *σ* 2 *Q* , respectively (means and variances are independent of *ε*), and which are defined on a three-element set U := {*u*1, *u*2, *u*3} as follows:

$$P\_{\varepsilon}(\mathfrak{u}\_{1}) = r, \quad P\_{\varepsilon}(\mathfrak{u}\_{2}) = 1 - r, \quad \qquad P\_{\varepsilon}(\mathfrak{u}\_{3}) = \mathbf{0}, \tag{212}$$

$$Q\_{\varepsilon}(\mathfrak{u}\_{1}) = \mathfrak{s}, \quad Q\_{\varepsilon}(\mathfrak{u}\_{2}) = 1 - \mathfrak{s} - \varepsilon, \quad Q\_{\varepsilon}(\mathfrak{u}\_{3}) = \varepsilon,\tag{213}$$

with *ε* > 0. We aim to set the parameters *r*,*s*, *u*1, *u*<sup>2</sup> and *u*<sup>3</sup> (as a function of *m*, *σP*, *σ<sup>Q</sup>* and *ε*) such that

$$\lim\_{\varepsilon \to 0^{+}} D(P\_{\varepsilon} \| Q\_{\varepsilon}) = 0. \tag{214}$$

Proving (214) yields (48), while it also follows that the infimum on the left side of (48) can be restricted to probability measures which are defined on a three-element set.

In view of the constraints on the means and variances in (39), with equal means *m*, we get the following set of equations from (212) and (213):

$$\begin{cases} ru\_1 + (1 - r)u\_2 = m\_\prime \\ su\_1 + (1 - s - \varepsilon)u\_2 + \varepsilon u\_3 = m\_\prime \\ ru\_1^2 + (1 - r)u\_2^2 = m^2 + \sigma\_{P'}^2 \\ su\_1^2 + (1 - s - \varepsilon)u\_2^2 + \varepsilon u\_3^2 = m^2 + \sigma\_Q^2. \end{cases} \tag{215}$$

The first and second equations in (215) refer to the equal means under *P* and *Q*, and the third and fourth equations in (215) refer to the second moments in (39). Furthermore, in view of (212) and (213), the relative entropy is given by

$$D(P\_{\varepsilon} \| Q\_{\varepsilon}) = r \log \frac{r}{s} + (1 - r) \log \frac{1 - r}{1 - s - \varepsilon}. \tag{216}$$

Subtracting the square of the first equation in (215) from its third equation gives the equivalent set of equations

$$\begin{cases} ru\_1 + (1 - r)u\_2 = m\_\star \\ su\_1 + (1 - s - \varepsilon)u\_2 + \varepsilon u\_3 = m\_\star \\\ r(1 - r)(u\_1 - u\_2)^2 = \sigma\_{P\prime}^2 \\ su\_1^2 + (1 - s - \varepsilon)u\_2^2 + \varepsilon u\_3^2 = m^2 + \sigma\_Q^2. \end{cases} \tag{217}$$

We next select *u*<sup>1</sup> and *u*<sup>2</sup> such that *u*<sup>1</sup> − *u*<sup>2</sup> := 2*σP*. Then, the third equation in (217) gives *<sup>r</sup>*(<sup>1</sup> <sup>−</sup> *<sup>r</sup>*) = <sup>1</sup> 4 , so *r* = <sup>1</sup> 2 . Furthermore, the first equation in (217) gives

$$
u\_1 = m + \sigma\_{P'} \tag{218}$$

$$
\mu\_2 = m - \sigma\_P.\tag{219}
$$

Since *r*, *u*1, and *u*<sup>2</sup> are independent of *ε*, so is the probability measure *P<sup>ε</sup>* := *P*. Combining the second equation in (217) with (218) and (219) gives

$$
\mu\_3 = m - \left(1 + \frac{2s - 1}{\varepsilon}\right) \sigma\_P.\tag{220}
$$

Substituting (218)–(220) into the fourth equation of (217) gives a quadratic equation for *s*, whose selected solution (such that *s* and *r* = <sup>1</sup> 2 be close for small *e* > 0) is equal to

$$s = \frac{1}{2} \left[ 1 - \varepsilon + \sqrt{\left( \frac{\sigma\_Q^2}{\sigma\_P^2} - 1 + \varepsilon \right)} \varepsilon \right]. \tag{221}$$

Hence, *s* = <sup>1</sup> <sup>2</sup> + *O*( √ *ε*), which implies that *s* ∈ (0, 1 − *ε*) for sufficiently small *ε* > 0 (as it is required in (213)). In view of (216), it also follows that *D*(*P*k*Qε*) vanishes as we let *ε* tend to zero.

We finally outline an alternative proof, which refers to the case of equal means with arbitrarily selected *σ* 2 *P* and *σ* 2 *Q* . Let (*σ* 2 *P* , *σ* 2 *Q* ) ∈ (0, ∞) 2 . We next construct a sequence of pairs of probability measures {(*Pn*, *Qn*)} with zero mean and respective variances (*σ* 2 *P* , *σ* 2 *Q* ) for which *D*(*Pn*k*Qn*) → 0 as *n* → ∞ (without any loss of generality, one can assume that the equal means are equal to zero). We start by assuming (*σ* 2 *P* , *σ* 2 *Q* ) ∈ (1, ∞) 2 . Let

$$
\mu\_n := \sqrt{1 + n(\sigma\_Q^2 - 1)},
\tag{222}
$$

and define a sequence of quaternary real-valued random variables with probability mass functions

$$Q\_n(a) := \begin{cases} \frac{1}{2} - \frac{1}{2n} & a = \pm 1, \\ \frac{1}{2n} & a = \pm \mu\_n. \end{cases} \tag{223}$$

It can be verified that, for all *n* ∈ N, *Q<sup>n</sup>* has zero mean and variance *σ* 2 *Q* . Furthermore, let

$$P\_n(a) := \begin{cases} \frac{1}{2} - \frac{\frac{\pi}{2}}{2n} & a = \pm 1, \\\frac{\frac{\pi}{2}}{2n} & a = \pm \mu\_{n\prime} \end{cases} \tag{224}$$

with

$$
\zeta := \frac{\sigma\_P^2 - 1}{\sigma\_Q^2 - 1}.
\tag{225}
$$

If *ξ* > 1, for *n* = 1, . . . , d*ξ*e, we choose *P<sup>n</sup>* arbitrarily with mean 0 and variance *σ* 2 *P* . Then,

$$\text{Var}(P\_{\text{fl}}) = 1 - \frac{\frac{\pi}{2}}{n} + \frac{\frac{\pi}{n}}{n} \mu\_{\text{n}}^2 = \sigma\_{\text{P}}^2 \tag{226}$$

$$D(P\_n \| Q\_n) = d\left(\frac{\xi}{n} \middle| \frac{1}{n}\right) \to 0. \tag{227}$$

Next, suppose min{*σ* 2 *P* , *σ* 2 *Q* } := *σ* <sup>2</sup> < 1, then construct *P* 0 *<sup>n</sup>* and *Q*<sup>0</sup> *<sup>n</sup>* as before with variances <sup>2</sup>*<sup>σ</sup>* 2 *P σ* <sup>2</sup> > 1 and <sup>2</sup>*<sup>σ</sup>* 2 *Q σ* <sup>2</sup> > 1, respectively. If *P<sup>n</sup>* and *Q<sup>n</sup>* denote the random variables *P* 0 *<sup>n</sup>* and *Q*<sup>0</sup> *n* scaled by a factor of √*σ* 2 , then their variances are *σ* 2 *P* , *σ* 2 *Q* , respectively, and *D*(*Pn*k*Qn*) = *D*(*P* 0 *n*k*Q*<sup>0</sup> *n* ) → 0 as we let *n* → ∞.

To conclude, it should be noted that the sequences of probability measures in the latter proof are defined on a four-element set. Recall that, in the earlier proof, specialized to the case of (equal means with) *σ* 2 *<sup>P</sup>* ≤ *σ* 2 *Q* , the introduced probability measures are defined on a three-element set, and the reference probability measure *P* is fixed while referring to an equiprobable binary random variable.

#### *5.4. Proof of Theorem 3*

We first prove (52). Differentiating both sides of (22) gives that, for all *λ* ∈ (0, 1],

$$F'(\lambda) = \frac{1}{\lambda} \chi^2(P \| R\_{\lambda}) \text{ log e} \tag{228}$$

$$\frac{1}{\lambda} \left[ \exp \left( D(P \| R\_{\lambda}) \right) - 1 \right] \log \mathbf{e} \tag{229}$$

$$\hat{\lambda} = \frac{1}{\lambda} \left[ \exp \left( F(\lambda) \right) - 1 \right] \text{ log\,e},\tag{230}$$

where (228) holds due to (21), (22) and (50); (229) holds by (16) and (230) is due to (21) and (50). This gives (52).

We next prove (53), and the conclusion which appears after it. In view of [16] [Theorem 8], applied to *f*(*t*) := − log *t* for all *t* > 0, we get (it should be noted that, by the definition of *F* in (50), the result in [16] [(195)–(196)] is used here by swapping *P* and *Q*)

$$\lim\_{\lambda \to 0^+} \frac{F(\lambda)}{\lambda^2} = \frac{1}{2} \chi^2(Q \| P) \text{ log.e.} \tag{231}$$

Since lim *<sup>λ</sup>*→0<sup>+</sup> *F*(*λ*) = 0, it follows by L'Hôpital's rule that

$$\lim\_{\lambda \to 0^+} \frac{F'(\lambda)}{\lambda} = 2 \lim\_{\lambda \to 0^+} \frac{F(\lambda)}{\lambda^2} = \chi^2(Q \| P) \log \mathbf{e}\_\prime \tag{232}$$

which gives (53). A comparison of the limit in (53) with a lower bound which follows from (52) gives

$$\lim\_{\lambda \to 0^+} \frac{F'(\lambda)}{\lambda} \ge \lim\_{\lambda \to 0^+} \frac{1}{\lambda^2} \left[ \exp \left( F(\lambda) \right) - 1 \right] \log \mathbf{e} \tag{233}$$

$$=\lim\_{\lambda \to 0^{+}} \frac{F(\lambda)}{\lambda^{2}} \lim\_{\lambda \to 0^{+}} \frac{\exp\left(F(\lambda)\right) - 1}{F(\lambda)} \cdot \log \mathbf{e} \tag{234}$$

$$=\lim\_{\lambda \to 0^{+}} \frac{F(\lambda)}{\lambda^2} \lim\_{\mu \to 0} \frac{\mathbf{e}^{\mu} - 1}{\mu} \tag{235}$$

$$=\frac{1}{2}\chi^2(\mathbb{Q}||P)\text{ log\,e}\,\tag{236}$$

where (236) relies on (231). Hence, the limit in (53) is twice as large as its lower bound on the right side of (236). This proves the conclusion which comes right after (53).

We finally prove the known result in (51), by showing an alternative proof which is based on (52). The function *F* is non-negative on [0, 1], and it is strictly positive on (0, 1] if *P* 6= *Q*. Let *P* 6= *Q* (otherwise, (51) is trivial). Rearranging terms in (52) and integrating both sides over the interval [*λ*, 1], for *λ* ∈ (0, 1], gives that

$$\int\_{\lambda}^{1} \frac{F'(t)}{\exp\{F(t)\} - 1} \, dt \ge \int\_{\lambda}^{1} \frac{\mathrm{d}t}{t} \, \log \mathrm{e} \tag{237}$$

$$=\log\frac{1}{\lambda'} \quad \forall \,\lambda \in (0,1]. \tag{238}$$

The left side of (237) satisfies

$$\int\_{\lambda}^{1} \frac{F'(t)}{\exp(F(t)) - 1} \, \mathrm{d}t = \int\_{\lambda}^{1} \frac{F'(t) \exp(-F(t))}{1 - \exp(-F(t))} \, \mathrm{d}t \tag{239}$$

$$=\int\_{\lambda}^{1} \frac{\mathbf{d}}{\mathbf{d}t} \left\{ \log \left( 1 - \exp \left( -F(t) \right) \right) \right\} \, \mathrm{d}t \tag{240}$$

$$=\log\left(\frac{1-\exp\left(-D(P||Q)\right)}{1-\exp\left(-F(\lambda)\right)}\right),\tag{241}$$

where (241) holds since *F*(1) = *D*(*P*k*Q*) (see (50)). Combining (237)–(241) gives

$$\frac{1 - \exp\left(-D(P\|Q)\right)}{1 - \exp\left(-F(\lambda)\right)} \ge \frac{1}{\lambda}, \quad \forall \lambda \in (0, 1], \tag{242}$$

which, due to the non-negativity of *F*, gives the right side inequality in (51) after rearrangement of terms in (242).

#### *5.5. Proof of Theorem 4*

**Lemma 3.** *Let f*<sup>0</sup> : (0, ∞) → R *be a convex function with f*0(1) = 0*, and let* { *f<sup>k</sup>* (·)} ∞ *k*=0 *be defined as in* (58)*. Then,* { *f<sup>k</sup>* (·)} ∞ *k*=0 *is a sequence of convex functions on* (0, ∞)*, and*

$$f\_k(\mathbf{x}) \ge f\_{k+1}(\mathbf{x}), \quad \forall \mathbf{x} > \mathbf{0}, \ k \in \{0, 1, \ldots\}. \tag{243}$$

**Proof.** We prove the convexity of { *f<sup>k</sup>* (·)} on (0, ∞) by induction. Suppose that *f<sup>k</sup>* (·) is a convex function with *f<sup>k</sup>* (1) = 0 for a fixed integer *k* ≥ 0. The recursion in (58) yields *fk*+<sup>1</sup> (1) = 0 and, by the change of integration variable *s* := (1 − *x*)*s* 0 ,

$$f\_{k+1}(\mathbf{x}) = \int\_0^1 f\_k(s'\mathbf{x} - s' + 1) \frac{\mathbf{ds}'}{\mathbf{s'}}, \quad \mathbf{x} > 0. \tag{244}$$

Consequently, for *t* ∈ (0, 1) and *x* 6= *y* with *x*, *y* > 0, applying (244) gives

$$f\_{k+1}((1-t)\mathbf{x} + ty) = \int\_0^1 f\_k(s'[(1-t)\mathbf{x} + ty] - s' + 1) \frac{\mathbf{ds}'}{\mathbf{s}'} \tag{245}$$

$$=\int\_{0}^{1} f\_{k}\left((1-t)(s'x-s'+1) + t(s'y-s'+1)\right) \frac{\mathbf{ds}'}{\mathbf{s}'}\tag{246}$$

$$\leq (1-t)\int\_{0}^{1} f\_{k}(s'x - s' + 1) \frac{\mathrm{d}s'}{\mathrm{s'}} + t \int\_{0}^{1} f\_{k}(s'y - s' + 1) \frac{\mathrm{d}s'}{\mathrm{s'}} \tag{247}$$

$$=(1-t)f\_{k+1}(\mathbf{x}) + tf\_{k+1}(y),\tag{248}$$

where (247) holds since *f<sup>k</sup>* (·) is convex on (0, ∞) (by assumption). Hence, from (245)–(248), *fk*+<sup>1</sup> (·) is also convex on (0, ∞) with *fk*+<sup>1</sup> (1) = 0. By mathematical induction and our assumptions on *f*0, it follows that { *f<sup>k</sup>* (·)} ∞ *k*=0 is a sequence of convex functions on (0, ∞) which vanish at 1.

We next prove (243). For all *x*, *y* > 0 and *k* ∈ {0, 1, . . .},

$$f\_{k+1}(y) \ge f\_{k+1}(\mathbf{x}) + f\_{k+1}'(\mathbf{x}) \left(y - \mathbf{x}\right) \tag{249}$$

$$f = f\_{k+1}(\mathbf{x}) + \frac{f\_k(\mathbf{x})}{\mathbf{x} - 1} \ (y - \mathbf{x}),\tag{250}$$

where (249) holds since *f<sup>k</sup>* (·) is convex on (0, ∞), and (250) relies on the recursive equation in (58). Substituting *y* = 1 into (249)–(250), and using the equality *fk*+<sup>1</sup> (1) = 0, gives (243).

We next prove Theorem 4. From Lemma 3, it follows that *Df<sup>k</sup>* (*P*k*Q*) is an *f*-divergence for all integers *<sup>k</sup>* <sup>≥</sup> 0, and the non-negative sequence *Df<sup>k</sup>* (*P*k*Q*)} ∞ *k*=0 is monotonically non-increasing. From (21) and (58), it also follows that, for all *λ* ∈ [0, 1] and integer *k* ∈ {0, 1, . . .},

$$D\_{f\_{k+1}}(\mathcal{R}\_{\lambda} || P) = \int p \, f\_{k+1} \left( \frac{r\_{\lambda}}{p} \right) \, \mathrm{d}\mu \tag{251}$$

$$I = \int p \int\_0^{(p-q)\lambda/p} f\_k(1-s) \left. \frac{\mathbf{d}s}{s} \right| \mathbf{d}\mu \tag{252}$$

$$=\int p\int\_{0}^{\lambda} f\_k \left(1 + \frac{(q-p)s'}{p}\right) \frac{\mathbf{d}s'}{\mathbf{s'}} \,\mathrm{d}\mu\tag{253}$$

$$I = \int\_0^\lambda \int pf\_k \left(\frac{r\_{s'}}{p}\right) \,\mathrm{d}\mu \,\,\frac{\mathrm{d}s'}{\mathrm{s'}}\,\tag{254}$$

= Z *λ* 0 *Df<sup>k</sup>* (*R<sup>s</sup>* <sup>0</sup>k*P*) d*s* 0 *s* 0 , (255)

where the substitution *s* := (*p*−*q*)*s* 0 *p* is invoked in (253), and then (254) holds since *<sup>r</sup> s* 0 *<sup>p</sup>* = 1 + (*q*−*p*)*s* 0 *p* for *s* <sup>0</sup> ∈ [0, 1] (this follows from (21)) and by interchanging the order of the integrations.

#### *5.6. Proof of Corollary 5*

Combining (60) and (61) yields (58); furthermore, *<sup>f</sup>*<sup>0</sup> : (0, <sup>∞</sup>) <sup>→</sup> <sup>R</sup>, given by *<sup>f</sup>*0(*x*) = <sup>1</sup> *<sup>x</sup>* − 1 for all *x* > 0, is convex on (0, ∞) with *f*0(1) = 0. Hence, Theorem 4 holds for the selected functions { *f<sup>k</sup>* (·)} ∞ *k*=0 in (61), which therefore are all convex on (0, ∞) and vanish at 1. This proves that (59) holds for all *<sup>λ</sup>* <sup>∈</sup> [0, 1] and *<sup>k</sup>* ∈ {0, 1, . . .}. Since *<sup>f</sup>*0(*x*) = <sup>1</sup> *<sup>x</sup>* − 1 and *f*1(*x*) = − log<sup>e</sup> (*x*) for all *x* > 0 (see (60) and (61)), then, for every pair of probability measures *P* and *Q*:

$$D\_{f\_0}(P\|\|Q) = \chi^2(Q\|P), \quad D\_{f\_1}(P\|\|Q) = \frac{1}{\log e} \, D(Q\|P). \tag{256}$$

Finally, combining (59), for *k* = 0, together with (256), gives (22) as a special case.

#### *5.7. Proof of Theorem 5 and Corollary 6*

For an arbitrary measurable set E ⊆ X , we have from (62)

$$
\mu\_{\mathcal{C}}(\mathcal{E}) = \int\_{\mathcal{E}} \frac{1\_{\mathcal{C}}(\mathbf{x})}{\mu(\mathcal{C})} \, \mathbf{d}\mu(\mathbf{x}) \,. \tag{257}
$$

where 1<sup>C</sup> : X → {0, 1} is the indicator function of C ⊆ X , i.e., 1<sup>C</sup> (*x*) := 1{*x* ∈ C} for *x* ∈ X . Hence,

$$\frac{\mathrm{d}\mu\_{\mathcal{C}}}{\mathrm{d}\mu}(\mathbf{x}) = \frac{\mathbf{1}\_{\mathcal{C}}(\mathbf{x})}{\mu(\mathcal{C})}, \quad \forall \mathbf{x} \in \mathcal{X}, \tag{258}$$

and

$$D(\mu\_{\mathcal{C}} \| \mu) = \int\_{\mathcal{X}} f\left(\frac{\mathbf{d}\mu\_{\mathcal{C}}}{\mathbf{d}\mu}\right) \,\mathrm{d}\mu\tag{259}$$

$$=\int\_{\mathcal{C}} f\left(\frac{1}{\mu(\mathcal{C})}\right) \,\mathrm{d}\mu(\mathbf{x}) + \int\_{\mathcal{K}\backslash\mathcal{C}} f(0) \,\mathrm{d}\mu(\mathbf{x})\tag{260}$$

$$=\mu(\mathcal{C})\,f\left(\frac{1}{\mu(\mathcal{C})}\right) + \mu(\mathcal{X}\,\backslash\,\mathcal{C})\,f(0) \tag{261}$$

$$=\tilde{f}\left(\mu(\mathcal{C})\right) + \left(1 - \mu(\mathcal{C})\right)f(0),\tag{262}$$

where the last equality holds by the definition of *f*ein (63). This proves Theorem 5. Corollary 6 is next proved by first proving (67) for the Rényi divergence. For all *α* ∈ (0, 1) ∪ (1, ∞),

$$D\_{\mathfrak{a}}\left(\mu\_{\mathcal{C}}\|\mu\right) = \frac{1}{\mathfrak{a}-1} \log \int\_{\mathcal{X}} \left(\frac{\mathbf{d}\mu\_{\mathcal{C}}}{\mathbf{d}\mu}\right)^{\mathfrak{a}} \mathrm{d}\mu \tag{263}$$

$$\epsilon = \frac{1}{\alpha - 1} \log \int\_{\mathcal{C}} \left( \frac{1}{\mu(\mathcal{C})} \right)^{\alpha} \mathrm{d}\mu \tag{264}$$

$$=\frac{1}{\alpha - 1} \quad \log\left(\left(\frac{1}{\mu(\mathcal{C})}\right)^{\alpha}\mu(\mathcal{C})\right) \tag{265}$$

$$=\log\frac{1}{\mu(\mathcal{C})}.\tag{266}$$

The justification of (67) for *α* = 1 is due to the continuous extension of the order-*α* Rényi divergence at *α* = 1, which gives the relative entropy (see (13)). Equality (65) is obtained from (67) at *α* = 1. Finally, (66) is obtained by combining (15) and (67) with *α* = 2.

#### *5.8. Proof of Theorem 6*

(100) is an equivalent form of (27). From (91) and (100), for all *α* ∈ [0, 1],

$$\frac{1}{\log \text{e}} \, \mathcal{S}\_{\mathfrak{a}}(P \| \! \| Q) = \mathfrak{a} \, \frac{1}{\log \text{e}} \, \mathcal{K}\_{\mathfrak{a}}(P \| \! \| Q) + (1 - \mathfrak{a}) \, \frac{1}{\log \text{e}} \, \mathcal{K}\_{1-\mathfrak{a}}(Q \| \! \| P) \tag{267}$$

$$\mathfrak{a} = \mathfrak{a} \int\_{0}^{\mathfrak{a}} s D\_{\Phi\_{\mathfrak{s}}}(P \| Q) \, \mathrm{d}s + (1 - \mathfrak{a}) \int\_{0}^{1 - \mathfrak{a}} s D\_{\Phi\_{\mathfrak{s}}}(Q \| P) \, \mathrm{d}s \tag{268}$$

$$=\mathfrak{a}\int\_{0}^{\mathfrak{a}} s D\_{\Phi\_{\mathfrak{s}}}(P \| Q) \, \mathrm{d}s + (1 - \mathfrak{a}) \int\_{\mathfrak{a}}^{1} (1 - s) D\_{\Phi\_{\mathfrak{l} - \mathfrak{s}}}(Q \| P) \, \mathrm{d}s.\tag{269}$$

Regarding the integrand of the second term in (269), in view of (18), for all *s* ∈ (0, 1)

$$D\_{\Phi\_{1-s}}(Q \| P) = \frac{1}{(1-s)^2} \cdot \chi^2(Q \| (1-s)P + sQ) \tag{270}$$

$$\hat{\lambda} = \frac{1}{s^2} \cdot \chi^2 \left( P \parallel (1 - s)P + sQ \right) \tag{271}$$

$$=D\_{\Phi\_s}(P\|\|Q)\_\prime\tag{272}$$

where (271) readily follows from (9). Since we also have *Dφ*<sup>1</sup> (*P*k*Q*) = *χ* 2 (*P*k*Q*) = *Dφ*<sup>0</sup> (*Q*k*P*) (see (18)), it follows that

$$D\_{\Phi\_{1-s}}(Q \| P) = D\_{\Phi\_s}(P \| Q), \quad s \in [0, 1]. \tag{273}$$

By using this identity, we get from (269) that, for all *α* ∈ [0, 1]

$$\frac{1}{\log \text{e}} \mathcal{S}\_{\mathfrak{a}}(P||\mathcal{Q}) = \mathfrak{a} \int\_{0}^{\mathfrak{a}} s D\_{\mathfrak{G}\_{\mathfrak{s}}}(P||\mathcal{Q}) \, \text{ds} + (1 - \mathfrak{a}) \int\_{\mathfrak{a}}^{1} (1 - s) D\_{\mathfrak{G}\_{\mathfrak{s}}}(P||\mathcal{Q}) \, \text{ds} \tag{274}$$

$$=\int\_0^1 g\_a(s) \, D\_{\Phi}(P \| Q) \, \mathrm{d}s,\tag{275}$$

where the function *g<sup>α</sup>* : [0, 1] → R is defined in (102). This proves the integral identity (101).

The lower bounds in (103) and (104) hold since, if *f* : (0, ∞) → R is convex, continuously twice differentiable and strictly convex at 1, then

$$
\mu\_{\chi^2}(Q\_{X'}W\_{Y|X}) \le \mu\_f(Q\_{X'}W\_{Y|X})\_\prime \tag{276}
$$

(see, e.g., [46] [Proposition II.6.5] and [50] [Theorem 2]). Hence, this holds in particular for the *f*-divergences in (95) and (96) (since the required properties are satisfied by the parametric functions in (97) and (98), respectively). We next prove the upper bound on the contraction coefficients in (103) and (104) by relying on (100) and (101), respectively. In the setting of Definition 7, if *P<sup>X</sup>* 6= *QX*, then it follows from (100) that for *α* ∈ (0, 1],

$$\frac{K\_{\mathfrak{A}}(P\_{Y} \| Q\_{Y})}{K\_{\mathfrak{A}}(P\_{\mathfrak{X}} \| Q\_{\mathfrak{X}})} = \frac{\int\_{0}^{\mathfrak{A}} sD\_{\mathfrak{A}\_{s}}(P\_{Y} \| Q\_{Y}) \, \mathrm{d}s}{\int\_{0}^{\mathfrak{A}} sD\_{\mathfrak{A}\_{s}}(P\_{\mathfrak{X}} \| Q\_{\mathfrak{X}}) \, \mathrm{d}s} \tag{277}$$

$$\leq \frac{\int\_{0}^{\mathfrak{a}} s \, \mu\_{\phi\_{\mathfrak{s}}}(Q\_{\mathbf{X}\prime} \, W\_{\mathbf{Y}|\mathbf{X}}) \, D\_{\phi\_{\mathfrak{s}}}(P\_{\mathbf{X}} \| Q\_{\mathbf{X}}) \, \mathrm{ds}}{\int\_{0}^{\mathfrak{a}} s \, D\_{\phi\_{\mathfrak{s}}}(P\_{\mathbf{X}} \| Q\_{\mathbf{X}}) \, \mathrm{ds}} \tag{278}$$

$$\leq \sup\_{s \in (0, \mathfrak{a}]} \mu\_{\phi\_s}(Q\_{\mathbf{X}\prime} W\_{\mathbf{Y}|\mathbf{X}}).\tag{279}$$

Finally, taking the supremum of the left-hand side of (277) over all probability measures *P<sup>X</sup>* such that 0 < *Kα*(*PX*k*QX*) < ∞ gives the upper bound on *µk<sup>α</sup>* (*QX*, *<sup>W</sup>Y*|*X*) in (103). The proof of the upper bound on *µs<sup>α</sup>* (*QX*, *<sup>W</sup>Y*|*X*), for all *<sup>α</sup>* ∈ [0, 1], follows similarly from (101), since the function *<sup>g</sup>α*(·) as defined in (102) is positive over the interval (0, 1).

*Entropy* **2020**, *22*, 563

#### *5.9. Proof of Corollary 7*

The upper bounds in (106) and (107) rely on those in (103) and (104), respectively, by showing that

$$\sup\_{s \in (0,1]} \mu\_{\phi\_{\mathbb{S}}}(Q\_{\mathbf{X}\mathcal{I}} \mathcal{W}\_{\mathbf{Y}|\mathbf{X}}) \le \mu\_{\chi^2}(\mathcal{W}\_{\mathbf{Y}|\mathbf{X}}).\tag{280}$$

Inequality (280) is obtained as follows, similarly to the concept of the proof of [51] [Remark 3.8]. For all *s* ∈ (0, 1] and *P<sup>X</sup>* 6= *QX*,

$$\begin{split} \frac{D\_{\phi\_{\mathbb{S}}}(P\_{\mathcal{X}}\mathcal{W}\_{\mathcal{Y}|\mathcal{X}} \parallel Q\_{\mathcal{X}}\mathcal{W}\_{\mathcal{Y}|\mathcal{X}})}{D\_{\phi\_{\mathbb{S}}}(P\_{\mathcal{X}} \parallel Q\_{\mathcal{X}})} \\ = \frac{\chi^{2}(P\_{\mathcal{X}}\mathcal{W}\_{\mathcal{Y}|\mathcal{X}} \parallel (1-s)P\_{\mathcal{X}}\mathcal{W}\_{\mathcal{Y}|\mathcal{X}} + sQ\_{\mathcal{X}}\mathcal{W}\_{\mathcal{Y}|\mathcal{X}})}{\chi^{2}(P\_{\mathcal{X}} \parallel (1-s)P\_{\mathcal{X}} + sQ\_{\mathcal{X}})} \end{split} \tag{281}$$

$$0 \le \mu\_{\chi^2}((1-s)P\_X + sQ\_{X'}\,\mathcal{W}\_{Y|X})\tag{282}$$

$$1 \le \mu\_{\chi^2}(\mathcal{W}\_{Y|X})\_\prime \tag{283}$$

where (281) holds due to (18), and (283) is due to the definition in (105).

#### *5.10. Proof of Proposition 3*

The lower bound on the contraction coefficients in (108) and (109) is due to (276). The derivation of the upper bounds relies on [49] [Theorem 2.2], which states the following. Let *f* : [0, ∞) → R be a three–times differentiable, convex function with *f*(1) = 0, *f* 00(1) > 0, and let the function *z* : (0, ∞) → R defined as *z*(*t*) := *f*(*t*)−*f*(0) *t* , for all *t* > 0, be concave. Then,

$$
\mu\_f(\mathbf{Q}\_{X'}\mathcal{W}\_{\mathbf{Y}|X}) \le \frac{f'(1) + f(0)}{f''(1)} \cdot \mu\_{\chi^2}(\mathbf{Q}\_{X'}\mathcal{W}\_{\mathbf{Y}|X}).\tag{284}
$$

For *α* ∈ (0, 1], let *zα*,1 : (0, ∞) → R and *zα*,2 : (0, ∞) → R be given by

$$z\_{\mathfrak{a},1}(t) := \frac{k\_{\mathfrak{a}}(t) - k\_{\mathfrak{a}}(0)}{t}, \quad t > 0,\tag{285}$$

$$z\_{a,2}(t) := \frac{s\_a(t) - s\_a(0)}{t}, \quad t > 0,\tag{286}$$

with *k<sup>α</sup>* and *s<sup>α</sup>* in (97) and (98). Straightforward calculus shows that, for *α* ∈ (0, 1] and *t* > 0,

$$z\_{\frac{1}{\log e}} z\_{a,1}^{\prime\prime}(t) = -\frac{a^2 + 2a(1-a)t}{t^2 \left[a + (1-a)t\right]^2} < 0,\tag{287}$$

$$\begin{split} \frac{1}{t^{\frac{1}{\log \mathfrak{e}}}} z\_{n,2}^{\prime\prime}(t) &= -\frac{a^2 \left[ a + 2(1-a)t \right]}{t^2 \left[ a + (1-a)t \right]^2} \\ &- \frac{2(1-a)}{t^3} \left[ \log\_{\mathfrak{e}} \left( 1 + \frac{(1-a)t}{a} \right) - \frac{(1-a)t}{a + (1-a)t} - \frac{(1-a)^2 t^2}{2\left[ a + (1-a)t \right]^2} \right]. \end{split} \tag{288}$$

The first term on the right side of (288) is negative. For showing that the second term is also negative, we rely on the power series expansion log<sup>e</sup> (<sup>1</sup> <sup>+</sup> *<sup>u</sup>*) = *<sup>u</sup>* <sup>−</sup> <sup>1</sup> 2 *u* <sup>2</sup> + <sup>1</sup> 3 *u* <sup>3</sup> <sup>−</sup> . . . for *<sup>u</sup>* <sup>∈</sup> (−1, 1]. Setting *<sup>u</sup>* :<sup>=</sup> <sup>−</sup> *<sup>x</sup>* 1+*x* , for *x* > 0, and using Leibnitz theorem for alternating series yields

$$\log\_{\mathbf{e}}(1+\mathbf{x}) = -\log\_{\mathbf{e}}\left(1 - \frac{\mathbf{x}}{1+\mathbf{x}}\right) > \frac{\mathbf{x}}{1+\mathbf{x}} + \frac{\mathbf{x}^2}{2(1+\mathbf{x})^2}, \qquad \mathbf{x} > 0. \tag{289}$$

*Entropy* **2020**, *22*, 563

Consequently, setting *x* := (1−*α*)*t <sup>α</sup>* ∈ [0, ∞) in (289), for *t* > 0 and *α* ∈ (0, 1], proves that the second term on the right side of (288) is negative. Hence, *z* 00 *<sup>α</sup>*,1(*t*), *z* 00 *<sup>α</sup>*,2(*t*) < 0, so both *zα*,1, *zα*,2 : (0, ∞) → R are concave functions.

In view of the satisfiability of the conditions of [49] [Theorem 2.2] for the *f*-divergences with *f* = *k<sup>α</sup>* or *f* = *sα*, the upper bounds in (108) and (109) follow from (284), and also since

$$k\_{\mathfrak{a}}(\mathbf{0}) = \mathbf{0}, \qquad \qquad \qquad k\_{\mathfrak{a}}'(\mathbf{1}) = \mathfrak{a} \ \log \mathbf{e}, \qquad \qquad \qquad k\_{\mathfrak{a}}''(\mathbf{1}) = \mathfrak{a}^2 \ \log \mathbf{e}, \tag{290}$$

$$s\_{\mathfrak{a}}(0) = -(1 - \mathfrak{a}) \log \mathfrak{a}, \quad s\_{\mathfrak{a}}'(1) = (2\mathfrak{a} - 1) \log \mathfrak{e}, \quad s\_{\mathfrak{a}}''(1) = (1 - 3\mathfrak{a} + 3\mathfrak{a}^2) \log \mathfrak{e}.\tag{291}$$

#### *5.11. Proof of Proposition 4*

In view of (24), we get

$$\frac{D(P\_Y \| Q\_Y)}{D(P\_X \| Q\_X)} = \frac{\int\_0^1 \chi^2(P\_Y \| \, (1-s)P\_Y + sQ\_Y) \, \frac{ds}{s}}{\int\_0^1 \chi^2(P\_X \| \, (1-s)P\_X + sQ\_X) \, \frac{ds}{s}} \tag{292}$$

$$\leq \frac{\int\_{0}^{1} \mu\_{\chi^{2}}((1-s)P\_{X} + sQ\_{X\prime}, \mathcal{W}\_{Y|X}) \, \chi^{2}(P\_{X} \parallel (1-s)P\_{X} + sQ\_{X}) \, \frac{\mathrm{d}s}{\mathrm{s}}}{\int\_{0}^{1} \chi^{2}(P\_{X} \parallel (1-s)P\_{X} + sQ\_{X}) \, \frac{\mathrm{d}s}{\mathrm{s}}} \tag{293}$$

$$\leq \sup\_{s \in [0,1]} \mu\_{\chi^2}((1-s)P\_X + sQ\_{X\prime}, W\_{Y|X}).\tag{294}$$

In view of (119), the distributions of *X<sup>s</sup>* and *Y<sup>s</sup>* , and since (1 − *s*)*P<sup>X</sup>* + *sQ<sup>X</sup> <sup>W</sup>Y*|*<sup>X</sup>* = (<sup>1</sup> − *<sup>s</sup>*)*P<sup>Y</sup>* + *sQ<sup>Y</sup>* holds for all *s* ∈ [0, 1], it follows that

$$\rho\_{\rm mf}(\mathbf{X}\_{\rm s};\mathbf{Y}\_{\rm s}) = \sqrt{\mu\_{\chi^2}((1-\mathbf{s})\mathbf{P}\_{\rm X} + \mathbf{s}\mathbf{Q}\_{\rm X\prime}\,\mathcal{W}\_{\rm Y\vert X})}, \quad \mathbf{s} \in [0,1], \tag{295}$$

which, from (292)–(295), implies that

$$\sup\_{s \in [0,1]} \rho\_{\mathbf{m}}(X\_{\mathbf{s}}; Y\_{\mathbf{s}}) \ge \sqrt{\frac{D(P\_Y \| Q\_Y)}{D(P\_{\mathbf{X}} \| Q\_{\mathbf{X}})}}.\tag{296}$$

Switching *P<sup>X</sup>* and *Q<sup>X</sup>* in (292)–(294) and using the mapping *s* 7→ 1 − *s* in (294) gives (due to the symmetry of the maximal correlation)

$$\sup\_{s \in [0,1]} \rho\_{\mathbf{m}}(\mathbf{X}\_{\mathbf{s}}; \mathbf{Y}\_{\mathbf{s}}) \ge \sqrt{\frac{D(Q\_{\mathbf{Y}} \| P\_{\mathbf{Y}})}{D(Q\_{\mathbf{X}} \| P\_{\mathbf{X}})}},\tag{297}$$

and, finally, taking the maximal lower bound among those in (296) and (297) gives (120).

**Author Contributions:** Both coauthors contributed to this research work, and to the writing and proofreading of this article. The starting point of this work was in independent derivations of preliminary versions of Theorems 1 and 2 in two separate un-published works [24,25]. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** Sergio Verdú is gratefully acknowledged for a careful reading, and well-appreciated feedback on the submitted version of this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

1. Kullback, S.; Leibler, R.A. On information and sufficiency. *Ann. Math. Stat.* **1951**, *22*, 79–86. [CrossRef]


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Two-Moment Inequality with Applications to Rényi Entropy and Mutual Information**

#### **Galen Reeves 1,2**

<sup>1</sup> Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA; galen.reeves@duke.edu

<sup>2</sup> Department of Statistical Science, Duke University, Durham, NC 27708, USA

Received: 14 September 2020; Accepted: 6 October 2020; Published: 1 November 2020

**Abstract:** This paper explores some applications of a two-moment inequality for the integral of the *r*th power of a function, where 0 < *r* < 1. The first contribution is an upper bound on the Rényi entropy of a random vector in terms of the two different moments. When one of the moments is the zeroth moment, these bounds recover previous results based on maximum entropy distributions under a single moment constraint. More generally, evaluation of the bound with two carefully chosen nonzero moments can lead to significant improvements with a modest increase in complexity. The second contribution is a method for upper bounding mutual information in terms of certain integrals with respect to the variance of the conditional density. The bounds have a number of useful properties arising from the connection with variance decompositions.

**Keywords:** information inequalities; mutual information; Rényi entropy; Carlson–Levin inequality

#### **1. Introduction**

The interplay between inequalities and information theory has a rich history, with notable examples including the relationship between the Brunn–Minkowski inequality and the entropy power inequality as well as the matrix determinant inequalities obtained from differential entropy [1]. In this paper, the focus is on a "two-moment" inequality that provides an upper bound on the integral of the *r*th power of a function. Specifically, if *f* is a nonnegative function defined on R*<sup>n</sup>* and *p*, *q*,*r* are real numbers satisfying 0 < *r* < 1 and *p* < 1/*r* − 1 < *q*, then

$$\left(\int f(\mathbf{x})^r \, d\mathbf{x}\right)^{\frac{1}{r}} \le \mathbb{C}\_{n, p, q, r} \left(\int \|\mathbf{x}\|^{np} f(\mathbf{x}) \, d\mathbf{x}\right)^{\frac{qr+r-1}{(q-p)r}} \left(\int \|\mathbf{x}\|^{nq} f(\mathbf{x}) \, d\mathbf{x}\right)^{\frac{1-r}{(q-p)r}},\tag{1}$$

where the best possible constant *Cn*,*p*,*q*,*<sup>r</sup>* is given exactly; see Propositions 2 and 3 ahead. The one-dimensional version of this inequality is a special case of the classical Carlson–Levin inequality [2–4], and the multidimensional version is a special case of a result presented by Barza et al. [5]. The particular formulation of the inequality used in this paper was derived independently in [6], where the proof follows from a direct application of Hölder's inequality and Jensen's inequality.

In the context of information theory and statistics, a useful property of the two-moment inequality is that it provides a bound on a nonlinear functional, namely the *r*-quasi-norm k · k*<sup>r</sup>* , in terms of integrals that are linear in *f* . Consequently, this inequality is well suited to settings where *f* is a mixture of simple functions whose moments can be evaluated. We note that this reliance on moments to bound a nonlinear functional is closely related to bounds obtained from variational characterizations such as the Donsker–Varadhan representation of Kullback divergence [7] and its generalizations to Rényi divergence [8,9].

The first application considered in this paper concerns the relationship between the entropy of a probability measure and its moments. This relationship is fundamental to the principle of maximum entropy, which originated in statistical physics and has since been applied to statistical inference problems [10]. It also plays a prominent role in information theory and estimation theory where the fact that the Gaussian distribution maximizes differential entropy under second moment constraints ([11], [Theorem 8.6.5]) plays a prominent role. Moment–entropy inequalities for Rényi entropy were studied in a series of works by Lutwak et al. [12–14], as well as related works by Costa et al. [15,16] and Johonson and Vignat [17], in which it is shown that, under a single moment constraint, Rényi entropy is maximized by a family of generalized Gaussian distributions. The connection between these moment–entropy inequalities and the Carlson–Levin inequality was noted recently by Nguyen [18].

In this direction, one of the contributions of this paper is a new family of moment–entropy inequalities. This family of inequalities follows from applying Inequality (1) in the setting where *f* is a probability density function, and thus there is a one-to-one correspondence between the integral of the *r*th power and the Rényi entropy of order *r*. In the special case where one of the moments is the zeroth moment, this approach recovers the moment–entropy inequalities given in previous work. More generally, the additional flexibility provided by considering two different moments can lead to stronger results. For example, in Proposition 6, it is shown that if *f* is the standard Gaussian density function defined on R*<sup>n</sup>* , then the difference between the Rényi entropy and the upper bound given by the two-moment inequality (equivalently, the ratio between the left- and right-hand sides of (1)) is bounded uniformly with respect to *n* under the following specification of the moments:

$$p\_n = \frac{1-r}{r} - \frac{1}{r} \sqrt{\frac{2(1-r)}{n+1}}, \qquad q\_n = \frac{1-r}{r} + \frac{1}{r} \sqrt{\frac{2(1-r)}{n+1}}.\tag{2}$$

Conversely, if one of the moments is restricted to be equal to zero, as is the case in the usual moment–entropy inequalities, then the difference between the Rényi entropy and the upper bound diverges with *n*.

The second application considered in this paper is the problem of bounding mutual information. In conjunction with Fano's inequality and its extensions, bounds on mutual information play a prominent role in establishing minimax rates of statistical estimation [19] as well as the information-theoretic limits of detection in high-dimensional settings [20]. In many cases, one of the technical challenges is to provide conditions under which the dependence between the observations and an underlying signal or model parameters converges to zero in the limit of high dimension.

This paper introduces a new method for bounding mutual information, which can be described as follows. Let *<sup>P</sup>X*,*<sup>Y</sup>* be a probability measure on X × Y such that *<sup>P</sup>Y*|*X*=*<sup>x</sup>* and *P<sup>Y</sup>* have densities *<sup>f</sup>*(*<sup>y</sup>* <sup>|</sup> *<sup>x</sup>*) and *<sup>f</sup>*(*y*) with respect to the Lebesgue measure on <sup>R</sup>*<sup>n</sup>* . We begin by showing that the mutual information between *X* and *Y* satisfies the upper bound

$$I(X;Y) \le \int \sqrt{\mathsf{Var}(f(y \mid X))} \,\mathrm{d}y,\tag{3}$$

where Var(*p*(*<sup>y</sup>* <sup>|</sup> *<sup>X</sup>*)) = <sup>R</sup> (*f*(*y* | *x*) − *f*(*y*)) 2 d*PX*(*x*) is the variance of *f*(*y* | *X*); see Proposition 8 ahead. In view of (3), an application of the two-moment Inequality (1) with *r* = 1/2 leads to an upper bound with respect to the moments of the variance of the density:

$$\int \|y\|\|^{\text{ns}} \mathsf{Var}(f(y \mid X)) \,\mathrm{d}y \tag{4}$$

where this expression is evaluated at *s* ∈ {*p*, *q*} with *p* < 1 < *q*. A useful property of this bound is that the integrated variance is quadratic in *PX*, and thus Expression (4) can be evaluated by swapping the integration over *y* and with the expectation of over two independent copies of *X*. For example, when *PX*,*<sup>Y</sup>* is a Gaussian scale mixture, this approach provides closed-form upper bounds in terms of

the moments of the Gaussian density. An early version of this technique is used to prove Gaussian approximations for random projections [21] arising in the analysis of a random linear estimation problem appearing in wireless communications and compressed sensing [22,23].

#### **2. Moment Inequalities**

Let *L p* (*S*) be the space of Lebesgue measurable functions from *S* to R whose *p*th power is absolutely integrable, and for *p* 6= 0, define

$$\|f\|\_{p} := \left(\int\_{S} |f(\mathfrak{x})|^{p} \,\mathrm{d}x\right)^{1/p}.$$

Recall that k · k*<sup>p</sup>* is a norm for *p* ≥ 1 but only a quasi-norm for 0 < *p* < 1 because it does not satisfy the triangle inequality. The *s*th moment of *f* is defined as

$$\mathcal{M}\_s(f) := \int\_S \|\mathbf{x}\|^s \, |f(\mathbf{x})| \, \mathbf{d}x\_s$$

where k · k denotes the standard Euclidean norm on vectors.

The two-moment Inequality (1) can be derived straightforwardly using the following argument. For *r* ∈ (0, 1), the mapping *f* 7→ k *f* k*<sup>r</sup>* is concave on the subset of nonnegative functions and admits the variational representation

$$\|f\|\_{r} = \inf \left\{ \frac{\|fg\|\_{1}}{\|g\|\_{r^\*}} : g \in L^{r^\*} \right\},\tag{5}$$

where *r* <sup>∗</sup> = *r*/(*r* − 1) ∈ (−∞, 0) is the Hölder conjugate of *r*. Consequently, each *g* ∈ *L r* ∗ leads to an upper bound on k *f* k*<sup>r</sup>* . For example, if *f* has bounded support *S*, choosing *g* to be the indicator function of *<sup>S</sup>* leads to the basic inequality <sup>k</sup> *<sup>f</sup>* <sup>k</sup>*<sup>r</sup>* <sup>≤</sup> (Vol(*S*))(1−*r*)/*r*<sup>k</sup> *<sup>f</sup>* <sup>k</sup>1. The upper bound on <sup>k</sup> *<sup>f</sup>* <sup>k</sup>*<sup>r</sup>* given in Inequality (1) can be obtained by restricting the minimum in Expression (5) to the parametric class of functions of the form *g*(*x*) = *ν*1k*x*k *np* <sup>+</sup> *<sup>ν</sup>*<sup>2</sup> <sup>k</sup>*x*<sup>k</sup> *nq* with *ν*1, *ν*<sup>2</sup> > 0 and then optimizing over the parameters (*ν*1, *ν*2). Here, the constraints on *p*, *q* are necessary and sufficient to ensure that *g* ∈ *L r* ∗ (R*<sup>n</sup>* ).

In the following sections, we provide a more detailed derivation, starting with the problem of maximizing k *f* k*<sup>r</sup>* under multiple moment constraints and then specializing to the case of two moments. For a detailed account of the history of the Carlson type inequalities as well as some further extensions, see [4].

#### *2.1. Multiple Moments*

Consider the following optimization problem:

$$\begin{aligned} \text{maximize} \quad & \|f\|\_{r} \\ \text{subject to} \quad & f(\mathbf{x}) \ge 0 \qquad \text{for all } \mathbf{x} \in \mathcal{S} \\ & \mathcal{M}\_{\mathbf{s}\_i}(f) \le m\_i \quad \text{for } 1 \le i \le k. \end{aligned}$$

For *<sup>r</sup>* <sup>∈</sup> (0, 1), this is a convex optimization problem because k · k*<sup>r</sup> r* is concave and the moment constraints are linear. By standard theory in convex optimization (e.g., [24]), it can be shown that if the problem is feasible and the maximum is finite, then the maximizer has the form

$$f^\*(\mathfrak{x}) = \left(\sum\_{i=1}^k \nu\_i^\* \, ||\mathfrak{x}||^{s\_i}\right)^{\frac{1}{r-1}}, \quad \text{for all } \mathfrak{x} \in \mathbb{S}.$$

The parameters *ν* ∗ 1 , · · · , *ν* ∗ *k* are nonnegative and the *i*th moment constraint holds with equality for all *i* such that *ν* ∗ *i* is strictly positive—that is, *ν* ∗ *<sup>i</sup>* > 0 =⇒ *µs<sup>i</sup>* (*f* ∗ ) = *m<sup>i</sup>* . Consequently, the maximum can be expressed in terms of a linear combination of the moments:

$$\|f^\*\|\_r^r = \|(f^\*)^r\|\_1 = \|f^\*(f^\*)^{r-1}\|\_1 = \sum\_{i=1}^k \nu\_i^\* m\_i.$$

For the purposes of this paper, it is useful to consider a relative inequality in terms of the moments of the function itself. Given a number <sup>0</sup> <sup>&</sup>lt; *<sup>r</sup>* <sup>&</sup>lt; <sup>1</sup> and vectors *<sup>s</sup>* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* and *<sup>ν</sup>* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* <sup>+</sup>, the function *cr*(*ν*,*s*) is defined according to

$$c\_r(\nu, s) = \left( \int\_0^\infty \left( \sum\_{i=1}^k \nu\_i \, \mathbf{x}^{s\_i} \right)^{-\frac{r}{1-r}} \, \mathbf{d}x \right)^{\frac{1-r}{r}}\Big|$$

if the integral exists. Otherwise, *cr*(*ν*,*s*) is defined to be positive infinity. It can be verified that *cr*(*ν*,*s*) is finite provided that there exists *i*, *j* such that *ν<sup>i</sup>* and *ν<sup>j</sup>* are strictly positive and *s<sup>i</sup>* < (1 − *r*)/*r* < *s<sup>j</sup>* .

The following result can be viewed as a consequence of the constrained optimization problem described above. We provide a different and very simple proof that depends only on Hölder's inequality.

**Proposition 1.** *Let f be a nonnegative Lebesgue measurable function defined on the positive reals* R+*. For any number* <sup>0</sup> <sup>&</sup>lt; *<sup>r</sup>* <sup>&</sup>lt; <sup>1</sup> *and vectors s* <sup>∈</sup> <sup>R</sup>*<sup>k</sup> and <sup>ν</sup>* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* <sup>+</sup>*, we have*

$$\|f\|\_{r} \le c\_{r}(\nu, s) \sum\_{i=1}^{k} \nu\_{i} \mathcal{M}\_{s\_{i}}(f).$$

**Proof.** Let *g*(*x*) = ∑ *k i*=1 *ν<sup>i</sup> x si* . Then, we have

$$\|f\|\_{r}^{r} = \|g^{-r}(f\mathfrak{g})^{r}\|\_{1} \le \|\mathfrak{g}^{-r}\|\_{\frac{1}{1-r}}\|(\mathfrak{g}f)^{r}\|\_{\frac{1}{r}} = \|g^{\frac{-r}{1-r}}\|\_{1}^{1-r}\|\mathfrak{g}f\|\_{1}^{r} = \left(c\_{r}(\nu\_{\prime}\mathfrak{s})\sum\_{i=1}^{k}\nu\_{i}\mathcal{M}\_{\mathfrak{s}\_{i}}(f)\right)^{r}.$$

where the second step is Hölder's inequality with conjugate exponents 1/(1 − *r*) and 1/*r*.

#### *2.2. Two Moments*

For *a*, *b* > 0, the beta function B(*a*, *b*) and gamma function Γ(*a*) are given by

$$\begin{aligned} \mathbf{B}(a,b) &= \int\_0^1 t^{a-1}(1-t)^{b-1} \, \mathbf{d}t \\ \Gamma(a) &= \int\_0^\infty t^{a-1} e^{-t} \, \mathbf{d}t \end{aligned}$$

and satisfy the relation B(*a*, *b*) = Γ(*a*)Γ(*b*)/Γ(*a* + *b*), *a*, *b* > 0. To lighten the notation, we define the normalized beta function

$$
\widetilde{B}(a,b) = \mathbf{B}(a,b)(a+b)^{a+b}a^{-a}b^{-b}.\tag{6}
$$

Properties of these functions are provided in Appendix A.

The next result follows from Proposition 1 for the case of two moments.

**Proposition 2.** *Let f be a nonnegative Lebesgue measurable function defined on* [0, ∞)*. For any numbers p*, *q*,*r with* 0 < *r* < 1 *and p* < 1/*r* − 1 < *q,*

$$||f||\_r \le \left[\psi\_r(p,q)\right]^{\frac{1-r}{r}} \left[\mathcal{M}\_p(f)\right]^\lambda [\mathcal{M}\_q(f)]^{1-\lambda}.$$

*where λ* = (*q* + 1 − 1/*r*)/(*q* − *p*) *and*

$$
\psi\_r(p,q) = \frac{1}{(q-p)} \widetilde{\mathcal{B}}\left(\frac{r\lambda}{1-r}, \frac{r(1-\lambda)}{1-r}\right),\tag{7}
$$

*where B*e(·, ·) *is defined in Equation* (6)*.*

**Proof.** Letting *s* = (*p*, *q*) and *ν* = (*γ* 1−*λ* , *γ* −*λ* ) with *λ* > 0, we have

$$\left[c\_r(\nu, s)\right]^{\frac{r}{1-r}} = \int\_0^\infty \left(\gamma^{1-\lambda} x^p + \gamma^{-\lambda} x^q\right)^{-\frac{r}{1-r}} \,\mathrm{d}x.$$

Making the change of variable *x* 7→ (*γu*) 1 *q*−*p* leads to

$$[c\_r(\nu, s)]^{\frac{r}{1-r}} = \frac{1}{q-p} \int\_0^\infty \frac{u^{b-1}}{(1+u)^{a+b}} \, \mathrm{d}u = \frac{\mathrm{B}\,(a, b)}{q-p} \, \mathrm{d}u$$

where *a* = *<sup>r</sup>* 1−*r λ* and *b* = *<sup>r</sup>* 1−*r* (1 − *λ*) and the second step follows from recognizing the integral representation of the beta function given in Equation (A3). Therefore, by Proposition 1, the inequality

$$||f||\_r \le \left(\frac{\mathsf{B}\left(a,b\right)}{q-p}\right)^{\frac{1-r}{r}} \left(\gamma^{1-\lambda}\mathcal{M}\_p(f) + \gamma^{-\lambda}\mathcal{M}\_q(f)\right)\lambda$$

holds for all *γ* > 0. Evaluating this inequality with

$$\gamma = \frac{\lambda \, \mathcal{M}\_{\emptyset}(f)}{(1 - \lambda)\mathcal{M}\_{p}(f)},$$

leads to the stated result.

The special case *r* = 1/2 admits the simplified expression

$$
\psi\_{1/2}(p,q) = \frac{\pi \lambda^{-\lambda} (1-\lambda)^{-(1-\lambda)}}{(q-p)\sin(\pi\lambda)},\tag{8}
$$

where we have used Euler's reflection formula for the beta function ([25], [Theorem 1.2.1]).

Next, we consider an extension of Proposition 2 for functions defined on R*<sup>n</sup>* . Given any measurable subset *S* of R*<sup>n</sup>* , we define

$$
\omega(\mathcal{S}) = \text{Vol}(\mathcal{B}^n \cap \text{cone}(\mathcal{S})),
\tag{9}
$$

where *B <sup>n</sup>* <sup>=</sup> {*<sup>u</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* : k*u*k ≤ 1} is the *n*-dimensional Euclidean ball of radius one and

$$\text{cone}(\mathcal{S}) = \{ \mathfrak{x} \in \mathbb{R}^n \, : \, t\mathfrak{x} \in \mathcal{S} \text{ for some } t > 0 \}.$$

The function *ω*(*S*) is proportional to the surface measure of the projection of *S* on the Euclidean sphere and satisfies

$$
\omega(\mathcal{S}) \le \omega(\mathbb{R}^n) = \frac{\pi^{\frac{\mu}{2}}}{\Gamma(\frac{\mu}{2} + 1)},\tag{10}
$$

for all *<sup>S</sup>* <sup>⊆</sup> <sup>R</sup>*<sup>n</sup>* . Note that *ω*(R+) = 1 and *ω*(R) = 2. **Proposition 3.** *Let f be a nonnegative Lebesgue measurable function defined on a subset S of* R*<sup>n</sup> . For any numbers p*, *q*,*r with* 0 < *r* < 1 *and p* < 1/*r* − 1 < *q,*

$$\|f\|\_{r} \le \left[\omega(\mathcal{S})\,\psi\_r(p,q)\right]^{\frac{1-r}{r}}\left[\mathcal{M}\_{np}(f)\right]^\lambda \left[\mathcal{M}\_{nq}(f)\right]^{1-\lambda}$$

*where λ* = (*q* + 1 − 1/*r*)/(*q* − *p*) *and ψr*(*p*, *q*) *is given by Equation* (7)*.*

**Proof.** Let *<sup>f</sup>* be extended to <sup>R</sup>*<sup>n</sup>* using the rule *<sup>f</sup>*(*x*) = <sup>0</sup> for all *<sup>x</sup>* outside of *<sup>S</sup>* and let *<sup>g</sup>* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> be defined according to

$$\operatorname{g}(y) = \frac{1}{n} \int\_{\mathbb{S}^{n-1}} f(y^{1/n} u) \operatorname{d} \sigma(u),$$

where S *<sup>n</sup>*−<sup>1</sup> <sup>=</sup> {*<sup>u</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* : k*u*k = 1} is the Euclidean sphere of radius one and *σ*(*u*) is the surface measure of the sphere. In the following, we will show that

$$\|f\|\_{r} \le \left(\omega(\mathcal{S})\right)^{\frac{1-r}{r}} \|g\|\_{r} \tag{11}$$

$$\mathcal{M}\_{\rm ns}(f) = \mathcal{M}\_{\rm s}(\mathbf{g}).\tag{12}$$

Then, the stated inequality then follows from applying Proposition 2 to the function *g*.

To prove Inequality (11), we begin with a transformation into polar coordinates:

$$\|f\|\|\_{r}^{r} = \int\_{0}^{\infty} \int\_{\mathbb{S}^{n-1}} |f(tu)|^{r} \, t^{n-1} \, \mathrm{d}\sigma(u) \, \mathrm{d}t. \tag{13}$$

Letting **1**cone(*S*) (*x*) denote the indicator function of the set cone(*S*), the integral over the sphere can be bounded using:

$$\begin{split} \int\_{\mathbb{S}^{n-1}} |f(tu)|^{r} \, \mathrm{d}\sigma(u) &= \int\_{\mathbb{S}^{n-1}} \mathbf{1}\_{\mathrm{cone}(\mathbb{S})}(u) \, |f(tu)|^{r} \, \mathrm{d}\sigma(u) \\ &\stackrel{(a)}{\leq} \left( \int\_{\mathbb{S}^{n-1}} \mathbf{1}\_{\mathrm{cone}(\mathbb{S})}(u) \, \mathrm{d}\sigma(u) \right)^{1-r} \left( \int\_{\mathbb{S}^{n-1}} |f(tu)| \, \mathrm{d}\sigma(u) \right)^{r} \\ &\stackrel{(b)}{=} n \, (\omega(\mathbb{S}))^{1-r} \, \mathrm{g}^{r}(t^{n}), \end{split} \tag{14}$$

where: (a) follows from Hölder's inequality with conjugate exponents <sup>1</sup> 1−*r* and <sup>1</sup> *r* , and (b) follows from the definition of *g* and the fact that

$$\begin{split} \omega(\mathcal{S}) &= \int\_0^1 \int\_{\mathbb{S}^{n-1}} \mathbf{1}\_{\text{cone}(\mathcal{S})} (\boldsymbol{\mu}) \, t^{n-1} \, \mathrm{d}\sigma(\boldsymbol{\mu}) \, \mathrm{d}t \\ &= \frac{1}{n} \int\_{\mathbb{S}^{n-1}} \mathbf{1}\_{\text{cone}(\mathcal{S})} (\boldsymbol{\mu}) \, \mathrm{d}\sigma(\boldsymbol{\mu}). \end{split}$$

Plugging Inequality (14) back into Equation (13) and then making the change of variable *t* → *y* 1 *<sup>n</sup>* yields

$$\|f\|\_{r}^{r} \le n \left(\omega(\mathcal{S})\right)^{1-r} \int\_{0}^{\infty} \mathcal{g}^{r}(t^{n}) t^{n-1} \, \mathbf{d}t = \left(\omega(\mathcal{S})\right)^{1-r} \|\mathcal{g}\|\_{r}^{r}.$$

The proof of Equation (12) follows along similar lines. We have

$$\begin{split} \mathcal{M}\_{\text{ns}}(f) & \stackrel{(a)}{=} \int\_{0}^{\infty} \int\_{\mathbb{S}^{n-1}} t^{\text{ns}} f(tu) \, t^{n-1} \, \mathrm{d}\sigma(u) \, \mathrm{d}t \\ & \stackrel{(b)}{=} \frac{1}{n} \int\_{0}^{\infty} \int\_{\mathbb{S}^{n-1}} y^{s} f(y^{\frac{1}{n}} u) \, \mathrm{d}\sigma(u) \, \mathrm{d}y \\ & = \mathcal{M}\_{\text{s}}(g) \end{split}$$

where (a) follows from a transformation into polar coordinates and (b) follows from the change of variable *t* 7→ *y* 1 *n* .

Having established Inequality (11) and Equation (12), an application of Proposition 2 completes the proof.

#### **3. Rényi Entropy Bounds**

Let *X* be a random vector that has a density *f*(*x*) with respect to the Lebesgue measure on R*<sup>n</sup>* . The differential Rényi entropy of order *r* ∈ (0, 1) ∪ (1, ∞) is defined according to [11]:

$$h\_r(\mathbf{X}) = \frac{1}{1-r} \log \left( \int\_{\mathbb{R}^n} f^r(\mathbf{x}) \, d\mathbf{x} \right).$$

Throughout this paper, it is assumed that the logarithm is defined with respect to the natural base and entropy is measured in nats. The Rényi entropy is continuous and nonincreasing in *r*. If the support set *<sup>S</sup>* <sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* : *f*(*x*) > 0} has finite measure, then the limit as *r* converges to zero is given by *h*0(*X*) = log Vol(*S*). If the support does not have finite measure, then *hr*(*X*) increases to infinity as *r* decreases to zero. The case *r* = 1 is given by the Shannon differential entropy:

$$h\_1(\mathbf{X}) = -\int\_S f(\mathbf{x}) \log f(\mathbf{x}) \, \mathrm{d}x.$$

Given a random variable *X* that is not identical to zero and numbers *p*, *q*,*r* with 0 < *r* < 1 and *p* < 1/*r* − 1 < *q*, we define the function

$$L\_r(X;p,q) = \frac{r\lambda}{1-r}\log\mathbb{E}\left[|X|^p\right] + \frac{r(1-\lambda)}{1-r}\log\mathbb{E}\left[|X|^q\right],$$

where *λ* = (*q* + 1 − 1/*r*)/(*q* − *p*).

The next result, which follows directly from Proposition 3, provides an upper bound on the Rényi entropy.

**Proposition 4.** *Let X be a random vector with a density on* R*<sup>n</sup> . For any numbers p*, *q*,*r with* 0 < *r* < 1 *and p* < 1/*r* − 1 < *q, the Rényi entropy satisfies*

$$h\_r(X) \le \log \omega(S) + \log \psi\_r(p, q) + L\_r(\|X\|^\eta; p, q), \tag{15}$$

*where ω*(*S*) *is defined in Equation* (9) *and ψr*(*p*, *q*) *is defined in Equation* (7)*.*

**Proof.** This result follows immediately from Proposition 3 and the definition of Rényi entropy.

The relationship between Proposition 4 and previous results depends on whether the moment *p* is equal to zero:


*Entropy* **2020**, *22*, 1244

The contribution of two-moment inequalities is that they lead to tighter bounds. To quantify the tightness, we define ∆*r*(*X*; *p*, *q*) to be the gap between the right-hand side and left-hand side of Inequality (15) corresponding to the pair (*p*, *q*)—that is,

$$
\Delta\_r(X; p, q) = \log \omega(S) + \log \psi\_r(p, q) + L\_r(||X||^n; p, q) - h\_r(X).
$$

The gaps corresponding to the optimal two-moment and one-moment inequalities are defined according to

$$
\Delta\_r(X) = \inf\_{p,q} \Delta\_r(X; p, q)
$$

$$
\widetilde{\Delta}\_r(X) = \inf\_q \Delta\_r(X; 0, q).
$$

#### *3.1. Some Consequences of These Bounds*

By Lyapunov's inequality, the mapping *<sup>s</sup>* 7→ <sup>1</sup> *s* logE [|*X*| *s* ] is nondecreasing on [0, ∞), and thus

$$L\_r(X; p, q) \le L\_r(X; 0, q) = \frac{1}{q} \log \mathbb{E}\left[|X|^q\right], \quad p \ge 0. \tag{16}$$

In other words, the case *p* = 0 provides an upper bound on *Lr*(*X*; *p*, *q*) for nonnegative *p*. Alternatively, we also have the lower bound

$$L\_r(X; p, q) \ge \frac{r}{1 - r} \log \mathbb{E}\left[|X|^{\frac{1 - r}{r}}\right],\tag{17}$$

which follows from the convexity of logE [|*X*| *s* ].

A useful property of *Lr*(*X*; *p*, *q*) is that it is additive with respect to the product of independent random variables. Specifically, if *X* and *Y* are independent, then

$$L\_r(XY; p, q) = L\_r(X; p, q) + L\_r(Y; p, q). \tag{18}$$

One consequence is that multiplication by a bounded random variable cannot increase the Rényi entropy by an amount that exceeds the gap of the two-moment inequality with nonnegative moments.

**Proposition 5.** *Let Y be a random vector on* R*<sup>n</sup> with finite Rényi entropy of order* 0 < *r* < 1*, and let X be an independent random variable that satisfies* 0 < *X* ≤ *t. Then,*

$$h\_r(XY) \le h\_r(tY) + \Delta\_r(Y; p, q)\_{rt}$$

*for all* 0 < *p* < 1/*r* − 1 < *q.*

**Proof.** Let *Z* = *XY* and let *S<sup>Z</sup>* and *S<sup>Y</sup>* denote the support sets of *Z* and *Y*, respectively. The assumption that *X* is nonnegative means that cone(*SZ*) = cone(*SY*). We have

$$h\_r(Z) \stackrel{(a)}{\leq} \log \omega(S\_Z) + \log \psi\_r(p\_\prime q) + L\_r(||Z||^n; p, q),$$

$$\stackrel{(b)}{=} h\_r(Y) + L\_r(|X|^n; p; q) + \Delta\_r(Y; p, q)$$

$$\stackrel{(c)}{\leq} h\_r(Y) + n \log t + \Delta\_r(Y; p, q),$$

where (a) follows from Proposition 4, (b) follows from Equation (18) and the definition of ∆*r*(*Y*; *p*, *q*), and (c) follows from Inequality (16) and the assumption |*X*| ≤ *t*. Finally, recalling that *hr*(*tY*) = *hr*(*Y*) + *n* log *t* completes the proof.

#### *3.2. Example with Log-Normal Distribution*

If *W* ∼ N (*µ*, *σ* 2 ), then the random variable *X* = exp(*W*) has a log-normal distribution with parameters (*µ*, *σ* 2 ). The Rényi entropy is given by

$$h\_r(\mathbf{X}) = \mu + \frac{1}{2} \left( \frac{1 - r}{r} \right) \sigma^2 + \frac{1}{2} \log(2\pi r^{\frac{1}{r - 1}} \sigma^2) \mu$$

and the logarithm of the *s*th moment is given by

$$\log \mathbb{E}\left[|X|^s\right] = \mu \mathbf{s} + \frac{1}{2}\sigma^2 s^2.$$

With a bit of work, it can be shown that the gap of the optimal two-moment inequality does not depend on the parameters (*µ*, *σ* 2 ) and is given by

$$\Delta\_{\mathcal{V}}(X) = \log\left(\widetilde{\mathcal{B}}\left(\frac{r}{2(1-r)}, \frac{r}{2(1-r)}\right) \sqrt{\frac{r}{4(1-r)}}\right) + \frac{1}{2} - \frac{1}{2}\log(2\pi r^{\frac{1}{r-1}}).\tag{19}$$

The details of this derivation are given in Appendix B.1. Meanwhile, the gap of the optimal one-moment inequality is given by

$$\widetilde{\Delta}\_{\mathbf{r}}(X) = \inf\_{q} \left[ \log \left( \widetilde{\mathbf{B}} \left( \frac{r}{1-r} - \frac{1}{q}, \frac{1}{q} \right) \frac{1}{q} \right) + \frac{1}{2} q \sigma^{2} \right] - \frac{1}{2} \left( \frac{1-r}{r} \right) \sigma^{2} - \frac{1}{2} \log(2 \pi r^{\frac{1}{r-1}} \sigma^{2}). \tag{20}$$

The functions ∆*r*(*X*) and ∆e*r*(*X*) are illustrated in Figure 1 as a function of *r* for various *σ* 2 . The function ∆*r*(*X*) is bounded uniformly with respect to *r* and converges to zero as *r* increases to one. The tightness of the two-moment inequality in this regime follows from the fact that the log-normal distribution maximizes Shannon entropy subject to a constraint on E [log *X*]. By contrast, the function ∆e*r*(*X*) varies with the parameter *σ* 2 . For any fixed *r* ∈ (0, 1), it can be shown that ∆e*r*(*X*) increases to infinity if *σ* 2 converges to zero or infinity.

**Figure 1.** Comparison of upper bounds on Rényi entropy in nats for the log-normal distribution as a function of the order *r* for various *σ* 2 .

#### *3.3. Example with Multivariate Gaussian Distribution*

Next, we consider the case where *Y* ∼ N (0, *In*) is an *n*-dimensional Gaussian vector with mean zero and identity covariance. The Rényi entropy is given by

$$h\_r(Y) = \frac{n}{2} \log(2\pi r^{\frac{1}{r-1}}),$$

and the *s*th moment of the magnitude k*Y*k is given by

$$\mathbb{E}\left[||Y||^s\right] = \frac{2^{\frac{s}{2}}\Gamma(\frac{n+s}{2})}{\Gamma(\frac{n}{2})}.$$

The next result shows that as the dimension *n* increases, the gap of the optimal two-moment inequality converges to the gap for the log-normal distribution. Moreover, for each *r* ∈ (0, 1), the following choice of moments is optimal in the large-*n* limit:

$$p\_n = \frac{1-r}{r} - \frac{1}{r} \sqrt{\frac{2(1-r)}{n+1}}, \qquad q\_n = \frac{1-r}{r} + \frac{1}{r} \sqrt{\frac{2(1-r)}{n+1}}.\tag{21}$$

The proof is given in Appendix B.3.

**Proposition 6.** *If Y* ∼ N (0, *In*)*, then, for each r* ∈ (0, 1)*,*

$$\lim\_{n \to \infty} \Delta\_r(Y) = \lim\_{n \to \infty} \Delta\_r(Y; p\_n, q\_n) = \Delta\_r(X)\_{\prime}$$

*where X has a log-normal distribution and* (*pn*, *qn*) *are given by* (21)*.*

Figure 2 provides a comparison of ∆*r*(*Y*), ∆*r*(*Y*; *pn*, *qn*), and ∆e*r*(*Y*) as a function of *n* for *r* = 0.1. Here, we see that both ∆*r*(*Y*) and ∆*r*(*Y*; *pn*, *qn*) converge rapidly to the asymptotic limit given by the gap of the log-normal distribution. By contrast, the gap of the optimal one-moment inequality ∆e*r*(*Y*) increases without bound.

**Figure 2.** Comparison of upper bounds on Rényi entropy in nats for the multivariate Gaussian distribution N (0, *In*) as a function of the dimension *n* with *r* = 0.1. The solid black line is the gap of the optimal two-moment inequality for the log-normal distribution.

#### *3.4. Inequalities for Differential Entropy*

Proposition 4 can also be used to recover some known inequalities for differential entropy by considering the limiting behavior as *r* converges to one. For example, it is well known that the differential entropy of an *n*-dimensional random vector *X* with finite second moment satisfies

$$h(X) \le \frac{1}{2} \log \left( 2\pi e \mathbb{E} \left[ \frac{1}{n} \|X\|^2 \right] \right),\tag{22}$$

with equality if and only if the entries of *X* are i.i.d. zero-mean Gaussian. A generalization of this result in terms of an arbitrary positive moment is given by

$$h(X) \le \log \frac{\Gamma\left(\frac{n}{s} + 1\right)}{\Gamma\left(\frac{n}{2} + 1\right)} + \frac{n}{2} \log \pi + \frac{n}{s} \log \left(es \operatorname{E}\left[\frac{1}{n} \|X\|^s\right]\right),\tag{23}$$

for all *s* > 0. Note that Inequality (22) corresponds to the case *s* = 2.

Inequality (23) can be proved as an immediate consequence of Proposition 4 and the fact that *hr*(*X*) is nonincreasing in *r*. Using properties of the beta function given in Appendix A, it is straightforward to verify that

$$\lim\_{r \to 1} \psi\_r(0, q) = (e \, q)^{\frac{1}{q}} \, \Gamma \left( \frac{1}{q} + 1 \right), \quad \text{for all } q > 0.$$

Combining this result with Proposition 4 and Inequality (16) leads to

$$h(X) \le \log \omega(S) + \log \Gamma\left(\frac{1}{q} + 1\right) + \frac{1}{q} \log \left(eq\mathbb{E}\left[||X||^{nq}\right]\right).$$

Using Inequality (10) and making the substitution *s* = *nq* leads to Inequality (23).

Another example follows from the fact that the log-normal distribution maximizes the differential entropy of a positive random variable *X* subject to constraints on the mean and variance of log(*X*), and hence

$$h(X) \le \mathbb{E}\left[\log(X)\right] + \frac{1}{2}\log\left(2\pi e \mathsf{Var}(\log(X))\right),\tag{24}$$

with equality if and only if *X* is log-normal. In Appendix B.4, it is shown how this inequality can be proved using our two-moment inequalities by studying the behavior as both *p* and *q* converge to zero as *r* increases to one.

#### **4. Bounds on Mutual Information**

#### *4.1. Relative Entropy and Chi-Squared Divergence*

Let *P* and *Q* be distributions defined on a common probability space that have densities *p* and *q* with respect to a dominating measure *µ*. The relative entropy (or Kullback–Leibler divergence) is defined according to

$$D\left(P\parallel Q\right) = \int p \log \left(\frac{p}{q}\right) \,\mathrm{d}\mu\_{\prime}$$

and the chi-squared divergence is defined as

$$\chi^2(P \parallel Q) = \int \frac{\left(p - q\right)^2}{q} \mathbf{d}\mu.$$

Both of these divergences can be seen as special cases of the general class of *f*-divergence measures and there exists a rich literature on comparisons between different divergences [8,26–32]. The chi-squared divergence can also be viewed as the squared *L*<sup>2</sup> distance between *p*/ √*q* and √*q*. The chi-square can also be interpreted as the first non-zero term in the power series expansion of the relative entropy ([26], [Lemma 4]). More generally, the chi-squared divergence provides an upper bound on the relative entropy via

$$D\left(P\|\|Q\right) \le \log(1 + \chi^2(P\|\|Q)).\tag{25}$$

The proof of this inequality follows straightforwardly from Jensen's inequality and the concavity of the logarithm; see [27,31,32] for further refinements.

Given a random pair (*X*,*Y*), the mutual information between *X* and *Y* is defined according to

$$I(X;Y) = D\left(P\_{X,Y} \parallel P\_X P\_Y\right).$$

From Inequality (25), we see that the mutual information can always be upper bounded using

$$I(X;Y) \le \log(1 + \chi^2(P\_{X,Y}||P\_X P\_Y)).\tag{26}$$

The next section provides bounds on the mutual information that can improve upon this inequality.

#### *4.2. Mutual Information and Variance of Conditional Density*

Let (*X*,*Y*) be a random pair such that the conditional distribution of *Y* given *X* has a density *<sup>f</sup>Y*|*X*(*y*|*x*) with respect to the Lebesgue measure on <sup>R</sup>*<sup>n</sup>* . Note that the marginal density of *Y* is given by *fY*(*y*) = E h *<sup>f</sup>Y*|*X*(*y*|*X*) i . To simplify notation, we will write *f*(*y*|*x*) and *f*(*y*) where the subscripts are implicit. The support set of *Y* is denoted by *SY*.

The measure of the dependence between *X* and *Y* that is used in our bounds can be understood in terms of the variance of the conditional density. For each *y*, the conditional density *f*(*y*|*X*) evaluated with a random realization of *X* is a random variable. The variance of this random variable is given by

$$\mathsf{Var}(f(y|X)) = \mathbb{E}\left[ (f(y|X) - f(y))^2 \right],\tag{27}$$

where we have used the fact that the marginal density *f*(*y*) is the expectation of *f*(*y*|*X*). The *s*th moment of the variance of the conditional density is defined according to

$$V\_s(Y|X) = \int\_{S\_Y} \|y\|^s \mathsf{Var}(f(y|X)) \,\mathrm{d}y.\tag{28}$$

The variance moment *Vs*(*Y*|*X*) is nonnegative and equal to zero if and only if *X* and *Y* are independent.

The function *κ*(*t*) is defined according to

$$\kappa(t) = \sup\_{u \in (0,\infty)} \frac{\log(1+u)}{u^t}, \qquad t \in (0,1]. \tag{29}$$

The proof of the following result is given in Appendix C. The behavior of *κ*(*t*) is illustrated in Figure 3.

**Figure 3.** Graphs of *κ*(*t*) and *tκ*(*t*) as a function of *t*.

**Proposition 7.** *The function κ*(*t*) *defined in Equation* (29) *can be expressed as*

$$\kappa(t) = \frac{\log(1+u)}{u^t}, \qquad t \in (0,1]$$

*where*

$$\mu = \exp\left(W\left(-\frac{1}{\bar{t}}\exp\left(-\frac{1}{\bar{t}}\right)\right) + \frac{1}{\bar{t}}\right) - 1$$

*and W*(·) *denotes Lambert's W- function, i.e., W*(*z*) *is the unique solution to the equation z* = *w* exp(*w*) *on the interval* [−1, ∞)*. Furthermore, the function g*(*t*) = *tκ*(*t*) *is strictly increasing on* (0, 1] *with* lim*t*→<sup>0</sup> *g*(*t*) = 1/*e and g*(1) = 1*, and thus*

$$\frac{1}{et} \le \kappa(t) \le \frac{1}{t'} \qquad t \in (0,1]\_\ell$$

*where the lower bound* 1/(*et*) *is tight for small values of t* ∈ (0, 1) *and the upper bound* 1/*t is tight for values of t close to 1.*

We are now ready to give the main results of this section, which are bounds on the mutual information. We begin with a general upper bound in terms of the variance of the conditional density.

**Proposition 8.** *For any* 0 < *t* ≤ 1*, the mutual information satisfies*

$$I(\mathbf{X}; \mathbf{Y}) \le \kappa(t) \int\_{S\_Y} \left[ f(y) \right]^{1-2t} \left[ \mathsf{Var}(f(y \mid \mathbf{X})) \right]^t \, \mathrm{d}y.$$

**Proof.** We use the following series of inequalities:

$$\begin{aligned} I(X;Y) & \stackrel{(a)}{=} \int f(y) \, D\left(P\_{X|Y=y} \, \Big|\, P\_X\right) \, \mathrm{d}y \\ & \stackrel{(b)}{\leq} \int f(y) \, \log\left(1 + \chi^2(P\_{X|Y=y} || P\_X)\right) \, \mathrm{d}y \\ & \stackrel{(c)}{=} \int f(y) \, \log\left(1 + \frac{\mathsf{Var}(f(y \mid X))}{f^2(y)}\right) \, \mathrm{d}y \\ & \stackrel{(d)}{\leq} \kappa(t) \int f(y) \left(\frac{\mathsf{Var}(f(y \mid X))}{f^2(y)}\right)^t \, \mathrm{d}y, \end{aligned}$$

where (a) follows from the definition of mutual information, (b) follows from Inequality (25), and (c) follows from Bayes' rule, which allows us to write the chi-square in terms of the variance of the conditional density:

$$\chi^2(P\_{X|Y=y}||P\_X) = \mathbb{E}\left[\left(\frac{f(y|X)}{f(y)} - 1\right)^2\right] = \frac{\mathsf{Var}(f(y|X))}{f^2(y)}.$$

Inequality (d) follows from the nonnegativity of the variance and the definition of *κ*(*t*).

Evaluating Proposition 8 with *t* = 1 recovers the well-known inequality *I*(*X*;*Y*) ≤ *χ* 2 (*PX*,*Y*k*PXPY*). The next two results follow from the cases 0 < *t* < <sup>1</sup> 2 and *t* = <sup>1</sup> 2 , respectively.

**Proposition 9.** *For any* 0 < *r* < 1*, the mutual information satisfies*

$$I(X;Y) \le \kappa(t) \left( e^{h\_r(Y)} \, \_t V\_0(Y|X) \right)^t \nu$$

*where t* = (1 − *r*)/(2 − *r*)*.*

**Proof.** Starting with Proposition 8 and applying Hölder's inequality with conjugate exponents 1/(1 − *t*) and 1/*t* leads to

$$I(\mathbf{X}; \mathbf{Y}) \le \mathfrak{x}(t) \left( \int f^r(\mathbf{y}) \, \mathbf{d}\mathbf{y} \right)^{1-t} \left( \int \mathsf{Var}(f(\mathbf{y} \mid \mathbf{X})) \, \mathbf{d}\mathbf{y} \right)^t = \mathfrak{x}(t) \, e^{t h\_l^r(\mathbf{Y})} V\_0^t(\mathbf{Y}|\mathbf{X}) .$$

where we have used the fact that *r* = (1 − 2*t*)/(1 − *t*).

**Proposition 10.** *For any p* < 1 < *q, the mutual information satisfies*

$$I(\mathbf{X}; \mathbf{Y}) \le \mathcal{C}(\lambda) \sqrt{\frac{\omega(\mathcal{S}\_Y) V\_{np}^{\lambda}(\mathbf{Y}|\mathbf{X}) V\_{nq}^{1-\lambda}(\mathbf{Y}|\mathbf{X})}{(q-p)}}$$

*where λ* = (*q* − 1)/(*q* − *p*) *and*

$$\mathcal{C}(\lambda) = \kappa(\frac{1}{2}) \sqrt{\frac{\pi \lambda^{-\lambda} (1 - \lambda)^{-(1 - \lambda)}}{\sin(\pi \lambda)}}$$

*with κ*( 1 2 ) = 0.80477 . . . *.*

**Proof.** Evaluating Proposition 8 with *t* = 1/2 gives

$$I(\mathbf{X}; \mathbf{Y}) \le \kappa(\frac{1}{2}) \int\_{\mathbb{S}\_Y} \sqrt{\mathsf{Var}(f(y \mid \mathbf{X}))} \, \mathrm{d}y.$$

Evaluating Proposition 3 with *r* = <sup>1</sup> 2 leads to

$$\left(\int\_{\mathcal{S}\_Y} \sqrt{\mathsf{Var}(f(y \mid X))} \, d\mathcal{y}\right)^2 \le \omega(\mathcal{S}\_Y) \, \psi\_{1/2}(p, q) V\_{np}^\lambda(Y|X) V\_{nq}^{1-\lambda}(Y|X).$$

Combining these inequalities with the expression for *ψ*1/2(*p*, *q*) given in Equation (8) completes the proof.

The contribution of Propositions 9 and 10 is that they provide bounds on the mutual information in terms of quantities that can be easy to characterize. One application of these bounds is to establish conditions under which the mutual information corresponding to a sequence of random pairs (*X<sup>k</sup>* ,*Y<sup>k</sup>* ) converges to zero. In this case, Proposition 9 provides a sufficient condition in terms of the Rényi entropy of *Y<sup>n</sup>* and the function *V*0(*Yn*|*Xn*), while Proposition 10 provides a sufficient condition in terms of *Vs*(*Yn*|*Xn*) evaluated with two difference values of *s*. These conditions are summarized in the following result.

**Proposition 11.** *Let* (*X<sup>k</sup>* ,*Y<sup>k</sup>* ) *be a sequence of random pairs such that the conditional distribution of Y<sup>k</sup> given X<sup>k</sup> has a density on* R*<sup>n</sup> . The following are sufficient conditions under which the mutual information of I*(*X<sup>k</sup>* ;*Y<sup>k</sup>* ) *converges to zero as k increases to infinity:*

*1. There exists* 0 < *r* < 1 *such that*

$$\lim\_{k \to \infty} e^{h\_l(\mathbf{Y}\_k)} V\_0(\mathbf{Y}\_k | \mathbf{X}\_k) = \mathbf{0}.$$

*2. There exists p* < 1 < *q such that*

$$\lim\_{k \to \infty} V\_{np}^{q-1}(\mathbf{Y}\_k | \mathbf{X}\_k) V\_{nq}^{1-p}(\mathbf{Y}\_k | \mathbf{X}\_k) = \mathbf{0}.$$

#### *4.3. Properties of the Bounds*

The variance moment *Vs*(*Y*|*X*) has a number of interesting properties. The variance of the conditional density can be expressed in terms of an expectation with respect to two independent random variables *X*<sup>1</sup> and *X*<sup>2</sup> with the same distribution as *X* via the decomposition:

$$\mathsf{Var}(f(y|X)) = \mathbb{E}\left[f(y|X)f(y|X) - f(y|X\_1)f(y|X\_2)\right].$$

Consequently, by swapping the order of the integration and expectation, we obtain

$$V\_{\mathbf{s}}(Y|X) = \mathbb{E}\left[\mathbf{K}\_{\mathbf{s}}(X,X) - \mathbf{K}\_{\mathbf{s}}(X\_1, X\_2)\right],\tag{50}$$

where

$$\mathcal{K}\_s(\mathbf{x}\_1, \mathbf{x}\_2) = \int ||y||^s f(y|\mathbf{x}\_1) f(y|\mathbf{x}\_2) \,\mathrm{d}y \,\mathrm{d}y.$$

The function *Ks*(*x*1, *x*2) is a positive definite kernel that does not depend on the distribution of *X*. For *s* = 0, this kernel has been studied previously in the machine learning literature [33], where it is referred to as the expected likelihood kernel.

The variance of the conditional density also satisfies a data processing inequality. Suppose that *U* → *X* → *Y* forms a Markov chain. Then, the square of the conditional density of *Y* given *U* can be expressed as

$$f\_{Y|U}^2(y|\mu) = \mathbb{E}\left[f\_{Y|X}(y|X\_1')f\_{Y|X}(y|X\_2') \mid \mathcal{U} = \mu\right].$$

where (*U*, *X* 0 1 , *X* 0 2 ) ∼ *PUPX*<sup>1</sup> <sup>|</sup>*UPX*2|*U*. Combining this expression with Equation (30) yields

$$N\_s(Y|\mathcal{U}) = \mathbb{E}\left[\mathcal{K}\_s(X\_1', X\_2') - \mathcal{K}\_s(X\_1, X\_2)\right],\tag{31}$$

where we recall that (*X*1, *X*2) are independent copies of *X*.

Finally, it is easy to verify that the function *Vs*(*Y*) satisfies

$$V\_s(aY|X) = |a|^{s-n} V\_s(Y|X), \quad \text{for all } a \neq 0.$$

Using this scaling relationship, we see that the sufficient conditions in Proposition 11 are invariant to scaling of *Y*.

#### *4.4. Example with Additive Gaussian Noise*

We now provide a specific example of our bounds on the mutual information. Let *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* be a random vector with distribution *P<sup>X</sup>* and let *Y* be the output of a Gaussian noise channel

$$Y = X + W\_{\prime} \tag{32}$$

where *W* ∼ N (0, *In*) is independent of *X*. If k*X*k has finite second moment, then the mutual information satisfies

$$I(X;Y) \le \frac{n}{2} \log \left( 1 + \frac{1}{n} \mathbb{E} \left[ \|X\|^2 \right] \right),\tag{33}$$

where equality is attained if and only if *X* has zero-mean isotropic Gaussian distribution. This inequality follows straightforwardly from the fact that the Gaussian distribution maximizes differential entropy subject to a second moment constraint [11]. One of the limitations of this bound is that it can be loose when the second moment is dominated by events that have small probability. In fact, it is easy to construct examples for which k*X*k does not have a finite second moment, and yet *I*(*X*;*Y*) is arbitrarily close to zero.

Our results provide bounds on *I*(*X*;*Y*) that are less sensitive to the effects of rare events. Let *φn*(*x*) = (2*π*) <sup>−</sup>*n*/2 exp(−k*x*<sup>k</sup> <sup>2</sup>/2) denote the density of the standard Gaussian distribution on R*<sup>n</sup>* . The product of the conditional densities can be factored according to

$$\begin{split} f(y \mid \mathbf{x\_1}) f(y \mid \mathbf{x\_2}) &= \phi\_{2n} \left( \begin{bmatrix} y - \mathbf{x\_1} \\ y - \mathbf{x\_2} \end{bmatrix} \right) = \phi\_{2n} \left( \begin{bmatrix} \sqrt{2}y - (\mathbf{x\_1} + \mathbf{x\_2})/\sqrt{2} \\ (\mathbf{x\_1} - \mathbf{x\_2})/\sqrt{2} \end{bmatrix} \right) \\ &= \phi\_n \left( \sqrt{2}y - \frac{\mathbf{x\_1} + \mathbf{x\_2}}{\sqrt{2}} \right) \phi\_n \left( \frac{\mathbf{x\_1} - \mathbf{x\_2}}{\sqrt{2}} \right), \end{split}$$

where the second step follows because *φ*2*n*(·) is invariant to orthogonal transformations. Integrating with respect to *y* leads to

$$K\_{\mathbf{s}}(\boldsymbol{\chi}\_{1}, \boldsymbol{\chi}\_{2}) = 2^{-\frac{n+s}{2}} \mathbb{E}\left[ \left\| \boldsymbol{W} + \frac{\boldsymbol{\chi}\_{1} + \boldsymbol{\chi}\_{2}}{\sqrt{2}} \right\|^{s} \right] \phi\_{n}\left(\frac{\boldsymbol{\chi}\_{1} - \boldsymbol{\chi}\_{2}}{\sqrt{2}}\right) \boldsymbol{\chi}\_{1}$$

where we recall that *W* ∼ N (0, *In*). For the case *s* = 0, we see that *K*0(*x*1, *x*2) is a Gaussian kernel, thus

$$V\_0(Y|X) = (4\pi)^{-\frac{n}{2}} \left[ 1 - \mathbb{E} \left[ e^{-\frac{1}{4} \|X\_1 - X\_2\|^2} \right] \right]. \tag{34}$$

A useful property of *V*0(*Y*|*X*) is that the conditions under which it converges to zero are weaker than the conditions needed for other measures of dependence. Observe that the expectation in Equation (34) is bounded uniformly with respect to (*X*1, *X*2). In particular, for every *e* > 0 and *x* ∈ R, we have

$$\left[1 - \mathbb{E}\left[e^{-\frac{1}{4}(X\_1 - X\_2)^2}\right] \le \mathfrak{e}^2 + 2\mathbb{P}\left[\left|X - \mathfrak{x}\right| \ge \mathfrak{e}\right], \mathfrak{y}\right]$$

where we have used the inequality 1 − *e* <sup>−</sup>*<sup>x</sup>* <sup>≤</sup> *<sup>x</sup>* and the fact that <sup>P</sup> [|*X*<sup>1</sup> <sup>−</sup> *<sup>X</sup>*2| ≥ <sup>2</sup>*e*] <sup>≤</sup> <sup>2</sup><sup>P</sup> [|*<sup>X</sup>* <sup>−</sup> *<sup>x</sup>*| ≥ *<sup>e</sup>*]. Consequently, *V*0(*Y*|*X*) converges to zero whenever *X* converges to a constant value *x* in probability.

To study some further properties of these bounds, we now focus on the case where *X* is a Gaussian scalar mixture generated according to

$$X = A\sqrt{\mathcal{U}}, \quad A \sim \mathcal{N}(0, 1), \quad \mathcal{U} \ge 0,\tag{35}$$

with *A* and *U* independent. In this case, the expectations with respect to the kernel *Ks*(*x*1, *x*2) can be computed explicitly, leading to

$$V\_s(Y|X) = \frac{\Gamma(\frac{1+s}{2})}{2\pi} \mathbb{E}\left[ (1+2\mathcal{U})^{\frac{s}{2}} - \frac{(1+\mathcal{U}\_1)^{\frac{s}{2}}(1+\mathcal{U}\_2)^{\frac{s}{2}}}{\left(1+\frac{1}{2}(\mathcal{U}\_1+\mathcal{U}\_2)\right)^{\frac{s+1}{2}}} \right],\tag{36}$$

where (*U*1, *U*2) are independent copies of *U*. It can be shown that this expression depends primarily on the magnitude of *U*. This is not surprising given that *X* converges to a constant if and only if *U* converges to zero.

Our results can also be used to bound the mutual information *I*(*U*;*Y*) by noting that *U* → *X* → *Y* forms a Markov chain, and taking advantage of the characterization provided in Equation (31). Letting *X* 0 <sup>1</sup> = *A*<sup>1</sup> √ *U* and *X* 0 <sup>2</sup> = *A*<sup>2</sup> √ *U* with (*A*1, *A*2, *U*) be mutually independent leads to

$$V\_s(Y|\mathcal{U}) = \frac{\Gamma(\frac{1+s}{2})}{2\pi} \mathbb{E}\left[ (1+\mathcal{U})^{\frac{s-1}{2}} - \frac{(1+\mathcal{U}\_1)^{\frac{s}{2}}(1+\mathcal{U}\_2)^{\frac{s}{2}}}{(1+\frac{1}{2}(\mathcal{U}\_1+\mathcal{U}\_2))^{\frac{s+1}{2}}} \right],\tag{37}$$

In this case, *Vs*(*Y*|*U*) is a measure of the variation in *U*. To study its behavior, we consider the simple upper bound

$$V\_s(Y|\mathcal{U}) \le \frac{\Gamma(\frac{1+s}{2})}{2\pi} \mathbb{P}\left[\mathcal{U}\_1 \ne \mathcal{U}\_2\right] \mathbb{E}\left[\left(1+\mathcal{U}\right)^{\frac{s-1}{2}}\right],\tag{38}$$

which follows from noting that the term inside the expectation in Equation (37) is zero on the event *U*<sup>1</sup> = *U*2. This bound shows that if *s* ≤ 1 then *Vs*(*Y*|*U*) is bounded uniformly with respect to distributions on *U*, and if *s* > 1, then *Vs*(*Y*|*U*) is bounded in terms of the ( *s*−1 2 )th moment of *U*.

In conjunction with Propositions 9 and 10, the function *Vs*(*Y*|*U*) provides bounds on the mutual information *I*(*U*;*Y*) that can be expressed in terms of simple expectations involving two independent copies of *U*. Figure 4 provides an illustration of the upper bound in Proposition 10 for the case where *U* is a discrete random variable supported on two points, and *X* and *Y* are generated according to Equations (32) and (35). This example shows that there exist sequences of distributions for which our upper bounds on the mutual information converge to zero while the chi-squared divergence between *PXY* and *PXP<sup>Y</sup>* is bounded away from zero.

**Figure 4.** Bounds on the mutual information *I*(*U*;*Y*) in nats when *U* ∼ (1 − *e*)*δ*<sup>1</sup> + *eδa*(*e*) , with *a*(*e*) = 1 + 1/ √ *e*, and *X* and *Y* are generated according to Equations (32) and (35). The bound from Proposition 10 is evaluated with *p* = 0 and *q* = 2.

#### **5. Conclusions**

This paper provides bounds on Rényi entropy and mutual information that are based on a relatively simple two-moment inequality. Extensions to inequalities with more moments are worth exploring. Another potential application is to provide a refined characterization of the "all-or-nothing" behavior seen in a sparse linear regression problem [34,35], where the current methods of analysis depend on a complicated conditional second moment method.

**Funding:** This research was supported in part by the National Science Foundation under Grant 1750362 and in part by the Laboratory for Analytic Sciences (LAS). Any opinions, findings, conclusions, and recommendations expressed in this material are those of the author and do not necessarily reflect the views of the sponsors.

**Conflicts of Interest:** The author declares no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A. The Gamma and Beta Functions**

This section reviews some properties of the gamma and beta functions. For *x* > 0, the gamma function is defined according to Γ(*x*) = R <sup>∞</sup> 0 *t x*−1 *e* <sup>−</sup>*<sup>t</sup>* d*t*. Binet's formula for the logarithm for the gamma function ([25], [Theorem 1.6.3]) gives

$$
\log \Gamma(\mathbf{x}) = \left(\mathbf{x} - \frac{1}{2}\right) \log \mathbf{x} - \mathbf{x} + \frac{1}{2} \log(2\pi) + \theta(\mathbf{x}),
\tag{A1}
$$

where the remainder term *θ*(*x*) is convex and nonincreasing with lim*x*→<sup>0</sup> *θ*(*x*) = ∞ and lim*x*→<sup>∞</sup> *θ*(*x*) = 0. Euler's reflection formula ([25], [Theorem 1.2.1]) gives

$$
\Gamma(\mathbf{x})\Gamma(1-\mathbf{x}) = \frac{\pi}{\sin(\pi\mathbf{x})}, \quad \mathbf{0} < \mathbf{x} < 1. \tag{A2}
$$

For *x*, *y* > 0, the beta function can be expressed as follows

$$\mathbf{B}(\mathbf{x}, y) = \frac{\Gamma(\mathbf{x})\Gamma(y)}{\Gamma(\mathbf{x} + y)} = \int\_0^1 t^{\mathbf{x} - 1}(1 - t)^{y - 1} \, \mathbf{d}t = \int\_0^\infty \frac{u^{a - 1}}{(1 + u)^{a + b}} \, \mathbf{d}u,\tag{A3}$$

where the second integral expression follows from the change of variables *t* 7→ *u*/(1 + *u*). Recall that Be(*x*, *y*) = B(*x*, *y*)(*x* + *y*) *<sup>x</sup>*+*yx* <sup>−</sup>*xy* −*y* . Using Equation (A1) leads to

$$\log\left(\tilde{\mathcal{B}}(\mathbf{x},\boldsymbol{y})\sqrt{\frac{\mathbf{x}\cdot\mathbf{y}}{2\pi(\mathbf{x}+\boldsymbol{y})}}\right) = \theta(\mathbf{x}) + \theta(\boldsymbol{y}) - \theta(\mathbf{x}+\boldsymbol{y}).\tag{A4}$$

It can also be shown that ([36], [Equation (2), pg. 2])

$$
\widetilde{\mathcal{B}}(x, y) \ge \frac{x + y}{xy}. \tag{A5}
$$

#### **Appendix B. Details for Rényi Entropy Examples**

This appendix studies properties of the two-moment inequalities for Rényi entropy described in Section 3.

#### *Appendix B.1. Log-Normal Distribution*

Let *X* be a log-normal random variable with parameters (*µ*, *σ* 2 ) and consider the parametrization

$$p = \frac{1-r}{r} - (1-\lambda)\sqrt{\frac{(1-r)\,\mu}{r\lambda(1-\lambda)}}$$

$$q = \frac{1-r}{r} + \lambda\sqrt{\frac{(1-r)\,\mu}{r\lambda(1-\lambda)}}$$

where *λ* ∈ (0, 1) and *u* ∈ (0, ∞). Then, we have

$$\begin{aligned} \psi\_r(p,q) &= \widetilde{\mathcal{B}}\left(\frac{r\lambda}{1-r}, \frac{r(1-\lambda)}{1-r}\right) \sqrt{\frac{r\lambda(1-\lambda)}{(1-r)\,\,u}},\\ L\_r(X;p,q) &= \mu + \frac{1}{2}\left(\frac{1-r}{r}\right)\sigma^2 + \frac{1}{2}\mu\sigma^2. \end{aligned}$$

Combining these expressions with Equation (A4) leads to

$$\Delta\_{\mathbf{r}}(\mathbf{X};p,q) = \theta\left(\frac{r\lambda}{1-r}\right) + \theta\left(\frac{r(1-\lambda)}{1-r}\right) - \theta\left(\frac{r}{1-r}\right) + \frac{1}{2}\mu\sigma^2 - \frac{1}{2}\log\left(\mu\sigma^2\right) - \frac{1}{2}\log\left(r^{\frac{1}{\tau-1}}\right). \tag{A6}$$

We now characterize the minimum with respect to the parameters (*λ*, *u*). Note that the mapping *λ* 7→ *θ*( *rλ* 1−*r* ) + *θ*( *r*(1−*λ*) 1−*r* ) is convex and symmetric about the point *λ* = 1/2. Therefore, the minimum with respect to *λ* is attained at *λ* = 1/2. Meanwhile, mapping *u* 7→ *uσ* <sup>2</sup> <sup>−</sup> log(*u<sup>σ</sup>* 2 ) is convex and attains its minimum at *u* = 1/*σ* 2 . Evaluating Equation (A6) with these values, we see that the optimal two-moment inequality can be expressed as

$$\Delta\_r(X) = 2\theta\left(\frac{r}{2(1-r)}\right) - \theta\left(\frac{r}{1-r}\right) + \frac{1}{2}\log\left(e^{\frac{1}{1-r}}\right).$$

By Equation (A4), this expression is equivalent to Equation (A1). Moreover, the fact that ∆*r*(*X*) decreases to zero as *r* increases to one follows from the fact that *θ*(*x*) decreases to zero and *x* increases to infinity.

Next, we express the gap in terms of the pair (*p*, *q*). Comparing the difference between ∆*r*(*X*; *p*, *q*) and ∆*r*(*X*) leads to

$$
\Delta\_r(\mathbf{X}; p, q) = \Delta\_r(\mathbf{X}) + \frac{1}{2}\varrho \left( \frac{r\lambda(1-\lambda)}{1-r}(q-p)^2\sigma^2 \right) + \theta \left( \frac{r\lambda}{1-r} \right) + \theta \left( \frac{r(1-\lambda)}{1-r} \right) - 2\theta \left( \frac{r}{2(1-r)} \right),
$$

where *ϕ*(*x*) = *x* − log(*x*) − 1. In particular, if *p* = 0, then we obtain the simplified expression

$$
\Delta\_r(X;0,q) = \Delta\_r(X) + \frac{1}{2}\varrho\left(\left(q-\frac{1-r}{r}\right)\sigma^2\right) + \theta\left(\frac{r}{1-r}-\frac{1}{q}\right) + \theta\left(\frac{1}{q}\right) - 2\theta\left(\frac{r}{2(1-r)}\right).
$$

This characterization shows that the gap of the optimal one-moment inequality ∆e*r*(*X*) increases to infinity in the limit as either *σ* <sup>2</sup> <sup>→</sup> 0 or *<sup>σ</sup>* <sup>2</sup> <sup>→</sup> <sup>∞</sup>.

#### *Appendix B.2. Multivariate Gaussian Distribution*

Let *Y* ∼ N (0, *In*) be an *n*-dimensional Gaussian vector and consider the parametrization

$$\begin{aligned} p &= \frac{1-r}{r} - \frac{1-\lambda}{r} \sqrt{\frac{2(1-r)z}{\lambda(1-\lambda)}}, \\ q &= \frac{1-r}{r} + \frac{\lambda}{r} \sqrt{\frac{2(1-r)z}{\lambda(1-\lambda)}}. \end{aligned}$$

where *λ* ∈ (0, 1) and *z* ∈ (0, ∞). We can write

$$\begin{aligned} \log \omega(S\_Y) &= \frac{n}{2} \log \pi - \log \left(\frac{n}{2}\right) - \log \Gamma \left(\frac{n}{2}\right) \\ \psi\_r(p,q) &= \tilde{\mathcal{B}} \left(\frac{r\lambda}{1-r}, \frac{r(1-\lambda)}{1-r}\right) \sqrt{\frac{r\lambda(1-\lambda)}{(1-r)}} \sqrt{\frac{nr}{2z}}. \end{aligned}$$

Furthermore, if

$$2(1 - \lambda)\sqrt{\frac{2(1 - r)z}{\lambda(1 - \lambda)n}} < 1,\tag{A7}$$

then *Lr*(k*Y*k *n* ; *p*, *q*) is finite and is given by

$$L\_{\Gamma}(\|Y\|^{\
\eta};p,q) = Q\_{r,\eta}(\lambda,z) + \frac{n}{2}\log 2 + \frac{r}{1-r}\left[\log \Gamma\left(\frac{n}{2r}\right) - \log \Gamma\left(\frac{n}{2}\right)\right],$$

where

$$\begin{split} Q\_{r,\mathbb{I}}(\lambda,z) &= \frac{r\lambda}{1-r} \log \Gamma\left(\frac{n}{2r} - \frac{1-\lambda}{r} \sqrt{\frac{(1-r)nz}{2\lambda(1-\lambda)}}\right) + \frac{r(1-\lambda)}{1-r} \log \Gamma\left(\frac{n}{2r} + \frac{\lambda}{r} \sqrt{\frac{(1-r)nz}{2\lambda(1-\lambda)}}\right) \\ &\quad - \frac{r}{1-r} \log \Gamma\left(\frac{n}{2r}\right). \end{split} \tag{A8}$$

Here, we note that the scaling in Equation (21) corresponds to *λ* = 1/2 and *z* = *n*/(*n* + 1), and thus the condition Inequality (A7) is satisfied for all *n* ≥ 1. Combining the above expressions and then using Equations (A1) and (A4) leads to

$$\begin{split} \Delta\_{l}(Y;p,q) &= \theta\left(\frac{r\lambda}{1-r}\right) + \theta\left(\frac{r(1-\lambda)}{1-r}\right) - \theta\left(\frac{r}{1-r}\right) + Q\_{r,\mathfrak{n}}(z,\lambda) - \frac{1}{2}\log z - \frac{1}{2}\log\left(r^{\frac{1}{r-1}}\right) \\ &\quad + \frac{r}{1-r}\theta\left(\frac{n}{2r}\right) - \frac{1}{1-r}\theta\left(\frac{n}{2}\right). \end{split} \tag{A9}$$

Next, we study some properties of *Qr*,*n*(*λ*, *z*). By Equation (A1), the logarithm of the gamma function can be expressed as the sum of convex functions:

$$\log \Gamma(\mathbf{x}) = \varphi(\mathbf{x}) + \frac{1}{2} \log \left(\frac{1}{\mathbf{x}}\right) + \frac{1}{2} \log(2\pi) - 1 + \theta(\mathbf{x})\_{\mathbf{x}}$$

where *ϕ*(*x*) = *x* log *x* + 1 − *x*. Starting with the definition of *Q*(*λ*, *z*) and then using Jensen's inequality yields

$$\begin{split} Q\_{r,n}(z,\lambda) &\geq \frac{r\lambda}{1-r} \varphi\left(\frac{n}{2r} - \frac{1-\lambda}{r} \sqrt{\frac{(1-r)nz}{2\lambda(1-\lambda)}}\right) \\ &\quad + \frac{r(1-\lambda)}{1-r} \varphi\left(\frac{n}{2r} + \frac{\lambda}{r} \sqrt{\frac{(1-r)nz}{2\lambda(1-\lambda)}}\right) - \frac{r}{1-r} \varphi\left(\frac{n}{2r}\right) \\ &= \frac{\lambda}{a} \varphi\left(1 - \sqrt{\left(\frac{1-\lambda}{\lambda}\right)az}\right) + \frac{(1-\lambda)}{a} \varphi\left(1 + \sqrt{\left(\frac{\lambda}{1-\lambda}\right)az}\right). \end{split}$$

where *a* = 2(1 − *r*)/*n*. Using the inequality *ϕ*(*x*) ≥ (3/2)(*x* − 1) <sup>2</sup>/(*x* + 2) leads to

$$\begin{split} Q\_{r,n}(\lambda, z) &\geq \frac{z}{2} \left[ \left( 1 - \sqrt{\left( \frac{1-\lambda}{\lambda} \right) bz} \right) \left( 1 + \sqrt{\left( \frac{\lambda}{1-\lambda} \right) bz} \right) \right]^{-1} \\ &\geq \frac{z}{2} \left( 1 + \sqrt{\left( \frac{\lambda}{1-\lambda} \right) bz} \right)^{-1} \end{split} \tag{A10}$$

where *b* = 2(1 − *r*)/(9*n*).

Observe that the right-hand side of Inequality (A10) converges to *z*/2 as *n* increases to infinity. It turns out this limiting behavior is tight. Using Equation (A1), it is straightforward to show that *Qn*(*λ*, *z*) converges pointwise to *z*/2 as *n* increases to infinity—that is,

$$\lim\_{n \to \infty} Q\_{r,n}(\lambda, z) = \frac{1}{2} z,\tag{A11}$$

for any fixed pair (*λ*, *z*) ∈ (0, 1) × (0, ∞).

#### *Appendix B.3. Proof of Proposition 6*

Let *D* = (0, 1) × (0, ∞). For fixed *r* ∈ (0, 1), we use *Qn*(*λ*, *z*) to denote the function *Qr*,*n*(*λ*, *z*) defined in Equation (A8) and we use *Gn*(*λ*, *z*) to denote the right-hand side of Equation (A9). These functions are defined to be equal to positive infinity for any pair (*λ*, *z*) ∈ *D* such that Inequality (A7) does not hold.

Note that the terms *θ*(*n*/(2*r*)) and *θ*(*n*/2) converge to zero in the limit as *n* increases to infinity. In conjunction with Equation (A11), this shows that *Gn*(*λ*, *z*) converges pointwise to a limit *G*(*λ*, *z*) given by

$$G(\lambda, z) = \theta\left(\frac{r\lambda}{1 - r}\right) + \theta\left(\frac{r(1 - \lambda)}{1 - r}\right) - \theta\left(\frac{r}{1 - r}\right) + \frac{1}{2}z - \frac{1}{2}\log\left(z\right) - \frac{1}{2}\log\left(r^{\frac{1}{r - 1}}\right).$$

At this point, the correspondence with the log-normal distribution can be seen from the fact that *G*(*λ*, *z*) is equal to the right-hand side of Equation (A6) evaluated with *uσ* <sup>2</sup> = *z*.

To show that the gap corresponding to the log-normal distribution provides an upper bound on the limit, we use

$$\begin{aligned} \limsup\_{n \to \infty} \Delta\_{\mathbb{T}}(\mathbf{Y}) &= \limsup\_{n \to \infty} \inf\_{(\lambda, z) \in D} \mathcal{G}\_{\mathbb{N}}(\lambda, z) \\ &\le \inf\_{(\lambda, z) \in D} \limsup\_{n \to \infty} \mathcal{G}\_{\mathbb{N}}(\lambda, z) \\ &= \inf\_{(\lambda, z) \in D} \mathcal{G}(\lambda, z) \\ &= \Delta\_{\mathbb{T}}(\mathbf{X}). \end{aligned} \tag{A12}$$

Here, the last equality follows from the analysis in Appendix B.1, which shows that the minimum of *G*(*λ*, *z*) is a attained at *λ* = 1/2 and *z* = 1.

To prove the lower bound requires a bit more work. Fix any *e* ∈ (0, 1) and let *D<sup>e</sup>* = (0, 1 − *e*] × (0, ∞). Using the lower bound on *Qn*(*λ*, *z*) given in Inequality (A10), it can be verified that

$$\liminf\_{n \to \infty} \inf\_{(\lambda, z) \in D\_{\mathfrak{c}}} \left[ Q\_{r, n}(z, \lambda) - \frac{1}{2} \log z \right] \ge \frac{1}{2}.$$

Consequently, we have

$$\liminf\_{\substack{\eta \to \infty \ (\lambda, z) \in D\_{\mathfrak{c}}}} \inf\_{\mathcal{G}\_{\mathfrak{U}}} \mathcal{G}\_{\mathfrak{U}}(\lambda, z) = \inf\_{(\lambda, z) \in D\_{\mathfrak{c}}} \mathcal{G}(\lambda, z) \ge \Delta\_{\mathfrak{I}}(X). \tag{A13}$$

To complete the proof we will show that for any sequence *λ<sup>n</sup>* that converges to one as *n* increases to infinity, we have

$$\liminf\_{n \to \infty} \inf\_{z \in (0, \infty)} G\_{\mathbb{II}}(\lambda\_n z) = \infty. \tag{A14}$$

To see why this is the case, note that by Equation (A4) and Inequality (A5),

$$\theta\left(\frac{r\lambda}{1-r}\right) + \theta\left(\frac{r(1-\lambda)}{1-r}\right) - \theta\left(\frac{r}{1-r}\right) \ge \frac{1}{2}\log\left(\frac{1-r}{2\pi r\lambda\left(1-\lambda\right)}\right).$$

Therefore, we can write

$$\mathcal{G}\_{\mathbb{H}}(\lambda, z) \ge Q\_{\mathbb{H}}(\lambda, z) - \frac{1}{2} \log \left( \lambda (1 - \lambda) z \right) + c\_{n\nu} \tag{A15}$$

where *c<sup>n</sup>* is bounded uniformly for all *n*. Making the substitution *u* = *λ*(1 − *λ*)*z*, we obtain

$$\inf\_{z>0} G\_{\mathbb{N}}\left(\lambda, z\right) \geq \inf\_{\mu>0} \left[Q\_{\mathbb{N}}\left(\lambda, \frac{\mu}{\lambda\left(1-\lambda\right)}\right) - \frac{1}{2}\log\mu\right] + c\_{\mathbb{N}}.$$

Next, let *b<sup>n</sup>* = 2(1 − *r*)/(9*n*). The lower bound in Inequality (A10) leads to

$$\inf\_{u>0} \left[ Q\_{\mathbb{H}} \left( \lambda, \frac{u}{\lambda(1-\lambda)} \right) - \frac{1}{2} \log u \right] \ge \inf\_{u>0} \left[ \frac{u}{2\lambda} \left( \frac{1}{1-\lambda + \sqrt{b\_{\mathbb{H}} u}} \right) - \frac{1}{2} \log u \right]. \tag{A16}$$

The limiting behavior in Equation (A14) can now be seen as a consequence of Inequality (A15) and the fact that, for any sequence *λ<sup>n</sup>* converging to one, the right-hand side of Inequality (A16) increases without bound as *n* increases. Combining Inequality (A12), Inequality (A13), and Equation (A14) establishes that the large *n* limit of ∆*r*(*Y*) exists and is equal to ∆*r*(*X*). This concludes the proof of Proposition 6.

*Appendix B.4. Proof of Inequality* (24)

Given any *λ* ∈ (0, 1) and *u* ∈ (0, ∞) let

$$p(r) = \frac{1-r}{r} - \sqrt{\frac{1-r}{r} \left(\frac{1-\lambda}{\lambda}\right) \mu}$$

$$q(r) = \frac{1-r}{r} + \sqrt{\frac{1-r}{r} \left(\frac{\lambda}{1-\lambda}\right) \mu}$$

We need the following results, which characterize the terms in Proposition 4 in the limit as *r* increases to one.

**Lemma A1.** *The function ψr*(*p*(*r*), *q*(*r*)) *satisfies*

$$\lim\_{r \to 1} \psi\_r(p(r), q(r)) = \sqrt{\frac{2\pi}{u}}.$$

**Proof.** Starting with Equation (A4), we can write

$$
\psi\_r(p,q) = \frac{1}{q-p} \sqrt{\frac{2\pi(1-r)}{r\lambda(1-\lambda}} \exp\left(\theta\left(\frac{r\lambda}{1-r}\right) + \theta\left(\frac{r(1-\lambda)}{1-r}\right) - \theta\left(\frac{r}{1-r}\right)\right).
$$

As p *r* converges to one, the terms in the exponent converge to zero. Note that *q*(*r*) − *p*(*r*) = *rλ*(1 − *λ*)/(1 − *r*) completes the proof.

**Lemma A2.** *If X is a random variable such that s* 7→ E [|*X*| *s* ] *is finite in a neighborhood of zero, then* E [log(*X*)] *and* Var(log(*X*)) *are finite, and*

$$\lim\_{r \to 1} L\_r(X; p(r), q(r)) = \mathbb{E}\left[\log|X|\right] + \frac{\mu}{2} \operatorname{Var}(\log|X|).$$

**Proof.** Let Λ(*s*) = log(E [|*X*| *s* ]). The assumption that E [|*X*| *s* ] is finite in a neighborhood of zero means that E [(log |*X*|) *<sup>m</sup>*] is finite for all positive integers *m*, and thus Λ(*s*) is real analytic in a neighborhood of zero. Hence, there exist constants *δ* > 0 and *C* < ∞, depending on the distribution of *X*, such that

$$\left|\Lambda(s) - as + bs^2\right| \le \mathcal{C} \left|s\right|^3, \quad \text{for all } |s| \le \delta\_\prime$$

where *<sup>a</sup>* <sup>=</sup> <sup>E</sup> [log <sup>|</sup>*X*|] and *<sup>b</sup>* <sup>=</sup> <sup>1</sup> 2 Var(|*X*|). Consequently, for all *r* such that 1 − *δ* < *p*(*r*) < (1 − *r*)/*r* < *q*(*r*) < 1 + *δ*, it follows that

$$\left| L\_r(X; p(r), q(r)) - a - \left( \frac{1 - r}{r} + u \right) b \right|\_{} \le \mathbb{C} \frac{r}{1 - r} \left( \lambda |p(r)|^3 + (1 - \lambda) |q(r)|^3 \right).$$

Taking the limit as *r* increases to one completes the proof.

We are now ready to prove Inequality (24). Combining Proposition 4 with Lemma A1 and Lemma A2 yields

$$\limsup\_{r \to \infty} h\_r(X) \le \frac{1}{2} \log \left( \frac{2\pi}{u} \right) + \mathbb{E} \left[ \log X \right] + \frac{u}{2} \operatorname{Var}(\log X).$$

The stated inequality follows from evaluating the right-hand side with *u* = 1/ Var(log *X*), recalling that *h*(*X*) corresponds to the limit of *hr*(*X*) as *r* increases to one.

#### **Appendix C. Proof of Proposition 7**

The function *κ* : (0, 1] → R<sup>+</sup> can be expressed as

$$\kappa(t) = \sup\_{\boldsymbol{\mu} \in (0, \infty)} \rho\_l(\boldsymbol{\mu}), \tag{A17}$$

where *ρt*(*u*) = log(1 + *u*)/*u t* . For *t* = 1, the bound log(1 + *u*) ≤ *u* implies that *ρ*1(*u*) ≤ 1. Noting that lim*u*→<sup>0</sup> *ρ*1(*u*) = 1, we conclude that *κ*(1) = 1.

Next, we consider the case *t* ∈ (0, 1). The function *ρ<sup>t</sup>* is continuously differentiable on (0, ∞) with

$$\operatorname{sgn}(\rho\_t'(\mu)) = \operatorname{sgn}\left(\mu - t(1+\mu)\log(1+\mu)\right). \tag{A18}$$

Under the assumption *t* ∈ (0, 1), we see that *ρt*(*u*) is increasing for all *u* sufficiently close to zero and decreasing for all *u* sufficiently large, and thus the supremum is attained at a stationary point of *ρt*(*u*) on (0, ∞). Making the substitution *w* = log(1 + *u*) − 1/*t* leads to

$$
\rho\_t'(\mu) = 0 \quad \Longleftrightarrow \quad we^w = -\frac{1}{t}e^{-\frac{1}{T}}.
$$

For *<sup>t</sup>* <sup>∈</sup> (0, 1), it follows that <sup>−</sup><sup>1</sup> *t e* − 1 *<sup>t</sup>* ∈ (−*e* −1 , 0), and thus *ρ* 0 *t* (*u*) has a unique root that can be expressed as

$$\mu\_t^\* = \exp\left(\mathcal{W}\left(-\frac{1}{t}\exp\left(-\frac{1}{t}\right)\right) + \frac{1}{t}\right) - 1\_{\prime\prime}$$

where Lambert's function *<sup>W</sup>*(*z*) is the solution to the equation *<sup>z</sup>* <sup>=</sup> *we<sup>w</sup>* on the interval on [−1, <sup>∞</sup>).

**Lemma A3.** *The function g*(*t*) = *tκ*(*t*) *is strictly increasing on* (0, 1] *with* lim*t*→<sup>0</sup> *g*(*t*) = 1/*e and g*(1) = 1*.*

**Proof.** The fact that *g*(1) = 1 follows from *κ*(1) = 1. By the envelope theorem [37], the derivative of *g*(*t*) can be expressed as

$$g'(t) = \frac{\mathbf{d}}{\mathbf{d}t} \, t\rho\_t(u)\Big|\_{u=u\_t^\*} = \frac{\log(1+u\_t^\*)}{(u\_t^\*)^t} - t\log(u\_t^\*)\frac{\log(1+u\_t^\*)}{(u\_t^\*)^t}$$

In view of Equation (A18), it follows that *ρ* 0 *t* (*u* ∗ *t* ) = 0 can be expressed equivalently as

$$\frac{u\_t^\*}{(1+u\_t^\*)\log(1+u\_t^\*)} = t,\tag{A19}$$

and thus

$$\operatorname{sgn}(g'(t)) = \operatorname{sgn}\left(1 - \frac{u\_t^\* \log u\_t^\*}{(1 + u\_t^\*) \log(1 + u\_t^\*)}\right). \tag{A20}$$

Noting that *u* log *u* < (1 + *u*)log(1 + *u*) for all *u* ∈ (0, ∞), it follows that *g* 0 (*t*) > 0 is strictly positive, and thus *g*(*t*) is strictly increasing.

To prove the small *t* limit, we use Equation (A19) to write

$$\log(g(t)) = \log\left(\frac{u\_t^\*}{1 + u\_t^\*}\right) - \frac{u\_t^\* \log u\_t^\*}{(1 + u\_t^\*) \log(1 + u\_t^\*)}.\tag{A21}$$

Now, as *t* decreases to zero, Equation (A19) shows that *u* ∗ *t* increases to infinity. By Equation (A21), it then follows that log(*g*(*t*)) converges to negative one, which proves the desired limit.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Strongly Convex Divergences**

#### **James Melbourne**

Department of Electrical and Computer Engineering, University of Minnesota-Twin Cities, Minneapolis, MN 55455, USA; melbo013@umn.edu

Received: 2 September 2020; Accepted: 9 November 2020; Published: 21 November 2020

**Abstract:** We consider a sub-class of the *f*-divergences satisfying a stronger convexity property, which we refer to as strongly convex, or *κ*-convex divergences. We derive new and old relationships, based on convexity arguments, between popular *f*-divergences.

**Keywords:** information measures; *f*-divergence; hypothesis testing; total variation; skew-divergence; convexity; Pinsker's inequality; Bayes risk; Jensen–Shannon divergence

#### **1. Introduction**

The concept of an *f*-divergence, introduced independently by Ali-Silvey [1], Morimoto [2], and Csisizár [3], unifies several important information measures between probability distributions, as integrals of a convex function *f* , composed with the Radon–Nikodym of the two probability distributions. (An additional assumption can be made that *f* is strictly convex at 1, to ensure that *Df* (*µ*||*ν*) > 0 for *µ* 6= *ν*. This obviously holds for any *f* 00(1) > 0, and can hold for some *f*-divergences without classical derivatives at 0, for instance the total variation is strictly convex at 1. An example of an *f*-divergence not strictly convex is provided by the so-called "hockey-stick" divergence, where *f*(*x*) = (*x* − *γ*)+, see [4–6].) For a convex function *f* : (0, ∞) → R such that *f*(1) = 0, and measures *P* and *Q* such that *P Q*, the *f*-divergence from *P* to *Q* is given by *D<sup>f</sup>* (*P*||*Q*) := R *f dP dQ dQ*. The canonical example of an *f*-divergence, realized by taking *f*(*x*) = *x* log *x*, is the relative entropy (often called the KL-divergence), which we denote with the subscript *f* omitted. *f*-divergences inherit many properties enjoyed by this special case; non-negativity, joint convexity of arguments, and a data processing inequality. Other important examples include the total variation, the *χ* 2 -divergence, and the squared Hellinger distance. The reader is directed to Chapter 6 and 7 of [7] for more background.

We are interested in how stronger convexity properties of *f* give improvements of classical *f*-divergence inequalities. More explicitly, we consider consequences of *f* being *κ*-convex, in the sense that the map *x* 7→ *f*(*x*) − *κx* <sup>2</sup>/2 is convex. This is in part inspired by the work of Sason [8], who demonstrated that divergences that are *κ*-convex satisfy "stronger than *χ* <sup>2</sup>" data-processing inequalities.

Perhaps the most well known example of an *f*-divergence inequality is Pinsker's inequality, which bounds the square of the total variation above by a constant multiple of the relative entropy. That is for probability measures *P* and *Q*, |*P* − *Q*| 2 *TV* ≤ *c D*(*P*||*Q*). The optimal constant is achieved for Bernoulli measures, and under our conventions for total variation, *c* = 1/2 log *e*. Many extensions and sharpenings of Pinsker's inequality exist (for examples, see [9–11]). Building on the work of Guntuboyina [9] and Topsøe [11], we achieve a further sharpening of Pinsker's inequality in Theorem 9.

Aside from the total variation, most divergences of interest have stronger than affine convexity, at least when *f* is restricted to a sub-interval of the real line. This observation is especially relevant to the situation in which one wishes to study *D<sup>f</sup>* (*P*||*Q*) in the existence of a bounded Radon–Nikodym derivative *dP dQ* ∈ (*a*, *b*) ( (0, ∞). One naturally obtains such bounds for skew divergences. That is divergences of the form (*P*, *Q*) 7→ *D<sup>f</sup>* ((1 − *t*)*P* + *tQ*||(1 − *s*)*P* + *sQ*) for *t*,*s* ∈ [0, 1], as in this case,

(1−*t*)*P*+*tQ* (1−*s*)*P*+*sQ* <sup>≤</sup> max <sup>n</sup> 1−*t* 1−*s* , *t s* o . Important examples of skew-divergences include the skew divergence [12] based on the relative entropy and the Vincze–Le Cam divergence [13,14], called the triangular discrimination in [11] and its generalization due to Györfi and Vajda [15] based on the *χ* 2 -divergence. The Jensen–Shannon divergence [16] and its recent generalization [17] give examples of *f*-divergences realized as linear combinations of skewed divergences.

Let us outline the paper. In Section 2, we derive elementary results of *κ*-convex divergences and give a table of examples of *κ*-convex divergences. We demonstrate that *κ*-convex divergences can be lower bounded by the *χ* 2 -divergence, and that the joint convexity of the map (*P*, *Q*) 7→ *D<sup>f</sup>* (*P*||*Q*) can be sharpened under *κ*-convexity conditions on *f* . As a consequence, we obtain bounds between the mean square total variation distance of a set of distributions from its barycenter, and the average *f*-divergence from the set to the barycenter.

In Section 3, we investigate general skewing of *f*-divergences. In particular, we introduce the skew-symmetrization of an *f*-divergence, which recovers the Jensen–Shannon divergence and the Vincze–Le Cam divergences as special cases. We also show that a scaling of the Vincze–Le Cam divergence is minimal among skew-symmetrizations of *κ*-convex divergences on (0, 2). We then consider linear combinations of skew divergences and show that a generalized Vincze–Le Cam divergence (based on skewing the *χ* 2 -divergence) can be upper bounded by the generalized Jensen–Shannon divergence introduced recently by Nielsen [17] (based on skewing the relative entropy), reversing the classical convexity bounds *D*(*P*||*Q*) ≤ log(1 + *χ* 2 (*P*||*Q*)) ≤ log *e χ* 2 (*P*||*Q*). We also derive upper and lower total variation bounds for Nielsen's generalized Jensen–Shannon divergence.

In Section 4, we consider a family of densities {*pi*} weighted by *λ<sup>i</sup>* , and a density *q*. We use the Bayes estimator *T*(*x*) = arg max*<sup>i</sup> λ<sup>i</sup> pi*(*x*) to derive a convex decomposition of the barycenter *p* = ∑*<sup>i</sup> λ<sup>i</sup> p<sup>i</sup>* and of *q*, each into two auxiliary densities. (Recall, a Bayes estimator is one that minimizes the expected value of a loss function. By the assumptions of our model, that P(*θ* = *i*) = *λ<sup>i</sup>* , and <sup>P</sup>(*<sup>X</sup>* <sup>∈</sup> *<sup>A</sup>*|*<sup>θ</sup>* <sup>=</sup> *<sup>i</sup>*) = <sup>R</sup> *A pi*(*x*)*dx*, we have E`(*θ*, <sup>ˆ</sup>*θ*) = <sup>1</sup> <sup>−</sup> R *λ*ˆ*θ*(*x*) *p* <sup>ˆ</sup>*θ*(*x*) (*x*)*dx* for the loss function `(*i*, *j*) = <sup>1</sup> <sup>−</sup> *<sup>δ</sup>i*(*j*) and any estimator <sup>ˆ</sup>*θ*. It follows that <sup>E</sup>`(*θ*, <sup>ˆ</sup>*θ*) <sup>≥</sup> <sup>E</sup>`(*θ*, *<sup>T</sup>*) by *<sup>λ</sup>*ˆ*θ*(*x*) *p* <sup>ˆ</sup>*θ*(*x*) (*x*) ≤ *λT*(*x*) *pT*(*x*) (*x*). Thus, *T* is a Bayes estimator associated to `. ) We use this decomposition to sharpen, for *κ*-convex divergences, an elegant theorem of Guntuboyina [9] that generalizes Fano and Pinsker's inequality to *f*-divergences. We then demonstrate explicitly, using an argument of Topsøe, how our sharpening of Guntuboyina's inequality gives a new sharpening of Pinsker's inequality in terms of the convex decomposition induced by the Bayes estimator.

#### *Notation*

Throughout, *f* denotes a convex function *f* : (0, ∞) → R ∪ {∞}, such that *f*(1) = 0. For a convex function defined on (0, ∞), we define *f*(0) := lim*x*→<sup>0</sup> *f*(*x*). We denote by *f* ∗ , the convex function *f* ∗ : (0, ∞) → R ∪ {∞} defined by *f* ∗ (*x*) = *x f*(*x* −1 ). We consider Borel probability measures *P* and *Q* on a Polish space X and define the *f*-divergence from *P* to *Q*, via densities *p* for *P* and *q* for *Q* with respect to a common reference measure *µ* as

$$\begin{split} D\_f(p||q) &= \int\_{\mathcal{X}} f\left(\frac{p}{q}\right) q d\mu \\ &= \int\_{\{pq>0\}} q f\left(\frac{p}{q}\right) d\mu + f(0)Q(\{p=0\}) + f^\*(0)P(\{q=0\}). \end{split} \tag{1}$$

We note that this representation is independent of *µ*, and such a reference measure always exists, take *µ* = *P* + *Q* for example.

For *t*,*s* ∈ [0, 1], define the binary *f*-divergence

$$D\_f(t||s) \coloneqq sf\left(\frac{t}{s}\right) + (1-s)f\left(\frac{1-t}{1-s}\right) \tag{2}$$

with the conventions, *<sup>f</sup>*(0) = lim*t*→0<sup>+</sup> *<sup>f</sup>*(*t*), <sup>0</sup> *<sup>f</sup>*(0/0) = 0, and <sup>0</sup> *<sup>f</sup>*(*a*/0) = *<sup>a</sup>* lim*t*→<sup>∞</sup> *<sup>f</sup>*(*t*)/*t*. For a random variable *X* and a set *A*, we denote the probability that *X* takes a value in *A* by P(*X* ∈ *A*), the expectation of the random variable by E*X*, and the variance by Var(*X*) := E|*X* − E*X*| 2 . For a probability measure *µ* satisfying *µ*(*A*) = P(*X* ∈ *A*) for all Borel *A*, we write *X* ∼ *µ*, and, when there exists a probability density function such that <sup>P</sup>(*<sup>X</sup>* <sup>∈</sup> *<sup>A</sup>*) = <sup>R</sup> *A f*(*x*)*dγ*(*x*) for a reference measure *γ*, we write *X* ∼ *f* . For a probability measure *µ* on X , and an *L* 2 function *f* : X → R, we denote Var*µ*(*f*) := Var(*f*(*X*)) for *X* ∼ *µ*.

#### **2. Strongly Convex Divergences**

**Definition 1.** *A* R ∪ {∞}*-valued function f on a convex set K* ⊆ R *is κ-convex when x*, *y* ∈ *K and t* ∈ [0, 1] *implies*

$$f((1-t)\mathbf{x}+ty) \le (1-t)f(\mathbf{x}) + tf(y) - \kappa t(1-t)(\mathbf{x}-y)^2/2. \tag{3}$$

For example, when *f* is twice differentiable, (3) is equivalent to *f* <sup>00</sup>(*x*) ≥ *κ* for *x* ∈ *K*. Note that the case *κ* = 0 is just usual convexity.

**Proposition 1.** *For f* : *K* → R ∪ {∞} *and κ* ∈ [0, ∞)*, the following are equivalent:*


$$f'\_+(t) \ge f'\_+(s) + \kappa(t - s)$$

*for t* ≥ *s.*

**Proof.** Observe that it is enough to prove the result when *κ* = 0, where the proposition is reduced to the classical result for convex functions.

**Definition 2.** *An f-divergence D<sup>f</sup> is κ-convex on an interval K for κ* ≥ 0 *when the function f is κ-convex on K.*

Table 1 lists some *κ*-convex *f*-divergences of interest to this article.


**Table 1.** Examples of Strongly Convex Divergences.

Observe that we have taken the normalization convention on the total variation (the total variation for a signed measure *µ* on a space *X* can be defined through the Hahn-Jordan decomposition of the measure into non-negative measures *µ* <sup>+</sup> and *µ* − such that *µ* = *µ* <sup>+</sup> <sup>−</sup> *<sup>µ</sup>* <sup>−</sup>, as k*µ*k = *µ* <sup>+</sup>(*X*) + *µ* −(*X*) (see [18]); in our notation, |*µ*|*TV* = k*µ*k/2) which we denote by |*P* − *Q*|*TV*, such that |*P* − *Q*|*TV* = sup*<sup>A</sup>* |*P*(*A*) − *Q*(*A*)| ≤ 1. In addition , note that the *α*-divergence interpolates Pearson's *χ* 2 -divergence when *α* = 3, one half Neyman's *χ* 2 -divergence when *α* = −3, the squared Hellinger divergence when *α* = 0, and has limiting cases, the relative entropy when *α* = 1 and the reverse relative entropy when *α* = −1. If *f* is *κ*-convex on [*a*, *b*], then recalling its dual divergence *f* ∗ (*x*) := *x f*(*x* −1 ) is *κa* 3 -convex on [ 1 *b* , 1 *a* ]. Recall that *f* ∗ satisfies the equality *D<sup>f</sup>* <sup>∗</sup> (*P*||*Q*) = *D<sup>f</sup>* (*Q*||*P*). For brevity, we use *χ* 2 -divergence to refer to the Pearson *χ* 2 -divergence, and we articulate Neyman's *χ* 2 explicitly when necessary.

The next lemma is a restatement of Jensen's inequality.

**Lemma 1.** *If f is κ-convex on the range of X,*

$$\mathbb{E}f(X) \ge f(\mathbb{E}(X)) + \frac{\kappa}{2}\operatorname{Var}(X).$$

**Proof.** Apply Jensen's inequality to *f*(*x*) − *κx* <sup>2</sup>/2.

For a convex function *<sup>f</sup>* such that *<sup>f</sup>*(1) = <sup>0</sup> and *<sup>c</sup>* <sup>∈</sup> <sup>R</sup>, the function ˜ *f*(*t*) = *f*(*t*) + *c*(*t* − 1) remains a convex function, and what is more satisfies

$$D\_f(P||Q) = D\_f(P||Q)$$

since R *c*(*p*/*q* − 1)*qdµ* = 0.

**Definition 3** (*χ* 2 -divergence)**.** *For f*(*t*) = (*t* − 1) 2 *, we write*

$$
\chi^2(P||Q) := D\_f(P||Q).
$$

We pursue a generalization of the following bound on the total variation by the *χ* 2 -divergence [19–21].

**Theorem 1** ([19–21])**.** *For measures P and Q,*

$$|P - Q|\_{TV}^2 \le \frac{\chi^2(P||Q)}{2}.\tag{4}$$

We mention the work of Harremos and Vadja [20], in which it is shown, through a characterization of the extreme points of the joint range associated to a pair of *f*-divergences (valid in general), that the inequality characterizes the "joint range", that is, the range of the function (*P*, *Q*) 7→ (|*P* − *Q*|*TV*, *χ* 2 (*P*||*Q*)). We use the following lemma, which shows that every strongly convex divergence can be lower bounded, up to its convexity constant *κ* > 0, by the *χ* 2 -divergence,

**Lemma 2.** *For a κ-convex f ,*

$$D\_f(P||Q) \ge \frac{\kappa}{2} \chi^2(P||Q).$$

**Proof.** Define a ˜ *f*(*t*) = *f*(*t*) − *f* 0 <sup>+</sup>(1)(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) and note that ˜ *f* defines the same *κ*-convex divergence as *f* . Thus, we may assume without loss of generality that *f* 0 <sup>+</sup> is uniquely zero when *t* = 1. Since *f* is *κ*-convex *φ* : *t* 7→ *f*(*t*) − *κ*(*t* − 1) <sup>2</sup>/2 is convex, and, by *f* 0 <sup>+</sup>(1) = 0, *φ* 0 <sup>+</sup>(1) = 0 as well. Thus, *φ* takes its minimum when *t* = 1 and hence *φ* ≥ 0 so that *f*(*t*) ≥ *κ*(*t* − 1) <sup>2</sup>/2. Computing,

$$\begin{aligned} D\_f(P||Q) &= \int f\left(\frac{dP}{dQ}\right) dQ\\ &\geq \frac{\kappa}{2} \int \left(\frac{dP}{dQ} - 1\right)^2 dQ\\ &= \frac{\kappa}{2} \chi^2(P||Q). \end{aligned}$$

Based on a Taylor series expansion of *f* about 1, Nielsen and Nock ([22], [Corollary 1]) gave the estimate

$$D\_f(P||Q) \approx \frac{f''(1)}{2} \chi^2(P||Q) \tag{5}$$

for divergences with a non-zero second derivative and *P* close to *Q*. Lemma 2 complements this estimate with a lower bound, when *f* is *κ*-concave. In particular, if *f* 00(1) = *κ*, it shows that the approximation in (5) is an underestimate.

**Theorem 2.** *For measures P and Q, and a κ convex divergence D<sup>f</sup> ,*

$$\|\mathbf{P} - \mathbf{Q}\|\_{TV}^2 \le \frac{D\_f(\mathbf{P}||\mathbf{Q})}{\kappa}.\tag{6}$$

**Proof.** By Lemma 2 and then Theorem 1,

$$\frac{D\_f(P||Q)}{\kappa} \ge \frac{\chi^2(P||Q)}{2} \ge |P - Q|\_{TV} \tag{7}$$

The proof of Lemma 2 uses a pointwise inequality between convex functions to derive an inequality between their respective divergences. This simple technique was shown to have useful implications by Sason and Verdu in [6], where it appears as Theorem 1 and is used to give sharp comparisons in several *f*-divergence inequalities.

**Theorem 3** (Sason–Verdu [6])**.** *For divergences defined by g and f with c f*(*t*) ≥ *g*(*t*) *for all t, then*

$$D\_{\mathcal{S}}(P||Q) \le c D\_f(P||Q).$$

*Moreover, if f* 0 (1) = *g* 0 (1) = 0*, then*

$$\sup\_{P \neq Q} \frac{D\_{\mathcal{S}}(P||Q)}{D\_f(P||Q)} = \sup\_{t \neq 1} \frac{\mathcal{g}(t)}{f(t)}.$$

**Corollary 1.** *For a smooth κ-convex divergence f , the inequality*

$$D\_f(P||Q) \ge \frac{\kappa}{2} \chi^2(P||Q) \tag{8}$$

*is sharp multiplicatively in the sense that*

$$\inf\_{P \neq Q} \frac{D\_f(P||Q)}{\chi^2(P||Q)} = \frac{\kappa}{2}.\tag{9}$$

*if f* 00(1) = *κ.*

In information geometry, a standard *f*-divergence is defined as an *f*-divergence satisfying the normalization *f*(1) = *f* 0 (1) = 0, *f* <sup>00</sup>(1) = 1 (see [23]). Thus, Corollary 1 shows that <sup>1</sup> 2 *χ* <sup>2</sup> provides a sharp lower bound on every standard *f*-divergence that is 1-convex. In particular, the lower bound in Lemma 2 complimenting the estimate (5) is shown to be sharp.

**Proof.** Without loss of generality, we assume that *f* 0 (1) = 0. If *f* 00(1) = *κ* + 2*ε* for some *ε* > 0, then taking *g*(*t*) = (*t* − 1) <sup>2</sup> and applying Theorem 3 and Lemma 2

$$\sup\_{P \neq Q} \frac{D\_{\mathcal{S}}(P||Q)}{D\_f(P||Q)} = \sup\_{t \neq 1} \frac{g(t)}{f(t)} \le \frac{2}{\kappa}.\tag{10}$$

Observe that, after two applications of L'Hospital,

$$\lim\_{\varepsilon \to 0} \frac{g(1+\varepsilon)}{f(1+\varepsilon)} = \lim\_{\varepsilon \to 0} \frac{g'(1+\varepsilon)}{f'(1+\varepsilon)} = \frac{g''(1)}{f''(1)} = \frac{2}{\kappa} \le \sup\_{t \ne 1} \frac{g(t)}{f(t)}.$$

Thus, (9) follows.

**Proposition 2.** *When D<sup>f</sup> is an f divergence such that f is κ-convex on* [*a*, *b*] *and that P<sup>θ</sup> and Q<sup>θ</sup> are probability measures indexed by a set* Θ *such that a* ≤ *dP<sup>θ</sup> dQ<sup>θ</sup>* (*x*) ≤ *b, holds for all θ and P* := R Θ *Pθdµ*(*θ*) *and Q* := R <sup>Θ</sup> *Qθdµ*(*θ*) *for a probability measure µ on* Θ*, then*

$$D\_f(P||Q) \le \int\_{\Theta} D\_f(P\_\theta||Q\_\theta) d\mu(\theta) - \frac{\kappa}{2} \int\_{\Theta} \int\_X \left(\frac{dP\_\theta}{dQ\_\theta} - \frac{dP}{dQ}\right)^2 dQ d\mu,\tag{11}$$

*In particular, when Q<sup>θ</sup>* = *Q for all θ*

$$\begin{split} D\_f(\mathbb{P}||\mathcal{Q}) \\ \leq & \int\_{\Theta} D\_f(P\_{\theta}||\mathcal{Q}) d\mu(\theta) - \frac{\kappa}{2} \int\_{\Theta} \int\_{\mathcal{X}} \left( \frac{dP\_{\theta}}{d\mathcal{Q}} - \frac{dP}{d\mathcal{Q}} \right)^2 d\mathcal{Q} d\mu(\theta) \\ \leq & \int\_{\Theta} D\_f(P\_{\theta}||\mathcal{Q}) d\mu(\theta) - \kappa \int\_{\Theta} |P\_{\theta} - P|^2\_{TV} d\mu(\theta) \end{split} \tag{12}$$

**Proof.** Let *dθ* denote a reference measure dominating *µ* so that *dµ* = *ϕ*(*θ*)*dθ* then write *ν<sup>θ</sup>* = *ν*(*θ*, *x*) = *dQ<sup>θ</sup> dQ* (*x*)*ϕ*(*θ*).

$$\begin{split} D\_f(P||Q) &= \int\_{\mathcal{X}} f\left(\frac{dP}{dQ}\right) dQ \\ &= \int\_{\mathcal{X}} f\left(\int\_{\Theta} \frac{dP\_{\theta}}{dQ} d\mu(\theta)\right) dQ \\ &= \int\_{\mathcal{X}} f\left(\int\_{\Theta} \frac{dP\_{\theta}}{dQ\_{\theta}} \nu(\theta, \mathbf{x}) d\theta\right) dQ \end{split} \tag{13}$$

By Jensen's inequality, as in Lemma 1

$$f\left(\int\_{\Theta} \frac{dP\_{\theta}}{dQ\_{\theta}} \nu\_{\theta} d\theta\right) \leq \int\_{\theta} f\left(\frac{dP\_{\theta}}{dQ\_{\theta}}\right) \nu\_{\theta} d\theta - \frac{\kappa}{2} \int\_{\Theta} \left(\frac{dP\_{\theta}}{dQ\_{\theta}} - \int\_{\Theta} \frac{dP\_{\theta}}{dQ\_{\theta}} \nu\_{\theta} d\theta\right)^{2} \nu\_{\theta} d\theta$$

*Entropy* **2020**, *22*, 1327

Integrating this inequality gives

$$D\_f(\boldsymbol{P}||\boldsymbol{Q}) \leq \int\_X \left( \int\_{\boldsymbol{\theta}} f \left( \frac{d\mathbf{P}\_{\boldsymbol{\theta}}}{d\mathbf{Q}\_{\boldsymbol{\theta}}} \right) \boldsymbol{\nu}\_{\boldsymbol{\theta}} d\boldsymbol{\theta} - \frac{\kappa}{2} \int\_{\boldsymbol{\Theta}} \left( \frac{d\mathbf{P}\_{\boldsymbol{\theta}}}{d\mathbf{Q}\_{\boldsymbol{\theta}}} - \int\_{\boldsymbol{\Theta}} \frac{d\mathbf{P}\_{\boldsymbol{\theta}}}{d\mathbf{Q}\_{\boldsymbol{\theta}}} \boldsymbol{\nu}\_{\boldsymbol{\theta}} d\boldsymbol{\theta} \right)^2 \boldsymbol{\nu}\_{\boldsymbol{\theta}} d\boldsymbol{\theta} \right) d\boldsymbol{Q} \tag{14}$$

Note that

$$\int\_{X} \int\_{\Theta} \left( \frac{d\mathbf{P}\_{\theta}}{dQ\_{\theta}} dQ - \int\_{\Theta} \frac{dP\_{\theta}}{dQ\_{\theta\_{0}}} \nu\_{\theta\_{0}} d\theta\_{0} \right)^{2} \nu\_{\theta} d\theta dQ = \int\_{\Theta} \int\_{X} \left( \frac{dP\_{\theta}}{dQ\_{\theta}} - \frac{dP}{dQ} \right)^{2} dQ d\mu\_{\theta}$$

and

$$\begin{split} \int\_{\mathcal{X}} \int\_{\Theta} f \left( \frac{dP\_{\theta}}{dQ\_{\theta}} \right) \nu(\theta, \mathbf{x}) d\theta d\mathbf{Q} &= \int\_{\Theta} \int\_{\mathcal{X}} f \left( \frac{dP\_{\theta}}{dQ\_{\theta}} \right) \nu(\theta, \mathbf{x}) dQ d\theta \\ &= \int\_{\Theta} \int\_{\mathcal{X}} f \left( \frac{dP\_{\theta}}{dQ\_{\theta}} \right) dQ\_{\theta} d\mu(\theta) \\ &= \int\_{\Theta} D(P\_{\theta} || Q\_{\theta}) d\mu(\theta) \end{split} \tag{15}$$

Inserting these equalities into (14) gives the result.

To obtain the total variation bound, one needs only to apply Jensen's inequality,

$$\begin{split} \int\_{\mathcal{X}} \left( \frac{dP\_{\theta}}{dQ} - \frac{dP}{dQ} \right)^{2} dQ &\geq \left( \int\_{\mathcal{X}} \left| \frac{dP\_{\theta}}{dQ} - \frac{dP}{dQ} \right| dQ \right)^{2} \\ &= |P\_{\theta} - P|\_{TV}^{2} . \end{split} \tag{16}$$

Observe that, taking *Q* = *P* = R Θ *Pθdµ*(*θ*) in Proposition 2, one obtains a lower bound for the average *f*-divergence from the set of distribution to their barycenter, by the mean square total variation of the set of distributions to the barycenter,

$$\propto \int\_{\Theta} |P\_{\theta} - P|\_{TV}^2 d\mu(\theta) \le \int\_{\Theta} D\_f(P\_{\theta} || P) d\mu(\theta). \tag{17}$$

An alternative proof of this can be obtained by applying |*P<sup>θ</sup>* − *P*| 2 *TV* ≤ *D<sup>f</sup>* (*P<sup>θ</sup>* ||*P*)/*κ* from Theorem 2 pointwise.

The next result shows that, for *f* strongly convex, Pinsker type inequalities can never be reversed,

**Proposition 3.** *Given f strongly convex and M* > 0*, there exists P, Q measures such that*

$$D\_f(P||Q) \ge M|P - Q|\_{TV} \tag{18}$$

**Proof.** By *κ*-convexity *φ*(*t*) = *f*(*t*) − *κt* <sup>2</sup>/2 is a convex function. Thus, *<sup>φ</sup>*(*t*) <sup>≥</sup> *<sup>φ</sup>*(1) + *<sup>φ</sup>* 0 <sup>+</sup>(1)(*t* − 1) = (*f* 0 <sup>+</sup>(1) − *κ*)(*t* − 1) and hence lim*t*→<sup>∞</sup> *f*(*t*) *<sup>t</sup>* ≥ lim*t*→<sup>∞</sup> *κt*/2 + (*f* 0 <sup>+</sup>(1) − *κ*) <sup>1</sup> <sup>−</sup> <sup>1</sup> *t* = ∞. Taking measures on the two points space *P* = {1/2, 1/2} and *Q* = {1/2*t*, 1 − 1/2*t*} gives *D<sup>f</sup>* (*P*||*Q*) <sup>≥</sup> <sup>1</sup> 2 *f*(*t*) *<sup>t</sup>* which tends to infinity with *t* → ∞, while |*P* − *Q*|*TV* ≤ 1.

In fact, building on the work of Basu-Shioya-Park [24] and Vadja [25], Sason and Verdu proved [6] that, for any *<sup>f</sup>* divergence, sup*P*6=*<sup>Q</sup> Df* (*P*||*Q*) |*P*−*Q*|*TV* = *f*(0) + *f* ∗ (0). Thus, an *f*-divergence can be bounded above by a constant multiple of a the total variation, if and only if *f*(0) + *f* ∗ (0) < ∞. From this perspective, Proposition 3 is simply the obvious fact that strongly convex functions have super linear (at least quadratic) growth at infinity.

#### **3. Skew Divergences**

If we denote *Cvx*(0, ∞) to be quotient of the cone of convex functions *f* on (0, ∞) such that *f*(1) = 0 under the equivalence relation *f*<sup>1</sup> ∼ *f*<sup>2</sup> when *f*<sup>1</sup> − *f*<sup>2</sup> = *c*(*x* − 1) for *c* ∈ R, then the map *f* 7→ *D<sup>f</sup>* gives a linear isomorphism between *Cvx*(0, ∞) and the space of all *f*-divergences. The mapping T : *Cvx*(0, ∞) → *Cvx*(0, ∞) defined by T *f* = *f* ∗ , where we recall *f* ∗ (*t*) = *t f*(*t* −1 ), gives an involution of *Cvx*(0, <sup>∞</sup>). Indeed, *<sup>D</sup>*<sup>T</sup> *<sup>f</sup>* (*P*||*Q*) = *D<sup>f</sup>* (*Q*||*P*), so that *<sup>D</sup>*<sup>T</sup> (<sup>T</sup> (*f*))(*P*||*Q*) = *<sup>D</sup><sup>f</sup>* (*P*||*Q*). Mathematically, skew divergences give an interpolation of this involution as

$$(P,Q)\mapsto D\_f((1-t)P + tQ)|(1-s)P + sQ)$$

gives *D<sup>f</sup>* (*P*||*Q*) by taking *s* = 1 and *t* = 0 or yields *D<sup>f</sup>* <sup>∗</sup> (*P*||*Q*) by taking *s* = 0 and *t* = 1.

Moreover, as mentioned in the Introduction, skewing imposes boundedness of the Radon–Nikodym derivative *dP dQ* , which allows us to constrain the domain of *f*-divergences and leverage *κ*-convexity to obtain *f*-divergence inequalities in this section.

The following appears as Theorem III.1 in the preprint [26]. It states that skewing an *f*-divergence preserves its status as such. This guarantees that the generalized skew divergences of this section are indeed *f*-divergences. A proof is given in the Appendix A for the convenience of the reader.

**Theorem 4** (Melbourne et al [26])**.** *For t*,*s* ∈ [0, 1] *and a divergence D<sup>f</sup> , then*

$$\mathcal{S}\_f(P||Q) := D\_f((1-t)P + tQ||(1-s)P + sQ) \tag{19}$$

*is an f -divergence as well.*

**Definition 4.** *For an f -divergence, its skew symmetrization,*

$$
\Delta\_f(P||Q) := \frac{1}{2} D\_f\left(P\left|\left|\frac{P+Q}{2}\right.\right.\right) + \frac{1}{2} D\_f\left(Q\left|\left|\frac{P+Q}{2}\right.\right.\right).
$$

∆*f* is determined by the convex function

$$\mathbf{x} \mapsto \frac{1+\mathbf{x}}{2} \left( f\left(\frac{2\mathbf{x}}{1+\mathbf{x}}\right) + f\left(\frac{2}{1+\mathbf{x}}\right) \right). \tag{20}$$

Observe that ∆*<sup>f</sup>* (*P*||*Q*) = ∆*<sup>f</sup>* (*Q*||*P*), and when *f*(0) < ∞, ∆*<sup>f</sup>* (*P*||*Q*) <sup>≤</sup> sup*x*∈[0,2] *f*(*x*) < ∞ for all *P*, *Q* since *dP <sup>d</sup>*(*P*+*Q*)/2 , *dQ <sup>d</sup>*(*P*+*Q*)/2 ≤ 2. When *f*(*x*) = *x* log *x*, the relative entropy's skew symmetrization is the Jensen–Shannon divergence. When *f*(*x*) = (*x* − 1) <sup>2</sup> up to a normalization constant the *χ* 2 -divergence's skew symmetrization is the Vincze–Le Cam divergence which we state below for emphasis. The work of Topsøe [11] provides more background on this divergence, where it is referred to as the triangular discrimination.

**Definition 5.** *When f*(*t*) = (*t*−1) 2 *t*+1 *, denote the Vincze–Le Cam divergence by* ∆(*P*||*Q*) := *D<sup>f</sup>* (*P*||*Q*).

If one denotes the skew symmetrization of the *χ* 2 -divergence by ∆*χ*<sup>2</sup> , one can compute easily from (20) that ∆*χ*<sup>2</sup> (*P*||*Q*) = ∆(*P*||*Q*)/2. We note that although skewing preserves 0-convexity, by the above example, it does not preserve *κ*-convexity in general. The skew symmetrization of the *χ* 2 -divergence a 2-convex divergence while *f*(*t*) = (*t* − 1) <sup>2</sup>/(*t* + 1) corresponding to the Vincze–Le Cam divergence satisfies *f* <sup>00</sup>(*t*) = <sup>8</sup> (*t*+1) 3 , which cannot be bounded away from zero on (0, ∞).

**Corollary 2.** *For an f -divergence such that f is a κ-convex on* (0, 2)*,*

$$
\Delta\_f(P||Q) \ge \frac{\kappa}{4} \Delta(P||Q) = \frac{\kappa}{2} \Delta\_{\chi^2}(P||Q), \tag{21}
$$

*with equality when the f*(*t*) = (*t* − 1) 2 *corresponding the the χ* 2 *-divergence, where* ∆*<sup>f</sup> denotes the skew symmetrized divergence associated to f and* ∆ *is the Vincze- Le Cam divergence.*

#### **Proof.** Applying Proposition 2

$$\begin{split} 0 &= D\_f \left( \frac{P+Q}{2} \middle| \middle| \frac{Q+P}{2} \right) \\ &\leq \frac{1}{2} D\_f \left( P \middle| \middle| \frac{Q+P}{2} \right) + \frac{1}{2} D\_f \left( Q \middle| \middle| \frac{Q+P}{2} \right) - \frac{\kappa}{8} \int \left( \frac{2P}{P+Q} - \frac{2Q}{P+Q} \right)^2 d(P+Q)/2 \\ &= \Delta\_f(P||Q) - \frac{\kappa}{4} \Delta(P||Q). \end{split}$$

When *f*(*x*) = *x* log *x*, we have *f* <sup>00</sup>(*x*) ≥ log *e* 2 on [0, 2], which demonstrates that up to a constant log *e* 8 the Jensen–Shannon divergence bounds the Vincze–Le Cam divergence (see [11] for improvement of the inequality in the case of the Jensen–Shannon divergence, called the "capacitory discrimination" in the reference, by a factor of 2).

We now investigate more general, non-symmetric skewing in what follows.

**Proposition 4.** *For α*, *β* ∈ [0, 1]*, define*

$$\mathcal{C}(\mathfrak{a}) := \begin{cases} 1 - \mathfrak{a} & \text{when } \mathfrak{a} \le \mathfrak{z} \\ \mathfrak{a} & \text{when } \mathfrak{a} > \mathfrak{z} \end{cases} \tag{22}$$

*and*

$$S\_{\mathfrak{a},\mathfrak{\beta}}(P||Q) := D((1-\mathfrak{a})P + \mathfrak{a}Q||(1-\mathfrak{\beta})P + \mathfrak{\beta}Q). \tag{23}$$

*Then,*

$$S\_{\mathfrak{a},\beta}(P||Q) \le \mathcal{C}(\mathfrak{a}) D\_{\infty}(\mathfrak{a}||\beta) |P - Q|\_{TV} \tag{24}$$

*where D*∞(*α*||*β*) :<sup>=</sup> log max <sup>n</sup> *α β* , 1−*α* 1−*β* o *is the binary* <sup>∞</sup>*-Rényi divergence [27].*

We need the following lemma originally proved by Audenart in the quantum setting [28]. It is based on a differential relationship between the skew divergence [12] and the [15] (see [29,30]).

**Lemma 3** (Theorem III.1 [26])**.** *For P and Q probability measures and t* ∈ [0, 1]*,*

$$S\_{0,t}(P||Q) \le -\log t|P - Q|\_{TV} \tag{25}$$

**Proof of Theorem 4.** If *<sup>α</sup>* <sup>≤</sup> *<sup>β</sup>*, then *<sup>D</sup>*∞(*α*||*β*) = log <sup>1</sup>−*<sup>α</sup>* 1−*β* and *C*(*α*) = 1 − *α*. In addition,

$$(1 - \beta)P + \beta Q = t\left((1 - a)P + aQ\right) + (1 - t)Q\tag{26}$$

with *t* = 1−*β* 1−*α* , thus

$$\begin{split} S\_{\mathfrak{a},\mathfrak{f}}(P||Q) &= S\_{0,t}((1-\mathfrak{a})P + \mathfrak{a}Q||\!|Q) \\ &\leq (-\log t)\left| ((1-\mathfrak{a})P + \mathfrak{a}Q) - Q \right|\_{TV} \\ &= \mathsf{C}(\mathfrak{a})\ D\_{\infty}(\mathfrak{a}||\!|\mathfrak{f})\ |P - Q|\_{TV} \end{split} \tag{27}$$

where the inequality follows from Lemma 3. Following the same argument for *α* > *β*, so that *C*(*α*) = *α*, *<sup>D</sup>*∞(*α*||*β*) = log *<sup>α</sup> β* , and

$$(1 - \beta)P + \beta Q = t\left((1 - a)P + aQ\right) + (1 - t)P\tag{28}$$

for *t* = *β α* completes the proof. Indeed,

$$\begin{split} \mathcal{S}\_{\mathfrak{a},\mathfrak{\beta}}(P||Q) &= \mathcal{S}\_{0,t}((1-\mathfrak{a})P + \mathfrak{a}Q||P) \\ &\leq -\log t \, | ((1-\mathfrak{a})P + \mathfrak{a}Q) - P |\_{TV} \\ &= \mathcal{C}(\mathfrak{a}) \, D\_{\mathfrak{\infty}}(\mathfrak{a}||\mathfrak{\beta}) \, |P - Q|\_{TV} .\end{split} \tag{29}$$

We recover the classical bound [11,16] of the Jensen–Shannon divergence by the total variation.

**Corollary 3.** *For probability measure P and Q,*

$$\mathbb{E}\left[\text{SD}(P||Q) \le \log 2 \mid P - Q\right]\_{TV} \tag{30}$$

**Proof.** Since JSD(*P*||*Q*) = <sup>1</sup> 2 *S*0, <sup>1</sup> 2 (*P*||*Q*) + <sup>1</sup> 2 *S*1, <sup>1</sup> 2 (*P*||*Q*).

Proposition 4 gives a sharpening of Lemma 1 of Nielsen [17], who proved *Sα*,*β*(*P*||*Q*) ≤ *D*∞(*α*||*β*), and used the result to establish the boundedness of a generalization of the Jensen–Shannon Divergence.

**Definition 6** (Nielsen [17])**.** *For p and q densities with respect to a reference measure µ, w<sup>i</sup>* > 0*, such that* ∑ *n <sup>i</sup>*=<sup>1</sup> *w<sup>i</sup>* = 1 *and α<sup>i</sup>* ∈ [0, 1]*, define*

$$JS^{a,w}(p:q) = \sum\_{i=1}^{n} w\_i \, D((1-\mathfrak{a}\_i)p + \mathfrak{a}\_i q || (1-\mathfrak{a})p + \mathfrak{a}q) \tag{31}$$

*where* ∑ *n <sup>i</sup>*=<sup>1</sup> *wiα<sup>i</sup>* = *α*¯*.*

Note that, when *n* = 2, *α*<sup>1</sup> = 1, *α*<sup>2</sup> = 0 and *w<sup>i</sup>* = <sup>1</sup> 2 , *JSα*,*w*(*<sup>p</sup>* : *<sup>q</sup>*) = JSD(*p*||*q*), the usual Jensen–Shannon divergence. We now demonstrate that Nielsen's generalized Jensen–Shannon Divergence can be bounded by the total variation distance just as the ordinary Jensen–Shannon Divergence.

**Theorem 5.** *For p and q densities with respect to a reference measure µ, w<sup>i</sup>* > 0*, such that* ∑ *n <sup>i</sup>*=<sup>1</sup> *w<sup>i</sup>* = 1 *and α<sup>i</sup>* ∈ (0, 1)*,*

$$\log e \text{ Var}\_w(a) \left| p - q \right|\_{TV}^2 \le f S^{a, w}(p: q) \le \mathcal{A} \, H(w) \left| p - q \right|\_{TV} \tag{32}$$
 
$$\Gamma \, \_{\pi^\* \text{ log } \pi^\* \text{ } m \ge 0 \, \_2a \, d \quad A = \text{max } \left| a \right|\_{\pi^\*} \quad \text{ } \pi \, \_2w \text{ with } \pi \text{ } \Gamma = \Gamma \qquad w \, a\_j$$

*where H*(*w*) := − ∑*<sup>i</sup> w<sup>i</sup>* log *wi*≥ 0 *and* A = max*<sup>i</sup>* |*α<sup>i</sup>* − *α*¯*<sup>i</sup>* | *with <sup>α</sup>*¯*<sup>i</sup>* = <sup>∑</sup>*j*6=*<sup>i</sup>* 1−*w<sup>i</sup> .*

Note that, since *α*¯*<sup>i</sup>* is the *w* average of the *α<sup>j</sup>* terms with *α<sup>i</sup>* removed, *α*¯*<sup>i</sup>* ∈ [0, 1] and thus A ≤ 1. We need the following Theorem from Melbourne et al. [26] for the upper bound.

**Theorem 6** ([26] Theorem 1.1)**.** *For f<sup>i</sup> densities with respect to a common reference measure γ and λ<sup>i</sup>* > 0 *such that* ∑ *n <sup>i</sup>*=<sup>1</sup> *λ<sup>i</sup>* = 1*,*

$$h\_{\gamma}(\sum\_{i} \lambda\_{i} f\_{i}) - \sum\_{i} \lambda\_{i} h\_{\gamma}(f\_{i}) \le \mathcal{T}H(\lambda)\_{\prime} \tag{33}$$

*where hγ*(*fi*) := − R *fi*(*x*)log *fi*(*x*)*dγ*(*x*) *and* T = sup*<sup>i</sup>* <sup>|</sup> *<sup>f</sup><sup>i</sup>* <sup>−</sup> ˜ *fi* <sup>|</sup>*TV with* ˜ *<sup>f</sup><sup>i</sup>* = <sup>∑</sup>*j*6=*<sup>i</sup> λj* 1−*λ<sup>i</sup> fj .* **Proof of Theorem 5.** We apply Theorem 6 with *f<sup>i</sup>* = (1 − *αi*)*p* + *αiq*, *λ<sup>i</sup>* = *w<sup>i</sup>* , and noticing that in general

$$h\_{\gamma}(\sum\_{i} \lambda\_{i} f\_{i}) - \sum\_{i} \lambda h\_{\gamma}(f\_{i}) = \sum\_{i} \lambda\_{i} D(f\_{i}||f)\_{i} \tag{34}$$

we have

$$\begin{split} \left| f S^{a,w}(p:q) = \sum\_{i=1}^{n} w\_i D((1-a\_i)p + a\_i q || (1-\bar{a})p + \bar{a}q) \\ \le \mathcal{T} H(w). \end{split} \tag{35}$$

It remains to determine T = max*<sup>i</sup>* <sup>|</sup> *<sup>f</sup><sup>i</sup>* <sup>−</sup> ˜ *fi* |*TV*,

$$\begin{split} \tilde{f}\_{i} - f\_{i} &= \frac{f - f\_{i}}{1 - \lambda\_{i}} \\ &= \frac{((1 - \bar{\kappa})p + \bar{\kappa}q) - ((1 - \alpha\_{i})p + \alpha\_{i}q)}{1 - w\_{i}} \\ &= \frac{(\alpha\_{i} - \bar{\kappa})(p - q)}{1 - w\_{i}} \\ &= (\alpha\_{i} - \bar{\alpha}\_{i})(p - q). \end{split} \tag{36}$$

Thus, T = max*i*(*α<sup>i</sup>* − *α*¯*i*)|*p* − *q*|*TV* = A|*p* − *q*|*TV*, and the proof of the upper bound is complete.

To prove the lower bound, we apply Pinsker's inequality, 2 log *e*|*P* − *Q*| 2 *TV* ≤ *D*(*P*||*Q*),

$$\begin{split} f\mathcal{S}^{a,w}(p:q) &= \sum\_{i=1}^{n} w\_i D((1-a\_i)p + a\_i q) |(1-\bar{a})p + \bar{a}q) \\ &\geq \frac{1}{2} \sum\_{i=1}^{n} w\_i 2 \log e \left| ((1-a\_i)p + a\_i q) - ((1-\bar{a})p + \bar{a}q) \right|\_{TV}^2 \\ &= \log e \sum\_{i=1}^{n} w\_i (a\_i - \bar{a})^2 |p - q|\_{TV}^2 \\ &= \log e \operatorname{Var}\_w(a) \left| p - q \right|\_{TV}^2. \end{split} \tag{37}$$

**Definition 7.** *Given an f-divergence, densities p and q with respect to common reference measure, α* ∈ [0, 1] *n and w* ∈ (0, 1) *n such that* ∑*<sup>i</sup> w<sup>i</sup>* = 1 *define its generalized skew divergence*

$$D\_f^{a,w}(p:q) = \sum\_{i=1}^n w\_i D\_f((1-a\_i)p + a\_i q) |(1-\overline{a})p + \overline{a}q). \tag{38}$$

*where α*¯ = ∑*<sup>i</sup> wiα<sup>i</sup> .*

Note that, by Theorem 4, *D α*,*w f* is an *f*-divergence. The generalized skew divergence of the relative entropy is the generalized Jensen–Shannon divergence *JSα*,*w*. We denote the generalized skew divergence of the *χ* 2 -divergence from *p* to *q* by

$$\chi^2\_{a,w}(p:q) := \sum\_i w\_i \chi^2((1-a\_i)p + a\_i q) |(1-\bar{a}p + \bar{a}q) \tag{39}$$

Note that, when *n* = 2 and *α*<sup>1</sup> = 0, *α*<sup>2</sup> = 1 and *w<sup>i</sup>* = <sup>1</sup> 2 , we recover the skew symmetrized divergence in Definition 4

$$D\_f^{(0,1),(1/2,1/2)}(p:q) = \Delta\_f(p||q) \tag{40}$$

The following theorem shows that the usual upper bound for the relative entropy by the *χ* 2 -divergence can be reversed up to a factor in the skewed case.

**Theorem 7.** *For p and q with a common dominating measure µ,*

$$
\chi^2\_{\alpha, w}(p:q) \le \mathcal{N}\_{\infty}(\alpha, w) f S^{\alpha, w}(p:q).
$$

Writing *<sup>N</sup>*∞(*α*, *<sup>w</sup>*) = max*<sup>i</sup>* max <sup>n</sup> 1−*α<sup>i</sup>* 1−*α*¯ , *αi α*¯ o . For *α* ∈ [0, 1] *<sup>n</sup>* and *<sup>w</sup>* <sup>∈</sup> (0, 1) *n* such that ∑*<sup>i</sup> w<sup>i</sup>* = 1, we use the notation *N*∞(*α*, *w*) := max*<sup>i</sup> e D*∞(*α<sup>i</sup>* ||*α*¯) where *α*¯ := ∑*<sup>i</sup> wiα<sup>i</sup>* .

**Proof.** By definition,

$$J S^{a, w}(p:q) = \sum\_{i=1}^{n} w\_i D((1-\mathfrak{a}\_i)p + \mathfrak{a}\_i q || (1-\mathfrak{a})p + \mathfrak{a}q).$$

Taking *P<sup>i</sup>* to be the measure associated to (1 − *αi*)*p* + *αiq* and *Q* given by (1 − *α*¯)*p* + *α*¯ *q*, then

$$\frac{dP\_i}{dQ} = \frac{(1 - \mathfrak{a}\_i)p + \mathfrak{a}\_i q}{(1 - \mathfrak{a})p + \mathfrak{a}q} \le \max\left\{\frac{1 - \mathfrak{a}\_i}{1 - \mathfrak{a}}, \frac{\mathfrak{a}\_i}{\mathfrak{a}}\right\} = e^{D\_\infty(\mathfrak{a}\_i || \mathfrak{a})} \le N\_\infty(\mathfrak{a}, w). \tag{41}$$

Since *f*(*x*) = *x* log *x*, the convex function associated to the usual KL divergence, satisfies *f* <sup>00</sup>(*x*) = <sup>1</sup> *x* , *f* is *e* −*D*∞(*α*) -convex on [0, sup*x*,*<sup>i</sup> dP<sup>i</sup> dQ* (*x*)], applying Proposition 2, we obtain

$$D\left(\sum\_{i} w\_{i} P\_{i} \middle| \begin{matrix} Q \\ Q \end{matrix}\right) \le \sum\_{i} w\_{i} D(P\_{i} || Q) - \frac{\sum\_{i} w\_{i} \int\_{X} \left(\frac{dP\_{i}}{dQ} - \frac{dP}{dQ}\right)^{2} dQ}{2N\_{\infty}(\mathfrak{a}, w)}.\tag{42}$$

Since *Q* = ∑*<sup>i</sup> wiP<sup>i</sup>* , the left hand side of (42) is zero, while

$$\begin{split} \sum\_{i} w\_{i} \int\_{\mathcal{X}} \left( \frac{dP\_{i}}{d\mathbb{Q}} - \frac{dP}{d\mathbb{Q}} \right)^{2} d\mathbb{Q} &= \sum\_{i} w\_{i} \int\_{\mathcal{X}} \left( \frac{dP\_{i}}{dP} - 1 \right)^{2} dP \\ &= \sum\_{i} w\_{i} \chi^{2}(P\_{i}||P) \\ &= \chi^{2}\_{a,\mu}(p:q). \end{split} \tag{43}$$

Rearranging gives,

$$\frac{\chi^2\_{a,w}(p:q)}{2\mathcal{N}\_\infty(\mathfrak{a},w)} \le f\mathcal{S}^{a,w}(p:q),\tag{44}$$

which is our conclusion.

#### **4. Total Variation Bounds and Bayes Risk**

In this section, we derive bounds on the Bayes risk associated to a family of probability measures with a prior distribution *λ*. Let us state definitions and recall basic relationships. Given probability densities {*pi*} *n i*=1 on a space X with respect a reference measure *µ* and *λ<sup>i</sup>* ≥ 0 such that ∑ *n <sup>i</sup>*=<sup>1</sup> *λ<sup>i</sup>* = 1, define the Bayes risk,

$$R := R\_{\lambda}(p) \coloneqq 1 - \int\_{\mathcal{X}} \max\_{i} \{\lambda\_{i} p\_{i}(\mathbf{x})\} d\mu(\mathbf{x}) \tag{45}$$

If `(*x*, *y*) = 1 − *δx*(*y*), and we define *T*(*x*) := arg max*<sup>i</sup> λ<sup>i</sup> pi*(*x*) then observe that this definition is consistent with, the usual definition of the Bayes risk associated to the loss function `. Below, we consider *θ* to be a random variable on {1, 2, . . . , *n*} such that P(*θ* = *i*) = *λ<sup>i</sup>* , and *x* to be a variable with conditional distribution <sup>P</sup>(*<sup>X</sup>* <sup>∈</sup> *<sup>A</sup>*|*<sup>θ</sup>* <sup>=</sup> *<sup>i</sup>*) = <sup>R</sup> *A pi*(*x*)*dµ*(*x*). The following result shows that the Bayes risk gives the probability of the categorization error, under an optimal estimator.

**Proposition 5.** *The Bayes risk satisfies*

$$R = \min\_{\theta} \mathbb{E}\ell(\theta, \hat{\theta}(X)) = \mathbb{E}\ell(\theta, T(X))$$

*where the minimum is defined over* <sup>ˆ</sup>*<sup>θ</sup>* : X → {1, 2, . . . , *<sup>n</sup>*}*.*

**Proof.** Observe that *R* = 1 − R X *λT*(*x*) *pT*(*x*) (*x*)*dµ*(*x*) = E`(*θ*, *T*(*X*)). Similarly,

$$\begin{aligned} \mathbb{E}\ell(\boldsymbol{\theta}, \boldsymbol{\hat{\theta}}(\boldsymbol{X})) &= 1 - \int\_{\mathcal{X}} \lambda\_{\hat{\boldsymbol{\theta}}(\boldsymbol{x})} p\_{\hat{\boldsymbol{\theta}}(\boldsymbol{x})}(\boldsymbol{x}) d\boldsymbol{\mu}(\boldsymbol{x}) \\ &\geq 1 - \int\_{\mathcal{X}} \lambda\_{T(\boldsymbol{x})} p\_{T(\boldsymbol{x})}(\boldsymbol{x}) d\boldsymbol{\mu}(\boldsymbol{x}) = R\_{\boldsymbol{\theta}} \end{aligned}$$

which gives our conclusion.

It is known (see, for example, [9,31]) that the Bayes risk can also be tied directly to the total variation in the following special case, whose proof we include for completeness.

**Proposition 6.** *When n* = 2 *and λ*<sup>1</sup> = *λ*<sup>2</sup> = <sup>1</sup> 2 *, the Bayes risk associated to the densities p*<sup>1</sup> *and p*<sup>2</sup> *satisfies*

$$\mathcal{LR} = 1 - |p\_1 - p\_2|\_{TV} \tag{46}$$

**Proof.** Since *p<sup>T</sup>* = |*p*1−*p*2|+*p*1+*p*<sup>2</sup> 2 , integrating gives R X *pT*(*x*)*dµ*(*x*) = |*p*<sup>1</sup> − *p*2|*TV* + 1 from which the equality follows.

Information theoretic bounds to control the Bayes and minimax risk have an extensive literature (see, for example, [9,32–35]). Fano's inequality is the seminal result in this direction, and we direct the reader to a survey of such techniques in statistical estimation (see [36]). What follows can be understood as a sharpening of the work of Guntuboyina [9] under the assumption of a *κ*-convexity.

The function *T*(*x*) = arg max*i*{*λ<sup>i</sup> pi*(*x*)} induces the following convex decompositions of our densities. The density *q* can be realized as a convex combination of *q*<sup>1</sup> = *λTq* <sup>1</sup>−*<sup>Q</sup>* where *<sup>Q</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> R *λTqdµ* and *q*<sup>2</sup> = (1−*λT*)*q Q* ,

$$q = (1 - Q)q\_1 + Qq\_2.$$

If we take *p* := ∑*<sup>i</sup> λ<sup>i</sup> p<sup>i</sup>* , then *p* can be decomposed as *ρ*<sup>1</sup> = *λ<sup>T</sup> p<sup>T</sup>* 1−*R* and *ρ*<sup>2</sup> = *p*−*λ<sup>T</sup> p<sup>T</sup> R* so that

$$p = (1 - R)\rho\_1 + R\rho\_2.$$

**Theorem 8.** *When f is κ-convex, on* (*a*, *b*) *with a* = inf*i*,*<sup>x</sup> pi* (*x*) *q*(*x*) *and b* = sup*i*,*<sup>x</sup> pi* (*x*) *q*(*x*)

$$\sum\_{i} \lambda\_i D\_f(p\_i||q) \ge D\_f(\mathbb{R}||Q) + \frac{\kappa W}{2}$$

*where*

$$\mathcal{W} := \mathcal{W}(\lambda\_i, p\_i, q) := \frac{(1 - R)^2}{1 - \mathcal{Q}} \chi^2(\rho\_1 || q\_1) + \frac{\mathcal{R}^2}{\mathcal{Q}} \chi^2(\rho\_2 || q\_2) + \mathcal{W}\_0$$

*for W*<sup>0</sup> ≥ 0*.*

*Entropy* **2020**, *22*, 1327

*W*<sup>0</sup> can be expressed explicitly as

$$\mathcal{W}\_0 = \int (1 - \lambda\_T) Var\_{\lambda\_i \neq T} \left( \frac{p\_i}{q} \right) d\mu = \int \sum\_{i \neq T} \lambda\_i \frac{|p\_i - \sum\_{j \neq T} \frac{\lambda\_j}{1 - \lambda\_T} p\_j|^2}{q} d\mu\_i$$

where for fixed *x*, we consider the variance *Varλi*6=*<sup>T</sup> pi q* to be the variance of a random variable taking values *pi*(*x*)/*q*(*x*) with probability *λi*/(1 − *λT*(*x*) ) for *i* 6= *T*(*x*). Note this term is a non-zero term only when *n* > 2.

**Proof.** For a fixed *x*, we apply Lemma 1

$$\begin{split} \sum\_{i} \lambda\_{i} f\left(\frac{p\_{i}}{q}\right) &= \lambda\_{\mathsf{T}} f\left(\frac{p\_{\mathsf{T}}}{q}\right) + (1 - \lambda\_{\mathsf{T}}) \sum\_{i \neq \mathsf{T}} \frac{\lambda\_{i}}{1 - \lambda\_{\mathsf{T}}} f\left(\frac{p\_{i}}{q}\right) \\ &\geq \lambda\_{\mathsf{T}} f\left(\frac{p\_{\mathsf{T}}}{q}\right) + (1 - \lambda\_{\mathsf{T}}) \left[ f\left(\frac{p - \lambda\_{\mathsf{T}} p\_{\mathsf{T}}}{q(1 - \lambda\_{\mathsf{T}})}\right) + \frac{\mathsf{x}}{2} \operatorname{Var}\_{\lambda\_{i} \neq \mathsf{T}} \left(\frac{p\_{i}}{q}\right) \right] \end{split} \tag{47}$$

Integrating,

$$\sum\_{i} \lambda\_i D\_f(p\_i||q) \ge \int \lambda\_T f\left(\frac{p\_T}{q}\right) q + \int (1 - \lambda\_T) f\left(\frac{-\lambda\_T p\_T + \sum\_i \lambda\_i p\_i}{q(1 - \lambda\_T)}\right) q + \frac{\kappa}{2} \mathbb{W}\_0. \tag{48}$$

where

$$\mathcal{W}\_0 = \int \sum\_{i \neq T(\mathbf{x})} \frac{\lambda\_i}{1 - \lambda\_T(\mathbf{x})} \frac{|p\_i - \sum\_{j \neq T} \frac{\lambda\_j}{1 - \lambda\_T} p\_j|^2}{q} d\mu. \tag{49}$$

Applying the *κ*-convexity of *f* ,

$$\begin{split} \int \lambda \tau f\left(\frac{p\_T}{q}\right) q &= (1 - \mathcal{Q}) \int q\_1 f\left(\frac{p\_T}{q}\right) \\ &\geq (1 - \mathcal{Q}) \left( f \left(\frac{\int \lambda\_T p\_T}{1 - \mathcal{Q}}\right) + \frac{\kappa}{2} \text{Var}\_{q\_1} \left(\frac{p\_T}{q}\right) \right) \\ &= (1 - \mathcal{Q}) f((1 - \mathcal{R})/(1 - \mathcal{Q})) + \frac{\mathcal{Q}\kappa}{2} \mathcal{W}\_{1\_{1'}} \end{split} \tag{50}$$

with

$$\begin{aligned} W\_1 &:= \text{Var}\_{q\_1} \left( \frac{p\_T}{q} \right) \\ &= \left( \frac{1 - R}{1 - Q} \right)^2 \text{Var}\_{q\_1} \left( \frac{\lambda\_T p\_T}{\lambda\_T q} \frac{1 - Q}{1 - R} \right) \\ &= \left( \frac{1 - R}{1 - Q} \right)^2 \text{Var}\_{q\_1} \left( \frac{\rho\_1}{q\_1} \right) \\ &= \left( \frac{1 - R}{1 - Q} \right)^2 \chi^2(\rho\_1 || q\_1) \end{aligned} \tag{51}$$

*Entropy* **2020**, *22*, 1327

Similarly,

$$\begin{split} \int (1 - \lambda\_T) f\left(\frac{p - \lambda\_T p\_T}{q(1 - \lambda\_T)}\right) q &= Q \int q\_2 f\left(\frac{p - \lambda\_T p\_T}{q(1 - \lambda\_T)}\right) \\ &\ge Q f\left(\int q\_2 \frac{p - \lambda\_T p\_T}{q(1 - \lambda\_T)}\right) + \frac{Q\kappa}{2} \mathcal{W}\_2 \\ &= Q f\left(\frac{R}{1 - Q}\right) + \frac{Q\kappa}{2} \mathcal{W}\_2 \end{split} \tag{52}$$

where

$$\begin{split} \mathcal{W}\_{2} &:= \text{Var}\_{q\_{2}} \left( \frac{p - \lambda\_{T} p\_{T}}{q (1 - \lambda\_{T})} \right) \\ &= \left( \frac{R}{Q} \right)^{2} \text{Var}\_{q\_{2}} \left( \frac{p - \lambda\_{T} p\_{T}}{q (1 - \lambda\_{T})} \frac{Q}{R} \right) \\ &= \left( \frac{R}{Q} \right)^{2} \text{Var}\_{q\_{2}} \left( \frac{p - \lambda\_{T} p\_{T}}{q (1 - \lambda\_{T})} - \frac{R}{Q} \right)^{2} \\ &= \left( \frac{R}{Q} \right)^{2} \int q\_{2} \left( \frac{p\_{2}}{q\_{2}} - 1 \right)^{2} \\ &= \left( \frac{R}{Q} \right)^{2} \chi^{2} (\rho\_{2} || q\_{2}) \end{split} \tag{53}$$

Writing *W* = *W*<sup>0</sup> + *W*<sup>1</sup> + *W*2, we have our result.

**Corollary 4.** *When λ<sup>i</sup>* = <sup>1</sup> *n , and f is κ-convex on* (inf*i*,*<sup>x</sup> pi*/*q*, sup*i*,*<sup>x</sup> pi*/*q*)

$$\begin{aligned} &\frac{1}{n}\sum\_{i} D\_{f}(p\_{i}||q) \\ &\geq D\_{f}(\mathbb{R}||(n-1)/n) + \frac{\kappa}{2} \left( n^{2} (1-\mathbb{R})^{2} \chi^{2}(\rho\_{1}||q) + \left( \frac{n\mathbb{R}}{n-1} \right)^{2} \chi^{2}(\rho\_{2}||q) + \mathbb{W}\_{0} \right) \end{aligned} \tag{54}$$

*further when n* = 2*,*

$$\begin{split} \frac{D\_f(p\_1||q) + D\_f(p\_2||q)}{2} &\geq D\_f \left( \frac{1 - |p\_1 - p\_2|\_{\mathbb{T}V}}{2} \right) \left| \frac{1}{2} \right. \\ &\quad + \frac{\kappa}{2} \left( \left(1 + |p\_1 - p\_2|\_{\mathbb{T}V} \right)^2 \chi^2(\rho\_1 ||q) + \left(1 - |p\_1 - p\_2|\_{\mathbb{T}V} \right)^2 \chi^2(\rho\_2 ||q) \right). \end{split} \tag{55}$$

**Proof.** Note that *q*<sup>1</sup> = *q*<sup>2</sup> = *q*, since *λ<sup>i</sup>* = <sup>1</sup> *n* implies *λ<sup>T</sup>* = <sup>1</sup> *n* as well. In addition, *Q* = 1 − R *λTqdµ* = *n*−1 *n* so that applying Theorem 8 gives

$$\sum\_{i=1}^{n} D\_f(p\_i||q) \ge n D\_f(\mathbb{R}||(n-1)/n) + \frac{\kappa n W(\lambda\_i, p\_i, q)}{2}.\tag{56}$$

The term *W* can be simplified as well. In the notation of the proof of Theorem 8,

$$\begin{aligned} \mathcal{W}\_1 &= n^2 (1 - R)^2 \chi^2(\rho\_1, q) \\ \mathcal{W}\_2 &= \left(\frac{nR}{n - 1}\right)^2 \chi^2(\rho\_2 || q) \\ \mathcal{W}\_0 &= \int \frac{\frac{1}{n - 1} \sum\_{i \neq T} (p\_i - \frac{1}{n - 1} \sum\_{j \neq T} p\_j)^2}{q} d\mu. \end{aligned} \tag{57}$$

For the special case, one needs only to recall *R* = 1−|*p*1−*p*2|*TV* <sup>2</sup> while inserting 2 for *n*.

**Corollary 5.** *When p<sup>i</sup>* ≤ *q*/*t* ∗ *for t*∗ > 0*, and f*(*x*) = *x* log *x*

$$\sum\_{i} \lambda\_i D(p\_i||q) \ge D(R||Q) + \frac{t^\* W(\lambda\_{i\prime} p\_{i\prime} q)}{2}.$$

*for D*(*p<sup>i</sup>* ||*q*) *the relative entropy. In particular,*

$$\sum\_{i} \lambda\_i D(p\_i||q) \ge D(p||q) + D(R||P) + \frac{t^\* \mathcal{W}(\lambda\_{i\prime} p\_{i\prime} p)}{2}$$

*where P* = 1 − R *λ<sup>T</sup> pdµ for p* = ∑*<sup>i</sup> λ<sup>i</sup> p<sup>i</sup> and t*<sup>∗</sup> = min *λ<sup>i</sup> .*

**Proof.** For the relative entropy, *f*(*x*) = *x* log *x* is <sup>1</sup> *<sup>M</sup>* -convex on [0, *M*] since *f* 00(*x*) = 1/*x*. When *p<sup>i</sup>* ≤ *q*/*t* <sup>∗</sup> holds for all *i*, then we can apply Theorem 8 with *M* = <sup>1</sup> *t* <sup>∗</sup> . For the second inequality, recall the compensation identity, ∑*<sup>i</sup> λiD*(*p<sup>i</sup>* ||*q*) = ∑*<sup>i</sup> λiD*(*p<sup>i</sup>* ||*p*) + *D*(*p*||*q*), and apply the first inequality to ∑*<sup>i</sup> D*(*p<sup>i</sup>* ||*p*) for the result.

$$\mathbb{D}$$

This gives an upper bound on the Jensen–Shannon divergence, defined as JSD(*µ*||*ν*) = 1 <sup>2</sup>*D*(*µ*||*µ*/2 <sup>+</sup> *<sup>ν</sup>*/2) + <sup>1</sup> <sup>2</sup>*D*(*ν*||*µ*/2 + *ν*/2). Let us also note that through the compensation identity ∑*<sup>i</sup> λiD*(*p<sup>i</sup>* ||*q*) = ∑*<sup>i</sup> λiD*(*p<sup>i</sup>* ||*p*) + *D*(*p*||*q*), ∑*<sup>i</sup> λiD*(*p<sup>i</sup>* ||*q*) ≥ ∑*<sup>i</sup> λiD*(*p<sup>i</sup>* ||*p*) where *p* = ∑*<sup>i</sup> λ<sup>i</sup> p<sup>i</sup>* . In the case that *λ<sup>i</sup>* = <sup>1</sup> *N*

$$\begin{aligned} \sum\_{i} \lambda\_i D(p\_i || q) \\ \geq & \sum\_{i} \lambda\_i D(p\_i || p) \\ \geq & Qf\left(\frac{1-R}{Q}\right) + (1-Q)f\left(\frac{R}{1-Q}\right) + \frac{t^\*W}{2} \end{aligned} \tag{58}$$

**Corollary 6.** *For two densities p*<sup>1</sup> *and p*2*, the Jensen–Shannon divergence satisfies the following,*

$$\begin{split} \text{JSD}(p\_1||p\_2) &\geq D\left(\frac{1-|p\_1-p\_2|\_{TV}}{2}\right) \Big| 1/2 \Big) \\ &+ \frac{1}{4} \Big( (1+|p\_1-p\_2|\_{TV})^2 \chi^2(\rho\_1||p) + (1-|p\_1-p\_2|\_{TV})^2 \chi^2(\rho\_2||p) \Big) \Big) \end{split} \tag{59}$$

*with ρ*(*i*) *defined above and p* = *p*1/2 + *p*2/2*.*

**Proof.** Since *<sup>p</sup><sup>i</sup>* (*p*1+*p*2)/2 ≤ <sup>2</sup> and *<sup>f</sup>*(*x*) = *<sup>x</sup>* log *<sup>x</sup>* satisfies *<sup>f</sup>* <sup>00</sup>(*x*) <sup>≥</sup> <sup>1</sup> 2 on (0, 2). Taking *q* = *p*1+*p*<sup>2</sup> 2 , in the *n* = 2 example of Corollary 4 with *κ* = <sup>1</sup> 2 yields the result.

Note that 2*D*((1 + *V*)/2||1/2) = (1 + *V*)log(1 + *V*) + (1 − *V*)log(1 − *V*) ≥ *V* 2 log *e*, we see that a further bound,

$$\text{JSD}(p\_1||p\_2) \ge \frac{\log e}{2}V^2 + \frac{(1+V)^2 \chi^2(\rho\_1||p) + (1-V)^2 \chi^2(\rho\_2||p)}{4},\tag{60}$$

can be obtained for *V* = |*p*<sup>1</sup> − *p*2|*TV*.

*On Topsøe's Sharpening of Pinsker's Inequality*

For *P<sup>i</sup>* , *Q* probability measures with densities *p<sup>i</sup>* and *q* with respect to a common reference measure, ∑ *n i*=1 *t<sup>i</sup>* = 1, with *t<sup>i</sup>* > 0, denote *P* = ∑*<sup>i</sup> tiP<sup>i</sup>* , with density *p* = ∑*<sup>i</sup> t<sup>i</sup> p<sup>i</sup>* , the compensation identity is

$$\sum\_{i=1}^{n} t\_i D(P\_i || Q) = D(P || Q) + \sum\_{i=1}^{n} t\_i D(P\_i || P). \tag{61}$$

**Theorem 9.** *For P*<sup>1</sup> *and P*2*, denote M<sup>k</sup>* = 2 <sup>−</sup>*kP*<sup>1</sup> + (<sup>1</sup> <sup>−</sup> <sup>2</sup> −*k* )*P*2*, and define*

$$\mathcal{M}\_1(k) = \frac{M\_k \mathbb{1}\_{\{P\_1 > P\_2\}} + P\_2 \mathbb{1}\_{\{P\_1 \le P\_2\}}}{M\_k \{P\_1 > P\_2\} + P\_2 \{P\_1 \le P\_2\}} \qquad \mathcal{M}\_2(k) = \frac{M\_k \mathbb{1}\_{\{P\_1 \le P\_2\}} + P\_2 \mathbb{1}\_{\{P\_1 > P\_2\}}}{M\_k \{P\_1 \le P\_2\} + P\_2 \{P\_1 > P\_2\}}.$$

*then the following sharpening of Pinsker's inequality can be derived,*

$$D(P\_1||P\_2) \ge (2\log e)|P\_1 - P\_2|\_{TV}^2 + \sum\_{k=0}^{\infty} 2^k \left( \frac{\chi^2(\mathcal{M}\_1(k), M\_{k+1})}{2} + \frac{\chi^2(\mathcal{M}\_2(k), M\_{k+1})}{2} \right).$$

**Proof.** When *n* = 2 and *t*<sup>1</sup> = *t*<sup>2</sup> = <sup>1</sup> 2 , if we denote *M* = *P*1+*P*<sup>2</sup> 2 , then (61) reads as

$$\frac{1}{2}D(P\_1||Q) + \frac{1}{2}D(P\_2||Q) = D(M||Q) + \text{JSD}(P\_1||P\_2). \tag{62}$$

Taking *Q* = *P*2, we arrive at

$$D(P\_1||P\_2) = \mathfrak{D}(M||P\_2) + \mathfrak{Z}\mathfrak{SD}(P\_1||P\_2) \tag{63}$$

Iterating and writing *M<sup>k</sup>* = 2 <sup>−</sup>*kP*<sup>1</sup> + (<sup>1</sup> <sup>−</sup> <sup>2</sup> −*k* )*P*2, we have

$$D(P\_1||P\_2) = \mathcal{Z}^n \left( D(M\_n||P\_2) + 2 \sum\_{k=0}^n \text{JSD}(M\_n||P\_2) \right) \tag{64}$$

It can be shown (see [11]) that 2 *<sup>n</sup>D*(*Mn*||*P*2) <sup>→</sup> <sup>0</sup> with *<sup>n</sup>* <sup>→</sup> <sup>∞</sup>, giving the following series representation,

$$D(P\_1||P\_2) = 2\sum\_{k=0}^{\infty} 2^k \text{JSD}(M\_k||P\_2). \tag{65}$$

Note that the *ρ*-decomposition of *M<sup>k</sup>* is exactly *ρ<sup>i</sup>* = M*<sup>k</sup>* (*i*), thus, by Corollary 6,

$$\begin{split} D(P\_1||P\_2) &= 2\sum\_{k=0}^{\infty} 2^k \text{JSD}(M\_k||P\_2) \\ &\geq \sum\_{k=0}^{\infty} 2^k \left( |M\_k - P\_2|\_{TV}^2 \log e + \frac{\chi^2(\mathcal{M}\_1(k), M\_{k+1})}{2} + \frac{\chi^2(\mathcal{M}\_2(k), M\_{k+1})}{2} \right) \\ &= (2\log e)|P\_1 - P\_2|\_{TV}^2 + \sum\_{k=0}^{\infty} 2^k \left( \frac{\chi^2(\mathcal{M}\_1(k), M\_{k+1})}{2} + \frac{\chi^2(\mathcal{M}\_2(k), M\_{k+1})}{2} \right). \end{split} \tag{66}$$

Thus, we arrive at the desired sharpening of Pinsker's inequality.

Observe that the *k* = 0 term in the above series is equivalent to

$$2^0 \left( \frac{\chi^2(\mathcal{M}\_1(0), \mathcal{M}\_{0+1})}{2} + \frac{\chi^2(\mathcal{M}\_2(0), \mathcal{M}\_{0+1})}{2} \right) = \frac{\chi^2(\rho\_1, p)}{2} + \frac{\chi^2(\rho\_2, p)}{2},\tag{67}$$

where *ρ<sup>i</sup>* is the convex decomposition of *p* = *p*1+*p*<sup>2</sup> 2 in terms of *T*(*x*) = arg max{*p*1(*x*), *p*2(*x*)}.

#### **5. Conclusions**

In this article, we begin a systematic study of strongly convex divergences, and how the strength of convexity of a divergence generator *f* , quantified by the parameter *κ*, influences the behavior of the divergence *D<sup>f</sup>* . We prove that every strongly convex divergence dominates the square of the total variation, extending the classical bound provided by the *χ* 2 -divergence. We also study a general notion of a skew divergence, providing new bounds, in particular for the generalized skew divergence of Nielsen. Finally, we show how *κ*-convexity can be leveraged to yield improvements of Bayes risk *f*-divergence inequalities, and as a consequence achieve a sharpening of Pinsker's inequality.

**Funding:** This research was funded by NSF grant CNS 1809194.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Theorem A1.** *The class of f-divergences is stable under skewing. That is, if f is convex, satisfying f*(1) = 0*, then*

$$\hat{f}(\mathbf{x}) := (t\mathbf{x} + (1 - t))f\left(\frac{r\mathbf{x} + (1 - r)}{t\mathbf{x} + (1 - t)}\right) \tag{A1}$$

*is convex with* ˆ *f*(1) = 0 *as well.*

**Proof.** If *µ* and *ν* have respective densities *u* and *v* with respect to a reference measure *γ*, then *rµ* + (1 − *r*)*ν* and *tµ* + 1 − *tν* have densities *ru* + (1 − *r*)*v* and *tu* + (1 − *t*)*v*

$$S\_{f,r,t}(\mu||v) = \int f\left(\frac{ru + (1-r)v}{tu + (1-t)v}\right) (tu + (1-t)v) d\gamma \tag{A2}$$

$$=\int f\left(\frac{r\frac{u}{v} + (1-r)}{t\frac{u}{v} + (1-t)}\right) (t\frac{u}{v} + (1-t)) v d\gamma \tag{A3}$$

$$= \int \oint \left(\frac{u}{v}\right) v d\gamma. \tag{A4}$$

Since ˆ *f*(1) = *f*(1) = 0, we need only prove ˆ *f* convex. For this, recall that the conic transform *g* of a convex function *f* defined by *g*(*x*, *y*) = *y f*(*x*/*y*) for *y* > 0 is convex, since

$$\frac{y\_1 + y\_2}{2} f\left(\frac{\mathbf{x}\_1 + \mathbf{x}\_2}{2} / \frac{y\_1 + y\_2}{2}\right) = \frac{y\_1 + y\_2}{2} f\left(\frac{y\_1}{y\_1 + y\_2} \frac{\mathbf{x}\_1}{y\_1} + \frac{y\_2}{y\_1 + y\_2} \frac{\mathbf{x}\_2}{y\_2}\right) \tag{A5}$$

$$
\leq \frac{y\_1}{2} f(\mathbf{x}\_1/y\_1) + \frac{y\_2}{2} f(\mathbf{x}\_2/y\_2). \tag{A6}
$$

Our result follows since ˆ *f* is the composition of the affine function *A*(*x*) = (*rx* + (1 − *r*), *tx* + (1 − *t*)) with the conic transform of *f* ,

$$
\hat{f}(\mathbf{x}) = \mathcal{g}(A(\mathbf{x})).\tag{A7}
$$

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Minimum Divergence Estimators, Maximum Likelihood and the Generalized Bootstrap**

**Michel Broniatowski**

Faculté de Mathématiques, Laboratoire de Probabilité, Statistique et Modélisation, Université Pierre et Marie Curie (Sorbonne Université), 4 Place Jussieu, CEDEX 05, 75252 Paris, France; michel.broniatowski@sorbonne-universite.fr

**Abstract:** This paper states that most commonly used minimum divergence estimators are MLEs for suited generalized bootstrapped sampling schemes. Optimality in the sense of Bahadur for associated tests of fit under such sampling is considered.

**Keywords:** statistical divergences; minimum divergence estimator; maximum likelihood; bootstrap; conditional limit theorem; Bahadur efficiency

#### **1. Motivation and Context**

Divergences between probability measures are widely used in statistics and data science in order to perform inference under models of various kinds; parametric or semiparametric, or even in non-parametric settings. The corresponding methods extend the likelihood paradigm and insert inference in some minimum "distance" framing, which provides a convenient description for the properties of the resulting estimators and tests, under the model or under misspecification. Furthermore, they pave the way to a large number of competitive methods, which allows to trade-off between efficiency and robustness, among other things. Many families of such divergences have been proposed, some of them stemming from classical statistics (such as the Chi-square divergence), while others have their origin in other fields, such as information theory. Some measures of discrepancy involve regularity of the corresponding probability measures while others seem to be restricted to measures on finite or countable spaces, at least when using them as inferential tools, henceforth in situations when the elements of a model have to be confronted with a dataset. The choice of a specific discrepancy measure in specific context is somehow arbitrary in many cases, although the resulting conclusion of the inference might differ accordingly, above all under misspecification.

The goal of this paper is explained shortly. The current literature on risks, seen from a statistical standpoint, has developed in two main directions, from basic definitions and principles, following the seminal papers [1,2].

A first stream of papers aims to describe classes of discrepancy indices (divergences) associated with invariance under classes of transformations and similar properties; see [3–5] for a review.

The second flow aims at making use of these indices for practical purposes under various models, from parametric models to semi-parametric ones, mostly. Also the literature in learning procedures makes extensive use of divergence-based risks, with a strong accent on the implementation issues. Following the standard approach, their properties are mainly considered under i.i.d. sampling, providing limit results, confidence areas, etc; see [6,7] and references therein for review and developments, and the monographs [8,9]. Also comparison among discrepancy indices are considered in terms of performances either under the model, or with respect to robustness (aiming at minimizing the role of outliers in the inference by providing estimators with redescending influence function), or

**Citation:** Broniatowski, M. Minimum Divergence Estimators, Maximum Likelihood and the Generalized Bootstrap. *Entropy* **2021**, *23*, 185. https://doi.org/10.3390/e23020185

Academic Editor: Igal Sason Received: 10 September 2020 Accepted: 28 January 2021 Published: 31 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

with respect to misspecification, hence focusing on the loss in estimation or testing with respect to the distance from the assumed model to the true one.

This literature, however, rarely considers the rationale for specific choices of indices in relation with the concepts which define statistics, such as the Bayesian paradigm or the maximum likelihood (ML) one; for a contribution in this direction for inference in models defined by linear constraints, see [10]. In [11], we could prove that minimum divergence estimators (in the class of the ones considered in the present paper) coincide with MLEs under i.i.d. sampling in regular exponential models (but need not, even in common models such as mixtures). Here it is proved that minimum divergence estimators are indeed MLEs under weighted sampling, instead of standard i.i.d. one, commonly met in bootstrap procedures which aim at providing finite sample properties of estimators through simulation.

This paper considers a specific class of divergences, which contains most of the classical inferential tools, and which is indexed by a single scalar parameter. This class of divergences belongs to the Csiszar-Ali-Silvey-Arimoto family of divergences (see [4]), and is usually referred to as the power divergence class, which has been considered by Cressie and Read [12]; however this denomination is also shared by other discrepancy measures of some different nature [13]. We will use the acronym CR for the class of divergences under consideration in this paper.

Section 2 recalls that the MLE is obtained as a proxy of the minimizer of the Kullback-Leibler divergence between the generic law of the observed variable and the model, which is the large deviation limit for the empirical distribution. This limit statement is nothing but the continuation of the classical ML paradigm, namely to make the dataset more "probable" under the fitted distribution in the model, or, equivalently, to fit the most "likely" distribution in the model to the dataset.

Section 3 states that given a divergence pseudo distance *φ* in CR the Minimum Divergence Estimator (MDE) is obtained as a proxy of the minimizer of the large deviation limit for some bootstrap version of the empirical distribution, which establishes that the MDE is MLE for bootstrapped samples defined in relation with the divergence. This fact is based on the strong relation which associates to any CR *φ*-divergence a specific RV *W* (see Section 1.1.2); this link is the cornerstone for the interpretation of the minimum *φ*-divergence estimators as MLEs for specific bootstrapped sampling schemes where *W* has a prominent rôle. Some specific remark explores the link between MDE and MLE in exponential families. As a by product, we also introduce a bootstrapped estimator of the divergence pseudo-distance *φ* between the distribution of the data and the model.

In Section 4, we specify the bootstrapped estimator of the divergence which can be used in order to perform an optimal test of fit. Due to the type of asymptotics handled in this paper, optimality is studied in terms of Bahadur efficiency. It is shown that tests of fit based on such estimators enjoy Bahadur optimality with respect to other bootstrap plans when the bootstrap is performed under the distribution associated with the divergence criterion itself.

The discussion held in this paper pertains to parametric estimation in a model P<sup>Θ</sup> whose elements *P<sup>θ</sup>* are probability measures defined on the same finite space Y := {*d*1, . . . , *dK*}, and *θ* ∈ Θ is an index space; we assume identifiability, namely different values of *θ* induce different probability laws *P<sup>θ</sup>* 's. Also all the entries of *P<sup>θ</sup>* will be positive for all *θ* in Θ.

#### *1.1. Notation*

#### 1.1.1. Divergences

We consider regular *divergence functions ϕ* which are non negative convex functions with values in R<sup>+</sup> which belong to *C* 2 (R) and satisfy *ϕ*(1) = *ϕ* 0 (1) = 0 and *ϕ* 00(1) = 1*;* see [3,4] for properties and extensions. An important class of such functions is defined through the power divergence functions

$$\varphi\_{\gamma}(\mathbf{x}) := \frac{\mathbf{x}^{\gamma} - \gamma \mathbf{x} + \gamma - 1}{\gamma(\gamma - 1)} \tag{1}$$

defined for all real *γ* 6= 0, 1 with *ϕ*0(*x*) := − log *x* + *x* − 1 (the likelihood divergence function) and *ϕ*1(*x*) := *x* log *x* − *x* +1 (the Kullback-Leibler divergence function). This class is usually referred to as the Cressie-Read family of divergence functions (see [12]). It is a very simple class of functions (with the limits in *γ* → 0, 1) which allows to represent nearly all commonly used statistical criterions. Parametric inference in commonly met situations including continuous models or some non-regular models can be performed with them; see [6]. The *L*<sup>1</sup> divergence function *ϕ*(*x*) := |*x* − 1| is not captured by the CR family of functions. When undefined the function *ϕ* is declared to assume value +∞.

Associated with a divergence function *ϕ*, *φ* is the *divergence* between a probability measure and a finite signed measure; see [14].

For *P* := (*p*1, . . . , *pK*) and *Q* := (*q*1, . . . , *qK*) in S *<sup>K</sup>*, the simplex of all probability measures on Y, define, whenever *Q* and *P* have non-null entries

$$\phi(Q, P) := \sum\_{k=1}^{K} p\_k \phi\left(\frac{q\_k}{p\_k}\right).$$

Indexing this pseudo-distance by *γ* and using *ϕ<sup>γ</sup>* as divergence function yields the Kullback-Leibler divergence *KL*(*Q*, *P*) := *φ*1(*Q*, *P*) := ∑ *q<sup>k</sup>* log *qk pk* , the likelihood or modified Kullback-Leibler divergence

$$KL\_m(Q, P) := \phi\_0(Q, P) := -\sum p\_k \log \left(\frac{q\_k}{p\_k}\right).$$

the Hellinger divergence

$$\phi\_{1/2}(Q,P) := \frac{1}{2} \sum p\_k \left( \sqrt{\frac{q\_k}{p\_k}} - 1 \right)^2.$$

the modified (or Neyman) *χ* <sup>2</sup> divergence

$$
\chi^2\_m(Q, P) := \phi\_{-1}(Q, P) := \frac{1}{2} \sum p\_k \left(\frac{q\_k}{p\_k} - 1\right)^2 \left(\frac{q\_k}{p\_k}\right)^{-1}.
$$

The *χ* <sup>2</sup> divergence

$$\phi\_2(Q, P) := \frac{1}{2} \sum p\_k \left(\frac{q\_k}{p\_k} - 1\right)^2$$

is defined between signed measures; see [15] for definitions in more general setting, and [6] for the advantage to extend the definition to possibly signed measures in the context of parametric inference for non-regular models. Also the present discussion which is restricted to finite spaces Y can be extended to general spaces.

The conjugate divergence function of *ϕ* is defined through

$$
\widetilde{\varphi}(\mathfrak{x}) := \mathfrak{x} \mathfrak{q} \left( \frac{1}{\mathfrak{x}} \right) \tag{2}
$$

and the corresponding divergence *φ*e(*P*, *Q*) is

$$\widetilde{\phi}(P,Q) := \sum\_{k=1}^{K} q\_k \widetilde{\varphi}\left(\frac{p\_k}{q\_k}\right).$$

which satisfies

$$
\bar{\phi}(P,Q) = \phi(Q,P).
$$

whenever defined, and equals <sup>+</sup><sup>∞</sup> otherwise. When *<sup>ϕ</sup>* <sup>=</sup> *<sup>ϕ</sup><sup>γ</sup>* then *<sup>ϕ</sup>*<sup>e</sup> <sup>=</sup> *<sup>ϕ</sup>*1−*<sup>γ</sup>* as follows by substitution. Pairs *ϕγ*, *ϕ*1−*<sup>γ</sup>* are therefore *conjugate pairs*. Inside the Cressie-Read family, the Hellinger divergence function is self-conjugate.

For *P* = *P<sup>θ</sup>* and *Q* ∈ S *<sup>K</sup>* we denote *φ*(*Q*, *P*) by *φ*(*Q*, *θ*) (resp *φ*(*θ*, *Q*), or *φ*(*θ* 0 , *θ*), etc. according to the context).

#### 1.1.2. Weights

This paragraph introduces the special link which connects CR divergences with specific random variables, which we call weights. Those will be associated to the dataset and define what is usually referred to as a generalized bootstrap procedure. This is the setting which allows for an interpretation of the MDE's as generalized bootstrapped MLEs.

For a given real valued random variable (RV) *W* denote

$$M(t) := \log E[\exp(t\mathcal{W})] \tag{3}$$

its cumulant generating function which we assume to be finite in a non-void open neighborhood of 0. The Fenchel Legendre transform of *M* (also called the Chernoff function) is defined through

$$\boldsymbol{\varrho}^{\mathcal{W}}(\mathbf{x}) = \boldsymbol{M}^\*(\mathbf{x}) := \sup\_t (t\mathbf{x} - \boldsymbol{M}(t)). \tag{4}$$

The function *x* → *ϕ <sup>W</sup>*(*x*) is non-negative, is *C* <sup>∞</sup> and convex. We also assume that *EW* = 1 together with *VarW* = 1 which implies *ϕ <sup>W</sup>*(1) = *ϕ W* 0 (1) = 0 and *ϕ W* 00(1) = 1. Hence *ϕ <sup>W</sup>*(*x*) is a divergence function with corresponding divergence *φ <sup>W</sup>*. Associated with *ϕ <sup>W</sup>* is the conjugate divergence *φ*f*<sup>W</sup>* with divergence function g*ϕ<sup>W</sup>* , which therefore satisfies *φ <sup>W</sup>*(*Q*, *P*) = *φ*f*W*(*P*, *Q*) whenever neither *P* nor *Q* have null entries.

It is of interest to note that the classical power divergences *ϕ<sup>γ</sup>* can be represented through (4) for *γ* ≤ 1 or *γ* ≥ 2. A first proof of this lays in the fact that when *W* has a distribution in a Natural Exponential Family (NEF) with power variance function with exponent *α* = 2 − *γ*, then the Legendre transform *ϕ <sup>W</sup>* of its cumulant generating function *M* is indeed of the form (1). See [16,17] for NEF's and power variance functions, and [18] for relation to the bootstrap. A general result of a different nature, including the former ones, can be seen in [19], Theorem 20. Correspondence between the various values of *γ* and the distribution of the respective weights can be found in [19], Example 39, and it can be summarized as presented now.

For *γ* < 0 the RV *W* is constructed as follows: Let *Z* be an auxiliary RV with density *<sup>f</sup><sup>Z</sup>* and support [0, <sup>∞</sup>) of a stable law with parameter triplet − *γ* 1−*γ* , 0, (1−*γ*) −*γ*//(1−*γ*) *γ* in terms of the "form B notation" on p 12 in [20]; then *W* has an absolutely continuous distribution with density

$$f\_W(y) := \frac{\exp(-y/(1-\gamma))}{\exp(1/\gamma)} f\_{\mathbb{Z}}(y) \mathbf{1}\_{[0,\infty)}(y) .$$

For *γ* = 0 (which amounts to consider the limit as *γ* → 0 in (1)) then *W* has a standard exponential distribution *E*(1) on [0, ∞).

For *γ* ∈ (0, 1) then *W* has a compound Gamma-Poisson distribution

$$\mathcal{C}(POI(\theta), GAM(\mathfrak{a}, \beta))$$

where

$$
\theta = \frac{1}{\gamma}, \alpha = \frac{1}{1 - \gamma}, \beta = \frac{\gamma}{1 - \gamma}.
$$

For *γ* = 1, *W* has a Poisson distribution with parameter 1, *POI*(1). For *γ* = 2, the RV *W* has normal distribution with expectation and variance equal to 1.

For *γ* > 2, the RV *W* is constructed as follows: Let *Z* be an auxiliary RV with density *<sup>f</sup><sup>Z</sup>* and support (−∞, <sup>∞</sup>) of a stable law with parameter triplet *γ γ*−1 , 0, (*γ*−1) −*γ*//(*γ*−1) *γ* in terms of the "form B notation" on p 12 in [20]; then *W* has an absolutely continuous distribution with density

$$f\_W(y) := \frac{\exp(y/(\gamma - 1))}{\exp(1/\gamma)} f\_Z(-y) \quad , y \in \mathbb{R}.$$

#### **2. Maximum Likelihood under Finitely Supported Distributions and Simple Sampling**

#### *2.1. Standard Derivation*

Let *X*1, . . . *X<sup>n</sup>* be a set of *n* independent random variables with common probability measure *Pθ<sup>T</sup>* and consider the Maximum Likelihood estimator of *θT*. A common way to define the ML paradigm is as follows: For any *θ* consider independent random variables (*X*1,*<sup>θ</sup>* , . . . *Xn*,*<sup>θ</sup>* ) with probability measure *P<sup>θ</sup>* , thus *sampled in the same way as the X<sup>i</sup> 's*, but under some alternative *θ*.

Denote

$$P\_n := \frac{1}{n} \sum\_{i=1}^n \delta\_{X\_i}$$

and

$$P\_{n, \theta} := \frac{1}{n} \sum\_{i=1}^n \delta\_{X\_{i, \theta}}$$

the empirical measures pertaining respectively to (*X*1, . . . *Xn*) and (*X*1,*<sup>θ</sup>* , . . . *Xn*,*<sup>θ</sup>* ).

Define *θML* as the value of the parameter *θ* for which the probability that, up to a permutation of the order of the *Xi*,*<sup>θ</sup>* 's, the probability that (*X*1,*<sup>θ</sup>* , . . . *Xn*,*<sup>θ</sup>* ) coincides with *X*1, . . . *X<sup>n</sup>* is maximal, conditionally on the observed sample *X*1, . . . *Xn*. In formula

$$\theta\_{ML} := \arg\max\_{\theta} P\_{\theta}(P\_{n,\theta} = P\_n | P\_n). \tag{5}$$

An explicit enumeration of the above expression *P<sup>θ</sup>* ( *Pn*,*<sup>θ</sup>* = *Pn*|*Pn*) involves the quantities

$$n\_{\circ} := \operatorname{card}\{\mathbf{i} : \mathbf{X}\_{\bar{\mathbf{i}}} = d\_{\bar{\mathbf{j}}}\}$$

for *j* = 1, . . . , *K* and yields

$$P\_{\theta}(P\_{n,\theta} = P\_{n,X} | P\_{n,X}) = \frac{n! P\_{\theta}(d\_j)^{n\_j}}{\prod\_{j=1}^{K} n\_j!} \tag{6}$$

as follows from the classical multinomial distribution. Optimizing on *θ* in (6) yields

$$\begin{aligned} \theta\_{ML} &= \arg\max\_{\theta} \sum\_{j=1}^{K} \frac{n\_j}{n} \log P\_{\theta} \left( d\_j \right) \\ &= \arg\max\_{\theta} \frac{1}{n} \sum\_{i=1}^{n} \log P\_{\theta} (X\_i) .\end{aligned}$$

It follows from direct evaluation that

$$
\theta\_{ML} = \arg\inf\_{\theta} KL\_m(P\_{\theta'}P\_n).
$$

Introducing the Kullback-Leibler divergence *KL*(*Pn*, *P<sup>θ</sup>* ) it thus holds

$$\theta\_{ML} = \arg\inf\_{\theta} \bar{K} \bar{L}\_m(P\_{n\nu} P\_{\theta}) = \arg\inf\_{\theta} KL(P\_{n\nu} P\_{\theta}).$$

We have recalled that minimizing the Kullback-Leibler divergence *KL*(*Pn*, *θ*) amounts to minimizing the Likelihood divergence *KLm*(*θ*, *Pn*) and produces the ML estimate of *θT*.

#### *2.2. Asymptotic Derivation*

We assume that

$$\lim\_{n \to \infty} P\_n = P\_{\theta\_T} \quad \text{a.s.}$$

This holds for example when the *X<sup>i</sup>* 's are drawn as an i.i.d. sample with common law *Pθ<sup>T</sup>* which we may assume in the present context. From an asymptotic standpoint, Kullback-Leibler divergence is related to the way *P<sup>n</sup>* keeps away from *P<sup>θ</sup>* when *θ* is not equal to the true value of the parameter *θ<sup>T</sup>* generating the observations *X<sup>i</sup>* 's and is closely related with the type of sampling of the *X<sup>i</sup>* 's. In the present case, when i.i.d. sampling of the *Xi*,*<sup>θ</sup>* 's under *P<sup>θ</sup>* is performed, Sanov Large Deviation theorem leads to

$$\lim\_{n \to \infty} \frac{1}{n} \log P\_{\theta}(P\_{n, \theta} = P\_n | P\_n) = -KL(\theta\_{T\_{\prime}} \theta). \tag{7}$$

This result can easily be obtained from (6) using Stirling formula to handle the factorial terms and the law of large numbers which states that for all *j*'s, *nj*/*n* tends to *Pθ<sup>T</sup>* (*dj*) as *n* tends to infinity. We note that the MLE *θML* is a proxy of the minimizer of the natural estimator *θ<sup>T</sup>* of *KL*(*θT*, *θ*) in *θ*, substituting the unknown measure generating the *X<sup>i</sup>* 's by its empirical counterpart *Pn*. Alternatively as will be used in the sequel, *θML* minimizes upon *θ* the Likelihood divergence *KLm*(*θ*, *θT*) between *P<sup>θ</sup>* and *Pθ<sup>T</sup>* substituting the unknown measure *Pθ<sup>T</sup>* generating the *X<sup>i</sup>* 's by its empirical counterpart *Pn*. Summarizing we have obtained:

The ML estimate can be obtained from a LDP statement as given in (7), optimizing in *θ* in the estimator of the LDP rate where the plug-in method of the empirical measure of the data is used instead of the unknown measure *Pθ<sup>T</sup>* . Alternatively it holds

$$\theta\_{ML} := \arg\min\_{\theta} \overline{KL\_m}(\theta, \theta\_T) \tag{8}$$

with

$$
\bar{\mathcal{K}}L\_m^-(\theta,\theta\_T) := \mathcal{K}L\_m(\theta,P\_n).
$$

This principle will be kept throughout this paper: the estimator is defined as maximizing the probability that the simulated empirical measure be close to the empirical measure as observed on the sample, conditionally on it, following the same sampling scheme. This yields a maximum likelihood estimator, and its properties are then obtained when randomness is introduced as resulting from the sampling scheme.

#### **3. Bootstrap and Weighted Sampling**

The sampling scheme which we consider is commonly used in connection with the bootstrap and is referred to as the *weighted* or *generalized bootstrap*, sometimes called *wild bootstrap*, first introduced by Newton and Mason [21].

Let *X*1, . . . , *X<sup>n</sup>* with common distribution *P* on Y := {*d*1, . . . , *dK*}.

Consider a collection *W*1, . . . , *W<sup>n</sup>* of independent copies of *W*, whose distribution satisfies the conditions stated in Section 1. The weighted empirical measure *P W n* is defined through

$$P\_n^W := \frac{1}{n} \sum\_{i=1}^n \mathcal{W}\_i \delta\_{X\_i}.$$

This empirical measure need not be a probability measure, since its mass may not equal 1. Also it might not be positive, since the weights may take negative values. Therefore *P W n* can be identified with a random point in R*K*. The measure *P W n* converges almost surely to *P* when the weights *W<sup>i</sup>* 's satisfy the hypotheses stated in Section 1.

We also consider the normalized weighted empirical measure

$$\mathfrak{P}\mathfrak{P}\_n^W := \sum\_{i=1}^n Z\_i \delta\_{X\_i} \tag{9}$$

where

$$Z\_i := \frac{\mathcal{W}\_i}{\sum\_{j=1}^{\mathcal{V}\_i} \mathcal{W}\_j} \tag{10}$$

whenever ∑ *n <sup>j</sup>*=<sup>1</sup> *W<sup>j</sup>* 6= 0, and

$$
\mathfrak{P}\_n^W = \infty
$$

when ∑ *n <sup>j</sup>*=<sup>1</sup> *<sup>W</sup><sup>j</sup>* = 0, where <sup>P</sup>*<sup>W</sup> <sup>n</sup>* = ∞ means P*<sup>W</sup> n* (*d<sup>k</sup>* ) = ∞ for all *d<sup>k</sup>* in Y.

#### *3.1. A Conditional Sanov Type Result for the Weighted Empirical Measure*

We now state a conditional Sanov type result for the family of random measures P*<sup>W</sup> n* . It follows readily from a companion result pertaining to *P W <sup>n</sup>* and enjoys a simple form when the weights *W<sup>i</sup>* are associated to power divergences, as defined in Section 1.1.2. We quote the following results, referring to [19].

Consider a set Ω in R*<sup>K</sup>* such that

$$cl\Omega = cl[\text{Int}(\Omega)]\tag{11}$$

which amounts to a regularity assumption (obviously met when Ω is an open set), and which allows for the replacement of the usual lim inf and lim sup by standard limits in usual LDP statements. We denote by *P <sup>W</sup>* the probability measure of the random family of i.i.d. weights *W<sup>i</sup>* .

It then holds

**Proposition 1** (Theorem 9 in [19])**.** *The weighted empirical measure P W n satisfies a conditional Large Deviation Principle in* R*<sup>K</sup> namely, denoting P the a.s. limit of Pn*,

$$\lim\_{n \to \infty} \frac{1}{n} \log P^W \left( P\_n^W \in \Omega \Big| X\_1^n \right) = -\phi^W(\Omega, P)$$

*where φ <sup>W</sup>*(Ω, *<sup>P</sup>*) :<sup>=</sup> inf*Q*∈<sup>Ω</sup> *<sup>φ</sup> <sup>W</sup>*(*Q*, *P*).

As a direct consequence of the former result, it holds, for any Ω ⊂ S *<sup>K</sup>* satisfying (11), where S *<sup>K</sup>* designates the simplex of all pm's on <sup>Y</sup>.

**Theorem 1** (Theorem 12 in [19])**.** *The normalized weighted empirical measure* P*<sup>W</sup> n satisfies a conditional Large Deviation Principle in* S *K*

$$\lim\_{m \to \infty} \frac{1}{n} \log P^W \left( \mathfrak{P}\_n^W \in \Omega \Big| X\_1^n \right) = - \inf\_{m \neq 0} \phi^W(m\Omega, P). \tag{12}$$

A flavour of the simple proofs of Proposition 1 and Theorem 1 is presented in Appendix A; see [19] for a detailed treatment; see also Theorem 3.2 and Corollary 3.3 in [22] where Theorem 1 is proved in a more abstract setting.

We will be interested in the pm's in Ω which minimize the RHS in the above display. The case when *φ <sup>W</sup>* is a power divergence, namely *φ <sup>W</sup>* = *φ<sup>γ</sup>* for some *γ* enjoys a special property with respect to the pm's *Q* achieving the infimum (upon *Q* in Ω) in (12). It holds

**Proposition 2** (Lemma 14 in [19])**.** *Assume that φ <sup>W</sup> is a power divergence. Then*

$$Q \in \arg\inf \left\{ \inf\_{m \neq 0} \phi^W(mQ, P), Q \in \Omega \right\}.$$

*and*

$$Q \in \arg\inf \left\{ \phi^W(Q, P), Q \in \Omega \right\}.$$

*are equivalent statements.*

Indeed Proposition 2 holds as a consequence of the following results, to be used later on.

**Lemma 1.** *For Q and P two pm's such that the involved expressions are finite, it holds*

*(i) For γ* 6= 0 *and γ* 6= 1 *it holds that*

$$\inf\_{m \neq 0} \phi\_\gamma(mQ, P) = \frac{1}{\gamma} \left[ 1 - (1 + \gamma(\gamma - 1)\phi\_\gamma(Q, P))^{-1/(\gamma - 1)} \right].$$

*(ii)* inf*m*6=<sup>0</sup> *φ*1(*mQ*, *P*) = 1 − exp(−*KL*(*Q*, *P*)) = 1 − exp(−*φ*1(*Q*, *P*)). *(iii)* inf*m*6=<sup>0</sup> *φ*0(*mQ*, *P*) = *KLm*(*Q*, *P*) = *φ*0(*Q*, *P*)

In the case where *W* is a RV with standard exponential distribution, then a link between the present approach and Bayesian inference can be drawn, since the normalized weighted empirical measure P*<sup>W</sup> n* is a realization of the a posteriori distribution for the Dirichlet prior on the non parametric distribution of *X*. See [23].

The weighted empirical measure *P W <sup>n</sup>* has been used in the weighted bootstrap (or wild bootstrap) context, although it is not a pm. However, conditionally upon the sample points, its produces statistical estimators *T*(*P W n* ) whose weak behavior (conditionally upon the sample) converges to the same limit as does *T*(*Pn*) when normalized on the classical CLT range; see eg Newton and Mason [21]. Large deviation theorem for the weighted empirical measure *P W <sup>n</sup>* has been obtained by [24]; for other contributions in line with those, see [22,25]. Normalizing the weights produces families of exchangeable weights *Zi* , and the normalized weighted empirical measure P*<sup>W</sup> n* is the cornerstone for the socalled non-parametric Bayesian bootstrap, initiated by [23], and further developed by [26] among others. Note however that in this context the RV's *W<sup>i</sup>* 's are chosen distributed as standard exponential variables. The link with spacings from a uniform distribution and the corresponding reproducibility of the Dirichlet distributions are the basic ingredients which justify the non parametric bootstrap approach; in the present context, the choice of the distribution of the *W<sup>i</sup>* 's is a natural extension of this paradigm, at least when those *W<sup>i</sup>* 's are positive RV's.

#### *3.2. Maximum Likelihood for the Generalized Bootstrap*

Let's turn back to the estimation of *θ<sup>T</sup>* , assuming *Pθ<sup>T</sup>* the common distribution of the independent observations *X*1, . . . , *Xn*. We will consider maximum likelihood in the same spirit as developed in Section 2.2, here in the context of the normalized weighted empirical measure; it amounts to justify minimum divergence estimators as appropriate MLEs under such bootstrap procedure.

We thus consider the same statistical model P<sup>Θ</sup> and keep in mind the ML principle as seen as resulting from a maximization of the conditional probability of getting simulated observations close to the initially observed data. Similarly as in Section 2 fix an arbitrary *θ* and simulate *X*1,*<sup>θ</sup>* , . . . , *Xn*,*<sup>θ</sup>* with distribution *P<sup>θ</sup>* . Define accordingly *P W n*,*θ* and P*<sup>W</sup> <sup>n</sup>*,*<sup>θ</sup>* making use of i.i.d. RV's *W*1, . . . , *Wn*. Now the event P*<sup>W</sup> n*,*θ* (*k*) = *nk*/*n* has probability 0 in most cases (for example when *W* has a continuous distribution), and therefore we are led to consider events of the form P*<sup>W</sup> <sup>n</sup>*,*<sup>θ</sup>* ∈ *Vε*(*Pn*) , meaning max*<sup>k</sup>* <sup>P</sup>*<sup>W</sup> n*,*θ* (*d<sup>k</sup>* ) − *Pn*(*d<sup>k</sup>* ) <sup>≤</sup> *<sup>ε</sup>* for some *<sup>ε</sup>* <sup>&</sup>gt; 0; notice that *Vε*(*Pn*) defined through

$$V\_{\varepsilon}(P\_n) := \left\{ \mathbb{Q} \in \mathbb{S}^K : \max\_k |\mathbb{Q}(d\_k) - P\_n(d\_k)| \le \varepsilon \right\}.$$

has non-void interior.

For such a configuration consider

$$P^W\left(\mathfrak{P}\_{n,\theta}^w \in V\_\varepsilon(P\_n) \Big| X\_{1,\theta}, \dots, X\_{n,\theta}, P\_n\right) \tag{13}$$

where the *Xi*,*<sup>θ</sup>* are randomly drawn i.i.d. under *P<sup>θ</sup>* . Obviously for *θ* far away from *θ<sup>T</sup>* the sample (*X*1,*<sup>θ</sup>* , . . . , *Xn*,*<sup>θ</sup>* ) is realized "far away " from (*X*1, . . . , *Xn*), which has been generated under the truth, namely *Pθ<sup>T</sup>* , and the probability in (13) is small, whatever the weights, for small *ε*.

We will now consider (13) for large *n*, since, in contrast with the first derivation of the standard MLE in Section 2.1, we cannot perform the same calculation for each *n*, which was based on multinomial counts. Note that we obtained a justification for the usual MLE through the asymptotic Sanov LDP, leading to the KL divergence and finally back to the MLE through an approximation step of this latest. From Theorem 12 together with the a.s. convergence of *P<sup>n</sup>* to *Pθ<sup>T</sup>* in S *<sup>K</sup>* it follows that for some *α* < 1 < *β*

$$-\inf\_{m\neq 0} \boldsymbol{\Phi}^{W}(mV\_{n\varepsilon}(\boldsymbol{P\_{\theta\tau}}), \boldsymbol{\theta}) \tag{14}$$

$$\begin{split} & \leq \lim\_{n\to\infty} \frac{1}{n} \log P^{W} \left( \mathfrak{P}^{W}\_{n,\theta} \in V\_{\varepsilon}(\boldsymbol{P\_{n}}) | \mathbf{X}\_{1,\theta}, \dots, \mathbf{X}\_{n,\theta}, \boldsymbol{P\_{n}} \right) \\ & \leq -\inf\_{m\neq 0} \boldsymbol{\Phi}^{W}(mV\_{\beta\varepsilon}(\boldsymbol{P\_{\theta\tau}}), \boldsymbol{\theta}) \end{split}$$

where *φ <sup>W</sup>*(*Vce*(*θT*), *<sup>θ</sup>*) = inf*µ*∈*Vce*(*Pθ<sup>T</sup>* )) *φ <sup>W</sup>*(*µ*, *θ*). As *ε* → 0 , by continuity it holds that

$$\begin{split} & \lim\_{\varepsilon \to 0} \lim\_{n \to \infty} \frac{1}{n} \log P^W \Big( \mathfrak{P}\_{n, \theta}^W \in V\_{\varepsilon}(P\_n) | X\_{1, \theta}, \dots, X\_{n, \theta}, P\_n \Big) \\ &= \ - \inf\_{m \neq 0} \phi^W(m P\_{\theta\_{T'}} \theta). \end{split} \tag{15}$$

The ML principle amounts to maximize

$$P^W\left(\mathfrak{P}\_{n,\theta}^W \in V\_\mathfrak{e}(P\_n)|X\_{1,\theta}, \dots, X\_{n,\theta}, P\_n\right) \tag{16}$$

over *θ*. Whenever Θ is a compact set we may insert this optimization in (14) which yields, following (15)

$$\begin{split} &\lim\_{\varepsilon \to 0} \lim\_{n \to \infty} \frac{1}{n} \log \sup\_{\boldsymbol{\theta}} P^{W} \left( \mathfrak{P}\_{n, \boldsymbol{\theta}}^{W} \in V\_{\varepsilon}(P\_{n}) | X\_{1, \boldsymbol{\theta}}, \dots, X\_{n, \boldsymbol{\theta}}, P\_{n} \right), \\ &= \ - \inf\_{\boldsymbol{\theta} \in \Theta} \inf\_{m \neq 0} \phi^{W}(m P\_{\boldsymbol{\theta}\_{\mathcal{T}}}, \boldsymbol{\theta}). \end{split}$$

We consider weights *W*'s such that there exists a power divergence function *ϕ<sup>γ</sup>* satisfying (4), which amounts to *φ <sup>W</sup>* = *φγ*; by the results quoted in Section 1.1.2 this holds when *γ* ∈ (−∞, 1] ∪ [2, +∞).

By Proposition 2 the argument of the infimum upon *θ* in the RHS of the above display coincides with the corresponding argument of *φ <sup>W</sup>*(*θT*, *θ*), which obviously gets *θT*. This justifies to consider a proxy of this minimization problem as a "ML" estimator based on normalized weighted data.

A further interpretation of the MDE in the context of non-parametric Bayesian procedures may also be proposed; this is postponed to a next paper.

Since

$$
\boldsymbol{\phi}^{\mathcal{W}}(\boldsymbol{\theta}\_{\mathcal{T}}, \boldsymbol{\theta}) = \widetilde{\boldsymbol{\phi}}^{\mathcal{W}}(\boldsymbol{\theta}, \boldsymbol{\theta}\_{\mathcal{T}})
$$

the ML estimator is obtained as in the conventional case by plug in the LDP rate. Obviously the "best" plug in consists in the substitution of *Pθ<sup>T</sup>* by *Pn*, the empirical measure of the sample, since *P<sup>n</sup>* achieves the best rate of convergence to *Pθ<sup>T</sup>* when confronted to any bootstrapped version, which adds "noise" to the sampling. We may therefore call

$$\begin{split} \boldsymbol{\theta}\_{ML}^{W} &:= \arg\inf\_{\boldsymbol{\theta} \in \Theta} \widehat{\boldsymbol{\phi}}^{W}(\boldsymbol{\theta}, \boldsymbol{P}\_{n}) := \arg\inf\_{\boldsymbol{\theta} \in \Theta} \sum\_{k=1}^{K} P\_{n}(d\_{k}) \widehat{\boldsymbol{\varphi}}^{W} \left( \frac{P\_{\boldsymbol{\theta}}(d\_{k})}{P\_{n}(d\_{k})} \right) \\ &= \arg\inf\_{\boldsymbol{\theta} \in \Theta} \sum\_{k=1}^{K} P\_{\boldsymbol{\theta}}(d\_{k}) \boldsymbol{\varphi}^{W} \left( \frac{P\_{n}(d\_{k})}{P\_{\boldsymbol{\theta}}(d\_{k})} \right) \end{split} \tag{17}$$

the MLE for the bootstrap sampling; here *<sup>φ</sup>*e*<sup>W</sup>* (with divergence function *<sup>ϕ</sup>*e) is the conjugate divergence of *φ <sup>W</sup>* (with divergence function *ϕ*). Since *φ <sup>W</sup>* = *φ<sup>γ</sup>* for some *γ*, it holds *<sup>φ</sup>*e*<sup>W</sup>* <sup>=</sup> *<sup>φ</sup>*1−*γ*.

We can also plug in the normalized weighted empirical measure, which also is a proxy of *Pθ<sup>T</sup>* for each run of the weights. This produces a bootstrap estimate of *θ<sup>T</sup>* through

$$\begin{split} \boldsymbol{\theta}\_{\mathcal{B}}^{\mathcal{W}} &:= \arg \inf\_{\boldsymbol{\theta} \in \Theta} \boldsymbol{\hat{\phi}}^{\mathcal{W}}(\boldsymbol{\theta}, \mathfrak{P}\_{n}^{\mathcal{W}}) := \arg \inf\_{\boldsymbol{\theta} \in \Theta} \sum\_{k=1}^{K} \mathfrak{P}\_{n}^{\mathcal{W}}(d\_{k}) \boldsymbol{\hat{\phi}}^{\mathcal{W}} \left( \frac{P\_{\theta}(d\_{k})}{\mathfrak{P}\_{n}^{\mathcal{W}}(d\_{k})} \right) \\ &= \arg \inf\_{\boldsymbol{\theta} \in \Theta} \sum\_{k=1}^{K} P\_{\theta}(d\_{k}) \boldsymbol{\varrho}^{\mathcal{W}} \left( \frac{\mathfrak{P}\_{n}^{\mathcal{W}}(d\_{k})}{P\_{\theta}(d\_{k})} \right) \end{split} \tag{18}$$

where P*<sup>W</sup> n* is defined in (9), assuming *n* large enough such that the sum of the *W<sup>i</sup>* 's is not zero. Whenever *P*(*W* = 0) > 0 , these estimators are defined for large *n* in order that P*<sup>W</sup> n* (*d<sup>k</sup>* ) be positive for all *k*. Since *E*(*W*) = 1, this occurs for large samples.

For a given weighted bootstrapped sample with weights *W*1, . . . , *W<sup>n</sup>* leading to the weighted normalized empirical measure P*<sup>W</sup> n* , *θ W B* is the MLE in the sense of (16), hence defined as a proxy of the maximizer of

$$P^{W\prime}\left(\mathfrak{P}\_{n,\theta}^{W'} \in V\_{\epsilon}(\mathfrak{P}\_n^{W})|X\_{1,\theta}, \dots, X\_{n,\theta}, \mathfrak{P}\_n^{W}\right),$$

where the vector *W*0 1 , . . . , *W*0 *n* is an independent copy of (*W*, . . . , *Wn*). This estimator usually differs from the bootstrapped version of the MLE based on *P<sup>n</sup>* (see (8)) which is defined for *n* large enough through

$$
\theta\_{ML}^B := \arg\inf\_{\theta} KL\_m(\theta, \mathfrak{P}\_n^W).
$$

When Y is not a finite space then an equivalent construction can be developed based on the variational form of the divergence; see [6].

**Remark 1.** *We may also consider cases when the MLE defined through θ W ML defined in* (17) *coincide with the standard MLE θML under i.i.d. sampling, and when its bootstrapped counterparts θ W B defined in (18) coincides with the bootstrapped standard MLE θ b ML defined through the likelihood estimating equation where the factor* 1/*n is substituted by the weight Z<sup>i</sup> . It is proved in Theorem 5 of [11] that whenever* <sup>P</sup><sup>Θ</sup> *is an exponential family with natural parametrization <sup>θ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup> and sufficient statistics T*

$$P\_{\theta}(d\_j) = \exp\left[T(d\_j)'\theta - \mathcal{C}(\theta)\right], \quad 1 \le j \le K$$

*where the Hessian matrix of C*(*θ*) *is definite positive, then for all divergence pseudo distance φ satisfying regularity conditions (including therefore the present cases), θ W ML equals θML, the classical MLE in* P<sup>Θ</sup> *defined as the solution of the normal equation*

$$\frac{1}{n}\sum T(X\_i) = \nabla \mathbf{C}(\theta\_{ML})$$

*irrespectively upon φ*. *Therefore on regular exponential families, and under i.i.d. sampling, all minimum divergence estimators coincide with the MLE (which is indeed one of them). The proof of this result is based on the variational form of the estimated divergence Q* → *φ*(*Q*, *P*)*, which coincides with the plug in version in (17) when the common support of all distributions in* P<sup>Θ</sup> *is finite. Following verbatim the proof of Theorem 5 in [11] substituting P<sup>n</sup> by* P*<sup>W</sup> n it results that θ W B equals the weighted MLE (standard generalized bootstrapped MLE θ b ML) defined through the normal equation*

$$\sum\_{i=1}^{n} Z\_i T(X\_i) = \nabla \mathbf{C}(\theta\_{ML}^b)\_{\prime}$$

*where the Z<sup>i</sup> 's are defined in (10). This fact holds for any choice of the weights, irrespectively on the choice of the divergence function ϕ with the only restriction that it satisfies the mild conditions (RC) in [11]. It results that for those models any generalized bootstrapped MDE coincides with the corresponding standard bootstrapped MLE.*

**Example 1.** *A-When W has a standard Poisson POI*(1) *distribution then the resulting estimator is the minimum modified Kullback-Leibler one. which takes the usual weighted form of the standard generalized bootstrap MLE*

$$\theta\_B^{POI(1)} := \arg\sup\_{\theta} \sum\_{k=1}^K \left( \frac{\sum\_{i=1}^n \mathcal{W}\_i \mathbf{1}\_k(X\_i)}{\sum\_{i=1}^n \mathcal{W}\_i} \right) \log P\_\theta(k)$$

*which is defined for n large enough*. *Also in this case θ W ML coincides with the standard MLE.*

*B-If <sup>W</sup> has an Inverse Gaussian distribution IG(1,1) then <sup>ϕ</sup>*(*x*) = *<sup>ϕ</sup>*−1(*x*) = <sup>1</sup> 2 (*x* − 1) 2 /*x for x* > 0 *and the ML estimator minimizes the Pearson Chi-square divergence with generator function ϕ*2(*x*) = <sup>1</sup> 2 (*x* − 1) <sup>2</sup> *which is defined on* R.

*C-If W follows a normal distribution with expectation and variance* 1*, then the resulting divergence is the Pearson Chi-square divergence ϕ*2(*x*) *and the resulting estimator minimizes the Neyman Chi-square divergence with ϕ*(*x*) = *ϕ*−1(*x*).

*D-When W has a Compound Poisson Gamma distribution C*(*POI*(2), Γ(2, 1)) *distribution then the corresponding divergence is ϕ*1/2(*x*) = 2 √ *x* − 1 2 *which is self conjugate, whence the ML estimator is the minimum Hellinger distance one.*

#### **4. Bahadur Efficiency of Minimum Divergence Tests under Generalized Bootstrap**

In [27] Efron and Tibshirani suggest the bootstrap as a valuable approach for testing, based on bootstrapped samples. We show that bootstrap testing for parametric models based on appropriate divergence statistics enjoys maximal Bahadur efficiency with respect to any bootstrap test statistics.

The standard approach to Bahadur efficiency can be adapted for the present generalized Bootstrapped tests as follows.

Consider the test of some null hypothesis H0: *θ<sup>T</sup>* = *θ* versus a simple Hypothesis H1 *θ<sup>T</sup>* = *θ* 0 .

We consider two competitive statistics for this problem. The first one is based on the bootstrap estimate of *<sup>φ</sup>*e*W*(*θ*, *<sup>θ</sup>T*) and

$$T\_{n,X} := \tilde{\Phi}\left(\theta, \mathfrak{P}\_{n,X}^W\right) = T\left(\mathfrak{P}\_{n,X}^W\right)$$

which allows to reject H0 for large values since lim*n*→<sup>∞</sup> *Tn*,*<sup>X</sup>* = 0 whenever H0 holds. In the above display we have emphasized in P*<sup>W</sup> n*,*X* the fact that we have used the RV *X<sup>i</sup>* 's. Let

$$L\_n(t) := P^W(T\_{n,X} > t | X\_1, \dots, X\_n).$$

We use *P <sup>W</sup>* to emphasize that the hazard is due to the weights. Consider now a set of RVs *Z*1, . . . , *Z<sup>n</sup>* extracted from a sequence such that

$$\lim\_{n \to \infty} P\_{n, \mathbb{Z}} = P\_{\theta}$$

0

a.s; we have denoted *Pn*,*<sup>Z</sup>* the empirical measure of (*Z*1, . . . , *Zn*); accordingly define P*W*<sup>0</sup> *n*,*Z* , the normalized weighted empirical measure of the *Z<sup>i</sup>* 's making use of weights *W*0 1 , . . . , *W*0 *n* which are i.i.d. copies of (*W*1, . . . , *Wn*), drawn independently from (*W*1, . . . , *Wn*). Define accordingly

$$T\_{n,Z} := \tilde{\Phi}\left(\theta, \mathfrak{P}\_{n,Z}^{W'}\right) = T\left(\mathfrak{P}\_{n,Z}^{W'}\right).$$

Define

$$L\_n(T\_{n,Z}) := P^W(T\_{n,W} > T\_{n,Z} | X\_1, \dots, X\_n),$$

which is a RV (as a function of *Tn*,*Z*). It holds

$$\lim\_{n \to \infty} T\_{n,Z} = \tilde{\Phi}(\theta, \theta') \quad \text{a.s.}$$

and therefore the Bahadur slope for the test with statistics *T<sup>n</sup>* is Φ(*θ* 0 , *θ*) as follows from

$$\begin{aligned} \lim\_{n \to \infty} \frac{1}{n} \log L\_n(T\_{\eta, \mathcal{Z}}) &= -\inf \left\{ \Phi(Q, \theta\_T) : \widetilde{\Phi}(\theta, Q) > \widetilde{\Phi}(\theta, \theta') \right\} \\ &= -\inf \{ \Phi(Q, \theta\_T) : \Phi(Q, \theta) > \Phi(\theta', \theta) \} \\ &= -\Phi(\theta', \theta) \end{aligned}$$

If *θ<sup>T</sup>* = *θ*. Under H0 the rate of decay of the *p*-value corresponding to a sampling under H1 is captured through the divergence Φ(*θ* 0 , *θ*).

Consider now a competitive test statistics *S* P*<sup>W</sup> n*,*X* and evaluate its Bahadur slope. Similarly as above it holds, assuming continuity of the functional *S* on S *K*

$$\begin{aligned} &\lim\_{n\to\infty} \frac{1}{n} \log P^W \left( S \left( \mathfrak{P}\_{n,X}^W \right) > S \left( \mathfrak{P}\_{n,\mathbb{Z}}^W \right) \Big| \mathbf{X}\_{1}, \dots, \mathbf{X}\_{n} \right) \\ &= - \inf \{ \Phi(Q, \theta\_T) : S(Q) > S(\theta') \} \\ &\geq - \Phi(\theta', \theta\_T) \end{aligned}$$

as follows from the continuity of *Q* → Φ(*Q*, *θT*). Hence the Bahadur slope of the test based on *S* P*<sup>W</sup> n*,*X* is larger or equal Φ(*θ* 0 , *θ*).

We have proved that the chances under H0 for the statistics *Tn*,*<sup>X</sup>* to exceed a value obtained under H1 are (asymptotically) less that the corresponding chances associated with any other statistics based on the same bootstrapped sample; as such it is most specific on this scale with respect to any competing ones. Namely the following result holds:

**Proposition 3.** *Under the weighted sampling the test statistics T* P*<sup>W</sup> n*,*X is the most efficient among all tests which are empirical versions of continuous functionals on* S *K*.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** The author thanks the Editor and two anonymous referees for many remarks which helped to improve this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

*A Heuristic Derivation of the Conditional LDP for the Normalized Weighted Empirical Measure*

The following sketch of proof gives the core argument which yields to Proposition 1; a proof adapted to a more abstract setting can be found in [22], following their Theorem 3.2 and Corollary 3.3, but we find it useful to present a proof which reduces to simple arguments. We look at the probability of the event

$$P\_n^W \in V(\mathbb{R}) \tag{A1}$$

for a given vector *R* in R*K*, where *V*(*R*) denotes a neighborhood of *R*, therefore defined through

$$(\mathcal{Q} \in V(\mathcal{R})) \Longleftrightarrow (\mathcal{Q}(d\_l) \approx \mathcal{R}(d\_l); 1 \le l \le k)$$

We denote by *P* the distribution of the RV *X* so that *P<sup>n</sup>* converges to *P* a.s.

Evaluating loosely the probability of the event defined in (A1) yields, denoting *PX<sup>n</sup>* 1 the conditional distribution given (*X*1, . . . , *Xn*)

$$\begin{split} P\_{\mathbf{X}\_{1}^{n}}\left(P\_{n}^{W}\in V(R)\right) &= P\_{\mathbf{X}\_{1}^{n}}\left(\bigcap\_{l=1}^{K}\left(\frac{1}{n}\sum\_{i=1}^{n}W\_{l}\delta\_{X\_{i}}(d\_{l})\approx R(d\_{l})\right)\right) \\ &= P\_{\mathbf{X}\_{1}^{n}}\left(\bigcap\_{l=1}^{K}\left(\frac{1}{n}\sum\_{i=1}^{n\_{l}}W\_{i,l}\approx R(d\_{l})\right)\right) \\ &= \prod\_{l=1}^{K}P\_{\mathbf{X}\_{1}^{n}}\left(\frac{1}{n\_{l}}\sum\_{i=1}^{n\_{l}}W\_{i,l}\approx \frac{n}{n\_{l}}R(d\_{l})\right) \\ &= \prod\_{l=1}^{K}P\_{\mathbf{X}\_{1}^{n}}\left(\frac{1}{n\_{l}}\sum\_{i=1}^{n\_{l}}W\_{i,l}\approx \frac{R(d\_{l})}{P(d\_{l})}\right) \end{split}$$

where we used repeatedly the fact that the r.v's *W* are i.i.d.. In the above display, from the second line on, the r.v's are independent copies of *W*<sup>1</sup> for all *i* and *l*. In the above displays *nl* is the number of *X<sup>i</sup>* 's which equal *d<sup>l</sup>* , and the *Wi*,*<sup>l</sup>* are the weights corresponding to these *Xi* 's. We used the convergence of *nl*/*n* to *P*(*d<sup>l</sup>* ) in the last display.

Now for each *l* in {1, 2, . . . , *K*} by the Cramer LDP for the empirical mean, it holds

$$\frac{1}{n\_l}\log P\left(\frac{1}{n\_l}\sum\_{i=1}^{n\_l} \mathcal{W}\_{i,l} \approx \frac{\mathcal{R}(d\_l)}{P(d\_l)}\right) \approx -\boldsymbol{\varphi}^W \left(\frac{\mathcal{R}(d\_l)}{P(d\_l)}\right)^2$$

i.e.,

$$\frac{1}{n}\log P\left(\frac{1}{n\_l}\sum\_{i=1}^{n\_l}\mathcal{W}\_{i,l}\approx\frac{R(l)}{P(l)}\right)\approx -\frac{R(d\_l)}{P(d\_l)}\varphi^W\left(\frac{R(d\_l)}{P(d\_l)}\right).$$

as follows from the classical Cramer LDP, and therefore

$$\begin{aligned} &\frac{1}{n}\log P\_{X\_1^n}\left(P\_n^W \in V(R)\right) \\ &\approx \frac{1}{n}\log P\_{X\_1^n}\left(\bigcap\_{l=1}^K \left(\frac{1}{n}\sum\_{i=1}^{n\_l} W\_{i,l} \approx R(d\_l)\right)\right) \\ &\rightarrow -\sum\_{l=1}^K \phi^W\left(\frac{R(d\_l)}{P(d\_l)}\right)P(d\_l) = -\phi^W(R,P) \end{aligned}$$

where the limit in the last line applies to the case where we let *n* → ∞ .

A precise derivation of Proposition 1 involves two arguments: firstly for a set Ω <sup>⊂</sup> <sup>R</sup>*<sup>K</sup>* a covering procedure by small balls allowing to use the above derivation locally, and the regularity assumption (11) which allows to obtain proper limits in the standard LDP statement.

The argument leading from Proposition 1 to Theorem 1 can be summarized now. For some subset Ω in S *<sup>K</sup>* with non-void interior it holds

$$\left(\mathfrak{P}\_n^W \in \Omega\right) = \bigcup\_{m \neq 0} \left( \left( P\_n^W \in m\Omega \right) \cap \left( \sum\_{i=1}^n W\_i = m \right) \right)$$

and *P W <sup>n</sup>* ∈ *m*Ω ⊂ (∑ *n <sup>i</sup>*=<sup>1</sup> *W<sup>i</sup>* = *m*) for all *m* 6= 0. Therefore

$$P\_{X\_1^n} \left( \mathfrak{P}\_n^W \in \Omega \right) = P\_{X\_1^n} \left( \bigcup\_{m \neq 0} \left( P\_n^W \in m\Omega \right) \right).$$

Making use of Proposition 1

$$\lim\_{n \to \infty} \frac{1}{n} \log P\_{\mathbf{X}\_1^n} \left( \mathfrak{P}\_n^W \in \Omega \right) = -\phi^W \left( \bigcup\_{m \neq 0} m \Omega, P \right).$$

Now

$$\phi^W \left( \bigcup\_{m \neq 0} m \Omega, P \right) = \inf\_{m \neq 0} \inf\_{Q \in \Omega} \phi^W(m Q, P).$$

We have sketched the arguments leading to Theorem 1; see [19] for details.

#### **References**


## *Article* **Error Exponents and** *α***-Mutual Information**

**Sergio Verdú**

Independent Researcher, Princeton, NJ 08540, USA; verdu@informationtheory.org

**Abstract:** Over the last six decades, the representation of error exponent functions for data transmission through noisy channels at rates below capacity has seen three distinct approaches: (1) Through Gallager's *E*<sup>0</sup> functions (with and without cost constraints); (2) large deviations form, in terms of conditional relative entropy and mutual information; (3) through the *α*-mutual information and the Augustin–Csiszár mutual information of order *α* derived from the Rényi divergence. While a fairly complete picture has emerged in the absence of cost constraints, there have remained gaps in the interrelationships between the three approaches in the general case of cost-constrained encoding. Furthermore, no systematic approach has been proposed to solve the attendant optimization problems by exploiting the specific structure of the information functions. This paper closes those gaps and proposes a simple method to maximize Augustin–Csiszár mutual information of order *α* under cost constraints by means of the maximization of the *α*-mutual information subject to an exponential average constraint.

**Keywords:** information measures; relative entropy; Rényi divergence; mutual information; *α*-mutual information; Augustin–Csiszár mutual information; data transmission; error exponents; large deviations

#### **1. Introduction**

#### *1.1. Phase 1: The MIT School*

The capacity *C* of a stationary memoryless channel is equal to the maximal symbolwise input–output mutual information. Not long after Shannon [1] established this result, Rice [2] observed that, when operating at any encoding rate *R* ă *C*, there exist codes whose error probability vanishes exponentially with blocklength, with a speed of decay that decreases as *R* approaches *C*. This early observation moved the center of gravity of information theory research towards the quest for the reliability function, a term coined by Shannon [3] to refer to the maximal achievable exponential decay as a function of *R*. The MIT information theory school, and most notably, Elias [4], Feinstein [5], Shannon [3,6], Fano [7], Gallager [8,9], and Shannon, Gallager and Berlekamp [10,11], succeeded in upper/lower bounding the reliability function by the sphere-packing error exponent function and the random coding error exponent function, respectively. Fortunately, these functions coincide for rates between *C* and a certain value, called the critical rate, thereby determining the reliability function in that region. The influential 1968 textbook by Gallager [9] set down the major error exponent results obtained during Phase 1 of research on this topic, including the expurgation technique to improve upon the random coding error exponent lower bound. Two aspects of those early works (and of Dobrushin's contemporary papers [12,13] on the topic) stand out:

(a) The error exponent functions were expressed as the result of the Karush-Kuhn-Tucker optimization of ad-hoc functions which, unlike mutual information, carried little insight. In particular, during the first phase, center stage is occupied by the parametrized function of the input distribution *P<sup>X</sup>* and the random transformation (or "channel") *<sup>P</sup>Y*|*X*,

**Citation:** Verdú, S. Error Exponents and *α*-Mutual Information. *Entropy* **2021**, *23*, 199. https://doi.org/ 10.3390/e23020199

Received: 7 December 2020 Accepted: 28 January 2021 Published: 5 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

$$E\_0(\rho, P\_X) = -\log \sum\_{y \in \mathcal{B}} \left( \sum\_{\mathbf{x} \in \mathcal{A}} P\_X(\mathbf{x}) P\_{Y|X}^{\frac{1}{1+\rho}}(y|\mathbf{x}) \right)^{1+\rho} \tag{1}$$

introduced by Gallager in [8].

(b) Despite the large-deviations nature of the setup, none of the tools from that thennascent field (other than the Chernoff bound) found their way to the first phase of the work on error exponents; in particular, relative entropy, introduced by Kullback and Leibler [14], failed to put in an appearance.

To this date, the reliability function remains open for low rates even for the binary symmetric channel, despite a number of refined converse and achievability results (e.g., [15–21]) obtained since [9]. Our focus in this paper is not on converse/achievability techniques but on the role played by various information measures in the formulation of error exponent results.

#### *1.2. Phase 2: Relative Entropy*

The second phase of the error exponent research was pioneered by Haroutunian [22] and Blahut [23], who infused the expressions for the error exponent functions with meaning by incorporating relative entropy. The sphere-packing error exponent function corresponding to a random transformation *<sup>P</sup>Y*|*<sup>X</sup>* is given as

$$E\_{\mathfrak{sp}}(\mathcal{R}) = \sup\_{P\_X} \min\_{\substack{Q\_{Y|X} \colon \mathcal{A} \to \mathcal{B} \\ I(P\_X, Q\_{Y|X}) \leqslant R}} D(Q\_{Y|X} \| P\_{Y|X} | P\_{\mathcal{X}}). \tag{2}$$

Roughly speaking, optimal codes of rate *R* ă *C* incur in errors due to atypical channel behavior, and large deviations establishes that the overwhelmingly most likely such behavior can be explained as if the channel would be supplanted by the one with mutual information bounded by *R* which is closest to the true channel in conditional relative entropy *<sup>D</sup>*p*QY*|*X*}*PY*|*X*|*PX*q. Within the confines of finite-alphabet memoryless channels, this direction opened the possibility of using the combinatorial method of types to obtain refined results robustifying the choice of the optimal code against incomplete knowledge of the channel. The 1981 textbook by Csiszár and Körner [24] summarizes the main results obtained during Phase 2.

#### *1.3. Phase 3: Rényi Information Measures*

Entropy and relative entropy were generalized by Rényi [25], who introduced the notions of Rényi entropy and Rényi divergence of order *α*. He arrived at Rényi entropy by relaxing the axioms Shannon proposed in [1], and showed to be satisfied by no measure but entropy. Shortly after [25], Campbell [26] realized the operational role of Rényi entropy in variable-length data compression if the usual average encoding length criterion Er`p*c*p*X*qqs is replaced by an exponential average *α* ´1 logErexpp*α* `p*c*p*X*qqs. Arimoto [27] put forward a generalized conditional entropy inspired by Rényi's measures (now known as Arimoto-Rényi conditional entropy) and proposed a generalized mutual information by taking the difference between Rényi entropy and the Arimoto-Rényi conditional entropy. The role of the Arimoto-Rényi conditional entropy in the analysis of the error probability of Bayesian *M*-ary hypothesis testing problems has been recently shown in [28], tightening and generalizing a number of results dating back to Fano's inequality [29].

Phase 3 of the error exponent research was pioneered by Csiszár [30] where he established a connection between Gallager's *E*<sup>0</sup> function and Rényi divergence by means of a Bayesian measure of the discrepancy among a finite collection of distributions introduced by Sibson [31]. Although [31] failed to realize its connection to mutual information, Csiszár [30,32] noticed that it could be viewed as a natural generalization of mutual information. Arimoto [27] also observed that the unconstrained maximization of his generalized mutual information measure with respect to the input distribution coincides with a scaled version of the maximal *E*<sup>0</sup> function. This resulted in an extension of the Arimoto-Blahut algorithm useful for the computation of error exponent functions [33] (see also [34]) for finite-alphabet memoryless channels.

Within Haroutunian's framework [22] applied in the context of the method of types, Poltyrev [35] proposed an alternative to Gallager's *E*<sup>0</sup> function, defined by means of a cumbersome maximization over a reverse random transformation. This measure turned out to coincide (modulo different parametrizations) with another generalized mutual information introduced four years earlier by Augustin in his unpublished thesis [36], by means of a minimization with respect to an output probability measure.

The key contribution in the development of this third phase is Csiszár's paper [32] where he makes a compelling case for the adoption of Rényi's information measures in the large deviations analysis of lossless data compression, hypothesis testing and data transmission. Recall that more than two decades earlier, Csiszár [30] had already established the connection of Gallager's *E*<sup>0</sup> function and the generalized mutual information inspired by Sibson [31], which, henceforth, we refer to as the *α*-mutual information. Therefore, its relevance to the error exponent analysis of error correcting codes had already been established. Incidentally, more recently, another operational role was found for *α*-mutual information in the context of the large deviations analysis of composite hypothesis testing [37]. In addition to *α*-mutual information, and always working with discrete alphabets, Csiszár [32] considers the generalized mutual informations due to Arimoto [27], and to Augustin [36], which we refer to as the Augustin–Csiszár mutual information of order *α*. Csiszár shows that all those three generalizations of mutual information coincide upon their unconstrained maximization with respect to the input distribution. Further relationships among those Rényi-based generalized mutual informations have been obtained in recent years in [38–45]. In [32] the maximal *α*-mutual information or generalized capacity of order *α* finds an operational characterization as a generalized cutoff rate–an equivalent way to express the reliability function. This would have been the final word on the topic if it weren't for its limitation to discrete-alphabet channels, and more importantly, encoding without cost constraints.

#### *1.4. Cost Constraints*

If the transmitted codebook is cost-constrained, i.e., every codeword p*c*1, . . . , *cn*q is forced to satisfy ř*<sup>n</sup> i*"1 bp*c<sup>i</sup>* q ď *n θ* for some nonnegative cost function bp¨q, then the channel capacity is equal to the input–output mutual information maximized over input probability measures restricted to satisfy Erbp*X*qs ď *θ*. Gallager [9] incorporated cost constraints in his treatment of error exponents by generalizing (1) to the function

$$E\_0(\rho, P\_{X'}r, \theta) = -\log \sum\_{y \in \mathcal{B}} \left( \sum\_{\mathbf{x} \in \mathcal{A}} P\_X(\mathbf{x}) \exp(r \mathbf{b}(\mathbf{x}) - r \theta) P\_{Y|X}^{\frac{1}{1+\rho}}(y|\mathbf{x}) \right)^{1+\rho} \tag{3}$$

with which he was able to prove an achievability result invoking Shannon's random coding technique [1]. Gallager also suggested in the footnote of page 329 of [9] that the converse technique of [10] is amenable to extension to prove a sphere-packing converse based on (3). However, an important limitation is that that technique only applies to constantcomposition codes (all codewords have the same empirical distribution). A more powerful converse circumventing that limitation (at least for symmetric channels) was given by [46] also expressing the upper bound on the reliability function by optimizing (3) with respect

to *ρ*, *r* and *PX*. A notable success of the approach based on the optimization of (3) was the determination of the reliability function (for all rates below capacity) of the direct detection photon channel [47].

In contrast, the Phase Two expression (2) for the sphere-packing error exponent for cost-constrained channels is much more natural and similar to the way the expression for channel capacity is impacted by cost constraints, namely we simply constrain the maximization in (2) to satisfy Erbp*X*qs ď *θ*. Unfortunately, no general methods to solve the ensuing optimization have been reported.

Once cost constraints are incorporated, the equivalence among the maximal *α*-mutual information, maximal order-*α* Augustin–Csiszár mutual information, and maximal Arimoto mutual information of order *α* breaks down. Of those three alternatives, it is the maximal Augustin–Csiszár mutual information under cost constraints that appears in the error exponent functions. The challenge is that Augustin–Csiszár mutual information is much harder to evaluate, let alone maximize, than *α*-mutual information. The Phase 3 effort to encompass cost constraints started by Augustin [36] and was continued recently by Nakiboglu [43]. Their focus was to find a way to express (3) in terms of Rényi information measures. Although, as we explain in Item 62, they did not quite succeed, their efforts were instrumental in developing key properties of the Augustin–Csiszár mutual information.

#### *1.5. Organization*

To enhance readability and ease of reference, the rest of this work is organized in 81 items, grouped into Section 13 and an appendix.

Basic notions and notation (including the key concept of *α*-response) are collected in Section 2. Unlike much of the literature on the topic, we do not restrict attention to discrete input/output alphabets, nor do we impose any topological structures on them.

The paper is essentially self-contained. Section 3 covers the required background material on relative entropy, Rényi divergence of order *α*, and their conditional versions, including a key representation of Rényi divergence in terms of relative entropies and a tilted probability measure, and additive decompositions of Rényi divergence involving the *α*-response.

Section 4 studies the basic properties of *α*-mutual information and order-*α* Augustin– Csiszár mutual information. This includes their variational representations in terms of conventional (non-Rényi) information measures such as conditional relative entropy and mutual information, which are particularly simple to show in the main range of interest in applications to error exponents, namely, *α* P p0, 1q.

The interrelationships between *α*-mutual information and order-*α* Augustin–Csiszár mutual information are covered in Section 5, which introduces the dual notions of *α*-adjunct and x*α*y-adjunct of an input probability measure.

The maximizations with respect to the input distribution of *α*-mutual information and order-*α* Augustin–Csiszár mutual information account for their role in the fundamental limits in data transmission through noisy channels. Section 6 gives a brief review of the results in [45] for the maximization of *α*-mutual information. For Augustin–Csiszár mutual information, Section 7 covers its unconstrained maximization, which coincides with its *α*mutual information counterpart. Section 8 proposes an approach to find C<sup>c</sup> *α* p*θ*q, the maximal Augustin–Csiszár mutual information of order *α* P p0, 1q subject to Erbp*X*qs ď *θ*. Instead of trying to identify directly the input distribution that maximizes Augustin–Csiszár mutual information, the method seeks its x*α*y-adjunct. This is tantamount to maximizing *α*-mutual information over a larger set of distributions.

Section 9 shows

$$\rho \, \mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\theta) = \min\_{r \ge 0} \max\_{P\_X} E\_0(\rho, P\_{X'}r, \theta), \tag{4}$$

where the maximization on the right side is unconstrained. In other words, the minimax of Gallager's *E*<sup>0</sup> function (3) with cost constraints is shown to be equal to the maximal Augustin–Csiszár mutual information, thereby bridging the existing gap between the Phase 1 and Phase 3 representations alluded to earlier in this introduction.

As in [48], Section 10 defines the sphere-packing and random-coding error exponent functions in the natural canonical form of Phase 2 (e.g., (2)), and gives a very simple proof of the nexus between the Phase 2 and Phase 3 representations, namely,

$$E\_{\mathsf{sp}}(R) = \sup\_{\rho \ge 0} \left\{ \rho \, \mathbb{C}^{\mathsf{C}}\_{\frac{1}{1+\rho}}(\theta) - \rho \, R \right\},\tag{5}$$

with or without cost constraints. In this regard, we note that, although all the ingredients required were already present at the time the revised version of [24] was published three decades after the original, [48] does not cover the role of Rényi's information measures in channel error exponents.

Examples illustrating the proposed method are given in Sections 11 and 12 for the additive Gaussian noise channel under a quadratic cost function, and the additive exponential noise channel under a linear cost function, respectively. Simple parametric expressions are given for the error exponent functions, and the least favorable channels that account for the most likely error mechanism (Section 1.2) are identified in both cases.

#### **2. Relative Information and Information Density**

We begin with basic terminology and notation required for the subsequent development.


$$\mu\_{P\|Q}(a) = \log \frac{\mathrm{d}P}{\mathrm{d}Q}(a) \in [-\infty, +\infty), \quad a \in \mathcal{A}.\tag{6}$$

As with the Radon-Nikodym derivative, any identity involving relative informations can be changed on a set of measure zero under the reference measure without incurring in any contradiction. If *P* ! *Q* ! *R*, then the chain rule of Radon-Nikodym derivatives yields

$$\mathfrak{l}\_{P\|\mathcal{Q}}(a) + \mathfrak{l}\_{\mathcal{Q}\|R}(a) = \mathfrak{l}\_{P\|R}(a), \quad a \in \mathcal{A}.\tag{7}$$

Throughout the paper, the base of exp and log is the same and chosen by the reader unless explicitly indicated otherwise. We frequently define a probability measure *P* from the specification of *<sup>ı</sup>P*}*<sup>Q</sup>* and *<sup>Q</sup>* " *<sup>P</sup>* since

$$P(A) = \int\_{A} \exp\left(\imath\_{P\parallel Q}(a)\right) \mathrm{d}Q(a), \quad A \in \mathcal{F}. \tag{8}$$

If *<sup>X</sup>* " *<sup>P</sup>* and *<sup>Y</sup>* " *<sup>Q</sup>*, it is often convenient to write *<sup>ı</sup>X*}*<sup>Y</sup>* <sup>p</sup>*x*<sup>q</sup> instead of *<sup>ı</sup>P*}*Q*p*x*q. Note that

$$\mathbb{E}\left[\exp\left(\imath\_{X\parallel Y}(Y)\right)\right] = 1.\tag{9}$$

**Example 1.** *If X* " N ` *µX*, *σ* 2 *X* ˘ *(Gaussian with mean µ<sup>X</sup> and variance σ* 2 *X ) and Y* " N ` *µY*, *σ* 2 *Y* ˘ *, then,* **Example 1.** *If <sup>X</sup>* ∼ N *µX*, *σ* 2 *(Gaussian with mean µ<sup>X</sup> and variance σ* 2 *) and Y* ∼

*Entropy* **2021**, *1*, 0 6 of 52

N *µY*, *σ* 2 *Y*  *X*

2

$$\iota\_{X\parallel Y}(a) = \frac{1}{2}\log\frac{\sigma\_Y^2}{\sigma\_X^2} + \frac{1}{2}\left(\frac{(a-\mu\_Y)^2}{\sigma\_Y^2} - \frac{(a-\mu\_X)^2}{\sigma\_X^2}\right)\log\text{e.}\tag{10}$$

−

2

2

*X*


$$Q(\mathcal{B}) = \int P\_{Y|X}(\mathcal{B}|\mathbf{x}) \, \mathrm{d}P\_{\mathcal{X}}(\mathbf{x}) = \mathbb{E}\left[P\_{Y|X}(\mathcal{B}|\mathcal{X})\right], \quad \mathcal{B} \in \mathcal{Y}.\tag{11}$$

6. If *<sup>P</sup><sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*|*<sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup><sup>Y</sup>* and *<sup>P</sup>Y*|*X*"*<sup>a</sup>* ! *<sup>P</sup>Y*, the information density *<sup>ı</sup>X*;*<sup>Y</sup>* : A <sup>ˆ</sup> B Ñ r´8, 8q is defined as 6. If *<sup>P</sup><sup>X</sup>* → *<sup>P</sup>Y*|*<sup>X</sup>* → *<sup>P</sup><sup>Y</sup>* and *<sup>P</sup>Y*|*X*=*<sup>a</sup> <sup>P</sup>Y*, the information density *<sup>ı</sup>X*;*<sup>Y</sup>* : A × B → [−∞, ∞) is defined as

$$\mathfrak{r}\_{\mathbf{X};Y}(a;b) = \mathfrak{r}\_{\mathbb{P}\_{Y|X=a} \parallel \mathbb{P}\_Y}(b), \quad (a,b) \in \mathcal{A} \times \mathcal{B}.\tag{12}$$

Following Rényi's terminology [49], if *<sup>P</sup>XPY*|*<sup>X</sup>* ! *<sup>P</sup><sup>X</sup>* <sup>ˆ</sup> *<sup>P</sup>Y*, the dependence between *X* and *Y* is said to be regular, and the information density can be defined on p*x*, *y*q P A <sup>ˆ</sup> B. Henceforth, we assume that *<sup>P</sup>Y*|*<sup>X</sup>* is such that the dependence between its input and output is regular regardless of the input probability measure. For example, if *<sup>X</sup>* " *<sup>Y</sup>* <sup>P</sup> R, then *<sup>P</sup>Y*|*X*"*<sup>a</sup>* p*A*q " 1t*a* P *A*u, and their dependence is not regular, since Following Rényi's terminology [49], if *<sup>P</sup>XPY*|*<sup>X</sup> <sup>P</sup><sup>X</sup>* × *<sup>P</sup>Y*, the dependence between *X* and *Y* is said to be regular, and the information density can be defined on (*x*, *y*) ∈ A × B. Henceforth, we assume that *<sup>P</sup>Y*|*<sup>X</sup>* is such that the dependence between its input and output is regular regardless of the input probability measure. For example, if *<sup>X</sup>* = *<sup>Y</sup>* ∈ R, then *<sup>P</sup>Y*|*X*=*<sup>a</sup>* (*A*) = 1{*a* ∈ *A*}, and their dependence is not regular, since

for any *P<sup>X</sup>* with non-discrete components *PXY* for any *P<sup>X</sup>* with non-discrete components *PXY* 6 *P<sup>X</sup>* × *PY*. *P<sup>X</sup>* ˆ *PY*.

1

7. Let *<sup>α</sup>* > 0, and *<sup>P</sup><sup>X</sup>* → *<sup>P</sup>Y*|*<sup>X</sup>* → *<sup>P</sup>Y*. The *<sup>α</sup>*-response to *<sup>P</sup><sup>X</sup>* ∈ P<sup>A</sup> is the output probability measure *PY*[*α*] *P<sup>Y</sup>* with relative information given by 7. Let *<sup>α</sup>* <sup>ą</sup> 0, and *<sup>P</sup><sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*|*<sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*. The *<sup>α</sup>*-response to *<sup>P</sup><sup>X</sup>* <sup>P</sup> P<sup>A</sup> is the output probability measure *<sup>P</sup>Y*r*α*<sup>s</sup> ! *<sup>P</sup><sup>Y</sup>* with relative information given by

$$\mu\_{Y[a]\|Y}(y) = \frac{1}{a} \log \mathbb{E}[\exp(a \,\iota\_{X;Y}(X;y) - \kappa\_{\mathbb{A}})], \quad X \sim P\_{X,\iota} \tag{13}$$

where *κ<sup>α</sup>* is a scalar that guarantees that *PY*[*α*] is a probability measure. Invoking (9), we obtain where *<sup>κ</sup><sup>α</sup>* is a scalar that guarantees that *<sup>P</sup>Y*r*α*<sup>s</sup> is a probability measure. Invoking (9), we obtain

$$\kappa\_a = a \log \mathbb{E} \left[ \mathbb{E}^{\frac{1}{a}} \Big[ \exp(a \iota\_{X;Y}(X;\bar{Y})) | \bar{Y} \Big] \right], \quad (X,\bar{Y}) \sim P\_X \times P\_Y. \tag{14}$$

For brevity, the dependence of *<sup>κ</sup><sup>α</sup>* on *<sup>P</sup><sup>X</sup>* and *<sup>P</sup>Y*|*<sup>X</sup>* is omitted. Jensen's inequality applied to (·) *α* results in *κ<sup>α</sup>* ≤ 0 for *α* ∈ (0, 1) and *κ<sup>α</sup>* ≥ 0 for *α* > 1. Although the *α*-response has a long record of services to information theory, this terminology and notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the *α*-response as the order *α* Rényi mean. Note that *<sup>κ</sup>*<sup>1</sup> = <sup>0</sup> and the 1-response to *<sup>P</sup><sup>X</sup>* is *<sup>P</sup>Y*. If *<sup>p</sup>Y*[*α*] and *<sup>p</sup>Y*|*<sup>X</sup>* denote the densities of *<sup>P</sup>Y*[*α*] and *<sup>P</sup>Y*|*<sup>X</sup>* with respect to some common dominating measure, then (13) becomes (*y*) = exp *κα* 1 h i For brevity, the dependence of *<sup>κ</sup><sup>α</sup>* on *<sup>P</sup><sup>X</sup>* and *<sup>P</sup>Y*|*<sup>X</sup>* is omitted. Jensen's inequality applied to p¨q*<sup>α</sup>* results in *κ<sup>α</sup>* ď 0 for *α* P p0, 1q and *κ<sup>α</sup>* ě 0 for *α* ą 1. Although the *α*-response has a long record of services to information theory, this terminology and notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the *α*-response as the order *α* Rényi mean. Note that *<sup>κ</sup>*<sup>1</sup> " 0 and the 1-response to *<sup>P</sup><sup>X</sup>* is *<sup>P</sup>Y*. If *<sup>p</sup>Y*r*α*<sup>s</sup> and *<sup>p</sup>Y*|*<sup>X</sup>* denote the densities of *<sup>P</sup>Y*r*α*<sup>s</sup> and *<sup>P</sup>Y*|*<sup>X</sup>* with respect to some common dominating measure, then (13) becomes

$$p\_{Y[a]}(y) = \exp\left(-\frac{\kappa\_a}{a}\right) \mathbb{E}^{\frac{1}{a}} \left[ p\_{Y|X}^a(y|X) \right], \quad X \sim P\_X. \tag{15}$$

transformation with less (resp. more) "noise" than *<sup>p</sup>Y*|*X*. For *α* ą 1 (resp. *α* ă 1) we can think of the normalized version of *p α Y*|*X* as a random transformation with less (resp. more) "noise" than *<sup>p</sup>Y*|*X*.

*α*

*α*

8. We will have opportunity to apply the following examples.

**Example 2.** *If Y* " *X* ` *N, where X* " N ` *µX*, *σ* 2 *X* ˘ *independent of N* " N ` *µN*, *σ* 2 *N* ˘ *, then the α-response to P<sup>X</sup> is*

$$\mathcal{Y}[\mathfrak{a}] \sim \mathcal{N}\Big(\mu\_X + \mu\_{N\prime}\mathfrak{a}\,\sigma\_X^2 + \sigma\_N^2\Big). \tag{16}$$

**Example 3.** *Suppose that Y* " *X* ` *N, where N is exponential with mean ζ, independent of X, which is a mixed random variable with density*

$$f\_{\mathbf{X}}(t) = \frac{\zeta}{\alpha} \delta(t) + \left(1 - \frac{\zeta}{\alpha \,\mu}\right) \frac{1}{\mu} \mathbf{e}^{-t/\mu} \, \mathbf{1}\{t > 0\},\tag{17}$$

*with α µ* ě *ζ. Then, Y*r*α*s*, the α-response to PX, is exponential with mean α µ.*

#### **3. Relative Entropy and Rényi Divergence**

Given a pair of probability measures p*P*, *Q*q P P 2 A , relative entropy and Rényi divergence gauge the distinctness between *P* and *Q*.

9. Provided *P* ! *Q*, the relative entropy is the expectation of the relative information with respect to the dominated measure

$$D(P \| Q) = \mathbb{E}\left[\mathfrak{a}\_{P \| Q}(X)\right], \quad X \sim P \tag{18}$$

$$\mathcal{I} = \mathbb{E}\left[\exp\left(\imath\_{P\parallel Q}(\mathcal{Y})\right)\mathbf{1}\_{P\parallel Q}(\mathcal{Y})\right], \quad \mathcal{Y} \sim \mathcal{Q} \tag{19}$$

$$\gg 0,\tag{20}$$

with equality if and only if *P* " *Q*. If *P* ! *Q*, then *D*p*P*}*Q*q " 8. As in Item 3, if *X* " *P* and *Y* " *Q*, we may write *D*p*X*}*Y*q instead of *D*p*P*}*Q*q, in the same spirit that the expectation and entropy of *P* are written as Er*X*s and *H*p*X*q, respectively.

10. Arising in the sequel, a common optimization in information theory finds, among the probability measures satisfying an average cost constraint, that which is closest to a given reference measure *Q* in the sense of *D*p¨}*Q*q. For that purpose, the following result proves sufficient. Incidentally, we often refer to unconstrained maximizations over probability distributions. It should be understood that those optimizations are still constrained to the sets P<sup>A</sup> or PB. As customary in information theory, we will abbreviate max*PX*PP<sup>A</sup> by max*<sup>X</sup>* or max*P<sup>X</sup>* .

**Theorem 1.** *Let P<sup>Z</sup>* P P<sup>A</sup> *and suppose that g* : A Ñ r0, 8q *is a Borel measurable mapping. Then,*

$$\min\_{X} \{ D(X \| Z) + \mathbb{E}[\mathcal{g}(X)] \} = -\log \mathbb{E}[\exp(-\mathcal{g}(Z))].\tag{21}$$

*achieved uniquely by P*˚ *<sup>X</sup>* !" *P<sup>Z</sup> defined by*

$$\mathfrak{m}\_{X^\ast \| \mathbb{Z}}(a) = -\mathfrak{g}(a) - \log \mathbb{E}[\exp(-\mathfrak{g}(Z))], \quad a \in \mathcal{A}.\tag{22}$$

**Proof.** Note that since *g* is nonnegative, *η* " Erexpp´*g*p*Z*qqs P p0, 1s. Furthermore,

$$\mathbb{E}[\mathbf{g}(X^\*)] = \frac{\int \mathbf{g}(t) \exp(-\mathbf{g}(t)) \, \mathrm{d}P\_Z(t)}{\mathbb{E}[\exp(-\mathbf{g}(Z))]} \in \left[0, \frac{1}{\mathrm{e}\,\eta}\right].\tag{23}$$

Therefore, the subset of P<sup>A</sup> for which the term in t¨u in (21) is finite is nonempty: Fix any *P<sup>X</sup>* from that subset, (which therefore satisfies *P<sup>X</sup>* ! *P<sup>Z</sup>* ! *P* ˚ *X* ) and invoke the chain rule (7) to write 4. Let (A, F ) and (B, G ) be measurable spaces, known as the input and output spaces, respectively. Likewise, A and B are referred to as the input and output alphabets respectively. The simplified notation *<sup>P</sup>Y*|*<sup>X</sup>* : A → B denotes a random transformation

(*a* − *µX*)

*σ* 2 *X* 2

!

*(Gaussian with mean µ<sup>X</sup> and variance σ*

(*a* − *µY*)

*σ* 2 *Y* 2

−

$$D(X \parallel Z) + \mathbb{E}[\mathcal{g}(X)] = \mathbb{E}\left[\iota\_{X \parallel X^{\bullet}}(X) + \iota\_{X^{\bullet} \parallel Z}(X) + \mathcal{g}(X)\right] \tag{24}$$

$$=D(X \mid X^\*) - \log \mathbb{E}[\exp(-g(Z))], \quad X \sim P\_{X\_{\prime}} \tag{25}$$

2 *X*

log e. (10)

*) and Y* ∼

which is uniquely minimized by letting *P<sup>X</sup>* " *P* ˚ *X* . Note that for typographical convenience we have denoted *X* ˚ " *P* ˚ *X* . random transformation, the corresponding joint probability measure is denoted by *P PY*|*<sup>X</sup>* ∈ PA×B (or, interchangeably, *<sup>P</sup>Y*|*XP*). The notation *<sup>P</sup>* → *<sup>P</sup>Y*|*<sup>X</sup>* → *<sup>Q</sup>* simply indicates that the output marginal of the joint probability measure *P PY*|*<sup>X</sup>* is denoted

11. Let *p* and *q* denote the Radon-Nikodym derivatives of probability measures *P* and *Q*, respectively, with respect to a common dominating *σ*-finite measure *µ*. The Rényi divergence of order *α* P p0, 1q Y p1, 8q between *P* and *Q* is defined as [25,50] by *Q* ∈ PB, namely, *<sup>Q</sup>*(*B*) = <sup>Z</sup> *<sup>P</sup>Y*|*X*(*B*|*x*) <sup>d</sup>*PX*(*x*) = E h *<sup>P</sup>Y*|*X*(*B*|*X*) i , *B* ∈ G . (11)

$$D\_{\mathbf{d}}(P\|Q) = \frac{1}{a-1} \log \int\_{\mathcal{A}} p^a q^{1-a} \mathrm{d}\mu \tag{26}$$

$$=\frac{1}{\mathfrak{a}-1}\log\mathbb{E}\left[\exp\left(\mathfrak{a}\mathfrak{u}\_{P\|\mathbb{R}}(Z)+(1-\mathfrak{a})\mathfrak{u}\_{Q\|\mathbb{R}}(Z)\right)\right],\quad Z\sim\mathbb{R}\tag{27}$$

$$=\frac{1}{a-1}\log\mathbb{E}\left[\exp\left(a\,\iota\_{P\parallel Q}(Y)\right)\right],\quad Y\sim Q\tag{28}$$

$$=\frac{1}{a-1}\log\mathbb{E}\left[\exp\left(\left(a-1\right)\mathbb{1}\_{\mathbb{P}\left|\big|\mathbb{Q}\right.\right)}(X)\right)\right],\quad X\sim\mathbb{P},\tag{29}$$

where (28) and (29) hold if *P* if *<sup>X</sup>* = *<sup>Y</sup>* ∈ R, then *<sup>P</sup>Y*|*X*=*<sup>a</sup>* for any *P<sup>X</sup>* with non-discrete components *PXY* 6 *P<sup>X</sup>* × *PY*. 7. Let *<sup>α</sup>* > 0, and *<sup>P</sup><sup>X</sup>* → *<sup>P</sup>Y*|*<sup>X</sup>* → *<sup>P</sup>Y*. The *<sup>α</sup>*-response to *<sup>P</sup><sup>X</sup>* ∈ P<sup>A</sup> is the output probability measure *PY*[*α*] *P<sup>Y</sup>* with relative information given by *<sup>ı</sup>Y*[*α*]k*Y*(*y*) = <sup>1</sup> *α* logE[exp(*α ıX*;*Y*(*X*; *y*) − *κα*)], *X* ∼ *PX*, (13) where *κ<sup>α</sup>* is a scalar that guarantees that *PY*[*α*] is a probability measure. Invoking (9), *Q*, and in (27), *R* is a probability measure that dominates both *P* and *Q*. Note that (28) and (29) state that p*t* ´ 1q*Dt*p*X*}*Y*q and *t D*1`*t*p*X*}*Y*<sup>q</sup> are the cumulant generating functions of the random variables *<sup>ı</sup>X*}*<sup>Y</sup>* p*Y*q and *<sup>ı</sup>X*}*<sup>Y</sup>* p*X*q, respectively. The relative entropy is the limit of *Dα*p*P*}*Q*q as *α* Ò 1, so it is customary to let *D*1p*P*}*Q*q " *D*p*P*}*Q*q. For any *α* ą 0, *Dα*p*P*}*Q*q ě 0 with equality if and only if *P* " *Q*. Furthermore, *Dα*p*P*}*Q*q is non-decreasing in *α*, satisfies the skew-symmetric property

$$(1 - \mathfrak{a})D\_{\mathfrak{a}}(P \| Q) = \mathfrak{a} \, D\_{1 - \mathfrak{a}}(Q \| P), \quad \mathfrak{a} \in [0, 1]. \tag{30}$$

and

*α*

*κ<sup>α</sup>* = *α* logE

h E 1

transformation with less (resp. more) "noise" than *<sup>p</sup>Y*|*X*.

we obtain

applied to (·)

*Entropy* **2021**, *1*, 0 6 of 52

*<sup>ı</sup>X*k*Y*(*a*) = <sup>1</sup>

*µX*, *σ* 2 *X* 

2 log *σ* 2 *Y σ* 2 *X* + 1 2 

**Example 1.** *If <sup>X</sup>* ∼ N

N *µY*, *σ* 2 *Y , then,*

$$\inf\_{a \in (0,1)} D\_a(P \| Q) = \infty \Longleftrightarrow P \perp Q \Longrightarrow \inf\_{a>1} D\_a(P \| Q) = \infty. \tag{31}$$

notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the *α*-response as the order *α* Rényi mean. Note that *<sup>κ</sup>*<sup>1</sup> = <sup>0</sup> and the 1-response to *<sup>P</sup><sup>X</sup>* is *<sup>P</sup>Y*. If *<sup>p</sup>Y*[*α*] and *<sup>p</sup>Y*|*<sup>X</sup>* denote the densities of *<sup>P</sup>Y*[*α*] 12. The expressions in the following pair of examples will come in handy in Sections 11 and 12.

and *<sup>P</sup>Y*|*<sup>X</sup>* with respect to some common dominating measure, then (13) becomes *κα* i **Example 4.** *Suppose that σ* 2 *<sup>α</sup>* " *α σ*<sup>2</sup> <sup>1</sup> ` p1 ´ *α*q*σ* 2 <sup>0</sup> ą 0 *and α* P p0, 1q Y p1, 8q*. Then,*

$$D\_{\mathfrak{a}}\left(\mathcal{N}\left(\mu\_{0},\sigma\_{0}^{2}\right)\|\,\mathcal{N}\left(\mu\_{1},\sigma\_{1}^{2}\right)\right) = \frac{1}{2}\log\frac{\sigma\_{1}^{2}}{\sigma\_{0}^{2}} + \frac{1}{2(\mathfrak{a}-1)}\log\frac{\sigma\_{1}^{2}}{\sigma\_{\mathfrak{a}}^{2}} + \frac{a(\mu\_{1}-\mu\_{0})^{2}}{2\sigma\_{\mathfrak{a}}^{2}}\log\mathfrak{e},\tag{32}$$

$$D\left(\mathcal{N}\left(\mu\_0, \sigma\_0^2\right) \parallel \mathcal{N}\left(\mu\_1, \sigma\_1^2\right)\right) = \frac{1}{2} \log \frac{\sigma\_1^2}{\sigma\_0^2} + \frac{1}{2} \left(\frac{\sigma\_0^2}{\sigma\_1^2} - 1\right) \log \mathbf{e} + \frac{(\mu\_1 - \mu\_0)^2}{2\sigma\_1^2} \log \mathbf{e} \tag{33}$$

$$=\lim\_{\mu\to 1} D\_{\mathfrak{a}}\left(\mathcal{N}\left(\mu\_{0\prime}\sigma\_{0}^{2}\right) \parallel \mathcal{N}\left(\mu\_{1\prime}\sigma\_{1}^{2}\right)\right).\tag{34}$$

1

h

**Example 5.** *Suppose Z is exponentially distributed with unit mean, i.e., its probability density function is* e ´*t*1t*<sup>t</sup>* <sup>ě</sup> 0u*. For <sup>d</sup>*<sup>0</sup> <sup>ě</sup> *<sup>d</sup>*<sup>1</sup> *and <sup>α</sup> such that* <sup>p</sup>1 ´ *<sup>α</sup>*<sup>q</sup> *<sup>µ</sup>*<sup>0</sup> ` *α µ*<sup>1</sup> <sup>ą</sup> 0 *we obtain*

$$D\_{\mathbf{R}}(\mu\_0 Z + d\_0 \parallel \mu\_1 Z + d\_1) = \frac{d\_0 - d\_1}{\mu\_1} \log \mathbf{e} + \log \frac{\mu\_1}{\mu\_0} + \frac{1}{1 - \alpha} \log \left( a + (1 - \alpha) \frac{\mu\_0}{\mu\_1} \right),$$

$$D(\mu\_0 Z + d\_0 \parallel \mu\_1 Z + d\_1) = \left( \underline{\mu\_0} - 1 + \underline{d\_0 - d\_1} \right) \log \mathbf{e} + \log \underline{\mu\_1} \tag{35}$$

$$D(\mu\_0 Z + d\_0 \parallel \mu\_1 Z + d\_1) = \left(\frac{\mu\_0}{\mu\_1} - 1 + \frac{d\_0 - d\_1}{\mu\_1}\right) \log \mathbf{e} + \log \frac{\mu\_1}{\mu\_0} \tag{35}$$

$$=\lim\_{\kappa \to 1} D\_{\kappa}(\mu\_0 Z + d\_0 \parallel \mu\_1 Z + d\_1). \tag{36}$$

13. Intimately connected with the notion of Rényi divergence is the tilted probability measure *P<sup>α</sup>* defined, if *Dα*p*P*1}*P*0q ă 8, by

$$\mathfrak{a}\_{P\_{\mathfrak{a}}\parallel Q}(a) = \mathfrak{a}\,\mathfrak{a}\_{P\_{\mathfrak{k}}\parallel Q}(a) + (1 - \mathfrak{a})\,\mathfrak{a}\_{P\_{\mathfrak{b}}\parallel Q}(a) + (1 - \mathfrak{a})D\_{\mathfrak{a}}(P\_{\mathfrak{k}}\parallel P\_{\mathfrak{b}}),\tag{37}$$

where *Q* is any probability measure that dominates both *P*0 and *P*1. Although (37) is defined in general, our main emphasis is on the range *α* P p0, 1q, in which, as long as *P*<sup>0</sup> M *P*1, the tilted probability measure is defined and satisfies *P<sup>α</sup>* ! *P*<sup>0</sup> and *P<sup>α</sup>* ! *P*1, with corresponding relative informations

$$\mathfrak{r}\_{\mathrm{P}\_{\mathbb{B}} \parallel \mathrm{P}\_{\mathbb{B}}}(a) = \mathfrak{r}\_{\mathrm{P}\_{\mathbb{B}} \parallel \mathbb{Q}}(a) - \mathfrak{r}\_{\mathrm{P} \parallel \mathbb{Q}}(a) \tag{38}$$

$$=(1-a)D\_{\mathfrak{a}}(P\_{\mathbf{1}} \parallel P\_{\mathbf{0}}) + a \left(\mathfrak{l}\_{P\_{\mathbf{1}} \parallel Q}(a) - \mathfrak{l}\_{P\_{\mathbf{0}} \parallel Q}(a)\right),\tag{39}$$

$$\mathfrak{l}\_{P\_{\mathfrak{a}} \parallel P\_{\mathfrak{t}}}(a) = \mathfrak{l}\_{P\_{\mathfrak{a}} \parallel Q}(a) - \mathfrak{l}\_{P\_{\mathfrak{t}} \parallel Q}(a) \tag{40}$$

$$\mathfrak{a} = (1 - \mathfrak{a}) D\_{\mathfrak{a}}(P\_1 \parallel P\_0) - (1 - \mathfrak{a}) \left( \iota\_{P\_1 \parallel Q}(a) - \iota\_{P\_0 \parallel Q}(a) \right),\tag{41}$$

where we have used the chain rule for *P<sup>α</sup>* ! *P*<sup>0</sup> ! *Q* and *P<sup>α</sup>* ! *P*<sup>1</sup> ! *Q*. Taking a linear combination of (38)–(41) we conclude that, for all *a* P A,

$$\iota\_1(1-\mathfrak{a})D\_\mathfrak{a}(P\_\mathfrak{1} \| P\_\mathfrak{0}) = (1-\mathfrak{a})\mathfrak{1}\_{P\_\mathfrak{a} \| P\_\mathfrak{0}}(a) + \mathfrak{a}\mathfrak{1}\_{P\_\mathfrak{a} \| P\_\mathfrak{1}}(a). \tag{42}$$

Henceforth, we focus particular attention on the case *α* P p0, 1q since that is the region of interest in the application of Rényi information measures to the evaluation of error exponents in channel coding for codes whose rate is below capacity. In addition, often proofs simplify considerably for *α* P p0, 1q.

14. Much of the interplay between relative entropy and Rényi divergence hinges on the following identity, which appears, without proof, in (3) of [51].

**Theorem 2.** *Let α* P p0, 1q *and assume that P*<sup>0</sup> M *P*<sup>1</sup> *are defined on the same measurable space. Then, for any P* ! *P*<sup>1</sup> *and P* ! *P*0*,*

$$a \, D(P \parallel P\_1) + (1 - a) \, D(P \parallel P\_0) = D(P \parallel P\_0) + (1 - a) D\_\hbar(P\_1 \parallel P\_0) \, \tag{43}$$

*where P<sup>α</sup> is the tilted probability measure in* (37) *and* (43) *holds regardless of whether the relative entropies are finite. In particular,*

$$D(P \parallel P\_0) < \infty \Longleftrightarrow \max\{D(P \parallel P\_0), D(P \parallel P\_1)\}<\infty. \tag{44}$$

**Proof.** We distinguish three overlapping cases:

(1) *D*p*P* } *Pα*q ă 8: Taking expectation of (42) with respect to *a* Ð *X* " *P*, yields (43) because

$$\mathbb{E}\left[\mathfrak{t}\_{P\_{\mathcal{Q}}\parallel P\_{\mathcal{Q}}}(X)\right] = D(P\|P\_{\mathcal{Q}}) - D(P\|P\_{\mathcal{R}})\_{\prime} \tag{45}$$

$$\mathbb{E}\left[\mathbf{1}\_{P\_{\mathbf{a}}\parallel P\_{\mathbf{1}}}(X)\right] = D(P\|P\_{\mathbf{1}}) - D(P\|P\_{\mathbf{a}}) \,\tag{46}$$

where, thanks to the assumption that *D*p*P* } *Pα*q ă 8, we have invoked Corollary A1 in the Appendix twice with p*P*, *Q*, *R*q Ð p*P*, *Pα*, *P*0q and p*P*, *Q*, *R*q Ð p*P*, *Pα*, *P*1q, respectively;


Finally, to show that (44) follows from (43), simply recall from (31) that *Dα*p*P*<sup>1</sup> }*P*0q ă 8.

15. Relative entropy and Rényi divergence are related by the following fundamental variational representation.

**Theorem 3.** *Fix α* P p0, 1q *and* p*P*1, *P*0q P P 2 A *. Then, the Rényi divergence between P*1 *and P*0 *satisfies*

$$(1 - \mathfrak{a}) \, D\_{\mathfrak{a}}(P\_{\mathfrak{I}} \| P\_{\mathfrak{O}}) = \min\_{P} \{ \mathfrak{a} \, D(P \| P\_{\mathfrak{I}}) + (1 - \mathfrak{a}) \, D(P \| P\_{\mathfrak{O}}) \},\tag{47}$$

*where the minimum is over* PA*. If P*<sup>0</sup> M *P*1*, then the right side of* (47) *is attained by the tilted measure Pα, and the minimization can be restricted to the subset of probability measures which are dominated by both P*1 *and P*0*.*

**Proof.** If *P*<sup>0</sup> K *P*1, then both sides of (47) are `8 since there is no probability measure that is dominated by both *P*<sup>0</sup> and *P*1. If *P*<sup>0</sup> M *P*1, then minimizing both sides of (43) with respect to *P* yields (47) and the fact that the tilted probability measure attains the minimum therein.

The variational representation in (47) was observed in [39] in the finite-alphabet case, and, contemporaneously, in full generality in [50]. Unlike Theorem 3, both of those references also deal with *α* ą 1. The function *d*p*α*q " p1 ´ *α*q *Dα*p*P*1}*P*0q, with *d*p1q " lim*α*Ò<sup>1</sup> *d*p*α*q, is concave in *α* because the right side of (47) is a minimum of affine functions of *α*.

16. Given random transformations *<sup>P</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B, *<sup>Q</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B, and a probability measure *P<sup>X</sup>* P P<sup>A</sup> on the input space, the conditional relative entropy is

$$D(P\_{Y|X} \parallel Q\_{Y|X} \mid P\_X) = D(P\_{Y|X} P\_X \parallel Q\_{Y|X} P\_X) \tag{48}$$

$$\mathcal{I} = \mathbb{E}\left[D\left(P\_{Y|X}(\cdot|X) \parallel Q\_{Y|X}(\cdot|X)\right)\right], \quad X \sim P\_X. \tag{49}$$

Analogously, the conditional Rényi divergence is defined as

$$D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_{Y|X} \mid P\_X) = D\_{\mathfrak{a}}(P\_{Y|X} P\_X \parallel Q\_{Y|X} P\_X). \tag{50}$$

A word of caution: the notation in (50) conforms to that in [38,45] but it is not universally adopted, e.g., [43] uses the left side of (50) to denote the Rényi generalization of the right side of (49). We can express the conditional Rényi divergence as

$$\begin{split} &D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_{Y|X} | P\_{X}) \\ &= \frac{1}{\mathfrak{a} - 1} \log \mathbb{E} \Big[ \exp \Big( (\mathfrak{a} - 1) D\_{\mathfrak{a}} \Big( P\_{Y|X}(\cdot | X) \parallel Q\_{Y|X}(\cdot | X) \Big) \Big) \Big], \quad X \sim P\_{X'} \end{split} \tag{51}$$

$$=\frac{1}{a-1}\log\mathbb{E}\left[\left(\frac{\mathbf{d}P\_{Y|X}}{\mathbf{d}Q\_{Y|X}}(Y|X)\right)^{a-1}\right], \quad (X,Y) \sim P\_{\mathbf{X}}P\_{Y|X} \tag{52}$$

where (52) holds if *<sup>P</sup>XPY*|*<sup>X</sup>* ! *<sup>P</sup>XQY*|*X*. Jensen's inequality applied to (51) results in

$$D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_{Y|X} | P\_X) \leqslant \mathbb{E}\left[D\_{\mathfrak{a}}(P\_{Y|X}(\cdot|X) \parallel Q\_{Y|X}(\cdot|X))\right], \quad \mathfrak{a} \in (0,1);\tag{53}$$

$$D\_{\mathfrak{A}}(P\_{Y|X} \| \, Q\_{Y|X} | P\_X) \geqslant \mathbb{E} \Big[ D\_{\mathfrak{A}}(P\_{Y|X}(\cdot | X) \parallel Q\_{Y|X}(\cdot | X)) \Big], \quad \mathfrak{a} > 1. \tag{54}$$

Nevertheless, an immediate and crucial observation we can draw from (51) is that the unconstrained maximizations of the sides of (53) and of (54) over *P<sup>X</sup>* do coincide: for all *α* ą 0,

$$\sup\_{X} D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_{Y|X} | P\_X) = \sup\_{X} \mathbb{E} \left[ D\_{\mathfrak{a}}(P\_{Y|X}(\cdot | X) \parallel Q\_{Y|X}(\cdot | X)) \right] \tag{55}$$

$$=\sup\_{a\in\mathcal{A}} D\_{\mathfrak{a}}(P\_{Y|X=a} \parallel Q\_{Y|X=a}).\tag{56}$$

17. Conditional Rényi divergence satisfies the following additive decomposition, originally pointed out, without proof, by Sibson [31] in the setting of finite A.

**Theorem 4.** *Given P<sup>X</sup>* <sup>P</sup> PA*, Q<sup>Y</sup>* <sup>P</sup> PB*, PY*|*<sup>X</sup>* : A <sup>Ñ</sup> B*, and <sup>α</sup>* P p0, 1q Y p1, 8q*, we have*

$$D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_Y \vert P\_X) = D\_{\mathfrak{a}}(P\_{Y|X} \parallel P\_{Y[\mathfrak{a}]} \vert P\_X) + D\_{\mathfrak{a}}(P\_{Y[\mathfrak{a}]} \vert Q\_Y). \tag{57}$$

*Furthermore, with κ<sup>α</sup> as in* (14)*,*

$$D\_{\mathfrak{a}}\left(P\_{Y|X} \parallel P\_{Y[\mathfrak{a}]} \middle| P\_{\mathbf{X}}\right) = \frac{\kappa\_{\mathfrak{a}}}{\mathfrak{a} - 1}.\tag{58}$$

**Proof.** Select an arbitrary probability measure *R<sup>Y</sup>* P P<sup>B</sup> that dominates both *Q<sup>Y</sup>* and *<sup>P</sup>Y*, and, therefore, *<sup>P</sup>Y*r*α*<sup>s</sup> too. Letting p*X*, *Z*q " *P<sup>X</sup>* ˆ *RY*, we have

$$D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_Y | P\_X) = \frac{1}{\mathfrak{a} - 1} \log \mathbb{E} \left[ \left( \frac{\mathrm{d}P\_{XY}}{\mathrm{d}P\_X \times R\_Y} (X, Z) \right)^{\mathfrak{a}} \left( \frac{\mathrm{d}Q\_Y}{\mathrm{d}R\_Y} (Z) \right)^{1 - \mathfrak{a}} \right] \tag{59}$$

$$=\frac{1}{n-1}\log\mathbb{E}\left[\mathbb{E}\left[\exp\left(\mathfrak{a}\,\mathfrak{l}\_{\mathbf{X};\mathbf{Y}}(\mathbf{X};\mathbf{Z})\right)|\mathbf{Z}\right] \left(\frac{\mathrm{d}\mathbf{P}\_{\mathbf{Y}}}{\mathrm{d}\mathbf{R}\_{\mathbf{Y}}}(\mathbf{Z})\right)^{n}\left(\frac{\mathrm{d}\mathbf{Q}\_{\mathbf{Y}}}{\mathrm{d}\mathbf{R}\_{\mathbf{Y}}}(\mathbf{Z})\right)^{1-n}\right] \tag{60}$$

$$\mathbb{E}\left[\mathfrak{a}\,\mathfrak{a}\mathbf{R}\_{\mathbf{Z}}\right] \stackrel{\text{def.}}{\longrightarrow} \mathfrak{a}\mathfrak{a}\mathfrak{a}^{-1}\tag{61}$$

$$=\frac{\kappa\_{\mathfrak{a}}}{\mathfrak{a}-1} + \frac{1}{\mathfrak{a}-1} \log \mathbb{E}\left[ \left( \frac{\mathrm{d}P\_{Y[\mathfrak{a}]}}{\mathrm{d}P\_{Y}}(Z) \right)^{\mathfrak{a}} \left( \frac{\mathrm{d}P\_{Y}}{\mathrm{d}R\_{Y}}(Z) \right)^{\mathfrak{a}} \left( \frac{\mathrm{d}Q\_{Y}}{\mathrm{d}R\_{Y}}(Z) \right)^{1-\mathfrak{a}} \right] \tag{61}$$

$$\mathbf{x} = \frac{\kappa\_{\mathbf{a}}}{\alpha - 1} + \frac{1}{\alpha - 1} \log \mathbb{E} \left[ \left( \frac{\mathbf{d} P\_{Y[\mathbf{a}]}}{\mathbf{d} R\_Y}(Z) \right)^a \left( \frac{\mathbf{d} Q\_Y}{\mathbf{d} R\_Y}(Z) \right)^{1 - a} \right] \tag{62}$$

$$\lambda = \frac{\kappa\_{\mathfrak{A}}}{\mathfrak{a} - 1} + D\_{\mathfrak{a}}(P\_{Y[\mathfrak{a}]} \| Q\_{Y})\_{\prime} \tag{63}$$

where (61) follows from (13), and (62) follows from the chain rule of Radon-Nikodym derivatives applied to *<sup>P</sup>Y*r*α*<sup>s</sup> ! *<sup>P</sup><sup>Y</sup>* ! *<sup>R</sup>Y*. Then, (58) follows by specializing *<sup>Q</sup><sup>Y</sup>* " *<sup>P</sup>Y*r*α*<sup>s</sup> , and the proof of (57) is complete, upon plugging (58) into the right side of (63).

A proof of (57) in the discrete case can be found in Appendix A of [37].

18. For all *α* ą 0, given two inputs p*PX*, *QX*q P P 2 A and one random transformation *<sup>P</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B, Rényi divergence (and, in particular, relative entropy) satisfies the data processing inequality,

$$D\_{\mathfrak{a}}(P\_X \| \, Q\_X) \gg D\_{\mathfrak{a}}(P\_Y \| \, Q\_Y),\tag{64}$$

where *<sup>P</sup><sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*|*<sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*, and *<sup>Q</sup><sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*|*<sup>X</sup>* <sup>Ñ</sup> *<sup>Q</sup>Y*. The data processing inequality for Rényi divergence was observed by Csiszár [52] in the more general context of *f*-divergences. More recently it was stated in [39,50]. Furthermore, given one input *<sup>P</sup><sup>X</sup>* <sup>P</sup> P<sup>A</sup> and two transformations *<sup>P</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B and *<sup>Q</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B, conditioning cannot decrease Rényi divergence,

$$D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_{Y|X} | P\_X) \gg D\_{\mathfrak{a}}(P\_Y \parallel Q\_Y). \tag{65}$$

Since *<sup>D</sup>α*p*PY*|*<sup>X</sup>* } *<sup>Q</sup>Y*|*X*|*PX*q " *<sup>D</sup>α*p*PXPY*|*<sup>X</sup>* } *<sup>P</sup>XQY*|*X*q, (65) follows by applying (64) to a deterministic transformation which takes an input pair and outputs the second component. Inequalities (53) and (65) imply the convexity of *Dα*p*P*}*Q*q in p*P*, *Q*q for *α* P p0, 1s.

#### **4. Dependence Measures**

In this paper we are interested in three information measures that quantify the dependence between random variables *<sup>X</sup>* and *<sup>Y</sup>*, such that *<sup>P</sup><sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*|*<sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*, namely, mutual information, and two of its generalizations, *α*- mutual information and Augustin–Csiszár mutual information of order *α*.

#### 19. The mutual information is

$$I(X;Y) = I(P\_{X\prime}P\_{Y|X}) = D(P\_{Y|X} \parallel P\_Y \mid P\_X) \tag{66}$$

$$=\min\_{Q\_Y} D(P\_{Y|X} \parallel Q\_Y \mid P\_X) \tag{67}$$

$$=\min\_{Q\_Y} D(P\_{XY} \parallel P\_X \times Q\_Y). \tag{68}$$

20. Given *α* P p0, 1q Y p1, 8q, the *α*-mutual information is defined as (see [30–32,40,42,45])

$$I\_{\mathfrak{A}}(X;Y) = I\_{\mathfrak{A}}(P\_{X\prime}P\_{Y|X}) \tag{69}$$

$$=\min\_{Q\_Y} D\_\mathbb{il}(P\_{Y|X} \parallel Q\_Y \mid P\_X) \tag{70}$$

$$=\min\_{Q\_Y} D\_\mathbf{a}(P\_{XY} \parallel P\_\mathbf{X} \times Q\_Y) \tag{71}$$

$$=D\_d\left(P\_{Y|X} \parallel P\_{Y[a]} \mid P\_X\right) \tag{72}$$

$$=\frac{1}{\mathfrak{a}-1}\log\mathbb{E}\left[\exp\left((\mathfrak{a}-1)D\_{\mathfrak{a}}\left(P\_{Y|X}(\cdot|X)\parallel P\_{Y[\mathfrak{a}]}\right)\right)\right], \quad X \sim P\_X\tag{73}$$

$$0 = D\_{\mathfrak{a}}\left(P\_{Y|X} \parallel P\_Y | P\_X\right) - D\_{\mathfrak{a}}\left(P\_{Y[a]} \parallel P\_Y\right) \tag{74}$$

$$\mathbf{x} = \frac{\mathbf{x}\_{\text{dl}}}{\mathbf{x} - \mathbf{1}} \tag{75}$$

$$=\frac{\alpha}{\alpha - 1} \log \mathbb{E}[\mathbb{E}^{\frac{1}{\alpha}}[\exp(\mathfrak{a} \,\mathsf{r}\_{\mathbf{X};\mathbf{Y}}(\mathbf{X};\mathbf{\bar{Y}})) \, | \, \mathbf{\bar{Y}}]], \quad (\mathbf{X}, \mathbf{\bar{Y}}) \sim P\_{\mathbf{X}} \times P\_{\mathbf{Y}}.\tag{76}$$

where (72) and (74) follow from (57); (73) is a special case of (51); (75) follows from Theorem 4; and, (76) is (14). In view of (67) and (69), we let *I*1p*X*;*Y*q " *I*p*X*;*Y*q. The notation we use for *α*-mutual information conforms to that used in [40,42,45,53]. Other notations include *K<sup>α</sup>* in [32,38,39] and *I g <sup>α</sup>* in [43]. *I*0p*X*;*Y*q and *I*8p*X*;*Y*q are defined by taking the corresponding limits.

21. Theorem 4 and (72) result in the additive decomposition

$$I\_{\mathfrak{a}}(X;Y) = D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_Y | P\_X) - D\_{\mathfrak{a}}(P\_{Y[\mathfrak{a}]} \parallel Q\_Y),\tag{77}$$

for any *<sup>Q</sup><sup>Y</sup>* with *<sup>D</sup>α*p*PY*r*α*<sup>s</sup> } *QY*q ă 8, thereby generalizing the well-known decomposition for mutual information,

$$I(X;Y) = D(P\_{Y|X} \parallel Q\_Y | P\_X) - D(P\_Y \parallel Q\_Y)\_\prime \tag{78}$$

which, in contrast to (77), is a simple consequence of the chain rule whenever the dependence between *X* and *Y* is regular, and of Lemma A1 in general.

22.

**Example 6.** *Additive independent Gaussian noise. If Y* " *X* ` *N, where X* " N ` 0, *σ* 2 *X* ˘ *independent of N* " N ` 0, *σ* 2 *N* ˘ *, then, for α* ą 0*,*

$$\mathcal{Y}[\mathfrak{a}] \sim \mathcal{N}\Big(0, \mathfrak{a} \,\sigma\_X^2 + \sigma\_N^2\Big),\tag{79}$$

$$I\_{\mathfrak{A}}(X;X+N) = I\_{\mathfrak{A}}(X+N;X) = \frac{1}{2} \log \left( 1 + \mathfrak{a} \frac{\sigma\_X^2}{\sigma\_N^2} \right). \tag{80}$$

23. If *α* P p0, 1q, (47) and (69) result in

$$\begin{aligned} &(1-\mathfrak{a})I\_{\mathfrak{a}}(P\_{\mathbf{X}'}P\_{Y|\mathbf{X}}) \\ &= \min\_{Q\_{\mathbf{X}}Q\_{Y|\mathbf{X}}} \Big\{ D(Q\_{\mathbf{X}} \parallel P\_{\mathbf{X}}) + \mathfrak{a} \, D(Q\_{Y|\mathbf{X}} \parallel P\_{Y|\mathbf{X}} \mid Q\_{\mathbf{X}}) + (1-\mathfrak{a}) \, I(Q\_{\mathbf{X}}, Q\_{Y|\mathbf{X}}) \Big\}. \end{aligned} \tag{81}$$

For *α* ą 1 a proof of (81) is given in [39] for finite alphabets.

24. Unlike *<sup>I</sup>*p*PX*, *<sup>P</sup>Y*|*X*q, we can express *<sup>I</sup>α*p*PX*, *<sup>P</sup>Y*|*X*<sup>q</sup> directly in terms of its arguments without involving the corresponding output distribution or the *α*-response to *PX*. This is most evident in the case of discrete alphabets, in which (76) becomes

$$I\_{\mathfrak{a}}(X;Y) = \frac{\mathfrak{a}}{\mathfrak{a}-1} \log \sum\_{\mathbf{y} \in \mathcal{B}} \left( \sum\_{\mathbf{x} \in \mathcal{A}} P\_{\mathbf{X}}(\mathbf{x}) P\_{Y|X=\mathbf{x}}^{\mathfrak{a}}(\mathbf{y}) \right)^{\frac{1}{\mathfrak{a}}} \tag{82}$$

$$I\_0(X;Y) = -\log\max\_{y \in \mathcal{B}} \sum\_{\mathbf{x} \in \mathcal{A}} P\_X(\mathbf{x}) \mathbf{1} \{ P\_{Y|X}(y|\mathbf{x}) > 0 \},\tag{83}$$

$$I\_{\mathcal{O}}(X;Y) = \log \left( \sum\_{b \in \mathcal{Y}} \sup\_{a \colon P\_X(a) > 0} P\_{Y|X}(b|a) \right). \tag{84}$$

For example, if *X* is discrete and *Hα*p*X*q denotes the Rényi entropy of order *α*, then for all *α* ą 0,

$$H\_{\mathfrak{a}}(X) = I\_{\frac{1}{\mathfrak{a}}}(X;X). \tag{85}$$

If *X* and *Y* are equiprobable with Pr*X* ‰ *Y*s " *δ*, then, in bits, *Iα*p*X*;*Y*q " 1 ´ *hα*p*δ*q, where *hα*p*δ*q denotes the binary Rényi entropy.

25. In the main region of interest, namely, *α* P p0, 1q, frequently we use a different parametrization in terms of *ρ* ą 0, with *α* " 1 1`*ρ* .

**Theorem 5.** *For any ρ* ą 0*, we have the upper bound*

$$\rho \operatorname\*{I}\_{\frac{1}{1+\rho}}(\mathbf{X}; \mathbf{Y}) \leqslant \min\_{Q\_{\mathbf{Y}|\mathbf{X}} \colon \mathcal{A} \to \mathcal{B}} \left\{ D(Q\_{\mathbf{Y}|\mathbf{X}} \| P\_{\mathbf{Y}|\mathbf{X}} \mid P\_{\mathbf{X}}) + \rho \operatorname\*{I}(P\_{\mathbf{X}}, Q\_{\mathbf{Y}|\mathbf{X}}) \right\}. \tag{86}$$

**Proof.** Fix *<sup>Q</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B, and let *<sup>P</sup><sup>X</sup>* <sup>Ñ</sup> *<sup>Q</sup>Y*|*<sup>X</sup>* <sup>Ñ</sup> *<sup>Q</sup>Y*. Then,

$$D\_{\frac{1}{1+\rho}}(\mathbf{X};\mathbf{Y}) \leqslant D\_{\frac{1}{1+\rho}}(\mathbf{P\_{XY}} \| \mathbf{P\_X} \times \mathbf{Q\_Y}) \tag{87}$$

$$=\frac{1+\rho}{\rho}\min\_{R\_{XY}}\left\{\frac{1}{1+\rho}D(R\_{XY}\|P\_{XY})+\frac{\rho}{1+\rho}D(R\_{XY}\|P\_X\times Q\_Y)\right\}\tag{88}$$

$$\approx \frac{1}{\rho} D(Q\_{Y|X} P\_X \| P\_{XY}) + D(Q\_{Y|X} P\_X \| P\_X \times Q\_Y) \tag{89}$$

$$=\frac{1}{\rho}D(Q\_{Y|X} \| P\_{Y|X} | P\_X) + I(P\_{X\prime} Q\_{Y|X})\_\prime \tag{90}$$

where (87), (88) and (90) follow from (69), (47) and (66) respectively.

Just like (53), we will show in Section 7 that (86) becomes an equality upon the unconstrained maximization of both sides.

26. Before introducing the last dependence measure in this section, recall from Definition 7 and (58) that *<sup>P</sup>Y*r*α*<sup>s</sup> ! *<sup>P</sup>Y*, the *<sup>α</sup>*-response (of *<sup>P</sup>Y*|*X*) to *<sup>P</sup><sup>X</sup>* defined by

$$\mu\_{Y[a]\parallel Y}(y) = \frac{1}{\alpha} \log \mathbb{E}[\exp\left(a \,\imath\_{X;Y}(X;y) + (1-\alpha)D\_a\left(P\_{Y[a]} \, \vert \, P\_{Y[a]} \, \vert \, P\_X\right)\right)],\tag{91}$$

attains min*Q<sup>Y</sup> <sup>D</sup>α*p*PY*|*X*}*QY*|*PX*q, where the expectation is with respect to *<sup>X</sup>* " *<sup>P</sup>X*. We proceed to define *<sup>P</sup>Y*x*α*<sup>y</sup> ! *<sup>P</sup>Y*, the <sup>x</sup>*α*y-response (of *<sup>P</sup>Y*|*X*) to *<sup>P</sup><sup>X</sup>* by means of

$$\iota\_{Y\langle a\rangle\parallel Y}(y) = \frac{1}{\mathfrak{a}} \log \mathbb{E}\left[\exp\left(\alpha \iota\_{X;Y}(X;y) + (1-\alpha)D\_{\mathfrak{a}}\left(P\_{Y|X}(\cdot|X) \parallel P\_{Y\langle a\rangle}\right)\right)\right] \tag{92}$$

with *<sup>X</sup>* " *<sup>P</sup>X*. Note that *<sup>P</sup>Y*x1<sup>y</sup> " *<sup>P</sup>Y*r1<sup>s</sup> " *<sup>P</sup>Y*.

27. In the case of discrete alphabets, (92) becomes the implicit equation

$$P\_{Y(a)}^{a}(y) = \sum\_{a \in \mathcal{A}} P\_X(a) \frac{P\_{Y|X}^{a}(y|a)}{\sum\_{b \in \mathcal{B}} P\_{Y|X}^{a}(b|a) \ P\_{Y\langle a\rangle}^{1-a}(b)}, \quad y \in \mathcal{B},\tag{93}$$

which coincides with (9.24) in Fano's 1961 textbook [7], with *s* Ð 1 ´ *α*, and is also given by Haroutunian in (19) of [22]. For example, if A " B is discrete and *Y* " *X*, then *<sup>P</sup>Y*x*α*<sup>y</sup> " *<sup>P</sup>X*, while *<sup>P</sup> α Y*r*α*s p*y*q " *c PX*p*y*q, *y* P A.

28. The x*α*y-response satisfies the following identity, which can be regarded as the counterpart of (57) satisfied by the *α*-response.

**Theorem 6.** *Fix P<sup>X</sup>* <sup>P</sup> PA*, PY*|*<sup>X</sup>* : A <sup>Ñ</sup> B *and Q<sup>Y</sup>* <sup>P</sup> PB*. Then,*

$$\begin{split} &D\_{\mathfrak{a}}(P\_{Y\langle a\rangle} \parallel Q\_{Y}) \\ &= \frac{1}{\mathfrak{a}-1} \log \mathbb{E} \Big[ \exp \Big( (1-\mathfrak{a}) \Big( D\_{\mathfrak{a}}(P\_{Y\mid X}(\cdot|X) \| P\_{Y\langle a\rangle}) - D\_{\mathfrak{a}}(P\_{Y\mid X}(\cdot|X) \| Q\_{Y}) \Big) \Big) \Big]. \end{split} \tag{94}$$

**Proof.** For brevity we assume *Q<sup>Y</sup>* ! *PY*. Otherwise, the proof is similar adopting a reference measure that dominates both *Q<sup>Y</sup>* and *PY*. The definition of unconditional Rényi divergence in Item 11 implies that we can write p*α* ´ 1q times the exponential of the left side of (94) as

exp´ <sup>p</sup>*<sup>α</sup>* ´ <sup>1</sup>q*Dα*p*PY*x*α*<sup>y</sup> }*QY*q ¯ " E «ˆ <sup>d</sup>*PY*x*α*<sup>y</sup> d*P<sup>Y</sup>* p*Y*q ˙*α*ˆ d*Q<sup>Y</sup>* d*P<sup>Y</sup>* p*Y*q ˙1´*<sup>α</sup>* ff (95) « ff

$$=\mathbb{E}\left[\exp\left(a\_{Y;Y}(X;Y) + (1-a)D\_{\mathfrak{a}}\left(P\_{Y|X}(\cdot|X) \parallel P\_{Y\langle\mathbf{a}\rangle}\right)\right) \left(\frac{\mathbf{d}Q\_{Y}}{\mathbf{d}P\_{Y}}(Y)\right)^{1-\mathfrak{a}}\right] \tag{96}$$

$$=\mathbb{E}\left[\mathbb{E}\left[\exp\left(a\_{Y;Y}(X;Y) + (1-a\_{Y})\left(1-a\_{Y}\left(\mathbf{1}\_{\mathbf{Z}\in\mathbf{Z}} \cap \mathbf{P}\left(\mathbf{Z}\_{\mathbf{Z}\in\mathbf{Z}}(\|X\|) \parallel \mathbf{P}\_{Y}\right)\right)\right)\right]\right]\right]$$

$$\begin{aligned} \mathbf{x} &= \mathbb{E}\left[\mathbb{E}\left[\exp\left(a\operatorname\*{\boldsymbol{\alpha}}\_{\mathbf{X};\mathbf{Y}}(\mathbf{X};\mathbf{Y}) + (1-\alpha)\left(\mathbb{I}\_{Q\_{\mathbf{Y}}\parallel P\_{\mathbf{Y}}}(\mathbf{Y}) + D\_{\mathbf{z}}\left(P\_{\mathbf{Y}|\mathbf{X}}(\cdot|\mathbf{X})\,\|\, P\_{\mathbf{Y}\langle\mathbf{a}\rangle}\right)\right)\right)\right] \mathbf{X}\right]\right] \\ &= \mathbb{E}\left[\exp\left((1-\alpha)\left(D\_{\mathbf{z}}\left(P\_{\mathbf{Y}|\mathbf{X}}(\cdot|\mathbf{X})\,\|\, P\_{\mathbf{Y}\langle\mathbf{a}\rangle}\right) - D\_{\mathbf{z}}\left(P\_{\mathbf{Y}|\mathbf{X}}(\cdot|\mathbf{X})\,\|\, Q\_{\mathbf{Y}}\right)\right)\right] .\end{aligned} \tag{97}$$

where p*X*,*Y*q " *P<sup>X</sup>* ˆ *PY*, (96) follows from (92), and (97) follows from the definition of unconditional Rényi divergence in (27).

**Theorem 7.** *If α* P p0, 1s*, then*

$$\begin{split} D\_{\mathfrak{a}}(P\_{Y\langle \mathfrak{a}\rangle} \parallel Q\_{Y}) &\leqslant \mathbb{E}\Big[D\_{\mathfrak{a}}(P\_{Y|X}(\cdot|X) \parallel Q\_{Y})\Big] - \mathbb{E}\Big[D\_{\mathfrak{a}}(P\_{Y|X}(\cdot|X) \parallel P\_{Y\langle \mathfrak{a}\rangle})\Big] \\ &\leqslant D(P\_{Y\langle \mathfrak{a}\rangle} \parallel Q\_{Y}). \end{split} \tag{98}$$

*If α* ě 1*, inequalities* (98) *and* (99) *are reversed.*

**Proof.** Assume *α* P p0, 1s. Jensen's inequality applied to the right side of (94) results in (98). To show (99), again we assume for brevity *Q<sup>Y</sup>* ! *PY*, and define the positive functions *V* : A ˆ B Ñ p0, 8q and *W* : A ˆ B Ñ p0, 8q,

$$V(\mathbf{x}, y) = \exp\left(a\iota\_{X;Y}(\mathbf{x}; y) + (1 - a)\iota\_{Y\langle a\rangle \parallel Y}(y)\right),\tag{100}$$

$$\mathcal{W}(\mathbf{x}, \mathbf{y}) = \exp\left(a \iota\_{X;Y}(\mathbf{x}; \mathbf{y}) + (1 - \mathfrak{a}) \iota\_{Q\_Y \parallel P\_Y}(\mathbf{y})\right). \tag{101}$$

Note that, with p*X*,*Y*q " *P<sup>X</sup>* ˆ *PY*, and p*x*, *y*q P A ˆ B,

$$\mathbb{E}[V(\mathbf{x},\mathbf{Y})] = \exp\left( (\mathfrak{a} - 1) D\_{\mathfrak{a}} (P\_{Y|X=x} \| P\_{Y\langle \mathbf{a} \rangle}) \right), \tag{102}$$

$$\mathbb{E}[W(\mathbf{x},\mathbf{Y})] = \exp\left( (a-1)D\_{\mathbf{a}}(P\_{Y|X=\mathbf{x}} \| \mathbf{Q}\_{Y}) \right),\tag{103}$$

$$\mathbb{E}\left[\frac{V(\mathbf{X},\boldsymbol{y})}{\mathbb{E}[V(\mathbf{X},\boldsymbol{Y})|\mathbf{X}]}\right] = \exp\left((1-a)\iota\_{Y(\boldsymbol{a})|\boldsymbol{Y}}(\boldsymbol{y})\right).$$

$$\cdot \mathbb{E}\left[\exp\left(a\iota\_{\mathbf{X};\boldsymbol{Y}}(\mathbf{X};\boldsymbol{y}) + (1-a)D\_{\boldsymbol{a}}(P\_{\boldsymbol{Y}|\mathbf{X}}(\cdot|\boldsymbol{X}) \|P\_{\boldsymbol{Y}\langle\boldsymbol{a}\rangle})\right)\right] \tag{104}$$

$$= \frac{\mathrm{d}P\_{Y\langle a\rangle}}{\mathrm{d}P\_Y}(y). \tag{105}$$

where (104) uses (100) and (102) and (105) follows from (92). Then,

$$D\_{\mathfrak{a}}(P\_{Y|X=x} \parallel Q\_Y) - D\_{\mathfrak{a}}(P\_{Y|X=x} \parallel P\_{Y(a)})$$

$$= \frac{1}{1-\mathfrak{a}} \log \frac{\mathbb{E}[V(\mathfrak{x}, Y)]}{\mathbb{E}[W(\mathfrak{x}, Y)]} \tag{106}$$

$$\leq \frac{1}{1-\alpha} \mathbb{E}\left[\frac{V(\mathbf{x},\mathbf{Y})}{\mathbb{E}[V(\mathbf{x},\mathbf{Y})]} \log \frac{V(\mathbf{x},\mathbf{Y})}{W(\mathbf{x},\mathbf{Y})}\right] \tag{107}$$

$$\mathbf{x} = \mathbb{E}\left[\frac{V(\mathbf{x}, Y)}{\mathbb{E}[V(\mathbf{x}, Y)]} \Big(\mathbf{r}\_{Y \langle \alpha \rangle \| Y}(Y) - \mathbf{r}\_{Q\_Y \| P\_Y}(Y)\Big)\right],\tag{108}$$

where the expectations are with respect to *Y* " *PY*, and

• (107) follows from the log-sum inequality for integrable non-negative random variables,

$$\mathbb{E}[V]\log\frac{\mathbb{E}[V]}{\mathbb{E}[\mathcal{W}]} \leqslant \mathbb{E}\left[V\log\frac{V}{\mathcal{W}}\right];\tag{109}$$

• (108) ð (100) and (101).

Taking expectation with respect to *X* " *P<sup>X</sup>* of (106)–(108) yields (99) because of Lemma A1 and (105). If *α* ě 1, then Jensen's inequality applied to the right side of (94) results in (98) but with the opposite inequality. Moreover, (107) is reversed and the remainder of the proof holds verbatim.

In the case of finite input-alphabets, a different proof of (99) is given in Appendix B of [54].

29. Introduced in the unpublished dissertation [36] and rescued from oblivion in [32], the Augustin–Csiszár mutual information of order *α* is defined for *α* ą 0 as

$$I\_a^\mathsf{C}(X;Y) = I\_a^\mathsf{C}(P\_{X\prime}P\_{Y|X}) = \min\_{Q\_Y} \mathbb{E}\left[D\_a(P\_{Y|X}(\cdot|X) \parallel Q\_Y)\right] \tag{110}$$

$$=\mathbb{E}\left[D\_{\mathfrak{a}}(P\_{Y|X}(\cdot|X)\parallel P\_{Y\langle\mathfrak{a}\rangle})\right],\tag{111}$$

where (111) follows from (98) if *α* P p0, 1s, and from the reverse of (99) if *α* ě 1. We conform to the notation in [40], where *I* a *<sup>α</sup>* was used to denote the difference between entropy and Arimoto-Rényi conditional entropy. In [32,39,43] the Augustin–Csiszár mutual information of order *α* is denoted by *Iα*. In Augustin's original notation [36], *I ρ* p*PX*q means *I* c 1´*ρ* <sup>p</sup>*PX*, *<sup>P</sup>Y*|*X*q, *<sup>ρ</sup>* P p0, 1q. Independently of [36], Poltyrev [35] introduced a functional (expressed as a maximization over a reverse random transformation) which turns out to be *ρI* c 1 1`*ρ* p*X*;*Y*q and which he denoted by *E*0p*ρ*, *PX*q, although in Gallager's notation that corresponds to *ρI* <sup>1</sup> p*X*;*Y*q, as we will see in (233). *I* c 0 p*X*;*Y*q

1`*ρ* and *I* c <sup>8</sup>p*X*;*Y*q are defined by taking the corresponding limits.

30. In the discrete case, (110) boils down to

$$I\_{\mathfrak{a}}^{\mathfrak{C}}(X;Y) = \min\_{Q\_Y} \frac{1}{\mathfrak{a}-1} \sum\_{\mathbf{x} \in \mathcal{A}} P\_X(\mathbf{x}) \log \sum\_{\mathbf{y} \in \mathcal{B}} P\_{Y|X}^{\mathfrak{a}}(\mathbf{y}|\mathbf{x}) \, Q\_Y^{1-\mathfrak{a}}(\mathbf{y}),\tag{112}$$

which can be juxtaposed with the much easier expression in (82) for *Iα*p*X*;*Y*q involving no further optimization. Minimizing the Lagrangian, we can verify that the minimizer in (112) satisfies (93). With p*X*,*Y*sq " *P<sup>X</sup>* ˆ *QY*, we have

$$I\_0^\varepsilon(X;Y) = \min\_{Q\_Y} \mathbb{E}\left[\log \frac{1}{\mathbb{P}[P\_{Y|X}(\bar{Y}|X) > 0 \mid X]}\right],\tag{113}$$

$$I\_{\mathcal{O}}^{\mathbf{c}}(X;Y) = \min\_{\mathcal{Q}\_Y} \mathbb{E}\left[\log \left\| \frac{P\_{Y|X}(\bar{Y}|X)}{\mathcal{Q}\_Y(\bar{Y})} \right\|\_{\infty} \right],\tag{114}$$

where the expectations are with respect to *X*.

31. The respective minimizers of (72) and (110), namely, the *α*-response and the x*α*yresponse, are quite different. Most notably, in contrast to Item 7, an explicit expression for *<sup>P</sup>Y*x*α*<sup>y</sup> is unknown. Instead of defining *<sup>P</sup>Y*x*α*<sup>y</sup> through (92), [36] defines it, equivalently, as the fixed point of the operator (dubbed the Augustin operator in [43]) which maps the set of probability measures on the output space to itself,

$$\frac{\mathrm{d}\mathrm{T}\_{a}(Q)}{\mathrm{d}Q}(y) = \mathbb{E}\left[\left(\frac{\mathrm{d}P\_{Y|X}}{\mathrm{d}Q}(y|X)\right)^{a}\exp\left((1-a)D\_{a}(P\_{Y|X}(\cdot|X)\|Q)\right)\right],\tag{115}$$

where *X* " *PX*. Although we do not rely on them, Lemma 34.2 of (*α* P p0, 1q) and Lemma 13 of [43] (*α* ą 1) claim that the minimizer in (110), referred to in [43] as the Augustin mean of order *α*, is unique and is a fixed point of the operator T*<sup>α</sup>* regardless of *PX*. Moreover, Lemma 13(c) of [43] establishes that for *α* P p0, 1q and finite input alphabets, repeated iterations of the operator T*<sup>α</sup>* with initial argument *<sup>P</sup>Y*r*α*<sup>s</sup> converge to *<sup>P</sup>Y*x*α*<sup>y</sup> .

32. It is interesting to contrast the next example with the formulas in Examples 2 and 6.

**Example 7.** Additive independent Gaussian noise. *If Y* " *X* ` *N, where X* " N ` 0, *σ* 2 *X* ˘ *independent of N* " N ` 0, *σ* 2 *N* ˘ *, then*

$$\mathcal{Y}\langle\mathfrak{a}\rangle \sim \mathcal{N}\left(0, \frac{\sigma\_{\mathrm{N}}^2}{2} \left(2 - \frac{1}{\mathfrak{a}} + \Delta + \mathrm{snr}\right)\right),\tag{116}$$

$$\text{snr} = \frac{\sigma\_X^2}{\sigma\_N^2} \tag{117}$$

$$
\Delta = \sqrt{4\,\text{snr} + \left(\frac{1}{\alpha} - \text{snr}\right)^2}.\tag{118}
$$

*This result can be obtained by postulating a zero-mean Gaussian distribution with variance v* 2 *α as PY*x*α*<sup>y</sup> *and verifying that* (92) *is indeed satisfied if v* 2 *α is chosen as in* (116)*. The first step is to invoke* (32)*, which yields*

$$D\_{\mathfrak{a}}\left(P\_{Y|X=\mathfrak{x}} \parallel P\_{Y(\mathfrak{a})}\right) = \frac{\lambda\_{\mathfrak{a}}}{2} + \frac{\mathfrak{a}}{2} \frac{\mathfrak{x}^2}{s\_{\mathfrak{a}}^2} \log \mathfrak{e},\tag{119}$$

$$
\lambda\_{\mathfrak{a}} = \log \frac{v\_{\mathfrak{a}}^2}{\sigma\_N^2} + \frac{1}{\mathfrak{a} - 1} \log \frac{v\_{\mathfrak{a}}^2}{s\_{\mathfrak{a}}^2} \, \tag{120}
$$

*where we have denoted s*<sup>2</sup> *<sup>α</sup>* " *α v* 2 *<sup>α</sup>* ` p1 ´ *α*q*σ* 2 *N . Since Y* " N ` 0, *σ* 2 *<sup>X</sup>* ` *σ* 2 *N* ˘ *,*

$$u\_{X;Y}(x;y) = \frac{1}{2} \log \frac{\sigma\_X^2 + \sigma\_N^2}{\sigma\_N^2} + \frac{1}{2} \left( \frac{y^2}{\sigma\_X^2 + \sigma\_N^2} - \frac{(y-x)^2}{\sigma\_N^2} \right) \log \mathbf{e},\tag{121}$$

$$\mu\_{Y\langle a\rangle\|Y}(y) = \frac{1}{2}\log\frac{\sigma\_X^2 + \sigma\_N^2}{v\_a^2} + \frac{1}{2}\left(\frac{y^2}{\sigma\_X^2 + \sigma\_N^2} - \frac{y^2}{v\_a^2}\right)\log\text{e.}\tag{122}$$

*Assembling* (120) *and* (121)*, the right side of* (92) *becomes*

1 *α* logE " expp*α ıX*;*Y*p*X*; *y*q ` p1 ´ *α*q*D<sup>α</sup>* ´ *PY*|*X*p¨|*X*q } *PY*x*α*<sup>y</sup> ¯ı " 1 2 log *<sup>σ</sup>* 2 *<sup>X</sup>* ` *σ* 2 *N σ* 2 *N* ` 1 2 *y* 2 log e *σ* 2 *<sup>X</sup>* ` *σ* 2 *N* ` 1 ´ *α* 2*α λα* ` 1 *α* logE « exp<sup>e</sup> ˜ ´ *α*p*y* ´ *X*q 2 2*σ* 2 *N* ` *α*p1 ´ *α*q*X* 2 2*s* 2 *α* ¸ff (123) " 1 2 log *<sup>σ</sup>* 2 *<sup>X</sup>* ` *σ* 2 *N σ* 2 *N* ` 1 ´ *α* 2*α λ<sup>α</sup>* ` *y* 2 log e 2 ˜ 1 *σ* 2 *<sup>X</sup>* ` *σ* 2 *N* ´ *s* 2 *<sup>α</sup>* ´ *α*p1 ´ *α*q*σ* 2 *X σ* 2 *N s* 2 *<sup>α</sup>* ` *α* 2*v* 2 *ασ* 2 *X* ¸ 2 2

$$+\frac{1}{2a}\log\frac{\sigma\_N^2 s\_\alpha^2}{\sigma\_N^2 s\_\alpha^2 + a^2 v\_\alpha^2 \sigma\_X^2} \tag{124}$$

$$=\frac{1}{2}\log\frac{\sigma\_X^2+\sigma\_N^2}{v\_\alpha^2}+\frac{1}{2}\left(\frac{y^2}{\sigma\_X^2+\sigma\_N^2}-\frac{y^2}{v\_\alpha^2}\right)\log\mathbf{e}\_\prime\tag{125}$$

*where* (124) *follows by Gaussian integration, and the marvelous simplification in* (125) *is satisfied provided that we choose*

$$s\_\alpha^2 = \frac{\alpha \sigma\_X^2 \upsilon\_\alpha^2}{\upsilon\_\alpha^2 - \sigma\_N^2}. \tag{126}$$

*Comparing* (122) *and* (125)*, we see that* (92) *is indeed satisfied with Y*x*α*y " N ` 0, *v* 2 *α* ˘ *if v* 2 *α satisfies the quadratic equation* (126)*, whose solution is in* (116)*–*(118)*. Invoking* (32) *and* (116)*, we obtain*

$$I\_a^c(X;X+N) = \frac{\mathfrak{a}\mathfrak{snr}}{1 + \mathfrak{a}\,\Delta + \mathfrak{a}\,\mathfrak{snr}} \log\mathfrak{e} + \frac{1}{2} \log\left(1 + \frac{1}{2} \left(\Delta + \mathfrak{snr} - \frac{1}{\mathfrak{a}}\right)\right)$$

$$-\frac{1}{2(1-\mathfrak{a})} \log\frac{2 - \frac{1}{\mathfrak{a}} + \Delta + \mathfrak{snr}}{1 + \mathfrak{a}\,\Delta + \mathfrak{a}\,\mathfrak{snr}}.\tag{127}$$

Beyond its role in evaluating the Augustin–Csiszár mutual information for Gaussian inputs, the Gaussian distribution in (116) has found some utility in the analysis of finite blocklength fundamental limits for data transmission [55].

33. This item gives a variational representation for the Augustin–Csiszár mutual information in terms of mutual information and conditional relative entropy (i.e., non-Rényi information measures). As we will see in Section 10, this representation accounts for the role played by Augustin–Csiszár mutual information in expressing error exponent functions.

**Theorem 8.** *For α* P p0, 1q*, the Augustin–Csiszár mutual information satisfies the variational representation in terms of conditional relative entropy and mutual information,*

$$I(1 - \mathfrak{a}) \, I\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{X'} P\_{Y|X}) = \min\_{Q\_{Y|X}} \Big\{ \mathfrak{a} \, D(Q\_{Y|X} \| P\_{Y|X} | P\_X) + (1 - \mathfrak{a}) \, I(P\_{X'} Q\_{Y|X}) \Big\}, \tag{128}$$

*where the minimum is over all the random transformations from the input to the output spaces.*

**Proof.** Invoking (47) with <sup>p</sup>*P*1, *<sup>P</sup>*0q Ð p*PY*|*X*"*<sup>x</sup>* , *QY*q we obtain

$$(1 - \mathfrak{a}) \operatorname{D}\_{\mathfrak{a}}(P\_{Y|X=x} \| Q\_Y) = \min\_{\mathcal{R}\_Y} \left\{ \mathfrak{a} \, D(\mathcal{R}\_Y \| P\_{Y|X=x}) + (1 - \mathfrak{a}) \, D(\mathcal{R}\_Y \| Q\_Y) \right\} \tag{129}$$

$$=\min\_{R\_{Y|X=x}}\left\{\alpha \, D(R\_{Y|X=x} \| P\_{Y|X=x}) + (1-\alpha) \, D(R\_{Y|X=x} \| Q\_Y) \right\}.\tag{130}$$

Averaging over *x* " *PX*, followed by minimization with respect to *Q<sup>Y</sup>* yields (128) upon recalling (67).

In the finite-alphabet case with *α* P p0, 1q Y p1, 8q, the representation in (128) is implicit in the appendix of [32], and stated explicitly in [39], where it is shown by means of a minimax theorem. This is one of the instances in which the proof of the result is considerably easier for *α* P p0, 1q; we can take the following route to show (128) for *α* ą 1. Neglecting to emphasize its dependence on *PX*, denote

$$f\_{\hbar}(Q\_{Y\prime}R\_{Y|X}) = \frac{\mathfrak{a}}{1-\mathfrak{a}}\,D(R\_{Y|X}\|P\_{Y|X}|P\_{X}) + D(R\_{Y|X}\|Q\_{Y}|P\_{X}).\tag{131}$$

Invoking (47) we obtain

$$D\_{\mathfrak{a}}(P\_{Y|X=x} \| Q\_Y) = \max\_{\mathcal{R}\_{Y|X=x}} \left\{ \frac{\mathfrak{a}}{1-\mathfrak{a}} D(R\_{Y|X=x} \| P\_{Y|X=x}) + D(R\_{Y|X=x} \| Q\_Y) \right\}.\tag{132}$$

Averaging (132) with respect to *P<sup>X</sup>* followed by minimization over *QY*, results in

$$I\_{\mathfrak{a}}^{\mathfrak{C}}(P\_{\mathcal{X}\prime}P\_{Y|X}) = \min\_{Q\_Y} \max\_{R\_{Y|X}} f\_{\mathfrak{a}}(Q\_{Y\prime}R\_{Y|X}) \tag{133}$$

$$\times \max\_{R\_{Y|X}} \min\_{Q\_Y} f\_{\mathfrak{a}}(Q\_{Y\prime} R\_{Y|X}) \tag{134}$$

$$=\max\_{\mathcal{R}\_{Y|X}} \left\{ \frac{\mathfrak{a}}{1-\mathfrak{a}} \, D(\mathcal{R}\_{Y|X} \| P\_{Y|X} | P\_X) + I(P\_{X\prime} \mathcal{R}\_{Y|X}) \right\},\tag{135}$$

which shows ě in (128). If a minimax theorem can be invoked to show equality in (134), then (128) is established for *<sup>α</sup>* <sup>ą</sup> 1. For that purpose, for fixed *<sup>R</sup>Y*|*X*, *<sup>f</sup>*p¨, *<sup>R</sup>Y*|*X*<sup>q</sup> is convex and lower semicontinuous in *Q<sup>Y</sup>* on the set where it is finite. Rewriting

$$\begin{split} \boldsymbol{f}(\boldsymbol{Q}\_{Y\boldsymbol{X}}\boldsymbol{R}\_{Y|X}) \\ = \frac{1}{1-\mathfrak{a}} \boldsymbol{D}(\boldsymbol{R}\_{Y|X}\|\boldsymbol{P}\_{Y|X}|\boldsymbol{P}\_{\mathbf{X}}) + \boldsymbol{D}(\boldsymbol{R}\_{Y|X}\|\boldsymbol{Q}\_{Y}|\boldsymbol{P}\_{\mathbf{X}}) - \boldsymbol{D}(\boldsymbol{R}\_{Y|X}\|\boldsymbol{P}\_{Y|X}|\boldsymbol{P}\_{\mathbf{X}}), \end{split} \tag{136}$$

it can be seen that *f*p*QY*, ¨q is upper semicontinuous and concave (if *α* ą 1). A different, and considerably more intricate route is taken in Lemma 13(d) of [43], which also gives (128) for *α* ą 1 assuming finite input alphabets.

34. Unlike mutual information, neither *Iα*p*X*;*Y*q " *Iα*p*Y*; *X*q nor *I* c *α* p*X*;*Y*q " *I* c *α* p*Y*; *X*q hold in general.

**Example 8.** *Erasure transformation. Let* A " t0, 1u, B " t0, 1, eu*,*

$$P\_{Y|X}(b|a) = \begin{cases} 1 - \delta\_{\prime} & a = b; \\ \delta\_{\prime} & b = \mathbf{e}; \\ 0, & a \neq b \neq \mathbf{e}. \end{cases} \tag{137}$$

*with <sup>δ</sup>* P p0, 1q*, and PX*p0q " <sup>1</sup> 2 *. Then, we obtain, for α* P p0, 1q Y p1, 8q*,*

$$I\_{\mathfrak{a}}(X;Y) = I\_{\mathfrak{a}}^{\mathfrak{c}}(X;Y) = \frac{\mathfrak{a}}{\mathfrak{a}-1} \log \left( \delta + (1-\delta) \, 2^{\left(1-\frac{1}{\mathfrak{a}}\right)} \right),\tag{138}$$

$$I\_{\mathfrak{a}}(Y;X) = \frac{1}{\mathfrak{a}-1} \log \left( \delta + (1-\delta) \, \mathfrak{a}^{\mathfrak{a}-1} \right),\tag{139}$$

$$I\_a^\mathbf{C}(Y;X) = I(X;Y) = 1 - \delta \text{ bits}.\tag{140}$$

35. It was shown in Theorem 5.2 of [38] that *α*-mutual information satisfies the data processing lemma, namely, if *X* and *Z* are conditionally independent given *Y*, then

$$I\_{\mathfrak{a}}(X;Z) \lesssim \min\{I\_{\mathfrak{a}}(X;Y), I\_{\mathfrak{a}}(Y;Z)\},\tag{141}$$

$$I\_{\mathfrak{a}}(Z;X) \ll \min\{I\_{\mathfrak{a}}(Z;Y), I\_{\mathfrak{a}}(Y;X)\}.\tag{142}$$

As shown by Csiszár [32] using the data processing inequality for Rényi divergence, the data processing lemma also holds for *I* c *α* .

36. From (53), (54) and the monotonicity of *Dα*p*P*}*Q*q in *α*, we obtain the ordering

$$I\_{\mathfrak{P}}(\mathbf{X};\mathbf{Y}) \leqslant I\_{\mathfrak{a}}(\mathbf{X};\mathbf{Y}) \leqslant I\_{\mathfrak{a}}^{\mathbf{c}}(\mathbf{X};\mathbf{Y}) \leqslant I\_{\mathfrak{v}}^{\mathbf{c}}(\mathbf{X};\mathbf{Y}) \leqslant I(\mathbf{X};\mathbf{Y}), \quad \mathbf{0} < \mathfrak{P} \leqslant \mathfrak{a} \leqslant \mathbf{v} < 1; \tag{143}$$

$$I(X;Y) \leqslant I\_{\boldsymbol{\nu}}^{\boldsymbol{\varepsilon}}(X;Y) \leqslant I\_{\boldsymbol{\mathfrak{a}}}^{\boldsymbol{\varepsilon}}(X;Y) \leqslant I\_{\boldsymbol{\mathfrak{a}}}(X;Y) \leqslant I\_{\boldsymbol{\mathfrak{\beta}}}(X;Y), \quad 1 < \boldsymbol{\nu} \leqslant \boldsymbol{\mathfrak{a}} \leqslant \boldsymbol{\beta}. \tag{144}$$

37. The convexity/concavity properties of the generalized mutual informations are summarized next.

#### **Theorem 9.**


#### **Proof.**


$$I\_{\mathfrak{A}}(\lambda \, P\_X^1 + (1 - \lambda) \, P\_{X'}^0 \, P\_{Y|X}) \tag{145}$$

$$I = \inf\_{Q\_Y} D\_\mathbb{d} (P\_{Y|X} \parallel Q\_Y \mid \lambda \, P\_X^1 + (1 - \lambda) \, P\_X^0) \tag{146}$$

$$\geqslant \inf\_{Q\_Y} \lambda \, D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_Y \mid P\_X^1) + (1 - \lambda) \, D\_{\mathfrak{a}}(P\_{Y|X} \parallel Q\_Y \mid P\_X^0) \tag{147}$$

$$\geq \lambda \, I\_{\mathfrak{a}}(P\_{X'}^{1}P\_{Y|X}) + (1-\lambda) \, I\_{\mathfrak{a}}(P\_{X'}^{0}P\_{Y|X}).\tag{148}$$

(c) The convexity of *I*p*PX*, ¨q and *Iα*p*PX*, ¨q follow from the convexity of *Dα*p*P*}*Q*q in p*P*, *Q*q for *α* P p0, 1s as we saw in Item 18. To show convexity of *I* c *α* p*PX*, ¨q if *<sup>α</sup>* P p0, 1q, we apply (169) in Item 45 with *<sup>P</sup>Y*|*<sup>X</sup>* " *<sup>λ</sup><sup>P</sup>* 1 *Y*|*X* ` p1 ´ *λ*q*P* 0 *Y*|*X* , and invoke the convexity of *Iα*p*PX*, ¨q:

$$\begin{split} & (1 - \mathfrak{a}) \, I\_{\mathfrak{a}}^{\mathbb{C}} (P\_{\mathbf{X}}, P\_{Y|\mathbf{X}}) \\ & \qquad = \max\_{Q\_{\mathbf{X}}} \Big\{ (1 - \mathfrak{a}) \, I\_{\mathfrak{a}} (Q\_{\mathbf{X}\mathbf{X}} \lambda P\_{Y|\mathbf{X}}^{\mathbb{I}} + (1 - \lambda) P\_{Y|\mathbf{X}}^{\mathbb{O}}) - D(P\_{\mathbf{X}} \parallel Q\_{\mathbf{X}}) \Big\} \\ & \qquad < \max\_{Q\_{\mathbf{X}}} \Big\{ \lambda \Big( 1 - \mathfrak{a} \, \big( 1 - \mathfrak{a} \, \big) I\_{\mathfrak{a}} (Q\_{\mathbf{X}} \parallel Q\_{\mathbf{X}}) \Big\} \\ \end{split} \tag{149}$$

$$\begin{split} \ll & \max\_{Q\_X} \left\{ \lambda \left( 1 - \mathfrak{a} \right) I\_{\mathfrak{a}} (Q\_{X\prime} P\_{Y|X}^1) - D(P\_X \parallel Q\_X) \right\} \\ &+ (1 - \lambda) \left( 1 - \mathfrak{a} \right) I\_{\mathfrak{a}} (Q\_{X\prime} P\_{Y|X}^0) - D(P\_X \parallel Q\_X) \right\} \end{split} \tag{150}$$

$$\leq (1-\alpha)\Big(\lambda I\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{\mathcal{X}\prime}P\_{Y|X}^{1}) + (1-\lambda)I\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{\mathcal{X}\prime}P\_{Y|X}^{0})\Big).\tag{151}$$

Although not used in the sequel, we note, for completeness, that if *α* P p0, 1q Y <sup>p</sup>1, 8q, [38] (see corrected version in [41]) shows that exp´´<sup>1</sup> ´ 1 *α* ¯ *<sup>I</sup>α*p¨, *<sup>P</sup>Y*|*X*<sup>q</sup> ¯ {p*α* ´ 1q is concave.

#### **5. Interplay between** *Iα*p*PX***,** *PY*|*X*q **and** *I* c *α* p*PX***,** *PY*|*X*q

In this section we study the interplay between both notions of mutual informations of order *α*, and, in particular, various variational representations of these information measures.

38. For given *<sup>α</sup>* P p0, 1q Y p1, 8q and *<sup>P</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B, define *<sup>Q</sup>X*r*α*<sup>s</sup> !" *<sup>P</sup>X*, the *<sup>α</sup>*-adjunct of *P<sup>X</sup>* by

$$\mu\_{Q\_{X[\mathfrak{a}]} \parallel P\_X}(\mathfrak{x}) = (\mathfrak{a} - 1) \, D\_{\mathfrak{a}} \left( P\_{Y|X=\mathfrak{x}} \| P\_{Y[\mathfrak{a}]} \right) - \kappa\_{\mathfrak{a} \prime} \tag{152}$$

with *<sup>κ</sup><sup>α</sup>* the constant in (14) and *<sup>P</sup>Y*r*α*<sup>s</sup> , the *α*-response to *PX*. 39. **Example 9.** *Let Y* " *X* ` *N with X* " N ` 0, *σ* 2 *X* ˘ *independent of N* " N ` 0, *σ* 2 *N* ˘ *, and* snr " *σ* 2 *X σ* 2 *N . The α-adjunct of the input is*

$$Q\_{X[n]} = \mathcal{N}\left(0, \sigma\_X^2 \frac{1 + \alpha^2 \mathbf{s} \mathbf{n} \mathbf{r}}{1 + \alpha \mathbf{s} \mathbf{n} \mathbf{r}}\right). \tag{153}$$

40. **Theorem 10.** *The* x*α*y*-response to QX*r*α*<sup>s</sup> *is PY*r*α*<sup>s</sup> *, the α-response to PX.*

**Proof.** We just need to verify that (92) is satisfied if we substitute *Y*x*α*y by *Y*r*α*s, and instead of taking the expectation in the right side with respect to *X* " *P<sup>X</sup>* we take it with respect to *<sup>X</sup>*<sup>r</sup> " *<sup>Q</sup>X*r*α*<sup>s</sup> . Then,

$$\begin{aligned} &\mathbb{E}\left[\exp\left(\mathfrak{a}\,\mathsf{r}\_{\mathbf{X};\mathcal{Y}}(\widetilde{\mathbf{X}};\mathcal{Y}) + (1-\mathfrak{a})D\_{\mathfrak{a}}\left(P\_{\mathbf{Y}|\mathbf{X}}(\cdot|\widetilde{\mathbf{X}})\,\|\, P\_{\mathbf{Y}[\mathbf{a}]}\right)\right)\right] \\ &=\mathbb{E}\left[\exp\left(\mathfrak{a}\_{Q\_{\mathbf{X}[\mathbf{a}]}\|\, P\_{\mathbf{X}}}(\mathbf{X}) + \mathfrak{a}\,\mathsf{r}\_{\mathbf{X};\mathcal{Y}}(\mathbf{X};\mathcal{Y}) + (1-\mathfrak{a})D\_{\mathfrak{a}}\left(P\_{\mathbf{Y}|\mathbf{X}}(\cdot|\mathbf{X})\,\|\, P\_{\mathbf{Y}[\mathbf{a}]}\right)\right)\right] \end{aligned} \tag{154}$$

$$\mathbf{x} = \mathbb{E}\left[\exp\left(\mathbf{a} \,\mathbf{r}\_{\mathbf{X};\mathbf{Y}}(\mathbf{X};\mathbf{y}) - \mathbf{x}\_{\mathbf{d}}\right)\right] \tag{155}$$

$$=\exp\left(a\,\iota\_{Y[a]\parallel Y}(y)\right),\tag{156}$$

where (154) is by change of measure, (155) follows by substitution of (152), and (156) is the same as (13).

41. For given *<sup>α</sup>* P p0, 1q Y p1, 8q and *<sup>P</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B, we define *<sup>Q</sup>X*x*α*<sup>y</sup> !" *<sup>P</sup>X*, the <sup>x</sup>*α*yadjunct of an input probability measure *P<sup>X</sup>* through

$$\mu\_{Q\_{X\langle \alpha\rangle} \parallel P\_X}(\mathbf{x}) = (1 - \alpha) \, D\_{\mathbf{a}}\left(P\_{Y \mid X = \mathbf{x}} \parallel P\_{Y \langle \mathbf{a}\rangle}\right) + \upsilon\_{\mathbf{a}\prime} \tag{157}$$

where *<sup>P</sup>Y*x*α*<sup>y</sup> is the <sup>x</sup>*α*y-response to *<sup>P</sup><sup>X</sup>* and *<sup>υ</sup><sup>α</sup>* is a normalizing constant so that *<sup>Q</sup>X*x*α*<sup>y</sup> is a probability measure. According to (9), we must have

$$\mathbb{E}\left[\exp\left(\iota\_{Q\_{X\backslash\langle a\rangle}\parallel P\_X}(X)\right)\right] = 1, \quad X \sim P\_X. \tag{158}$$

Hence,

$$\upsilon\_{\mathfrak{a}} = (\mathfrak{a} - 1) \, D\_{\mathfrak{a}} \left( P\_{Y|X} \, \| \, P\_{Y \langle \mathfrak{a} \rangle} \, | \, Q\_{X \langle \mathfrak{a} \rangle} \right). \tag{159}$$

42. With the aid of the expression in Example 7, we obtain

**Example 10.** *Let Y* " *X* ` *N with X* " N ` 0, *σ* 2 *X* ˘ *independent of N* " N ` 0, *σ* 2 *N* ˘ *, and* snr " *σ* 2 *X σ* 2 *N . Then, the* x*α*y*-adjunct of the input is*

$$Q\_{X\langle\alpha\rangle} = \mathcal{N}\left(0, \sigma\_X^2 \frac{1 + \mathfrak{a}(\Delta + \mathfrak{snr})}{1 + \mathfrak{a}(\Delta - \mathfrak{snr}) + 2\mathfrak{a}^2 \mathfrak{snr}}\right),\tag{160}$$

*which, in contrast to QX*r*α*<sup>s</sup> *, has larger variance than σ* 2 *X if α* P p0, 1q*.*

43. The following result is the dual of Theorem 10.

**Theorem 11.** *The α-response to QX*x*α*<sup>y</sup> *is PY*x*α*<sup>y</sup> *, the* x*α*y*-response to PX. Therefore,*

$$
\upsilon\_{\mathfrak{a}} = (\mathfrak{a} - 1) \operatorname{I}\_{\mathfrak{a}} \Big( \operatorname{Q}\_{\mathbf{X}\langle \mathfrak{a} \rangle \prime} P\_{\mathbf{Y}|\mathbf{X}} \Big). \tag{161}
$$

**Proof.** The proof is similar to that of Theorem 10. We just need to verify that we obtain the right side of (92) if on the right side of (91) we substitute *<sup>P</sup><sup>X</sup>* by *<sup>Q</sup>X*x*α*<sup>y</sup> and *<sup>P</sup>Y*r*α*<sup>s</sup> by *<sup>P</sup>Y*x*α*<sup>y</sup> . Let *<sup>X</sup>*<sup>s</sup> " *<sup>Q</sup>X*x*α*<sup>y</sup> . Then,

$$\begin{split} &\frac{1}{\alpha}\log\mathbb{E}\left[\exp\left(a\,\mathsf{r}\_{\boldsymbol{X};\boldsymbol{Y}}(\overline{\boldsymbol{X}};\boldsymbol{y})+(1-\alpha)D\_{\boldsymbol{a}}\left(P\_{\boldsymbol{Y}|\boldsymbol{X}}\,\|\,P\_{\boldsymbol{Y}\langle\boldsymbol{a}\rangle}\,\Big|\,Q\_{\boldsymbol{X}\langle\boldsymbol{a}\rangle}\right)\right)\right] \\ &=\frac{1}{\alpha}\log\mathbb{E}\left[\exp\left(\mathsf{r}\_{Q\_{\boldsymbol{X}\langle\boldsymbol{a}\rangle}\,\|\,P\_{\boldsymbol{X}}}(\boldsymbol{X})+\alpha\,\mathsf{r}\_{\boldsymbol{X};\boldsymbol{Y}}(\boldsymbol{X};\boldsymbol{y})-\mathsf{v}\_{\boldsymbol{a}}\right)\right] \end{split} \tag{162}$$

$$\mathcal{I} = \frac{1}{\mathfrak{a}} \log \mathbb{E} \left[ \exp \left( \mathfrak{a} \, \mathfrak{a}\_{\mathbf{X}; \mathbf{Y}}(\mathbf{X}; \mathfrak{y}) + (1 - \mathfrak{a}) D\_{\mathfrak{a}} \left( P\_{\mathbf{Y}|\mathbf{X}}(\cdot|\mathbf{X}) \, \|\, P\_{\mathbf{Y}\langle \mathbf{a} \rangle} \right) \right) \right] \tag{163}$$

$$=\mathfrak{n}\_{Y(\mathfrak{a})\parallel Y(\mathfrak{y})}(\mathfrak{y})\_{\prime} \tag{164}$$

where (162)–(164) follow by change of measure, (157), and (92), respectively.

44. By recourse to a minimax theorem, the following representation is given for *α* P p0, 1q Y p1, 8q in the case of finite alphabets in [39], and dropping the restriction on the finiteness of the output space in [43]. As we show, a very simple and general proof is possible for *α* P p0, 1q.

**Theorem 12.** *Fix <sup>α</sup>* P p0, 1q*, P<sup>X</sup>* <sup>P</sup> P<sup>A</sup> *and PY*|*<sup>X</sup>* : A <sup>Ñ</sup> B*. Then,*

$$I(1 - \mathfrak{a}) \operatorname{I}\_{\mathfrak{a}}(X; Y) = \min\_{Q\_X} \left\{ (1 - \mathfrak{a}) \operatorname{I}\_{\mathfrak{a}}^{\mathfrak{c}}(Q\_{X \prime} P\_{Y|X}) + D(Q\_X \parallel P\_X) \right\},\tag{165}$$

*where the minimum is attained by QX*r*α*<sup>s</sup> *, the α-adjunct of P<sup>X</sup> defined in* (152)*.*

**Proof.** The variational representations in (81) and (128) result in (165). To show that the minimum is indeed attained by *<sup>Q</sup>X*r*α*<sup>s</sup> , recall from Theorem 10 that the x*α*yresponse to *<sup>Q</sup>X*r*α*<sup>s</sup> is *PY*r*α*<sup>s</sup> . Therefore, evaluating the term in tu in (165) for *<sup>Q</sup><sup>X</sup>* <sup>Ð</sup> *<sup>Q</sup>X*r*α*<sup>s</sup> yields, with *<sup>X</sup>*<sup>r</sup> " *<sup>Q</sup>X*r*α*<sup>s</sup> ,

$$\begin{split} \mathbb{E}(1-\mathfrak{a}) \, & I\_{\mathfrak{a}}^{\mathfrak{c}}(Q\_{X[\mathfrak{a}]}, P\_{Y[X}) + D(Q\_{X[\mathfrak{a}]} \parallel P\_{X}) \\ &= (1-\mathfrak{a}) \, \mathbb{E} \Big[ D\_{\mathfrak{a}}(P\_{Y[X}(\cdot|\widetilde{X}) \parallel P\_{Y[\mathfrak{a}]})) \Big] + D(Q\_{X[\mathfrak{a}]} \parallel P\_{X}) \end{split} \tag{166}$$

$$
\dot{\mathbf{x}} = -\kappa\_{\mathbf{u}} \tag{167}
$$

$$= \left(1 - \mathfrak{a}\right) I\_{\mathfrak{a}}(X;Y),\tag{168}$$

where (167) follows from (152) and (168) results from (69)–(75).

45. For finite-input alphabets, Lemma 18(b) of [43] (earlier Theorem 3.4 of [35] gave an equivalent variational characterization assuming, in addition, finite output alphabets) established the following dual to Theorem 12.

**Theorem 13.** *Fix <sup>α</sup>* P p0, 1q*, P<sup>X</sup>* <sup>P</sup> P<sup>A</sup> *and PY*|*<sup>X</sup>* : A <sup>Ñ</sup> B*. Then,*

$$I(1-\mathfrak{a})\ I\_{\mathfrak{a}}^{\mathfrak{c}}(X;Y) = \max\_{Q\_X} \left\{ (1-\mathfrak{a})\ I\_{\mathfrak{a}}(Q\_{X\prime}P\_{Y|X}) - D(P\_X \parallel Q\_X) \right\}.\tag{169}$$

*The maximum is attained by QX*x*α*<sup>y</sup> *, the* x*α*y*-adjunct of P<sup>X</sup> defined by* (157)*.*

**Proof.** First observe that (165) implies that ě holds in (169). Second, the term in tu on the right side of (169) evaluated at *<sup>Q</sup><sup>X</sup>* <sup>Ð</sup> *<sup>Q</sup>X*x*α*<sup>y</sup> becomes

$$\begin{split} (1-\mathfrak{a})\operatorname{I}\_{\mathfrak{a}}(Q\_{X\langle\mathfrak{a}\rangle\prime}P\_{Y|X}) - \operatorname{D}(P\_{X}\parallel Q\_{X\langle\mathfrak{a}\rangle}) \\ = (1-\mathfrak{a})\operatorname{I}\_{\mathfrak{a}}(Q\_{X\langle\mathfrak{a}\rangle\prime}P\_{Y|X}) + (1-\mathfrak{a})\operatorname{I}\_{\mathfrak{a}}^{\mathfrak{c}}(P\_{X\prime}P\_{Y|X}) + \upsilon\_{\mathfrak{a}} \\ \forall \mathfrak{a} \qquad \text{ $\mathfrak{a}\text{-}$ } \operatorname{M}(P\_{Y}\mathbf{D}\_{X}) \end{split} \tag{170}$$

$$\mathbf{P} = (1 - \mathfrak{a}) I\_{\mathfrak{a}}^{\mathbf{c}} (P\_{X'} P\_{Y|X})\_{\prime} \tag{171}$$

where (170) follows by taking the expectation of minus (157) with respect to *PX*. Therefore, <sup>ď</sup> also holds in (169) and the maximum is attained by *<sup>Q</sup>X*x*α*<sup>y</sup> , as we wanted to show.

Hinging on Theorem 8, Theorems 12 and 13 are given for *α* P p0, 1q which is the region of interest in the analysis of error exponents. Whenever, as in the finite-alphabet case, (128) holds for *α* ą 1, Theorems 12 and 13 also hold for *α* ą 1.

Notice that since the definition of *<sup>Q</sup>X*x*α*<sup>y</sup> involves *<sup>P</sup>Y*x*α*<sup>y</sup> , the fact that it attains the maximum in (169) does not bring us any closer to finding *I* c *α* p*X*;*Y*q for a specific input probability measure *PX*. Fortunately, as we will see in Section 8, (169) proves to be the gateway to the maximization of *I* c *α* p*X*;*Y*q in the presence of input-cost constraints. 46. Focusing on the main range of interest, *α* P p0, 1q, we can express (169) as

$$I\_a^\mathbf{c}(P\_{X\prime}, P\_{Y|X}) = \max\_{Q\_X} \left\{ I\_a(Q\_{X\prime}, P\_{Y|X}) - \frac{1}{1 - a} D(P\_X \parallel Q\_X) \right\} \tag{172}$$

$$=\max\_{\xi\geq0} \left\{ \mathbb{I}(\xi) - \frac{\xi}{1-\alpha} \right\} \tag{173}$$

$$= \mathbb{I}(\mathfrak{f}\_{\alpha}) - \frac{\mathfrak{f}\_{\alpha}}{1 - \alpha'} \tag{174}$$

where we have defined the function (dependent on *<sup>α</sup>*, *<sup>P</sup>X*, and *<sup>P</sup>Y*|*X*)

$$\mathbb{I}(\xi) = \max\_{\substack{Q\_X: \\ D(P\_X \| Q\_X) \leqslant \xi}} I\_a(Q\_{X\prime} P\_{Y \mid X})\_\prime \tag{175}$$

and *ξ<sup>α</sup>* is the solution to

$$\dot{\mathbb{I}}(\xi\_a) = \frac{1}{1 - a}.\tag{176}$$

Recall that the maxima over the input distribution in (172) and (175) are attained by the <sup>x</sup>*α*y-adjunct *<sup>Q</sup>X*x*α*<sup>y</sup> defined in Item 41.

	- *<sup>P</sup>Y*: The familiar output probability measure *<sup>P</sup><sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*|*<sup>X</sup>* <sup>Ñ</sup> *<sup>P</sup>Y*, defined in Item 5.
	- *<sup>P</sup>Y*r*α*<sup>s</sup> : The *α*-response to *PX*, defined in Item 7. It is the unique achiever of the minimization in the definition of *α*-mutual information in (67).
	- *<sup>P</sup>Y*x*α*<sup>y</sup> : The x*α*y-response to *P<sup>X</sup>* defined in Item 26. It is the unique achiever of the minimization in the definition of Augustin–Csiszár *α*-mutual information in (110).
	- *<sup>Q</sup>X*r*α*<sup>s</sup> : The *<sup>α</sup>*-adjunct of *<sup>P</sup>X*, defined in (152). The <sup>x</sup>*α*y-response to *<sup>Q</sup>X*r*α*<sup>s</sup> is *<sup>P</sup>Y*r*α*<sup>s</sup> . Furthermore, *<sup>Q</sup>X*r*α*<sup>s</sup> achieves the minimum in (165).
	- *<sup>Q</sup>X*x*α*<sup>y</sup> : The <sup>x</sup>*α*y-adjunct of *<sup>P</sup>X*, defined in (157). The *<sup>α</sup>*-response to *<sup>Q</sup>X*x*α*<sup>y</sup> is *<sup>P</sup>Y*x*α*<sup>y</sup> . Furthermore, *<sup>Q</sup>X*x*α*<sup>y</sup> achieves the maximum in (169).

#### **6. Maximization of** *Iα*p*X***;** *Y*q

Just like the maximization of mutual information with respect to the input distribution yields the channel capacity (of course, subject to conditions [57]), the maximization of *Iα*p*X*;*Y*q and of *I* c *α* p*X*;*Y*q arises in the analysis of error exponents, as we will see in Section 10. A recent in-depth treatment of the maximization of *α*-mutual information is given in [45]. As we see most clearly in (82) for the discrete case, when it comes to its optimization, one advantage of *Iα*p*X*;*Y*q over *I*p*X*;*Y*q is that the input distribution does not affect the expression through its influence on the output distribution.

48. The maximization of *α*-mutual information is facilitated by the following result.

**Theorem 14** ([45])**.** *Given <sup>α</sup>* P p0, 1q Y p1, 8q*; a random transformation <sup>P</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B*; and, a convex set* P Ă PA*, the following are equivalent.*

*(a) P* ˚ *X* P P *attains the maximal α-mutual information on* P*,*

$$I\_{\mathfrak{a}}(P\_{X'}^\* P\_{Y|X}) = \max\_{P \in \mathcal{P}} I\_{\mathfrak{a}}(P\_{\prime} P\_{Y|X}) < \infty. \tag{177}$$

*(b) For any P<sup>X</sup>* P P*, and any output distribution Q<sup>Y</sup>* P PB*,*

$$D\_{\mathfrak{a}}(P\_{Y|X} \parallel P\_{Y[a]}^{\*} \mid P\_X) \leqslant D\_{\mathfrak{a}}(P\_{Y|X} \parallel P\_{Y[a]}^{\*} \mid P\_X^{\*}) \tag{178}$$

$$$$

*where P*˚ *Y*r*α*s *is the α-response to P*˚ *X .*

*those that satisfy the constraint*

*Moreover, if PY*r*α*<sup>s</sup> *denotes the α-response to PX, then*

$$D\_{\mathfrak{a}}(P\_{Y[a]} \| P\_{Y[a]}^{\*}) \leqslant I\_{\mathfrak{a}}(P\_{X'}^{\*} P\_{Y[X]}) - I\_{\mathfrak{a}}(P\_{X'} P\_{Y[X]}) < \infty. \tag{180}$$

Note that, while *<sup>I</sup>α*p¨, *<sup>P</sup>Y*|*X*<sup>q</sup> may not be maximized by a unique (or, in fact, by any) input distribution, the resulting *α*-response *P* ˚ *Y*r*α*s is indeed unique. If P is such that none of its elements attain the maximal *Iα*, it is known [42,45] that the *α*-response to any asymptotically optimal sequence of input distributions converges to *P* ˚ *Y*r*α*s . This is the counterpart of a result by Kemperman [58] concerning mutual information.

49. The following example appears in [45]. **Example 11.** *Let Y* " *X* ` *N where N* " N ` 0, *σ* 2 *N* ˘ *independent of X. Fix α* P p0, 1q *and P* ą 0*. Suppose that the set,* P Ă PA*, of allowable input probability measures consists of*

$$\mathbb{E}\left[\exp\_{\mathbf{e}}\left(-\frac{\mathfrak{a}(1-\mathfrak{a})X^{2}}{2\left(\mathfrak{a}^{2}P+\sigma\_{N}^{2}\right)}\right)\right] \geqslant \sqrt{\frac{\mathfrak{a}^{2}P+\sigma\_{N}^{2}}{\mathfrak{a}^{2}P+\sigma\_{N}^{2}}}.\tag{181}$$

*We can readily check that X* ˚ " Np0, *P*q *satisfies* (181) *with equality, and as we saw in Example 2, its α-response is P* ˚ *Y*r*α*s " N p0, *α P* ` *σ* 2 q*. Theorem 14 establishes that P* ˚ *X does indeed maximize the α-mutual information among all the distributions in* P*, yielding (recall Example 6)*

$$\max\_{P\_X \in \mathcal{P}} I\_\mathbb{A}(X; Y) = \frac{1}{2} \log \left( 1 + \frac{aP}{\sigma^2} \right). \tag{182}$$

*Curiously, if, instead of* P *defined by the constraint* (181)*, we consider the more conventional* P " t*X* : Er*X* 2 s ď *P*u*, then the left side of* (182) *is unknown at present. Numerical evidence shows that it can exceed the right side by employing non-Gaussian inputs.*

50. Recalling (56) and (178) implies that if *P* ˚ *X* attains the finite maximal unconstrained *α*-mutual information and its *α*-response is denoted by *P* ˚ *Y*r*α*s , then,

$$\max\_{X} I\_{\mathfrak{A}}(X;Y) = \max\_{P \in \mathcal{P}} I\_{\mathfrak{A}}(P, P\_{Y|X}) = \max\_{a \in \mathcal{A}} D\_{\mathfrak{A}}(P\_{Y|X=a} \| P\_{Y[a]}^{\*})\_{\prime} \tag{183}$$

which requires that *P* ˚ *X* pA˚ *α* q " 1, with

$$\mathcal{A}\_{\mathfrak{a}}^{\ast} = \left\{ \mathbf{x} \in \mathcal{A} \colon D\_{\mathfrak{a}}\left(P\_{Y|X=\mathbf{x}} \| P\_{Y[a]}^{\ast}\right) = \max\_{a \in \mathcal{A}} D\_{\mathfrak{a}}\left(P\_{Y|X=a} \| P\_{Y[a]}^{\ast}\right) \right\}.\tag{184}$$

For discrete alphabets, this requires that if *x* R A˚ *α* , then *P* ˚ *X* p*x*q " 0, which is tantamount to

$$\sum\_{y \in \mathcal{B}} P\_{Y|X}^{a}(y|\mathbf{x}) \to^{\frac{1-a}{a}} \left[ P\_{Y|X}^{a}(y|X^{\*}) \right] \geqslant \exp\left( \frac{a-1}{a} I\_{a}(X^{\*}; Y^{\*}) \right),\tag{185}$$

with equality for all *x* P A such that *P* ˚ *X* p*x*q ą 0. For finite-alphabet random transformations this observation is equivalent to Theorem 5.6.5 in [9].

51. Getting slightly ahead of ourselves, we note that, in view of (128), an important consequence of Theorem 15 below, is that, as anticipated in Item 25, the unconstrained maximization of *Iα*p*X*;*Y*q for *α* P p0, 1q can be expressed in terms of the solution to an optimization problem involving only conventional mutual information and conditional relative entropy. For *ρ* ě 0,

$$\rho \sup\_{X} I\_{\frac{1}{1+\rho}}(X;Y) = \sup\_{X} \min\_{Q\_{Y|X}: \mathcal{A} \to \mathcal{B}} \left\{ D(Q\_{Y|X} \| P\_{Y|X} | P\_X) + \rho \operatorname{I}(P\_{X'} Q\_{Y|X}) \right\}.\tag{186}$$

#### **7. Unconstrained Maximization of** *I* c *α* p*X***;** *Y*q

52. In view of the fact that it is much easier to determine the *α*-mutual information than the order-*α* Augustin–Csiszár information, it would be advantageous to show that the unconstrained maximum of *I* c *α* p*X*;*Y*q equals the unconstrained maximum of *Iα*p*X*;*Y*q. In the finite-alphabet setting, in which it is possible to invoke a "minisup" theorem (e.g., see Section 7.1.7 of [59]), Csiszár [32] showed this result for *α* ą 0. The assumption of finite output alphabets was dropped in Theorem 1 of [42], and further generalized in Theorem 3 of the same reference. As we see next, for *α* P p0, 1q, it is possible to give an elementary proof without restrictions on the alphabets.

**Theorem 15.** *Let α* P p0, 1q*. If the suprema are over* PA*, the set of all probability measures defined on the input space, then*

$$\sup\_{X} I\_{\alpha}^{\mathbf{c}}(X;Y) = \sup\_{X} I\_{\alpha}(X;Y). \tag{187}$$

**Proof.** In view of (143), ě holds in (187). To show ď, we assume sup*<sup>X</sup> Iα*p*X*;*Y*q ă 8 as, otherwise, there is nothing left to prove. The unconstrained maximization identity in (183) implies

$$\sup\_{X} I\_{\mathfrak{a}}(X;Y) = \sup\_{a \in \mathcal{A}} D\_{\mathfrak{a}}(P\_{Y|X=a} \| P\_{Y[a]}^{\ast}) \tag{188}$$

$$=\sup\_{P\_{\mathcal{X}}\in\mathcal{P}}\mathbb{E}\Big[D\_{\mathfrak{a}}(P\_{Y|X}(\cdot|X)\|P\_{Y[\mathfrak{a}]}^{\*})\Big]\tag{189}$$

$$\geqslant \inf\_{Q \in \mathcal{Q}} \sup\_{P\_X \in \mathcal{P}} \mathbb{E} \left[ D\_{\mathfrak{a}} (P\_{Y|X}(\cdot|X) \| Q) \right] \tag{190}$$

$$\gtrsim \sup\_{P\_X \in \mathcal{P}} \inf\_{\mathcal{Q} \in \mathcal{Q}} \mathbb{E} \left[ D\_{\mathfrak{a}} (P\_{Y|X}(\cdot|X) \| \mathcal{Q}) \right] \tag{191}$$

$$=\sup\_{\mathcal{X}} I\_{\mathfrak{a}}^{\mathfrak{c}}(\mathcal{X};\mathcal{Y}),\tag{192}$$

where *P* ˚ *Y*r*α*s is the unique *α*-response to any input that achieves the maximal *α*-mutual information, and if there is no such input, it is the limit of the *α*-responses to any asymptotically optimal input sequence (Item 48).

Furthermore, if t*Xn*u is asymptotically optimal for *Iα*, i.e., lim*n*Ñ8 *Iα*p*Xn*;*Yn*q " sup*<sup>X</sup> Iα*p*X*;*Y*q, then t*Xn*u is also asymptotically optimal for *I* c *<sup>α</sup>* because for any *δ* ą 0, we can find *N*, such that for all *n* ą *N*,

$$D\_a(X\_n; Y\_n) + \delta \geqslant \sup\_{a \in \mathcal{A}} D\_a(P\_{Y|X=a} \| P\_{Y[a]}^\*) \tag{193}$$

$$\geqslant \mathbb{E}\left[D\_{\mathfrak{a}}(P\_{Y|X}(\cdot|X\_{\mathfrak{n}}) \| P\_{Y[\mathfrak{a}]}^{\*})\right] \tag{194}$$

$$\mathbf{x} \gtrsim I\_{\mathfrak{a}}^{\mathbf{c}}(X\_{\mathfrak{n}}; \mathbf{Y}\_{\mathfrak{n}}) \tag{195}$$

$$\lambda \gtrsim I\_{\mathfrak{A}}(X\_{\mathfrak{n}}; \mathcal{Y}\_{\mathfrak{n}}).\tag{196}$$

#### **8. Maximization of** *I* c *α* p*X***;** *Y*q **Subject to Average Cost Constraints**

This section is at the heart of the relevance of Rényi information measures to error exponent functions.

53. Given *<sup>α</sup>* P p0, 1q, *<sup>P</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B, a cost function b: A Ñ r0, 8q and real scalar *<sup>θ</sup>* <sup>ě</sup> 0, the objective is to maximize the Augustin–Csiszár mutual information allowing only those probability measures that satisfy Erbp*X*qs ď *θ*, namely,

$$\mathbb{C}\_{\mathfrak{a}}^{\mathbf{c}}(\theta) = \sup\_{\substack{P\_{\mathbf{X}}: \\ \mathbb{E}[\mathfrak{b}(X)] \leqslant \theta}} I\_{\mathfrak{a}}^{\mathbf{c}}(P\_{\mathbf{X}}, P\_{Y|X}). \tag{197}$$

Unfortunately, identity (187) no longer holds when the maximizations over the input probability measure are cost-constrained, and, in general, we can only claim

$$\mathbb{C}\_{\mathfrak{a}}^{\mathfrak{c}}(\theta) \geqslant \sup\_{\substack{P\_X: \\ \mathbb{E}[\mathfrak{b}(X)] \lesssim \theta}} I\_{\mathfrak{a}}(P\_{X\prime} P\_{Y|X}).\tag{198}$$

A conceptually simple approach to solve for C<sup>c</sup> *α* p*θ*q is to


$$B(P\_{X'}Q\_Y) = \int D\_{\mathfrak{a}}\left(P\_{Y|X=x} \| Q\_Y \right) \mathrm{d}P\_{X'} \tag{199}$$

where *Q<sup>Y</sup>* P P<sup>A</sup> and *P<sup>X</sup>* is chosen from the convex subset of P<sup>A</sup> of probability measures which satisfy Erbp*X*qs ď *θ*.

Since *P* ˚ *Y* is already known, by definition, to be the x*α*y-response to *P* ˚ *X* , verifying the saddle point is tantamount to showing that *B*p*PX*, *P* ˚ *Y* q is maximized by *P* ˚ *X* among t*P<sup>X</sup>* P P<sup>A</sup> : Erbp*X*qs ď *θ*u. Theorem 1 of [43] guarantees the existence of a saddle point in the case of finite input alphabets. In addition to the fact that it is not always easy to guess the optimum input *P* ˚ *X* (see e.g., Section 12), the main stumbling block is the difficulty in determining the x*α*y-response to any candidate input distribution, although sometimes this is indeed feasible as we saw in Example 7.

54. Naturally, Theorem 15 implies

$$\mathbb{C}\_{\mathfrak{a}}^{\mathfrak{c}}(\theta) \leqslant \sup\_{\mathcal{X}} I\_{\mathfrak{a}}(\mathcal{X}; \mathcal{Y}). \tag{200}$$

If the unconstrained maximization of *I* c *α* p¨, *<sup>P</sup>Y*|*X*<sup>q</sup> is achieved by an input distribution *X* ‹ that satisfies Erbp*X* ‹ qs ď *θ*, then equality holds in (200), which, in turn, is equal to *I* c *α* p*P* ‹ *X* , *<sup>P</sup>Y*|*X*q. In that case, the average cost constraint is said to be inactive. For most cost functions and random transformations of practical interest, the cost constraint is active for all *θ* ą 0. To ascertain whether it is, we simply verify whether there exists an input achieving the right side of (200), which happens to satisfy the constraint. If so, C<sup>c</sup> *α* p*θ*q has been found. The same holds if we can find a sequence t*Xn*u such that Erbp*Xn*qs ď *θ* and *Iα*p*Xn*;*Yn*q Ñ sup*<sup>X</sup> Iα*p*X*;*Y*q. Otherwise, we proceed with the method described below. Thus, henceforth, we assume that the cost constraint is active.

55. The approach proposed in this paper to solve for C<sup>c</sup> *α* p*θ*q for *α* P p0, 1q hinges on the variational representation in (172), which allows us to sidestep having to find any x*α*y-response. Note that once we set out to maximize *I* c *α* <sup>p</sup>*PX*, *<sup>P</sup>Y*|*X*<sup>q</sup> over P " t*P<sup>X</sup>* P P<sup>A</sup> : Erbp*X*qs ď *θ*u, the allowable *Q<sup>X</sup>* in the maximization in (175) range over a *ξ*-blow-up of P defined by

$$\Gamma\_{\xi}(\mathcal{P}) = \{ Q\_X \in \mathcal{P}\_{\mathcal{A}} \colon \exists P\_X \in \mathcal{P}\_{\prime} \text{ such that } D(P\_X \| Q\_X) \leqslant \xi \}. \tag{201}$$

As we show in Item 56, we can accomplish such an optimization by solving an unconstrained maximization of the sum of *α*-mutual information and a term suitably derived from the cost function.

56. It will not be necessary to solve for (176), as our goal is to further maximize (172) over *P<sup>X</sup>* subject to an average cost constraint. The Lagrangian corresponding to the constrained optimization in (197) is

$$\mathbb{L}\_{\mathfrak{a}}(\nu, P\_X) = I\_{\mathfrak{a}}^{\mathbf{c}}(X; \mathcal{Y}) - \nu \, \mathbb{E}[\mathfrak{b}(X)] + \nu \, \theta,\tag{202}$$

where on the left side we have omitted, for brevity, the dependence on *θ* stemming from the last term on the right side. The Lagrange multiplier method (e.g., [60]) implies that if *X* ˚ achieves the supremum in (197), then there exists *ν* ˚ ě 0 such that for all *P<sup>X</sup>* on A and *ν* ě 0,

$$\mathbb{L}\_{\mathfrak{a}}(\boldsymbol{\upsilon}^{\ast}, \boldsymbol{P}\_{\boldsymbol{X}}) \leqslant \mathbb{L}\_{\mathfrak{a}}(\boldsymbol{\upsilon}^{\ast}, \boldsymbol{P}\_{\boldsymbol{X}}^{\ast}) \leqslant \mathbb{L}\_{\mathfrak{a}}(\boldsymbol{\upsilon}, \boldsymbol{P}\_{\boldsymbol{X}}^{\ast}).\tag{203}$$

Note from (202) that the right inequality in (203) can only be achieved if

$$\mathbb{E}[\mathbf{b}(X^\*)] = \theta,\tag{204}$$

and, consequently,

$$\mathbb{C}\_{\mathfrak{a}}^{\mathbf{c}}(\theta) = \mathbb{L}\_{\mathfrak{a}}(\boldsymbol{\nu}^{\mathbf{s}}, P\_{\boldsymbol{X}}^{\mathbf{s}}) = \min\_{\boldsymbol{\nu} \gtrsim \mathbf{0}} \max\_{P\_{\boldsymbol{X}}} \mathbb{L}\_{\mathfrak{a}}(\boldsymbol{\nu}, P\_{\boldsymbol{X}}) = \max\_{P\_{\boldsymbol{X}}} \min\_{\boldsymbol{\nu} \gtrsim \mathbf{0}} \mathbb{L}\_{\mathfrak{a}}(\boldsymbol{\nu}, P\_{\boldsymbol{X}}).\tag{205}$$

The pivotal result enabling us to obtain C<sup>c</sup> *α* p*θ*q without the need to deal with Augustin– Csiszár mutual information is the following.

**Theorem 16.** *Given <sup>α</sup>* P p0, 1q*, <sup>ν</sup>* <sup>ě</sup> 0*, <sup>P</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B*, and* b: A Ñ r0, 8q*, denote the function*

$$\mathbb{A}\_{\mathfrak{a}}(\nu) = \max\_{\mathcal{X}} \left\{ I\_{\mathfrak{a}}(\mathcal{X}; \mathcal{Y}) + \frac{1}{1 - \mathfrak{a}} \log \mathbb{E} [\exp(-(1 - \mathfrak{a})\nu \,\mathfrak{b}(\mathcal{X}))] \right\}. \tag{206}$$

*Then,*

$$\sup\_{P\_X \in \mathcal{P}\_{\mathcal{A}}} \mathbb{L}\_{\mathfrak{A}}(\boldsymbol{\nu}, P\_X) = \boldsymbol{\nu} \,\boldsymbol{\theta} + \mathbb{A}\_{\mathfrak{A}}(\boldsymbol{\nu}),\tag{207}$$

*and*

$$\mathbb{C}\_{\mathfrak{a}}^{\mathbf{c}}(\theta) = \min\_{\nu \succeq 0} \{\nu \,\theta + \mathbb{A}\_{\mathfrak{a}}(\nu)\}. \tag{208}$$

**Proof.** Plugging (172) into (197) we obtain, with *<sup>X</sup>* " *<sup>P</sup>X*, and *<sup>X</sup>*<sup>ˆ</sup> " *<sup>Q</sup>X*,

$$\sup\_{P\_{\mathbf{X}} \in \mathcal{P}\_{\mathcal{A}}} \mathbb{L}\_{\mathfrak{a}}(\boldsymbol{\nu}, P\_{\mathbf{X}}) = \sup\_{P\_{\mathbf{X}}} \{ I\_{\mathfrak{a}}^{\mathbf{c}}(\mathbf{X}; \mathbf{Y}) - \boldsymbol{\nu} \, \mathbb{E}[\mathbf{b}(\mathbf{X})] + \boldsymbol{\nu} \, \theta \}\tag{209}$$

$$=\sup\_{P\_{\mathbf{X}}\in\mathcal{P}\_{\mathcal{A}}}\left\{\max\_{Q\_{\mathbf{X}}\in\mathcal{P}\_{\mathcal{A}}}\left\{I\_{\mathbf{a}}(Q\_{\mathbf{X}},P\_{Y|\mathbf{X}})-\frac{1}{1-\mathfrak{a}}D(P\_{\mathbf{X}}\parallel Q\_{\mathbf{X}})\right\}-\nu\,\mathbb{E}[\mathfrak{b}(\mathbf{X})]+\nu\,\theta\right\}\tag{210}$$

$$=\nu\,\theta + \max\_{Q\_X \in \mathcal{P}\_{\mathcal{A}}} \left\{ I\_{\mathcal{U}}(Q\_X, P\_{Y|X}) - \frac{1}{1-\alpha} \inf\_{P\_X} \{ D(P\_X \parallel Q\_X) + \nu(1-\alpha)\mathbb{E}[\mathbf{b}(X)] \} \right\} \tag{211}$$

$$\rho = \nu \,\theta + \max\_{Q\_{\hat{X}} \in \mathcal{P}\_{\mathcal{A}}} \left\{ I\_{\mathbb{d}}(Q\_{X}, P\_{Y|X}) + \frac{1}{1 - \alpha} \log \mathbb{E} [\exp \left( -\nu (1 - \alpha) \mathbf{b}(\hat{X}) \right)] \right\} \tag{212}$$

$$
\theta = \nu \,\theta + \mathbb{A}\_{\mathfrak{a}}(\nu),
\tag{213}
$$

where (209) and (213) follow from (202) and (206), respectively, and (212) follows by invoking Theorem 1 with *Z* " *Q<sup>X</sup>* and

$$\mathcal{g}(a) = (1 - a)\boldsymbol{\nu} \,\mathsf{b}(a),\tag{214}$$

which is nonnegative since *α* P p0, 1q and *ν* ą 0. Finally, (208) follows from (205) and (207).

In conclusion, we have shown that the maximization of Augustin–Csiszár mutual information of order *α* subject to Erbp*X*qs ď *θ* boils down to the unconstrained maximization of a Lagrangian consisting of the sum of *α*-mutual information and an exponential average of the cost function. Circumventing the need to deal with x*α*y-responses and with Augustin–Csiszár mutual information of order *α* leads to a particularly simple optimization, as illustrated in Sections 11 and 12.

57. Theorem 16 solves for the maximal Augustin–Csiszár mutual information of order *α* under an average cost constraint without having to find out the input probability measure *P* ˚ *X* that attains it nor its x*α*y-response *P* ˚ *Y* (using the notation in Item 53). Instead, it gives the solution as

$$\mathbb{C}\_{\mathfrak{a}}^{\mathbf{c}}(\theta) = \min\_{\nu \ge 0} \left\{ \nu \, \theta + \max\_{\mathbf{X}} \left\{ I\_{\mathfrak{a}}(\mathbf{X}; \mathbf{Y}) + \frac{1}{1 - \mathfrak{a}} \log \mathbb{E} [\exp(-(1 - \mathfrak{a})\nu \, \mathsf{b}(\mathbf{X}))] \right\} \right\}. \tag{215}$$

Although we are not going to invoke a minimax theorem, with the aid of Theorem 9-(b) we can see that the functional within the inner brackets is concave in *PX*; Furthermore, if *V* P p0, 1s, then logEr*V ν* s is easily seen to be convex in *ν* with the aid of the Cauchy-Schwarz inequality. Before we characterize the saddle point p*ν* ˚ , *Q*˚ *X* q of the game in (215) we note that p*P* ˚ *X* , *P* ˚ *Y* q can be readily obtained from p*ν* ˚ , *Q*˚ *X* q.

**Theorem 17.** *Fix α* P p0, 1q*. Let ν* ˚ ą 0 *denote the minimizer on the right side of* (215)*, and Q*˚ *X the input probability measure that attains the maximum in* (206) *(or* (215)*) for ν* " *ν* ˚ *. Then,*


$$\mathfrak{a}\_{P\_X^{\mathfrak{A}} \| Q\_X^{\mathfrak{A}} (a)} = -(1 - a)\boldsymbol{\nu}^\* \mathbf{b}(a) + \boldsymbol{\tau}\_{\mathfrak{U}} \quad a \in \mathcal{A},\tag{216}$$

*where τ<sup>α</sup> is a normalizing constant ensuring that P*˚ *X is a probability measure.*

#### **Proof.**

(a) We had already established in Theorem 13 that the maximum on the right side of (210) is achieved by the x*α*y-adjunct of *PX*. In the special case *ν* " *ν* ˚ , such *P<sup>X</sup>* is *P* ˚ *X* . Therefore, *Q*˚ *X* , the argument that achieves the maximum in (206) for *ν* " *ν* ˚ , is the x*α*y-adjunct of *P* ˚ *X* .


The saddle point of (215) admits the following characterization.

**Theorem 18.** *If α* P p0, 1q*, the saddle point* p*ν* ˚ , *Q*˚ *X* q *of* (215) *satisfies*

$$\mathbb{E}\left[\mathbf{b}(\bar{X}^\*)\exp\left(-(1-a)\boldsymbol{\nu}^\*\mathbf{b}(\bar{X}^\*)\right)\right] = \theta \,\mathbb{E}\left[\exp\left(-(1-a)\boldsymbol{\nu}^\*\mathbf{b}(\bar{X}^\*)\right)\right], \quad \bar{X}^\* \sim Q\_X^\* \tag{217}$$

$$D\_\mathbf{a}\left(P\_{Y|X=a} \parallel Q\_{Y[a]}^\mathbf{a}\right) = \boldsymbol{\nu}^\* \cdot \mathbf{b}(a) + c\_\mathbf{a}(\boldsymbol{\nu}^\*), \quad a \in \mathcal{A}, \tag{218}$$

*where Q*˚ *Y*r*α*s *is the α-response to Q*˚ *X , and cα*p*ν* ˚ q *does not depend on a* P A*. Furthermore,*

$$\mathbb{A}\_{\mathfrak{a}}(\nu^\*) = \mathfrak{c}\_{\mathfrak{a}}(\nu^\*),\tag{219}$$

$$\mathbb{C}\_{\mathfrak{a}}^{\mathbf{c}}(\theta) = \boldsymbol{\nu}^\* \theta + \mathfrak{c}\_{\mathfrak{a}}(\boldsymbol{\nu}^\*). \tag{220}$$

**Proof.** First, we show that the scalar *ν* ˚ ě 0 that minimizes

$$f(\boldsymbol{\nu}) = \boldsymbol{\nu}\,\boldsymbol{\theta} + I\_{\mathsf{A}}(\boldsymbol{Q}\_{\boldsymbol{X}\boldsymbol{\nu}}^{\boldsymbol{\theta}}\boldsymbol{P}\_{\boldsymbol{Y}|\boldsymbol{X}}) + \frac{1}{1-\mathsf{a}}\log\mathbb{E}[\exp\left(-(1-\mathsf{a})\boldsymbol{\nu}\,\mathsf{b}(\boldsymbol{\bar{X}}^{\boldsymbol{\ast}})\right)] \tag{221}$$

satisfies (217). If we abbreviate *<sup>V</sup>* " exp` ´p<sup>1</sup> ´ *<sup>α</sup>*qbp*X*¯ ˚ q ˘ P p0, 1s, then the dominated convergence theorem results in

$$\frac{\mathbf{d}}{\mathbf{d}\boldsymbol{\nu}} \left\{ \boldsymbol{\nu} \,\theta + \frac{1}{1-\boldsymbol{\alpha}} \log \mathbb{E}[V^{\boldsymbol{\nu}}] \right\} = \theta + \frac{1}{1-\boldsymbol{\alpha}} \frac{\mathbb{E}[V^{\boldsymbol{\nu}} \log V]}{\mathbb{E}[V^{\boldsymbol{\nu}}]}.\tag{222}$$

Therefore, (217) is equivalent to 9 *f*p*ν* ˚ q " 0, which is all we need on account of the convexity of *f*p¨q. To show (218), notice that for all *a* P A,

$$(1 - \mathfrak{a})\nu^\*\mathfrak{b}(\mathfrak{a}) - \mathfrak{r}\_{\mathfrak{a}} = \mathfrak{r}\_{Q\_X^\# \| P\_X^\#}(\mathfrak{a}) \tag{223}$$

$$\mathbf{u} = (1 - \alpha) D\_{\mathbf{a}} (P\_{Y|X=\mathbf{a}} \parallel P\_Y^\*) + \upsilon\_{\mathbf{a} \prime} \tag{224}$$

where (223) is (216) and (224) is (157) with *<sup>P</sup>Y*x*α*<sup>y</sup> <sup>Ð</sup> *<sup>P</sup>* ˚ *Y* in view of Theorem 17-(b). In conclusion, (218) holds with

$$
\omega\_{\mathfrak{a}}(\nu^\*) = \frac{\upsilon\_{\mathfrak{a}} + \tau\_{\mathfrak{a}}}{\mathfrak{a} - 1}. \tag{225}
$$

Finally, (206) implies

$$\mathbb{A}\_{\mathfrak{a}}(\nu^{\ast}) = I\_{\mathfrak{a}}(Q\_{X'}^{\ast}P\_{Y|X}) + \frac{1}{1-\mathfrak{a}}\log\mathbb{E}[\exp\left(-(1-\mathfrak{a})\nu^{\ast}\mathsf{b}(\bar{X}^{\ast})\right)]\tag{226}$$

$$=\frac{1}{\mathfrak{a}-1}\log\mathbb{E}\left[\exp\left((\mathfrak{a}-1)D\_{\mathfrak{a}}\left(P\_{Y|X}(\cdot|\bar{X}^{\ast})\parallel P\_{Y}^{\ast}\right)\right)\right]$$

$$\begin{split} \mathbf{E} &= \frac{1}{\alpha - 1} \log \mathbb{E} \left[ \exp \left( (\alpha - 1) D\_{\mathbb{K}} \left( P\_{Y|X} (\cdot | \bar{X}^\*) \parallel P\_{Y}^\* \right) \right) \right] \\ &+ \frac{1}{1 - \alpha} \log \mathbb{E} \left[ \exp \left( (\alpha - 1) \nu^\* \left( \mathbf{b} (\bar{X}^\*) \right) \right) \right] \end{split} \tag{227}$$

$$\begin{split} \mathbf{c} &= \frac{1}{\alpha - 1} \log \mathbb{E} \left[ \exp \left( (\mathfrak{a} - 1) \left( \nu^\* \, \mathfrak{b} (\bar{X}^\*) + c\_{\mathfrak{a}} (\nu^\*) \right) \right) \right] \\ &+ \frac{1}{1 - \alpha} \log \mathbb{E} \left[ \exp \left( (\mathfrak{a} - 1) \nu^\* \, \mathfrak{b} (\bar{X}^\*) \right) \right] \\ &= c\_{\mathfrak{a}} (\nu^\*), \end{split} \tag{228}$$

where (227) follows from the definition of *α*-mutual information and Theorem 17-(b), and (228) follows from (218). Plugging (219) into (208) results in (220).

	- (a) guessing the form of the auxiliary input *Q*˚ *X* (modulo some unknown parameter),
	- (b) obtaining its *α*-response *Q*˚ *Y*r*α*s , and
	- (c) verifying that (217) and (218) are satisfied for some specific choice of the unknown parameter.

With the same approach, we can postulate, for every *ν* ě 0, an input distribution *R ν X* , whose *α*-response *R ν Y*r*α*s satisfies

$$D\_{\mathfrak{a}}\left(P\_{Y|X=a} \parallel R^{\vee}\_{Y[a]}\right) = \nu \,\mathfrak{b}(a) + c\_{\mathfrak{a}}(\nu), \quad a \in \mathcal{A},\tag{230}$$

where the only condition we place on *cα*p*ν*q is that it not depend on *a* P A. If this is indeed the case, then the same derivation in (226)–(229) results in

$$\mathbb{A}\_{\mathfrak{A}}(\nu) = \mathfrak{c}\_{\mathfrak{A}}(\nu),\tag{231}$$

and we determine *ν* ˚ as the solution to *θ* " ´*c*9*α*p*ν* ˚ q, in lieu of (217). Sections 11 and 12 illustrate the effortless nature of this approach to solve for A*α*p*ν*q. Incidentally, (230) can be seen as the *α*-generalization of the condition in Problem 8.2 of [48], elaborated later in [61].

#### **9. Gallager's** *E***<sup>0</sup> Functions and the Maximal Augustin–Csiszár Mutual Information**

In keeping with Gallager's setting [9], we stick to discrete alphabets throughout this section.

59. In his derivation of an achievability result for discrete memoryless channels, Gallager [8] introduced the function (1), which we repeat for convenience,

$$E\_0(\rho, P\_X) = -\log \sum\_{y \in \mathcal{B}} \left( \sum\_{\mathbf{x} \in \mathcal{A}} P\_X(\mathbf{x}) P\_{Y|X}^{\frac{1}{1+\rho}}(y|\mathbf{x}) \right)^{1+\rho} \tag{232}$$

Comparing (82) and (232), we obtain

$$E\_0(\rho\_\prime P\_X) = \rho \operatorname{I}\_{\frac{1}{1+\rho}}(\mathbf{X}; \mathbf{Y}),\tag{233}$$

which, as we mentioned in Section 1, is the observation by Csiszár in [30] that triggered the third phase in the representation of error exponents. Popularized in [9], the *E*<sup>0</sup> function was employed by Shannon, Gallager and Berlekamp [10] for *ρ* ě 0 and by Arimoto [62] for *ρ* P p´1, 0q in the derivation of converse results in data transmission, the latter of which considers rates above capacity, a region in which error probability increases with blocklength, approaching one at an exponential rate. For the achievability part, [8] showed upper bounds on the error probability involving *E*0p*ρ*, *PX*q for *ρ* P r0, 1s. Therefore, for rates below capacity, the *α*-mutual information only enters the picture for *α* P p0, 1q. One exception in which Rényi divergence of order greater than 1 plays a role at rates below capacity was found by Sason [63], where a refined achievability result is shown for binary linear codes for output symmetric channels (a case in which equiprobable *P<sup>X</sup>* maximizes (233)), as a function of their Hamming weight distribution.

Although Gallager did not have the benefit of the insight provided by the Rényi information measures, he did notice certain behaviors of *E*<sup>0</sup> reminiscent of mutual information. For example, the derivative of (233) with respect to *ρ*, at *ρ* Ð 0 is equal to *I*p*X*;*Y*q. As pointed out by Csiszár in [32], in the absence of cost constraints, Gallager's *E*<sup>0</sup> function in (232) satisfies

$$\max\_{P\_X} E\_0(\rho, P\_X) = \rho \max\_X I\_{\frac{1}{1+\rho}}(X; Y) = \rho \max\_X I\_{\frac{1}{1+\rho}}^{\mathbf{c}}(X; Y), \tag{234}$$

in view of (233) and (187).

Recall that Gallager's modified *E*<sup>0</sup> function in the case of cost constraints is

$$E\_0(\rho, P\_{X'}r, \theta) = -\log \sum\_{y \in \mathcal{B}} \left( \sum\_{\mathbf{x} \in \mathcal{A}} P\_X(\mathbf{x}) \exp(r \, \mathbf{b}(\mathbf{x}) - r \, \theta) P\_{Y|X}^{\frac{1}{1+\rho}}(y|\mathbf{x}) \right)^{1+\rho} \tag{235}$$

which, like (232) he introduced in order to show an achievability result. Up until now, no counterpart to (234) has been found with cost constraints and (235). This is accomplished in the remainder of this section.

60. In the finite alphabet case the following result is useful to obtain a numerical solution for the functional in (206). More importantly, it is relevant to the discussion in Item 61.

**Theorem 19.** *In the special case of discrete alphabets, the function in* (206) *is equal to*

$$\mathbb{A}\_{\mathfrak{a}}(\nu) = \max\_{G} \frac{\mathfrak{a}}{\mathfrak{a} - 1} \log \sum\_{y \in \mathcal{B}} \left( \sum\_{a \in \mathcal{A}} G(a) \, P\_{Y|X}^{\mathfrak{a}}(y|a) \right)^{\frac{1}{\mathfrak{a}}} \tag{236}$$

*where the maximization is over all G* : A Ñ r0, 8q *such that*

$$\sum\_{a \in \mathcal{A}} G(a) \exp(-(1-a)\nu \mathbf{b}(a)) = 1. \tag{237}$$

**Proof.** Recalling (82) we have

$$I\_{\mathfrak{a}}(X;Y) + \frac{1}{1-\mathfrak{a}} \log \mathbb{E}[\exp(-(1-\mathfrak{a})\nu \,\mathfrak{b}(X))]$$

$$= \frac{\mathfrak{a}}{\mathfrak{a}-1} \log \sum\_{\mathcal{Y} \in \mathcal{B}} \left( \sum\_{\mathbf{x} \in \mathcal{A}} P\_{\mathbf{X}}(\mathbf{x}) P\_{Y|X=\mathbf{x}}^{\mathfrak{a}}(\mathfrak{y}) \right)^{\frac{1}{\mathfrak{a}}}$$

$$+ \frac{1}{1-\mathfrak{a}} \log \mathbb{E}[\exp(-(1-\mathfrak{a})\nu \,\mathfrak{b}(X))] \tag{238}$$

$$=\frac{\alpha}{\alpha-1}\log\sum\_{y\in\mathcal{S}}\left(\frac{\mathbb{E}\left[P\_{Y|X}^{\alpha}(y|X)\right]}{\mathbb{E}[\exp(-(1-\alpha)\nu\mathbf{b}(X))]}\right)^{\frac{1}{\alpha}}\tag{239}$$

$$=\frac{\alpha}{\alpha-1}\log\sum\_{y\in\mathcal{S}}\left(\sum\_{a\in\mathcal{A}}G(a)\,P\_{Y|X}^{\alpha}(y|a)\right)^{\frac{1}{\alpha}}\tag{240}$$

where

$$G(\boldsymbol{\omega}) = \frac{P\_{\boldsymbol{X}}(\boldsymbol{\omega})}{\sum\_{a \in \mathcal{A}} P\_{\boldsymbol{X}}(a) \exp(-(1-\boldsymbol{\alpha})\boldsymbol{\nu} \,\, \mathbf{b}(a))}.\tag{241}$$

61. We can now proceed to close the circle between the maximization of Augustin–Csiszár mutual information subject to average cost constraints (Phase 3 in Section 1) and Gallager's approach (Phase 1 in Section 1).

**Theorem 20.** *In the discrete alphabet case, recalling the definitions in* (202) *and* (235) *, for ρ* ą 0*,*

$$\max\_{P\_X} E\_0(\rho, P\_X, r, \theta) = \rho \max\_{P\_X} \mathbb{E}\_{\frac{1}{1+\rho}} \left( r + \frac{r}{\rho}, P\_X \right)\_{\prime} \quad r > 0; \tag{242}$$

$$\min\_{r \gg 0} \max\_{P\_X} E\_0(\rho, P\_X, r, \theta) = \rho \cdot \mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\theta), \tag{243}$$

*where the maximizations are over* PA*.*

**Proof.** With

$$\alpha = \frac{1}{1+\rho} \quad \text{and} \quad \nu = r \frac{1+\rho}{\rho} = \frac{r}{1-\alpha'} \tag{244}$$

the maximization of (235) with the respect to the input probability measure yields

$$\begin{split} \max\_{P\_{\mathbf{X}}} & E\_0(\rho, P\_{\mathbf{X}}, r, \theta) \\ &= \max\_{P\_{\mathbf{X}}} \left\{ (1 + \rho) \operatorname{r} \theta - \log \sum\_{\mathbf{y} \in \mathcal{B}} \left( \sum\_{\mathbf{x} \in \mathcal{A}} P\_{\mathbf{X}}(\mathbf{x}) \exp(\operatorname{r} \mathbf{b}(\mathbf{x})) P\_{\mathbf{Y}|\mathbf{X}}^{\frac{1}{1+\rho}}(\mathbf{y}|\mathbf{x}) \right)^{1+\rho} \right\} \\ &\qquad \times \sum\_{\mathbf{x} \in \mathcal{A}} \left\{ \operatorname{r} \theta - \operatorname{rk} \theta \operatorname{rk} \left( \operatorname{r} \theta - \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{rk} \theta \operatorname{$$

$$\rho = \rho \,\,\nu \,\theta + \rho \max\_{P\_{\mathcal{X}}} \frac{a}{a-1} \log \sum\_{y \in \mathcal{B}} \left( \sum\_{\mathbf{x} \in \mathcal{A}} P\_{\mathcal{X}}(\mathbf{x}) \exp((1-a)\,\nu \,\mathbf{b}(\mathbf{x})) P\_{Y|\mathcal{X}}^{\mathbf{a}}(y|\mathbf{x}) \right)^{\frac{1}{a}} \tag{246}$$

$$=\rho\,\nu\,\theta+\rho\max\_{\mathbb{G}}\frac{a}{a-1}\log\sum\_{y\in\mathcal{B}}\left(\sum\_{\mathbf{x}\in\mathcal{A}}\mathcal{G}(\mathbf{x})P\_{Y|X}^{\mathbb{K}}(y|\mathbf{x})\right)^{\frac{1}{\kappa}}\tag{247}$$

$$
\rho = \rho \,\nu \theta + \rho \,\mathbb{A}\_a(\nu) \tag{248}
$$

$$= \rho \max\_{P\_X} \mathbb{L}\_{\mathfrak{a}}(\nu, P\_X),$$

where


The proof of (242) is complete once (244) is invoked to substitute *α* and *ν* from the right side of (249). If we now minimize the outer sides of (245)–(249) with respect to *r* we obtain, using (205) and (244),

$$\min\_{r \ge 0} \max\_{P\_X} E\_0(\rho, P\_{X'}r, \theta) = \rho \min\_{r \ge 0} \max\_{P\_X} \mathbb{E}\_{\mathbb{A}} \left( \frac{r}{1 - \mathfrak{a}'}, P\_X \right) \tag{250}$$

$$= \rho \min\_{\nu\_{\mathcal{P}} \ni 0} \max\_{P\_X} \mathbb{L}\_{\mathfrak{A}}(\nu\_{\nu} P\_X) \tag{251}$$

$$=\rho \, \mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\theta). \tag{252}$$

In p. 329 of [9], Gallager poses the unconstrained maximization (i.e., over *P<sup>X</sup>* P PA) of the Lagrangian

$$E\_0(\rho, P\_{\mathcal{X}}, r, \theta) + \gamma \sum\_{a \in \mathcal{A}} P\_{\mathcal{X}}(a) \mathbf{b}(a) - \gamma \,\theta. \tag{253}$$

Note the apparent discrepancy between the optimizations in (243) and (253): the latter is parametrized by *r* and *γ* (in addition to *ρ* and *θ*), while the maximization on the right side of (243) does not enforce any average cost constraint. In fact, there is no disparity since Gallager loc. cit. finds serendipitously that *γ* " 0 regardless of *r* and *θ*, and, therefore, just one parameter is enough.

62. The raison d'être for Augustin's introduction of *I* c *α* in [36] was his quest to view Gallager's approach with average cost constraints under the optic of Rényi information measures. Contrasting (232) and (235) and inspired by the fact that, in the absence of cost constraints, (232) satisfies a variational characterization in view of (69) and (233), Augustin [36] dealt, not with (235), but with

$$\min\_{Q\_Y} D\_\mathfrak{a}(\tilde{P}\_{Y|X} \| Q\_Y | P\_X), \quad \text{where} \quad \tilde{P}\_{Y|X=x} = P\_{Y|X=x} \exp\left(r^l \mathbf{b}(\mathbf{x})\right).$$

Assuming finite alphabets, Augustin was able to connect this quantity with the maximal *I* c *α* p*X*;*Y*q under cost constraints in an arcane analysis that invokes a minimax theorem. This line of work was continued in Section 5 of [43], which refers to min*Q<sup>Y</sup> <sup>D</sup>α*p*P*˜ *<sup>Y</sup>*|*X*}*QY*|*PX*<sup>q</sup> as the Rényi-Gallager information. Unfortunately, since *P*˜ *<sup>Y</sup>*|*<sup>X</sup>* is not a random transformation, the conditional pseudo-Rényi divergence *<sup>D</sup>α*p*P*˜ *<sup>Y</sup>*|*X*}*QY*|*PX*<sup>q</sup> need not satisfy the key additive decomposition in Theorem <sup>4</sup> so the approach of [36,43] fails to establish an identity equating the maximization of Gallager's function (235) with the maximization of Augustin–Csiszár mutual information, which is what we have accomplished through a crisp and elementary analysis.

#### **10. Error Exponent Functions**

The central objects of interest in the error exponent analysis of data transmission are the functions *<sup>E</sup>*spp*R*, *<sup>P</sup>X*<sup>q</sup> and *<sup>E</sup>*rp*R*, *<sup>P</sup>X*<sup>q</sup> of a random transformation *<sup>P</sup>Y*|*<sup>X</sup>* : A <sup>Ñ</sup> B. Reflecting the three different phases referred to in Section 1, there is no unanimity in the definition of those functions. Following [48], we adopt the standard canonical Phase 2 (Section 1.2) definitions of those functions, which are given in Items 63 and 67.

63. If *R* ě 0 and *P<sup>X</sup>* P PA, the sphere-packing error exponent function is (e.g., (10.19) of [48])

$$E\_{\mathfrak{sp}}(R, P\_X) = \min\_{\substack{Q\_{Y|X}: \mathcal{A} \to \mathcal{B} \\ I(P\_X, Q\_{Y|X}) \leqslant R}} D(Q\_{Y|X} \parallel P\_{Y|X} \mid P\_X). \tag{254}$$

64. As a function of *<sup>R</sup>* <sup>ě</sup> 0, the basic properties of (254) for fixed <sup>p</sup>*PX*, *<sup>P</sup>Y*|*X*<sup>q</sup> are as follows.


*E*spp*r*, *PX*q ě *E*spp*R*, *PX*q ´ *ρ<sup>R</sup> r* ` *ρ<sup>R</sup> R*. (255)

65. In view of Theorem 8 and its definition in (254), it is not surprising that *E*spp*R*, *PX*q is intimately related to the Augustin–Csiszár mutual information, through the following key identity.

**Theorem 21.**

$$E\_{\mathfrak{sp}}(R, P\_X) = \sup\_{\rho \ge 0} \left\{ \rho \, I\_{\frac{1}{1+\rho}}^{\mathbb{C}}(X; Y) - \rho \, R \right\}, \quad R \gg 0; \tag{256}$$

$$R\_{\mathcal{O}}(P\_X) = I\_0^{\mathbf{c}}(X;Y). \tag{257}$$

**Proof.** First note that ě holds in (256) because from (128) we obtain, for all *ρ* ě 0,

$$\rho \, I^{\mathbf{C}}\_{\frac{1}{1+\rho}}(\mathbf{X};\mathbf{Y}) = \min\_{\mathbf{Q}\_{Y|\mathbf{X}}} \left\{ D(\mathbf{Q}\_{Y|\mathbf{X}} \| P\_{Y|\mathbf{X}} | P\_{\mathbf{X}}) + \rho \, I(P\_{\mathbf{X}}, Q\_{Y|\mathbf{X}}) \right\} \tag{258}$$

$$\leqslant \min\_{\substack{Q\_{Y|X}:\\I(P\_X, Q\_{Y|X}) \leqslant R}} \left\{ D(Q\_{Y|X} \| P\_{Y|X} | P\_X) + \rho \, I(P\_{X\prime} Q\_{Y|X}) \right\} \tag{259}$$

$$0 \lessapprox E\_{\rm{spp}}(R\_{\prime}P\_{\rm{X}}) + \rho \mathop{\rm{R}}\_{\rm{\prime}} \tag{260}$$

where (260) follows from the definition in (254). To show ď in (256) for those *R* such that 0 ă *E*spp*R*, *PX*q ă 8, Property (d) in Item 64 allows us to write

$$\min\_{Q\_{Y|X}} \left\{ D(Q\_{Y|X} \| P\_{Y|X} | P\_X) + \rho\_R \, I(P\_{X\prime} Q\_{Y|X}) \right\} = \min\_{r \gg 0} \left\{ E\_{\mathtt{zp}}(r, P\_X) + \rho\_R \, r \right\} \tag{261}$$

$$\geqslant E\_{\mathsf{sp}}(R, P\_{\mathsf{X}}) + \rho\_{R} R\_{\mathsf{A}} \tag{262}$$

where (262) follows from (255).

To determine the region where the sphere-packing error exponent is infinite and show (257), first note that if *R* ă *I* c 0 p*X*;*Y*q " lim*α*Ó<sup>0</sup> *I* c *α* p*X*;*Y*q, then *E*spp*R*, *PX*q " 8 because for any *ρ* ě 0, the function in tu on the right side of (256) satisfies

$$
\rho \, I^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\mathbf{X};\mathbf{Y}) - \rho \, \mathbf{R} = \rho \, I^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\mathbf{X};\mathbf{Y}) - \rho \, I^{\mathbf{c}}\_{0}(\mathbf{X};\mathbf{Y}) + \rho \, I^{\mathbf{c}}\_{0}(\mathbf{X};\mathbf{Y}) - \rho \, \mathbf{R} \tag{263}
$$

$$\geqslant \rho \operatorname{I}\_{0}^{\mathsf{c}}(X;Y) - \rho \operatorname{R}\_{\mathsf{r}} \tag{264}$$

where (264) follows from the monotonicity of *I* c *α* p*X*;*Y*q in *α* we saw in (143). Conversely, if *I* c 0 p*X*;*Y*q ă *R* ă 8, there exists *e* P p0, 1q such that *I* c *e* p*X*;*Y*q ă *R*, which implies that in the minimization

$$I\_{\varepsilon}^{\mathbf{c}}(X;Y) = \min\_{Q\_{Y|X}} \left\{ \frac{\varepsilon}{1-\varepsilon} D(Q\_{Y|X} \| P\_{Y|X} | P\_X) + I(P\_{X'} Q\_{Y|X}) \right\} \tag{265}$$

we may restrict to those *<sup>Q</sup>Y*|*<sup>X</sup>* such that *<sup>I</sup>*p*PX*, *<sup>Q</sup>Y*|*X*q ď *<sup>R</sup>*, and consequently, *<sup>I</sup>* c *e* p*X*;*Y*q ě *e* 1´*e E*spp*R*, *PX*q. Therefore, to avoid a contradiction, we must have *E*spp*R*, *PX*q ă 8.

The remaining case is *I* c 0 p*X*;*Y*q " 8. Again, the monotonicity of the Augustin– Csiszár mutual information implies that *I* c *α* p*X*;*Y*q " 8 for all *α* ą 0. So, (128) prescribes *<sup>D</sup>*p*QY*|*X*}*PY*|*X*|*PX*q " 8 for any *<sup>Q</sup>Y*|*<sup>X</sup>* is such that *<sup>I</sup>*p*PX*, *<sup>Q</sup>Y*|*X*q ă 8. Therefore, *E*spp*R*, *PX*q " 8 for all *R* ě 0, as we wanted to show.

Augustin [36] provided lower bounds on error probability for codes of type *P<sup>X</sup>* as a function of *I* c *α* p*X*;*Y*q but did not state (256); neither did Csiszár in [32] as he was interested in a non-conventional parametrization (generalized cutoff rates) of the reliability function. As pointed out in p. 5605 of [64], the ingredients for the proof of (256) were already present in the hint of Problem 23 of Section II.5 of [24]. In the discrete case, an exponential lower bound on error probability for codes with constant composition *P<sup>X</sup>* is given as a function of *I* c 1 1`*ρ* <sup>p</sup>*PX*, *<sup>P</sup>Y*|*X*<sup>q</sup> in [44,64]. As in [64], Nakiboglu [65] gives

(256) as the definition of the sphere-packing function and connects it with (254) in Lemma 3 therein, within the context of discrete input alphabets.

In the discrete case, (257) is well-known (e.g., [66]), and given by (83). As pointed out in [40], max*<sup>X</sup> I* c 0 p*X*;*Y*q is the zero-error capacity with noiseless feedback found by Shannon [67], provided there is at least a pair <sup>p</sup>*a*1, *<sup>a</sup>*2q P <sup>A</sup><sup>2</sup> such that *<sup>P</sup>Y*|*X*"*a*<sup>1</sup> K *PY*|*X*"*a*<sup>2</sup> . Otherwise, the zero-error capacity with feedback is zero.

66. The critical rate, *R*cp*PX*q, is defined as the smallest abscissa at which the convex function *E*spp¨, *PX*q meets its supporting line of slope ´1. According to (256),

$$H\_{\frac{1}{2}}^{\mathbf{c}}(X;Y) = R\_{\mathbf{c}}(P\_X) + E\_{\mathbf{g}\mathbf{p}}(R\_{\mathbf{c}}(P\_X), P\_X). \tag{266}$$

67. If *R* ě 0 and *P<sup>X</sup>* P PA, the random-coding exponent function is (e.g., (10.15) of [48])

$$E\_{\mathbf{T}}(R, P\_X) = \min\_{Q\_{Y|X}: \mathcal{A} \to \mathcal{B}} \left\{ D(Q\_{Y|X} \| P\_{Y|X} | P\_X) + [I(P\_{X'} Q\_{Y|X}) - R]^+ \right\},\tag{267}$$

with r*t*s ` " maxt0, *t*u.

68. The random-coding error exponent function is determined by the sphere-packing error exponent function through the following relation, illustrated in Figure 1.

**Figure 1.** *E*spp¨, *PX*q and *E*rp¨, *PX*q.

**Theorem 22.**

$$E\_{\mathbf{F}}(R, P\_{\mathbf{X}}) = \min\_{r \ge R} \{ E\_{\mathbf{z}\mathbf{p}}(r, P\_{\mathbf{X}}) + r - R \} \tag{268}$$

$$= \begin{cases} 0, & \text{ $R \gg I$ } (P\_{\text{X}}, P\_{Y|X}); \\ E\_{\text{\tiny B}}(\text{R \gg P}\_{\text{X}}), & \text{ $R \in [R\_{\text{c}}(P\_{\text{X}}), I(P\_{\text{X}}, P\_{Y|X})]}; \\ I\_{\text{\tiny T}}^{\text{c}}(\text{X}; Y) - R, & \text{$ R \in [0, R\_{\text{c}}(P\_{\text{X}})]}. \end{cases} \tag{269}$$

$$=\sup\_{\rho\in[0,1]}\left\{\rho\,I\_{\frac{1}{1+\rho}}^{\mathbf{c}}\left(X;Y\right)-\rho\,\mathcal{R}\right\}.\tag{270}$$

**Proof.** Identities (268) and (269) are well-known (e.g. Lemma 10.4 and Corollary 10.4 in [48]). To show (270), note that (256) expresses *E*spp¨, *PX*q as the supremum of supporting lines parametrized by their slope ´*ρ*. By definition of critical rate (for brevity, we do not show explicitly its dependence on *<sup>P</sup>X*), if *<sup>R</sup>* P r*R*c, *<sup>I</sup>*p*PX*, *<sup>P</sup>Y*|*X*qs, then *E*spp*R*, *PX*q can be obtained by restricting the optimization in (256) to *ρ* P r0, 1s. In that segment of values of *R*, *E*spp*R*, *PX*q " *Er*p*R*, *PX*q according to (269). Moreover, on the interval *R* P r0, *R*cs, we have

2

$$\max\_{\rho \in [0,1]} \left\{ \rho \, I\_{\frac{1}{1+\rho}}^{\mathbb{C}}(X;Y) - \rho \, R \right\} = I\_{\frac{1}{2}}^{\mathbb{C}}(X;Y) - R \tag{271}$$

$$=E\_{\rm sp}(R\_{\bf c}, P\_X) + R\_{\bf c} - R \tag{272}$$

$$=E\_{\mathbf{r}}(R\_{\prime}P\_{X})\_{\prime} \tag{273}$$

where we have used (266) and (269).

The first explicit connection between *E*rp*R*, *PX*q and the Augustin–Csiszár mutual information was made by Poltyrev [35] although he used a different form for *I* c *α* p*X*;*Y*q, as we discussed in (29).

69. The unconstrained maximizations over the input distribution of the sphere-packing and random coding error exponent functions are denoted, respectively, by

$$E\_{\mathtt{sp}}(R) = \sup\_{P\_X} E\_{\mathtt{sp}}(R, P\_X)\_{\prime} \tag{274}$$

$$E\_{\mathbf{r}}(\mathcal{R}) = \sup\_{P\_X} E\_{\mathbf{r}}(\mathcal{R}, P\_X). \tag{275}$$

Coding theorems [8–10,22,48] have shown that when these functions coincide they yield the reliability function (optimum speed at which the error probability vanishes with blocklength) as a function of the rate *R* ă max*<sup>X</sup> I*p*X*;*Y*q. The intuition is that, for the most favorable input distribution, errors occur when the channel behaves so atypically that codes of rate *R* are not reliable. There are many ways in which the channel may exhibit such behavior and they are all unlikely, but the most likely among them is the one that achieves (254).

It follows from (187), (256) and (270) that (274) and (275) can be expressed as

$$E\_{\mathsf{BP}}(\mathsf{R}) = \sup\_{\rho \ge 0} \left\{ \rho \sup\_{X} I\_{\frac{1}{1+\rho}}(X;Y) - \rho \,\,\mathrm{R} \right\}\_{\prime} \tag{276}$$

$$E\_{\mathbf{r}}(R) = \sup\_{\rho \in [0, 1]} \left\{ \rho \sup\_{X} I\_{\frac{1}{1+\rho}}(X; Y) - \rho \, R \right\}.\tag{277}$$

Therefore, we can sidestep working with the Augustin–Csiszár mutual information in the absence of cost constraints.

70. Shannon [1] showed that, operating at rates below maximal mutual information, it is possible to find codes whose error probability vanishes with blocklength; for the converse, instead of error probability, Shannon measured reliability by the conditional entropy of the message given the channel output. That alternative reliability measure, as well as its generalization to Arimoto-Rényi conditional entropy, is also useful analyzing the average performance over code ensembles. It turns out (see e.g., [28,68]) that, below capacity, those conditional entropies also vanish exponentially fast in much the same way as error probability with bounds that are governed by *E*spp*R*q and *E*rp*R*q thereby lending additional operational significance to those functions.

71. We now introduce a cost function b: A Ñ r0, 8q and real scalar *θ* ě 0, and reexamine the optimizations in (274) and (275) allowing only those probability measures that satisfy Erbp*X*qs ď *θ*. With a patent, but unavoidable, abuse of notation we define

$$E\_{\mathfrak{sp}}(\mathbb{R}, \theta) = \sup\_{\substack{P\_{\mathbb{X}} \colon \\ \mathbb{E}[\mathfrak{b}(X)] \leqslant \theta}} E\_{\mathfrak{sp}}(\mathbb{R}, P\_{\mathbb{X}}) \tag{278}$$

$$=\sup\_{\rho\geqslant0} \left\{ \rho \sup\_{\substack{p\_\chi \colon \\ \mathbb{E}[\mathfrak{b}(X)] \leqslant \theta}} I^{\mathfrak{c}}\_{\frac{1}{1+\rho}}(X;Y) - \rho \mathop{\mathcal{R}}\nolimits \right\} \tag{279}$$

$$=\sup\_{\rho\geq0} \left\{ \rho \, \mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\theta) - \rho \, \mathbb{R} \right\} \tag{280}$$

$$=\sup\_{\rho\geqslant0}\left\{-\rho\mathop{\mathsf{R}}+\rho\mathop{\mathsf{min}}\limits\_{\nu\geqslant0}\left\{\nu\mathop{\mathsf{0}}+\mathbb{A}\_{\frac{1}{1+\rho}}\left(\nu\right)\right\}\right\}\tag{281}$$

$$\begin{split} \mathcal{S} &= \sup\_{\rho > 0} \left\{ -\rho \operatorname{R} + \min\_{\nu \ge 0} \{ \rho \,\,\nu \,\theta \,\, \\ & \quad + \max\_{\boldsymbol{X}} \left\{ \rho \, I\_{\frac{1}{1+\rho}}(\boldsymbol{X}; \boldsymbol{Y}) + (1+\rho) \log \mathbb{E} \left[ \exp \left( -\frac{\rho \,\boldsymbol{\nu}}{1+\rho} \mathbf{b}(\boldsymbol{X}) \right) \right] \right\} \right\}, \end{split} \tag{282}$$

where (279), (281) and (282) follow from (256), (208) and (206), respectively. 72. In parallel to (278)–(281),

Erbp*X*qs ď *θ*

$$E\_{\mathbf{r}}(R,\theta) = \sup\_{P\_{\mathbf{X}}:\mathbf{x}\sim\mathcal{X}} E\_{\mathbf{r}}(R, P\_{\mathbf{X}}) \tag{283}$$

$$=\sup\_{\rho\in[0,1]}\left\{\rho\sup\_{\substack{p\_\mathcal{X}\colon \\ \mathbb{E}[\mathfrak{b}(X)]\leqslant\theta}}I^{\mathbf{c}}\_{\frac{1}{1+\rho}}(X;Y)-\rho\,\mathcal{R}\right\}\tag{284}$$

$$=\sup\_{\rho\in\{0,1\}}\left\{\rho\,\mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\theta)-\rho\,\mathbb{R}\right\}\_{\prime} \tag{285}$$

where (284) follows from (270). In particular, if we define the critical rate and the cutoff rate as

$$R\_{\mathbf{c}} = \sup\_{\substack{P\_{\mathbf{X}}:\\ \dots, P\_{\mathbf{X}}:\\ \dots}} R\_{\mathbf{c}}(P\_{\mathbf{X}})\_{\mathbf{c}} \tag{286}$$

$$\mathbb{E}[\mathsf{b}(X)] \leqslant \theta$$

$$R\_0 = \sup\_{\substack{P\_X: \\ \mathbb{E}[\mathsf{b}(X)] \leqslant \theta}} I^{\mathsf{c}}\_{\frac{1}{2}}(X;Y), \tag{287}$$

respectively, then it follows from (270) that

$$E\_{\mathbf{r}}(R) = R\_0 - R, \quad R \in [0, R\_{\mathbf{c}}].\tag{288}$$

Summarizing, the evaluation of *E*spp*R*, *θ*q and *E*rp*R*, *θ*q can be accomplished by the method proposed in Section 8, at the heart of which is the maximization in (206) involving *α*-mutual information instead of Augustin–Csiszár mutual information. In Sections 11 and 12, we illustrate the evaluation of the error exponent functions with two important additive-noise examples.

#### **11. Additive Independent Gaussian Noise; Input Power Constraint**

We illustrate the procedure in Item 58 by taking Example 6 considerably further.

73. Suppose A " B " R, bp*x*q " *x* 2 , and *<sup>P</sup>Y*|*X*"*<sup>a</sup>* " N ` *a*, *σ* 2 *N* ˘ . We start by testing whether we can find *R ν X* P P<sup>A</sup> such that its *α*-response satisfies (230). Naturally, it makes sense to try *R ν X* " N ` 0, *σ* 2 ˘ for some yet to be determined *σ* 2 . As we saw in Example 6, this choice implies that its *α*-response is *R ν Y*r*α*s " N ` 0, *α σ*<sup>2</sup> ` *<sup>σ</sup>* 2 *N* ˘ . Specializing Example 4, we obtain

$$D\_a\left(P\_{Y|X=x} \parallel R\_{Y[a]}^\vee\right) = D\_a\left(\mathcal{N}\left(\mathbf{x}, \sigma\_N^2\right) \parallel \mathcal{N}\left(\mathbf{0}, \mathbf{a}, \sigma^2 + \sigma\_N^2\right)\right) \tag{289}$$

$$\hat{\sigma} = \frac{1}{2} \log \left( 1 + \frac{a \, \sigma^2}{\sigma\_N^2} \right) - \frac{1}{2(1 - a)} \log \left( 1 + \frac{a(1 - a) \sigma^2}{a^2 \sigma^2 + \sigma\_N^2} \right) + \frac{1}{2} \frac{a \, \pi^2}{a^2 \sigma^2 + \sigma\_N^2} \log \text{e}. \tag{290}$$

Therefore, (230) is indeed satisfied with

$$c\_{\mathfrak{A}}(\nu) = \frac{1}{2} \log \left( 1 + \frac{a \, \sigma^2}{\sigma\_N^2} \right) - \frac{1}{2(1 - a)} \log \left( 1 + \frac{a(1 - a) \sigma^2}{a^2 \sigma^2 + \sigma\_N^2} \right), \tag{291}$$

$$\nu = \frac{1}{2} \frac{\mathfrak{a}}{a^2 \sigma^2 + \sigma\_N^2} \log \mathbf{e},\tag{292}$$

where (292) follows if we choose the variance of the auxiliary input as

$$
\sigma^2 = \frac{\log \mathbf{e}}{2 \,\alpha \,\nu} - \frac{\sigma\_N^2}{\alpha^2} \tag{293}
$$

$$=\frac{\sigma\_N^2}{\alpha^2} \left(\frac{\alpha}{\lambda} - 1\right). \tag{294}$$

In (294) we have introduced an alternative, more convenient, parametrization for the Lagrange multiplier

$$
\lambda = \frac{2 \,\mathrm{\boldsymbol{\nu}} \,\sigma\_N^2}{\log \mathbf{e}} \in (0, \boldsymbol{\alpha}). \tag{295}
$$

In conclusion, with the choice in (293), N ` 0, *σ* 2 ˘ attains the maximum in (206), and in view of (231), A*α*p*ν*q is given by the right side of (291) substituting *σ* <sup>2</sup> by (293). Therefore, we have

$$\nu \, \theta + \mathbb{A}\_{\mathfrak{a}}(\nu) = \frac{\lambda}{2} \mathsf{sn} \tau \log \mathsf{e} + c\_{\mathfrak{a}} \left( \frac{\lambda \log \mathsf{e}}{2 \, \sigma\_N^2} \right) \tag{296}$$

$$\hat{\mathbf{x}} = \frac{\lambda}{2} \mathbf{s} \mathbf{n} r \log \mathbf{e} + \frac{1}{2} \log \left( 1 + \frac{1}{\lambda} - \frac{1}{\mathfrak{a}} \right) - \frac{1}{2(1 - \mathfrak{a})} \log(\mathfrak{a} - \lambda(1 - \mathfrak{a})) + \frac{\log \mathfrak{a}}{1 - \mathfrak{a}'} \tag{297}$$

where we denoted snr " *θ σ* 2 *N* . In accordance with Theorem 16 all that remains is to minimize (297) with respect to *ν*, or equivalently, with respect to *λ*. Differentiating (297) with respect to *λ*, the minimum is achieved at *λ* ˚ satisfying

$$\text{snr} = \frac{1}{\lambda^\*} \frac{\alpha - \lambda^\*}{\alpha - \lambda^\* + \alpha \lambda^\*},\tag{298}$$

whose only valid root (obtained by solving a quadratic equation) is

$$
\lambda^\* = \frac{1 + \alpha \operatorname{snr} - \alpha \Delta}{2 \operatorname{snr} (1 - \alpha)} \in (0, \alpha), \tag{299}
$$

with ∆ defined in (118). So, for *α* P p0, 1q, (208) becomes

$$\begin{split} \mathbb{C}\_{\mathfrak{a}}^{\mathbf{c}}(\mathfrak{sn}\,\sigma\_{N}^{2}) &= \frac{1+\mathfrak{a}\,\mathfrak{sn}\,-\mathfrak{a}\,\Delta}{4(1-\mathfrak{a})} \log\mathfrak{e} + \frac{1}{2} \log\left(1 + \frac{2\,\mathfrak{sn}\,\,(1-\mathfrak{a})}{1+\mathfrak{a}\,\mathfrak{sn}\,-\mathfrak{a}\,\Delta} - \frac{1}{\mathfrak{a}}\right) \\ &- \frac{1}{2(1-\mathfrak{a})} \log\left(\frac{\mathfrak{a}\,\mathfrak{sn}\,+\mathfrak{a}\,\Delta-1}{2\,\mathfrak{sn}\,\mathfrak{a}^{2}}\right). \end{split} \tag{300}$$

Letting *α* " 1 1`*ρ* , we obtain

$$\mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}}\left(\mathfrak{sn}\,\sigma\_{N}^{2}\right) = \frac{\mathfrak{sn}}{2\rho}(1-\beta)\log\mathbf{e} + \frac{1}{2}\log(1+\beta\,\mathfrak{sn}) - \frac{1+\rho}{2\rho}\log((1+\rho)\beta),\tag{301}$$

with

$$\beta = \frac{1}{2} \left( 1 - \frac{1}{\text{a sr}r} + \frac{\Delta}{\text{snr}} \right) = \frac{1}{2} \left( 1 - \frac{1+\rho}{\text{snr}} + \sqrt{\frac{4}{\text{snr}} + \left( \frac{1+\rho}{\text{snr}} - 1 \right)^2} \right) . \tag{302}$$

74. Alternatively, it is instructive to apply Theorem 18 to the current Gaussian/quadratic cost setting. Suppose we let *Q*˚ *X* " N ` 0, *σ* ˚2 ˘ , where *σ* ˚2 is to be determined. With the aid of the formulas

$$\mathbb{E}\left[X^2 \text{ e}^{-\mu X^2}\right] = \frac{\sigma^2}{\left(1 + 2\,\mu \,\sigma^2\right)^{\frac{3}{2}}},\tag{303}$$

$$\mathbb{E}\left[\mathbf{e}^{-\mu X^2}\right] = \frac{1}{\sqrt{1 + 2\,\mu\,\sigma^2}},\tag{304}$$

where *µ* ě 0, and *X* " N ` 0, *σ* 2 ˘ , (217) becomes

$$\frac{1}{\mathbf{snr}} = \frac{\sigma\_N^2}{\sigma^{\*2}} + (1 - \mathfrak{a})\lambda^\*,\tag{305}$$

upon substituting *σ* <sup>2</sup> <sup>Ð</sup> *<sup>σ</sup>* ˚<sup>2</sup> and

$$
\mu \leftarrow \nu^\* \frac{1-\alpha}{\log \mathbf{e}} = \lambda^\* \frac{1-\alpha}{2\sigma\_N^2}.\tag{306}
$$

Likewise (218) translates into (291) and (292) with p*ν*, *σ* 2 q Ð p*ν* ˚ , *σ* ˚2 q, namely,

$$c\_a(\nu^\*) = \frac{1}{2}\log\left(1 + \frac{a\,\sigma^{\*2}}{\sigma\_N^2}\right) - \frac{1}{2(1-a)}\log\left(1 + \frac{a(1-a)\sigma^{\*2}}{a^2\sigma^{\*2} + \sigma\_N^2}\right),\tag{307}$$

$$
\lambda^\* = \frac{a\sigma\_N^2}{a^2 \sigma^{\*2} + \sigma\_N^2}.\tag{308}
$$

Eliminating *σ* ˚2 from (305) by means of (308) results in (299) and the same derivation that led to (300) shows that it is equal to *ν* ˚ *θ* ` *cα*p*ν* ˚ q.

75. Applying Theorem 17, we can readily find the input distribution, *P* ˚ *X* , that attains Cc *α* p*θ*q as well as its x*α*y-response *P* ˚ *Y* (recall the notation in Item 53). According to Example 2, *P* ˚ *Y* , the *α*-response to *Q*˚ *X* is Gaussian with zero mean and variance

$$
\sigma\_N^2 + \alpha \,\sigma^{\*2} = \sigma\_N^2 \left( 1 + \frac{1}{\lambda^\*} - \frac{1}{\alpha} \right) \tag{309}
$$

$$\sigma = \frac{\sigma\_N^2}{2} \left( 2 - \frac{1}{\alpha} + \Delta + \text{snr} \right) \tag{310}$$

where (309) follows from (308) and (310) follows by using the expression for ∆ in (118). Note from Example 7 that *P* ˚ *Y* is nothing but the x*α*y-response to N ` 0,snr *σ* 2 *N* ˘ . We can easily verify from Theorem 17 that indeed *P* ˚ *X* " N ` 0,snr *σ* 2 *N* ˘ since in this case (216) becomes

$$\mu\_{P\_X^{\mathfrak{A}} \parallel Q\_X^{\mathfrak{A}}}(a) = -(1 - \alpha)\nu^\* a^2 + \pi\_{\mathfrak{a}\prime} \tag{311}$$

which can only be satisfied by *P* ˚ *X* " N ` 0,snr *σ* 2 *N* ˘ in view of (305). As an independent confirmation, we can verify, after some algebra, that the right sides of (127) and (300) are identical.

In fact, in the current Gaussian setting, we could start by postulating that the distribution that maximizes the Augustin–Csiszár mutual information under the second moment constraint does not depend on *α* and is given by *P* ˚ *<sup>X</sup>* " Np0, *θ*q. Its x*α*yresponse *P* ˚ *<sup>Y</sup>*x*α*<sup>y</sup> was already obtained in Example 7. Then, an alternative method to find C<sup>c</sup> *α* p*θ*q, given in Section 6.2 of [43], is to follow the approach outlined in Item 53. To validate the choice of *P* ˚ *<sup>X</sup>* we must show that it maximizes *B*p*PX*, *P* ˚ *Y*x*α*y q (in the notation introduced in (199)) among the subset of P<sup>A</sup> which satisfies Er*X* 2 s ď *θ*. This follows from the fact that *D<sup>α</sup>* ´ *PY*|*X*"*<sup>x</sup>* }*P* ˚ *Y*x*α*y ¯ is an affine function of *x* 2 .

76. Let's now use the result in Item 73 to evaluate, with a novel parametrization, the error exponent functions for the Gaussian channel under an average power constraint.

**Theorem 23.** *Let* A " B " R*,* bp*x*q " *x* 2 *, and PY*|*X*"*<sup>a</sup>* " N ` *a*, *σ* 2 *N* ˘ *. Then, for β* P r0, 1s*,*

$$E\_{\mathsf{SP}}(R, \mathsf{snr} \,\sigma\_N^2) = \frac{\mathsf{snr}}{2} (1 - \beta) \log \mathsf{e} - \frac{1}{2} \log(1 + \mathsf{snr} \,\beta(1 - \beta)),\tag{312}$$

$$R = \frac{1}{2} \log \left( 1 + \frac{\beta^2}{\beta(1 - \beta) + \frac{1}{\text{snr}}} \right). \tag{313}$$

*The critical rate and cutoff rate are, respectively,*

$$R\_{\mathbf{C}} = \frac{1}{2} \log \left( \frac{1}{2} + \frac{\mathbf{snr}}{4} + \frac{1}{2} \sqrt{1 + \frac{\mathbf{snr}^2}{4}} \right) . \tag{314}$$

$$R\_0 = \frac{1}{2} \left( 1 + \frac{\mathbf{s}\mathbf{n}\mathbf{r}}{2} - \sqrt{1 + \frac{\mathbf{s}\mathbf{n}\mathbf{r}^2}{4}} \right) \log\mathbf{e} + \frac{1}{2} \log\left(\frac{1}{2} + \frac{1}{2}\sqrt{1 + \frac{\mathbf{s}\mathbf{n}\mathbf{r}^2}{4}}\right). \tag{315}$$

**Proof.** Expression (315) for the cutoff rate follows by letting *ρ* " 1 in (301) and (302). The supremum in (281) is attained by *ρ* ˚ ě 0 that satisfies (recall the concavity result in Theorem 9-(a))

$$R = \frac{\mathbf{d}}{\mathbf{d}\rho} \,\rho \,\mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}} \left(\mathfrak{sn}\,\sigma\_N^2\right)\Big|\_{\rho \leftarrow \rho^\*}\tag{316}$$

$$=\frac{1}{2}\log\left(\text{snr}+\frac{1}{\beta}\right)-\frac{1}{2}\log(1+\rho^\*),\tag{317}$$

obtained after a dose of symbolic computation working with (301). In particular, letting *ρ* ˚ " 1, we obtain the critical rate in (314). Note that if in (302) we substitute *ρ* Ð *ρ* ˚ , with *ρ* ˚ given as a function of *R*, snr and *β* by (317), we end up with an equation involving *R*, snr, and *β*. We proceed to verify that that equation is, in fact, (312). By solving a quadratic equation, we can readily check that (302) is the positive root of

$$1 + \rho = \mathfrak{sn}r(1 - \beta) + \frac{1}{\beta}.\tag{318}$$

If we particularize (318) to *ρ* Ð *ρ* ˚ , with *ρ* ˚ given by (317), namely,

$$
\rho^\* = -1 + \exp(-2R) \left( \text{snr} + \frac{1}{\beta} \right),
\tag{319}
$$

we obtain

$$\exp(2R) = \frac{\mathfrak{snr}\,\beta + 1}{\mathfrak{snr}\,\beta(1-\beta) + 1},\tag{320}$$

which is (313). Notice that the right side of (320) is monotonic increasing in *β* ą 0 ranging from 1 (for *β* " 0) to 1 ` snr (for *β* " 1). Therefore, *β* P r0, 1s spans the whole gamut of values of *R* of interest.

Assembling (281), (301) and (317), we obtain

$$\begin{split} E\_{\mathsf{SP}}(R, \mathsf{snr} \, \sigma\_{N}^{2}) \\ = & -\boldsymbol{\rho}^{\*} \boldsymbol{R} + \frac{\mathsf{snr}}{2} (1 - \boldsymbol{\beta}) \log \mathbf{e} + \frac{\boldsymbol{\rho}^{\*}}{2} \log(1 + \boldsymbol{\beta} \, \mathsf{snr}) - \frac{1 + \boldsymbol{\rho}^{\*}}{2} \log((1 + \boldsymbol{\rho}^{\*}) \boldsymbol{\beta}) \\ = & -\boldsymbol{\rho}^{\*} \boldsymbol{R} + \frac{\mathsf{snr}}{2} (1 - \boldsymbol{\beta}) \log \mathbf{e} + \frac{\boldsymbol{\rho}^{\*}}{2} \log(1 + \boldsymbol{\beta} \, \mathsf{snr}) - \frac{1 + \boldsymbol{\rho}^{\*}}{2} \log \boldsymbol{\beta} \\ & + (1 + \boldsymbol{\rho}^{\*}) \boldsymbol{R} - \frac{1 + \boldsymbol{\rho}^{\*}}{2} \log \left( \mathsf{snr} + \frac{1}{\boldsymbol{\beta}} \right) \end{split} \tag{322}$$

$$\mathbf{E} = \mathbf{R} + \frac{\mathbf{s}\mathbf{n}r}{2}(1-\beta)\log\mathbf{e} - \frac{1}{2}\log(1+\beta\,\mathbf{s}\mathbf{n})\tag{323}$$

$$\mathbf{e} = \frac{\mathfrak{snr}}{2}(1-\beta)\log\mathbf{e} - \frac{1}{2}\log(1+\mathfrak{snr}\,\beta(1-\beta)),\tag{324}$$

where (324) follows by substituting (313) on the left side.

Note that the parametric expression in (312) and (313) (shown in Figure 2) is, in fact, a closed-form expression for *E*spp*R*,snr *σ* 2 *N* q since we can invert (313) to obtain

$$\beta = \frac{1}{2} (1 - \exp(-2 \, R)) \left( 1 + \sqrt{1 + \frac{4}{\text{snr} \, (1 - \exp(-2 \, R))}} \right). \tag{325}$$

The random coding error exponent is

$$E\_{\mathbf{r}}(R,\theta) = \begin{cases} E\_{\mathbf{sp}}(R,\theta), & R \in (R\_{\mathbf{c}}, \frac{1}{2}\log(1+\mathfrak{sn}\tau)); \\ R\_0 - R, & R \in [0, R\_{\mathbf{c}}]\_{\boldsymbol{\nu}} \end{cases} \tag{326}$$

with the critical rate *R*c and cutoff rate *R*<sup>0</sup> in (314) and (315), respectively. It can be checked that (326) coincides with the expression given by Gallager [9] (p. 340) where he optimizes (235) with respect to *ρ* and *r*, but not *PX*, which he just assumes to be *P<sup>X</sup>* " Np0, *θ*q. The expression for *R*<sup>c</sup> in (314) can be found in (7.4.34) of [9]; *R*<sup>0</sup> in (314) is implicit in p. 340 of [9], and explicit in e.g., [69].

**Figure 2.** *E*spp*R*,snr *σ* 2 *N* q in (312) and (313); logarithms in base 2.

77. The expression for *E*spp*R*, *θ*q in Theorem 23 has more structure than meets the eye. The analysis in Item 73 has shown that *E*spp*R*, *PX*q is maximized over *P<sup>X</sup>* with second moment not exceeding *θ* by *P* ˚ *<sup>X</sup>* " Np0, *θ*q regardless of *R* P ´ 0, <sup>1</sup> 2 logp1 ` snrq ¯ . The fact that we have found a closed-form expression for (254) when evaluated at such input probability measure and *<sup>P</sup>Y*|*X*"*<sup>a</sup>* " N ` *a*, *σ* 2 *N* ˘ is indicative that the minimum therein is attained by a Gaussian random transformation *Q*˚ *Y*|*X* . This is indeed the case: define the random transformation

$$Q\_{Y|X=a}^{\*} = \mathcal{N}(\beta \, a, \sigma\_1^2),\tag{327}$$

$$\frac{\sigma\_1^2}{\sigma\_N^2} = 1 + \text{snr } \beta (1 - \beta). \tag{328}$$

In comparison with the nominal random transformation *<sup>P</sup>Y*|*X*"*<sup>a</sup>* " N ` *a*, *σ* 2 *N* ˘ , this channel attenuates the input and contaminates it with a more powerful noise. Then,

$$I(P\_{X'}^\* Q\_{Y|X}^\*) = \frac{1}{2} \log \left( 1 + \frac{\beta^2}{\beta(1-\beta) + \frac{1}{\text{snr}}} \right) = R. \tag{329}$$

Furthermore, invoking (33), we get

$$D(Q\_{Y|X}^{\ast} \| P\_{Y|X} | P\_X^{\ast}) = \mathbb{E}\left[D\left(\mathcal{N}\left(\mathcal{S}X^{\ast}, \sigma\_1^2\right) \| \mathcal{N}\left(X^{\ast}, \sigma\_N^2\right)\right)\right] \tag{330}$$

$$=\frac{1}{2}\left(\left(\beta-1\right)^{2}\mathbf{s}\mathbf{n}\mathbf{r}+\frac{\sigma\_{1}^{2}}{\sigma\_{N}^{2}}-1\right)\log\mathbf{e}-\frac{1}{2}\log\frac{\sigma\_{1}^{2}}{\sigma\_{N}^{2}}\tag{331}$$

$$=\frac{\mathfrak{snr}}{2}(1-\beta)\log\mathbf{e}-\frac{1}{2}\log(1+\mathfrak{snr}\,\beta(1-\beta))\tag{332}$$

$$=E\_{\mathfrak{sp}}(\mathcal{R}\_{\prime}\mathfrak{sn}\,\sigma\_{\mathrm{N}}^{2})\_{\prime}\tag{333}$$

where (333) is (312). Therefore, *Q*˚ *Y*|*X* does indeed achieve the minimum in (254) if *PY*|*X*"*<sup>a</sup>* " N ` *a*, *σ* 2 *N* ˘ and *P* ˚ *<sup>X</sup>* " Np0, *θ*q. So, the most likely error mechanism is the result of atypically large noise strength and an attenuated received signal. Both effects cannot be combined into additional noise variance: there is no *σ* <sup>2</sup> <sup>ą</sup> <sup>0</sup> such that *QY*|*X*"*<sup>a</sup>* " N ` *a*, *σ* 2 ˘ achieves the minimum in (254).

#### **12. Additive Independent Exponential Noise; Input-Mean Constraint**

This section finds the sphere-packing error exponent for the additive independent exponential noise channel under an input-mean constraint.

78. Suppose that A " B " r0, 8q, bp*x*q " *x*, and

$$Y = X + N\_\prime \tag{334}$$

where *N* is exponentially distributed, independent of *X*, and Er*N*s " *ζ*. Therefore *<sup>P</sup>Y*|*X*"*<sup>a</sup>* has density

$$p\_{Y|X=a}(t) = \frac{1}{\mathcal{J}} \mathbf{e}^{-\frac{t-a}{\mathcal{J}}} \mathbf{1}\{t \gg a\}.\tag{335}$$

It is shown in [70,71] that

$$\max\_{X \colon X \in [X] \lhd \theta} I(X; X + N) = \log(1 + \textsf{snr}), \tag{336}$$

$$\text{sgn}\mathbf{r} = \frac{\theta}{\mathcal{Z}}\mathbf{\hat{z}}\tag{337}$$

achieved by a mixed random variable with density

$$f\_X^\*(t) = \frac{\zeta}{\zeta + \theta} \,\delta(t) + \frac{\theta}{(\zeta + \theta)^2} \,\mathbf{e}^{-t/(\zeta + \theta)} \mathbf{1}\{t > 0\}.\tag{338}$$

To determine C<sup>c</sup> *α* psnr *ζ*q, *α* P p0, 1q, we invoke Theorem 18. A sensible candidate for the auxiliary input distribution *Q*˚ *X* is a mixed random variable with density

$$q\_X^\*(t) = \Gamma^\* \, \delta(t) + (1 - \Gamma^\*) \frac{1}{\mu} \mathbf{e}^{-t/\mu} \, \mathbf{1}\{t > 0\},\tag{339}$$

$$
\mu = \frac{\zeta}{\alpha \Gamma^\*} \tag{340}
$$

where Γ ˚ P p0, 1q is yet to be determined. This is an attractive choice because its *α*-response, *Q*˚ *Y*r*α*s , is particularly simple: exponential with mean *α µ* " *ζ* <sup>Γ</sup>˚ , as we can verify using Laplace transforms. Then, if *Z* is exponential with unit mean, with the aid of Example 5, we can write

$$D\_{\mathfrak{a}}\left(P\_{Y|X=\mathfrak{x}} \parallel Q\_{Y[\mathfrak{a}]}^{\*}\right) = D\_{\mathfrak{a}}(\zeta Z + \mathfrak{x} \parallel \mathfrak{a} \,\mu \, Z) \tag{341}$$

$$\mathbf{e} = \frac{\boldsymbol{\mathfrak{x}}}{\boldsymbol{\mathfrak{a}} \,\boldsymbol{\mu}} \log \mathbf{e} + \log \frac{\boldsymbol{\mathfrak{a}} \,\boldsymbol{\mu}}{\boldsymbol{\zeta}} + \frac{1}{1 - \boldsymbol{\alpha}} \log \left( \boldsymbol{\mathfrak{a}} + (1 - \boldsymbol{\alpha}) \frac{\boldsymbol{\zeta}}{\boldsymbol{\mathfrak{a}} \,\boldsymbol{\mu}} \right) \tag{342}$$

$$=\frac{\Gamma^\*\pi}{\zeta}\log\mathbf{e}-\log\Gamma^\*+\frac{1}{1-\alpha}\log(\alpha+(1-\alpha)\Gamma^\*).\tag{343}$$

So, (218) is satisfied with

$$\mathbf{v}^\* = \frac{\Gamma^\*}{\mathcal{J}} \begin{array}{c} \log \mathbf{e}, \\\\ \end{array} \tag{344}$$

$$c\_{\mathfrak{a}}(\nu^\*) = \frac{1}{1-\mathfrak{a}}\log(\mathfrak{a} + (1-\mathfrak{a})\Gamma^\*) - \log \Gamma^\*.\tag{345}$$

To evaluate (217), it is useful to note that if *γ* ą ´1, then

$$\mathbb{E}\left[Z\mathbf{e}^{-\gamma Z}\right] = \frac{1}{(1+\gamma)^2},\tag{346}$$

$$\mathbb{E}\left[\mathbf{e}^{-\gamma Z}\right] = \frac{1}{1+\gamma}.\tag{347}$$

,

Therefore, the left side of (217) specializes to, with *<sup>X</sup>*¯ ˚ " *<sup>Q</sup>*˚ *X*

$$\begin{split} \mathbb{E}\left[\mathbf{b}(\tilde{X}^\*) \exp\left(-(1-\mathfrak{a})\boldsymbol{\nu}^\* \, \mathbf{b}(\tilde{X}^\*)\right)\right] &= \frac{\mu(1-\Gamma^\*)}{\left(1+\mu(1-\mathfrak{a})\frac{\boldsymbol{\nu}^\*}{\log \mathbf{e}}\right)^2} \\ &= \zeta \, \mathfrak{a}\left(\frac{1}{\Gamma^\*}-1\right) . \end{split} \tag{348}$$

while the expectation on the right side of (217) is given by

$$\mathbb{E}\left[\exp\left(-(1-\mathfrak{a})\nu^\*\mathfrak{b}(\bar{X}^\*)\right)\right] = \mathfrak{a} + \Gamma^\* - \mathfrak{a}\Gamma^\*.\tag{350}$$

Therefore, (217) yields

$$\text{snr} = \frac{1}{\Gamma^\*} - \frac{1}{\alpha + (1 - \alpha)\Gamma^\*} \tag{351}$$

˚

whose solution is

$$\Gamma^\* = \frac{1}{2\rho \text{ snr}} \left( \sqrt{(1+\text{snr})^2 + 4\rho \text{ snr}} - 1 - \text{snr} \right),\tag{352}$$

with *ρ* " 1´*α α* . So, finally, (220), (344) and (345) give the closed-form expression

$$\mathbb{C}\_{\mathfrak{a}}^{\mathbf{c}}(\theta) = \mathfrak{sn}\,\Gamma^{\ast}\log\mathbf{e} - \log\Gamma^{\ast} + \frac{1}{1-\mathfrak{a}}\log(\mathfrak{a} + (1-\mathfrak{a})\Gamma^{\ast}).\tag{353}$$

As in Item 73, we can postulate an auxiliary distribution that satisfies (230) for every *ν* ě 0. This is identical to what we did in (341)–(343) except that now (344) and (345) hold for generic *ν* and Γ. Then, (351) is the result of solving *θ* " ´*c*9*α*p*ν* ˚ q, which is, in fact, somewhat simpler than obtaining it through (217).

79. We proceed to get a very simple parametric expression for *E*spp*R*, *θ*q.

**Theorem 24.** *Let* A " B " r0, 8q*,* bp*x*q " *x, and Y* " *X* ` *N, with N exponentially distributed, independent of X, and* Er*N*s " *ζ. Then, under the average cost constraint* Erbp*X*qs ď *ζ* snr*,*

$$E\_{\mathbf{sp}}(R,\zeta,\mathbf{snr}) = \left(\frac{1}{\eta} - 1\right) \log \mathbf{e} + \log \eta\_{\prime} \tag{354}$$

$$R = \log(1 + \eta \,\text{snr}),\tag{355}$$

*where η* P p0, 1s*.*

**Proof.** Rewriting (353), results in

$$\rho \, \mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\theta) = \rho \, \text{sn} \, \Gamma^\* \log \mathbf{e} - \rho \log \Gamma^\* + (1+\rho) \log \frac{1+\rho \, \Gamma^\*}{1+\rho} \, \text{s} \tag{356}$$

which is monotonically decreasing with *<sup>ρ</sup>*. With <sup>Γ</sup><sup>9</sup> ˚ " B B*ρ* Γ ˚ p*ρ*,snrq, the counterpart of (317) is now

$$R = \frac{\mathbf{d}}{\mathbf{d}\rho} \rho \mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\theta)\Big|\_{\rho \leftarrow \rho^{\ast}} \tag{357}$$

$$= (\Gamma^\* + \rho^\* \dot{\Gamma}^\*) \left( \mathfrak{sn} \mathbf{r} - \frac{1}{\Gamma^\*} + \frac{1 + \rho^\*}{1 + \rho^\* \Gamma^\*} \right) \log \mathbf{e} + \log \frac{1 + \rho^\* \Gamma^\*}{\Gamma^\* + \rho^\* \Gamma^\*} \tag{358}$$

$$= (\Gamma^\* + \rho^\* \dot{\Gamma}^\*) \left( \mathfrak{sn} \tau + \frac{1}{\Gamma^\*} \frac{\Gamma^\* - 1}{1 + \rho^\* \Gamma^\*} \right) \log \mathbf{e} + \log \frac{1 + \rho^\* \Gamma^\*}{\Gamma^\* + \rho^\* \Gamma^\*} \tag{359}$$

$$=\log\frac{1+\rho^\*\Gamma^\*}{\Gamma^\*+\rho^\*\Gamma^\*},\tag{360}$$

where the drastic simplification in (360) occurs because, with the current parametrization, (351) becomes

$$1 - \Gamma^\* = (1 + \rho^\* \Gamma^\*) \,\Gamma^\* \text{snr.}\tag{361}$$

Now we go ahead and express both *ρ* ˚ and Γ ˚ as functions of snr and *R* exclusively. We may rewrite (357)–(360) as

$$
\rho^\* \Gamma^\* = \frac{\exp(-R) - \Gamma^\*}{1 - \exp(-R)},
\tag{362}
$$

which, when plugged in (361), results in

$$\Gamma^\* = \frac{1}{\mathfrak{snr}} (1 - \exp(-\mathbb{R})) < 1,\tag{363}$$

$$\rho^\* = \frac{(1+\mathfrak{sn}r)\exp(-R) - 1}{\left(1 - \exp(-R)\right)^2} > 0,\tag{364}$$

where the inequalities in (363) and (364) follow from *R* ă logp1 ` snrq. So, in conclusion,

$$E\_{\mathsf{SP}}(R,\theta) = \max\_{\rho \ge 0} \left\{ \rho \, \mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho}}(\theta) - \rho \, R \right\} \tag{365}$$

$$=\rho^\* \mathbb{C}^{\mathbf{c}}\_{\frac{1}{1+\rho^W}}(\theta) - \rho^\* \,\mathrm{R} \tag{366}$$

$$\rho = \rho^\* \operatorname{snr} \Gamma^\* \log \mathbf{e} - \rho^\* \log \Gamma^\* + (1 + \rho^\*) \log \frac{1 + \rho^\* \Gamma^\*}{1 + \rho^\*} - \rho^\* R \tag{367}$$

$$\mathbf{k} = \rho^\* \sin \Gamma^\* \log \mathbf{e} - \rho^\* \log \Gamma^\* + (1 + \rho^\*)(\mathbf{R} + \log \Gamma^\*) - \rho^\* \mathbf{R} \tag{368}$$

$$\mathbf{r} = \rho^\* \sin \Gamma^\* \log \mathbf{e} + \log \Gamma^\* + \mathcal{R} \tag{369}$$

$$\mathbf{r} = \left(\frac{\mathbf{snr}}{\exp(R) - 1} - 1\right) \log \mathbf{e} + \log \frac{\exp(R) - 1}{\mathbf{snr}} \tag{370}$$

$$\mathbf{e} = \left(\frac{1}{\eta} - 1\right) \log \mathbf{e} + \log \eta\_{\prime} \tag{371}$$

where we have introduced

$$\eta = \frac{\exp(R) - 1}{\mathfrak{sn}r} = \frac{\Gamma^\*}{1 - \mathfrak{sn}r \,\Gamma^\*}. \tag{372}$$

Evidently, the left identity in (372) is the same as (355).

The critical rate and the cutoff rate are obtained by particularizing (360) and (356) to *ρ* ˚ " 1 and *ρ* " 1, respectively. This yields

$$R\_{\mathbf{C}} = \log \frac{1 + \Gamma\_1^\*}{2\,\Gamma\_1^\*},\tag{373}$$

$$R\_0 = \mathfrak{sn}\,\Gamma\_1^\* \log \mathbf{e} - \log(4\,\Gamma\_1^\*) + 2\log(1 + \Gamma\_1^\*),\tag{374}$$

$$\Gamma\_1^\* = \frac{\sqrt{(1+\text{snr})^2 + 4\text{snr} - 1 - \text{snr}}}{2\text{snr}}.\tag{375}$$

As in (326), the random coding error exponent is

$$E\_{\mathbf{r}}(R,\mathcal{J}\text{ snr}) = \begin{cases} E\_{\mathbf{s}\mathbf{p}}(R,\mathcal{J}\text{ snr}), & R \in (R\_{\mathbf{c}\prime}\log(1+\mathbf{s}\mathbf{n}\prime)); \\ R\_0 - R, & R \in [0, R\_{\mathbf{c}\prime}]. \end{cases} \tag{376}$$

with the critical rate *R*c and cutoff rate *R*<sup>0</sup> in (373) and (375), respectively. This function is shown along with *E*spp*R*, *ζ* snrq in Figure 3 for snr " 3.

80. In parallel to Item 77, we find the random transformation that explains the most likely mechanism to produce errors at every rate *R*, namely the minimizer of (254) when *P<sup>X</sup>* " *P* ˚ *X* , the maximizer of the Augustin–Csiszár mutual information of order *α*. In this case, *P* ˚ *X* is not as trivial to guess as in Section 11, but since we already found *Q*˚ *X* in (339) with Γ " Γ ˚ , we can invoke Theorem 17 to show that the density of *P* ˚ *X* achieving the maximal order-*α* Augustin–Csiszár mutual information is

$$p\_X^\*(t) = \frac{\Gamma^\*}{a + (1 - a)\Gamma^\*} \delta(t) + \frac{1 - \Gamma^\*}{a + (1 - a)\Gamma^\*} \frac{a}{\zeta} \frac{\Gamma^\*}{\zeta} \mathbf{e}^{-t\Gamma^\*/\zeta} \mathbf{1}\{t > 0\},\tag{377}$$

whose mean is, as it should,

$$\frac{\alpha}{\Gamma^\*} \frac{\zeta}{\alpha + (1 - \alpha)\Gamma^\*} = \zeta \operatorname{snr} = \theta. \tag{378}$$

Let *Q*˚ *Y* be exponential with mean *θ* ` *κ*, and *Q*˚ *Y*|*X*"*a* have density

$$q\_{Y|X=a}^{\*}(t) = \frac{1}{\kappa} \mathbf{e}^{-\frac{t-a}{\kappa}} \mathbf{1}\{t \gg a\},\tag{379}$$

with

$$\kappa = \frac{\zeta}{\eta},\tag{380}$$

and *η* as defined in (372). Using Laplace transforms, we can verify that *P* ˚ *<sup>X</sup>* Ñ *Q*˚ *<sup>Y</sup>*|*<sup>X</sup>* Ñ *Q*˚ *<sup>Y</sup>* where *P* ˚ *X* is the probability measure with density in (377). Let *Z* be unit-mean exponentially distributed. Writing mutual information as the difference between the output differential entropy and the noise differential entropy we get

$$I(P\_{X'}^\* Q\_{Y|X}^\*) = h((
\theta + \kappa)Z) - h(\kappa Z) \tag{381}$$

$$=\log\left(1+\frac{\theta}{\kappa}\right)\tag{382}$$

$$= \mathbb{R}\_{\prime} \tag{383}$$

in view of (363). Furthermore, using (335) and (379),

$$D(Q\_{Y|X}^{\ast} \parallel P\_{Y|X} | P\_X^{\ast}) = \log \frac{\zeta}{\kappa} + \left(\frac{\kappa}{\zeta} - 1\right) \log \mathbf{e} \tag{384}$$

$$=\log\eta + \left(\frac{1}{\eta} - 1\right)\log\mathbf{e}\tag{385}$$

$$=E\_{\mathsf{sp}}(\mathsf{R}, \mathsf{\zeta}\mathsf{snr}),\tag{386}$$

where we have used (380) and (354). Therefore, we have shown that *Q*˚ *Y*|*X* is indeed the minimizer of (254). In this case, the most likely mechanism for errors to happen is that the channel adds independent exponential noise with mean *ζ*{*η*, instead of the nominal mean *ζ*. In this respect, the behavior is reminiscent of that of the exponential timing channel for which the error exponent is dominated (at least above critical rate) by an exponential server which is slower than the nominal [72].

**Figure 3.** Error exponent functions in (354), (355) and (376).

#### **13. Recap**

81. The analysis of the fundamental limits of noisy channels in the regime of vanishing error probability with blocklength growing without bound expresses channel capacity

in terms of a basic information measure: the input–output mutual information maximized over the input distribution. In the regime of fixed nonzero error probability, the asymptotic fundamental limit is a function of not only capacity but channel dispersion [73], which is also expressible in terms of an information measure: the variance of the information density obtained with the capacity-achieving distribution. In the regime of exponentially decreasing error probability (at fixed rate below capacity) the analysis of the fundamental limits has gone through three distinct phases. No information measures were involved during the first phase and any optimization with respect to various auxiliary parameters and input distribution had to rely on standard convex optimization techniques, such as Karush-Kuhn-Tucker conditions, which not only are cumbersome to solve in this particular setting, but shed little light on the structure of the solution. The second phase firmly anchored the problem in a large deviations foundation, with the fundamental limits expressed in terms of conditional relative entropy as well as mutual information. Unfortunately, the associated maximinimization in (2) did not immediately lend itself to analytical progress. Thanks to Csiszár's realization of the relevance of Rényi's information measures to this problem, the third phase has found a way to, not only express the error exponent functions as a function of information measures, but to solve the associated optimization problems in a systematic way. While, in the absence of cost constraints, the problem reduces to finding the maximal *α*-mutual information, cost constraints make the problem much more challenging because of the difficulty in determining the order-*α* Augustin– Csiszár mutual information. Fortunately, thanks to the introduction of an auxiliary input distribution (the x*α*y-adjunct of the distribution that maximizes *I* c *α* ), we have shown that *α*-mutual information also comes to the rescue in the maximization of the order-*α* Augustin–Csiszár mutual information in the presence of average cost constraints. We have also finally ended the isolation of Gallager's *E*<sup>0</sup> function with cost constraints from the representations in Phases 2 and 3. The pursuit of such a link is what motivated Augustin in 1978 to define a generalized mutual information measure. Overall, the analysis has given yet another instance of the benefits of variational representations of information measures, leading to solutions based on saddle points. However, we have steered clear of off-the-shelf minimax theorems and their associated topological constraints. *Entropy* **2021**, *1*, 0 6 of 52 **Example 1.** *If <sup>X</sup>* ∼ N *µX*, *σ* 2 *X (Gaussian with mean µ<sup>X</sup> and variance σ* 2 *X ) and Y* ∼ N *µY*, *σ* 2 *Y , then, <sup>ı</sup>X*k*Y*(*a*) = <sup>1</sup> 2 log *σ* 2 *Y σ* 2 *X* + 1 2 (*a* − *µY*) 2 *σ* 2 *Y* − (*a* − *µX*) 2 *σ* 2 *X* ! log e. (10) 4. Let (A, F ) and (B, G ) be measurable spaces, known as the input and output spaces,

We have worked out two channels/cost constraints (additive Gaussian noise with quadratic cost, and additive exponential noise with a linear cost) that admit closedform error-exponent functions, most easily expressed in parametric form. Furthermore, in Items 77 and 80 we have illuminated the structure of those closed-form expressions by identifying the anomalous channel behavior responsible for most errors at every given rate. In the exponential noise case, the solution is simply a noisier exponential channel, while in the Gaussian case it is the result of both a noisier Gaussian channel and an attenuated input. respectively. Likewise, A and B are referred to as the input and output alphabets respectively. The simplified notation *<sup>P</sup>Y*|*<sup>X</sup>* : A → B denotes a random transformation from (A, <sup>F</sup> ) to (B, <sup>G</sup> ), i.e. for any *<sup>x</sup>* ∈ A, *<sup>P</sup>Y*|*X*=*<sup>x</sup>* (·) is a probability measure on (B, <sup>G</sup> ), and for any *<sup>B</sup>* ∈ <sup>G</sup> , *<sup>P</sup>Y*|*X*=· (*B*) is an F-measurable function. 5. We abbreviate by P<sup>A</sup> the set of probability measures on (A, F ), and by PA×B the set of probability measures on (A × B, <sup>F</sup> ⊗ <sup>G</sup> ). If *<sup>P</sup>* ∈ P<sup>A</sup> and *<sup>P</sup>Y*|*<sup>X</sup>* : A → B is a random transformation, the corresponding joint probability measure is denoted by *P PY*|*<sup>X</sup>* ∈ PA×B (or, interchangeably, *<sup>P</sup>Y*|*XP*). The notation *<sup>P</sup>* → *<sup>P</sup>Y*|*<sup>X</sup>* → *<sup>Q</sup>* simply

These observations prompt the question of whether there might be an alternative general approach that eschews Rényi's information measures to arrive at not only the most likely anomalous channel behavior, but the error exponent functions themselves. indicates that the output marginal of the joint probability measure *P PY*|*<sup>X</sup>* is denoted by *Q* ∈ PB, namely, *<sup>Q</sup>*(*B*) = <sup>Z</sup> *<sup>P</sup>Y*|*X*(*B*|*x*) <sup>d</sup>*PX*(*x*) = E h *<sup>P</sup>Y*|*X*(*B*|*X*) i , *B* ∈ G . (11)

**Funding:** This research received no external funding. 6. If *<sup>P</sup><sup>X</sup>* → *<sup>P</sup>Y*|*<sup>X</sup>* → *<sup>P</sup><sup>Y</sup>* and *<sup>P</sup>Y*|*X*=*<sup>a</sup> <sup>P</sup>Y*, the information density *<sup>ı</sup>X*;*<sup>Y</sup>* : A × B →

**Acknowledgments:** The manuscript incorporates constructive suggestions by Academic Editor Igal Sason and the anonymous referees. [−∞, ∞) is defined as *<sup>ı</sup>X*;*Y*(*a*; *<sup>b</sup>*) = *<sup>ı</sup>PY*|*X*=*a*k*P<sup>Y</sup>* (*b*), (*a*, *b*) ∈ A × B. (12)

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The author declares no conflict of interest. Following Rényi's terminology [49], if *<sup>P</sup>XPY*|*<sup>X</sup> <sup>P</sup><sup>X</sup>* × *<sup>P</sup>Y*, the dependence between *X* and *Y* is said to be regular, and the information density can be defined on (*x*, *y*) ∈

#### **Appendix A**

Recall that the relative information *<sup>ı</sup>P*}*<sup>Q</sup>* is defined only if *<sup>P</sup>* ! *<sup>Q</sup>*, while *<sup>D</sup>*p*P*}*Q*q P r0, `8s is always defined and equal to `8 if (but not only if) *P* if *<sup>X</sup>* = *<sup>Y</sup>* ∈ R, then *<sup>P</sup>Y*|*X*=*<sup>a</sup>* (*A*) = 1{*a* ∈ *A*}, and their dependence is not regular, since for any *P<sup>X</sup>* with non-discrete components *PXY* 6 *P<sup>X</sup>* × *PY*. 7. Let *<sup>α</sup>* > 0, and *<sup>P</sup><sup>X</sup>* → *<sup>P</sup>Y*|*<sup>X</sup>* → *<sup>P</sup>Y*. The *<sup>α</sup>*-response to *<sup>P</sup><sup>X</sup>* ∈ P<sup>A</sup> is the output probability *Q*.

*α*

*<sup>ı</sup>Y*[*α*]k*Y*(*y*) = <sup>1</sup>

where *κ<sup>α</sup>* is a scalar that guarantees that *PY*[*α*]

h E 1

*κ<sup>α</sup>* = *α* logE

*pY*[*α*]

(*y*) = exp

transformation with less (resp. more) "noise" than *<sup>p</sup>Y*|*X*.

− *κα α* E 1 *α* h *p α Y*|*X* (*y*|*X*) i

For *α* > 1 (resp. *α* < 1) we can think of the normalized version of *p*

*α*

A × B. Henceforth, we assume that *<sup>P</sup>Y*|*<sup>X</sup>* is such that the dependence between its input and output is regular regardless of the input probability measure. For example,

*<sup>α</sup>* [exp(*<sup>α</sup> <sup>ı</sup>X*;*Y*(*X*;*Y*¯))|*Y*¯]

For brevity, the dependence of *<sup>κ</sup><sup>α</sup>* on *<sup>P</sup><sup>X</sup>* and *<sup>P</sup>Y*|*<sup>X</sup>* is omitted. Jensen's inequality

*α*-response has a long record of services to information theory, this terminology and notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the *α*-response as the order *α* Rényi mean. Note that *<sup>κ</sup>*<sup>1</sup> = <sup>0</sup> and the 1-response to *<sup>P</sup><sup>X</sup>* is *<sup>P</sup>Y*. If *<sup>p</sup>Y*[*α*] and *<sup>p</sup>Y*|*<sup>X</sup>* denote the densities of *<sup>P</sup>Y*[*α*] and *<sup>P</sup>Y*|*<sup>X</sup>* with respect to some common dominating measure, then (13) becomes

logE[exp(*α ıX*;*Y*(*X*; *y*) − *κα*)], *X* ∼ *PX*, (13)

i

results in *κ<sup>α</sup>* ≤ 0 for *α* ∈ (0, 1) and *κ<sup>α</sup>* ≥ 0 for *α* > 1. Although the

is a probability measure. Invoking (9),

, (*X*,*Y*¯) <sup>∼</sup> *<sup>P</sup><sup>X</sup>* <sup>×</sup> *<sup>P</sup>Y*. (14)

, *X* ∼ *PX*. (15)

as a random

*α Y*|*X*

we obtain

applied to (·)

**Lemma A1.** *If Q* ! *R and X* " *P* ! *R, then <sup>Q</sup>*(*B*) = <sup>Z</sup> *<sup>P</sup>Y*|*X*(*B*|*x*) <sup>d</sup>*PX*(*x*) = E

*Entropy* **2021**, *1*, 0 6 of 52

*<sup>ı</sup>X*k*Y*(*a*) = <sup>1</sup>

(B, <sup>G</sup> ), and for any *<sup>B</sup>* ∈ <sup>G</sup> , *<sup>P</sup>Y*|*X*=·

by *Q* ∈ PB, namely,

[−∞, ∞) is defined as

*µX*, *σ* 2 *X* 

2 log *σ* 2 *Y σ* 2 *X* + 1 2 

from (A, <sup>F</sup> ) to (B, <sup>G</sup> ), i.e. for any *<sup>x</sup>* ∈ A, *<sup>P</sup>Y*|*X*=*<sup>x</sup>*

**Example 1.** *If <sup>X</sup>* ∼ N

N *µY*, *σ* 2 *Y , then,*

$$\mathbb{E}\left[\iota\_{P\|\mathcal{R}}(X) - \iota\_{Q\|\mathcal{R}}(X)\right] = D(P\|\,Q)\_{\prime} \tag{A1}$$

*regardless of whether the right side is finite. <sup>ı</sup>X*;*Y*(*a*; *<sup>b</sup>*) = *<sup>ı</sup>PY*|*X*=*a*k*P<sup>Y</sup>* (*b*), (*a*, *b*) ∈ A × B. (12)

**Proof.** If *P* ! *Q* ! *R*, we may invoke the chain rule (7) to decompose Following Rényi's terminology [49], if *<sup>P</sup>XPY*|*<sup>X</sup> <sup>P</sup><sup>X</sup>* × *<sup>P</sup>Y*, the dependence between

$$\mathfrak{a}\_{P\|\mathbb{R}}(a) - \mathfrak{a}\_{Q\|\mathbb{R}}(a) = \mathfrak{a}\_{P\|Q}(a). \tag{A2}$$

Then, the result follows by taking expectations of (A2) when *a* Ð *X* " *P*. if *<sup>X</sup>* = *<sup>Y</sup>* ∈ R, then *<sup>P</sup>Y*|*X*=*<sup>a</sup>* (*A*) = 1{*a* ∈ *A*}, and their dependence is not regular, since

input and output is regular regardless of the input probability measure. For example,

To show that (A1) also holds when *P* for any *P<sup>X</sup>* with non-discrete components *PXY* 6 *P<sup>X</sup>* × *PY*. 7. Let *<sup>α</sup>* > 0, and *<sup>P</sup><sup>X</sup>* → *<sup>P</sup>Y*|*<sup>X</sup>* → *<sup>P</sup>Y*. The *<sup>α</sup>*-response to *<sup>P</sup><sup>X</sup>* ∈ P<sup>A</sup> is the output probability measure *PY*[*α*] *P<sup>Y</sup>* with relative information given by *Q*, i.e., that the expectation on the left side is `8, we invoke the Lebesgue decomposition theorem (e.g. p. 384 of [74]), which ensures that we can find *α* P r0, 1q, *P*<sup>0</sup> K *Q* and *P*<sup>1</sup> ! *Q*, such that

*(Gaussian with mean µ<sup>X</sup> and variance σ*

(*a* − *µY*)

*σ* 2 *Y*

4. Let (A, F ) and (B, G ) be measurable spaces, known as the input and output spaces, respectively. Likewise, A and B are referred to as the input and output alphabets respectively. The simplified notation *<sup>P</sup>Y*|*<sup>X</sup>* : A → B denotes a random transformation

5. We abbreviate by P<sup>A</sup> the set of probability measures on (A, F ), and by PA×B the set of probability measures on (A × B, <sup>F</sup> ⊗ <sup>G</sup> ). If *<sup>P</sup>* ∈ P<sup>A</sup> and *<sup>P</sup>Y*|*<sup>X</sup>* : A → B is a random transformation, the corresponding joint probability measure is denoted by *P PY*|*<sup>X</sup>* ∈ PA×B (or, interchangeably, *<sup>P</sup>Y*|*XP*). The notation *<sup>P</sup>* → *<sup>P</sup>Y*|*<sup>X</sup>* → *<sup>Q</sup>* simply indicates that the output marginal of the joint probability measure *P PY*|*<sup>X</sup>* is denoted

2

−

(*B*) is an F-measurable function.

h

*<sup>P</sup>Y*|*X*(*B*|*X*)

i

(*a* − *µX*)

*σ* 2 *X* 2

!

(·) is a probability measure on

, *B* ∈ G . (11)

2 *X*

log e. (10)

*) and Y* ∼

$$P = \mathfrak{a} \, P\_{\mathfrak{I}} + (1 - \mathfrak{a}) P\_{\mathfrak{O}}.\tag{A3}$$

where *κ<sup>α</sup>* is a scalar that guarantees that *PY*[*α*] Since *P*<sup>1</sup> K *P*0, we have

*κ<sup>α</sup>* = *α* logE

*<sup>ı</sup>Y*[*α*]k*Y*(*y*) = <sup>1</sup>

*α*

h E 1

$$D(P\_1 \parallel P) = \log \frac{1}{\alpha'} \tag{A4}$$

is a probability measure. Invoking (9),

$$D(P\_0 \parallel P) = \log \frac{1}{1 - \alpha}.\tag{A5}$$

applied to (·) *α* results in *κ<sup>α</sup>* ≤ 0 for *α* ∈ (0, 1) and *κ<sup>α</sup>* ≥ 0 for *α* > 1. Although the *α*-response has a long record of services to information theory, this terminology and If *X*<sup>1</sup> " *P*1, then

$$\mathbb{E}\left[\mathfrak{l}\_{\mathbb{P}\|\mathbb{R}}(\mathbf{X}\_{\mathbf{1}}) - \mathfrak{l}\_{\mathbb{Q}\|\mathbb{R}}(\mathbf{X}\_{\mathbf{1}})\right] = \mathbb{E}\left[\mathfrak{l}\_{\mathbb{P}\_{\mathbf{1}}\|\mathbb{R}}(\mathbf{X}\_{\mathbf{1}}) - \mathfrak{l}\_{\mathbb{Q}\|\mathbb{R}}(\mathbf{X}\_{\mathbf{1}})\right] - \mathbb{E}\left[\mathfrak{l}\_{\mathbb{P}\_{\mathbf{1}}\|\mathbb{R}}(\mathbf{X}\_{\mathbf{1}}) - \mathfrak{l}\_{\mathbb{P}\|\mathbb{R}}(\mathbf{X}\_{\mathbf{1}})\right] \tag{A6}$$

$$=D(\mathbf{P\_1} \parallel Q) - D(\mathbf{P\_1} \parallel P) \tag{A7}$$

$$=D(P\_1 \parallel Q) - \log \frac{1}{\mathfrak{a}'} \tag{A8}$$

where

we obtain

For *α* > 1 (resp. *α* < 1) we can think of the normalized version of *p α Y*|*X* as a random transformation with less (resp. more) "noise" than *<sup>p</sup>Y*|*X*. • (A7) ðù (A1) with p*P*, *Q*, *R*q Ð p*P*1, *Q*, *R*q and (A1) with p*P*, *Q*, *R*q Ð p*P*1, *P*, *R*q, which we are entitled to invoke since *P*1 is dominated by both *Q* and *R*;

*Y*|*X*

• (A8) ðù (A4).

Analogously, if *X*<sup>0</sup> " *P*0, then

*pY*[*α*]

$$\mathbb{E}\left[\mathfrak{l}\_{P\|R}(X\_{\bullet})\right] = \mathbb{E}\left[\mathfrak{l}\_{P\_{\bullet}\|R}(X\_{\bullet})\right] - \mathbb{E}\left[\mathfrak{l}\_{P\_{\bullet}\|R}(X\_{\bullet}) - \mathfrak{l}\_{P\|R}(X\_{\bullet})\right] \tag{A9}$$

$$\mathbf{P} = D(P\_{\mathbf{0}} \parallel \mathbf{R}) - D(P\_{\mathbf{0}} \parallel P) \tag{A10}$$

$$=D(P\_0 \parallel R) - \log \frac{1}{1 - \alpha}.\tag{A11}$$

Therefore, we are ready to conclude that

−

*α*

$$\begin{aligned} &\mathbb{E}\left[\mathfrak{l}\_{P\|\mathcal{R}}(\mathcal{X}) - \mathfrak{l}\_{Q\|\mathcal{R}}(\mathcal{X})\right] \\ &= \alpha \, \mathbb{E}\left[\mathfrak{l}\_{P\|\mathcal{R}}(\mathcal{X}\_{\mathcal{1}}) - \mathfrak{l}\_{Q\|\mathcal{R}}(\mathcal{X}\_{\mathcal{1}})\right] + (1 - \alpha) \, \mathbb{E}\left[\mathfrak{l}\_{P\|\mathcal{R}}(\mathcal{X}\_{\mathcal{0}}) - \mathfrak{l}\_{Q\|\mathcal{R}}(\mathcal{X}\_{\mathcal{0}})\right] \end{aligned} \tag{A12}$$

$$\mathbf{P} = \mathbf{a} \, D(P\_1 \parallel Q) + (1 - \mathbf{a}) D(P\_0 \parallel R) - (1 - \mathbf{a}) \mathbb{E} \left[ \mathbf{r}\_{Q \parallel R}(X\_0) \right] - h(\mathbf{a}) \tag{A13}$$

$$
\omega = +\infty \tag{A14}
$$

where


$$\begin{aligned} \bullet \quad \text{(A14)}\\ \sqsupset \quad \square \end{aligned} \iff \mathbb{E}\left[\iota\_{Q \mid R}(X\_0)\right] = -\infty \Longleftrightarrow \begin{aligned} &P\_0\left(\mathbf{x} \in \mathcal{A} \colon \frac{\mathrm{d}Q}{\mathrm{dR}}(\mathbf{x}) = \mathbf{0}\right) = 1 \Longleftrightarrow P\_0 \perp Q. \end{aligned}$$

**Corollary A1.** *Suppose that Q* ! *R and X* " *P* ! *R. Then,*

$$\mathbb{E}\left[\iota\_{Q\|R}(X)\right] = D(P\|\,R) - D(P\|\,Q)\_{\prime} \tag{A15}$$

*as long as at least one of the relative entropies on the right side is finite.*

#### **References**


## *Article* **Discriminant Analysis under** *f***-Divergence Measures**

**Anmol Dwivedi, Sihui Wang and Ali Tajer \***

Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA; dwivea2@rpi.edu (A.D.); scottwon@bupt.edu.cn (S.W.) **\*** Correspondence: tajer@ecse.rpi.edu; Tel.: +1-518-276-8237

**Abstract:** In statistical inference, the information-theoretic performance limits can often be expressed in terms of a statistical divergence between the underlying statistical models (e.g., in binary hypothesis testing, the error probability is related to the total variation distance between the statistical models). As the data dimension grows, computing the statistics involved in decision-making and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising the performance (the divergence reduces by the data-processing inequality). This paper considers linear dimensionality reduction such that the divergence between the models is maximally preserved. Specifically, this paper focuses on Gaussian models where we investigate discriminant analysis under five *f*-divergence measures (Kullback–Leibler, symmetrized Kullback–Leibler, Hellinger, total variation, and *χ* 2 ). We characterize the optimal design of the linear transformation of the data onto a lower-dimensional subspace for zero-mean Gaussian models and employ numerical algorithms to find the design for general Gaussian models with non-zero means. There are two key observations for zero-mean Gaussian models. First, projections are not necessarily along the largest modes of the covariance matrix of the data, and, in some situations, they can even be along the smallest modes. Secondly, under specific regimes, the optimal design of subspace projection is identical under all the *f*-divergence measures considered, rendering a degree of universality to the design, independent of the inference problem of interest.

**Keywords:** dimensionality reduction; discriminant analysis; *f*-divergence; statistical inference

#### **1. Introduction**

#### *1.1. Motivation*

Consider a simple binary hypothesis testing problem in which we observe an *n*dimensional sample *X* and aim to discern the underlying model according to:

$$
\mathsf{H}\_{0}\colon X \sim \mathbb{P} \qquad \text{vs.} \qquad \mathsf{H}\_{1}\colon X \sim \mathbb{Q}\,.\tag{1}
$$

The optimal decision rule (in the Neyman-Pearson sense) involves computing the likelihood ratio <sup>d</sup><sup>P</sup> dQ (*X*) and the performance limit (sum of type I and type II errors) is related to the total variation distance between P and Q. We emphasize that our focus is on the settings in which the *n* elements of *X* are not statistically independent, in which case the likelihood ratio <sup>d</sup><sup>P</sup> dQ (*X*) cannot be decomposed into the product of the coordinate-level likelihood ratios. One of the key practical obstacles to solve such problems pertains to the computational cost of finding and performing the statistical tests. This renders a gap between the performance that is information-theoretically viable (unbounded complexity) versus a performance possible under bounded computational complexity [1,2].

Dimensionality reduction techniques have become an integral part of statistical analysis in high dimensions [3–6]. In particular, linear dimensionality reduction methods have been developed and used for over a century for various reasons, such as their low computational complexity and simple geometric interpretation, as well as for a multitude of applications, such as data compression, storage, and visualization, to name only a few.

**Citation:** Dwivedi, A.; Wang, S.; Tajer, A. Discriminant Analysis under *f*-Divergence Measures. *Entropy* **2022**, *24*, 188. https://doi.org/10.3390/ e24020188

Academic Editor: Igal Sason

Received: 18 November 2021 Accepted: 25 January 2022 Published: 27 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

These methods linearly map the high-dimensional data to lower dimensions while ensuring that the desired features of the data are preserved. There exist two broad sets of approaches to linear dimensionality reduction in one dataset *X*, which we review next.

#### *1.2. Related Literature*

*(1) Feature extraction:* In one set of approaches, the objective is to select and extract informative and non-redundant features in the dataset *X*. These approaches are generally unsupervised. These widely-used approaches are principal component analysis (PCA), and its variations [7–9], multidimensional scaling (MDS) [10–13], and sufficient dimensionality reduction (SDR) [14]. The objective of PCA is to retain as much variation in the data in a lower dimension by minimizing the reconstruction error. In contrast, MDS aims to maximize the scatter of the projection and maximizes an aggregate scatter metric. Finally, the objective of SDR is to design an orthogonal mapping of the data that makes the data *X* and the responses conditionally independent (given the projected data). There exist extensive variations to the three approaches, and we refer the reader to Reference [6] for more discussions.

*(2) Class separation:* In another set of approaches, the objective is to perform classification in the lower dimensional space. These approaches are supervised. Depending on the problem formulation and the underlying assumptions, the resulting decision boundaries between the models can be linear or non-linear. One approach pertinent to this paper's scope is discriminant analysis (DA), that leverages the distinction between given models and designs a mapping such that its lower-dimensional output exhibits maximum separation across different models [15–20]. In general, this approach generates two matrices: within-class and between-class scatter matrices. The within-class scatter matrix shows the scatter of the samples around their respective class means, whereas, in contrast, the between-class scatter matrix captures the scatter of the samples around the mixture mean of all the models. Subsequently, a univariate function of these matrices is formed such that it increases when the between-class scatter becomes larger, or when the within-class scatter becomes smaller. Examples of such a function of between-class and within-class matrices is a classification index that includes the ratio of their determinants, difference of their determinants, and ratio of their traces [17]. These approaches focus on reducing the dimension to one and maximize separability between the two classes. There exist, however, studies that consider reducing to dimensions higher than one and separation across more than two classes. Finally, depending on the structure of the class-conditional densities, the resulting shape of the decision boundaries give rise to linear and quadratic DA.

The *f*-divergences between a pair of probability measures quantifies the similarity between them. Shannon [21] introduced the mutual information as a divergence measure, which was later studied comprehensively by Kullback and Leibler [22] and Kolmogorov [23], establishing the importance of such measures in information theory, probability theory, and related disciplines. The family of *f*-divergences, independently introduced by Csiszár [24], Ali and Silvey [25], and Morimoto [26], generalize the Kullback– Leibler divergence which enable characterizing the information-theoretic performance limits of a wide range of inference, learning, source coding, and channel coding problems. For instance, References [27–30] consider their application to various statistical decisionmaking problems [31–34]. More recent developments on the properties of *f*-divergence measures can be found in Reference [31,35–37].

#### *1.3. Contributions*

The contribution of this paper has two main distinctions from the existing literature on DA. First, DA generally focuses on the classification problem for determining the underlying model of the data. Secondly, motivated by the complexities of finding the optimal decision rules for classification (e.g., density estimation), the existing criteria used for separation are selected heuristically. In this paper, we study this problem by referring to the family of *f*-divergences as measures of the distinction between a pair of probability distributions. Such a choice has three main features: (i) it enables designing linear mappings for a wider range of inference problems (beyond classification); (ii) it provides the designs that are optimal for the inference problem at hand; and (iii) it enables characterizing the information-theoretic performance limits after linear mapping. Our analyses are focused on Gaussian models. Even though we observe that the design of the linear mapping has differences under different *f*-divergence measures, we have two main observations in the case of zero-mean Gaussian models: (i) the optimal design of the linear mapping is not necessarily along the most dominant components of the data matrix; and (ii) in certain regimes, irrespective of the choice of the *f*-divergence measure, the design of the linear map that retains the maximal divergence between the two models is robust. In such cases, this makes the optimal design of the linear map independent of the inference problem at hand rendering a degree of universality (in the considered space of the Gaussian probability measures).

The remainder of the paper is organized as follows. Section 2 provides the linear dimensionality reduction model, and it provides an overview of the *f*-divergence measures considered in this paper. Section 3 formulates the problem, and it helps to facilitate the mathematical analysis in subsequent sections. In Section 4, we provide a motivating operational interpretation for each *f*-divergence measure and then characterize an optimal design of the linear mapping for zero-mean Gaussian models. Section 5 considers numerical simulations for inference problems associated with the *f*-divergence measure of interest for zero-mean Gaussian models. Section 6 generalizes the theory to non-zero mean Gaussian models and discusses numerical algorithms that help characterize the design of the linear map, and Section 7 concludes the paper. A list of abbreviations used in this paper is provided on page 22.

#### **2. Preliminaries**

Consider a pair of *n*-dimensional Gaussian models:

$$\mathbb{P}: \quad \mathcal{N}(\mathfrak{\mu}\_{\mathbb{P}}, \mathfrak{\Sigma}\_{\mathbb{P}}) \; , \quad \text{and} \quad \mathbb{Q}: \quad \mathcal{N}(\mathfrak{\mu}\_{\mathbb{Q}'} \mathfrak{\Sigma}\_{\mathbb{Q}}) \; , \tag{2}$$

where *µ*<sup>P</sup> , *µ*<sup>Q</sup> and **Σ**P, **Σ**<sup>Q</sup> are two distinct mean vectors and covariance matrices, respectively, and P and Q denote their associated probability measures. The nature selects one model and generates a random variable *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* . We perform linear dimensionality reduction on *<sup>X</sup>* via matrix **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*r*×*<sup>n</sup>* , where *r* < *n*, rendering

$$\mathcal{Y} \stackrel{\triangle}{=} \mathbf{A} \cdot \mathbf{X} \,. \tag{3}$$

After linear mapping, the two possible distributions of *Y* induced by matrix **A** are denoted by P**<sup>A</sup>** and Q**A**, where

$$\begin{array}{rcl} \mathbb{P}\_{\mathbf{A}} & : & \mathcal{N}(\mathbf{A} \cdot \boldsymbol{\mu}\_{\mathbb{P}^{\prime}} \mathbf{A} \cdot \boldsymbol{\Sigma}\_{\mathbb{P}} \cdot \mathbf{A}^{\top}) \\ \mathbb{Q}\_{\mathbf{A}} & : & \mathcal{N}(\mathbf{A} \cdot \boldsymbol{\mu}\_{\mathbb{Q}^{\prime}} \mathbf{A} \cdot \boldsymbol{\Sigma}\_{\mathbb{Q}} \cdot \mathbf{A}^{\top}) \end{array} \tag{4}$$

Motivated by inference problems that we discuss in Section 3, our objective is to design the linear mapping parameterized by matrix **A** that ensures that the two possible distributions of *Y*, i.e., P**<sup>A</sup>** and Q**A**, are maximally distinguishable. That is, to design **A** as a function of the statistical models (i.e., *µ*<sup>P</sup> , *µ*<sup>Q</sup> , **Σ**<sup>P</sup> and **Σ**Q) such that relevant notions of *f*-divergences between P**<sup>A</sup>** and Q**<sup>A</sup>** are maximized. We use a number of *f*-divergence measures for capturing the distinction between P**<sup>A</sup>** and Q**A**, each with a distinct operational meaning under specific inference problems. For this purpose, we denote the *f*-divergence of Q**<sup>A</sup>** from P**<sup>A</sup>** by *D<sup>f</sup>* (**A**), where

$$D\_f(\mathbf{A}) \stackrel{\triangle}{=} \mathbb{E}\_{\mathbb{P}\_{\mathbf{A}}} \left[ f \left( \frac{\mathbf{d} \mathbb{Q} \mathbf{A}}{\mathbf{d} \mathbb{P}\_{\mathbf{A}}} \right) \right]. \tag{5}$$

We use the shorthand *D<sup>f</sup>* (**A**) for the canonical notation *D<sup>f</sup>* (Q**<sup>A</sup>** k P**A**) for emphasizing the dependence on **A** and for the simplicity in notations. EP**<sup>A</sup>** denotes the expectation with respect to P**A**, and *f* : (0, +∞) → R is a convex function that is strictly convex at 1 and *f*(1) = 0. Strict convexity at 1 ensures that the *f*-divergence between a pair of probability measures is zero if and only if the probability measures are identical. Given the linear dimensionality reduction model in (3), the objective is to solve

$$\mathcal{P}: \max\_{\mathbf{A} \in \mathbb{R}^{r \times n}} D\_f(\mathbf{A})\,. \tag{6}$$

for the following choices of the *f*-divergence measures.

1. *Kullback–Leibler (KL) divergence* for *f*(*t*) = *t* log *t*:

$$D\_{\mathbf{KL}}(\mathbf{A}) \stackrel{\triangle}{=} \mathbb{E}\_{\mathbf{Q}\_{\mathbf{A}}} \left[ \log \frac{\mathbf{d} \mathbf{Q}\_{\mathbf{A}}}{\mathbf{d} \mathbb{P}\_{\mathbf{A}}} \right]. \tag{7}$$

We also denote the KL divergence from P**<sup>A</sup>** to Q**<sup>A</sup>** by *D*KL(P**<sup>A</sup>** k Q**A**).

2. *Symmetric KL divergence* for *f*(*t*) = (*t* − 1)log *t*:

$$D\_{\mathsf{SKL}}(\mathbf{A}) \triangleq D\_{\mathsf{KL}}(\mathbb{Q}\_{\mathbf{A}} \parallel \mathbb{P}\_{\mathbf{A}}) + D\_{\mathsf{KL}}(\mathbb{P}\_{\mathbf{A}} \parallel \mathbb{Q}\_{\mathbf{A}}) \,. \tag{8}$$

3. *Squared Hellinger distance* for *f*(*t*) = (1 − √ *t*) 2 :

$$\mathsf{H}^{2}(\mathbf{A}) \stackrel{\triangle}{=} \int\_{\mathbb{R}^{r}} \left(\sqrt{\mathbf{d}\mathbb{Q}\_{\mathbf{A}}} - \sqrt{\mathbf{d}\mathbb{P}\_{\mathbf{A}}}\right)^{2} . \tag{9}$$

4. *Total variation distance* for *f*(*t*) = <sup>1</sup> 2 · |*t* − 1|:

$$d\_{\mathbf{T}}(\mathbf{A}) \stackrel{\triangle}{=} \frac{1}{2} \int\_{\mathbb{R}^{\prime}} \left| \mathbf{d} \mathbb{Q}\_{\mathbf{A}} - \mathbf{d} \mathbb{P}\_{\mathbf{A}} \right| \,. \tag{10}$$

5. *χ* 2 *-divergence* for *f*(*t*) = (*t* − 1) 2 :

$$\chi^2(\mathbf{A}) \stackrel{\triangle}{=} \int\_{\mathbb{R}^\prime} \frac{(\mathbf{d} \mathbb{Q} \mathbf{A} - \mathbf{d} \mathbb{P} \mathbf{A})^2}{\mathbf{d} \mathbb{P} \mathbf{A}}.\tag{11}$$

We also denote the *χ* 2 -divergence from P**<sup>A</sup>** to Q**<sup>A</sup>** by *χ* 2 (P**<sup>A</sup>** k Q**A**).

#### **3. Problem Formulation**

In this section, without loss of generality, we focus on the setting where one of the covariance matrices is the identity matrix, and the other one has a covariance matrix **Σ** in order to avoid complex representations. One key observation is that the design of **A** under different measures has strong similarities. We first note that, by defining **A**¯ 4 = **A** · **Σ** 1/2 P , *µ* 4 = **Σ** −1/2 P · (*µ*<sup>Q</sup> − *µ*<sup>P</sup> ), and **Σ** 4 = **Σ** −1/2 P · **Σ**<sup>Q</sup> · **Σ** −1/2 P , designing **A** for maximally distinguishing

$$\mathcal{N}(\mathbf{A}\cdot\boldsymbol{\mu}\_{\mathbb{P}'}\mathbf{A}\cdot\boldsymbol{\Sigma}\_{\mathbb{P}}\cdot\mathbf{A}^{\top}) \quad \text{and} \quad \mathcal{N}(\mathbf{A}\cdot\boldsymbol{\mu}\_{\mathbb{Q}'}\mathbf{A}\cdot\boldsymbol{\Sigma}\_{\mathbb{Q}}\cdot\mathbf{A}^{\top}) \tag{12}$$

is equivalent to designing **A**¯ for maximally distinguishing

$$\mathcal{N}(\mathbf{0}, \mathbf{\bar{A}} \cdot \mathbf{\bar{A}}^{\top}) \quad \text{and} \quad \mathcal{N}(\mathbf{\bar{A}} \cdot \boldsymbol{\mu}, \mathbf{\bar{A}} \cdot \boldsymbol{\Sigma} \cdot \mathbf{\bar{A}}^{\top}) \,. \tag{13}$$

Hence, without loss of generality, we focus on the setting where *µ*<sup>P</sup> = **0**, **Σ**<sup>P</sup> = **I***n*, and **Σ**<sup>Q</sup> = **Σ**. Next, we show that determining an optimal design for **A** can be confined to the class of semi-orthogonal matrices.

**Theorem 1.** *For every* **A***, there exists a semi-orthogonal matrix* **A**¯ *such that D<sup>f</sup>* (**A**¯ ) = *D<sup>f</sup>* (**A**)*.* **Proof.** See Appendix A.

This observation indicates that we can reduce the unconstrained problem in (6) to the following constrained problem:

$$\mathcal{Q}: \max\_{\mathbf{A} \in \mathbb{R}^{r \times n}} D\_f(\mathbf{A}) \quad \text{s.t.} \quad \mathbf{A} \cdot \mathbf{A}^\top = \mathbf{I}\_r \text{ .} \tag{14}$$

We show that the design of **A** in the case of *µ* = **0**, under the considered *f*-divergence measures, directly relates to analyzing the eigenspace of matrix **Σ**. For this purpose, we denote the non-negative eigenvalues of **Σ** ordered in the descending order by {*λ<sup>i</sup>* : *i* ∈ [*n*]}, where for an integer *m* we have defined [*m*] = {1, . . . , *m*}. For an arbitrary permutation function *π* : [*n*] → [*n*], we denote the permutation of {*λ<sup>i</sup>* : *i* ∈ [*n*]} with respect to *π* by {*λπ*(*i*) : *i* ∈ [*n*]}. We also denote the eigenvalues of **A** · **Σ** · **A**<sup>&</sup>gt; ordered in the descending order by {*γ<sup>i</sup>* : *i* ∈ [*r*]}. Throughout the analysis, we frequently use Poincaré separation theorem [38] for finding the row space of matrix **A** with respect to the eigenvalues of **Σ**.

**Theorem 2** (Poincaré Separation Theorem)**.** *Let* **Σ** *be a real symmetric n* × *n matrix and* **A** *be a semi-orthogonal r* × *n matrix. The eigenvalues of* **Σ** *denoted by* {*λ<sup>i</sup>* : *i* ∈ [*n*]} *(sorted in the descending order) and the eigenvalues of* **A** · **Σ** · **A**<sup>&</sup>gt; *denoted by* {*γ<sup>i</sup>* : *i* ∈ [*r*]} *(sorted in the descending order) satisfy*

$$
\lambda\_{n-(r-i)} \le \gamma\_i \le \lambda\_i \quad \forall i \in \left[r\right]. \tag{15}
$$

Finally, we define the following functions, which we will refer to frequently throughout the paper:

$$h\_1(\mathbf{A}) \stackrel{\triangle}{=} \mathbf{A} \cdot \boldsymbol{\Sigma} \cdot \mathbf{A}^{\top} \text{ , \tag{16}$$

$$h\_2(\mathbf{A}) \stackrel{\triangle}{=} \boldsymbol{\mu}^\top \cdot \mathbf{A}^\top \cdot \mathbf{A} \cdot \boldsymbol{\mu} \,, \tag{17}$$

$$h\_3(\mathbf{A}) \stackrel{\triangle}{=} \boldsymbol{\mu}^\top \cdot \mathbf{A}^\top \cdot [h\_1(\mathbf{A})]^{-1} \cdot \mathbf{A} \cdot \boldsymbol{\mu} \,. \tag{18}$$

In the next sections, we analyze the design of **A** under different *f*-divergence measures. In particular, in Sections 4 and 5, we focus on zero-mean Gaussian models for P and Q where we provide an operational interpretation of the measure in the dichotomous mode in (4). Subsequently, we will discuss the generalization to non-zero mean Gaussian models in Section 6.

#### **4. Main Results for Zero-Mean Gaussian Models**

In this section, we analyze problem Q defined in (14) for each of the *f*-divergence measures separately. Specifically, for each case, we briefly provide an inference problem as a motivating example, in the context of which we relate the optimal performance limit of that inference problem to the *f*-divergence of interest. These analyses are provided in Sections 4.1–4.5. Subsequently, we provide the main results on the optimal design of the linear mapping matrix **A** in Section 4.6.

#### *4.1. Kullback–Leibler Divergence*

#### 4.1.1. Motivation

The KL divergence, being the expected value of the log-likelihood ratio, captures, at least partially, the performance of a wide range of inference problems. One specific problem whose performance is completely captured by *D*KL(**A**) is the quickest changepoint detection. Consider an observation process (time-series) {*X<sup>t</sup>* : *t* ∈ N} in which the observations *<sup>X</sup><sup>t</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* are generated by a distribution with probability measure <sup>P</sup> specified in (2). This distribution changes to Q at an unknown (random or deterministic) time *κ*, i.e.,

$$\mathbf{X}\_{l} \sim \mathbb{P} \quad t < \kappa \quad \quad \text{and} \qquad \mathbf{X}\_{l} \sim \mathbb{Q} \quad t \ge \kappa \quad . \tag{19}$$

Change-point detection algorithms sample the observation process sequentially and aim to detect the change point with the minimal delay after it occurs subject to a false alarm constraint. Hence, the two key figures of merit capturing the performance of a sequential change-point detection algorithm are the average detection delay (ADD) and the rate of false alarms. Whether the change-point *κ* is random or deterministic gives rise to two broad classes of quickest change-point detection problems, namely the Bayesian setting (*κ* is random) and minimax setting (*κ* is deterministic). Irrespective of their discrepancies in settings and the nature of performance guarantees, the ADD for the (asymptotically) optimal algorithms are in the form [39]:

$$\text{ADD} \sim \frac{c\_1}{D\_{\text{KL}}(\mathbb{Q} \parallel \mathbb{P})} \,. \tag{20}$$

Hence, after the linear mapping induced by matrix **A**, for the ADD, we have

$$\text{ADD} \sim \frac{c\_2}{D\_{\text{KL}}(\mathbb{Q}\_\mathbf{A} \parallel \mathbb{P}\_\mathbf{A})} \text{ }, \tag{21}$$

where *c*<sup>1</sup> and *c*<sup>2</sup> are constants specified by the false alarm constraints. Clearly, the design of **A** that minimizes the ADD will be maximizing the disparity between the pre- and post-change distributions P**<sup>A</sup>** and Q**A**, respectively.

#### 4.1.2. Connection between *D*KL and **A**

By noting that **A** is a semi-orthogonal matrix and recalling that the eigenvalues of *h*1(**A**) are denoted by {*γ<sup>i</sup>* : *i* ∈ [*r*]}, simple algebraic manipulations simplify *D*KL(Q**<sup>A</sup>** k P**A**) to:

$$D\_{\mathsf{KL}}(\mathbb{Q}\_{\mathbf{A}} \parallel \mathbb{P}\_{\mathbf{A}}) = \frac{1}{2} \left[ \log \frac{1}{|h\_1(\mathbf{A})|} - r + \text{Tr}[h\_1(\mathbf{A})] + h\_2(\mathbf{A}) \right]. \tag{22}$$

By setting, and leveraging, Theorem 2, the problem of finding an optimal design for **A** that solves (14) can be found as the solution to:

$$\max\_{\{\gamma\_{i}:i\in[r]\}} \sum\_{i=1}^{r} \mathcal{g}\_{\mathsf{KL}}(\gamma\_{i}) \qquad \text{s.t.} \qquad \lambda\_{\mathsf{n}-(r-i)} \le \gamma\_{i} \le \lambda\_{i} \ \forall i \in [r] \,. \tag{23}$$

where we have defined

$$\log \mathsf{L}(\mathsf{x}) \stackrel{\triangle}{=} \frac{1}{2} (\mathsf{x} - \log \mathsf{x} - 1)\,. \tag{24}$$

Likewise, finding the optimal design for **A** that optimizes *D*KL(P**<sup>A</sup>** k Q**A**) when *µ* = 0 can be found by replacing *<sup>g</sup>*KL(*γi*) by *<sup>g</sup>*KL 1 *γi* in (23). In either case, the optimal design of **A** is constructed by choosing *r* eigenvectors of **Σ** as the rows of **A**. The results and observations are formalized in Section 4.6.

### *4.2. Symmetric KL Divergence*

#### 4.2.1. Motivation

The KL divergence discussed in Section 4.1 is an asymmetric measure of separation between two probability measures. It is symmetrized by adding two directed divergence measures in reverse directions. The symmetric KL divergence has applications in model selection problems in which the model selection criteria is based on a measure of disparity between the true model and the approximating models. As shown in Reference [40], using the symmetric KL divergence outperforms the individual directed KL divergences since it better reflects the risks associated with underfitting and overfitting of the models, respectively.

4.2.2. Connection between *D*SKL and **A**

For a given **A**, the symmetric KL divergence of interest specified in (8) is given by

$$D\_{\mathbf{SKL}}(\mathbf{A}) = \frac{1}{2} \cdot \left[ \text{Tr} \left( [h\_1(\mathbf{A})]^{-1} + h\_1(\mathbf{A}) \right) + h\_2(\mathbf{A}) + h\_3(\mathbf{A}) \right] - r \text{.} \tag{25}$$

By setting *µ* = 0, and leveraging Theorem 2, the problem of finding an optimal design for **A** that solves (14) can be found as the solution to:

$$\max\_{\{\gamma\_{i}:i\in[r]\}} \sum\_{i=1}^{r} \text{g} \text{s\ull} \left(\gamma\_{i}\right) \qquad \text{s.t.} \qquad \lambda\_{n-(r-i)} \le \gamma\_{i} \le \lambda\_{i} \text{ } \forall i \in [r] \text{ } \tag{26}$$

where we have defined

$$\text{g\_{\textbf{SKL}}}(\mathbf{x}) \stackrel{\triangle}{=} \frac{1}{2} \left(\mathbf{x} + \frac{1}{\mathbf{x}} - 2\right). \tag{27}$$

*4.3. Squared Hellinger Distance*

4.3.1. Motivation

Squared Hellinger distance facilitates analysis in high dimensions, especially when other measures fail to take closed-form expressions. We will discuss an important instance of this in the next subsection in the analysis of *d*TV. Squared Hellinger distance is symmetric, and it is confined in the range [0, 2].

#### 4.3.2. Connection between H <sup>2</sup> and **A**

For a given matrix **A**, we have the following closed-form expression:

$$\mathsf{H}^{2}(\mathbf{A}) = -2 - 2 \frac{|4 \cdot h\_{1}(\mathbf{A})|^{\frac{1}{4}}}{|h\_{1}(\mathbf{A}) + \mathbf{I}\_{r}|^{\frac{1}{2}}} \cdot \exp\left(-\frac{\boldsymbol{\mu}^{\top} \cdot \mathbf{A}^{\top} \cdot [h\_{1}(\mathbf{A}) + \mathbf{I}\_{r}]^{-1} \cdot \mathbf{A} \cdot \boldsymbol{\mu}}{4}\right) . \tag{28}$$

By setting *µ* = 0, and leveraging Theorem 2, the problem of finding an optimal design for **A** that solves (14) can be found as the solution to:

$$\max\_{\{\gamma\_{i}:i\in[r]\}} \prod\_{i=1}^{r} g\_{\mathsf{H}}(\gamma\_{i}) \qquad \text{s.t.} \qquad \lambda\_{n-(r-i)} \le \gamma\_{i} \le \lambda\_{i} \text{ } \forall i \in [r] \text{ } \tag{29}$$

where we have defined

$$\mathfrak{g}\_{\mathsf{H}}(\mathfrak{x}) \stackrel{\triangle}{=} \frac{(\mathfrak{x} + 1)^2}{\mathfrak{x}}.\tag{30}$$

#### *4.4. Total Variation Distance*

#### 4.4.1. Motivation

The total variation distance appears as the key performance metric in binary hypothesis testing and in high-dimensional inference, e.g., Le Cam's method for the binary quantization and testing of the individual dimensions (which is in essence binary hypothesis testing). In particular, for the simple binary hypothesis testing model in (65), the minimum total probability of error (sum of type-I and type-II error probabilities) is related to the total variation *d*TV(**A**). Specifically, for a decision rule *d* : *X* → {H0, H1}, the following holds:

$$\inf\_{d} \left[ \mathbb{P}\_{\mathbf{A}}(d = \mathsf{H}\_{\mathrm{I}}) + \mathbb{Q}\_{\mathbf{A}}(d = \mathsf{H}\_{\mathrm{0}}) \right] = \mathbbm{1} - d\_{\mathsf{T}}(\mathbf{A}) \,. \tag{31}$$

The total variation between two Gaussian distributions does not have a closed-form expression. Hence, unlike the other settings, an optimal solution to (6) in this context cannot be obtained analytically. Alternatively, in order to gain intuition into the structure of a

near optimal matrix **A**, we design **A** such that it optimizes known bounds on *d*TV(**A**). In particular, we use two sets of bounds on *d*TV(**A**). One set is due to bounding it via the Hellinger distance, and another set is due to a recent study that established upper and lower bounds that are identical up to a constant factor [41].

#### 4.4.2. Connection between *d*TV and **A**

*(1) Bounding by Hellinger Distance:* The total variation distance can be bounded by the Hellinger distance according to

$$\frac{1}{2}\mathsf{H}^{2}(\mathbf{A}) \le d\_{\mathsf{TV}}(\mathbf{A}) \le \mathsf{H}(\mathbf{A})\sqrt{1 - \frac{\mathsf{H}^{2}(\mathbf{A})}{4}}\,. \tag{32}$$

It can be readily verified that these bounds are monotonically increasing with H 2 (**A**) in the interval [0, 2]. Hence, they are maximized simultaneously by maximizing the squared Hellinger distance as discussed in Section 4.3. We refer to this bound as the Hellinger bound.

*(2) Matching Bounds up to a Constant:* The second set of bounds that we used are provided in Reference [41]. These bounds relate the total variation between two Gaussian models to the Frobenius norm (FB) of a matrix related to their covariance matrices. Specifically, these FB-based bounds on the total variation *d*TV(**A**) are given by

$$\frac{1}{100} \le \frac{d\_{\rm TV}(\mathbf{A})}{\min\{1, \sqrt{\sum\_{i=1}^{7} \lg \mathbf{r} \sqrt{(\gamma\_i)}}\}} \le \frac{3}{2} \,\,\,\,\tag{33}$$

where we have defined

$$\mathcal{g}\_{\mathsf{T}\mathsf{V}}(\mathsf{x}) \stackrel{\triangle}{=} \left(\frac{1}{\mathsf{x}} - 1\right)^{2}.\tag{34}$$

Since the lower and upper bounds on *d*TV(**A**) are identical up to a constant, they will be maximized by the same design of **A**.

#### *4.5. χ* 2 *-Divergence*

4.5.1. Motivation

*χ* 2 -divergence appears in a wide range of statistical estimation problems for the purpose of finding a lower bound on the estimation noise variance. For instance, consider the canonical problem of estimating a latent variable *θ* from the observed data *X*, and denote two candidate estimates by *p*(*X*) and *q*(*X*). Define P and Q as the probability measures of *p*(*X*) and *q*(*X*), respectively. According to the Hammersly-Chapman-Robbins (HCR) bound on the quadratic loss function, for any estimator ˆ*θ*, we have

$$\mathsf{var}\_{\theta}(\theta) \ge \sup\_{p \ne q} \frac{\left[\mathbb{E}\_{\mathbb{Q}}[q(X)] - \mathbb{E}\_{\mathbb{P}}[p(X)]\right]^2}{\lambda^2(\mathbb{Q} \parallel \mathbb{P})} \,\mathrm{}\tag{35}$$

which, for unbiased estimators *p* and *q*, simplifies to the Cramér-Rao lower bound

$$\mathsf{var}\_{\theta}(\theta) \ge \sup\_{p \ne q} \frac{(q-p)^2}{\lambda^2 (\mathbb{Q} \parallel \mathbb{P})} \, \mathsf{} \tag{36}$$

depending on P and Q through their *χ* 2 -divergence. Besides the applications to estimation problems, *χ* 2 is easier to compute compared to some of other *f*-divergence measures (e.g., total variation). Specifically, for product distributions *χ* 2 tensorizes to be expressed in terms of the one-dimensional components that are easier to compute than the KL divergence and TV variation distance. Hence, a combination of bounding other measures with *χ* <sup>2</sup> and then analyzing *χ* <sup>2</sup> appears in a wide range of inference problems.

4.5.2. Connection between *χ* <sup>2</sup> and **A**

By setting *µ* = 0, for a given matrix **A**, from (11), we have the following closed-form expression:

$$\chi^2(\mathbf{A}) = \frac{1}{|h\_1(\mathbf{A})|\sqrt{|2(h\_1(\mathbf{A}))^{-1} - \mathbf{I}\_r|}} - 1 \tag{37}$$

= *r* ∏ *i*=1 *gχ*<sup>1</sup> (*γi*) − 1 , (38)

where we have defined

$$\log\_{\chi\_1}(\mathbf{x}) \stackrel{\triangle}{=} \frac{1}{\sqrt{\mathbf{x}(2-\mathbf{x})}} \cdot \tag{39}$$

As we show in Appendix C, for *χ* 2 (**A**) to exist (i.e., be finite), all the eigenvalues {*λ<sup>i</sup>* : *i* ∈ [*r*]} should fall in the interval (0, 2). Subsequently, finding the optimal design for **A** that optimizes *χ* 2 (P**<sup>A</sup>** k Q**A**) when *µ* = 0 can be done by replacing *gχ*<sup>1</sup> in (38) by *gχ*<sup>2</sup> , which is given by

$$\mathbf{g}\_{\chi\_2}(\mathbf{x}) \stackrel{\triangle}{=} \sqrt{\frac{\mathbf{x}^2}{2\mathbf{x}-1}}.\tag{40}$$

Based on this, and by following a similar line of argument as in the case of the KL divergence, designing an optimal **A** reduces to identifying a subset of the eigenvalues of **Σ** and assigning their associated eigenvectors as the rows of matrix **A**. These observations are formalized in Section 4.6.

#### *4.6. Main Results*

In this section, we provide analytical closed-form solutions to design optimal matrices **A** for the following *f*-divergence measures: *D*KL, *D*SKL, H 2 , and *χ* 2 . The total variation measure *d*TV does not admit a closed-form for Gaussian models. In this case, we provide a design for **A** that optimizes the bound we have provided for *d*TV in Section 4.4. Due to their structural similarities of the results, we group and treat *D*KL, *D*SKL, and *d*TV in Theorem 3. Similarly, we group and treat H <sup>2</sup> and *χ* 2 in Theorem 4.

**Theorem 3** (*D*KL, *D*SKL, *d*TV)**.** *For a given function g* : R → R*, define the permutations:*

$$\pi^\* \stackrel{\triangle}{=} \arg\max\_{\pi} \sum\_{i=1}^r \lg(\lambda\_{\pi(i)})\,. \tag{41}$$

*Then, for D<sup>f</sup>* (**A**) ∈ {*D*KL(**A**), *D*SKL(**A**), *d*TV(**A**)} *and functions g<sup>f</sup>* ∈ {*g*KL, *g*SKL, *g*TV}*:*

*1. For maximizing D<sup>f</sup> , set g* = *g<sup>f</sup> and select the eigenvalues of* **AΣA**<sup>&</sup>gt; *as*

$$
\gamma\_i = \lambda\_{\pi^\*(i)} \, \, \, \, \quad \text{ for} \qquad i \in [r] \, \, \, \, \tag{42}
$$

*2. Row i* ∈ [*r*] *of matrix* **A** *is the eigenvector of* **Σ** *associated with the eigenvalue γ<sup>i</sup> .*

**Proof.** See Appendix B.

By further leveraging the structures of functions *g*KL, *g*SKL, and *g*TV, we can simplify approaches for designing the matrix **A**. Specifically, note that the functions *g*KL, *g*SKL, *andg*TV are all strictly convex functions taking their global minima at *x* = 1. Based on this, we have the following observations.

**Corollary 1** (*D*KL, *D*SKL, *d*TV)**.** *For maximizing D<sup>f</sup>* (**A**) ∈ {*D*KL(**A**), *D*SKL(**A**), *d*TV(**A**)}*, when λ<sup>n</sup>* ≥ 1*, we have γ<sup>i</sup>* = *λ<sup>i</sup> for all i* ∈ [*r*]*, and the rows of* **A** *are eigenvectors of* **Σ** *associated with its r largest eigenvalues, i.e.,* {*λ<sup>i</sup>* : *i* ∈ [*r*]}*.*

**Corollary 2** (*D*KL, *D*SKL, *d*TV)**.** *For maximizing D<sup>f</sup>* (**A**) ∈ {*D*KL(**A**), *D*SKL(**A**), *d*TV(**A**)}*, when λ*<sup>1</sup> ≤ 1*, we have γ<sup>i</sup>* = *λn*−*r*+*<sup>i</sup> for all i* ∈ [*r*]*, and the rows of* **A** *are eigenvectors of* **Σ** *associated with its r smallest eigenvalues, i.e.,* {*λ<sup>i</sup>* : *i* ∈ {*n* − *r* + 1, . . . , *n*}}*.*

**Remark 1.** *In order to maximize D<sup>f</sup>* (**A**) ∈ {*D*KL(**A**), *D*SKL(**A**), *d*TV(**A**)} *when λ<sup>n</sup>* ≤ 1 ≤ *λ*1*, finding the best permutation of eigenvalues involves sorting all the n eigenvalues λ<sup>i</sup> 's and subsequently performing r comparisons as illustrated in Algorithm 1. This amounts to* O(*n* · log(*n*)) *time complexity instead of* O(*n* · log(*r*)) *time complexity involved in determining the design for* **A** *in the case of Corollaries 1 and 2, which require finding the r extreme eigenvalues in determining the design for π* ∗ *.*

**Remark 2.** *The optimal design of* **A** *often does not involve being aligned with the largest eigenvalues of the covariance matrix* **Σ***, which is in contrast to some of the key approaches to linear dimensionality reduction that generally perform linear mapping along the eigenvectors associated with the largest eigenvalues of the covariance matrix. When the eigenvalues of* **Σ** *are all smaller than 1, in particular,* **A** *will be designed by choosing eigenvectors associated with the smallest eigenvalues of* **Σ** *in order to preserve largest separability.*

Next, we provide the counterpart results for the H <sup>2</sup> and *χ* 2 -divergence measures. Their major distinction from the previous three measures is that, for these two, *D<sup>f</sup>* (**A**) can be decomposed into a product of individual functions of the eigenvalues {*γ<sup>i</sup>* : *i* ∈ [*r*]}. Next, we provide the counterparts of Theorem 3 and Corollaries 1 and 2 for H <sup>2</sup> and *χ* 2 .

**Theorem 4** (H 2 , *χ* 2 )**.** *For a given function g* : R → R*, define the permutations:*

$$\pi^\* \stackrel{\triangle}{=} \arg\max\_{\pi} \prod\_{i=1}^r g(\lambda\_{\pi(i)})\,. \tag{43}$$

*Then, for D<sup>f</sup>* (**A**) ∈ {H 2 (**A**), *χ* 2 (**A**), *χ* 2 (P**<sup>A</sup>** k Q**A**)} *and functions g<sup>f</sup>* ∈ {*g*H, *gχ*<sup>1</sup> , *gχ*<sup>2</sup> }*:*

*1. For maximizing D<sup>f</sup> , set g* = *g<sup>f</sup> and select the eigenvalues of* **AΣA**<sup>&</sup>gt; *as*

$$
\gamma\_i = \lambda\_{\pi^\*(i)} \, \prime \qquad \text{for} \qquad i \in \left[r\right] \, . \tag{44}
$$

*2. Row i* ∈ [*r*] *of matrix* **A** *is the eigenvector of* **Σ** *associated with the eigenvalue γ<sup>i</sup> .*

**Proof.** See Appendix C.

Next, note that *g*<sup>H</sup> is a strictly convex function taking its global minimum at *x* = 1. Furthermore, *gχ<sup>i</sup>* for *i* ∈ [2] are strictly convex over (0, 2) and take their global minimum at *x* = 1.

**Corollary 3** (H 2 , *χ* 2 )**.** *For maximizing D<sup>f</sup>* (**A**) ∈ {H 2 (**A**), *χ* 2 (**A**), *χ* 2 (P**<sup>A</sup>** k Q**A**)}*, when λ<sup>n</sup>* ≥ 1*, we have γ<sup>i</sup>* = *λ<sup>i</sup> for all i* ∈ [*r*]*, and the rows of* **A** *are eigenvectors of* **Σ** *associated with its r largest eigenvalues, i.e.,* {*λ<sup>i</sup>* : *i* ∈ [*r*]}*.*

**Corollary 4** (H 2 , *χ* 2 )**.** *For maximizing D<sup>f</sup>* (**A**) ∈ {H 2 (**A**), *χ* 2 (**A**), *χ* 2 (P**<sup>A</sup>** k Q**A**)}*, when λ*<sup>1</sup> ≤ 1*, we have γ<sup>i</sup>* = *λn*−*r*+*<sup>i</sup> for all i* ∈ [*r*]*, and the rows of* **A** *are eigenvectors of* **Σ** *associated with its r smallest eigenvalues, i.e.,* {*λ<sup>i</sup>* : *i* ∈ {*n* − *r* + 1, . . . , *n*}}*.*

#### **Algorithm 1** Optimal Permutation *π* <sup>∗</sup> When *λ<sup>n</sup>* ≤ 1 ≤ *λ*<sup>1</sup>

1: Initialize *i* ← *n*, *j* ← 1, *p<sup>k</sup>* ← *λ<sup>k</sup>* ∀*k* ∈ {*i*, *j*}, *π* <sup>∗</sup> ← ∅ 2: Sort the eigenvalues of **Σ** in descending order {*λ<sup>k</sup>* : *k* ∈ [*n*]} 3: **while** |*π* ∗ | 6= *r* **do** 4: **if** *g<sup>f</sup>* (*pi*) > *g<sup>f</sup>* (*pj*) **then** 5: *π* <sup>∗</sup> ← *π* <sup>∗</sup> ∪ {*pi*} 6: *i* ← *i* − 1 7: **else** 8: *π* <sup>∗</sup> ← *π* <sup>∗</sup> ∪ {*pj*} 9: *j* ← *j* + 1 10: **end if** 11: **end while** 12: **return** *π* ∗

Finally, we remark that, unlike the other measures, total variation does not admit a closed-form, and we used two sets of tractable bounds to analyze this case of total variations. By comparing the design of **A** based on different bounds, we have the following observation.

**Remark 3.** *We note that both sets of bounds lead to the same design of* **A** *when either λ*<sup>1</sup> ≤ 1 *or λ<sup>n</sup>* ≥ 1*. Otherwise, each will be selecting a different set of the eigenvectors of* **Σ** *to construct* **A** *according to the functions*

$$g\_{\mathsf{H}}(\mathbf{x}) = \frac{(\mathbf{x} + \mathbf{1})^2}{\mathbf{x}} \quad \text{versus} \quad g\_{\mathsf{T}\mathsf{V}}(\mathbf{x}) = \left(\frac{1}{\mathfrak{x}} - \mathbf{1}\right)^2. \tag{45}$$

#### **5. Zero-Mean Gaussian Models–Simulations**

#### *5.1. KL Divergence*

In this section, we show gains of the above analysis for the KL divergence measure *D*KL(**A**) through simulations on a change-point detection problem. We focus on the minimax setting in which the change-point *κ* is deterministic. The objective is to detect a change in the stochastic process *X<sup>t</sup>* with minimal delay after the change in the probability measure occurs at *κ* and define *τ* ∈ N as the time that we can form a confident decision. A canonical model to quantify the decision delay is the conditional average detection delay (CADD) due to Pollak [42]

$$\mathsf{CADD}(\tau) \stackrel{\triangle}{=} \sup\_{\kappa \ge 1} \mathbb{E}\_{\mathbb{K}} \left[ \tau - \kappa \mid \tau \ge \kappa \right],\tag{46}$$

where E*<sup>κ</sup>* is the expectation with respect to the probability distribution when the change happens at time *κ*. The objective of this formulation is to optimize the decision delay for the worst-case realization of the random change-point *κ* (that is, the change-point realization that leads to the maximum decision delay), while the constraints on the false alarm rate are satisfied. In this formulation, this worst-case realization is *κ* = 1, in which case all the data points are generated from the post-change distribution. In the minimax setting, a reasonable measure of false alarms is the mean-time to false alarm, or its reciprocal, which is the false alarm rate (FAR) defined as

$$\mathsf{FAR}(\mathfrak{r}) \stackrel{\triangle}{=} \frac{1}{\mathbb{E}\_{\infty}[\mathfrak{r}]} \quad \mathsf{ \,} \tag{47}$$

where E<sup>∞</sup> is the expectation with respect to the distribution when a change never occurs, i.e., *κ* 4 = ∞. A standard approach to balance the trade-off between decision delay and false alarm rates involves solving [42]

$$\min\_{\tau} \mathsf{CADD}(\tau) \qquad \text{s.t.} \qquad \mathsf{FAR}(\tau) \le \mathfrak{a} \text{ .} \tag{48}$$

where *α* ∈ R<sup>+</sup> controls the rate of false alarms. For the quickest change-point detection formulation in (48), the popular cumulative sum (CuSum) test generates the optimal solutions, involving computing the following test statistic:

$$\mathcal{W}[t] \stackrel{\triangle}{=} \max\_{1 \le k \le t+1} \sum\_{i=k}^{t} \log \left( \frac{d\mathbb{Q}\_{\mathbf{A}}(X\_i)}{d\mathbb{P}\_{\mathbf{A}}(X\_i)} \right) \,. \tag{49}$$

Computing *W*[*t*] follows a convenient recursion given by

$$\mathcal{W}[t] \stackrel{\triangle}{=} \left( \mathcal{W}[t-1] + \log \left( \frac{d \mathbb{Q}\_{\mathbf{A}}(\mathbf{X}\_{t})}{d \mathbb{P}\_{\mathbf{A}}(\mathbf{X}\_{t})} \right) \right)^{+},\tag{50}$$

where *W*[0] = 0. The CuSum statistic declares a change at a stopping time *τ* given by

$$
\pi \triangleq \inf \{ t \ge 1 : W[t] > \mathbb{C} \} \, , \tag{51}
$$

where *C* is chosen such that the constraint on FAR(*τ*) in (48) is satisfied.

In this setting, we consider two zero-mean Gaussian models with the following preand post-linear dimensionality reduction structures:

$$\begin{array}{llll}\mathbb{P}: & \mathcal{N}(\mathbf{0}, \mathbf{I}\_{\mathrm{n}}) \quad \text{and} & \mathbb{Q}: & \mathcal{N}(\mathbf{0}, \mathbf{E})\\\mathbb{P}\_{\mathbf{A}}: & \mathcal{N}(\mathbf{0}, \mathbf{I}\_{\mathrm{r}}) \quad \text{and} & \mathbb{Q}\_{\mathbf{A}}: & \mathcal{N}(\mathbf{0}, h\_{1}(\mathbf{A})) \end{array} \tag{52}$$

where the covariance matrix **Σ** is generated randomly, and its eigenvalues are sampled from a uniform distribution. In particular, for the original data dimension *n*, d0.9*n*e eigenvalues are sampled such that {*λ<sup>i</sup>* ∼ U(0.064, 1)}, and the remaining eigenvalues are sampled such that {*λ<sup>i</sup>* ∼ U(1, 4.24)}. We note that this is done since the objective function lies in the same range for the eigenvalues within the range [0.0649, 1] and [1, 4.24]. In order to consider the worst case detection delay, we set *κ* = 1 and generate stochastic observations according to the model described in (52) that follows the change-point detection model in (19). For every random realization of covariance matrix **Σ**, we run the CuSum statistic (50), where we generate **A** according to the following two schemes:

*(1) Largest eigen modes:* In this scheme, the linear map **A** is designed such that its rows are eigenvectors associated with the *r* largest eigenvalues of **Σ**.

*(2) Optimal design:* In this scheme, the linear map **A** is designed such that its rows are eigenvectors associated with *r* eigenvalues of **Σ** that maximize *D*KL(**A**) according to Theorem 3.

In order to evaluate and compare the performance of the two schemes, we compute the ADD obtained by running a Monte-Carlo simulation over 5000 random realizations of the stochastic process *X<sup>t</sup>* following the change-point detection model in (19) for every random realization of **Σ** and for each reduced dimension 1 ≤ *r* ≤ 9. The detection delays obtained are then averaged again over 100 random realizations of covariance matrices **Σ** for each reduced dimension *r*. Figure 1 shows the plot for ADD versus *r* for multiple initial data dimension *n* and for a fixed FAR = <sup>1</sup> <sup>5000</sup> . Owing to the dependence on *D*KL(**A**) given in (21), the delay associated with the optimal linear mapping in Theorem 3 achieves better performance.

**Figure 1.** Comparison of the average detection delay (ADD) under the optimal design and largest eigen modes schemes for multiple reduced data dimensions *r* as a function of original data dimension *n* for a fixed false alarm rate (FAR) which is equal to 1/5000.

#### *5.2. Symmetric KL Divergence*

In this section, we show the gains of the analysis by numerically computing *D*SKL(**A**). We follow the pre- and post-linear dimensionality reduction structures given in (52), where the covariance matrix **Σ** is randomly generated following the setup used in Section 5.1. As plotted in Figure 2, by choosing the design scheme for *D*SKL(**A**) according to Theorem 3, the optimal design outperforms other schemes.

**Figure 2.** Comparison of the empirical average computed for the optimal design and largest eigen modes schemes for multiple reduced data dimensions *r* as a function of original data dimension *n*.

#### *5.3. Squared Hellinger Distance*

We consider a Bayesian hypothesis testing problem given class a priori parameters *p*P**<sup>A</sup>** , *p*Q**<sup>A</sup>** and Gaussian class conditional densities for the linear dimensionality reduction model in (52). Without loss of generality, we assume a 0–1 loss function associated with misclassification for the hypothesis test. In order to quantify the performance of the Bayes decision rule, it is imperative to compute the associated probability of error, also known as the Bayes error, which we denote by *P<sup>e</sup>* . Since, in general, computing *P<sup>e</sup>* for the optimal decision rule for multivariate Gaussian conditional densities is intractable, numerous techniques have been devised to bound *P<sup>e</sup>* . Owing to its simplicity, one of the most commonly employed metric is the Bhattacharyya coefficient given by

$$\mathsf{BC}(\mathbf{A}) \stackrel{\triangle}{=} \int\_{\mathbb{R}^{\mathsf{r}}} \sqrt{\mathsf{d}\mathbb{P}\_{\mathbf{A}} \cdot \mathsf{d}\mathbb{Q}\_{\mathbf{A}}} \,. \tag{53}$$

The metric in (53) facilitates upper bounding the error probability as

$$P\_{\varepsilon} \le \sqrt{p\_{\mathbb{P}\_{\mathbf{A}}} \ p\_{\mathbb{Q}\_{\mathbf{A}}}} \cdot \mathbb{BC}(\mathbf{A})\,. \tag{54}$$

which is widely referred to as the Bhattacharrya bound. Relevant to this study is that the squared Hellinger distance is related to the Bhattacharyya coefficient in (53) through

$$\mathsf{H}^{2}(\mathbf{A}) = \mathsf{2} - \mathsf{BC}(\mathbf{A})\,. \tag{55}$$

Hence, maximizing the Hellinger distance H 2 (**A**) results in a tighter bound on *P<sup>e</sup>* from (54). To show the performance numerically, we compute the BC(**A**) via (55). For the preand post-linear dimensionality reduction structures as given in (52), the covariance matrix **Σ** is randomly generated following the setup used in Section 5.1. As plotted in Figure 3, by employing the design scheme according to Theorem 4, the optimal design results in a smaller BC(**A**) and, hence, a tighter upper bound on *P<sup>e</sup>* in comparison to other schemes.

**Figure 3.** Comparison of the empirical average of the Bhattacharyya coefficient BC(**A**) under optimal design and largest eigen modes schemes for multiple reduced data dimensions *r* as a function of original data dimension *n*.

#### *5.4. Total Variation Distance*

Consider a binary hypothesis test with Gaussian class conditional densities following the model in (52) and equal class a priori probabilities, i.e., *p*P**<sup>A</sup>** = *p*Q**<sup>A</sup>** . We define *cij* as the cost associated with deciding in favor of H*<sup>i</sup>* when the true hypothesis is H*<sup>j</sup>* such that 0 ≤ *i*, *j* ≤ 1, and denote the densities associated with measures P**A**, Q**<sup>A</sup>** by *f*P**<sup>A</sup>** and *f*Q**<sup>A</sup>** , respectively. Without loss of generality, we assume a 0–1 loss function such that *cij* = 1 ∀ *i* 6= *j* and *cii* = 0 ∀ *i*. The optimal Bayes decision rule that minimizes the error probability is given by

$$\frac{f\_{\mathbb{P}\_{\mathbf{A}}}(\boldsymbol{\chi})}{f\_{\mathbb{Q}\_{\mathbf{A}}}(\boldsymbol{\chi})} \stackrel{d=\mathsf{H}\_{1}}{\underset{d=\mathsf{H}\_{0}}{\rightleftharpoons}} \mathbf{1} \,. \tag{56}$$

Since the total variation distance cannot be computed in closed-form, we numerically compute the error probability *P<sup>e</sup>* under the two bounds (Hellinger-based and FB-based) introduced in Section 4.4.2 to quantify the performance of the design of matrix **A** for the underlying inference problem. The covariance matrix **Σ** is randomly generated following the setup used in Section 5.1. As plotted in Figure 4, by optimizing the Hellinger-based bound according to Theorem 4 and optimizing the FB-based bound according to Theorem 3, the two design schemes achieve a smaller *P<sup>e</sup>* . We further observe that the bounds due to FB-based are loose in comparison to Hellinger-based bounds. Therefore, we choose not to plot the lower bound on *P<sup>e</sup>* for the FB-based bounds in Figure 4.

**Figure 4.** Comparing the logarithm of the empirical average value for *P<sup>e</sup>* under the two bounds on *d*TV(**A**) (Hellinger-based and Frobenius norm (FB)-based) with the largest eigen modes scheme for multiple projected data dimensions *r* as a function of initial data dimension *n*.

#### *5.5. χ* 2 *-Divergence*

In this section, we show the gains of the proposed analysis through numerical evaluations by numerically computing *χ* 2 (**A**), to find a lower bound on the noise variance var*<sup>θ</sup>* ( ˆ*θ*) up to a constant. Following the pre- and post-linear dimensionality reduction structures given in (52), the covariance matrix **Σ** is randomly generated following the setup used in Section 5.1. As shown in Figure 5, constructing the optimal design according to Theorem 4 achieves a tighter lower bound in comparison to the other scheme.

**Figure 5.** Comparison of the lower bound on noise variance given by <sup>1</sup> *χ*<sup>2</sup>(**A**) under the optimal and largest eigen modes schemes for multiple reduced data dimensions *r* as a function of original data dimension *n*.

#### **6. General Gaussian Models**

In the previous section, we focused on *µ* = 0. When *µ* 6= 0, optimizing each *f*divergence measure under the semi-orthogonality constraint does not render closed-form expressions. Nevertheless, to provide some intuitions, we provide a numerical approach to the optimal design of **A**, which might also enjoy some *local* optimality guarantees. To start, note that the feasible set of solutions given by <sup>M</sup>*<sup>r</sup> n* 4 <sup>=</sup> {**<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*r*×*<sup>n</sup>* : **A** · **A**<sup>&</sup>gt; = **I***r*} owing to the orthogonality constraints in Q is often referred to as the Stiefel manifold. Therefore, solving Q requires designing algorithms that optimize the objective while preserving manifold constraints during iterations.

We employ the method of Lagrange multipliers to formulate the Lagrangian function. By denoting the matrix of Lagrangian multipliers by **<sup>L</sup>** <sup>∈</sup> <sup>R</sup>*r*×*<sup>r</sup>* , the Lagrangian function of problem (14) is given by

$$\mathcal{L}(\mathbf{A}, \mathbf{L}) = D\_f(\mathbf{A}) + \left< \mathbf{L}, \mathbf{A} \cdot \mathbf{A}^\top - \mathbf{I}\_r \right>. \tag{57}$$

From the first order optimality condition, for any local maximizer **A**∗ of (14), there exists a Lagrange multiplier **L** ∗ such that

$$\left. \nabla\_{\mathbf{A}} \mathcal{L}(\mathbf{A}, \mathbf{L}) \right|\_{\mathbf{A}^\*, \mathbf{L}^\*} = \mathbf{0} \,, \tag{58}$$

where we denote the partial derivative with respect to **A** by ∇**A**. In what follows, we iterate the design mapping **A** using the gradient ascent algorithm in order to find a solution for **A**. As discussed in the next subsection, this solution is guaranteed to be at least locally optimal.

#### *6.1. Optimizing via Gradient Ascent*

We use an iterative gradient ascent-based algorithm to find the local maximizer of *Df* (**A**) such that **<sup>A</sup>** ∈ M*<sup>r</sup> n* . The gradient ascent update at any given iteration *k* ∈ N is given by

$$\mathbf{A}^{k+1} = \mathbf{A}^k + \boldsymbol{\alpha} \cdot \nabla\_\mathbf{A} \mathcal{L}(\mathbf{A}, \mathbf{L}) \Big|\_{\mathbf{A}^k} \,. \tag{59}$$

Note that, following this update, since the new point **A***k*+<sup>1</sup> in (59) may not satisfy the semi-orthogonality, i.e., **<sup>A</sup>***k*+<sup>1</sup> ∈ M/ *r n* , it is imperative to establish a relation between the multipliers **L** and **A***<sup>k</sup>* in every iteration *k* to ensure a constraint-preserving update scheme. In particular, to enforce the semi-orthogonality constraint on **A***k*+<sup>1</sup> , a relationship between the multipliers and the gradients in every iteration *k* is derived. Following a similar line of analysis for gradient descent in Reference [43], the relationship between multipliers and the gradients is provided in Appendix E. More details on the analysis of the update scheme can be found in Reference [43], and a detailed discussion on the convergence guarantees of classical steepest descent update schemes adapted to semi-orthogonality constraints can be found in Reference [44].

In order to simplify ∇**A**L(**A**, **L**) and state the relationships, we define **Λ** 4 = **L** + **L** > and subsequently find a relationship between **Λ** and **A***<sup>k</sup>* in every iteration *k*. This is obtained by right-multiplying (59) by **A***k*+<sup>1</sup> and solving for **Λ** that enforces the semi-orthogonality constraint on **A***k*+<sup>1</sup> . To simplify the analysis, we take a finite Taylor series expansion of **Λ** around *α* = 0 and choose *α* such that the error in forcing the constraint is a good approximation of the gradient of the objective subjected to **A** · **A**<sup>&</sup>gt; = **I***<sup>r</sup>* . As derived in the Appendix E, by simple algebraic manipulations, it can be shown that the matrices **Λ**0, **Λ**1, and **Λ**2, for which the finite Taylor series expansion of **Λ** ≈ **Λ**<sup>0</sup> + *α* · **Λ**<sup>1</sup> + *α* 2 · **Λ**<sup>2</sup> is a good approximation of the constraint, are given by

$$\mathbf{A}\_{0} \stackrel{\triangle}{=} -\frac{1}{2} \left[ \nabla\_{\mathbf{A}} D\_{f}(\mathbf{A}) \cdot (\mathbf{A})^{\top} + \mathbf{A} \cdot \nabla\_{\mathbf{A}} D\_{f}(\mathbf{A})^{\top} \right],\tag{60}$$

$$\mathbf{A}\_{1} \stackrel{\triangle}{=} -\frac{1}{2} \left[ \left( \nabla\_{\mathbf{A}} D\_{f}(\mathbf{A}) + \mathbf{A}\_{0} \mathbf{A} \right) \cdot \left( \nabla\_{\mathbf{A}} D\_{f}(\mathbf{A}) + \mathbf{A}\_{0} \mathbf{A} \right)^{\top} \right],\tag{61}$$

$$\boldsymbol{\Lambda}\_{2} \stackrel{\triangle}{=} -\frac{1}{2} \Big[ \boldsymbol{\Lambda}\_{1} \cdot \mathbf{A} \cdot \nabla\_{\mathbf{A}} \boldsymbol{D}\_{f}(\mathbf{A})^{\top} + \nabla\_{\mathbf{A}} \boldsymbol{D}\_{f}(\mathbf{A}) \cdot (\mathbf{A})^{\top} \cdot \boldsymbol{\Lambda}\_{1} + \boldsymbol{\Lambda}\_{0} \cdot \boldsymbol{\Lambda}\_{1} + \boldsymbol{\Lambda}\_{1} \cdot \boldsymbol{\Lambda}\_{0} \Big]. \tag{62}$$

Additionally, we note that, since finding the global maximum is not guaranteed, it is imperative to initialize **A**<sup>0</sup> close to the estimated maximum. In this regard, we leverage the structure of the objective function for each *f*-divergence measure as given in Appendix D. In particular, we observe that the objective of each *f*-divergence measure can be decomposed into two objectives: the first not involving *µ* (making this objective a convex problem as shown in Section 4), and the second objective a function of *µ*. Hence, leveraging the structure of the solution from Section 4, we initialize **A**<sup>0</sup> such that it maximizes the objective in the case of zero-mean Gaussian models. We further note that, while there are more sophisticated orthogonality constraint-preserving algorithms [45], we find that our method adopted from Reference [43] is sufficient for our purpose, as we show next through numerical simulations.

#### *6.2. Results and Discussion*

The design of **A** when *µ* 6= 0 is not characterized analytically. Therefore, we resort to numerical simulations to show the gains of optimizing *f*-divergence measures when *µ* 6= 0. In particular, we consider the linear discriminant analysis (LDA) problem where the goal is to design a mapping **A** and perform classification in the lower dimensional space (of dimension *r*). Without loss of generality, we assume *n* = 10 and consider Gaussian densities with the following pre- and post-linear dimensionality reduction structures:

$$\begin{array}{llll}\mathbb{P}: & \mathcal{N}(\mathbf{0}, \mathbf{I}\_{\boldsymbol{\eta}}) \quad \text{and} & \mathbb{Q}: & \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\\\mathbb{P}\_{\mathbf{A}}: & \mathcal{N}(\mathbf{0}, \mathbf{I}\_{\boldsymbol{r}}) \quad \text{and} & \mathbb{Q}\_{\mathbf{A}}: & \mathcal{N}(\mathbf{A} \cdot \boldsymbol{\mu}, h\_{1}(\mathbf{A})) \end{array} \tag{63}$$

where the covariance matrix **Σ** is generated randomly the eigenvalues of which are sampled from a uniform distribution {*λ<sup>i</sup>* ∼ U(0, 1)} 10 *i*=1 . For the model in (63), we consider two kinds of performance metrics that have information-theoretic performance interpretations: (i) the total probability of error related to the *d*TV(**A**), and (ii) the exponential decay of error probability related to *D*KL(P**<sup>A</sup>** k Q**A**). In what follows, we demonstrate that optimizing

appropriate *f*-divergence measures between P**<sup>A</sup>** and Q**<sup>A</sup>** lead to better performance when compared to the performance of the popular Fisher's quadratic discriminant analysis (QDA) classifier [20]. In particular, the Fisher's approach sets *r* = 1 and designs **A** by solving

$$\underset{\mathbf{A}\in\mathbb{R}^{1\times n}}{\arg\max}\quad\frac{(\mu\cdot\mathbf{A}^{\top})^{2}}{\mathbf{A}\cdot(\mathbf{I}\_{\mathrm{il}}+\boldsymbol{\Sigma})\cdot\mathbf{A}^{\top}}\,.\tag{64}$$

In contrast, we design **A** such that the information-theoretic objective functions associated with the total probability of error (captured by *d*TV(**A**)) and the exponential decay of error probability (captured by *D*KL(P**<sup>A</sup>** k Q**A**)) are minimized. The structure of the objective functions is discussed in Total probability of error and Type-II error subjected to type-I error constraints. Both methods and Fisher's method, after projecting the data into a lower dimension, deploy optimal detectors to discern the true model. It is noteworthy that, in both methods the data in the lower dimensions has a Gaussian model, and the conventional QDA [20] classifier is the optimal detector. Hence, we emphasize that our approach aims to have a design for **A** that maximizes the distance between the probability measures after reducing the dimensions, i.e., the distance between P**<sup>A</sup>** and Q**A**. Since this distance captures the quality of the decisions, our design of **A** outperforms that of Fisher's. For each comparison, we consider various values for *µ* and compare the appropriate performance metrics with that of Fisher's QDA for each. In all cases, the data is synthetically generated, i.e., sampled from a Gaussian distribution where we consider 2000 data points associated with each measure P and Q.

#### 6.2.1. Schemes for Linear Map

*(1) Total Probability of Error*: In this scheme, the linear map **A** is designed such that *d*TV(**A**) is optimized via gradient ascent iterations until convergence. As discussed in Section 4.4.1, since the total probability of error is the key performance metric that arises while optimizing *d*TV(**A**), it is expected that optimizing *d*TV(**A**) will result in a smaller total error in comparison to other schemes that optimize other objective functions (e.g., Fisher's QDA). We note that, since there do not exist closed-form expressions for the total variation distance, we maximize bounds on *d*TV(**A**) instead via the Hellinger bound in (33) as a proxy to minimize the total probability of error. The corresponding gradient expression to optimize H 2 (**A**) (to perform iterative updates as in (59)) is derived in closed-form and is given in Appendix D.

*(2) Type-II Error Subjected to Type-I Error Constraints*: In this scheme, the linear map **A** is designed such that *D*KL(P**<sup>A</sup>** k Q**A**) is optimized via gradient ascent iterations until convergence. In order to establish a relation, consider the following binary hypothesis test:

$$\mathsf{H}\_{0}\ : X \sim \mathbb{P}\_{\mathbf{A}} \quad \text{versus} \quad \mathsf{H}\_{1}\ : X \sim \mathbb{Q}\_{\mathbf{A}}\ . \tag{65}$$

When minimizing the probability of type-II error subjected to type-I error constraints, the optimal test guarantees that the probability of type-II error decays exponentially as

$$\lim\_{s \to \infty} \frac{-\log(\mathbb{Q}\_{\mathbf{A}}(d = \mathbb{H}\_{0}))}{s} = D\_{\mathsf{KL}}(\mathbb{P}\_{\mathbf{A}} \parallel \mathbb{Q}\_{\mathbf{A}}) \,. \tag{66}$$

where we have define *d* : *X* → {H0, H1} as the decision rule for the hypothesis test, and *s* denotes the sample size. As a result, *D*KL(P**<sup>A</sup>** k Q**A**) appears as the error exponent for hypothesis test in (65). Hence, it is expected that optimizing *D*KL(P**<sup>A</sup>** k Q**A**) will result in a smaller type-II error for the same type-I error when comparing with a method that optimizes other objectives (e.g., Fisher's QDA). The corresponding gradient expression to optimize the *D*KL(P**<sup>A</sup>** k Q**A**) is derived in closed-form and is given in Appendix D.

For the sake of comparison and reference, we also consider schemes in which **A** is designed to optimize the objectives *D*KL(**A**), the largest eigen modes (LEM), and the smallest eigen modes (SEM), which carry no specific operational significance in the context of the binary classification problem. In the case of LEM and SEM schemes, the linear map **A** is designed such that the rows of **A** are the eigenvector associated with the largest and smallest modes of the matrix **Σ**, respectively. Furthermore, we define 1 as the vector of all those of appropriate dimension.

#### 6.2.2. Performance Comparison

After learning the linear map **A** for each scheme described in Section 6.2.1, we perform classification in the lower dimensional space of dimension *r* to find the type-I, type-II, and total probability of error for each scheme. Tables 1–4 tabulate the results for various choices of the mean parameter *µ*. We have the following important observations: (i) we observe that optimizing H 2 (**A**) results in a smaller total probability of error in comparison to the total error obtained by optimizing the Fisher's objective; it is important to note that the superior performance is observed despite maximizing bounds on *d*TV(**A**) (that is suboptimal) and not the distance itself; and (ii) we observe that except for the case of *<sup>µ</sup>* <sup>=</sup> 0.8 · <sup>1</sup>, optimizing *D*KL(P**<sup>A</sup>** k Q**A**) results in a smaller type-II error in comparison to that obtained by optimizing the Fisher's objective indicating a gain in optimizing *D*KL(P**<sup>A</sup>** k Q**A**) in comparison to the Fisher's objective in (64).

**Table 1.** *<sup>µ</sup>* <sup>=</sup> 0.2 · <sup>1</sup>,*<sup>r</sup>* <sup>=</sup> 1.


**Table 2.** *<sup>µ</sup>* <sup>=</sup> 0.4 · <sup>1</sup>,*<sup>r</sup>* <sup>=</sup> 1.


### **Table 3.** *<sup>µ</sup>* <sup>=</sup> 0.6 · <sup>1</sup>,*<sup>r</sup>* <sup>=</sup> 1.


**Table 4.** *<sup>µ</sup>* <sup>=</sup> 0.8 · <sup>1</sup>,*<sup>r</sup>* <sup>=</sup> 1.


It is important to note that the convergence of the gradient ascent algorithm only guarantees a locally optimal solution. While we have restricted the results that consider a maximum separation of *<sup>µ</sup>* <sup>=</sup> 0.8 · <sup>1</sup>, we have performed additional simulations for larger separation between models (greater *µ* > 0.8). We have the following observations: (i) solution for the linear map **A** obtained through gradient ascent becomes highly sensitive to the initialization **A**<sup>0</sup> ; specifically, it was observed that optimizing the Fisher's objective outperforms optimizing H 2 (**A**) for various initializations **A**<sup>0</sup> , and vice versa, for other random initializations; and (ii) the gradient ascent solver becomes more prone to getting stuck at the local maxima for larger separations between the models. We conjecture that the odd observation in the case of *<sup>µ</sup>* <sup>=</sup> 0.8 · <sup>1</sup> when optimizing *<sup>D</sup>*KL(P**<sup>A</sup>** <sup>k</sup> <sup>Q</sup>**A**) (where optimizing the Fisher's objective outperforms optimizing *D*KL(P**<sup>A</sup>** k Q**A**)) supports this

observation. Furthermore, we note that, since the problem is convex for *µ* = **0**, a deviation from this assumption moves the problem further from being convex, making the solver prone to getting stuck at the locally optimal solutions for larger separation between the Gaussian models.

#### 6.2.3. Subspace Representation

In order to gain more intuition towards the learned representations, we illustrate the 2-dimensional projections of the original 10-dimensional data obtained after optimizing the corresponding *f*-divergence measures. For brevity, we only show the plots for *D*KL(P**<sup>A</sup>** k Q**A**) and H 2 (**A**). Figures 6 and 7 plot the two-dimensional projections of the synthetic dataset that optimize *D*KL(P**<sup>A</sup>** k Q**A**) and H 2 (**A**), respectively. As expected, it is observed that the total probability of error is smaller when optimizing H 2 (**A**). Figure 8 shows the variation in the objective function as a function of gradient ascent iterations. As the iterations grow, the objective functions eventually converges to a locally optimal solution.

**Figure 6.** Two-dimensional projected data obtained by optimizing *D*KL(P**<sup>A</sup>** k Q**A**).

**Figure 7.** Two-dimensional projected data obtained by optimizing H 2 (**A**).

**Figure 8.** Convergence of the gradient ascent algorithm as a result of optimizing H 2 (**A**).

#### **7. Conclusions**

In this paper, we have considered the problem of discriminant analysis such that separation between the classes is maximized under *f*-divergence measures. This approach is motivated by dimensionality reduction for inference problems, where we have investigated discriminant analysis under Kullback–Leibler, symmetrized Kullback–Leibler, Hellinger, *χ* 2 , and total variation measures. We have characterized the optimal design for the linear transformation of the data onto a lower-dimensional subspace for each in the case of zeromean Gaussian models and adopted numerical algorithms to find the design of the linear transformation in the case of general Gaussian models with non-zero means. We have shown that, in the case of zero-mean Gaussian models, the row space of the mapping matrix lies in the eigenspace of a matrix associated with the covariance matrix of the Gaussian models involved. While each *f*-divergence measure favors specific eigenvector components, we have shown that all the designs become identical in certain regimes, making the design of the linear mapping independent of the inference problem of interest.

**Author Contributions:** A.D., S.W. and A.T. contributed equally. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported in part by the U. S. National Science Foundation under grants CAREER Award ECCS-1554482 and ECCS-1933107, and RPI-IBM Artificial Intelligence Research Collaboration (AIRC).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:



#### **Appendix A. Proof of Theorem 1**

Consider two pairs of probability measures (P**A**, Q**A**) and (P**A**¯ , Q**A**¯ ) associated with the mapping **<sup>A</sup>** in space <sup>X</sup> and **<sup>A</sup>**¯ in space <sup>Y</sup>, respectively. Let *<sup>g</sup>* : X → Y denote any invertible transformation. Under the invertible map, we have

$$\mathbf{d}\mathbb{Q}\_{\mathbf{A}} = \mathbf{d}\mathbb{Q}\_{\mathbf{A}} \cdot |\mathcal{T}|^{-1}, \quad \text{and} \quad \mathrm{d}\mathbb{P}\_{\mathbf{A}} = \mathrm{d}\mathbb{P}\_{\mathbf{A}} \cdot |\mathcal{T}|^{-1}, \tag{A1}$$

where |T | denotes the determinant of the Jacobian matrix associated with *g*. Leveraging (A1), the *f*-divergence measure *D<sup>f</sup>* (**A**¯ ) simplifies as follows.

$$D\_f(\mathbf{A}) \stackrel{\triangle}{=} \mathbb{E}\_{\mathbb{P}\_{\tilde{\mathbf{A}}}} \Big[ f\Big(\frac{\mathbf{d} \mathbf{Q}\_{\tilde{\mathbf{A}}}}{d \mathbb{P}\_{\tilde{\mathbf{A}}}}\Big) \Big] \tag{A2}$$

$$=\int\_{\mathcal{Y}} f\left(\frac{\mathrm{d}\mathbb{Q}\_{\mathbf{A}}}{\mathrm{d}\mathbb{P}\_{\mathbf{A}}}\right) \mathrm{d}\mathbb{P}\_{\mathbf{A}}(y) \tag{A3}$$

$$=\int\_{\mathcal{X}} |\mathcal{T}(\mathbf{x})|^{-1} \cdot f\left(\frac{\mathbf{d} \mathbb{Q} \mathbf{A} \cdot |\mathcal{T}(\mathbf{x})|^{-1}}{\mathbf{d} \mathbb{P}\_{\mathbf{A}} \cdot |\mathcal{T}(\mathbf{x})|^{-1}}\right) \cdot |\mathcal{T}(\mathbf{x})| \, \mathrm{d} \mathbb{P}\_{\mathbf{A}}(\mathbf{x}) \tag{A4}$$

$$=\int\_{\mathcal{X}} f\left(\frac{\mathrm{d}\mathbb{Q}\_{\mathbf{A}}}{\mathrm{d}\mathbb{P}\_{\mathbf{A}}}\right) \mathrm{d}\mathbb{P}\_{\mathbf{A}}(\boldsymbol{x})\tag{A5}$$

$$=D\_f(\mathbf{A})\,. \tag{A6}$$

Therefore, *f*-divergence measures are invariant under invertible transformations (both linear and non-linear) ensuring the existence of **A**¯ for every **A** as a special case for linear transformations.

#### **Appendix B. Proof of Theorem 3**

We observe that *D*KL(**A**), *D*SKL(**A**), and the objective to be optimized through the matching bound Section 4.4.2, Matching Bounds up to a Constant on *d*TV(**A**) can be decomposed as the summation of strictly convex functions involving *g*KL(*x*), *g*SKL(*x*), and *g*TV(*x*), respectively. Since the summation of strictly convex functions is strictly convex, we conclude that each objective *D<sup>f</sup>* ∈ {*D*KL(**A**), *D*SKL(**A**), *d*TV(**A**)} is strictly convex.

Next, the goal is to choose {*γi*} *r i*=1 such that *D<sup>f</sup>* ∈ {*D*KL(**A**), *D*SKL(**A**), *d*TV(**A**)} is maximized subjected to spectral constraints given by *<sup>λ</sup>n*−(*r*−*i*) ≤ *<sup>γ</sup><sup>i</sup>* ≤ *<sup>λ</sup><sup>i</sup>* . In order to choose appropriate *γ<sup>i</sup>* 's, we first note that the global minimizer for functions *g<sup>f</sup>* ∈ {*g*KL, *g*SKL, *g*TV} appears at *x* = 1. By noting that each *g<sup>f</sup>* is strictly convex, it can be readily verified that *gf* (*x*) is monotonically increasing for *x* > 1 and monotonically decreasing for *x* < 1. This will guide selecting {*γi*} *r i*=1 , as explained next.

In the case of *λ<sup>n</sup>* ≥ 1, i.e., when all the eigenvalues are larger than or equal to 1, the objective of maximizing each *D<sup>f</sup>* ∈ {*D*KL(**A**), *D*SKL(**A**), *d*TV(**A**)} boils down to maximizing a monotonically increasing function (considering the domain). This is trivially done by choosing *γ<sup>i</sup>* = *λ<sup>i</sup>* for *i* ∈ [*r*], proving Corollary 1. On the other hand, when *λ*<sup>1</sup> ≤ 1, i.e., when all the eigenvalues are smaller than or equal to 1, following the same line of argument, the objective boils down to maximizing each *D<sup>f</sup>* ∈ {*D*KL(**A**), *D*SKL(**A**), *d*TV(**A**)}, where each *D<sup>f</sup>* is a monotonically decreasing function (considering the domain). This is trivially done by choosing *γ<sup>i</sup>* = *λn*−*r*+*<sup>i</sup>* for *i* ∈ [*r*].

When *λ<sup>n</sup>* ≤ 1 ≤ *λ*1, the selection process is not trivial. Rather, an iterative algorithm can be followed, where we start from the eigenvalues farthest away from 1 on both sides and, subsequently, choose the one in every iteration that achieves a higher objective. This procedure can be repeated recursively until *r* eigenvalues are chosen. This procedure is also discussed in Algorithm 1 in Section 4.6.

Finally, constructing the optimal matrix **A**, which maximizes *D<sup>f</sup>* for any data matrix **Σ**, becomes equivalent to choosing eigenvectors as the rows of **A** associated with the chosen permutation of eigenvalues for each of the aforementioned cases.

#### **Appendix C. Proof for Theorem 4**

We first find a closed-form expression for *χ* 2 (**A**) and *χ* 2 (P**<sup>A</sup>** k Q**A**). From the definition, we have

$$\chi^2(\mathbf{A}) \stackrel{\triangle}{=} \frac{|\mathbf{I}\_r|^{\frac{1}{2}}}{(2\pi)^{\frac{5}{2}} \cdot |h\_1(\mathbf{A})|} \cdot \int\_{\mathbb{R}^r} \exp\left[-\frac{1}{2} \cdot \left(Y^\top \cdot \mathbf{K}\_1 \cdot Y\right)\right] \, \mathrm{d}Y - 1\,,\tag{A7}$$

where we defined **K**<sup>1</sup> 4 = 2 · *h*1(**A**) <sup>−</sup><sup>1</sup> <sup>−</sup> **<sup>I</sup>***<sup>r</sup>* . We note that **K**<sup>1</sup> is a real symmetric matrix since *h*1(**A**) is a real symmetric matrix. We denote the eigen decomposition of **K**<sup>1</sup> as **K**<sup>1</sup> = **U** · **Θ** · **U**>, where the matrix **Θ** is a diagonal matrix with the eigenvalues {*θi*} *r i*=1 as its elements. Based on this decomposition, we have

$$\chi^2(\mathbf{A}) = \frac{1}{(2\pi)^{\frac{r}{2}} \cdot |h\_1(\mathbf{A})|} \cdot \int\_{\mathbb{R}^r} \exp\left[ -\frac{1}{2} \left( Y^\top \cdot \mathbf{U} \Theta \mathbf{U}^\top \cdot Y \right) \right] \, \mathrm{d}Y - 1 \tag{A8}$$

$$\frac{1}{\ell} = \frac{1}{(2\pi)^{\frac{\mathsf{I}}{2}} \cdot |h\_1(\mathbf{A})|} \cdot \int\_{\mathbb{R}^{\mathsf{I}}} \exp\left[-\frac{1}{2} \left(W^{\top} \cdot \Theta \cdot W\right)\right] \,\mathrm{d}W - 1\tag{A9}$$

$$=\frac{1}{(2\pi)^{\frac{r}{2}}\cdot|h\_1(\mathbf{A})|}\cdot\prod\_{i=1}^{r}\int\_{-\infty}^{\infty}\exp\left[-\frac{1}{2}\left(\theta\_i\cdot w\_i^2\right)\right]\,\mathrm{d}w\_i-1\,,\tag{A10}$$

where we have defined *W* 4 = **U**<sup>&</sup>gt; · *Y*. We note that, in order for *χ* 2 (**A**) to be finite, it is required that the eigenvalues {*θi*} *r i*=1 be non-negative. Hence, based on the definition of **K**1, all the eigenvalues *λ<sup>i</sup>* should fall in the interval (0, 2). Hence, we obtain:

$$\chi^2(\mathbf{A}) = \frac{1}{(2\pi)^{\frac{r}{2}} \cdot |h\_1(\mathbf{A})|} \cdot \prod\_{i=1}^r \int\_{-\infty}^\infty \exp\left[-\frac{1}{2} \left(\theta\_i \cdot w\_i^2\right)\right] \mathrm{d}w\_i - 1 \tag{A11}$$

$$=\frac{1}{(2\pi)^{\frac{r}{2}}\cdot|h\_1(\mathbf{A})|}\cdot\prod\_{i=1}^{r}\sqrt{\frac{2\pi}{\theta\_i}}-1\tag{A12}$$

$$=\frac{1}{|h\_1(\mathbf{A})|} \cdot \sqrt{\frac{1}{|\mathbf{K}\_1|}} - 1 \,. \tag{A13}$$

Recall that the eigenvalues of *h*1(**A**) are given by {*γi*} *r i*=1 in the descending order. Therefore, (A13) simplifies to:

$$\chi^2(\mathbf{A}) = \prod\_{i=1}^r \sqrt{\frac{1}{\gamma\_i \cdot (2 - \gamma\_i)}} - 1 = \prod\_{i=1}^r g\_{\chi\_1}(\gamma\_i) - 1 \,. \tag{A14}$$

Hence, from (A14), maximizing *χ* 2 (**A**) is equivalent to choosing the eigenvalues {*γi*} *r i*=1 such that they maximize *gχ*<sup>1</sup> (*x*). Similarly, the closed-form expression for *χ* 2 (P**<sup>A</sup>** k Q**A**) can be derived as follows:

$$\chi^2(\mathbb{P}\_\mathbf{A} \parallel \mathbb{Q}\_\mathbf{A}) = \frac{|h\_1(\mathbf{A})|^{\frac{1}{2}}}{(2\pi)^{\frac{r}{2}} \cdot |\mathbf{I}\_r|} \cdot \int\_{\mathbb{R}^r} \exp\left[ -\frac{1}{2} \cdot \left( Y^\top \cdot \mathbf{K}\_2 \cdot Y \right) \right] \, \mathrm{d}Y - 1 \,\tag{A15}$$

where we defined **K**<sup>2</sup> 4 = 2 · **I***<sup>r</sup>* − *h*1(**A**) −1 . We note that **K**<sup>2</sup> is a real symmetric matrix due to *h*1(**A**) being a real symmetric matrix. Hence, following a similar line of argument as in the case of *χ* 2 (**A**), and as a consequence of Theorem 2, we conclude that all the eigenvalues *λ<sup>i</sup>* should fall in the interval (0.5, ∞) to ensure a finite value for *χ* 2 (P**<sup>A</sup>** k Q**A**). Following this requirement, since the integrands are bounded, we obtain the following closed-form expression:

$$\chi^2(\mathbb{P}\_\mathbf{A} \parallel \mathbb{Q}\_\mathbf{A}) = \frac{|h\_1(\mathbf{A})|^\frac{1}{2}}{1} \cdot \sqrt{\frac{1}{|\mathbf{K}\_2|}} - 1 \,. \tag{A16}$$

Recall that the eigenvalues of *h*1(**A**) are given by {*γi*} *r i*=1 ; then, (A16) simplifies to

$$\chi^2(\mathbb{P}\_\mathbf{A} \parallel \mathbb{Q}\_\mathbf{A}) = \prod\_{i=1}^r \sqrt{\frac{\gamma\_i^2}{(2\gamma\_i - 1)}} - 1 = \prod\_{i=1}^r g\_{\mathbb{X}2}(\gamma\_i) - 1 \,\, . \tag{A17}$$

Hence, from (A17), maximizing *χ* 2 (P**<sup>A</sup>** k Q**A**) is equivalent to choosing the eigenvalues {*γi*} *r i*=1 such that they maximize *gχ*<sup>2</sup> (*x*).

We observe that H 2 (**A**), *χ* 2 (**A**), and *χ* 2 (P**<sup>A</sup>** k Q**A**) can be decomposed as the product of *r* non-negative identical convex functions involving *g*H(*x*), *gχ*<sup>1</sup> (*x*), and *gχ*<sup>2</sup> (*x*), respectively. Hence, the goal is to choose {*γi*} *r i*=1 such that *D<sup>f</sup>* ∈ {H 2 (**A**), *χ* 2 (**A**), *χ* 2 (P**<sup>A</sup>** k Q**A**)} is maximized subjected to spectral constraints given by *<sup>λ</sup>n*−(*r*−*i*) ≤ *<sup>γ</sup><sup>i</sup>* ≤ *<sup>λ</sup><sup>i</sup>* . In order to choose appropriate *γ<sup>i</sup>* 's, we first note that the global minimizer for each *g<sup>f</sup>* ∈ {*g*H, *gχ*<sup>1</sup> , *gχ*<sup>2</sup> } is attained at *x* = 1. Leveraging this observation, along with the structure that each *g<sup>f</sup>* is convex, it is easy to infer that each *g<sup>f</sup>* (*x*) is monotonically increasing for *x* > 1 and monotonically decreasing *x* < 1. From the exact same argument in Appendix B, we obtain Corollaries 3 and 4.

Therefore, similar to Appendix B, constructing the linear map **A** that maximizes *D<sup>f</sup>* ∈ {H 2 (**A**), *χ* 2 (**A**), *χ* 2 (P**<sup>A</sup>** k Q**A**)} for any data matrix **Σ** boils down to choosing eigenvectors as rows of **A** associated with the chosen permutation of eigenvalues for each of the aforementioned cases.

#### **Appendix D. Gradient Expressions for** *f***-Divergence Measures**

For clarity in analysis, we define the following functions:

$$h\_2(\mathbf{A}) \stackrel{\triangle}{=} \boldsymbol{\mu}^{\parallel} \cdot \mathbf{A}^{\parallel} \cdot \mathbf{A} \cdot \boldsymbol{\mu} \, \, \, \, \, \tag{A18}$$

$$h\_3(\mathbf{A}) \stackrel{\triangle}{=} \boldsymbol{\mu}^\top \cdot \mathbf{A}^\top \cdot [h\_1(\mathbf{A})]^{-1} \cdot \mathbf{A} \cdot \boldsymbol{\mu} \,. \tag{A19}$$

Based on these definitions, we have the following representations for the divergence measures and their associated gradients:

$$D\_{\mathsf{KL}}(\mathbf{A}) = \frac{1}{2} \left[ \log \frac{1}{|h\_1(\mathbf{A})|} - r + \text{Tr}[h\_1(\mathbf{A})] + h\_2(\mathbf{A}) \right], \tag{A20}$$

$$\nabla\_{\mathbf{A}} D\_{\mathsf{KL}}(\mathbf{A}) = [h\_1(\mathbf{A})]^{-1} \cdot \left[ \mathbf{I}\_r - [h\_1(\mathbf{A})]^{-1} - \mathbf{A} \cdot \boldsymbol{\mu} \cdot \boldsymbol{\mu}^\top \cdot \mathbf{A}^\top \cdot [h\_1(\mathbf{A})]^{-1} \right] \cdot \mathbf{A} \cdot \boldsymbol{\Sigma}$$

$$+ [h\_1(\mathbf{A})]^{-1} \cdot \mathbf{A} \cdot \boldsymbol{\mu} \cdot \boldsymbol{\mu}^\top \, . \tag{A20}$$

$$\begin{split} D\_{\mathsf{KL}}(\mathbb{P}\_{\mathbf{A}} \parallel \mathbb{Q}\_{\mathbf{A}}) &= \frac{1}{2} \Big[ \log \left| h\_{1}(\mathbf{A}) \right| - r + \mathrm{Tr} \Big[ h\_{1}(\mathbf{A})^{-1} \Big] + h\_{3}(\mathbf{A}) \Big], \\ \nabla\_{\mathbf{A}} D\_{\mathsf{KL}}(\mathbb{P}\_{\mathbf{A}} \parallel \mathbb{Q}\_{\mathbf{A}}) &= \left( \mathbf{I}\_{\mathbb{P}} - \left[ h\_{1}(\mathbf{A}) \right]^{-1} \right) \cdot \mathbf{A} \cdot \boldsymbol{\Sigma} + \mathbf{A} \cdot \boldsymbol{\mu} \cdot \boldsymbol{\mu}^{\top} \,. \end{split} \tag{A21}$$

*<sup>D</sup>*SKL(**A**) = <sup>1</sup> 2 · h Tr [*h*1(**A**)]−<sup>1</sup> + *h*1(**A**) + *h*2(**A**) + *h*3(**A**) i − *r* , (A22) <sup>∇</sup>**A***D*SKL(**A**) = <sup>h</sup> **<sup>I</sup>***<sup>r</sup>* <sup>−</sup> [*h*1(**A**)]−<sup>2</sup> <sup>−</sup> [*h*1(**A**)]−<sup>1</sup> · **A** · *µ* · *µ* <sup>&</sup>gt; · **A** <sup>&</sup>gt; · [*h*1(**A**)]−<sup>1</sup> i · **A** · **Σ**

$$\begin{aligned} \mathbf{I} &+ \left(\mathbf{I}\_{\mathbf{f}} + \left[h\_1(\mathbf{A})\right]^{-1}\right) \cdot \mathbf{A} \cdot \boldsymbol{\mu} \cdot \boldsymbol{\mu}^{\top} \,. \end{aligned} \tag{A23}$$

$$\begin{split} \mathsf{H}^{2}(\mathbf{A}) &= 2 - 2 \frac{|\mathbf{A} \cdot \mathbf{h}\_{1}(\mathbf{A})|^{\frac{1}{2}}}{|\mathbf{h}\_{1}(\mathbf{A}) + \mathbf{I}\_{\mathrm{r}}|^{\frac{1}{2}}} \cdot \exp\left(-\frac{\boldsymbol{\mu}^{\top} \cdot \mathbf{A}^{\top} \cdot [\boldsymbol{h}\_{1}(\mathbf{A}) + \mathbf{I}\_{\mathrm{r}}]^{-1} \cdot \mathbf{A} \cdot \boldsymbol{\mu}}{4}\right), \\ \frac{\nabla\_{\mathbf{A}} \mathsf{H}^{2}(\mathbf{A})}{-[1 - \mathsf{H}^{2}(\mathbf{A})]} &= \frac{1}{2} \cdot [\boldsymbol{h}\_{1}(\mathbf{A})]^{-1} \cdot \mathbf{A} \cdot \boldsymbol{\Sigma} + [\boldsymbol{h}\_{1}(\mathbf{A}) + \mathbf{I}\_{\mathrm{r}}]^{-1} \cdot \left[-\mathbf{A} \cdot [\boldsymbol{\Sigma} + \mathbf{I}\_{\mathrm{n}}] - \frac{1}{2} \cdot \mathbf{A} \cdot \boldsymbol{\mu} \cdot \boldsymbol{\mu}^{\top} \\ &+ \frac{1}{2} \cdot \mathbf{A} \cdot \boldsymbol{\mu} \cdot \boldsymbol{\mu}^{\top} \cdot \mathbf{A}^{\top} \cdot [\boldsymbol{h}\_{1}(\mathbf{A}) + \mathbf{I}\_{\mathrm{r}}]^{-1} \cdot \mathbf{A} \cdot [\boldsymbol{\Sigma} + \mathbf{I}\_{\mathrm{n}}] \right]. \end{split} \tag{A24}$$

#### **Appendix E. Proof for Lagrange Multipliers**

Denoting <sup>∇</sup>**A**<sup>L</sup> by <sup>∆</sup>˜ and <sup>∇</sup>**A***D<sup>f</sup>* by <sup>∆</sup>, and further post-multiplying (59) by **<sup>A</sup>***k*+<sup>1</sup> , we have:

$$\mathbf{A}^{k+1} \cdot (\mathbf{A}^{k+1})^\top = \mathbf{A}^k \cdot (\mathbf{A}^{k+1})^\top + \boldsymbol{\mu} \cdot \tilde{\boldsymbol{\Delta}} \cdot (\mathbf{A}^{k+1})^\top,\tag{A25}$$

$$\mathbf{I}\_r = \mathbf{A}^k \cdot \left(\mathbf{A}^k + \boldsymbol{\mathfrak{a}} \cdot \tilde{\boldsymbol{\Delta}}\right)^\top + \boldsymbol{\mathfrak{a}} \cdot \tilde{\boldsymbol{\Delta}} \cdot \left(\mathbf{A}^k + \boldsymbol{\mathfrak{a}} \cdot \tilde{\boldsymbol{\Delta}}\right)^\top,\tag{A26}$$

$$\mathbf{0} = \mathbf{A}^k \cdot \tilde{\boldsymbol{\Delta}}^{\top} + \tilde{\boldsymbol{\Delta}} \cdot (\mathbf{A}^k)^{\top} + \boldsymbol{a} \cdot \tilde{\boldsymbol{\Delta}} \cdot \tilde{\boldsymbol{\Delta}}^{\top} \,. \tag{A27}$$

Substituting <sup>∆</sup>˜ <sup>=</sup> <sup>∆</sup> <sup>+</sup> **<sup>Λ</sup>** · **<sup>A</sup>** in (A27) and simplifying the expression, we obtain:

$$\mathcal{Q} \cdot \boldsymbol{\Lambda} + \mathbf{A}^k \cdot \boldsymbol{\Delta}^\top + \boldsymbol{\Delta} \cdot (\mathbf{A}^k)^\top = -\boldsymbol{\mu} \cdot \left(\boldsymbol{\Delta} \cdot \boldsymbol{\Delta}^\top + \boldsymbol{\Delta} \cdot (\mathbf{A}^k)^\top \boldsymbol{\Lambda} + \boldsymbol{\Lambda} \cdot \mathbf{A}^k \cdot \boldsymbol{\Delta}^\top + \boldsymbol{\Lambda} \cdot \boldsymbol{\Lambda}^\top\right). \tag{A28}$$

By noting that **Λ** is symmetric, taking the Taylor series expansions of **Λ** around *α* = 0 and equating the terms of *α* in both sides, we obtain the relationships in (60)–(62).

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Entropy* Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com ISBN 978-3-0365-4331-4