The Cauchy Distribution in Information Theory
Abstract
1. Introduction
2. The Cauchy Distribution and Generalizations
- A real-valued random variable V is said to be standard Cauchy if its probability density function isFurthermore, X is said to be Cauchy if there exist and such that , in which casewhere and are referred to as the location (or median) and scale, respectively, of the Cauchy distribution. If , (2) is said to be centered Cauchy.
- Since , the mean of a Cauchy random variable does not exist. Furthermore, for , and the moment generating function of V does not exist (except, trivially, at 0). The characteristic function of the standard Cauchy random variable is
- Using (3), we can verify that a Cauchy random variable has the curious property that adding an independent copy to it has the same effect, statistically speaking, as adding an identical copy. In addition to the Gaussian and Lévy distributions, the Cauchy distribution is stable: a linear combination of independent copies remains in the family, and is infinitely divisible: it can be expressed as an n-fold convolution for any n. It follows from (3) that if are independent, standard Cauchy, and is a deterministic sequence with finite -norm , then has the same distribution as . In particular, the time average of independent identically distributed Cauchy random variables has the same distribution as any of the random variables. The families and , with any interval of the real line, are some of the simplest parametrized random variables that are not an exponential family.
- If is uniformly distributed on , then is standard Cauchy. This follows since, in view of (1) and (A1), the standard Cauchy cumulative distribution function isTherefore, V has unit semi-interquartile length. The functional inverse of (4) is the standard Cauchy quantile function given by
- If and are standard Gaussian with correlation coefficient , then is Cauchy with scale and location . This implies that the reciprocal of a standard Cauchy random variable is also standard Cauchy.
- Taking the cue from the Gaussian case, we say that a random vector is multivariate Cauchy if any linear combination of its components has a Cauchy distribution. Necessary and sufficient conditions for a characteristic function to be that of a multivariate Cauchy were shown by Ferguson [5]. Unfortunately, no general expression is known for the corresponding probability density function. This accounts for the fact that one aspect, in which the Cauchy distribution does not quite reach the wealth of information theoretic results attainable with the Gaussian distribution, is in the study of multivariate models of dependent random variables. Nevertheless, special cases of multivariate Cauchy distribution do admit some interesting information theoretic results as we will see below. The standard spherical multivariate Cauchy probability density function on is (e.g., [6])where is the Gamma function. Therefore, are exchangeable random variables. If are independent standard normal, then the vector has the density in (6). With the aid of (A10), we can verify that any subset of components of is distributed according to . In particular, the marginals of (6) are given by (1). Generalizing (3), the characteristic function of (6) is
- In parallel to Item 1, we may generalize (6) by dropping the restriction that it be centered at the origin and allowing ellipsoidal deformation, i.e., letting with and a positive definite matrix . Therefore,While is a Cauchy random variable for , (8) fails to encompass every multivariate Cauchy distribution—in particular, the important case of independent Cauchy random variables. Another reason the usefulness of the model in (8) is limited is that it is not closed under independent additions: if and are independent, each distributed according to (6); then, , while multivariate Cauchy, does not have a density of the type in (8) unless for some .
- Another generalization of the (univariate) Cauchy distribution, which comes into play in our analysis, was introduced by Rider in 1958 [7]. With and ,In addition to the parametrization in (9), we may introduce scale and location parameters by means of , just as we did in the Cauchy case . Another notable special case is , which is the centered Student-t random variable, itself equivalent to a Pearson type VII distribution.
3. Differential Entropy
- 9.
- The differential entropy of a Cauchy random variable isusing (A3). Throughout this paper, unless the logarithm base is explicitly shown, it can be chosen by the reader as long as it is the same on both sides of the equation. For natural logarithms, the information measure unit is the nat.
- 10.
- An alternative, sometimes advantageous, expression for the differential entropy of a real-valued random variable is feasible if its cumulative distribution function is continuous and strictly monotonic. Then, the quantile function is its functional inverse, i.e., for all , which implies that for all . Moreover, since X and with U uniformly distributed on have identical distributions, we obtain
- 11.
- Despite not having finite moments, an independent identically distributed sequence of Cauchy random variables is information stable in the sense thatbecause of the strong law of large numbers.
- 12.
- With distributed according to the standard spherical multivariate Cauchy density in (6), it is shown in [8] thatwhere is the Euler–Mascheroni constant and is the digamma function. Therefore, the differential entropy of (6) is, in nats, (see also [9])whose growth is essentially linear with n: the conditional differential entropyis monotonically decreasing with
- 13.
- By the scaling law of differential entropy and its invariance to location, we obtain
- 14.
- 15.
- The Rényi differential entropy of order of an absolutely continuous random variable with probability density function is
- 16.
4. The Shannon- and -Transforms
- 17.
- The Shannon transform of a nonnegative random variable X is the function , defined byUnless for all (e.g., if X has the log-Cauchy density ), or , , (which occurs if a.s.), the Shannon transform is a strictly concave continuous function from , which grows without bound as .
- 18.
- 19.
- 20.
- The -transform of a non-negative random variable is defined as the functionwhich is intimately related to the Cauchy–Stieltjes transform [11]. For example,
5. Strength
- 21.
- The strength of a real-valued random variable Z is defined asIt follows that the only random variable with zero strength is , almost surely. If the inequality in (34) is not satisfied for any , then . Otherwise, is the unique positive solution toIf , then (35) holds with ≤.
- 22.
- The set of probability measures whose strength is upper bounded by a given finite nonnegative constant,is convex: The set is a singleton as seen in Item 21, while, for , we can express (36) asTherefore, if and , we must have .
- 23.
- The peculiar constant in the definition of strength is chosen so that if V is standard Cauchy, then its strength is because, in view of (29),
- 24.
- If , a.s., then its strength is
- 25.
- The left side of (35) is the Shannon transform of evaluated at , which is continuous in . If then, (35) can be written aswhere, on the right side, we have denoted the functional inverse of the Shannon transform. Clearly, the square root of the right side of (40) cannot be expressed as the expectation with respect to Z of any that does not depend on . Nevertheless, thanks to (37), (36) can be expressed as
- 26.
- Theorem 1.The strength of a real-valued random variable satisfies the following properties:
- (a)
- (b)
- with equality if and only if is deterministic.
- (c)
- If , and , then
- (d)
- If V is standard Cauchy, independent of X, then is the solution toif it exists, otherwise, . Moreover, ≤ holds in (45) if .
- (e)
- (f)
- If , thenwhere V is standard Cauchy, and stands for the relative entropy with reference probability measure and dominated measure .
- (g)
- (h)
- If V is standard Cauchy, then
- (i)
- The finiteness of strength is sufficient for the finiteness of the entropy of the integer part of the random variable, i.e.,
- (j)
- If in for any , then .
- (k)
- (l)
- If , then .
- (m)
- If , and Z is independent of , then .
Proof.For the first three properties, it is clear that they are satisfied if , i.e., almost surely.- (a)
- (b)
- (c)
- (d)
- Substituting x by X and averaging over X, the result follows from the definition of strength.
- (e)
- The result holds trivially if either or . Otherwise, we simply rewrite (35) asand upper/lower bound the right side by .
- (f)
- (g)
- (h)
- It is sufficient to assume for the condition on the right of (49) because the condition on the left holds if and only if it holds for , for any and . If , thenwhich is finite unless either or . This establishes ⟹ in view of (48). To establish ⟸, it is enough to show thatin view of (48) and the fact that, according to (59), if both and are finite. To show (60), we invoke the following variational representation of relative entropy (first noted by Kullback [12] for absolutely continuous random variables): If , thenattained only at . Let Q be the absolutely continuous random variable with probability density function
- (i)
- (j)
- If , then a.e., and the result follows from (44). For all ,where (71) follows by maximizing the left side over . Denote the difference between the right side and the left side of (72) by , an even function which satisfies , andNow, because of the scaling property in (42), we may assume without loss of generality that . Thus, (74) and (75) result inwhich requires that , since, by assumption, the right side vanishes. Assume now that , and therefore, . Inequality (75) remains valid in this case, implying that, as soon as the right side is finite (which it must be for all sufficiently large n), , and therefore, in view of (48).
- (k)
- 1st ⟸
- For any , Markov’s inequality results in
- ⟹
- First, we show that, for any , we haveThe case is trivial. The case follows because implieswhere ≥ is obvious, and ≤ holds becauseIf infinitely often, so is in view of (48). Assume that , and is finite for all sufficiently large. Then, there is a subsequence such that , andfor all sufficiently large i and . Consequently, (78) implies that .
- 2nd ⟸
- Suppose that . Therefore, there is a subsequence along which . If , then along the subsequence. Because of the continuity of the Shannon transform and the fact that it grows without bound as its argument goes to infinity (Item 25), if , we can find such that , which implies . Therefore, as we wanted to show.
- (l)
- We start by showing thatwhere we have denoted the right side of (71) with arbitrary logarithm base by . Since , it is easy to verify thatwhere the lower and upper bounds are attained uniquely at and , respectively. The lower bound results in ⟸ in (83). To show ⟹, decompose, for arbitrary ,where(87) holds from the upper bound in (84), and the fact that (89) is decreasing in , and (88) holds for all sufficiently large n if . Since the right side of (88) goes to 0 as , (83) is established. Assume . From the linearity property (42), we have with and which satisfies . Therefore, we may restrict attention to without loss of generality. Following (71) and (74), and abbreviating , we obtain
- (m)
- If , then a.s., and in view of Part (f). Assume henceforth that . Since , it suffices to showUnder the assumptions, Part (l) guarantees thatIf V is a standard Cauchy random variable, then in distribution as the characteristic function converges: for all t. Analogously, according to Part (k), since in probability. Since the strength of is finite for all sufficiently large n, we may invoke (47) to express, for those n,The lower semicontinuity of relative entropy under weak convergence (which, in turn, is a corollary to the Donsker–Varadhan [14,15] variational representation of relative entropy) results inbecause and . Therefore, (92) follows from (94) and (95).□
- 27.
- In view of (42) and Item 23, if V is standard Cauchy. Furthermore, if and are centered independent Cauchy random variables, then their sum is centered Cauchy withMore generally, it follows from Theorem 1-(d) that, if is centered Cauchy, and (96) holds for and all , then X must be centered Cauchy. Invoking (52), we obtainwhich is also valid for as we saw in Item 24.
- 28.
- If X is standard Gaussian, then , and . Therefore, if and are zero-mean independent Gaussian random variables, thenThus, in this case, .
- 29.
- It follows from Theorem 1-(d) that, with X independent of standard Cauchy V, we obtain whenever X is such thatAn example is the heavy-tailed probability density functionfor which .
- 30.
- Using (A8), we can verify that, if X is zero-mean uniform with variance , thenwhere c is the solution to .
- 31.
- We say that in strength if . Parts (j) and (k) of Theorem 1 show that this convergence criterion is intermediate between the traditional in probability and criteria. It is not equivalent to either one: Ifthen , while in probability. If, instead, , with probability , then in strength, but not in for any .
- 32.
- The assumption in Theorem 1-(m) that in strength cannot be weakened to convergence in probability. Suppose that is absolutely continuous with probability density functionWe have in probability since, regardless of how small , for all . Furthermore,because (103) is the mixture of a uniform and an infinite differential entropy probability density function, and differential entropy is concave. We conclude that , since .
- 33.
- The following result on the continuity of differential entropy is shown in [16]: if X and Z are independent, and , thenThis result is weaker than Theorem 1-(m) because finite first absolute moment implies finite strength as we saw in (44), and in if , and therefore, it vanishes in strength too.
- 34.
- 35.
6. Maximization of Differential Entropy
- 36.
- Among random variables with a given second moment (resp. first absolute moment), differential entropy is maximized by the zero-mean Gaussian (resp. Laplace) distribution. More generally, among random variables with a given p-absolute moment , differential entropy is maximized by the parameter-p Subbotin (or generalized normal) distribution with p-absolute moment [17]Among nonnegative random variables with a given mean, differential entropy is maximized by the exponential distribution. In those well-known solutions, the cost function is an affine function of the negative logarithm of the maximal differential entropy probability density function. Is there a cost function such that, among all random variables with a given expected cost, the Cauchy distribution is the maximal differential entropy solution? To answer this question, we adopt a more general viewpoint. Consider the following result, whose special case was solved in [18] using convex optimization:Proof.
- (a)
- For every and , there is a unique that satisfies (110) because the function of on the right side is strictly monotonically decreasing, grows without bound as , and goes to zero as .
- (b)
- For any Z which satisfies , its relative entropy, in nats, with respect to iswhere (113) and (114) follow from (110) and (22), respectively. Since relative entropy is nonnegative, and zero only if both measures are identical, not only does (2) hold but any random variable other than achieves strictly lower differential entropy.□
- 37.
- An unfortunate consequence stemming from Theorem 2 is that, while we were able to find out a cost function such that the Cauchy distribution is the maximal differential entropy distribution under an average cost constraint, this holds only for a specific value of the constraint. This behavior is quite different from the classical cases discussed in Item 36 for which the solution is, modulo scale, the same regardless of the value of the cost constraint. As we see next, this deficiency is overcome by the notion of strength introduced in Section 5.
- 38.
- Theorem 3.Strength constraint.The differential entropy of a real-valued random variable with strength is upper bounded byIf , equality holds if and only if Z has a centered Cauchy density, i.e., for some .
- 39.
- The entropy power of a random variable Z is the variance of a Gaussian random variable whose differential entropy is , i.e.,While the power of a Cauchy random variable is infinite, its entropy power is given byIn the same spirit as the definition of entropy power, Theorem 3 suggests the definition of , the entropy strength of Z, as the strength of a centered Cauchy random variable whose differential entropy is , i.e., . Therefore,where (119) follows from (56), and (120) holds with equality if and only if Z is centered Cauchy. Note that, for all ,Comparing (116) and (118), we see that entropy power is simply a scaled version of the entropy strength squared,The entropy power inequality (e.g., [19,20]) states that, if and are independent real-valued random variables, thenregardless of whether they have moments. According to (122), we may rewrite the entropy power inequality (123) replacing each entropy power by the corresponding squared entropy strength. Therefore, the squared entropy strength of the sum of independent random variables satisfies
- 40.
- Theorem 3 implies that any random variable with infinite differential entropy has infinite strength. There are indeed random variables with finite differential entropy and infinite strength. For example, let be an absolutely continuous random variable with probability density functionThen, nats, while the entropy of the quantized version as well as the strength satisfy .
- 41.
- With the same approach, we may generalize Theorem 3 to encompass the full slew of the generalized Cauchy distributions in (9). To that end, fix and define the -strength of a random variable asTherefore, for , and if satisfy (110), then . As in Item 25, if , we have
- 42.
- Theorem 4.
- 43.
- In the multivariate case, we may find a simple upper bound on differential entropy based on the strength of the norm of the random vector.Theorem 5.The differential entropy of a random vector is upper bounded by
- 44.
- To obtain a full generalization of Theorem 3 in the multivariate case, it is advisable to define the strength of a random n-vector asfor . To verify (139), note (15)–(17). Notice that and for , (138) is equal to (34). The following result provides a maximal differential entropy characterization of the standard spherical multivariate Cauchy density.Theorem 6.
7. Relative Information
- 45.
- For probability measures P and Q on the same measurable space , such that , the logarithm of their Radon–Nikodym derivative is the relative information denoted by
- 46.
- As usual, we may employ the notation to denote . The distributions of the random variables and are referred to as relative information spectra (e.g., [21]). It can be shown that there is a one-to-one correspondence between the cumulative distributions of and . For example, if they are absolutely continuous random variables with respective probability density functions and , thenObviously, the distributions of and determine each other. One caveat is that relative information may take the value . It can be shown that
- 47.
- The information spectra determine all measures of the distance between the respective probability measures of interest (e.g., [22,23]), including f-divergences and Rényi divergences. For example, the relative entropy (or Kullback–Leibler divergence) of the dominated measure P with respect to the reference measure Q is the average of the relative information when the argument is distributed according to P, i.e., . If , then .
- 48.
- The information spectra also determine the fundamental trade-off in hypothesis testing. Let denote the minimal probability of deciding when is true subject to the constraint that the probability of deciding when is true is no larger than . A consequence of the Neyman–Pearson lemma iswhere and .
- 49.
- Cauchy distributions are absolutely continuous with respect to each other and, in view of (2),
- 50.
- The following result, proved in Item 58, shows that the relative information spectrum corresponding to Cauchy distributions with respective scale/locations and depends on the four parameters through the single scalarwhere equality holds if and only if .Theorem 7.Suppose that , and V is standard Cauchy. DenoteThen,
- (a)
- (b)
- Z has the same distribution as the random variablewhere Θ is uniformly distributed on and . Therefore, the probability density function of Z ison the interval .
- 51.
- 52.
- For future use, note that the endpoints of the support of (153) are their respective reciprocals. Furthermore,which implies
8. Equivalent Pairs of Probability Measures
- 53.
- Suppose that and are probability measures on such that and and are probability measures on such that . We say that and are equivalent pairs, and write , if the cumulative distribution functions of and are identical with and . Naturally, ≡ is an equivalence relationship. Because of the one-to-one correspondence indicated in Item 46, the definition of equivalent pairs does not change if we require equality of the information spectra under the dominated measure, i.e., that and be equally distributed and . Obviously, the requirement that the information spectra coincide is the same as requiring that the distributions of and are equal. As in Item 46, we also employ the notation to indicate if , , , and .
- 54.
- Suppose that the output probability measures of a certain (random or deterministic) transformation are and when the input is distributed according to and , respectively. If , then the transformation is a sufficient statistic for deciding between and (i.e., the case of a binary parameter).
- 55.
- If is a measurable space on which the probability measures are defined, and is a -measurable injective function, then are probability measures on andConsequently, .
- 56.
- The most important special case of Item 55 is an affine transformation of an arbitrary real-valued random variable X, which enables the reduction of four-parameter problems into two-parameter problems: for all and ,withby choosing the affine function .
- 57.
- Theorem 8.If is an even random vector, i.e., , thenwhenever .
- 58.
- We now proceed to prove Theorem 7.Proof.Since and have identical distributions, we may assume for convenience that and . Furthermore, capitalizing on Item 56, we may assume , , , and , and then recover the general result letting and . Invoking (A9) and (A10), we haveand we can verify that we recover (151) through the aforementioned substitution. Once we have obtained the expectation of , we proceed to determine its distribution. Denoting the right side of (169) by , we havewhere is uniformly distributed on . We have substituted (see Item 4) in (172), and invoked elementary trigonometric identities in (173) and (174). Since the phase in (175) does not affect it, the distribution of Z is indeed as claimed in (152), and (153) follows because the probability density function of is□
- 59.
- In general, it need not hold that —for example, if X and Y are zero-mean Gaussian with different variances. However, the class of scalar Cauchy distributions does satisfy this property since the result of Theorem 7 is invariant to swapping and . More generally, Theorem 7 implies that, if , thenCuriously, (177) implies that .
- 60.
- For location–dilation families of random variables, we saw in Item 56 how to reduce a four-parameter problem into a two-parameter problem since with the appropriate substitution. In the Cauchy case, Theorem 7 reveals that, in fact, we can go one step further and turn it into a one-parameter problem. We have two basic ways of doing this:
- (a)
- with .
- (b)
- with eitherwhich are the solutions to .
9. -Divergences
- 61.
- If and is convex and right-continuous at 0, f-divergence is defined as
- 62.
- The most important property of f-divergence is the data processing inequalitywhere and are the responses of a (random or deterministic) transformation to and , respectively. If f is strictly convex at 1 and , then is necessary and sufficient for equality in (180).
- 63.
- If , then with the transform , which satisfies .
- 64.
- Theorem 9.If and , thenwhere stands for all convex right-continuous .Proof.As mentioned in Item 53, is equivalent to and having identical distributions with and .
- ⟹
- According to (179), is determined by the distribution of the random variable , .
- ⟸
- For , the function , , is convex and right-continuous at 0, and is the moment generating function, evaluated at t, of the random variable , . Therefore, for all t implies that .□
- 65.
- Since is not necessary in order to define (finite) , it is possible to enlarge the scope of Theorem 9 by defining dropping the restriction that and . For that purpose, let and be -finite measures on and , respectively, and denote , , . Then, we say if
- (a)
- when restricted to , the random variables and have identical distributions with and ;
- (b)
- when restricted to , the random variables and have identical distributions with and .
Note that those conditions imply that- (c)
- ;
- (d)
- ;
- (e)
- .
For example, if and , then . To show the generalized version of Theorem 9, it is convenient to use the symmetrized form - 66.
- Suppose that there is a class of probability measures on a given measurable space with the property that there exists a convex function (right-continuous at 0) such that, if and , thenIn such case, Theorem 9 indicates that can be partitioned into equivalence classes such that, within every equivalence class, the value of is constant, though naturally dependent on f. Throughout , the value of determines the value of , i.e., we can express , where is a non-decreasing function. Consider the following examples:
- (a)
- Let be the class of real-valued Gaussian probability measures with given variance . Then,
- (b)
- Let be the collection of all Cauchy random variables. Theorem 7 reveals that (183) is also satisfied if because, if and , then
- 67.
- An immediate consequence of Theorems 7 and 9 is that, for any valid f, the f-divergence between Cauchy densities is symmetric,This property does not generalize to the multivariate case. While, in view of Theorem 8,in general, since the corresponding relative entropies do not coincide as shown in [8].
- 68.
- It follows from Item 66 and Theorem 7 that any f-divergence between Cauchy probability measures is a monotonically increasing function of given by (149). The following result shows how to obtain that function from f.
- 69.
- Suppose now that we have two sequences of Cauchy measures with respective parameters and such that . Then, Theorem 10 indicates thatThe most common f-divergences are such that since in that case . In addition, adding the function to does not change the value of and with appropriately chosen , we can turn into canonical form in which not only but . In the special case in which the second measure is fixed, Theorem 9 in [25] shows that, if with , thenprovided the limit on the right side exists; otherwise, the left side lies between the left and right limits at 1. In the Cauchy case, we can allow the second probability to depend on n and sharpen that result by means of Theorem 10. In particular, it can be shown thatprovided the right side is not .
10. -Divergence
- 70.
- With either or , f-divergence is the -divergence,
- 71.
- If P and Q are Cauchy distributions, then (149), (151) and (195) result ina formula obtained in Appendix D of [26] using complex analysis and the Cauchy integral formula. In addition, invoking complex analysis and the maximal group invariant results in [27,28], ref. [26] shows that any f-divergence between Cauchy distributions can be expressed as a function of their divergence, although [26] left open how to obtain that function, which is given by Theorem 10 substituting .
11. Relative Entropy
- 72.
- The relative entropy between Cauchy distributions is given bywhere . The special case of (198) was found in Example 4 of [29]. The next four items give different simple justifications for (198). An alternative proof was recently given in Appendix C of [26] using complex analysis holomorphisms and the Cauchy integral formula. Yet another, much more involved, proof is reported in [30]. See also Remark 19 in [26] for another route invoking the Lévy–Khintchine formula and the Frullani integral.
- 73.
- Now, substituting and , we obtain (198) since, according to Item 56, .
- 74.
- 75.
- 76.
- 77.
- If V is standard Cauchy, independent of Cauchy and , then (198) results inwhere and , and is an independent (or exact) copy of V. In contrast, the corresponding result in the Gaussian case in which X, , are independent Gaussian with means and variances , respectively, is
- 78.
- An important information theoretic result due to Csiszár [32] is that if and P is such thatthen the following Pythagorean identity holdsAmong other applications, this result leads to elegant proofs of minimum relative entropy results. For example, the closest Gaussian to a given P with a finite second moment has the same first and second moments as P. If we let and be centered Cauchy with strengths and , respectively, then the orthogonality condition (207) becomes, with the aid of (148) and (198),
- 79.
- Mutually absolutely continuous random variables may be such that
- 80.
- While relative entropy is lower semi-continuous, it is not continuous. For example, using the Cauchy distribution, we can show that relative entropy is not stable against small contamination of a Gaussian random variable: if X is Gaussian independent of V, then no matter how small ,
12. Total Variation Distance
- 81.
- 82.
- Example 15 of [33] shows that the total variation distance between centered Cauchy distributions isin view of (197). Since any f-divergence between Cauchy distributions depends on the parameters only through the corresponding -divergence, (217)–(218) imply the general formulaAlternatively, applying Theorem 11 to the case of Cauchy random variables, note that, in this case, Z is an absolutely continuous random variable with density function (153). Therefore, = 1, andwhere (221) follows from (154) and the identity specialized to . Though more laborious (see [26]), (219) can also be verified by direct integration.
13. Hellinger Divergence
- 83.
- The Hellinger divergence, of order , is the -divergence withNotable special cases arewhere is known as the squared Hellinger distance.
- 84.
14. Rényi Divergence
- 85.
- For absolutely continuous probability measures P and Q, with corresponding probability density functions p and q, the Rényi divergence of order is [35]Note that, if , then . Moreover, although Rényi divergence of order is not an f-divergence, it is in one-to-one correspondence with the Hellinger divergence of order :
- 86.
- 87.
- 88.
- 89.
- Since , particularizing (230), we obtain
- 90.
- Since , for Cauchy random variables, we obtain
- 91.
- For Cauchy random variables, the Rényi divergence for integer order 4 or higher can be obtained through (235), (236) and the recursion (dropping for typographical convenience)which follows from (230) and the recursion of the Legendre polynomialswhich, in fact, also holds for non-integer n (see 8.5.3 in [34]).
- 92.
- The Chernoff informationsatisfies regardless of . If, as in the case of Cauchy measures, , then Chernoff information is equal to the Bhattacharyya distance:where is the squared Hellinger distance, which is the f-divergence with . Together with Item 87, (240) gives the Chernoff information for Cauchy distributions. While it involves the complete elliptical integral function, its simplicity should be contrasted with the formidable expression for Gaussian distributions, recently derived in [38]. The reason (240) holds is that the supremum in (239) is achieved at . To see this, note thatwhere (241) reflects the skew-symmetry of Rényi divergence, and (242) holds because . Since is concave and its own mirror image, it is maximized at .
15. Fisher’s Information
- 93.
- The score function of the standard Cauchy density (1) isThen, is a zero-mean random variable with second moment equal to Fisher’s informationwhere we have used (A11). Since Fisher’s information is invariant to location and scales as , we obtain
- 94.
- Introduced in [39], Fisher’s information of a density function (245) quantifies its similarity with a slightly shifted version of itself. A more general notion is the Fisher information matrix of a random transformation satisfying the regularity conditionThen, the Fisher information matrix of at has coefficientsand satisfies (with relative entropy in nats)
- 95.
- The relative Fisher information is defined asAlthough the purpose of this definition is to avoid some of the pitfalls of the classical definition of Fisher’s information, not only do equivalent pairs fail to have the same relative Fisher information but, unlike relative entropy or f-divergence, relative Fisher information is not transparent to injective transformations. For example, . Centered Cauchy random variables illustrate this fact since
- 96.
- de Bruijn’s identity [4] states that, if is independent of X, then, in nats,As well as serving as the key component in the original proofs of the entropy power inequality, the differential equation in (254) provides a concrete link between Shannon theory and its prehistory. As we show in Theorem 12, it turns out that there is a Cauchy counterpart of de Bruijn’s identity (254). Before stating the result, we introduce the following notation for a parametrized random variable (to be specified later):i.e., and are the Fisher information with respect to location and with respect to dilation, respectively (corresponding to the coefficients and of the Fisher information matrix when as in Item 94. The key to (254) is that , satisfies the partial differential equationTheorem 12.Suppose that X is independent of standard Cauchy V. Then, in nats,Proof.Equation (259) does not hold in the current case in which , andHowever, some algebra (the differentiation/integration swaps can be justified invoking the bounded convergence theorem) indicates that the convolution with the Cauchy density satisfies the Laplace partial differential equationThe derivative of the differential entropy of is, in nats,Taking another derivative, the left side of (260) becomeswhere
- 97.
- Theorem 12 reveals that the increasing function is concave (which does not follow from the concavity of differential entropy functional of the density). In contrast, it was shown by Costa [40] that the entropy power , with is concave in t.
16. Mutual Information
- 98.
- Most of this section is devoted to an additive noise model. We begin with the simplest case in which is centered Cauchy independent of , also centered Cauchy with . Then, (11) yieldsthereby establishing a pleasing parallelism with Shannon’s formula [1] for the mutual information between a Gaussian random variable and its sum with an independent Gaussian random variable. Aside from a factor of , in the Cauchy case, the role of the variance is taken by the strength. Incidentally, as shown in [2], if N is standard exponential on , an independent X on can be found so that is exponential, in which case the formula (271) also applies because the ratio of strengths of exponentials is equal to the ratio of their means. More generally, if input and noise are independent non-centered Cauchy, their locations do not affect the mutual information, but they do affect their strengths, so, in that case, (271) holds provided that the strengths are evaluated for the centered versions of the Cauchy random variables.
- 99.
- It is instructive, as well as useful in the sequel, to obtain (271) through a more circuitous route. Since is centered Cauchy with strength , the information density (e.g., [41]) is defined asAveraging with respect to , we obtain
- 100.
- If the strengths of output and independent noise N are finite and their differential entropies are not , we can obtain a general representation of the mutual information without requiring that either input or noise be Cauchy. Invoking (56) and , we havesince, as we saw in (49), the finiteness of the strengths guarantees the finiteness of the relative entropies in (278). We can readily verify the alternative representation in which strength is replaced by standard deviation, and the standard Cauchy V is replaced by standard normal W:A byproduct of (278) is the upper boundwhere (281) follows from , and (282) follows by dropping the last term on the right side of (278). Note that (281) is the counterpart of the upper bound given by Shannon [1] in which the standard deviation of Y takes the place of the strength in the numerator, and the square root of the noise entropy power takes the place of the entropy strength in the denominator. Shannon gave his bound three years before Kullback and Leibler introduced relative entropy in [42]. The counterpart of (282) with analogous substitutions of strengths by standard deviations was given by Pinsker [43], and by Ihara [44] for continuous-time processes.
- 101.
- We proceed to investigate the maximal mutual information between the (possibly non-Cauchy) input and its additive Cauchy-noise contaminated version.Theorem 13.Maximal mutual information: output strength constraint. For any ,where is centered Cauchy independent of X. The maximum in (283) is attained uniquely by the centered Cauchy distribution with strength .
- 102.
- In the information theory literature, the maximization of mutual information over the input distribution is usually carried out under a constraint on the average cost for some real-valued function . Before we investigate whether the optimization in (283) can be cast into that conventional paradigm, it is instructive to realize that the maximization of mutual information in the case of input-independent additive Gaussian noise can be viewed as one in which we allow any input such that the output variance is constrained, and because the output variance is the sum of input and noise variances that the familiar optimization over variance constrained inputs obtains. Likewise, in the case of additive exponential noise and random variables taking nonnegative values, if we constrain the output mean, automatically we are constraining the input mean. In contrast, the output strength is not equal to the sum of Cauchy noise strength and the input strength, unless the input is Cauchy. Indeed, as we saw in Theorem 1-(d), the output strength depends not only on the input strength but on the shape of its probability density function. Since the noise is Cauchy, (45) yieldswhich is the same input constraint found in [45] (see also Lemma 6 in [46] and Section V in [47]) in which affects not only the allowed expected cost but the definition of the cost function itself. If X is centered Cauchy with strength , then (286) is satisfied with equality, in keeping with the fact that that input achieves the maximum in (283). Any alternative input with the same strength that produces output strength lower than or equal to can only result in lower mutual information. However, as we saw in Item 29, we can indeed find input distributions with strength that can produce output strength higher than . Can any of those input distributions achieve ? The answer is affirmative. If we let , defined in (9), we can verify numerically that, for ,We conclude that, at least for , the capacity–input–strength function satisfies
- 103.
- Although not always acknowledged, the key step in the maximization of mutual information over the input distribution for a given random transformation is to identify the optimal output distribution. The results in Items 101 and 102 point out that it is mathematically more natural to impose constraints on the attributes of the observed noisy signal than on the transmitted noiseless signal. In the usual framework of power constraints, both formulations are equivalent as an increase in the gain of the receiver antenna (or a decrease in the front-end amplifier thermal noise) of dB has the same effect as an increase of dB in the gain of the transmitter antenna (or increase in the output power of the transmitted amplifier). When, as in the case of strength, both formulations lead to different solutions, it is worthwhile to recognize that what we usually view as transmitter/encoder constraints also involve receiver features.
- 104.
- Consider a multiaccess channel , where is a sequence of strength independent centered Cauchy random variables. While the capacity region is unknown if we place individual cost or strength constraints on the transmitters, it is easily solvable if we impose an output strength constraint. In that case, the capacity region is the trianglewhere is the output strength constraint. To see this, note (a) the corner points are achievable thanks to Theorem 13; (b) if the transmitters are synchronous, a time-sharing strategy with Cauchy distributed inputs satisfies the output strength constraint in view of (107); (c) replacing the independent encoders by a single encoder which encodes both messages would not be able to achieve higher rate sum. It is also possible to achieve (289) using the successive decoding strategy invented by Cover [48] and Wyner [49] for the Gaussian multiple-access channel: fix ; to achieve and , we let the transmitters use random coding with sequences of independent Cauchy random variables with respective strengthswhich abide by the output strength constraint since , anda rate-pair which is achievable by successive decoding by using a single-user decoder for user 1, which treats the codeword transmitted by user 2 as noise; upon decoding the message of user 1, it is re-encoded and subtracted from the received signal, thereby presenting a single-user decoder for user 2 with a signal devoid of any trace of user 1 (with high probability).
- 105.
- The capacity per unit energy of the additive Cauchy-noise channel , where is an independent sequence of standard Cauchy random variables, was shown in [29] to be equal to , even though the capacity-cost function of such a channel is unknown. A corollary to Theorem 13 is that the capacity per unit output strength of the same channel isBy only considering Cauchy distributed inputs, the capacity per unit input strength is lower bounded bybut is otherwise unknown as it is not encompassed by the formula in [29].
- 106.
- We turn to the scenario, dual to that in Theorem 13, in which the input is Cauchy but the noise need not be. As Shannon showed in [1], if the input is Gaussian, among all noise distributions with given second moment, independent Gaussian noise is the least favorable. Shannon showed that fact applying the entropy power inequality to the numerator on the right side of (279), and then further weakened the resulting lower bound by replacing the noise entropy power in the denominator by its variance. Taking a cue from this simple approach, we apply the entropy strength inequality (124) to (277) to obtainwhere (299) follows from . Unfortunately, unlike the case of Gaussian input, this route falls short of showing that Cauchy noise of a given strength is least favorable because the right side of (299) is strictly smaller than the Cauchy-input Cauchy-noise mutual information in (271). Evidently, while the entropy power inequality is tight for Gaussian random variables, it is not for Cauchy random variables as we observed in Item 39. For this approach to succeed showing that, under a strength constraint, the least favorable noise is centered Cauchy we would need that, if W is independent of standard Cauchy V, then . (See Item 119c-(a).)
- 107.
- As in Item 102, the counterpart in the Cauchy-input case is more challenging due to the fact that, unlike variance, the output strength need not be equal to the sum of input and noise strength. The next two results give lower bounds which, although achieved by Cauchy noise, do not just depend on the noise distribution through its strength.Theorem 14.If is centered Cauchy, independent of W with , denote . Then,with equality if W is centered Cauchy.Proof.Let us abbreviate . Consider the following chain:whereAlthough the lower bound in Theorem 14 is achieved by a centered Cauchy, it does not rule out the existence of W such that and .
- 108.
- For the following lower bound, it is advisable to assume for notational simplicity and without loss of generality that . To remove that restriction, we may simply replace W by .Theorem 15.Let V be standard Cauchy independent of W. Then,where is the solution toEquality holds in (306) if W is a centered Cauchy random variable, in which case, .Proof.It can be shown that, if and is an auxiliary random transformation such that where is the response of to , thenwhere and the information density corresponds to the joint probability measure . We can participate decomposition of mutual information to the case where where Wc is centered Cauchy with strenght λ > 0. Then, is the joint distribution of V and V + WC, andTaking expectation with respect to , and invoking (52), we obtainFinally, taking expectation with respect to , we obtainIf , namely, the solution to (307), then (306) follows as a result of (108). If , then the solution to (307) is , and the equality in (306) can be seen by specializing (271) to . □
- 109.
- 110.
- As the proof indicates, at the expense of additional computation, we may sharpen the lower bound in Theorem 15 to showwhich is attained at the solution to
- 111.
- Theorem 16.The rate–distortion function of a memoryless source whose distribution is centered Cauchy with strength such that the time-average of the distortion strength is upper bounded by D is given byProof.If , reproducing the source by results in time-average of the distortion strength equal to . Therefore, . If , we proceed to determine the minimal among all such that . For any such random transformation,where (320) holds because conditioning cannot increase differential entropy, and (322) follows from Theorem 3 applied to . The fact that there is an allowable that achieves the lower bound with equality is best seen by letting , where Z and are independent centered Cauchy random variables with and . Then, is such that the X marginal is indeed centered Cauchy with strength , and . Recalling (271),and the lower bound in (323) can indeed be satisfied with equality. We are not finished yet since we need to justify that the rate–distortion function is indeedwhich does not follow from the conventional memoryless lossy compression theorem with average distortion because, although the distortion measure is separable, it is not the average of a function with respect to the joint probability measure . This departure from the conventional setting does not impact the direct part of the theorem (i.e., ≤ in (325)), but it does affect the converse and in particular the proof of the fact that the n-version of the right side of (325) single-letterizes. To that end, it is sufficient to show that the function of D on the right side of (325) is convex (e.g., see pp. 316–317 in [19]). In the conventional setting, this follows from the convexity of the mutual information in the random transformation since, with a distortion function , we havewhere , , and . Unfortunately, as we saw in Item 35, strength is not convex on the probability measure so, in general, we cannot claim thatThe way out of this quandary is to realize that (327) is only needed for those and that attain the minimum on the right side of (325) for different distortion bounds and . As we saw earlier in this proof, those optimal random transformations are such that and are centered Cauchy. Fortuitously, as we noted in (107), (327) does indeed hold when we restrict attention to mixtures of centered Cauchy distributions. □Theorem 16 gives another example in which the Shannon lower bound to the rate–distortion function is tight. In addition to Gaussian sources with mean–square distortion, other examples can be found in [50]. Another interesting aspect of the lossy compression of memoryless Cauchy sources under strength distortion measure is that it is optimally successively refinable in the sense of [51,52]. As in the Gaussian case, this is a simple consequence of the stability of the Cauchy distribution and the fact that the strength of the sum of independent Cauchy random variables is equal to the sum of their respective strengths (Item 27).
- 112.
- The continuity of mutual information can be shown under the following sufficient conditionsTheorem 17.Suppose that is a sequence of real-valued random variables that vanishes in strength, Z is independent of , and . Then,Proof.Under the assumptions, . Therefore, , and (328) follows from Theorem 1-(m). □
- 113.
- The assumption is not superfluous for the validity of Theorem 17 even though it was not needed in Theorem 1-(m). Suppose that Z is integer valued, and where has probability mass functionThen, , while , and therefore, .
- 114.
- In the case in which and are standard spherical multivariate Cauchy random variables with densities in (6), it follows from (7) that has the same distribution as . Therefore,where we have used the scaling law . There is no possibility of a Cauchy-counterpart of the celebrated log-determinant formula for additive Gaussian vectors (e.g., Theorem 9.2.1 in [41]) because, as pointed out in Item 7, is not distributed according to the ellipsoidal density in (8) unless and are proportional, in which case the setup reverts to that in (330).
- 115.
- 116.
- The shared information of n random variables is a generalization of mutual information introduced in [54] for deriving the fundamental limit of interactive data exchange among agents who have access to the individual components and establish a dialog to ensure that all of them find out the value of the random vector. The shared information of is defined aswhere , with , and the minimum is over all partitions of :such that . If we divide (338) by , we obtain the shared information of n random variables distributed according to the standard spherical multivariate Cauchy model. This is a consequence of the following result, which is of independent interest.Theorem 18.If are exchangeable random variables, any subset of which have finite differential entropy, then for any partition Π of ,Proof.Fix any partition with chunks. Denote by the number of chunks in with cardinality . Therefore,By exchangeability, any chunk of cardinality k has the same differential entropy, which we denote by . Then,and the difference of the left minus the right sides of (340) multiplied by is readily seen to equalwhereNaturally, the same proof applies to n discrete exchangeable random variables with finite joint entropy.
17. Outlook
- 117.
- We have seen that a number of key information theoretic properties pertaining to the Gaussian law are also satisfied in the Cauchy case. Conceptually, those extensions shed light on the underlying reason the conventional Gaussian results hold. Naturally, we would like to explore how far beyond the Cauchy law those results can be expanded. As far as the maximization of differential entropy is concerned, the essential step is to redefine strength tailoring it to the desired law: Fix a reference random variable W with probability density function and finite differential entropy , and define the W-strength of a real valued random variable Z asFor example,
- (a)
- For , ;
- (b)
- if W is standard normal, then ;
- (c)
- if V is standard Cauchy, then ;
- (d)
- if W is standard exponential, then if a.s., otherwise, ;
- (e)
- if W is standard Subbotin (108) with , then, ;
- (f)
- (g)
- if W is uniformly distributed on , ;
- (h)
- if W is standard Rayleigh, then if a.s., otherwise, .
The pivotal Theorems 3 and 4 admit the following generalization.Theorem 19.Suppose and . Then,Proof.Fix any Z in the feasible set. For any such that , we haveTherefore, , by definition of , thereby establishing ≤ in (348). Equality holds since . □A corollary to Theorem 19 is a very general form of the Shannon lower bound for the rate–distortion function of a memoryless source Z such that the distortion is constrained to have W-strength not higher than D, namely,Theorem 19 finds an immediate extension to the multivariate casewhere, for with , we have definedFor example, if is zero-mean multivariate Gaussian with positive definite covariance , then . - 118.
- One aspect in which we have shown that Cauchy distributions lend themselves to simplification unavailable in the Gaussian case is the single-parametrization of their likelihood ratio, which paves the way for a slew of closed-form expressions for f-divergences and Rényi divergences. It would be interesting to identify other multiparameter (even just scale/location) families of distributions that enjoy the same property. To that end, it is natural, though by no means hopeful, to study various generalizations of the Cauchy distribution such as the Student-t random variable, or more generally, the Rider distribution in (9). The information theoretic study of general stable distributions is hampered by the fact that they are characterized by their characteristic functions (e.g., p. 164 in [55]), which so far, have not lent themselves to the determination of relative entropy or even differential entropy.
- 119.
- Although we cannot expect that the cornucopia of information theoretic results in the Gaussian case can be extended to other domains, we have been able to show that a number of those results do find counterparts in the Cauchy case. Nevertheless, much remains to be explored. To name a few,
- (a)
- The concavity of the entropy-strength —a counterpart of Costa’s entropy power inequality [40] would guarantee the least favorability of Cauchy noise among all strength-constrained noises as well as the entropy strength inequality
- (b)
- Information theoretic analyses quantifying the approach to normality in the central limit theorem are well-known (e.g., [56,57,58]). It would be interesting to explore the decrease in the relative entropy (relative to the Cauchy law) of independent sums distributed according to a law in the domain of attraction of the Cauchy distribution [55].
- (c)
- Since de Bruijn’s identity is one of the ancestors of the i-mmse formula of [59], and we now have a counterpart of de Bruijn’s identity for convolutions with scaled Cauchy, it is natural to wonder if there may be some sort of integral representation of the mutual information between a random variable and its noisy version contaminated by additive Cauchy noise. In this respect, note that counterparts for the i-mmse formula for models other than additive Gaussian noise have been found in [60,61,62].
- (d)
- Mutual information is robust against the addition of small non-Gaussian contamination in the sense that its effects are the same as if it were Gaussian [63]. The proof methods rely on Taylor series expansions that require the existence of moments. Any Cauchy counterparts (recall Item 77) would require substantially different methods.
- (e)
- Pinsker [41] showed that Gaussian processes are information stable imposing only very mild assumptions. The key is that, modulo a factor, the variance of the information density is upper bounded by its mean, the mutual information. Does the spherical multivariate Cauchy distribution enjoy similar properties?
- 112.
- Although not surveyed here, there are indeed a number of results in the engineering literature advocating Cauchy models in certain heavy-tailed infinite-variance scenarios (see, e.g., [45] and the references therein.) At the end, either we abide by the information theoretic maxim that “there is nothing more practical than a beautiful formula”, or we pay heed to Poisson, who after pointing out in [64] that Laplace’s proof of the central limit theorem broke down for what we now refer to as the Cauchy law, remarked that “Mais nous ne tiendrons pas compte de ce cas particulier, quil nous suffira d’avoir remarqué à cause de sa singularité, et qui ne se recontre sans doute pas dans la pratique”.
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Definite Integrals
References
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
- Verdú, S. The exponential distribution in information theory. Probl. Inf. Transm. 1996, 32, 86–95. [Google Scholar]
- Anantharam, V.; Verdú, S. Bits through queues. IEEE Trans. Inf. Theory 1996, 42, 4–18. [Google Scholar] [CrossRef]
- Stam, A. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inf. Control. 1959, 2, 101–112. [Google Scholar] [CrossRef]
- Ferguson, T.S. A representation of the symmetric bivariate Cauchy distribution. Ann. Math. Stat. 1962, 33, 1256–1266. [Google Scholar] [CrossRef]
- Fang, K.T.; Kotz, S.; Ng, K.W. Symmetric Multivariate and Related Distributions; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
- Rider, P.R. Generalized Cauchy distributions. Ann. Inst. Stat. Math. 1958, 9, 215–223. [Google Scholar] [CrossRef]
- Bouhlel, N.; Rousseau, D. A generic formula and some special cases for the Kullback–Leibler divergence between central multivariate Cauchy distributions. Entropy 2022, 24, 838. [Google Scholar] [CrossRef]
- Abe, S.; Rajagopal, A.K. Information theoretic approach to statistical properties of multivariate Cauchy-Lorentz distributions. J. Phys. A Math. Gen. 2001, 34, 8727–8731. [Google Scholar] [CrossRef]
- Tulino, A.M.; Verdú, S. Random matrix theory and wireless communications. Found. Trends Commun. Inf. Theory 2004, 1, 1–182. [Google Scholar] [CrossRef]
- Widder, D.V. The Stieltjes transform. Trans. Am. Math. Soc. 1938, 43, 7–60. [Google Scholar] [CrossRef]
- Kullback, S. Information Theory and Statistics; Dover: New York, NY, USA, 1968; Originally published in 1959 by JohnWiley. [Google Scholar]
- Wu, Y.; Verdú, S. Rényi information dimension: Fundamental limits of almost lossless analog compression. IEEE Trans. Inf. Theory 2010, 56, 3721–3747. [Google Scholar] [CrossRef]
- Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, I. Commun. Pure Appl. Math. 1975, 28, 1–47. [Google Scholar] [CrossRef]
- Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, III. Commun. Pure Appl. Math. 1977, 29, 369–461. [Google Scholar] [CrossRef]
- Lapidoth, A.; Moser, S.M. Capacity bounds via duality with applications to multiple-antenna systems on flat-fading channels. IEEE Trans. Inf. Theory 2003, 49, 2426–2467. [Google Scholar] [CrossRef]
- Subbotin, M.T. On the law of frequency of error. Mat. Sb. 1923, 31, 296–301. [Google Scholar]
- Kapur, J.N. Maximum-Entropy Models in Science and Engineering; Wiley-Eastern: New Delhi, India, 1989. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: New York, NY, USA, 2006. [Google Scholar]
- Dembo, A.; Cover, T.M.; Thomas, J.A. Information theoretic inequalities. IEEE Trans. Inf. Theory 1991, 37, 1501–1518. [Google Scholar] [CrossRef]
- Han, T.S. Information Spectrum Methods in Information Theory; Springer: Heidelberg, Germany, 2003. [Google Scholar]
- Vajda, I. Theory of Statistical Inference and Information; Kluwer: Dordrecht, The Netherlands, 1989. [Google Scholar]
- Deza, E.; Deza, M.M. Dictionary of Distances; Elsevier: Amsterdam, The Netherlands, 2006. [Google Scholar]
- Gradshteyn, I.S.; Ryzhik, I.M. Table of Integrals, Series, and Products, 7th ed.; Academic Press: Burlington, MA, USA, 2007. [Google Scholar]
- Sason, I.; Verdú, S. f-divergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
- Nielsen, F.; Okamura, K. On f-divergences between Cauchy distributions. In Proceedings of the International Conference on Geometric Science of Information, Paris, France, 21–23 July 2021; pp. 799–807. [Google Scholar]
- Eaton, M.L. Group Invariance Applications in Statistics. In Proceedings of the Regional Conference Series in Probability and Statistics; Institute of Mathematical Statistics: Hayward, CA, USA, 1989; Volume 1. [Google Scholar]
- McCullagh, P. On the distribution of the Cauchy maximum-likelihood estimator. Proc. R. Soc. London. Ser. A Math. Phys. Sci. 1993, 440, 475–479. [Google Scholar]
- Verdú, S. On channel capacity per unit cost. IEEE Trans. Inf. Theory 1990, 36, 1019–1030. [Google Scholar] [CrossRef]
- Chyzak, F.; Nielsen, F. A closed-form formula for the Kullback–Leibler divergence between Cauchy distributions. arXiv 2019, arXiv:1905.10965. [Google Scholar]
- Verdú, S. Mismatched estimation and relative entropy. IEEE Trans. Inf. Theory 2010, 56, 3712–3720. [Google Scholar] [CrossRef]
- Csiszár, I. I-Divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
- Sason, I.; Verdú, S. Bounds among f-divergences. arXiv 2015, arXiv:1508.00335. [Google Scholar]
- Abramowitz, M.; Stegun, I.A. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables; US Government Printing Office: Washington, DC, USA, 1964; Volume 55. [Google Scholar]
- Rényi, A. On measures of information and entropy. In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability; Neyman, J., Ed.; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
- Gil, M.; Alajaji, F.; Linder, T. Rényi divergence measures for commonly used univariate continuous distributions. Inf. Sci. 2013, 249, 124–131. [Google Scholar] [CrossRef]
- González, M. Elliptic integrals in terms of Legendre polynomials. Glasg. Math. J. 1954, 2, 97–99. [Google Scholar] [CrossRef]
- Nielsen, F. Revisiting Chernoff information with likelihood ratio exponential families. Entropy 2022, 24, 1400. [Google Scholar] [CrossRef]
- Fisher, R.A. Theory of statistical estimation. Math. Proc. Camb. Math. Soc. 1925, 22, 700–725. [Google Scholar] [CrossRef]
- Costa, M.H.M. A new entropy power inequality. IEEE Trans. Inf. Theory 1985, 31, 751–760. [Google Scholar] [CrossRef]
- Pinsker, M.S. Information and Information Stability of Random Variables and Processes; Holden-Day: San Francisco, CA, USA, 1964; Originally published in Russian in 1960. [Google Scholar]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Pinsker, M.S. Calculation of the rate of message generation by a stationary random process and the capacity of a stationary channel. Dokl. Akad. Nauk 1956, 111, 753–766. [Google Scholar]
- Ihara, S. On the capacity of channels with additive non-Gaussian noise. Inf. Control. 1978, 37, 34–39. [Google Scholar] [CrossRef]
- Fahs, J.; Abou-Faycal, I.C. A Cauchy input achieves the capacity of a Cauchy channel under a logarithmic constraint. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 3077–3081. [Google Scholar]
- Rioul, O.; Magossi, J.C. On Shannon’s formula and Hartley’s rule: Beyond the mathematical coincidence. Entropy 2014, 16, 4892–4910. [Google Scholar] [CrossRef]
- Dytso, A.; Egan, M.; Perlaza, S.; Poor, H.; Shamai, S. Optimal inputs for some classes of degraded wiretap channels. In Proceedings of the 2018 IEEE Information Theory Workshop, Guangzhou, China, 25–29 November 2018; pp. 1–7. [Google Scholar]
- Cover, T.M. Some advances in broadcast channels. In Advances in Communication Systems; Viterbi, A.J., Ed.; Academic Press: New York, NY, USA, 1975; Volume 4, pp. 229–260. [Google Scholar]
- Wyner, A.D. Recent results in the Shannon theory. IEEE Trans. Inf. Theory 1974, 20, 2–9. [Google Scholar] [CrossRef]
- Berger, T. Rate Distortion Theory; Prentice-Hall: Englewood Cliffs, NJ, USA, 1971. [Google Scholar]
- Koshelev, V.N. Estimation of mean error for a discrete successive approximation scheme. Probl. Inf. Transm. 1981, 17, 20–33. [Google Scholar]
- Equitz, W.H.R.; Cover, T.M. Successive refinement of information. IEEE Trans. Inf. Theory 1991, 37, 269–274. [Google Scholar] [CrossRef]
- Kotz, S.; Nadarajah, S. Multivariate t-Distributions and Their Applications; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
- Csiszár, I.; Narayan, P. The secret key capacity of multiple terminals. IEEE Trans. Inf. Theory 2004, 50, 3047–3061. [Google Scholar] [CrossRef]
- Kolmogorov, A.N.; Gnedenko, B.V. Limit Distributions for Sums of Independent Random Variables; Addison-Wesley: Reading, MA, USA, 1954. [Google Scholar]
- Barron, A.R. Entropy and the central limit theorem. Ann. Probab. 1986, 14, 336–342. [Google Scholar] [CrossRef]
- Artstein, S.; Ball, K.; Barthe, F.; Naor, A. Solution of Shannon’s problem on the monotonicity of entropy. J. Am. Math. Soc. 2004, 17, 975–982. [Google Scholar] [CrossRef]
- Tulino, A.M.; Verdú, S. Monotonic decrease of the non-Gaussianness of the sum of independent random variables: A simple proof. IEEE Trans. Inf. Theory 2006, 52, 4295–4297. [Google Scholar] [CrossRef]
- Guo, D.; Shamai, S.; Verdú, S. Mutual information and minimum mean–square error in Gaussian channels. IEEE Trans. Inf. Theory 2005, 51, 1261–1282. [Google Scholar] [CrossRef]
- Guo, D.; Shamai, S.; Verdú, S. Mutual information and conditional mean estimation in Poisson channels. IEEE Trans. Inf. Theory 2008, 54, 1837–1849. [Google Scholar] [CrossRef]
- Jiao, J.; Venkat, K.; Weissman, T. Relations between information and estimation in discrete-time Lévy channels. IEEE Trans. Inf. Theory 2017, 63, 3579–3594. [Google Scholar] [CrossRef]
- Arras, B.; Swan, Y. IT formulae for gamma target: Mutual information and relative entropy. IEEE Trans. Inf. Theory 2018, 64, 1083–1091. [Google Scholar] [CrossRef]
- Pinsker, M.S.; Prelov, V.; Verdú, S. Sensitivity of channel capacity. IEEE Trans. Inf. Theory 1995, 41, 1877–1888. [Google Scholar] [CrossRef]
- Poisson, S.D. Sur la probabilité des résultats moyens des observations. In Connaisance des Tems, ou des Mouvemens Célestes a l’usage des Astronomes, et des Navigateurs, pour l’an 1827; Bureau des longitudes: Paris, France, 1824; pp. 273–302. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Verdú, S. The Cauchy Distribution in Information Theory. Entropy 2023, 25, 346. https://doi.org/10.3390/e25020346
Verdú S. The Cauchy Distribution in Information Theory. Entropy. 2023; 25(2):346. https://doi.org/10.3390/e25020346
Chicago/Turabian StyleVerdú, Sergio. 2023. "The Cauchy Distribution in Information Theory" Entropy 25, no. 2: 346. https://doi.org/10.3390/e25020346
APA StyleVerdú, S. (2023). The Cauchy Distribution in Information Theory. Entropy, 25(2), 346. https://doi.org/10.3390/e25020346
