Entropy and Its Discontents: A Note on Definitions

Petroni, Nicola Cufaro

doi:10.3390/e16074044

Open AccessArticle

Entropy and Its Discontents: A Note on Definitions

by

Nicola Cufaro Petroni

Dipartimento di Matematica and TIRES, Università di Bari, INFN Sezione di Bari via E. Orabona 4, 70125 Bari, Italy

Entropy 2014, 16(7), 4044-4059; https://doi.org/10.3390/e16074044

Submission received: 29 May 2014 / Revised: 27 June 2014 / Accepted: 8 July 2014 / Published: 17 July 2014

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

The routine definitions of Shannon entropy for both discrete and continuous probability laws show inconsistencies that make them not reciprocally coherent. We propose a few possible modifications of these quantities so that: (1) they no longer show incongruities; and (2) they go one into the other in a suitable limit as the result of a renormalization. The properties of the new quantities would slightly differ from that of the usual entropies in a few other respects.

Keywords:

Shannon entropy; continuous and discrete probability laws; renormalization

PACS Classifications:

02.50.Cw; 05.45.Tp

MSC Classifications:

94A17; 54C70

1. Introduction

As it is usually defined, the Shannon entropy of a discrete law p_k =P {x_k} associated with the values x_k of some random variable is:

H = - \sum_{k} p_{k} ln p_{k}

(1)

and apparently is a non-negative, dimensionless quantity. As a matter of fact, however, it does not depend on all of the details of the distribution: for instance, only the p_k are relevant, while the x_k play no role at all. This means that if we modify our distribution just by moving the x_k, the entropy is left the same: this entails, among others, that H does not always change along with the variance (or other typical parameters) of the distribution, which instead is contingent on the x_k values. In particular, H is invariant under every linear transformation ax_k + b (centering and rescaling) of the random quantities: in this sense, every type of law [1] is isentropic. Surprisingly enough, despite the unsophistication of Definition (1) and beyond a few elementary examples, explicit formulas displaying the dependence of the entropy H from the parameters of the most common discrete distributions are not known. If, for instance, we take the entropy H of the binomial distributions 𝔅_n,p with:

x_{k} = k = 0, 1, \dots, n p_{k} = (\begin{matrix} n \\ p \end{matrix}) p^{k} {(1 - p)}^{n - k}

(2)

although it would always be possible to calculate the entropy H for every particular example, no general formula giving its explicit dependence from n and p is available, and only its asymptotic behavior for large n is known in the literature [2,3]:

H [𝔅_{n, p}] = \frac{1}{2} ln [2 π e n p (1 - p)] + \frac{4 p (1 - p) - 1}{12 n p (1 - p)} + O (\frac{1}{n^{2}})

(3)

It is remarked moreover that, while this formula explicitly contains np(1 − p), namely the variance of 𝔅_n,p, it is easy to recognize that, as long as we leave untouched the n probabilities p_k, the entropy H [𝔅_n,p] remains the same when we change the variance by moving the points x_k away from their usual locations x_k = k. In particular, this is true for the standardized (centered, unit variance) binomial

𝔅_{n, p}^{*}

with:

x_{k} = \frac{k - n p}{\sqrt{n p (1 - p)}} k = 0, 1, \dots, n

(4)

and the same p_k of (2), which entails H [𝔅_n,p] = H[

𝔅_{n, p}^{*}

]. All this hints to the fact that what seems to be relevant to the entropy is not the variance itself, but some other feature, possibly related to the shape of the distribution. In a similar vein, for the Poisson distributions 𝔓_λ with:

x_{k} = k = 0, 1, \dots p_{k} = e^{- λ} \frac{λ^{k}}{k!}

(5)

the entropy is:

H [𝔓_{λ}] = λ (1 - ln λ) + e^{- λ} \sum_{k = 0}^{\infty} \frac{λ^{k} ln k!}{k!}

(6)

with an asymptotic expression for large λ:

H [𝔓_{λ}] = \frac{1}{2} ln (2 π e λ) + \frac{1}{12 λ} - \frac{1}{24 λ^{2}} - \frac{19}{360 λ^{3}} + O (λ^{- 4})

(7)

which explicitly contains the parameter λ (also playing the role of the variance), but which is also completely independent from the values of the x_k’s. As a consequence, a standardized Poisson distribution

𝔓_{λ}^{*}

, with:

x_{k} = \frac{k - λ}{\sqrt{λ}}

(8)

and the same probabilities p_k, has the same entropy of 𝔓_λ, namely H [𝔓_λ] = H [

𝔓_{λ}^{*}

].

When, on the other hand, we consider continuous laws (for short, we will call continuous the laws possessing a pdf f (x), without insisting on the difference between continuous and absolutely continuous distributions, which is not relevant here) with a pdf f (x), Definition (1) no longer applies, and we are led to introduce another quantity commonly known as differential entropy (we acknowledge that this name for an integral could be misleading, but we will retain it in the following to abide by a long established habit):

h = - \int_{R} f (x) ln f (x) d x

(9)

which, in several respects, differs from the entropy (1) of the discrete distributions. First of all, explicit formulas of the entropy (9) are known for most of the usual laws: for example (see also Appendix A), the distributions 𝔘(a) uniform on [0, a] with a > 0 have entropy:

h [𝔘] = ln a

(10)

while for the centered, Gaussian laws 𝔑(a) with variance a²,we have:

h [𝔑] = ln (a \sqrt{2 π e})

(11)

An exhaustive list of similar formulas for other families of laws is widely available in the literature, but even from these two examples only, it is apparent that:

(1): at variance with the discrete case, the differential entropies explicitly depend on a scaling parameter a, showing now a dependence either on the variance, or on some other dispersion index, such as the interquantile ranges (IQnR); this means, in particular, that the types of continuous laws are no longer isentropic;
(2): the differential entropies can take negative values when the parameters of the laws are chosen in such a way that the value of the logarithm argument falls below 1;
(3): the logarithm arguments are not in general dimensionless quantities, in an apparent violation of the homogeneity rule that the scalar arguments of transcendental functions (as logarithms are) must be dimensionless quantities; this entails, in particular, that the entropy depends on the units of measurement.

These three remarks hence make it abundantly clear that something is inscribed in Definition (9) that is not present in Definition (1), and vice versa.

Finally, the two definitions seem not to be reciprocally consistent in the sense that, when, for instance, a continuous law is weakly approximated by a sequence of discrete laws, we would like to see the entropies of the discrete distributions converging toward the entropy of the continuous one. That this is not the case is apparent from a few counterexamples. It is well known, for instance, that, for every 0 < p < 1, the sequence of the standardized binomial laws

𝔅_{n, p}^{*}

weakly converges to the Gaussian 𝔑(1) when n → ∞; however, since the binomial probabilities p_k are unaffected by a standardization, the entropies H [

𝔅_{n, p}^{*}

] still obey Formula (3), and hence, their sequence diverges as ln

\sqrt{n}

instead of being convergent to the differential entropy of 𝔑(1), which, from (11), is ln

\sqrt{2 π e}

. In the same vein, the cdf F (x) of a uniform law 𝔘(a) can be approximated by the sequence F_n(x) of the discrete uniform laws 𝔘_n(a) concentrated with equal probabilities p₁ = ... = p_n =

\frac{1}{n}

on the n equidistant points x₁,...,x_n, where x_k = kΔ for k = 1, 2,...,n, and x_k − x_k−1 = Δ =

\frac{a}{n}

with x₀ = 0. However, it is easy to see that:

H [𝔘_{n}] = - \sum_{k = 1}^{n} \frac{1}{n} ln \frac{1}{n} = ln n

(12)

so that their sequence again diverge as ln n, while the differential entropy h[𝔘] of the uniform law has the finite value (10).

As a consequence of these remarks, in the following sections, we will propose a few elementary ways to change the two definitions, (1) and (9), in order to possibly rid them of said inconsistencies and to make them reciprocally coherent without losing too much of the essential properties of the usual quantities. These new definitions, moreover, operate an effective renormalization of the said divergences, so that now, when a continuous law is weakly approximated by a sequence of discrete laws, also the entropies of the discrete distributions converge toward the entropy of the continuous one. A few additional points with examples and explicit calculations are finally collected in the appendices. It must be clearly stated at this point, however, that we do not claim here that the Shannon entropy is somehow ill-defined in itself: we rather point out a few reciprocal inconsistencies of the different manifestations of this time-honored concept, and we try to attune them in such a way that every probability distribution (either discrete, or continuous) would now be treated on the same foot.

2. Entropy for Continuous Laws

Let us begin with some remarks about the differential entropy for continuous laws with a pdf f (x): the simplest ways to achieve the essential of our aims would be to adopt some new definition of the type:

- \int_{R} f (x) ln [k f (x)] d x = h - ln k

(13)

where κ is any parameter of the law f (x) with the same dimensions of x and with a finite and strictly positive value for every non-degenerate law. To this end, the first idea that comes to the fore consists in taking the standard deviation σ to play the role of κ in (13), but it is also apparent that this choice would restrict our definition only to the continuous laws with finite second momentum, leaving out many important cases. A strong alternative candidate for the role of κ could instead be some interquantile range (IQnR), which can represent a measure of the dispersion, even when the variance does not exist. In the following, we will analyze a few possible choices for the parameter κ along with their principal consequences.

2.1. Interquantile Range

The calculation of the IQnR goes through the use of the quantile function Q(p), namely the inverse cumulative distribution function (cdf). In order to take into account possible jumps and flat spots of a given cdf F (x), the quantile function is usually defined as:

Q (p) = inf {x \in R : p \leq F (x)} 0 \leq p \leq 1

(14)

In the case of continuous laws (no jumps), however, this can be reduced to:

Q (p) = inf {x \in R : p = F (x)}

(15)

and when F (x) is also strictly increasing (no flat spots), we finally have:

Q (p) = F^{- 1} (p)

(16)

It is apparent then that Q(p) jumps wherever F (x) has flat spots, while it has flat spots wherever F (x) jumps. The IQnR function is then defined as:

ϱ (p) = Q (1 - p) - Q (p) 0 < p < \frac{1}{2}

(17)

and the classical interquartile range (IQrR) is just the particular value:

ϱ (\frac{1}{4}) = Q (\frac{3}{4}) - Q (\frac{1}{4})

(18)

The IQnR ϱ(p) is a non-increasing function of p, and for continuous laws (since Q(p) has no flat spots), it is always well defined and never vanishes, so that one of its values can be safely used to play a role in the definition of κ in (13). Of course, when a law has also a finite second momentum, the IQnR ϱ and the standard deviation σ are both well defined, and the ratio γ = ϱ/σ often has the same for entire families of laws. We now propose to adopt a new form for the entropy of continuous laws, which, by making use, instead of the variance, of some particular value of IQnR that we will denote ϱ̃, will encompass even the case of the laws without a finite second momentum:

\tilde{h} = - \int_{R} f (x) ln [\tilde{ϱ} f (x)] d x = h - ln \tilde{ϱ}

(19)

In particular, for the continuous laws, we can simply take

\tilde{ϱ} = ϱ (\frac{1}{4})

, the IQrR.

Despite the minimality of this change of definition, however, the new entropy h̃ has properties slightly different from h. It is shown by the examples of the Appendix A that, at variance with the usual differential entropy h, this new entropy h̃ has neither a minimum nor a maximum value, because, according to the particular continuous law considered, it takes every possible real values, both positive and negative. In this respect, we must instead recall the well-known property of the Gaussian laws 𝔑(a), which qualify as the laws with the maximum differential entropy h among all of the other continuous laws with the same variance σ². It is apparent then that within our new definition (19) this special position of the Gaussian laws will simply be lost.

The adoption of (19), however, brings several benefits that will also be made apparent in the examples of the Appendix A: first of all, the argument of the logarithm is now by definition a dimensionless quantity, so that the value of h̃ becomes invariant under the change of measurement units. Second, the new entropy h̃ will no longer depend on the value of some scaling parameters linked to the variance: its values are determined by the form of the distribution, rather than by its actual numerical dispersion, and will be the same for entire families of laws. When, in fact, the variables are subject to some linear transformation y = ax + b (with a > 0, as in the changes of unit of measurement), the differential entropy h changes with the new pdf according to:

- \int_{R} \frac{1}{a} f (\frac{y - b}{a}) ln [\frac{1}{a} f (\frac{y - b}{a})] d y = - \int_{R} f (x) ln f (x) d x + ln a

namely, it is explicitly dependent from the scaling parameter a, while it is independent from the centering parameter b. It is apparent, moreover, that, according to these remarks, also the quantile function of the transformed cdf:

F (\frac{x - b}{a})

is changed into aF⁻¹(p) + b = aQ(p) + b, so that any IQnR is modified according to aϱ(p), namely it will be sensitive again only to the scaling parameter a, but not to the centering one. As a consequence, the modifications of both h and ϱ̃under a linear transformation of the variables are apparently such that they cancel out reciprocally, so that h̃, as defined in (19), is always left unchanged: this means in particular that the types of laws are isentropic.

2.2. Variance and Scaling Parameters

By restricting ourselves to the continuous laws with finite second momentum and standard deviation σ, an alternative redefinition of the differential entropy could be considered as:

- \int_{R} f (x) ln [σ f (x)] d x = h - ln σ

(20)

This form of differential entropy would bring the same benefits of h̃: the argument of the logarithm is dimensionless, and it will no longer depend on the value of scaling parameters. Since, however, the dimensional parameter is the standard deviation, it is possible to show that the Gaussian laws would now keep their usual role of maximum entropy laws, and this suggests proposing a further possible change of definition as (please notice the change of sign):

\hat{h} = \int_{R} f (x) ln [σ \sqrt{2 π e} f (x)] d x = ln (σ \sqrt{2 π e}) - h

(21)

As shown in the Appendix A, all of the Gaussian laws 𝔑(a) will now have ĥ[𝔑] = 0 and, because of the change of sign, this value will now represent the minimum for all the other laws, irrespective of their variance: as a consequence, the entropy ĥ of all of the continuous distributions with finite variance will now be non-negative, as for the entropy H of the discrete laws.

For laws lacking a finite second momentum (as the Cauchy laws), we would have no ĥ entropy, because these distributions have no variance to speak of: this is an apparent shortcoming presented by Definition (21) of ĥ, and to go around this weakness, we introduced our Definition (19) of h̃ by exploiting the properties of the IQnR ϱ(p), which are always well defined for every possible distribution. It would be interesting to remark, however, not only that these are not the only two possible choices, but also that even seemingly harmless modifications can imply slightly different properties. For instance, by going back to the remarks at the end of Section 2.1, it is well known that by linear transformation of the variables (with a > 0 to simplify), every continuous law f (x) spans a type of continuous law:

\frac{1}{a} f (\frac{x - b}{a})

As already pointed out, the centering parameter b has no influence on the value of the entropy, while the scaling parameter a would change the differential entropy h of Definition (9) by an additional ln a. As a consequence, by simply adopting as a new definition:

\bar{h} = - \int_{R} f (x) ln [a f (x)] d x = h - ln a

(22)

where a is the parameter locating the law within its type, we would get an entropy invariant for rescaling. It is apparent that Definition (22) considers a just as a parameter, and not as a measure of dispersion, and it is interesting to notice that it also entails a few consequences shown in the examples of Appendix A. In particular, we now have that the entropy h̄ takes again all of the (positive and negative) real values and, hence, that there is no such thing as a maximum entropy distribution, as in the case of the h̃ entropy.

3. Entropy for Discrete Laws

We could now naively extend to the discrete laws our previous re-definitions simply by taking H − ln κ, with H given by (1), and with a suitable choice of κ, but in so doing, we would miss a chance to reconcile the two forms (discrete and continuous) of our entropy in some limit behavior. We find it then more convenient to introduce some further changes that, for the sake of generality, we will discuss in the settings of Section 2.1, where κ is an IQnR.

3.1. Renormalization

In order to extend Definition (19) to the discrete distributions, we must first remark that, at variance with the continuous case, now the IQrR

ϱ (\frac{1}{4})

can vanish and, hence, cannot be immediately adopted as κ in our definitions. For the discrete laws, in fact, F (x) makes jumps and, hence, Q(p) has flat spots, so that ϱ(p) can be zero for some values of p; in particular, this can happen also for

p = \frac{1}{4}

. If, however, our distributions are purely discrete (a few remarks about the more general case of mixtures can be found in the Appendix B), ϱ(p) is a non-increasing function of p, which changes values only by jumping and which is constant between subsequent jumps. As a consequence, with the only exception of the degenerate laws (which have a constant Q(p) and, hence, a ϱ vanishing for every p), ϱ(p) certainly takes non-zero values for some

0 < p \leq \frac{1}{4}

, even when

ϱ (\frac{1}{4}) = 0

. We can then use in our definitions as dimensional constant ϱ̃ the smallest, non-zero IQnR larger or equal to the IQrR ϱ(

\frac{1}{4}

): more precisely, if

𝒫

is the set of all of the values of ϱ(p) for

0 < p \leq \frac{1}{4}

and

𝒫

₀ =

𝒫

\{0}, we will take ϱ̃ = min

𝒫

₀ > 0. We remark that, in particular, we again have

\tilde{ϱ} = ϱ (\frac{1}{4})

) whenever the IQrR does not vanish.

We start by remarking that if F (x) is the cumulative distribution function of a discrete distribution concentrated on x_k with probabilities p_k for k = 1, 2 . . ., by taking:

Δ x_{k} = x_{k} - x_{k - 1} Δ F_{k} = F (x_{k}) - F (x_{k - 1}) = p_{k} k = 1, 2, \dots

(with x₀ < x₁ and, hence, F(x₀) = 0: for instance, x₀ = x₁ − inf_k≥2 Δx_k, so that Δx₁ = inf_k≥2 Δx_k), we see, first, that Definition (1) can be immediately recast in the form:

H = - \sum_{k \geq 1} Δ F_{k} ln Δ F_{k}

Since, on the other hand, many typical discrete distributions (binomial, Poisson ...) describe counting experiments, in many instances, we have Δx_k = 1, and in these cases (since Δx_k is also dimensionless), we could also write:

H = - \sum_{k \geq 1} Δ F_{k} ln \frac{Δ F_{k}}{Δ x_{k}}

By comparing this expression with the definition of differential entropy (9) and by recalling that for a continuous distribution, we have f (x) = F′(x), we are led to propose as a new definition of the entropy of a discrete law the quantity:

\tilde{H} = - \sum_{k \geq 1} \frac{Δ F_{k}}{Δ x_{k}} ln (\tilde{ϱ} \frac{Δ F_{k}}{Δ x_{k}}) Δ x_{k} = H + \sum_{k \geq 1} p_{k} ln Δ x_{k} - ln \tilde{ϱ}

(23)

In general, even for the discrete distributions, Δx_k is not dimensionless, but apparently, this is compensated for by means of ϱ̃. This definition (23) has properties that are similar to that of the new differential entropy defined in the previous section, but the main benefit of this new formulation is that now, as will be discussed in the subsequent section, the differential h̃ entropy of a continuous law (19) can be recovered as a limit of the entropies H̃ as a sequence of approximating, discrete laws. In fact, the new Definition (23) effectively renormalizes the traditional entropy H in such a way that the asymptotic divergences pointed out in the Section 1 are exactly compensated for by means of our dimensional parameters. These conclusions hold also for a suitable extension of the alternative Definitions (21) and (22), respectively, of ĥ and h̄.

3.2. Convergence

We will discuss in this section a few particular examples showing that the two quantities h̃ of (19) and H̃ in (23) are no longer disconnected concepts, as happens for the usual Definitions (1) and (9). Let us consider first the case of the binomial laws 𝔅_n,p and of their standardized versions

𝔅_{n, p}^{*}

, already introduced in Section 1: we know that H[𝔅_n,p] = H[

𝔅_{n, p}^{*}

] and that both of these entropies diverge as ln

\sqrt{n}

when n → ∞. On the other hand, since Δx_k = 1 for 𝔅_n,p, from (23), we get:

\tilde{H} [𝔅_{n, p}] = H [𝔅_{n, p}] - ln \tilde{ϱ}

For binomial laws with fixed p and n large enough, the IQrR does not vanish, so that: ϱ̃ = ϱ(

\frac{1}{4}

) and, by introducing the ratio:

γ = \frac{ϱ (\frac{1}{4})}{σ}

(24)

we have:

\tilde{ϱ} = γ σ = γ [𝔅_{n, p}] \sqrt{n p (1 - p)}

and hence:

\tilde{H} [𝔅_{n, p}] = H [𝔅_{n, p}] - ln \tilde{ϱ} = H [𝔅_{n, p}] - ln (γ [𝔅_{n, p}] \sqrt{n p (1 - p)})

From (3) for large n, since from the binomial limit theorem, we have

γ [𝔅_{n, p}] \overset{n}{\to} γ [𝔑]

, while from (27) and (28) in Appendix A it is:

γ [𝔑] = Φ^{- 1} (\frac{3}{4}) - Φ^{- 1} (\frac{1}{4})

by taking into account (29) of Appendix A, we finally get:

\begin{array}{l} \tilde{H} [𝔅_{n, p}] & = \frac{1}{2} ln [2 π e n p (1 - p)] + \frac{4 p (1 - p) - 1}{12 n p (1 - p)} + O (\frac{1}{n^{2}}) - ln [γ [𝔅_{n, p}] \sqrt{n p (1 - p)}] \\ = - ln (\frac{γ [𝔅_{n, p}]}{\sqrt{2 π e}}) + \frac{4 p (1 - p) - 1}{12 n p (1 - p)} + O (\frac{1}{2}) \overset{n}{\to} - ln (\frac{γ [𝔑]}{\sqrt{2 π e}}) = \tilde{h} [𝔑] \end{array}

which is the first example of the convergence of entropies to differential entropies in the framework of the new definitions. The same result is achieved for

\tilde{H} [𝔅_{n, p}^{*}]

, because now σ = 1, so that ϱ̃ = γ[𝔅_n,p], while from (4), we have for every k:

Δ x_{k} = \frac{1}{\sqrt{n p (1 - p)}}

and hence, from (23), we get:

\tilde{H} [𝔅_{n, p}^{*}] = H [𝔅_{n, p}^{*}] - ln (γ [𝔅_{n, p}^{*}] \sqrt{n p (1 - p)})

On the other hand, we know that H[

[𝔅_{n, p}^{*}]

] = H[𝔅_n,p], so that from

γ [𝔅_{n, p}^{*}] \overset{n}{\to} γ [𝔑]

, we get again:

\tilde{H} [𝔅_{n, p}^{*}] \overset{n}{\to} \tilde{h} [𝔑]

Similar results hold for the Poisson laws 𝔓_λ: we already remarked in Section 1 that, while

𝔓_{λ}^{*} \to 𝔑 (1)

for λ → ∞, the entropies

H [𝔓_{λ}] = H [𝔓_{λ}^{*}]

diverge as ln

\sqrt{λ}

. It could be shown instead that both H̃[𝔓_λ] and

\tilde{H} [𝔓_{λ}^{*}]

graciously converge to h̃[𝔑], because now we respectively have

σ = \sqrt{λ}

and Δx_k =

\frac{1}{\sqrt{λ}}

.

In a similar way for the discrete uniform distributions 𝔘_n(a) introduced in Section 1, we now have:

x_{k} = \frac{k a}{n} Δ x_{k} = \frac{a}{n} p_{k} = \frac{1}{n} k = 1, \dots, n

so that, since ϱ̃[𝔘_n] is the IQrR

ϱ_{\frac{1}{4}}

, again, from (12) and (23), we get:

\tilde{H} [𝔘_{n}] = ln n + ln \frac{a}{n} - ln \tilde{ϱ} [𝔘_{n}] = ln a - ln \tilde{ϱ} [𝔘_{n}]

If we then remember from (30) that

\tilde{ϱ} [𝔘_{n}] \overset{n}{\to} \tilde{ϱ} [𝔘] = \frac{a}{2}

, we finally get from (31):

\tilde{H} [𝔘_{n}] \overset{n}{\to} ln a - ln \tilde{ϱ} [𝔘] = ln a - ln \frac{a}{2} = ln 2 = \tilde{h} [𝔘]

We are then allowed to conjecture that this is a generalized behavior: within the frame of our Definitions (23) of H̃ and (19) of h̃, whenever, as in our previous examples, a sequence of purely discrete laws 𝔄_n weakly converges to a continuous law 𝔄, then also

H [𝔘_{n}] \overset{n}{\to} \tilde{h} [𝔘]

.

4. Conclusions

We have proposed to modify both the usual Definition (1) of the entropy H and (9) of the differential entropy h, respectively, into (23) and (19), namely within our most general notation:

\tilde{H} = - \sum_{k \geq 1} \frac{Δ F_{k}}{Δ x_{k}} ln (\tilde{ϱ} \frac{Δ F_{k}}{Δ x_{k}}) Δ x_{k} = H - ln \tilde{ϱ} + \sum_{k \geq 1} p_{k} ln Δ x_{k}

(25)

\tilde{h} = - \int_{R} f (x) ln [\tilde{ϱ} f (x)] d x = h - ln \tilde{ϱ}

(26)

where, in general, ϱ̃coincides with the IQrR

ϱ_{\frac{1}{4}}

of the considered distribution, except when the IQrR vanishes (as can happen for discrete laws): in this last event, ϱ̃ is taken as the smallest non-zero IQnR of the distribution. There are also several other possible re-definitions, which essentially differ among them by the choice of the parameter κ in (13) and by the set of their possible values. All of these definitions, moreover, bypass the anomalies listed in Section 1 and appear to go smoothly, one into the other, for a suitable discrete-continuous limit. As a matter of fact, the introduction of the dimensional parameter κ effectively renormalizes the divergences that we would otherwise encounter in the limiting processes leading from discrete to continuous laws. We remark finally that the discrete form of (25) can also be easily customized to fit with the entropy estimation from empirical data.

We end the paper by pointing out that, despite extensive similarities, the new quantities, such as H̃ and h̃, no longer have all of the same properties of H and h. For instance, at the present stage, we could neither prove, nor disprove (by means of some counterexample) that H̃ ≥ 0 as for H. On the other hand, the examples seem also to allow no room for h̃-extremal distributions, as the normal laws were for the h differential entropy. We remark, however, that these conclusions would be different by adopting the alternative definitions that are presented in Section 2.2. While all of these topics seem to be interesting fields of inquiry, it would also be important to extensively review what is preserved of all the well-known properties of the usual definitions and how to adapt further ideas, such as relative entropy, mutual information and whatever else is today used in information processing [4,5]. This remark emphasizes the possibilities opened by our seemingly naive changes: as a bid to connect two previous standpoints, in fact, our proposed definitions blend the properties of the older quantities and, in so doing, can also break new ground. An extensive analysis of all of the possible consequences, both of the proposed definitions and of their articulations, will be the subject of a forthcoming paper, while on this topic, we will, at present, limit ourselves just to point out that many relevant features of the entropy essentially derive from the properties of the logarithms, which, in any case, play a central role also in the new definitions. Finally, it would be stimulating to explore how, if at all, it is possible to make the new definitions compatible with other celebrated extensions of the classical entropies, such as, for instance, that proposed by Tsallis [6] or the more recent cumulative residual entropy [7].

Acknowledgments

The author would like to thank Andrea Andrisani, Salvatore De Martino, Silvio De Siena, Christopher J. Ellison and Sebastiano Stramaglia for invaluable comments and suggestions.

Conflicts of Interest

The author declares no conflict of interest.

Appendix

A. Examples

We begin by comparing the values of the two differential entropies h and h̃, respectively defined as (9) and (19), for the most common families of laws by neglecting the centrality parameters, which are irrelevant for our purposes, because both of the entropies are independent of them.

For the Gaussian laws 𝔑(a) with:

f (x) = \frac{e^{- \frac{x^{2}}{2 a^{2}}}}{a \sqrt{2 π}} σ = a h [𝔑] = ln (a \sqrt{2 π e})

(27)

we know that:

\tilde{ϱ} [𝔑] = a [Φ^{- 1} (\frac{3}{4}) - Φ^{- 1} (\frac{1}{4})] Φ (y) = \int_{- \infty}^{y} \frac{e^{- \frac{z^{2}}{2}}}{\sqrt{2 π}} d z

(28)

and hence, we immediately have from (19):

\tilde{h} [𝔑] = h [𝔑] - ln \tilde{ϱ} [𝔑] = ln \frac{\sqrt{2 π e}}{Φ^{- 1} (\frac{3}{4}) - Φ^{- 1} (\frac{1}{4})} \approx 1.11959

(29)

For the laws 𝔘(a) uniform on [0, a] with (here, ϑ(x) is the Heaviside function):

f (x) = \frac{ϑ (x) - ϑ (x - a)}{a} σ = \frac{a}{\sqrt{12}} h [𝔘] = ln a \tilde{ϱ} = \frac{a}{2}

(30)

we instead have:

\tilde{h} [𝔘] = ln a - ln \frac{a}{2} = ln 2 \approx 0.693147

(31)

For the gamma laws 𝔊_λ(a), λ > 0 with:

f (x) = ϑ (x) \frac{x^{λ - 1} e^{- \frac{x}{a}}}{a^{λ} Γ (λ)} σ = a \sqrt{λ} h [𝔊_{λ}] = (1 - λ) ψ (λ) + ln [a e^{λ} Γ (λ)]

(32)

where:

ψ (z) = \frac{Γ^{'} (z)}{Γ (z)}

we have:

\tilde{ϱ} = a [Γ_{λ}^{- 1} (\frac{3}{4}) - Γ_{λ}^{- 1} (\frac{1}{4})] Γ_{λ} (y) = 1 - \frac{Γ (λ, y)}{Γ (λ, 0)} Γ (λ, y) = \int_{y}^{\infty} t^{λ - 1} e^{- t} d t

and hence:

\tilde{h} [𝔊_{λ}] = (1 - λ) ψ (λ) + ln \frac{e^{λ} Γ (λ)}{Γ_{λ}^{- 1} (\frac{3}{4}) - Γ_{λ}^{- 1} (\frac{1}{4})}

(33)

which, as a function of λ, is displayed in Figure 1: this shows that h̃[𝔊_λ] takes also negative values, that h̃ [𝔊_λ] →−∞ for λ → 0 and that h̃[𝔊_λ] ↑ h̃[𝔑] for λ → +∞. In particular, for the exponential laws 𝔈(a) = 𝔊₁(a), we have:

\tilde{h} [𝔊] = 1 - ln (ln 3) \approx 0.905952

(34)

Finally, for the family of Student laws 𝔗_λ(a), λ > 0 with:

f (x) = \frac{1}{a B (\frac{1}{2}, \frac{λ}{2})} {(\frac{a^{2}}{a^{2} + x^{2}})}^{\frac{λ + 1}{2}} σ = \frac{a}{\sqrt{λ - 2}} B (α, β) = \frac{Γ (α) Γ (β)}{Γ (α + β)}

(35)

the variance exists only for λ > 2, but the differential entropy is well defined for every λ > 0:

h [𝔗_{λ}] = \frac{λ + 1}{2} [ψ (\frac{λ + 1}{2}) - ψ (\frac{λ}{2})] + ln [a B (\frac{1}{2}, \frac{λ}{2})]

(36)

For brevity, the explicit form of the IQrR ϱ̃will not be explicitly given here. Since, however, the differential entropy h[𝔗_λ] and the IQrR ϱ̃[𝔗_λ] are both proportional to the scaling parameter a, the entropy h̃[𝔗_λ] will be independent of a and as a function of λ is displayed in Figure 2, which shows that h̃[𝔗_λ] takes always positive values larger than h̃[𝔑] ≈ 1.11959, that h̃[𝔊_λ] → +∞ for λ → 0 and that h̃[𝔊_λ] ↓ h̃[𝔑] for λ → +∞. In particular, for the Cauchy laws 𝔆(a) = 𝔗₁(a), without variance, with:

f (x) = \frac{1}{a π} \frac{a^{2}}{a^{2} + x^{2}} h [ℭ] = ln (4 π a) \tilde{ϱ} = 2 a

(37)

we have that:

\tilde{h} [ℭ] = ln (4 π a) - ln (2 a) = ln (2 π) \approx 1.83788 \geq \tilde{h} [𝔑]

(38)

A particular consequence of these examples is that, as already remarked in Section 2.1, the entropy h̃ has neither a maximum nor a minimum value, and by suitably choosing the continuous law, it can take every real value, both positive and negative.

Figure 1. Entropy h̃[𝔊_λ] for gamma laws.

Figure 2. Entropy h̃[𝔗_λ] for Student laws.

Similar calculations can then be carried out also for the alternative entropy definition (21) of ĥ for laws endowed with a finite second momentum: we immediately get for the Gaussian laws:

\hat{h} [𝔑] = ln (a \sqrt{2 π e}) - ln (a \sqrt{2 π e}) = 0

(39)

and for the uniform laws:

\hat{h} [𝔘] = ln (\frac{a}{\sqrt{12}} \sqrt{a π e}) - ln a = ln \sqrt{\frac{π e}{6}} \approx 0.1765

(40)

For the gamma laws, we have now:

\hat{h} [𝔊_{λ}] = ln (a \sqrt{2 π e λ}) - (1 - λ) ψ (λ) - ln [a e^{λ} Γ (λ)] = ln \frac{\sqrt{2 π e λ}}{e^{λ} Γ (λ)} - (1 - λ) ψ (λ)

(41)

which always take positive values, as shown in the Figure 3, with ĥ[𝔊_λ] → 0 for λ → +∞, and in particular, for the exponential law, we have:

\hat{h} [𝔊] = ln (a \sqrt{2 π e}) - ln (e a) = ln \sqrt{\frac{2 π}{e}} \approx 0.4189

(42)

while for the Student laws, with λ > 2,we have:

\hat{h} [𝔗_{λ}] = - \frac{λ + 1}{2} [ψ (\frac{λ + 1}{2}) - ψ (\frac{λ}{2})] - ln [\sqrt{\frac{λ - 2}{2 π e}} B (\frac{1}{2}, \frac{λ}{2})]

(43)

displayed in Figure 4. It is easy to check that, in all of these examples, the entropy ĥ takes only non-negative values.

Figure 3. Entropy ĥ[𝔊_λ] for the gamma laws.

Figure 4. Entropy ĥ[𝔗_λ] for Student laws with λ > 2.

Finally, for the third definition (22) of h̄, we have the following values: for the Gaussian type, we have:

\bar{h} [𝔑] = ln \sqrt{2 π e} \approx 1.41894

(44)

and for the uniform type:

\bar{h} [𝔘] = 0

(45)

For the gamma types:

\bar{h} [𝔊_{λ}] = (1 - λ) ψ (λ) + ln [e^{λ} Γ (λ)]

(46)

the values, as displayed in Figure 5, go from −∞ for λ → 0⁺, to +∞ for λ → +∞. In particular, for the exponential type, we have:

\bar{h} [𝔊] = 1

(47)

For the Student types, we finally have:

\bar{h} [𝔗_{λ}] = \frac{λ + 1}{2} [ψ (\frac{λ + 1}{2}) - ψ (\frac{λ}{2})] + ln [B (\frac{1}{2}, \frac{λ}{2})]

(48)

with values shown in Figure 6 and going again from +∞ at λ → 0⁺, to −∞ for λ → +∞. In particular, for the Cauchy type, we get:

\bar{h} [ℭ] = ln (4 π) \approx 2.53102

(49)

As remarked in Section 2.2, these examples show that the entropy h̄ takes again all of the (positive and negative) real values and, hence, that there is no maximum entropy distribution, as in the case of the h̃ entropy.

Figure 5. Entropy h̄[𝔊_λ] for gamma laws.

Figure 6. Entropy h̄[𝔗_λ] for Student laws.

B. Mixtures

The definition of ϱ̃ proposed in Section 3 certainly produces non-zero values for both purely discrete and purely continuous laws. Some care must be exercised, however, for discrete-continuous mixtures. Let us take, for instance, the mixture:

\frac{2}{3} δ_{\frac{1}{2}} + \frac{1}{2} 𝔘 (1)

of a law degenerate in

x = \frac{1}{2}

and a law uniform in [0, 1]. Its cdf would then be:

F (x) = {\begin{array}{l} 0 & x < 0 \\ \frac{2}{3} ϑ (x - \frac{1}{2}) + \frac{x}{3} & 0 \leq x \leq 1 \\ 1 & 1 < x \end{array}

where ϑ is the Heaviside function, and its quantile function is:

Q (p) = {\begin{array}{l} 3 p & 0 \leq p \leq \frac{1}{6} \\ \frac{1}{2} & \frac{1}{6} \leq p \leq \frac{5}{6} \\ 3 p - 1 & \frac{5}{6} \leq p \leq 1 \end{array}

as displayed in Figure 7. It is apparent then that the IQrR is zero, because:

ϱ (\frac{1}{4}) = Q (\frac{3}{4}) - Q (\frac{1}{4}) = \frac{1}{2} - \frac{1}{2} = 0

while the IQnR ϱ(p) is a continuous function, so that (with the notations of Section 3) also ϱ̃ = 0, because inf

𝒫

₀ = 0. As a consequence, in the case of discrete-continuous mixtures with a cdf, such as:

F (x) = q F_{d} (x) + (1 - q) F_{c} (x) 0 < q < 1

we cannot simply extend the definitions of Section 3. We could, however, consider separately both the ϱ̃_d of the discrete distribution (as defined in Section 3) and the

{\tilde{ϱ}}_{c} = ϱ (\frac{1}{4})

of the continuous distribution and to take as dimensional constant κ their convex combination:

q {\tilde{ϱ}}_{d} + (1 - q) {\tilde{ϱ}}_{c}

which never vanishes, because, at least, its continuous part is always non-zero.

Figure 7. The quantile function Q(p) for a mixture of degenerate and uniform laws.

References

Loève, M. Probability Theory I–II; Springer: Berlin, Germany, 1977. [Google Scholar]
Jacquet, P.; Szpankowski, W. Entropy computations via analytic depoissonization. IEEE Trans. Inf. Theory 1999, 45, 1072–1081. [Google Scholar]
Cichoń, J.; Gołe̹biewski, Z. On Bernoulli sums and Bernstein polynomials. Proceedings of 23rd International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods for the Analysis of Algorithms (AofA’12), Montreal, Canada, 17–22 June 2012; pp. 179–190.
Cover, T.M.; Thomas, J.M. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
Bettencourt, L.M.A.; Gintautas, V.; Ham, M.I. Identification of functional information subgraphs in complex networks. Phys. Rev. Lett 2008, 100. [Google Scholar] [CrossRef]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys 1988, 52, 479–487. [Google Scholar]
Drissi, N.; Chonavel, T.; Boucher, J.M. Generalized cumulative residual entropy for distributions with unrestricted supports. Res. Lett. Signal Proc 2008, 11. [Google Scholar] [CrossRef]

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Petroni, N.C. Entropy and Its Discontents: A Note on Definitions. Entropy 2014, 16, 4044-4059. https://doi.org/10.3390/e16074044

AMA Style

Petroni NC. Entropy and Its Discontents: A Note on Definitions. Entropy. 2014; 16(7):4044-4059. https://doi.org/10.3390/e16074044

Chicago/Turabian Style

Petroni, Nicola Cufaro. 2014. "Entropy and Its Discontents: A Note on Definitions" Entropy 16, no. 7: 4044-4059. https://doi.org/10.3390/e16074044

APA Style

Petroni, N. C. (2014). Entropy and Its Discontents: A Note on Definitions. Entropy, 16(7), 4044-4059. https://doi.org/10.3390/e16074044

Article Menu

Entropy and Its Discontents: A Note on Definitions

Abstract

1. Introduction

2. Entropy for Continuous Laws

2.1. Interquantile Range

2.2. Variance and Scaling Parameters

3. Entropy for Discrete Laws

3.1. Renormalization

3.2. Convergence

4. Conclusions

Acknowledgments

Conflicts of Interest

Appendix

A. Examples

B. Mixtures

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI