Next Article in Journal
Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off
Previous Article in Journal
Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Lorenz Curve: A Proper Framework to Define Satisfactory Measures of Symbol Dominance, Symbol Diversity, and Information Entropy

Unidad Docente de Ecología, Departamento de Ciencias de la Vida, Universidad de Alcalá, 28805 Alcalá de Henares, Madrid, Spain
Entropy 2020, 22(5), 542; https://doi.org/10.3390/e22050542
Submission received: 9 March 2020 / Revised: 4 May 2020 / Accepted: 11 May 2020 / Published: 13 May 2020
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
Novel measures of symbol dominance (dC1 and dC2), symbol diversity (DC1 = N (1 − dC1) and DC2 = N (1 − dC2)), and information entropy (HC1 = log2 DC1 and HC2 = log2 DC2) are derived from Lorenz-consistent statistics that I had previously proposed to quantify dominance and diversity in ecology. Here, dC1 refers to the average absolute difference between the relative abundances of dominant and subordinate symbols, with its value being equivalent to the maximum vertical distance from the Lorenz curve to the 45-degree line of equiprobability; dC2 refers to the average absolute difference between all pairs of relative symbol abundances, with its value being equivalent to twice the area between the Lorenz curve and the 45-degree line of equiprobability; N is the number of different symbols or maximum expected diversity. These Lorenz-consistent statistics are compared with statistics based on Shannon’s entropy and Rényi’s second-order entropy to show that the former have better mathematical behavior than the latter. The use of dC1, DC1, and HC1 is particularly recommended, as only changes in the allocation of relative abundance between dominant (pd > 1/N) and subordinate (ps < 1/N) symbols are of real relevance for probability distributions to achieve the reference distribution (pi = 1/N) or to deviate from it.

1. Introduction

Following the early use of Shannon’s [1] entropy (HS) by some theoretical ecologists during the 1950s [2,3,4], HS has been extensively used in community ecology to quantify species diversity. Ecologists have considered the relative abundance or probability of the ith symbol in a message or sequence of N different symbols whose meaning is irrelevant [1,5,6] as the relative abundance or probability of the ith species in a community or assemblage of S different species whose phylogeny is irrelevant (i.e., all species are considered taxonomically equally distinct) [4,7,8]. This use of HS implies that the concept of species diversity is directly related to the concept of information entropy, basically representing the amount of information or uncertainty in a probability distribution defined for a set of N possible symbols [1] or a set of S possible species [4]. HS takes values from 0 to log2 N or log2 S and is properly expressed in bits, but it can also be expressed in nats or dits (also called bans, decits, or Hartleys) if the natural logarithm or the decimal logarithm is calculated [1,4,5,6,7,8].
In recent decades, several ecologists have, however, claimed that HS is a unsatisfactory diversity index because species diversity actually takes values from 1 to S and is ideally expressed in units of species (i.e., in the same units as S). Keeping this perspective in mind, and only considering the number of different symbols as the number of different species and the relative abundances of symbols as the relative abundances of species, Hill [9] proposed the exponential form of Shannon’s [1] entropy (HS) and the exponential form of Rényi’s [10] second-order entropy (HR) = the reciprocal of Simpson’s [11] concentration statistic (λ) as better alternatives to quantify species diversity, thereby assuming that the amount of information or uncertainty in a probability distribution defined for a set of S possible species was mathematically equivalent to the logarithm of its related species diversity. Similarly, we can assume that the amount of information or uncertainty (expressed in bits) in a probability distribution defined for a set of N possible symbols is mathematically equivalent to the binary logarithm of its related symbol diversity. Additionally, we can assume that symbol dominance characterizes the extent of relative abundance inequality among different symbols, particularly between dominant and subordinate symbols, and that symbol diversity equals the number of different symbols (N) or maximum expected diversity in any given message with equiprobability.
On the basis of these working assumptions, I first use the Lorenz curve [12] as the key framework to assess symbol dominance, symbol diversity, and information entropy. The contrast between symbol dominance and symbol redundancy is also highlighted. Subsequently, novel measures of symbol dominance (dC1 and dC2), symbol diversity (DC1 and DC2), and information entropy (HC1 and HC2) are derived from Lorenz-consistent statistics that I had previously proposed to quantify dominance and diversity in community ecology [13,14,15,16,17] and landscape ecology [18]. Finally, Lorenz-consistent statistics (dC1, dC2, DC1, DC2, HC1, and HC2) are compared with HS-based and HR-based statistics (dS, dR, DS, DR, HS, and HR) to show that the former have better mathematical behavior than the latter when measuring symbol dominance, symbol diversity, and information entropy in hypothetical messages. In this regard, I recently found that the corresponding versions of dC1, dC2, DC1, and DC2 exhibited better mathematical behavior than the corresponding versions of dS, dR, DS, and DR when measuring land cover dominance and diversity in hypothetical landscapes [18]. This better mathematical behavior was inherent to the compatibility of dC1 and dC2 with the Lorenz-curve-based graphical representation of land cover dominance [18].
The Lorenz curve [12] was introduced in the early twentieth century as a graphical method to assess the inequality in the distribution of income among the individuals of a population. Subsequently, this graphical method and Lorenz-consistent indices of income inequality, such as Gini’s [19,20] index and Schutz’s [21] index, have become popular in the field of econometrics (see reviews in [22,23]). More recently, owing to the increasing economic inequality during the present market globalization [24], some authors have supported the use of Bonferroni’s [25] curve and Zenga’s [26] curve and related indices to better assess poverty, as these inequality measures are oversensitive to lower levels of the income distribution [27,28,29]. To me, however, the Lorenz curve represents the best and most logical framework to define satisfactory indices of inequality (dominance) and associated measures of diversity or entropy.

2. Materials and Methods

2.1. Assessing Symbol Dominance, Symbol Diversity, and Information Entropy within the Framework of the Lorenz Curve

In econometrics, the Lorenz curve [12] is ideally depicted within a unit (1 × 1) square, in which the cumulative proportion of income (the vertical y-axis) is related to the cumulative proportion of individuals (the horizontal x-axis), ranked from the person with the lowest income to the person with the highest income. The 45-degree (diagonal) line represents equidistribution or perfect income equality. Income inequality may be quantified as the maximum vertical distance from the Lorenz curve to the 45-degree line of equidistribution if only differences in income between the rich and the poor are of interest (this measure being equivalent to the value of Schutz’s inequality index), or as twice the area between the Lorenz curve and the 45-degree line of equidistribution if differences in income among all of the individuals are of interest (this measure being equivalent to the value of Gini’s inequality index), with both measures exhibiting the same value whenever income inequality occurs only between the rich and the poor (see reviews in [22,23]; also see [18]). Therefore, in any given population with M individuals, income inequality takes a minimum possible value of 0 when every person has the same income (= total income/M, including M = 1) and a maximum possible value of 1 − 1/M when a single person has all the income and the remaining M − 1 people have none, as persons with no income can exist in a population.
If we assume that symbol dominance characterizes the extent of relative abundance inequality among different symbols, particularly between dominant and subordinate symbols, then the Lorenz-curve-based graphical representation of symbol dominance is given by the separation of the Lorenz curve from the 45-degree line of equiprobability, in which every symbol i has the same relative abundance (pi = 1/N, with N = the number of different symbols). This separation may be quantified as the maximum vertical distance from the Lorenz curve to the 45-degree line if only differences in relative abundance between dominant and subordinate symbols are of interest, or as twice the area between the Lorenz curve and the 45-degree line if differences in relative abundance among all symbols are of interest, with both measures giving the same value whenever relative abundance inequality occurs only between dominant and subordinate symbols.
In any given message with equiprobability the relative abundance of each different symbol equals 1/N, meaning a symbol may be objectively regarded as dominant if its probability (pd) > 1/N and as subordinate if its probability (ps) < 1/N. I had already used an equivalent method to discriminate between dominant and subordinate species [13,14,15,16,17] and between dominant and subordinate land cover types [18]. Thus, symbol dominance takes a minimum possible value of 0 when every different symbol has the same relative abundance (= 1/N, including N = 1), and approaches a maximum possible value of 1 – 1/N when a single symbol has a relative abundance very close to 1 and the remaining N −1 symbols have minimum relative abundances (>0), as symbols with no abundance or zero probability do not exist in a message.
In addition, if we assume that symbol diversity equals the number of different symbols or maximum expected diversity (N) in any given message with equiprobability (symbol dominance = 0 because pi = 1/N), then symbol diversity in any given message with symbol dominance > 0 must equal the maximum expected diversity minus the impact of symbol dominance on it; that is, symbol diversity = N – (N × symbol dominance) = N (1 – symbol dominance). This Lorenz-consistent measure of symbol diversity is a function of both the number of different symbols and the equal distribution of their relative abundances (i.e., symbol diversity is a probabilistic concept free of semantic attributes), taking values from 1 to N (maximum diversity if pi = 1/N) and being properly expressed in units of symbols. Therefore, symbol diversity/N = 1 – symbol dominance (i.e., symbol dominance triggers the inequality between symbol diversity and its maximum expected value).
It should also be evident that the reciprocal of symbol diversity refers to the concentration of relative abundance in the same symbol, and consequently may be regarded as a Lorenz-consistent measure of symbol redundancy = 1/(N (1 − symbol dominance)). This redundancy measure is a function of both the fewness of different symbols and the unequal distribution of their relative abundances, taking values from 1/N to 1 (maximum redundancy if N = 1). Thus, symbol dominance (relative abundance inequality among different symbols) and symbol redundancy are distinct concepts, although the value of the former affects the value of the latter.
Lastly, if we assume that information entropy is mathematically equivalent to the binary logarithm of its related symbol diversity, then the resulting Lorenz-consistent measure of information entropy = log2 (N (1 − symbol dominance)). This entropy measure takes values from 0 to log2 N (maximum entropy if pi = 1/N) and is properly expressed in bits, quantifying the amount of information or uncertainty in a probability distribution defined for a set of N possible symbols. Obviously, the degree of uncertainty attains a minimum value of 0 as symbol redundancy reaches a maximum value of 1.

2.2. Deriving Measures of Symbol Dominance, Symbol Diversity, and Information Entropy from Lorenz-Consistent Statistics

Following the theoretical approach of assessing symbol dominance, symbol diversity, and information entropy within the framework of the Lorenz curve, novel measures of symbol dominance (dC1 and dC2), symbol diversity (DC1 and DC2), and information entropy (HC1 and HC2) are derived from Lorenz-consistent statistics, which I had previously proposed to quantify species dominance and diversity [13,14,15,16,17] and land cover dominance and diversity [18]. In this derivation the number of different species or land cover types is considered as the number of different symbols, and the probabilities of species or land cover types are considered as the probabilities of symbols:
d C 1 = d = 1 L ( p d   1 / N ) = ( d s = 1 G ( p d p s ) ) / N ,
D C 1 = N ( N × d C 1 ) = N ( 1 d C 1 ) = N ( p d p s ) ,
HC1 = log2 DC1 = log2 (N − ∑(pdps)),
d C 2 = ( i j = 1 K ( p i p j ) / N ,
D C 2 = N ( N × d C 2 ) = N ( 1 d C 2 ) = N p i p j ,
HC2 = log2 DC2 = log2 (N − ∑│pipj|),
where N is the number of different symbols or maximum expected diversity, pd > 1/N is the relative abundance of each dominant symbol, ps < 1/N is the relative abundance of each subordinate symbol, pi and pj are the relative abundances of two different symbols in the same message, L is the number of dominant symbols, G is the number of subtractions between the relative abundances of dominant and subordinate symbols, and K = N (N − 1)/2 is the number of subtractions between all pairs of relative symbol abundances.
The dominance statistic dC1 refers to the average absolute difference between the relative abundances of dominant and subordinate symbols (Equation (1)), with its value being equivalent to the maximum vertical distance from the Lorenz curve to the 45-degree line of equiprobability (see also [18]). Accordingly, the value of DC1 equals the number of different symbols minus the impact of symbol dominance (dC1) on the maximum expected diversity (Equation (2)). The binary logarithm of this subtraction is the associated measure of information entropy (Equation (3)).
Likewise, the dominance statistic dC2 refers to the average absolute difference between all pairs of relative symbol abundances (Equation (4)), with its value being equivalent to twice the area between the Lorenz curve and the 45-degree line of equiprobability (see also [18]). Accordingly, the value of DC2 equals the number of different symbols minus the impact of symbol dominance (dC2) on the maximum expected diversity (Equation (5)). The binary logarithm of this subtraction is the associated measure of information entropy (Equation (6)).
Despite the above dissimilarities between Lorenz-consistent statistics of symbol dominance, symbol diversity, and information entropy, dC1 = dC2 = 0, DC1 = DC2 = N, and HC1 = HC2 = log2 N if there is equiprobability (pi = 1/N, including N = 1); and dC1 = dC2 > 0, DC1 = DC2 < N, and HC1 = HC2 < log2 N whenever relative abundance inequality occurs only between dominant and subordinate symbols. In this regard, it is worth noting that dC1 is comparable to Schutz’s [21] index of income inequality (also known as the Pietra ratio or Robin Hood index) and dC2 is comparable to Gini’s [19,20] index of income inequality. In fact, Gini’s index and Schutz’s index take the same value whenever income inequality occurs only between the rich and the poor (see reviews in [22,23]; also see [18]). However, there is a particular difference between the measurement of symbol dominance (dC1 and dC2) and the measurement of income inequality (Schutz’s index and Gini’s index): income inequality can reach a maximum value of 1 − 1/M when a single person has all the income and the remaining M – 1 people have none (as individuals with no income are considered to measure income inequality), but symbol dominance can only approach a maximum value of 1 – 1/N when a single symbol has a relative abundance very close to 1 and the remaining N – 1 symbols have minimum relative abundances (as symbols with no abundance or zero probability cannot be considered to measure symbol dominance).
Additionally, because the reciprocal of symbol diversity refers to the concentration of relative abundance in the same symbol (as already explained in Section 2.1), two Lorenz-consistent statistics of symbol redundancy are RC1 = 1/DC1 and RC2 = 1/DC2. RC1 and RC2 take values from 1/N to 1 (maximum redundancy if N = 1), and therefore their mathematical behavior can considerably differ from the mathematical behavior of Gatlin’s [30] classical redundancy index (RG = 1 – HS/log2 N). Indeed, since RG takes a maximum value of 1 if N = 1 and a minimum value of 0 if pi = 1/N [30], RG should be regarded as a combination of redundancy and dominance (see also [15]).

2.3. Comparing Lorenz-Consistent Statistics with HS-Based and HR-Based Statistics

Lorenz-consistent statistics of symbol dominance (dC1 and dC2), symbol diversity (DC1 and DC2), and information entropy (HC1 and HC2) are compared with statistics based on Shannon’s [1] entropy (HS) and Rényi’s [10] second-order entropy (HR). More specifically, on the basis of Hill’s [9] proposals for measuring diversity and Camargo’s [17] proposals for measuring dominance, the HS-based and HR-based statistics are:
H S = i = 1 N p i log 2 p i ,
D S = 2 H S ,
d S = 1 D S / N ,
H R =   log 2 1 / i = 1 N p i 2 ,
D R = 2 H R ,
d R = 1 D R / N ,
where pi is the relative abundance or probability of the ith symbol in a message or sequence of N different symbols.
Although dC1 = dC2 = dS = dR = 0, DC1 = DC2 = DS = DR = N, and HC1 = HC2 = HS = HR = log2 N whenever there is equiprobability, differences in mathematical behavior between Lorenz-consistent statistics and HS-based and HR-based statistics were examined by computing all these statistics for the ten probability distributions (I–X) described as hypothetical messages in Table 1. As we can see, the hypothetical message V is the primary or starting distribution, having two different symbols with probabilities of 0.6 and 0.4. From distribution V to I, the probabilities of all different symbols are successively halved by doubling their number, with the whole relative abundance of dominant symbols that must be transferred to subordinate symbols to achieve equiprobability remaining steady (= 0.1). From distribution V to X, only the probabilities of subordinate symbols are successively halved by doubling their number, with the whole relative abundance of dominant symbols that must be transferred to subordinate symbols to achieve equiprobability approaching the probability of the single dominant symbol (= 0.6). Accordingly, the degree of dominance in each dominant symbol is given by the positive deviation of its probability (pd) from the expected equiprobable value of 1/N, while the degree of subordination in each subordinate symbol is given by the positive deviation of its probability (ps) from 1/N. Thus, in each probability distribution or hypothetical message, symbol dominance = symbol subordination = the average absolute difference between the relative abundances of dominant and subordinate symbols (Equation (1)) = the whole relative abundance of dominant symbols that must be transferred to subordinate symbols to achieve equiprobability (Ptransfer values in Table 1).
In addition, disparities in mathematical behavior between Lorenz-consistent statistics and HS-based and HR-based statistics were examined by computing all these statistics for the ten probability distributions (XI–XX) described as hypothetical messages in Table 2, where differences in relative abundance or probability occur not only between dominant and subordinate symbols (as in Table 1), but also among dominant symbols and among subordinate symbols. However, because the Ptransfer value equals 0.25 in all hypothetical messages, only changes in the allocation of relative abundance between dominant and subordinate symbols (but not among dominant symbols or among subordinate symbols) seem to be of real significance for probability distributions to achieve the reference distribution (involving equiprobability) or to deviate from it. The reasons for this are evident: in the case of a dominant symbol increasing its relative abundance at the expense of other dominant symbols (Table 2, relative abundances p1p5 in probability distributions XVI–XIX), the resulting proportional abundance of all the dominant symbols is the same as before the transfer, since the increase in the probability of a dominant symbol (becoming more dominant) is compensated by an equivalent decrease in the probability of other dominant symbols (becoming less dominant); similarly, in the case of a subordinate symbol increasing its relative abundance at the expense of other subordinate symbols (Table 2, relative abundances p6p10 in probability distributions XII–XV), the resulting proportional abundance of all the subordinate symbols is the same as before the transfer, since the increase in the probability of a subordinate symbol (becoming less subordinate) is compensated by an equivalent decrease in the probability of other subordinate symbols (becoming more subordinate or rare).
Probability distributions in Table 1 and Table 2 were selected to better assess differences in mathematical behavior between Lorenz-consistent statistics (Camargo’s indices) and HS-based and HR-based statistics. Otherwise, when using probability distributions that were chosen at random, we could obtain results that do not allow us to appreciate significant differences between the respective mathematical behaviors.

3. Results and Discussion

The Lorenz-curve-based graphical representation of symbol dominance (relative abundance inequality among different symbols) is shown in Figure 1. Estimated values of symbol dominance are 0.1 (I–V, with five Lorenz curves perfectly superimposed), 0.267 (VI), 0.4 (VII), 0.489 (VIII), 0.541 (IX), and 0.57 (X), with all these dominance values being equivalent to the respective Ptransfer values in Table 1. Additionally, estimated values of symbol diversity are 28.8 (I), 14.4 (II), 7.2 (III), 3.6 (IV), 1.8 (V), 2.199 (VI), 3.0 (VII), 4.599 (VIII), 7.803 (IX), and 14.19 (X) symbols, and estimated values of information entropy are 4.848 (I), 3.848 (II), 2.848 (III), 1.848 (IV), 0.848 (V), 1.137 (VI), 1.585 (VII), 2.202 (VIII), 2.964 (IX), and 3.827 (X) bits.
Differences in mathematical behavior between Lorenz-consistent statistics (dC1, dC2, DC1, DC2, HC1, and HC2) and HS-based and HR-based statistics (dS, dR, DS, DR, HS, and HR) are shown in Table 3. Because dC1, dC2, DC1, DC2, HC1, and HC2 are Lorenz-consistent, their estimated values match estimated values of symbol dominance, symbol diversity, and information entropy concerning Figure 1. In fact, estimated values of dC1 and dC2 are equivalent to the respective Ptransfer values in Table 1. By contrast, estimated values of dS, dR, DS, DR, HS, and HR do not match estimated values of symbol dominance, symbol diversity, and information entropy concerning Figure 1, while dS and dR exhibit values even greater than the upper limit for symbol dominance (= 0.6). Consequently, DS and DR can underestimate symbol diversity when differences in relative abundance between dominant and subordinate symbols are large or can overestimate it when such differences are relatively small.
The observed shortcomings in the measurement of symbol dominance (using dS and dR) and symbol diversity (using DS and DR) seem to be a consequence of the mathematical behavior of the associated entropy measures (HS and HR). As we can see in Table 3, from distribution V to I, where the Ptransfer value remains relatively small = 0.1 (Table 1), inequalities between entropy measures result in HS values > HR values > HC1 and HC2 values. On the contrary, from distribution VII to X, where the Ptransfer approaches a higher value of 0.6 (Table 1), inequalities between entropy measures result in HC1 and HC2 values > HS values > HR values. In fact, whereas the normalized entropies of HC1 and HC2 increase from distribution VII to X, the normalized entropies of HS and HR decrease markedly.
This remarkable finding would indicate that HC1 and HC2 can quantify the amount of information or uncertainty in a probability distribution more efficiently than HS and HR, particularly when differences between higher and lower probabilities are maximized by increasing the number of small probabilities (as shown in Table 3 regarding data in Table 1). After all, within the context of classical information theory, the information content of a symbol is an increasing function of the reciprocal of its probability [1,5,6,10] (also see [31,32]).
Other relevant disparities in mathematical behavior regarding measures of symbol dominance, symbol diversity, and information entropy are shown in Table 4. The respective values of dC1, DC1, and HC1 remain identical from distribution XI to XX, since dC1 is sensitive only to differences in relative abundance between dominant and subordinate symbols. On the other hand, because dC2 is sensitive to differences in relative abundance among all different symbols, the respective values of dC2, DC2, and HC2 do not remain identical from distribution XI to XX, even though they are equal in XII and XVI, in XIII and XVII, in XIV and XVIII, and in XV and XIX, as in each of these distribution pairs changes in the allocation of relative abundance among dominant symbols and among subordinate symbols are equivalent. A similar pattern of values is observed concerning dR, DR, and HR, but not regarding dS, DS, and HS, whose respective values remain distinct from distribution XI to XX.

4. Concluding Remarks

This theoretical analysis has shown that the Lorenz curve is a proper framework for defining satisfactory measures of symbol dominance, symbol diversity, and information entropy (Figure 1 and Table 3 and Table 4). The value of symbol dominance equals the maximum vertical distance from the Lorenz curve to the 45-degree line of equiprobability when only differences in relative abundance between dominant and subordinate symbols are quantified which is equivalent to the average absolute difference between the relative abundances of dominant and subordinate symbols = dC1 (Equation (1)) or equals twice the area between the Lorenz curve and the 45-degree line of equiprobability when differences in relative abundance among all symbols are quantified, which is equivalent to the average absolute difference between all pairs of relative symbol abundances = dC2 (Equation (4)). Symbol diversity = N (1 – symbol dominance) (i.e., DC1 = N (1 – dC1) and DC2 = N (1 – dC2)) and information entropy = log2 symbol diversity (i.e., HC1 = log2 DC1 and HC2 = log2 DC2). Additionally, the reciprocal of symbol diversity may be regarded as a satisfactory measure of symbol redundancy (i.e., RC1 = 1/DC1 and RC2 = 1/DC2).
This study has also shown that Lorenz-consistent statistics (dC1, dC2, DC1, DC2, HC1, and HC2) have better mathematical behavior than HS-based and HR-based statistics (dS, dR, DS, DR, HS, and HR), exhibiting greater coherence and objectivity when measuring symbol dominance, symbol diversity, and information entropy (Table 3 and Table 4). However, considering that the 45-degree line of equiprobability (Figure 1) represents the reference distribution (pi = 1/N), and that only changes in the allocation of relative abundance between dominant and subordinate symbols (but not among dominant symbols or among subordinate symbols) seem to have true relevance for probability distributions to achieve the reference distribution or to deviate from it (Table 2), the use of dC1, DC1, and HC1 may be more practical and preferable than the use of dC2, DC2, and HC2 in measuring symbol dominance, symbol diversity, and information entropy. In this regard, it should be evident that if the number of different symbols (N) is fixed in any given message, increasing differences in relative abundance between dominant and subordinate symbols necessarily imply decreases in symbol diversity and information entropy, whereas decreasing differences in relative abundance between dominant and subordinate symbols necessarily imply increases in symbol diversity and information entropy, with these two variables taking a maximum if pi = 1/N. By contrast, increasing or decreasing differences in relative abundance among dominant symbols or among subordinate symbols will not affect symbol diversity and information entropy, since the decrease or increase in the information content of a dominant or subordinate symbol is compensated by an equivalent increase or decrease in the information content of other dominant or subordinate symbols.

Funding

This research received no external funding.

Acknowledgments

The study was financially supported by the University of Alcala (research budget MC-100). The valuable comments and suggestions of four anonymous reviewers are gratefully acknowledged.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
  2. Good, I.J. The population frequencies of species and the estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
  3. MacArthur, R.H. Fluctuations of animal populations and a measure of community stability. Ecology 1955, 36, 533–536. [Google Scholar] [CrossRef]
  4. Margalef, R. Information theory in ecology. Int. J. Gen. Syst. 1958, 3, 36–71. [Google Scholar]
  5. Lombardi, O.; Holik, F.; Vanni, L. What is Shannon information? Synthese 2016, 193, 1983–2012. [Google Scholar] [CrossRef] [Green Version]
  6. Rioul, O. This is IT: A primer on Shannon’s entropy and information. Sém. Poincaré 2018, XXIII, 43–77. [Google Scholar]
  7. Magurran, A.E. Measuring Biological Diversity; Blackwell: Oxford, UK, 2004. [Google Scholar]
  8. Hayek, L.A.C.; Buzas, M.A. Surveying Natural Populations: Quantitative Tools for Assessing Biodiversity, 2nd ed.; Columbia University Press: New York, NY, USA, 2010. [Google Scholar]
  9. Hill, M.O. Diversity and evenness: A unifying notation and its consequences. Ecology 1973, 54, 427–432. [Google Scholar] [CrossRef] [Green Version]
  10. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
  11. Simpson, E.H. Measurement of diversity. Nature 1949, 163, 688. [Google Scholar] [CrossRef]
  12. Lorenz, M.O. Methods of measuring the concentration of wealth. J. Am. Stat. Assoc. 1905, 9, 209–219. [Google Scholar] [CrossRef]
  13. Camargo, J.A. Temporal and spatial variations in dominance, diversity and biotic indices along a limestone stream receiving a trout farm effluent. Water Air Soil Pollut. 1992, 63, 343–359. [Google Scholar] [CrossRef]
  14. Camargo, J.A. New diversity index for assessing structural alterations in aquatic communities. Bull. Environ. Contam. Toxicol. 1992, 48, 428–434. [Google Scholar] [CrossRef] [PubMed]
  15. Camargo, J.A. Must dominance increase with the number of subordinate species in competitive interactions? J. Theor. Biol. 1993, 161, 537–542. [Google Scholar] [CrossRef]
  16. Camargo, J.A. On measuring species evenness and other associated parameters of community structure. Oikos 1995, 74, 538–542. [Google Scholar] [CrossRef]
  17. Camargo, J.A. Revisiting the relation between species diversity and information theory. Acta Biotheor. 2008, 56, 275–283. [Google Scholar] [CrossRef]
  18. Camargo, J.A. The Lorenz curve: A suitable framework to define satisfactory indices of landscape composition. Landscape Ecol. 2019, 34, 2735–2742. [Google Scholar] [CrossRef]
  19. Gini, C. Sulla misura della concentrazione e della variabilità dei caratteri. Atti del Reale Ist. Veneto di Sci. Lett. ed Arti 1914, 73, 1203–1248. [Google Scholar]
  20. Gini, C. Measurement of inequality of incomes. Econ. J. 1921, 31, 124–126. [Google Scholar] [CrossRef]
  21. Schutz, R.R. On the measurement of income inequality. Am. Econ. Rev. 1951, 41, 107–122. [Google Scholar]
  22. Idrees, M.; Ahmad, E. Measurement of income inequality: A survey. FJES 2017, 13, 1–32. [Google Scholar] [CrossRef]
  23. Fellman, J. Income inequality measures. Theor. Econ. Lett. 2018, 8, 557–574. [Google Scholar] [CrossRef] [Green Version]
  24. Milanovic, B. Global Inequality: A New Approach for the Age of Globalization; The Belknap Press of Harvard University Press: Cambridge, MA, USA, 2016. [Google Scholar]
  25. Bonferroni, C. Elementi di Statistica Generale; Libreria Seeber: Firenze, Italy, 1930. [Google Scholar]
  26. Zenga, M. Inequality curve and inequality index based on the ratios between lower and upper arithmetic means. Stat. Appl. 2007, 5, 3–27. [Google Scholar]
  27. Giordani, P.; Giorgi, G.M. A fuzzy logic approach to poverty analysis based on the Gini and Bonferroni inequality indices. Stat. Method. Appl. 2010, 19, 587–607. [Google Scholar] [CrossRef]
  28. Greselin, F.; Pasquazzi, L.; Zitikis, R. Contrasting the Gini and Zenga indices of economic inequality. J. Appl. Stat. 2013, 40, 282–297. [Google Scholar] [CrossRef]
  29. Pasquazzi, L.; Zenga, M. Components of Gini, Bonferroni, and Zenga inequality indexes for EU income data. J. Off. Stat. 2018, 34, 149–180. [Google Scholar] [CrossRef] [Green Version]
  30. Gatlin, L.L. Information Theory and the Living System; Columbia University Press: New York, NY, USA, 1972. [Google Scholar]
  31. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 1991. [Google Scholar]
  32. Applebaum, D. Probability and Information: An Integrated Approach, 2nd ed.; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Figure 1. The cumulative proportion of abundance is related to the cumulative proportion of symbols, ranked from the symbol with the lowest relative abundance to the symbol with the highest relative abundance, for the ten probability distributions (I–X) described as hypothetical messages in Table 1. The reference distribution is depicted by the 45-degree line of equiprobability, where every symbol has the same relative abundance = 1/N, symbol dominance = 0, and symbol diversity = the number of different symbols (N). Symbol dominance may be estimated as the maximum vertical distance from the Lorenz curve to the 45-degree line, or as twice the area between the Lorenz curve and the 45-degree line, with both measures giving the same value whenever relative abundance inequality occurs only between dominant and subordinate symbols (as shown in this figure). In addition, symbol diversity = N (1 – symbol dominance), symbol redundancy = 1/symbol diversity, and information entropy = log2 symbol diversity.
Figure 1. The cumulative proportion of abundance is related to the cumulative proportion of symbols, ranked from the symbol with the lowest relative abundance to the symbol with the highest relative abundance, for the ten probability distributions (I–X) described as hypothetical messages in Table 1. The reference distribution is depicted by the 45-degree line of equiprobability, where every symbol has the same relative abundance = 1/N, symbol dominance = 0, and symbol diversity = the number of different symbols (N). Symbol dominance may be estimated as the maximum vertical distance from the Lorenz curve to the 45-degree line, or as twice the area between the Lorenz curve and the 45-degree line, with both measures giving the same value whenever relative abundance inequality occurs only between dominant and subordinate symbols (as shown in this figure). In addition, symbol diversity = N (1 – symbol dominance), symbol redundancy = 1/symbol diversity, and information entropy = log2 symbol diversity.
Entropy 22 00542 g001
Table 1. Ten probability distributions (I–X) are described as hypothetical messages: N = the number of different symbols; p1p33 = the relative abundances of symbols (symbol probabilities); Ptransfer = the whole relative abundance of dominant symbols (pd > 1/N) that must be transferred to subordinate symbols (ps < 1/N) to achieve equiprobability (pi = 1/N, including N = 1).
Table 1. Ten probability distributions (I–X) are described as hypothetical messages: N = the number of different symbols; p1p33 = the relative abundances of symbols (symbol probabilities); Ptransfer = the whole relative abundance of dominant symbols (pd > 1/N) that must be transferred to subordinate symbols (ps < 1/N) to achieve equiprobability (pi = 1/N, including N = 1).
IIIIIIIVVVIVIIVIIIIXX
N32168423591733
p10.03750.0750.150.30.60.60.60.60.60.6
p20.03750.0750.150.30.40.20.10.050.0250.0125
p30.03750.0750.150.2 0.20.10.050.0250.0125
p40.03750.0750.150.2 0.10.050.0250.0125
p50.03750.0750.1 0.10.050.0250.0125
p60.03750.0750.1 0.050.0250.0125
p70.03750.0750.1 0.050.0250.0125
p80.03750.0750.1 0.050.0250.0125
p90.03750.05 0.050.0250.0125
p100.03750.05 0.0250.0125
p110.03750.05 0.0250.0125
p120.03750.05 0.0250.0125
p130.03750.05 0.0250.0125
p140.03750.05 0.0250.0125
p150.03750.05 0.0250.0125
p160.03750.05 0.0250.0125
p170.025 0.0250.0125
p180.025 0.0125
p190.025 0.0125
p200.025 0.0125
p210.025 0.0125
p220.025 0.0125
p230.025 0.0125
p240.025 0.0125
p250.025 0.0125
p260.025 0.0125
p270.025 0.0125
p280.025 0.0125
p290.025 0.0125
p300.025 0.0125
p310.025 0.0125
p320.025 0.0125
p33 0.0125
Ptransfer0.10.10.10.10.10.2670.40.4890.5410.57
Table 2. Ten probability distributions (XI–XX) are described as hypothetical messages: N = the number of different symbols; p1p10 = the relative abundances of symbols (symbol probabilities); Ptransfer = the whole relative abundance of dominant symbols (pd > 1/N) that must be transferred to subordinate symbols (ps < 1/N) to achieve equiprobability (pi = 1/N, including N = 1).
Table 2. Ten probability distributions (XI–XX) are described as hypothetical messages: N = the number of different symbols; p1p10 = the relative abundances of symbols (symbol probabilities); Ptransfer = the whole relative abundance of dominant symbols (pd > 1/N) that must be transferred to subordinate symbols (ps < 1/N) to achieve equiprobability (pi = 1/N, including N = 1).
XIXIIXIIIXIVXVXVIXVIIXVIIIXIXXX
N10101010101010101010
p10.150.150.150.150.150.190.190.190.190.19
p20.150.150.150.150.150.140.170.170.170.17
p30.150.150.150.150.150.140.130.150.150.15
p40.150.150.150.150.150.140.130.120.130.13
p50.150.150.150.150.150.140.130.120.110.11
p60.050.090.090.090.090.050.050.050.050.09
p70.050.040.070.070.070.050.050.050.050.07
p80.050.040.030.050.050.050.050.050.050.05
p90.050.040.030.020.030.050.050.050.050.03
p100.050.040.030.020.010.050.050.050.050.01
Ptransfer0.250.250.250.250.250.250.250.250.250.25
Table 3. Measures of symbol dominance (dC1, dC2, dR, and dS), symbol diversity (DC1, DC2, DR, and DS), and information entropy (HC1, HC2, HR, and HS) are computed for the ten probability distributions (I–X) described as hypothetical messages in Table 1. Hmax = log2 N = maximum expected entropy; HC1/Hmax, HC2/Hmax, HR/Hmax, and HS/Hmax = normalized entropies. All statistics are explained in the text.
Table 3. Measures of symbol dominance (dC1, dC2, dR, and dS), symbol diversity (DC1, DC2, DR, and DS), and information entropy (HC1, HC2, HR, and HS) are computed for the ten probability distributions (I–X) described as hypothetical messages in Table 1. Hmax = log2 N = maximum expected entropy; HC1/Hmax, HC2/Hmax, HR/Hmax, and HS/Hmax = normalized entropies. All statistics are explained in the text.
IIIIIIIVVVIVIIVIIIIXX
dC10.1000.1000.1000.1000.1000.2670.4000.4890.5410.570
DC128.80014.4007.2003.6001.8002.1993.0004.5997.80314.190
HC14.8483.8482.8481.8480.8481.1371.5852.2022.9643.827
HC1/Hmax0.9700.9620.9490.9240.8480.7170.6830.6950.7250.759
dC20.1000.1000.1000.1000.1000.2670.4000.4890.5410.570
DC228.80014.4007.2003.6001.8002.1993.0004.5997.80314.190
HC24.8483.8482.8481.8480.8481.1371.5852.2022.9643.827
HC2/Hmax0.9700.9620.9490.9240.8480.7170.6830.6950.7250.759
dR0.0380.0380.0380.0380.0380.2420.5000.7080.8410.917
DR30.76815.3847.6923.8461.9232.2732.5002.6322.7032.740
HR4.9433.9432.9431.9430.9431.1841.3221.3961.4341.454
HR/Hmax0.9890.9860.9810.9720.9430.7470.5690.4400.3510.288
dS0.0200.0200.0200.0200.0200.1380.3170.5000.6500.762
DS31.36015.6807.8403.9201.9602.5863.4134.5035.9427.841
HS4.9713.9712.9711.9710.9711.3711.7712.1712.5712.971
HS/Hmax0.9940.9930.9900.9850.9710.8650.7630.6850.6290.589
Hmax5.0004.0003.0002.0001.0001.5852.3223.1704.0875.044
Table 4. Measures of symbol dominance (dC1, dC2, dR, and dS), symbol diversity (DC1, DC2, DR, and DS), and information entropy (HC1, HC2, HR, and HS) are computed for the ten probability distributions (XI–XX) described as hypothetical messages in Table 2. Hmax = log2 N = maximum expected entropy; HC1/Hmax, HC2/Hmax, HR/Hmax, and HS/Hmax = normalized entropies. All statistics are explained in the text.
Table 4. Measures of symbol dominance (dC1, dC2, dR, and dS), symbol diversity (DC1, DC2, DR, and DS), and information entropy (HC1, HC2, HR, and HS) are computed for the ten probability distributions (XI–XX) described as hypothetical messages in Table 2. Hmax = log2 N = maximum expected entropy; HC1/Hmax, HC2/Hmax, HR/Hmax, and HS/Hmax = normalized entropies. All statistics are explained in the text.
XIXIIXIIIXIVXVXVIXVIIXVIIIXIXXX
dC10.2500.2500.2500.2500.2500.2500.2500.2500.2500.250
DC17.5007.5007.5007.5007.5007.5007.5007.5007.5007.500
HC12.9072.9072.9072.9072.9072.9072.9072.9072.9072.907
HC1/Hmax0.8750.8750.8750.8750.8750.8750.8750.8750.8750.875
dC20.2500.2700.2820.2880.2900.2700.2820.2880.2900.330
DC27.5007.3007.1807.1207.1007.3007.1807.1207.1006.700
HC22.9072.8682.8442.8322.8282.8682.8442.8322.8282.744
HC2/Hmax0.8750.8630.8560.8520.8510.8630.8560.8520.8510.826
dR0.2000.2130.2200.2240.2250.2130.2200.2240.2250.248
DR8.0007.8707.8007.7607.7507.8707.8007.7607.7507.520
HR3.0002.9762.9632.9562.9542.9762.9632.9562.9542.911
HR/Hmax0.9030.8960.8920.8900.8890.8960.8920.8900.8890.876
dS0.1220.1370.1490.1570.1610.1280.1310.1330.1340.173
DS8.7798.6288.5128.4318.3878.7248.6918.6758.6628.269
HS3.1343.1093.0893.0763.0683.1253.1203.1173.1153.048
HS/Hmax0.9430.9360.9300.9260.9240.9410.9390.9380.9370.918
Hmax3.3223.3223.3223.3223.3223.3223.3223.3223.3223.322

Share and Cite

MDPI and ACS Style

Camargo, J.A. The Lorenz Curve: A Proper Framework to Define Satisfactory Measures of Symbol Dominance, Symbol Diversity, and Information Entropy. Entropy 2020, 22, 542. https://doi.org/10.3390/e22050542

AMA Style

Camargo JA. The Lorenz Curve: A Proper Framework to Define Satisfactory Measures of Symbol Dominance, Symbol Diversity, and Information Entropy. Entropy. 2020; 22(5):542. https://doi.org/10.3390/e22050542

Chicago/Turabian Style

Camargo, Julio A. 2020. "The Lorenz Curve: A Proper Framework to Define Satisfactory Measures of Symbol Dominance, Symbol Diversity, and Information Entropy" Entropy 22, no. 5: 542. https://doi.org/10.3390/e22050542

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop