1. Introduction
The Simpson index concerning a population distributed among
k categories or classes is defined as:
where
denotes the probability (or proportion of occurrences) of class
i. So, one has
and
, and therefore
S is defined on a
simplex. This index equals the probability that two elements taken at random from the population of interest belong to the same category or class. The value of Simpson’s index ranges from
to 1, with 1 representing no diversity; so, the larger the value of
, the lower the diversity. The name “Simpson index” roots from the influential 1949 paper by Edward Hugh Simpson entitled “Measurement of Diversity” [
1], wherein he introduced what he called a measure of concentration defined in terms of population constants, with the minimum concentration equaling the maximum diversity. The Simpson index became a widely used quantitative metric in ecological and biodiversity studies as a tool for assessing and quantifying the diversity and evenness of species within ecological communities. It also applies to other biological problems, including biomedical sciences, such as measuring diversity concerning immunity in response to viral infections (e.g., [
2]).
However, it is also acknowledged that the original mathematical concept formulation was used in cryptanalysis as far back as the 1920s and 30s—therein named
probability of monographic coincidence—by the American cryptanalysts William Friedman and Solomon Kullback (e.g., [
3]). It is relevant to note that the Italian statistician Corrado Gini had already applied the quantity
as early as 1912. He defined the index with relative frequencies
computed from large samples, referring to it as an
index of mutability for disconnected (qualitative) variables [
4]. This quantity became later known as the “Gini-Simpson index”, a name adopted in the 1980s by the eminent statistician C. R. Rao (e.g., [
5,
6]), who restated it with probabilities as
. For instance, Jian et al. [
7] consider a “Livelihood Simpson Index”, which in fact is a Gini–Simpson index. Obviously, given a probability distribution
, the Simpson and Gini–Simpson indices correspond to complementary events, verifying
.
The use of a weighted version of the Simpson index appears to have been first reported in 1992, when Nowak and May [
8] conceived the effective immune response against the virus population composed of different strains in the context of HIV infections, which was then revisited two years later [
9].
Some refer to the weighted Simpson index when they are actually dealing with the weighted Gini–Simpson index (e.g., [
10,
11]). The weighted Gini–Simpson index is defined as
, a concave function, differentiable in the interior of a simplex, with an identifiable maximum value for which a method to determine the optimal point (maximizer) was framed based on the fact that one is dealing with feasibility values associated with the constraints of the simplex [
12]—namely, that the optimal coordinates must verify
—which were not taken into account in [
11] and may lead to miscalculations.
Yet, Kasulo and Perrings used a price-weighted Simpson index to assess scenarios relative to the connection between the diversity of catch in a multi-species fishery and profit-maximizing regimes [
13], and Ma used a symmetric form of the weighted Simpson index for ranking alternatives under a decision-making procedure involving mixed attribute values [
14].
Despite the many recent publications addressing diversity, including phylogenetic diversity, and dissimilarity indices—either in biology (e.g., [
15,
16,
17]) or, following the pioneer approach of Patil and Taillie [
18], in the social sciences (e.g., [
19,
20]), mostly associated with developments relative to Hill’s numbers [
21] and Rényi entropies [
22]—it seems that an analytical study concerning the optimal point (minimizer) and the optimal value (minimum) of the weighted Simpson index has not yet been published.
In this short communication, we undertake a comprehensive analytical study of the weighted Simpson index focusing on the optimization problem. We do not deal with statistical developments, which could be addressed in the future, presumably within the scope of information measures (e.g., [
23]).
Concerning the structure of this paper, in
Section 2, we solve both the optimization problem and the inverse problem and discuss different normalization procedures of the index in relation to the results obtained and, in
Section 3, we present some final remarks.
2. The Weighted Simpson Index
Herein, we define the weighted Simpson index, concerning a population distributed among
k categories or classes, as:
where
denotes the probability (or proportion of occurrences) of class
i and
is a weight assigned to that class, altogether defining a vector of positive real values
. For now, we have decided not to impose any extra conditions on the weights, leaving this matter to be discussed later. Our current focus is on understanding the broader context.
Weights allow one to consider various features for the classes. In the context of biodiversity, these features may be related, for example, to environmental benefits, conservation importance, and the vulnerability or economic value of species, a subject that was already emphasized at least since the beginning of the 1980s, and exemplified with biomass and other importance values [
24].
2.1. Optimizing the Weighted Simpson Index
Now, we will address the optimization problem concerning the weighted Simpson index for fixed weights We recall that the lower the value of , the greater the diversity of a composition associated with a positive valuation of the different classes or corresponding events. In general, for a weighted index, diversity is not maximized by the uniform distribution of classes. So, it is natural to ask the following: which distribution of the classes minimizes and what is its minimum value? In the following proposition, we derive this optimal distribution and the minimum value of and determine the maximum value of (this one indicating a minimum diversity’s value, what occurs in a vertex of the simplex) as well.
Proposition 1. Given a vector of positive real numbers and the weighted Simpson index defined on the simplex ,
- (a)
the maximum value of is given by ;
- (b)
the minimum value of is given by
and corresponds to the distribution Proof. The weighted Simpson index is a (continuous) convex function defined in the compact domain and thus the extreme value theorem ensures the existence of the (global) maximum and minimum values of the index. So, attains its absolute maximum at the boundary of and its absolute minimum at the interior of . Clearly, the maximum is attained when all except one of the are zero, and therefore one has .
The minimum can be assessed with the method of Lagrange multipliers for finding the critical points of
subject to the equality constraint
. As
is differentiable in the interior of
, one can build the Lagrangian function:
Equating partial derivatives to zero, we get:
From the first
equations in (4), we conclude that for any specific
, the following equivalence holds:
Then, using the constraint
one has
and, recalling the argument of a convex function, we obtain the optimal coordinates of the minimum point given by:
Now one can evaluate the minimum value of the weighted Simpson index (1) as follows:
2.2. Some Further Comments on the Minimizer
Note that the minimum value of the weighted Simpson index (2) is related to the harmonic mean of the weights H
by
. The harmonic mean
of
positive real numbers
is the reciprocal of the arithmetic mean of the reciprocals of those numbers:
and is appropriate for averaging rates over constant numerator units [
25]. As a typical example, if a set of investments are invested at different interest rates, and they all give the same income, the unique rate at which all of the capital tied up in those investments must be invested to produce the same revenue as given by the set of investments, is equal to the harmonic mean of the individual rates ([
26], p. 240).
The special case of all weights being equal to 1 leads to the uniform distribution: and to the minimum value , as can be seen from expressions (3) and (2), and also because in this case, with being the (unweighted) Simpson index whose minimum is .
Rewriting the optimal distribution in (3) as:
it is straightforward to conclude that the weights are driving forces of the optimal probabilities, operating reciprocally. When the weight attached to a specific class increases and all the others keep invariant, the corresponding optimal probability decreases, and when a weight attached to another class increases, all the others being invariant, the original class increases its optimal probability.
If one considers the valuation of the classes of a distribution in the usual sense of importance assessment with an ordering of positive real numbers, expecting that would promote the result , then one should be aware that the weights associated with the optimal point (3) would not be the values , but could possibly be conceived like and instead.
2.3. Normalization
In the realm of indices, normalization comprises the adjustment of the index values to conform to a predetermined range or interval. The use of normalized indices in applications can be important for several reasons. Normalized indices provide a standardized scale, usually ranging from 0 to 1, or 0% to 100%, irrespective of the specific scale or magnitude of the data. This standardization allows for direct comparisons between different datasets or populations and remains consistent across different contexts and scales.
In the case of the weighted Simpson index, this can be done in a classic way defining the normalized weighted Simpson index as:
However, this normalization eliminates the effect of the number of classes in the distribution (e.g., species in a community). For example, in the case of all weights equal to 1, the normalized weighted Simpson index of a population with
species uniformly distributed is always 0, and thus independent of the number of species. The fact that normalized indices of diversity can be misleading has already been mentioned by several authors (e.g., [
11]).
In the case of a weighted index, it may be relevant to normalize the weights so that the index becomes dimensionless and independent of the order of magnitude of the weights. For the weighted Simpson index, the normalization can be performed, for example, by imposing the condition This procedure can be accomplished by dividing each non-normalized weight by the sum of all the weights. So, the weighted Simpson index with normalized weights corresponding to the weighted Simpson index is denoted by with and with .
Proposition 2. Let be the weighted Simpson index computed with normalized weights, corresponding to the weighted Simpson index . Then:
- (a)
The maximum value of is given by ;
- (b)
The minimum value of is given by and corresponds to the distribution for
Proof. Note that for any real number In fact, As with , this normalization procedure does not affect the maximum and minimum points of the index, and the maximum and minimum values of are obtained by multiplying by , respectively, the maximum and minimum values of . □
2.4. The Inverse Problem
The inverse procedure relative to the weighted Gini–Simpson index was formulated recently [
27]. Now, we consider the analogous problem concerning the weighted Simpson index, stated as the following: given a minimum point of
denoted by
, verifying both
and
, what would be a set of weights able to generate that solution? The answer to this question is straightforward, as follows:
Proposition 3. Suppose that ,
verifying both and ,
is a minimum point of the weighted Simpson index with normalized weights. Thenand the harmonic mean of the weights equals the harmonic mean of the coordinates of the minimum point, Proof. Recalling (3), the weights must be chosen to be inversely proportional to the optimal coordinates
with the proportionality constant equal to
(2), as we can see rewriting:
For normalized weights, using the condition
, one gets
, and so the weights must be chosen as:
Note that from the condition that the sum of the weights equals 1, it follows that and thus, we can proceed with the equality obtaining the result concerning the corresponding harmonic means H. □
For non-normalized weights, there are infinitely many solutions to the inverse problem, parameterized by the minimum . In other words, in that case, for the inverse problem to have a unique solution, it is not sufficient to know the minimum point, and the minimum value must also be given.
3. Final Remarks
We have presented a detailed analytical study of the optimization problem associated with the weighted Simpson index. The core result is that at the optimal point one has
for
, and also, that for all
i, one gets
, the minimum value of the index. So, there is a trade-off between the weights and the optimal probabilities (or proportions of occurrences) in what could be seen as an equilibrium condition. The fact that Nowak [
8,
9] has used the weighted Simpson index as a Lyapunov function to assess an antigenic diversity threshold seems compatible with an equilibrium point perspective, which could also be used in a broad sense concerning different problems within several scientific fields. The mathematical results are obtained in simple closed-form solutions and so do not seem prone to unexpected computational difficulties.
Furthermore, for a random variable with values corresponding to the previous weights , and the probability function given by , with defined as in (3), computing the mean value of entails , which equals the harmonic mean of the weights, meaning .