4.1.2. k-Dominance Plot

This kind of plot shows the relationship between percentage cumulative abundance (*y* axis) and species rank/log series rank (*x* axis). Here, the elevated curve represents the less diverse assemblages (see [12,13]).

#### 4.1.3. Abundance/Biomass Comparison Curve or ABC Curve

A variant of the k-dominance plot was introduced by [14]. The related curve is constructed using two measures of abundance: the number of individuals and biomass. The level of disturbance, pollution-induced or otherwise, affecting the assemblage can be inferred from the resulting curve.

The method was developed for benthic macrofauna and has been used productively by a number of investigators in this context.

The ABC plot is used to study the entire species abundance distribution. The author of [15] has introduced a summary statistic specified as *W* (named after R. M. Warwick), and defined by

$$W = \sum\_{i=1}^{S} \frac{B\_i - A\_i}{50(S - 1)}$$

where *Bi* denotes the biomass value of each species rank (*i*) in the ABC curve, *S* represents the number of species, and *Ai* represents the abundance (individuals) value of each species rank (*i*). *Ai* and *Bi* do not necessarily refer to the same species, since species are ranked separately for each abundance measure. The result will be positive if the biomass curve is consistently above the individual curve. This symbolizes undisturbed abundance. In contrast, a grossly perturbed assemblage will give a negative value (consistently above the biomass curve). A curve that produces a value of *W* close to 0 and overlaps signifies moderate disturbance. *W* ranges usually from −1 to +1.

The *W* statistics are generally computed for each sample separately. ANOVA can be used to test for significant differences, if treatments have been replicated. Alternatively, graphing *W* values can be a very effective way of illustrating shifts in the composition of the assemblage if un-replicated samples have been taken along a transect or over a time series (such as before, during and after a pollution event). While considering ABC curves at discriminating samples, *W* statistics are most useful (see [16]).

#### *4.2. Species Abundance Models*

Statistical models were initially devised as the best empirical fits to the observed data (see [17]). They help the investigator to objectively compare different assemblages, which is one of its advantages. In some cases, a parameter of the distribution can be used as an index of diversity. Another set of models is biological or theoretical models.

#### 4.2.1. Statistical Models

• log series model

In this model, the number of species (*y* axis) is displayed in relation to the number of individuals per species (*x* axis), the abundance classes which are presented on log scale. This plot is typically used when the log normal distribution is chosen. This type of graph is sometimes dubbed the "Preston plot" (see [18]) in remembrance of Preston F., who pioneered the use of the log normal model in [19]. In the log series model, the mode will fall to the class with the lowest abundance, which represents a single individual, and in the case of this plot, it is more focused on rare species. In log transformation, the *x* axis has a tendency to shift a mode to the right so as to reveal a log normal pattern.

• Negative binomial model

The author of [20] describes many applications of the negative binomial model in ecology. Particularly in estimating species richness (see [21]). However, the authors of [22] remarked that it is only rarely fitted to data of species abundance (one exception being [23]). Since it came from a stable log series model, it has some potential interest.

• Zipf-Mandelbrot model This model has its roots in linguistics and information theory. This model has several applications in environmental diversity, which are well described in [24–28]. The Zipf-Mandelbrot model is important for a rigorous sequence of colonists from the same species, always present at the same point in the succession in identical habitats. According to [29], this model is not better than the log series or log normal model. This model, however, has been successfully used in [28,30–32]. We also refer to [33–35] for the use of this model in terrestrial studies, and [36] for the use of this model in aquatic

systems. The author of [37] states that it can be used to test the performance of various diversity estimators. The Zipf-Mandelbrot model provided the best description of the cover data, while the biomass data are compatible with the log normal distribution.

#### 4.2.2. Goodness of Fit Tests

A goodness of fit test, often called *χ*2, is used to find the relationship between the observed and expected frequencies of a species in each abundance class [38]. To fit a deterministic model, the conventional method used is to assign the observed data to abundance classes. Classes based on log2 are usually used. According to the model used, the number of species expected in each abundance class is determined.

The model takes the *S* (number of species) as observed values and *N* (total abundance), and then determines how these *N* individuals should be distributed among the *S* species. If *p* < 0.05 (*p*-value), the model is rejected because it does not adequately describe the pattern of species abundances. If *p* > 0.05, the fit fails to be rejected or, ideally, *p* >> 0.05 is assumed to be a good fit. Tests of empirical data typically involve a very small number of abundance classes (10 or fewer). This causes a reduction in the degrees of freedom (d.f.) available. The more the degrees of freedom get the least value, the harder it becomes to reject a model.

The authors of [29] remarked that goodness of fit tests work most effectively with large assemblages (but might not be ecologically coherent units). Instead of *χ*<sup>2</sup> he recommends the Kolmogorov–Smirnov (K–S) goodness of fit (*GOF*) test, as said in [38,39]. Indeed, Tokeshi suggests adopting the K–S-*GOF* test, as the standard method of assessing the goodness of fit of deterministic models. He also suggests the K–S two-sample test can be used to compare two datasets directly to describe their abundance patterns.

The author of [11] reinforces that, if one model fits the data and another does not, it is not possible to conclude that the fit of the two is significantly different. His solution is to use replicated observations. The deviations can be log transformed, if necessary to achieve normality. A multiple comparison test, for example, Duncan's new multiple range test (see [38]) can then be used to infer which models are significantly different from one another.

#### 4.2.3. Biological or Theoretical Models

• Deterministic and stochastic models

Deterministic models assume that *N* individuals will be distributed amongst the *S* species in the assemblage. The geometric series is the only deterministic niche apportionment model. Stochastic models recognize that replicate communities structured according to the same set of rules will vary according to the relative abundances of species found there, and they try to capture the random elements inherent in natural processes. This makes biological sense. Perhaps not surprisingly, stochastic models are more challenging to fit than their deterministic counterparts. In a practical sense, it is necessary to know whether a model is deterministic or stochastic. Stochastic models have a complexity that requires replicated data, and this problem is solved in Tokeshi's refinements (see [40]).

• Geometric series

Assume that the dominant species pre-empts a limiting resource percentage *k*, and the second most dominant species pre-empts the same *k* of the remaining part, and so on, until all *S* have been chosen. If the species abundance is proportional to the resource amount and the assumption stated above is fulfilled, the resulting pattern will follow a geometric series (or niche pre-emption hypothesis). Here, species abundance is ranked from most to least. Ratio of abundance of each species to abundance of predecessor is being a constant through the species and the ranked list is the reason. In addition, the series will appear as a straight line when plotted on log abundance/species rank graph. This plot helps identify whether the dataset is consistent or not with a geometric series. A full mathematical treatment of the geometric series can be found in [41], who also

presents the species abundance distribution corresponding to the rank/abundance series. In a geometric series, the abundances of species, ranked from the most to least abundant will be (see [41,42]):

$$n\_i = \mathcal{NC}\_k k (1 - k)^{i - 1}$$

where *ni* is the total number of individuals in the *i*th species, *n* is the total number of species, *N* is total number of individuals, *k* is the proportion of the remaining niche space occupied by each successively colonizing species (*k* is a constant), and *Ck* = <sup>1</sup> <sup>−</sup> (<sup>1</sup> <sup>−</sup> *<sup>k</sup>*)*S*−<sup>1</sup> is a constant that insures that <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *ni* = *N*. Because the ratio of the abundance of each species to the abundance of its predecessor is constant through the ranked list of species, the series will appear as a straight line when plotted on a log abundance/species rank graph.

• Broken stick model

The broken stick model, alias the random niche boundary hypothesis, was proposed in [43]. This model plots relative species abundance in the *y* axis on a linear scale, and in the *x* axis, they plot the logged species sequence abundance, so as to represent it from most to least. Then, we will get a straight line. As [22] states, the model has a demerit in that it may be derived from more than one hypothesis. It provides evidence that some ecological factors are being shared more or less evenly between species (see [41]). It represents a group of *S* species with equal competitive ability vying for niche space, according to [29]. It is typically organized in the order of rank order abundance (see [11]). The authors of [44] prepared a program which estimates species abundance. This model is tricky enough to fit with empirical data (see [29]).

• Tokeshi's models

Tokeshi has developed several niche apportionment models, including the dominance pre-emption, random fraction, power fraction, MacArthur fraction, and dominance decay models in [45,46]. They work with the assumption that abundance is proportional to the fraction of niche space occupied by a species. The model here assumes that the target niche selected is divided at random. The only difference between the models is how the target niche is selected. The larger the niche is, the more even the resulting species abundance distribution will be. Evenness ranges from least to most from the dominance pre-emption model, following the order of explanation. The random assortment model represents a random collection of niches of arbitrary sizes (see [45]).

**–** Random fraction

In this model, available niche space is divided at random into two pieces. Among these two, one is selected randomly for further subdivision, and so on, till all species are accommodated (see [40]). The sequential breakage model depicts a situation in which a new colonist competes for the niche of a species that is already in the community and takes over a random proportion of the previously existing niche. This model can be used to cover speciation events (see [40]). In addition, this is conceptually simple and found to be fit for a small community of freshwater *chironomids* (see [45,47]). The authors of [44] have created a Microsoft Excel program which can model the species abundance and distribution associated with it.

**–** Power fraction model

The Tokeshi model is applicable to species-rich assemblages, which is an exception to others (see [46]). In this case, the niche space is subdivided in the same way that a random fraction is. However, the probability of a niche splitting increases in this model, albeit only slightly in relation to size (*x*) via the power function (*K*). When *K* approaches 1, the largest niche is selected for fragmentation. When *K* = 1, the power fraction model resembles the MacArthur fraction model. Instead, when *K* = 0, niche fragmentation is done by random choice

and becomes a random fraction model. Usually *K* is set to 0.5 for the power fraction model (see [46]). Tokeshi accounts for virtually all assemblages. The author of [48] states that larger niches have a high fragmented probability or could occur either ecologically or evolutionarily.

**–** Dominance pre-emption model

This model assumes that each species pre-empts more than half the niche space remaining. Because of this, it is dominant among combined species (see [45]). The proportion of available niche space is assigned between 0.5 and 1. When the number of replications increases (or *K* = 0.75, the same as the power fraction model), it becomes more similar to the geometric series (see [45]). It can also be applied to niche fragmentation (see [29,40]).

**–** MacArthur fraction model

In the case of predicted species abundance distribution, the MacArthur fraction and the broken stick models paved the way to the same result. In this model, the probability of niche fragmentation is inversely proportional to size. This creates a very uniform distribution of species abundances and is only plausible in small communities of taxonomically related species. However, Tokeshi also reminds us that unreplicated data are not good for either the broken stick or the MacArthur fraction models.

**–** Dominance decay model

Here, a more uniform pattern of species abundance is considered. At random, the niche space for fragmentation is selected at random. No empirical data indicate that communities as predicted by Tokeshi's dominance decay model can be found in nature till date. This can be due to insufficient investigations or due to the lower chance of finding an even distribution in nature.

#### 4.2.4. Fitting Niche Apportionment Models to Empirical Data

The author of [45] found a new way of testing stochastic models. Species (*S*) are listed in decreasing order of abundance. The equation given below is used to fit a niche apportionment model if the mean observed abundance falls within the confidence limits of expected abundance.

$$\mathcal{R}(\mathbf{x}\_{i}) = \mu\_{i} \pm r\sigma\_{i}\sqrt{n}$$

where

*xi*=<sup>1</sup> = mean abundance of most abundant; *xi*=<sup>2</sup> = mean abundance of next most abundant; . . . *xi*=*<sup>S</sup>* = mean abundance of least abundant; *μ<sup>i</sup>* = mean of abundance ranked from *i* = 1 to *S*; *σ<sup>i</sup>* = standard deviation of abundance; *n* = number of replicated samples; *r* = breadth of confidence limit.

The mean abundance constitutes the observed distribution. For an assemblage of the same number of species (*S*), the expected abundance is estimated. For this model, we have to choose a large *N*, *μi*, *σ<sup>i</sup>* and *n*. In addition, confidence limits are assigned to each rank of expected abundance by considering *n* rather than *N* (the number of times the model was simulated).

## *4.3. Species Richness Indices*

There are two well-known species richness indices, which are easy to calculate too, which were introduced by [49,50], respectively.

• Margalef's diversity index (*DMg*)

$$D\_{M\xi} = \frac{S - 1}{\log N}$$

• Menhinick's index (*DMn*)

$$D\_{Mn} = \frac{S}{\sqrt{N}}$$

where *S* is the number of species recorded, *N* the total number of individuals in the sample.

Despite the attempt to correct for sample size, both measures remain strongly influenced by sampling effort. Nonetheless, they are intuitively meaningful indices that can be useful in biological diversity research.

#### Estimating Species Richness

There are two approaches to estimating species richness from samples, as cited in [51,52]. The first is the extrapolation of species accumulation or species–area curves. The second approach is to use a non-parametric estimator.

• Species accumulation curves

Species accumulation curves, also known as collector curves, plot *S*, the total number of species, as a function of sampling effort (*n*) (see [51]). These curves are widely used in botanical research (see [53,54]). This is only a type of species accumulation curve. Curves that are *S* versus *A* for different areas (such as islands) and those used in increasingly larger parcels of the same region are the most common.

The overall shape of species accumulation curves is determined by the order of samples (or individuals). By randomizing, the curve can be made smoother. It also helps deduce the mean and standard deviation of species richness. According to [55], these curves resemble rarefaction curves (see [56]). They usually move from left to right, as new species are added. However, rarefaction curves conventionally move from right to left. Many scientists have plotted species accumulation curves using linear scales on both axes. However, it is better to use a log-transformed *x* axis since semi-log plots make it easier to identify asymptotic curves from logarithmic curves (see [57]). To find an estimate of total species richness, the authors of [58] extrapolate the graph.

Functions used in this kind of extrapolation can be classified into asymptotic or nonasymptotic. Both of their roles are to help the user predict an increase in species richness with additional sampling effort rather than to estimate total species richness.

**–** Asymptotic curves

They can be generated using two methods. The first is by using a negative exponential model (see [58]). The second is using the Michaelis–Menten equation (see [59]). The usual form of the equation is

$$S(n) = \frac{n S\_{\text{max}}}{B + n}$$

where *S*(*n*) is the number of species observed in *n* samples, *S*max is the total number of species in the assemblage, and *B* is the sampling effort required to detect 50% of *S*max and *n* is the sample count.

**–** Non-asymptotic curves

These curves are used to estimate species richness. The authors of [60] proposed that the relationship between area and species be best described by a log linear model, extrapolated to a larger area. The authors of [61] imposed an asymptote on the log-log species area curve to avoid extremely high estimates of species richness.

	- · *Chao*1

It represents a simple estimator of the absolute number of species in an assemblage, which was introduced by Anne Chao (see [63]). The measure is named by [51] as *Chao*1 and it is based on the number of rare species in a sample. The following notation was provided by [52]:

$$S\_{Chao1} = S\_{obs} + \frac{F\_1^2}{2F\_2}$$

where *Sobs* denotes the number of species in the sample, *F*<sup>1</sup> is the number of observed species represented by a single individual (singletons) and *F*<sup>2</sup> denotes the number of observed species represented by two individuals (doubletons).

The requirement for abundance data is an obvious disadvantage of *Chao*1. The abundance data should at least show whether they are singleton or doubleton. However, rather than presence/absence, they are often called incidence or occurrence data. The calculation of the variance of *Chao*1 is possible (see [64,65]).

· *Chao*2

Anne Chao was well aware that the number of species found in one sample is the only essential factor for calculation. For this, a new estimator, *Chao*2 was invented. It is as follows (see [51]):

$$S\_{\text{Chao2}} = S\_{\text{obs}} + \frac{Q\_1^2}{2Q\_2}$$

where *Q*<sup>1</sup> is the number of species that occur in one sample only (unique species) and *Q*<sup>2</sup> is the number of species that occurs in two samples. · Other estimators

The author of [51] also invented another category of estimator called coverage estimators (see [66]). Coverage estimators are based on the assumption that widespread or abundant species can be included in any sample (see [67]). The abundance-based coverage estimator, alias ACE, is another estimator based on empirical data (see [68]). The partner incidence-based coverage estimator, ICE, focuses its eye on species found in <10 sampling units. Here, to estimate the true number of species, two estimators in this category are Jackknife and bootstrap estimators, which are described in the following sections. The estimators

are evaluated using some criteria, such as sample size, patchiness and overall abundance.

#### *4.4. Diversity Measures*

Species richness measures and estimators all fall into two categories: either parametric diversity indices or non-parametric diversity indices.

#### 4.4.1. Parametric Measures of Diversity

• log series *α*

The parameters of the log series model are *x* and *α*, where *α* is a diversity index. In addition, *α* is calculated during the fitting of a distribution. When *S* and *N* are known, the value of *α* can be easily calculated using the Williams monograph (see [69]) or appendix 4 of [70]. Here, *x* is estimated by iterating the following form:

$$\frac{S}{N} = -\log(1 - x)\frac{1 - x}{x}$$

According to [70], until *x* - 0.5 and as *S* > *α*, the log series distribution is not the best descriptor of species abundance pattern. In fact, for natural assemblages, usually *x* > 0.9 or close to 1 and *S* > *α*. This implies that *α* is approximately the same as the number of species represented by a single individual.

• log normal *λ*

The standard deviation (*σ*) of the log normal distribution would be a good measure of diversity. Although we can use it as an evenness measure and as an index for discriminating amongst samples, *σ* is not a good choice. It is also impossible to estimate for small sample sizes (see [71]). Then, *S*∗ (*S*∗ is the estimator of *S*, the number of species) is a good predictor of total species richness. However, the ratio of these two unsuitable parameters (*S*∗/*σ*) turns out to be an effective diversity measure (*λ*). It is effective in discriminating against assemblages (see [72]). Its ranking of sites suits well with *α*.

• The *Q* statistic

The authors of [73,74] proposed the *Q* statistic, which is based on the distribution of species abundance data. For this measure, the user does not require a model to fit the empirical data. Hence, for empirical data, a cumulative species abundance curve is drawn and its inter-quartile slope is used to measure diversity. The author of [75] suggests that by restricting the measure to the inter-quartile region, the complete cumulative species abundance curve can be used to explain diversity as well as to remove the bias caused by the extremities (very rare and very abundant species).

This is analogous to *α* and hence can be expressed in terms of a log series model, described by [76]. The following equation is estimated from empirical data:

$$Q = \frac{(1/2)n\_{R\_1} + \sum\_{R\_1+1}^{R\_2-1} n\_r + (1/2)n\_{R\_2}}{\log(R\_2/R\_1)}$$

where *nr* is the total number of species with abundance *R*, *R*<sup>1</sup> and *R*<sup>2</sup> are the 25% and 75% quartiles, *nR*<sup>1</sup> is the number of species in the class where *R*<sup>1</sup> falls, and *nR*<sup>2</sup> is the number of species in the class where *R*<sup>2</sup> falls. The quartiles are chosen so that:

$$\sum\_{1}^{R\_1 - 1} n\_r < \frac{1}{4}S \le \sum\_{1}^{R\_1} n\_r$$

and

$$\sum\_{1}^{R\_2 - 1} n\_r < \frac{3}{4} S \le \sum\_{1}^{R\_2} n\_r$$

where *S* is the total number of species in the sample, although the placement of *R*<sup>1</sup> and *R*<sup>2</sup> is not critical as the inter-quartile region of a cumulative species abundance curve, or indeed a rank/abundance plot, tends to be linear.

Because *Q* = 0.371 for the log normal model, it is not formally a parametric index. Thus, its performance is somewhat similar to that of parametric ones. However, for species which are censused >50%, *Q* may be biased (see [74]). The author of [77] has found an evenness measure which is similar to *Q* statistic, i.e., *EQ* which will be discussed later.

#### 4.4.2. Non-Parametric Measures of Diversity

Most diversity measures are not explicitly associated with named species abundance models, even though their performance is often governed by the underlying distribution of species abundances. They are non-parametric measures of diversity.

• Shannon Index (*H* )

> It was independently derived by Claude Shannon and Warren Weaver and is generally known as the Shannon index or Shannon information index. However, it is sometimes mistakenly referred to as the Shannon–Weaver index (see [9]). It is represented as

$$H' = -\sum\_{i=1}^{n} p\_i \log p\_i$$

Usually, in samples, *pi* will be unknown but it is estimated using the maximum likelihood estimator, *ni*/*N* (see [78]), where *ni* is the total number of individuals in the *i*th species and *N* is the total number of individuals. The ecological validity and computational easiness led Shannon to represent the index as logarithm of *pi*. Historically, log2 is used for calculating the Shannon index, but this is without any biological reason. An increased trend in logarithm standardization is found in [79]. However, Shannon index does not have an unbiased estimate (see [80]).

**–** A model using Shannon index: Caswell's neutral model Caswell's neutral model is very famous for its innovative approach to community structure analysis (see [81]). The model focuses on species abundance patterns when biological interactions are removed. It is represented by the deviation statistic defined by

$$V = \frac{H' - E[H']}{SD(H')}$$

where *H* is the Shannon diversity index. It can be used to compare observed diversity (*H* ) with the predicted neutral diversity *E*[*H* ]. For values of *V* > 2 or *V* < −2, it depicts the departure from neutrality [82]. The author of [83] presented a computer program in PRIMER to calculate *V* which is termed a measure of environmental stress (see [84,85]) but is very rarely used. As richness and evenness are in complex relationships, *V* is probably useful only as a measure of disturbance. For large values of *S* and *N*, the expected values of *H* are generated by a neutral model that closely resembles the predicted values in the log series model (see [70]), where *S* is the total number of species in the sample and *N* is the total number of individuals.

• The Shannon evenness measure (*J* ) Assume a situation where all species have equal abundance. Then, the ratio of observed diversity will generate a new measure *J* (see [22,78]). It is defined as

$$f' = \frac{H'}{H\_{\text{max}}} = \frac{H'}{\log S}$$

where *S* is the number of species and *H* is the Shannon diversity index. To find *H*min, the author of [86] gives a simple method that can be utilized in other forms of the Shannon evenness (see [87]).

• Heip's index of evenness (*EHeip*) In [88], Heip notes that the evenness measure should not be based on species richness. So, according to this idea, he proposed the following new measure:

$$E\_{Hcip} = \frac{e^{H'} - 1}{S - 1}$$

Compared to *J* , *EHeip* is least affected by species richness, it does not require sample size to be independent if there are only 10 species in 1 sample (see [89]). *EHeip*'s minimum value is 0 and it usually goes to 0.006 when an extremely uneven community is considered.

• SHE analysis

One of the main characteristics of the Shannon index is that it depends extremely on species richness and evenness. In [70,90], they identified that this property of the Shannon index can be utilized in another way. Consider a measure of evenness *E* = *eH* /*S* (see [88]), such that

$$H' = \log S + \log E$$

This decomposition aids the user in interpreting changes in diversity.

A decrease in diversity tends to cause pollution incidents due to loss of richness, evenness, or a combination of them.

The essence of SHE analysis is the triangular relationship between *S* (species richness), *H* (diversity as measured by the Shannon index) and *E* (evenness). SHE analysis used by [91] in examining geographic patterns of body mass diversity in Mexican mammals found that evenness was high at intermediate spatial scales but low at the regional one.

• The Brillouin index (*HB*)

The Brillouin index, abbreviated *HB*, is appropriate when sample randomness is not guaranteed, a community is completely censused, or every individual is accounted for (see [22,78]). It is given as

$$HB = \frac{\log N! - \sum\_{i=1}^{S} \log n\_i!}{N}$$

where *N* is the total number of individuals in the sample, *ni* is the number of individuals from the *i*th species and *S* is the number of species. The *HB* value is rarely greater than 4.5. When compared to the Shannon Index, *HB* always yields a lower value, but they both provide similar or correlated estimates of diversity. The reason is that the Brillouin index describes a completely known collection without any uncertainty. Evenness (*E*) for the Brillouin diversity index is obtained from

$$E = \frac{HB}{HB\_{\text{max}}}$$

where *HB*max is calculated as

$$HB\_{\max} = \frac{1}{N} \log \left( \frac{N!}{N\_S!^{S-r} (N\_S+1)!^r} \right).$$

where *NS* is the integer part of *N*/*S* and *r* = *N* − *S*(*NS*).

The index is unavailable with variance, and hence, no statistical test is needed to test significance. *HB* is mathematically speaking superior to the other two indices presented by [92]. However, some scientists state that it is more time-consuming and less familiar. Its over-dependence on sample size leads to unexpected results. This is unsuitable when abundance is measured as biomass or productivity (see [9,93]).

• Dominance and evenness measures A group of diversity indices is weighted by abundances of the commonest species and is usually referred to as either dominance or evenness measures.

**–** Simpson's index (*D*)

It is occasionally called the Yule index in remembrance of G. U. Yule (see [20]). The probability of any two individuals drawn at random from an infinitely large community being of the same species is given by [94] as

$$D = \sum\_{i=1}^{n} p\_i^2$$

where *pi* denotes the proportion of individuals in the *i*th species, and *n* number of species. The form of the index appropriate for a finite community is:

$$D = \sum\_{i=1}^{n} \frac{n\_i(n\_i - 1)}{N(N - 1)}$$

where *ni* is the number of individuals in the *i*th species and *N* is the total number of individuals in the sample.

Simpson's index is expressed as 1 − *D* or 1/*D* because diversity decreases as *D* increases, and thus, it captures the variance of species abundance distribution. Simpsons' index, on the other hand, is less sensitive to species richness and more oriented toward species abundance. Confidence limits are applied using jackknifing. Simpson's index is the most meaningful and robust of all the measures. The reciprocal nature of the Simpson index was questioned by [95] and he recommends using log(*D*) instead of (1 − *D*) or (1/*D*), because this notation ensures severe variance problems. He also advises Kemp's transformation.

**–** Simpson's measure of evenness (*E*1/*D*) The Simpson measure of evenness, denoted by *E*1/*<sup>D</sup>* and stated in [9,89], is defined by

$$E\_{1/D} = \frac{1/D}{S}$$

Here, *E*1/*<sup>D</sup>* usually ranges between 0 and 1 and is not so related to species richness. Because Simpson's index is a product of Simpson's evenness measure and *S*, multiplying *S* turns any good evenness index into a heterogeneity measure (see [96]).

**–** McIntosh's measure of diversity (*U*)

McIntosh postulated in 1967 that a community may be thought of as a point in a *S*dimensional hyper volume, with the Euclidean distance between the assemblage and its origin serving as a measure of diversity (see [97]). The distance is known as *U* and is calculated as

$$
\mathcal{U} = \sqrt{\sum\_{i=1}^{n} n\_i^2}
$$

where *ni* is the number of individuals in the *i*th species and *n* number of species. The McIntosh *U* index is formally not a dominance index. However, a measure of diversity (*D*) or dominance that is independent of *N* can also be calculated as

$$D = \frac{N - U}{N - \sqrt{N}}$$

A further evenness measure can be obtained from the following formula (see [22]):

$$E = \frac{N - U}{N - N/\sqrt{S}}$$

**–** The Berger-Parker index (*d*)

The Berger-Parker index, denoted by *d*, is an easy-to-calculate dominance measure (see [41,98]). The proportional abundance of the most abundant species is expressed by this index:

$$d = \frac{N\_{\text{max}}}{N}$$

where *N*max is the number of individuals in the most abundant species. In this case, *d* denotes the relative importance of the most dominant species in the assemblage; both are considered equivalent. The reciprocal form of the Berger-Parker index is accepted because an increase in the value of the index accompanies an increase in diversity and a reduction in dominance, making it similar to Simpson's index. It is one of the most satisfactory diversity measures available because of its simplicity and biological significance (see [41]). In small assemblages, *d* is independent of *S*, and its value decreases with increasing species richness.

**–** Nee, Harvey and Cotgreave's evenness measure (*ENHC*) As an evenness measure, the authors of [77] proposed the slope *b* of a rank/abundance plot (with abundances log transformed). The resulting measure is

$$E\_{NHC} = b$$

*ENHC* ranges from −∞ and 0, where 0 is perfect evenness. This measure is difficult to interpret due to its range of values. It is more properly a measure of diversity than of evenness, and this is one of its demerits (see [73]). The authors of [89] therefore proposed a new form of the measure, which is

$$E\_Q = -\frac{2}{\pi \arctan(b')}$$

In this measure, the ranks are scaled before the regression is fitted, and *b* denotes the corresponding slope. Thus, this is accomplished by dividing all ranks by the highest rank, such that the most abundant species receives a rank of 1.0 and the least abundant receives a rank of 1/*S*. The transformation (−2/[*π* arctan(*b* )]) places the measure in the 0 (no evenness) to 1 (perfect evenness) range.

**–** Camargo's evenness index (*Ec*) The author of [99] also introduced the following evenness measure:

$$E\_{\mathbb{C}} = 1 - \sum\_{i=1}^{S} \sum\_{j=i+1}^{S} \frac{p\_i - p\_j}{S}$$

where *Ec* is Camargo's index of evenness, *pi* the proportion of species *i* in the sample, *pj* the proportion of species *j* in the sample, and *S* the sample size. Although, the index is simple to calculate and relatively unaffected by rare species (see [100]). The authors of [37] found it to be biased, especially in comparison with the Simpson index.

**–** Smith and Wilson's evenness index (*Evar*)

The authors of [89] proposed a new index to provide an intuitive measure of evenness. This index takes the variation in species abundances and divides it by log abundance to produce proportional differences. This makes the index independent of measurement units. Smith and Wilson called their measure *Evar*. It is defined by

$$E\_{\text{var}} = 1 - \frac{2}{\pi \arctan\left\{ \sum\_{i=1}^{S} \left( \log n\_i - \sum\_{j=1}^{S} \frac{\log n\_i}{S} \right)^2 / S \right\}}$$

where *ni* is the number of individuals in species *i*, *nj* is the number of individuals in species *j* and *S* represents the total number of species. The conversion by 1 − 2/(*π* arctan(*x*)) ensures that the resulting measure falls between 0 (minimum evenness) and 1 (maximum evenness).

#### 4.4.3. Taxonomic Diversity

If two assemblages have the same number of species and similar patterns of species abundance, but differ in the diversity of taxa to which the species belong, it seems intuitively reasonable that the assemblage with the most taxonomically diverse taxa is the more diversified assemblage. A taxonomic distinctness measure is one of the most recent developments in taxonomic diversity (see [101,102]).

• Clarke and Warwick's taxonomic distinctness index

This measure gives the average taxonomic distance, or simply the path length between two randomly chosen organisms through phylogeny. Two forms can be taken by species in an assemblage. The first is taxonomic diversity (Δ), which considers taxonomic relatedness or species abundance. The two organisms may belong to the same species. The second form is taxonomic distinctness (Δ∗), a pure measure of taxonomic relatedness, which is equivalent to dividing Δ by the value it would take if all species belonged to the same genus, that is, in the absence of a taxonomic hierarchy. When presence/absence data are used, both measures reduce to the same statistic, Δ+, which is the average taxonomic distance between two randomly selected species. It is calculated as follows:

$$\Delta^{+} = \frac{\sum\_{i=1}^{S} \sum\_{j=i+1, i$$

where *S* denotes the number of species in the study and *ωij* is the taxonomic path length between species *i* and *j*.

The taxonomic distinctness index is distinguished by its lack of reliance on sampling effort (see [103]).

Using Δ+, a significance test can be carried out. Here, the null hypothesis considered is "taxonomic distinctness of a locality is not significantly different from the global list". On the other hand, the author of [104] used multivariate methods during detection

of small variations in community structure and diversity. Multivariate analysis also helps find increased variability between samples (see [105]).

#### *4.5. Sampling: An Essential Attribute*

There are essentially two choices regarding sample size. The investigator may either adjust the sample size to cope-up with the situation or adopt a standard sample size. The second approach, which is also recommended by [70], is the best. If two samples with different sample sizes are drawn from the same assemblage, then this may lead to different conclusions about its diversity (see [22]). If samples are replicated several times, the curve obtained by plotting the measure of diversity (or evenness) against cumulative sample size may lead to a smooth curve.

• Replications

The number of replications required is always an unanswerable question. Ideally, the available sample size and number of replications required to complete this are selected on the basis of the most diverse assemblage. In addition, it will be the same throughout the study. When sample size is not consistent, this becomes more true. One should be well aware of the difference between replication and pseudoreplication (see [106]). For more ideas in this context, users can refer to [107]. The primary condition is that all replicates must be independent (spatially).

## *4.6. Comparison of Communities*

The manner in which the statistical comparison of communities or other ecological entities is achieved depends to some extent, though with significant overlaps, on the aspect of biodiversity that has been measured.

• Rarefaction—Sample data to common abundance level

Rarefaction is a technique that reduces sample data to a common abundance level, which helps direct mapping between species richness in communities. During rarefaction, to estimate the richness of a small sample, complete information regarding all the collected species is required. Rarefaction curves converge when sample sizes are small (see [55,108]). Sampling should be enough to characterize the community, but there is a chance that estimates will be biased if the sample is insufficient.

The author of [109] states that software can be used to create rarefaction curves. In [65], sample-based rarefaction curves were calculated using the EstimateS software. Confidence intervals can be incorporated into these curves. Rarefaction by the log series model is computationally simple. Indeed, it may even be used in circumstances where species abundances do not follow a log series distribution. However, if the sampling was inadequate in the first place, no method of rarefaction is going to compensate.

• Statistical tests

Standard statistical techniques such as T-tests and ANOVA can be used to compare assemblages (see [38]). Alternatively, jackknifing or bootstrapping can be used to attach confidence intervals to a diversity statistic.

**–** Jackknifing: a measure of diversity

Jackknifing (see [110]) is a strategy for improving the estimate of almost any statistic. It can also be used to calculate the number of species present. It was first proposed by Quenouille in 1956, with Tukey making changes in 1958. The author of [111] was the first to apply the approach to diversity statistics. This application was further investigated by [112,113].

Jackknifing does not require assumptions about the underlying distribution. Instead, it uses a set of "pseudo-values" which are artificially produced. These pseudo-values are (usually) normally distributed, their mean forms the best estimate of the statistic. Approximate confidence limits can also be attached to the estimate. The procedure is simple. The first step is to estimate the diversity of all *n* samples together. This produces *St*, the original diversity estimate. Next, the

diversity measure is recalculated *n* times, missing out each sample in turn. Each recalculation produces a new estimate, *St*−*i*. The pseudo-value (*φi*) can then be calculated for each of the *n* samples as

$$\phi\_i = nSt - (n-1)St\_{-i}$$

The jackknifed estimate of the diversity statistic is simply the mean of these pseudo-values:

$$\bar{\Phi} = \sum\_{i=1}^{n} \frac{\Phi\_i}{n}$$

The approximate standard error of the jackknifed estimate is

$$SE\_{\Phi} = \sqrt{\sum\_{i=1}^{n} \frac{(\phi\_i - \bar{\phi})^2}{n(n-1)}}$$

This standard error may be used to assign approximate confidence limits to the jackknifed diversity estimate. Confidence limits are set in the usual way, i.e.,

$$
\bar{\Phi} \pm t\_{0.05(n-1)} S E\_{\bar{\phi}},
$$

Prior to jackknifing, the author of [38] recommended that statistics with a restricted range (such as those constrained between 0 and 1) should be modified. Following that, same methods were used to estimate species richness, with considerable success. They are called Jackknife 1, a first-order jackknife estimator that employs the number of species that occur only in a single sample (see [114,115]), and Jackknife 2, a second-order estimator which, like the *Chao*2 equation, takes both the number of species found in one sample only (*Q*1) and in precisely two samples (*Q*2) into account (see [116]). Both require incidence data. In the following equations, *m* denotes the number of samples:

$$S\_{Jack1} = S\_{obs} + Q\_1 \left(\frac{m-1}{m}\right)^2$$

$$S\_{\rm{Jack2}} = S\_{\rm{obs}} + \left(\frac{Q\_1(2m-3)}{m} - \frac{Q\_2(m-2)^2}{m(m-1)}\right)^2$$

The variances of both estimators can be calculated.

**–** Bootstrapping

A related method for producing standard errors and confidence bounds is bootstrapping. It is more computationally intensive than the jackknife, although it is regarded as an improvement. In essence, the original dataset is sampled numerous times to obtain a large number of different observations. These are then used to deduce the standard error. The authors of [20,38] provide more details. Bootstrapping, like jackknifing, can be used in species richness estimation.

• Null models

In the last decade, there has been a rising use of null models in diversity measurement. Ecologists are becoming aware of the importance of developing testable null hypotheses (see [117]). The observed patterns are not due to the presumed causal explanation, according to the null hypothesis. It is based on the assumption that nothing significant has occurred (see [118]). Null models can also be used to determine whether perceived differences in diversity are simply an artifact of sampling. As [55] emphasizes, a null model does not presume that there is no structure in a community or that all processes are random. Instead, randomness is assumed only in respect of the mechanism being

tested. Null models are already used extensively to evaluate species co-occurrence patterns (see [119]).

#### *4.7. Diversity in Space (and Time)*

Till now, we have focused on the diversity of a defined assemblage or habitat, or *α* diversity. The author of [120] makes the distinction between *α* and *β* diversity where diversity increases as the similarity in species composition decreases. *β* diversity reflects biotic change or species replacement, whereas *α* diversity is a property of a specific spatial unit. The diversity of two or more spatial units differs. We can use *β* diversity. The relationship between *α* and *β* diversity is scale-dependent. The observation made by [80] is that

$$D\_{\gamma} = \bar{D}\_{a} + D\_{\beta}$$

When species richness is used to measure *α* and *γ* diversity, *β* diversity may be estimated as follows:

$$D\_{\mathcal{S}} = \mathcal{S}\_{\mathcal{T}} - \vec{S}\_{\dot{\mathcal{I}}} = \sum\_{j=1}^{n} q\_{j} (\mathcal{S}\_{\mathcal{T}} - \mathcal{S}\_{\dot{\mathcal{I}}})^{\dagger}$$

where *ST* is the species richness of the landscape (*γ* diversity), *Sj* denotes the richness of assemblage *j* and *qj* is the proportional weight of assemblage *j* based on its sample size(*n*) or importance.

This approach is also used in the Shannon and Simpson diversity measurements. Low *α* and high *β* diversity will come from many small sampling units, but the opposite will be true if there are fewer but larger samples. If all other factors are equal, both sampling procedures yield the same conclusions concerning *γ* diversity.

• Indices of *β* diversity

The majority of these indices use presence/absence data and, as such, focus on the species richness element of diversity.

1. Whittaker's measure (*βW*)

One of the simplest, and most effective, measures of *β* diversity was devised by [120]:

$$
\beta\_W = \frac{S}{\bar{a}}
$$

where *S* is the total number of species recorded in the system (i.e., *γ* diversity) and *α* is the average sample diversity, where each sample is a standard size and diversity is measured as species richness. This is equivalent to:

$$D\_{\beta} = \frac{S\_T}{\overline{S}\_j}$$

in Lande's notation.

When Whittaker's measure is used to compute *βW*, values of the measure will range from 1 (complete similarity) to 2 (no overlap in species composition). The author of [121] introduced a modification of Whittaker's measure. This allows the user to compare two transects (or samples) of different size. The related formula is

$$\beta\_{H1} = \frac{S/\alpha - 1}{N - 1} \times 100$$

where *S* denotes the total number of species recorded, *α* means *α* diversity and *N* is the number of sites (or grid squares) along a transect. The measure ranges from 0 (no turnover) to 100 (every sample has a unique set of species) and can be used to examine pairwise differentiation between sites. The author of [121] suggested a second modification which is insensitive to species richness trends. It is given by

$$\beta\_{H2} = \frac{S/(\alpha\_{\text{max}} - 1)}{N - 1} \times 100$$

Here, *α*max is the maximum within-taxon richness per sample. The authors of [122] used *βH*<sup>2</sup> to compare the turnover of various taxa in relation to disturbance in a Cameroon forest.

2. Cody's measure (*βC*)

The author of [123] proposed an index, which is easy to calculate and is a good measure of species turnover. It is given by

$$\beta\_{\mathbb{C}} = \frac{\mathbb{g}(H) + l(H)}{2}$$

where *g*(*H*) is the number of species gained and *l*(*H*) is the number of species lost.

3. Routledge's measures (*βR*, *β<sup>I</sup>* and *βE*)

The author of [124] was concerned with how diversity measures can be partitioned into *α* and *β* components. His first index, denoted by *βR*, takes overall species richness and the degree of species overlap into consideration. This index is defined by

$$\beta\_R = \frac{S^2}{2r + S} - 1$$

where *S* is the total number of species in all samples and *r* is the number of species pairs with overlapping distributions.

*βI*, the second index, stems from information theory and has been simplified for presence/absence data and equal sample size by [125]:

$$\beta\_I = \log T - \frac{1}{T} \sum\_{i=1}^{n} \varepsilon\_i \log \varepsilon\_i - \frac{1}{T} \sum\_{j=1}^{n} S\_j \log S\_j$$

where *ei* is the number of samples in the transect in which species *i* is present, *Sj* is the species richness of sample *j*, and *T* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *ei* <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Sj*, and *n* the total number of samples.

The third index, *βE*, is simply the exponential form of *βI*. That is

$$
\beta\_E = e^{\beta\_I}
$$

4. Wilson and Shmida's index *β<sup>T</sup>*

The authors of [125] proposed a new measure of *β* diversity. It is given by

$$\beta\_T = \frac{\mathbf{g}(H) + l(H)}{2S\_j}$$

where *S*¯ *<sup>j</sup>* is the mean of *Sj*. Most measures of *β* diversity are sensitive to scale. Turnover decreases as progressively larger areas are investigated.

• Indices of complementarity and similarity

The author of [126] coined the term complementarity to characterize the differences across locations in respect of the species they support. Complementarity is, of course, another name for the *β* variety. The larger the *β* diversity of two sites, the more

complimentary they are. Measures typically combine three variables: *a*, the total number of species present in both quadrants or samples, *b* the number of species present only in quadrant 1 and *c* the number of species present only in quadrant 2. There are mainly two indices.

1. Marczewski–Steinhaus (MS) distance

Following [127], the author of [51] recommended the Marczewski–Steinhaus (MS) distance as a measure of complementarity. It is expressed as

$$\mathcal{C}\_{\rm MS} = 1 - \frac{a}{a+b+c}$$

This measure is in fact the complement of the familiar [128] similarity index:

$$C\_{I} = \frac{a}{a+b+c}$$

As suggested by Pielou, the statistic can also be adapted to give a single measure of complementarity across a set of samples or along a transect:

$$C\_T = \sum\_{i=1}^{n} \sum\_{\substack{j=1, j \neq j}}^{n} \frac{\mathcal{U}\_{\vec{j}k}}{n}.$$

where *Ujk* = *Sj* + *Sk* − 2*Vjk* and is summed across all pairs of samples, *Vjk* is the number of species common to the two lists *j* and *k* (the same value as a in the formulae above), *Sj* and *Sk* are the number of species in samples *j* and *k*, respectively, and *n* is the number of samples.

When *n* is large, *CT* approaches a value of *nST*/4, where *ST* is the species richness of all samples combined.

A metric (as opposed to a nonmetric) measure is the Marczewski–Steinhaus dissimilarity measure (and hence the complement of the Jaccard similarity measure). This indicates that it meets specific geometric criteria. The significant result for the user is that it may now be used as a distance measure and in ordination (see [127]).

2. Sorensen's measure

Another popular similarity measure was devised by [129]:

$$\mathcal{C}\_S = \frac{2a}{2a+b+c}$$

Sorensen's measure (see [20]) is widely recognized as one of the most effective presence/absence similarity metrics. The Bray-Curtis presence/absence coefficient is the same.

3. Lennon turnover measure

Sorensen's measure will always be large. Therefore, they introduce a new turnover measure *βsim*, that focuses more precisely on differences in composition:

$$
\beta\_{\rm sim} = 1 - \frac{a}{a + \min(b, c)}
$$

This is related to a measure derived by [130]. Any difference in species richness inflates either *b* or *c*. The consequence of using the smallest of these values in the denominator is thus to reduce the impact of any imbalance in species richness. The authors of [131] found that this measure performs well.

One of the primary advantages of these measurements is that they are simple to calculate and comprehend. Furthermore, the coefficients do not take into consideration the relative abundance of species, which is a flaw.

4. Sorensen quantitative index or Bray-Curtis index

Similarity/dissimilarity measures based on quantitative data. The author of [132] introduced a modified version of the Sorensen index. This is sometimes called the Sorensen quantitative index (see [133]). It is given by

$$\mathcal{C}\_N = \frac{2jN}{N\_a + N\_b}.$$

where *Na* is the total number of individuals in site *A*, *Nb* is the total number of individuals in site *B*, and 2*jN* is the sum of the lower of the two abundances for species found in both sites.

5. Other notable indices

The authors of [134] looked into a number of quantitative similarity indices and discovered that, with the exception of the Morisita–Horn index, they were all heavily influenced by species richness and sample size. The Morisita–Horn index (MH) has the drawback of being extremely sensitive to the abundance of the most abundant species. Despite this, the author of [135] was able to measure *β* diversity in tropical cockroach assemblages using a modified version of the index. It is defined by

$$\mathcal{C}\_{MH} = \frac{2\sum\_{i=1}^{n} a\_i b\_i}{(d\_a + d\_b) \times N\_a \times N\_b}$$

where *Na* is the total number of individuals at site A, *Nb* is the total number of individuals at site B, *ai* is the number of individuals in the *i*th species in A, *bi* is the number of individuals in the *i* species in B, *n* is the total number of species and *da* and *db* are calculated as follows:

$$d\_a = \frac{\sum\_{i=1}^{n} a\_i^2}{N\_a^2}$$

The Morisita–Horn measure is widely used (see [136,137]). The authors of [20] provided a version of Morisita's original index that is suitable for easy computation. A further simple measure is percentage similarity (see [20]):

$$P = 100 - 0.5 \sum\_{i=1}^{S} |P\_{\text{aĭ}} - P\_{\text{bi}}|$$

where *Pai* and *Pbi* is the percentage abundances of species *i* in samples *a* and *b*, respectively, and *S* is the total number of species.

Some practical applications are given below based on [5].

#### *4.8. Extreme Values in Modeling Atmospheric Ozone*

The traditional method of extreme value analysis popularized by [138] was the annual maximum method, in which one of the three classical types of extreme value distributions was fitted to, say, the annual maxima of a river or sea level series. Modified approaches to extreme value analysis which cope with time series dependence are discussed by [139,140]. The extreme value trend centered on the statistical features of insurance claims for environmental damage. The author of [141] suggested that exceedances over a high threshold can be modeled approximately by the generalized Pareto distribution (GPD).

#### *4.9. Environmental Epidemiology*

The study of associations between environmental pollutants and negative health consequences is a prominent topic in current environmental health science.

The authors of [142,143] have considered some methodological issues associated with detecting clusters in spatial point processes of disease. The authors of [144] extended the approach to the modeling of spatially aggregated data. Earlier, the authors of [145] proposed a non-parametric test for identifying disease clusters. However, as there are several sources for a disease, it has become impossible to associate the effect of each. Therefore, the cluster cannot be detected easily. In such cases, it is generally assumed that comparison of mortality or disease incidence with levels of counter-revolutionary spatial regions is subject to so much confounding with other environmental effects. To estimate the sequential mean and covariances, Zidek adapted Bayesian approach on spatial prediction of a multidimensional variable (see [146]).
