1. Introduction
Species richness is the most commonly used quantitative diversity metric and is easily understood. The term “species” can be broadly defined to include biological species, software bugs, words in a book, genes, alleles, or other discrete entities, as reviewed in [
1,
2,
3]. This article focuses on biological applications—specifically, the number of detectable species within a given area. However, due to constraints in resources or sampling, creating a complete species inventory for a target area is often unfeasible. Instead, a random sample, representing a small portion of the target area’s size or community, is typically used to evaluate species diversity. In ecological studies, there are two main formats for assessing species diversity: individual-based abundance data and sample-based incidence data. Individual-based abundance data involve randomly sampling and identifying individual organisms to species and recording the frequency of species. Sample-based incidence data involve randomly sampling a plot, quadrat, trap, transect, or net from the target area and recording the presence or absence of species appearing in the sampled unit [
4].
Since the true number of species in an area is the sum of the species observed in the sample and those not appearing in the sample, the observed richness in the sample always underestimates the true richness. Generally, the extent to which species are underestimated in a sample hinges on sampling efforts and sample completeness [
5]. Accurately estimating the species richness of an assemblage remains a statistical challenge, particularly in highly heterogeneous assemblages [
6]. To address the negative bias of observed richness, numerous estimators have been proposed, leading to significant advancements in various disciplines (refer to the review papers [
1,
3,
7,
8] for detailed information).
Generally, richness estimators in the literature can be classified into three types: curve-fitting, parametric, and non-parametric approaches. Curve-fitting approaches utilize parametric curves to extrapolate species-accumulation or species-area curves, aiming to predict their asymptote as an estimate of species richness [
9,
10]. This method does not directly leverage the frequencies of common and rare species. Instead, it only anticipates the trajectory of the rising curve. Parametric approaches treat species composition as a random variable, adhering to a distribution with limited parameters [
11,
12,
13]. This parameter reduction enables the application of standard traditional statistical inference procedures, such as the maximum likelihood method. A primary advantage of parametric methods is their simplicity. However, curve-fitting and parametric methods face challenges in selecting the appropriate parametric function or distribution. Models using different functions or distributions might fit the data similarly, resulting in vastly different estimates. Further, a well-fitting parametric model does not guarantee a satisfactory estimate of species richness. Non-parametric richness estimators, which do not make model assumptions about species detection probability or species composition, tend to be more robust and are often preferred by ecologists. In the realm of ecological research, Chao’s lower bound estimators [
14,
15], which are rooted in the Cauchy–Schwarz inequality, are prominently used. Moreover, to address the bias in observed richness, jackknife-based estimators [
16,
17] were crafted. These work by consecutively excluding individuals from the data to analyze various sub-datasets. In addition, these non-parametric estimators do not require all the information on observed species; only rare species (singletons and doubletons) are used in the sample to estimate undetected richness. These estimators could show expected robust statistical behavior only when the sampling unit (i.e., an individual in abundance data and a plot in incidence data) is randomly sampled. However, due to resource constraints, a random sample is often only feasible in a limited area, not in a large-scale area.
In recent decades, monitoring species richness to reveal the impact of human activities on a large or global scale has become an increasingly urgent task [
18,
19,
20,
21,
22,
23]. However, estimating richness for large-scale areas (or across multiple sites) remains a statistical challenge, and no reliable estimator has been developed to date. In general, the collected datasets used to estimate the richness across multiple sites usually consist of the samples that are separately sampled from each site by implementing different sampling schemes. Therefore, this integrated dataset is composed of different kinds of data formats, including individual-based abundance data and sample-based incidence data. However, the widely-used rigorous estimators in the literature have their limitations due to their underlying theoretical assumptions, and they are not equipped to analyze this type of integrated data. Therefore, until now, no estimator has been specifically proposed for integrated data to estimate the richness across multiple sites.
In this article, I provide a theoretical interpretation of the applicability of Chao’s lower bound estimator for estimating species richness based on a pooled sample of integrated data. Additionally, utilizing the Good–Turing frequency formula [
24], I address the negative bias inherent in Chao’s lower bound estimator and propose a bias-corrected alternative. The variance of the new estimator can be calculated through the asymptotic approach, and its 95% confidence interval can be obtained through logarithmic transformation. To evaluate the efficacy of this proposed estimator, three commonly used ecological models and two real datasets are utilized in simulation studies and illustrative examples. Based on simulation results from various scenarios of integrated data, both estimators significantly reduce the negative bias of observed richness, providing reliable lower bound estimates across various hypothetical models and exhibiting convergence towards the true richness as the sample size increases. Notably, the newly proposed bias-corrected estimator outperforms Chao’s lower bound, exhibiting lower bias, lower root mean square error (RMSE), and a more accurate 95% confidence interval (CI) for the true richness, particularly when dealing with small sample sizes or highly heterogeneous communities.
2. Materials and Methods
2.1. Sampling Distribution Model
Assume there are a total of
distinct species in the community of interest. In ecological studies, individual-based abundance data and sample-based incidence data are the most commonly collected data types in the assessment of richness diversity [
4]. The sampling unit of individual-based abundance data is an individual independently sampled and identified to a species from the target area, the sampling unit of sample-based incidence data is a plot, quadrat, trap, transect, or net randomly sampled from the target area, and only the incidence (presence or absence) of species appearing in the selected plot is recorded.
For individual-based abundance data, assume
(a small fraction of community size) individuals are independently sampled by sampling with replacement or sampling without replacement. Let
be the number of individuals of species
i counted in the sample. The species frequency or species abundance
could be assumed to follow a multinomial distribution with size
n and probabilities
, and the species frequency
follows a binomial distribution with parameters
and
:
where
is the relative detection probability of species
i. Let
be the number of species that are observed exactly
k times in the sample,
. Therefore,
and
represent the undetected richness, singleton richness, doubleton richness, and tripleton richness in the abundance sample, respectively, where
is an unknown parameter.
For sample-based incidence data, assume
sampling units are randomly sampled from the target area and only the incidence (presence or absence) of species in the sampled unit is recorded. Let
be the number of units in which species
i is detected in the
t sampled units. Then,
could be assumed to follow a binomial distribution with size
t and probability
:
where
is the detection probability of species
i, which depends on the abundance, body size, and color of species
i, as well as the investigator’s capability. Let
be the number of species that are detected in exactly
k out of
t sampling units,
Therefore,
and
are the unseen richness, singleton richness, doubleton richness, and tripleton richness in the incidence sample, respectively, where
is an unknown parameter.
2.2. Richness Estimation for a Single Assemblage Using Integrated Data
Assuming that
N samples are randomly collected from the target area through various sampling schemes, including individual-unit-based and sample-unit-based sampling methods, the integrated data comprise two formats (i.e., abundance data and incidence data), as commonly seen in ecological studies. To determine the richness of the assemblage, Chao1 and Chao2 [
14,
15], derived without model assumption on species detection rates, are the most commonly used estimators, which are briefly outlined below. In this context, I will not deeply explore jackknife-based estimators because they lack a theoretical basis for bias reduction in species richness estimation [
25] and they exhibit inferior statistical performance compared to Chao’s lower bound estimators [
26].
2.2.1. Chao’s Lower Bound Estimators
Based on Cauchy–Schwarz inequality, and without making any assumptions on species detection rates, Chao proposed lower bound estimators for richness in 1984 and 1987. These estimators were designed for individual-based abundance data and sample-based incidence data and are referred to as the Chao1 and Chao2 estimators, respectively. The Chao1 and Chao2 estimators are separately expressed as
Chao’s lower bound estimators only use the frequency counts of the two rarest species (i.e., the numbers of singleton and doubleton species) in the sample to estimate undetected richness.
On the basis of Cauchy–Schwarz inequality theory, Chao’s lower bound estimators are unbiased when the detection rates of species are homogeneous (i.e.,
in Equation (1) or
in Equation (2), for
). In addition, according to the Good–Turing frequency formula [
24], Chao et al. [
27,
28] show that Chao’s lower bound estimators are nearly unbiased estimators only when rare species have approximately homogenous detection probabilities (or rates). Therefore, the degree of heterogeneity of the abundant species in the assemblage contains no information about the unbiasedness of Chao’s lower bound estimators. When the detection rate of rare species is highly heterogeneous or the sample size is not large enough, in contrast to other parametric estimators, Chao1 or Chao2 can provide a lower bound and robust richness estimate [
2,
28]. However, Chao1 and Chao2 were separately derived based on different sampling models for abundance data and incidence data. Importantly, there is still no theoretical evidence or proof that Chao’s lower bound estimator can be used to estimate species richness using a pooled sample of integrated data.
2.2.2. Extending Chao’s Lower Bound Estimators for Integrated Data
Many richness estimators proposed in the literature, whether they are parametric or non-parametric, are designed for randomly sampled data. This means that the detection rate of a species for each random trial, such as a selected individual or plot, is assumed to be identical. These estimators assume that the underlying assumptions of the binomial distribution are met.
However, if N samples are separately collected using different sampling methods (e.g., sampling schemes, sampling efforts, plot sizes, or investigators) from the target area, the observed species count in the pooled sample no longer follows a binomial distribution. This violates the theoretical assumption of a random sample. This type of integrated data is often encountered in ecological studies, where individual-based abundance data and sample-based incidence data are collected from the same target area. While integrated data are commonly employed to estimate richness, no estimator has been rigorously designed for such data. In this section, I will theoretically illustrate how Chao’s lower bound estimator can be modified to handle integrated data.
For individual-based abundance data, according to probability theory, when sample size is sufficiently large and relative abundance (or detection probability) is sufficiently small, the species frequency () follows a binomial distribution that converges to a Poisson distribution. This implies that the frequency () of rare species (i.e.,is sufficiently small) in the sample could approximate a Poisson distribution with mean (i.e., ) for species with low detection rate. This convergence feature also applies to sample-based incidence data. When the number of plots is large and the detection rate tends to zero, the incidence count () of rare species in the sample could approximate a Poisson distribution with mean (i.e., ) for species with a low detection rate.
Without loss of generality, two random samples are collected from the target region through different sampling schemes, namely, individual-based sampling and plot-based sampling methods. These samples correspond to individual-based abundance data and sample-based incidence data, respectively. When the two sampled samples are pooled, the pooled species frequency represents the count of species i in the pooled sample. Here, is no longer a random variable following a binomial distribution.
Based on the convergence principle between the binomial and Poisson distributions discussed earlier, for species with low detection rates in the combined sampling scheme (i.e., small
and small
), the species abundance (
) in the pooled sample approximately follows a Poisson distribution with a mean parameter
. For simplicity, let denote this mean parameter as
, where
represents the unknown size of the pooled sample and
represents the detection rate of species
i. Next, let
be the species frequency count, representing the number of species that are present exactly k times in the pooled sample. When
k is small (e.g.,
k = 0, 1, 2 or 3) and the size of the pooled sample is sufficiently large,
is primarily contributed by the rare species, which approximately follow a Poisson distribution. Given a specific sampling scheme, all species in the region can be divided into a set of rare species, denoted as
, and a set of abundant species, denoted as
. Based on the existing convergence theory between the binomial distribution and the Poisson distribution, we have the approximation of the expectation of
for small
k:
When
k is small, the probability that abundant species have a count of
k tends to zero. Therefore,
is roughly equal to 0. We have
According to the convergence property between binomial and Poisson distribution for rare species, the following approximation is held:
Therefore, we can derive the following four approximation equations for the expectation of undetected richness, singleton richness, doubleton richness, and tripleton richness, which represent the number of rare species in the pooled sample:
It is worth emphasizing once again that these approximations are valid only for lower species frequency counts in the sample, under the condition that sample sizes (
and
) are sufficiently large. In
Appendix A, I provide evidence that the aforementioned approximate equations hold by demonstrating their validity through numerical simulations.
Based on the Cauchy–Schwarz inequality, we have the following inequality:
According to Equations (3a) and (3b), Equation (4) is equivalent to
. This inequality is also held when species detected mean abundance
is assumed to be a random variable with probability density function
, expressed as
Therefore, we have the lower bound estimator of undetected richness
which uses the number of singletons and doubletons in the pooled sample to estimate undetected richness. Therefore, the proposed richness estimator could be interpreted as an extension of Chao1 or Chao2. It is denoted as Chao3, and the modified formula can be expressed as
2.2.3. Modified Good–Turing Frequency Formula for Integrated Data
Before adjusting Chao’s lower bound estimator for a more accurate estimator, it is essential first to introduce the Good–Turing frequency formula. Given a species abundance sample of size n collected randomly, let symbolize the mean relative abundance of species that appear exactly r times in the sample, expressed as . The Good–Turing frequency formula, designed to estimate is presented as
This formula has its roots in the work of Alan Turing and I. J. Good during World War II. They collaborated on deciphering German ciphers and innovatively utilized this statistical method to estimate the true frequencies of rare code elements, including those undetected, based on observed frequencies in intercepted samples of Nazi code. Later, Good’s papers in 1953 [
29], and jointly with Toulmin in 1956 [
24], shed light on Turing’s wartime explorations concerning the frequency formula and other related research topics.
In ecological studies, sample coverage represents the proportion of the total number of individuals in a community that belong to the species represented in the sample. This is represented mathematically as
, providing a measure of the sample’s completeness. Since sample coverage is equivalent to
, it can be estimated as
[
24]. This provides insight into the proportion of individuals from sampled species. This metric helps ecologists determine how well their sample represents the underlying community and whether more sampling is needed.
The Good–Turing frequency formula also can be used to estimate the number of unobserved species in a sample, based on the intuitive concept that the mean relative abundance of unseen species should be no greater than the mean relative abundance of species observed once in the sample (i.e., ). Upon employing the Good–Turing frequency formula to estimate and , we arrive at the inequality . Then, the lower bound estimator of undetected richness can be obtained, shown as , that is identical the Chao1 lower bound estimator initially derived via Cauchy–Schwarz inequality.
Based on the concept of the Good–Turing frequency formula, for the pooled sample from an integrated data, the mean detection rate of species which are present
r times in the pooled sample is denoted as
That can be effectively estimated via a modified Good–Turing frequency formula, shown as
2.2.4. Modified Chao’s Lower Bound Estimator for Integrated Data
According to the Cauchy–Schwarz inequality, we know that Chao3 is a lower bound estimator. Based on the Good–Turing frequency formula, Chao3 will be severely negatively biased when the rare species have a high degree of heterogeneity. In this section, the negative bias of Chao3 can be corrected based on the Good–Turing frequency formula [
24]. The bias of Chao3 is approximately equal to
Using the modified Good–Turing frequency formula (Equation (6)), we have the following approximate equations:
According to Equations (7a) and (7b), the bias of
can be approximately derived as
Therefore, the bias of Chao3 can be estimated by replacing
and
with
and
, respectively. It is given as
Then, we have the bias-corrected estimator of Chao3, expressed as
where
equals
if
and
if
. Here, as
(or
),
(or
is replaced by 1 to make Equation (8) always well-defined. The mathematic form of the estimator shown in Equation (8) is identical to the parametric estimator proposed by Chiu [
30,
31] which was derived based on the beta-binomial mixture model for sample-based incidence data or based on the gamma-Poisson mixture model for individual-based abundance data. The new estimator can also be proved to be a lower bound of richness under the incidence-based beta-binomial mixture model or the abundance-based gamma-Poisson mixture model [
30,
31].
Since the Chao3
Adj estimator utilizes the first three rarest species in the sample to estimate undetected richness, it can be applied to integrated data consisting of multiple samples randomly collected from the target area without adhering to a specific sampling model or scheme. For a comprehensive comparison, a table is provided in
Appendix B. This table details the equations and symbols utilized in the proposed estimators, complete with their definitions, origins (cited with references), and their statistical performances in estimating richness.
2.3. Estimating Richness across Multiple Assemblages Using Integrated Data
When there are N assemblages (sites), species sampling data are collected independently and separately from each site. The sampling data can be collected by either an individual-unit-based sampling method or a sample-unit-based sampling method. Let represent the number of sampling units (such as the number of individuals in individual-based abundance data or the number of plots in sample-based incidence data) of species in the sample , which is collected from the th site. Here, i ranges from 1 to for the species, and j ranges from 1 to for the sites. If the sample size (i.e., the total number of individuals in abundance data or the total number of plots in incidence data) is sufficiently large in each sample, the counts () of species with low detection rates will approximate a Poisson distribution. Then, the total count of species i in the pooled sample, denoted as , will approximate a Poisson distribution when the detection rate of species i in each site is uniformly low.
Let be the number of species with a count of exactly k in the pooled sample. The approximate equations shown in Equations (3a)–(3d) are also applicable to the pooled sample of integrated data. Additionally, formulae for Chao3 and Chao3Adj can be derived to estimate species richness across multiple assemblages based on the Good–Turing frequency formula without making any specific model assumptions. Similarly, their variance estimators can be obtained using the asymptotic approach, and the 95% confidence interval (CI) of species richness can be derived by referring to the discussion surrounding Equation (9).
According to the derivation, the proposed richness estimator possesses the following properties: (i) when the samples are individually and randomly collected from each site, the sampled samples can be directly combined to estimate undetected richness, regardless of whether the data format in each sample is identical or not; (ii) the estimation of undetected richness is solely based on the frequency counts of the rarest species in the pooled sample; (iii) when the detection rates of rare species are homogeneous (including the homogeneous model as a special case) or the sample size is sufficiently large, the proposed estimators are nearly unbiased.
2.4. Estimation of the Variance for the Estimator
To derive the variance estimator for the proposed richness estimator, an asymptotic approach is employed. By defining as the total frequency count of species with a count of at least 3 in the sample (i.e., ), the estimator Chao3 can be expressed as a function of . The estimator of Chao3′s variance could be obtained by the asymptotic approach in which approximate a multinomial distribution with parameters . Additionally, letbe the total frequency count of species with a count of at least 4 in the sample (i.e., ); then, Chao3Adj becomes a function of . The estimator of Chao3Adj’s variance can also be obtained using the asymptotic approach, where approximately follow a multinomial distribution with parameters .
The variance estimator of the Chao3 or Chao3
Adj can be derived via the delta method and is expressed as
where
To derive the 95% confidence interval (CI) of species richness and to ensure that the lower bound of the 95% CI of species richness is larger than the observed richness, assume
follows a log-normal distribution [
27,
32]; then, the two-sided 95% CI of species richness is obtained as
When samples are randomly collected, Chao3 consistently provides a lower bound estimate of species richness. Similarly, the Chao3
Adj also provides a lower bound estimate of species richness under the gamma-Poisson model or the beta-binomial model [
30,
31]. Therefore, in cases where the community exhibits high heterogeneity or the sample size is small, these two estimators can offer lower bound estimates and more informative one-sided 95% confidence intervals (CIs) of species richness, shown as
3. Results
3.1. Hypothetical Species Composition Models for Simulation Study
A simulation study was conducted to examine the statistical behaviors of the new estimators. The study involved the use of three species abundance models to generate individual-based abundance data and three species detection models to generate sample-based incidence data. The number of species was kept constant at S = 600, and the simulated datasets were generated separately and independently using the following models.
3.1.1. Models for Individual-Based Abundance Sampling
The species detection probabilities (or species relative abundance) in each model are provided below, where c is a normalizing constant such that . The coefficient of variation (CV) of is also presented to indicate the degree of heterogeneity of .
Abundance model 1, random uniform model (CV = 0.53), with , where is a random sample from a uniform distribution.
Abundance model 2, broken-stick model (CV = 0.97), with , where is a random sample from an exponential distribution with parameter 1. This model is commonly used in the literature and is equivalent to the Dirichlet distribution.
Abundance model 3, log-normal model (CV = 1.56), with , where is a random sample from a log-normal distribution with parameters 0 and 1.
3.1.2. Models for Sample-Based Incidence Sampling
The species detection probabilities in each model were determined, where c is a rescaling constant such that the maximum detection probability is a fixed at a constant value. The coefficient of variation (CV) of is also calculated to indicate the degree of heterogeneity of
- d.
Incidence model 1: the random uniform model (CV = 0.57), where , and is a random sample from a uniform distribution with parameters (0, 1), and scale c is used to control the maximum of .
- e.
Incidence model 2: the broken stick model (CV = 0.99), where , and is a random sample from an exponential distribution with parameter 1, and scale c is used to control the maximum of . This model is commonly used in the literature and is equivalent to the Dirichlet distribution.
- f.
Incidence model 3: the log-normal model (CV = 1.23), where , and is a random sample from a log-normal distribution, and scale c is used to control the maximum of .
The coefficient of variation (CV) in these six models ranged from 0 to 1.56, encompassing a wide range of values that encompass most practical cases in real-world applications.
In the simulation study, different sample sizes are considered to represent varying levels of sampling effort. For each simulation scenario, 1000 simulated datasets are generated. The estimates and their corresponding estimated standard errors (SE) are averaged across the 1000 simulated datasets to obtain the mean estimate and mean estimated SE. The sample SE and root mean square error (RMSE) are calculated based on the 1000 estimates to determine the sample SE and sample RMSE. The percentage of 95% confidence intervals (CIs) that cover the true value and the average observed richness are also calculated. All the simulation results are presented in
Table 1 and
Table 2. For simplicity, the average estimates of the discussed estimators are plotted in
Figure 1 and
Figure 2 to illustrate their statistical behavior as a function of sampling effort.
3.2. Simulation Results for Richness Estimation in a Single Assemblage
In this case, the integrated dataset consists of three random samples that are independently collected from the same assemblage. Each sample is simulated separately based on one of the three discussed abundance/incidence models, representing different sampling situations or methods. Different sample sizes are considered to indicate varying levels of sampling efforts, ranging from n = 200 to 600, with an increment of 50 for abundance data, and t = 10 to 50 with an increment of 5 for incidence data.
Four different scenarios are examined, including:
Three abundance models: random uniform, broken-stick, and log-normal;
Two abundance models: random uniform and broken-stick; one incidence model: log-normal;
One abundance model: random uniform; two incidence models: broken-stick and log-normal;
Three incidence models: random uniform, broken-stick, and log-normal.
The simulation results for these four scenarios are presented separately in
Figure 1a–d and
Table 1.
To estimate the richness of a single assemblage based on integrated data, as shown in
Figure 1 and
Table 1, under various scenarios of integrated datasets, both Chao3 and Chao3
Adj can effectively reduce the negative bias of observed richness, and their bias and RMSEs decrease as sample size increases. When the sample size is small, both Chao3 and Chao3
Adj provide a lower bound for the true richness, and they approach the true richness as sampling increases. The estimator of variance derived via the asymptotic method could perform well in all simulation scenarios (shown in
Table 1).
Compared to Chao3, Chao3
Adj offers a nearly unbiased and resilient estimate (with reduced bias and RMSE) and a more accurate 95% CI in every simulation scenario (as illustrated in
Figure 1 and
Table 1), even if the sample size is small.
3.3. Simulation Results for Richness Estimation across Multiple Assemblages
To evaluate the statistical behavior of the discussed estimators for richness estimation based on integrated data, three assemblages (sites) are assumed here. The integrated data consist of three samples that are collected separately from each site. It is assumed that the three sites comprise S = 600 species, with each site containing 300 species. There are varying numbers of shared species and unique species in each site.
Different sample sizes are considered to represent different sampling efforts, ranging from n = 400 to 4000, with an increment of 450 for abundance data, and t = 10 to 50, with an increment of 5 for incidence data. Four different scenarios are examined:
Three abundance models: random uniform, broken-stick, and log-normal;
Two abundance models: random uniform and broken-stick; one incidence model: log-normal;
One abundance model: random uniform; two incidence models: broken-stick and log-normal;
Three incidence models: random uniform, broken-stick, and log-normal.
The simulation results for these four scenarios are presented separately in
Figure 2a,d and
Table 2.
To assess richness across multiple assemblages using integrated data, both
Figure 2 and
Table 2 indicate that Chao3 and Chao3
Adj effectively mitigate the negative bias of observed richness in each discussed integrated data scenario. As the sample size increases, the bias and RMSE for these two estimators decline. With limited sample sizes, Chao3 and Chao3
Adj act as lower bounds for true richness. As sample size increases, these estimators converge toward true richness. The variance estimator, derived through the asymptotic method, consistently performs well across all simulated scenarios, as corroborated by
Table 2. While Chao3
Adj exhibits higher variance compared to Chao3, it yields more precise and consistent estimates, demonstrating less bias and RMSE. This ensures a more dependable 95% CI across all simulation scenarios, as emphasized in both
Figure 2 and
Table 2.
3.4. Remarks of Simulation Results
Undoubtedly, for a fixed sample size, a superior species richness estimator should exhibit lower bias and variance (i.e., low RMSE). Additionally, the coverage rate of its associated 95% confidence interval should be close to 0.95. As the sample size increases, an effective estimator should exhibit the following key characteristics: its bias should decrease; its accuracy (measured by RMSE) should enhance; and the coverage rate of its confidence interval should generally become better, ultimately approximating the true species richness when the sample size is adequately expansive. Based on these criteria, the following findings can be concluded from the simulation results:
In all simulation scenarios presented in
Table 1 and
Table 2 and
Figure 1 and
Figure 2, both Chao3 and Chao3
Adj consistently provide robust lower bound estimates in all hypothetical models, and they tend to approach the true richness as the sample size increases;
Both Chao3 and Chao3
Adj exhibit the essential statistical behaviors: their bias and RMSE decrease, resulting in more accurate 95% confidence intervals as the sample size increases (
Table 1 and
Table 2);
The estimators of the discussed estimators’ variance, derived using the asymptotic approach, perform well across all simulation scenarios (
Table 1 and
Table 2);
Compared to Chao3, Chao3
Adj exhibits lower bias, larger standard errors, and more accurate 95% confidence intervals for the true richness in all simulation scenarios (
Table 1 and
Table 2);
When samples are directly collected from the entire region (
Table 1), Chao3
Adj has higher RMSEs compared to Chao3; however, when samples are separately collected from each local area within the target region (
Table 2), Chao3
Adj demonstrates lower RMSEs.
These findings collectively demonstrate the favorable performance of Chao3Adj in terms of bias, standard error, and accuracy of the 95% confidence interval, particularly when samples are collected from each site within the target area.
3.5. Using Datasets as True Assemblages
I utilized two biological survey datasets, representing true assemblages and generated separate datasets, from these two assemblages. In each dataset, the observed species relative abundance was considered as the true species relative abundance or detection probability. A sample of size n (or t) was then generated through sampling with replacement to create the sampling dataset. Different sample sizes were considered to indicate varying levels of sampling efforts.
The average estimate and other relevant statistics obtained using the 1000 generated datasets, as a function of sample size, are depicted in
Figure 3 and
Figure 4 and
Appendix Table A3 and
Table A4 (refer to
Appendix C for detailed information). These evaluations aimed to assess the statistical behaviors of the discussed richness estimators across four different sampling scenarios: three abundance data; two abundance data and one incidence datum; one abundance datum and two incidence data; and three incidence data.
3.5.1. Moth Species Data
Moth species data were collected in the Golfo Dulce region of the Costa Rican rainforest from July to October 2014 [
33]. The target region was divided into three types of forest: creek forest; slope forest; and ridge forest. Light traps were set up at 18 sites, with six replicates within each forest type. Further details can be found in [
33].
Table 3 presents a summary of the data, including the sample size, observed richness, and the first five species frequency counts for each forest type. In the pooled sample, a total of 421 species were recorded, with 115, 285, and 356 species observed in the creek, slope, and ridge forests, respectively. In this case, the survey datasets are considered as the true assemblages. The proportion of species in the sample is assumed to represent the species’ relative abundance for generating individual-based abundance data, while the ratio between species abundance and the maximum abundance is considered as the species’ detection probability for generating sample-based incidence data. Consequently, each type of forest has its corresponding abundance model and incidence model. Four different scenarios are examined:
Three abundance models: creek, slope, and ridge;
Two abundance models: creek and slope; one incidence model: ridge;
One abundance model: creek; two incidence models: slope and ridge;
Three incidence models: creek, slope, and ridge.
3.5.2. Xylobiont Beetle Species Data
The second real dataset comprises xylobiont beetle species data collected from the Leipzig floodplain forest in 2016 [
34]. The beetle species data were collected separately from three dominant tree species in the Leipzig floodplain forest area: Quercus robur (QR); Tilia cordata (TC); and Fraxinus excelsior (FE).
Table 3 provides information on the sample size, observed richness, and the first three species frequency counts for each tree species. In total, 307 beetle species were observed, with 174, 198, and 184 species recorded in QR, TC, and FE tree species, respectively.
In this case, the survey datasets are treated as the true assemblages, and the species abundance/incidence model is constructed using the same method discussed earlier for each tree species. Four different scenarios are considered, including:
Three abundance models: QR, TC, and FE;
Two abundance models: QR and TC; one incidence model: FE;
One abundance model: QR; two incidence models: TC and FE;
Three incidence models: QR, TC, and FE.
Abbreviations: CV, coefficient of variation; and are, respectively, the singleton richness, doubleton richness, tripleton richness, and the number of species observed more than three times in the sample.
The results of the analysis, as depicted in
Figure 3 and
Figure 4 and the
Appendix Tables in
Appendix C, demonstrate that both Chao3 and Chao3
Adj effectively reduce the underestimation of observed richness. From a theoretical standpoint, and considering the results of the simulation study, it is expected that Chao3
Adj exhibits lower bias compared to Chao3, particularly when there is high heterogeneity as indicated by a high coefficient of variation (CV). Furthermore, the
Appendix Table A3 and
Table A4 in
Appendix C confirm that Chao3
Adj has lower bias, higher standard error, and lower root mean square error (RMSE). The higher estimated standard error of Chao3
Adj compared to Chao3 suggests that the former estimator may provide a more accurate 95% confidence interval for true richness. This observation aligns with the findings of the simulation study presented in
Table 1 and
Table 2.
Overall, the results support the notion that Chao3Adj has less bias and performs better in terms of standard error and RMSE, indicating its potential to provide more accurate estimates and confidence intervals for true richness, particularly in scenarios with higher heterogeneity.
4. Discussion and Conclusions
Species richness is the most commonly used diversity metric in ecological research. Numerous methodologies for estimating total species richness in a given area have been explored in the scholarly literature. These methods can be broadly categorized as either parametric or non-parametric estimators. Parametric estimators leverage assumptions about species compositions and necessitate complex computational processes to resolve likelihood functions. Furthermore, these estimators often encounter convergence issues during iterative numerical procedures or yield high variance when sample sizes are small. As such, they are less suited to small sample sizes and seldom applied in ecological studies. In contrast, non-parametric estimators, which do not impose assumptions on species composition and feature simple, closed formulae, tend to be more robust in various simulation cases. Consequently, they have gained widespread adoption in ecological studies. However, parametric and non-parametric approaches are derived assuming the sample is collected randomly from the target area, whereas in the abundance sample, the number of individuals belonging to a specific species, or in the incidence sample, the number of plots where a species is detected, both adhere to a binomial distribution.
Estimating species richness for a large-scale area or multiple assemblages poses a statistical challenge due to the difficulty of obtaining a random sample from the entire region. Typically, integrated data collected for assessing species richness in such cases consist of multiple samples that are individually sampled from each assemblage or local-area. Additionally, these samples may employ different sampling schemes or strategies. Therefore, the detection probability of a species may vary across the samples, and the data format (individual-based abundance data or sample-based incidence data) can differ among the samples. As a result, the pooled sample of the integrated data cannot be considered a random sample from the entire region, even though each individual sample is randomly collected from its respective local area or assemblage. Consequently, the pooled sample from the integrated data cannot be modeled using a traditional sampling distribution. Additionally, no estimator has been previously developed in the literature to estimate richness based on integrated data.
In this context, richness estimators that rely only on the frequency counts of rare species in the sample have been theoretically demonstrated to be applicable to the pooled sample, as long as the samples are randomly collected and the sample size is not excessively small. While numerous non-parametric techniques are grounded on the frequency tallies of infrequent species, such as the widely-adopted jackknife estimators, these often contravene essential standards, where bias and root mean squared error (RMSE) ought to diminish with the increasing sample size. Additionally, they are not consistently reliable, especially with limited data or when the assemblage is highly heterogenous [
26,
30,
35]. Hence, this manuscript does not delve into these estimators; instead, it focuses on the widely used Chao’s lower bound estimator, which utilizes the numbers of singletons and doubletons to estimate undetected richness and provides a reabiable estimate when sample size is not sufficiently large. This is the primary approach discussed in this text. In this research, a lower bound estimator (Chao3) and its bias-corrected estimator (Chao3
Adj) are theoretically proven to be suitable for estimating richness in multiple assemblages based on the pooled sample from integrated data. Chao3 derived using Cauchy–Schwarz inequality provides a lower bound richness estimate, while Chao3
Adj corrects the bias of Chao3 based on the Good–Turing frequency formula.
Since a single statistical model cannot accurately fit all ecological communities, there is no existing richness estimator that is uniformly unbiased for all such communities. Therefore, the development of a more robust estimator becomes an essential issue in estimating species richness. In this case, an estimator should be designed with functions such that both its bias and accuracy (quantified by RMSE), the two most crucial properties for an estimator, decrease as the sample size increases. Additionally, the coverage rate of the 95% confidence interval should approach 0.95 as the sample size expands. Based on these critical criteria, I arrived at the following conclusions from our simulation results. In all simulated scenarios, the observed richness in the samples was significantly underestimated, particularly when the sample size was small or when the species composition of the community was highly heterogeneous. Simulation results demonstrate that both Chao3 and Chao3Adj could correct the severe negative bias of observed richness, and their bias and RMSEs decreased as the sample size increased across all models discussed. These two estimators provide lower bound estimates in all hypothetical models and tend to converge to the true richness as the sample size increases. This implies that both proposed estimators can be used to estimate regional richness based on the pooled sample from integrated data, which aligns with the theoretical findings. Notably, when the sample size is small or the community exhibits high heterogeneity, Chao3 presents a significant negative bias, and its 95% confidence interval (CI) coverage rate is generally much lower than 0.95. In this case, Chao3Adj outperforms Chao3 with lower bias, lower root mean square error (RMSE), higher standard error (s.e.), and a more accurate 95% CI for true richness. This indicates that Chao3Adj tends to be more stable and less susceptible to the challenges mentioned compared to the traditional Chao’s lower bound estimator.
In the text, all proposed estimators are based on the assumption that each sampling unit is collected independently. When individuals in the sample are not sampled independently, and individuals of the same species are more likely to be sampled, the proposed estimator can be severely negatively biased. Therefore, when individuals of a species exhibit spatial aggregation patterns, it becomes challenging to collect them independently and individually. In such cases, the individual-based sampling method may not be practical to implement. In these situations, it is recommended to use the sample-based incidence sampling method for collecting data to assess species richness. This method could approximately ensure that the sampling units are sampled independently to align with the underlying model assumptions.
The proposed estimators depend solely on data related to rare species, namely, the count of singletons, doubletons, and tripletons in the aggregated sample, in order to estimate unobserved richness. Compared to more complex computations such as the maximum likelihood method, these methods offer a computational advantage as they provide estimates more simply, and their user-friendliness is emphasized by their straightforward formulae. From a practical standpoint, another significant benefit is that these estimators eliminate the need for detailed tracking of the count of abundant species observed during field surveys, thereby considerably reducing the field sampling burden.
In summary, while the newly introduced estimators show promising results in the hypothetical models and two real datasets, their applicability still necessitates further validation using a broader range of real datasets in the future.