1. Introduction
Uncovering the true scale of COVID-19 pandemic is critical for shaping public health policy, designing effective medical interventions, and planning economic, social, and educational activities. This requires conducting testing studies aimed at the assessment of the unknown prevalence of the ongoing or past infection in a given population. Knowledge of the prevalence parameter is also indispensable for estimation of the all-important infection mortality rate of COVID-19.
The prevalence of a current or past disease or infection in a population at a given time is defined as the fraction of the population that at the time of interest has or has had the disease or infection. The prevalence is time-dependent. Due to the fact that COVID-19 is highly contagious and has a relatively short incubation period, the prevalence of COVID-19 in a population may change on the time scale of weeks. Thus, to provide a meaningful estimate of the population prevalence of COVID-19, testing studies should be conducted during a short period of time.
Selection of an appropriate target population for a given prevalence study is essential for the study’s scientific value. Such selection may be guided by the current knowledge of the epidemiology of the disease. Testing the entire population for COVID-19 or other diseases is expensive and usually impractical. This is why it is typically performed on a relatively small sample of individuals deemed to be representative of the population in terms of the likelihood of a present or past infection. The sample can be drawn randomly or recruited by other means and controlled for individual characteristics known to be associated with the rate of occurrence of the infection or severity of the disease in order to approximate the distribution of these characteristics in the target population. In the case of COVID-19, these include age, sex, race, socio-economic status, membership in certain professional groups, geographic location, and the presence of various comorbidities. One notable exception to the small sample approach is a SARS-CoV-2 infection study conducted on 21–29 February 2020 and 7 March 2020 in Vo’, a municipality of 3275 inhabitants in the Veneto region in Italy, where nasopharyngeal swabs were collected from 85.9% and 71.5% of the entire population in two phases of the study [
1].
Several reports on COVID-19 population prevalence studies have been recently published [
2,
3,
4,
5]. One of them, the Santa Clara county seroprevalence study [
2], has stirred a considerable scientific [
6] and public [
7,
8] controversy. The reasons include, but are not limited to, questions raised about the study design, statistical methodology, and the lack of a full disclosure of methods and underlying assumptions. This creates an urgent need for methodological clarity regarding mathematical, statistical, and epidemiological foundations of testing for COVID-19, the analysis of the resulting data including accurate accounting for false positive and false negative test results, and effective design of population prevalence studies. The present work is a step in this direction. In particular, I explicitly state three fairly minimal biomedical/statistical assumptions essential for estimation of the prevalence of COVID-19 or other diseases in a given population. In addition, I identify possible reasons that may lead to violation of these assumptions using the studies [
2,
3,
4,
5] as an illustration. Finally, detailed proofs of all the mathematical and statistical results are provided. These proofs depend on the three aforementioned assumptions alone, and it is specified which assumptions are requisite for each result.
An important consideration to reckon with when studying prevalence of a present or past infection in a target population is heterogeneity of the infection prevalence across its various subpopulations. These subpopulations may be determined by the above-mentioned observable individual characteristics or their combinations. The effects of these attributes on the risk of COVID-19 infection and severity of the disease have been well-documented. For example, health care workers may be exposed to higher doses of SARS-CoV-2, and, more frequently, than members of the general population; consequently, they may have a higher prevalence of the infection. Furthermore, potentially significant fractions of the population may be insusceptible to COVID-19 or carry the infection asymptomatically. In addition, various presently unknown variables associated e.g., with the genetic make-up, functioning of the immune system, and the history of immunizations and exposures to similar pathogens may all prove important for proper prevalence stratification and estimation.
Various approaches to accounting for population heterogeneity can be illustrated using the aforementioned prevalence studies [
2,
3,
4,
5]. The Gangelt study [
3] did not deal with explicitly defined subpopulations; however, it addressed the effects of the household size, the presence of several co-morbidities, and participation in a COVID-19 super-spreading event on the risk of COVID-19 infection and severity of the disease. The Santa Clara county study [
2] defined its subpopulations by sex, race, and zip code. To mitigate a mismatch between the sample of tested subjects and the population (manifesting, for example, in the fact that 63.7% of study participants with valid test results were females while the overall proportion of females in the Santa Clara county is 49.5%), the test results were re-weighted, see [
2] for details. A similar approach was taken in the Los Angeles county study [
4] that stratified the population by sex, age, race/ehtnicity, and income. Finally, the population queried for the prevalence of ongoing COVID-19 infection in the Iceland study [
5] was only stratified by age and sex.
A subpopulation of a given population will be called homogeneous if, at the time of survey (or over a certain period of time in the case of survey for a past infection), all members of the subpopulation have, or had, the same probability of being infected. This probability represents the prevalence of the disease within the subpopulation. Homogeneous subpopulations of the given population together with their prevalences will be referred to as the prevalence structure of the population. Like many other abstract concepts of mathematics and statistics, the concepts of homeogeneous population and prevalence structure are idealizations; however, they will prove to be useful for our analysis.
The diagnostic quality of a test for a certain infection (or disease) is determined by its
sensitivity (i.e., the probability of obtaining a positive test result for an infected individual) and
specificity (i.e., the probability of a negative test result for an uninfected individual). Great effort is expended by the manufacturers of test kits to ensure that the sensitivity and specificity of their test remain stable across a wide range of testing conditions, which involves running the test on a large number of known true positive and true negative samples. This is why in this article I assume
and
to be fixed. In spite of the effort, however, the sensitivity and specificity of a test, especially a newly designed one, may display certain random or systematic variation depending on the testing site, tested individual, and other conditions. A Bayesian framework for studying the effects of such variation on population prevalence of COVID-19 was presented, in the context of critical analysis of the Santa Clara county study [
2], in [
6]. For a review of Bayesian modeling approaches to the assessment of accuracy of diagnostic tests, see [
9].
Below, the reader will encounter references to various tests for COVID-19 infection including molecular amplification tests such as RT-PCR for SARS-CoV-2 and serological tests for its antibodies, such as IgA, IgM, and IgG that serve as biomarkers of the present or past infection. For a general reference on this subject, see [
10].
In this work, I compute in closed form, starting with a homogeneous population, (1) the distribution of the number,
of positive test outcomes that result from testing
N individuals using a test with given sensitivity
and specificity
(2) the distribution of the number of true and false positive test results conditional on the event that
n positive test results were observed; and (3) the conditional expected value and variance of the true number of infected individuals among those tested given
n positive test results, see
Section 3 and
Section 4. In
Section 5, an
Equivalence Principle that enables an extension of all these results to heterogeneous populations is established. In
Section 6, I bring the above results to bear on the design of a consistent estimator of the population prevalence of the disease (or infection) from the test results collected from a sample of individuals. Finally, in
Section 6, I also provide a detailed justification of the mathematical form of the prevalence estimator and study its properties. This estimator was employed in [
2] and used for the analysis of uncertainty propagation in [
6].
The results of this work can be used for the assessment of plausibility of prevalence estimates reported by various population studies. Specifically, the prevalence of the current or past infection can be checked for self-consistency by comparing the observed number of positive cases with its theoretical prediction and by computing the expected number of false positive and false negative test results and associated probabilities. Additionally, the assumptions formulated in
Section 2 may help optimize the design of future prevalence studies for COVID-19 and other diseases.
The Santa Clara county study [
2] was criticized for using suboptimal tests for the presence of IgM and IgG antibodies that serve as respective biomarkers of current and past SARS-CoV-2 infection, with combined sensitivity
and specificity
on the grounds that these tests could have produced a large number of false positive and false negative results. I argue that this criticism is largely unfounded and show that the estimator of the true population prevalence,
derived in the present work and also utilized in [
2], is consistent and has small variance for a sufficiently large sample size for
any and
such that
and
For example, if at the time of the study [
2], the true COVID-19 seroprevalence in Santa Clara county was, say, between 1% and 10%, then the tests employed (and even those with the same specificity and a dismal sensitivity of 15%) would still be appropriate for a sufficiently large number of tested individuals, should the three assumptions formulated in
Section 2 be met.
Because of the current acute interest in SARS-CoV-2 and COVID-19, I will be employing below the terminology and biological considerations pertaining to COVID-19 testing. However, the results of this work may prove applicable to testing for other diseases or conditions as well as in some non-epidemiological settings.
This article can be viewed as a tutorial on some basic mathematical and statistical aspects of testing and prevalence estimation. It represents an expanded and refined version of the author’s preprint [
11].
2. Basic Assumptions
I start by spelling out three assumptions, termed A, B, C that are ideally to be met by a well-designed population prevalence study. Their discussion is tailored to testing for COVID-19 and uses studies [
2,
3,
4,
5] as an illustration. As many general assumptions in probability and statistics, assumptions A–C are largely empirically untestable. However, various kinds of empirical evidence as well as careful study design can make the case of their validity stronger. In the terminology of Henri Poincaré [
12], these assumptions serve as
neutral hypotheses that are not too restrictive yet enable building a rigorous quantitative framework for population prevalence estimation. Conversely, specific aspects of study design, discussed below, likely contravene these assumptions. Finally, the assumptions do not have to be necessarily met by the entire sample of subjects recruited for testing or those actually tested; rather, they may serve as a guide for selection of a subset of study participants whose valid test results will be used for population prevalence estimation.
Although this assumption is indispensable for rigorous statistical analysis of testing data, its violations in prevalence studies are fairly common. One particular reason is excessive inclusion of multiple members of the same household in a set of testing data used for prevalence estimation. The effects of such oversampling were clearly demonstrated by the study [
3] conducted on 31 March–6 April 2020 in Gangelt, a community of around 12,500 people in the state of North Rhine-Westphalia, Germany. One of the aims of the study was to estimate the excess risk of contracting COVID-19 for someone who lives in the same household with an infected individual. It was found that, interestingly, the risk of such secondary infection increases by 28.1%, 20.2%, and 2.8% for households with two, three, and four people, respectively, relative to the 15.5% risk of the primary infection [
3]. Given that 919 participants of the Gangelt study for whom valid RT-PCR SARS-CoV-2 test results and/or the titers of IgA or IgG antibodies were obtained belonged to just 405 households, the validity of Assumption A for this study is questionable. The same likely applies to the Santa Clara study [
2] conducted on 3–4 April 2020 in Santa Clara county, CA with the population of around 2 million people. In that study, among 3390 participants whose blood specimens were analyzed for the presence of IgM or IgG antibodies there were 2747 adults and 643 children, no more than one per household, living with some of these adults [
2]. By contrast to the Gangelt and Santa Clara studies, the Los Angeles study of IgM/IgG seroprevalence [
4] that was conducted on 10–11 April 2020 in the Los Angeles county, CA limited participation to one individual per household.
The validity of Assumption A may also prove problematic if a disproportionately large amount of study participants attended a known COVID-19 super-spreading event or may be suspected of belonging to known clusters of COVID-19 cases. For example, investigation of the effects of one such event, a carnival festivity held around 15 February 2020 in Gangelt, revealed that the infection rate among participants of the event was 2.6 higher than among non-participants; in addition, the course of the disease was much more severe in the former than in the latter [
3].
Although this assumption is tacitly adopted almost universally, its validity may prove in certain cases questionable. For example, in asymptomatic or pre-symptomatic carriers of COVID-19, the number of viral particles on a nasopharyngeal swab may be too low to be detectable by the RT-PCR test. Additionally, in some asymptomatic or even symptomatic individuals, the titer of IgA or IgM antibodies indicative of an ongoing disease may not exceed the detection threshold of a serological test. As yet another example, in convalescent COVID-19 patients, the presence of viral particles may already be undetectable while the titer of IgG antibody, an indicator of a past disease, may not be detectable yet. In all these cases, the sensitivity of the test will be reduced. Likewise, cross-reactivity with viral fragments of, or antibodies to, another virus may increase the likelihood of false positive responses in those individuals who at the time of testing are, or have recently been, infected with a similar pathogen, e.g., a coronavirus causing the common cold. Such cross-reactivity will result in a reduction in the test’s specificity.
Another source of systematic non-uniformity of test performance is the use of composite tests. For example, IgM and IgG antibody titers in the Santa Clara county study [
2] were measured concurrently, and so were IgA and IgG antibody titers in the Gangelt study. Suppose two tests with respective sensitivities
and specificities
were given to the same group of subjects. Whereas the specificity of the composite test for any uninfected individual can be assumed to be
, the sensitivity of the composite test varies. Consider, for example, an infected individual who only has the first antibody. If a positive result is defined as detection of at least one antibody, then, under the independence assumption, the sensitivity of the composite test for such individual would be
Similarly, for an infected individual who has only the second antibody, it is
. However, for those subjects who have both antibodies, the test’s sensitivity is
(for an evidence that the fraction of such subjects is not negligible, see, e.g., [
13]). Thus, the sensitivities of the composite test for these three categories of individuals are, generally speaking, all distinct.
Finally, test results of the Vo’ study [
1] have not been adjusted for the sensitivity and specificity of the RT-PCR test, which amounts to assuming that
This assumption suggests a particular way in which the sample of tested individuals is representative of the target population. One sampling method that satisfies the Matching Principle is Simple Random Sampling (SRS), see, e.g., [
14], whose defining property is that all samples of a given size are equally likely to be selected. To see that SRS satisfies Assumption C, consider drawing a sample of a given size
N from a population of
individuals. For a fixed subpopulation
S, homogeneous or otherwise, consisting of
individuals, denote by
the random number of individuals from a sample that belong to
It follows from the definition of SRS that random variable
has hypergeometric distribution
Therefore, for its expected value,
we have
so that
where
w is the fractional size, or weight, of the subpopulation
Thus, under SRS, every subpopulation is represented in the sample, on average, in accordance with its weight.
SRS can be combined with stratification of the population into several subpopulations determined by observable individual characteristics associated with the likelihood of the disease or infection. For example, the total sample size N can be first partitioned into r subsample sizes, proportional to the demographic weights of the identified subpopulations and then random subsample of size can be generated from the -th subpopulation by means of SRS for each
One source of potential violation of Assumption C is oversampling from the same household, discussed above, allowed by the design of the studies [
2,
3]. The validity of Assumption C may also prove uncertain if recruitment for testing involves a significant opportunity for self-selection, which makes it likely that people who surmised that they have, or have had, the disease volunteered for the study. Such selection bias was manifestly present in the SARS-CoV-2 population prevalence study [
5] conducted in Iceland between 13 March and 1 April 2020 where about half of the 10,797 tested participants who volunteered for the study had mild respiratory symptoms. A possibility for self-selection also existed in the Santa Clara county study whose recruited volunteers responded to an advertisement posted on Facebook [
2]. The same problem was potentially present in the Los Angeles study where only 865 among 1952 randomly selected adults (with some restrictions aimed at matching the county demographics) were actually tested [
4]. Finally, the initial recruitment effort of the Gangelt study consisted of generating a random sample of 600 community members with distinct last names and inviting them to participate in the study. However, the 407 study participants who responded to the invitation were allowed to bring in other household members for testing. As a result, 1007 individuals from 405 households were tested [
3].
The overall logic and flow of exposition in the rest of the article are as follows. All mathematical results, derived under Assumptions A and B for a homogeneous population, where the probability of having a current or past infection can be assumed the same for all individuals, are formulated in
Section 3 and
Section 4. Next,
Section 5 introduces, based on Assumption C, the
Equivalence Principle that enables a natural extension of all the results obtained in
Section 3 and
Section 4 to a heterogeneous population consisting of any number of homogeneous subpopulations.
Section 6 is dedicated to construction of a prevalence estimator and studying its properties. Finally, in
Section 7, I summarize the findings and formulate recommendations for the design of prevalence studies informed by the analysis of this article.
3. Distribution of the Number of True and False Positive Test Results
Consider a test with a binary outcome (positive/negative) administered to N individuals selected from a homogeneous population with infection prevalence Let be the sensitivity and be the specificity of the test, Suppose the test resulted in positive outcomes. Denote by the respective unobservable numbers of true positive and false positive test results, and let be the observable total number of positive outcomes. Denote by M the unknown true number of presently or previously infected individuals (depending on the nature of the test) among the N tested individuals. Below, we seek to compute, under Assumptions A and B, the distribution of random variables and the conditional distributions of X and Y given The conditional expectation and variance of random variable M given will be computed in closed form in the next section.
Assumption A implies that the distribution of random variable
M is binomial
For this and other basic concepts and results from probability, the reader is referred to [
15].
If
is fixed, then the testing of each infected individual produces a positive test result, independently of other individuals (Assumption A), with the same probability
the sensitivity of the test (Assumption B). Then, for the number,
of true positive test results, we have
Similarly, it follows from Assumption B that every uninfected individual receives a false positive test result with the same probability
Thus, the distribution of the number,
of false positives is given by
Importantly, it follows from Assumption A that, for every random variables X and Y are conditionally independent given
Due to Assumptions A and B, random variable
X is a
thinning of the binomial random variable
M with probability
In general, thinning of a sequence of random events is their independent marking, or filtration, with the same probability. Accordingly, the random variable that counts the number of marked events is called a thinning of the random variable counting the occurrence of the original events; for more on thinning, see [
16]. By compounding distributions (1) and (2), we find that random variable
X has binomial distribution
In fact, for
, we have using the formula of total probability, setting
and finally employing Newton’s binomial formula:
Likewise, Y represents a thinning of the binomial random variable with probability and consequently has distribution Similarly, the distribution of the number of false negative test results is
To compute the joint distribution of random variables
X and
notice that, if
and
, then every admissible value,
of random variable
M satisfies the inequalities
Using the formula of total probability, invoking Equations (1)–(3), rearranging the factors, making a change of variable
and finally employing Newton’s binomial formula, we obtain for all
such that
Therefore, random vector
has trinomial distribution
Finally, lumping together true and false positive test results and combining their probabilities lead to a conclusion that the distribution of the total number,
of positive test results, is binomial
where
According to the formula of total probability, represents the probability of obtaining a positive test result for a randomly selected individual from the given homogeneous population.
Formulas (4) and (5) produce the following distributions of the number of true and false positive test results conditional on the observed total number of positive outcomes:
and
where
Thus, the distribution of the number of true and false positives given the observed number, of positive test results is and respectively.
Distributions (7) and (8) have the following three notable features:
- (a)
They are independent of the total number, of tested individuals;
- (b)
Parameters
and
specified in (9) represent the
predictive positive and predictive negative values that can be obtained by applying Bayes theorem to prior probabilities
p and
see, e.g., [
17];
- (c)
Distributions (7) and (8) depend on a
single parameter
that combines the basic parameters
The extraordinary simplicity of Formulas (5), (7), and (8) should not becloud the fact that their validity depends critically on Assumptions A and B.
4. Conditional Expected Number of Infected Individuals for a Given Number of Positive Test Results
A natural estimator,
of the prevalence of an infection in a population can be defined as the expected fraction of infected individuals among those tested given the observed number,
of positive test results:
The main goal of this section is to compute in the case of a homogeneous population.
For the distribution of random variable
M conditional on
, we have
where
is given in (5)–(6) and
x satisfies the inequalities
and
or equivalently
Although (1)–(3) and (5) combined with (11) lead to a formula for the conditional probability this formula does not seem to be reducible to a simple expression. However, the corresponding conditional expectation and variance can be computed in closed form, as I show below.
Formula (11) suggests that, in order to find the conditional expectation
one has to compute the following quantity:
where the bounds for variable
x are given in (12). Notice that the range of pairs
has a simpler representation
than for pairs
Therefore, switching the order of summation, changing the variable in the internal sum to
and using Formulas (1)–(3) yield
Using (6) we represent the internal sum in (13) as
where we set
and used the formula for the expected value of the binomial distribution
Now, the above derivation of the formula for
can be continued:
where we employed (5) along with the formula for the expected value of the binomial distribution
where
is the same as in Formula (9). Thus, in view of (11),
A very similar argument leads to the following formula for the conditional second moment of
M given
Therefore, due to (14) and (6),
Inspection of Formulas (14) and (15) reveals that the conditional expectation and variance of random variable
M given
depend on the following two combinations of parameters
alone:
5. The Equivalence Principle for Heterogeneous Populations
Section 3 and
Section 4 dealt with a population that was assumed homogeneous in the sense that all its individuals had the same probability,
to have a current or past infection. The aim of this section is to extend the results of
Section 3 and
Section 4 to a more realistic case of a heterogeneous population consisting of
r homogeneous subpopulations. Let
where
is the vector of relative sizes (weights) of these subpopulations and
be the vector of their disease prevalences.
I start with introducing the following convenient notation. For a non-negative integer vector with r components, set and In addition, for two such vectors and , we denote and Finally, means that for
Let
be the number of individuals from the
th homogeneous subpopulation among
N tested individuals. The Matching Principle (Assumption C) implies that random vector
with
has multinomial distribution
Let be the number of infected individuals among tested individuals, Random vector represents a component-wise thinning of random vector with thinning probabilities forming the vector What is the distribution of random vector ? A computation below shows that, in contrast to the binomial case, it is not multinomial!
It follows from Assumption A and subpopulation homogeneity that, for any
the conditional distribution of random variable
given
is binomial
In addition, Assumption A implies that, for every vector
such that
components of random vector
are conditionally independent given
Therefore, for any vector
with
we have using (16), employing the formula of total probability, making a change of variable
and, finally using the multinomial formula,
Due to Assumption B, all the computations in
Section 3 and
Section 4 involve only the
total number,
of infected individuals among those tested. The distribution of random variable
can now be derived using the multinomial formula and (17):
Thus, the total number of infected individuals among N subjects tested follows the binomial distribution
Comparison between Formulas (18) and (1) leads to the following conclusion that can be termed the Equivalence Principle:
Under Assumption C, the distribution of the total number of infected individuals among N tested individuals selected from a heterogeneous population consisting of r homogeneous subpopulations with weights and infection prevalences is the same as for a homogeneous population with infection prevalence
The Equivalence Principle is also true when the r subpopulations comprising the population of interest are heterogeneous. In fact, partitioning them into homogeneous subsubpopulations, applying Formula (19), regrouping and rescaling the terms pertaining to the same subpopulation, and applying the Equivalence Principle again leads to Formula (19) in which are the weights (relative sizes) and are the prevalences of the r heterogeneous subpopulations.
In summary, all the results in
Section 3 and
Section 4 that were derived for a homogeneous population with infection prevalence
p are also valid for a heterogeneous population if one selects
p in accordance with Formula (19). This equivalence property is, of course, quite natural; however, it depends on Assumptions A–C in very essential ways.
6. Prevalence Estimation
If N individuals drawn from a population of interest were tested and n positive test results were observed, then a “naïve" estimate of the prevalence of the current or past infection in the population would be The testing process can be viewed as the following mental experiment: for an infected individual, a coin is flipped that lands “heads” with probability and “tails” with probability while, for an uninfected individual, another coin is flipped that lands “heads” with probability and “tails” with probability In these terms, is the fraction of “heads” (positive test results) recorded for N independent replications of this random experiment. Clearly, depends on the sensitivity and specificity of the test and the prevalence of the disease. Therefore, one needs to untangle them and construct a consistent estimator of the prevalence alone.
In the rest of this section, it will be assumed that
This condition is always met in practice; otherwise, either the sensitivity or the specificity of a test would not exceed 0.5, thus making the test equivalent or inferior, for either infected or uninfected individuals, to flipping a fair coin.
Recall that the number,
of positive test results has binomial distribution
see
Section 3, where
Notice that, under condition (20),
I first define the desired estimate,
of the population prevalence
p heuristically. Because
is a consistent unbiased estimator of
the following “plug-in” equation can be set up for
This defines
as the population prevalence that would produce, on average, the same fraction of positive test results when
N individuals are tested as the one actually observed. Solving Equation (
22) for
yields
Note that this estimator was employed in the Santa Clara county study [
2], see also [
6]. Observe that
if and only if
compare with (21). Thus, the complete definition of estimator
is
This formula implies that meaningful prevalence estimation in a population with low prevalence of a disease requires a test with high specificity (namely, with where is the raw positivity rate for a representative sample). Likewise, estimation of large prevalence requires a test with sufficiently high sensitivity (specifically, with ).
Since
almost surely as
Equations (23), (21), and (6) imply that
almost surely as
Therefore,
is a consistent estimator of
Because estimator
is uniformly bounded, we also have
as
which means that estimator
is asymptotically unbiased.
The heuristic formula (23) can be derived on more “theoretical” grounds. Recall that
is defined as the expected fraction of infected individuals among those tested conditional on the observed number of positive test results, see Equation (
10). Then, in view of (14),
The principal difficulty with this definition of the prevalence estimator
is that it depends on the unknown true prevalence parameter
p that
seeks to estimate. A natural idea, then, would be to determine the value of
p for which
and take it as the desired prevalence estimator. Using expression (6) for
, one finds after some algebra
Upon comparison with (23), this leads to the conclusion that, under assumption (20), the required fixed point of function f is exactly the above heuristic estimator
My next goal is to estimate the variance of
Recall that a function
is called a
contraction if
for all
Setting here
shows that for every contraction
TAn important family of contractions consists of functions
defined for
by
If
U is a random variable with finite second moment defined on a sample space
S with probability measure
P and
where
T is a contraction, then it follows from (24) that the second moment of random variable
V is also finite. Moreover,
In fact,
In particular, setting
we conclude from the above full definition of estimator
that
This leads to the following upper bound for the variance of
Interestingly, this upper bound depends only on the quantity
The estimator
has a remarkable feature that can be termed the “mixture-invariance” property. Consider a population that consists of
r subpopulations. Let
be the number of tested individuals from the
th subpopulation and
be the number of positive outcomes, based on the same test. Denote by
the above-designed prevalence estimate for the
th subpopulation and set
where
Finally, assume that
for all
Then, the prevalence estimate,
for the entire population becomes
compared with (19).
In fact, it follows from Formula (23) and
that
Thus,
is a weighted average of the ratios
which implies, due to the above-assumed bounds for these ratios that
We now continue (27) to finally get
The mixture-invariance property (26) enables one to combine prevalence estimates for several subpopulations of a given population obtained within the same study, or different studies utilizing the same test, into a prevalence estimate for the entire population.
7. Discussion and Recommendations
In this article, I computed the distribution of the number of positive outcomes resulting from administration of a test with known sensitivity and specificity to
N individuals selected from a given population. I also found the conditional distribution of the unobservable number of true and false positive test results given the observed number,
of positive outcomes. These formulas lead to a closed form expression for the expected value of the unknown true number of infected individuals among those tested conditional on
In
Section 3 and
Section 4, these results were obtained for a homogeneous population while the
Equivalence Principle derived in
Section 5 extended them to a heterogeneous population. This theory culminated with a construction in
Section 6 of a consistent estimator,
of the prevalence of infected individuals in a population and finding an upper bound for the variance of
Importantly, in
Section 2, I formulated three basic assumptions required for the validity of the above results and identified, using the well-known early COVID-19 prevalence studies [
2,
3,
4,
5] as examples, several sources of their violation. Because it is uncommon for epidemiological studies including [
2,
3,
4,
5] to disclose all the details of their statistical analyses, it is hard to say with certainty if some of the formulas obtained in this work were employed in the published prevalence studies (as mentioned earlier, the Santa Clara county study [
2] did use the prevalence estimator (23)). However, the design of these studies does not seem to fully meet the assumptions upon which these formulas vitally depend.
Note that this work never employed any asymptotic arguments, i.e., those applicable to large values of Therefore, all the results can be used for small sample size N as long as the sample of tested individuals is sufficiently representative of the prevalence structure of the target population.
One possible line of extension of this work would be to incorporate the compliance rate of individuals recruited for a prevalence study and the rate at which a testing system produces invalid results into the prevalence estimator. Another direction of further work would be to develop a probabilistic framework for the propagation of uncertainty in the test’s sensitivity and specificity into the population prevalence estimator. As mentioned in the Introduction, a Bayesian approach to quantifying such propagation was developed in [
6].
I close with a list of specific conclusions and recommendations regarding the design of population prevalence studies and relevant statistical methodology.
1. The estimator,
of population prevalence
p introduced in
Section 6 is consistent for
any test whose sensitivity,
and specificity,
satisfy the conditions
and
The quality of this estimator, as determined by the magnitude of its variance, depends on the quantity
alone, see inequality (25), which can be used for deciding on the number of individuals to be tested. While the accuracy of estimator
improves with the increase in the sensitivity and specificity of the test, the same improvement can be achieved by increasing the sample size. Thus, contrary to the common belief, high sensitivity and specificity of a test is not of primary importance for population prevalence estimation (although the accuracy of
individual test results depends critically on how close the test’s sensitivity and specificity are to 100%).
2. The “naïve" prevalence estimator depends on the true population prevalence and the test’s sensitivity and specificity. It may deviate considerably from the correct “disentangled” population prevalence estimator. For example, for a perfectly specific test (), one has Therefore, the use of as prevalence estimator should be discouraged, unless a test with sensitivity and specificity close to 100% is employed.
3. Accurate prevalence estimation for a population with a high prevalence of a disease or infection requires a very sensitive test while, for a population with low prevalence, a very specific test should be used.
4. Prevalence estimates, resulting from data obtained on the same testing platform, for subpopulations with known weights leads automatically, through the “mixing-invariance” property, see
Section 6, to a prevalence estimate for a heterogeneous population comprised of these subpopulations without the need for a de novo study.
5. The validity of an estimate of population prevalence of the current or past infection depends critically on study design. Here are a few recommendations for a selection of study participants and choosing among them those individuals whose valid test results can be used for prevalence estimation:
(a) Excessive inclusion of testing data for more than one household member in the same analysis of the prevalence of COVID-19 should be avoided. The same applies to individuals who are known to have been in close and/or protracted contact without PPE.
(b) Prevalence studies in populations where known COVID-19 super-spreading events have occurred should eschew oversampling from the subpopulation of their participants. Likewise, study participants could be screened for association with known infection clusters.
(c) Compliance with the Matching Principle (Assumption C) can be achieved through simple random sampling from the target population. This approach can be combined with stratification based on individual characteristics relevant to the prevalence of the disease or infection. Selected individuals can be additionally screened based on the above criteria (a) and (b) as well as for membership in high-risk professional groups and for residence in locations with high or low prevalence of registered cases.
(d) Self-selection of study subjects can make the resulting prevalence estimate unreliable. One way to prevent this selection bias is to improve post-selection compliance by providing financial incentives to study participants.
(e) Using composite testing data (e.g., combining molecular RT-PCR or IgA/IgM serological test for current infection with serological IgG test for past infection) within the same prevalence analysis violates the testing uniformity assumption and should be avoided.