Next Article in Journal
Efficient SCAN and Chaotic Map Encryption System for Securing E-Healthcare Images
Previous Article in Journal
A Method for UWB Localization Based on CNN-SVM and Hybrid Locating Algorithm
Previous Article in Special Issue
A Shannon-Theoretic Approach to the Storage–Retrieval Trade-Off in PIR Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Group Testing with a Graph Infection Spread Model

Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA
*
Author to whom correspondence should be addressed.
Information 2023, 14(1), 48; https://doi.org/10.3390/info14010048
Submission received: 1 December 2022 / Revised: 26 December 2022 / Accepted: 9 January 2023 / Published: 12 January 2023
(This article belongs to the Special Issue Advanced Technologies in Storage, Computing, and Communication)

Abstract

:
The group testing idea is an efficient infection identification approach based on pooling the test samples of a group of individuals, which results in identification with less number of tests than individually testing the population. In our work, we propose a novel infection spread model based on a random connection graph which represents connections between n individuals. Infection spreads via connections between individuals, and this results in a probabilistic cluster formation structure as well as non-i.i.d. (correlated) infection statuses for individuals. We propose a class of two-step sampled group testing algorithms where we exploit the known probabilistic infection spread model. We investigate the metrics associated with two-step sampled group testing algorithms. To demonstrate our results, for analytically tractable exponentially split cluster formation trees, we calculate the required number of tests and the expected number of false classifications in terms of the system parameters, and identify the trade-off between them. For such exponentially split cluster formation trees, for zero-error construction, we prove that the required number of tests is O ( log 2 n ) . Thus, for such cluster formation trees, our algorithm outperforms any zero-error non-adaptive group test, binary splitting algorithm, and Hwang’s generalized binary splitting algorithm. Our results imply that, by exploiting probabilistic information on the connections of individuals, group testing can be used to reduce the number of required tests significantly even when the infection rate is high, contrasting the prevalent belief that group testing is useful only when the infection rate is low.

1. Introduction

The group testing problem, introduced by Dorfman in [1], is the problem of identifying the infection statuses of a set of individuals by performing fewer tests than individually testing everyone. The key idea of group testing is to mix test samples of the individuals and test the mixed sample. A negative test result implies that everyone within that group is negative, thereby identifying infection statuses of an entire group with a single test. A positive test result implies that there is at least one positive individual in that group, in which case Dorfman’s original algorithm goes into a second phase of testing everyone individually.
Since Dorfman’s seminal work, various families of algorithms have been studied, such as adaptive algorithms, where one designs test pools in the ( i + 1 ) st step by using information from the test results in the first i steps, and non-adaptive algorithms, where every test pool is predetermined and run in parallel. In addition, various forms of infection spread models have been considered as well, such as the independent and identically distributed (i.i.d.) model where each person is infected independent of others with probability p, and the combinatorial model where k out of n people are infected uniformly distributed on the sample space of n k elements. Under these various system models and family of algorithms, the group testing problem has been widely studied. For instance, Ref. [2] gives a detailed study of combinatorial group testing and zero-error group testing, Ref. [3] relates the group testing problem to a channel coding problem, and Refs. [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25] advance the group testing literature in various directions. The advantage of group testing is known to diminish when the disease is not rare [26,27,28].
Early works mainly consider two infection models: combinatorial model where, prior to designing the algorithm, the exact number of infections is assumed to be known, and the probabilistic model where each individual is assumed to be infected with probability p identically and independently. Although there is no general result for arbitrary infection probabilities and arbitrary correlations, Refs. [29,30,31,32,33,34] have considered advanced probabilistic models. Our goal in this paper is to consider a realistic graph-based infection spread model, and exploit the knowledge of the infection spread model to design efficient group testing algorithms. In this paper, we expand our prior conference paper in [35], to present a comprehensive analysis.
To that end, first, we propose a novel infection spread model, where individuals are connected via a random connection graph, whose connection probabilities are known (For instance, location data obtained from cell phones can be used to estimate connection probabilities.). A realization of the random connection graph results in different connected components, i.e., clusters and partitions the set of all individuals. The infection starts with a patient zero who is uniformly randomly chosen among n individuals. Then, any individual who is connected to at least one infected individual is also infected. For this system model, we propose a novel family of algorithms which we coin two-step sampled group testing algorithms. The algorithm consists of a sampling step, where a set of individuals are chosen to be tested, and a zero-error non-adaptive test step, where selected individuals are tested according to a zero-error non-adaptive group test matrix. In order to select individuals to test in the first step, one of the possible cluster formations that can be formed in the random connection graph is selected. Then, according to the selected cluster formation, we select exactly one individual from every cluster. After identifying the infection statuses of the selected individuals with zero-error, we assign the same infection statuses to the other individuals in the same cluster with identified individuals. Note that the actual cluster formation is not known prior to the test design and, because of that, selected cluster formation can be different from the actual cluster formation. Thus, this process is not necessarily a zero-error group testing procedure.
Our main contributions consist of proposing a novel infection spread model with random connection graph, proposing a two-step sampled group testing algorithm which is based on novel F -separable zero-error non-adaptive test matrices, characterizing the optimal design of two-step sampled group testing algorithms, and presenting explicit results on analytically tractable exponentially split cluster formation trees. For the considered two-step sampled group testing algorithms, we identify the optimal sampling function selection, calculate the required number of tests and the expected number of false classifications in terms of the system parameters, and identify the trade-off between them. Our F -separable zero-error non-adaptive test matrix construction is based on taking advantage of the known probability distribution of cluster formations. In order to present an analytically tractable case study for our proposed two-step sampled group testing algorithm, we consider exponentially split cluster formation trees as a special case, in which we explicitly calculate the required number of tests and the expected number of false classifications. For zero-error construction, we prove that the required number of tests is less than 4 ( log 2 n + 1 ) / 3 and is of O ( log 2 n ) , when there are at most n equal-sized clusters in the system, each having δ individuals. For the sake of fairness, in our comparisons, we take δ to be 1, ignoring further reductions of the number of tests due to δ . We show that, even when we ignore the gain by cluster size δ , our non-adaptive algorithm, in the zero-error setting, outperforms any zero-error non-adaptive group test and Hwang’s generalized binary splitting algorithm [36], which is known to be the optimal zero-error adaptive group test [28]. Since the number of infections scale as n log 2 n δ in exponentially split cluster formation trees with n δ individuals, our results show that we can use group testing to reduce the required number of tests significantly in our system model even when the infection rate is high by using our two-step sampled group testing algorithm.

2. Related Work

In the classical group testing works, the infection model is mostly based on the combinatorial or i.i.d. probabilistic model [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,28]. In more recent works, researchers have challenged the infection modeling dimension of the group testing problem. These related works include non-identical and/or correlated infection probabilities. Ref. [29] considers a probabilistic model with independent but non-identically distributed infection probabilities. Ref. [30] considers a correlated infection distribution under very specific assumptions. Ref. [31] considers a system where individuals are modeled as a community with positive correlations between them for specific setups, such as individuals at contiguous positions in a line. Ref. [32] considers a model where individuals belong to disjoint communities, and the system parameters are the number of infected families and the probability that a family is infected. The authors show that leveraging the community information improves the testing performance by reducing the number of tests required, from the scale of number of infections to the scale of number of infected families for both probabilistic and combinatorial setups. In the subsequent work [33], the authors consider overlapping communities. In [34], the authors focus on community structured system model, where the underlying network model is drawn from the stochastic block model. Over a fixed community structure, initial infections are introduced i.i.d. to the system; then, infection spread within and between communities is realized, with infections spreading within the community with a higher fixed probability than between communities. The authors propose an adaptive algorithm and compare its performance with the binary splitting algorithm that does not leverage the community information. In [32,33,34], a form of correlation between the infection status of individuals is considered, in a structured way, represented by the community structure networks of the individuals. In [37,38], further structured community network based systems are considered. In our work, we consider a random graph based infection spread model, which introduces correlations to the system.

3. System Model

We consider a group of n individuals. The random infection vector U = ( U 1 , U 2 , , U n ) represents the infection status of the individuals. Here, U i is a Bernoulli random variable with parameter p i . If individual i is infected, then U i = 1 , otherwise U i = 0 . Random variables U i need not be independent. A patient zero random variable Z is uniformly distributed over the set of individuals, i.e., Z = i with probability p Z ( i ) = 1 n for i = 1 , , n . Patient zero is the first person to be infected. Thus far, the infection model is identical to the traditional combinatorial model with k = 1 infected among n individuals.
Next, we define a random connection graph C which is a random graph where vertices represent the individuals, and edges represent the connections between the individuals. Let p C denote the probability distribution of the random graph C over the support set of all possible edge realizations. For the special class of random connection graphs where the edges are realized independently, we fully characterize the statistics of the random connection graph by the random connection matrix C , which is a symmetric n × n matrix, where the ( i , j ) th entry C i j is the probability that there is an edge between vertices i and j for i j , and C i j = 0 for i = j by definition.
A random connection graph C is an undirected random graph with vertex set V C = [ n ] , with each vertex representing a unique individual, and a random edge set E C = { e i j } which represents connections between individuals that satisfy the following: (1) If e i j E C , then there is an edge between vertices i and j; (2) For an arbitrary edge set E C * , probability of E C = E C * is equal to p C ( E C * , V C ) . In the case when all 1 { e i j E C } are independent, where 1 A denotes the indicator function of the event A, the random connection matrix C fully characterizes the statistics of edge realizations. There is a path between vertices i and j if there exists a set of vertices { i 1 , i 2 , i k } in [ n ] such that { e i i 1 , e i 1 i 2 , e i 2 i 3 , e i k j } E C , i.e., two vertices are connected if there exists a path between them. We summarize the system and algorithm parameters that we use throughout the paper in Table 1.
In our system model, if there is a path in C between two individuals, then their infection statuses are equal. In other words, the infection spreads from patient zero Z to everyone that is connected to patient zero. Thus, U k = U l , if there exists a path between k and l in C . Here, we note that a realization of the random graph C consists of clusters of individuals, where a cluster is a subset of vertices in C such that all elements in a cluster are connected with each other, and none of them is connected to any vertex that is not in the cluster. More rigorously, a subset S = { i 1 , i 2 , i k } of V C is a cluster, if i l and i m are connected for all i l i m S , but i a and i b are not connected for any i a S and all i b V C \ S .
Note that the set of all clusters in a realization of the random graph C is a partition of [ n ] . In a random connection graph structure, formation of clusters in C along with patient zero Z determine the status of the infection vector. Therefore, instead of focusing on the specific structure of the graph C , we focus on the cluster formations in C . For a given p C , we can calculate the probabilities of possible cluster formations in C .
To solidify ideas, we give an example in Figure 1. For a random connection graph where the edges are realized independently, we give probabilities of the existence of edges (zero probabilities are not shown) in Figure 1a and three different realizations of a random connection graph C , where all three realizations result in different cluster formations in Figure 1b–d. In Figure 1, we consider a random connection graph C that has n = 21 vertices, which represent the individuals in our group testing model. Since in this example we assume that the edges are realized independently, every edge between vertices i and j exists with probability C i j , independently. As we defined, if there is a path between two vertices (i.e., they are in the same cluster), then we say that their infection statuses are the same. One way of interpreting this is that there is a patient zero Z, which is uniformly randomly chosen among n individuals, and patient zero spreads the infection to everyone in its cluster. Therefore, working on the cluster formation structures, rather than the random connection graph itself, is equally informative for the sake of designing group tests. For instance, in the realization that we give in Figure 1b, if the edge between vertices 5 and 10 did not exist that would be a different realization for the random connection graph C ; however, the cluster formations would still be the same. As all infections are determined by the cluster formations and the realization of patient zero, cluster formations are sufficient statistics. Before we rigorously argue this point, we first focus on constructing a basis for random cluster formations.
The random cluster formation variable F is distributed over F as P ( F = F i ) = p F ( F i ) , for all F i F , where F is a subset of the set of all partitions of the set { 1 , 2 , , n } . In our model, we know the set F (i.e., the set of cluster formations that can occur) and the probability distribution p F , since we know p C . Let us denote | F | by f. For a cluster formation F i , individuals that are in the same cluster have the same infection status. Let | F i | = σ i , i.e., there are σ i subsets in the partition F i of { 1 , 2 , , n } . Without loss of generality, for i < j , we have σ i σ j , i.e., cluster formations in F are ordered in increasing sizes. Let S j i be the jth subset of the partition F i where i [ f ] and j [ σ i ] . Then, for fixed i and j, U k = U l for all k , l S j i , for all i [ f ] and j [ σ i ] .
To clarify the definitions, we give a simple running example which we will refer to throughout this section. Consider a population with n = 3 individuals who are connected according to the random connection matrix C and assume that the edges are realized independently,
C = 0 0.3 0.5 0.3 0 0 0.5 0 0
By definition, the main diagonal of the random connection matrix is zero, since we define edges between distinct vertices only. In this example, F consists of four possible cluster formations, and thus we have f = | F | = 4 . The random cluster formation variable F can take those four possible cluster formations with the following probabilities:
F = F 1 = { { 1 , 2 , 3 } } , w . p . 0.15 F 2 = { { 1 , 2 } , { 3 } } , w . p . 0.15 F 3 = { { 1 , 3 } , { 2 } } , w . p . 0.35 F 4 = { { 1 } , { 2 } , { 3 } } , w . p . 0.35
This example network and the corresponding cluster formations are shown in Figure 2. Here, cluster formation F 1 occurs when the edge between vertices 1 and 2 and the edge between vertices 1 and 3 are realized; F 2 occurs when only the edge between vertices 1 and 2 is realized; and F 3 occurs when only the edge between vertices 1 and 3 is realized. Finally, F 4 occurs when none of the edges in C is realized. In this example, we have σ 1 = | F 1 | = 1 , σ 2 = | F 2 | = 2 , σ 3 = | F 3 | = 2 , and σ 4 = | F 4 | = 3 . Note that σ 1 σ 2 σ 3 σ 4 as assumed without loss of generality above. Each subset that forms the partition F i are denoted by S j i , for instance, F 3 consists of S 1 3 = { 1 , 3 } and S 2 3 = { 2 } .
Next, we argue formally that cluster formations are sufficient statistics, i.e., they represent an equal amount of information as the realization of the random graph as far as the infection statuses of the individuals is concerned. When Z and F are realized, the infection statuses of n individuals are also realized, i.e., H ( U | Z , F ) = 0 . Then,
I ( U ; F )
= H ( U ) H ( U | F )
= H ( U ) H ( U , Z | F ) H ( Z | U , F )
= H ( U ) H ( Z | F ) + H ( U | Z , F ) H ( Z | U , F )
= H ( U ) H ( Z ) H ( Z | U , F )
H ( U ) H ( Z | C ) + H ( U | Z , C ) H ( Z | U , C )
= H ( U ) H ( U | C )
= I ( U ; C )
where in (3) we used the fact that F is a function of C (not necessarily invertible). In addition, from U C F , we also have I ( U ; F ) I ( U ; C ) , which together with () imply I ( U ; F ) = I ( U ; C ) . Thus, F is sufficient statistics for C relative to U. Therefore, from this point on, we focus on the random cluster formation variable F in our analysis.
The graph model and the resulting cluster formations we described so far are general. For tractability, in this paper, we investigate a specific class of F which satisfies the following condition: For all i, F i can only be obtained by partitioning some elements of F i 1 . This assumption results in a tree-like structure for cluster formations. Thus, we call F sets that satisfy this condition cluster formation trees. Formally, F is a cluster formation tree if F i + 1 \ F i can be obtained by partitioning the elements of F i \ F i + 1 for all i [ f 1 ] . Note that F in (2) is not a cluster formation tree. However, if the probability of the edge between vertices 1 and 3 were 0, then F would not contain F 1 and F 3 , and F would be a cluster formation tree in this case. Note that cluster formation trees may arise in real-life clustering scenarios, for instance, if individuals belong to a hierarchical structure. An example is: an individual may belong to a professor’s lab, then to a department, then to a building, and then to a campus.
Next, we define the family of algorithms that we consider, which we coin two-step sampled group testing algorithms. In the two-step sampled group testing algorithms, two steps do not involve consecutive testing phases: the proposed algorithm family in our paper consists of non-adaptive constructions and should not be confused with semi-adaptive algorithms with two testing phases such as two stage algorithms in [32]. Two-step sampled group testing algorithms consist of two steps in both testing phase and decoding phase. The following definitions are necessary in order to characterize the family of algorithms that we consider in this paper.
In order to design a two-step sampled group testing algorithm, we first pick one of the cluster formations in F to be the sampling cluster formation. The selection of F m is a design choice, for example, recalling the running example in (1) and (2), one can choose F 2 to be the sampling cluster formation.
Next, we define the sampling function, M, to be a function of F m . The sampling function selects which individuals to be tested by selecting exactly one individual from every subset that forms the partition F m . Let the infected set among the sampled individuals be denoted by K M . The output of the sampling function M is the individuals that are sampled and going to be tested. In the second step, a zero-error non-adaptive group test is performed on the sampled individuals. This results in the identification of the infection statuses of the selected σ m = | F m | individuals with zero-error probability. For example, recalling the running example in (1) and (2), when the sampling cluster formation is chosen as F 2 , we may design M as
M = { 1 , 3 }
Note that, for each selection of F m , M selects exactly one individual from each S j m . As long as it satisfies this property, M can be chosen freely while designing the group testing algorithm.
The test matrix X is a non-adaptive test matrix of size T × σ m , where T is the required number of tests. Let U ( M ) denote the infection status vector of the sampled individuals. Then, we have the following test result vector y
y i = j [ σ m ] X i j U j ( M ) , i [ T ]
In the classical group testing applications, while constructing zero-error non-adaptive test matrices, the aim is to obtain unique result vectors, y, for every unique possible infected set and, for instance, in combinatorial setting, with d infections, d-separable matrix construction is proposed [39]. In the classical d-separable matrix construction, we have
i S 1 X ( i ) i S 2 X ( i )
for all subsets S 1 and S 2 of cardinality d. As a more general approach, we do not restrict the possible infected sets to the subsets of [ n ] of the same size, but we consider the problem of designing test matrices that satisfy (12) for every unique S 1 and S 2 in a given set of possible infected sets. This approach leads to a more general basis for designing zero-error non-adaptive group testing algorithms for various scenarios, when the set of possible infected sets can be restricted by the available side information.
By using the test result vector y, in the first decoding step, the infection statuses of the sampled individuals are identified with zero-error probability. In the second stage of decoding, depending on F m and the infection statuses of the sampled individuals, other non-tested individuals are estimated by assigning the same infection status to all of the individuals that share the same cluster in the cluster formation F m . In the running example, with M given in (10), one must design a zero-error non-adaptive test matrix X , which identifies the infection statuses of individuals 1 and 3.
Let U ^ = ( U ^ 1 , U ^ 2 , , U ^ n ) be the estimated infection status vector. By definition, the infection estimates are the same within each cluster, i.e., for sampling cluster formation F m , U ^ k = U ^ l , for all k , l S j m , for all j [ σ m ] . Since M samples exactly one individual from every subset that forms the partition F m , there is exactly one identified individual at the beginning of the second step of the decoding phase and by the aforementioned rule, all n individuals have estimated infection statuses at the end of the process. For instance, in the running example, for the sampling cluster formation F 2 , we have M = { 1 , 3 } as given in (10) and X identifies U 1 and U 3 with zero-error. Then, U ^ 2 = U 1 , since individuals 1 and 2 are in the same cluster in F 2 .
Finally, we have two metrics to measure the performance of a group testing algorithm. The first one is the required number of tests T, which is the number of rows of X in the two step sampled group testing algorithm family that we defined. Having a minimum number of required tests is one of the aims of the group testing procedure. The second metric is the expected number of false classifications. Due to the second step of decoding, the overall two step sampled group testing algorithm is not a zero-error algorithm (except for the choice of m = f ) and the expected number of false classifications is a metric to measure the error performance of the algorithm. We use E f = E [ d H ( U U ^ ) ] to denote the expected number of false classifications, where d H ( · ) is the Hamming weight of a binary vector.
Designing a two-step sampled group testing algorithm consists of selecting F m , then designing the function M, and then designing the non-adaptive test matrix X for the second step of the testing and the first step of the decoding phase for zero-error identification of the infection statuses of the sampled σ m individuals. We consider cluster formation trees and uniform patient zero assumptions for our infection spread model, and we consider two step sampled group testing algorithms for the group test design.
In the following section, we present a motivating example to demonstrate our key ideas.

4. Motivating Example

Consider the following example. There are n = 10 individuals, and a cluster formation tree with f = 3 levels. Full characterization of F is as follows:
F = F 1 = { { 1 , 2 , 3 } , { 4 , 5 } , { 6 , 7 , 8 , 9 , 10 } } , w . p . 0.4 F 2 = { { 1 , 2 } , { 3 } , { 4 , 5 } , { 6 , 7 , 8 , 9 , 10 } } , w . p . 0.2 F 3 = { { 1 , 2 } , { 3 } , { 4 , 5 } , { 6 , 7 } , { 8 , 9 , 10 } } , w . p . 0.4
First, we find the optimal sampling functions, M, for all possible selections of F m . First of all, note that M selects exactly one individual from each subset that forms F m , by definition. Therefore, the number of sampled individuals is constant for a fixed choice of F m . Thus, in the optimal sampling function design, the only parameter that we consider is the minimum number of expected false classifications E f . Note that a false classification occurs only when one of the sampled individuals has a different infection status than one of the individuals in its cluster in F m . For instance, assume that m = 1 is chosen. Then, assume that the sampling function M selects individual 1 from the set S 1 1 = { 1 , 2 , 3 } . Recall that, after the second step of the two-step group testing algorithm, by using X , the infection status of individual 1 is identified with zero-error and its status is used to estimate the statuses of individuals 2 and 3, since they are in the same cluster in F m = F 1 . However, with positive probability, individuals 1 and 3 can have distinct infection statuses, in which case, a false classification occurs. Note that this scenario occurs only when F m is at a higher level than the realized F in the cluster formation tree F , where we refer to F 1 as the top level of the cluster formation tree and F f as the bottom level.
While finding the optimal sampling function M, one must consider the possible false classifications and minimize E f , the expected number of false classifications. As shown in Figure 3, the cluster { 4 , 5 } does not become partitioned, and for all three choices of F m , M can sample either one of the individuals 4 and 5. This selection does not change the expected number of false classifications since U 4 = U 5 in all possible realizations of F. For all sampling cluster formation selections, we have the following analysis:
  • If F m = F 1 : If M samples individual 1 or 2 from the cluster S 1 1 = { 1 , 2 , 3 } , a false classification occurs if F = F 2 and the cluster { 1 , 2 } is infected, in that case, individual 3 is falsely classified as infected. Similar false classification occurs when F = F 3 and the cluster { 1 , 2 } is infected. Similarly, in these cases, if individual 3 is infected, again, individual 3 is falsely classified as non-infected. Thus, for cluster { 1 , 2 , 3 } , when either individuals 1 or 2 is sampled, the expected number of false classifications is:
    ( p F ( F 2 ) + p F ( F 3 ) ) ( p Z ( 1 ) + p Z ( 2 ) + p Z ( 3 ) ) = 0.6 × 0.3 = 0.18
    Similarly, when individual 3 is sampled from the cluster { 1 , 2 , 3 } , individuals 1 and 2 are falsely classified when F = F 2 or F = F 3 and either the cluster { 1 , 2 } or individual 3 is infected. Thus, in that case, the expected number of false classifications is:
    2 ( p F ( F 2 ) + p F ( F 3 ) ) ( p Z ( 1 ) + p Z ( 2 ) + p Z ( 3 ) ) = 2 × 0.6 × 0.3 = 0.36
    Thus, (14) and (15) imply that, for cluster S 1 1 = { 1 , 2 , 3 } , the optimal M should select either individuals 1 or 2 for testing. As discussed above, for cluster S 2 1 = { 4 , 5 } , the selection of sampled individual is indifferent and results in 0 expected false classification. Finally, for cluster S 3 1 = { 6 , 7 , 8 , 9 , 10 } , a similar analysis implies that the optimal M should select one of the individuals in { 8 , 9 , 10 } for testing.
  • If F m = F 2 : Similar combinatorial arguments follow and we conclude that selection of sampled individuals from the clusters S 1 2 = { 1 , 2 } , S 2 2 = { 3 } and S 3 2 = { 4 , 5 } are indifferent in terms of the expected number of false classifications. Only a possible false classification can happen in cluster S 4 2 = { 6 , 7 , 8 , 9 , 10 } when F = F 3 and the infected cluster is either S 4 3 = { 6 , 7 } or S 5 3 = { 8 , 9 , 10 } . Similar to the case m = 1 , if the sampled individual is either 6 or 7, then the expected number of false classifications is 0.6 in contrast to the 0.4 when the sampled individual is one of 8, 9 and 10. Thus, the optimal M should select one of the individuals 8, 9 and 10 as the sampled individual to minimize the expected number of false classifications.
  • If F m = F 3 : It is not possible to make a false classification since, for all clusters in F 3 , all individuals that are in the same cluster have the same infection status with probability 1.
Therefore, for this example, the optimal sampling function selects either individuals 1 or 2 from the set S 1 1 ; selects either 4 or 5 from the set S 2 1 ; and selects either 8, 9 or 10 from the set S 3 1 if F m = F 1 , and the same sampling is optimal with an addition of individual 3, if F m = F 2 . Let us assume that M selects the individual with the smallest index when the selection is indifferent among a set of individuals. Thus, the optimal sampling function M for this example is: { 1 , 4 , 8 } , { 1 , 3 , 4 , 8 } or { 1 , 3 , 4 , 6 , 8 } , depending on the selection of F m being F 1 , F 2 , or F 3 , respectively.
Now, for these possible sets of sampled individuals, we need to design zero-error non-adaptive test matrices.
  • If F m = F 1 (i.e., M = { 1 , 4 , 8 } ): The set of all possible infected sets is P ( K M ) = { { 1 } , { 4 } , { 8 } } . By a counting argument, we need at least two tests, since each of three possible infected sets must result in a unique result vector y, and each one of these sets has one element. We can achieve this lower bound by using the following test matrix:
Information 14 00048 i001
  • If F m = F 2 (i.e., M = { 1 , 3 , 4 , 8 } ): In this case, the set of all possible infected sets is now P ( K M ) = { { 1 } , { 3 } , { 1 , 3 } , { 4 } , { 8 } } . In the classical zero-error construction for the combinatorial group testing model, one can construct d-separable matrices, and the rationale behind the construction is to enable the decoding of the infected set, when the infected set can be any d-sized subset of [ n ] . However, in our model, the set of all possible infected sets, i.e., P ( K M ) , is not a set of all fixed sized subsets of [ n ] , but instead consists of varying sized subsets of [ n ] that are structured, depending on the given F . As illustrated in Figure 3, a given cluster formation tree F can be represented by a tree structure with nodes (Throughout the paper, we use the word “node” only for the possible clusters in the cluster formation tree representations, not for the vertices in the connection graphs that represent the individuals.) representing possible infected sets, i.e., clusters at each level. Then, the aim of constructing a zero-error test matrix is to have unique test result vectors for each unique possible infected set, i.e., unique nodes in the cluster formation tree. In Figure 4, we present the subtree of F , which ends at the level F 2 , with assigned result vectors to each node. One must assign unique binary vectors to each node, except for the nodes that do not become partitioned while moving from level to level: those nodes represent the same cluster, and thus the same vector is assigned, as seen in Figure 4. Moreover, while merging in upper level nodes, binary OR of vectors assigned to the descendant nodes must be assigned to their ancestor node. By combinatorial arguments, one can find the minimum vector length such that such vectors can be assigned to the nodes.
    In this case, the required number of tests must be at least 3 and, by assigning result vectors as in Figure 4, we can construct the following test matrix X :
    Information 14 00048 i002
    Note that, for all elements of P ( K M ) , the corresponding result vector is unique and satisfies the tree structure criteria, as shown in Figure 4.
  • If F m = F 3 (i.e., M = { 1 , 3 , 4 , 6 , 8 } ): In this case, the set of all possible infected sets is P ( K M ) = { { 1 } , { 3 } , { 1 , 3 } , { 4 } , { 6 } , { 8 } , { 6 , 8 } } . We give a tree structure representation with assigned result vectors of length 3 that achieves the tree structure criteria discussed above, which is shown in Figure 5 where each unique node is assigned a unique vector except for the nodes that do not become partitioned while moving from level to level. Note that every unique node in the tree representation corresponds to a unique element of P ( K M ) . The corresponding test matrix X is the following 3 × 5 matrix:
Information 14 00048 i003
A more structured and detailed analysis of the selection of the optimal sampling function and the minimum number of required tests is given in the next section.
We finalize our analysis of this example by calculating the expected number of false classifications where E f , α denotes the conditional expected false classifications, given F = F α :
  • If F m = F 1 :
    E f = α p F ( F α ) E f , α = p F ( F 2 ) E f , 2 + p F ( F 3 ) E f , 3 = 0.2 ( 0.3 × 1 ) + 0.4 ( 0.3 × 1 + 0.5 × 2 ) = 0.58
  • If F m = F 2 :
    E f = p F ( F 3 ) E f , 3 = 0.4 ( 0.5 × 2 ) = 0.4
  • If F m = F 3 , we have E f = 0 .
Note that the choice of F m is a design choice, and one can use time sharing (Time sharing can be implemented by assigning a probability distribution to F m over F , instead of picking one cluster formation from F to be F m deterministically.) between different choices of m, depending on the specifications of the desired group testing algorithm. For instance, if a minimum number of tests is desired, then one can pick m = 1 , which results in two tests, which is the minimum possible, but with expected 0.58 false classifications, which is the maximum possible in this example. On the other hand, if a minimum expected false classifications is desired, then one can pick m = 3 , results in 0 expected false classifications, which is the minimum possible, but with 3 tests, which is the maximum possible in this example. Generally, there is a trade-off between the number of tests and the number of false classifications, and we can formulate optimization problems for specific system requirements, such as finding a time sharing distribution for F m that minimizes the number of tests for a desired level of false classifications, or vice versa.
In the following section, we describe the details of our proposed group testing algorithm.

5. Proposed Algorithm and Analysis

In our F -separable matrix construction, we aim to construct binary matrices that have n columns, and for each possible infected subset of the selected individuals, there must be a corresponding distinct result vector. A binary matrix X is F -separable if
i S 1 X ( i ) i S 2 X ( i )
is satisfied for all distinct subsets S 1 and S 2 in the set of all possible infected subsets, where X ( i ) denotes the ith column of X . In d-separable matrix construction [39], this condition must hold for all subsets S 1 and S 2 of cardinality d; here, it must hold for all possible feasible infected subsets as defined by F . From this point of view, our F -separable test matrix construction exploits the known structure of F and thus it results in an efficient zero-error non-adaptive test design for the second step of our proposed algorithm.
We adopt a combinatorial approach to the design of the non-adaptive test matrix X . Note that, for a given M, we have σ m individuals to be identified with zero-error probability. The key point of our algorithm is the fact that the infected set of individuals among those selected individuals can only be some specific subsets of those σ m individuals. Without any information about the cluster formation, any one of the 2 σ m subsets of the selected individuals can be the infected set. However, since we are given F , we know that the infected set among the selected individuals, K M , can be one of the 2 σ m subsets only if there exists at least one set S i j that contains K M , and there is no element in the difference set M \ K M such that it is an element of all sets S i j containing K M . This fact, especially in a cluster formation tree structure, significantly reduces the total number of possible infected subsets that need to be considered. Therefore, we can focus on such subsets and design the test matrix X by requiring that the logical OR operation of the columns that correspond to the possible K M sets to be distinct, in order to decode the test results with zero-error. Let P ( K M ) denote the set of possible infected subsets of the selected individuals, i.e., the set of possible sets that K M can be. Then, matrix X must satisfy (18) for all distinct S 1 and S 2 that are elements of P ( K M ) . Note that the decoding process is a mapping from the result vectors to the infected sets and thus we require the distinct result vector property to guarantee zero-error decoding.
Designing the X matrix that satisfies the aforementioned property is the key idea of our algorithm. Before going into the design of X , we first derive the expected number of false classifications in a given two step sampled group testing algorithm. Recall that false classifications occur during the second step of the decoding phase. In particular, in the second step of the decoding phase, depending on the selection of the sampling cluster formation F m , the infection statuses of selected individuals M are assigned to the other individuals such that the infection status estimate is the same within each cluster. For fixed sampling cluster formation F m and the sampling function M, the number of expected false classifications can be calculated as in the following theorem.
 Theorem 1. 
In a two step sampled group testing algorithm with the given sampling cluster formation F m and the sampling function M over a cluster formation tree structure defined by F and p F , with uniform patient zero distribution p Z over [ n ] , the expected number of false classifications given F = F α is
E f , α = i [ σ m ] ( | S α ( M i ) | n · | S i m \ S α ( M i ) | + S j α S i m \ S α ( M i ) | S j α | 2 n )
and the expected number of false classifications is
E f = α > m p F ( F α ) E f , α
where S α ( M i ) is the subset in the partition F α which contains the ith selected individual.
Next, we obtain Theorem 2 to characterize the optimal choice of the sampling function M. First, we define β i ( k ) functions as follows. For i [ f ] and k [ n ] ,
β i ( k ) j > i p F ( F j ) ( | S j ( k ) | · | S i ( k ) \ S j ( k ) | + S l j S i ( k ) \ S j ( k ) | S l j | 2 )
where S i ( k ) is the subset in partition F i that contains k.
 Theorem 2. 
For sampling cluster formation F m , the optimal choice of M that minimizes the expected number of false classifications is
M i = arg min k S i m β m ( k )
where M i is the ith selected individual. Moreover, the number of required tests is constant and is independent of the choice of M.
We present the proofs of Theorems 1 and 2 in Appendix A.
The optimal M analysis focuses on choosing the sampling function that results in the minimum expected number of false classifications, among the set of functions that select exactly one individual from each cluster of a given F m . For some scenarios, it is possible to choose a sampling function that selects multiple individuals from some clusters of a given F m that achieves expected false classifications–required number of tests points that cannot be achieved by the optimal M in (A6). However, for the majority of the cases, the sampling functions of interest, i.e., the sampling functions that choose exactly one individual from each F m , are globally optimal. First, the sampling functions that select multiple individuals from a cluster that never becomes partitioned further in the levels below F m is sub-optimal: these sampling functions select multiple individuals to identify who are guaranteed to have the same infection status. For instance, in zero expected false classifications case, i.e., the bottom level, F f is chosen as the sampling cluster formation, sampling more than one individual from each cluster is sub-optimal. Second, picking the sampling cluster formation F m and choosing an M such that multiple individuals are chosen from some clusters that further become partitioned in the levels below F m , is equivalent to choosing a sampling cluster formation below F m and using an M that selects exactly one individual from each cluster of the new sampling cluster formation, except for the scenarios where there exists partitioning of multiple clusters in two consecutive cluster formations in a given F , and one can consider a sampling function that selects multiple individuals from some clusters of a given F m that cannot be represented as a sampling function that selects exactly one individual from each cluster of another cluster formation F m . For the sake of compactness, we focus on the family of sampling functions M that selects exactly one individual from each cluster of the chosen F m .
Thus far, we have presented a method to select individuals to be tested in a way to minimize the expected number of false classifications. Now, we move on to the design of X , the zero-error non-adaptive test matrix which identifies the infection statuses of the selected individuals M with a minimum number of tests. Recall that, since | F | = f , there are f possible choices of F m , and each choice results in a different test matrix X .
Based on the combinatorial viewpoint stated in (18), we propose a family of non-adaptive group testing algorithms which satisfy the separability condition for all of the subsets in P ( K M ) , which is determined by F . We call such matrices F -separable matrices and non-adaptive group tests that use F -separable matrices as their test matrix as F -separable non-adaptive group tests. In the rest of the section, we present our results on the required number of tests for F -separable non-adaptive group tests.
The key idea of designing an F -separable matrix is determining the set P ( K M ) for a given set of selected individuals M and the tree structure of F so that we can find binary column vectors for each selected individual where all of the corresponding possible result vectors are distinct. Note that, for a given choice of F m , if we consider the corresponding subtree of F which starts from the first level F 1 and ends at the level F m , the problem of finding an F -separable non-adaptive test matrix is equivalent to finding a set of length T binary column vectors for each node at level F m that satisfy the following criteria:
  • For every node at the levels that are above the level F m , each node must be assigned a binary column vector that is equal to the OR of all vectors that are assigned to its descendant nodes. This is because each node in the tree corresponds to a possible set of infected individuals among the selected individuals where each merging of the nodes corresponds to the union of the possible infected sets which results in taking the OR of the assigned vectors of the merged nodes.
  • Each assigned binary vector must be unique for each unique node, i.e., for every node that represents a unique set S i j . For the nodes that do not split between two levels, the assigned vector remains the same. This is because each unique node (note that when a node does not split between levels, it still represents the same set of individuals) corresponds to a unique possible infected subset of the selected individuals and they must satisfy (18).
In other words, for a cluster formation tree with assigned result vectors to each node, a sufficient condition for achievability of F -separable matrices as follows:
  • Let u be a node with Hamming weight d H ( u ) . Then, the number of all descendant nodes of u with constant Hamming weights i must be less than d H ( u ) i for all i. This must hold for all nodes u. Furthermore, the number of nodes with constant Hamming weight i must be less than T i for all i. In addition, Hamming weights of the nodes must strictly decrease while moving from ancestor nodes to descendant nodes.
This condition is indeed sufficient because it guarantees the existence of unique set of vectors that can be assigned to each node of the subtree of F that satisfies the merging/OR structure determined by the subtree.
The problem of designing an F -separable non-adaptive group test can be reduced to finding the minimum number T, for which we can find σ m binary vectors with length T, such that all vectors that are assigned to the nodes satisfy the above condition. Here, the assigned vectors are the result vectors y when the corresponding node is the infected node.
We have the following definitions that we need in Theorem 3. For a given F , we define λ S i j as the number of unique ancestor nodes of the set S i j . We also define λ j as the number of unique sets S a b in F at and above the level F j . Note that a j σ a is the total number of sets S a b in F at and above the level F j , and thus we have
a j σ a λ j
 Theorem 3. 
For given F and F m for m < f , the number of required tests for an F -separable non-adaptive group test, i.e., the number of rows of the test matrix X , must satisfy
T max max j [ σ m ] ( λ S j m + 1 ) , log 2 ( λ m + 1 ) ˥
with the addition of 1’s removed in (24) for the special case of m = f .
We present the proof of Theorem 3 in the Appendix A. Note that Theorem 3 is a converse argument, without a statement about the achievability of the given lower bound. In fact, the given lower bound is not always achievable.
Complexity: The time complexity of the two-step sampled group testing algorithms consists of the complexity of finding the optimal M given F m and F , the complexity of the construction of the F -separable test matrix given M and F , and the complexity of the decoding of the test results given the test matrix X and the result vector y. In the following lemmas, we analyze the complexity of these processes.
 Lemma 1. 
For a given cluster formation tree F and a sampling cluster formation F m , the complexity of finding the optimal M as in Theorem 2 is
O ( n ( f m ) ζ m )
where ζ m = max k [ n ] | { S l f : S l f S m ( k ) \ S f ( k ) } | .
Proof. 
In order to find the optimal M, β m ( k ) needs to be calculated as in (21) for each k [ n ] . The complexity of each of these calculations is bounded above by the number of cluster formations below F m multiplied by the number of clusters at level f that do not include the individual k and form the cluster S m ( k ) , i.e., the clusters S l f that satisfy S l f S m ( k ) \ S f ( k ) . Note that this upper bound varies for each k [ n ] and the total complexity is the summation of these sizes multiplied by f m , i.e., the number of cluster formations below F m , for each k [ n ] . As an upper bound, we consider the maximum of these sizes, i.e., ζ m , concluding the proof. □
In the next lemma, we analyze the complexity of the construction of the F -separable test matrix given M and F .
 Lemma 2. 
For a given cluster formation tree F and a sampling function M, the complexity of assigning the binary result vectors to the nodes in F , and thus the construction of the F -separable test matrix is Ω ( m σ m ) .
Proof. 
When the cluster formation tree F and the sampling function M are given, in order to assign unique binary result vectors to each node in F that represents a unique possible infected cluster, we need to consider the subtree of F that starts with the level F 1 and ends at the level F m , as in the example in Figure 4. Then, we need to traverse from each bottom node in the subtree, to the top node, to detect every merging of each cluster. This results in finding the numbers λ S j m for j [ σ m ] and λ m and unique binary test result vectors can be assigned to each unique node in F . The traversing on the subtree of F starting from the bottom level F m to the top level for each bottom level node has the complexity Θ ( m σ m ) . This traversing does not immediately result in the explicit construction of unique binary result vectors to be assigned, but it gives an asymptotic lower bound for the complexity of the construction of the F -separable test matrices. □
Note that the Lemma 2 is an asymptotic lower bound for the complexity of the binary result vector assignment to the unique nodes in F , and thus for the construction of the F -separable test result matrix X . This analysis is a baseline for the proposed model and proposing explicit F -separable test matrix constructions with an exact number of required tests, and complexity is an open problem.
 Lemma 3. 
For a given F -separable test matrix X , with corresponding cluster formation tree F with assigned binary result vectors to each node and the result vector y, the decoding complexity is O ( 1 ) .
Proof. 
While constructing the F -separable test matrix, we consider the assignment of the unique binary result vectors to the nodes in the given cluster formation tree F . For a given test matrix X and the result vector y, the decoding problem is a hash table lookup, with the complexity O ( 1 ) . □
Since, during the proposed process of assignment of unique binary result vectors to each unique node in F , we specifically assign the test result vectors to every unique possible infected set, the decoding process is basically a hash table lookup, resulting in fast decoding with low complexity.
Key Steps of the Proposed Algorithm: The summary of the key steps of the two-step sampled group testing algorithm is given below:
  • We start with the assumption that exact connections between the individuals are not known, but the probability distribution of the possible edge realizations are known.
  • The given edge set probability distribution results in a random cluster formation variable, F. Each possible cluster formation is a partition of the set of all individuals.
  • Out of all possible cluster formations (which we call this set as F ), one cluster formation is selected as the sampling cluster formation, which we call F m .
  • Exactly one individual is selected from each cluster in F m . These individuals are then tested and identified.
  • The selection is carried out according to the sampling function M. For the given choice of F m , M selects the individuals from the clusters that minimizes the expected number of false classifications, given in Theorem 2, and this results in the expected number of false classifications given in Theorem 1.
  • By using the given set of possible cluster formations, F , an F -separable test matrix is constructed to identify the individuals selected by M. This test matrix is guaranteed to identify the selected individuals since the construction is based on assigning a unique test result vector to every possible infected set among the selected individuals.
  • In Theorem 3, we present a converse argument by giving a lower bound for the required number of tests, in terms of the system parameters.
  • After obtaining the test results and identifying the selected individuals with zero-error, for each selected individual, their infection status is assigned to the others in their cluster, in F m . Note that there is exactly one individual selected and identified from every cluster in F m . This step introduces possible false classifications.
  • Selecting F m from lower levels from the possible cluster formations tree results in lower expected false classifications while increasing the number of required tests for identification. This results in a trade-off between the number of tests and expected false classifications. By using a randomized selection of F m , intermediate points can also be achieved for the expected false classifications and required number of tests.
In the next section, we introduce and focus on a family of cluster formation trees which we call exponentially split cluster formation trees. For this analytically tractable family of cluster formation trees, we achieve the lower bound in Theorem 3 order-wise, and we compare our result with the results in the literature.

6. Exponentially Split Cluster Formation Trees

In this section, we consider a family of cluster formation trees, explicitly characterize the selection of optimal sampling function, and the resulting expected number of false classifications and the number of required tests. We also compare our results with Hwang’s generalized binary splitting algorithm [36] and zero-error non-adaptive group testing algorithms in order to show the gain of utilizing the cluster formation structure as achieved in this paper.
A cluster formation tree F is an exponentially split cluster formation tree if it satisfies the following criteria:
  • An exponentially split cluster formation tree that consists of f levels has 2 i 1 nodes at level F i , for each i [ f ] , i.e., σ i = 2 i 1 , i [ f ] .
  • At level F i , every node has 2 f i δ individuals where δ is a constant positive integer, i.e., | S j i | = 2 f i δ , i [ f ] , j [ σ i ] .
  • Every node has exactly two descendant nodes in one level below in the cluster formation tree, i.e., every node is partitioned into equal sized 2 nodes when moving one level down in the cluster formation tree.
  • Random cluster formation variable F is uniformly distributed over F , i.e., p F ( F i ) = 1 / f , i [ f ] .
We analyze the expected number of false classifications and the required number of tests for exponentially split cluster formation trees, by using the general results derived in Section 5. In Figure 6, we give a 4-level exponentially split cluster formation tree example. In that example, there is a 2 0 = 1 node at level F 1 and the number of nodes gets doubled at each level, since each node is split into two nodes when moving one level down in the tree. In addition, the sizes of the nodes that are at the same level are the same, with the bottom level nodes having the size δ .
Being a subset of cluster formation trees, exponentially split cluster formation trees correspond to random connection graphs where edges between individuals are not independently realized in non-trivial cases. For instance, in Figure 7, we present four different possible realizations of edges of a 4-level exponentially split cluster formation tree system, given in Figure 6, where there are δ = 4 individuals in the bottom level clusters. Here, if the edges between individuals are realized independently, then there would be possible cluster formations that do not result in an exponentially split cluster formation tree structure. The edge realizations are correlated in the sense that, if there is at least one edge realized between two bottom level neighbor clusters, then there must be at least one edge realized between other bottom level neighbor cluster pairs as well. Similarly, if there is at least one bottom level cluster pair that are not immediate neighbors but get merged in some upper level F k in F , then other bottom level cluster pairs that get merged in F k must be connected as well. In Figure 7, in F 4 realization, the only edges that are present are the edges that form bottom level clusters. In F 3 realization, there are at least one edge realized between each bottom level neighbor cluster pair, resulting in clusters of eight individuals. Similarly, there are more distant connections that are realized in F 2 and F 1 . From a practical point of view, the 4-level exponential split cluster formation tree example in Figure 6 and Figure 7 can be used to model real-life scenarios, such as the infection spread in an apartment complex with multiple buildings. In the bottom level, there are households that are guaranteed to be connected, and, in the F 3 level, the households that are in close contact are connected, in the F 2 level, there is a connection building-wise and, in F 1 , the whole community is connected. Note that the connections given in Figure 7 are realization examples that fall under four possible cluster formations and all edge realization scenarios are possible as long as the resulting cluster formation is one of the four given cluster formations. While designing the group testing algorithm, the given information is the probability distribution over the cluster formations, and in practice, one can expect a probability distribution where bottom level cluster formations, i.e., cluster formations towards F 4 , have higher probabilities in a community where there are strict social isolation measures, and high immunity rates for a contagious infection, whereas higher probabilities of upper level cluster formations, i.e., cluster formations toward F 1 , can be expected for communities with high contact rate and lower immunity.
Optimal sampling function and expected number of false classifications: Due to the symmetry of the system, for any choice F m , each element of S i m has the same β m ( i ) value for all i σ m . Therefore, the sampling function selects individuals from each set arbitrarily, i.e., the selection of a particular individual does not change the expected number of false classifications. Thus, we can pick any sampling function that selects one element from each S i m . By Theorem 1, the expected number of false classifications, for given F m , is
E f = α > m 1 f i [ σ m ] ( | S α ( M i ) | n · | S i m \ S α ( M i ) |
+ S j α S i m \ S α ( M i ) | S j α | 2 n )
= α > m 1 f σ m σ α δ ( 2 f m 2 f α ) + ( 2 α m 1 ) δ 2 f α
= α > m 2 f + 1 δ f 2 α 2 m 2 α
= 2 f + 1 δ f α > m 2 α 2 m α > m 2 2 α
= 2 f + 1 δ f ( 2 m 2 f ) 2 m 3 ( 2 2 m 2 2 f )
= δ 3 f 2 f m + 2 + 2 m f + 1 6
This expected number of false classifications takes its maximum value when F m = F 1 ,
E f = δ 3 f 2 f + 1 + 2 2 f 6
and it takes its minimum value when F m = F f as E f = 0 . Since the choice of F m is a design parameter, one can use time sharing between the possible selections of F m to achieve any desired value for the expected number of false classifications between E f = 0 and E f in (32).
Required number of tests: We first recall that, if we choose the sampling cluster formation level F m , the required number of tests for selected individuals at that level for whom we design an F -separable test matrix depends on the subtree that is composed of the first m levels of the cluster formation tree F . Note that the first m levels of an exponentially split cluster formation tree is also an exponentially split cluster formation tree with m levels. In Theorem 4 below, we focus on the sampling cluster formation choice at the bottom level, F m = F f and characterize the exact required number of tests to be between f and 4 3 f . This implies that the required number of tests at level F f is O ( f ) , and thus the required number of tests at level F m is O ( m ) .
 Theorem 4. 
For an f level exponentially split cluster formation tree, at level f, there exists an F -separable test matrix, X , with not more than 4 3 f rows, i.e., an upper (achievable) bound for the number of required tests is 4 3 ( log 2 n + 1 ) for n individuals. Conversely, this is also the capacity order-wise, since the number of required tests must be greater than f.
We present the proof of Theorem 4 in Appendix A.
Expected number of infections: In an exponentially split cluster formation tree structure with f levels, the expected total number of infections is
i = 1 f 1 f 2 f i δ = δ f ( 2 f 1 )
since p F ( F i ) = 1 / f and if F = F i , then there are 2 f i δ infections. Thus, the expected number of infections is O n log 2 n .
Comparison: In order to compare our results for the exponentially split cluster formation trees with other results in the literature, for fairness, we focus on the zero-error case in our system model, which happens when F m = F f is chosen. The resulting sampling function selects in a total of 2 f 1 individuals, and the resulting number of required tests is between f and 4 3 f , i.e., O ( log 2 n ) , as proved in Theorem 4. Note that, by performing at most 4 3 f tests to 2 f 1 individuals, we identify the infection statuses of 2 f 1 δ individuals with zero false classifications, which implies that the number of tests scales with the number of nodes at the bottom level, instead of the number of individuals in the system. This results in a gain scaled with δ . However, in order to fairly compare our results with the results in the literature, we ignore this gain and compare the performance of the second step of our algorithm only, i.e., the identification of infection statuses of selected individuals only. To avoid confusion, let δ = 1 , i.e., each cluster at the bottom level is an individual and thus n = 2 f 1 .
From (33), the expected number of infections in this system is 2 f 1 f = O ( n log 2 n ) . When the infections scale faster than n , as proved in [26] (see also [28]), non-adaptive tests with zero-error criterion cannot perform better than individual testing. Since our algorithm results in O ( f ) = O ( log 2 n ) tests, it outperforms all non-adaptive algorithms in the literature. Furthermore, we compare our results with Hwang’s generalized binary splitting algorithm [36], even though it is an adaptive algorithm and also it assumes the prior knowledge of exact number of infections. Hwang’s algorithm results in a zero-error identification of k infections among the population of n individuals with k log 2 ( n / k ) + O ( k ) tests and attains the capacity of adaptive group testing [28,36,40]. Since the number of infections takes f values in the set { 1 , 2 , 2 2 , , 2 f 1 } uniformly randomly, the resulting mean value of the required number of tests when Hwang’s generalized binary splitting algorithm is used is
E [ T Hwang ] = i = 0 f 1 1 f 2 i log 2 2 f 1 i + O n log 2 n
= f 1 f i = 0 f 1 2 i 1 f i = 0 f 1 i 2 i + O n log 2 n
= 2 f f 1 f + O n log 2 n
= O n log 2 n
Thus, the expected number of tests when Hwang’s generalized binary splitting algorithm is used scales as O n log 2 n which is much faster than our result of O ( log 2 n ) . We note that Hwang’s generalized binary splitting algorithm assumes the prior knowledge of exact number of infections, and is an adaptive algorithm, and furthermore, we have ignored the gain of our algorithm in the first step (i.e., δ = 1 ). Despite these advantages given to it, our algorithm still outperforms Hwang’s generalized binary splitting algorithm for exponentially split cluster formation trees.

7. Numerical Results

In this section, we present numerical results for the proposed two-step sampled group testing algorithm and compare our results with the existing results in the literature. In the first simulation environment, we focus on exponentially split cluster formation trees as presented in Section 6, and in the second simulation environment, we consider an arbitrary random connection graph, as discussed in Section 3, which does not satisfy the cluster formation tree assumption. In the first simulation environment, we verify our analytical results by focusing on exponentially split cluster formation trees. In the second simulation environment, we show that our ideas can be applied to arbitrary random connection graph based networks.

7.1. Exponentially Split Cluster Formation Tree Based System

In the first simulation environment, we have an exponentially split cluster formation tree with f = 10 levels and δ = 1 at the bottom level. For this system of n = 2 f 1 δ = 512 individuals, for each sampling cluster formation choice F m (which is a design parameter), from m = 1 , i.e., the top level of the cluster formation tree, to m = 10 , i.e., the bottom level of the cluster formation tree, we calculate the expected number of false classifications and the minimum required number of tests. Note that the required number of tests is fixed for a fixed sampling cluster formation F m , while the number of false classifications depends on the realization of the true cluster formation F α and patient zero Z. This is because of the fact that, when a sampling cluster formation is selected, the test matrix of choice is guaranteed to identify the sampled individuals with zero-error, independent of the realized infections. In Figure 8a, we plot the expected number of false classifications which meets the analytical expressions we found in Section 6. To plot Figure 8, we run our simulation and realize the infections 1000 times to numerically obtain the average number of false classifications in the system. While calculating the minimum number of required tests, for each choice of F m , our program finds the minimum T that satisfies the sufficient criteria that we presented in Section 5 and in the proof of Theorem 4 by searching over possible assignments of binary result vectors to the nodes in the given exponentially split cluster formation tree, starting from the vector length 1 and increasing the vector length by 1 if no such assignment is found. When a binary vector assignment to the nodes is found, the resulting test matrix is constructed and used for running the simulation 1000 times to obtain the numerical average of the expected number of false classifications. We plot the minimum required number of tests in Figure 8b. Note that, unlike the number of false classifications, for a fixed F m , the number of required tests is fixed and thus we do not repeat the simulations while calculating the required number of tests. The resulting non-adaptive test matrix X is fixed for a fixed F m and identifies the infection statuses of the individuals that are selected by M, with zero-error.
Next, for this network setting, we compare our zero-error construction results with the results of a variation of Hwang’s generalized binary splitting algorithm [36,40], presented in [41], which further reduces the number of required tests by reducing the O ( k ) term in the capacity expression of Hwang’s algorithm. As we state in the comparison part of Section 6, the required number of tests in our algorithm scales with O ( log 2 n ) . In our numerical results, we see that the required number of tests is 13 at level m = f = 10 , as seen in Figure 8b. On the other hand, the average number of required tests for Hwang’s algorithm scales as O n log 2 n , and is approximately 172 in this case. Furthermore, when we remove the assumption of known number of infections, we have to use the binary splitting algorithm presented originally in [42], which results in a number of tests that is not lower than individual testing, i.e., n = 512 tests in this case. For Hwang’s generalized and the original binary splitting algorithm results, we run these algorithms 1000 times by realizing the infection statuses of the population at each iteration to obtain the numerical average of the number of required tests for both of these algorithms.

7.2. Arbitrary Random Connection Graph Based System

In our second simulation environment, we present an arbitrary random connection graph C with 20 individuals, shown in Figure 9c, where the edges realize independently with probabilities shown on them (zero probability edges are not shown). In this system, since each independent realization of nine edges that can be either present or not results in a distinct cluster formation, in total, there are 2 9 = 512 cluster formations that can be realized with positive probability. Note that this system with the random connection graph C does not yield a cluster formation tree, yet we still apply our ideas designed for cluster formation trees here. For each one of the 512 possible selections of m, we plot the corresponding expected number of false classifications in Figure 9a and the required number of tests in Figure 9b for our two-step sampled group testing algorithm.
In this simulation, for each possible choice of the sampling cluster formation F m , we calculate the set of all possible infected sets P ( K M ) for all possible choices of M and calculate the resulting expected number of false classifications by also calculating p F , the probability distribution of random cluster formations and select the optimal sampling function M. For the required number of tests, we find the minimum number of tests that satisfies the sufficient criteria that we presented in Section 5 in order to construct F -separable matrices for this system. In our simulation environment, this procedure is achieved by brute force, since this system is not a cluster formation tree as in our system model and we cannot use the systematic results that we derived. This simulation demonstrates that the ideas presented can be generalized and applied to arbitrary random connection graph structures.
Since the system here is arbitrary unlike the exponentially split cluster formation tree structure in the first simulation environment in Section 7.1, the resulting expected number of false classifications is not monotonically decreasing when we sort the resulting required number of tests in the increasing order for the choices of F m . In Figure 9a, we mark the choices of sampling cluster formations that result in the minimum number of expected false classifications within each required number of the test range. By using time sharing between these choices of the sampling cluster formations, dotted red lines between them can be achieved. The six corner points in Figure 9a,b correspond to the following cluster formations,
F 1 = { { 1 18 } , { 19 20 } }
F 43 = { { 1 6 } , { 7 13 } , { 14 18 } , { 19 20 } }
F 184 = { { 1 6 } , { 7 9 } , { 10 13 } , { 14 18 } , { 19 } , { 20 } } F 428 = { { 1 } , { 2 } , { 3 6 } , { 7 9 } , { 10 13 } , { 14 17 } , { 18 } ,
{ 19 } , { 20 } } F 510 = { { 1 , 2 } , { 3 6 } , { 7 9 } , { 10 13 } , { 14 , 15 } , { 16 } ,
{ 17 } , { 18 } , { 19 } , { 20 } } F 512 = { { 1 } , { 2 } , { 3 6 } , { 7 9 } , { 10 13 } , { 14 , 15 } , { 16 } ,
{ 17 } , { 18 } , { 19 } , { 20 } }
For instance, F 43 in (38) is composed of four clusters with S 1 43 = { 1 , 2 , 3 , 4 , 5 , 6 } , S 2 43 = { 7 , 8 , 9 , 10 , 11 , 12 , 13 } , S 3 43 = { 14 , 15 , 16 , 17 , 18 } and S 4 43 = { 19 , 20 } . When F m = F 43 is chosen as the sampling cluster formation, the resulting expected number of false classifications is E f = 1.505 , and the required number of tests is 3, as seen in Figure 9a,b. For the sampling cluster formation choices which are not one of the six cluster formations listed above, these six cluster formations can be chosen to minimize the expected number of false classifications while keeping the required number of tests constant. For instance, all choices of m between m = 2 and m = 42 result in the required number of three tests as m = 43 but yield a larger E f than what m = 43 yields.
For this system as well, we calculate the average number of required tests for Hwang’s generalized binary splitting algorithm by using the results of [36,40,41] as in the first simulation (by implementing and running these algorithms 1000 times where we realize the infection statuses of the population for each iteration) and find that the average number of required tests is 16.4 in this case. Similar to the first simulation environment, the binary splitting algorithm presented originally in [42], which does not require the exact number of infections, cannot perform better than individual testing.

8. Conclusions

In this paper, we introduced a novel infection spread model that consists of a random patient zero and a random connection graph, which corresponds to non-identically distributed and correlated (non i.i.d.) infection statuses for individuals. We proposed a family of group testing algorithms, which we call two step sampled group testing algorithms, and characterized their optimal parameters. We determined the optimal sampling function selection, derived expected false classifications, and proposed F -separable non-adaptive group tests, which is a family of zero-error non-adaptive group testing algorithms that exploit a given random cluster formation structure. For a specific family of random cluster formations, which we call exponentially split cluster formation trees, we calculated the expected number of false classifications and the required number of tests explicitly, by using our general results, and showed that our two-step sampled group testing algorithm outperforms all non-adaptive tests that do not exploit the cluster formation structure and Hwang’s adaptive generalized binary splitting algorithm, even though our algorithm is non-adaptive, and we ignore our gain from the first step of our two-step sampled group testing algorithm. Finally, our work has an important implication: in contrast to the prevalent belief about group testing that it is useful only when the infections are rare, our group testing algorithm shows that a considerable reduction in the number of required tests can be achieved by using the prior probabilistic knowledge about the connections between the individuals, even in scenarios with a significantly high number of infections.

Author Contributions

Conceptualization, B.A. and S.U.; methodology, B.A. and S.U.; software, B.A. and S.U.; validation, B.A. and S.U.; formal analysis, B.A. and S.U.; investigation, B.A. and S.U.; resources, B.A. and S.U.; data curation, B.A. and S.U.; writing—original draft preparation, B.A. and S.U.; writing—review and editing, B.A. and S.U.; visualization, B.A. and S.U.; supervision, S.U.; project administration, S.U.; funding acquisition, S.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

 Theorem A1. 
In a two step sampled group testing algorithm with the given sampling cluster formation F m and the sampling function M over a cluster formation tree structure defined by F and p F , with uniform patient zero distribution p Z over [ n ] , the expected number of false classifications given F = F α is
E f , α = i [ σ m ] ( | S α ( M i ) | n · | S i m \ S α ( M i ) |
+ S j α S i m \ S α ( M i ) | S j α | 2 n )
and the expected number of false classifications is
E f = α > m p F ( F α ) E f , α
where S α ( M i ) is the subset in the partition F α , which contains the ith selected individual.
Proof. 
For the sake of simplicity, we denote the subset in partition F α that contains the ith selected individual by S α ( M i ) . We start our calculation with the conditional expectation, where F = F α is given. Observe that an error occurs, in the second step of the decoding process, only if F m is at a higher level of the cluster formation tree than the realization of F = F α and the true infected cluster K = S γ α is merged at the level F m , i.e., α > m and S γ α F m . Since there is exactly one true infected cluster, which is at level F α , false classifications only happen in the set S θ m that contains S γ α . Now, we know that, for the given sampling function M, the θ th selected individual is selected from the set S θ m and in the second step of the decoding phase, its infection status is assigned to all of the members of the set S θ m . Therefore, the members of the difference set S θ m \ S α ( M θ ) are falsely classified if the set S α ( M θ ) is the true infected set. In that case, all members of S θ m would be classified as infected while only the subset of them, which is S α ( M θ ) , were infected. On the other hand, when the cluster of the selected individual at level F α is not infected, i.e., the infected cluster is a subset of S θ m \ S α ( M θ ) , then only the infected cluster is falsely identified, since all of the members of S θ m are classified as non-infected. Thus, we have the following conditional expected number of false classifications when F = F α is given, where p S i j denotes the probability of the set S i j being infected
E f , α = i [ σ m ] ( p S M i α | S i m \ S α ( M i ) ) |
+ S j α S i m \ S α ( M i ) p S j α | S j α | ) = i [ σ m ] ( | S α ( M i ) | n · | S i m \ S α ( M i ) |
+ S j α S i m \ S α ( M i ) | S j α | 2 n )
where (A4) follows from the uniform patient zero assumption. Finally, since false classifications occur only when α > m , we have the following expression for the expected number of false classifications
E f = α > m p F ( F α ) E f , α
concluding the proof. □
 Theorem A2. 
For sampling cluster formation F m , the optimal choice of M that minimizes the expected number of false classifications is
M i = arg min k S i m β m ( k )
where M i is the ith selected individual. Moreover, the number of required tests is constant and is independent of the choice of M.
Proof. 
We first prove the second part of the theorem, i.e., that the choice of M does not change the required number of tests. In a cluster formation tree structure, when we sample exactly one individual from each subset S i m , P ( K M ) contains single element subsets of selected individuals, since, when F = F m , we have exactly one infected individual that can be any one of these individuals with positive probability. Now, consider the cluster formation F m 1 . Since it is a cluster formation tree structure, there must be at least one S i m 1 such that S i m 1 = S j m S k m , S j m S k m , which means that P ( K M ) must contain the set of selected individuals from S k m and S j m as well because of the fact that, in the case of F = F m 1 , these individuals can be infected simultaneously. Similarly, when moving towards the top node of the cluster formation tree (i.e., F 1 ), whenever we observe a merging, we must add a corresponding union of the subsets of individuals to P ( K M ) , which is the set of all possible infected sets for the selected individuals M. Thus, the structure of distinct sets of possible infected individuals do not depend on the indices of the sampled individuals within each S i m , but depends on the given F and F m , completing the proof of the second part of the theorem.
We next prove the first part of the theorem, i.e., we prove that selecting the individual that has the minimum β m ( k ) value for each S i m results in the minimum expected number of false classifications and thus it is the optimal choice. First, recall that, by definition, M depends on F m and thus we design sampling function M for a given F m . Now, recall the expected number of false classifications stated in (A1) and (A2). Designing a sampling function that minimizes E f for a given F m can be achieved as follows. From (A1) and (A2),
min M E f = min M { α : m < α p F ( F α ) i [ σ m ] ( | S α ( M i ) | n × | S i m \ S α ( M i ) | + S j α S i m \ S α ( M i ) | S j α | 2 n ) } = 1 n i [ σ m ] min M { α : m < α p F ( F α ) ( | S α ( M i ) |
× | S i m \ S α ( M i ) | + S j α S i m \ S α ( M i ) | S j α | 2 ) } = 1 n i [ σ m ] ( α : m < α p F ( F α ) ( | S α ( k i * ) |
× | S i m \ S α ( k i * ) | + S j α S i m \ S α ( k i * ) | S j α | 2 ) )
where k i * = arg min k S i m β m ( k ) , and (A9) is the minimum value of the expected number of false classifications for given F m . The sampling function M defined in (A6) achieves the minimum and thus it is optimal, completing the proof of the first part of the theorem. □
 Theorem A3. 
For given F and F m for m < f , the number of required tests for an F -separable non-adaptive group test, i.e., the number of rows of the test matrix X , must satisfy
T max max j [ σ m ] ( λ S j m + 1 ) , log 2 ( λ m + 1 ) ˥
with the addition of 1’s removed in (A10) for the special case of m = f .
Proof. 
First, we have that each unique node (nodes that represent a unique subset S i j ) represents a unique possibly infected set K M where each result vector must be unique as well. Therefore, in total, we must have at least λ m unique vectors. Furthermore, when m < f , it is possible that the infected set among the sampled individuals is the empty set. Thus, we have to reserve the zero vector for this case as well. Therefore, the total number of tests must be at least log 2 ( λ m + 1 ) ˥ in general, with an exception of m = f case, where we can assign the zero vector to one of the nodes and may achieve log 2 ( λ m ) ˥ .
Second, assume that, for any node j at an arbitrary level F i , i < m , the set of indices of the positions of 1’s must contain the set of indices of the positions of 1’s of the descendants of node j. Moreover, since all nodes that split must be assigned a unique vector, Hamming weights of the vectors must strictly decrease as we move from an ancestor node to a descendant at each level. Considering the fact that the ancestor node at the top level can have Hamming weight at most T and the nodes at the level F m must be assigned a vector which has Hamming weight at least 1, including the node that has the most unique ancestor nodes, T must be at least max j [ σ m ] ( λ S j m + 1 ) . Similar to the first case, when m = f , we can have a zero vector assigned to one of the bottom level nodes, and thus we can have T at least max j [ σ m ] λ S j m . □
 Theorem A4. 
For an f level exponentially split cluster formation tree, at level f, there exists an F -separable test matrix, X , with not more than 4 3 f rows, i.e., an upper (achievable) bound for the number of required tests is 4 3 ( log 2 n + 1 ) for n individuals. Conversely, this is also the capacity order-wise, since the number of required tests must be greater than f.
Proof. 
By using the converse in Theorem 3, we already know that the required number of tests is at least f from (24) since there are λ f = 2 f 1 unique nodes and also λ S i f + 1 = f for every subset S i f . This proves the converse part of the theorem.
In order to satisfy the sufficient conditions for the existence of an F -separable matrix, each node in the tree must be represented by a T length vector of sufficient Hamming weight, so that (i) every descendant can be represented by a unique vector with positions of 1’s being the subsets of the positions of 1’s of their ancestor nodes, and (ii) OR of vectors that are all descendants of a node must be equal to the vector of the ancestor node. In our proof, we show that, for exponentially split cluster formation trees, it is sufficient to check that we have sufficient number of rows in X to uniquely assign vectors to the bottom level nodes, i.e., the subsets S i f at level F f .
First, as we stated above, from the converse in Theorem 3, an F -separable test matrix of an exponentially split cluster formation tree with f levels must have at least f rows. However, for exponentially split cluster formation trees, this converse is not achievable: There are 2 f 1 nodes at level f but f 1 binary vectors with Hamming weight 1. Since, for f > 3 , f 1 is less than 2 f 1 , we cannot assign distinct Hamming weight 1 vectors to the bottom level nodes. Thus, we need vectors with a length longer than f. Now, assume that an achievable F -separable test matrix has f + k rows, where k is a non-negative integer. Our objective in the remainder of the proof is to characterize this k in terms of f.
We argue that, if the number of nodes at the bottom level, which is equal to 2 f 1 , is less than i = 1 k + 1 f + k i , then we can find an achievable F -separable test matrix, i.e.,
i = 1 k + 1 f + k i 2 f 1
is a sufficient condition for the existence of an achievable F -separable test matrix for a given ( f , k ) pair. Minimum k that satisfies (A11) will result in the minimum number of required tests f + k . In our construction, we assign each node at level F i a unique vector with Hamming weight f + k + 1 i , except for the bottom level F f . Since each node is assigned a unique vector, when moving from a level to one level down, descendant nodes must be assigned vectors that have Hamming weight at least 1 less than their ancestor node. At the bottom level, we use the remaining vectors with a Hamming weight less than or equal to k + 1 . We choose a minimum such k for this construction, resulting in the minimum number of tests.
Before proving the achievability of this above construction, we first analyze the minimum k that satisfies (A11) in terms of f. We state and prove in Lemma A1 in Appendix A that k = f / 3 satisfies (A11), giving an upper bound for the minimum k, thus finalizing the first part of the achievability proof. This, in turn, shows that we can use all vectors of Hamming weight 1 through k + 1 in the bottom level to represent all 2 f 1 nodes at that level.
Next, we show that, for the upper levels, our construction is achievable, i.e., we can find sufficiently many vectors of corresponding Hamming weights. By using Lemma A2 in the Appendix A, and the fact that, for k f / 3 , when f 13 , we have
f + k k + 2 2 f 2
which implies that we can find unique vectors of Hamming weight k + 2 to assign to the nodes at level F f 1 (one level up from the bottom level). For the remaining levels below ( f + k ) / 2 ˥ , we have f + k i > f + k i + 1 and the number of nodes decreases by half as we move upwards on the tree. Thus, we can find unique vectors to represent the nodes by increasing the Hamming weights by 1 at each level, which is the minimum increase of Hamming weights while moving upwards on the tree. For the remaining nodes, which are above the level ( f + k ) / 2 ˥ , we can use the lower bound for the binomial coefficient,
f + k i f + k i i 2 i
to show that there are unique vectors of required weights at those levels as well.
Thus, there are sufficiently many unique vectors of appropriate Hamming weights at every level. Finally, we have to check whether or not there are sufficient number of unique vectors for every subtree of descendants of each node. In exponentially split cluster formation trees, due to the symmetry of the tree, any descendant subtrees of each node is again an exponentially split cluster formation tree. If we assume that k, where the number of rows of X is equal to f + k , satisfies (A11) with k being a minimum such number, then every descendant subtree below the top level has parameters ( f i , k ) , and we show in Lemma A1 in the Appendix A that they also satisfy the condition (A11). For f values that are below the corresponding threshold in our proof steps (e.g., f 13 threshold before (A12) above), manual calculations yield the desired results. This proves the achievability part of the theorem. □
 Lemma A1. 
Minimum k that satisfies
i = 1 k + 1 f + k i 2 f 1
is upper bounded by f / 3 .
Proof. 
We prove the statement of the lemma by showing that the pair ( f , k ) = ( f , f / 3 ) satisfies (A14). We first consider the left-hand side of (A14) when f is incremented by 1 for fixed k, and write it as
i = 1 k + 1 f + k + 1 i = 2 i = 1 k + 1 f + k i + 1 f + k k + 1
which follows by using the identity a b = a 1 b 1 + a 1 b .
Second, we prove the following statement for k 1 ,
i = 1 k + 1 4 k i 2 3 k 1
Note that, when k = f / 3 , (A16) is equivalent to (A14) for f values that are divisible by 3. For f values that are not divisible by 3, since the pairs ( f 1 , k ) and ( f 2 , k ) satisfy (A14) when the pair ( f , k ) satisfies (A14), by (A15), it suffices to prove the statement in (A16).
We prove (A16) by induction on k. For k = 1 , the inequality holds. Assume that the inequality holds for a k 1 , then we show that it also holds for k + 1 . In the lines below, we use the identity a b = a 1 b 1 + a 1 b recursively,
i = 1 k + 2 4 k + 4 i = i = 1 k + 2 4 k + 3 i + i = 1 k + 2 4 k + 3 i 1 = i = 1 k + 2 4 k + 2 i + i = 1 k + 2 4 k + 2 i 1 + 1
+ i = 1 k + 1 4 k + 2 i + i = 1 k + 1 4 k + 2 i 1 = 9 i = 1 k + 1 4 k i 5 4 k k + 1 + 4 k k + 2
+ 4 4 k k 1 + 5 4 k k 2 + A = 9 i = 1 k + 1 4 k i 2 k + 11 k + 2 4 k k + 1
+ 4 4 k k 1 + 5 4 k k 2 + A = 8 i = 1 k + 1 4 k i k + 9 k + 2 4 k k + 1 + 4 k k
+ 5 4 k k 1 + 6 4 k k 2 + A
= 8 i = 1 k + 1 4 k i + 3 4 k k 2 + A
2 3 k + 2
where A , A , A are positive terms that are o 4 k k 2 , and we use the identity a b = a b + 1 b a b 1 after Equation (A19) to eliminate the negative 4 k k + 1 term. Inequality (A23) follows from the induction assumption. This proves the statement for k + 1 and completes the proof. □
 Lemma A2. 
When k 2 n 8 5 , the following inequality holds:
1 2 i = 1 k n i < n k + 1
Proof. 
We prove the lemma by induction over k. First note that the inequality holds when k = 1 ,
1 2 n 1 < n 2
Then, assume that the statement is true for k. Now, we check the statement for k + 1 ,
1 2 i = 1 k + 1 n i < 3 2 n k + 1
n k 1 k + 2 n k + 1
= n k + 2
where (A26) follows from the induction assumption, and (A27) is because k 2 n 8 5 . This proves the statement for k + 1 and completes the proof. □

References

  1. Dorfman, R. The Detection of Defective Members of Large Populations. Ann. Math. Stat. 1943, 14, 436–440. [Google Scholar] [CrossRef]
  2. Zhu, D.Z.; Hwang, F.K. Combinatorial Group Testing and Its Applications, 2nd ed.; World Scientific: London, UK, 1999. [Google Scholar]
  3. Wolf, J. Born Again Group Testing: Multiaccess Communications. IEEE Trans. Inf. Theory 1985, 31, 185–191. [Google Scholar] [CrossRef]
  4. Atia, G.K.; Saligrama, V. Boolean Compressed Sensing and Noisy Group Testing. IEEE Trans. Inf. Theory 2012, 58, 1880–1901. [Google Scholar] [CrossRef] [Green Version]
  5. Wadayama, T. Nonadaptive Group Testing Based on Sparse Pooling Graphs. IEEE Trans. Inf. Theory 2017, 63, 1525–1534. [Google Scholar] [CrossRef] [Green Version]
  6. Wang, C.; Zhao, Q.; Chuah, C. Optimal Nested Test Plan for Combinatorial Quantitative Group Testing. IEEE Trans. Signal Processing 2018, 66, 992–1006. [Google Scholar] [CrossRef]
  7. Wu, S.; Wei, S.; Wang, Y.; Vaidyanathan, R.; Yuan, J. Partition Information and its Transmission Over Boolean Multi-Access Channels. IEEE Trans. Inf. Theory 2015, 61, 1010–1027. [Google Scholar] [CrossRef] [Green Version]
  8. Shangguan, C.; Ge, G. New Bounds on the Number of Tests for Disjunct Matrices. IEEE Trans. Inf. Theory 2016, 62, 7518–7521. [Google Scholar] [CrossRef] [Green Version]
  9. Scarlett, J.; Johnson, O. Noisy Non-Adaptive Group Testing: A (Near-)Definite Defectives Approach. IEEE Trans. Inf. Theory 2020, 66, 3775–3797. [Google Scholar] [CrossRef] [Green Version]
  10. Scarlett, J.; Cevher, V. Near-Optimal Noisy Group Testing via Separate Decoding of Items. IEEE J. Sel. Top. Signal Process. 2018, 12, 902–915. [Google Scholar] [CrossRef] [Green Version]
  11. Scarlett, J. Noisy Adaptive Group Testing: Bounds and Algorithms. IEEE Trans. Inf. Theory 2019, 65, 3646–3661. [Google Scholar] [CrossRef]
  12. Mazumdar, A. Nonadaptive Group Testing with Random Set of Defectives. IEEE Trans. Inf. Theory 2016, 62, 7522–7531. [Google Scholar] [CrossRef]
  13. Kealy, T.; Johnson, O.; Piechocki, R. The Capacity of Non-Identical Adaptive Group Testing. In Proceedings of the Allerton Conference, Monticello, IL, USA, 30 September–3 October 2014; pp. 101–108. [Google Scholar]
  14. Johnson, O.; Aldridge, M.; Scarlett, J. Performance of Group Testing Algorithms with Near-Constant Tests Per Item. IEEE Trans. Inf. Theory 2019, 65, 707–723. [Google Scholar] [CrossRef] [Green Version]
  15. Inan, H.A.; Kairouz, P.; Wootters, M.; Ozgur, A. On the Optimality of the Kautz-Singleton Construction in Probabilistic Group Testing. In Proceedings of the Allerton Conference, Monticello, IL, USA, 2–5 October 2018; pp. 188–195. [Google Scholar]
  16. Karimi, E.; Kazemi, F.; Heidarzadeh, A.; Narayanan, K.R.; Sprintson, A. Non-adaptive Quantitative Group Testing Using Irregular Sparse Graph Codes. In Proceedings of the Allerton Conference, Monticello, IL, USA, 24–27 September 2019; pp. 608–614. [Google Scholar]
  17. Gebhard, O.; Hahn-Klimroth, M.; Kaaser, D.; Loick, P. Quantitative Group Testing in the Sublinear Regime. arXiv 2021, arXiv:1905.01458. [Google Scholar]
  18. Falahatgar, M.; Jafarpour, A.; Orlitsky, A.; Pichapati, V.; Suresh, A.T. Estimating the Number of Defectives with Group Testing. In Proceedings of the IEEE ISIT, Barcelona, Spain, 10–15 July 2016; pp. 1376–1380. [Google Scholar]
  19. Coja-Oghlan, A.; Gebhard, O.; Hahn-Klimroth, M.; Loick, P. Information-Theoretic and Algorithmic Thresholds for Group Testing. IEEE Trans. Inf. Theory 2020, 66, 7911–7928. [Google Scholar] [CrossRef]
  20. Chan, C.L.; Jaggi, S.; Saligrama, V.; Agnihotri, S. Non-Adaptive Group Testing: Explicit Bounds and Novel Algorithms. IEEE Trans. Inf. Theory 2014, 60, 3019–3035. [Google Scholar] [CrossRef] [Green Version]
  21. Cai, S.; Jahangoshahi, M.; Bakshi, M.; Jaggi, S. Efficient Algorithms for Noisy Group Testing. IEEE Trans. Inf. Theory 2017, 63, 2113–2136. [Google Scholar] [CrossRef]
  22. Bondorf, S.; Chen, B.; Scarlett, J.; Yu, H.; Zhao, Y. Sublinear-Time Non-Adaptive Group Testing with O(klogn) Tests via Bit-Mixing Coding. arXiv 2020, arXiv:1904.10102. [Google Scholar]
  23. Aldridge, M. Individual Testing Is Optimal for Nonadaptive Group Testing in the Linear Regime. IEEE Trans. Inf. Theory 2019, 65, 2058–2061. [Google Scholar] [CrossRef] [Green Version]
  24. Agarwal, A.; Jaggi, S.; Mazumdar, A. Novel Impossibility Results for Group-Testing. In Proceedings of the IEEE ISIT, Vail, CO, USA, 17–22 June 2018; pp. 2579–2583. [Google Scholar]
  25. Heidarzadeh, A.; Narayanan, K. Two-Stage Adaptive Pooling with RT-qPCR for COVID-19 Screening. arXiv 2020, arXiv:2007.02695. [Google Scholar]
  26. Ruszinko, M. On the Upper Bound of the Size of the R-Cover-Free Families. J. Comb. Theory Ser. 1994, 66, 302–310. [Google Scholar] [CrossRef] [Green Version]
  27. Riccio, L.; Colbourn, C.J. Sharper Bounds in Adaptive Group Testing. Taiwan. J. Math. 2000, 4, 669–673. [Google Scholar] [CrossRef]
  28. Aldridge, M.; Johnson, O.; Scarlett, J. Group Testing: An Information Theory Perspective. Found. Trends Commun. Inf. Theory 2019, 15, 196–392. [Google Scholar] [CrossRef] [Green Version]
  29. Li, T.; Chan, C.L.; Huang, W.; Kaced, T.; Jaggi, S. Group Testing with Prior Statistics. In Proceedings of the IEEE ISIT, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2346–2350. [Google Scholar]
  30. Lendle, S.D.; Hudgens, M.G.; Qaqish, B.F. Group Testing for Case Identification with Correlated Responses. Biometrics 2012, 68, 532–540. [Google Scholar] [CrossRef] [PubMed]
  31. Lin, Y.J.; Yu, C.H.; Liu, T.H.; Chang, C.S.; Chen, W.T. Positively Correlated Samples Save Pooled Testing Costs. arXiv 2021, arXiv:2011.09794. [Google Scholar] [CrossRef]
  32. Nikolopoulos, P.; Guo, T.; Fragouli, C.; Diggavi, S. Community Aware Group Testing. arXiv 2021, arXiv:2007.08111. [Google Scholar]
  33. Nikolopoulos, P.; Srinivasavaradhan, S.R.; Guo, T.; Fragouli, C.; Diggavi, S. Group Testing for Overlapping Communities. In Proceedings of the ICC 2021—IEEE International Conference on Communications, Montreal, QC, Canada, 14–23 June 2021; pp. 1–7. [Google Scholar]
  34. Ahn, S.; Chen, W.N.; Ozgur, A. Adaptive Group Testing on Networks with Community Structure. arXiv 2021, arXiv:2101.02405. [Google Scholar]
  35. Arasli, B.; Ulukus, S. Graph and Cluster Formation Based Group Testing. In Proceedings of the IEEE ISIT, Melbourne, Australia, 12–20 July 2021. [Google Scholar]
  36. Hwang, F.K. A Method for Detecting All Defective Members in a Population by Group Testing. J. Am. Stat. Assoc. 1972, 67, 605–608. [Google Scholar] [CrossRef]
  37. Idalino, T.B.; Moura, L. Structure-Aware Combinatorial Group Testing: A New Method for Pandemic Screening. arXiv 2022, arXiv:2202.09264. [Google Scholar]
  38. Gonen, M.; Langberg, M.; Sprintson, A. Group Testing on General Set-Systems. arXiv 2022, arXiv:2202.04988. [Google Scholar]
  39. Chen, H.B.; Hwang, F.K. Exploring the Missing Link Among d-Separable, d¯-Separable and d-Disjunct Matrices. Discret. Appl. Math. 2007, 155, 662–664. [Google Scholar] [CrossRef] [Green Version]
  40. Baldassini, L.; Johnson, O.; Aldridge, M. The Capacity of Adaptive Group Testing. In Proceedings of the IEEE ISIT, Istanbul, Turkey, 7–12 July 2013. [Google Scholar]
  41. Allemann, A. An efficient algorithm for combinatorial group testing. In Proceedings of the Information Theory, Combinatorics, and Search Theory: In Memory of Rudolf Ahlswede, Bielefeld, Germany, 25–26 July 2011. [Google Scholar]
  42. Sobel, M.; Groll, P.A. Group Testing To Eliminate Efficiently All Defectives in a Binomial Sample. Bell Syst. Tech. J. 1959, 38, 1179–1252. [Google Scholar] [CrossRef]
Figure 1. Random connection graph C and three possible realizations and cluster formations. We show each cluster with a different color. (a) Probabilities of the edges; (b) a realization of C with four clusters; (c) a realization of C with six clusters; (d) a realization of C with four clusters.
Figure 1. Random connection graph C and three possible realizations and cluster formations. We show each cluster with a different color. (a) Probabilities of the edges; (b) a realization of C with four clusters; (c) a realization of C with six clusters; (d) a realization of C with four clusters.
Information 14 00048 g001
Figure 2. Edge probabilities of C and elements of F in example C given in (1) with clusters shown in different colors.
Figure 2. Edge probabilities of C and elements of F in example C given in (1) with clusters shown in different colors.
Information 14 00048 g002
Figure 3. Cluster formation tree F .
Figure 3. Cluster formation tree F .
Information 14 00048 g003
Figure 4. Subtree of F with assigned result vectors for each node.
Figure 4. Subtree of F with assigned result vectors for each node.
Information 14 00048 g004
Figure 5. F with assigned result vectors for each node.
Figure 5. F with assigned result vectors for each node.
Information 14 00048 g005
Figure 6. A 4-level exponentially split cluster formation tree.
Figure 6. A 4-level exponentially split cluster formation tree.
Information 14 00048 g006
Figure 7. Four realizations of a random connection graph C that falls under four different cluster formations in a 4-level exponentially split cluster formation tree with δ = 4 .
Figure 7. Four realizations of a random connection graph C that falls under four different cluster formations in a 4-level exponentially split cluster formation tree with δ = 4 .
Information 14 00048 g007
Figure 8. (a) Expected number of false classifications vs. the choice of sampling cluster formation F m ; (b) required number of tests vs. the choice of sampling cluster formation F m .
Figure 8. (a) Expected number of false classifications vs. the choice of sampling cluster formation F m ; (b) required number of tests vs. the choice of sampling cluster formation F m .
Information 14 00048 g008
Figure 9. (a) Expected number of false classifications vs. the choice of sampling cluster formation F m ; (b) required number of tests vs. the choice of sampling cluster formation F m ; (c) random connection graph.
Figure 9. (a) Expected number of false classifications vs. the choice of sampling cluster formation F m ; (b) required number of tests vs. the choice of sampling cluster formation F m ; (c) random connection graph.
Information 14 00048 g009
Table 1. Nomenclature.
Table 1. Nomenclature.
System
nnumber of individuals in the system
Uinfection status vector of size n
Zpatient zero random variable
p Z ( i ) probability of individual i is the patient zero
C random connection graph
E C edge set of C
V C vertex set of C , also equal to [ n ]
C random connection matrix
Fcluster formation random variable
F set of all possible cluster formations, i.e., { F i }
p F ( F i ) probability of true cluster formation is F i
fnumber of possible cluster formations, i.e., | F |
σ i number of clusters in the cluster formation F i
S j i jth cluster in F i
λ j number of unique clusters in F at and above the level F j
λ S i j number of unique ancestor nodes of S i j in F
δ size of the bottom level clusters in an exponentially split F
Algorithm
F m sampling cluster formation chosen from F
Msampling function that selects individuals to be tested
U ( M ) infection status vector of the selected individuals by M
S α ( M i ) the cluster in F α that contains the ith selected individual by M
K M set of infections among the selected individuals by M
P ( K M ) set of all possible infected sets that K M can be
Tnumber of tests to be performed
X T × σ m test matrix
X ( i ) ith column of X
ytest result vector of size T
U ^ estimated infection status of n individuals after test results
E f , α expected number of false classifications given F = F α
E f expected number of false classifications
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Arasli, B.; Ulukus, S. Group Testing with a Graph Infection Spread Model. Information 2023, 14, 48. https://doi.org/10.3390/info14010048

AMA Style

Arasli B, Ulukus S. Group Testing with a Graph Infection Spread Model. Information. 2023; 14(1):48. https://doi.org/10.3390/info14010048

Chicago/Turabian Style

Arasli, Batuhan, and Sennur Ulukus. 2023. "Group Testing with a Graph Infection Spread Model" Information 14, no. 1: 48. https://doi.org/10.3390/info14010048

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop