4.1. Proposed Robust Algorithm
In this section, we propose a robust algorithm (RA) for GT. This RA is based on the use of MAP. Note that the problem of finding the optimal MAP solution in GT is NP-hard. Although it is difficult to find the optimal solution for this argument, many researchers have tried to find sub-optimal approaches, which are close to the optimal one. Among them, the performance of the belief propagation algorithm, introduced by Mackey in [
20], showed results close to the Shannon limits in channel coding theory.
To handle our proposed RA, we assume the following: each person
has a prior probability of having an infected and normal state (given by Equation (
1)) under the GT scheme, assuming that the input vector, group matrix, and output result are mutually independent. The challenge for GT is finding the MAP combination of the estimated
cases, given the observed output
. This is formulated as
where the second equality comes from the independence assumption of the prior cases.
Using Bayes’ rule, the conditional probability
can be rewritten as
where the notation
denotes a summation over all entries of
except the entry
, the second proportional relation holds due to Bayes’ rule and the last equality comes from the independent assumption. The aim of the proposed algorithm is to find the maximized marginal probability for each sample in Equation (
5).
Next, we describe the key idea of the RA proposed in this paper. Before describing our algorithm, the graphical representation of one example of the GT scheme is represented in
Figure 3. There are 10 samples—
through
—and seven output results—
through
. As the first row of
A, as shown in
Figure 2, is [0 1 0 0 0 1 0 0 0 1], there are three edges between the first output
and three samples,
,
and
, in
Figure 3. In the same way, other edges between the samples and the outputs can be drawn as shown in
Figure 3.
Let
be the set of samples participating in the
ith group, and
be the set of groups participating in the
jth sample. We also use
to denote the set
excluded the
jth sample, and
to denote the set
excluding the
ith group. The RA proposed in this paper is mainly described as a process in which two marginal probabilities exchange information in each iteration. Note that we aim to find the maximum posterior probability for each sample, as in the last line of Equation (
5). In other words, the two conditional probabilities
and
are exchanged with each other to maximize the posterior probability. Let
be the downward message from sample
to output
, and
be the upward message from output
to sample
. Both messages are expressed as conditional probabilities. As mentioned in Equation (
5), the two messages exchange their information with each other. The downward message
is a function of the upward message
, and vice versa. In the example of
Figure 3, the downward message
is obtained from the two upward messages
and
. Conversely, the upward message
is obtained from the two downward messages
and
.
Now, the RA updates the messages
and
associated with each edge between the sample
and the output
. There are three steps to estimate each input sample: initialization, updating the messages
and
, and tentative decoding to check the constraint condition in Equation (
2). In the initialization step, we define the probability distribution of
in Equation (
1), generate the group matrix
with a random design (i.e., low-density parity check (LDPC) codes, as in [
20]), and obtain the output result
from the given
and
. We aim to find an (unknown) input vector
by using (known)
and
. In addition, the initial downward message
can be obtained from the prior probability distribution of Equation (
1), assuming that the upward messages for 0 and 1 are equally distributed.
Next, we consider the upward message
from output
to sample
. This message
is obtained as follows:
where the constraint condition in Equation (
2) is satisfied as
in the noiseless GT scheme. The summation of Equation (
6) is used to collect all the associated downward probabilities that satisfy the constraint condition of Equation (
2). To update the upward message, the noisy probability
is multiplied.
The downward message
can be written as
where the variable
is used for normalization of the total probability. Let
be the posterior probability for the sample
. Finally, we determine a maximum probability for 0 or 1, as defined in Equation (
5),
Using Equations (
6) and (
7), the proposed RA iteratively updates the messages
and
; that is, while the algorithm is running, the posterior probability of each sample moves to converge in general. Even when the algorithm has been operated for the maximum number of iterations, the posterior probability may not converge. In this case, we assume that the sample is unreliable and perform individual testing instead of GT. In the final step, we perform individual testing by picking only unstable samples. Algorithm 1 describes our proposed RA for the detection of infected people in COVID-19.
Algorithm 1: Proposed Robust Algorithm (RA). |
|
Let
be the number of samples that requires individual testing when the posterior probability does not converge. The total number of tests
T for successfully finding infected people in COVID-19 is obtained as follows:
4.2. Simulation Results
COVID-19 has a different incidence rate in each country.
Figure 4 shows the number of infected people and number of tests in each country. The statistics shown in
Figure 4 list the countries with the largest number of confirmed cases as of April 12, 2020. China, the first affected country, was excluded from
Figure 4, due to a lack of information on the number of tests. In addition, the average incidence rate for the most-affected countries, as shown in
Figure 4, is 12.89%, where the total number of confirmed cases is 1,743,883 and the total number of tests is 13,531,095. However, this incidence rate is reported to be lower than the actual case, as it is not recommended to actively test suspected patients. We consider South Korea as our simulation environment for the diagnosis of COVID-19; the reason for this is that it adopted the world’s most objective and aggressive countermeasures to COVID-19. According to these results, the incidence rate of COVID-19 in South Korea is very low, at about 2%.
We implemented a decoding algorithm for the GT scheme to find people with infected COVID-19 in noiseless (
) and both false positive and negative settings (
). Our proposed algorithm is based on the belief propagation algorithm of LDPC codes [
20] in channel coding theory, which has shown performance close to the Shannon limit in information theory. The difference between Mackey’s method [
20] and the proposed RA is revealed in the operations. In the case of channel coding, the operation is performed over Finite Fields; however, GT uses Boolean operations such as AND, OR, and XOR. We evaluate the performance on RAs for the GT schemes. To this end, we set the simulation environment as follows: the defective samples are generated from the probability distribution in Equation (
1) with different incidence rates
, and the group matrix comes from LDPC codes with five constant weights in each column. Additionally, we set the number of maximum iterations as 20 for the RA. Throughout this paper, we consider
, assuming that the diagnosis of COVID-19 is carried out 1000 times simultaneously on one site. All the results for the number of tests are averaged from 100 experiments. The computational complexity of the RA based on the belief propagation algorithm in LDPC codes is
in [
20]. In addition, the relationship between the number of iterations and the performance on the belief propagation algorithm by using the analysis of density evolution under binary erasure and symmetry channels was presented in [
21]. Intuitively, the greater the number of iterations of belief propagation decoding, the better the performance, but in order to achieve such a conclusion, it is necessary to have characteristics regarding the generation of a group matrix (e.g., stopping set) in [
21].
Next, we evaluate the total number of tests
T to successfully find infected people with different incidence rates in the noiseless GT scheme, as shown in
Table 1. First, the information-theoretic bound was obtained from [
22], which is exploited (by Fano’s inequality [
23]) as a lower bound on the number of tests. Dorfman’s method [
4], the DT method [
1], and the GBS method [
5] were evaluated at
and different incidence rates of
(0.01, 0.02, 0.03, 0.04, 0.05, and 0.1), whose values were based on the statistics of the infected people for COVID-19 as of 12 April 2020, as shown in
Figure 4. All the statistical results for the number of tests
T are shown in
Table 1. In the noiseless schemes, the best method for the detection of COVID-19 was the GBS algorithm [
5], the results of which were close to the information-theoretic bound. This method showed the best performance but has the inconvenience of not being able to test all samples simultaneously. In contrast, our proposed RA offers the advantage of GT all the samples at once using a predefined group matrix. In other words, in the GBS method, the current test is determined based on the results of the previous test, whereas the RA processes all tests at the same time, so it can be inspected in large quantities.
The main advantage of our algorithm, compared to other algorithms, is that it is noise-resistant. Other detection methods shown in
Table 1 have better performance in the noiseless GT. However, there exists noise in the GT framework, so it is limited when using these methods. We need a way to find infected samples even when the GT is noisy. To this end, we consider the noisy GT scheme with different noise values
, which is formulated by Equation (
3) as the false positive and the false negative of the GT.
Figure 5 shows how many total tests on the theoretical bounds [
22] are required to successfully find different incidence rates
in the noisy GT schemes with
, where
, 0.03, 0.05, and 0.1. As shown in
Figure 5, if the incidence rate
is greater than 0.2 and the noise probability
is greater than 0.05, individual testing is better that the GT scheme. In other words, the GT scheme in COVID-19 is suitable for the reduction of the number of tests when the incidence rate is less than 0.2, when assuming
.
Table 2 shows the performance for the number of tests
T of our proposed RA method for the noisy GT scheme, in which there are different incidence rates
and noise probabilities
. Note that the interpretation of the false negative is more complicated because there may be contaminated input samples. In addition, test results including a large number of defective samples are unlikely to lead to false negatives, even with noise. In [
24], authors showed that it is easier to find false negatives than positive ones. It also led to improved results on the number of tests in the case of false negatives.
The reason for this robustness against noise is as follows: first, the RA updates all the posterior probabilities of the unknown binary input vector at every iteration, where it comes from exchanging upward and downward messages with each other. It is used to find the uncorrupted value using the beliefs of neighboring nodes even if the test result changes due to false negative or positive errors. As the number of iterations of the algorithm increases, the posterior probability gradually converges once there is cycle-free in the bipartite graph of the GT scheme with the low noise. Note that generating a group matrix with cycle-free is difficult in the large length. In our proposed RA, as the noise increases, the belief propagation does not work, so it falls into a region where errors cannot be corrected. For more details of effect of noise for belief propagation algorithms that are robust to noise, please refer to Lav’s paper in [
25].