Nonparametric Additive Regression for High-Dimensional Group Testing Data

Zuo, Xinlei; Ding, Juan; Zhang, Junjian; Xiong, Wenjun

doi:10.3390/math12050686

Open AccessArticle

Nonparametric Additive Regression for High-Dimensional Group Testing Data

¹

School of Mathematics and Statistics, Guangxi Normal University, Guilin 541004, China

²

School of Mathematics, Hohai University, Nanjing 210098, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(5), 686; https://doi.org/10.3390/math12050686

Submission received: 12 January 2024 / Revised: 20 February 2024 / Accepted: 21 February 2024 / Published: 27 February 2024

(This article belongs to the Special Issue Advances in Biostatistics and Applications)

Download

Browse Figures

Versions Notes

Abstract

Group testing has been verified as a cost-effective and time-efficient approach, where the individual samples are pooled with a predefined group size for subsequent testing. Recent research has explored the integration of covariate information to improve the modeling of the group testing data. While existing works for high-dimensional data primarily focus on parametric models, this study considers a more flexible generalized nonparametric additive model. Nonlinear components are approximated using B-splines and model estimation under the sparsity assumption is derived employing group lasso. Theoretical results demonstrate that our method selects the true model with a high probability and provides consistent estimates. Numerical studies are conducted to illustrate the good performance of our proposed method, using both simulated and real data.

Keywords:

group testing; nonparametric regression; variable selection; measurement error

MSC:

62G08; 62P10

1. Introduction

Group testing is a proven cost-effective and time-efficient approach. In group testing, individual samples are pooled with a predefined group size and subsequently tested. Only those pools yielding positive results require additional tests, including individual tests. Dorfman [1] introduced group testing to screen syphilis among soldiers. Afterward, it has been widely applied to detect infected samples in a large population, including a variety of infectious diseases such as chlamydia, gonorrhea [2,3,4], HIV, the hepatitis B virus [5], West Nile virus [6], SARS-CoV-2 [7], and COVID-19 [8]. Group testing is also used in many other fields such as genetics [9], blood safety [10], drug discovery [11], and hyopneumoniae [12]. In particular, the efficiency of this approach is remarkable, especially when the prevalence of the disease is low. Gregory et al. [13] noted that the State Hygienic Laboratory (SHL) in Iowa saved approximately $3.1 million from 2009 to 2014 in screening female subjects for chlamydia. Verougstraete et al. [4] evaluated the feasibility of CT and NG molecular testing in a large number of female sex workers and showed that pool testing compared with single-site samples reduced the cost by

35 %

.

Given the rapid advancement of digital technologies, it is possible to access extensive covariates, encompassing personal details such as age and occupation, electronic medical records, genetic data, and environmental variables. In recent literature on group testing, covariate information has been explored for estimation and case identification [14,15,16]. Vansteelandt et al. [17] proposed estimating individual infection probabilities by utilizing the test results obtained from initial pools. Xie [18] developed a general regression methodology that relates the group testing responses to individual covariate information. Black et al. [19] improved the Halving group testing procedure by taking into account population heterogeneity. It is evident that covariant information improves the estimate of individual risk probability, further reduces the test number, and lowers the cost [20].

Most existing research on group testing by incorporating covariate information has been developed under a parametric regression model, such as the logistic model. Zhang et al. [2] addressed the goal of case identification by incorporating retesting results. McMahan et al. [21] developed a Bayesian framework for group testing with covariate information. Gregory et al. [13] used the adaptive elastic net for group testing with high-dimensional covariates. Lin et al. [3] considered variable selection for multiple-infection group testing data. Alternatively, semiparametric or nonparametric regression has also received attention, but it mainly focuses on low-dimensional group testing. Yuan et al. [22] considered semiparametric isotonic regression models for the simultaneous estimation of the conditional probability curve and covariate effects, in which an expectation–maximization (EM) algorithm is used to overcome the computational challenge. Delaigle et al. considered nonparametric regression methods for group testing data [23,24,25]. Liu et al. [26] developed a Bayesian generalized additive regression approach to model the individual-level probability of disease. However, rare work has been conducted on high-dimensional covariates within the semiparametric or nonparametric regression framework for group testing.

In a standard group testing procedure, test results are collected at every stage, and high-dimensional covariate information is obtained during the initial enrollment of specimens. To maximize the utility of this extensive data, we propose an integrated modeling framework that employs test results from every phase of the group testing algorithm and combines them with high-dimensional covariate information. We employ a generalized nonparametric additive model framework to analyze the heterogeneous high-dimensional group testing data. The unknown nonlinear components are approximated with B-splines [27] and estimated by the group lasso method [28]. Four commonly used group testing algorithms are considered: master pool, Dorfman, Halving, and array testing [1,29,30,31]. To enhance computational efficiency, we utilize the EM algorithm in conjunction with the stochastic gradient descent algorithm. Theoretical results demonstrate that our method selects the true model with high probability and provides consistent estimates for components.

The remaining text is organized as follows. In Section 2, we establish our estimation method under the generalized nonparametric additive model through the B-spline approximation, along with the corresponding algorithm. Subsequently, the theoretical properties of the proposed method are investigated and proved. In Section 3, we derive the conditional expectation utilized in the EM algorithm to enhance computational efficiency. In Section 4 and Section 5, we conduct extensive simulations and make some real applications. The numerical result shows the good performance of our method. We conclude our methods in Section 6. The technical details of the proofs are presented in Appendix A.

2. Main Results

2.1. Notation

Throughout this work, we define

R^{d}

as the set of d-dimensional real number vectors. We denote

λ_{min} (A)

and

λ_{max} (A)

as the minimum and maximum eigenvalues of matrix

A

, respectively. Furthermore, we employ

[m]

to define the set

{1, \dots, m}

. The notation

{∥ \cdot ∥}_{\infty}

represents the sup norm. Let

ϵ_{n} = o (1)

indicate that

ϵ_{n} \to 0 .

We use

E (\cdot)

to denote the expectation concerning the underlying probability space. The symbol

\overset{d}{\to}

denotes the convergence in distribution. We denote

a_{n} ≻ b_{n}

to signify the relationship

\frac{b_{n}}{a_{n}} = o (1) .

Let

{∥ f ∥}^{2} = E (f^{2})

and

{∥ f ∥}_{n}^{2} = E_{n} (f^{2})

with

E_{n}

represent the empirical measures. The superscript “T” indicates the transpose of a matrix or a vector.

2.2. Model

Suppose that n individuals are recruited to participate in testing. We denote by

{\tilde{Y}}_{i}

the true status of the i-th individual, indicating if it is infected or not (

{\tilde{Y}}_{i} = 1

or 0) for

i \in [n]

. Then

{\tilde{Y}}_{i}

follows a Bernoulli distribution with a risk probability of

p_{i}

. Simultaneously, we gather and express individual information as

X_{i} = {(X_{i 1}, \dots, X_{i q_{n}})}^{T} \in R^{q_{n}}

for the

i^{t h}

individual, where

X_{i}

has a dimension of

q_{n}

. Assuming that

{\tilde{Y}}_{i}

follows the generalized additive model (GAM) given

X_{i}

,

P ({\tilde{Y}}_{i} = 1 | X_{i}) = g (μ + \sum_{j = 1}^{q_{n}} f_{j} (X_{i j})),

(1)

where the link function is logit,

g (t) = \frac{e^{t}}{1 + e^{t}}

,

μ

represents the intercept, and

f_{j}

(

j \in [q_{n}]

) is an unknown smooth nonlinear function [27,32,33]. The corresponding true values are denoted as

μ^{*}

and

f_{j}^{*}

, respectively. The dimension of covariates

X

is notably large and increases alongside n. Typically, there is an assumption of model sparsity, which implies that only a subset of the covariates

X

correlates with the response variable. Let the true model be represented as

Ω^{*} = {j \in [q_{n}] : f_{j}^{*} (\cdot) \neq 0}

, where its size is denoted by

s_{0}

. Without loss of generality, we assume that the unknown additive component

f_{j}^{*}

is centered, meaning that

E [f_{j}^{*} (X_{j})] = 0

.

In group testing, the true status of an individual is not directly observable; instead, tests are performed at the group level. Moreover, for groups flagged as high-risk due to positive results, additional tests, including individual tests, may be conducted, as individuals within these groups are more likely to be infected [34]. A typical group testing algorithm usually contains two steps. First, partition the specimens into J groups, and perform tests for each group. Second, conduct additional tests for the groups that tested positive in the first step. We denote the J groups as

{P_{j, 1}}_{j = 1}^{J}

with group sizes

{k_{j}}_{j = 1}^{J}

, satisfying

\cup_{j = 1}^{J} P_{j, 1} = {1, \dots, n}

. We define

P_{j} = (P_{j, 1}, \dots, P_{j, T_{j}})

, where

P_{j, 1}

denotes the initial

j^{t h}

group, and

P_{j, l}

denotes the subsequent subgroups or individuals within this group for

l = 2, \dots, T_{j}

, with

T_{j}

representing the total number of tests associated with the

j^{t h}

group. In Figure 1, we present a typical example of employing the Halving algorithm with a group size of 6, where the number is

T_{j} = 6

. Correspondingly, let

Z_{j}

encompass all the test results associated with

P_{j, 1}

, including subgroup and individual tests, denoted by

Z_{j} = (Z_{j, 1}, \dots, Z_{j, T_{j}})

.

Considering measurement error, we denote sensitivity and specificity by

S_{e}

and

S_{p}

, respectively. Sensitivity is the probability that a test correctly identifies a positive sample as positive, while specificity is the probability that a test correctly identifies a negative sample as negative. We assume these values are known and do not depend on the group size. We denote

{\tilde{Z}}_{j} = ({\tilde{Z}}_{j, 1}, \dots, {\tilde{Z}}_{j, T_{j}})

as the true status corresponding to

Z_{j}

, where

{\tilde{Z}}_{j, l} = {max}_{i \in P_{j, l}} {\tilde{Y}}_{i}

. Given the true status

{\tilde{Z}}_{j, l}

, the outcome

Z_{j, l}

follows a Bernoulli distribution with probability

{S_{e}}^{{\tilde{Z}}_{j, l}} {(1 - S_{p})}^{1 - {\tilde{Z}}_{j, l}}

for

l \in [T_{j}]

and

j \in [J] .

They are conditionally independent given their true status. Given

X_{1}, \dots, X_{n}

, the individual statuses

{\tilde{Y}}_{1}, \dots, {\tilde{Y}}_{n}

are conditionally independent. Similarly, the group statuses

{{\tilde{Z}}_{j, l}, l \in [T_{j}]}

are conditionally independent, given

{{\tilde{Y}}_{i}, i \in P_{j, 1}} .

These assumptions are commonly made in the literature [13,16]. Let

\tilde{Y} = ({\tilde{Y}}_{1}, \dots, {\tilde{Y}}_{n})

and

X = {(X_{1}, \dots, X_{n})}^{T}

. Then, the log-likelihood function of

\tilde{Y}

is expressed as follows:

\begin{matrix} ln P (\tilde{Y} | X) = \sum_{i = 1}^{n} [{\tilde{Y}}_{i} ln (p_{i}) + (1 - {\tilde{Y}}_{i}) ln (1 - p_{i})], \end{matrix}

(2)

where

p_{i} = P ({\tilde{Y}}_{i} = 1 | X_{i}) = g (μ + \sum_{j = 1}^{q_{n}} f_{j} (X_{i j})) .

We note that the individual true status

\tilde{Y}

is unknown, whereas the group-level test results

Z = {Z_{j}, j \in [J]}

are observed. The corresponding log-likelihood function is as follows:

\begin{matrix} P (Z | X) = \sum_{\tilde{Y} \in {0, 1}^{n}} P (Z | \tilde{Y}) P (\tilde{Y} | X), \end{matrix}

(3)

where the conditional probability

P (Z | \tilde{Y})

is formulated as follows:

\begin{matrix} P (Z | \tilde{Y}) & = \prod_{j = 1}^{J} P (Z_{j} | \tilde{Y}) & = \prod_{j = 1}^{J} \prod_{l = 1}^{T_{j}} {S_{e}^{{\tilde{Z}}_{j, l}} {(1 - S_{p})}^{1 - {\tilde{Z}}_{j, l}}}^{Z_{j, l}} {{(1 - S_{e})}^{{\tilde{Z}}_{j, l}} S_{p}^{1 - {\tilde{Z}}_{j, l}}}^{1 - Z_{j, l}} . \end{matrix}

With the above likelihood function, our goal is to estimate the true model

Ω^{*}

and obtain an accurate estimate of the risk probability through the model (1). As model (1) is a nonparametric additive model, it prompts the use of the B-spline method, known for its effectiveness in modeling such nonparametric frameworks.

Let

S_{n}

denote the space of polynomial splines of degree

l \geq 1

, and let

ϕ_{j k} (\cdot),

k \in [m_{n}]

denote the standardized B-spline basis functions satisfying

∥ ϕ_{j k} ∥_{\infty} \leq 1

. Then for any

f_{n j} \in S_{n},

there exist coefficients

β_{j k}, k \in [m_{n}]

, such that

\begin{matrix} f_{n j} (x) = \sum_{k = 1}^{m_{n}} β_{j k} ϕ_{j k} (x) . \end{matrix}

Using

f_{n j} (x)

allows us to approximate the nonlinear function

f_{j} (x)

. Let

β_{j} = (β_{j 1}, \dots,

β_{j m_{n}})^{T}

and

Φ_{i j} = {(ϕ_{j 1} (X_{i j}), \dots, ϕ_{j m_{n}} (X_{i j}))}^{T}

. Then

μ + \sum_{j = 1}^{q_{n}} f_{j} (X_{i j})

could be approximated by

Φ_{i}^{T} β

, where

Φ_{i} = {(1, Φ_{i 1}^{T}, \dots, Φ_{i q_{n}}^{T})}^{T}

and

β = {(β_{0}, β_{1}^{T}, \dots, β_{q_{n}}^{T})}^{T}

includes the intercept

β_{0}

. Therefore, the risk probability

p_{i} = g (μ + \sum_{j = 1}^{q_{n}} f_{j} (X_{i j}))

is approximated by

p_{i β} = g (Φ_{i}^{T} β)

for

i \in [n]

. Consequently, the corresponding likelihood function is

P_{β} (Z | X) = \sum_{\tilde{Y} \in {0, 1}^{n}} P (Z | \tilde{Y}) P_{β} (\tilde{Y} | X),

with

P_{β}

represents the probability measure with respect to the parameter

β

and

\begin{matrix} ln P_{β} (\tilde{Y} | X) = \sum_{i = 1}^{n} [{\tilde{Y}}_{i} ln (\frac{p_{i β}}{1 - p_{i β}}) + ln (1 - p_{i β})] . \end{matrix}

(4)

Given the high dimensionality of

X

, we propose maximizing

P_{β} (Z | X)

using the group lasso penalty to select the important components and obtain the estimator:

\begin{matrix} \hat{β} = arg max_{β} {ln P_{β} (Z | X) - n λ \sum_{j = 1}^{q_{n}} ∥ β_{j} ∥_{2}}, \end{matrix}

(5)

where

λ

is the tuning parameter, and the intercept

β_{0}

is not penalized. The model

Ω^{*}

is estimated by

\hat{Ω} = {j \in [q_{n}] : ∥ {\hat{β}}_{j} ∥_{2} \neq 0}

, and the nonlinear components are estimated by

{\hat{f}}_{n j} (x) = \sum_{k = 1}^{m_{n}} {\hat{β}}_{j k} ϕ_{j k} (x), j \in [q_{n}] .

2.3. The Estimation Procedure with EM Algorithm

The individual status

\tilde{Y}

is latent and not directly observable, posing a challenge in maximizing (5) to obtain the estimator. To address this issue, we employ the EM algorithm, which iterates with two steps: the E-step and the M-step. For the

{(t + 1)}^{t h}

iteration, these two steps are carried out as follows: in the E-step, we calculate the conditional expectation of the complete log-likelihood function of

\tilde{Y}

, given the observed data

Z

and the

t^{t h}

iteration

β^{(t)}

, denoted as follows:

E [ln f (\tilde{Y} | β) | Z, β^{(t)}]

. In the M-step, we maximize the regularized function

S^{(t)} (β) = E [ln f (\tilde{Y} | β) | Z, β^{(t)}] - n λ \sum_{s = 1}^{q_{n}} {∥ β_{s} ∥}_{2}

, and update the parameter with

β^{(t + 1)}

. Our method is summarized in Algorithm 1.

Algorithm 1 Regularized GAM for group testing.

Input: The observed outcomes

Z

, covariates

X

, and the initial value

β^{(0)}

.

Compute:

Step 1 (E-step). Given the parameter at the

t^{t h}

iteration,

β^{(t)}

, we calculate the conditional expectation

E [ln f (\tilde{Y} | β) | Z, β^{(t)}]

.

Step 2 (M-step). We update the iterative parameter

β^{(t + 1)}

by maximizing the objective function

S^{(t)} (β) = E \{\sum_{i = 1}^{n} [{\tilde{Y}}_{i} ln (\frac{p_{i β}}{1 - p_{i β}}) + ln (1 - p_{i β})] | Z, β^{(t)}\} - n λ \sum_{s = 1}^{q_{n}} {∥ β_{s} ∥}_{2}

.

Step 3. We repeat the E-step and M-step until the parameters converge.

Output: The estimate

\hat{β}

,

{\hat{f}}_{n j}, j \in [q_{n}]

and

\hat{Ω}

.

We should note that the conditional expectation in the E-step exhibits variability across different group testing methods. For a comprehensive understanding, specific details are presented in Section 3 to facilitate the implementation of the EM algorithm.

2.4. The Consistency of the Estimate

In this section, we establish the theoretical results of our method. We commence by presenting essential conditions that are required to establish our main results. These conditions are commonly employed in B-spline methodologies. Within Conditions D and E, a set,

U_{m}

, is introduced and defined as follows: the results

{Z_{1}, \dots, Z_{J}}

are categorized into M classes, with

U_{m}

representing the index set to which

Z_{j}

belongs to the m-th category for

m \in [M]

.

$A .$: The nonparametric components ${f_{j}^{*}}_{j = 1}^{q_{n}}$ belong to the function class $F$ , where the $r^{t h}$ derivative $f^{(r)}$ exists and is Lipschitz of order $α$ ,

$\begin{matrix} \begin{matrix} F = {f (\cdot) : | f^{(r)} (s) - f^{(r)} (t) | \leq K | s - t |^{α}, \forall s, t \in [a, b]} \end{matrix} \end{matrix}$

for some positive constant K, where r is a non-negative integer, $[a, b]$ is the support of the covariate, and $α \in (0, 1]$ , such that $d = r + α \geq 0.5 .$
$B .$: $E f_{j}^{*} = 0$ , and the marginal density function $g_{j}$ of $X_{j}$ adheres to $0 < K_{1} \leq g_{j} (X_{j}) \leq K_{2} < \infty$ on $[a, b]$ for $1 \leq j \leq p$ , where $K_{1}$ and $K_{2}$ are constants.
$C .$: ${min}_{j \in Ω^{*}} {∥ f_{j}^{*} ∥_{2}} ≻ q_{n} n^{- \frac{2 d - 1}{2 (2 d + 1)}},$ with $q_{n} = o (n^{\frac{3}{4} - \frac{1}{2 (2 d + 1)}}) .$
$D .$: $| U_{m} | / J \to γ_{m} \in (0, 1)$ as $J \to \infty$ .
$E .$: $I_{m} (β^{*}) = - \frac{1}{| U_{m} |} \sum_{j \in U_{m}} \frac{\partial^{2} log P_{β} (Z_{j} | X)}{\partial β \partial β^{T}} |_{β = β^{*}}$ is positive-definite and possesses a minimum eigenvalue larger than $τ_{m}$ for any $m \in [M]$ .

Conditions A–C are commonly employed for nonparametric additive models, with similar assumptions made by Huang et al. [35]. When the dimension

q_{n}

is of the order

O (1)

, Condition C is less stringent than Condition (A1) as described in Huang et al. [35]. Conditions D and E are presumed to pertain to the high-dimensional group testing data, guaranteeing an adequate dataset size for all categories of testing results, as employed in work by Gregory et al. [13].

Let the coefficients

β^{*}

represent the population-level parameter used to approximate the nonparametric function, given by

β^{*} = arg min_{β} E_{Z} P_{β} (Z | X) .

Under mild conditions, our penalized estimate

\hat{β}

converges to

β^{*}

, as demonstrated in the following theorem:

Theorem 1.

Given Conditions A–E,

m_{n} = O (n^{\frac{1}{2 d + 1}})

and

λ = o (q_{n}^{- 1 / 2} n^{γ / 2} / \sqrt{n})

, we have

{∥ \hat{β} - β^{*} ∥}_{2} = O_{p} (q_{n}^{- 1 / 2} n^{- (1 - γ) / 2}),

where

γ = \frac{1}{2 d + 1}

.

This theorem reveals that the difference between

β^{*}

and its estimate converges to zero in probability, achieving a convergence rate of

O_{p} (q_{n}^{- 1 / 2} n^{- (1 - γ) / 2})

. According to Huang et al. [35], we set

m_{n} = O (n^{\frac{1}{2 d + 1}})

to achieve the optimal convergence rate. In what follows, we derive the convergence rate between

f_{j}^{*}

and its estimate

{\hat{f}}_{n j} = Φ_{j}^{T} \hat{β} .

Consequently, we validate the model selection property of our method.

Theorem 2.

Given the conditions in Theorem 1, we have

(1): ${∥ {\hat{f}}_{n j} - f_{j}^{*} ∥}_{2} = O_{p} (q_{n} n^{- \frac{2 d - 1}{2 (2 d + 1)}}) .$
(2): $P (Ω^{*} \subset \hat{Ω}) \to 1$ as $n \to \infty .$

The above theorem demonstrates the convergence of the estimator

{\hat{f}}_{n j}

to

f_{j}^{*}

, which quantifies the importance of the

j^{t h}

component. According to Huang et al. [35], we set

m_{n} = O (n^{\frac{1}{2 d + 1}})

to achieve the optimal convergence rate. We also establish the sure screening property of the set

\hat{Ω}

, encompassing the true model

Ω^{*}

with a probability approaching 1.

3. The Conditional Expectation

In this section, we present the derivations of the conditional expectations for

{\tilde{Y}}_{i}

based on the observed status of groups, a crucial component for the EM algorithm. These expectations are denoted by

w_{i}^{(t)} = E [{\tilde{Y}}_{i} | Z_{1}, \dots, Z_{J}, μ^{(t)}, β^{(t)}]

. Given that

{\tilde{Y}}_{i}

follows a Bernoulli distribution and is independent, we express

w_{i}^{(t)}

as

w_{i}^{(t)} = P ({\tilde{Y}}_{i} = 1 | Z_{1}, \dots, Z_{J}, μ^{(t)}, β^{(t)})

. This conditional expectation is derived based on the group testing algorithm we implemented. Detailed derivations are provided for three group testing methods: the master pool, Dorfman’s algorithm, and Halving. However, the derivation for the array method is exceedingly complex and has been omitted in this context for brevity.

3.1. Master Pool

With the master pool method, specimens are divided into non-overlapping groups, with each individual exclusively assigned to one of these groups. Testing is performed on each group, without requiring additional tests. Then we have the following:

ω_{i}^{(t)} = \{\begin{matrix} P ({\tilde{Y}}_{i} = 1 | Z_{j, 1} = 0) = (1 - S_{e}) \cdot p_{i B}^{(t)} / (1 - P (Z_{j, 1} = 1)) & if Z_{j, 1} = 0, \\ P ({\tilde{Y}}_{i} = 1 | Z_{j, 1} = 1) = S_{e} \cdot p_{i B}^{(t)} / P (Z_{j, 1} = 1) & if Z_{j, 1} = 1, \end{matrix}

where

P (Z_{j, 1} = 1) = S_{e} + (1 - S_{e} - S_{p}) \prod_{i \in P_{j}} (1 - p_{i B}^{(t)})

.

3.2. Dorfman

In Dorfman’s algorithm, specimens are partitioned into non-overlapping groups, similar to the master pool method, and group tests are conducted. Subsequently, for those groups testing positive, individual tests are conducted. When

Z_{j, 1} = 0

, the conditional expectation mirrors that of master pool testing, specifically

ω_{i}^{(t)} = P ({\tilde{Y}}_{i} = 1 | Z_{j, 1} = 0) = (1 - S_{e}) \cdot p_{i B}^{(t)} / (1 - P (Z_{j, 1} = 1)) .

When the

j^{t h}

group tests positive,

Z_{j, 1} = 1

, individual tests are performed subsequently. Let

Z_{j, s : t} = (Z_{j, s}, \dots, Z_{j, t})

and

Z_{j}

is short for

Z_{1 : T_{j}}

. Then we have the following:

ω_{i}^{(t)} = P ({\tilde{Y}}_{i} = 1 | Z_{j, 1} = 1, Z_{j, 2 : T_{j}}) = \frac{P ({\tilde{Y}}_{i} = 1, Z_{j, 1} = 1, Z_{j, 2 : T_{j}})}{P (Z_{j, 1} = 1, Z_{j, 2 : T_{j}})} .

The denominator is expressed as follows:

\begin{matrix} P (Z_{j, 1} = 1, Z_{j, 2 : T_{j}}) \\ = \sum_{{\tilde{Z}}_{j}} P (Z_{j, 1} = 1, Z_{j, 2 : T_{j}} | {\tilde{Z}}_{j}) P ({\tilde{Z}}_{j}) \\ = \sum_{{\tilde{Z}}_{j}} S_{e}^{{\tilde{Z}}_{j, 1}} {(1 - S_{p})}^{1 - {\tilde{Z}}_{j, 1}} P (Z_{j, 2 : T_{j}} | {\tilde{Z}}_{j, 2 : T_{j}}) P ({\tilde{Z}}_{j}), \end{matrix}

where

{\tilde{Z}}_{j} = ({\tilde{Z}}_{j, 1}, \dots, {\tilde{Z}}_{j, T_{j}})

and

{\tilde{Z}}_{j, 2 : T_{j}} = ({\tilde{Z}}_{j, 2}, \dots, {\tilde{Z}}_{j, T_{j}}) .

Without loss of generalization, we assume that

{\tilde{Y}}_{i}

represents the true status of the first individual of the group, with similar calculations applied to other cases. Subsequently, the numerator is as follows:

\begin{matrix} P ({\tilde{Y}}_{i} = 1, Z_{j, 1} = 1, Z_{j, 2 : T_{j}}) \\ = P (Z_{j, 1} = 1, Z_{j, 2}, Z_{j, 3 : T_{j}} | {\tilde{Y}}_{i} = 1) P ({\tilde{Y}}_{i} = 1) \\ = S_{e} S_{e}^{Z_{j, 2}} {(1 - S_{e})}^{1 - Z_{j, 2}} P (Z_{j, 3 : T_{j}}) p_{i B}^{(t)} \end{matrix}

3.3. Halving

With the Halving algorithm, the testing procedure is performed in three stages. The first stage is identical to the master pool method. In the second stage, groups that test positive are evenly split into two subgroups. In the third stage, these subgroups are tested, and positive ones undergo further individual tests. When

Z_{j, 1} = 0

, the conditional expectation aligns with master pool testing, with

ω_{i}^{(t)} = P ({\tilde{Y}}_{i} = 1 | Z_{j, 1} = 0) = (1 - S_{e}) \cdot p_{i B}^{(t)} / (1 - P (Z_{j, 1} = 1)) .

Assuming the

j^{t h}

group tests positive with

Z_{j, 1} = 1

, sequential tests are further carried out. These tests can unfold in any of the following scenarios:

(1): $Z_{j} = (1, 0, 0)$ signifies that the group containing the $i^{t h}$ specimen initially tested positive in the first stage, followed by a negative result in a subsequent subgroup test; ultimately, the $i^{t h}$ specimen is concluded with a negative outcome. Then, we have the following:

$ω_{i}^{(t)} = P ({\tilde{Y}}_{i} = 1 | Z_{j, 1 : 3} = (1, 0, 0)) = \frac{P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 0, 0))}{P (Z_{j, 1 : 3} = (1, 0, 0))},$

where the denominator is as follows:

$\begin{matrix} P (Z_{j, 1 : 3} = (1, 0, 0)) \\ = \sum_{{\tilde{Z}}_{j}} P (Z_{j, 1 : 3} = (1, 0, 0) | {\tilde{Z}}_{j, 1 : 3}) P ({\tilde{Z}}_{j, 1 : 3}) \\ = \sum_{{\tilde{Z}}_{j}} S_{e}^{{\tilde{Z}}_{j, 1}} {(1 - S_{p})}^{1 - {\tilde{Z}}_{j, 1}} {(1 - S_{e})}^{{\tilde{Z}}_{j, 2} + {\tilde{Z}}_{j, 3}} S_{p}^{2 - {\tilde{Z}}_{j, 2} - {\tilde{Z}}_{j, 3}} P ({\tilde{Z}}_{j, 1 : 3}) . \end{matrix}$

For simplicity, we assume that ${\tilde{Y}}_{i}$ is the first individual specimen $P_{j, 2}$ . The derivation is similar to other cases. Then, the numerator is as follows:

$\begin{matrix} P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 0, 0)) \\ = P (Z_{j, 1 : 3} = (1, 0, 0) | {\tilde{Y}}_{i} = 1) P ({\tilde{Y}}_{i} = 1) \\ = S_{e} (1 - S_{e}) P (Z_{j, 3} = 0) p_{i B}^{(t)} . \end{matrix}$
(2): For $Z_{j} = (1, 1, 0)$ , we have the following:

$ω_{i}^{(t)} = P ({\tilde{Y}}_{i} = 1 | Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}}) = \frac{P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}})}{P (Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}})},$

where the denominator is as follows:

$\begin{matrix} P (Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}}) \\ = \sum_{{\tilde{Z}}_{j}} P (Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}} | {\tilde{Z}}_{j}) P ({\tilde{Z}}_{j}) \\ = \sum_{{\tilde{Z}}_{j}} S_{e}^{{\tilde{Z}}_{j, 1} + {\tilde{Z}}_{j, 2}} {(1 - S_{p})}^{2 - {\tilde{Z}}_{j, 1} - {\tilde{Z}}_{j, 2}} {(1 - S_{e})}^{{\tilde{Z}}_{j, 3}} S_{p}^{1 - {\tilde{Z}}_{j, 3}} P (Z_{j, 4 : T_{j}} | {\tilde{Z}}_{j, 4 : T_{j}}) P ({\tilde{Z}}_{j}) . \end{matrix}$

For the numerator, we consider the two situations where the $i^{t h}$ specimen is located. If ${\tilde{Y}}_{i}$ is the first individual in the first subgroup, correspondingly, ${\tilde{Z}}_{j, 4} = 1$ , then the numerator is as follows:

$\begin{matrix} P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}}) \\ = P (Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}} | {\tilde{Y}}_{i} = 1) P ({\tilde{Y}}_{i} = 1) \\ = S_{e}^{2 + Z_{j, 4}} {(1 - S_{e})}^{1 - Z_{j, 4}} P (Z_{j, 3} = 0) P (Z_{j, 5 : T_{j}}) p_{i B}^{(t)} . \end{matrix}$

Otherwise, if ${\tilde{Y}}_{i}$ corresponds to the first individual in the second subgroup, then the numerator is as follows:

$\begin{matrix} P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}}) \\ = P (Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}} | {\tilde{Y}}_{i} = 1) P ({\tilde{Y}}_{i} = 1) \\ = S_{e} (1 - S_{e}) P (Z_{j, 2} = 1) P (Z_{j, 4 : T_{j}}) p_{i B}^{(t)} \end{matrix}$
(3): For $Z_{j} = (1, 0, 1)$ , we have the following:

$ω_{i}^{(t)} = P ({\tilde{Y}}_{i} = 1 | Z_{j, 1 : 3} = (1, 0, 1), Z_{j, 4 : T_{j}}) = \frac{P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 0, 1), Z_{j, 4 : T_{j}})}{P (Z_{j, 1 : 3} = (1, 0, 1), Z_{j, 4 : T_{j}})},$

where the denominator is as follows:

$\begin{matrix} P (Z_{j, 1 : 3} = (1, 0, 1), Z_{j, 4 : T_{j}}) \\ = \sum_{{\tilde{Z}}_{j}} P (Z_{j, 1 : 3} = (1, 0, 1), Z_{j, 4 : T_{j}} | {\tilde{Z}}_{j}) P ({\tilde{Z}}_{j}) \\ = \sum_{{\tilde{Z}}_{j}} S_{e}^{{\tilde{Z}}_{j, 1} + {\tilde{Z}}_{j, 3}} {(1 - S_{p})}^{2 - {\tilde{Z}}_{j, 1} - {\tilde{Z}}_{j, 3}} {(1 - S_{e})}^{{\tilde{Z}}_{j, 2}} S_{p}^{1 - {\tilde{Z}}_{j, 2}} P (Z_{j, 4 : T_{j}} | {\tilde{Z}}_{j, 4 : T_{j}}) P ({\tilde{Z}}_{j}) . \end{matrix}$

Similarly, we derive the numerator according to the location of the $i^{t h}$ specimen. If it is assigned in the first subgroup with ${\tilde{Z}}_{j, 4} = 1,$ then the numerator is as follows:

$\begin{matrix} P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 0, 1), Z_{j, 4 : T_{j}}) \\ = P (Z_{j, 1 : 3} = (1, 0, 1), Z_{j, 4 : T_{j}} | {\tilde{Y}}_{i} = 1) P ({\tilde{Y}}_{i} = 1) \\ = S_{e} (1 - S_{e}) P (Z_{P_{j, 3}} = 1) P (Z_{j, 4 : T_{j}}) p_{i B}^{(t)} . \end{matrix}$

If it is assigned in the second subgroup, then the numerator is as follows:

$\begin{matrix} P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}}) \\ = P (Z_{j, 1 : 3} = (1, 1, 0), Z_{j, 4 : T_{j}} | {\tilde{Y}}_{i} = 1) P ({\tilde{Y}}_{i} = 1) \\ = S_{e}^{2 + Z_{j, 4}} {(1 - S_{e})}^{1 - Z_{j, 4}} P (Z_{j, 2} = 0) P (Z_{j, 5 : T_{j}}) p_{i B}^{(t)} . \end{matrix}$
(4): For $Z_{j} = (1, 1, 1)$ , we have the following:

$ω_{i}^{(t)} = P ({\tilde{Y}}_{i} = 1 | Z_{j, 1 : 3} = (1, 1, 1), Z_{j, 4 : T_{j}}) = \frac{P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 1, 1), Z_{j, 4 : T_{j}})}{P (Z_{j, 1 : 3} = (1, 1, 1), Z_{j, 4 : T_{j}})},$

where the denominator is as follows:

$\begin{matrix} P (Z_{j, 1 : 3} = (1, 1, 1), Z_{j, 4 : T_{j}}) \\ = \sum_{{\tilde{Z}}_{j}} P (Z_{j, 1 : 3} = (1, 1, 1), Z_{j, 4 : T_{j}} | {\tilde{Z}}_{j}) P ({\tilde{Z}}_{j}) \\ = \sum_{{\tilde{Z}}_{j}} [\prod_{k = 1}^{3} S_{e}^{{\tilde{Z}}_{j, k}} {(1 - S_{p})}^{1 - {\tilde{Z}}_{j, k}}] P (Z_{j, 4 : T_{j}} | {\tilde{Z}}_{j, 4 : T_{j}}) P ({\tilde{Z}}_{j}) . \end{matrix}$

In this case, the location of the $i^{t h}$ specimen in the subgroups will not affect the conditional expectation. Therefore, assuming it is the first individual in the first subgroup. Given ${\tilde{Y}}_{i} = 1$ , then we have ${\tilde{Z}}_{j, 1} = {\tilde{Z}}_{j, 2} = {\tilde{Z}}_{j, 4} = 1$ , and, therefore, the numerator is derived as follows:

$\begin{matrix} P ({\tilde{Y}}_{i} = 1, Z_{j, 1 : 3} = (1, 1, 1), Z_{j, 4 : T_{j}}) \\ = P (Z_{j, 1 : 3} = (1, 1, 1), Z_{j, 4 : T_{j}} | {\tilde{Y}}_{i} = 1,) P ({\tilde{Y}}_{i} = 1) \\ = S_{e}^{2 + Z_{j, 4}} {(1 - S_{e})}^{1 - Z_{j, 4}} P (Z_{j, 2} = 1) P (Z_{j, 5 : T_{j}}) p_{i B}^{(t)} . \end{matrix}$

4. Simulation

In this section, we conduct numerical studies to demonstrate the performance of Algorithm 1 and assess the effectiveness of modeling group testing data within the framework of the generalized additive model. We adapt the methodology outlined by Liu et al. [26] to generate group testing data with high-dimensional covariates. We employ the following additive models:

Example 1.

(

n = 1000

and 500,

q_{n} = 50

,

s_{0} = 4

). The model is

P ({\tilde{Y}}_{i} = 1 | X_{i}) = g (μ + f_{1} (X_{i 1}) + f_{2} (X_{i 2}) + f_{3} (X_{i 3}) + f_{4} (X_{i 4}))

, where

μ = - 4

,

f_{1} (x) = 2 x

,

f_{2} (x) = {(2 x - \frac{1}{4})}^{2} - 1.4

,

f_{3} (x) = 3 s i n (x) + c o s (x^{2}) - \frac{5}{6}

,

f_{4} (x) = 3 e^{(- {(x + 1.5)}^{2} / 0.8)} - \frac{1}{4}

and

f_{j} = 0, j = 5, \dots, q_{n} .

The covariates are generated from the uniform distribution

U (- 1, 1)

. In this case, the average prevalence rate is about

0.112

. Figure 2 presents the true components and their corresponding estimators in a simulation run.

Example 2.

(

n = 500

,

q_{n} = 50

and 100,

s_{0} = 4

,

ρ = 0.2

). In contrast to Example 1, Example 2 incorporates a correlation structure within the data while utilizing the same functions for data generation. The covariates

{(X_{1}, \dots, X_{n})}^{T}

are generated as

X_{j} = \frac{W_{j} + t U}{1 + t^{2}}, j \in [q_{n}],

where

W_{1}, \dots, W_{q_{n}}

and

U

are i.i.d. samples from Uniform

(- 1, 1)

, and

t = \sqrt{\frac{ρ}{1 - ρ}}

. In this case, the disease prevalence rate is about

0.064

.

Example 3.

(

n = 1000

,

q_{n} = 50

and 100,

s_{0} = 3

,

ρ = 0.2

). The model is defined as

P ({\tilde{Y}}_{i} = 1 | X_{i}) = g (μ + f_{1} (X_{i 1}) + f_{2} (X_{i 2}) + f_{3} (X_{i 3}))

, with

μ = - 3.5

,

f_{1} (x) = 2 x

,

f_{2} (x) = {(2 x - \frac{1}{4})}^{2} - 1.1

,

f_{3} (x) = - 3 s i n (\frac{π}{4} x) + c o s^{2} (\frac{π}{4} x) - 0.9

and

f_{j} = 0, j = 4, \dots, q_{n} .

The covariates are simulated as in Example 2, and the average disease prevalence rate can be approximated as

0.063

.

Example 4.

(

n = 500

,

q_{n} = 50

and 100,

s_{0} = 3

,

ρ = 0.2

). Following Example 3, we employ the same function for data generation and conduct simulation experiments with varied values of

q_{n}

while reducing the sample size to 500. The individual average prevalence rate approximates around

0.10

.

We implement our method using four group testing algorithms: master pool testing (master pool), Dorfman, Halving, and array. For the master pool, individuals are grouped into non-overlapping pools with a size of 4, conducting tests without any subsequent individual tests. For Dorfman and Halving testing, the group size is fixed at 4. In the case of array testing, individuals are arranged into

4 \times 4

arrays. The sensitivity and specificity are set at

0.98

.

Let

| | {\hat{f}}_{n j} {| |}_{n}^{2} = \frac{1}{n} \sum_{j = 1}^{n} {\hat{f}}_{n j}^{2} (X_{i j}) = {\hat{β}}_{j}^{T} (E_{n} Φ_{j} Φ_{j}^{T}) {\hat{β}}_{j}

denote the effect of a component. The estimators

\hat{β}

are obtained using Algorithm 1. Since we utilize stochastic gradient descent to optimize the regularized objective function, it is possible that the coefficients of non-significant components may not precisely equal zero. As illustrated in Figure 2, a variable is deemed significant and selected if

∥ {\hat{f}}_{n j} ∥_{n}

deviates significantly from zero. In contrast, for other non-significant variables, such as

x_{5}

and

x_{6}

depicted in Figure 2, their functional estimates closely approximate zero.

To assess the performance of Algorithm 1 in variable selection and model estimation, we conduct experiments across the four examples, repeating the procedure 100 times for each example. The initial parameters for each experiment follow a standard normal distribution. We calculate the number of true positives (TPs) and false positives (FPs) to evaluate variable selection attributes. Additionally, we employ the mean prediction error (PE) to measure the overall prediction error of the model, defined as

PE = E_{X} [{(\hat{f} (X) - f (X))}^{2}]

. Table 1 reports the average and corresponding standard deviations of TP, FP, and PE. The results demonstrate that Algorithm 1 effectively identifies significant variables and exhibits high robustness. In Table 1, we set a threshold of

τ = 0.05

to obtain the estimate

\hat{M}

, employing stochastic gradient descent to optimize the objective. Notably, only a few cases are observed where variables that are not significant are mistakenly identified.

To further validate the performance of Algorithm 1, we utilize two metrics, namely PE_{f_j} to signify the prediction error of each component, and PE_non to represent the average prediction error of non-significant variables. In Table 2 and Table 3, we present the outcomes of four models using four group testing methods. Notably, the prediction error (PE_{f_j}) associated with each significant variable is relatively small, which serves as an indication that the B-spline successfully captures the unknown nonlinear functions using Algorithm 1. These findings ascertain Algorithm 1’s outstanding capability in variable selection, enabling it to precisely identify important variables while minimizing the probability of mistakenly selecting non-significant variables.

Among the four different group testing methods, the Dorfman and array methods demonstrate superior performance, compared to the master pool and Halving methods. This characteristic verifies the fact that Dorfman employs sufficient individual repeated testing and array utilizes more group-level tests. In contrast, Halving adopts a sub-group retest, and subsequently, the individual test, while the master pool only uses the initial pools and does not involve any retesting.

5. Application

In our study, we perform regression analysis using the data set from the 1999–2004 National Health and Nutrition Examination Survey (NHANES). NHANES is a continuous cross-sectional survey conducted in the United States, which utilizes a probabilistic sampling method to select participants. The survey collects demographic information, health history, and behavioral data through interviews with participants and their families. The dataset contains 5515 observations (

n = 5515

) and classifies individuals as diabetics or non-diabetic patients. Following Yu et al. [36], we select a subset of covariates for regression analysis and subsequently divide the dataset into a training set (

n = 4412

) and a test set (

n = 1103

). The subset contains 14 variables that are potentially associated with the risk of diabetes: education (

X_{1}

), household income (

X_{2}

), physical activity (

X_{3}

), hypertension (

X_{4}

), family history (

X_{5}

), sex (

X_{6}

), race and ethnicity (

X_{7}

), alcohol use (

X_{8}

), BMI (

X_{9}

), height (

X_{10}

), waist circumference (

X_{11}

), weight (

X_{12}

), age (

X_{13}

), and smoking (

X_{14}

). To exam the performance of our method for high-dimensional data, we add non-significant variables that follow the standard normal distribution, increasing the dimension to

q_{n} = 500

.

To perform the group testing algorithm, we choose Dorfman’s algorithm with a group size of

k = 2, 5, 8

, setting the sensitivity to

S_{e} = 0.95

and the specificity to

S_{p} = 0.98

. Based on all available risk factors collected from each individual, we use the following model to describe an individual’s diabetes status:

P ({\tilde{Y}}_{i} = 1 | X_{i}) = g (β_{0} + X_{i 1}^{T} β_{1} + \dots + X_{i 7}^{T} β_{7} + \sum_{j = 8}^{q_{n}} f_{j} (X_{i j})),

(6)

where the categorical variables

X_{1}

to

X_{7}

are transformed using one-hot encoding, represented by

X_{i j}

for

j = 1, \dots, 7

. Additionally, nonlinear functions are applied to variables

X_{8}

through

X_{14}

to capture their relationship with the diabetes status more accurately. The unknown function,

f_{j} (\cdot)

, in (6) allows variables to have a non-linear influence on the state of diabetes. We assume that the link function

g (\cdot)

is a logistic function. For comparison, we also report the results using the individual data (

k = 1

) and a generalized linear regression model (GLM) based on the original data. To demonstrate the performance of our algorithm, we report the following statistics: accuracy (ACC), positive predictive value (

P P V = \frac{T P}{T P + F P}

), negative predictive value (

N P V = \frac{T N}{T N + F N}

), and recall (

R e c a l l = \frac{T P}{T P + F N}

), where TP, FP, TN, and FN represent the numbers of true positives, false positives, true negatives, and false negatives, respectively. The results are presented in Table 4.

Our algorithm exhibits an accuracy of approximately

83 %

, while the GLM model demonstrates an accuracy of

81.50 %

when correctly classifying individuals as diabetics or non-diabetics. Our algorithm also has better performance than directly using GLM on the other three measures. It is expected that individual tests (

k = 1

) achieve the highest precision. However, our group testing algorithms with

k > 1

perform competitively while saving a considerable number of tests compared with individual tests. For continuous variables from

X_{8}

to

X_{500}

, our algorithm selects the variables

X_{10}

,

X_{11}

, and

X_{13}

. The other variables have little or no impact on the response. The estimates for the functions

f_{10}

,

f_{11}

, and

f_{13}

are shown in Figure 3, along with the estimates obtained using GLM. It is noteworthy that our method demonstrates a slightly higher accuracy compared to the GLM model, indicating its potential advantages in capturing more complex relationships and improving predictive capabilities.

6. Concluding Remarks

In this study, we propose a generalized nonparametric additive model to characterize high-dimensional group testing data. We mainly use the B-spline to approximate the unknown nonparametric function. Theoretical results demonstrate that our method selects the true model with high probability and provides consistent estimates for components. Our work extends the existing works from a generalized linear model to a more flexible generalized nonparametric model for high-dimensional data.

Consideration of the measurement error in a test kit is essential for practical applications. The unobservable nature of an individual’s true status, treated as a latent variable, prompts us to use the EM algorithm to maximize the regularized log-likelihood function for high-dimensional data. The E-step of the EM algorithm varies according to the structures of different group testing procedures and needs to be calculated separately to speed up the algorithm.

Another critical issue for group testing is case identification, whose efficiency could be improved with the help of covariates. In future work, we will study the improvement of our method for case identification, and implement an informative retest procedure to accurately identify the infections and further reduce costs.

Author Contributions

Methodology, W.X. and X.Z.; writing—original draft preparation, X.Z. and J.D.; validation, J.D. and J.Z.; formal analysis, X.Z., J.Z. and J.D.; writing—review and editing, J.D. and W.X.; supervision, W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (grant nos. 11801102, 12361055), the Fundamental Research Funds for the Central Universities (423131), and Guangxi Natural Science Foundation (2021GXNSFAA220054).

Data Availability Statement

Data source is provided within the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In this appendix, we first present a fact and a lemma that have been established in the literature. Then we provide detailed proofs of Theorems 1 and 2.

Fact A1

(Zhou et al. [37]). Under Conditions A and B, there exist some positive constants,

C_{1}

and

C_{2}

, such that we have the following:

C_{1} m_{n}^{- 1} \leq λ_{min} (E (Φ Φ^{T})) \leq λ_{max} (E (Φ Φ^{T})) \leq C_{2} m_{n}^{- 1} .

Lemma A1

(Huang et al. [35]). Under Condition B, for any

j \in [q_{n}]

, there exists an

f_{n j} \in S_{n}

, satisfying the following:

{∥ f_{n j} - f_{j}^{*} ∥}_{2} = O_{p} (m_{n}^{- d} + m_{n}^{\frac{1}{2}} n^{- \frac{1}{2}}) .

In particular, if we choose

m_{n} = O (n^{\frac{1}{2 d + 1}})

, then

{∥ f_{n j} - f_{j}^{*} ∥}_{2} = O_{p} (m_{n}^{- d}) = O_{p} (n^{- \frac{d}{2 d + 1}}) .

Proof of Theorem 1.

Note that the objective function is as follows:

S_{n} (β) = ℓ_{n} (β) - n λ P (β),

where

ℓ_{n} (β) = \sum_{j = 1}^{J} log P_{β} (Z_{j} | X)

is the log-likelihood function and

n λ P (β)

is a penalty with a tuning parameter

λ

. We first prove that the estimate

\hat{β}

has a convergence rate of

∥ \hat{β} - β^{*} ∥ = O_{p} (\frac{\sqrt{q_{n}} n^{γ / 2}}{\sqrt{n}})

with

γ = \frac{1}{2 d + 1}

. Let

α_{n} = \frac{\sqrt{q_{n}} n^{γ / 2}}{\sqrt{n}}

, and let u denote a

q_{n} + 1

dimensional vector. To verify that the convergence rate is

α_{n}

, we only need to prove that there exists some positive constant C, such that

\begin{matrix} P (\underset{∥ u ∥ = C}{Sup} S_{n} (β^{*} + α_{n} u) - S_{n} (β^{*}) < 0) \to 1 . \end{matrix}

Taking the Taylor expansion, we have the following:

\begin{matrix} S_{n} (β^{*} + α_{n} u) - S_{n} (β^{*}) \\ = ℓ_{n} (β^{*} + α_{n} u) - ℓ_{n} (β^{*}) - n λ \sum_{j = 1}^{q_{n}} (P (β_{j}^{*} + α_{n} u_{j}) - P (β_{j}^{*})) \\ = α_{n} u^{T} \frac{\partial ℓ_{n} (β^{*})}{\partial β} + \frac{1}{2} α_{n}^{2} u^{T} \frac{\partial^{2} ℓ_{n} (β^{*})}{\partial β \partial β^{T}} u + \frac{1}{6} α_{n}^{3} \sum_{l, j, k} \frac{\partial^{3} ℓ_{n} (β_{w})}{\partial β_{l} \partial β_{j} \partial β_{k}} u_{l} u_{j} u_{k} \\ - n λ \sum_{j = 1}^{q_{n}} (P (β_{j}^{*} + α_{n} u_{j}) - P (β_{j}^{*})) \\ = I_{1} + I_{2} + I_{3} - I_{4}, \end{matrix}

(A1)

where

β_{w}

is an intermediate value between

β^{*}

and

β^{*} + α_{n} u

. In what follows, we calculate the value of

I_{1}

,

I_{2}

,

I_{3}

, and

I_{4}

, separately.

(1): Since $I_{1} = α_{n} u^{T} \frac{\partial ℓ_{n} (β^{*})}{\partial β}$ , its first-order derivative is as follows:

$\begin{matrix} \frac{\partial ℓ_{n} (β^{*})}{\partial β} & = \sum_{j = 1}^{J} \frac{\partial log P_{β^{*}} (Z_{j} | X)}{\partial β} = \sqrt{n} \sum_{m = 1}^{M} {(\frac{| U_{m} |}{n})}^{\frac{1}{2}} S_{m}, \end{matrix}$

where $S_{m} = \frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} \frac{\partial log P_{β^{*}} (Z_{j} | X)}{\partial β}$ and $P_{β^{*}}$ is a probability that depends on the parameter $β^{*}$ . Recall that $P_{j} = (P_{j, 1}, \dots, P_{j, T_{j}})$ , and $P_{j, 1}$ represents the original group with size $k_{j} .$ Since $Z_{j}$ only depends on the true status of individual $y_{P_{j, 1}} \in {0, 1}^{k_{j}}$ , we have the following:

$P_{β^{*}} (Z_{j} | X) = \sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) P_{β^{*}} ({\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}} | X_{P_{j, 1}}) .$

Recall that $P_{β^{*}} ({\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}} | X_{P_{j, 1}}) = \prod_{s \in P_{j, 1}} P_{β^{*}} ({\tilde{Y}}_{s} = y_{s} | X_{s})$ . We denote the true risk probability by $P^{*} ({\tilde{Y}}_{s} = y_{s} | X_{s}) = \frac{e^{(μ^{*} + \sum_{j = 1}^{q_{n}} f_{j}^{*} (X_{s j})) y_{s}}}{1 + e^{μ^{*} + \sum_{j = 1}^{q_{n}} f_{j}^{*} (X_{s j})}}$ . Then we have the following:

$\begin{matrix} | P_{β^{*}} ({\tilde{Y}}_{s} = y_{s} | X_{s}) - P^{*} ({\tilde{Y}}_{s} = y_{s} | X_{s}) | = |\frac{e^{Φ_{s}^{T} β^{*} y_{s}}}{1 + e^{Φ_{s}^{T} β^{*}}} - \frac{e^{(μ^{*} + \sum_{j = 1}^{q_{n}} f_{j}^{*} (X_{s j})) y_{s}}}{1 + e^{μ^{*} + \sum_{j = 1}^{q_{n}} f_{j}^{*} (X_{s j})}}| \\ \leq | Φ_{s}^{T} β^{*} - (μ^{*} + \sum_{j = 1}^{q_{n}} f_{j}^{*} (X_{s j})) | . \end{matrix}$

Note that the logistic model can be viewed as a generalized linear model $exp (y_{i} η_{i} - b (η_{i})),$ where $b (η_{i}) = log (1 + exp (η_{i}))$ and $η_{i} = Φ_{i}^{T} β$ . We denote the population-level parameter, $η_{i}^{*} = Φ_{i}^{T} β^{*}$ , then we have the following:

$\begin{matrix} S_{m} = \frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} \frac{1}{P_{β^{*}} (Z_{j} | X)} \frac{\partial P_{β^{*}} (Z_{j} | X)}{\partial β} \\ = \frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} \frac{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp (\sum_{ℓ \in P_{j, 1}} (y_{l} η_{l}^{*} - b (η_{l}^{*}))) (\sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (η_{l}^{*})) Φ_{l})}{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp \{\sum_{ℓ \in P_{j, 1}} (y_{l} - b (η_{l}^{*}))\}} \\ = \frac{1}{\sqrt{U_{m}}} \sum_{j \in U_{m}} \sum_{y_{P_{j, 1}}} \frac{P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp \{\sum_{ℓ \in P_{j, 1}} y_{l} η_{l}^{*} - b (η_{l}^{*})\} \sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (η_{l}^{*})) Φ_{l}}{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp \{\sum_{ℓ \in P_{j, 1}} (y_{l} η_{l}^{*} - b (η_{l}^{*}))\}} \\ = \frac{1}{\sqrt{U_{m}}} \sum_{j \in U_{m}} \sum_{y_{P_{j, 1}}} P_{β^{*}} ({\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}} | Z_{j}) \cdot \sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (η_{l}^{*})) Φ_{l}, \end{matrix}$

where the last equality comes from the following conditional probability:

$P_{β^{*}} ({\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}} | Z_{j}) = \frac{P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp \{\sum_{ℓ \in P_{j, 1}} (y_{l} η_{l}^{*} - b (η_{l}^{*}))\}}{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp \{\sum_{ℓ \in P_{j, 1}} (y_{l} η_{l}^{*} - b (η_{l}^{*}))\}} .$

(A2)

Let $ρ_{l}^{*} = μ^{*} + \sum_{j = 1}^{q_{n}} f_{j}^{*} (X_{l, j})$ denote the true model, then we have the following:

$\begin{matrix} y_{l} η_{l}^{*} - b (η_{l}^{*}) & = y_{l} ρ_{l}^{*} - b (ρ_{l}^{*}) + y_{l} (η_{l}^{*} - ρ_{l}^{*}) - [b (η_{l}^{*}) - b (ρ_{l}^{*})] \\ = y_{l} ρ_{l}^{*} - b (ρ_{l}^{*}) + (y_{l} - b^{'} (ρ_{l}^{*})) (η_{l}^{*} - ρ_{l}^{*}) - ϖ_{l}, \end{matrix}$

where $ϖ_{l} = b (η_{l}^{*}) - b (ρ_{l}^{*}) - b^{'} (ρ_{l}^{*}) (η_{l}^{*} - ρ_{l}^{*})$ . Then,

$\begin{matrix} \sum_{ℓ \in P_{j, 1}} [y_{l} η_{l}^{*} - b (η_{l}^{*})] = \sum_{ℓ \in P_{j, 1}} [y_{l} ρ_{l}^{*} - b (ρ_{l}^{*})] + \sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) (η_{l}^{*} - ρ_{l}^{*}) - \sum_{ℓ \in P_{j, 1}} ϖ_{l} . \end{matrix}$

Let $δ_{l} = (y_{l} - b^{'} (ρ_{l}^{*})) (η_{l}^{*} - ρ_{l}^{*})$ . Combining with the fact that $\sum_{ℓ} ϖ_{l}$ is independent with the value $y_{P_{j, 1}}$ , we have the following:

$\begin{matrix} S_{m} = \frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} \\ \frac{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp \{\sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) ρ_{l}^{*} + \sum_{ℓ \in P_{j, 1}} δ_{l}\} (\sum_{ℓ \in P_{j, 1}} [y_{l} - b^{'} (ρ_{l}^{*}) + b^{″} (η_{ℓ, ω}) (ρ_{l}^{*} - η_{l}^{*})] Φ_{l})}{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp \{\sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) ρ_{l}^{*} + \sum_{ℓ \in P_{j, 1}} δ_{l}\}} \end{matrix}$

where $η_{ℓ, ω}$ lies between $ρ_{l}^{*}$ and $η_{l}^{*}$ . Note that $b^{'} (η) = \frac{e^{η}}{1 + e^{η}}$ and $b^{″} (η) = \frac{e^{η}}{{(1 + e^{η})}^{2}}$ both belonging to the interval $(0, 1)$ . Recall that $η_{l}^{*}$ represents a B-spline function that approximates $ρ_{l}^{*}$ . By Lemma A1, we have $\sum_{ℓ \in P_{j, 1}} δ_{l} = o_{p} (1)$ for a finite group size and

$\begin{matrix} S_{m} = \frac{1}{\sqrt{U_{m}}} \sum_{j \in U_{m}} \\ \frac{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp \{\sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) ρ_{l}^{*} + o_{p} (1)\} \{\sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) Φ_{l} + \sum_{ℓ \in P_{j, 1}} (b^{'} (ρ_{l}^{*}) - b^{'} (η_{l}^{*})) Φ_{l}\}}{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp (\sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) ρ_{l}^{*} + o_{p} (1))} . \end{matrix}$

Since $[b^{'} (ρ_{l}^{*}) - b^{'} (η_{l}^{*})] Φ_{l}$ is independent with the latent variable $y_{Q_{l}}$ , we have the following:

$\begin{matrix} S_{m} & = \frac{1}{\sqrt{U_{m}}} \sum_{j \in U_{m}} \frac{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp \{\sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) ρ_{l}^{*} + o_{p} (1)\} \{\sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) Φ_{l}\}}{\sum_{y_{P_{j, 1}}} P (Z_{j} | {\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}}) exp (\sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) ρ_{l}^{*} + o_{p} (1))} \\ + \frac{1}{\sqrt{U_{m}}} \sum_{j \in U_{m}} \sum_{ℓ \in P_{j, 1}} [b^{'} (ρ_{l}^{*}) - b^{'} (η_{l}^{*})] Φ_{l} . \end{matrix}$

Combining with (A2), we have the following:

$\begin{matrix} S_{m} & = \frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} \sum_{y_{P_{j, 1}}} P^{*} ({\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}} | Z_{j}) \cdot \sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) Φ_{l} (1 + o_{p} (1)) \\ + \frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} \sum_{y_{P_{j, 1}}} P^{*} ({\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}} | Z_{j}) \cdot \sum_{ℓ \in P_{j, 1}} (b^{'} (ρ_{l}^{*}) - b^{'} (η_{l}^{*})) Φ_{l} (1 + o_{p} (1)) \\ = \frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} I_{m, j} (1 + o_{p} (1)) + \frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} {II}_{m, j} (1 + o_{p} (1)) . \end{matrix}$

By taking the expectation over $I_{m, j}$ , we find that

$\begin{matrix} E_{Z_{j}} I_{m, j} & = E_{Z_{j}} \sum_{y_{P_{j, 1}}} P^{*} ({\tilde{Y}}_{P_{j, 1}} = y_{P_{j, 1}} | Z_{j}) \cdot \sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) Φ_{l} \\ = E_{Z_{j}} E ({\tilde{Y}}_{P_{j, 1}} | Z_{j}) \sum_{ℓ \in P_{j, 1}} (y_{l} - b^{'} (ρ_{l}^{*})) Φ_{l} = 0 . \end{matrix}$

By the fact that $P_{j}, j \in [m]$ is non-overlapped and Condition D, we have $\frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} I_{m, j} \overset{d}{\to} N (0, V_{m})$ with $V_{m} = E_{Z_{j}} I_{m, j}^{2}$ . Additionally, by Lemma A1 and Fact A1, we have the following:

$\begin{matrix} | \frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} {II}_{m, j} | & = | \frac{1}{\sqrt{| U_{m} |}} \sum_{ℓ \in P_{j, 1}} (b^{'} (ρ_{ℓ}^{*}) - b^{'} (η_{l}^{*})) Φ_{l} | \\ = O_{p} (\sqrt{| U_{m} |} n^{- \frac{d}{2 d + 1}}) \\ = O_{p} (n^{\frac{1}{2 (2 d + 1)}}) . \end{matrix}$

So, $S_{m}$ is dominated by $\frac{1}{\sqrt{| U_{m} |}} \sum_{j \in U_{m}} {II}_{m, j}$ . It follows that $S_{m} = O_{p} (n^{\frac{1}{2 (2 d + 1)}})$ and $\frac{\partial ℓ_{n} (β^{*})}{\partial β} = O_{P} (\sqrt{n} \cdot n^{\frac{1}{2 (2 d + 1)}}) = O_{p} (n^{\frac{d + 1}{2 d + 1}}) .$ Since $γ = \frac{1}{2 d + 1}$ , we have the following: $\frac{\partial ℓ_{n} (β^{*})}{\partial β} = O_{P} (n^{\frac{γ + 1}{2}})$ and

$I_{1} = α_{n} u^{T} \frac{\partial ℓ_{n} (β^{*})}{\partial β} = O_{p} (α_{n} {∥ u ∥}_{2} \frac{\sqrt{q_{n}} n^{γ / 2}}{\sqrt{n}}) = O_{p} (n α_{n}^{2} {∥ u ∥}_{2}) .$
(2): We continue to derive $I_{2} = \frac{1}{2} α_{n}^{2} u^{T} \frac{\partial^{2} ℓ_{n} (β^{*})}{\partial β \partial β^{T}} u$ , in which

$\frac{\partial^{2} ℓ_{n} (β^{*})}{\partial β \partial β^{T}} = n \sum_{m = 1}^{M} \frac{| U_{m} |}{n} \frac{1}{| U_{m} |} \sum_{j \in U_{m}} \frac{\partial^{2} log P_{β^{*}} (Z_{j} | X)}{\partial β \partial β^{T}} .$

By Condition D and the law of large numbers, we have

$- \frac{1}{n} \frac{\partial^{2} ℓ_{n} (β^{*})}{\partial β \partial β^{T}} \to \sum_{m = 1}^{M} γ_{m} I_{m} (β^{*}) .$

By Condition E, $I_{m} (β^{*})$ is positive-definite, with a minimum eigenvalue exceeding $τ_{m}$ for any $m \in [M]$ . Let $\underset{̲}{τ} = {min}_{m \in [M]} {τ_{m}}$ . Then, for a sufficiently large n, we have

$I_{2} \leq - \frac{1}{4} n \underset{̲}{τ} α_{n}^{2} {∥ u ∥}^{2} .$
(3): In the following, we derive the bound for $I_{3} = \frac{1}{6} α_{n}^{3} \sum_{l, j, k} \frac{\partial^{3} ℓ_{n} (β_{w})}{\partial β_{l} \partial β_{j} \partial β_{k}} u_{l} u_{j} u_{k}$ , where

$\begin{matrix} \frac{1}{n} \frac{\partial^{3} ℓ_{n} (β_{w})}{\partial β_{l} \partial β_{j} \partial β_{k}} = \sum_{m = 1}^{M} \frac{| U_{m} |}{n} \frac{1}{| U_{m} |} \sum_{j \in U_{m}} \frac{\partial^{3} ℓ_{n j} (β_{w})}{\partial β_{l} \partial β_{j} \partial β_{k}}, \end{matrix}$

and $ℓ_{n j} (\cdot)$ denotes the log-likelihood function for $Z_{j} .$ Note that

$\begin{matrix} \frac{\partial^{3} ℓ_{n j} (β_{w})}{\partial β_{l} \partial β_{j} \partial β_{k}} = & P_{β} {(Z_{j} | X)}^{- 3} \frac{\partial^{3} P_{β} (Z_{j} | X)}{\partial β_{l} \partial β_{j} \partial β_{k}} + 2 P_{β} {(Z_{j} | X)}^{- 3} \prod_{i \in {ℓ, j, k}} \frac{\partial P_{β} (Z_{j} | X)}{\partial β_{i}} \\ - P_{β} {(Z_{j} | X)}^{- 2} \frac{\partial^{2} P_{β} (Z_{j} | X)}{\partial β_{l} \partial β_{j}} \frac{\partial P_{β} (Z_{j} | X)}{\partial β_{k}} - P_{β} {(Z_{j} | X)}^{- 2} \frac{\partial^{2} P_{β} (Z_{j} | X)}{\partial β_{l} \partial β_{k}} \frac{\partial P_{β} (Z_{j} | X)}{\partial β_{j}} \\ - P_{β} {(Z_{j} | X)}^{- 2} \frac{\partial^{2} P_{β} (Z_{j} | X)}{\partial β_{j} \partial β_{k}} \frac{\partial P_{β} (Z_{j} | X)}{\partial β_{l}} . \end{matrix}$

According to the proof of Theorem 1 by Gregory et al. [13], we have the following results:

$\begin{matrix} | \frac{P_{β} (Z_{j} | X)}{\partial β_{l}} | \leq 2 \sum_{i \in P_{j}} | Φ_{i l} |, | \frac{\partial^{2} P_{β} (Z_{j} | X)}{\partial β_{l} \partial β_{j}} | \leq 4 \sum_{i_{1}, i_{2} \in P_{j}} | Φ_{i_{1} ℓ} Φ_{i_{2} j} |, \\ | \frac{\partial^{2} P_{β} (Z_{j} | X)}{\partial β_{l} \partial β_{j}} | \leq 12 \sum_{i_{1}, i_{2}, i_{3} \in P_{j}} | Φ_{i_{1} ℓ} Φ_{i_{2} j} Φ_{i_{2} k} | . \end{matrix}$

By the fact that covariates are bounded, we have the following:

$\frac{1}{| U_{m} |} \sum_{j \in U_{m}} \frac{\partial log P_{β} (Z_{j} | X, β_{w})}{\partial β_{l} \partial β_{j} \partial β_{k}} = O_{p} (1) .$

It follows that

$\sum_{ℓ, j, k = 1}^{q_{n} + 1} u_{l} u_{j} u_{k} \sum_{j = 1}^{J} \frac{\partial^{3} ℓ_{n j} (β_{w})}{\partial β_{l} \partial β_{j} \partial β_{k}} = O_{p} (q_{n}^{3 / 2} {∥ u ∥}^{3}) .$

By Condition C with $q_{n} = o (n^{3 / 4 - γ / 2})$ , we have the following: $I_{3} = o (n α_{n}^{2}) .$
(4): Further, by the triangle inequality, we have

$| n λ \sum_{j = 1}^{q_{n}} (∥ β_{j}^{*} + α_{n} u_{j} ∥ - ∥ β_{j}^{*} ∥) | \leq n λ \sum_{j = 1}^{q_{n}} ∥ α_{n} u_{j} ∥ .$

Given the condition $λ = o (q_{n}^{- 1 / 2} n^{γ / 2} / \sqrt{n})$ , we find that

$I_{4} = n λ \sum_{j = 1}^{q_{n}} (P (β_{j}^{*} + α_{n} u_{j}) - P (β_{j}^{*})) = o_{p} (n α_{n}^{2}) .$

Combining the above results, for a sufficiently large $∥ u ∥$ , we observe that the value of $S_{n} (β^{*} + α_{n} u) - S_{n} (β^{*})$ is dominated by $I_{2}$ , which is negative. It implies the existence of a positive constant C, such that

$\begin{matrix} P (\underset{∥ u ∥ = C}{Sup} S_{n} (β^{*} + α_{n} u) - S_{n} (β^{*}) < 0) \to 1, n \to \infty . \end{matrix}$

Then, the proof is finished.

□

Proof of Theorem 2.

(1).: By the property of the B-spline approximation [38], there exist some positive constants ${\tilde{c}}_{1}$ and ${\tilde{c}}_{2}$ , such that

${\tilde{c}}_{1} m_{n}^{- 1} {∥ {\hat{β}}_{j} - β_{j}^{*} ∥}^{2} \leq {∥ {\hat{f}}_{n j} - f_{n j}^{*} ∥}_{2}^{2} \leq {\tilde{c}}_{2} m_{n}^{- 1} {∥ {\hat{β}}_{j} - β_{j}^{*} ∥}^{2},$

where $f_{n j}^{*} = Φ_{j}^{T} β_{j}^{*} .$ Then we have the following:

${∥ {\hat{f}}_{n j} - f_{n j}^{*} ∥}_{2} = O_{p} (q_{n} n^{- \frac{2 d - 1}{2 (2 d + 1)}}) .$

Combining with the fact ${∥ f_{n j}^{*} - f_{j}^{*} ∥}_{2} = O_{p} (m_{n}^{- d})$ , we have ${∥ {\hat{f}}_{n j} - f_{j}^{*} ∥}_{2} = O_{p} (q_{n} n^{- \frac{2 d - 1}{2 (2 d + 1)}}) .$
(2).: If there exists $j \in Ω^{*}$ , such that ${\hat{β}}_{j} = 0$ , then ${\hat{f}}_{n j} = 0$ . Combining with the above result (1), we have ${∥ 0 - f_{j}^{*} ∥}_{2} = O_{p} (q_{n} n^{- \frac{2 d - 1}{2 (2 d + 1)}}) .$ It contradicts Condition C. Then, the proof is finished.

□

References

Dorfman, R. The detection of defective members of large populations. Ann. Math. Stat. 1943, 14, 436–440. [Google Scholar] [CrossRef]
Zhang, B.; Bilder, C.R.; Tebbs, J.M. Group testing regression model estimation when case identification is a goal. Biom. J. 2013, 55, 173–189. [Google Scholar] [CrossRef]
Lin, J.; Wang, D.; Zheng, Q. Regression analysis and variable selection for two-stage multiple-infection group testing data. Stat. Med. 2019, 38, 4519–4533. [Google Scholar] [CrossRef]
Verougstraete, N.; Verbeke, V.; De Cannière, A.S.; Simons, C.; Padalko, E.; Coorevits, L. To pool or not to pool? Screening of Chlamydia trachomatis and Neisseria gonorrhoeae in female sex workers: Pooled versus single-site testing. Sex. Transm. Infect. 2020, 96, 417–421. [Google Scholar] [CrossRef] [PubMed]
Stramer, S.L.; Notari, E.P.; Krysztof, D.E.; Dodd, R.Y. Hepatitis B virus testing by minipool nucleic acid testing: Does it improve blood safety? Transfusion 2013, 53, 2449–2458. [Google Scholar] [CrossRef]
Busch, M.P.; Caglioti, S.; Robertson, E.F.; McAuley, J.D.; Tobler, L.H.; Kamel, H.; Linnen, J.M.; Shyamala, V.; Tomasulo, P.; Kleinman, S.H. Screening the blood supply for West Nile virus RNA by nucleic acid amplification testing. N. Engl. J. Med. 2005, 353, 460–467. [Google Scholar] [CrossRef]
Mutesa, L.; Ndishimye, P.; Butera, Y.; Souopgui, J.; Uwineza, A.; Rutayisire, R.; Ndoricimpaye, E.L.; Musoni, E.; Rujeni, N.; Nyatanyi, T.; et al. A pooled testing strategy for identifying SARS-CoV-2 at low prevalence. Nature 2021, 589, 276–280. [Google Scholar] [CrossRef] [PubMed]
Bish, D.R.; Bish, E.K.; El-Hajj, H.; Aprahamian, H. A robust pooled testing approach to expand COVID-19 screening capacity. PLoS ONE 2021, 16, e0246285. [Google Scholar] [CrossRef]
Gastwirth, J.L. The efficiency of pooling in the detection of rare mutations. Am. J. Hum. Genet. 2000, 67, 1036–1039. [Google Scholar] [CrossRef] [PubMed]
Okasha, H.; Baddour, M.; Elsawy, M.; Sadek, N.; Eltoweissy, M.; Gouda, A.; Abdelkhalek, O.; Meheissen, M. Optimization of pooling technique for hepatitis C virus nucleic acid testing (NAT) in blood banks. Hepat. Mon. 2020, 20, e99571. [Google Scholar] [CrossRef]
Hughes-Oliver, J.M. Pooling experiments for blood screening and drug discovery. In Screening: Methods for Experimentation in Industry, Drug Discovery, and Genetics; Springer: New York, NY, USA, 2006; pp. 48–68. [Google Scholar]
Sponheim, A.; Munoz-Zanzi, C.; Fano, E.; Polson, D.; Pieters, M. Pooled-sample testing for detection of Mycoplasma hyopneumoniae during late experimental infection as a diagnostic tool for a herd eradication program. Prev. Vet. Med. 2021, 189, 105313. [Google Scholar] [CrossRef]
Gregory, K.B.; Wang, D.; McMahan, C.S. Adaptive elastic net for group testing. Biometrics 2019, 75, 13–23. [Google Scholar] [CrossRef]
Chen, P.; Tebbs, J.M.; Bilder, C.R. Group testing regression models with fixed and random effects. Biometrics 2009, 65, 1270–1278. [Google Scholar] [CrossRef] [PubMed]
McMahan, C.S.; Tebbs, J.M.; Bilder, C.R. Regression models for group testing data with pool dilution effects. Biostatistics 2013, 14, 284–298. [Google Scholar] [CrossRef]
Wang, D.; McMahan, C.; Gallagher, C.; Kulasekera, K. Semiparametric group testing regression models. Biometrika 2014, 101, 587–598. [Google Scholar] [CrossRef]
Vansteelandt, S.; Goetghebeur, E.; Verstraeten, T. Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics 2000, 56, 1126–1133. [Google Scholar] [CrossRef] [PubMed]
Xie, M. Regression analysis of group testing samples. Stat. Med. 2001, 20, 1957–1969. [Google Scholar] [CrossRef] [PubMed]
Black, M.S.; Bilder, C.R.; Tebbs, J.M. Group testing in heterogeneous populations by using halving algorithms. J. R. Stat. Soc. Ser. C Appl. Stat. 2012, 61, 277–290. [Google Scholar] [CrossRef] [PubMed]
Bilder, C.R.; Tebbs, J.M.; Chen, P. Informative retesting. J. Am. Stat. Assoc. 2010, 105, 942–955. [Google Scholar] [CrossRef]
McMahan, C.S.; Tebbs, J.M.; Hanson, T.E.; Bilder, C.R. Bayesian regression for group testing data. Biometrics 2017, 73, 1443–1452. [Google Scholar] [CrossRef] [PubMed]
Yuan, A.; Piao, J.; Ning, J.; Qin, J. Semiparametric isotonic regression modelling and estimation for group testing data. Can. J. Stat. 2021, 49, 659–677. [Google Scholar] [CrossRef] [PubMed]
Delaigle, A.; Meister, A. Nonparametric regression analysis for group testing data. J. Am. Stat. Assoc. 2011, 106, 640–650. [Google Scholar] [CrossRef]
Delaigle, A.; Hall, P. Nonparametric regression with homogeneous group testing data. Ann. Stat. 2012, 40, 131–158. [Google Scholar] [CrossRef]
Delaigle, A.; Hall, P.; Wishart, J. New approaches to nonparametric and semiparametric regression for univariate and multivariate group testing data. Biometrika 2014, 101, 567–585. [Google Scholar] [CrossRef]
Liu, Y.; McMahan, C.S.; Tebbs, J.M.; Gallagher, C.M.; Bilder, C.R. Generalized additive regression for group testing data. Biostatistics 2021, 22, 873–889. [Google Scholar] [CrossRef]
Yoshida, T.; Naito, K. Asymptotics for penalised splines in generalised additive models. J. Nonparametr. Stat. 2014, 26, 269–289. [Google Scholar] [CrossRef]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef]
Litvak, E.; Tu, X.M.; Pagano, M. Screening for the presence of a disease by pooling sera samples. J. Am. Stat. Assoc. 1994, 89, 424–434. [Google Scholar] [CrossRef]
Kim, H.Y.; Hudgens, M.G.; Dreyfuss, J.M.; Westreich, D.J.; Pilcher, C.D. Comparison of group testing algorithms for case identification in the presence of test error. Biometrics 2007, 63, 1152–1163. [Google Scholar] [CrossRef]
Xiong, W.; Ding, J.; He, Y.; Li, Q. Improved matrix pooling. Stat. Methods Med. Res. 2019, 28, 211–222. [Google Scholar] [CrossRef]
Stone, C.J. Additive regression and other nonparametric models. Ann. Stat. 1985, 13, 689–705. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R. Generalized additive models: Some applications. J. Am. Stat. Assoc. 1987, 82, 371–386. [Google Scholar] [CrossRef]
Xiong, W.; Ding, J.; Zhang, W.; Liu, A.; Li, Q. Nested Group Testing Procedure. Commun. Math. Stat. 2023, 11, 663–693. [Google Scholar] [CrossRef]
Huang, J.; Horowitz, J.L.; Wei, F. Variable selection in nonparametric additive models. Ann. Stat. 2010, 38, 2282. [Google Scholar] [CrossRef]
Yu, W.; Liu, T.; Valdez, R.; Gwinn, M.; Khoury, M.J. Application of support vector machine modeling for prediction of common diseases: The case of diabetes and pre-diabetes. BMC Med. Inform. Decis. Mak. 2010, 10, 16. [Google Scholar] [CrossRef] [PubMed]
Zhou, S.; Shen, X.; Wolfe, D. Local asymptotics for regression splines and confidence regions. Ann. Stat. 1998, 26, 1760–1782. [Google Scholar]
De Boor, C.; De Boor, C. A Practical Guide to Splines; Springer: New York, NY, USA, 1978; Volume 27. [Google Scholar]

Figure 1. An illustration of the test procedure using the Halving algorithm, where filled circles denote positive samples, marked with a “+”, and hollow circles denote negative samples, marked with a “−”.

Figure 2. The estimators for the functional components in Example 1, simulated using the master pool group testing algorithm. The solid line represents the estimation function,

{\hat{f}}_{j}

, and the dashed line represents the true function,

f_{j}

.

Figure 2. The estimators for the functional components in Example 1, simulated using the master pool group testing algorithm. The solid line represents the estimation function,

{\hat{f}}_{j}

, and the dashed line represents the true function,

f_{j}

.

Figure 3. The estimate of the non-linear functions using our method, along with the coefficients’ estimate obtained using GLM.

Table 1. Overall performance of our method under four group testing algorithms.

Model	Setting	Procedure	PE	TP	FP
Example 1 ( $q_{n} = 50$ , $s_{0} = 4$ )	$n = 1000$	Master Pool	0.82 (0.12)	3.92 (0.27)	3.09 (0.29)
		Halving	0.83 (0.11)	3.96 (0.20)	2.98 (0.14)
		Dorfman	0.76 (0.01)	4.00 (0.00)	1.98 (0.15)
		Array	0.80 (0.01)	3.99 (0.10)	2.20 (0.45)
	$n = 500$	Master Pool	1.57 (0.51)	3.83 (0.38)	3.22 (1.45)
		Halving	1.45 (0.55)	3.89 (0.31)	3.18 (1.42)
		Dorfman	0.95 (0.01)	3.90 (0.30)	2.94 (0.34)
		Array	0.71 (0.01)	4.00 (0.00)	1.01 (0.17)
Example 2 ( $n = 500$ , $s_{0} = 4$ )	$q_{n} = 50$	Master Pool	1.67 (0.32)	3.74 (0.44)	3.94 (1.78)
		Halving	1.32 (1.01)	3.71 (0.48)	3.75 (1.67)
		Dorfman	1.19 (0.01)	3.88 (0.36)	1.94 (0.23)
		Array	1.10 (0.01)	3.92 (0.31)	1.93 (0.26)
	$q_{n} = 100$	Master Pool	1.72 (0.43)	3.65 (0.58)	3.21 (1.71)
		Halving	1.52 (0.28)	3.57 (0.59)	2.72 (1.54)
		Dorfman	1.45 (0.01)	3.64 (0.57)	2.02 (1.41)
		Array	1.29 (0.01)	3.85 (0.48)	2.48 (1.72)
Example 3 ( $n = 1000$ , $s_{0} = 3$ )	$q_{n} = 50$	Master Pool	1.61 (0.08)	2.89 (0.45)	3.97 (0.33)
		Halving	1.64 (0.24)	2.85 (0.48)	3.51 (0.88)
		Dorfman	0.60 (0.01)	2.91 (0.35)	1.73 (0.44)
		Array	0.54 (0.01)	3.00 (0.00)	1.87 (0.34)
	$q_{n} = 100$	Master Pool	0.92 (0.01)	2.89 (0.40)	0.00 (0.00)
		Halving	1.23 (0.15)	2.74 (0.50)	1.91 (0.29)
		Dorfman	0.85 (0.01)	2.84 (0.44)	0.97 (0.17)
		Array	0.95 (0.08)	2.92 (0.37)	2.01 (0.18)
Example 4 ( $n = 500$ , $s_{0} = 3$ )	$q_{n} = 50$	Master Pool	1.47 (0.35)	2.78 (0.48)	3.13 (1.01)
		Halving	1.48 (0.21)	2.69 (0.53)	3.16 (1.36)
		Dorfman	0.81 (0.02)	2.93 (0.36)	0.44 (0.50)
		Array	0.86 (0.01)	2.91 (0.38)	0.46 (1.11)
	$q_{n} = 100$	Master Pool	1.52 (0.16)	2.69 (0.53)	3.17 (1.54)
		Halving	1.46 (0.28)	2.73 (0.51)	3.55 (1.55)
		Dorfman	1.31 (0.02)	2.88 (0.40)	2.96 (1.97)
		Array	1.10 (0.01)	2.90 (0.39)	1.98 (0.83)

Table 2. The estimates of the functions using Example 1 and Example 2.

Model	Setting	Procedure	PE_f₁	PE_f₂	PE_f₃	PE_f₄	PE_non
Example 1 ( $q_{n} = 50$ )	$n = 1000$	Master Pool	0.04 (0.03)	0.17 (0.10)	0.18 (0.09)	0.15 (0.05)	0.02 (0.01)
		Halving	0.05 (0.04)	0.16 (0.11)	0.17 (0.08)	0.14 (0.05)	0.01 (0.01)
		Dorfman	0.03 (0.01)	0.06 (0.02)	0.08 (0.01)	0.11 (0.01)	0.02 (0.01)
		Array	0.04 (0.01)	0.06 (0.03)	0.09 (0.02)	0.12 (0.02)	0.02 (0.01)
	$n = 500$	Master Pool	0.05 (0.06)	0.40 (0.14)	0.41 (0.17)	0.12 (0.04)	0.01 (0.01)
		Halving	0.04 (0.03)	0.23 (0.09)	0.31 (0.08)	0.10 (0.04)	0.02 (0.01)
		Dorfman	0.03 (0.01)	0.17 (0.03)	0.17 (0.02)	0.07 (0.01)	0.01 (0.01)
		Array	0.01 (0.01)	0.16 (0.02)	0.24 (0.01)	0.05 (0.02)	0.01 (0.01)
Example 2 ( $n = 500$ )	$q_{n} = 50$	Master Pool	0.32 (0.06)	0.22 (0.06)	0.44 (0.17)	0.14 (0.03)	0.02 (0.01)
		Halving	0.14 (0.14)	0.28 (0.30)	0.27 (0.22)	0.14 (0.08)	0.01 (0.01)
		Dorfman	0.19 (0.01)	0.21 (0.04)	0.39 (0.01)	0.04 (0.01)	0.02 (0.01)
		Array	0.24 (0.01)	0.20 (0.04)	0.51 (0.01)	0.06 (0.01)	0.01 (0.01)
	$q_{n} = 100$	Master Pool	0.31 (0.17)	0.14 (0.11)	0.20 (0.19)	0.09 (0.04)	0.01 (0.01)
		Halving	0.36 (0.20)	0.13 (0.12)	0.18 (0.08)	0.09 (0.03)	0.02 (0.01)
		Dorfman	0.05 (0.01)	0.12 (0.03)	0.08 (0.01)	0.06 (0.01)	0.01 (0.01)
		Array	0.05 (0.01)	0.15 (0.07)	0.09 (0.02)	0.07 (0.01)	0.01 (0.01)

Table 3. The specific results of Example 3 and Example 4.

Model	Setting	Procedure	PE_f₁	PE_f₂	PE_f₃	PE_non
Example 3 ( $n = 1000$ )	$q_{n} = 50$	Master Pool	0.09 (0.02)	0.19 (0.26)	0.07 (0.24)	0.04 (0.02)
		Halving	0.06 (0.03)	0.17 (0.26)	0.06 (0.25)	0.05 (0.01)
		Dorfman	0.01 (0.02)	0.10 (0.01)	0.02 (0.02)	0.02 (0.01)
		Array	0.01 (0.01)	0.07 (0.01)	0.01 (0.01)	0.01 (0.01)
	$q_{n} = 100$	Master Pool	0.04 (0.02)	0.25 (0.09)	0.03 (0.01)	0.01 (0.01)
		Halving	0.02 (0.01)	0.13 (0.03)	0.06 (0.02)	0.01 (0.01)
		Dorfman	0.03 (0.01)	0.24 (0.02)	0.04 (0.01)	0.02 (0.01)
		Array	0.03 (0.01)	0.12 (0.01)	0.03 (0.01)	0.01 (0.01)
Example 4 ( $n = 500$ )	$q_{n} = 50$	Master Pool	0.23 (0.14)	0.41 (0.21)	0.06 (0.07)	0.01 (0.01)
		Halving	0.19 (0.07)	0.51 (0.17)	0.06 (0.02)	0.02 (0.01)
		Dorfman	0.19 (0.01)	0.13 (0.01)	0.02 (0.01)	0.01 (0.01)
		Array	0.14 (0.01)	0.12 (0.01)	0.01 (0.01)	0.02 (0.01)
	$q_{n} = 100$	Master Pool	0.21 (0.02)	0.10 (0.01)	0.27 (0.05)	0.02 (0.01)
		Halving	0.22 (0.05)	0.08 (0.02)	0.32 (0.10)	0.01 (0.01)
		Dorfman	0.06 (0.01)	0.15 (0.01)	0.13 (0.02)	0.01 (0.01)
		Array	0.09 (0.01)	0.16 (0.01)	0.26 (0.01)	0.01 (0.01)

Table 4. The performance of the models; GLM is

q_{n}

= 14 in a low dimension, while in our method,

q_{n}

= 500.

Table 4. The performance of the models; GLM is

q_{n}

= 14 in a low dimension, while in our method,

q_{n}

= 500.

Metrics	Our Method				GLM
Metrics	k = 1	k = 2	k = 5	k = 8	GLM
ACC	0.8296	0.8250	0.8268	0.8277	0.8150
PPV	0.5800	0.5703	0.5761	0.5766	0.5472
NPV	0.9027	0.8993	0.8977	0.9006	0.8952
Recall	0.6360	0.6228	0.6140	0.6272	0.6096

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zuo, X.; Ding, J.; Zhang, J.; Xiong, W. Nonparametric Additive Regression for High-Dimensional Group Testing Data. Mathematics 2024, 12, 686. https://doi.org/10.3390/math12050686

AMA Style

Zuo X, Ding J, Zhang J, Xiong W. Nonparametric Additive Regression for High-Dimensional Group Testing Data. Mathematics. 2024; 12(5):686. https://doi.org/10.3390/math12050686

Chicago/Turabian Style

Zuo, Xinlei, Juan Ding, Junjian Zhang, and Wenjun Xiong. 2024. "Nonparametric Additive Regression for High-Dimensional Group Testing Data" Mathematics 12, no. 5: 686. https://doi.org/10.3390/math12050686

APA Style

Zuo, X., Ding, J., Zhang, J., & Xiong, W. (2024). Nonparametric Additive Regression for High-Dimensional Group Testing Data. Mathematics, 12(5), 686. https://doi.org/10.3390/math12050686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Nonparametric Additive Regression for High-Dimensional Group Testing Data

Abstract

1. Introduction

2. Main Results

2.1. Notation

2.2. Model

2.3. The Estimation Procedure with EM Algorithm

2.4. The Consistency of the Estimate

3. The Conditional Expectation

3.1. Master Pool

3.2. Dorfman

3.3. Halving

4. Simulation

5. Application

6. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI