1. Introduction
The problems involved in testing statistical hypotheses occupy an important place in applied statistics and are used in such areas as genetics, biology, astronomy, radar, computer graphics, etc. The classical methods for solving these problems are based on a single hypothesis test. There is a sample
X of size
m and the null hypothesis
is tested against the general alternative
. The hypothesis is tested using the statistic
T, a function of the sample with a known distribution under the null hypothesis (zero distribution). For a given zero distribution, the attainable
p-values are calculated, and the decision to reject the null hypothesis is made on their basis. Errors arising from the application of this one-time hypothesis testing algorithm are divided into two types, and the probability of falsely rejecting the correct null hypothesis (the probability of a type I error) is bounded by a given significance level
:
where
t is the critical threshold value.
With this approach, we can often not only find the region for which the -constraint on the probability of a type I error is satisfied, but also minimize the probability of a type II error, i.e., maximize the statistical power.
When considering the problem of multiple hypothesis testing, the task becomes more complicated: now we are dealing with n different null hypotheses and the alternatives . These hypotheses are tested by statistics with given zero distributions. Thus, for each hypothesis, the attainable p-values can be calculated as well as type II error probabilities.
Let us introduce the notation: is the set of indices of true null hypotheses, R is the set of indices of rejected hypotheses. Then is the number of type I errors. The task is to minimize V by changing the parameter R.
There are many statistical procedures that offer different ways to solve the multiple hypothesis testing problem. One of the first measures proposed to generalize the type I error was the family-wise error rate (FWER) [
1]. This value is defined as the probability of making at least one type I error, i.e., instead of controlling the probability of a type I error at the level
for each test, the overall FWER is controlled:
. However, such a strict criterion significantly increases the type II error for a large number of tested hypotheses.
In [
2], an alternative measure called the false discovery rate (FDR) was proposed. This measure assumes to control the expected proportion of false rejections:
This approach is widely used in situations where the number of tested hypotheses is so large that it is preferable to allow a certain number of type I errors in order to increase the statistical power.
To control FDR, the Benjamini–Hochberg [
2] multiple hypothesis testing algorithm is often used, which under the condition of independency of the testing statistics allows the FDR value to be bounded by the parameter
, i.e.,
In this procedure, the significance levels change linearly:
To apply the Benjamini–Hochberg method, a variational series is constructed from the attained
p-values:
All hypotheses
are rejected, where
k,
, is the maximum index for which
There are other measures to control the total number of type I errors. In [
1], a
q-value is considered that provides control of the positive false discovery rate (pFDR). Controlling the full coverage ratio (FCR) involves solving the problem of multiple hypothesis testing in terms of the confidence intervals [
3]. The papers [
4,
5] are devoted to the harmonic mean
p-value (HMP) method. However, in this paper we focus on the properties of the FDR method. It is believed that the widespread use of the FDR measure is due to the development of technologies that allow collecting and analyzing large amounts of data. Computing power makes it easy to perform hundreds or thousands of statistical tests on a given data set, and therefore the use of FWER loses its relevance.
In this paper, we study the asymptotic properties of the mean-square risk estimate for the FDR method in the problem of multiple hypothesis testing for the mathematical expectation of a Gaussian vector with independent components. The consistency of this estimate was proved in [
6]. In this paper, we prove its asymptotic normality.
The paper is organized as follows.
Section 2 provides some basic information about the statement of the problem and the considered vector classes. In
Section 3 we define the mean-square risk of the thresholding method and describe the properties of the FDR-threshold.
Section 4 considers the asymptotic properties of the mean-square risk estimate, and
Section 5 contains some concluding remarks.
2. Preliminaries
Consider the problem of estimating the mathematical expectation of a Gaussian vector
where
are independent normally distributed random variables with zero expectation and known variance
, and
is an unknown vector belonging to some given set (class). The key assumption adopted in this paper is the “sparsity” of the vector
, i.e., it is assumed that only a relatively small number of its components are significantly large. A similar problem statement arises, for example, in the analysis and processing of signals containing noise. In this case, the sparsity or “economical” representation of the signal is achieved using some special preprocessing, for example, a discrete wavelet transform of the signal vector.
In this paper, we consider the following definitions of sparsity. Let
denote the number of nonzero components of
. Fixing
, define the class
For small values of , only a small number of vector components are nonzero.
Another possible way to define sparsity is to limit the absolute values of
. To do this, consider the sorted absolute values
and for
define the class
In addition, sparsity can be modeled using the
-norm
In this case, the sparse class is defined as
There are important relationships between these classes. As , the -norm approaches : . The embedding is also valid.
3. Mean-Square Risk and Properties of the FDR Threshold
In the considered problem, one of the widespread and well-proven methods for constructing an estimate of
is the method of (hard) thresholding of each vector component:
i.e., the vector component is zeroed if its absolute value does not exceed the critical threshold
T. This procedure is equivalent to testing the hypothesis of zero mathematical expectation for each component of the vector, and when using the FDR method, the threshold value
T is selected according to the following rule. The initial sample is used to construct a variational series of decreasing absolute values
and
are compared with the right tail Gaussian quantiles
. Let
be the largest index
k for which
, then the threshold
is chosen.
In combination with hypothesis testing methods, the penalty method is also widely used, in which the target loss function is minimized with the addition of a penalty term [
7,
8,
9]. In a particular case, this method leads to the so-called soft thresholding: the estimates of the vector components are calculated according to the rule
This approach is in some cases more adequate than (
2), since the function
in (
3) is continuous in
T.
The mean-square error (or risk) of the considered procedures is determined as
Methods for selecting the threshold value
T are usually focused on minimizing the risk (
4) provided that the vector
belongs to a given class. A “perfect” value of the threshold is
Note that the expression (
4) contains unknown values of
and it is impossible to calculate
and
in practice. Therefore, a minimax approach is used. The threshold
is calculated based on the observed values of
and has the property of an adaptive minimax optimality in the considered sparse classes [
7]. In addition,
has the following important property [
7], which we will use later in proving the asymptotic normality of the risk estimate.
Theorem 1. [7] Suppose that or , where for and for , . Then there exists such that for the FDR-threshold with a controlling parameter and large n,wherefor andfor . Thus, if is chosen so that , the value of is the lower bound for the threshold .
Note also that the so-called universal threshold
is popular as well. This threshold is, in a certain sense, the maximum (it was shown in [
10,
11] that
can be ignored). Based on this, we will assume everywhere that
.
4. Asymptotic Properties of the Risk Estimate
As already mentioned, since the expression (
4) explicitly depends on the unknown values of
, it cannot be calculated in practice. However, it is possible to construct its estimate, which is calculated using only the observed data. This estimate is determined by the expression
where
for the hard thresholding and
for the soft thresholding [
12].
In [
6] it is proved that the estimate (
7) is consistent.
Theorem 2. [6] Let the conditions of Theorem 1 be satisfied and as so that , then Let us prove a statement about the asymptotic normality of the estimate (
7), which, in particular, allows constructing asymptotic confidence intervals for the mean-square risk (
4). In the proof, we will use the same notation
C for different positive constants that may depend on the parameters of the classes and methods under consideration, but do not depend on
n.
First, consider the class .
Theorem 3. Let , , . Let be the FDR-threshold with a controlling parameter and as , where and are defined in (5). Then Proof. Let us prove the theorem for the soft thresholding method. In the case of hard thresholding, the proof is similar.
With soft thresholding,
is an unbiased estimate of
, and with hard thresholding, under the conditions of the theorem the bias tends to zero when divided by
[
12]. For the variance of the numerator [
13]
Moreover, since
are independent,
and the number of nonzero
does not exceed
, we obtain
Finally, the Lindeberg condition is met: for any
as
where
. Indeed, due to (
9) and (
10) and since the summands in
are modulo bounded by the value
, starting from some
n all indicators in (
11) vanish.
Therefore, (
8) holds, and to prove the theorem it remains to show that
Repeating the reasoning from [
14,
15,
16] it can be shown that
, where
. To shorten the notation without compromising the proof, we can omit
and assume that
.
Let
,
, where the sum
contains terms with
, and
contains all other terms. By the definition of the class
, the number of terms in
does not exceed
. Moreover, the absolute value of each term is bounded by
. For convenience, we will assume that
contains terms with indices from 1 to
, i.e.,
Next,
and
. Given the definition of the class
and the form of
, it can be shown that for the terms of
the estimate
is valid when
. So
and for
as
.
Next, take
and denote
Divide the segment
into equal parts:
,
,
. Then
where
Applying the Hoeffding inequality [
17] for
, we obtain the estimate
It is easy to show that
. Hence,
Applying the Hoeffding inequality, we obtain
Thus, for an arbitrary
as
.
Let us now consider the sum
. For large
n, the number of terms in this sum is
. Repeating the above reasoning, we divide the segment
into equal parts:
,
,
. Then
where
Taking into account the definition of the class
and the form of
, we can bound the variance of the terms in
(and hence
):
. Then, applying Bernstein’s inequality [
18] for
, we obtain
The variance of the terms in is bounded by .
Applying Bernstein’s inequality, we obtain
Thus, for an arbitrary
as
.
Combining (
8), (
12), (
14) and (
15), we obtain the statement of the theorem. □
A similar statement is true for the class .
Theorem 4. Let , , , . Let be the FDR-threshold with a controlling parameter and as , where and are defined in (6). Then Proof. The main steps in the proof of this theorem repeat the proof of Theorem 3. We also write
The statement
is proved exactly the same as the statement (
8). Let
,
, where the sum
contains terms with
, and
contains all other terms. By the definition of the class
, the number of terms in
does not exceed
and each term is modulo bounded by
. Considering the form of
, it can be shown that the mathematical expectations of the terms in
do not exceed
, and their variances do not exceed
. Next, arguing as in Theorem 3, we see that
and
for an arbitrary
as
. Thus, since
,
as
. □
The above statements demonstrate that the considered method for constructing estimates in the model (
1) has very similar properties to the method based on minimizing the estimate (
7) in the parameter
T (see [
19]).