Next Article in Journal
Advances in Mathematical Inequalities and Applications
Next Article in Special Issue
Bayesian Estimation of a New Pareto-Type Distribution Based on Mixed Gibbs Sampling Algorithm
Previous Article in Journal
Color Image Encryption Algorithm Based on a Chaotic Model Using the Modular Discrete Derivative and Langton’s Ant
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Quantile-Composited Feature Screening for Ultrahigh-Dimensional Data

1
School of Mathematics, Shandong University, Jinan 250100, China
2
School of Science, National University of Defense and Technology, Changsha 410000, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(10), 2398; https://doi.org/10.3390/math11102398
Submission received: 24 April 2023 / Revised: 12 May 2023 / Accepted: 19 May 2023 / Published: 22 May 2023
(This article belongs to the Special Issue Methods and Applications in Multivariate Statistics)

Abstract

:
Ultrahigh-dimensional grouped data are frequently encountered by biostatisticians working on multi-class categorical problems. To rapidly screen out the null predictors, this paper proposes a quantile-composited feature screening procedure. The new method first transforms the continuous predictor to a Bernoulli variable, by thresholding the predictor at a certain quantile. Consequently, the independence between the response and each predictor is easy to judge, by employing the Pearson chi-square statistic. The newly proposed method has the following salient features: (1) it is robust against high-dimensional heterogeneous data; (2) it is model-free, without specifying any regression structure between the covariate and outcome variable; (3) it enjoys a low computational cost, with the computational complexity controlled at the sample size level. Under some mild conditions, the new method was shown to achieve the sure screening property without imposing any moment condition on the predictors. Numerical studies and real data analyses further confirmed the effectiveness of the new screening procedure.

1. Introduction

With the rapid advancements in science and technology, ultrahigh-dimensional data are becoming increasingly common across various fields of scientific research: these include, but are not limited to, biomedical imaging, neuroscience, tomography, and tumor classifications, where the number of variables or parameters can exponentially increase with the sample size. In such a situation, an important task is to recover the important features from thousands or even millions of predictors.
In order to rapidly lower the huge dimensionality of data to an acceptable size, Fan and Lv [1] introduced the method of sure independence screening, which ranks the importance of predictors according to their marginal utilities. Since then, a series in the literature has been devoted to this issue, in various scenarios, which can basically be divided into two groups: the model-based and the model-free methods. For the former, the typical literature includes, but is not limited to, Wang [2], Chang et al. [3], and Wang and Leng [4] for linear models, Fan et al. [5] for additive models, and Fan et al. [6] and Liu et al. [7] for varying coefficients models, amongst others. Model-based methods are computationally efficient, but can suffer from the risk of model misspecification. To avoid such a risk, researchers developed the model-free methods, without specifying a concrete model. For example, Zhu et al. [8] proposed a screening procedure named SIRS for the multi-index model; Li et al. [9] introduced a sure screening procedure via the distance correlation called DCS; for the heterogeneous data, He et al. [10] developed a quantile-adaptive screening method; Lin et al. [11] proposed a novel approach, dubbed Nonparametric Ranking Feature Screening (NRS), leveraging the local information flows of the predictors; Lu and Lin [12] developed a conditional model-free screening procedure, utilizing the conditional distance correlation; and Tong et al. [13] proposed a model-free conditional feature screening method with FDR control. Additionally, Ref. [14] recently introduced a data-adaptive threshold selection procedure with error rate control, which is applicable to most kinds of popular screening methods. Ref. [15] proposed a feature screening method for the interval-valued response.
The literature listed above mainly concentrated on the continuous response; however, ultrahigh-dimensional grouped data, in which the label of a sample can be seen as a categorical response, are also very frequently encountered in many scientific research fields—specifically, for biostatisticians who work on multi-class categorical problems. For example, in the diagnosis of tumor classification, researchers need to judge the type of tumor, according to the gene expression level. If we do not reduce the dimension of the predictors, the established classifier will behave as poorly as random guessing, due to the diverging spectra and accumulation of noise (Fan et al. [16]); therefore, it makes sense to screen out the null predictors before further analysis. The following are the existing works that have made some progress on this issue. Huang et al. [17] proposed a screening method based on Pearson chi-square statistics, for discrete predictors. Pan et al. [18] set the maximal mean difference for each pair of classes as a ranking index and, based on this, proposed a corresponding screening procedure. Mai and Zou [19] built a Kolmogorov–Smirnov type distance, to measure the dependence between two variables, and used it as a filter for screening out noise predictors. Cui et al. [20] proposed a screening method via measuring the distance of the distribution of the subgroup from the whole distribution. Recently, Xie et al. [21] established a category-adaptive screening procedure, by calculating the difference between the conditional distribution of the response and the marginal one. All these aforementioned methods were clearly motivated, and have been examined effectively for feature screening in different settings.
In this paper, we propose a new robust screening method for ultrahigh-dimensional grouped data. Our research was partly motivated by an empirical analysis of a leukemia dataset, consisting of 72 observations and 3571 genes, of which 47 were acute lymphocytic leukemia (ALL), and 25 were acute myelogenous leukemia (AML). Figure 1 plots the density function of the first 20 features selected from the 3571 genes of the 47 ALLs, from which it can be seen that all of them are far from being a regular distribution, most of them have sharp peaks and heavy tails (e.g., gene 9 and gene 12), and some of them are even multi-modal (e.g., gene 6 and gene 8), although these samples are from the same ALL group. This phenomenon challenges most of the existing methods. For example, the method in Pan et al. [18] might fail, if data are not normally distributed, and the method in Xie et al. [21] might lose efficiency when the distribution of a predictor is multi-modal. It is known that quantile-based statistics are not sensitive to outliers and heavy-tailed distributed data; thus, it was expected that the quantile-based screening method would be robust against heterogeneous data. Furthermore, compared to point estimation, quantile-based statistics can usually provide a more detailed picture of a predictor at different quantile levels. Motivated by the above discussion, we propose a quantile-composited screening approach, by aggregating the distributional information over many quantile levels. The basic idea of our method is straightforward. If X j has no contribution to predicting the category of an outcome variable, denoted by Y, at the τ -th quantile level, the conditional quantile function of X j given Y should be equal to the unconditional one, i.e, q X j | Y ( τ ) = q X j ( τ ) . Moreover, if X j and Y are independent, we have q X j | Y ( τ ) = q X j ( τ ) ( a . s . ) for all τ ( 0 , 1 ) , where a.s. means ‘almost surely’. Thus, the equality q X j | Y ( τ ) = q X j ( τ ) plays a key role in measuring the independence between Y and X j . To quantify this kind of independence, we show that q X j | Y ( τ ) = q X j ( τ ) for a given τ is equivalent to the independence between the index variable I ( X j q X j ( τ ) > 0 ) and the label variable Y. Then, the equality between q X j | Y ( τ ) and q X j ( τ ) is converted to testing the independence between two discrete variables, which can be easily checked by the Pearson chi-square test statistics. Finally, we aggregate all the discriminative information over the whole distribution in an efficient way, based on which, we establish the corresponding screening procedure. Our newly proposed screening method enjoys the following salient features. First of all, compared to the existing methods, it is robust against non-normal data, which are very common in high dimensions. Secondly, it is model-free, in the sense that we do not need to assume a specific statistical model, such as the linear or quadratic discriminant analysis model, between the predictor and the outcome variable. Thirdly, its ranking index has a very simple form, and the computational complexity is controlled at the sample size level, so that the proposed screening method can be implemented very quickly. In addition, as a by-product, our new method is invariant, in regard to the monotonic transformations of the data.
The rest of the paper is organized as follows. Section 2 gives the details of the quantile-composited screening procedure, including the methodological development, theoretical properties, and some extensions. Section 3 provides convincing numerical results and two real data analyses. Technical proofs of the main results are deferred to Appendix A.

2. A New Feature Screening Procedure

Let X = { X 1 , , X p } be the p-dimensional predictor, and without loss of generality, let Y { 1 , , K } be the outcome variable indicating which group X belongs to, where K is allowed to grow with the sample size at some certain rate. Define the index set of active predictors corresponding to quantile level τ as
A τ = { 1 j p : q X j | Y ( τ ) functionally depends on Y } ,
where q X j | Y ( τ ) = inf { t : P ( X j t | Y ) τ } . Denote by | A τ | the cardinality of A τ ; | A τ | is usually less than the sample size n under the sparsity assumption. Denote by q X j ( τ ) = inf { t : P ( X j t ) τ } the τ -th quantile of X j . Intuitively, if q X j | Y ( τ ) does not functionally depend on Y, it should be the case that q X j | Y ( τ ) = q X j ( τ ) for all Y: in other words, X j has no ability to predict its label Y at the quantile level τ . On the other hand, if q X j | Y ( τ ) is far away from q X j ( τ ) for some Y, X j will be helpful for predicting the category of Y. Hence, the difference between q X j | Y and q X j ( τ ) determines whether X j is a contributive predictor at the τ -th quantile level. The following lemma was of central importance to our methodological development.
Lemma 1.
Let Y be the outcome variable, and let X be a continuous variable; then, we have two conclusions:
(1) 
q X j | Y ( τ ) = q X j ( τ ) a . s . if and only if the Bernoulli variable I { X > q X j ( τ ) } and Y are independent, where I { · } is the indicator function;
(2) 
q X j | Y ( τ ) = q X j ( τ ) a . s . for τ ( 0 , 1 ) if Y and X j are independent.
The proof of this lemma is presented in Appendix A. Conclusions (1) and (2) imply that the independence between X j and Y for τ ( 0 , 1 ) is equivalent to the independence between I { X > q X j ( τ ) } and Y; consequently, it is natural to apply the Pearson chi-square statistics, to measure the independence between them. Let Z j ( τ ) = I { X j > q X j ( τ ) } , π y k = P ( Y = k ) , π j b ( τ ) = P ( Z j ( τ ) = b ) , π y k , j b ( τ ) = P ( Y = k , Z j ( τ ) = b ) . Then, the dependence of X j on the response Y, at quantile level τ , can be evaluated by
Q j ( τ ) = k = 1 K b = 0 1 ( π y k π j b ( τ ) π y k , j b ( τ ) ) 2 π y k π j b ( τ ) .
Clearly, Q j ( τ ) = 0 iff Z j ( τ ) and Y are independent.
Q j ( τ ) provides a way to identify whether X j is active at quantile level τ . However, it is not easy to determine the informative quantiles for every predictor. Moreover, the active predictors could be contributive at many quantiles instead of a single one. For these reasons, we propose a quantile-composited screening index, which makes an integration for Q j ( τ ) at the interval ( 0 , 1 ) . More specifically, the ranking index is defined as
Q j = α 1 α Q j ( τ ) w j ( τ ) d τ ,
where w j ( τ ) is some positive weight function, and α is a value tending to 0 at some certain rate related to the sample size, which will be specified in the next section. Note that Q j avoids making integration at the endpoints 0 and 1, because Q j ( τ ) could be ill-defined at the two points. Theoretically, Q j = 0 if X is independent of Y, regardless of the choice of w j ( τ ) , which is easy to prove according to Lemma 1. According to the above analysis, Q j ( τ ) is always non-negative for τ ( 0 , 1 ) , and will equal 0 if X j is independent of Y.
For the choice of weight w j ( τ ) , the different settings will lead to different values of Q j . For example, a naive setting is w j ( τ ) = 1 for τ ( 0 , 1 ) , which means that all Q j ( τ ) are treated equally. Clearly, this is not a good option. Intuitively, if X j is active, Q j ( τ ) should be large for some τ in ( 0 , 1 ) . Then, we should place more weight on these quantile levels. For this reason, we set w j ( τ ) = Q j ( τ ) / α 1 α Q j ( τ ) d τ ; then, the resultant Q j has the following form:
Q j = α 1 α Q j 2 ( τ ) d τ / α 1 α Q j ( τ ) d τ .
In addition, for the precise-definition of Q j , we restrict Q j = 0 when Q j ( τ ) = 0 for all τ ( 0 , 1 ) .
In the following, we give the estimation of Q j . Suppose { X i , Y i } i = 1 n is a set of i.i.d samples from ( X , Y ) , where i.i.d means independent and identically distributed. Let q ^ X j ( τ ) be the τ th sample quantile of X j and Z i j ( τ ) = I { X i j > q ^ X j ( τ ) } , π y k , π j b ( τ ) and π y k , j b ( τ ) can be estimated as π ^ y k = n 1 i = 1 n I { Y i = k } , π ^ j b ( τ ) = n 1 i = 1 n I { Z i j ( τ ) = b } and π ^ y k , j b ( τ ) = n 1 i = 1 n I { Y i = k } I { Z i j ( τ ) = b } , respectively. Then, by plug-in method, Q j ( τ ) is estimated as
Q ^ j ( τ ) = k = 1 K b = 0 1 ( π ^ y k π ^ j b ( τ ) π ^ y k , j b ( τ ) ) 2 π ^ y k π ^ j b ( τ ) ,
and Q j is estimated as
Q ^ j = α 1 α Q ^ j 2 ( τ ) d τ / α 1 α Q ^ j ( τ ) d τ .
Regarding Q ^ j ( τ ) , we make the following remarks:
Remark 1.
1. 
If q X j | Y ( τ ) = q X j ( τ ) , n Q ^ j ( τ ) follows the χ 2 distribution with K 1 degrees of freedom [22].
2. 
Q ^ j is invariant to any monotonic transformation on predictors, because Z j ( τ ) is free of the monotonic transformation on X j .
3. 
The computation of Q j ( τ ) involves the integration of τ. We can calculate it by an approximate numerical method as
Q ^ j = i = 1 s Q ^ j 2 ( i / s ) / i = 1 s Q ^ j ( i / s ) .
4. 
The choice of s. Intuitively, a large s will make the approximation of integration more accurate. However, our method aims to efficiently separate the active predictors from the null ones, instead of getting an accurate estimate of Q j . Figure 2 displays the density curves of marginal utilities of active and inactive predictors versus different choices of s with Example 2 in Section 3. It can be seen that the choice of s does not affect the distribution of either active predictors or inactive ones.
5. 
Figure 2 also shows that the gap between the indices of active predictors and inactive ones is clear, which means the proposed method is efficient at separating the influential predictors from the inactive ones well. Moreover, it can also be observed that the marginal utilities of active predictors are, with a smaller variance, comparable to those of inactive ones, which implies that the new method is sensitive to the active predictors.
With the estimation of Q ^ j , the index set of active predictors can be estimated as
A ^ = { 1 j p : Q ^ j c n η } ,
where c and η are two predetermined thresholding values. In practice, we usually take a hard threshold criterion, to determine the submodel as
A ^ = { 1 j p : Q ^ j is among the top d n largest of all } ,
where d n is a predetermined threshold value. We call the above quantile-composited screening procedure, based on Q ^ j as QCS.

2.1. Theoretical Properties

This section provides the sure screening property of the newly proposed method, which guarantees the effectiveness of the newly proposed method. The corresponding technical details of the proof can be found in Appendix A.
We first prove the consistency of Q ^ j ( τ ) . To this end, we require the following condition.
(C1): There exist two constants c 1 , c 2 ( c 1 < c 2 ) , such that c 1 / K < π y k < c 2 / K for k { 1 , 2 , , K } with K = O ( n γ ) .
Condition C1 requires that the sample size of each subgroup can be neither too small nor too large. The condition K = O ( n γ ) allows that the number of categories can diverge to infinity at some certain rate, with the increase of sample size. The following theorem states the consistency of Q ^ j ( τ ) .
Theorem 1.
For a given quantile τ ( α , 1 α ) , under condition (C1),
P Q ^ j ( τ ) Q j ( τ ) c n η = K exp ( { n 1 2 γ 2 η + n η 2 γ / τ ¯ } ) ,
where τ ¯ = min ( τ , 1 τ ) .
This theorem shows that the consistency of Q ^ j ( τ ) can be guaranteed under suitable conditions. In addition, it reminds us that we cannot select the quantiles either very close to zero or to one, because the items τ ¯ would collapse to zero, which would make the consistency of Q ^ j ( τ ) problematic. Based on the above theorem, the following theorem provides the consistency of Q ^ j .
Corollary 1.
According to the conditions in Theorem 1, if τ ¯ = O ( n η 1 ) ,
P max 1 j p | Q ^ j Q j | > c n η p K exp ( O { n 1 2 η 2 γ } ) .
This theorem states that the gap between Q ^ j and Q j will disappear with probability tending to 1 as n . This theorem also shows that our method can address the dimensionality of order o exp n ( 1 2 γ 2 η ) .
In the following, we provide the sure screening property of our method.
Theorem 2.
Sure screening property: let A = { 1 j p : Q j > 0 } ; then, under condition (C1) and the following condition, min j A Q j 2 c n η ,
P A A ^ 1 s n K exp O ( n 1 2 η 2 γ ) ,
where s n is the cardinality of A .

2.2. Extensions

Up to this point, the new methods have been designed for ultrahigh-dimensional categorical data. In this section, to make the proposed methods applicable in more settings, we give two natural extensions for our method, and in the next section, we use some numerical simulation, to illustrate the effectiveness of these extensions.
Extension to Genome-Wide Association Studies. We first apply our method to the typical case of the genome-wide association studies (GWAS), where the predictors are single-nucleotide polymorphisms (SNPs) in three classes, denoted by { A A , A a , a a } , and the response is continuous. Our strategy for this problem is straightforward: define the sample space { A A , A a , a a } as { 1 , 0 , 1 } , respectively; then, the marginal utility of X j at quantile level τ is defined as
Q ^ 1 , j ( τ ) = b = 1 1 k = 0 1 ( π ^ y k 1 ( τ ) π ^ j b 1 π ^ y k , j b 1 ( τ ) ) 2 π ^ y k 1 ( τ ) π ^ j b 1 , j = 1 , , p ,
where Y i ( τ ) = I ( Y i > q ^ Y ( τ ) ) , π ^ y k 1 ( τ ) = n 1 i = 1 n I ( Y i ( τ ) = k ) , π ^ j b 1 = n 1 i = 1 n I ( X i j = b ) , π ^ y k , j b 1 ( τ ) = n 1 i = 1 n I ( Y i ( τ ) = k , X i j = b ) for b = 1 , 0 , 1 .
Extension to additive models. We can extend our method to the model in which both the response and predictors are continuous. To make our method applicable, we first slice the predictors into several segments, according to some threshold values. For example, taking the quartiles of the predictor as the cut points, then the predictors are transformed to a balanced four-categorical variable. Specifically, let ( Q ^ j 1 , , Q ^ j N 1 ) be N percentiles of X j , and define X i j * = b I { Q ^ j b X i j < Q ^ j ( b + 1 ) } , where b = 0 , 1 , , N 1 ; here, we define Q ^ j 0 = min i X i j and Q ^ j ( N ) = max i X i j . Then, similar to (9), we define the marginal utility of X j at quantile level τ as
Q ^ 2 , j ( τ ) = b = 0 N 1 k = 0 1 ( π ^ y k * ( τ ) π ^ j b * π ^ y k , j b * ( τ ) ) 2 π ^ y k * ( τ ) π ^ j b * , j = 1 , , p ,
where π ^ y k * ( τ ) = π ^ y k 1 ( τ ) , π ^ j b * = n 1 i = 1 n I ( X i j * = b ) , π ^ y k , j b * ( τ ) = n 1 i = 1 n I ( Y i ( τ ) = k , X i j * = b ) for b = 0 , 1 , , N 1 .

3. Numerical Studies

3.1. General Settings

For this section, we first conducted some Monte Carlo simulations, to compare our method to those of several competitors. Then, we applied our screening procedure to two real data examples.
We compared our method to: (1) MV-based sure independence screening (MVS) [20], which can be seen as the weighted average of the Cramér–von Mises distances between the conditional distribution function of X given Y = k and the unconditional distribution function of X; (2) distance correlation–sure independence screening (DCS) [9], which employs distance correlation as a measure to evaluate the importance of each predictor; (3) category-adaptive variable screening (CAS) [21], which screens the inactive predictor, by comparing its marginal distribution to its marginal conditional one; (4) Kolmogorov filter screening (KFS) [19], which filters the inactive predictors, by comparing the Kolmogorov distance between the conditional distribution and the unconditional one. Note that DCS is not efficient for categorical variables. Thus, we transferred the categorical variable into a multivariate dummy variable, with the i-th coordinate equal to 1, and other coordinates equal to 0, where i was the category of a sample, e.g., we transformed Y = 3 into ( 0 , 0 , 1 , 0 , 0 ) if Y was five-category.
Throughout the simulation, we repeated each experiment 1000 times, and always set s = 50 . To fairly evaluate the performances of the different methods, the following criteria were employed: (1) MS: the minimum model size of the selected models that are required to have a sure screening; (2) P s : the percentage of submodels that contain all active predictors under a predetermined model size d n over 1000 replications. We let MS(t) be the result of the t-th numerical experiment, and denoted by MS α the α -level quantile of {MS(1),⋯,MS(1000)}; then, we reported the median of MS (MMS), the interquartile range (IQR) of MS, and the extreme percentile range (EPR) of MS, namely:
MMS = MS 0.5 , IQR = MS 0.75 MS 0.25 , EPR = MS 0.95 MS 0.05 , P s = 1 1000 t = 1 1000 I ( MS ( t ) s ) × 100 % .
We considered d n = [ n / log n ] and s = 2 [ n / log n ] for a small and large model size, respectively, where [ a ] was the integer part of a. By the two criteria, a well-behaved screening method should have small MS, but with P a close to 1.

3.2. Monte Carlo Simulations

Example 1.
Data were generated in the following manner. For a given Y = k , the p-dimensional random vector of X | { Y = k } was generated from a mixture distribution ( 1 r ) Z + r W , where X N ( μ k , I p ) , with I p being the identity matrix and μ k = ( μ k 1 , , μ k p ) ; W was a random vector, with each component being an independent student’s t-distribution with one degree of freedom. Here, r was used to check the robustness of our method against the heavy-tailed distribution. We considered r = 0.05 and 0.15, representing, respectively, a low and high proportion of the heavy-tailed samples in the data. The categorical variable Y was set to be binary and multi-category, with both balanced and imbalanced design, by the following scenarios:
Case 1. 
P ( Y = 1 ) = P ( Y = 2 ) = 0.5 , μ 1 = ( 1.5 , 0 , , 0 ) and μ 2 = ( 0 , 1.5 , 0 , , 0 ) .
Case 2. 
The same setup as Case 1, except that P ( Y = 1 ) = 1 / 3 and P ( Y = 2 ) = 2 / 3 .
Case 3. 
P ( Y = k ) = 1 / K for k = 1 , , 8 , and μ k = ( 0 k 1 , 2 , 0 p k ) for k = 1 , , 8 , where 0 d represented a d-dimensional zero-valued vector.
Case 4. 
The same setup as Case 1, except that P ( Y = k ) = 2 [ 1 + ( k 1 ) / ( K 1 ) ] / 3 K .
The numerical results are reported in Table 1, by setting ( n , p ) = ( 50 , 1000 ) for K = 2 , and ( n , p ) = ( 160 , 2000 ) for K = 8 . From this table, it can be seen that the QCS, MVS, CAS, and KFS performed comparably well with both P d n and P 2 d n equal to 100%. However, the performance of DCS was unsatisfactory, in that it was sensitive to heavy-tailed data, and was easily affected by the imbalanced response.
Example 2.
In this example, we used a more complex setting to check the effectiveness of the proposed methods. This example was similar to Example 2 in Xie et al. [21]. For a given Y = k , the p-dimensional random vector of X | { Y = k } was generated in the same way as in Example 1, but the correlation structure among the predictors was set as Corr ( X i , X j ) = 0.05 | i j | . We considered a five-categorical response; the mean shifts μ k for each class were μ 1 = ( 1.5 , 1.5 , 0 p 2 ) , μ 2 = ( 0 5 , 1.5 , 1.5 , 1.5 , 0 p 8 ) , μ 3 = ( 0 10 , 1.5 , 1.5 , 1.5 , 1.5 , 0 p 14 ) , μ 4 = ( 0 20 , 1.5 , 1.5 , 1.5 , 1.5 , 1.5 , 0 p 25 ) , μ 5 = ( 0 30 , 1.5 , 1.5 , 1.5 , 1.5 , 1.5 , 1.5 , 0 p 36 ) , so the corresponding active sets were A = { 1 , 2 , 6 , 7 , 8 , 11 , , 14 , 21 , , 25 , 31 , , 36 } . Y was also generated in a balanced way, with P ( Y = k ) = 0.2 for k = 1 , , 5 , and in an imbalanced way, with P ( Y = k ) = 0.1 for k = 1 , 2 , 3 and P ( Y = k ) = 0.35 for k = 4.5 . We considered n = 200 and p = 1000 or 3000.
Table 2 presents the simulation results. In this example, we can see that QCS performed better than its competitors: it had the smallest MMS, IQR, and EPR. Secondly, it can be seen that the increase of dimensionality p had a negative effect on all methods, but that our method suffered the least.
Example 3.
This example mimicked the scenario that the samples in the same group had multi-modals. Given Y = k , the random vector of X was generated in the same way as in Example 1, except that we fixed r = 0.05 , and generated Z from a mixture normal distribution, designed as follows:
Case 1. 
Z | { Y = k } 0.2 N ( μ k , I p ) + 0.8 N ( μ k , I p ) ;
Case 2. 
Z | { Y = k } 0.3 N ( μ k , I p ) + 0.7 N ( μ k , I p ) ;
Case 3. 
Z | { Y = k } 0.4 N ( μ k , I p ) + 0.6 N ( μ k , I p ) ,
where μ k k = 2.5 for k = 1 , 2 , , K and 0 for other components in μ k . In this example, we only considered a balanced setting for Y. Similarly, we considered ( K , n , p ) = ( 2 , 50 , 1000 ) and ( 8 , 160 , 2000 ) , respectively.
The simulation results are shown in Table 3. This table shows that the category K of Y had a greatly negative effect on all the competitors, in that they suffered much efficiency loss for the screening when we increased K from 2 to 8. In particular, in case 3, where the distribution of data had two comparable modals, all methods except ours missed the active predictors completely, even under a large model size 2 d n . The above results show that the newly proposed method is very robust.
Example 4.
This example considered a K-categorical logistic model with
P ( Y = k | X ) = exp ( X β k ) 1 + i = 1 K exp ( X β k ) , k = 1 , 2 , , K
where the model settings were configured as follows:
Case 1. 
K = 2 , β 2 = 0 p and β 1 = ( β 1 , , β 10 , 0 p 10 ) with β j U n i f o r m ( 1 , 2 ) ;
Case 2. 
K = 5 , β 1 = 0 p β 2 = ( β 1 , 0 p 1 ) , β 3 = ( 0 , β 2 , β 3 , 0 p 3 ) ; β 4 = ( 0 3 , β 4 , β 5 , β 6 , 0 p 6 ) and β 5 = ( 0 6 , β 7 , β 8 , β 9 , β 10 , 0 p 10 ) with β j U n i f o r m ( 1 , 2 ) .
We considered the multivariate normal distribution X j N ( 0 , 1 ) and the student t-distribution X j t 3 . The correlation structure among the predictors was equal to Corr ( X i , X j ) = 0 . 5 | i j | . We set ( n , p ) = ( 150 , 1000 ) . The corresponding simulation results are shown in Table 4, which shows that all the methods performed similarly, but that our methods behaved slightly better under the t-distribution. In addition, it seems that the t-distribution led to a more accurate screening result for all methods. The reason may be attributed to the structure of the logistic model. Consider the simplest case, P ( Y = 1 | X ) = 1 / ( 1 + exp ( X ) ) and P ( Y = 0 | X ) = 1 / ( 1 + exp ( X ) ) : clearly, a larger | X | will make the classification between positive and negative easier. Consequently, under logistic function, the t-distributed data will result in a more accurate result, because the t-distribution has a higher probability of generating predictors with large values.
Example 5.
This example aimed to check the effectiveness of the two extensions of the new method in Section 2.2. We considered the following three models:
1. 
Y = i = 1 5 X i + exp i = 6 10 X i + ε , where X j N ( 0 , 1 ) with Corr ( X i , X j ) = 0 . 5 | i j | and ε N ( 0 , 1 ) ;
2. 
Y = 3 f 1 ( X 1 ) + f 2 ( X 2 ) 1.5 f 3 ( X 3 ) + f 4 ( X 4 ) + ε , where f 1 ( x ) = sin ( 2 x ) , f 2 ( x ) = x 2 25 / 12 , f 3 ( x ) = x , f 4 ( x ) = exp ( x ) 0.4 sinh ( 2.5 ) , where X j was independent of U n i f o r m ( 2.5 , 2.5 ) ;
3. 
Y = 1.5 log ( n ) / n ( X 1 + X 2 2 X 10 + 2 X 20 2 | X 100 | ) + ε , where X j was equal to 1 if Z j < q 1 , 1 if Z j q 3 , and 0 otherwise, and where Z j N ( 0 , 1 ) with Corr ( Z j , Z k ) = 0 . 5 | j k | , and q 1 and q 3 were the first and third quartiles, respectively, of a standard normal distribution.
Model 1 was an index model from Zhu et al. [8]. Model 2 was an additive model from Meier et al. [23]. Model 3 mimicked the SNPs, with equal allele frequencies { 1 , 0 , 1 } representing { A A , A a , a a } , respectively; this model has been analyzed in Cui et al. [20]. We report the simulation results in Table 5. It is clear that the proposed method always demonstrated a superior performance under the three models. More specifically, in Models 1 and 2, DCS did not work, though the predictor was not heavy-tailed. In Model 3, the performance of DCS and CAS were unsatisfying, with large MS and less probability of including the active predictors.
Overall, through the above simulations, we can summarize that QCS was the most robust method: compared to its competitors, it had a very stable performance within different model settings.
Computational complexity. Before the end of this subsection, we discuss the computational complexity of our method. Theoretically, the computational complexity of our method is O ( n p ) , which is restricted at the sample size level. To obtain a clearer view of the computational complexity of our method, we conducted some simulations, to compare the computing time of our method to its competitors (see Figure 3). This figure showed that the computing time of our method linearly increased with the sample size, while the computing times of the other methods had a quadratic form against n. The simulations were conducted using Matlab 2013a in a Dell OptiPlex 7060 SFF, equipped with eight 3.20 GHz Intel(R) Core(TM) i7-8700 CPUs 3.20 Ghz and 16.0 GB RAM.

3.3. Real Data Analyses

For this section, we applied our new screening methods to two cancer datasets. One was leukemia data, consisting of 72 samples and 3571 genes, of which 47 were acute lymphocytic leukemia (ALL) and the rest 25 were acute myelogenous leukemia (AML). Note that the original leukemia data had 73 samples and 7129 genes. The data we analyzed here had been pre-feature-selected (see details in Dettling M. [24]). The other cancer dataset comprised small-round-blue-cell tumors (SRBCT) data, consisting of 63 samples and 2308 genes. Among the 63 subjects, there were four types of tumors, including Burkitt lymphoma (BL), having 23 cases, Ewing sarcoma (EWS), having 20 cases, neuroblastoma (NB), having 12 cases, and rhabdomyosarcoma (RMS), having 8 cases, respectively, so the data was four-categorical. The two datasets are available on the website http://www.stat.cmu.edu/~jiashun/Research/software/GenomicsData/, accessed on 5 March 2022.
The purpose of the two datasets was to identify the key genes that have a dominant effect on predicting the diagnostic category of a tumor. We first employed the screening methods, to reduce the large p to a suitable scale s. Then, we invoked the penalized linear discriminant analysis (penLDA) [25], to further select the discriminative predictors from the s n predictors. The above procedure is the popular two-stage method that is commonly used in the analysis of ultrahigh-dimensional data. Note that we could also replace the penLDA in the second stage with other penalized methods, such as sparse discriminant analysis, as proposed by Clemmensen et al. [26].
We randomly extracted 70% of the samples from each class, as the training data, and set the rest of the samples as the testing data, in which the training data were used both to implement the screening procedure and to build the classifier, while the testing data were used to check the performance of the trained classifier. We repeated the above procedure for 500 replications, and we report both the training errors and testing errors for different methods. Note that in the screening stage, we set d n = [ n / log n ] and 2 [ n / log n ] , respectively: thus, d n = 16 and 32 in the leukemia dataset, and d n = 15 and 30 in the SRBCT dataset. In the second stage, the tuning parameter of the penLDA method was determined according to the 5-fold cross-validation method. Table 6 displays the corresponding results, where QCS–penLDA denotes the two-stage method of QCS followed by penLDA; a similar definition applies to MQS–penLDA, MVS–penLDA, etc.
The numerical results are summarized in Table 6, from which the following conclusion can be obtained. For the leukemia dataset, all methods except DCS performed reasonably well, such that all of them could control the testing errors below 1. However, for the SRBCT data, our method performed significantly better than the other methods: it achieved the smallest training errors, and testing errors closer to 0. The CAS-based two-stage method yielded bad results for both the training error and the testing error. The reason may be that the distribution behind the data was not unimodal.

4. Conclusions

This paper proposes a new quantile-composited feature screening (QCS) procedure, to rapidly screen out the null predictors. Compared to the existing methods, QCS sheds light on the following aspects. Firstly, the ranking index is a simple structure, so that the implementation of the screening procedure is computationally easy. Secondly, QCS is a quantile-composited method: it can utilize much distributional information, so as to significantly improve the screening efficiency, but retains the computational cost at a low level. The simulation and real data analysis also demonstrated the effectiveness of QCS.
In addition, it is worth mentioning that QCS can be further improved. For example, the selection of the number s of the quantiles is still a problem, which could be the focus of future work, based on this article.

Author Contributions

Methodology, S.C. and J.L.; Formal analysis, S.C.; Writing—review & editing, S.C. and J.L.; Funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

Jun Lu’s research was partly supported by National Natural Science Foundation of China (No.12001486), Natural Science Foundation of Zhejiang Province (No.LQ20A010001), and Advance Research Plan of National University of Defense Technology.

Data Availability Statement

The datasets used in this paper can be accessed freely on the website http://www.stat.cmu.edu/~jiashun/Research/software/GenomicsData/.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Main Results

Proof of Lemma 1.
For Lemma 1(1), we only prove the sufficient; the necessity can be proved similarly. If q X ( τ ) = q X | Y ( τ ) , then for any Y = 1 , , K ,
P ( I { X > q X ( τ ) } = 0 , Y = k ) = P ( X q X ( τ ) | Y = k ) P ( Y = k ) = P ( X q X | Y ( τ ) | Y = k ) P ( Y = k ) = P ( X q X ( τ ) ) P ( Y = k ) = P ( I { X > q X ( τ ) } = 0 ) P ( Y = k ) .
For I { X > q X ( τ ) } = 1 , the proof is the same.
To prove Lemma 1(2), we have x 0 R , τ 0   s . t .   x 0 = q X ( τ 0 ) ; then,
P ( X x 0 , Y = k ) = P ( X q X ( τ 0 ) , Y = k ) = P ( X q X ( τ 0 ) | Y = k ) P ( Y = k ) = P ( X q X | Y ( τ 0 ) | Y = k ) P ( Y = k ) = P ( X q X | Y ( τ 0 ) ) P ( Y = k ) = P ( X q X ( τ 0 ) ) P ( Y = k ) = P ( X x 0 ) P ( Y = k ) .
Proof of Theorem 1.
We prove this theorem in two steps.
Firstly, we prove the consistency of π j b ^ , π ^ y k , π ^ y k , j b ( τ ) and π ^ y k π ^ j b ( τ ) .
(1) If b = 0 , then
π ^ j 0 ( τ ) π j 0 ( τ ) = 1 n i = 1 n I ( X i j q ^ X j ( τ ) ) P ( X j q X j ( τ ) ) = [ n τ ] + I ( n τ > [ n τ ] ) n n τ n = I ( n τ > [ n τ ] ) ( n τ [ n τ ] ) n 1 n .
The conclusion for b = 1 can be proved similarly.
(2) By using Hoeffding’s inequality, we obtain
P π ^ y k π y k > ε 2 exp 2 n ε 2 .
(3) Define Z ˜ i j ( τ ) = I ( X i j > q X j ( τ ) ) , such that
P π ^ y k , j b ( τ ) π y k , j b ( τ ) ε = P 1 n i = 1 n I { Y i = k } I { Z i j ( τ ) = b } P ( Y = k , Z j ( τ ) = b ) ε P 1 n i = 1 n I { Y i = k } ( I { Z i j ( τ ) = b } I { Z ˜ i j ( τ ) = b } ) ε 2
+ P 1 n i = 1 n I { Y i = k } I { Z ˜ i j ( τ ) = b } P ( Y = k , Z j ( τ ) = b ) ε 2
For (A3), for each j, I { Z i j = b } I { Z ˜ i j ( τ ) = b } 0 for any i, or I { Z i j = b } I { Z ˜ i j ( τ ) = b } > 0 for any i. Using Hoeffding’s inequality, (A3) can be deduced, such that
P 1 n i = 1 n I { Y i = k } ( I { Z i j ( τ ) = b } I { Z ˜ i j ( τ ) = b } ) ε 2 = P 1 n i = 1 n I { Y i = k } ( I { X i j q ^ X j ( τ ) } I { X i j q X j ( τ ) } ) ε 2 P 1 n i = 1 n ( I { X i j q ^ X j ( τ ) } I { X i j q X j ( τ ) } ) ε 2 = P 1 n i = 1 n I { X i j q ^ X j ( τ ) } 1 n i = 1 n I { X i j q X j ( τ ) } ε 2 = P π ^ j 0 ( τ ) 1 n i = 1 n I { X i j q X j ( τ ) } ε 2 = P π ^ j 0 ( τ ) π j 0 ( τ ) + π j 0 ( τ ) 1 n i = 1 n I { X i j q X j ( τ ) } ε 2 P 1 n + 1 n i = 1 n I { X i j q X j ( τ ) } τ ε 2 = P 1 n i = 1 n I { X i j q X j ( τ ) } τ ε 2 1 n 2 exp n ( ε 2 n ) 2 2 .
For (A4), using Hoeffding’s inequality,
P 1 n i = 1 n I ( y i = k ) I ( Z ˜ i j ( τ ) = b ) P ( Y = k , Z j ( τ ) = b ) > ε / 2 2 exp n ε 2 2 .
Consequently, combining the results of (A5) and (A6), it is simple to establish that
P π ^ y k , j b ( τ ) π y k , j b ( τ ) > ε 2 exp n ( ε 2 n ) 2 2 + 2 exp n ε 2 2 .
(4) By employing a similar argument, π ^ y k π ^ j b ( τ ) π y k π j b ( τ ) can be bounded easily, as
P π ^ y k π ^ j b ( τ ) π y k π j b ( τ ) ε P π ^ y k π ^ j b ( τ ) π ^ y k π j b ( τ ) + π ^ y k π j b ( τ ) π y k π j b ( τ ) ε = P π ^ y k · π ^ j b ( τ ) π j b ( τ ) + π j b ( τ ) · π ^ y k π y k ε P 1 n + π ^ y k π y k ε P 1 n i = 1 n I ( Y i = k ) P ( Y = k ) ε 1 n 2 exp 2 n ε 1 n 2 ,
where the second inequality holds because π ^ j b ( τ ) π j b ( τ ) 1 n , and where the last inequality holds due to Hoeffding’s inequality.
Secondly, we prove the consistency of Q ^ j ( τ ) Q j ( τ ) . Because
Q ^ j ( τ ) Q j ( τ ) = k = 0 K b = 0 1 ( π ^ y k π ^ j b ( τ ) π ^ y k , j b ( τ ) ) 2 π ^ y k π ^ j b ( τ ) k = 0 K b = 0 1 ( π y k π j b ( τ ) π y k , j b ( τ ) ) 2 π y k π j b ( τ ) = k = 0 K b = 0 1 π ^ y k π ^ j b ( τ ) 2 π ^ y k , j b ( τ ) + π ^ y k , j b 2 ( τ ) π ^ y k π ^ j b ( τ ) k = 0 K b = 0 1 π y k π j b ( τ ) 2 π y k , j b ( τ ) + π y k , j b 2 ( τ ) π y k π j b ( τ ) = k = 0 K b = 0 1 ( π ^ y k π ^ j b ( τ ) π y k π j b ( τ ) ) + 2 k = 0 K b = 0 1 ( π ^ y k , j b ( τ ) π y k , j b ( τ ) ) + k = 0 K b = 0 1 π ^ y k , j b 2 ( τ ) π ^ y k π ^ j b ( τ ) π y k , j b 2 ( τ ) π y k π j b ( τ ) = 0 + 0 + k = 0 K b = 0 1 π ^ y k , j b 2 ( τ ) π ^ y k π ^ j b ( τ ) π y k , j b 2 ( τ ) π y k π j b ( τ ) = k = 0 K b = 0 1 π ^ y k , j b 2 ( τ ) π ^ y k π ^ j b ( τ ) π y k , j b 2 ( τ ) π y k π j b ( τ ) ,
we only need to prove the consistency of k = 0 K b = 0 1 π ^ y k , j b 2 ( τ ) π ^ y k π ^ j b ( τ ) π y k , j b 2 ( τ ) π y k π j b ( τ ) .
When 0 < τ 1 2 ,
k = 1 K b = 0 1 π ^ y k , j b 2 ( τ ) π ^ y k π ^ j b ( τ ) π y k , j b 2 ( τ ) π y k π j b ( τ ) = k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) + π ^ y k , j 1 2 ( τ ) π ^ y k π ^ j 1 ( τ ) π y k , j 1 2 ( τ ) π y k π j 1 ( τ ) = k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) + [ π ^ y k π ^ y k , j 0 ( τ ) ] 2 π ^ y k π ^ j 1 ( τ ) [ π y k π y k , j 0 ( τ ) ] 2 π y k π j 1 ( τ ) = k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) + π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 1 ( τ ) π y k , j 0 2 ( τ ) π y k π j 1 ( τ ) + π ^ y k 2 π ^ y k , j 0 ( τ ) π ^ j 1 ( τ ) π y k 2 π y k , j 0 ( τ ) π j 1 ( τ ) k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) + π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 1 ( τ ) π y k , j 0 2 ( τ ) π y k π j 1 ( τ )
+ k = 1 K π ^ y k 2 π ^ y k , j 0 ( τ ) π ^ j 1 ( τ ) π y k 2 π y k , j 0 ( τ ) π j 1 ( τ ) .
For (A9), combining the results of (A1), (A2), (A7) and (A8), we have:
k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) + π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 1 ( τ ) π y k , j 0 2 ( τ ) π y k π j 1 ( τ ) = k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k 1 π ^ j 0 ( τ ) + 1 π ^ j 1 ( τ ) π y k , j 0 2 ( τ ) π y k 1 π j 0 ( τ ) + 1 π j 1 ( τ ) = k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π ^ j 1 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) π j 1 ( τ ) = k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π ^ j 1 ( τ ) π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π j 1 ( τ ) + π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π j 1 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) π j 1 ( τ ) k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) · π j 1 ( τ ) π ^ j 1 ( τ ) π ^ j 1 ( τ ) π j 1 ( τ ) + k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) · 1 π j 1 ( τ ) = π j 1 ( τ ) π ^ j 1 ( τ ) π ^ j 1 ( τ ) π j 1 ( τ ) · k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) + 1 π j 1 ( τ ) · k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) 1 n ( 1 τ ) · ( 1 τ 1 n ) · k = 1 K π ^ y k , j 0 ( τ ) π ^ j 0 ( τ ) + 2 k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) 1 n ( 1 τ ) · ( 1 τ 1 n ) + 2 k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) ,
and
k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) = k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ j 0 ( τ ) π ^ y k , j 0 2 ( τ ) π ^ y k π j 0 ( τ ) + π ^ y k , j 0 2 ( τ ) π ^ y k π j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) = k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k 1 π ^ j 0 ( τ ) 1 π j 0 ( τ ) + k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π j 0 ( τ ) π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) 1 π ^ j 0 ( τ ) 1 π j 0 ( τ ) k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k + 1 π j 0 ( τ ) k = 1 K π ^ y k , j 0 2 ( τ ) π ^ y k π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) π ^ y k + π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) π ^ y k π y k , j 0 2 ( τ ) π y k π j 0 ( τ ) π ^ j 0 ( τ ) π ^ j 0 ( τ ) π j 0 ( τ ) k = 1 K π ^ y k , j 0 ( τ ) + 1 π j 0 ( τ ) k = 1 K π ^ y k , j 0 ( τ ) π ^ y k π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) + π y k , j 0 ( τ ) π ^ y k π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) π j 0 ( τ ) π ^ j 0 ( τ ) π ^ j 0 ( τ ) π j 0 ( τ ) · π ^ j 0 ( τ ) + 1 π j 0 ( τ ) k = 1 K π ^ y k , j 0 ( τ ) π ^ y k π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) + 1 π j 0 ( τ ) k = 1 K π ^ y k , j 0 ( τ ) π ^ y k π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) π j 0 ( τ ) π ^ j 0 ( τ ) π j 0 ( τ ) + 1 π j 0 ( τ ) k = 1 K π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) + 1 π j 0 ( τ ) k = 1 K π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) = 1 n τ + 2 1 τ k = 1 K π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) .
For (A10),
k = 1 K π ^ y k 2 π ^ y k , j 0 ( τ ) π ^ j 1 ( τ ) π y k 2 π y k , j 0 ( τ ) π j 1 ( τ ) = k = 1 K π ^ y k π ^ j 1 ( τ ) π y k π j 1 ( τ ) + 2 π y k , j 0 ( τ ) π j 1 ( τ ) 2 π ^ y k , j 0 ( τ ) π ^ j 1 ( τ ) = 1 π ^ j 1 ( τ ) 1 π j 1 ( τ ) + 2 π j 0 ( τ ) π j 1 ( τ ) π ^ j 0 ( τ ) π ^ j 1 ( τ ) = 1 π ^ j 1 ( τ ) 1 π j 1 ( τ ) + 2 1 π j 1 ( τ ) π j 1 ( τ ) 1 π ^ j 1 ( τ ) π ^ j 1 ( τ ) = 1 π ^ j 1 ( τ ) 1 π j 1 ( τ ) + 2 1 π j 1 ( τ ) 1 π ^ j 1 ( τ ) = 1 π ^ j 1 ( τ ) 1 π j 1 ( τ ) 1 n ( 1 τ ) · ( 1 τ 1 n ) .
Combining the results of (A9)–(A13),
k = 1 K b = 0 1 π ^ y k , j b 2 ( τ ) π ^ y k π ^ j b ( τ ) π y k , j b 2 ( τ ) π y k π j b ( τ ) 2 ( 1 τ ) · [ n ( 1 τ ) 1 ] + 2 n τ + 4 1 τ k = 1 K π ^ y k , j 0 ( τ ) π y k , j 0 ( τ ) .
For 1 2 < τ < 1 , by employing a similar argument, it can be proved that
k = 1 K b = 0 1 π ^ y k , j b 2 ( τ ) π ^ y k π ^ j b ( τ ) π y k , j b 2 ( τ ) π y k π j b ( τ ) 2 τ · ( n τ 1 ) + 2 n ( 1 τ ) + 4 τ k = 1 K π ^ y k , j 1 ( τ ) π y k , j 1 ( τ ) .
For any τ ( 0 , 1 ) , by (A14) and (A15), it holds that
k = 1 K b = 0 1 π ^ y k , j b 2 ( τ ) π ^ y k π ^ j b ( τ ) π y k , j b 2 ( τ ) π y k π j b ( τ ) 2 τ ˜ · ( n τ ˜ 1 ) + 2 n ( τ ¯ ) + 4 τ ˜ k = 1 K π ^ y k , j a ( τ ) π y k , j a ( τ ) ,
where τ ˜ = max { τ , 1 τ } , τ ¯ = min { τ , 1 τ } , and b ¯ = I { τ > 1 τ } . For (A7) and (A16), we can obtain
P Q ^ j ( τ ) Q j ( τ ) ε P 2 τ ˜ · ( n τ ˜ 1 ) + 2 n ( τ ¯ ) + 4 τ ˜ k = 1 K π ^ y k , j b ¯ ( τ ) π y k , j b ¯ ( τ ) ε = P k = 1 K π ^ y k , j b ¯ ( τ ) π y k , j b ¯ ( τ ) τ ˜ 4 ε 2 τ ˜ · ( n τ ˜ 1 ) 2 n ( τ ¯ ) 2 K P π ^ y k , j b ( τ ) π y k , j b ( τ ) τ ˜ 4 K ε 2 τ ˜ · ( n τ ˜ 1 ) 2 n ( τ ¯ ) 4 K exp n τ ˜ 4 K ε 2 τ ˜ · ( n τ ˜ 1 ) 2 n ( τ ¯ ) 2 n 2 2 + 4 K exp n τ ˜ 4 K ε 2 τ ˜ · ( n τ ˜ 1 ) 2 n ( τ ¯ ) 2 2 .
Let τ ( α , 1 α ) , and by condition (C1), it can be derived that
P Q ^ j ( τ ) Q j ( τ ) ε = K exp ( { n K 2 ε 2 + ε K 2 / τ ¯ } ) .
Let K = O ( n γ ) and ε = c n η , if τ ¯ = o ( n η 1 ) ; then
P Q ^ j ( τ ) Q j ( τ ) c n η = K exp ( O { n 1 2 η 2 γ } ) .
The proof of Corollary 1. Under the conditions in Theorem 1, following (A17) and (A19), we obtain
P Q ^ j Q j ε = P a 1 a Q ^ j 2 ( τ ) d τ a 1 a Q ^ j ( τ ) d τ a 1 a Q j 2 ( τ ) d τ a 1 a Q j ( τ ) d τ ε = P a 1 a Q ^ j 2 ( τ ) d τ a 1 a Q ^ j ( τ ) d τ a 1 a Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q ^ j ( τ ) d τ + a 1 a Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q ^ j ( τ ) d τ a 1 a Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q j ( τ ) d τ + a 1 a Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q j ( τ ) d τ a 1 a Q j 2 ( τ ) d τ a 1 a Q j ( τ ) d τ ε = P a 1 a Q ^ j ( τ ) Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q ^ j ( τ ) d τ + a 1 a Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q ^ j ( τ ) d τ a 1 a Q j ( τ ) d τ + a 1 a Q ^ j ( τ ) Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q j ( τ ) d τ ε P a 1 a Q ^ j ( τ ) Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q ^ j ( τ ) d τ + a 1 a Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q ^ j ( τ ) d τ a 1 a Q j ( τ ) d τ + a 1 a Q j ( τ ) Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q j ( τ ) d τ ε , Q ^ j ( τ ) Q j ( τ ) < ε 4 + P Q ^ j ( τ ) Q j ( τ ) ε 4 P ε 4 + ε 4 · a 1 a Q ^ j ( τ ) Q j ( τ ) d τ a 1 a Q ^ j ( τ ) d τ a 1 a Q j ( τ ) d τ + ε 4 ε , Q ^ j ( τ ) Q j ( τ ) < ε 4 + P Q ^ j ( τ ) Q j ( τ ) ε 4 P ε 2 + ε 4 · a 1 a Q ^ j 2 ( τ ) d τ a 1 a Q j 2 ( τ ) d τ a 1 a Q ^ j ( τ ) d τ a 1 a Q j ( τ ) d τ ε , Q ^ j ( τ ) Q j ( τ ) < ε 4 + P Q ^ j ( τ ) Q j ( τ ) ε 4 P ε 2 + ε 4 · a 1 a Q ^ j ( τ ) d τ a 1 a Q j ( τ ) d τ a 1 a Q ^ j ( τ ) d τ a 1 a Q j ( τ ) d τ ε , Q ^ j ( τ ) Q j ( τ ) < ε 4 + P Q ^ j ( τ ) Q j ( τ ) ε 4 = P ε 2 + ε 4 · 1 ε , Q ^ j ( τ ) Q j ( τ ) < ε 4 + P Q ^ j ( τ ) Q j ( τ ) ε 4 = P Q ^ j ( τ ) Q j ( τ ) ε 4 K exp ( O { n 1 2 η 2 γ } ) .
Proof of Theorem 2.
If A A ^ , then there must exist some k A , such that Q ^ k < c n η . It follows from condition (C2) that Q ^ k Q k > c n η for some k A , indicating that the events satisfy A A ^ Q ^ k Q k > c n κ , for some k A , and hence E n = max k A Q ^ k Q k c n η A A ^ . Consequently,
P A A ^ Pr E n = 1 P E n c = 1 P min k A ω ^ k ω k c n η = 1 s n P ω ^ k ω k c n η 1 s n K exp O ( n 1 2 η 2 γ ) ,
where s n is the cardinality of A . □

References

  1. Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Statist. Soc. B 2008, 70, 849–911. [Google Scholar] [CrossRef] [PubMed]
  2. Wang, H. Forward regression for ultra-high dimensional variable screening. J. Am. Statist. Assoc. 2009, 104, 1512–1524. [Google Scholar] [CrossRef]
  3. Chang, J.; Tang, C.; Wu, Y. Marginal empirical likelihood and sure independence feature screening. Ann. Statist. 2013, 41, 2123–2148. [Google Scholar] [CrossRef] [PubMed]
  4. Wang, X.; Leng, C. High dimensional ordinary least squares projection for screening variables. J. R. Statist. Soc. B 2016, 78, 589–611. [Google Scholar] [CrossRef]
  5. Fan, J.; Feng, Y.; Song, R. Nonparametric independence screening in sparse ultrahigh dimensional additive models. J. Am. Statist. Assoc. 2011, 106, 544–557. [Google Scholar] [CrossRef] [PubMed]
  6. Fan, J.; Ma, Y.; Dai, W. Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Am. Statist. Assoc. 2014, 109, 1270–1284. [Google Scholar] [CrossRef] [PubMed]
  7. Liu, J.; Li, R.; Wu, R. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Am. Statist. Assoc. 2014, 109, 266–274. [Google Scholar] [CrossRef]
  8. Zhu, L.; Li, L.; Li, R.; Zhu, L. Model-free feature screening for ultrahigh-dimensional data. J. Am. Statist. Assoc. 2011, 106, 1464–1475. [Google Scholar] [CrossRef]
  9. Li, R.; Zhong, W.; Zhu, L. Feature screening via distance correlation learning. J. Am. Statist. Assoc. 2012, 107, 1129–1139. [Google Scholar] [CrossRef]
  10. He, X.; Wang, L.; Hong, H. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Statist. 2013, 41, 342–369. [Google Scholar] [CrossRef]
  11. Lin, L.; Sun, J.; Zhu, L. Nonparametric feature screening. Comput. Statist. Data Anal. 2013, 67, 162–174. [Google Scholar] [CrossRef]
  12. Lu, J.; Lin, L. Model-free conditional screening via conditional distance correlation. Statist. Pap. 2020, 61, 225–244. [Google Scholar] [CrossRef]
  13. Tong, Z.; Cai, Z.; Yang, S.; Li, R. Model-Free Conditional Feature Screening with FDR Control. J. Am. Statist. Assoc. 2002. [Google Scholar] [CrossRef]
  14. Guo, X.; Ren, H.; Zou, C.; Li, R. Threshold Selection in Feature Screening for Error Rate Control. J. Am. Statist. Assoc. 2022, 36, 1–13. [Google Scholar] [CrossRef]
  15. Zhong, W.; Qian, C.; Liu, W.; Zhu, L.; Li, R. Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills. J. Am. Statist. Assoc. 2023. [Google Scholar] [CrossRef]
  16. Fan, J.; Feng, Y.; Tong, X. A road to classification in high dimensional space: The regularized optimal affine discriminant. J. R. Statist. Soc. B 2012, 74, 745–771. [Google Scholar] [CrossRef]
  17. Huang, D.; Li, R.; Wang, H. Feature screening for ultrahigh dimensional categorical data with applications. J. Bus. Econ. Stat. 2014, 32, 237–244. [Google Scholar] [CrossRef]
  18. Pan, R.; Wang, H.; Li, R. Ultrahigh-dimensional multiclass linear discriminant analysis by pairwise sure independence screening. J. Am. Statist. Assoc. 2016, 111, 169–179. [Google Scholar] [CrossRef]
  19. Mai, Q.; Zou, H. The fused kolmogorov filter: A nonparametric model-free feature screening. Ann. Statist. 2015, 43, 1471–1497. [Google Scholar] [CrossRef]
  20. Cui, H.; Li, R.; Zhong, W. Model-free feature screening for ultrahigh dimensional discriminant analysis. J. Am. Statist. Assoc. 2015, 110, 630–641. [Google Scholar] [CrossRef]
  21. Xie, J.; Lin, Y.; Yan, X.; Tang, N. Category-Adaptive Variable Screening for Ultra-high Dimensional Heterogeneous Categorical Data. J. Am. Statist. Assoc. 2019, 36, 747–760. [Google Scholar] [CrossRef]
  22. Shao, J. Mathematical Statistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  23. Meier, L.; van de Geer, S.; Buhlmann, P. High-Dimensional Additive Modeling. Ann. Statist. 2009, 37, 3779–3821. [Google Scholar] [CrossRef]
  24. Dettling, M. Bagboosting for tumor classification with gene expression data. Bioinformatics 2004, 20, 3583–3593. [Google Scholar] [CrossRef] [PubMed]
  25. Witten, D.M.; Tibshirani, R. Penalized classification using fisher’s linear discriminant. J. R. Statist. Soc. B 2011, 73, 753–772. [Google Scholar] [CrossRef]
  26. Clemmensen, L.; Hastie, T.; Witten, D.; Ersbøll, B. Sparse discriminant analysis. Technometrics 2011, 53, 406–413. [Google Scholar] [CrossRef]
Figure 1. The sample histograms of the 47 ALLs corresponding to the first 20 features selected from 3571 genes.
Figure 1. The sample histograms of the 47 ALLs corresponding to the first 20 features selected from 3571 genes.
Mathematics 11 02398 g001
Figure 2. Density curves of marginal utilities of active predictors (solid line) and inactive ones (dashed line) for s = 10 (red), 20 (green), 50 (blue), 100 (black). The simulations were repeated 1000 times, using the model in Example 2 in Section 3 with a balanced response and r = 0.05 .
Figure 2. Density curves of marginal utilities of active predictors (solid line) and inactive ones (dashed line) for s = 10 (red), 20 (green), 50 (blue), 100 (black). The simulations were repeated 1000 times, using the model in Example 2 in Section 3 with a balanced response and r = 0.05 .
Mathematics 11 02398 g002
Figure 3. Computing time of different methods based on 100 replications, where QCS is our method, MVS is the MV-based sure independence screening method in [20], DCS is the distance correlation–sure independence screening procedure in [9], CAS is the category-adaptive variable screening in [21], and KFS is the Kolmogorov filter method in [19]. This simulation used Example 1, with ( K , p ) = ( 8 , 2000 ) .
Figure 3. Computing time of different methods based on 100 replications, where QCS is our method, MVS is the MV-based sure independence screening method in [20], DCS is the distance correlation–sure independence screening procedure in [9], CAS is the category-adaptive variable screening in [21], and KFS is the Kolmogorov filter method in [19]. This simulation used Example 1, with ( K , p ) = ( 8 , 2000 ) .
Mathematics 11 02398 g003
Table 1. Simulation results of Example 1.
Table 1. Simulation results of Example 1.
Method r = 0.05 r = 0.15
MMSIQREPR P d n P 2 d n MMSIQREPR P d n P 2 d n
Case 1
QCS211294.297.3222191.295.4
MVS21697.198.62116.593.497.0
DCS751486.498.020134412.065.2
CAS20597.598.9211592.195.9
KFS221691.296.1232986.293.1
Case 2
QCS222390.794.7343485.892.6
MVS211591.696.0222290.493.7
DCS7623.578.894.822.5196213.356.1
CAS211693.396.0232490.193.2
KFS343585.091.1355383.090.2
Case 3
QCS80398.899.380109899.3
MVS821596.498.38.534592.396.0
DCS68.5311260.241.2190.510333400
CAS80299.799.981698.298.9
KFS9.542196.099.21174390.096.3
Case 4
QCS801196.998.78224.594.997.0
MVS9439.592.696.011148483.592.0
DCS9163270.50.019.22722136300.00.0
CAS81698.499.4941993.196.3
KFS12104786.696.214.5209877.091.6
Table 2. Simulation results of Example 2.
Table 2. Simulation results of Example 2.
Method r = 0.05 r = 0.15
MMSIQREPR P d n P 2 d n MMSIQREPR P d n P 2 d n
Balanced response, p = 1000
QCS2001100100200199.2100
MVS2000100100200199.5100
DCS3381974.099.768.52158066.3
CAS20001001002000100100
KFS2002100100201699.2100
Imbalanced response, p = 1000
QCS2032492.598.02285185.595.3
MVS2173188.897.623146678.492.4
DCS83832871.741.6203.514342000.7
CAS26133880.297.332195563.594.0
KFS31206464.791.6362913752.483.1
Balanced response, p = 3000
QCS2001100100200299.2100
MVS2001100100200399.2100
DCS611656080.4159.55016200
CAS20001001002001100100
KFS202896.81002121795.198.8
Imbalanced response, p = 3000
QCS2188084.493.2242723069.983.2
MVS231510677.190.3294622060.278.2
DCS228.523790801.2581423114800
CAS373410950.981.7576216120.563.3
KFS48.55625736.669.268.58030619.154.8
Table 3. Simulation results of Example 3.
Table 3. Simulation results of Example 3.
Method K = 2 K = 8
MMSIQREPR P d n P 2 d n MMSIQREPR P d n P 2 d n
Case 1
QCS20398.999.780498.999.8
MVS202100100212914164.583.4
DCS104.51275.099.0147.57923800
CAS3312.592.997.43137.5169.550.579.5
KFS20299.6100222310267.087.2
Case 2
QCS321493.697.39327.594.396.9
MVS4512.593.398.4103126.53663.025.5
DCS1351649.596.5192.59242000
CAS354112511.333.0467385101400
KFS221195.097.6103.5108344.58.028.4
Case 3
QCS61044.574.587.6131686.578.189.7
MVS12134952.179.4346.5277.584100
DCS1582128.788.2271153.5545.500
CAS297.5207.5550.5001644317696.500
KFS9185560.979.3411332.573800
Table 4. Simulation results of Example 4.
Table 4. Simulation results of Example 4.
Method K = 2 K = 5
MMSIQREPR P d n P 2 d n MMSIQREPR P d n P 2 d n
X j N ( 0 , 1 )
QCS12.51511277.788.3121210181.690.4
MVS131816575.284.8121414376.688.1
DCS131817776.085.6183721065.180.0
CAS12.51315378.187.131.546.017247.272.8
KFS213628162.278.132.55323048.070.4
X j t 3
QCS101513785.692.210124992.095.2
MVS101515085.392.410126790.194.4
DCS101513384.791.2112311279.088.6
CAS101410685.292.8143013672.683.6
KFS122413979.288.5133417472.882.8
Table 5. Simulation results of Example 5.
Table 5. Simulation results of Example 5.
MethodMMSIQREPR P d n P 2 d n
Model 1QCS111011186.292.1
MVS11.51413282.489.5
DCS95771315130.40.8
CAS307833856.771.2
KFS141816583.290.3
Model 2QCS41798.1100
MVS41998.0100
DCS495819534.664.5
CAS6103592.998.3
KFS5137292.494.1
Model 3QCS6.51410988.794.2
MVS71611486.490.8
DCS132621576.586.6
CAS175425860.878.9
KFS258731658.268.4
Table 6. Numerical results of the real data analyses.
Table 6. Numerical results of the real data analyses.
Data d n MethodNo. of Training ErrorsNo. of Testing Errors
MeanStdMeanStd
Leukemia16QCS-penLDA0.1760.4010.7940.865
MVS-penLDA0.1660.3830.8280.864
DCS-penLDA1.1880.7831.3341.082
CAS-penLDA0.1400.3530.8140.867
KFS-penLDA0.2600.4780.8960.924
32QCS-penLDA0.1520.3590.6700.813
MVS-penLDA0.1300.3360.6960.808
DCS-penLDA0.8980.7320.9740.920
CAS-penLDA0.1280.3340.6820.801
KFS-penLDA0.2100.4170.7920.873
SRBCT15QCS-penLDA0.1000.3190.5740.818
MVS-penLDA0.4361.7771.1801.399
DCS-penLDA1.3662.7531.7401.701
CAS-penLDA7.2362.2465.7442.174
KFS-penLDA2.8501.6652.8521.744
30QCS-penLDA0.0881.3430.2060.872
MVS-penLDA0.1301.7100.4701.021
DCS-penLDA0.3202.6930.6041.458
CAS-penLDA3.3601.6013.8641.787
KFS-penLDA0.5940.8310.8600.964
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, S.; Lu, J. Quantile-Composited Feature Screening for Ultrahigh-Dimensional Data. Mathematics 2023, 11, 2398. https://doi.org/10.3390/math11102398

AMA Style

Chen S, Lu J. Quantile-Composited Feature Screening for Ultrahigh-Dimensional Data. Mathematics. 2023; 11(10):2398. https://doi.org/10.3390/math11102398

Chicago/Turabian Style

Chen, Shuaishuai, and Jun Lu. 2023. "Quantile-Composited Feature Screening for Ultrahigh-Dimensional Data" Mathematics 11, no. 10: 2398. https://doi.org/10.3390/math11102398

APA Style

Chen, S., & Lu, J. (2023). Quantile-Composited Feature Screening for Ultrahigh-Dimensional Data. Mathematics, 11(10), 2398. https://doi.org/10.3390/math11102398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop