Next Article in Journal
Conditions for Implicit-Degree Sum for Spanning Trees with Few Leaves in K1,4-Free Graphs
Previous Article in Journal
On Some Formulas for the Lauricella Function
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Variable Selection for Sparse Logistic Regression with Grouped Variables

School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(24), 4979; https://doi.org/10.3390/math11244979
Submission received: 16 November 2023 / Revised: 7 December 2023 / Accepted: 11 December 2023 / Published: 17 December 2023
(This article belongs to the Section Probability and Statistics)

Abstract

:
We present a new penalized method for estimation in sparse logistic regression models with a group structure. Group sparsity implies that we should consider the Group Lasso penalty. In contrast to penalized log-likelihood estimation, our method can be viewed as a penalized weighted score function method. Under some mild conditions, we provide non-asymptotic oracle inequalities promoting the group sparsity of predictors. A modified block coordinate descent algorithm based on a weighted score function is also employed. The net advantage of our algorithm over existing Group Lasso-type procedures is that the tuning parameter can be pre-specified. The simulations show that this algorithm is considerably faster and more stable than competing methods. Finally, we illustrate our methodology with two real data sets.

1. Introduction

Logistic regression models are a powerful and popular technique for modeling the relationship between the predictors and a categorical response variable. Let ( x 1 , y 1 ) , , ( x n , y n ) be independent pairs of observed data which are realizations of a random vector ( X , Y ) , with p-dimensional predictors X R p and univariate binary response variable Y { 0 , 1 } . ( X , Y ) is assumed to satisfy
P ( Y = 1 | X = x ) = G ( x T β 0 ) = exp ( x T β 0 ) 1 + exp ( x T β 0 ) ,
where β 0 R p is a regression vector to be estimated. We are especially concerned with a sparse logistic regression problem in which the dimension p is high and the sample size n might be small, i.e., the so-called “small n, large p” framework, which is a variable selection problem for high-dimensional data.
When dealing with high-dimensional data, there are usually two important considerations: model sparsity and prediction accuracy. The Lasso [1] was proposed to address these two objectives, since Lasso can determine submodels with a moderate number of parameters that still fit the data adequately. There are also other similar methods including SCAD [2], elastic net [3], Dantzig selector [4], MCP [5] and so on. In high-dimensional logistic regression models, Lasso study topics range from asymptotic results, including the consistency and asymptotic distribution of the estimator, e.g., Sur et al. [6], Ma et al. [7], Bianco et al. [8], to non-asymptotic results, including the non-asymptotic oracle inequalities of the estimation and prediction errors, e.g., Abramovich et al. [9], Huang et al. [10] and Yin [11].
In many applications, predictors can often be thought of as grouped. For example, in genome-wide association studies (GWASs), genes usually do not act individually, but are reflected in the covariation of several genes with each other. Additionally, in histologically normal epithelium (NlEpi) studies, we need to consider the non-linear effects of genes for microarray data. Similar to the Lasso, considering this grouped information in the modeling process should improve the interpretability and the accuracy of the model. Yuan and Lin [12] proposed an extension of the Lasso, called the Group Lasso, which imposes an L 2 penalty to individual groups of variables and then an L 1 penalty to the resulting block norms, rather than only an L 1 penalty to individual variables. Suppose x i and β 0 in model (1) are divided into g known groups, where we consider a partition { G 1 , , G g } of { 1 , , p } into groups and denote the cardinality of a group G l by | G l | , x i = ( x i ( 1 ) T , x i ( 2 ) T , , x i ( g ) T ) T , β 0 = ( ( β ( 1 ) 0 ) T , ( β ( 2 ) 0 ) T , , ( β ( g ) 0 ) T ) T , x i ( l ) R | G l | , β ( l ) 0 R | G l | . We wish to achieve sparsity at the level of groups, i.e., to β 0 such that β ( l ) 0 = 0 for some of the groups l { 1 , , g } . When using high-dimensional logistic regression models, Group Lasso provides an estimator for β 0 :
β ^ G L : = arg min β R p 1 n i = 1 n log 1 + exp ( x i T β ) ( x i T β ) y i + λ l = 1 g ω l β ( l ) 2 ,
where λ 0 is a tuning parameter which controls the amount of penalization, ω l = | G l | is used to normalize across groups of different sizes and · 2 denotes the L 2 norm of a vector. Meier et al. [13] established the asymptotic consistency theory of Group Lasso for logistic regression, Wang et al. [14] analyzed the rates of convergence, Blazere et al. [15] stated oracle inequalities and Kwemou [16] and Nowakowski [17] studied non-asymptotic oracle inequalities. Furthermore, Zhang et al. [18] studied the L p , q regularization penalty estimates for logistic regression. In terms of computational algorithms, Meier et al. [13] applied the block coordinate descent algorithm of Tseng [19] to Group Lasso for logistic regression, ans Breheny and Huang [20] proposed the Group descent algorithm. While the aforementioned methods have shown promising performance in practical settings (Abramovich [21], Chen [22], Tyan [23], Yang [24]), a pressing issue that remains unresolved is that these approaches are just computing the exact coefficients fast enough at those selected values of λ .
However, it is well known that for the Lasso (or the Group Lasso) in linear regression models, the respective optimal values of the tuning parameter λ depend on the unknown parameter σ 2 , the homogeneous noise variance, and its accurate estimation is generally more difficult when p n . To solve this problem, Belloni et al. [25] proposed square-root Lasso, which removed this unknown parameter by using a weighted score function (i.e., the square root of the empirical loss function). Bunea et al. [26] extended the ideas behind the square-root Lasso for group selection and developed the Group square-root Lasso. Inspired by Group square-root Lasso, we propose a new penalized weighted score function method, which alternatively replaces the original score function (i.e., the gradient of negative loglikelihood function) with a weighted score function (Huang and Wang [27]) to study sparse logistic regression with a Group Lasso penalty. We obtain convergence rates for the estimation error and provide a direct choice for the tuning parameter. Moreover, we propose a modified block coordinate descent algorithm based on the weighted score function, which greatly optimizes the computational complexity.
The framework of this paper is as follows. In Section 2, we apply this idea behind the Group square-root Lasso to sparse logistic models and develop our method, the penalized weighted score function method. In Section 3, we propose asymptotic bounds for our new estimator and a direct selection for the tuning parameter. In Section 4, we provide the weighted block coordinate descent algorithm. In Section 5, numerical simulations show the advantages of our algorithm in terms of selection effects and computational time. In Section 6, we present real data for genes and musk to support the simulations and theoretical results. Section 7 concludes our work. All proofs are given in Appendix A.
Notation: Throughout the paper, the non-zero coordinate of β 0 is denoted by I = { l   :   β ( l ) 0 2     0 } and s = card { I } is the number of non-zero elements of β 0 . For all δ R p and subset I, δ I has the same coordinates as δ on I and zero coordinates on the complement I C of I. For a function f ( β ) R , we denote by f ( β ) R p its gradient and H ( β ) R p × p its Hessian matrix at β R p . The L q norm of any vector v is defined as v q = ( i | v i | q ) 1 / q and for any vector β R p with group structures, the block norm of β for any 0 q is denoted as β 2 , q = ( l = 1 g β ( l ) 2 q ) 1 / q . In particular, β 2 , 0 = l = 1 g 1 β ( l ) 0 indicates the number of non-zero groups, β 2 , 1 = l = 1 g β ( l ) 2 represents the form of Group Lasso, β 2 , 2 = β 2 denotes the L 2 norm, and β 2 , = max l β ( l ) 2 means the largest L 2 norm of all groups. Moreover Φ ( x ) denotes the cumulative distribution function of the standard normal distribution.

2. Penalized Weighted Score Function Method

Recall that model (1), the loss function (i.e., the negative loglikelihood), is given by
( β ) = 1 n i = 1 n log 1 + exp ( x i T β ) ( x i T β ) y i ,
leading to the score function
( β ) = 1 n i = 1 n ( G ( x i T β ) y i ) x i .
Note that the solution β ^ G L of model (2) satisfies KKT conditions defined as follows
1 n i = 1 n ( G ( x i T β ^ G L ) y i ) x i ( l ) = λ ω l β ^ ( l ) G L / β ^ ( l ) G L 2 , if β ^ ( l ) G L 0 , | 1 n i = 1 n ( G ( x i T β ^ G L ) y i ) x i ( l ) |     λ ω l , if β ^ ( l ) G L = 0 ,
for all l = 1 , , g . The left side of Equation (3) is the score function for logistic regression with a group structure, which shows that β ^ G L is actually a penalized score function estimator. To obtain a good estimator, we usually require that the inequality λ ω l c ( β 0 ) 2 , for all l = 1 , , g and some constant c 1 holds with high probability (Meier et al. [13] and Kwemou [16]). However, the random part G ( x i T β 0 ) y i for ( β 0 ) , the score function valued at β = β 0 , has variance G ( x i T β 0 ) ( 1 G ( x i T β 0 ) ) , which is also the variance of the binary random variable Y i | X i = x i . Obviously, binary noises are not homogeneous like the noise in linear regression models; a unique tuning parameter for all of the different coefficients is not a good choice.
We apply the idea from Group square-root Lasso to solve the above problem for choosing a tuning parameter, and develop our method as follows. Huang and Wang [27] formed a class of root-consistent estimating functions by a weighted score function for logistic regression
ψ ( β ) = 1 n i = 1 n ψ ( x i T β ) ( G ( x i T β ) y i ) x i ,
where ψ ( · ) is the weighted function of x i T β . This requires choosing a suitable weighed function to ensure that ψ ( β ) is almost integrable for β . Then, replacing the score function in Equation (3) with the weighted score function, we develop a penalized weighted score function estimate β ^ , which is a solution of the following equation:
1 n i = 1 n ψ ( x i T β ^ ) ( G ( x i T β ^ ) y i ) x i ( l ) = λ ω l β ^ ( l ) / β ^ ( l ) 2 , if β ^ ( l ) 0 , | 1 n i = 1 n ψ ( x i T β ^ ) ( G ( x i T β ^ ) y i ) x i ( l ) |     λ ω l , if β ^ ( l ) = 0 .
Let ψ ( β ) be the loss function corresponding to the weighted score function (4); the solution to Equation (5) is equivalent to solving the following optimization problem:
β ^ : = arg min β R p ψ ( β ) + λ l = 1 g ω l β ( l ) 2 .
Our method is motivated by Bunea et al.’s [26] minimization of the Group square-root Lasso for the linear model:
β ^ G S L : = arg min β R p Y X β 2 n + λ n l = 1 g ω l β ( l ) 2 ,
where Y R n × 1 and X R n × p . When Y X β ^ G S L 2 is non-zero, the Group square-root Lasso estimator β ^ G S L satisfies the KKT condition
n i = 1 n ( Y X β ^ G S L 2 ) 1 ( y i x i T β ^ G S L ) x i ( l ) = λ ω l β ^ ( l ) G S L / β ^ ( l ) G S L 2 , if β ^ ( l ) G S L 0 , | n i = 1 n ( Y X β ^ G S L 2 ) 1 ( y i x i T β ^ G S L ) x i ( l ) | λ ω l , if β ^ ( l ) G S L = 0 .
Compared with the KKT conditions for Group square-root Lasso and Group Lasso, the Group square-root Lasso adds the weighted function ( n Y X β ^ G S L 2 ) 1 to estimate the homogeneous noise variance, which allows the tuning parameter λ to be independent of the homogeneous noise variance. Thus, the Group square-root Lasso is able to estimate for the grouped variables and influence the choice of the tuning parameter simultaneously.
A drawback of Group square-root Lasso is that it can only directly select the tuning parameter in linear regression models. However, in logistic regression models, there is no direct way to select the tuning parameter. The penalized weighted score function method uses this scheme. We will discuss this in more detail in the next section.

3. Statistical Properties

In this section, we will establish non-asymptotic oracle inequalities for the penalized weighted score function estimate and present a direct choice for tuning parameter.
Throughout this paper, we consider a fixed design setting (i.e., x 1 , , x n are considered as deterministic), and we make the following assumptions:
(A1)
There exists a positive constant M < such that max 1 i n max 1 l g j G l x i j 2 M .
(A2)
n , p satisfy that n p = o ( e n 1 / 3 ) , and p / ϵ > 2 for ϵ ( 0 , 1 ) .
(A3)
There exists N ( β 0 ) > 0 such that
N 2 ( β 0 ) = max 1 j p 1 n 1 i n ψ 2 ( x i T β 0 ) G ( x i T β 0 ) ( 1 G ( x i T β 0 ) ) x i j 2 .
(A4)
Let ψ ( · ) : R p R be a convex three-times differentiable function such that for all u , v R P , the function g ( t ) = ψ ( u + t v ) satisfies | g ( t ) | τ 0 max 1 i n | x i T v | g ( t ) for all t R , where τ 0 > 0 is a constant.
Assumption (A1) strictly controls the bounds of predictors, since the real data we collected were often bounded. Assumption (A2) controls the sparsity of the data and the lower bound on the probability that the non-asymptotic property holds. Assumption (A3) makes sure the variance of each component of ψ ( β 0 ) is bounded by choosing a suitable weighted function ψ ( · ) . Assumption (A4) is similar to Proposition 1 proposed by Bach [28]. Under Assumption (A4), we can obtain lower and upper Taylor expansions of the loss function ψ ( · ) , which can be used to derive non-asymptotic results.
Moreover, the restricted eigenvalue condition plays a key role in deriving oracle inequalities. For the Group Lasso problem of high-dimensional linear regression models, the oracle property under the group restricted eigenvalue condition was discussed by Hu et al. [29] and extended to logistic regression models by Zhang et al. [18]. To establish the desired group restricted eigenvalue condition, we introduce the following group restricted set
Θ α = : ϑ R p : W I C ϑ ( I C ) 2 , 1     α W I ϑ ( I ) 2 , 1 , α > 0 ,
which is a grouped version of the restricted set θ α = : ϑ R p : ϑ I C 1     α ϑ I 1 mentioned in Bickel et al. [30], where W I is a diagonal matrix with the jth diagonal element ω j if j I and 0 otherwise. Based on the group restricted set (8), we propose the following group restricted eigenvalue condition:
(A5)
For some integer s such that 1 < s < g and a positive number α , the following condition holds
μ ( s , α ) = min I { 1 , , g } | I | s min δ 0 δ Θ α ( δ T H ψ ( β 0 ) δ ) 1 / 2 W I δ ( I ) 2 , 2 > 0 ,
where H ψ ( β 0 ) is the Hessian matrix for ψ ( β 0 ) . In contrast to the restricted eigenvalue condition mentioned in Bickel et al. [30] for linear regression models, the group restricted eigenvalue condition for logistic regression is converted from the L 2 norm to the block norm for the denominator part and from the Gram matrix to the Hessian matrix H ψ ( β 0 ) for the numerator part of (9).
Remark 1.
The Hessian matrix of ψ ( β ) is given by
H ψ ( β ) = 1 n i = 1 n ψ ( x i T β ) exp ( x i T β ) 1 + exp ( x i T β ) y i + ψ ( x i T β ) exp ( x i T β ) ( 1 + exp ( x i T β ) ) 2 x i x i T = 1 n i = 1 n ψ ( x i T β ) G ( x i T β ) y i + ψ ( x i T β ) G ( x i T β ) ( 1 G ( x i T β ) ) x i x i T .
Bach [28] has already shown the Hessian matrix of ( β ) is positive definite on some restricted sets. If the chosen weighted function ψ ( x i T β ) makes the loss function ψ ( β ) satisfy the assumption (A3), H ψ ( β ) is also positive definite on the group restricted set (8). Such weighted functions in fact exist and will be described later. In addition, the group restricted eigenvalue condition can effectively control the estimation error, enabling estimations with good statistical properties and reliable results.
Theorem 1.
Assume that (A1)–(A4) are satisfied. Let λ < k ( 1 z ) μ ( s , α ) 4 τ 0 M s , z ( 0 , 1 ) and k < min 1 l g ω l . Let λ be a tuning parameter chosen such that
λ ω l = N ( β 0 ) z | G l | n Φ 1 ( 1 ϵ 2 p ) .
Then, with probability of at least 1 ϵ ( 1 + o ( 1 ) ) , we have the following:
1. A group restricted set β ^ β 0 Θ α with α = 1 + z 1 z .
2. Under the group restricted eigenvalue condition (A5), the block norm estimation errors are
β ^ β 0 2 , 1     2 k λ s ( min 1 l g ω l k ) ( 1 z ) μ ( s , α ) ,
β ^ β 0 2 , q q     2 k λ s ( min 1 l g ω l k ) ( 1 z ) μ ( s , α ) q , for all 1 < q < 2 ,
respectively, and the error of the loss function ψ is
| ψ ( β ^ ) ψ ( β 0 ) |     2 min 1 l g ω l λ 2 s ( min 1 l g ω l k ) ( 1 z ) μ ( s , α ) .
The non-asymptotic oracle inequalities for the true coefficient β 0 are provided in (11) and (12). Unfortunately, the parameter N ( β 0 ) is influenced by the true coefficient β 0 , so that the choice of λ also depends on β 0 . Therefore, we will choose a suitable ψ ( x i T β 0 ) to solve this problem in the next theorem.
Theorem 2.
Choose the weight function in the following form
ψ ( x i T β 0 ) = 1 2 exp ( x i T β 0 2 ) + exp ( x i T β 0 2 ) .
Under Assumptions (A2) and (A3), we choose the tuning parameter as
λ ω l = | G l | max 1 j p ( i = 1 n x i j 2 ) 2 n z Φ 1 ( 1 ϵ 2 p ) .
Then, under the assumptions of Theorem 1 with the probability at least 1 ϵ ( 1 + o ( 1 ) ) , we have inequalities (11)–(13).
In Theorem 2, Yin [11] presents a discussion about the order of Φ 1 ( 1 ϵ 2 p ) in (15), proving that Φ 1 ( 1 ϵ 2 p ) O ( log ( 2 p / ϵ ) ) . When | G l |   =   1 for l   =   1 , 2 , , g , our estimate β ^ is a Lasso estimate and its theoretical properties have been well studied by Yin [11].
Remark 2.
If ψ ( x i T β 0 ) is given as in Theorem 2, the loss function, weighted score function and the Hessian matrix, respectively, are given by
ψ ( β 0 ) = 1 n i = 1 n ( 1 y i ) exp ( x i T β 0 2 ) + y i exp ( x i T β 0 2 ) , ψ ( β 0 ) = 1 2 n i = 1 n ( 1 y i ) exp ( x i T β 0 2 ) y i exp ( x i T β 0 2 ) x i , H ψ ( β 0 ) = 1 4 n i = 1 n ( 1 y i ) exp ( x i T β 0 2 ) + y i exp ( x i T β 0 2 ) x i x i T .
Clearly, the Hessian matrix given as a weighting function in the form in Theorem 2 is positive definite.

4. Weighted Block Coordinate Descent Algorithm

We apply the techniques of the block coordinate descent algorithm to the penalized weighted score function. Choose the weighted function with the form of (14) and set β = β ^ + ζ ; then, a second-order Taylor expansion of the loss function ψ ( β ) in Equation (6) gives
D ( β ^ + ζ ) = ψ ( β ^ ) + ζ T ψ ( β ^ ) + 1 2 ζ T H ψ ( β ^ ) ζ + λ W ( β ^ + ζ ) 2 , 1 ,
Now, we consider minimization D ( β ^ + ζ ) with respect to the lth group of penalized parameters. This means that
ψ ( β ^ ) ( l ) + H ψ ( β ^ ) ( l ) ζ ( l ) + λ ω l β ^ ( l ) + ζ ( l ) β ^ ( l ) + ζ ( l ) 2 = 0 .
Inspired by Meier et al.’s [13] assumptions, we set the sub-matrix H ψ ( β ^ ) ( l ) in the form of H ψ ( β ^ ) ( l ) = h ψ ( β ^ ) ( l ) I ( l ) , which means that h ψ ( β ^ ) ( l ) = max { diag ( H ψ ( β ^ ) ( l ) ) , r 0 } , where r 0 is a lower bound to ensure convergence. Then, simplifying Equation (17) gives
λ ω l β ^ ( l ) + ζ ( l ) 2 + h ψ ( β ^ ) ( l ) ( β ^ ( l ) + ζ ( l ) ) = h ψ ( β ^ ) ( l ) β ^ ( l ) ψ ( β ^ ) ( l ) .
This leads to the following equivalence equation
β ^ ( l ) + ζ ( l ) β ^ ( l ) + ζ ( l ) 2 = h ψ ( β ^ ) ( l ) β ^ ( l ) ψ ( β ^ ) ( l ) h ψ ( β ^ ) ( l ) β ^ ( l ) ψ ( β ^ ) ( l ) 2 .
According to Equation (15) and Remark 2, it is obtained that:
If h ψ ( β ^ ) ( l ) β ^ ( l ) ψ ( β ^ ) ( l ) 2     λ ω l , the value of ζ at the k-th iteration is given by
ζ ( l ) ( k ) = β ^ ( l ) ( k ) ,
otherwise
ζ ( l ) ( k ) = 1 h ψ ( β ^ ( k ) ) ( l ) ψ ( β ^ ( k ) ) ( l ) + λ ω l h ψ ( β ^ ( k ) ) ( l ) β ^ ( l ) ( k ) ψ ( β ^ ( k ) ) ( l ) h ψ ( β ^ ( k ) ) ( l ) β ^ ( l ) ( k ) ψ ( β ^ ( k ) ) ( l ) 2 .
where λ ω l = | G l | max 1 j p ( i = 1 n x i j 2 ) Φ 1 ( 1 ϵ 2 p ) / 2 n z . If ζ ( l ) ( k ) 0 , we use the Armijo rule of Tseng and Yun [31] to select the step factor σ ( k ) as follows:
  • Armijo rule
Choose σ 0 > 0 and let σ ( k ) be the largest value of { σ 0 θ j } j 0 satisfying
D ( β ^ ( l ) ( k ) + σ ( k ) ζ ( l ) ( k ) ) D ( β ^ ( l ) ( k ) ) σ ( k ) ϱ l ( k ) ,
where 0 < θ < 1 , 0 < ϱ < 1 , and
l ( k ) = ζ ( l ) ( k ) T ψ ( β ^ ( k ) ) ( l ) + λ ω l β ^ ( l ) ( k ) + ζ ( l ) ( k ) 2     λ ω l β ^ ( l ) ( k ) 2 .
Finally, the update direction is calculated for the gradient of the parameters and the parameters are updated according to a certain step size
β ^ ( l ) ( k + 1 ) = β ^ ( l ) ( k ) + σ ( k ) ζ ( l ) ( k ) .
The weighted block coordinate gradient descent algorithm is given by Algorithm 1. An initial parameter setting of σ 0 = 1 , θ = 0.5 and ϱ = 0.1 was given by Tseng and Yun [31]. In the next simulations, we set the convergence criterion of step 3 in Algorithm 1 to be σ ( k ) 10 10 . In general, selecting the tuning parameter λ using the cross-validation method is complicated. As we know from Algorithm 1, the algorithm eliminates the selection process for the tuning parameter λ ω l . Given an initial value β ^ ( 0 ) , we can then iterate directly over β ^ ( 0 ) until it converges to the range which we expect.
Algorithm 1 Weighted block coordinate gradient descent algorithm
  • Step 1: Let β ^ ( 0 ) R p be an initial parameter vector
  • Step 2: For l = 1 , , g
          H ψ ( β ^ ( k ) ) ( l ) = h ψ ( β ^ ( k ) ) ( l ) I ( l ) ,
          ζ ( k ) = arg min ζ R p { D ( β ^ ( k ) + ζ ) } ,
         if ζ ( k ) = 0
            β ^ ( k + 1 ) = β ^ ( k ) ,
         else
           Search σ ( k ) using Armijo rule,
            β ^ ( k + 1 ) = β ^ ( k ) + σ ( k ) ζ ( k ) ,
  •    end
  • Step 3: Repeat step 2 until some convergence criterion is met
It is worth noting that we have given a direct choice (15) for λ under a specific weight function ψ ( x i T β 0 ) given by (14), so the weighted block coordinate gradient descent algorithm will be computationally faster than working iteratively on a fixed grid of tuning parameters λ (see Meier et al. [13]). If choosing other weight functions, the weighted block coordinate gradient descent algorithm can still be used to solve (6). However, then the tuning parameter λ depends on β 0 (unknown); some cross-validation can be used for choosing the parameter λ .

5. Simulations

In this section, we use simulated datasets to evaluate the performance of the penalized weighted score function estimator. Meier [13] describes the block coordinate gradient descent algorithm using the R package R 4.3.1 grplasso (https://cran.r-project.org/web/packages/grplasso/grplasso.pdf, accessed on 6 July 2023). While the grplasso algorithm offers 20 predefined values of the tuning parameter λ , it lacks an optimal design for λ . We improved grplasso by providing a scheme for directly selecting the tuning parameters, named wgrplasso, and we use it to describe the weighted block coordinate gradient descent algorithm. We compare the performance of the wgrplasso algorithm, the R package grpreg (https://cran.r-project.org/web/packages/grpreg/grpreg.pdf, accessed on 6 July 2023) developed by Breheny [20] and the R package gglasso (https://cran.r-project.org/web/packages/gglasso/gglasso.pdf, accessed on 6 July 2023) developed by Yang and Zou [32]. Three main aspects of model performance are considered: the correctness of variable selection, the accuracy of coefficient estimation and the running time of the algorithm. The evaluation indicators for the model include the following:
  • TP: the number of predicted non-zero values in the non-zero coefficient set when determining the model.
  • TN: the number of predicted zero values in the zero coefficient set when determining the model.
  • FP: the number of predicted non-zero values in the zero coefficient set when determining the model.
  • FN: the number of predicted zero values in the non-zero coefficient set when determining the model.
  • TPR: the ratio of predicted non-zero values in the non-zero coefficient set when determining the model, which is calculated by the following formula:
    T P R = T P T P + F N .
  • Accur: the ratio of accurate predictions when determining the model, which is calculated by the following formula:
    A c c u r = T P + T N T P + T N + F P + F N .
  • Time: the running time of the algorithm.
  • BNE: the block norm of the estimation error, which is calculated by the following formula:
    B N E   =   β ^ β 2 , 1 .
The sample size was 200. We set values of p = 300 , 600 and 900, and generated 500 random datasets to repeat the simulation. We set ϵ to 0.01 and 0.05 and uniformly specified the true non-zero coefficient parameters of the logistic regression models as
β = ( 1 , 1 , , 1 3 , , 1 , , 1 3 30 , 0 , , 0 p 30 ) .
For the log odd η setting, we considered the following four different models.
(a) In Model I, the observed data X are assumed to be sampled from a multivariate normal distribution and the log odd η is considered to be the linear case, where the data between groups are independent but the data within groups are correlated. We set the size of each group to 3 and assume that the data within the groups obey X i N ( 0 , Σ i , j k ) , where Σ i = 0 . 5 | j k | . Thus, the observed data can then be defined as X N ( 0 , Σ ) , where Σ = d i a g ( Σ 1 , , Σ p 3 ) .
(b) In Model II, the observed data X are assumed to be the sum of two uniform distributions and the log odd η is considered to be the linear case. Assume that the p-dimensional vectors Z 1 , , Z p and W are generated independently and through a uniform distribution of [ 1 , 1 ] . Thus, the observed data can be defined as X i = Z i + W .
The log odds η for Models I and II are then defined as follows
η = β 0 + X 1 β 1 + + X p β p .
(c) In Model III, the observed data X are assumed to follow a standard multivariate normal distribution and the log ogg η is considered to be additive case. Assuming that X obeys the p 3 -dimensional standard normal distribution, the observed data can therefore be defined as X N ( 0 , I p 3 ) .
(d) In Model IV, the observed data X are assumed to be the sum of two uniform distributions and the log odd η is considered to be the additive case. This means that the p 3 -dimensional vectors Z 1 , , Z p 3 and W are assumed to be generated independently by a uniform distribution of [ 1 , 1 ] . Thus, the observed data can be defined as X i = Z i + W .
The log odds η for Models III and IV are then defined as follows
η = β 0 + X 1 β 1 + X 1 2 β 2 + X 1 3 β 3 + + X p 3 β p 2 + X p 3 2 β p 1 + X p 3 3 β p .
Then, the dataset for the response variable Y was generated by the logistic regression models
P ( Y = 1 | η ) = 1 1 + exp ( η 1 ) .
Table 1 shows the average simulation results of the three algorithms for the linear case, and Figure 1 shows the point–line plots of Model I and Model II for TPR, Accur, Time and MSE.
First, from the TPR perspective, all three algorithms show excellent selection results when the normal distribution assumption is adopted. However, when the uniform distribution assumption is used, the wgrplasso algorithm shows higher correct selection in the nonzero set than the other algorithms, and the wgrplasso algorithm is also more stable in terms of variance.
Second, from the Accur perspective, compared to the gepreg algorithm, the wgrplasso and gglasso algorithms maintain a high selection effect under the assumption of a normal distribution. However, Accur is also affected by FP, and the gepreg algorithm and gglasso algorithm are not stable enough to control FP from the perspective of variance. In addition, under the assumption of a uniform distribution, both in terms of the effect of selection and the stability of variance, the wgrplasso algorithm has lower control over the FP aspect, which makes the wgrplasso algorithm perform better than the other algorithms in terms of Accur.
Third, from a Time perspective, using the wgrplasso algorithm saves a lot of time, both for the normal distribution assumption and the uniform distribution assumption.
Furthermore, lastly, from a BNE perspective, under the assumption of normal distribution, the BNE values obtained by the wgrplasso and gglasso algorithms are similar and smaller than that obtained by the grpreg algorithm. However, under the assumption of a uniform distribution, compared with the gglasso algorithm and the grpreg algorithm, the BNE obtained by the wgrplasso algorithm is smaller, which means that the wgrplasso algorithm performs better.
Table 2 presents the simulation results of the three algorithms for the additive case, and Figure 2 shows the point–line plots of Models III and IV for TPR, Accur, Time and BNE.
The simulation results show that the grpreg algorithm and the gglasso algorithm in the additive case are poorer both in terms of TPR and Accur, and also show through the variance that the grpreg algorithm and the gglasso algorithm also do not have a stable selection, as well as increasing computational time overheads and BNE values. However, wgrplasso obtains similar results in the additive case as in the linear case, and still maintains a better selection. Regardless of TPR, Accur and BNE, the wgrplasso algorithm performs better than the other algorithms, and the advantage in Time is even more obvious.

6. Real Data

In this section, we apply our proposed estimates to analyze two real data sets. The first data set comes from the molecular shape and conformation of musk. The second data set comes from histologically normal epithelial cells from breast cancer patients and cancer-free prophylactic mastectomy patients. As in the previous section, we set ϵ to 0.01 and 0.05, respectively. In Section 6.1, we compare the number of variables selected and the computation time of the three algorithms in the above simulation, and in Section 6.2, we compare the prediction accuracy and the computation time.

6.1. Studies on the Molecular Structure of Muscadine

The R package of kernlab (https://cran.r-project.org/web/packages/kernlab/kernlab.pdf, accessed on 12 July 2023) contains the molecular shape and conformation of musk in the native dataset musk. The data set contains a data frame of 476 observations for the following 167 variables. The first 162 of these variables are the distance characteristics of the rays, measured relative to the origin along which each ray was placed. Any experiment with the data should treat these features as being on any continuous scale. Variable 163 is the distance of the oxygen atom to a specified point in 3D space. Variable 164 is the x-displacement from the specified point. Variable 165 is the Y-displacement from the specified point. Variable 166 is the Z displacement from the specified point. Variable 167 has a value of 0 for no musk or 1 for musk.
We used 3/4 of the data for training and performed a third-order B-spline basis function expansion on the training data, and then we used the wgrplasso, grpreg, gglasso, and glmnet (https://cran.r-project.org/web/packages/glmnet/glmnet.pdf, accessed on 12 July 2023) algorithms for estimations using the expanded training data, respectively. The remaining 1/4 of the data were used as a test, and the estimated coefficients were used to predict the test data, comparing the prediction accuracy, model size and time for each of the four algorithms. Table 3 presents the experimental results of 100 repetitions.
The experimental results show that wgrplasso has the highest prediction accuracy among the four algorithms, indicating that the algorithm is able to identify the target class more accurately in the task of categorizing musk data, and wgrplasso also exhibits a shorter computation time without sacrificing accuracy. This makes the wgrplasso algorithm the preferred algorithm for dealing with the problem of categorizing musk datasets.

6.2. Gene Expression Studies in Epithelial Cells of Breast Cancer Patients

We obtained microarray data from the NCBI Gene Expression Omnibus for patient histological epithelial cells (https://www.ncbi.nlm.nih.gov/geo/, accessed on 31 August 2023) under accession GSE20437. The dataset consists of 42 samples with 22,283 variables. It consists of microarray gene expression data collected from the histologically normal epithelium (NlEpi) from 18 breast cancer patients (HN), 18 patients undergoing breast reduction (RM) and 6 cancer-free prophylactic mastectomy (PM) patients in high-risk women. Graham et al. [33] have shown that genes are differentially expressed between HM and RM samples. This is more fully discussed in Yang and Zou [32]. Here, we consider the effect of genes on HM and RM. Similar to Yang and Zou’s [32] approach to the data, we fit the sparse additive logistic regression model using the Group Lasso penalty while selecting the significant additive components.
As with the setup in Section 6.1, we continue to train with 3/4 of the data and expand the training data using a third-order B-spline basis function and treated them as a group to reflect the role in the additive models, leading to a grouped regression problem with n = 36 and p = 66849. All data were then standardized so that the mean of each original variable was zero and the sample variance was in units. This experiment was repeated 100 times to obtain the prediction error. We built a complete observational model for one of experiments, and report the selected genes in wgrplasso, grpreg and gglasso algorithms. These results are listed in Table 4. We observe that the wgrplasso and gglasso algorithms select more variables than the grpreg algorithm, and wgrplasso has lower prediction errors. Summarizing the above results, our proposed penalized weighted score function method can pick much more meaningful variables for explanation and prediction.

7. Conclusions

In our work, we propose the penalized weighted score function method for Group Lasso for logistic regression models. We determine an upper bound of the error of parameter estimation with a high probability and the direct choice of the tuning parameter under a specific weighted function. Under the direct choice of the tuning parameter, we improve the block coordinate descent algorithm to reduce the computational time and complexity. Simulation results show that our method not only exhibits better statistical accuracy, but also calculates faster than competing methods. Experimental results with real data also show that our method is effective in other fields such as biology and chemistry. Indeed, our approach can be extended to other generalized linear models with a sparse group structure, which will be future research.

Author Contributions

Conceptualization, Z.Y.; Methodology, M.Z., Z.Y. and Z.W.; Software, M.Z.; Data curation, Z.W.; Writing—original draft, M.Z. and Z.Y.; Writing—review & editing, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The authors’ work was supported by the Educational Commission of Jiangxi Province of China (No.GJJ160927) and the National Natural Science Foundation of China (No.62266002).

Data Availability Statement

All data available in the paper with its related references.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Lemma A1
(Bach [28]). Consider a three-times differentiable convex function g : R R such that for all t R , | g ( t ) | S g ( t ) , for some S 0 . Then, for all t 0 :
g ( 0 ) S 2 ( exp ( S t ) + S t 1 ) g ( t ) g ( 0 ) g ( 0 ) t g ( 0 ) S 2 ( exp ( S t ) S t 1 ) .
Lemma A2
(Hu et al. [29]). If the inequality i = 1 n a i b 0 holds for all a i > 0 , we have i = 1 n a i q b 0 q for 1 < q < 2 .
Proof of Lemma A2.
We first introduce the Holder inequality:
Set m , n > 1 and 1 m + 1 n = 1 . Let a i and b i be non-negative real numbers, then
i = 1 n a i b i i = 1 n a i m 1 m i = 1 n b i n 1 n .
According to the Holder inequality and setting m = 1 2 q and n = 1 q 1 , we have
i = 1 n a i q = i = 1 n a i 2 q a i 2 q 2 i = 1 n a i 2 q i = 1 n a i 2 q 1 ,
because i = 1 n a i 2 ( i = 1 n a i ) 2 b 0 2 . Then,
i = 1 n a i q b 0 2 q b 0 2 q 1 = b 0 q ,
where m , n > 1 , which means q ( 1 , 2 ) . □
Lemma A3
(Sakhanenko [34]). Let F 1 , , F n be independent random variables with E ( F i ) = 0 and | F i |   <   1 for all 1 i n . Denote B n 2 = i = 1 n E ( F i 2 ) and L n = i = 1 n E ( | F i | 3 ) / B n 3 . Then, there exists a positive constant R such that for all x [ 1 , 1 R min { B n , L n 1 / 3 } ]
P i = 1 n F i > B n x = ( 1 + O ( 1 ) x 3 L n ) ( 1 Φ ( x ) ) .
Proof of Theorem 1.
Define the event
A = max 1 l g j G l ψ 2 ( β j 0 ) z λ ω l .
We state the theorem result on the event A and find an lower bound of P ( A ) .
Define I = k : β ( k ) 0 2 0 , and since β ^ is the minimizer of ψ ( β ) + λ W β 2 , 1 , we get
ψ ( β ^ ) + λ W β ^ 2 , 1     ψ ( β 0 ) + λ W β 0 2 , 1 .
Adding λ W ( β ^ β 0 ) 2 , 1 to both sides of (A1) and rearranging the inequality, we obtain
ψ ( β ^ ) ψ ( β 0 ) + λ W ( β ^ β 0 ) 2 , 1 λ W β 0 2 , 1 λ W β ^ 2 , 1 + λ W ( β ^ β 0 ) 2 , 1 2 λ W I ( β ^ β 0 ) ( I ) 2 , 1 .
According to the fact that ψ ( β 0 ) is a convex function, by applying the Cauchy–Schwarz inequality, its Taylor expansion is as follows
ψ ( β ^ ) ψ ( β 0 ) ( β ^ β 0 ) T ψ ( β 0 ) l = 1 g j G l ψ 2 ( β j 0 ) / ω l · ω l ( β ^ β 0 ) ( l ) 2 max 1 l g j G l ψ 2 ( β j 0 ) / ω l · l = 1 g ω l ( β ^ β 0 ) ( l ) 2 z λ W ( β ^ β 0 ) 2 , 1 .
Combining (A2) and (A3) and defining δ ( l ) = β ^ ( l ) β ( l ) 0 , we obtain the weighted restricted group
W I C δ ( I c ) 2 , 1     α W I δ ( I ) 2 , 1 .
Therefore, in the event A, we have μ ( s , α ) > 0 for α = 1 + z 1 z .
Then, due to ψ ( β 0 ) satisfying the condition of being three-times differentiable, define the function g ( t ) = ψ ( β 0 + t δ ) . By applying the Cauchy–Schwarz inequality, we have
| g ( t ) | τ 0 max 1 i n | x i T δ | g ( t ) τ 0 max 1 i n l = 1 g j G l x i j 2 / ω l ω l δ ( l ) 2 g ( t ) τ 0 max 1 i n max 1 l g j G l x i j 2 / ω l W δ 2 , 1 g ( t ) τ 0 M / min 1 l g ω l ( α + 1 ) s W I δ ( I ) 2 , 2 g ( t ) .
Make M ¯ = τ 0 ( α + 1 ) s M / min 1 l g ω l , where ω l is a real-valued constant; thus, M ¯ is bounded, and this means that | g ( t ) | M ¯ W I δ ( I ) 2 , 2 g ( t ) . By Lemma A1, we have
ψ ( β ^ ) ψ ( β 0 ) δ T ψ ( β 0 ) + δ T H ψ ( β 0 ) δ M ¯ 2 W I δ ( I ) 2 , 2 2 e M ¯ W I δ ( I ) 2 , 2 + M ¯ W I δ ( I ) 2 , 2 1 .
Combining (A3) and (A4), we have the following result
z λ W δ 2 , 1 + δ T H ψ ( β 0 ) δ M ¯ 2 W I δ ( I ) 2 , 2 2 e M ¯ W I δ ( I ) 2 , 2 + M ¯ W I δ ( I ) 2 , 2 1 λ W I δ ( I ) 2 , 1 λ W I C δ ( I c ) 2 , 1 .
Furthermore, using the group restricted eigenvalue condition, we obtain
μ ( s , α ) M ¯ 2 e M ¯ W I δ ( I ) 2 , 2 + M ¯ W I δ ( I ) 2 , 2 1 + ( 1 z ) λ W δ 2 , 1 2 λ s W I δ ( I ) 2 , 2 .
This implies that
e M ¯ W I δ ( I ) 2 , 2 + M ¯ W I δ ( I ) 2 , 2 1 2 λ s μ ( s , α ) M ¯ 2 W I δ ( I ) 2 , 2 .
In fact, we can reach the conclusion as follows under all t [ 0 , 1 )
exp ( 2 t 1 t ) + 2 t 1 0 .
Therefore, we adopt t = M ¯ W I δ ( I ) 2 , 2 / ( 2 + M ¯ W I δ ( I ) 2 , 2 ) , which meets the above conditions, and then we obtain
e M ¯ W I δ ( I ) 2 , 2 + M ¯ W I δ ( I ) 2 , 2 1 M ¯ 2 W I δ ( I ) 2 , 2 2 2 + M ¯ W I δ ( I ) 2 , 2 .
Combining (A6) and (A7), we have
W I δ ( I ) 2 , 2 2 + M ¯ W I δ ( I ) 2 , 2 2 λ s μ ( s , α ) .
Based on the group restricted eigenvalue condition, choose λ k ( 1 z ) μ ( s , α ) 8 τ 0 s M , for a positive constant k < min 1 l g ω l and substitute it into the above equation
M ¯ W I δ ( I ) 2 , 2 2 k min 1 l g ω l k .
Then, substituting this equation into (A7), we have
e M ¯ W δ 2 , 2 + M ¯ W δ 2 , 2 1 min 1 l g ω l k 2 min 1 l g ω l M ¯ 2 W I δ ( I ) 2.2 2 .
Combining (A5) and (A8) and because of the Cauchy–Schwarz inequality, we have that
min 1 l g ω l k 2 min 1 l g ω l μ ( s , α ) W I δ ( I ) 2 , 2 2 + ( 1 z ) λ W δ 2 , 1 2 λ W I δ ( I ) 2 , 1 2 λ s W I δ ( I ) 2 , 2 a λ 2 s + 1 a W I δ ( I ) 2 , 2 2 .
Let a = 2 min 1 l g ω l ( min 1 l g ω l k ) μ ( s , α ) ; then, we have the following conclusion under the event A
W δ 2 , 1 2 min 1 l g ω l λ s ( min 1 l g ω l k ) ( 1 z ) μ ( s , α ) ,
which means that
δ 2 , 1 2 λ s ( min 1 l g ω l k ) ( 1 z ) μ ( s , α ) .
Furthermore, Equation (12) follows from (11) by applying Lemma A2.
Furthermore, by (A2) and (A3), we obtain
| ψ ( β ^ ) ψ ( β 0 ) |     λ W δ 2 , 1 2 min 1 l g ω l λ 2 s ( min 1 l g ω l k ) ( 1 z ) μ ( s , α )
Now, we prove the probability of event A
P ( A c ) = P max 1 l g j G l ψ 2 ( β j 0 ) / ω l > z λ P max 1 l g max j G l | G l | ψ 2 ( β j 0 ) ω l 2 > ( z λ ) 2 P max 1 j p | ψ ( β j 0 ) | > z λ ω l | G l | ,
Take η = Φ 1 ( 1 ϵ 2 p ) and λ ω l = N ( β 0 ) z G l n η , then it follows that
P ( A c ) p max 1 j p P | ψ ( β j 0 ) |   >   z λ ω l | G l | p max 1 j p P | 1 n i = 1 n ψ ( x i T β 0 ) [ G ( x i T β 0 ) Y i ] x i j |   >   z λ ω l | G l | = p max 1 j p P | i = 1 n κ i j |   >   n N ( β 0 ) η ,
where κ i j = ψ ( x i T β 0 ) [ G ( x i T β 0 ) Y i ] x i j . Furthermore, with assumptions, we obtain that
E ( κ i j ) = ψ ( x i T β 0 ) [ G ( x i T β 0 ) E ( Y i ) ] x i j = 0 , E ( κ i j 2 ) = Var ( κ i j ) = ψ 2 ( x i T β 0 ) G ( x i T β 0 ) ( 1 G ( x i T β 0 ) ) x i j 2 = N 2 ( β 0 ) ,
because of
| κ i j |     ψ ( x i T β 0 ) [ G ( x i T β 0 ) Y i ] ( max i , j | x i j | ) M R ,
with a positive constant R = max 1 i n ψ ( x i T β 0 ) , 0 G ( x i T β 0 ) 1 . F i j = κ i j / ( M R ) , where | F i j |     1 , E ( F i j ) = 0 .
B n j 2 = j = 1 n E ( F i j 2 ) = j = 1 n E ( κ i j 2 ) / ( M R ) 2 n N 2 ( β 0 ) / ( M R ) 2 , L n j = j = 1 n E ( | F i j | 3 ) / B n j 3 j = 1 n E ( | F i j | 2 ) / B n j 3 = 1 B n j .
Then, B n j = O ( n ) and L n j = O ( 1 / n ) . By Lemma A3, we have
P | i = 1 n κ i j |   >   n N ( β 0 ) η = P | i = 1 n F i j |   >   n N ( β 0 ) M R η P | i = 1 n F i j |   >   B n j η = 2 ( 1 + O ( 1 ) η 3 L n j ) ( 1 Φ ( η ) ) = ϵ p 1 + O ( η 3 / n ) .
Note that for any η > 0 , we have 1 Φ ( η ) Φ ( η ) / η ; then,
ϵ 2 p = 1 Φ ( η ) Φ ( η ) η = exp ( η 2 / 2 ) 2 π η .
Our default p > 2 has p / ϵ > 2 , which means that η > Φ 1 ( 3 / 4 ) > 1 / 2 π , and so
ϵ 2 p exp ( η 2 / 2 ) 2 π η < exp ( η 2 2 ) .
Here, we get
η < 2 log 2 p ϵ .
As n , p with n p = o ( e n 1 / 3 ) , we have
P ( A c ) ϵ ( 1 + o ( 1 ) ) .
which completes the proof of Theorem 1. □
Proof of Theorem 2.
We only need to show that the action of the weight function in the form of (15) under logistic loss satisfies the Assumption (A3).
Denote g ( t ) = ψ ( u + t v ; X , Y ) for u , v R p , and then we have
g ( t ) = 1 2 n i = 1 n ( 1 Y i ) exp ( x i T u + x i T t v 2 ) Y i exp ( x i T u + x i T t v 2 ) v T x i , g ( t ) = 1 4 n i = 1 n ( 1 Y i ) exp ( x i T u + x i T t v 2 ) + Y i exp ( x i T u + x i T t v 2 ) ( v T x i ) 2 , g ( t ) = 1 8 n i = 1 n ( 1 Y i ) exp ( x i T u + x i T t v 2 ) Y i exp ( x i T u + x i T t v 2 ) ( v T x i ) 3 .
It is not difficult to find that g ( t ) = g ( t ) , and then
| g ( t ) | = 1 8 n i = 1 n ( 1 Y i ) exp ( x i T u + x i T t v 2 ) Y i exp ( x i T u + x i T t v 2 ) ( v T x i ) 3 1 2 max 1 i n | x i T v | 1 4 n i = 1 n ( 1 Y i ) exp ( x i T u + x i T t v 2 ) + Y i exp ( x i T u + x i T t v 2 ) ( v T x i ) 2 = 1 2 ( max 1 i n | x i T v | ) | g ( t ) | .
which completes the proof of Theorem 2. □

References

  1. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  2. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  3. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
  4. Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar]
  5. Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [PubMed]
  6. Sur, P.; Chen, Y.; Candès, E.J. The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. Probab. Theory Relat. Fields 2019, 175, 487–558. [Google Scholar] [CrossRef]
  7. Ma, R.; Tony Cai, T.; Li, H. Global and simultaneous hypothesis testing for high-dimensional logistic regression models. J. Am. Stat. Assoc. 2021, 116, 984–998. [Google Scholar] [CrossRef]
  8. Bianco, A.M.; Boente, G.; Chebi, G. Penalized robust estimators in sparse logistic regression. Test 2022, 31, 563–594. [Google Scholar] [CrossRef]
  9. Abramovich, F.; Grinshtein, V. High-dimensional classification by sparse logistic regression. IEEE Trans. Inf. Theory 2018, 65, 3068–3079. [Google Scholar] [CrossRef]
  10. Huang, H.; Gao, Y.; Zhang, H.; Li, B. Weighted Lasso estimates for sparse logistic regression: Non-asymptotic properties with measurement errors. Acta Math. Sci. 2021, 41, 207–230. [Google Scholar] [CrossRef]
  11. Yin, Z. Variable selection for sparse logistic regression. Metrika 2020, 83, 821–836. [Google Scholar] [CrossRef]
  12. Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef]
  13. Meier, L.; Van De Geer, S.; Bühlmann, P. The group lasso for logistic regression. J. R. Stat. Soc. Ser. Stat. Methodol. 2008, 70, 53–71. [Google Scholar] [CrossRef]
  14. Wang, L.; You, Y.; Lian, H. Convergence and sparsity of Lasso and group Lasso in high-dimensional generalized linear models. Stat. Pap. 2015, 56, 819–828. [Google Scholar] [CrossRef]
  15. Blazere, M.; Loubes, J.M.; Gamboa, F. Oracle Inequalities for a Group Lasso Procedure Applied to Generalized Linear Models in High Dimension. IEEE Trans. Inf. Theory 2014, 60, 2303–2318. [Google Scholar] [CrossRef]
  16. Kwemou, M. Non-asymptotic oracle inequalities for the Lasso and group Lasso in high dimensional logistic model. ESAIM Probab. Stat. 2016, 20, 309–331. [Google Scholar] [CrossRef]
  17. Nowakowski, S.; Pokarowski, P.; Rejchel, W.; Sołtys, A. Improving group Lasso for high-dimensional categorical data. In Proceedings of the International Conference on Computational Science; Springer: Berlin/Heidelberg, Germany, 2023; pp. 455–470. [Google Scholar]
  18. Zhang, Y.; Wei, C.; Liu, X. Group Logistic Regression Models with Lp, q Regularization. Mathematics 2022, 10, 2227. [Google Scholar] [CrossRef]
  19. Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 2001, 109, 475–494. [Google Scholar] [CrossRef]
  20. Breheny, P.; Huang, J. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 2015, 25, 173–187. [Google Scholar] [CrossRef]
  21. Abramovich, F.; Grinshtein, V.; Levy, T. Multiclass classification by sparse multinomial logistic regression. IEEE Trans. Inf. Theory 2021, 67, 4637–4646. [Google Scholar] [CrossRef]
  22. Chen, S.; Wang, P. Gene selection from biological data via group LASSO for logistic regression model: Effects of different clustering algorithms. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 6374–6379. [Google Scholar]
  23. Ryan Kilcullen, J.; Castonguay, L.G.; Janis, R.A.; Hallquist, M.N.; Hayes, J.A.; Locke, B.D. Predicting future courses of psychotherapy within a grouped LASSO framework. Psychother. Res. 2021, 31, 63–77. [Google Scholar] [CrossRef] [PubMed]
  24. Yang, Y.; Hu, X.; Jiang, H. Group penalized logistic regressions predict up and down trends for stock prices. N. Am. J. Econ. Financ. 2022, 59, 101564. [Google Scholar] [CrossRef]
  25. Belloni, A.; Chernozhukov, V.; Wang, L. Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 2011, 98, 791–806. [Google Scholar] [CrossRef]
  26. Bunea, F.; Lederer, J.; She, Y. The group square-root lasso: Theoretical properties and fast algorithms. IEEE Trans. Inf. Theory 2013, 60, 1313–1325. [Google Scholar] [CrossRef]
  27. Huang, Y.; Wang, C. Consistent functional methods for logistic regression with errors in covariates. J. Am. Stat. Assoc. 2001, 96, 1469–1482. [Google Scholar] [CrossRef]
  28. Bach, F. Self-concordant analysis for logistic regression. Electron. J. Stat. 2010, 4, 384–414. [Google Scholar] [CrossRef]
  29. Hu, Y.; Li, C.; Meng, K.; Qin, J.; Yang, X. Group sparse optimization via lp, q regularization. J. Mach. Learn. Res. 2017, 18, 960–1011. [Google Scholar]
  30. Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
  31. Tseng, P.; Yun, S. A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 2009, 117, 387–423. [Google Scholar] [CrossRef]
  32. Yang, Y.; Zou, H. A fast unified algorithm for solving group-lasso penalize learning problems. Stat. Comput. 2015, 25, 1129–1141. [Google Scholar] [CrossRef]
  33. Graham, K.; de Las Morenas, A.; Tripathi, A.; King, C.; Kavanah, M.; Mendez, J.; Stone, M.; Slama, J.; Miller, M.; Antoine, G.; et al. Gene expression in histologically normal epithelium from breast cancer patients and from cancer-free prophylactic mastectomy patients shares a similar profile. Br. J. Cancer 2010, 102, 1284–1293. [Google Scholar] [CrossRef] [PubMed]
  34. Sakhanenko, A. Berry-Esseen type estimates for large deviation probabilities. Sib. Math. J. 1991, 32, 647–656. [Google Scholar] [CrossRef]
Figure 1. Average TPR, Accur, Time and BNE plots for 500 repetitions of the three algorithms in Model I and Model II.
Figure 1. Average TPR, Accur, Time and BNE plots for 500 repetitions of the three algorithms in Model I and Model II.
Mathematics 11 04979 g001
Figure 2. Average TPR, Accur, Time and BNE plots for 500 repetitions of the three algorithms in Model III and Model IV.
Figure 2. Average TPR, Accur, Time and BNE plots for 500 repetitions of the three algorithms in Model III and Model IV.
Mathematics 11 04979 g002
Table 1. Average results for 500 repetitions of the three algorithms in Models I and II.
Table 1. Average results for 500 repetitions of the three algorithms in Models I and II.
Model I
TPTPRFPAccurTimeBNE
p = 300grpreg( λ = min)30.00
(0.00)
1.00091.28
(19.46)
0.696300.6318.32
(1.96)
gglasso( λ = min)30.00
(0.00)
1.00041.64
(29.92)
0.861390.5617.96
(3.11)
gglasso( λ = lse)29.68
(1.10)
0.99013.44
(14.73)
0.954389.2721.81
(2.29)
wgrplasso( ϵ = 0.01)29.61
(1.06)
0.98726.15
(7.92)
0.91223.5318.51
(0.65)
wgrplasso( ϵ = 0.05)29.77
(0.85)
0.99336.14
(9.80)
0.87929.2417.88
(0.70)
p = 600grpreg( λ = min)29.90
(0.55)
0.997116.36
(26.51)
0.806444.3120.35
(1.73)
gglasso( λ = min)29.80
(0.91)
0.99445.85
(34.78)
0.923508.3519.95
(2.41)
gglasso( λ = lse)29.32
(2.00)
0.97817.37
(16.92)
0.970506.2722.77
(1.81)
wgrplasso( ϵ = 0.01)29.25
(1.40)
0.97541.84
(11.33)
0.92938.9719.17
(0.71)
wgrplasso( ϵ = 0.05)29.50
(1.19)
0.98455.81
(12.78)
0.90645.1618.73
(0.76)
p = 900grpreg( λ = min)29.66
(1.13)
0.989130.12
(32.66)
0.855590.5521.56
(1.82)
gglasso( λ = min)29.88
(0.59)
0.99664.84
(39.83)
0.928614.6420.07
(2.24)
gglasso( λ = lse)29.30
(1.53)
0.97724.07
(21.79)
0.972612.2423.13
(1.80)
wgrplasso( ϵ = 0.01)29.19
(1.43)
0.97354.10
(15.45)
0.93952.6319.58
(0.73)
wgrplasso( ϵ = 0.05)29.44
(1.21)
0.98270.01
(15.98)
0.92262.8119.20
(0.78)
Model II
TPTPRFPAccurTimeBNE
p = 300grpreg( λ = min)17.82
(4.36)
0.59465.31
(10.55)
0.742641.2327.77
(1.32)
gglasso( λ = min)14.30
(4.92)
0.47636.25
(10.33)
0.827391.2827.69
(1.43)
gglasso( λ = lse)11.36
(4.80)
0.37827.70
(11.50)
0.846389.8328.73
(0.96)
wgrplasso( ϵ = 0.01)25.07
(2.67)
0.8366.52
(4.83)
0.96239.7115.92
(1.09)
wgrplasso( ϵ = 0.05)25.02
(2.68)
0.8346.28
(4.70)
0.96240.2415.85
(1.09)
p = 600grpreg( λ = min)12.61
(4.32)
0.42085.84
(11.35)
0.828894.4729.13
(1.17)
gglasso( λ = min)10.95
(4.99)
0.36547.08
(13.41)
0.890584.3228.73
(1.04)
gglasso( λ = lse)8.23
(4.76)
0.27436.33
(13.85)
0.903581.7429.26
(0.72)
wgrplasso( ϵ = 0.01)24.57
(2.81)
0.8199.43
(6.08)
0.97569.4815.96
(0.96)
wgrplasso( ϵ = 0.05)24.69
(2.80)
0.8239.23
(6.26)
0.97672.0515.89
(0.99)
p = 900grpreg( λ = min)10.53
(4.60)
0.35196.88
(12.79)
0.8711115.7329.64
(1.07)
gglasso( λ = min)8.43
(4.49)
0.28153.67
(13.97)
0.916746.6229.14
(0.93)
gglasso( λ = lse)6.09
(4.20)
0.20340.74
(15.09)
0.928742.6229.49
(0.58)
wgrplasso( ϵ = 0.01)24.86
(2.66)
0.82910.80
(6.39)
0.982106.94015.85
(1.01)
wgrplasso( ϵ = 0.05)24.99
(2.71)
0.83311.05
(6.23)
0.982111.9515.80
(1.00)
Reported numbers are the averages and standard errors (show in parentheses).
Table 2. Average results for 500 repetitions of the three algorithms in Models III and IV.
Table 2. Average results for 500 repetitions of the three algorithms in Models III and IV.
Model III
TPTPRFPAccurTimeBNE
p = 300grpreg( λ = min)29.39
(1.79)
0.98073.59
(21.16)
0.753447.4627.52
(1.96)
gglasso( λ = min)29.91
(0.59)
0.99774.11
(25.60)
0.753812.0324.06
(2.05)
gglasso( λ = lse)29.57
(2.32)
0.98640.58
(21.48)
0.863807.6525.27
(1.69)
wgrplasso( ϵ = 0.01)27.69
(2.51)
0.92324.02
(7.69)
0.91235.9228.99
(1.27)
wgrplasso( ϵ = 0.05)28.55
(2.06)
0.95232.00
(8.15)
0.88839.1328.84
(1.38)
p = 600grpreg( λ = min)28.05
(2.96)
0.93586.76
(28.04)
0.852598.0528.65
(1.70)
gglasso( λ = min)29.40
(2.37)
0.98097.53
(36.13)
0.836974.7025.44
(1.92)
gglasso( λ = lse)27.62
(5.90)
0.92045.84
(27.29)
0.920968.5726.65
(1.87)
wgrplasso( ϵ = 0.01)27.15
(2.69)
0.90540.41
(10.68)
0.92856.3529.40
(1.22)
wgrplasso( ϵ = 0.05)28.18
(2.21)
0.94051.31
(11.66)
0.91163.6729.34
(1.33)
p = 900grpreg( λ = min)25.66
(5.66)
0.85682.92
(36.76)
0.903745.8229.33
(1.51)
gglasso( λ = min)28.77
(3.79)
0.959105.48
(45.77)
0.8811121.1926.32
(1.87)
gglasso( λ = lse)24.33
(9.47)
0.81142.12
(35.83)
0.9471113.4527.76
(2.14)
wgrplasso( ϵ = 0.01)26.85
(2.87)
0.89550.99
(10.80)
0.94068.7429.70
(1.18)
wgrplasso( ϵ = 0.05)27.80
(2.38)
0.92663.14
(12.27)
0.92781.3229.67
(1.27)
Model IV
TPTPRFPAccurTimeBNE
p = 300grpreg( λ = min)21.94
(4.03)
0.73263.80
(9.64)
0.760466.7335.16
(1.78)
gglasso( λ = min)19.88
(4.43)
0.66252.83
(11.36)
0.790409.9228.30
(1.13)
gglasso( λ = lse)17.30
(4.74)
0.57747.80
(11.44)
0.798408.2228.93
(0.74)
wgrplasso( ϵ = 0.01)28.75
(1.65)
0.95925.96
(8.12)
0.909218.1026.09
(2.55)
wgrplasso( ϵ = 0.05)28.78
(1.65)
0.96026.32
(8.14)
0.908221.0826.13
(2.57)
p = 600grpreg( λ = min)18.32
(4.40)
0.61183.08
(12.48)
0.842689.2735.02
(1.79)
gglasso( λ = min)16.48
(5.10)
0.54970.00
(14.34)
0.861571.9029.08
(1.01)
gglasso( λ = lse)14.05
(5.17)
0.46862.39
(14.65)
0.869567.9829.37
(0.62)
wgrplasso( ϵ = 0.01)28.58
(1.80)
0.95334.33
(10.12)
0.940384.7926.59
(2.69)
wgrplasso( ϵ = 0.05)28.58
(1.83)
0.95334.76
(10.11)
0.940380.5726.63
(2.70)
p = 900grpreg( λ = min)15.66
(4.25)
0.52294.71
(12.41)
0.879356.3634.90
(1.50)
gglasso( λ = min)13.80
(4.61)
0.46080.03
(13.92)
0.893289.0629.45
(0.92)
gglasso( λ = lse)11.61
(4.49)
0.38770.52
(15.85)
0.901287.7329.64
(0.54)
wgrplasso( ϵ = 0.01)28.55
(1.83)
0.95239.33
(12.57)
0.955184.1326.53
(2.34)
wgrplasso( ϵ = 0.05)28.56
(1.80)
0.95239.24
(12.45)
0.955186.8926.56
(2.36)
Reported numbers are the averages and standard errors (show in parentheses).
Table 3. Average prediction accuracy, model size and time taken for 100 repetitions of the four algorithms in the musk dataset.
Table 3. Average prediction accuracy, model size and time taken for 100 repetitions of the four algorithms in the musk dataset.
wgrplasso
( ϵ = 0.05)
grpreg
( λ = min)
gglasso
( λ = min)
glmnet
( λ = min)
Prediction accuracy0.8200.8130.7710.758
Model size66.5331.2930.1453.53
Time0.693.042.702.12
Table 4. Average prediction error and model size for selected genes for 100 repetitions of three algorithms in microarray gene expression data from histological epithelial cells.
Table 4. Average prediction error and model size for selected genes for 100 repetitions of three algorithms in microarray gene expression data from histological epithelial cells.
wgrplasso
( ϵ = 0.05 )
grpreg
( λ = min)
gglasso
( λ = min)
Prediction Accuracy0.730.630.71
Model Size14914
Selected genes117_at
1255_g_at
200000_s_at
200002_at
200030_s_at
200040_at
200041_s_at
200655_s_at
200661_at
200729_s_at
201040_at
201465_s_at
202707_at
211997_x_at
201464_x_at
201465_s_at
201778_s_at
202707_at
204620_s_at
205544_s_at
211997_x_at
213280_at
217921_at
200047_s_at
200729_s_at
200801_x_at
201465_s_at
202046_s_at
202707_at
205544_s_at
208443_x_at
211374_x_at
211997_x_at
212234_at
213280_at
217921_at
220811_at
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhong, M.; Yin, Z.; Wang, Z. Variable Selection for Sparse Logistic Regression with Grouped Variables. Mathematics 2023, 11, 4979. https://doi.org/10.3390/math11244979

AMA Style

Zhong M, Yin Z, Wang Z. Variable Selection for Sparse Logistic Regression with Grouped Variables. Mathematics. 2023; 11(24):4979. https://doi.org/10.3390/math11244979

Chicago/Turabian Style

Zhong, Mingrui, Zanhua Yin, and Zhichao Wang. 2023. "Variable Selection for Sparse Logistic Regression with Grouped Variables" Mathematics 11, no. 24: 4979. https://doi.org/10.3390/math11244979

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop