1. Introduction
Classification problems appear in diverse practical applications, such as spam e-mail classification, disease diagnosis and drug discovery, among many others (e.g., [
1,
2,
3]). In these classification problems, the goal is to predict class labels based on a given set of variables. Recent research has focused extensively on linear classification: see [
4,
5] for comprehensive introductions. Among many linear classification methods, support vector machines (SVMs) (see [
6,
7]) and distance-weighted discrimination (DWD) (see [
8,
9,
10]) are two commonly used large-margin based classification methods.
Owing to the recent advent of new technologies for data acquisition and storage, classification with high dimensional features, i.e., a large number of variables, has become a ubiquitous problem in both theoretical and applied scientific studies. Typically, only a small number of instances are available, a setting we refer to as high-dimensional, low-sample size (HDLSS), as in [
11]. In the HDLSS setting, a so-called “data-piling” phenomenon is observed in [
8] for SVMs, occurring when projections of many training instances onto a vector normal to the separating hyperplane are nearly identical, suggesting severe overfitting. DWD was originally proposed to overcome data-piling in the HDLSS setting. In binary classification problems, linear SVMs seek a hyperplane maximizing the smallest margin for all data points, while DWD seeks a hyperplane minimizing the sum of inverse margins over all data points. Reference [
8] suggests replacing the inverse margins by the
q-th power of the inverse margins in a generalized DWD method; see [
12] for a detailed description. Formally, for a training data set
of
N observations, where
and
, binary generalized linear DWD seeks a proper separating hyperplane
through the optimization problem
where
a and
are the intercept and slope parameters, respectively. The slack variable
is introduced to ensure that the corresponding margin
is non-negative and the constant
is a tuning parameter to control the overlap between classes. Problem (
1) can also be written in a loss-plus-penalty form (e.g., [
12]) as
where
with
,
and
. When
, (
1) becomes the standard DWD problem in [
8] while problem (
2) appears in [
9,
13].
The binary classification problem (
1) is well studied. However, in many applications such as image classification [
1], cancer diagnosis [
2] and speech recognition [
3], to name a few, problems with more than two categories are commonplace. To solve these multicategory problems with the DWD classifier, approaches based on either formulation (
1) or (
2) are common. One common strategy is to extend problem (
1) to multiple classes by solving a series of binary problems in a one-versus-one (OVO) or one-versus-rest (OVR) method (e.g., [
14]). Instead of reducing the multicategory problem to a binary one, another strategy based on problem (
1) considers all classes at once. As shown in [
14], this approach generally works better than the OVO and OVR methods. Based on an extension of problem (
2), [
15] proposes multicategory DWD, written in a loss-plus-penalty form as
with
and where
and
are the intercept and slope parameters for each category
k, respectively. Although these methods can be applied to multicategory classification in the HDLSS setting, both problems (
2) and (
4) use the
penalty and do not perform feature selection. As discussed in [
16], for high dimensional classification, taking all features into consideration does not work well for two reasons. First, based on prior knowledge, only a small number of variables are relevant to the classification problem: a good classifier in high dimensions should have the ability to sparsely select important variables and discard redundant ones. Second, classifiers using all available variables in high-dimensional settings may have poor classification performance.
Much of the SVM literature has considered variable selection in high-dimensional classification problems to improve performance (e.g., [
17,
18,
19]). Among the DWD literature, to the best of our knowledge, only [
16] considered variables selection and classification simultaneously. Wang and Zou [
16] considered an
rather than an
penalty in problem (
2) to improve interpretability through sparsity in the binary classification. Moreover, [
16] made selections based on the strengths of input variables within individual classes but ignored the strengths of input variable groupings, thereby selecting more factors than necessary for each class. To overcome this weakness in this paper, we developed a multicategory generalized DWD method that is capable of performing variable selection and classification simultaneously. Our approach incorporates sparsity and group structure information via the sparse group lasso penalty (see [
20,
21,
22,
23,
24]).
Although DWD is well studied, it is less popular than the SVM for binary classification, arguably for computational and theoretical reasons. For an up-to-date list of works on DWD mostly focused on the
case, see [
14,
15]. Theoretical asymptotic properties of large-margin classifiers in high dimensional settings were studied in [
25], and [
26] derived an expression for asymptotic generalization error. In terms of computation, [
8] solved the standard DWD problem in (
1) as a second-order cone programming (SOCP) problem using a primal-dual interior-point method that is computationally expensive when
N or
p is large. To overcome computational bottlenecks, [
12] proposed an approach based on a novel formulation of the primal DWD model in (
1): this method, proposed in [
12], does not scale to large data sets and requires further work. Lam et al. [
27] designed a new algorithm for large DWD problems with
and
based on convergent multi-block ADMM-type methods (see [
28]). Wang and Zou [
16] solved the lasso-penalized binary DWD problem by combining majorization–minimization and coordinate descent methods: the lasso penalty does not directly permit a SOCP solution. In fact, solution identifiability in the generalized DWD problem with
requires more constraints and remains an open research problem (see [
8]). To the best of our knowledge, no work focusing on computational aspects of lasso penalized multicategory generalized DWD (MgDWD) exists. The same holds for sparse group lasso-penalized MgDWD.
The theoretical and computational contributions of this paper are as follows. First, we establish the uniqueness of the minimizer in the population form of the MgDWD problem. Second, we prove a non-asymptotic estimation error bound for the sparse group lasso-regularized MgDWD loss function in the ultra-high dimensional setting under mild regularity conditions. Third, we develop a fast, efficient algorithm able to solve the sparse group lasso-penalized MgDWD problem using proximal methods.
The rest of this paper is organized as follows. In
Section 2.1, we introduce the MgDWD problem with sparse group lasso penalty. In
Section 2.2 and
Section 2.3, we establish theoretical properties of the population classifier and regularized empirical loss. We propose a computational algorithm in
Section 2.4.
Section 3 illustrates the finite sample performance of our method through simulation studies and a real data analysis. Proofs for major theorems are given in the
Appendix A.
2. Methodology
2.1. Model Setup
We begin with some basic set-up and notation. Consider the multicategory classification problem for a random sample of N independent and identically distributed (i.i.d.) observations from some distribution . Here, y is the categorical response taking values in , and is the covariate vector. We wish to obtain a proper separating hyperplane for each category , where and are intercept and slope parameters, respectively.
In this paper, we consider MgDWD with sparse group lasso regularization. That is, we estimate a classification boundary by solving the constrained optimization problem
where
is as defined in (
3).
To approach this problem, we apply the concept of a “margin vector” to extend the definition of a (binary) margin to the multicategory case. Denote the margin vector of an observation
as
, with
satisfying
. Let
be the class indicator vector with
. The multicategory margin of the data point
is then given as
. Therefore, the MgDWD loss can be rewritten as
Based on (
6), Lemma 1 describes the Fisher consistency of the MgDWD loss.
Lemma 1. Given , the minimizer of the conditional expectation of (6) is , satisfyingwhereand . Consequently,
can be treated as an effective proxy of
and, for any new observation
, a reasonable prediction of its label
is
Speaking to the sparse group lasso (SGL) regularization in (
5), the
penalty encourages an element-wise sparse estimator that selects important variables for each category, indicated by
. Assuming that parameters in different categories share the same information, we use an
penalty to encourage a group-wise sparsity structure that removes covariates that are irrelevant across all categories, that is, where
. Specifically, let
and
, where the
k-th column
is the slope vector for the category label
k and the
j-th row
is the group coefficient for the variable
. If
is noise in the classification problem or is not relevant to category label
k, then the entry
of
should be shrunk to exactly zero. The SGL penalty of (
5) can be written as a convex combination of the lasso and group lasso penalties in terms of
as
where
is the scale of the penalty and
tunes the propensity between the element-wise and group-wise sparsity structure.
2.2. Population MgDWD
In this subsection, some basic results pertaining to unpenalized population MgDWD are given. These results are necessary for further theoretical analysis.
Denote the marginal probability mass of
y as
with
and
, and the conditional probability density functions of
given
by
. Let
be the collection of coefficient vectors
for all labels and
. The population version of the MgDWD problem in (
6) is
where
is the vectorization of the matrix
and
is a random vector. Denote the true parameter value
as a minimizer of the population MgDWD problem, namely,
where
is the set of sum-constrained
with
, where ⊗ denotes the Kronecker product.
To facilitate our theoretical analysis, we first define the gradient vector and Hessian matrix of the population MgDWD loss function. We then introduce some regularity conditions necessary to derive theoretical properties of this problem. Let
be a diagonal matrix constructed from the vector
, and let ∘ and ⊕ be the Hadamard product and the direct matrix sum, respectively. Denote the gradient vector of the population MgDWD loss function (
8) as
with
and its Hessian matrix as
where
denotes the second derivative of the function
;
is a random vector with
and
; and
The block structure of implies a parallel relationship between each category. The relationship between the is reflected by the sum-to-zero constraint in the definition of .
We assume the following regularity conditions.
(C1) The densities of given , i.e., the , are continuous and have finite second moments.
(C2) for all , where .
(C3) for all .
Remark 1. Condition (C1) ensures that , and are well defined and continuous in . For the theoretically optimal hyperplane , the case with leaves useless for classification. On the other hand, when and , the hyperplane is the empty set and is similarly meaningless. Condition (C2) is proposed to avoid the case where so that always contains information relevant to the classification problem. For bounded random variables, condition (C2) should be assumed with caution. Condition (C3) implies the positive definiteness of .
By convexity and the second-order Lagrange condition, the following theorem shows that the local minimizer of the population MgDWD problem exists and is unique.
Theorem 1. Under the regularity conditions (C1)–(C3), the true parameter is the unique minimizer of with , andwith , where The bounds in Theorem 1 show how
q affects the loss function
. The upper bound
is a decreasing function of
q with
In the lower bound
, the first term
is an increasing function of
q and the last term
is a decreasing function of
q, with
Consequently, for the given population , a larger q encourages the population MgDWD estimator to focus more on the regions that correspond to misclassifications. As a result, the estimator’s performance will be similar to the hinge loss as . Setting q too small will lead to an ineffective classifier due to the unreasonable penalty placed on the well classified region . This variation in the lower bound with respect to q provides a necessary condition for the existence of an optimal q.
Remark 2. The explicit relationship between q and is complicated. While it may be more desirable to prove that a greater value of q results in a smaller value of the loss function , there is no explicit formula for the optimal value in terms of q.
2.3. Estimator Consistency
Under the unpenalized framework presented in the previous subsection, all covariates will contribute to the classification task for each category: this scenario may lead to a classifier that overfits to the training data set. In this subsection, we study the consistency of the estimator for (
5) in ultra-high dimensional settings.
To achieve structural sparsity in the estimator, the regularization parameter
in (
7) must be large enough to dominate the gradient of the empirical MgDWD loss evaluated at the theoretical minimizer
with high probability. On the other hand,
should also be as small as possible to reduce the bias incurred by the SGL regularization term
Lemma 2 provides a suitable choice of under the following assumption.
(A1) The predictors are independent sub-Gaussian random vectors satisfying , and where , there exists a constant such that for any , . From here on, we define as the largest eigenvalue of .
Lemma 2. Denote , where , with , and is the identity matrix of size K. Under condition (A1),with probability at least , wherefor constants . It is difficult to obtain a closed form for the conjugate of the SGL penalty, say,
. Instead, we use a regularized upper bound
. Based on Lemma 2, we propose a theoretical tuning parameter value
where
is some given constant satisfying
.
Before we can derive an error bound for the estimator in (
5), we impose two additional assumptions.
(A2) For the true parameter value
, there is a
-sparse structure in the coefficients
with element-wise and group-wise support sets
with cardinality
and
, respectively.
(A3) There exist some positive constants
and
such that
with
and
where
,
is the complement of
,
, and
.
Under the choice of
given in (
9), we show the
-consistency of the estimator in (
5).
Theorem 2. Suppose that conditions (A1)–(A3) hold. Then with in (5), we have thatwith probability at least , where and . Remark 3. The sub-Gaussian distribution assumption (A1) is common in high-dimensional scenarios. This assumption characterizes the tail behavior of a collection of random variables including Gaussian, Bernoulli, and bounded variables as special cases. Assumption (A2) describes structural sparsity at two levels. The element-wise size is the size of the underlying generative model, and the group-wise size is the size of the signal covariate set. Both and are allowed to depend on the sample size N. As a result, the dimension p is allowed to increase with the sample size N. Assumption (A3) guarantees that eigenvalues are positive in this sparse scenario.
Remark 4. In practice, the tuning parameters λ and τ in (7) are commonly chosen by M-fold cross validation. That is, we choose the pair with the highest prediction accuracy among the sub-data sets , specifically,where . 2.4. Computational Algorithm
In this section, we propose an efficient algorithm to solve problem (
5). Our approach uses the proximal algorithm (see [
29]) for solving high dimensional regularization problems. In two main steps, this approach obtains a solution to the constrained optimization problem by applying the proximal operator to the solution to the unconstrained problem.
Since regularization is not needed for the intercept terms
, it can be separated from the coefficients in
. The empirical MgDWD loss of (
8) is given by
where
. Various properties of the loss function
follow below.
Lemma 3. The loss function has Lipschitz continuous partial derivatives. In particular, for and any , we have thatwhere is the largest group sample size. For and any , we have thatwhere is the k-th column of and indicates the observations belonging to the k-th group. Hence, following the majorization–minimization scheme, we can majorize the empirical MgDWD loss
by a quadratic function, that is,
for some
, where
and
denote the Lipschitz constants in Lemma 3. Instead of minimizing
directly, we apply gradient descent to minimize its surrogate upper bound function. The gradient descent updates are given by
Next, we address the problem’s constraints and regularization simultaneously by applying the proximal operator. For
, it is clear that
where
. For
, the minimization problem can be expressed as
which implies that we can implement minimization for
p groups in parallel. The following theorem provides the solution to (
13).
Theorem 3. Let and . Then the constrained regularization problemhas a solution of the formfor some . In the special case with
, the constrained regularization problem in Theorem 3 reduces to the constrained lasso problem with solution
. Combined with (
14), the proximal operator
, given by
can be introduced to realize the group sparsity of
.
For the standard lasso problem, the subgradient
has a closed form given by
, with
. However, under the constraint on
, the naive solution
is misleading in that it satisfies the constraint but does not achieve shrinkage, let alone loss function minimization. The term
is suggestive of the intersection between the subdifferential set
and the constraint set
; in this sense,
might not have a closed form. Here we consider using coordinate descent to solve the constrained lasso problem. For some fixed coordinate
m, since
, we have that
. Rewriting the objective function of the lasso-constrained problem in a coordinate-wise form, we obtain
Next, Theorem 4 provides the solution to the optimization problem (
16).
Theorem 4. Suppose that and . Then the regularization problemhas solutionwhere . By Theorem 4, given some
, the coordinate-wise minimizer for any
can be expressed as the proximal operator
with
and
. If we fix
m during iteration, then the shrinkage of
will be indirectly reflected in the other
. We propose that
m change with
k in the coordinate-wise minimization process to ensure that every coordinate can be equally shrunk. We summarize our proposed algorithm in Algorithm 1.
Algorithm 1: Proximal gradient descent algorithm for SGL-MgDWD. |
Input: |
Initialization:, , . |
- 1:
repeat - 2:
Update according to ( 10) and ( 12):
- 3:
Update according to (11):
- 4:
Set . - 5:
repeat - 6:
for to K do - 7:
for k in do - 8:
- 9:
Update according to ( 17) and by constraint:
- 10:
end for - 11:
end for - 12:
until convergence. - 13:
Update according to ( 15):
- 14:
Set . - 15:
until some condition is met.
|
Output: and .
|