Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions

Su, Tong; Wang, Yafei; Liu, Yi; Branton, William G.; Asahchop, Eugene; Power, Christopher; Jiang, Bei; Kong, Linglong; Tang, Niansheng

doi:10.3390/e22111257

Open AccessArticle

Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions

by

Tong Su

¹,

Yafei Wang

²,

Yi Liu

²,

William G. Branton

³,

Eugene Asahchop

³,

Christopher Power

³,

Bei Jiang

²,

Linglong Kong

^2,*

and

Niansheng Tang

^1,*

¹

Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming 650091, China

²

Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada

³

Department of Medicine (Neurology), University of Alberta, Edmonton, AB T6G 2G1, Canada

^*

Authors to whom correspondence should be addressed.

Entropy 2020, 22(11), 1257; https://doi.org/10.3390/e22111257

Submission received: 30 September 2020 / Revised: 26 October 2020 / Accepted: 2 November 2020 / Published: 5 November 2020

(This article belongs to the Special Issue Big Data Analytics and Information Science for Business and Biomedical Applications)

Download Versions Notes

Abstract

:

Distance weighted discrimination (DWD) is an appealing classification method that is capable of overcoming data piling problems in high-dimensional settings. Especially when various sparsity structures are assumed in these settings, variable selection in multicategory classification poses great challenges. In this paper, we propose a multicategory generalized DWD (MgDWD) method that maintains intrinsic variable group structures during selection using a sparse group lasso penalty. Theoretically, we derive minimizer uniqueness for the penalized MgDWD loss function and consistency properties for the proposed classifier. We further develop an efficient algorithm based on the proximal operator to solve the optimization problem. The performance of MgDWD is evaluated using finite sample simulations and miRNA data from an HIV study.

Keywords:

high dimension; multicategory classification; DWD; sparse group lasso; L₂-consistency; proximal algorithm

1. Introduction

Classification problems appear in diverse practical applications, such as spam e-mail classification, disease diagnosis and drug discovery, among many others (e.g., [1,2,3]). In these classification problems, the goal is to predict class labels based on a given set of variables. Recent research has focused extensively on linear classification: see [4,5] for comprehensive introductions. Among many linear classification methods, support vector machines (SVMs) (see [6,7]) and distance-weighted discrimination (DWD) (see [8,9,10]) are two commonly used large-margin based classification methods.

Owing to the recent advent of new technologies for data acquisition and storage, classification with high dimensional features, i.e., a large number of variables, has become a ubiquitous problem in both theoretical and applied scientific studies. Typically, only a small number of instances are available, a setting we refer to as high-dimensional, low-sample size (HDLSS), as in [11]. In the HDLSS setting, a so-called “data-piling” phenomenon is observed in [8] for SVMs, occurring when projections of many training instances onto a vector normal to the separating hyperplane are nearly identical, suggesting severe overfitting. DWD was originally proposed to overcome data-piling in the HDLSS setting. In binary classification problems, linear SVMs seek a hyperplane maximizing the smallest margin for all data points, while DWD seeks a hyperplane minimizing the sum of inverse margins over all data points. Reference [8] suggests replacing the inverse margins by the q-th power of the inverse margins in a generalized DWD method; see [12] for a detailed description. Formally, for a training data set

{(y_{i}, X_{i})}_{i = 1}^{N}

of N observations, where

X_{i} \in R^{p}

and

y_{i} \in {- 1, 1}

, binary generalized linear DWD seeks a proper separating hyperplane

{X : a + X^{⊤} b = 0}

through the optimization problem

\begin{matrix} \underset{a, b}{arg max} & \sum_{i = 1}^{N} \frac{1}{d_{i}^{q}} \\ s . t . d_{i} & = y_{i} (a + X_{i}^{T} b) + η_{i} \geq 0, \forall i, \\ η_{i} & \geq 0, \forall i, \sum_{i} η_{i} \leq c, \\ {∥ b ∥}_{2}^{2} & = 1, \end{matrix}

(1)

where a and

b

are the intercept and slope parameters, respectively. The slack variable

η_{i}

is introduced to ensure that the corresponding margin

d_{i}

is non-negative and the constant

c > 0

is a tuning parameter to control the overlap between classes. Problem (1) can also be written in a loss-plus-penalty form (e.g., [12]) as

\begin{matrix} (\hat{a}, \hat{b}) = \underset{a, b}{argmin} [\frac{1}{N} \sum_{i = 1}^{N} ϕ_{q} \{y_{i} (a + X_{i}^{⊤} b)\} + λ {∥ b ∥}_{2}^{2}], \end{matrix}

(2)

where

\begin{matrix} ϕ_{q} (u) = \{\begin{matrix} 1 - u, & if u \leq Q \\ φ_{q} (u), & if u > Q, \end{matrix} \end{matrix}

(3)

with

Q = \frac{q}{q + 1}

,

q > 0

and

φ_{q} (u) = (1 - Q) {(Q u^{- 1})}^{q}

. When

q = 1

, (1) becomes the standard DWD problem in [8] while problem (2) appears in [9,13].

The binary classification problem (1) is well studied. However, in many applications such as image classification [1], cancer diagnosis [2] and speech recognition [3], to name a few, problems with more than two categories are commonplace. To solve these multicategory problems with the DWD classifier, approaches based on either formulation (1) or (2) are common. One common strategy is to extend problem (1) to multiple classes by solving a series of binary problems in a one-versus-one (OVO) or one-versus-rest (OVR) method (e.g., [14]). Instead of reducing the multicategory problem to a binary one, another strategy based on problem (1) considers all classes at once. As shown in [14], this approach generally works better than the OVO and OVR methods. Based on an extension of problem (2), [15] proposes multicategory DWD, written in a loss-plus-penalty form as

\begin{matrix} min_{a_{k}, b_{k}} \frac{1}{N} \sum_{i = 1}^{N} ϕ_{q} (a_{y_{i}} + X_{i}^{⊤} b_{y_{i}}) + λ \sum_{k = 1}^{K} {∥ b_{k} ∥}_{2}^{2} \\ s . t . \sum_{k = 1}^{K} a_{k} = 0; \sum_{k = 1}^{K} b_{j k} = 0, \forall j = 1, \dots, p, \end{matrix}

(4)

with

y_{i}, k \in {1, \dots, K}

and where

a_{k}

and

b_{k} = (b_{1 k}, \dots, b_{p k})

are the intercept and slope parameters for each category k, respectively. Although these methods can be applied to multicategory classification in the HDLSS setting, both problems (2) and (4) use the

L_{2}

penalty and do not perform feature selection. As discussed in [16], for high dimensional classification, taking all features into consideration does not work well for two reasons. First, based on prior knowledge, only a small number of variables are relevant to the classification problem: a good classifier in high dimensions should have the ability to sparsely select important variables and discard redundant ones. Second, classifiers using all available variables in high-dimensional settings may have poor classification performance.

Much of the SVM literature has considered variable selection in high-dimensional classification problems to improve performance (e.g., [17,18,19]). Among the DWD literature, to the best of our knowledge, only [16] considered variables selection and classification simultaneously. Wang and Zou [16] considered an

L_{1}

rather than an

L_{2}

penalty in problem (2) to improve interpretability through sparsity in the binary classification. Moreover, [16] made selections based on the strengths of input variables within individual classes but ignored the strengths of input variable groupings, thereby selecting more factors than necessary for each class. To overcome this weakness in this paper, we developed a multicategory generalized DWD method that is capable of performing variable selection and classification simultaneously. Our approach incorporates sparsity and group structure information via the sparse group lasso penalty (see [20,21,22,23,24]).

Although DWD is well studied, it is less popular than the SVM for binary classification, arguably for computational and theoretical reasons. For an up-to-date list of works on DWD mostly focused on the

q = 1

case, see [14,15]. Theoretical asymptotic properties of large-margin classifiers in high dimensional settings were studied in [25], and [26] derived an expression for asymptotic generalization error. In terms of computation, [8] solved the standard DWD problem in (1) as a second-order cone programming (SOCP) problem using a primal-dual interior-point method that is computationally expensive when N or p is large. To overcome computational bottlenecks, [12] proposed an approach based on a novel formulation of the primal DWD model in (1): this method, proposed in [12], does not scale to large data sets and requires further work. Lam et al. [27] designed a new algorithm for large DWD problems with

q \geq 2

and

K = 2

based on convergent multi-block ADMM-type methods (see [28]). Wang and Zou [16] solved the lasso-penalized binary DWD problem by combining majorization–minimization and coordinate descent methods: the lasso penalty does not directly permit a SOCP solution. In fact, solution identifiability in the generalized DWD problem with

q > 1

requires more constraints and remains an open research problem (see [8]). To the best of our knowledge, no work focusing on computational aspects of lasso penalized multicategory generalized DWD (MgDWD) exists. The same holds for sparse group lasso-penalized MgDWD.

The theoretical and computational contributions of this paper are as follows. First, we establish the uniqueness of the minimizer in the population form of the MgDWD problem. Second, we prove a non-asymptotic

L_{2}

estimation error bound for the sparse group lasso-regularized MgDWD loss function in the ultra-high dimensional setting under mild regularity conditions. Third, we develop a fast, efficient algorithm able to solve the sparse group lasso-penalized MgDWD problem using proximal methods.

The rest of this paper is organized as follows. In Section 2.1, we introduce the MgDWD problem with sparse group lasso penalty. In Section 2.2 and Section 2.3, we establish theoretical properties of the population classifier and regularized empirical loss. We propose a computational algorithm in Section 2.4. Section 3 illustrates the finite sample performance of our method through simulation studies and a real data analysis. Proofs for major theorems are given in the Appendix A.

2. Methodology

2.1. Model Setup

We begin with some basic set-up and notation. Consider the multicategory classification problem for a random sample

{(y_{i}, X_{i})}_{i = 1}^{N}

of N independent and identically distributed (i.i.d.) observations from some distribution

P (y, X)

. Here, y is the categorical response taking values in

Y = {1, \dots, K}

, and

X = {(x_{1}, \dots, x_{p})}^{⊤} \in X \subset R^{p}

is the covariate vector. We wish to obtain a proper separating hyperplane

{X \in X | a_{k} + X^{⊤} b_{k} = 0}

for each category

k \in Y

, where

a_{k}

and

b_{k} = {(b_{1 k}, \dots, b_{p k})}^{⊤}

are intercept and slope parameters, respectively.

In this paper, we consider MgDWD with sparse group lasso regularization. That is, we estimate a classification boundary by solving the constrained optimization problem

\begin{matrix} min_{a_{k}, b_{k}} \frac{1}{N} \sum_{i = 1}^{N} ϕ_{q} (a_{y_{i}} + X_{i}^{⊤} b_{y_{i}}) + λ_{1} \sum_{k = 1}^{K} \sum_{j = 1}^{p} | b_{j k} | + λ_{2} \sum_{j = 1}^{p} \sqrt{\sum_{k = 1}^{K} b_{j k}^{2}} \\ s . t . \sum_{k = 1}^{K} a_{k} = 0; \sum_{k = 1}^{K} b_{j k} = 0, \forall j = 1, \dots, p, \end{matrix}

(5)

where

ϕ_{q}

is as defined in (3).

To approach this problem, we apply the concept of a “margin vector” to extend the definition of a (binary) margin to the multicategory case. Denote the margin vector of an observation

X_{i}

as

F_{i} = {(f_{i 1}, \dots, f_{i K})}^{⊤}

, with

f_{i k} = a_{k} + X_{i}^{⊤} b_{k}

satisfying

\sum_{k = 1}^{K} f_{i k} = 0

. Let

E_{i} = {(e_{i 1}, \dots, e_{i K})}^{⊤}

be the class indicator vector with

e_{i k} = 𝟙 {y_{i} = k}

. The multicategory margin of the data point

(y_{i}, X_{i})

is then given as

f_{i y_{i}} = a_{y_{i}} + X_{i}^{⊤} b_{y_{i}} = E_{i}^{⊤} F_{i}

. Therefore, the MgDWD loss can be rewritten as

\begin{matrix} ϕ_{q} (a_{y_{i}} + X_{i}^{⊤} b_{y_{i}}) = ϕ_{q} (E_{i}^{⊤} F_{i}) = E_{i}^{⊤} ϕ_{q} (F_{i}) = \sum_{k = 1}^{K} 𝟙 {y_{i} = k} ϕ_{q} (a_{k} + X_{i}^{⊤} b_{k}) . \end{matrix}

(6)

Based on (6), Lemma 1 describes the Fisher consistency of the MgDWD loss.

Lemma 1.

Given

X = u

, the minimizer of the conditional expectation of (6) is

\tilde{F} (u) = {({\tilde{f}}_{1} (u), \dots, {\tilde{f}}_{K} (u))}^{⊤}

, satisfying

\begin{matrix} \underset{k \in Y}{argmax} {\tilde{f}}_{k} (u) = \underset{k \in Y}{argmax} Pr {y = k | X = u}, \end{matrix}

where

\begin{matrix} {\tilde{f}}_{k} (u) = \{\begin{matrix} Q \sqrt[q]{\frac{Pr {y = k | X = u}}{Pr {y = k_{*} | X = u}}}, & k \neq k_{*} \\ - Q \sum_{l \neq k_{*}} \sqrt[q]{\frac{Pr {y = l | X = u}}{Pr {y = k_{*} | X = u}}}, & k = k_{*} . \end{matrix} \end{matrix}

and

k_{*} = \underset{k \in Y}{argmin} Pr {y = k | X = u}

.

Consequently,

{\tilde{f}}_{k} (u)

can be treated as an effective proxy of

Pr {y = k | X = u}

and, for any new observation

X_{*}

, a reasonable prediction of its label

y_{*}

is

\begin{matrix} {\hat{y}}_{*} = \underset{k \in Y}{argmax} {a_{k} + X_{*}^{⊤} b_{k}} . \end{matrix}

Speaking to the sparse group lasso (SGL) regularization in (5), the

L_{1}

penalty encourages an element-wise sparse estimator that selects important variables for each category, indicated by

{\hat{b}}_{j k} \neq 0

. Assuming that parameters in different categories share the same information, we use an

L_{2}

penalty to encourage a group-wise sparsity structure that removes covariates that are irrelevant across all categories, that is, where

{\hat{β}}_{j} = {(b_{1 j}, \dots, b_{K j})}^{⊤} = 0

. Specifically, let

x_{j} = {(x_{1 j}, \dots, x_{N j})}^{T}

and

B = (b_{j k}) \in R_{j k}^{p \times K}

, where the k-th column

b_{k}

is the slope vector for the category label k and the j-th row

β_{j}^{⊤}

is the group coefficient for the variable

x_{j}

. If

x_{j}

is noise in the classification problem or is not relevant to category label k, then the entry

b_{j k}

of

B

should be shrunk to exactly zero. The SGL penalty of (5) can be written as a convex combination of the lasso and group lasso penalties in terms of

β_{j}

as

\begin{matrix} λ_{1} \sum_{k = 1}^{K} \sum_{j = 1}^{p} | b_{j k} | + λ_{2} \sum_{j = 1}^{p} \sqrt{\sum_{k = 1}^{K} b_{j k}^{2}} = λ \sum_{j = 1}^{p} \{τ {∥ β_{j} ∥}_{1} + (1 - τ) {∥ β_{j} ∥}_{2}\}, \end{matrix}

(7)

where

λ > 0

is the scale of the penalty and

τ \in [0, 1]

tunes the propensity between the element-wise and group-wise sparsity structure.

2.2. Population MgDWD

In this subsection, some basic results pertaining to unpenalized population MgDWD are given. These results are necessary for further theoretical analysis.

Denote the marginal probability mass of y as

Pr (y = k) = π_{k}

with

π_{k} > 0

and

\sum_{k = 1}^{K} π_{k} = 1

, and the conditional probability density functions of

X

given

y = k

by

g (X ∣ y = k) = g_{k} (X)

. Let

Θ = (θ_{1}, \dots, θ_{K})

be the collection of coefficient vectors

θ_{k} = {(a_{k}, b_{k}^{⊤})}^{⊤}

for all labels and

Z = {(1, X^{⊤})}^{⊤}

. The population version of the MgDWD problem in (6) is

\begin{matrix} L (ϑ) = E \{I {(Y)}^{⊤} ϕ_{q} (Θ^{⊤} Z)\} = \sum_{k = 1}^{K} π_{k} \int_{X} ϕ_{q} (Z^{⊤} θ_{k}) g_{k} (x) d x, \end{matrix}

(8)

where

ϑ = vec {Θ}

is the vectorization of the matrix

Θ

and

I (Y) = {(𝟙 {y = 1}, \dots, 𝟙 {y = K})}^{⊤}

is a random vector. Denote the true parameter value

ϑ^{*}

as a minimizer of the population MgDWD problem, namely,

\begin{matrix} ϑ^{*} \in \underset{ϑ \in C}{argmin} L (ϑ), \end{matrix}

where

C = \{ϑ \in R^{K (p + 1)} | C ϑ = 0_{K}\}

is the set of sum-constrained

ϑ

with

C = 1_{K}^{⊤} \otimes I_{p + 1}

, where ⊗ denotes the Kronecker product.

To facilitate our theoretical analysis, we first define the gradient vector and Hessian matrix of the population MgDWD loss function. We then introduce some regularity conditions necessary to derive theoretical properties of this problem. Let

diag {v}

be a diagonal matrix constructed from the vector

v

, and let ∘ and ⊕ be the Hadamard product and the direct matrix sum, respectively. Denote the gradient vector of the population MgDWD loss function (8) as

\begin{matrix} S (ϑ) = E ({I (Y) \circ ϕ_{q}^{'} (Θ^{⊤} Z)} \otimes Z) = vec (S_{1}, \dots, S_{K}), \end{matrix}

with

\begin{matrix} S_{k} = E \{𝟙 {y = k} ϕ_{q}^{'} (Z^{⊤} θ_{k}) Z\} = π_{k} \int_{X} ϕ_{q}^{'} (Z^{⊤} θ_{k}) Z g_{k} (X) d X, \end{matrix}

and its Hessian matrix as

\begin{matrix} H (ϑ) = E \{diag \{I (Y, X) \circ φ_{q}^{″} (Θ^{⊤} Z)\} \otimes ({ZZ}^{⊤})\} = ⨁_{k = 1}^{K} H_{k}, \end{matrix}

where

φ_{q}^{″}

denotes the second derivative of the function

φ_{q}

;

I (Y, X) = I (Y) \circ I (X)

is a random vector with

I (X) = {(𝟙 {X \in X_{1}}, \dots, 𝟙 {X \in X_{k}})}^{⊤}

and

X_{k} = \{X \in X | Z^{⊤} θ_{k} > Q\}

; and

\begin{matrix} H_{k} = E \{𝟙 {y = k, X \in X_{k}} φ_{q}^{″} (Z^{⊤} θ_{k}) {ZZ}^{⊤}\} = π_{k} \int_{X_{k}} φ_{q}^{″} (Z^{⊤} θ_{k}) {ZZ}^{⊤} g_{k} (X) d X . \end{matrix}

The block structure of

H (ϑ)

implies a parallel relationship between each category. The relationship between the

θ_{k}

is reflected by the sum-to-zero constraint in the definition of

C

.

We assume the following regularity conditions.

(C1) The densities of

X

given

y = k \in Y

, i.e., the

g_{k} (X)

, are continuous and have finite second moments.

(C2)

0 < Pr {X \in X_{k}^{*} | y = k} < 1

for all

k \in Y

, where

X_{k}^{*} = \{X \in X | Z^{⊤} θ_{k}^{*} > Q\}

.

(C3)

Var \{X | X \in X_{k}^{*}, y = k\} ≻ O_{p}

for all

k \in Y

.

Remark 1.

Condition (C1) ensures that

L

,

S

and

H

are well defined and continuous in

ϑ

. For the theoretically optimal hyperplane

{X \in X | Z^{⊤} θ_{k}^{*} = 0}

, the case with

θ_{k}^{*} = 0_{p + 1}

leaves

X

useless for classification. On the other hand, when

a_{k}^{*} \neq 0

and

b_{k}^{*} = 0_{p}

, the hyperplane is the empty set and is similarly meaningless. Condition (C2) is proposed to avoid the case where

b_{k}^{*} = 0_{p}

so that

ϑ^{*}

always contains information relevant to the classification problem. For bounded random variables, condition (C2) should be assumed with caution. Condition (C3) implies the positive definiteness of

H (ϑ^{*})

.

By convexity and the second-order Lagrange condition, the following theorem shows that the local minimizer of the population MgDWD problem exists and is unique.

Theorem 1.

Under the regularity conditions (C1)–(C3), the true parameter

ϑ^{*} \in C

is the unique minimizer of

L (ϑ)

with

b_{k}^{*} \neq 0_{p}

, and

\begin{matrix} L (ϑ^{*}) = \sum_{k = 1}^{K} A (k, q) π_{k}, \end{matrix}

with

0 \leq u (k, q) \leq A (k, q) \leq v (k, q) \leq 1

, where

\begin{matrix} A (k, q) = 1 - E \{𝟙 {X \in X_{k}^{*}} \{1 - {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q}\} | y = k\}, \\ u (k, q) = Pr \{X \notin X_{k}^{*} | y = k\} + Q^{2 q} Pr \{Q < Z^{⊤} θ_{k}^{*} \leq Q^{- 1} | y = k\}, \\ v (k, q) = Pr \{Z^{⊤} θ_{k}^{*} \leq 1 | y = k\} + inf_{ϵ > 0} {(\frac{Q}{1 + ϵ})}^{q} Pr \{Z^{⊤} θ_{k}^{*} > 1 + ϵ | y = k\} . \end{matrix}

The bounds in Theorem 1 show how q affects the loss function

L (ϑ^{*})

. The upper bound

v (k, q)

is a decreasing function of q with

\begin{matrix} lim_{q \to 0} v (k, q) = 1 and lim_{q \to \infty} v (k, q) = Pr \{Z^{⊤} θ_{k}^{*} \leq 1 | y = k\} . \end{matrix}

In the lower bound

u (k, q)

, the first term

Pr \{X \notin X_{k}^{*} | y = k\}

is an increasing function of q and the last term

Q^{2 q} Pr \{Q < Z^{⊤} θ_{k}^{*} \leq Q^{- 1} | y = k\}

is a decreasing function of q, with

\begin{matrix} lim_{q \to 0} u (k, q) = 1 and lim_{q \to \infty} u (k, q) = Pr \{Z^{⊤} θ_{k}^{*} \leq 1 | y = k\} . \end{matrix}

Consequently, for the given population

P (y, X)

, a larger q encourages the population MgDWD estimator to focus more on the regions

{X \notin X_{k}, y = k}

that correspond to misclassifications. As a result, the estimator’s performance will be similar to the hinge loss as

q \to \infty

. Setting q too small will lead to an ineffective classifier due to the unreasonable penalty placed on the well classified region

{X \in X_{k}, y = k}

. This variation in the lower bound with respect to q provides a necessary condition for the existence of an optimal q.

Remark 2.

The explicit relationship between q and

ϑ^{*}

is complicated. While it may be more desirable to prove that a greater value of q results in a smaller value of the loss function

L (ϑ)

, there is no explicit formula for the optimal value

ϑ^{*}

in terms of q.

2.3. Estimator Consistency

Under the unpenalized framework presented in the previous subsection, all covariates will contribute to the classification task for each category: this scenario may lead to a classifier that overfits to the training data set. In this subsection, we study the consistency of the estimator for (5) in ultra-high dimensional settings.

To achieve structural sparsity in the estimator, the regularization parameter

λ

in (7) must be large enough to dominate the gradient of the empirical MgDWD loss evaluated at the theoretical minimizer

ϑ^{*} = vec {Θ^{*}}

with high probability. On the other hand,

λ

should also be as small as possible to reduce the bias incurred by the SGL regularization term

\begin{matrix} P (β) = \sum_{j = 1}^{p} τ {∥ β_{j} ∥}_{1} + (1 - τ) {∥ β_{j} ∥}_{2} . \end{matrix}

Lemma 2 provides a suitable choice of

λ

under the following assumption.

(A1) The predictors

X = (x_{1}, \dots, x_{p}) \in R^{p}

are independent sub-Gaussian random vectors satisfying

E X = 0_{p}

, and where

Var (X) = Σ

, there exists a constant

κ > 0

such that for any

γ \in R^{p}

,

E exp (γ^{⊤} Σ^{- 1 / 2} X) \leq exp ({∥ γ ∥}_{2}^{2} κ^{2} / 2)

. From here on, we define

ς_{1}^{2}

as the largest eigenvalue of

Σ

.

Lemma 2.

Denote

S (ϑ^{*}) = (I_{K} \otimes Z^{⊤}) diag (vec {E}) vec {ϕ_{q}^{'} (Z Θ^{*})}

, where

E = {(E_{1}, \dots, E_{N})}^{⊤}

,

Z = {(Z_{1}, \dots, Z_{N})}^{⊤}

with

Z_{i} = {(1, X_{i}^{⊤})}^{⊤}

, and

I_{K}

is the identity matrix of size K. Under condition (A1),

\begin{matrix} \tilde{P} \{P S (ϑ^{*})\} \leq τ Λ_{1} + (1 - τ) Λ_{2} \end{matrix}

with probability at least

1 - 2 {(K p)}^{1 - c_{1}^{2}} - p^{1 - c_{2}^{2}}

, where

\begin{matrix} P = (I_{K} - K^{- 1} 1_{K} 1_{K}^{⊤}) \otimes I_{p + 1}, \\ Λ_{1} = max {ς_{1} κ, 1} (1 - \frac{1}{K}) \sqrt{\frac{2 log (p K)}{N}}, \\ Λ_{2} = max {2 \sqrt{2} ς_{1} κ, 1} \{c_{2} \sqrt{(1 - \frac{1}{K}) \frac{2 log (p)}{N}} + \sqrt{\frac{K - 1}{N}}\}, \end{matrix}

for constants

c_{1}, c_{2} > 1

.

It is difficult to obtain a closed form for the conjugate of the SGL penalty, say,

\bar{P} (v) = {sup}_{u \in C \ {0}} \frac{〈 u, v 〉}{P (u)}

. Instead, we use a regularized upper bound

\tilde{P} (v) \geq \bar{P} (v)

. Based on Lemma 2, we propose a theoretical tuning parameter value

\begin{matrix} λ = c_{0} \sqrt{\frac{log (p K)}{N}}, \end{matrix}

(9)

where

c_{0}

is some given constant satisfying

λ > τ Λ_{1} + (1 - τ) Λ_{2}

.

Before we can derive an error bound for the estimator in (5), we impose two additional assumptions.

(A2) For the true parameter value

ϑ^{*}

, there is a

(s_{e}, s_{g})

-sparse structure in the coefficients

B^{*}

with element-wise and group-wise support sets

\begin{matrix} E = \{(j, k) \in {1, \dots, p} \times {1, \dots, K} | b_{j k}^{*} \neq 0\} and G = \{j \in {1, \dots, p} | β_{j}^{*} \neq 0_{K}\} \end{matrix}

with cardinality

| E | = s_{e}

and

| G | = s_{g}

, respectively.

(A3) There exist some positive constants

ς_{2}

and

ς_{3}

such that

\begin{matrix} ς_{2}^{2} = max_{γ \in V} \frac{∥ diag {vec (E^{⊤})} (Z \otimes I_{K}) {γ ∥}_{2}^{2}}{{N ∥ γ ∥}_{2}^{2}} and ς_{3}^{2} = min_{γ \in U} \frac{γ^{⊤} H (ϑ^{*}) γ}{γ^{⊤} γ} \end{matrix}

with

V = \{v \in R^{K (p + 1)} {| 0 < ∥ v ∥}_{0} \leq s_{e} + K\}

and

\begin{matrix} U = \{δ \in R^{K (p + 1)} | \frac{τ}{1 - τ} {∥ δ_{E_{+}} ∥}_{1} + \sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2} \geq C_{0} (\frac{τ}{1 - τ} {∥ δ_{E^{c}} ∥}_{1} + \sum_{j \notin G} {∥ δ_{j} ∥}_{2})\}, \end{matrix}

where

C_{0} = \frac{(c_{0} - 1)}{(c_{0} + 1)}

,

E^{c}

is the complement of

E

,

E_{+} = E \cup {l = 1 + (k - 1) (p + 1) | k = 1, \dots, K}

, and

G_{+} = G \cup {0}

.

Under the choice of

λ

given in (9), we show the

L_{2}

-consistency of the estimator in (5).

Theorem 2.

Suppose that conditions (A1)–(A3) hold. Then with

λ = c_{0} \sqrt{\frac{log (p K)}{N}}

in (5), we have that

\begin{matrix} ∥ \hat{ϑ} - ϑ^{*} ∥_{2} \leq \{C_{1} \sqrt{s_{e} + K} + C_{2} \sqrt{s_{g} + 1}\} \sqrt{\frac{log (p K)}{N}} \end{matrix}

with probability at least

1 - 2 {(K p)}^{2 (s_{e} + K) (1 - c_{3}^{2})}

, where

C_{1} = 2 ς_{3}^{- 2} {c_{0} τ + (\sqrt{2} + 2 c_{3}) ς_{2}}

and

C_{2} = 2 ς_{3}^{- 2} c_{0} (1 - τ)

.

Remark 3.

The sub-Gaussian distribution assumption (A1) is common in high-dimensional scenarios. This assumption characterizes the tail behavior of a collection of random variables including Gaussian, Bernoulli, and bounded variables as special cases. Assumption (A2) describes structural sparsity at two levels. The element-wise size

s_{e} < p

is the size of the underlying generative model, and the group-wise size

s_{g} < p K

is the size of the signal covariate set. Both

s_{e}

and

s_{g}

are allowed to depend on the sample size N. As a result, the dimension p is allowed to increase with the sample size N. Assumption (A3) guarantees that eigenvalues are positive in this sparse scenario.

Remark 4.

In practice, the tuning parameters λ and τ in (7) are commonly chosen by M-fold cross validation. That is, we choose the pair

(τ, λ)

with the highest prediction accuracy among the sub-data sets

D_{m}

, specifically,

\begin{matrix} C V (τ, λ) = \sum_{m = 1}^{M} \sum_{i \in D_{m}}, 𝟙 {y_{i} = {\hat{y}}_{i} (τ, λ)} \end{matrix}

where

{\hat{y}}_{i} (τ, λ) = \underset{k \in Y}{argmax} Z_{i}^{⊤} {\hat{θ}}_{k} (τ, λ)

.

2.4. Computational Algorithm

In this section, we propose an efficient algorithm to solve problem (5). Our approach uses the proximal algorithm (see [29]) for solving high dimensional regularization problems. In two main steps, this approach obtains a solution to the constrained optimization problem by applying the proximal operator to the solution to the unconstrained problem.

Since regularization is not needed for the intercept terms

α = {(a_{1}, \dots, a_{K})}^{⊤}

, it can be separated from the coefficients in

B

. The empirical MgDWD loss of (8) is given by

\begin{matrix} L (ϑ) = \frac{1}{N} \sum_{i = 1}^{N} E_{i}^{⊤} ϕ_{q} (F_{i}) = \frac{1}{N} tr \{E ϕ_{q} (F^{⊤})\} = \frac{1}{N} vec {E^{⊤}}^{⊤} vec \{ϕ_{q} (F^{⊤})\} \end{matrix}

where

F = {(f_{i k})}_{N \times K} = Z Θ = 1_{N} α^{⊤} + XB

. Various properties of the loss function

L (ϑ)

follow below.

Lemma 3.

The loss function

L (ϑ)

has Lipschitz continuous partial derivatives. In particular, for

S (α) = \frac{\partial L (θ)}{\partial α} = \frac{1}{N} {\{E \circ ϕ_{q}^{'} (F)\}}^{⊤} 1_{N}

and any

u, v \in R^{K}

, we have that

\begin{matrix} ∥ S (u) - S (v) ∥_{2} \leq \sqrt{\frac{n_{max}}{N}} \frac{{(q + 1)}^{2}}{q} {∥ u - v ∥}_{2}, \end{matrix}

where

n_{max}

is the largest group sample size. For

S (B) = \frac{\partial L (θ)}{\partial B} = \frac{1}{N} {\{E \circ ϕ_{q}^{'} (F)\}}^{⊤} X

and any

U, V \in R^{p \times K}

, we have that

\begin{matrix} ∥ vec {S (U) - S (V)} ∥_{2} \leq \frac{{max}_{k} {∥ diag (e_{k}) X ∥}_{2}^{2}}{N} \frac{{(q + 1)}^{2}}{q} {∥ vec {U - V} ∥}_{2}, \end{matrix}

where

e_{k}

is the k-th column of

E

and indicates the observations belonging to the k-th group.

Hence, following the majorization–minimization scheme, we can majorize the empirical MgDWD loss

L (ϑ)

by a quadratic function, that is,

\begin{matrix} L (ϑ) \leq & L (ϑ_{*}) + S {(α_{*})}^{⊤} (α - α_{*}) + \frac{L_{α}}{2} {∥ α - α_{*} ∥}_{2}^{2} \\ + vec {S (B)}^{⊤} vec {B - B_{*}} + \frac{L_{B_{*}}}{2} {∥ vec {B - B_{*}} ∥}_{2}^{2}, \end{matrix}

for some

ϑ_{*} = vec {{(α_{*}, B_{*}^{⊤})}^{⊤}}

, where

L_{α}

and

L_{B}

denote the Lipschitz constants in Lemma 3. Instead of minimizing

L (ϑ)

directly, we apply gradient descent to minimize its surrogate upper bound function. The gradient descent updates are given by

\begin{matrix} α_{*} = α - \frac{q {(q + 1)}^{- 2}}{\sqrt{n_{max} N}} {\{E \circ ϕ_{q}^{'} (F)\}}^{⊤} 1_{N}, \end{matrix}

(10)

\begin{matrix} B_{*} = B - \frac{q {(q + 1)}^{- 2}}{{max}_{k} {∥ diag (e_{k}) X ∥}_{2}^{2}} {\{E \circ ϕ_{q}^{'} (F)\}}^{⊤} X . \end{matrix}

(11)

Next, we address the problem’s constraints and regularization simultaneously by applying the proximal operator. For

α_{*}

, it is clear that

\begin{matrix} α_{new} = \underset{α^{⊤} 1_{K} = 0}{argmin} {∥ α - α_{*} ∥}_{2}^{2} = P_{K} α_{*}, \end{matrix}

(12)

where

P_{K} = I_{K} - k^{- 1} 1_{K} 1_{K}

. For

B_{*} = {(β_{1 *}, \dots, β_{p *})}^{⊤}

, the minimization problem can be expressed as

\begin{matrix} B_{new} = & \underset{B 1_{K} = 0_{p}}{argmin} \frac{1}{2} ∥ vec {B - B_{*}} ∥_{2}^{2} + \frac{λ_{1}}{L_{B}} ∥ vec {B} ∥_{1} + \frac{λ_{2}}{L_{B}} {∥ vec {B} ∥}_{1, 2} \\ = & \underset{B 1_{K} = 0_{p}}{argmin} \sum_{j = 1}^{p} \frac{1}{2} ∥ β_{j} - β_{j *} ∥_{2}^{2} + \frac{λ_{1}}{L_{B}} ∥ β_{j} ∥_{1} + \frac{λ_{2}}{L_{B}} {∥ β_{j} ∥}_{2}, \end{matrix}

(13)

which implies that we can implement minimization for p groups in parallel. The following theorem provides the solution to (13).

Theorem 3.

Let

ρ_{1}, ρ_{2} \geq 0

and

β_{*} \in R^{K}

. Then the constrained regularization problem

\begin{matrix} min_{β \in R^{K}} \frac{1}{2} ∥ β - β_{*} ∥_{2}^{2} + ρ_{1} {∥ β ∥}_{1} + ρ_{2} {∥ β ∥}_{2} \\ s . t . β^{⊤} 1_{K} = 0 \end{matrix}

has a solution of the form

\begin{matrix} β^{*} = {\{1 - \frac{ρ_{2}}{∥ P_{K} (β_{*} - ρ_{1} u) ∥_{2}}\}}_{+} P_{K} (β_{*} - ρ_{1} u) \end{matrix}

(14)

for some

u \in {\partial ∥ β ∥}_{1}

.

In the special case with

ρ_{2} = 0

, the constrained regularization problem in Theorem 3 reduces to the constrained lasso problem with solution

{\tilde{β}}^{*} = P_{K} (β_{*} - ρ_{1} u)

. Combined with (14), the proximal operator

U

, given by

\begin{matrix} β^{*} = U ({\tilde{β}}^{*}, ρ_{2}) = {\{1 - \frac{ρ_{2}}{∥ {\tilde{β}}^{*} ∥_{2}}\}}_{+} {\tilde{β}}^{*}, \end{matrix}

(15)

can be introduced to realize the group sparsity of

{\tilde{β}}^{*}

.

For the standard lasso problem, the subgradient

u

has a closed form given by

{\tilde{β}}^{*} = β_{*} - ρ_{1} u = S (β_{*}, ρ_{1})

, with

S (u, v) = sign (u) {(| u | - v)}_{+}

. However, under the constraint on

{\tilde{β}}^{*}

, the naive solution

P_{K} S (β_{*}, ρ_{1})

is misleading in that it satisfies the constraint but does not achieve shrinkage, let alone loss function minimization. The term

P_{K} u

is suggestive of the intersection between the subdifferential set

{\partial ∥ β ∥}_{1}

and the constraint set

{β \in R^{K} | β^{⊤} 1_{K} = 0}

; in this sense,

{\tilde{β}}^{*}

might not have a closed form. Here we consider using coordinate descent to solve the constrained lasso problem. For some fixed coordinate m, since

β^{⊤} 1_{K} = 0

, we have that

b_{m} = - \sum_{l \neq m} b_{l}

. Rewriting the objective function of the lasso-constrained problem in a coordinate-wise form, we obtain

\begin{matrix} \sum_{l = 1}^{K} \frac{1}{2} {(b_{l} - b_{l *})}^{2} + ρ_{1} | b_{l} | = & {(b_{k} - \frac{(b_{k *} - b_{m *})}{2} + \frac{1}{2} \sum_{l \neq k, m}^{K} b_{l})}^{2} + ρ_{1} \{| b_{k} | + | b_{k} + \sum_{l \neq k, m}^{K} b_{l} |\} \\ + \frac{1}{4} {(b_{k *} + b_{m *} + \sum_{l \neq k, m}^{K} b_{l})}^{2} + \sum_{l \neq k, m}^{K} \frac{1}{2} {(b_{l} - b_{l *})}^{2} + ρ_{1} | b_{l} | . \end{matrix}

(16)

Next, Theorem 4 provides the solution to the optimization problem (16).

Theorem 4.

Suppose that

t, s \in R

and

ϱ \geq 0

. Then the regularization problem

\begin{matrix} min_{b \in R} \frac{1}{2} {(b - t)}^{2} + ϱ {| b | + | b + s |} \end{matrix}

has solution

\begin{matrix} b^{*} = & \{\begin{matrix} t, & | t | < C (s, t) \\ - C (s, t), & C (s, t) \leq | t | \leq C (s, t) + 2 ϱ \\ sign (t) (| t | - 2 ϱ), & | t | > C (s, t) + 2 ϱ \end{matrix} \\ = & t - S (t, C (s, t)) + S \{S (t, C (s, t)), 2 ϱ\}, \end{matrix}

where

C (s, t) = \frac{1 - sign (s) sign (t)}{2} | s |

.

By Theorem 4, given some

m \in {1, \dots, K}

, the coordinate-wise minimizer for any

k \neq m

can be expressed as the proximal operator

\begin{matrix} b_{k} = T (t, s, ρ_{1}) = t - S (t, C (s, t)) + S \{S (t, C (s, t)), ρ_{1}\}, \end{matrix}

(17)

with

s = \sum_{l \neq k, m} b_{l}

and

t = (b_{k *} - b_{m *} - s) / 2

. If we fix m during iteration, then the shrinkage of

b_{m}

will be indirectly reflected in the other

b_{k}

. We propose that m change with k in the coordinate-wise minimization process to ensure that every coordinate can be equally shrunk. We summarize our proposed algorithm in Algorithm 1.

Algorithm 1: Proximal gradient descent algorithm for SGL-MgDWD.

Input:

λ_{1}, λ_{2} .

Initialization:

α^{(0)} = 0_{K}

,

B^{(0)} = O_{p \times K}

,

l = 0

.

1:: repeat
2:: Update $α$ according to (10) and (12):

$\begin{matrix} α^{(l + 1)} = P_{K} {α^{(l)} - L_{α}^{- 1} S (α^{(l)})} . \end{matrix}$
3:: Update $\tilde{B}$ according to (11):

$\begin{matrix} \tilde{B} = B^{(l)} - L_{B}^{- 1} S (B^{(l)}) . \end{matrix}$
4:: Set $B^{(l + 1)} \leftarrow \tilde{B}$ .
5:: repeat
6:: for $m = 1$ to K do
7:: for k in ${1, \dots, K} \ m$ do
8:: Update $(t, s)$ :

$\begin{matrix} t = {\tilde{b}}_{k} - {\tilde{b}}_{m}, s = \sum_{r = 1}^{K} b_{r}^{(l + 1)} - b_{k}^{(l + 1)} - b_{m}^{(l + 1)} . \end{matrix}$
9:: Update $b_{k}^{(l + 1)}$ according to (17) and $b_{m}^{(l + 1)}$ by constraint:

$\begin{matrix} b_{k}^{(l + 1)} = T (t, s, L_{B}^{- 1} λ_{1}), b_{m}^{(l + 1)} = - s - b_{k}^{(l + 1)} . \end{matrix}$
10:: end for
11:: end for
12:: until $B^{(l + 1)}$ convergence.
13:: Update $B^{(l + 1)}$ according to (15):

$\begin{matrix} B^{(l + 1)} = U (B^{(l + 1)}, L_{B}^{- 1} λ_{2}) . \end{matrix}$
14:: Set $l \leftarrow l + 1$ .
15:: until some condition is met.

Output:

α^{(l)}

and

B^{(l)}

.

3. Numerical Analysis

In the following section, we use both simulated and real data sets to evaluate the finite sample properties of our proposed method. We compare the finite sample performance of SGL-MgDWD with

L_{1}

-regularized multinomial logistic regression (

L_{1}

-logistic).

3.1. Simulation Studies

The data is generated from the following model. Consider the K-category classification problem where

π_{k} = K^{- 1}

and

g_{k} (X)

is the density function of a normal distribution with mean vector

μ_{k} = {(μ_{1 k}, μ_{2 k}, 0_{p - 2}^{⊤})}^{⊤}

and covariance matrix

I_{p}

, where

(μ_{1 k}, μ_{2 k}) = (2 cos (π r_{k}), 2 sin (π r_{k}))

with

r_{k} = \frac{2 (k - 1)}{K}

, for

k = 1, \dots, K

. In this model, only the first two variables contribute to the classification and their corresponding parameter vectors

β_{1}

and

β_{2}

form two groups of coefficients. The true model has the sparsity structure

(s_{e}, s_{g}) = (2 K, 2)

for a total of

K (p + 1)

coefficients. We set the sample size for each category to

n_{k} = 50, 100, 200

and 400, and the number of classes to

K = 5

and 11. We consider dimensionality

p = 100

and 1000.

In what follows, we compare the proposed SGL-MgDWD method with the OVR method based on SGL-MgDWD with

K = 2

(OVR-SGL-gDWD). For SGL-MgDWD, logistic regression and OVR, the tuning parameter

λ

is optimized over a discrete set by minimizing the prediction error using 5-fold cross validation. In each simulation, we conduct 100 runs and use a testing set of equal size to evaluate each method’s performance using the following criteria:

Testing set accuracy, measuring the rate of correct classification;
Signal, as the average number of correctly-selected element-wise and group-wise signals, that is, with ${\hat{b}}_{j k} \neq 0$ and ${\hat{β}}_{j} \neq 0$ , respectively, denoted by the pair $(s_{e}^{+}, s_{g}^{+})$ ;
Noise, as the average number of incorrectly-selected element-wise and group-wise components, that is, with ${\hat{b}}_{j k} = 0$ and ${\hat{β}}_{j} = 0$ , respectively, denoted by the pair $(n_{e}^{+}, n_{g}^{+})$ .

Simulation results are summarized in Table 1 and Table 2.

As shown in Table 1 and Table 2, the proposed SGL-MgDWD method performs better than the

L_{1}

-logistic and OVR methods. Specifically, in each scenario, predictions from the SGL-MgDWD method had higher accuracy relative to the other two methods. Similarly, the SGL-MgDWD method correctly selected the signal components of the model with fewer incorrectly-selected noise components, again relative to the

L_{1}

-logistic and OVR methods. These simulation results also demonstrate that test accuracy increases with increasing sample size

n_{k}

and that test accuracy decreases with higher dimension p at fixed

n_{k}

. This is consistent with the derived theoretical properties. All computations were performed on a Tensorflow 2.3 CPU on Threadripper 2950X at 4.1 Ghz.

3.2. HIV Data Analysis

Symptomatic distal sensory polyneuropathy (sDSP) is a common debilitating condition among people with HIV. This condition leads to neuropathic pain and is associated with substantial comorbidities and increased health care costs. Plasma miRNA profiles show differences between HIV patients with and without sDSP, and several miRNA biomarkers are reported to be associated with the presence of sDSP in HIV patients (see [30]). The corresponding binary classification problem was analyzed in [30] using random forest classifiers. However, the HIV data set can be further classified into four classes. The HIV data set has 1715 miRNA measures for 40 patients and is partitioned into four groups (

K = 4

) with

n_{k} = 10

patients each category: non-HIV, HIV with no brain damage (HIVNBD), HIV with brain damage but stable (HIVBDS) and HIV with brain damage and unstable (HIVBDU). In the following analysis, we apply our proposed method to this classification problem. The primary aim was to identify critical miRNA biomarkers for each of the four groups. Beyond achieving a finer classification, this analysis is helpful in assessing related pathogenic effects for each patient group.

Given the small sample size of

N = 40

, we chose the tuning parameter

λ

by maximizing leave-one-out cross validation accuracy. We fixed

(q, τ) = (1, 0.1)

. Table 3 shows the signal for coefficient estimates obtained from the SGL-MgDWD method using the selected

λ

. We conclude that there are 22 critical miRNA biomarkers important to the classification problem. In particular, the biomarkers miR-25-star, miR-3171, miR-3924 and miR-4307 are not relevant to the non-HIV group; miR-4641, miR-4655-3p and miR-660 are not relevant to the HIVNBD group; miR-217 and miR-4683 are not relevant to the HIVBDS group; and miR-217 and miR-4307 are not relevant to the HIVBDU group.

Author Contributions

Conceptualization, L.K. and N.T.; Methodology, T.S., L.K. and N.T.; Formal Analysis, Y.L.; Data Curation, W.G.B., E.A. and C.P.; Writing—Review & Editing, Y.W., B.J. and L.K.; Supervision, B.J., L.K. and N.T. All authors have read and agreed to the published version of the manuscript.

Funding

A Canadian Institutes of Health Research Team Grant and Canadian HIV-Ageing Multidisciplinary Programmatic Strategy (CHAMPS) in NeuroHIV (Christopher Power) supported these studies. Bei Jiang and Linglong Kong were supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Christopher Power and Linglong Kong were supported by Canada Research Chairs in Neurological Infection and Immunity and Statistical Learning, respectively. Niansheng Tang was supported by grants from the National Natural Science Foundation of China (grant number: 11671349) and the Key Projects of the National Natural Science Foundation of China (grant number: 11731011).

Acknowledgments

The authors are thankful for the invitation of the two guest editors, Farouk Nathoo and Ejaz Ahmed. This work has also benefited from two anonymous reviewers’ constructive comments and valuable feedback. The authors also thank the great help of Matthew Pietrosanu with editing.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs

Appendix A.1. Proof of Lemma 1

Proof.

For simplicity, we write

p_{j} = P (y = j | X)

and

f_{k} = f_{k} (X)

. Using the Lagrange multiplier method, we define

\begin{matrix} L (F) = E \{\sum_{k = 1}^{K} 𝟙 {y = k} ϕ_{q} {F (X)} | X = u\} + μ 1_{K}^{⊤} F (X) = \sum_{k = 1}^{K} p_{k} ϕ_{q} (f_{k}) + μ f_{k} . \end{matrix}

Then for each k,

\begin{matrix} \frac{\partial L (F)}{\partial f_{k}} = ϕ_{q}^{'} (f_{k}) p_{k} + μ = 0 \end{matrix}

(A1)

with

\begin{matrix} ϕ_{q}^{'} (f_{j}) = \{\begin{matrix} - 1, & f_{k} \leq Q \\ - {(Q f_{k}^{- 1})}^{q}, & f_{k} > Q . \end{matrix} \end{matrix}

Without loss of generality, assume that

p_{1} > p_{2} \geq p_{3} \geq \dots \geq p_{K - 1} > p_{K}

. Note that

- 1 \leq ϕ_{q}^{'} < 0

, and so

p_{j} \geq - ϕ_{q}^{'} (f_{k}) p_{k} = μ > 0

and

μ = p_{k}

if and only if

f_{k} \leq Q

.

If

μ < p_{K} < p_{k}

, then

p_{K} \neq μ

when

f_{K} > Q

, which implies that

f_{k} > f_{K} > Q

for all

1 \leq k \leq K

. Hence, substituting

ϕ_{q}^{'} (f_{k}) = - {(Q f_{k}^{- 1})}^{q}

into (A1) yields

\begin{matrix} f_{k} = Q \sqrt[q]{p_{k} μ^{- 1}} > Q > 0 . \end{matrix}

However,

\sum_{k = 1}^{K} f_{k} > 0

, contradicting the sum-to-zero constraint. Therefore,

μ = p_{K} < p_{k}

for

k < K

and the result follows. □

Appendix A.2. Proof of Theorem 1

Lemma A1.

Under (C1),

L (ϑ)

exists, and it is convex on

ϑ

.

Proof.

The existence of

L (ϑ)

will be satisfied if

\begin{matrix} E_{X | y} \{| ϕ_{q} (Z^{⊤} θ_{k}) | | y = k\} = \int_{X} | ϕ_{q} (Z^{⊤} θ_{k}) | g_{k} (X) d X < \infty . \end{matrix}

We divide

X

into two disjoint subsets. Defining

X_{k} = {X \in X ∣ Z^{⊤} θ_{k} > Q}

, it is clear that

\begin{matrix} \int_{X_{k}} | ϕ_{q} (Z^{⊤} θ_{k}) | g_{k} (X) d X \leq {(q + 1)}^{- 1} \int_{X_{k}} g_{k} (X) d X < \infty . \end{matrix}

Note that

0 < ϕ_{q} (u) < {(1 + q)}^{- 1} < 1

when

u > Q

. On the other hand, for

X_{k}^{c} = {X \in X ∣ Z^{⊤} θ_{k} \leq Q}

,

\begin{matrix} \int_{X_{k}^{c}} | ϕ_{q} (Z^{⊤} θ_{k}) | g_{k} (X) d X \leq | 1 - a_{k} | + \sum_{j = 1}^{p} b_{j k} \int_{X} | x_{j} | g_{k} (X) d X < \infty, \end{matrix}

if

E_{X | y} \{| x_{j} | | y = k\} < \infty

for all

k \in Y

. This completes the proof of the existence of

L (ϑ)

.

Recall that

\begin{matrix} L (ϑ) = \sum_{k = 1}^{K} π_{k} \int_{X} ϕ_{q} (Z^{⊤} θ_{k}) g_{k} (X) d X, \end{matrix}

where

ϕ_{q} (u)

is a convex function of u, so its composition with the affine mapping

u = Z^{⊤} θ_{k}

is still convex in

θ_{k}

. Clearly,

g_{k} (X), π_{k} > 0

, so the non-negatively-weighted integral and sum both preserve convexity. □

Lemma A2.

Existence of minimizers of

L (ϑ)

on

C = \{ϑ \in R^{K (p + 1)} | C ϑ = 0_{K}\}

, where

C = 1_{K}^{⊤} \otimes I_{p + 1}

.

Proof.

By Jensen’s inequality, for any

ϑ \in C

, we have that

\begin{matrix} L (ϑ) \geq ϕ_{q} (\sum_{k = 1}^{K} π_{k} E {Z^{⊤} θ_{k} | y = k}) . \end{matrix}

Let

μ = vec \{{(π_{k} E {z_{j} | y = k})}_{j k}\}

, where

{∥ μ ∥}_{2} \geq {(\sum_{k = 1}^{K} π_{k}^{2})}^{\frac{1}{2}} \geq K^{- \frac{1}{2}} > 0

. For some

C > 0

, we have that

\begin{matrix} L (ϑ) \geq & ϕ_{q} (μ^{⊤} ϑ) = 𝟙 {μ^{⊤} ϑ < Q} (1 - μ^{⊤} ϑ) + 𝟙 {μ^{⊤} ϑ \geq Q} φ_{q} (μ^{⊤} ϑ) \\ \geq & 𝟙 {μ^{⊤} ϑ < Q} | 1 - | μ^{⊤} ϑ | | \\ = & 𝟙 {μ^{⊤} ϑ < - (C + 1)} (| μ^{⊤} ϑ | - 1) + 𝟙 {- (C + 1) < μ^{⊤} ϑ < - 1} (| μ^{⊤} ϑ | - 1) \\ + 𝟙 {- 1 < μ^{⊤} ϑ < Q} (1 - | μ^{⊤} ϑ |) \\ > & 𝟙 {∥ μ ∥_{2} ∥ ϑ ∥_{2} > C + 1} C \\ = & 𝟙 \{{∥ ϑ ∥}_{2} > \frac{C + 1}{{∥ μ ∥}_{2}}\} C . \end{matrix}

Note that

1 - μ^{⊤} ϑ > 1 - Q > 0

when

μ^{⊤} ϑ < Q

. By the Cauchy–Schwarz inequality,

- μ^{⊤} ϑ = | μ^{⊤} {ϑ | \leq ∥ μ ∥}_{2} {∥ ϑ ∥}_{2}

.

Hence, if

{∥ ϑ ∥}_{2} > \frac{C + 1}{{∥ μ ∥}_{2}} > 0

, then

L (ϑ) > C > 0

. The contrapositive of this result implies the existence of a minimizer in the unconstrained problem. That is, the closed set

\{ϑ \in C | L (ϑ) \leq C\}

is bounded for some large enough C. This guarantees the existence of a solution, as desired. □

Lemma A3.

Under (C1),

S (ϑ)

exists and

\begin{matrix} \frac{\partial L (ϑ)}{\partial ϑ} = S (ϑ) . \end{matrix}

Proof.

The existence of

S (ϑ)

will follow if

\begin{matrix} \int_{X} | ϕ_{q}^{'} (Z^{⊤} θ_{k}) z_{j} | π_{k} g_{k} (X) d X \leq π_{k} \int_{X} | z_{j} | g_{k} (X) d X < \infty \end{matrix}

for

j = 1, \dots, p + 1

. Note that

| ϕ_{q}^{'} (u) | \leq 1

when

u > Q

.

For every

θ_{k j} \in R

,

ϕ_{q} (Z^{⊤} θ_{k})

is a Lebesgue integrable function of

X

. For any

u \in R

,

ϕ_{q}^{'} (u)

exists and

| ϕ_{q}^{'} (u) | \leq 1

. Hence, by the Leibniz integral rule, we have that

\begin{matrix} \frac{\partial}{\partial θ_{j k}} \int_{X} ϕ_{q} (Z^{⊤} θ_{k}) π_{k} g_{k} (X) d X = & \int_{X} \frac{\partial ϕ_{q} (Z^{⊤} θ_{k})}{\partial θ_{j k}} π_{k} g_{k} (X) d X \\ = & \int_{X} ϕ_{q}^{'} (Z^{⊤} θ_{k}) z_{j} π_{k} g_{k} (X) d X \end{matrix}

and for any

l \neq k

,

\begin{matrix} \frac{\partial}{\partial θ_{j l}} \int_{X} ϕ_{q} (Z^{⊤} θ_{k}) π_{k} g_{k} (X) d X = 0, \end{matrix}

which is sufficient to show that

\begin{matrix} \frac{\partial L (ϑ)}{\partial ϑ} = S (ϑ) . \end{matrix}

□

Lemma A4.

Suppose (C1) is satisfied. Then (C2) implies that

b_{k}^{*} \neq 0

.

Proof.

We can rewrite

ϕ_{q} (u)

as

\begin{matrix} ϕ_{q} (u) = & 𝟙 {u \leq Q} (1 - u) + 𝟙 {u > Q} (1 - Q) {(\frac{Q}{u})}^{q} \\ = & \{- 𝟙 {u \leq Q} - 𝟙 {u > Q} {(\frac{Q}{u})}^{q + 1}\} u + 𝟙 {u \leq Q} + 𝟙 {u > Q} {(\frac{Q}{u})}^{q} \\ = & ϕ_{q}^{'} (u) u + 𝟙 {u \leq Q} + 𝟙 {u > Q} {(\frac{Q}{u})}^{q} . \end{matrix}

Then for any

γ \in R^{p + 1}

and its corresponding

X_{k} = {X \in X | Z^{⊤} γ > Q}

, we have that

\begin{matrix} E {𝟙 & {y = k} ϕ_{q} (Z^{⊤} γ)} \\ = & E \{𝟙 {y = k} ϕ_{q}^{'} (Z^{⊤} γ) Z^{⊤} γ\} + E \{𝟙 {y = k, Z^{⊤} γ \leq Q}\} \\ + E \{𝟙 {y = k, Z^{⊤} γ > Q} {(\frac{Q}{Z^{⊤} γ})}^{q}\} \\ = & S_{k}^{⊤} (γ) γ + Pr {y = k, X \notin X_{k}} + E \{𝟙 {y = k, X \in X_{k}} {(\frac{Q}{Z^{⊤} γ})}^{q}\} \\ = & S_{k}^{⊤} (γ) γ + π_{k} (1 - E \{𝟙 {X \in X_{k}} \{1 - {(\frac{Q}{Z^{⊤} γ})}^{q}\} | y = k\}) . \end{matrix}

Let

ϑ^{*} \in C

be a local minimizer. It follows that

P S (ϑ^{*}) = 0

and

\sum_{k = 1}^{K} S_{k}^{⊤} (θ_{k}^{*}) θ_{k}^{*} = S^{⊤} (ϑ^{*}) ϑ^{*} = 0

since

ϑ^{*} = P ϑ^{*}

and

P = (I_{K} - K^{- 1} 1_{K} 1_{K}^{⊤}) \otimes I_{p + 1}

. Therefore,

\begin{matrix} L (ϑ^{*}) = & E \{𝟙 {y = k} ϕ_{q} (Z^{⊤} θ_{k}^{*})\} \\ = & \sum_{k = 1}^{K} π_{k} (1 - E \{𝟙 {X \in X_{k}^{*}} \{1 - {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q}\} | y = k\}) \\ = & \sum_{k = 1}^{K} π_{k} (1 - Pr {X \in X_{k}^{*} | y = k} E \{1 - {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q} | y = k, X \in X_{k}^{*}\}) . \end{matrix}

(A2)

For any

γ \in R^{p + 1}

and its corresponding

X_{k} = {X \in X | Z^{⊤} γ > Q}

, we always have that

\begin{matrix} 0 < E \{{(\frac{Q}{Z^{⊤} γ})}^{q} | y = k, X \in X_{k}\} < 1 . \end{matrix}

If

γ = 0_{p + 1}

, then

X_{k} = ⌀

so that

Pr {y = k, X \notin X_{k}} = π_{k}

and

Pr {y = k, X \in X_{k}} = 0

. If

γ_{1} \leq Q

and

γ_{/ 1} = 0_{p}

, then

X_{k} = ⌀

, giving the same conclusions as the previous case. If

γ_{1} > Q

and

γ_{/ 1} = 0_{p}

, then

X_{k} = X

so that

Pr {y = k, X \notin X_{k}} = 0

and

Pr {y = k, X \in X_{k}} = π_{k}

. Consequently, when

0 < Pr {X \in X_{k} | y = k} < 1

, then neither

X_{k}

nor

X

equal ⌀, so

b_{k} \neq 0

follows.

Note that

Pr {X \notin X_{k} | y = k} > 0

implies that

Pr {0 < Z^{⊤} γ \leq Q | y = k} > 0

or

Pr {Z^{⊤} γ \leq 0 | y = k} > 0

, and so special attention should be paid to bounded random variables. □

Lemma A5.

Under (C1),

H (ϑ)

exists and

\begin{matrix} \frac{\partial^{2} L (ϑ)}{\partial ϑ \partial ϑ^{⊤}} = H (ϑ) . \end{matrix}

Furthermore,

H (ϑ^{*}) ≻ O_{K (p + 1)}

when (C2) and (C3) hold.

Proof.

The existence of

H (ϑ)

follows if its all entries are absolutely integrable, that is, for any

j, k = 1, \dots, p + 1

,

\begin{matrix} \int_{X} | 𝟙 {Z^{⊤} θ_{k} > Q} φ_{q}^{″} (Z^{⊤} θ_{k}) z_{j} z_{l} | π_{k} g_{k} (X) d X \\ \leq (q + q^{- 1} + 2) \int_{X_{k}^{c}} | z_{j} z_{l} | g_{k} (X) d X \\ < \infty . \end{matrix}

Equivalently, the result follows if

E_{X | y} \{| z_{j} z_{l} | | y = k\} < \infty

for all

k \in Y

. Note that

0 < φ_{q}^{″} (u) \leq q + q^{- 1} + 2

when

u > Q

.

Let

η

be a test function belonging to the Schwartz space

D

. Then

η^{'} \in D

with some support denoted by

supp (η^{'})

.

Clearly,

ϕ_{q}^{'} (u)

is not differentiable at Q but is Lipschitz continuous. Therefore, the measurable function

S_{k} (θ_{k})

is a locally integrable function of

θ_{k}

. Then the (regular) generalized functions

S_{k} (θ_{k})

belong to the dual space of

D

.

For the distributional derivative of

S_{k} (θ_{k})

with respect to

θ_{j k}

, we have that

\begin{matrix} |〈\frac{\partial S_{k} (θ_{k})}{\partial θ_{j k}}, η (θ_{j k})〉| = & |- 〈S_{k} (θ_{k}), \frac{d η (θ_{j k})}{d θ_{j k}}〉| \\ \leq & \int_{R} |S_{k} (θ_{k}) η^{'} (θ_{j k})| d θ_{j k} \\ \leq & max_{θ_{j k} \in supp (η^{'})} | η^{'} (θ_{j k}) | \int_{supp (η^{'})} | S_{k} (θ_{k}) | d θ_{j k} \\ < & \infty \end{matrix}

implying that the function

f (θ_{j k}, X) = ϕ_{q}^{'} (Z^{⊤} θ_{k}) Z π_{k} g_{k} (X) η^{'} (θ_{j k})

is integrable on

R \times X

. Therefore, by Fubini’s Theorem,

\begin{matrix} 〈\frac{\partial S_{k} (θ_{k})}{\partial θ_{j k}}, η (θ_{j k})〉 = & - 〈S_{k} (θ_{k}), \frac{d η (θ_{j k})}{d θ_{j k}}〉 \\ = & \int_{X} - 〈ϕ_{q}^{'} (Z^{⊤} θ_{k}) Z π_{k} g_{k} (X), \frac{d η (θ j k)}{d θ_{j k}}〉 d X \\ = & \int_{X} 〈\frac{\partial ϕ_{q}^{'} (Z^{⊤} θ_{k})}{\partial θ_{j k}} Z π_{k} g_{k} (X), η (θ_{j k})〉 d X \\ = & 〈E \{\frac{\partial ϕ_{q}^{'} (Z^{⊤} θ_{k})}{\partial θ_{j k}} Z 𝟙 {y = k}\}, η (θ_{j k})〉, \end{matrix}

which implies that

\begin{matrix} \frac{\partial S_{k} (θ_{k})}{\partial θ_{j k}} = E \{\frac{\partial ϕ_{q}^{'} (Z^{⊤} θ_{k})}{\partial θ_{j k}} Z 𝟙 {y = k}\} . \end{matrix}

Recall that

ϕ_{q}^{'}

can be written as

\begin{matrix} ϕ_{q}^{'} (u) = φ_{q}^{'} (u) 𝟙 {u > Q} + (- 1) 𝟙 {u \leq Q} = (φ_{q}^{'} (u) + 1) 𝟙 {u > Q} - 1, \end{matrix}

which contains a Schwartz product between the differentiable function

φ_{q}^{'} (u)

and the generalized function

𝟙 {u > Q}

. Note that

\begin{matrix} 𝟙 {Z^{⊤} θ_{k} > Q} = & 𝟙 {z_{j} > 0, θ_{j k} > c_{j k}} + 𝟙 {z_{j} \leq 0, θ_{j k} \leq c_{j k}} \\ = & (2 𝟙 {z_{j} > 0} - 1) 𝟙 {θ_{j k} > c_{j k}} + (1 - 𝟙 {z_{j} > 0}) \\ = & sign (z_{j}) 𝟙 {θ_{j k} > c_{j k}} + 𝟙 {z_{j} \leq 0}, \end{matrix}

where

c_{j k} = (Q - \sum_{l \neq j} z_{l} θ_{l k}) / z_{j}

and

\begin{matrix} \frac{\partial 𝟙 {Z^{⊤} θ_{k} > Q}}{\partial θ_{j k}} + 0 & = sign (z_{j}) δ (θ_{j k} - c_{j k}) \\ = sign (z_{j}) | z_{j} | δ (Z^{⊤} θ_{k} - Q) \\ = z_{j} δ (Z^{⊤} θ_{k} - Q), \end{matrix}

where

δ (x)

is the Dirac delta function and the distributional derivative of

𝟙 {x > 0}

. Recall that

δ (c x) = δ (x) / | c |

and

f (x) δ (x - c) = f (c) δ (x - c)

for some constant c and function f.

Thus, by the product rule for the distributional derivative of the Schwartz product,

\begin{matrix} \frac{\partial ϕ_{q}^{'} (Z^{⊤} θ_{k})}{\partial θ_{j k}} = & \frac{\partial (φ_{q}^{'} (Z^{⊤} θ_{k}) + 1)}{\partial θ_{j k}} 𝟙 {Z^{⊤} θ_{k} > Q} + (φ_{q}^{'} (Z^{⊤} θ_{k}) + 1) \frac{\partial 𝟙 {Z^{⊤} θ_{k} > Q}}{\partial θ_{j k}} \\ = & φ_{q}^{″} (Z^{⊤} θ_{k}) z_{j} 𝟙 {Z^{⊤} θ_{k} > Q} + (φ_{q}^{'} (Z^{⊤} θ_{k}) + 1) z_{j} δ (Z^{⊤} θ_{k} - Q) \\ = & φ_{q}^{″} (Z^{⊤} θ_{k}) z_{j} 𝟙 {Z^{⊤} θ_{k} > Q} . \end{matrix}

Substituting the above expression, we obtain

\begin{matrix} \frac{\partial S_{k} (θ_{k})}{\partial θ_{j k}} = E \{φ_{q}^{″} (Z^{⊤} θ_{k}) Z z_{j} 𝟙 {Z^{⊤} θ_{k} > Q} 𝟙 {y = k}\} . \end{matrix}

Similarly, for

l \neq k

, we have the distributional derivative

\begin{matrix} \frac{\partial S_{k} (θ_{k})}{\partial θ_{j l}} = 0 . \end{matrix}

Recall that the distributional derivative does not depend on the order of differentiation and agrees with the classical derivative whenever the latter exists. To summarize, we have that

\begin{matrix} H_{k} (θ_{k}) = & \frac{\partial^{2} L (ϑ)}{\partial θ_{k} \partial θ_{k}^{⊤}} = \frac{\partial S_{k} (θ_{k})}{\partial θ_{k}^{⊤}}, H (ϑ) = ⨁_{k = 1}^{K} H_{k} (θ_{k}) . \end{matrix}

The

H_{k} (θ_{k})

are symmetric matrices, so

H (ϑ)

is also symmetric.

In the sense of generalized functions, differentiation is a continuous operation with respect to convergence in

D^{'}

. Therefore,

ϕ_{0}^{'} = lim_{q \to 0} ϕ_{q}^{'} = - 𝟙 {u \leq 0}

and

ϕ_{0}^{″} = lim_{q \to 0} ϕ_{q}^{″} = δ (u)

;

ϕ_{\infty}^{'} = lim_{q \to \infty} ϕ_{q}^{'} = - 𝟙 {u \leq 1}

and

ϕ_{\infty}^{″} = lim_{q \to \infty} ϕ_{q}^{″} = δ (u - 1)

, which coincides with results from the hinge loss.

Next,

H (ϑ) ≻ O_{K (p + 1)}

if and only if both

H_{1} (θ_{1})

and its Schur complement

⨁_{k = 2}^{K} H_{k} (θ_{k})

are both symmetric and positive definite. We can deduce that

H (ϑ) ≻ O_{K (p + 1)}

if and only if

H_{k} (θ_{k}) ≻ O_{p + 1}

for all k.

Note that there exists

c > 0

such that

φ_{q}^{″} (Z^{⊤} θ_{k}) \geq c

on

X_{k}

. Then for any

γ \in R^{p + 1}

,

\begin{matrix} γ^{⊤} H_{k} (θ_{k}) γ = & π_{k} \int_{X_{k}} φ_{q}^{″} (Z^{⊤} θ_{k}) {(Z^{⊤} γ)}^{2} g_{k} (X) d X \\ \geq & c Pr {X \in X_{k}, y = k} E {{(Z^{⊤} γ)}^{2} | X \in X_{k}, y = k} \\ \geq & c Pr {X \in X_{k}, y = k} (γ_{0}^{2} + γ_{1}^{⊤} Var {X | X \in X_{k}, y = k} γ_{1}), \end{matrix}

which implies that

γ^{⊤} H_{k} (θ_{k}) γ = 0

if and only if

γ = 0_{p + 1}

when

Var {X | X \in X_{k}, y = k}

is assumed to be non-singular. Assuming that

Var {X | y = k} ≻ O

implies that

Var {X | X \in X_{k}, y = k} ⪰ O

. □

Proof of Theorem 1

By Lemma A2, a minimizer

ϑ^{*} \in C

exists with

b_{k}^{*} \neq 0_{p}

(by Lemma A4) and

H (ϑ^{*}) ≻ O_{K (p + 1)}

(by Lemma A5). By the second-order Lagrange condition and the convexity of

L (ϑ)

(by Lemma A1), a minimizer of the population MgDWD loss is unique.

Recall from (A2) that

\begin{matrix} L (ϑ^{*}) = & E \{𝟙 {y = k} ϕ_{q} (Z^{⊤} θ_{k}^{*})\} \\ = & \sum_{k = 1}^{K} π_{k} (1 - E \{𝟙 {X \in X_{k}^{*}} \{1 - {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q}\} | y = k\}) \\ = & \sum_{k = 1}^{K} A (k, q) π_{k} . \end{matrix}

It follows that

\begin{matrix} 0 \leq & E \{𝟙 {X \in X_{k}^{*}} \{1 - {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q}\} | y = k\} \\ < & E \{𝟙 {Z^{⊤} γ > 1 + q^{- 1}} + 𝟙 {Q < Z^{⊤} γ \leq 1 + q^{- 1}} \{1 - {(\frac{Q}{1 + q^{- 1}})}^{q}\} | y = m\} \\ = & Pr \{Z^{⊤} γ > Q | y = m\} - Pr \{Q < Z^{⊤} γ \leq Q^{- 1} | y = m\} Q^{2 q} \\ \leq & 1 \end{matrix}

and

\begin{matrix} 1 \geq & E \{𝟙 {X \in X_{k}^{*}} \{1 - {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q}\} | y = k\} \\ > & E \{𝟙 {Z^{⊤} θ_{k}^{*} > 1 + ϵ} \{1 - {(\frac{Q}{1 + ϵ})}^{q}\} | y = k\} \\ \geq & sup_{ϵ > 0} \{1 - {(\frac{Q}{1 + ϵ})}^{q}\} Pr \{Z^{⊤} θ_{k}^{*} > 1 + ϵ | y = m\} \\ \geq & 0 . \end{matrix}

Consequently,

0 \leq u (k, q) \leq A (k, q) \leq v (k, q) \leq 1

.

Note that

lim_{q \to \infty} {(1 + ϵ)}^{- q} Q^{q} = e^{- 1}

when

ϵ = 0

and

lim_{q \to \infty} {(1 + ϵ)}^{- q} Q^{q} = 0

when

ϵ > 0

. The difference between these two results is attributed to pointwise convergence.

Let

f_{m} = 1 - A (k, m) \in D^{'}

with

m = 1, 2, \dots

and

η \in D

. By Fubini’s theorem and the dominated convergence theorem,

\begin{matrix} lim_{m \to \infty} 〈f_{m}, η〉 = & lim_{m \to \infty} 〈E \{𝟙 {X \in X_{k}^{*}} {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q} | y = k\}, η (γ)〉 \\ = & lim_{m \to \infty} E \{〈𝟙 {X \in X_{k}^{*}} {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q}, η (θ_{k}^{*})〉 | y = k\} \\ = & E \{lim_{m \to \infty} 〈𝟙 {X \in X_{k}^{*}} {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q}, η (θ_{k}^{*})〉 | y = k\} \\ = & 0 = 〈0, η (θ_{k}^{*})〉 . \end{matrix}

Similarly,

\begin{matrix} lim_{m \to 0} 〈f_{m}, η〉 = & E \{lim_{m \to 0} 〈𝟙 {X \in X_{k}^{*}} {(\frac{Q}{Z^{⊤} θ_{k}^{*}})}^{q}, η (γ)〉 | y = k\} \\ = & E \{〈𝟙 {Z^{⊤} θ_{k}^{*} > 0}, η (γ)〉 | y = k\} \\ = & 〈E \{𝟙 {Z^{⊤} θ_{k}^{*} > 0} | y = k\}, η (θ_{k}^{*})〉 \\ = & 〈Pr \{Z^{⊤} θ_{k}^{*} > 0 | y = k\}, η (θ_{k}^{*})〉, \end{matrix}

hence

\begin{matrix} A (k, \infty) = lim_{q \to \infty} A (k, q) = Pr \{X \notin X_{k}^{*} | y = k\}, and A (k, 0) = lim_{q \to 0} A (k, q) = 1 . \end{matrix}

As a result,

A (k, \infty)

coincides with the population hinge/SVM loss and

A (k, 0)

is independent of

θ_{k}^{*}

. □

Appendix A.3. Proof of Lemma 2

Proof.

By the definition of

\tilde{P}

,

\begin{matrix} \tilde{P} \{P S (ϑ^{*})\} = τ {∥ P S (ϑ^{*}) ∥}_{\infty} + (1 - τ) max_{j} \{{∥ P_{K} S (α^{*}) ∥}_{2}, {∥ P_{K} S (β_{j}^{*}) ∥}_{2}\}, \end{matrix}

where

\begin{matrix} P_{K} S (α^{*}) = P_{K} {(E \circ ϕ_{q}^{'} {F (ϑ^{*})})}^{⊤} 1_{K} = \frac{1}{N} \sum_{i = 1}^{N} P_{K} diag {E_{i}} ϕ_{q}^{'} (F_{i}^{*}), \\ P_{K} S (β_{j}^{*}) = P_{K} {(E \circ ϕ_{q}^{'} {F (ϑ^{*})})}^{⊤} x_{j} = \frac{1}{N} \sum_{i = 1}^{N} x_{i j} P_{K} diag {E_{i}} ϕ_{q}^{'} (F_{i}^{*}), \end{matrix}

P_{K} = (p_{1}, \dots, p_{K})

with

p_{k} = (p_{l k}) = 𝟙 {l = k} - K^{- 1}

, and

\begin{matrix} E {P_{K} S (α^{*})} = P_{K} S (α^{*}) = 0_{K}, E {P_{K} S (β_{j}^{*})} = P_{K} S (β_{j}^{*}) = 0_{K} . \end{matrix}

Denoting

\begin{matrix} d_{i k} = \{p_{k}^{⊤} diag {E_{i}} ϕ_{q}^{'} (F_{i}^{*})\} = \sum_{l = 1}^{K} (𝟙 {y_{i} = k} - \frac{1}{K}) e_{i l} ϕ_{q}^{'} (f_{i l}^{*}), \end{matrix}

we have that

| d_{i k} | \leq 1 - K^{- 1}

. Note that the

d_{i k}

are N i.i.d. random variables with

\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} E (d_{i k}) = p_{k}^{⊤} S (α^{*}) = 0 and \frac{1}{N} \sum_{i = 1}^{N} E (d_{i k} x_{i j}) = p_{k}^{⊤} S (β_{j}^{*}) = 0 . \end{matrix}

By Hoeffding’s inequality, we have that

\begin{matrix} Pr \{| p_{k}^{⊤} S (α^{*}) | > c_{1} (1 - \frac{1}{K}) \sqrt{\frac{2 log (p K)}{N}}\} \leq 2 {(p K)}^{- c_{1}^{2}}, \end{matrix}

(A3)

where

c_{1} > 1

.

Regarding the

d_{i k} x_{i j}

, we have that

\begin{matrix} E exp {d_{i k} x_{i j}} \leq E exp {(1 - K^{- 1}) | x_{i j} |} \leq exp {4 {(1 - K^{- 1})}^{2} ς_{1}^{2} κ^{2}}, \end{matrix}

which implies that the

d_{i k} x_{i j}

are N independent sub-Gaussian random variables with variance proxy

{(1 - K^{- 1})}^{2} ς_{1}^{2} κ^{2}

. Taking

c_{1} > 1

, we have that

\begin{matrix} Pr \{| p_{k}^{⊤} S (β_{j}^{*}) | > c_{1} ς_{1} κ (1 - \frac{1}{K}) \sqrt{\frac{2 log (p K)}{N}}\} \leq 2 {(p K)}^{- c_{1}^{2}} . \end{matrix}

(A4)

Then by (A3) and (A4),

\begin{matrix} Pr \{max_{j} \{| p_{k}^{⊤} S (α^{*}) |, | p_{k}^{⊤} S (β_{j}^{*}) |\} > Λ_{1}\} \leq 2 {(p K)}^{- c_{1}^{2}} \end{matrix}

(A5)

with

\begin{matrix} Λ_{1} = max {ς_{1} κ, 1} c_{1} (1 - \frac{1}{K}) \sqrt{\frac{2 log (p K)}{N}} . \end{matrix}

Taking a union bound over the

K p

entries of

P S (β^{*})

yields that

\begin{matrix} Pr \{{∥ P S (ϑ^{*}) ∥}_{\infty} \geq Λ_{1}\} = & Pr \{max_{j, k} \{| \frac{1}{N} \sum_{i = 1}^{N} p_{k}^{⊤} S (α^{*}) |, | \frac{1}{N} \sum_{i = 1}^{N} p_{k}^{⊤} S (β_{j}^{*}) |\} \geq Λ_{1}\} \\ \leq & 2 K (p + 1) {(K p)}^{- c_{1}^{2}} . \end{matrix}

On one hand,

\begin{matrix} ∥ P diag {E_{i}} ϕ_{q}^{'} (F_{i}^{*}) ∥_{2}^{2} = ∥ (E_{i} - K^{- 1}) \circ ϕ_{q}^{'} (F_{i}^{*}) ∥_{2}^{2} \leq \sum_{l = 1}^{K} {(e_{i l} - K^{- 1})}^{2} \cdot 1 = 1 - K^{- 1}, \end{matrix}

so for any

γ \in R^{K}

,

\begin{matrix} | γ^{⊤} P diag {E_{i}} ϕ_{q}^{'} (F_{i}^{*}) {| \leq ∥ γ ∥}_{2} \sqrt{1 - \frac{1}{K}} \end{matrix}

and

E {γ^{⊤} P diag {E_{i}} ϕ_{q}^{'} (F_{i}^{*})} = 0

. Applying Hoeffding’s lemma,

\begin{matrix} E exp {γ^{⊤} P_{K} S (α^{*})} = \prod_{i = 1}^{N} E exp \{\frac{1}{N} γ^{⊤} P_{K} diag {E_{i}} ϕ_{q}^{'} (F_{i}^{*})\} \leq exp \{\frac{{∥ γ ∥}_{2}^{2}}{2 N} (1 - \frac{1}{K})\} . \end{matrix}

Applying a square root to Theorem 2.1 of [31] with

c_{2} > 1

, we have that

\begin{matrix} Pr \{∥ P S (α^{*}) ∥_{2} \geq \sqrt{\frac{K - 1}{N}} + c_{2} \sqrt{(1 - \frac{1}{K}) \frac{2 log (p)}{N}}\} \leq p^{- c_{2}^{2}} . \end{matrix}

(A6)

On the other hand, since the

x_{i j}

are N independent sub-Gaussian random variables with variance proxy

ς_{1}^{2} κ^{2}

,

\begin{matrix} E exp {γ^{⊤} P S (β_{j}^{*})} = & \prod_{i = 1}^{N} E exp \{\frac{x_{i j}}{N} \{γ^{⊤} P diag {E_{i}} ϕ_{q}^{'} (F_{i}^{*})\}\} \\ \leq & \prod_{i = 1}^{N} E exp \{\sqrt{1 - \frac{1}{K}} \frac{{∥ γ ∥}_{2}}{N} | x_{i j} |\} \\ \leq & = exp \{\frac{{∥ γ ∥}_{2}^{2}}{2} (1 - \frac{1}{K}) \frac{8 ς_{1}^{2} κ^{2}}{N}\} \end{matrix}

and

E {P_{K} S (β_{j}^{*})} = 0_{K}

. Similarly, we have that

\begin{matrix} Pr \{∥ P S (β_{j}^{*}) ∥_{2} \geq 2 \sqrt{2} ς_{1} κ \{\sqrt{\frac{K - 1}{N}} + c_{2} \sqrt{(1 - \frac{1}{K}) \frac{2 log (p)}{N}}\}\} \leq p^{- c_{2}^{2}} \end{matrix}

(A7)

for a constant

c_{2} > 1

.

Therefore, by (A6) and (A7),

\begin{matrix} Pr \{max_{j} \{∥ P S (α^{*}) ∥_{2}, {∥ P S (β_{j}^{*}) ∥}_{2}\} \geq Λ_{2}\} \leq p^{- c_{2}^{2}} \end{matrix}

with

\begin{matrix} Λ_{2} = max {2 \sqrt{2} ς_{1} κ, 1} \{\sqrt{\frac{K - 1}{N}} + c_{2} \sqrt{(1 - \frac{1}{K}) \frac{2 log (p)}{N}}\} . \end{matrix}

Applying the union bound to (A5), it follows that

\begin{matrix} Pr \{\tilde{P} \{P S (ϑ^{*})\} \geq τ Λ_{1} + (1 - τ) Λ_{2}\} \leq 2 K (p + 1) {(p K)}^{1 - c_{1}^{2}} + p^{1 - c_{2}^{2}}, \end{matrix}

and the desired result follows. □

Appendix A.4. Proof of Theorem 2

Lemma A6.

Suppose that

λ = c_{0} \sqrt{\frac{log (p K)}{N}}

. Then

\hat{ϑ} - ϑ^{*} \in U

, where

\begin{matrix} U = \{δ \in R^{K (p + 1)} | \frac{τ}{1 - τ} {∥ δ_{E_{+}} ∥}_{1} + \sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2} \geq C_{0} (\frac{τ}{1 - τ} {∥ δ_{E^{c}} ∥}_{1} + \sum_{j \notin G} {∥ δ_{j} ∥}_{2})\}, \end{matrix}

C_{0} = \frac{(c_{0} - 1)}{(c_{0} + 1)}

,

E^{c}

denotes the complement of

E

,

E_{+} = E \cup {l = 1 + (k - 1) (p + 1) | k = 1, \dots, K}

, and

G_{+} = G \cup {0}

.

Proof.

Since

\hat{ϑ} = ϑ^{*} + δ

is the minimizer, we have that

\begin{matrix} L (ϑ^{*}) + λ P (β^{*}) \geq & L (\hat{ϑ}) + λ P (\hat{β}) \\ λ \{P (β^{*}) - P (β^{*} + \tilde{δ})\} \geq & L (ϑ^{*} + δ) - L (ϑ^{*}), \end{matrix}

(A8)

where

β^{*}

is the vector

ϑ^{*}

without the

a_{k}

components, replacing

\tilde{δ}

for

δ

. Then

\begin{matrix} P (β^{*}) - P (β^{*} + \tilde{δ}) = & τ ({∥ β_{E}^{*} ∥}_{1} - {∥ β_{E}^{*} + {\tilde{δ}}_{E} ∥}_{1} - {∥ {\tilde{δ}}_{E^{c}} ∥}_{1}) \\ + (1 - τ) (\sum_{j \in G} {∥ β_{j}^{*} ∥}_{2} - \sum_{j \in G} {∥ β_{j}^{*} + δ_{j} ∥}_{2} - \sum_{j \notin G} {∥ δ_{j}^{*} ∥}_{2}) \\ \leq & τ ({∥ {\tilde{δ}}_{E} ∥}_{1} - {∥ {\tilde{δ}}_{E^{c}} ∥}_{1}) + (1 - τ) (\sum_{j \in G} {∥ δ_{j} ∥}_{2} - \sum_{j \notin G} {∥ δ_{j}^{*} ∥}_{2}) \\ \leq & τ ({∥ δ_{E_{+}} ∥}_{1} - {∥ δ_{E^{c}} ∥}_{1}) + (1 - τ) (\sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2} - \sum_{j \notin G} {∥ δ_{j} ∥}_{2}) . \end{matrix}

By the convexity of L,

\begin{matrix} L (ϑ^{*} + δ) - L (ϑ^{*}) \geq 〈 S (ϑ^{*}), δ 〉 \geq - \bar{P} {P S (ϑ^{*})} P (δ) \geq - \frac{λ}{c_{0}} P (δ) . \end{matrix}

Note that

\begin{matrix} P (δ) = τ ({∥ δ_{E_{+}} ∥}_{1} + {∥ δ_{E^{c}} ∥}_{1}) + (1 - τ) (\sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2} + \sum_{j \notin G} {∥ δ_{j} ∥}_{2}) . \end{matrix}

Combining the above results, we have that

\begin{matrix} λ \{P (ϑ^{*}) - P (ϑ^{*} + δ)\} \geq \{L (ϑ^{*} + δ) - L (ϑ^{*})\} \\ (c + 1) τ {∥ δ_{E_{+}} ∥}_{1} + (1 - τ) \sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2} \geq (c - 1) τ {∥ δ_{E^{c}} ∥}_{1} + (1 - τ) \sum_{j \notin G} {∥ δ_{j} ∥}_{2} \\ \frac{τ}{1 - τ} {∥ δ_{E_{+}} ∥}_{1} + \sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2} \geq C_{0} (\frac{τ}{1 - τ} {∥ δ_{E^{c}} ∥}_{1} + \sum_{j \notin G} {∥ δ_{j} ∥}_{2}) . \end{matrix}

□

Lemma A7.

Assume that conditions (A1)–(A3) are satisfied. Then

\begin{matrix} sup_{v \in V} \frac{| Δ L (u, v) - E {Δ L (u, v)} |}{{∥ v ∥}_{2}} > Λ_{3} \end{matrix}

with probability at most

2 {(K p)}^{2 (s_{e} + K) (1 - c_{3}^{2})}

, where

\begin{matrix} Λ_{3} = (1 + \sqrt{2} c_{3}) ς_{2} \sqrt{\frac{2 (s_{e} + K) log (p K)}{N}} \end{matrix}

and

Δ L (u, v) = L (u + v) - L (u)

for any

u, v \in R^{K (p + 1)}

and for some constant

c_{3} > 1

.

Proof.

Given any

u \in R^{K (p + 1)}

and

v \in V

with

V = \{v \in R^{K (p + 1)} {| 0 < ∥ v ∥}_{0} \leq s_{e} + K\}

,

\begin{matrix} Δ L (u, v) = & \frac{1}{N} \sum_{i = 1}^{N} E_{i}^{⊤} (ϕ_{q} \{{(U + V)}^{⊤} Z_{i}\} - ϕ_{q} \{U^{⊤} Z_{i}\}) \\ = & \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} e_{i k} (ϕ_{q} \{Z_{i}^{⊤} (u_{k} + v_{k})\} - ϕ_{q} \{Z_{i}^{⊤} (u_{k})\}) \\ = & \frac{1}{N} \sum_{i = 1}^{N} d_{i} (u, v), \end{matrix}

where

u = vec {U}

,

v = vec {V}

.

The bounded gradient implies the Lipschitz continuity of

ϕ_{q}

so that

| ϕ_{q} (u + v) - ϕ_{q} (u) | \leq | v |

. Since

e_{i k} \in {0, 1}

, we have that

\begin{matrix} | d_{i} (u, v) | \leq & \sum_{k = 1}^{K} | e_{i k} \{ϕ_{q} \{Z_{i}^{⊤} (u_{k} + v_{k})\} - ϕ_{q} (Z_{i}^{⊤} u_{k})\} | \\ \leq & \sum_{k = 1}^{K} | e_{i k} Z_{i}^{⊤} v_{k} | \leq E_{i}^{⊤} vec {V^{⊤} Z_{i}} \\ = & v^{⊤} (Z_{i} \otimes I_{K}) E_{i} . \end{matrix}

Note that

\begin{matrix} \sum_{i = 1}^{N} {(v^{⊤} (Z_{i} \otimes I_{K}) E_{i})}^{2} = {∥ diag {vec {E^{⊤}}} (Z \otimes I_{K}) v ∥}_{2}^{2} . \end{matrix}

By Hoeffding’s inequality, we have that

\begin{matrix} Pr \{| \frac{1}{N} \sum_{i = 1}^{N} d_{i} (u, v) - E (\frac{1}{N} \sum_{i = 1}^{N} d_{i} (u, v)) | > t\} \\ \leq & 2 exp \{- \frac{2 N^{2} t^{2}}{4 ∥ diag {vec {E^{⊤}}} (Z \otimes I_{K}) {v ∥}_{2}^{2}}\} \\ \leq & 2 exp \{- \frac{N t^{2}}{2 ς_{2}^{2} {∥ v ∥}_{2}^{2}}\} . \end{matrix}

Thus

Pr {R (v) > Λ_{3}} \leq 2 {(K p)}^{- (s_{e} + K) c_{3}^{2}}

with

\begin{matrix} R (v) = \frac{| Δ L (u, v) - E {Δ L (u, v)} |}{{∥ v ∥}_{2}} and Λ_{3} = c_{3} ς_{2} \sqrt{\frac{2 (s_{e} + K) log (p K)}{N}} . \end{matrix}

Next, we consider covering

V

with

ϵ

-balls such that for any

v_{1}

and

v_{2}

in the same ball,

| \frac{v_{1}}{∥ v_{1} ∥_{2}} - \frac{v_{1}}{∥ v_{1} ∥_{2}} | \leq ϵ,

where

ϵ

is a small positive number. The number of

ϵ

-balls required to cover a m-dimensional unit ball is bounded by

{(\frac{2}{ϵ} + 1)}^{m}

. Then for those

\frac{v}{{∥ v ∥}_{2}}

, we require a covering number of at most

{(3 (K p) / ϵ)}^{s_{e} + K}

. Let

N

denote such an

ϵ

-net. We have that

\begin{matrix} Pr \{sup_{v \in N} R (v) > Λ_{3}\} \leq {(\frac{3 K p}{ϵ})}^{s_{e} + K} 2 {(K p)}^{- (s_{e} + K) c_{3}^{2}} = 2 {\{\frac{3}{ϵ} {(K p)}^{1 - c_{3}^{2}}\}}^{s_{e} + K} . \end{matrix}

Furthermore, for any

v_{1}, v_{2} \in V

,

\begin{matrix} | R (v_{1}) - R (v_{2}) | \leq & \frac{2}{N} ∥ diag {vec {E^{⊤}}} (Z \otimes I_{K}) (\frac{v_{1}}{∥ v_{1} ∥_{2}} - \frac{v_{1}}{∥ v_{1} ∥_{2}}) ∥_{1} \\ \leq & \frac{2}{\sqrt{N}} ∥ diag {vec {E^{⊤}}} (Z \otimes I_{K}) (\frac{v_{1}}{∥ v_{1} ∥_{2}} - \frac{v_{1}}{∥ v_{1} ∥_{2}}) ∥_{2} \\ \leq & 2 ς_{2} ϵ . \end{matrix}

Therefore

{sup}_{v \in V} R (v) \leq {sup}_{v \in N} R (v) + 2 ς_{2} ϵ

. Taking

ϵ = \sqrt{\frac{(s_{e} + K) log (p K)}{2 N}}

, we have that

\begin{matrix} Pr \{sup_{v \in V} R (v) > Λ_{3}\} \leq & Pr \{sup_{v \in N} R (v) > (c_{3} - 1) ς_{1} \sqrt{\frac{2 (s_{e} + K) log (p K)}{N}}\} \\ \leq & 2 {\{\sqrt{\frac{2 N}{(s_{e} + K) log (p K)}} 3 {(K p)}^{1 - {(c_{3} - 1)}^{2}}\}}^{s_{e} + K} \\ \leq & 2 {\{{(K p)}^{2 - {(c_{3} - 1)}^{2}}\}}^{s_{e} + K} . \end{matrix}

Setting

c_{3} = 1 + \sqrt{2} c_{4}

and

c_{4} > 1

, we obtain the desired result that

\begin{matrix} Pr \{sup_{v \in V} R (v) > (1 + \sqrt{2} c_{4}) ς_{2} \sqrt{\frac{2 (s_{e} + K) log (p K)}{N}}\} \leq 2 {(K p)}^{2 (s_{e} + K) (1 - c_{4}^{2})} . \end{matrix}

□

Proof of Theorem 2.

Consider a disjoint partition on the coordinate set

δ = \hat{ϑ} - ϑ^{*}

, that is,

δ = \sum_{m = 1}^{M} v_{m}

with

v_{m} \in V

. Note that, each subvector

v_{m}

has at most

s_{e} + K

non-zero coordinates. Denote

v_{0} = 0

and

u_{m} = ϑ^{*} + \sum_{l = 0}^{m - 1} v_{l}

so that

u_{1} = ϑ^{*}

and

u_{M} + v_{M} = ϑ^{*} + δ

. We have the decomposition

\begin{matrix} Δ L (ϑ^{*}, δ) = & L (ϑ^{*} + \sum_{m = 1}^{M} v_{m}) - L (ϑ^{*}) = \sum_{m = 1}^{M} L (ϑ^{*} + \sum_{l = 0}^{m} v_{l}) - L (ϑ^{*} + \sum_{l = 0}^{m - 1} v_{l}) \\ = & \sum_{m = 1}^{M} L (u_{m} + v_{m}) - L (u_{m}) = \sum_{m = 1}^{M} Δ L (u_{m}, v_{m}) . \end{matrix}

By Lemma A7,

\begin{matrix} \sum_{m = 1}^{M} Δ L (u_{m}, v_{m}) \geq \sum_{m = 1}^{M} E \{Δ L (u_{m}, v_{m})\} - Λ_{3} ∥ v_{m} ∥_{2} = E \{Δ L (ϑ^{*}, δ)\} - Λ_{3} {∥ δ ∥}_{2} \end{matrix}

with high probability. By Lemma A5,

L

is twice differentiable so that

\begin{matrix} E \{Δ L (ϑ^{*}, δ)\} & = \frac{1}{N} \sum_{i = 1}^{N} E (E_{i}^{⊤} ϕ_{q} \{F_{i} (ϑ^{*} + δ)\}) - E (E_{i}^{⊤} ϕ_{q} \{F_{i} (ϑ^{*})\}) \\ = L (ϑ^{*} + δ) - L (ϑ^{*}) \\ = S {(ϑ^{*})}^{⊤} δ + \frac{1}{2} δ^{⊤} H (ϑ^{*}) δ + o ({∥ δ ∥}_{2}^{2}) \\ \geq 0 + \frac{ς_{3}^{2}}{2} {∥ δ ∥}_{2}^{2} + o ({∥ δ ∥}_{2}^{2}) . \end{matrix}

Consequently,

Δ L (ϑ^{*}, δ)

is bounded below by

\frac{ς_{3}^{2}}{2} {∥ δ ∥}_{2}^{2} - Λ_{3} {∥ δ ∥}_{2}

with high probability.

Note that

\begin{matrix} P (β^{*}) - P (β^{*} + \tilde{δ}) \leq & τ ({∥ δ_{E_{+}} ∥}_{1} - {∥ δ_{E^{c}} ∥}_{1}) + (1 - τ) (\sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2} - \sum_{j \notin G} {∥ δ_{j} ∥}_{2}) \\ \leq & (τ {∥ δ_{E_{+}} ∥}_{1} + (1 - τ) \sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2}) . \end{matrix}

From (A8),

\begin{matrix} L (ϑ^{*}) + λ P (β^{*}) \geq & L (\hat{ϑ}) + λ P (\hat{β}) \\ λ \{P (β^{*}) - P (β^{*} + \tilde{δ})\} \geq & L (ϑ^{*} + δ) - L (ϑ^{*}) \\ λ (τ {∥ δ_{E_{+}} ∥}_{1} + (1 - τ) \sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2}) \geq & \frac{ς_{3}^{2}}{2} {∥ δ ∥}_{2}^{2} - Λ_{3} {∥ δ ∥}_{2} . \end{matrix}

Clearly,

{∥ δ_{E_{+}} ∥}_{1} \leq \sqrt{s_{e} + K} {∥ δ_{E_{+}} ∥}_{2} \leq \sqrt{s_{e} + K} {∥ δ ∥}_{2}

and

\sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2} \leq \sqrt{s_{g} + 1} {∥ δ ∥}_{2}

. We conclude that

\begin{matrix} \frac{ς_{3}^{2}}{2} {∥ δ ∥}_{2}^{2} \leq & λ (τ {∥ δ_{E_{+}} ∥}_{1} + (1 - τ) \sum_{j \in G_{+}} {∥ δ_{j} ∥}_{2}) + Λ_{3} {∥ δ ∥}_{2} \\ {∥ δ ∥}_{2}^{2} \leq & 2 ς_{3}^{- 2} \{λ (τ \sqrt{s_{e} + K} + (1 - τ) \sqrt{s_{g} + 1}) + Λ_{3}\} {∥ δ ∥}_{2}, \end{matrix}

after which the desired result follows from straightforward algebraic manipulation. □

Appendix A.5. Proof of Lemma 3

Proof.

Since

\begin{matrix} vec {(F^{⊤})}^{⊤} = vec {\{{(1_{N} α^{⊤} + XB)}^{⊤}\}}^{⊤} = α^{⊤} (1_{N}^{⊤} \otimes I_{K}) + vec {(B^{⊤})}^{⊤} (X^{⊤} \otimes I_{K}), \end{matrix}

we have that

\begin{matrix} \{\begin{matrix} \frac{\partial vec {(F^{⊤})}^{⊤}}{\partial α} = \frac{\partial α^{⊤} (1_{N}^{⊤} \otimes I_{K})}{\partial α} = \frac{\partial α^{⊤}}{\partial α} (1_{N}^{⊤} \otimes I_{K}) = I_{K} (1_{N}^{⊤} \otimes I_{K}) = 1_{N}^{⊤} \otimes I_{K} \\ \frac{\partial vec {(F^{⊤})}^{⊤}}{\partial vec (B^{⊤})} = \frac{\partial vec {(B^{⊤})}^{⊤} (X^{⊤} \otimes I_{K})}{\partial vec (B^{⊤})} = I_{p K} (X^{⊤} \otimes I_{K}) = X^{⊤} \otimes I_{K} . \end{matrix} \end{matrix}

The derivative with respect to

α

is

\begin{matrix} N S (α) = & N \frac{\partial L (θ)}{\partial α} = \frac{\partial}{\partial α} vec {E^{⊤}}^{⊤} vec \{ϕ_{q} (F^{⊤})\} \\ = & \frac{\partial vec {(F^{⊤})}^{⊤}}{\partial α} \frac{\partial ϕ_{q} \{vec {(F^{⊤})}^{⊤}\}}{\partial vec (F^{⊤})} vec {E^{⊤}} \\ = & (1_{N}^{⊤} \otimes I_{K}) diag (vec \{ϕ_{q}^{'} (F^{⊤})\}) vec {E^{⊤}} \\ = & vec (I_{K} {\{E \circ ϕ_{q}^{'} (F)\}}^{⊤} 1_{N}) \\ = & {\{E \circ ϕ_{q}^{'} (F)\}}^{⊤} 1_{N} . \end{matrix}

Thus,

\begin{matrix} {∥ S (α) |}_{v}^{u} ∥_{2}^{2} = & ∥ S (u) - S (v) ∥_{2}^{2} = N^{- 2} ∥ (1_{N}^{⊤} \otimes I_{K}) vec \{{(E \circ ϕ_{q}^{'} {F (α)} |_{v}^{u})}^{⊤}\} ∥_{2}^{2} \\ \leq & N^{- 2} ∥ 1_{N}^{⊤} \otimes I_{K} ∥_{2}^{2} ∥ vec \{{(E \circ ϕ_{q}^{'} {F (α)} |_{v}^{u})}^{⊤}\} ∥_{2}^{2} \\ = & N^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{N} e_{i k}^{2} {(ϕ_{q}^{'} {f_{i k} (u_{k})} - ϕ_{q}^{'} {f_{i k} (v_{k})})}^{2} \\ \leq & N^{- 1} \sum_{k = 1}^{K} (\sum_{i = 1}^{N} e_{i k}) L_{q}^{2} {(u_{k} - v_{k})}^{2} \\ \leq & N^{- 1} n_{max} L_{q}^{2} {∥ u - v ∥}_{2}^{2}, \end{matrix}

where

L_{q} = \frac{{(q + 1)}^{2}}{q}

is the Lipschitz constant of

ϕ_{q}^{'}

. We have that

L_{α} = \sqrt{\frac{n_{max}}{N}} L_{q}

.

The derivative with respect to

vec (B^{⊤})

is

\begin{matrix} N \frac{\partial L (θ)}{\partial vec (B^{⊤})} = & \frac{\partial}{\partial vec (B^{⊤})} vec {E^{⊤}}^{⊤} vec \{ϕ_{q} (F^{⊤})\} \\ = & \frac{\partial vec {(F^{⊤})}^{⊤}}{\partial vec (B^{⊤})} \frac{\partial ϕ_{q} \{vec {(F^{⊤})}^{⊤}\}}{\partial vec (F^{⊤})} vec {E^{⊤}} \\ = & (X^{⊤} \otimes I_{K}) diag (vec \{ϕ_{q}^{'} (F^{⊤})\}) vec {E^{⊤}} \\ = & vec (I_{K} {\{E \circ ϕ_{q}^{'} (F)\}}^{⊤} X) \\ = & vec ({\{E \circ ϕ_{q}^{'} (F)\}}^{⊤} X) . \end{matrix}

Therefore, the derivative with respect to

B

is

S (B) = N^{- 1} X^{⊤} \{E \circ ϕ_{q}^{'} (F)\}

. Note that

\begin{matrix} vec (X^{⊤} \{E \circ ϕ_{q}^{'} (F)\}) = & (I_{K} \otimes X^{⊤}) diag {vec (E)} vec {ϕ_{q}^{'} (F)} \\ = & \{⨁_{k = 1}^{K} X^{⊤} diag (e_{k})\} vec {ϕ_{q}^{'} (F)} \end{matrix}

and

\begin{matrix} \sum_{i = 1}^{N} {\{e_{i k} X_{i}^{⊤} (u_{k} - v_{k})\}}^{2} = {∥diag (e_{k}) X (u_{k} - v_{k})∥}_{2}^{2} \leq ∥ diag (e_{k}) {X ∥}_{2}^{2} {∥ u_{k} - v_{k} ∥}_{2}^{2}; \end{matrix}

thus

\begin{matrix} N^{2} ∥ vec {S (U) - S (V)} ∥_{2}^{2} = & \sum_{k = 1}^{K} ∥ X^{⊤} diag (e_{k}) ϕ_{q} {f_{k} (b_{k})} |_{v_{k}}^{u_{k}} ∥_{2}^{2} \\ \leq & \sum_{k = 1}^{K} ∥ X^{⊤} diag (e_{k}) ∥_{2}^{2} ∥ diag (e_{k}) ϕ_{q} {f_{k} (b_{k})} {|_{v_{k}}^{u_{k}} ∥}_{2}^{2} \\ \leq & \sum_{k = 1}^{K} {∥ diag (e_{k}) X ∥}_{2}^{2} \sum_{i = 1}^{N} e_{i k} {(ϕ_{q} {f_{i k} (u_{k})} - ϕ_{q} {f_{i k} (v_{k})})}^{2} \\ \leq & L_{q}^{2} \sum_{k = 1}^{K} {∥ diag (e_{k}) X ∥}_{2}^{2} \sum_{i = 1}^{N} {\{e_{i k} X_{i}^{⊤} (u_{k} - v_{k})\}}^{2} \\ \leq & L_{q}^{2} \sum_{k = 1}^{K} ∥ diag (e_{k}) {X ∥}_{2}^{4} {∥ u_{k} - v_{k} ∥}_{2}^{2} \\ \leq & max_{k} {\{∥ diag (e_{k}) {X ∥}_{2}^{2}\}}^{2} {∥ vec (U - V) ∥}_{2}^{2} . \end{matrix}

We conclude that

L_{B} = L_{q} N^{- 1} {max}_{k} {∥ diag (e_{k}) X ∥}_{2}^{2}

. □

Appendix A.6. Proof of Theorem 3

Lemma A8.

The indicator function

\begin{matrix} δ_{R} (x) = \{\begin{matrix} 0, & if x \in R \\ \infty, & if x \notin R, \end{matrix} \end{matrix}

where

R = {x \in R^{p} ∣ 1_{p}^{⊤} x = 0}

, has subdifferential

\begin{matrix} \partial δ_{R} (x) = \{\begin{matrix} {g \in R^{p} ∣ g = s 1_{p}, s \in R}, & if x \in R \\ ⌀, & if x \notin R . \end{matrix} \end{matrix}

Proof.

Suppose that

x \in R

. Then

g \in \partial δ_{R} (x)

if and only if both

\begin{matrix} δ_{R} (y) \geq δ_{R} (x) + 〈 g, y - x 〉 for all y \in R and \\ ω^{⊤} (y - x) \leq 0 . \end{matrix}

Let

z = y - x

. Then

z \in R

since

1_{p}^{⊤} (y - X) = 0

. Thus,

g^{⊤} z \leq 0

. If

g^{⊤} z = 0

, then

g \in {g \in R^{p} ∣ g = s 1_{p}, s \in R}

. If there exists

g \in \partial δ_{R} (x)

satisfying

g^{⊤} z < 0

for some

z \in R

, then

- z \in R

, so we must have that

g^{⊤} z > 0

. This is a contradiction.

Now, for any

x \notin R

, we have that

g \in \partial δ_{R} (x)

if and only if both

\begin{matrix} δ_{R} (y) \geq δ_{R} (x) + 〈 g, y - x 〉 for all y \in R and \\ ω^{⊤} (x - y) \geq \infty . \end{matrix}

For

x \notin R

and

y \in R

, since

z = x - y \in R^{p}

and

g^{⊤} z \geq \infty

, it must be that

g \in ⌀

. □

Proof of Theorem 3.

It is sufficient to minimize the objective function

\begin{matrix} G (β) = \frac{1}{2} ∥ β - β_{*} ∥_{2}^{2} + ρ_{1} {∥ β ∥}_{1} + ρ_{2} {∥ β ∥}_{2} + δ_{R} (β), \end{matrix}

where

R = {x \in R^{K} ∣ 1_{K}^{⊤} β = 0}

. Then the subdifferential of

G (β)

is

\begin{matrix} \partial G (β) = β - β_{*} + ρ_{1} {\partial ∥ β ∥}_{1} + ρ_{2} \partial {∥ β ∥}_{2} + \partial δ_{R} (β) . \end{matrix}

For an optimal solution

β^{*} \in R

, we have that

0_{p} \in \partial G (β^{*})

if and only if there exist

u \in {\partial ∥ β ∥}_{1}

,

v \in {\partial ∥ β ∥}_{2}

and

s \in R

such that

β^{*} = β_{*} - ρ_{1} u - ρ_{2} v - s 1_{p}

. Since

1^{⊤} β^{*} = 0

, we have that

s = p^{- 1} 1_{p}^{⊤} (β_{*} - ρ_{1} u - ρ_{2} v)

, so

\begin{matrix} β^{*} & = P_{K} (β_{*} - ρ_{1} u - ρ_{2} v) . \end{matrix}

If

β^{*} = 0_{p}

, then

| u_{j} | < 1

for

j = 1, \dots, p

,

{∥ v ∥}_{2} \leq 1

and

\begin{matrix} ∥ P_{K} (β_{*} - ρ_{1} u) ∥_{2} = ρ_{2} ∥ P_{K} {v ∥}_{2} \leq ρ_{2} ∥ P_{K} ∥_{2} {∥ v ∥}_{2} = ρ_{2} {∥ v ∥}_{2} \leq ρ_{2}; \end{matrix}

If

β^{*} \neq 0_{K}

, then

u \in {\partial ∥ X ∥}_{1}

,

v = \frac{β^{*}}{∥ β^{*} ∥_{2}}

and

\begin{matrix} β^{*} = & P_{K} (β_{*} - ρ_{1} u - ρ_{2} \frac{β^{*}}{∥ β^{*} ∥_{2}}) \\ (1 + \frac{ρ_{2}}{∥ β^{*} ∥_{2}}) β^{*} = & P_{K} (β_{*} - ρ_{1} u) . \end{matrix}

Note that

β^{*} = P_{K} β^{*} \in R

. Taking the norm of both sides, we see that

\begin{matrix} (1 + \frac{ρ_{2}}{∥ β^{*} ∥_{2}}) {∥ β^{*} ∥}_{2} & = ∥ P_{K} (β_{*} - ρ_{1} u) ∥_{2} \\ ∥ β^{*} ∥_{2} & = ∥ P_{K} (β_{*} - ρ_{1} u) ∥_{2} - ρ_{2} > 0 . \end{matrix}

Substituting this result back into the

β^{*} \neq 0_{K}

case, we have that

\begin{matrix} β^{*} = \{1 - \frac{ρ_{2}}{∥ P_{K} (β_{*} - ρ_{1} u) ∥_{2}}\} P_{K} (β_{*} - ρ_{1} u) . \end{matrix}

Combining the above two cases gives the desired result. □

Appendix A.7. Proof of Theorem 4

Proof.

Denote the objective function by

\begin{matrix} G (b) = \frac{1}{2} {(b - t)}^{2} + ϱ {| b | + | b + s |} . \end{matrix}

When

s = 0

, we obtain a lasso problem with

\begin{matrix} b^{*} = \underset{b \in R}{argmin} \frac{1}{2} {(b - t)}^{2} + 2 ϱ | x | = S (t, 2 ϱ) . \end{matrix}

When

s \neq 0

, the subdifferential of

G (b)

is

\begin{matrix} \partial G (b) = b - t + ϱ {\partial | x | + \partial | x + s |} . \end{matrix}

We see that

0 \in \partial G (b^{*})

if and only if there exist

u \in \partial | b |

and

v \in \partial | b + s |

with

\begin{matrix} b^{*} = b - ϱ (u + v) . \end{matrix}

If

b^{*} = 0

, then

| u | < 1

and

v = sign (s)

, hence

\begin{matrix} b^{*} = 0 if | t - ϱ sign (s) | \leq ϱ . \end{matrix}

If

s > 0

, then

sign (s) = 1

and

0 \leq t \leq 2 ϱ

. If

s < 0

, then

sign (s) = - 1

, and

- 2 ϱ \leq t \leq 0

. Note that if

t \neq 0

, then

sign (s) = sign (t)

or

sign (s) sign (t) = 1

.

When

b^{*} = - s

, then

u = - sign (s)

and

| v | < 1

, hence

\begin{matrix} b^{*} = - s if | t + s + ϱ sign (s) | \leq ϱ . \end{matrix}

If

s > 0

, then

sign (s) = 1

and

- (s + 2 λ) \leq t \leq - s < 0

. If

s < 0

, then

sign (s) = - 1

and

0 < - s \leq t \leq - (s - 2 λ)

. Note that

sign (s) = - sign (t)

is equivalent to

sign (s) sign (t) = - 1

.

Let

C (s, t) = \frac{1 - sign (s) sign (t)}{2} | s | \geq 0

. We can summarize the two cases above as

\begin{matrix} b^{*} = - C (s, t) if 0 \leq C (s, t) \leq | t | \leq C (s, t) + 2 ϱ . \end{matrix}

(A9)

If

b^{*} \neq 0, - s

, then

u = sign (b^{*})

and

v = sign (b^{*} + s)

, thus

\begin{matrix} b^{*} = t - ϱ \{sign (b^{*}) + sign (b^{*} + s)\} \\ b^{*} + s = t + s - ϱ \{sign (b^{*}) + sign (b^{*} + s)\} . \end{matrix}

If

sign (b^{*}) = - sign (b^{*} + s) = 1

, then

b^{*} (b^{*} + s) < 0

or

0 < t < - s

. Thus

b^{*} = t > 0

if

0 < t < - s

. If

sign (b^{*}) = - sign (b^{*} + s) = - 1

, then

b^{*} (b^{*} + s) < 0

or

- s < t < 0

. Thus

b^{*} = t < 0

if

- s < t < 0

. Rewriting the two cases above, we have that

\begin{matrix} b^{*} = t if 0 < | t | < C (s, t) . \end{matrix}

(A10)

If

sign (b^{*}) = sign (b^{*} + s) = 1

, then

\begin{matrix} min {b^{*}, b^{*} + s} > 0 \\ t - 2 ϱ + \frac{s - | s |}{2} > 0 \\ sign (t) | t | > sign (t) (\frac{| s |}{2} + 2 ϱ) - \frac{s}{2} > 0 . \end{matrix}

Note that

t > 0

and

sign (x) = sign (t)

. If

sign (b^{*}) = sign (b^{*} + s) = - 1

, then

\begin{matrix} max {b^{*}, b^{*} + s} < 0 \\ t + 2 ϱ + \frac{s + | s |}{2} > 0 \\ sign (t) | t | < sign (t) (\frac{| s |}{2} + 2 ϱ) - \frac{s}{2} < 0 . \end{matrix}

Note that

t < 0

and

sign (x) = sign (t)

. Rewriting the two cases above, we have that

\begin{matrix} b^{*} = t - 2 ϱ sign (t) if | t | > 2 ϱ + C (s, t) . \end{matrix}

(A11)

Summarizing (A9)–(A11),

\begin{matrix} b^{*} = \{\begin{matrix} t, & | t | < C (s, t), \\ - C (s, t), & C (s, t) \leq | t | \leq C (s, t) + 2 ϱ, \\ sign (t) (| t | - 2 ϱ), & | t | > C (s, t) + 2 ϱ, \end{matrix} \end{matrix}

with

C (s, t) = \frac{1 - sign (s) sign (t)}{2} | s | \geq 0

. On one hand, when

s \neq 0

,

\begin{matrix} b^{*} = t - S (t, C (s, t)) + S \{S (t, C (s, t)), 2 ϱ\} . \end{matrix}

On the other hand, when

s = 0

, it follows that

b^{*} = S (t, 2 ϱ)

since

S (z, 0) = z

. □

References

Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural features for image classification. IEEE Trans. Syst. Man Cybern. 1973, SMC-3, 610–621. [Google Scholar] [CrossRef] [Green Version]
Wang, X.; Zhang, H.H.; Wu, Y. Multiclass probability estimation with support vector machines. J. Comput. Graph. Stat. 2019, 28, 586–595. [Google Scholar] [CrossRef]
Hansen, J.H.; Hasan, T. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process. Mag. 2015, 32, 74–99. [Google Scholar] [CrossRef]
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification; John Wiley & Sons: New York, NY, USA, 2012. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Science & Business Media: New York, NY, USA, 2009. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Marron, J.S.; Todd, M.J.; Ahn, J. Distance-weighted discrimination. J. Am. Stat. Assoc. 2007, 102, 1267–1271. [Google Scholar] [CrossRef] [Green Version]
Qiao, X.; Zhang, H.H.; Liu, Y.; Todd, M.J.; Marron, J.S. Weighted distance weighted discrimination and its asymptotic properties. J. Am. Stat. Assoc. 2010, 105, 401–414. [Google Scholar] [CrossRef] [Green Version]
Marron, J. Distance-weighted discrimination. Wiley Interdiscip. Rev. Comput. Stat. 2015, 7, 109–114. [Google Scholar] [CrossRef] [Green Version]
Zhang, L.; Lin, X. Some considerations of classification for high dimension low-sample size data. Stat. Methods Med. Res. 2013, 22, 537–550. [Google Scholar] [CrossRef]
Wang, B.; Zou, H. Another look at distance-weighted discrimination. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2018, 80, 177–198. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, H.H.; Wu, Y. Hard or soft classification? Large-margin unified machines. J. Am. Stat. Assoc. 2011, 106, 166–177. [Google Scholar] [CrossRef] [Green Version]
Huang, H.; Liu, Y.; Du, Y.; Perou, C.M.; Hayes, D.N.; Todd, M.J.; Marron, J.S. Multiclass distance-weighted discrimination. J. Comput. Graph. Stat. 2013, 22, 953–969. [Google Scholar] [CrossRef]
Wang, B.; Zou, H. A multicategory kernel distance weighted discrimination method for multiclass classification. Technometrics 2019, 61, 396–408. [Google Scholar] [CrossRef]
Wang, B.; Zou, H. Sparse distance weighted discrimination. J. Comput. Graph. Stat. 2016, 25, 826–838. [Google Scholar] [CrossRef] [Green Version]
Wang, L.; Shen, X. On L1-norm multiclass support vector machines: Methodology and theory. J. Am. Stat. Assoc. 2007, 102, 583–594. [Google Scholar] [CrossRef]
Zhang, X.; Wu, Y.; Wang, L.; Li, R. Variable selection for support vector machines in moderately high dimensions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2016, 78, 53–76. [Google Scholar] [CrossRef] [Green Version]
Peng, B.; Wang, L.; Wu, Y. An error bound for L1-norm support vector machine coefficients in ultra-high dimension. J. Mach. Learn. Res. 2016, 17, 8279–8304. [Google Scholar]
Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. A sparse-group lasso. J. Comput. Graph. Stat. 2013, 22, 231–245. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. A note on the group lasso and a sparse group lasso. arXiv 2010, arXiv:1001.0736. [Google Scholar]
Cai, T.T.; Zhang, A.; Zhou, Y. Sparse group lasso: Optimal sample complexity, convergence rate, and statistical inference. arXiv 2019, arXiv:1909.09851. [Google Scholar]
Yu, D.; Zhang, L.; Mizera, I.; Jiang, B.; Kong, L. Sparse wavelet estimation in quantile regression with multiple functional predictors. Comput. Stat. Data Anal. 2019, 136, 12–29. [Google Scholar] [CrossRef] [Green Version]
He, Q.; Kong, L.; Wang, Y.; Wang, S.; Chan, T.A.; Holland, E. Regularized quantile regression under heterogeneous sparsity with application to quantitative genetic traits. Comput. Stat. Data Anal. 2016, 95, 222–239. [Google Scholar] [CrossRef] [Green Version]
Huang, H. Large dimensional analysis of general margin based classification methods. arXiv 2019, arXiv:1901.08057. [Google Scholar]
Huang, H.; Yang, Q. Large scale analysis of generalization error in learning using margin based classification methods. arXiv 2020, arXiv:2007.10112. [Google Scholar] [CrossRef]
Lam, X.Y.; Marron, J.; Sun, D.; Toh, K.C. Fast algorithms for large-scale generalized distance weighted discrimination. J. Comput. Graph. Stat. 2018, 27, 368–379. [Google Scholar] [CrossRef]
Sun, D.; Toh, K.C.; Yang, L. A convergent 3-block semiproximal alternating direction method of multipliers for conic programming with 4-type constraints. SIAM J. Optim. 2015, 25, 882–915. [Google Scholar] [CrossRef] [Green Version]
Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
Asahchop, E.L.; Branton, W.G.; Krishnan, A.; Chen, P.A.; Yang, D.; Kong, L.; Zochodne, D.W.; Brew, B.J.; Gill, M.J.; Power, C. HIV-associated sensory polyneuropathy and neuronal injury are associated with miRNA–455-3p induction. JCI Insight 2018, 3, e122450. [Google Scholar] [CrossRef] [Green Version]
Hsu, D.; Kakade, S.; Zhang, T. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 2012, 17, 52. [Google Scholar] [CrossRef]

Table 1. Simulation results for the SGL-MgDWD,

L_{1}

-logistic, and OVR methods with

K = 5

. Time is measured relative to a baseline logistic regression model with

K = 5

,

p = 100

, and

N = 50

. Numbers in parentheses denote standard deviations.

Table 1. Simulation results for the SGL-MgDWD,

L_{1}

-logistic, and OVR methods with

K = 5

. Time is measured relative to a baseline logistic regression model with

K = 5

,

p = 100

, and

N = 50

. Numbers in parentheses denote standard deviations.

$n_{k}$	p	Method	Test Accuracy	Signal $(s_{e}^{+}, s_{g}^{+})$	Noise $(n_{e}^{+}, n_{g}^{+})$	Time (SD)
50	100	SGL-MgDWD	0.980	(9.99, 2)	(0, 0)	1.150 (0.173)
		$L_{1}$ -logistic	0.979	(9.00, 2)	(116.98, 26.17)	1.000 (0.153)
		OVR-SGL-gDWD	0.912	-	-	-
	1000	SGL-MgDWD	0.979	(10, 2)	(6.96, 1.94)	5.290 (0.166)
		$L_{1}$ -logistic	0.966	(10, 2)	(2793.65, 722.38)	5.130 (0.063)
		OVR-SGL-gDWD	0.740	-	-	-
100	100	SGL-MgDWD	0.981	(10, 2)	(0.07, 0.03)	1.453 (0.155)
		$L_{1}$ -logistic	0.980	(8.82, 2)	(35.18, 3.98)	1.258 (0.127)
		OVR-SGL-gDWD	0.828	-	-	-
	1000	SGL-MgDWD	0.980	(10, 2)	(1.01, 0.25)	4.863 (0.150)
		$L_{1}$ -logistic	0.978	(9.93, 2)	(1380.38, 192.37)	4.703 (0.061)
		OVR-SGL-gDWD	0.546	-	-	-
200	100	SGL-MgDWD	0.980	(10, 2)	(7.67, 2.08)	1.776 (0.164)
		$L_{1}$ -logistic	0.980	(9.39, 2)	(13.1, 0.72)	1.709 (0.175)
		OVR-SGL-gDWD	0.934	-	-	-
	1000	SGL-MgDWD	0.982	(10, 2)	(1.09, 0.29)	8.641 (0.186)
		$L_{1}$ -logistic	0.981	(9.79, 2)	(199.02, 2.51)	2.505 (0.121)
		OVR-SGL-gDWD	0.950	-	-	-
400	100	SGL-MgDWD	0.981	(10, 2)	(0.02, 0)	2.792 (0.159)
		$L_{1}$ -logistic	0.981	(10, 2)	(4.72, 3.95)	2.828 (0.115)
		OVR-SGL-gDWD	0.979	-	-	-
	1000	SGL-MgDWD	0.981	(10, 2)	(4.72, 3.95)	15.800 (0.221)
		$L_{1}$ -logistic	0.981	(9.6, 2)	(16.17, 0.02)	17.915 (1.585)
		OVR-SGL-gDWD	0.964	-	-	-

Table 2. Simulation results for the SGL-MgDWD,

L_{1}

-logistic, and OVR methods with

K = 11

. Time is measured relative to a baseline logistic regression model with

K = 5

,

p = 100

, and

N = 50

. Numbers in parentheses denote standard deviations.

Table 2. Simulation results for the SGL-MgDWD,

L_{1}

-logistic, and OVR methods with

K = 11

. Time is measured relative to a baseline logistic regression model with

K = 5

,

p = 100

, and

N = 50

. Numbers in parentheses denote standard deviations.

$n_{k}$	p	Method	Test Accuracy	Signal $(s_{e}^{+}, s_{g}^{+})$	Noise $(n_{e}^{+}, n_{g}^{+})$	Time (SD)
50	100	SGL-MgDWD	0.735	(21.41, 2)	(0.14, 0.02)	1.661 (0.143)
		$L_{1}$ -logistic	0.735	(20.13, 2)	(337.77, 22.07)	1.610 (0.110)
		OVR-SGL-gDWD	0.647	-	-	-
	1000	SGL-MgDWD	0.733	(21.25, 2)	(0, 0)	7.105 (0.205)
		$L_{1}$ -logistic	0.566	(20.67, 2)	(3805.97, 265.82)	6.933 (0.205)
		OVR-SGL-gDWD	0.382	-	-	-
100	100	SGL-MgDWD	0.737	(21.82, 2)	(0.06, 0.01)	2.518 (0.099)
		$L_{1}$ -logistic	0.721	(20, 2)	(173.17, 5.81)	2.418 (0.103)
		OVR-SGL-gDWD	0.609	-	-	-
	1000	SGL-MgDWD	0.737	(21.88, 2)	(5.4, 0.77)	12.371 (0.109)
		$L_{1}$ -logistic	0.697	(20.15, 2)	(1859.51, 9.04)	12.279 (0.114)
		OVR-SGL-gDWD	0.214	-	-	-
200	100	SGL-MgDWD	0.738	(22, 2)	(0, 0)	5.191 (0.079)
		$L_{1}$ -logistic	0.730	(20, 2)	(50.7, 0.08)	4.246 (0.100)
		OVR-SGL-gDWD	0.609	-	-	-
	1000	SGL-MgDWD	0.738	(21.98, 2)	(0.23, 0.04)	21.950 (0.241)
		$L_{1}$ -logistic	0.730	(20, 2)	(523.08, 1.07)	22.158 (0.163)
		OVR-SGL-gDWD	0.490	-	-	-
400	100	SGL-MgDWD	0.740	(22, 2)	(0, 0)	7.025 (0.172)
		$L_{1}$ -logistic	0.738	(20, 2)	(3.71, 3.48)	7.997 (0.122)
		OVR-SGL-gDWD	0.709	-	-	-
	1000	SGL-MgDWD	0.738	(22, 2)	(0.68, 0.11)	38.301 (0.200)
		$L_{1}$ -logistic	0.734	(20, 2)	(38.84, 35.37)	41.059 (2.064)
		OVR-SGL-gDWD	0.556	-	-	-

Table 3. Signal for the coefficient estimates obtained from the SGL-MgDWD method with

(q, τ) = (1, 0.1)

for the HIV data set. The symbols “+” and “-” denote positive and negative coefficient estimates, respectively, while “0” denotes a zero coefficient (i.e., an irrelevant variable).

Table 3. Signal for the coefficient estimates obtained from the SGL-MgDWD method with

(q, τ) = (1, 0.1)

for the HIV data set. The symbols “+” and “-” denote positive and negative coefficient estimates, respectively, while “0” denotes a zero coefficient (i.e., an irrelevant variable).

	Non-HIV	HIVNBD	HIVBDS	HIVBDU
interception	+	+	-	+
miR-255b	-	+	-	+
miR-217	+	-	0	0
miR-25-star	0	+	+	-
miR-3136-5p	-	-	+	-
miR-3152-3p	+	-	-	+
miR-3159	-	-	-	+
miR-3171	0	+	-	-
miR-33b	-	-	-	+
miR-34c-3p	-	-	+	+
miR-3545-5p	-	+	-	+
miR-3654	-	-	-	+
miR-3924	0	-	+	-
miR-4307	0	-	+	0
miR-4474-5p	-	+	+	+
miR-4526	+	-	-	-
miR-4641	+	0	-	-
miR-4655-3p	+	0	-	-
miR-4680-5p	-	-	+	-
miR-4683	-	-	0	+
miR-589	-	+	+	-
miR-619	+	-	-	+
miR-660	+	0	-	+

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, T.; Wang, Y.; Liu, Y.; Branton, W.G.; Asahchop, E.; Power, C.; Jiang, B.; Kong, L.; Tang, N. Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions. Entropy 2020, 22, 1257. https://doi.org/10.3390/e22111257

AMA Style

Su T, Wang Y, Liu Y, Branton WG, Asahchop E, Power C, Jiang B, Kong L, Tang N. Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions. Entropy. 2020; 22(11):1257. https://doi.org/10.3390/e22111257

Chicago/Turabian Style

Su, Tong, Yafei Wang, Yi Liu, William G. Branton, Eugene Asahchop, Christopher Power, Bei Jiang, Linglong Kong, and Niansheng Tang. 2020. "Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions" Entropy 22, no. 11: 1257. https://doi.org/10.3390/e22111257

APA Style

Su, T., Wang, Y., Liu, Y., Branton, W. G., Asahchop, E., Power, C., Jiang, B., Kong, L., & Tang, N. (2020). Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions. Entropy, 22(11), 1257. https://doi.org/10.3390/e22111257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions

Abstract

1. Introduction

2. Methodology

2.1. Model Setup

2.2. Population MgDWD

2.3. Estimator Consistency

2.4. Computational Algorithm

3. Numerical Analysis

3.1. Simulation Studies

3.2. HIV Data Analysis

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proofs

Appendix A.1. Proof of Lemma 1

Appendix A.2. Proof of Theorem 1

Appendix A.3. Proof of Lemma 2

Appendix A.4. Proof of Theorem 2

Appendix A.5. Proof of Lemma 3

Appendix A.6. Proof of Theorem 3

Appendix A.7. Proof of Theorem 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI