Next Article in Journal
Operators and Boundary Problems in Finance, Economics and Insurance: Peculiarities, Efficient Methods and Outstanding Problems
Next Article in Special Issue
GR-GNN: Gated Recursion-Based Graph Neural Network Algorithm
Previous Article in Journal
Image Encryption Schemes Based on a Class of Uniformly Distributed Chaotic Systems
Previous Article in Special Issue
Dynamic Analysis of a Stochastic Rumor Propagation Model with Regime Switching
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Communication-Efficient Distributed Learning for High-Dimensional Support Vector Machines

School of Statistics and Data Science, Nanjing Audit University, Nanjing 211085, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(7), 1029; https://doi.org/10.3390/math10071029
Submission received: 21 February 2022 / Revised: 20 March 2022 / Accepted: 22 March 2022 / Published: 23 March 2022
(This article belongs to the Special Issue Statistical Modeling for Analyzing Data with Complex Structures)

Abstract

:
Distributed learning has received increasing attention in recent years and is a special need for the era of big data. For a support vector machine (SVM), a powerful binary classification tool, we proposed a novel efficient distributed sparse learning algorithm, the communication-efficient surrogate likelihood support vector machine (CSLSVM), in high-dimensions with convex or nonconvex penalties, based on a communication-efficient surrogate likelihood (CSL) framework. We extended the CSL for distributed SVMs without the need to smooth the hinge loss or the gradient of the loss. For a CSLSVM with lasso penalty, we proved that its estimator could achieve a near-oracle property for l 1 penalized SVM estimators on whole datasets. For a CSLSVM with smoothly clipped absolute deviation penalty, we showed that its estimator enjoyed the oracle property, and that it used local linear approximation (LLA) to solve the optimization problem. Furthermore, we showed that the LLA was guaranteed to converge to the oracle estimator, even in our distributed framework and the ultrahigh-dimensional setting, if an appropriate initial estimator was available. The proposed approach is highly competitive with the centralized method within a few rounds of communications. Numerical experiments provided supportive evidence.

1. Introduction

The support vector machine (SVM), originally introduced by [1], has been a great success when applied to many classification problems. Owing to its high accuracy and flexibility it has provided solid mathematical foundations in machine learning. It is one of the most popular binary classification tools. The motivation of an SVM is to find a maximum-margin hyperplane by a regularized functional optimization problem. In statistical machine learning, the penalized functional is a sum of the hinge loss plus l 2 -norm regularization. The statistical properties of an SVM have been studied in a lot of works. In this work, we focused on a distributed penalized linear SVM for datasets with large sample sizes and large dimensions.
With the development of modern technology, the size of data has become incredibly large, and, in some cases, cannot even be stored on a single machine. In real-world applications, many datasets are stored locally on individual servers and individual’s devices, such as mobile phones and computers. It is difficult to collect these local data onto a single machine due to communication costs and privacy preservation. Thus, new methods and theories in distributed learning are called for. Distributed learning has attracted increasing attention in recent years, for example, see Refs. [2,3] for M-estimation, Refs. [4,5,6] for quantile regression, Refs. [7,8] for nonparametric regression, Refs. [3,9] for confidence intervals, and so on. These works focus on the simple setting of data parallelism under which the dataset is partitioned and distributed on m worker machines that can analyze data independently. Most methods suggest that in each round of communication, each worker machine estimates the parameters of the model locally, and then communicates these local estimators to a master machine that averages these estimators to form a global estimator. Although this divide-and-conquer approach is communication-efficient, it has some restrictions: for achieving the minimax rate of convergence, the number of worker machines cannot be too large and samples in each worker machine should be large enough, these restrictions are highly restrictive. In addition, averaging can perform poorly if the estimator is nonlinear.
Not only can the size of data be exceedingly large, but also many successful models are heavily over-parameterized. These problems have been discussed widely. Zou [10] proposed an improved l 1 penalized SVM for simultaneous feature selection and classification, and showed that the hybrid SVM not only often improved the classification accuracy, but also enjoyed better feature-selection performance. Meinshausen and Buhlmann [11] studied the problem of variable selection in high-dimensional graphs, and explained that neighborhood selection with the lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Zhao and Yu [12] studied almost necessary and sufficient conditions for lasso to select the true model when p < n or p n , where p was the dimension of the model parameters and n was the sample size. Meinshausen and Yu [13] introduced sparse representations for high-dimensional data and proved that the estimator was still consistent even though the lasso could not recover the correct sparsity pattern. The adaptive lasso was proposed by Zou [14] and is a new version of the lasso that enjoys oracle properties. In the high-dimensional problems, the dimension p of the covariates was larger than the size of data, but there were only a few covariates that were relevant to the response. As a concrete example, in a microarray data set, which contains more than 10,000 genes, only several genes will make a difference to the result. Recently, the statistical inference for high-dimensional data has been investigated; readers can refer to [15,16] for details. In high-dimensional surroundings, a standard SVM can be easily affected by many redundant variables, so variable selection is important for high-dimensional SVMs. Fan and Fan [17] has shown that it was as poor as “tossing a coin” if all features were used in classification due to the accumulation of noise in high-dimensional analysis. Many works have been proposed to handle such a problem. Bradley and Mangasarian [18], Peng et al. [19], Zhu et al. [20] and Wegkamp and Yuan [21] studied the l 1 penalized SVM; Fan and Li [22] proposed an approach of variable selection and the estimation of model simultaneously by using a smoothly clipped absolute deviation penalty (SCAD) or minimax concave penalty (MCP), which are non-convex. Becker et al. [23], Park et al. [24] and Zhang et al. [25] considered the SCAD penalized SVM. Lian and Fan [26] gave the divide-and-conquer debiased estimator for an SVM, but such simple averaging might result in high computational costs for the high-dimensional problem, although it could become a debiased estimator by the lasso penalty. Jordan et al. [3] proposed the communication-efficient surrogate likelihood (CSL) framework for solving distributed statistical inference problems, which could work for high-dimensional penalized regression. As [27] stated, the CSL approach is different from distributed first-order optimization methods, which leverage both global first-order information and local higher-order information; yet, to the best of our knowledge, the distributed inference of variable selection for high-dimensional SVM has not been studied in the CSL framework.
In this paper, we propose communication-efficient distributed learning for support vector machines in high dimensions. Instead of using all the data to estimate the parameters, our method only needed to solve a regularized optimization problem on the first machine that was based on all gradients obtained from all worker machines. For the penalty function, we considered the convex l 1 and the non-convex SCAD penalty in a high-dimensional SVM, which could achieve variable selection and estimation simultaneously. In the distributed learning high-dimensional SVM, we did not need smooth assumptions to the loss or the gradient of the loss. We give some theoretical results in the paper.
The reminder of this paper is organized as follows. In Section 2, we give the problem formulation. Communication-efficient distributed estimation for an SVM is presented in Section 3. In Section 4, we provide simulation studies and real data examples and demonstrate encouraging performances.

2. Problem Formulation

In this section, we set up our learning problem formally. We considered a strategy, which was empirical risk minimization, to obtain the optimal model. We considered a distributed learning framework with m worker machines, in which the 1st machine was regarded as the central machine. The 1st worker could aggregate information from another m 1 worker machines. In addition, every machine had n samples. So the size of total samples was N = n m . For a standard binary classification problem, we denoted X to be the input space and Y = { 1 , + 1 } to be the output space. Random vectors ( X , Y ) X × Y were drawn from an unknown joint distribution D on X × Y . Let the parameters β = β 0 , β 1 , , β p T and the features x i = 1 , x i 1 , , x i p T . Suppose that training data points x i , y i i = 1 N are available from D . Let X , β be a loss function and
β 0 = arg min β E X , β ,
where β 0 = β 00 , β 01 , , β 0 p T is the true parameter. Let the i.i.d (independent identically distributed) samples x i , y i i = 1 N be stored on m machines and use I k to denote the indices of samples on the kth machine with I k = n = N / m for all k [ m ] and I j I k = for j k , j , k [ m ] . The empirical risk of the kth machine is defined by
L ^ k ( β ) = 1 I k i I k β ; ( X i , y i ) ,
the empirical risk based on all N samples is
L ^ ( β ) = 1 m k = 1 m L ^ k ( β ) .
We used structural risk minimization strategy to learn β 0 , which is defined by
β ^ = argmin β { L ^ ( β ) + g ( β ) } ,
where g ( β ) is the penalty term, such as l 1 , l 2 , SCAD and MCP.
In distributed statistical learning, Jordan et al. [3] proposed a distributed estimator with statistical guarantee and communication efficiency. Given an appropriate initial estimator β ˜ , we have
L ˜ ( β ) = L ^ 1 ( β ) β T L ^ 1 ( β ˜ ) L ^ ( β ˜ ) .
By the above formula, we could introduce a communication-efficient distributed learning algorithm: the first worker machine broadcasts the initial β ˜ = β ^ 1 to the remaining m 1 machines and each machine computes the local gradient L ^ k ( β ˜ ) . Then, these local gradients are sent back to the first machine where the first worker carries out an SVM by aggregating these local gradients. The communication-efficient estimator is given by
β ˇ = argmin β { L ˜ ( β ) + g ( β ) } ,
with approximate loss L ˜ ( β ) . The architecture of the distributed learning reduced the total communication costs O ( ( m 1 ) n p ) to O ( ( m 1 ) p ) , where p is the dimension of β .

3. Distributed Learning for an SVM

A standard non-separable SVM has the following form,
L ^ ( β ) + g ( β ) = 1 N i = 1 N 1 y i X i T β + + λ 2 β * 2 2 ,
where x + = m a x x , 0 is the hinge loss function that is piecewise linear and is differentiable except at point 0; β * = β 1 , , β p T is the unknown p-dimensional parameter; λ is the regularization parameter, which determines the importance of penalty term. The true parameter β 0 is the minimum of the following population loss function
L β = E 1 Y X T β + .
We define
S β = E I 1 Y X T β 0 Y X ,
H β = E δ 1 Y X T β X X T ,
where I · is the indicator function and δ · is the Dirac delta function. The S β and H β could be viewed as the gradient vector and Hessian matrix of L β .
The empirical loss function of the kth worker machine is L ^ k β = 1 I k i I k 1 y i X i T β + . Given β ˜ is an initial estimator of β 0 , we use L ˜ ( β ) : = L ^ 1 ( β ) β T L ^ 1 ( β ˜ ) L ^ ( β ˜ ) to replace L ^ β = 1 N i = 1 N 1 y i X i T β + , and then obtain the distributed estimator
β ˇ = arg min β L ˜ ( β ) + g ( β ) .
In this paper, we considered the l 1 penalty and SCAD penalty, respectively. The l 1 penalty is
g ( β ) = λ β 1 ,
and the SCAD penalty is
g ( β ) = j = 1 p p λ β j ,
where
p λ ( t ) = λ | t | I ( 0 | t | < λ ) + a λ | t | t 2 + λ 2 / 2 a 1 I ( λ | t | a λ ) + ( a + 1 ) λ 2 2 I ( | t | > a λ )
for some a > 2 . Note that the SCAD penalty has the following properties:
Property 1: p λ t is symmetric and for t 0 , is non-decreasing and concave, with p λ 0 = 0 .
Property 2: The derivative of p λ t is continuous on 0 , : for some a > 1 ,
lim t 0 + p λ ( t ) = λ , p λ ( t ) λ t / a for 0 < t < a , and p λ ( t ) = 0 for t a λ .
In practice, we adopted the following CSL distributed learning for an SVM, which is summarized in Algorithm 1. However, our theories were based on the distributed estimator β ˇ in (1).
Algorithm 1: CSL distributed support vector machine (CSLSVM).
Mathematics 10 01029 i001

3.1. A Communication-Efficient Distributed SVM with Lasso Penalty

In this section, we establish the theoretical properties of the proposed estimator. Despite the generality and elegance of [3]’s method, the approach uses only at least second derivatives for smooth loss functions, such as cross-entropy loss, which can not directly apply to the hinge loss of an SVM. Recall that surrogate loss L ˜ ( β ) : = L ^ 1 ( β ) β T L ^ 1 ( β ˜ ) L ^ ( β ˜ ) . We only used first-order information, so if the gradient existed almost everywhere, it was usable for the aforementioned method. For an SVM, the hinge loss was differentiable except at point 0. We could use a subgradient function
L ^ ( β ) = 1 I k i I k I 1 y i x i T β 0 y i x i
in place of the gradient and, thus, the surrogate loss was directly usable.
Our main results were established under the following assumptions.
(A1) The conditional densities of X T β 01 given Y = 1 and Y = 1 are denoted as f and g, respectively. It is assumed that f is uniformly bounded away from 0 and in a neighborhood of 1, and g is uniformly bounded away from 0 and in a neighborhood of 1 .
(A2) β 0 is a sparse and nonzero vector, and S denotes the support of β 0 .
(A3) x is a sub-Gaussian random vector. That is, for any η R p ,
E exp η T X exp C η 2 ;
it is assumed that each component x i j of feather x i is a random variable with a mean of zero and variance 1.
Denote X = X 1 , X 2 , , X n T to be the feature design matrix and define restricted eigenvalues as follows,
λ max = max δ R p + 1 : δ 0 C q δ T X T X δ n δ 2 2 ,
and
λ min H β * ; q = min δ Δ δ T H β * δ δ 2 2 ,
where Δ is a restricted cone in R p + 1 ,
Δ = γ R p + 1 : γ S + 1 3 γ S + c 1 ,
S + = S { 0 } , S { 1 , 2 , , p } and | S | q .
(A4) λ max and λ min are bounded away from zero.
(A5) The initial estimator β ˜ is sparse and β ˜ β 0 1 C q log p n .
Remark 1.
Under Assumption (A1), the Hessian matrix H β was well-defined and continuous in β. (A1) ensured that we could obtain sufficient information around the non-differentiable point of the hinge loss; see more details in [24,28]. (A2) is a common assumption in high-dimensional problems and we knew S s for some s m i n p , n . In this paper, we used q as the number of nonzero entries in β 0 . Using the sub-Gaussianity assumption (A3), we easily obtained max i x i C log p with probability 1 p C . Assumption (A4) is similar to Lemma 2 in [19]; we used these to control the bounds of the empirical loss function of the SVM and it’s expectation. (A5) is an assumption of the initial estimator for the iterative algorithm; Ref. [19] proved that the L 1 -norm SVM coefficients satisfied such assumptions.
Our results were as follows.
Theorem 1.
Assume that (A1)–(A5) above and that λ 2 L ˜ ( β 0 ) , we have with probability at least 1 n C ,
β ˇ β 0 C λ q + q 3 / 2 ( log p ) 5 / 2 n + ( q log p ) 1 / 2 n 1 / 2 + q 3 / 2 log p n 3 / 4 .
Remark 2.
Although N did not appear in the formula, in fact, the condition λ 2 L ˜ ( β 0 ) implied that the convergence rate was dependent on N. If m was not too big, that is, n was not too small, we chose λ log p / N . If q n 1 / 4 / ( log p ) 1 / 2 and q log p / n 3 / 4 log p / N , the convergence rate would be dominated by the first term λ q . That is,
β ˇ β 0 C q log p N .
This was a near-oracle property for the l 1 penalized SVM estimator based on the entirety of the datasets [27].
Proof of Theorem  1.
We prove the result by the following three steps in line with the proof of [6].
Step 1. Let δ = β ˇ β 0 . Since L ˜ ( β ) is convex in β , we have
L ˜ ( β ) L ˜ β 0 L ˜ β 0 β β 0
for all β . In terms of L ˜ ( β ˇ ) + λ β ˇ 1 L ˜ β 0 + λ β 0 1 and Hölder’s inequality, we obtain
L ˜ β 0 δ 1 L ˜ ( β ˇ ) L ˜ β 0 λ β 0 1 λ β 0 + δ 1 .
Using λ 2 L ˜ β 0 , we obtain
λ 2 δ 1 λ β 0 1 λ β 0 + δ 1 .
Writing δ 1 = δ S 1 + δ S c 1 , β 0 1 = β 0 S 1 and β 0 + δ 1 = β 0 S + δ S 1 + δ S c 1 , we obtain
λ 2 δ S 1 λ 2 δ S c 1 λ δ S 1 λ δ S c 1 .
After rearranging, we have
δ S c 1 3 δ S 1 .
Step 2. We observe that
L ˜ β 0 + δ L ˜ β 0 δ T L ˜ β 0 = L ^ 1 β 0 + δ L ^ 1 β 0 δ T L ^ 1 β 0 = 1 n i = 1 n 1 y i x i T ( β 0 + δ ) I y i x i T ( β 0 + δ ) 1 1 n i = 1 n 1 y i x i T ( β 0 ) I y i x i T ( β 0 ) 1 δ T L ˜ β 0 = 1 n i = 1 n I y i x i T ( β 0 + δ ) 1 I y i x i T ( β 0 ) 1 + 1 n i = 1 n y i x i T ( β 0 + δ ) I y i x i T ( β 0 + δ ) 1 β 0 I y i x i T β 0 1 δ T L ^ 1 β 0 = Q 1 n + Q 2 n + Q 3 n .
With arguments basically the same as the proof of Proposition 3, we could prove that for any δ
sup δ t | Q 1 n E ( Q 1 n ) | = sup δ t O x 1 / 2 q 1 / 4 δ 1 / 2 q log p n + x q log p n = O p ( q log p ) 3 / 4 t 1 / 2 n + q ( log p ) 3 / 2 n ,
sup δ t | Q 2 n E ( Q 2 n ) | = sup δ t O x 3 / 2 q 1 / 4 β 0 1 δ 1 / 2 q log p n + x q log p n = O p q 3 / 4 ( log p ) 5 / 4 t 1 / 2 n + q ( log p ) 3 / 2 n .
By following Proposition 1, we have
sup δ t | Q 3 n | = O L ^ 1 ( β 0 ) δ 1 = O p log p n + q log p n 3 / 4 + q ( log p ) 3 / 2 n q t .
Based on (2)–(5), we have with probability 1 n C
sup δ t L ^ 1 β 0 + δ L ^ 1 β 0 δ T L ^ 1 β 0 E L ^ 1 β 0 + δ + E L ^ 1 β 0 C q 3 / 4 ( log p ) 5 / 4 t 1 / 2 n + log p n + q log p n 3 / 4 + q ( log p ) 3 / 2 n q t .
Step 3. Assume that β ˇ β > t for some t > 0 . By step 1, this implies
inf δ t δ S c 1 3 δ S 1 L ˜ β 0 + δ L ˜ β 0 + λ β 0 + δ 1 λ β 0 1 0 .
By the triangle inequality, we have β 0 + δ 1 β 0 1 δ S 1 q δ S q t . Using the result from Step 2 and the lower bound for E L ^ 1 β 0 + δ E L ^ 1 β 0 , similar to Lemma 4 of [29], we have
L ˜ β 0 + δ L ˜ β 0 E L ^ 1 β 0 + δ E L ^ 1 β 0 δ 1 L ˜ β 0 C q 3 / 4 ( log p ) 5 / 4 t 1 / 2 n + log p n + q log p n 3 / 4 + q ( log p ) 3 / 2 n q t C q 3 / 4 ( log p ) 5 / 4 t 1 / 2 n + log p n + q log p n 3 / 4 + q ( log p ) 3 / 2 n q t + C ( t 2 t ) C λ q t .
Thus, we have
C ( t 2 t ) C λ q t C q 3 / 4 ( log p ) 5 / 4 t 1 / 2 n + log p n + q log p n 3 / 4 + q ( log p ) 3 / 2 n q t 0 .
Some algebra shows that
t C λ q + q 3 / 2 ( log p ) 5 / 2 n + q log p n + q 3 / 2 log p n 3 / 4 .
Proposition 1.
Under the same assumptions as Theorem 1 with probability at least 1 p C
L ˜ β 0 C log p N + q log p n 3 / 4 + q ( log p ) 3 / 2 n .
Proof of Proposition 1.
By the definition of L ˜ , we have L ˜ β 0 = L ^ 1 β 0 L ^ 1 ( β ˜ ) + L ^ ( β ˜ ) and thus
L ˜ β 0 L ^ 1 β 0 L ^ 1 ( β ˜ ) L ^ β 0 + L ^ ( β ˜ ) + L ^ β 0 .
The last term above is the same as that dealt with in Lemma 1 in [19], which shows that with probability at least 1 p C ,
L ^ β 0 C log p / N .
In Proposition 2, we show that with probability 1 p C
1 n i y i x i I y i x i T β ˜ 1 I y i x i T β 0 1 E y x I y x T β ˜ 1 I y x T β 0 1 C q log p n 3 / 4 + q ( log p ) 3 / 2 n .
Similarly,
1 N i y i x i I y i x i T β ˜ 1 I y i x i T β 0 1 E y x I y x T β ˜ 1 I y x T β 0 1 C q log p N 3 / 4 + q ( log p ) 3 / 2 N C q log p n 3 / 4 + q ( log p ) 3 / 2 n .
Thus, we have
L ^ 1 β 0 L ^ 1 ( β ˜ ) L ^ β 0 + L ^ ( β ˜ ) C q log p n 3 / 4 + q ( log p ) 3 / 2 n .
Then we can obtain
L ˜ β 0 C log p N + q log p n 3 / 4 + q ( log p ) 3 / 2 n .
Proposition 2.
Under the same assumptions as Theorem 1, with probability at least 1 p C , we have
1 N i y i x i I y i x i T β ˜ 1 I y i x i T β 0 1 E y x I y x T β ˜ 1 I y x T β 0 1 C q log p N 3 / 4 + q ( log p ) 3 / 2 N .
Proof of Proposition 2.
We take Ω = β R p : β 0 q , β β 0 1 C q log p / N . Define the class of functions
G j = y x j I y x T β 1 I y x T β 0 1 : β Ω
with squared integrable envelope function F ( x , y ) = x j . We decompose Ω as Ω = T { 1 , , p } , | T | K Ω ( T ) with Ω ( T ) = { β : support of β T } Ω . We also define G j ( T ) = y x j I y x T β 1 I y x T β 0 1 : β Ω ( T ) . By Lemma 2.6.15, Lemma 2.6.18 (vi) and (viii) in [30], for each fixed T { 1 , , p } with | T | C q , G j ( T ) is a VC-subgraph with index bounded by C q and by Theorem 2.6 . 7 of [30], we have
N ϵ , G j ( T ) , L 2 P n C F L 2 P n ϵ C q C ϵ C q .
Since there are at most p C q ( e p / C q ) C q different such T, we have
N ϵ , G j , L 2 P n C ϵ C q e p C q C q C p ϵ C q
and thus
N ϵ , j = 1 p G j , L 2 P n p C p ϵ C q .
Let σ 2 = sup f j G j P f 2 . Then by Theorem 3.12 of [31], we have
E R n j G j C σ q log p N + q log p log p N ,
where R n j G j = sup f j G j N 1 i = 1 N ε i f x i , y i with ε i being i.i.d. Rademacher random variables. Using the symmetrization inequality, which states that E P n P j G j 2 E R n j G j , where P n P j G j = sup f j G j N 1 i f x i , y i E f ( x , y ) , Talagrand’s inequality (page 24 of [31]) gives
P P n P j G j C σ q log p N + q log p log p N + σ 2 t N + log p t N e t ,
that is, with probability at least 1 p C ,
1 N i y i x i I y i x i T β ˜ 1 I y i x i T β 0 1 E y x I y x T β ˜ 1 I y x T β 0 1 C σ q log p N + log p q log p N .
Finally, we need to decide the size of σ 2 . For β Ω , we have that
E I x T β 1 I x T β 0 1 2 y = 1 P x T β 1 , x T β 0 1 y = 1 + P x T β 1 , x T β 0 1 y = 1 C | x T ( β β 0 ) | C log p q log p / N .
Thus, with probability at least 1 p C ,
1 N i y i x i I y i x i T β ˜ 1 I y i x i T β 0 1 E y x I y x T β ˜ 1 I y x T β 0 1 C q log p N 3 / 4 + q ( log p ) 3 / 2 N .
Proposition 3.
Under the same assumptions as Theorem 1, with probability at least 1 p C , we have
1 N i y i x i β I y i x i T β 1 β 0 I y i x i T β 0 1 E y x β I y x T β ˜ 1 β 0 I y x T β 0 1 C q 5 / 4 ( log p ) 3 / 2 n 3 / 4 + q 2 ( log p ) 3 / 2 n .
Proof of Proposition 3.
The proof is similar to the proof of Proposition 2. We take Ω = β R p : β 0 q , β β 0 1 C q log p / N . Define the class of functions
G j = y x j β I y x T β 1 β 0 I y x T β 0 1 : β Ω
with squared integrable envelope function F ( x , y ) = C x j . With probability at least 1 p C , we have
1 N i y i x i β I y i x i T β ^ 1 β 0 I y i x i T β 0 1 E y x β I y x T β ^ 1 β 0 I y x T β 0 1 C σ q log p N + log p q log p N ,
where σ 2 = sup f G j P f 2 . Next, we need to decide the order of σ 2 . For β Ω , using basic inequality 2 a b a 2 + b 2 , we have that
E y x T β I x T β 1 β 0 I x T β 0 1 2 y = 1 , x E x T β 0 I x T β 1 I x T β 0 1 + x T ( β β 0 ) I x T β 1 2 x 2 E ( x T β 0 ) 2 I x T β 1 I x T β 0 1 2 | x + 2 E x T ( β β 0 ) 2 I x T β 1 x 2 x 2 β 0 1 2 | x T ( β β 0 ) | + 2 | x T ( β β 0 ) | 2 2 q β 0 1 2 x 3 β β 0 + 2 q x 2 β β 0 2 = O p q 3 / 2 ( log p ) 2 n + q 3 ( log p ) 2 n .
Therefore, σ 2 = O q 3 / 2 ( log p ) 2 n + q 3 ( log p ) 2 n . Thus, we complete the proof of Proposition 3.

3.2. A Communication-Efficient SVM with SCAD Penalty

In this section, we further discuss the advantage of a distributed non-convex penalized SVM in ultra-high dimension. Similarly, the oracle property of distributed non-convex penalized SVM coefficients are be investigated.
Our main results were established under the following assumptions.
(C1) The densities of X given Y = 1 and Y = 1 are continuous and have common support in R q .
(C2) The densities of X given Y = 1 and Y = 1 have finite second moments.
(C3) The true model dimension q = O N c 1 for some 0 c 1 < 1 2 .
(C4) λ max N 1 X A T X A M 1 for a constant M 1 > 0 , where λ max denotes the largest eigenvalue and X A is the first q N + 1 columns of the design matrix.
(C5) λ min H β 01 M 2 for some constant M 2 > 0 , where λ min denotes the smallest eigenvalue.
(C6) There exist constants M 3 > 0 and 2 c 1 < c 2 1 such that N 1 c 2 / 2 min 1 j q N β 0 j M 3 .
(C7) f is uniformly bounded away from 0 and in a neighborhood of 1, and g is uniformly bounded away from 0 and in a neighborhood of 1 , where f and g are the conditional densities of X T β 01 given Y = 1 and Y = 1 , respectively.
Remark 3.
Assumptions (C1)–(C2) and (C4)–(C5) are similar to the assumptions in Section 3.1, which have been used by [25]. Assumption (C3) controled the divergence rate of the number of nonzero coefficients, which could not be faster than N . In addition, see Remark 2. Assumption (C6) simply required that the signals could not decay too quickly, which implied that the relevant signals were not too small so that it could be identified, which is common in the literature of high-dimensional problems. Assumption (C7) was trivially held by the unbounded support of the conditional distribution of X A given Y. See Remark 1 in [25].
First, we introduced the oracle estimator β ˙ = β ^ 1 , 0 , where β ^ 1 was estimated by covariates associated with the true model, and β ^ 1 β 01 = O p q N / N when N (based on all dataset) in [25].
In the non-convex penalty, there might be multiple local minimums. We used B N ( λ ) to denote the set of local minimums. T he non-convex problem could be written as the difference of two convex functions, and then presented as a sufficient local optimal condition.
Let
f ( β ) = L ^ 1 ( β ) β T L ^ 1 ( β ˜ ) L ^ ( β ˜ ) + j = 1 p p λ β j .
Although f ( β ) is non-convex, we can write it as
f ( β ) = g ( β ) h ( β ) ,
where
g ( β ) = n 1 i = 1 n 1 Y i X i T β + + λ j = 1 p β j β T L ^ 1 ( β ˜ )
and
h ( β ) = λ j = 1 p β j j = 1 p p λ β j β T L ^ N ( β ˜ ) .
Obviously, h ( β ) and g ( β ) are convex.
To present our main results, we need a sufficient local optimal condition based on subgradient estimation as described below.
Lemma 1.
(Sufficient local optimal condition) if there is a neighbourhood U around the point x * such that h ( x ) g x * , x U dom ( g ) , then x * is a local minimum of g ( x ) h ( x ) .
Lemma 1 has been stated as Corollary 1 in [32]. The main results are summarized in the following theorem.
Theorem 2.
Assume that assumptions (C1)–(C7) hold, the oracle estimator satisfies
P β ˙ B N ( λ ) 1 ,
when N , λ = o N 1 c 2 / 2 and q log p log ( N ) N 1 / 2 = o ( λ ) .
Remark 4.
From Theorem 2, we could see that the oracle estimator held when taking λ = N 1 / 2 + δ for some c 1 < δ < c 2 / 2 even for p = o exp ( N ( δ c 1 ) / 2 ) . So, the local oracle property held for a non-convex distributed penalized SVM even when the number of features, p, grew exponentially with the sample size, N, of the whole dataset.
Proof of Theorem 3.
We sketch our proof as follows:
Step 1. From Theorem 1 in [25], we obtain some properties about s j ( β ˙ ) and β ^ j , with probability approaching 1,
s j ( β ˙ ) = 0 , j = 0 , 1 , , q ,
β ^ j a + 1 2 λ , j = 1 , , q ,
s j ( β ˙ ) λ , β ^ j = 0 , j = q + 1 , , p .
Step 2. By Proposition 1, we have with probability approaching 1 p c ,
L ^ 1 ( β ˜ ) L ^ ( β ˜ ) C log p N + log p n + q log p n 3 / 4 + q ( log p ) 3 / 2 n ,
so that when n , we obtain P L ^ 1 ( β ˜ ) L ^ ( β ˜ ) < δ 1 for some δ > 0 .
Step 3. Let
G = ξ = ξ 0 , , ξ p ,
where
ξ 0 = L ^ 1 ( β ˜ ) 0 ,
ξ j = λ sgn ( β ˙ ) j + L ^ 1 ( β ˜ ) j , j = 1 , , q ,
ξ j = s j ( β ) + L ^ 1 ( β ˜ ) j + λ l j , j = q + 1 , , p ,
l j [ 1 , 1 ] , j = q + 1 , , p .
By Step 1, we obtain P { G g ( β ˙ ) } 1 . Then we show that there exist ξ * G such that P ξ j * = h ( β ) / β j 1 as n for any β in R p + 1 with center β ˙ and radius λ / 2 .
Since h ( β ) / β 0 = L ^ ( β ˜ ) 0 , by Step 2 we have ξ 0 * = h ( β ) / β 0 .
For j = 1 , , q , we have min 1 j q β j min 1 j q β ^ j max 1 j q β ^ j β j a + 1 2 λ λ / 2 = a λ with probability 1 by Step 1. Therefore, by Property 2 of the class of penalties P h ( β ) / β j = ξ j = λ sgn ( β j ) + L ^ N ( β ˜ ) j 1 for j = 1 , , q . For sufficiently large n, sgn β j = sgn ( β j ˙ ) , L ^ N ( β ˜ ) j = L ^ 1 ( β ˜ ) j . Thus we have P ξ j * = h ( β ) / β j 1 as n for j = 1 , , q .
For j = q + 1 , , p , we have P β j β ^ j + β j β ^ j λ 1 by Step 1. Therefore P h ( β ) / β j = 0 1 for SCAD. By Property 2, P h ( β ) / β j λ 1 for the class of penalties. By lemma 1, we have P s j ( β ^ j ) λ 1 for j = q + 1 , , p . We can always find l j [ 1 , 1 ] such that P ξ j * = s j ( β ˙ ) + λ l j + L ^ 1 ( β ˜ ) j = h ( β ) β j 1 for j = 1 , , q . This completes the proof. □
In this paper, we did not need to assume that the solution of the minimum problem was unique. By some numerical algorithms, which solve the non-convex penalized SCADSVM, we could identify the oracle estimator. Ref. [33] introduced the LLA algorithm to obtain the sparse estimator in non-convex penalized likelihood models. We applied the LLA algorithm to our SCADSVM approach. Now, we flesh out the problem.
Let β ˜ ( 0 ) = β ˜ 0 ( 0 ) , , β ˜ p ( 0 ) T . We update β ˜ ( t ) by solving
min β L ^ 1 ( β ) β T L ^ 1 ( β ˜ ( t 1 ) ) L ^ ( β ˜ ( t 1 ) ) + j = 1 p p λ β ˜ j ( t 1 ) β j .
Consider the following events:
(a)
F n 1 = β ˜ j ( 0 ) β 0 j > λ , f o r 1 j p ;
(b)
F n 2 = β 0 j < ( a + 1 ) λ , f o r 1 j q ;
(c)
F n 3 = for all subgradients s ( β ˙ ) , s j ( β ˙ ) > ( 1 1 / a ) λ for some q + 1 j p or s j ( β ˙ ) 0 for some 0 j q } ;
(d)
F n 4 = β ^ j < a λ , f o r 1 j q .
(e)
F n 5 = L ^ 1 ( β ˜ ) j L ^ ( β ˜ ) j δ , f o r 1 j q .
The first four events are similar to [25]. Denote P n i = P F n i , then we have the following Theorem 3.
Theorem 3.
Using LLA algorithm initiated by β ˜ ( 0 ) , we can obtain the oracle estimator after two iterations with probability at least 1 P n 1 P n 2 P n 3 P n 4 P n 5 .
Remark 5.
Theorem 3 gave a non-asymptotic lower bounded probability, which implied the oracle estimator could be obtained by the LAA algorithm. That is, the LAA algorithm could identify the oracle estimator in two iterations.
Proof of Theorem 3
Assume that none of the events F n i is true, for i = 1 , , 5 . The probability that none of these event is true is at least 1 P n 1 P n 2 P n 3 P n 4 P n 5 . Then we have
β ˜ j ( 0 ) = β ˜ j ( 0 ) β 0 j λ , q + 1 j p , β ˜ j ( 0 ) β 0 j β ˜ j ( 0 ) β 0 j a λ , 1 j q .
By properties of the class of non-convex penalties, we have p λ β ˜ j ( 0 ) = 0 for 1 j q . Therefore, the solution of the next iteration of β ˜ ( 1 ) is the solution to the convex optimization
β ˜ ( 1 ) = arg min β L ^ 1 ( β ) β T L ^ 1 ( β ˜ ( 0 ) ) L ^ ( β ˜ ( 0 ) ) + j = q + 1 p p λ β ˜ j ( 0 ) β j .
By the fact that F n 3 is not true, there are some subgradients of oracle estimator s ( β ˙ ) such that s j ( β ˙ ) = 0 for 0 j q and s j ( β ˙ ) < ( 1 1 / a ) λ for q + 1 j p . By the definition of subgradient, we have
L ^ 1 ( β ) L ^ 1 ( β ˙ ) + 0 j p s j ( β ˙ ) β j β ^ j = L ^ 1 ( β ˙ ) + q + 1 j p s j ( β ^ ) β j β ^ j .
Then we have for any β
L ^ 1 ( β ) β T L ^ 1 ( β ˜ ( 0 ) ) L ^ ( β ˜ ( 0 ) ) + j = q + 1 p p λ β ˜ j ( 0 ) β j L ^ 1 ( β ˙ ) β ˙ T L ^ 1 ( β ˜ ( 0 ) ) L ^ ( β ˜ ( 0 ) ) + j = q + 1 p p λ β ˜ j ( 0 ) β j ^ q + 1 j p p λ β ˜ j ( 0 ) s j ( β ^ ) sgn β j β j ( β β ˙ ) T L ^ 1 ( β ˜ ( 0 ) ) L ^ ( β ˜ ( 0 ) ) q + 1 j p ( 1 1 / a ) λ s j ( β ^ ) sgn β j β j ( β β ˙ ) T L ^ 1 ( β ˜ ( 0 ) ) L ^ ( β ˜ ( 0 ) ) 0 .
So we can obtain β ˜ ( 1 ) = β ˙ . This proves that the LLA algorithm finds the oracle estimator after one iteration.
If F n 2 is not true, one obtains β ^ j > a λ for all 1 j q . So we have p λ β ^ j = 0 for all 1 j q and p λ β ^ j = p λ ( 0 ) = λ for all q + 1 j p by Property 2 of the class of penalties. At iteration 1, when the LLA algorithm has found β ^ , the solution to the next LLA iteration β ˜ ( 2 ) is the minimum of the convex optimization problem
β ˜ ( 2 ) = arg min β L ^ 1 ( β ) β T L ^ 1 ( β ˜ ( 1 ) ) L ^ ( β ˜ ( 1 ) ) + q + 1 j p λ β j .
Then we have for any β
L ^ 1 ( β ) β T L ^ 1 ( β ˜ ( 1 ) ) L ^ ( β ˜ ( 1 ) ) + q + 1 j p λ β j L ^ 1 ( β ˙ ) β ˙ T L ^ 1 ( β ˜ ( 1 ) ) L ^ ( β ˜ ( 1 ) ) + q + 1 j p λ β j q + 1 j p λ s j ( β ˙ ) sgn β j β j ( β β ˙ ) T L ^ 1 ( β ˜ ( 0 ) ) L ^ ( β ˜ ( 0 ) ) 0 .
Hence, the iteration 2 finds an oracle estimator β ˜ ( 2 ) = β ˙ again and the algorithm stops.

4. Numerical Experiments

4.1. Simulation Experiments

We considered four models to evaluate the finite sample performance of the distributed SVM. The first and the second models were similar to models 1 and 2 of [27], respectively. The first model was essentially a standard linear discriminant analysis which was used in [19,24,25]. The other three models were probit regression under a different setting. In numerical simulation experiments, we generated the data in R, and used CPLEX solver to solve the optimization problem in AMPL for model 1. In addition, we used Python for the other models.
Model 1: Pr ( Y = 1 ) = Pr ( Y = 1 ) = 0.5 , X * ( Y = 1 ) MN ( μ , Σ ) , X * ( Y = 1 ) MN ( μ , Σ ) , q = 5 , μ = ( 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0 , , 0 ) T R p , Σ = σ i j with non-zero elements σ i i = 1 for i = 1 , 2 , , p and σ i j = ρ = 0.2 for 1 i j q . The Bayes rule is sgn 2.67 X 1 + 2.83 X 2 + 3 X 3 + 3.17 X 4 + 3.33 X 5 with Bayes error 6.3 % .
Model 2: X * MN 0 , Σ , Σ = σ i j and σ i j = 0 . 4 | i j | for 1 i j p , σ i j = 1 for i = j , Pr Y = 1 X * = Φ X * T β * , where Φ ( · ) is the cumulative density function of the standard normal distribution. The Bayes rule is sgn 1 X 1 + 1 X 2 + 1 X 3 + 1 X 4 with Bayes error 10.4 % .
Model 3: X i j MN 0 , Σ , Σ = σ i j and σ i j = 0 . 5 | i j | for 1 i n , 1 j m , Pr Y = 1 X * = Φ X * T β * where Φ ( · ) is the cumulative density function of the standard normal distribution. The true parameter β 0 is set to be sparse and it’s first q entries are uniformly distributed i.i.d. random variables from [ 0 , 1 ] .
Model 4: X i j MN 0 , Σ , Σ = σ i j and σ i j = 0 . 5 | i j | / 5 for 1 i n , 1 j m , Pr Y = 1 X * = Φ X * T β * , where Φ ( · ) is the cumulative density function of the standard normal distribution. The true parameter β 0 is set to be sparse and it’s first q entries are uniformly distributed i.i.d. random variables from [ 0 , 1 ] .This is an ill-conditioned case for model 3.
For model 1, we used the dimension p = 500 and p = 1000 , local sample size n = 200 and 500, number of machines m = 5 , 10 , 15 , 20 . For models 2–4, the dimension p = 1000 , number of machines m = 5 , 10 , 20 , and total sample size N = n m = 10,000.
We compared the finite sample performances of the following four estimators:
  • L1SVM algorithm: the proposed communication-efficient estimator β ˇ L 1 ;
  • SCADSVM algorithm: the proposed communication-efficient estimator β ˇ S C A D ;
  • Cen algorithm: the central estimator β ^ Cen , which computes the l 1 -regularized estimator using all of the dataset;
  • Sub algorithm: the sub-data estimator β ^ Sub , which computes the l 1 -regularized estimator using data only on the first machine.
We use the first model to compare the performance of variable selection among the above four algorithms, which are listed in Table 1; and use the other models to evaluate the estimation errors of parameters of algorithms via MSE, which are presented Figure 1, Figure 2 and Figure 3.
The numbers in Table 1 are the number of zero coefficients incorrectly estimated to be nonzero. The number of nonzero coefficients incorrectly estimated is zero, which meant all of the four algorithms could find the relevant variables; hence, they are not listed. From Table 1, we observed that:
(i)
The centralized algorithm was the best among these algorithms because it used the information of the whole dataset.
(ii)
The sub algorithm was bad because it only used the information of the data on the first machine.
(iii)
Our proposed L1SVM and SCADSVM could both select relevant variables, and the SCADSVM had a better performance than L1SVEM. This implied the non-convex SCADSVM algorithm was more robust than convex L1SVEM, especially for the complex models and massive datasets.
(iv)
When N = m n was large, our two proposed distributed SVM algorithms were as good as the centralized algorithm.
We gave the prediction error analysis for models 2–4, see Figure 1, Figure 2 and Figure 3. From Figure 1, Figure 2 and Figure 3, we have the following observations:
(i)
The central algorithm was still the best classifier, but had the highest communication cost and risk of privacy leakage.There was a big gap between the sub estimator and the centralized estimator.
(ii)
Our two proposed communication-efficient estimators could match the central estimator with a few rounds of communication. The prediction errors of SCADSVM were lower than that of L1SVM, and it was more robust than L1SVM.

4.2. Real Data

In this subsection, we verify the performance of the CSLSVM algorithm (L1SVM and SCADSVM) using three real datasets. We use ‘a9a’, ‘w8a’, and ‘phishing’ datasets from the LIBSVM website (https://www.csie.ntu.edu.tw/∼cjlin/libsvm/ accessed on 12 February 2022). These real datasets are listed Table 2. The ‘a9a’ dataset was an adult dataset from the 1994 Census database. The prediction task was to determine whether a person makes over 50K a year or not. The ‘w8a’ dataset was also based on the Census database, but it had more features than ‘a9a’. The phishing dataset aimed to predict phishing websites. Phishing is the process by which a fraudster impersonates a legitimate person by simulating the same or similar web pages or websites to steal personal or private information for illegal political and economic gain. As phishing is becoming more and more serious, phishing web detection is gaining attention as an anti-phishing measure and technique.
Approximately 80% of the data was used to train the model and the remaining was applied to test the model. In distributed learning, we used the number of worker machines m = 5, 10, and 20, respectively. The results of classification errors for the three datasets are provided in Figure 4, Figure 5 and Figure 6. From Figure 4, Figure 5 and Figure 6, we found the following:
(i)
Since these datasets had no well-specified model, the curves behaved quite differently on these datasets. However, overall there was a large gap between the sub algorithm and centralized solution.
(ii)
In most of the cases, the distributed L1SVM algorithm still converged quite slowly.
(iii)
The proposed distributed SCADSVM could obtain a solution that was highly competitive with the centralized model within a few rounds of communications, and was more robust than the distributed L1SVM.
The experimental results on simulated and real datasets verified that the proposed distributed SCADSVM/L1SVM algorithms were two effective procedures for distributed sparse learning on classification via the SVM technique, which maintained efficiency in both communication and computation.
In the computational effort of the four methods, by numerical experiments, we also observed the following. For the central algorithm, the whole dataset was used to train the model, so it was the most accurate estimation, but had the highest computational cost and had a risk of privacy leakage. The sub-data estimation algorithm had the least computational cost because of the small amount of data, and no communication was required; however, it had the largest estimation error. Our proposed L1SVM and SCADSVM algorithms were communication efficient and computational efficient using the CSL framework, and could match the central estimator.

5. Conclusions

In the paper, we proposed a novel distributed CSLSVM learning algorithm with convex ( l 1 )/nonconvex (SCAD) penalties based on a communication-efficient surrogate likelihood (CSL) framework, which wads efficient in both communication and computation. For CSLSVM with l 1 penalty, we proved that the estimator of L1SVM could achieve a near-oracle property for an l 1 penalized SVM estimator based on the whole datasets. For CSLSVM with SCAD penalty, we showed that the estimator of SCADSVM enjoyed the oracle property, i.e., one of the local minimums of the distributed non-convex penalized SVM behaved similarly to the oracle estimator based on the whole dataset, as if the true sparsity was known in advance and only the relevant features were found to form the decision boundary. We also showed that, as long as the initial estimator was appropriate, the oracle estimator could be identified with a probability tending to 1. Extensive experiments on both simulated and real data illustrated that the proposed SCLSVM algorithm improved the performance of the work on the first worker machine and matched the centralized method. In addition, the proposed distributed SCADSVM could obtain a solution that was highly competitive with the centralized model within a few rounds of communication, and was more robust than the distributed L1SVM.

Author Contributions

Conceptualization, X.Z. and H.S.; methodology, X.Z.; software, H.S.; validation, H.S.; investigation, X.Z.; writing—original draft preparation, H.S.; writing—review and editing, X.Z.; supervision, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Chinese National Social Science Fund (Grant No. 19BTJ034), National Natural Science Foundation of China (Grant No. 12171242, 11971235) and Postgraduate Research & Practice Innovation Program of Jiangsu Province (Grant No. KYCX201676).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1995. [Google Scholar]
  2. Banerjee, M.; Durot, C.; Sen, B. Divide and conquer in nonstandard problems and the super-efficiency phenomenon. Ann. Stat. 2019, 47, 720–757. [Google Scholar] [CrossRef] [Green Version]
  3. Jordan, M.I.; Lee, J.D.; Yang, Y. Communication-Efficient Distributed Statistical Inference. J. Am. Stat. Assoc. 2019, 114, 668–681. [Google Scholar] [CrossRef] [Green Version]
  4. Volgushev, S.; Chao, S.K.; Cheng, G. Distributed inference for quantile regression processes. arXiv 2017, arXiv:1701.06088. [Google Scholar] [CrossRef] [Green Version]
  5. Chen, X.; Liu, W.; Mao, X.; Yang, Z. Distributed High-dimensional Regression Under a Quantile Loss Function. arXiv 2019, arXiv:1906.05741. [Google Scholar]
  6. Wang, L.; Lian, H. Communication-efficient estimation of high-dimensional quantile regression. Anal. Appl. 2020, 18, 1057–1075. [Google Scholar] [CrossRef]
  7. Zhang, Y.; Duchi, J.; Wainwright, M. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 2015, 16, 3299–3340. [Google Scholar]
  8. Han, Y.; Mukherjee, P.; Ozgur, A.; Weissman, T. Distributed statistical estimation of high-dimensional and nonparametric distributions. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 506–510. [Google Scholar]
  9. Wang, X.; Yang, Z.; Chen, X.; Liu, W. Distributed inference for linear support vector machine. J. Mach. Learn. Res. 2019, 20, 1–41. [Google Scholar]
  10. Zou, H. An improved 1-norm svm for simultaneous classification and variable selection. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, PMLR, San Juan, Puerto Rico, 21–24 March 2007; pp. 675–681. [Google Scholar]
  11. Meinshausen, N.; Bühlmann, P. High-dimensional graphs and variable selection with the lasso. Ann. Stat. 2006, 34, 1436–1462. [Google Scholar] [CrossRef] [Green Version]
  12. Zhao, P.; Yu, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
  13. Meinshausen, N.; Yu, B. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat. 2009, 37, 246–270. [Google Scholar] [CrossRef]
  14. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
  15. Bühlmann, P.; Van De Geer, S. Statistics for High-Dimensional Data: Methods, Theory and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  16. Giraud, C. Estimator Selection. In Introduction to High-Dimensional Statistics; Chapman and Hall: London, UK; CRC: London, UK, 2014; pp. 117–136. [Google Scholar]
  17. Fan, J.; Fan, Y. High dimensional classification using features annealed independence rules. Ann. Stat. 2008, 36, 2605. [Google Scholar] [CrossRef] [Green Version]
  18. Bradley, P.S.; Mangasarian, O.L. Feature selection via concave minimization and support vector machines. ICML Citeseer 1998, 98, 82–90. [Google Scholar]
  19. Peng, B.; Wang, L.; Wu, Y. An error bound for l1-norm support vector machine coefficients in ultra-high dimension. J. Mach. Learn. Res. 2016, 17, 8279–8304. [Google Scholar]
  20. Zhu, J.; Rosset, S.; Tibshirani, R.; Hastie, T.J. 1-norm support vector machines. Advances in neural information processing systems. Citeseer 2003, 16, 49–56. [Google Scholar]
  21. Wegkamp, M.; Yuan, M. Support vector machines with a reject option. Bernoulli 2011, 17, 1368–1385. [Google Scholar] [CrossRef] [Green Version]
  22. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  23. Becker, N.; Toedt, G.; Lichter, P.; Benner, A. Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data. BMC Bioinform. 2011, 12, 1–13. [Google Scholar] [CrossRef] [Green Version]
  24. Park, C.; Kim, K.R.; Myung, R.; Koo, J.Y. Oracle properties of scad-penalized support vector machine. J. Stat. Plan. Inference 2012, 142, 2257–2270. [Google Scholar] [CrossRef]
  25. Zhang, X.; Wu, Y.; Wang, L.; Li, R. Variable selection for support vector machines in moderately high dimensions. J. R. Stat. Soc. Ser. Stat. Methodol. 2016, 78, 53. [Google Scholar] [CrossRef] [Green Version]
  26. Lian, H.; Fan, Z. Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. J. Mach. Learn. Res. 2017, 18, 6691–6716. [Google Scholar]
  27. Wang, J.; Kolar, M.; Srebro, N.; Zhang, T. Efficient Distributed Learning with Sparsity. arXiv 2016, arXiv:1605.07991. [Google Scholar]
  28. Koo, J.Y.; Lee, Y.; Kim, Y.; Park, C. A bahadur representation of the linear support vector machine. J. Mach. Learn. Res. 2008, 9, 1343–1368. [Google Scholar]
  29. Belloni, A.; Chernozhukov, V. l1-penalized quantile regression in high-dimensional sparse models. Ann. Stat. 2011, 39, 82–130. [Google Scholar] [CrossRef]
  30. Van Der Vaart, A.W.; van der Vaart, A.W.; van der Vaart, A.; Wellner, J. Weak Convergence and Empirical Processes: With Applications to Statistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1996. [Google Scholar]
  31. Koltchinskii, V. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011; Volume 2033. [Google Scholar]
  32. Tao, P.; An, I. Convex analysis approach to D.C. programming theory, algorithms and applications. Acta Math. Vietnam. 1997, 22, 289–355. [Google Scholar]
  33. Zou, H.; Li, R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 2008, 36, 1509. [Google Scholar]
Figure 1. Prediction error analysis vs. rounds of communication when m = 5 , 10, and 20 for model 2.
Figure 1. Prediction error analysis vs. rounds of communication when m = 5 , 10, and 20 for model 2.
Mathematics 10 01029 g001
Figure 2. Prediction error analysis vs. rounds of communication when m = 5 , 10, and 20 for model 3.
Figure 2. Prediction error analysis vs. rounds of communication when m = 5 , 10, and 20 for model 3.
Mathematics 10 01029 g002
Figure 3. Prediction error analysis vs. rounds of communication when m = 5 , 10, and 20 for model 4.
Figure 3. Prediction error analysis vs. rounds of communication when m = 5 , 10, and 20 for model 4.
Mathematics 10 01029 g003
Figure 4. Classification error vs. rounds of communications for ’a9a’ data.
Figure 4. Classification error vs. rounds of communications for ’a9a’ data.
Mathematics 10 01029 g004
Figure 5. Classification error vs. rounds of communications for ’w8a’ data.
Figure 5. Classification error vs. rounds of communications for ’w8a’ data.
Mathematics 10 01029 g005
Figure 6. Classification error vs rounds of communications for ’phishing’ data.
Figure 6. Classification error vs rounds of communications for ’phishing’ data.
Mathematics 10 01029 g006
Table 1. Variable selection results for Model 1.
Table 1. Variable selection results for Model 1.
n = 200, p = 500
mSubL1SVMSCADSVMCen
521210
1022340
1528010
2024000
n = 200, p = 1000
mSubL1SVMSCADSVMCen
549870
1039200
1542100
2042110
n = 400, p = 500
mSubL1SVMSCADSVMCen
54400
103000
155000
201000
n = 400, p = 1000
mSubL1SVMSCADSVMCen
56000
107000
154000
204000
Table 2. Real data used in the experiments.
Table 2. Real data used in the experiments.
Data NameNumber of DataFeaturesTask
a9a48,842123Classification
w8a64,700301Classification
phishing11,05568Classification
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhou, X.; Shen, H. Communication-Efficient Distributed Learning for High-Dimensional Support Vector Machines. Mathematics 2022, 10, 1029. https://doi.org/10.3390/math10071029

AMA Style

Zhou X, Shen H. Communication-Efficient Distributed Learning for High-Dimensional Support Vector Machines. Mathematics. 2022; 10(7):1029. https://doi.org/10.3390/math10071029

Chicago/Turabian Style

Zhou, Xingcai, and Hao Shen. 2022. "Communication-Efficient Distributed Learning for High-Dimensional Support Vector Machines" Mathematics 10, no. 7: 1029. https://doi.org/10.3390/math10071029

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop