Next Article in Journal
A ConvNext-Based and Feature Enhancement Anchor-Free Siamese Network for Visual Tracking
Next Article in Special Issue
Mammographic Classification of Breast Cancer Microcalcifications through Extreme Gradient Boosting
Previous Article in Journal
An Area-Optimized and Power-Efficient CBC-PRESENT and HMAC-PHOTON
Previous Article in Special Issue
Electrocardiogram Signal Classification Based on Mix Time-Series Imaging
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ASPDC: Accelerated SPDC Regularized Empirical Risk Minimization for Ill-Conditioned Problems in Large-Scale Machine Learning

1
School of Biomedical Engineering, Sun Yat-sen University, Guangzhou 510006, China
2
College of Engineering, Shantou University, Shantou 515041, China
3
Department of Computer Science, Sun Yat-sen University, Guangzhou 510006, China
4
School of Artificial Intelligence, Xidian University, Xi’an 710071, China
5
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
6
School of Intelligent Systems Engineering, Sun Yat-sen University, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(15), 2382; https://doi.org/10.3390/electronics11152382
Submission received: 30 June 2022 / Revised: 25 July 2022 / Accepted: 26 July 2022 / Published: 29 July 2022
(This article belongs to the Special Issue Machine Learning in Big Data)

Abstract

:
This paper aims to improve the response speed of SPDC (stochastic primal–dual coordinate ascent) in large-scale machine learning, as the complexity of per-iteration of SPDC is not satisfactory. We propose an accelerated stochastic primal–dual coordinate ascent called ASPDC and its further accelerated variant, ASPDC-i. Our proposed ASPDC methods achieve a good balance between low per-iteration computation complexity and fast convergence speed, even when the condition number becomes very large. The large condition number causes ill-conditioned problems, which usually requires many more iterations before convergence and longer per-iteration times in data training for machine learning. We performed experiments on various machine learning problems. The experimental results demonstrate that ASPDC and ASPDC-i converge faster than their counterparts, and enjoy low per-iteration complexity as well.

1. Introduction

In this paper, we consider a composite convex optimization problem, Regularized Empirical Risk Minimization (RERM), that can be solved by SPDC [1]. Our goal is to use our proposed ASPDC find the approximate solution of the following optimization problem:
min w R d { P ( w ) = 1 n i = 1 n ϕ i ( y i , w T x i , b ) + g ( w ) }
where x i R d is a feature vector, y i is the corresponding label in a machine learning task, { ( x i , y i ) } , i = 1 , 2 , , n are n samples in the dataset, ϕ i is the proper convex function of the linear predictor w T x i , and g ( w ) the simple convex regularization function.
RERM is one of the central problems in machine learning. It is now prevalent in the data mining and machine learning domain. More background information on RERM can be found in [2]. The following are four examples of RERM:
  • Linear SVM, where ϕ i ( y i , w T x i , b ) = max { 0 , 1 y i ( w T x i + b ) } g ( w ) = λ 2 | | w | | 2 2
  • Ridge Regression, where ϕ i ( y i , w T x i , b ) = 1 2 ( y i ( w T x i + b ) ) 2 , g ( w ) = λ 2 | | w | | 2 2
  • Lasso, where ϕ i ( y i , w T x i , b ) = 1 2 ( y i ( w T x i + b ) ) 2 g ( w ) = λ | | w | | 1
  • Logistic Regression, where ϕ i ( y i , w T x i , b ) = l o g ( 1 + exp ( y i ( w T x i + b ) ) ) , g ( w ) = λ 2 | | w | | 2 2
Here, we focus on the scenario in which the number of samples n is very large, as the per-iteration complexity of SPDC is intolerable in this scenario. Computing a full gradient becomes extremely expensive in terms of time and space costs. Therefore, RERM algorithms with a lower per-iteration complexity are more attractive in large-scale machine learning applications.
General optimization methods to the RERM problem using gradients are categorized into two types, namely, first-order and second-order. Second-order methods such as the Newton algorithm employ a Hessian matrix at each iteration to decrease the objective value. The disadvantage of these second-order methods is that both obtaining and using a Hessian matrix is computationally expensive. On the other hand, while first-order optimization schemes are lightweight in gradient computation, they may converge slowly [3,4].
Among the algorithms for solving the RERM problem, we are more interested in dual algorithms such as stochastic dual coordinate ascent-SDCA, as the dual-gap is a clearer stopping criterion than gradients. In addition, they are capable of handling non-differentiable primal optimal functions more easily [5]. SDCA is a first-order optimization method and is widely used in the current machine learning domain. Dual coordinate methods have been implemented in open machine learning libraries [4].
The dual methods do not solve the primal problem directly. Instead, they solve the dual or saddle point problem of the primal problem. The corresponding dual problem of the primal problem in Equation (1) is formulated as follows:
max α R n { D ( α ) = 1 n i = 1 n ϕ i * ( α i ) g * ( 1 n i = 1 n α i x i ) }
where g * ( u ) = max w R d { w T u g ( w ) } and ϕ i * are the convex conjugate functions of g and ϕ i , respectively. Due to the structure of this dual problem, coordinate ascent methods can be more efficient than full gradient methods [4,6,7].
In the stochastic dual coordinate ascent method (SDCA) [5], a dual coordinate α i is picked randomly at each iteration and then updated to increase the dual objective value. This helps SDCA to reach a low per-iteration computational complexity. Nevertheless, the convergency speed of SDCA becomes much slower as the condition number grows. A large condition number leads to an ill-conditioned problem. An ill-conditioned scenario refers to a case in which a small change in one of the values of the coefficient matrix causes a large change in the solution vector [8,9,10,11]. Hence, SDCA is not applicable to large-scale data processing in ill-conditioned scenarios. Unfortunately, many traning tasks involving large-scale data involve ill-conditioned scenarios. Ill-conditioned problems are particularly common in mathematics and geosciences [12].
Paper Organization. The rest of this paper is organized as follows. In Section 2, we describe related works.
In Section 3, we describe the relevant assumptions and preliminaries.
In Section 4, we discuss the accelerated stochastic primal–dual coordinate method. In this section, we present ASPDC in Algorithm 1 and its convergence analysis for the saddle point problem in Equation (3).
In Section 5, we extend ASPDC to ill-conditioned problems, in particular, those in which λ 4 n γ . Our proposed extension method is called ASPDC-i, where i means “for ill-conditioned problems”.
In Section 6, we evaluate the performance of our proposed ASPDC algorithms with several state-of-art algorithms for solving machine learning problems, then discuss the experimental results.
In Section 7, we conclude the paper and discuss potential avenues for future work.

2. Related Work

Shalev-Shwartz and Zhang [13] developed an accelerated proximal stochastic dual coordinate ascent method (ASDCA), which converges faster than traditional methods when the condition number is large (Table 1). ASDCA can be regarded as a variant of a proximal point algorithm equipped with Nesterov’s accelerated technique [14,15,16]. ASDCA uses an inner–outer iteration procedure, where the outer loop is a minimization of an auxiliary problem with a regularized quadratic term. Then, the proximal SDCA starts to solve the auxiliary problem with a customized precision. At the end of each outer loop, Nesterov’s accelerated update is performed on the primal variable w. Nonetheless, ASDCA requires λ to be limited to a range of low-level values, for example, λ R 2 10 n γ , where γ is the smooth parameter of ϕ i , n is the number of samples, and R 2 = max i | | x i | | 2 2 .
Studies have extended the inner–outer iteration method in order to derive more general accelerated proximal-point algorithms, e.g., Catalyst, [17,18]. Theoretically, one can replace the inner-loop proximal SDCA algorithm using other algorithms, such as SVRG [19] and Prox-SVRG [20], to obtain the same overall complexity concerning the number of outer loops.
More recently, Zhang and Xiao [1,21] proposed a stochastic primal–dual coordinate (SPDC) method to solve the RERM problem defined in Equation (1). SPDC achieves a faster convergence rate in reducing the dual-gap than ASDCA and other dual methods in general optimization problems with condition numbers that are not very large. The per-iteration computation complexity of SPDC is much higher than ASDCA and SDCA. Theoretically, the per-iteration complexity of SPDC is O ( d ) . However, due to the auxiliary variable update and the momentum item, SPDC requires much more time to process one pass of a dataset, as verified in our experiments. When the condition number is large, the SPDC per-iteration computation complexity of SPDC is intolerable, which makes SPDC inapplicable to large-scale data processing. Our experiments verified that SPDC is more time-consuming than ASDCA and other low per-iteration complexity methods. Moreover, the dual-gap of SPDC is much larger when the data are sparse and have high dimensionality.
The above issue leads to the following key question: “Can we design an algorithm with both a low per-iteration complexity and a fast convergence rate, especially for ill-conditioned scenarios in large-scale data processing?” We propose the ASPDC and ASPDC-i algorithms as the answer to this question. ASPDC methods have the following three advantages:
  • Simple structure at each iteration. In comparison with SPDC or other accelerated variants, ASPDC does not need to keep track of any other auxiliary variables; it only maintains the primal and dual variable. Each iteration only involves a dual update and primal update. This design makes its per-iteration complexity much lower than SPDC and other variants. The simple iteration design makes it easy to be implemented as well.
  • Short running time. Our experiments show that to reach the same precision, our methods need far less time and fewer epochs (numbers of passes through the entire data) to satisfy the stop condition.
  • Theoretical guarantee. ASPDC adopts Nesterov’s estimation technique [22,23]. We present a new proof onf the convergence of proposed methods.

3. Assumptions and Preliminary

Throughout this paper, the standard Euclidean is denoted as an equation such as | | w | | 2 = i | w i | 2 . We use E to denote the expectation that is taken with respect to the randomness of α i . For the sake of convenience, we use the new notation x i ( x i T , 1 ) T , w ( w T , b ) T . Without loss of generality, we continue to assume w R d , x i R d . Then, we make the following assumptions to clearly specify the problem in Equation (1) as follows:
Assumption 1.
Each ϕ i is lower semi-continuous and convex, and its derivative is 1 γ -Lipschitz continuous (or equivalently: ϕ i is 1 γ -smooth), i.e., there exist γ > 0 such that | ϕ i ( a ) ϕ i ( b ) | 1 γ | a b | a , b R i = 1 , 2 , , n .
It is widely known that Assumption 1 implies that ϕ i * is γ -strongly convex (see Theorem 4.2.2 in the convex fundamental book [24]).
Assumption 2.
The primal function P ( w ) is λ-strongly convex: There exists λ > 0 such that w 1 , w 2 R d ,
P ( w 1 ) P ( w 2 ) + P ( w 2 ) T ( w 1 w 2 ) + λ 2 | | w 1 w 2 | | 2 2
The convexity of P ( w ) may come from either ϕ i or g ( w ) or both. For instance, if g ( w ) = λ 2 | | w | | 2 2 , Assumption 2 holds.
Assumption 3.
| | x i | | 2 1 , i = 1 , 2 , , n .
Assumption 3 is not a strict one, as when data are normalized, Assumption 3 holds.
Under the three assumptions above, the RERM problem defined in Equation (1) can be rewritten as the following convex–concave saddle point problem [1]:
min w R d max α R n { f ( w , α ) = 1 n [ α i w T x i ϕ i * ( α i ) ] + g ( w ) }
where ϕ i * ( α i ) = sup s R { s α i ϕ i ( s ) } is a convex conjugation function of ϕ i . Lemma 1 demonstrates the relationship between the primal problem of Equation (1) with the problem of Equation (3).
Lemma 1.
Let w * = arg min w R d P ( w ) and α * = arg max α R n D ( α ) , then we have
(1) 
P ( w ) = max α R n f ( w , α )
(2) 
D ( α ) = min w R d f ( w , α )
(3) 
There exists a unique solution ( w * , α * ) such that P ( w * ) = D ( α * ) = f ( w * , α * ) .
Proof. 
Presented in Appendix A. □
Lemma 1 implies that we can calculate the optimal solution of the primal problem in Equation (1) by solving the saddle point problem in Equation (3).

4. Accelerated Stochastic Primal–Dual Coordinate Method

In this section, we present ASPDC in Algorithm 1 and its convergence analysis for the saddle point problem in Equation (3).
Each iteration in ASPDC can be divided into two steps: the dual update step and the primal update step. The dual update step is executed first. As shown in lines 4–6 of Algorithm 1, a dual coordinate, α i , is picked randomly and updated to increase the objective value of f ( w , α ) while keeping the primal variable w and other α j ( j i ) fixed. Then, the primal update step is executed later. As shown in line 7 of Algorithm 1, the primal variable w is updated to decrease the objective value of f ( w , α ) while keeping α j ( j = 1 , 2 , , n ) fixed.
The update of the dual variable α is extremely simple. It can be simplified as a univariate optimal problem, which makes its per-iteration complexity much lower than traditional SPDC algorithms. Specifically, the local update of dual variable α i is
Δ α i * = arg max Δ α i R f ( w , α + Δ α i e i ) = arg max Δ α i R ( Δ α i x i T w ( t ) ϕ i * ( α i ( t ) + Δ α i ) ) ,
where e i R n is a unit vector with the i t h element being one.
The update of primal variable w is shown in Equation (5) as follows:
w * = arg min w R d f ( w , α ( t + 1 ) )
= arg min w R d { ( 1 n i = 1 n α i ( t + 1 ) x i ) T w + g ( w ) }
= arg max w R d { ( 1 n i = 1 n α i ( t + 1 ) x i ) T w g ( w ) }
= g * ( 1 n i = 1 n α i ( t + 1 ) x i ) ,
where the last equation is derived from the conjugation sub-gradient theorem in [25]. In this way, we turn the optimization process into a derivative operation of g * ( w ) . For instance, if g ( x ) = λ 2 | | w | | 2 2 the update of primal variable can be written as w ( t + 1 ) = 1 λ n i = 1 n α i ( t + 1 ) x i .
We compare the complexity of SPD1, SPD1-VR, and SVRG [19] with our methods in Table 2. In Table 2, r is the maximum number of non-zero elements in each sample, S is the number of non-zero elements in the whole data sets, d is the dimension of the dataset, and n the number of data samples. Usually, S is much smaller than n d when the data are sparse and high-dimensional. Apparently, in most large-scale data applications the data sets are sparse have high dimensionality, i.e., most of the attributes are zeros. At each iteration, SPD1 and SPD1-VR choose x i j (the j-th value of sample x i ) to update the primal variable and dual variable regardless of whether x i j is 0 or not. This method enables the per-iteration complexity of SPD1 and SPD1-VR to be reduced to O ( 1 ) . However, their complexity of pass-through data is O ( n d ) , which is the same as SVRG. In contrast, ASPDC will not execute the update if x i j = 0 . Thus, the complexity of its pass-through data is O ( S ) , which is much lower than SPD1 and SVRG when the data are sparse and high-dimensional.
There are two major differences between SDCA and ASPDC, as follows. First, SDCA tries to solve the dual problem, while ASPDC tries to solve a saddle point problem. Second, the dual update of ASPDC is significantly simpler than the update of SDCA. The dual update of SDCA is shown in (9). In comparison with that of ASPDC in Equation (4), the dual update of SDCA involves the additional computation of 1 2 λ n | | x i | | 2 2 ( Δ α i ) 2 :
Δ α i * = arg max Δ α i R ( Δ α i x i T w ( t ) ϕ i * ( α i ( t ) Δ α i ) ) + 1 2 λ n | | x i | | 2 2 ( Δ α i ) 2
We use the dual-gap metric as the stopping criterion, as shown in line 9 of Algorithm 1. The dual-gap is calculated by P ( w ) D ( α ) , and it is sufficient to say that | P ( w ) P ( w * ) | ϵ if P ( w ) D ( α ) ϵ , as | P ( w ) p ( w * ) | P ( w ) D ( α ) ϵ . This stopping criterion is easier to implement than the other criteria, e.g., | P ( w ) P ( w * ) | ϵ . This is for the reason that w * is not known in advance in real-world machine learning applications.
Algorithm 1 ASPDC
  1:
Input f ( w , α ) , α ( 0 ) , ϵ
  2:
Initialize w ( 0 ) = g * ( 1 n i = 1 n α i ( 0 ) x i )
  3:
for t = 0 , 1 , 2 , do
  4:
    pick i { 1 , 2 , , n } under uniform distribution.
  5:
     Δ α i * = arg max Δ α i R ( Δ α i x i T w ( t ) ϕ i * ( α i ( t ) + Δ α i ) )
  6:
     α ( t + 1 ) = α ( t ) + Δ α i * e i
  7:
     w ( t + 1 ) = g * ( 1 n i = 1 n α i ( t + 1 ) x i )
  8:
end for
  9:
Stop condition: P ( w ( T ) ) D ( α ( T ) ) ϵ
Output 
1 w ( T ) , α ( T ) , P ( w ( T ) ) D ( α ( T ) )
In the rest of this section, we show the proof for ASPDC’s convergence. We first present the following lemma.
Lemma 2.
On the basis of Assumptions 1–3, let w ( t ) and α ( t ) be the sequence produced by ASPDC and let g ( w ) = λ 2 | | w | | 2 2 . λ 4 n γ ; then, we have:
E ( P ( w ( t ) ) D ( α ( t ) ) ) 2 n ( 1 1 2 n ) t ( P ( w ( 0 ) ) D ( α ( 0 ) ) )
Proof. 
The detailed proof can be found in the Appendix. In the proof, we assume that g ( w ) = λ 2 | | w | | 2 2 for convenience. Therefore, the theory only works for l2 regularization. The extension to l1 regularization is a topic for future work.
The skeleton of the proof in the Appendix can be described using the following three steps:
First, we obtain
E ( D ( α * ) D ( α ( t ) ) ) ( 1 1 2 n ) t ( D ( α * ) D ( α ( 0 ) ) ) .
Second, we have
1 2 n E ( P ( w ( t ) ) D ( α ( t ) ) ) E ( D ( α ( t + 1 ) ) D ( α ( t ) ) ) E ( D ( α ( t + 1 ) ) D ( α * ) + D ( α * ) D ( α ( t ) ) ) D ( α * ) D ( α ( t ) ) E ( D ( α * ) D ( α ( t + 1 ) ) D ( α * ) D ( α ( t ) )
Finally, using the weak duality we can obtain
E ( P ( w ( t ) ) D ( α ( t ) ) ) 2 n ( 1 1 2 n ) t ( P ( w ( 0 ) ) D ( α ( 0 ) ) ) .
   □
Theorem 1.
The total number of iterations needed to achieve the expected duality gap of E ( P ( w ( t ) ) D ( α ( t ) ) ) ϵ is
t 2 n log ( 2 n ( P ( w ( 0 ) ) D ( α ( 0 ) ) ) 1 ϵ )
Proof. 
Using Lemma 2, we can obtain
E ( P ( w ( t ) ) D ( α ( t ) ) ) 2 n exp ( t 2 n ) ( P ( w ( 0 ) ) D ( α ( 0 ) ) ) ,
where, in the inequality, we use the fact that ( 1 1 2 n ) t exp ( t 2 n ) . Let 2 n exp ( t 2 n ) ( P ( w ( 0 ) ) D ( α ( 0 ) ) ) ϵ ; then, we finally obtain t 2 n log ( 2 n ( P ( w ( 0 ) ) D ( α ( 0 ) ) ) 1 ϵ ) . □
As shown by Equation (11), the complexity of ASPDC is O ( n log ( n 1 ϵ ) ) , In contrast, the complexity of SVRG is O ( d ( n + κ ) log ( 1 / ϵ ) ) and the complexity of SPDC is O ( d ( n + n κ ) log ( 1 / ϵ ) ) .

5. ASPDC for Ill-Conditioned Problems

According to convex theory [16], the value Q f = L / μ is called the condition number of function f if f is L s m o o t h and μ s t r o n g l y c o n v e x . Under Assumptions 1–3, the condition number of the primal function in Equation (1) is ( 1 + γ λ ) / λ = 1 λ γ + 1 . Suppose λ becomes lower; then, the condition number, Q f , will be larger. When Q f 1 , the problem f is called ill-conditioned.
In this section, we extend ASPDC to the ill-conditioned problem, especially when λ 4 n γ . The extension method is called ASPDC-i, in which the suffix i means “for ill-conditioned problems”.
As shown in Algorithm 2, the procedure of ASPDC-i can be divided into epochs, indexed s = 1 , 2 , 3 , , S . Each epoch uses ASPDC to solve the following problem with a decreasing precision parameter ξ s :
min w R d max α R n f ˜ s ( w , α ) = 1 n i = 1 n [ α i w T x i ϕ i * ( α i ) ] + g ˜ ( w )
where g ˜ ( w ) = g ( w ) + κ 2 | | w | | 2 2 κ w T w ˜ s , κ R is a constant throughout the procedure, and g ˜ ( w ) is g ( w ) plus an additional perturbation term. This additional term is employed to ensure that the strongly convex parameter λ + κ of g ˜ ( w ) satisfies λ + κ 4 n γ . Note that a smaller κ is preferable, as a larger κ leads to a severe bias between f ( w , α ) and f ˜ s ( w , α ) . Therefore, in the implementation of our ASPDC algorithms we simply use the smallest κ : κ = 4 n γ λ .
These calls of ASPDC produce a sequence w ˜ s , s = 1 , 2 , , which are the solutions of the corresponding approximate problem in Equation (12). Here, we need to prove that each running procedure of ASPDC from these calls can stop itself after finite epochs as well as that the output w ˜ S satisfies the condition | P ( w ˜ S ) P ( w * ) | ϵ . In this condition, the variable w * is the theoretical optimal solution of P ( w ) . These facts are illustrated in the following Theorem 2.
Theorem 2.
Algorithm 2 needs S 1 + 2 η l o g ( ξ 1 1 ϵ ) epochs to approach the approximate solution w * , where | P ( w ˜ S ) P ( w * ) | ϵ .
The proof can be found in the Appendix A. The settings of the hyper parameters of Algorithm 2 are presented in the proof.
Algorithm 2 ASPDC-i
  1:  Parameter: λ 4 n γ , κ = 4 n γ λ , η = λ λ + 2 κ ,
      ξ 1 = ( 1 + η 1 ) ( P ( w ˜ 1 ) D ( α ˜ 1 ) )
  2:  Initialize: w ˜ 1 = 0 , α ˜ 1 = 0
  3:  for s = 1,2,3,... do
  4:    ( w ˜ s + 1 , α ˜ s + 1 , ϵ s + 1 ) = ASPDC ( f ˜ s ( w , α ) , α ˜ s , η 2 ( 1 + η 1 ) ξ s )
  5:    ξ s + 1 = ( 1 0.5 η ) ξ s
  6:  end for
  7:  stop condition: S 1 + 2 η l o g ( ξ 1 1 ϵ )
Output 1 w ˜ S , α ˜ S
To make for a fair comparison with other algorithms, we provide an realistic implementation version of Algorithm 2. This implementation version is shown in Algorithm 3. Here, the number of iterations in Algorithm 3 is set to be a constant m (e.g., m = 2 n ). As be demonstrated in the experiment section, this approach works well.
Algorithm 3 Implemented version of ASPDC-i
  1:
Parameter: λ 4 n γ , κ = 4 n γ λ
  2:
Initialize: w ˜ 0 = 0 , α ˜ 0 = 0
  3:
for s = 1 , 2 , 3 , , S do
  4:
    α ( 0 ) = α ˜ ( s 1 ) , w ( 0 ) = g ˜ * ( 1 n i = 1 n x i α i ( 0 ) )
  5:
   for  t = 0 , 1 , 2 , , m 1  do
  6:
       pick i { 1 , 2 , , n } under uniform distribute
  7:
        Δ α i * = arg max Δ α i R ( Δ α i x i T w ( t ) ϕ i * ( α i ( t ) + Δ α i ) )
  8:
        α ( t + 1 ) = α ( t ) + Δ α i * e i
  9:
        w ( t + 1 ) = g ˜ * ( 1 n i = 1 n x i α i ( t + 1 ) )
  10:
   end for
  11:
    w ˜ s = w ( m ) , α ˜ s = α ( m )
  12:
end for
Output 
1 w ˜ S , α ˜ S

6. Experiments

In this section, we evaluate the performance of our ASPDC algorithms along with several state-of-art algorithms for solving machine learning problems such as SVM. All the algorithms were implemented in C++ and executed through a Matlab interface. The experiments were performed on a PC with an Intel i5-4690 CPU and 16.0 GB RAM. The source code and the detailed proofs can be downloaded from the GitHub website (https://github.com/lianghb6/ASPDC, accessed on 28 June 2022) and the datasets can be obtained from the LIBSVM website (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, accessed on 28 June 2022).
As the computation processes of the problems are similar, in these experiments we mainly evaluated the practical performance of ASPDC for solving the following SVM optimization problem:
min w R d { P ( w ) = 1 n i = 1 n ϕ i ( w T x i ) + λ 2 | | w | | 2 2 }
where ϕ i is a smooth hinge loss, and is used in [1,5] as well.
ϕ i ( w T x i ) = 0 y i w T x i 1 1 2 y i w T x i y i w T x i 0 1 2 ( 1 y i w T x i ) 2 o t h e r w i s e .
The corresponding convex–concave saddle point problem is as follows:
min w R d max α R n { f ( w , α ) = 1 n i = 1 n [ α i w T x i ϕ i * ( α i ) ] + λ 2 | | w | | 2 2 }
where
ϕ i * ( α i ) = y i α i + 1 2 α i 2 1 y i α i 0 + o t h e r w i s e .
Under Assumption 3, the smooth parameter γ of ϕ i is 1. The strongly convex parameter of P ( w ) is λ , which comes from the regularized function g ( w ) = λ 2 | | w | | 2 2 .
In Figure 1 and Table 3, we show the cases when λ is relatively large (e.g., 10 2 , 10 3 , 10 4 ). We compare ASPDC (Algorithm 1) with state-of-art dual methods: the stochastic dual coordinate ascent method (SDCA) [5] and stochastic primal–dual coordinate method (SPDC) [1]. Note that accelerated stochastic dual ascent (ASDCA) [13] cannot be applied to this scenario, as ASDCA requires λ to be extremely small (i.e., λ 1 10 n γ ) . We omit the comparison between ASPDC and the stochastic gradient descent method and its variants (e.g., SVRG [19] and Katyusha [26]), as there have already been extensive experiments using SPDC and this situation performed in the literature.
The horizontal axis in Figure 1 is the number of passes through the entire dataset, and the vertical axis is the logarithmic dual-gap. It can be seen from Figure 1 that ASPDC and SDCA have comparable performances on relatively large λ . With the same epoch, the dual-gap of ASPDC is lower than that of SPDC by two orders of magnitude after several epochs.
Figure 1 shows that both SDCA and ASPDC are faster than SPDC. This is because λ in Figure 1 is relative large (e.g., 0.01). In this case, the condition number of problems is relatively small. When the condition number is large, ASPDC and SPDC perform better than SDCA. In total, ASPDC is faster and is well suited for ill-conditioned problems.
Table 3 lists the needed running time for the dual-gaps of different algorithms to decrease to the given precision (e.g., d u a l g a p 10 6 ) for different algorithms and datasets. Table 3 demonstrates that ASPDC and SDCA need less time to approach the given precision, and verifies that the convergence of ASPDC and SDCA is faster than SPDC. Table 4 presents the total running time for the algorithms to go through the entire dataset once to measure the per-iteration computation complexity. An algorithm with a shorter running time indicates that the algorithm has a lower per-iteration computation complexity. Table 4 shows that ASPDC and SDCA have a lower per-iteration complexity than SPDC. Among all of the running time results, ASPDC demonstrates both fast convergence and low per-iteration complexity when λ is large.
We then tested the case when λ is relatively small (e.g., λ 4 γ n ) and compared ASPDC-i with SDCA, SPDC, and ASDCA. Figure 2 plots the convergence results. Figure 2 shows that the convergences of SDCA, ASDCA, and SPDC are significantly slower than those of the same algorithms in Figure 1. The reason for this is that the condition number of the problem in this test case is larger than that in Figure 1. ASPDC-i performs much better in this experiment, as can be seen from Figure 2. ASPDC-i needs far fewer epochs than other algorithms to approach the same level of dual-gap. Additionally, ASPDC can approach a significantly lower dual-gap than the others with the same epochs.
In addition, we compared ASPDC-i to a widely used non-dual-based algorithm, SVRG [19]. As SVRG is not dual-based, we directly compared its reduction speed of the primal value with ASPDC-i. Figure 3 shows that the convergence speed of ASPDC-i is faster than SVRG.
Note that ASDCA cannot be applied to cases in which the dataset is covtype and λ = 10 6 , as ASDCA needs the extra condition λ 1 10 n γ . Table 5 illustrates the running time that different algorithms spend to decrease the dual-gap to the given precision (e.g., 10 4 ). Table 6 demonstrates the total running time for the algorithms to go through the entire dataset once. It shows that ASPDC and ASDCA have lower per-iteration complexity than SPDC. Although SDCA has low per-iteration complexity, its convergence is the slowest among these methods when λ is relatively small. We did not list the corresponding results of SDCA in Table 5 and Table 6. In summary, the above experiments show that our proposed methods achieve both fast convergence and low per-iteration complexity.

7. Conclusions and Future Work

In this paper, we propose two stochastic primal–dual coordinate methods, ASPDC and its accelerated variant version, ASPDC-i. These two algorithms are designed for the regularized empirical risk minimization problem. We proved the theoretical convergence guarantee of the algorithms and performed a series of experiments. The results illustrate that our methods achieve a good balance between low per-iteration computation complexity and fast convergence. The new convergence proof presented here uses Nesterov’s estimation sequence technique and g ( w ) = λ 2 | | w | | 2 2 . We believe that it is possible to extend this proof to the more general regularized function g ( w ) ; however, we leave this as a possibility for future work.

Author Contributions

Writing—original draft, H.L.; Data curation, F.S. and X.L.; Writing—review & editing, H.C., H.W. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Program of Guangzhou, China (No. 202002020045) and by the Meizhou Major Scientific and Technological Innovation Platforms and Projects of Guangdong Provincial Science & Technology Plan Projects under Grant No. 2019A0102005.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Proof of Lemma 1

We prove the following equations: P ( w ) = max α R n f ( w , α ) , D ( α ) = min w R d f ( w , α ) and P ( w * ) = D ( α * ) = f ( w * , α * ) . We first prove P ( w ) = max α R n f ( w , α ) .
Proof. 
max α R n f ( w , α ) = max α R n { 1 n i = 1 n [ α i w T x i ϕ i * ( α i ) ] + g ( w ) } = max α R n { 1 n i = 1 n [ α i w T x i ϕ i * ( α i ) ] } + g ( w ) = 1 n i = 1 n max α i R { α i w T x i ϕ i * ( α i ) } + g ( w ) = 1 n i = 1 n ϕ i ( w T x i ) + g ( w ) = P ( w )
In the last equation, we use the Conjugate Theorem (Convex Optimization Theory). Then, we prove that D ( α ) = min w R d f ( w , α ) .
min w R d f ( w , α ) = min w R d { 1 n i = 1 n [ α i w T x i ϕ i * ( α i ) ] + g ( w ) } = 1 n i = 1 n ϕ i * ( α i ) + min w R d { ( 1 n i = 1 n α i x i ) T w + g ( w ) } = 1 n i = 1 n ϕ i * ( α i ) max w R d { ( 1 n i = 1 n α i x i ) T w g ( w ) } = 1 n i = 1 n ϕ i * ( α i ) g * ( 1 n i = 1 n α i x i ) = D ( α )
The proof of P ( w * ) = D ( α * ) = f ( w * , α * ) can be found in [1]. □

Appendix A.2. Proof of Lemma 2

Proof. 
When g ( w ) = λ 2 | | w | | 2 2 , the primal objective can be written as follows:
P ( w ) = 1 n i = 1 n ϕ i ( w T x i ) + λ 2 | | w | | 2 2 .
The corresponding dual objective is
D ( α ) = 1 n i = 1 n ϕ i * ( α i ) λ 2 | | 1 λ n i = 1 n α i x i | | 2 2 .
Note that at through the algorithm, we can set
w ( t ) = 1 λ n i = 1 n α i ( t ) x i .
Thus, the D ( α ( t ) ) can be written as
D ( α ( t ) ) = 1 n i = 1 n ϕ i * ( α i ( t ) ) λ 2 | | w ( t ) | | 2 2 .
Suppose we have α ( t ) and that the i t h coordinate is chosen at iteration t + 1 :
D ( α ( t + 1 ) ) D ( α ( t ) ) = 1 n ϕ i * ( α i ( t + 1 ) ) λ 2 | | w ( t ) 1 λ n Δ α i * x i | | 2 2 R 1 { 1 n ϕ i * ( α i ( t ) ) λ 2 | | w ( t ) | | 2 2 } R 2 .
The variables in the algorithm are as follows:
Δ α i * = arg max d R ( d x i T w ( t ) ϕ i * ( α i ( t ) + d ) ) = arg max d R ( ( α t ( t ) + d ) x i T w ( t ) ϕ i * ( α i ( t ) + d ) ) = arg max α t ( t ) + d R ( ( α t ( t ) + d ) x i T w ( t ) ϕ i * ( α i ( t ) + d ) ) = arg max β R ( β x i T w ( t ) ϕ i * ( β ) ) α i ( t ) ,
where in the last inequality we define β = α i ( t ) + d , and correspondingly have β * = α i ( t ) + Δ α i * .
R 1 = 1 n ϕ i * ( α i ( t + 1 ) ) λ 2 | | w ( t ) 1 λ n Δ α i * x i | | 2 2 = 1 n ϕ i * ( α i ( t ) + Δ α i * ) + 1 n Δ α i * x i T w ( t ) 1 2 λ n 2 | | x i | | 2 2 ( Δ α i * ) 2 λ 2 | | w ( t ) | | 2 2 = 1 n { max d R ( d x i T w ( t ) ϕ i * ( α i ( t ) + d ) ) } 1 2 λ n 2 | | x i | | 2 2 ( Δ α i * ) 2 λ 2 | | w ( t ) | | 2 2 1 n { q ( β * α i ( t ) ) x i T w ( t ) ϕ i * ( α i ( t ) + q ( β * α i ( t ) ) ) } 1 2 λ n 2 | | x i | | 2 2 ( Δ α i * ) 2 λ 2 | | w ( t ) | | 2 2 = 1 n ϕ i * ( ( 1 q ) α i ( t ) + q β * ) + 1 n q ( β * α i ( t ) ) x i T w ( t ) 1 2 λ n 2 | | x i | | 2 2 ( Δ α i * ) 2 λ 2 | | w ( t ) | | 2 2 1 n { q ϕ i * ( β * ) + ( 1 q ) ϕ i * ( α i ( t ) ) γ q ( 1 q ) 2 ( β * α i ( t ) ) 2 } + 1 n q ( β * α i ( t ) ) x i T w ( t ) 1 2 λ n 2 | | x i | | 2 2 ( Δ α i * ) 2 λ 2 | | w ( t ) | | 2 2 q n { ϕ i * ( β * ) + β * x i T w ( t ) } 1 q n ϕ i * ( α i ( t ) ) + γ ( 1 q ) q 2 n ( β * α i ( t ) ) 2 q n α i ( t ) x i T w ( t ) 1 2 λ n 2 | | x i | | 2 2 ( Δ α i * ) 2 λ 2 | | w ( t ) | | 2 2
where q ( 0 , 1 ) in the inequality ①, while in the inequality ② we use the fact that if ϕ i is 1 γ smooth, then ϕ i * is γ strong convex.
On the one hand, according to (A7), we obtain
β * = arg max β R ( β x i T w ( t ) ϕ i * ( β ) ) .
This implies that
x i T w ( t ) = ϕ i * ( β * ) .
On the other hand, by the definition of the convex conjugate function, we have ϕ i * * ( x i T w ( t ) ) = max β R ( β x i T w ( t ) ϕ i * ( β ) ) . According to the Fenchel conjugate sub-gradient theorem, we have
x i T w ( t ) = ϕ i * ( β * ) β * x i T w ( t ) ϕ i * ( β * ) = ϕ i * * ( x i T w ( t ) ) = ϕ i ( x i T w ( t ) ) ,
where in ③ we apply the Fenchel Dual theorem.
Combined with (A8) and (A11), we obtain
R 1 q n { ϕ i ( x i T w ( t ) ) + ϕ i * ( α i ( t ) ) α i ( t ) x i T w ( t ) } + γ ( 1 q ) q 2 n ( β * α i ( t ) ) 2 1 2 λ n 2 | | x i | | 2 2 ( Δ α i * ) 2 + { 1 n ϕ i * ( α i ( t ) ) λ 2 | | w ( t ) | | 2 2 } R 2 .
Combining β * = α i ( t ) + Δ α i * with (A6) and (A12), we have
D ( α ( t + 1 ) ) D ( α ( t ) ) q n { ϕ i ( x i T w ( t ) ) + ϕ i * ( α i ( t ) ) α i ( t ) x i T w ( t ) } + { γ ( 1 q ) q 2 n 1 2 ( λ ) n 2 | | x i | | 2 2 } ( Δ α i * ) 2 q n { ϕ i ( x i T w ( t ) ) + ϕ i * ( α i ( t ) ) α i ( t ) x i T w ( t ) } + { γ ( 1 q ) q 2 n 1 2 ( λ n 2 } ( Δ α i * ) 2 ,
where in the last inequality we use the assumption | | x i | | 2 2 1 . Note that we have supposed that the i t h coordinate of α is chosen, thus, we use the expectation of (A13) with respect to i, obtaining
E { D ( α ( t + 1 ) ) D ( α ( t ) ) } q n 1 n i = 1 n { ϕ i ( x i T w ( t ) ) + ϕ i * ( α i ( t ) ) α i ( t ) x i T w ( t ) } + { γ ( 1 q ) q 2 n 1 2 λ n 2 } 1 n i = 1 n ( Δ α i * ) 2 .
Recall that
P ( w ( t ) ) D ( α ( t ) ) = 1 n i = 1 n { ϕ i ( x i T w ( t ) ) + ϕ i * ( α i ( t ) ) } + λ | | w ( t ) | | 2 2 = 1 n i = 1 n { ϕ i ( x i T w ( t ) ) + ϕ i * ( α i ( t ) ) α i ( t ) x i T w ( t ) } ,
where in ④ we use the fact that w ( t ) = 1 λ n i = 1 n α i ( t ) x i .
Combined (A14) with (A15), we obtain
E { D ( α ( t + 1 ) ) D ( α ( t ) ) } q n { P ( w ( t ) ) D ( α ( t ) ) } + { γ ( 1 q ) q 2 n 1 2 λ n 2 } 1 n i = 1 n ( Δ α i * ) 2 .
Using q = 1 / 2 and λ 4 n γ , we have γ ( 1 q ) q 2 n 1 2 λ n 2 0 , and
E { D ( α ( t + 1 ) ) D ( α ( t ) ) } 1 2 n { P ( w ( t ) ) D ( α ( t ) ) } .
Note that α * = arg max α D ( α ) ; it is well known that P ( w ( t ) ) D ( α * ) D ( α ( t ) ) .
Combined with (A17), we obtain
1 2 n { D ( α * ) D ( α ( t ) ) } 1 2 n { P ( w ( t ) ) D ( α ( t ) ) } E { D ( α ( t + 1 ) ) D ( α ( t ) ) } = E { D ( α ( t + 1 ) ) D ( α * ) + D ( α * ) D ( α ( t ) ) } = { D ( α * ) D ( α ( t ) ) } E { D ( α * ) D ( α ( t + 1 ) ) } .
This further implies that
E { D ( α * ) D ( α ( t + 1 ) ) } ( 1 1 2 n ) { D ( α * ) D ( α ( t ) ) .
Until now, we have assumed that α ( t ) is known and the expectation is for random variable i; if below we take this expectation with all the history i, we obtain
E { D ( α * ) D ( α ( t + 1 ) ) } ( 1 1 2 n ) ( t + 1 ) { D ( α * ) D ( α ( 0 ) ) .
In addition, it can be known from (A17) that
1 2 n E { P ( w ( t ) ) D ( α ( t ) ) } E { D ( α ( t + 1 ) ) D ( α ( t ) ) } = { D ( α * ) D ( α ( t ) ) } E { D ( α * ) D ( α ( t + 1 ) ) } { D ( α * ) D ( α ( t ) ) } ( 1 1 2 n ) t { D ( α * ) D ( α ( 0 ) ) }
This implies that E { P ( w ( t ) ) D ( α ( t ) ) } 2 n ( 1 1 2 n ) t { D ( α * ) D ( α ( 0 ) ) } . □

References

  1. Zhang, Y.; Xiao, L. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 353–361. [Google Scholar]
  2. Ruppert, D. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Publ. Am. Stat. Assoc. 2010, 99, 567. [Google Scholar] [CrossRef]
  3. Chiang, W.; Lee, M.; Lin, C. Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments. In KDD ’16, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1485–1494. [Google Scholar]
  4. Hsieh, C.; Chang, K.; Lin, C.; Keerthi, S.S.; Sundararajan, S. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 408–415. [Google Scholar]
  5. Shalevshwartz, S.; Zhang, T. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. J. Mach. Learn. Res. 2012, 14, 2013. [Google Scholar]
  6. Chang, K.W.; Hsieh, C.J.; Lin, C.J. Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines. J. Mach. Learn. Res. 2008, 9, 1369–1398. [Google Scholar]
  7. Platt, J.C. Fast Training of Support Vector Machines Using Sequential Minimal Optimization; MIT Press: Cambridge, MA, USA, 1999; pp. 185–208. [Google Scholar]
  8. Naskovska, K.; Lau, S.; Korobkov, A.A.; Haueisen, J.; Haardt, M. Coupled CP decomposition of simultaneous MEG-EEG signals for differentiating oscillators during photic driving. Front. Neurosci. 2020, 14, 261. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Lee, S.; Kim, E.; Kim, C.; Kim, K. Localization with a mobile beacon based on geometric constraints in wireless sensor networks. IEEE Trans. Wirel. Commun. 2009, 8, 5801–5805. [Google Scholar] [CrossRef]
  10. Wang, J.; Dong, P.; Jing, Z.; Cheng, J. Consensus-based filter for distributed sensor networks with colored measurement noise. Sensors 2018, 18, 3678. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  11. Anastassiu, H.T.; Vougioukas, S.; Fronimos, T.; Regen, C.; Petrou, L.; Zude, M.; Käthner, J. A computational model for path loss in wireless sensor networks in orchard environments. Sensors 2014, 14, 5118–5135. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Deng, X.; Yin, L.; Peng, S.; Ding, M. An iterative algorithm for solving ill-conditioned linear least squares problems. Geod. Geodyn. 2015, 6, 453–459. [Google Scholar] [CrossRef] [Green Version]
  13. Shalevshwartz, S.; Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the International Conference on Machine Learning, Bejing, China, 21–26 June 2014. [Google Scholar]
  14. Bauschke, H.H.; Combettes, P.L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces; Springer: New York, NY, USA, 2011; Volume 408. [Google Scholar]
  15. Güler, O. New proximal point algorithms for convex minimization. SIAM J. Optim. 1992, 2, 649–664. [Google Scholar] [CrossRef]
  16. Nesterov, Y. Introductory Lectures on Convex Optimization; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2014; pp. xviii, 236. [Google Scholar]
  17. Frostig, R.; Ge, R.; Kakade, S.; Sidford, A. Un-regularizing: Approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2540–2548. [Google Scholar]
  18. Lin, H.; Mairal, J.; Harchaoui, Z. A Universal Catalyst for First-Order Optimization. Available online: https://proceedings.neurips.cc/paper/2015/hash/c164bbc9d6c72a52c599bbb43d8db8e1-Abstract.html (accessed on 29 June 2022).
  19. Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the Advances in Neural Information Processing Systems, Tahoe, CA, USA, 5–10 December 2013; pp. 315–323. [Google Scholar]
  20. Xiao, L.; Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 2014, 24, 2057–2075. [Google Scholar] [CrossRef]
  21. Zhang, Y.; Xiao, L. Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res. 2017, 18, 2939–2980. [Google Scholar]
  22. Devolder, O.; Glineur, F.; Nesterov, Y. First-order methods of smooth convex optimization with inexact oracle. Math. Program. 2014, 146, 37–75. [Google Scholar] [CrossRef] [Green Version]
  23. Schmidt, M.; Roux, N.L.; Bach, F.R. Convergence rates of inexact proximal-gradient methods for convex optimization. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; pp. 1458–1466. [Google Scholar]
  24. Hiriart-Urruty, J.B.; Lemaréchal, C. Fundamentals of Convex Analysis; Springer Science & Business Media: New York, NY, USA, 2012. [Google Scholar]
  25. Bertsekas, D.P. Convex Optimization Theory; Athena Scientific Belmont: Belmont, MA, USA, 2009. [Google Scholar]
  26. Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, QC, Canada, 19–23 June 2017; pp. 1200–1205. [Google Scholar]
Figure 1. Dual-gap (y-axis) vs, the number of epochs (x-axis). Comparing ASPDC with other methods for smooth hinge SVM on real-world datasets with regularization coefficient λ { 0.1 , 0.01 , 0.001 , 0.0001 } . The horizontal axis is the number of passes through the entire dataset, and the vertical axis is the logarithmic dual-gap.
Figure 1. Dual-gap (y-axis) vs, the number of epochs (x-axis). Comparing ASPDC with other methods for smooth hinge SVM on real-world datasets with regularization coefficient λ { 0.1 , 0.01 , 0.001 , 0.0001 } . The horizontal axis is the number of passes through the entire dataset, and the vertical axis is the logarithmic dual-gap.
Electronics 11 02382 g001
Figure 2. Dual-gap (y-axis) vs. the number of epochs (x-axis). Comparing ASPDC-i with other methods for smooth hinge SVM on real-world datasets with regularization coefficient λ { 10 6 , 10 7 , 10 8 } . The horizontal axis is the number of passes through the entire dataset, and the vertical axis is the logarithmic dual-gap.
Figure 2. Dual-gap (y-axis) vs. the number of epochs (x-axis). Comparing ASPDC-i with other methods for smooth hinge SVM on real-world datasets with regularization coefficient λ { 10 6 , 10 7 , 10 8 } . The horizontal axis is the number of passes through the entire dataset, and the vertical axis is the logarithmic dual-gap.
Electronics 11 02382 g002
Figure 3. Optimal primal value (y-axis) vs. the number of epochs (x-axis): Comparing ASPDC-i with SVRG for smooth hinge SVM on real-world datasets with the regularization coefficient 10 6 . The x-axis is the number of passes through the entire dataset, and the y-axis is the logarithmic dual-gap.
Figure 3. Optimal primal value (y-axis) vs. the number of epochs (x-axis): Comparing ASPDC-i with SVRG for smooth hinge SVM on real-world datasets with the regularization coefficient 10 6 . The x-axis is the number of passes through the entire dataset, and the y-axis is the logarithmic dual-gap.
Electronics 11 02382 g003
Table 1. Abbreviations used in this study.
Table 1. Abbreviations used in this study.
Complete NameAbbreviation
Stochastic primal-dual coordinate ascentSPDC
Stochastic dual coordinate ascent methodSDCA
Accelerated stochastic primal-dual coordinate ascentASPDC
Extended ASPDC to the ill-conditioned problemASPDC-i
Accelerated stochastic dual ascentASDCA
Table 2. Complexity comparison of per-iteration and pass through data.
Table 2. Complexity comparison of per-iteration and pass through data.
Per-IterationPass through Data
ASPDC, ASPDC-i O ( r ) O ( S )
SPD1, SPD1-VR O ( 1 ) O ( n d )
SVRG [19] O ( d ) O ( n d )
Table 3. The running time for dual-gap approaches to the given precision ( 10 6 ) when λ = 0.01 .
Table 3. The running time for dual-gap approaches to the given precision ( 10 6 ) when λ = 0.01 .
SDCASPDCASPDC
a9a0.505 s1.311 s0.636 s
ijcnn0.984 s1.438 s1.183 s
covtype11.502 s20.526 s14.972 s
Table 4. The average running time for the algorithms to pass through the entire dataset once when λ = 0.01 .
Table 4. The average running time for the algorithms to pass through the entire dataset once when λ = 0.01 .
SDCASPDCASPDC
a9a0.029 s0.052 s0.028 s
ijcnn0.053 s0.061 s0.052 s
covtype0.650 s0.840 s0.644 s
Table 5. The running time for dual-gaps to approach the given precision ( 10 4 ) when λ = 10 6 .
Table 5. The running time for dual-gaps to approach the given precision ( 10 4 ) when λ = 10 6 .
ASDCASPDCASPDC-i
a9a0.582 s2.262 s0.8464 s
ijcnn0.994 s3.127 s2.033 s
covtype8.407 s91.132 s47.734 s
Table 6. The average running time for the algorithms to pass through the entire dataset once when λ = 10 6 .
Table 6. The average running time for the algorithms to pass through the entire dataset once when λ = 10 6 .
ASDCASPDCASPDC-i
a9a0.0165 s0.0857 s0.0167 s
ijcnn0.0305 s0.0821 s0.0302 s
covtype0.208 s1.253 s0.408 s
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liang, H.; Cai, H.; Wu, H.; Shang, F.; Cheng, J.; Li, X. ASPDC: Accelerated SPDC Regularized Empirical Risk Minimization for Ill-Conditioned Problems in Large-Scale Machine Learning. Electronics 2022, 11, 2382. https://doi.org/10.3390/electronics11152382

AMA Style

Liang H, Cai H, Wu H, Shang F, Cheng J, Li X. ASPDC: Accelerated SPDC Regularized Empirical Risk Minimization for Ill-Conditioned Problems in Large-Scale Machine Learning. Electronics. 2022; 11(15):2382. https://doi.org/10.3390/electronics11152382

Chicago/Turabian Style

Liang, Haobang, Hao Cai, Hejun Wu, Fanhua Shang, James Cheng, and Xiying Li. 2022. "ASPDC: Accelerated SPDC Regularized Empirical Risk Minimization for Ill-Conditioned Problems in Large-Scale Machine Learning" Electronics 11, no. 15: 2382. https://doi.org/10.3390/electronics11152382

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop