An Adaptive Low Computational Cost Alternating Direction Method of Multiplier for RELM Large-Scale Distributed Optimization

Wang, Ke; Huo, Shanshan; Liu, Banteng; Wang, Zhangquan; Ren, Tiaojuan

doi:10.3390/math12010043

Open AccessArticle

An Adaptive Low Computational Cost Alternating Direction Method of Multiplier for RELM Large-Scale Distributed Optimization

by

Ke Wang

^1,2

,

Shanshan Huo

³,

Banteng Liu

^1,*,

Zhangquan Wang

¹ and

Tiaojuan Ren

¹

College of Infomation Science and Technology, Zhejiang Shuren University, Hangzhou 310015, China

²

State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China

³

School of Computer Science and Artificial Intelligence, Changzhou University, Changzhou 213164, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(1), 43; https://doi.org/10.3390/math12010043

Submission received: 23 October 2023 / Revised: 15 December 2023 / Accepted: 18 December 2023 / Published: 22 December 2023

(This article belongs to the Special Issue AI Algorithm Design and Application)

Download

Browse Figures

Versions Notes

Abstract

:

In a class of large-scale distributed optimization, the calculation of RELM based on the Moore–Penrose inverse matrix is prohibitively expensive, which hinders the formulation of a computationally efficient optimization model. Attempting to improve the model’s convergence performance, this paper proposes a low computing cost Alternating Direction Method of Multipliers (ADMM), where the original update in ADMM is solved inexactly with approximate curvature information. Based on quasi-Newton techniques, the ADMM approach allows us to solve convex optimization with reasonable accuracy and computational effort. By introducing this algorithm into the RELM model, the model fitting problem can be decomposed into a set of subproblems that can be executed in parallel to achieve efficient classification performance. To avoid the storage of expensive Hessian for large problems, BFGS with limited memory is proposed with computational efficiency. And the optimal parameter values of the step-size search method are obtained through Wolfe line search strategy. To demonstrate the superiority of our methods, numerical experiments are conducted on eight real-world datasets. Results on problems arising in machine learning suggest that the proposed method is competitive with other similar methods, both in terms of better computational efficiency as well as accuracy.

Keywords:

extreme learning machines; alternating direction method of multipliers; matrix calculation; convex optimization

MSC:

68T09

1. Introduction

The extreme learning machine (ELM) [1] has received much attention in recent years, owing to its fast training speed and good generalization. Note that traditional extreme learning machines suffer from a limitation of memory with large scale datasets. Especially in the era of big data, the dataset scale is usually extremely large and the data are often very high-dimensional for detailed information [2,3,4,5,6] because the increasing complexity of the datasets enlarges the dimension of the hidden output matrix, which leads to a huge memory space and heavy computational load in matrix-inversion based (MI-based) solutions.

To address these limitations, some enhanced ELMs with parallel or distributed structures have been implemented to meet the challenge of large-scale datasets, as shown in Table 1. For example, ELM based on the MapReduce framework can effectively calculate the matrix multiplication in parallel [7,8], and has an efficient learning ability in massive rapidly updated datasets [9]. However, the parallel ELM based on MapReduce creates a large amount of extra overhead and degrades the learning speed. The algorithm based on the Spark parallel framework is then proposed to speed up the whole computing process of ELM for big data [10].

The methods discussed above focus on the computation of MI-based solutions using parallel and distributed hardware structures and programming models. The alternating direction method of multipliers (ADMM), without the time-consuming MI operation, is also an effective method for distributed optimization [3,11,12]. By using the ADMM framework, the model fitting problem can be decomposed into a set of subproblems that can be executed in parallel to achieve efficient classification performance so as to meet the needs of large-scale data processing in the real environment. To achieve optimal performance without user oversight, an adaptive method that automatically tunes the key algorithm parameters is applied to improve the Relaxed ADMM [13]. Appropriate selection of the penalty parameter is crucial to obtaining good performance from the ADMM. While analytic results for optimal selection of the penalty parameter are very limited, an adaptive penalty strategy based on residual balancing is proposed to obtain good performance from the ADMM [14] because a convex model fitting problem can be split into a set of concurrently executable subproblems. Therefore, in a big data environment, the regularized least-squares problem is split across the coefficients and incorporates a relaxation technique to achieve good convergence [15]. Furthermore, elastic-net theory is employed to simultaneously improve the sparsity and stability of the model, which develops an accelerated ADMM algorithm [16].

Table 1. Review of various approaches of enhanced ELMs in the literature.

Framework	Utilized Techniques	Metrics	Datasets	Main Characteristics
Parallel or distributed learning	MapReduce [7]	Running time, Speed up	Synthetic datasets	Parallel computing ability, Efficient learning of large-scale data
	MapReduce [9]	Running time, Update ratio	Synthetic datasets	Efficient learning in massive rapidly updated dataset
	MapReduce [8]	Speed up, Scaleup, Sizeup	Real datasets	Parallelism, Low runtime memory, Good scalability
	Spark [10]	Running time, Speed up, Accuracy	Synthetic datasets	Fault tolerance, Persist/cache strategies
ADMM	Residuals normalization [14]	Iteration number	Synthetic datasets	Robust in sparse coding
	Adaptive penalty, Relaxation technique [13]	Iteration number	Real datasets	Without user oversight or parameter tuning
	Maximally splitting, Relaxation technique [15]	Convergence ratio, Acceleration ratios	Real datasets	Fast convergence, Less computations, High parallelism
	Inertial technique, Bregman distance [17]	Training time, Constraint errors	Synthetic datasets	Global convergence, High acceleration

For real-time data classification in the convex optimization problem, the prime problem can be decomposed into several subproblems by leveraging ADMM [18]. The global optimal solution to the original problem can be obtained by processing the subproblem in parallel. Fast convergence speed and parallelism make ADMM suitable for solving large-scale distributed optimization problems. However, subproblem optimization must be solved at each iteration, which imposes a heavy calculation burden [19]. Numerical experience has shown that the effective solution of subproblems is critical to the performance of ADMM [20].

Several alternatives are available for unconstrained optimization such as Newton-type methods [21], Chebyshev-like methods [22], the quasi-Newton method(QNM) [23], and so on [24]. These methods require less computational effort to calculate the search direction, thus demonstrating rapid convergence. The RLS problem in RELM mainly involves the computation of a Hessian matrix and the gradient of the cost function. The second-order partial derivatives of the Hessian matrix can be avoided by putting forth the displacement and first derivative information of two adjacent entries [25,26]. Combined with line search techniques, it could achieve attractive global convergence properties.

However, in machine learning and image processing, the computation of the Hessian matrix at each iteration is not a trivial task [13]. The cost of storing and working with Hessian approximations can be excessive for large matrixes. In order to reduce the storage of the Hessian approximations, variants of the quasi-Newton approach, such as limited-memory BFGS (LBFGS) and stochastic QNM [27,28,29,30,31], are developed to store Hessian approximations compactly. Azam et al. [32] analyzed the convergence performance of L-BFGS in convex optimization problems and further proved their practical value in solving large-scale optimization problems. Aryan et al. [28] proposed stochastic QNM(SQN), which uses second-order information to accelerate stochastic convergence, and modified the update formula of BFGS to ensure that the eigenvalues of the Hessian approximation matrix remain bounded, so as to ensure that the function can obtain extreme values. Chen et al. [30] proposed the stochastic damped L-BFGS to ensure positivity and avoid ill-conditioned outputs in the Hessian update process by introducing damping parameters. The computation of Hessian matrixes often involves the step-size parameter, which can be determined by line search method. Backtracking line search is commonly used to guarantee convergence in case the linear model assumptions break down and an unstable stepsize is produced. It is time-consuming and may lose its advantage in other types of ELMs.

In this paper, we study a low-cost computational scheme for the ADMM and jointly devise an adaptive step-size selection. The stochastic damped optimal L-BFGS(R-SDL-BFGS) is therefore derived, which improves the computational efficiency of the ADMM. Our contributions can be summarized as follows:

(1): Low-cost computational scheme: The curvature information from recent iterations is used to reduce computational cost;
(2): Damped BFGS correction scheme: The damping technology is introduced into the BFGS to make up for the deficiency of the Hessian approximation matrix in the positive definiteness of the solution, and to ensure the positive definiteness of the BFGS matrix in non-convex optimization;
(3): Step factor selection scheme: The non-monotonic Wolfe-type strategy is applied to the memory gradient method, combined with the BB spectral gradient descent, to obtain the optimal step-size factor.

Finally, we compare the proposed method to other ADMM variants by experiments of real-world classification and image processing problems.

2. Preliminaries

2.1. Convex Optimization

There are great advantages to recognizing or formulating a problem as a convex optimization problem. The most basic advantage is that the problem can then be solved, very reliably and efficiently, using interior-point methods or other special methods for convex optimization. These solution methods are reliable enough to be embedded in a machine learning model, or even a real-time reactive or automatic control system. An example would be the single-node perceptron neural network, in which the parameter optimization can be formulated as a convex optimization problem.

The idea behind the convex optimization problem can be extended to a more general form, in which the convex minimization problem can be expressed as an unconstrained optimization problem. There are also theoretical or conceptual advantages in formulating a problem as an unconstrained problem in optimization research. For the purpose of achieving good numerical performance, it is helpful to make use of convex functions to find bounds on the optimal value. Thus, in most cases where combinatorial optimization and global optimization is needed, it is highly advantageous to make use of it.

In particular, the unconstrained optimization problem can be summarized as follows.

min f (x), x \in R^{n}

(1)

where

f (x)

denotes the objective function with variable

x \in R

.

There is in general no analytical formula for the solution of convex optimization problems, but there are very effective methods for solving them. The ability to analytically solve the unconstrained optimization problem is the basis for BFGS [31], a powerful method for convex optimization. For convex optimization problem, the BFGS resorts to an iterative algorithm to achieve accurate solutions. A general iterative numerical method has the form:

x_{k + 1} = x_{k} + α_{k} d_{k}

(2)

d_{k} = - {B_{k}}^{- 1} g_{k}

(3)

where

d_{k}

is search direction,

g_{k}

indicates the gradient of

f (x_{k})

at

x_{k}

, and

α_{k}

is the step size. Here,

B_{k}

denotes the approximate Hessian matrix. The BFGS algorithm approximates the inverse of the Hessian matrix by constructing a symmetric, positive definite initial moment

B_{0}

, and then iteratively updating the

B

.

The essence of BFGS is to approximate the Hessian using finite differences computed by successive iterations and gradients, so the gradient estimation and the iterative difference are defined as follows:

y_{k} = g_{k + 1} - g_{k}

(4)

s_{k} = x_{k + 1} - x_{k}

(5)

Because the size of the Hessian may vary across iterations, it is not trivial to calculate an appropriate Hessian at each iteration. To address this problem, this paper seeks to approximate the Hessian to avoid calculating the inverse of the Hessian matrix for each iteration.

2.2. Limited-Memory BFGS Algorithm

Different correction formulas of the quasi-Newton matrix

B_{k + 1}

represent different QNM. The BFGS correction formula can maintain the symmetric positive definiteness of the iterative matrix

B_{k}

. And the BFGS algorithm has global convergence for the convex minimization problem, and therefore the BFGS correction formula is commonly used to solve

B_{k + 1}

. The BFGS correction formula is as follows:

B_{k + 1} = B_{k} - \frac{B_{k} s_{k} s_{k} B_{k}^{T}}{s_{k}^{T} B_{k} s_{k}} + \frac{y_{k} y_{k}^{T}}{y_{k}^{T} s_{k}}

(6)

The line search method has been proved useful in practice since selection of the step size

α_{k}

determines where along the line

\{x_{k} + α_{k} d_{k} | x \in R\}

the next iterate will be. The exact line search is used when the cost of the minimization problem is with one variable. However, in an exact line search method, the values of

B_{k}

and one or more derivatives at multiple values of k must be found. It usually gives a very small improvement in efficiency. For these reasons, most practical implementations use an inexact line search, which is guaranteed to be feasible for the unconstrained problem. Common inexact line search criteria are given by [23]:

(1): Armijo criteria:

$f (x_{k} + α_{k} d_{k}) \leq f (x_{k}) + c_{1} g_{k}^{T} α_{k} d_{k}$

(7)
(2): Wolfe criteria:

$g {(x_{k} + α_{k} d_{k})}^{T} d_{k} \leq f (x_{k}) + c_{2} g_{k}^{T} d_{k}$

(8)

where $0 < c_{1} < c_{2} < 1$ .

The BFGS algorithm finds it difficult to solve large-scale unconstrained optimization problems on account of the calculation and storage of metrics in each iteration. That is, in each iteration, the BFGS approximates the inverse of the Hessian with gradient information, and the computional cost is expensive. In this section, the LBFGS method [33] is introduced to reduce the computational cost. It stores the matrix in a vector form that makes it possible to reduce computational effort for transmission or storage of data (assuming the vectors are stored using an appropriate data structure). To simplify the calculation of the inversion matrix, by applying the the Sherman–Morrison formula, the corresponding BFGS correction formula is transformed into:

\begin{matrix} B_{k + 1}^{- 1} = (I - \frac{s_{k} y_{k}^{T}}{y_{k}^{T} s_{k}}) B_{k}^{- 1} (I - \frac{y_{k} s_{k}^{T}}{y_{k}^{T} s_{k}}) + \frac{s_{k} s_{k}^{T}}{y_{k}^{T} s_{k}} \end{matrix}

(9)

\begin{matrix} y_{k} = \frac{1}{M_{k}} \sum_{k = 1}^{M_{k}} f^{'} (x_{k + 1}) - f^{'} (x_{k}) \end{matrix}

(10)

where

M_{k}

is storage length.

3. Adaptive Stochastic Damping Optimization for Limited Memory

The positive definite matrix plays a significant role in convex optimization. And the BFGS uses the positive definite matrix

B_{k}

to approximate the Hessian. Note that, during iterations, the

B_{k}

may become a singular matrix, which will significantly affect the convergence of the algorithm. Simultaneously, BFGS requires that the optimization problem must be convex. Otherwise, the

B_{k}

may become non-positive, and the decreasing step size may not be positive. Therefore, we need to deal with non-convexity and ill-conditioned behavior to guarantee the positivity of

B_{k}

.

3.1. Proposed Damped SL-BFGS Method

In the optimization process of BFGS, the Hessian approximation

B_{k}

may become a non-positive definite matrix if the property

s_{k}^{T} y_{k} > 0

is not satisfied. In this case, it may not be possible to ensure that the algorithm is along the best search direction. Considering that the convergence of BFGS relies heavily on the positive definite matrix, damping technology [23] is used to correct the BFGS update formula. This makes up the deficiency of the solution to the positive definiteness of the Hessian approximation, which therefore maintains the positive definiteness of the

B_{k}

.

The optimization of LBFGS reduces the computational cost from the perspective of transmitting or storing data. Nevertheless, such modifications do affect the accuracy of the Hessian approximation. Stochastic optimization methods, as a popular optimization tool, can effectively obtain good analytical solutions. By applying the stochastic gradient information to approximate the curvature of the objective function in the convex optimization, the optimal analytical solution can be obtained, which also speeds up the convergence.

Because the noise of the stochastic gradient may be infinitely amplified in the curvature estimation, the Hessian approximation matrix will be negatively affected, which reduces the convergence speed. We shall adjust the gradient estimation

y_{k}

and descent distance estimation

s_{k}

by different batch sizes, thereby decoupling the computations of stochastic gradient and curvature estimation. By extending the random damping technique to the LBFGS,

y_{k}

and

s_{k}

can be represented by

{\hat{y}}_{k}

and

{\hat{s}}_{k}

, respectively.

{\hat{s}}_{k} = {\hat{x}}_{k + 1} - {\hat{x}}_{k}, {\hat{x}}_{k} = \{\begin{matrix} \frac{1}{b} \sum_{k = (k - 1) b}^{j b - 1} x_{k}, j \geq 1 \\ x_{0}, j = 0 \end{matrix}

(11)

{\hat{y}}_{k} = ω_{k} y_{k} + (1 - ω_{k}) B_{k} {\hat{s}}_{k}

(12)

ω_{k} = \{\begin{matrix} \frac{0.8 {\hat{s}}_{k}^{T} B_{k} s_{k}}{{\hat{s}}_{k}^{T} B_{k} s_{k} - {\hat{s}}_{k}^{T} y_{k}}, {\hat{s}}_{k}^{T} y_{k} < 0.2 {\hat{s}}_{k}^{T} B_{k} {\hat{s}}_{k} \\ 1, others \end{matrix}

(13)

where b is interval length (also called batch size), and scalar

ω_{k}

denotes the damping factor.

It is possible to design a more efficient model that computes the stochastic gradient and target variable with a batch update of batch size b in each iteration.

{\hat{x}}_{k + 1} = \{\begin{matrix} {\hat{x}}_{_{k}} - γ_{k} B_{k}^{- 1} g_{k}, mod (d, k) = 0 \\ {\hat{x}}_{k} - γ_{k} g_{k}, others \end{matrix}

(14)

g_{k} = f^{'} ({\hat{x}}_{k}) γ_{k} = \frac{γ_{0} τ_{0}}{τ_{0} + k}

(15)

where

γ_{k}

represents the batch step size, and

γ_{0} τ_{0}

are constant.

Although knowledge of gradient information allows BFGS to gradually approximate the inverse of the Hessian, the search direction also plays a crucial role in the global convergence. We should ensure that the algorithm makes reasonable progress along the given search direction and focus on finding a suitable step length along this direction.

3.2. Robust Optimization Approach for Limited Memory

As a common descending direction search criterion, the key to the success of the inexact line search is that the convergence performance of each step must be monotonically decreasing. In many cases, it is possible to leverage a non-monotonic search technique [34] to relax the convergence conditions while overcoming the oscillation phenomenon, whereas this method makes it easy to obtain the local extrema when the initial value is taken near the local function valley.

To avoid the above problems, a non-monotonic Wolfe search strategy can be devised. This method combines current and past function iteration information to find global solutions. By introducing this method for a convex optimization problem, the objective value changes by rules based on

x_{k} = \{\begin{matrix} - x_{k}, k = 1 \\ - [(1 - ξ_{k}) x_{k} + ξ_{k} x_{k}], k \geq 2 \end{matrix}

(16)

ξ_{k} = \frac{ζ {∥x_{k}∥}^{2}}{{∥x_{k}∥}^{2} + |x_{k}^{T} x_{k - 1}|}, ζ \in (0, 1)

(17)

4. The Adaptive Stochastic Optimization to RELM Design

Consider a large-scale optimization problem, in which the training data has high dimension and large volume. The MI-based RELM method will require more computations. As a powerful tool for solving large-scale optimization problems, ADMM can greatly improve the speed of convergence by parallel computing. Using ADMM, a large-scale optimization problem can be split into a set of concurrently executable subproblems, each with just a subset of model coefficients. As for ADMM, efficiency and robustness of the numerical computation greatly depend on the effective solution of the subproblem. The complexity of subproblems therefore limits the convergence of the algorithm.

4.1. AADMM with Low Computing Cost

ADMM has advantages of dealing with convex optimization problems owing to its good convergence and parallel structure [35]. In practice, optimal trade-off solutions are explored for convex optimization problems, based on the separable structure. Define

[A = [a_{1}, a_{2}, \dots, a_{P}] \in R^{N \times P}, a_{i} \in R^{N}]

as the data matrix.

x \in R^{P}

indicates the model coefficient,

b \in R^{N}

denotes the target output,

f (\cdot)

and

h (\cdot)

are the convex loss function and the convex regularization function, respectively.

x

can be partitioned as

{[x_{1}, x_{2}, \dots, x_{L}]}^{T}, x_{l} \in R^{p_{l}}

where

\sum_{l = 1}^{L} p_{l} = P

. Similarly, partition

A

as

[A_{1}, A_{2}, \dots, A_{L}], A_{l} \in R^{N \times p_{l}}

. According to the above definitions,

h (x) = \sum_{l = 1}^{L} h_{l} (x_{l})

. Then, a convex optimization problem can be given by

min f (\sum_{l = 1}^{L} z_{l} - b) + \sum_{l = 1}^{L} h_{l} (x_{l})

(18)

A_{l} x_{l} - z_{l} = 0

(19)

where

Z = [z_{1}, z_{2}, \dots, z_{L}], z_{l} \in R^{N}

.

Its augmented Lagrangian function is given by

\begin{matrix} L_{ρ} (x, Z, Λ) = f (\sum_{l = 1}^{L} z_{l} - b) + \sum_{l = 1}^{L} h_{l} (x_{l}) + \sum_{l = 1}^{L} λ_{l}^{T} (A_{l} x_{l} - z_{l}) + \frac{ρ}{2} \sum_{l = 1}^{L} {∥A_{l} x_{l} - z_{l}∥}_{2}^{2} \end{matrix}

(20)

where

ρ > 0

represents the penalty factor, and

Λ = [λ_{1}, λ_{2}, \dots, λ_{L}] \in R^{N \times L}

denotes the dual variable.

The key technology that affects the performance of ADMM lies in the division of subproblems and the selection of hyperparameters. Built upon the maximally split technique, a method for basic vector and matrix operations that ensures that each partition subproblem only contains one scalar component to take advantage of specific computer architectures. Consider the L partition ADMM with

L = P

; the optimization problem (18) is therefore maximally split into P subproblems. In this case, the matrix

A_{l}

is reduced to a vector

a_{l}

, and vector

x_{l}

is reduced to a scalar

x_{l}

. To also account for past iterates when computing the next ones, an alternative relaxation technique is used, in which

z_{l}

is replaced by

(1 - α) a_{l} x_{l}^{k + 1} + α z_{l}

. Then, the corresponding iterative solution process turns into

\begin{matrix} x_{l}^{k + 1} = \underset{x_{l}}{arg min} \{h_{l} (x_{l}) + \frac{ρ}{2} {∥a_{l} x_{l} - z_{l}^{k} + \frac{λ_{l}^{k}}{α ρ})∥}_{2}^{2}\} \\ Z^{k + 1} = \underset{Z}{arg min} \{f (\sum_{l = 1}^{L} z_{l} - b) + \frac{ρ α^{2}}{2} \sum_{l = 1}^{L} {∥a_{l} x_{l}^{k + 1} - z_{l} + \frac{λ_{l}^{k}}{α ρ}∥}_{2}^{2}\} \\ λ_{l}^{k + 1} = λ_{l}^{k} + α ρ (a_{l} x_{l}^{k + 1} - z_{l}^{k + 1}) \end{matrix}

(21)

where

α > 0

is the relaxation parameter.

In view of the fact that the number of iterations required for ADMM convergence depends on the penalty factor, we try to automatically adjust the value of the key parameters in each iterative process to improve the convergence speed. This can be achieved by balancing the original residual and dual residual to measure the algorithm convergence.

\begin{matrix} r_{k} = A x_{k} - z_{k} \\ θ_{k} = ρ A^{T} (z_{k} - z_{k - 1}) \end{matrix}

(22)

where

r_{k}

and

d_{k}

are primal residual and dual residual, respectively. According to [36], the iteration is generally stopped when

\begin{matrix} {∥r_{k}∥}^{2} \leq ε^{t o l} max \{{∥A x_{k}∥}^{2}, {∥z_{k}∥}^{2}\}, \\ {∥d_{k}∥}^{2} \leq ε^{t o l} max \{{∥A^{T} λ_{k}∥}^{2}\} \end{matrix}

(23)

where

ε^{t o l}

denotes the termination tolerance. It can be observed that the higher penalty term results in smaller primal residuals but larger dual ones, while a smaller penalty term yields larger primal and smaller dual residuals. In order to maintain small residuals at convergence, the penalty parameter

ρ

should be adaptively tuned to balance the residuals.

Following [37] the Barzilai–Borwein(BB) and Douglas–Rachdord splitting (DRS) method, the optimal

ρ

that minimizes the residual is given by

ρ_{k} = arg min_{ρ} \frac{1 + θ_{k} σ_{k} ρ^{2}}{(θ_{k} + σ_{k}) ρ} = \frac{1}{\sqrt{θ_{k} σ_{k}}}

(24)

where

θ_{k} > 0

and

σ_{k} > 0

are the spectral gradient descent stepsizes for the dual variable of problem (18) at iteration k. Under this choice of

ρ_{k}

, we then have the optimal relaxation parameter

α_{k}

α_{k} = 1 + \frac{1 + θ_{k} σ_{k} {ρ_{k}}^{2}}{(θ_{k} + σ_{k}) ρ_{k}} \leq 2

(25)

The model parameters

θ_{k}

and

σ_{k}

can be estimated based on the results from iteration k and an older iteration

k_{0} < k

. Let

S D

stand for the steepest descent and

M G

for the minimum gradient, and then

{\hat{θ}}_{k} = \{\begin{matrix} {\hat{θ}}_{k}^{M G}, 2 {\hat{θ}}_{k}^{M G} > {\hat{θ}}_{k}^{S D} \\ {\hat{θ}}_{k}^{S D} - {\hat{θ}}_{k}^{M G} / 2, otherwise \end{matrix}

(26)

where

\hat{θ} = 1 / θ

.

σ_{k}

can be obtained in a similar way. For a detailed account of these formulas, see [13].

The specific steps of the proposed LCC-AADMM are shown in Algorithm 1.

Algorithm 1: LCC-AADMM

Data:: optimization variable $x_{k}$ , batch step size $γ_{k}$ , interval length b, initial Hessian approximation matrix $B_{k + 1}^{- 1}$ , storage length $M_{k}$ , number of iterations k, constant $γ_{0}$ , $τ_{0}$
Result:: the optimal variable $x$
1: Initialization;
2: Randomly select samples with batch size b and calculate the stochastic gradient $g_{k}$ of the objective function;
3: Update the target variable $x_{k}$ by Equation (14);
4: Calculate the distance estimation $s_{k}$ and gradient estimation $y_{k}$ of the objective function by Formulas (11) and (12);
5: Calculate the dual variable $λ_{k}$ and auxiliary variable $z_{k}$ by Formula (21);
6: Judging whether the iteration termination condition (23) is satisfied; if it is satisfied, terminate the iteration, or calculate and update the BFGS approximate matrix by Formula (9), and return to iterative step 1, and the number of iterations is increased by 1;

4.2. RELM Models with Stochastic Optimization Constraints

For an M-category classification problem, assuming that the training sample is

X = [x_{1}, x_{2}, \dots, x_{S}] \in R^{L \times S}

and the number of hidden layer nodes is I, the model output

f_{m} (x_{s})

is given by

\begin{matrix} f_{m} (x_{s}) = \sum_{i = 1}^{I} F ({w_{i}}^{T} x_{s} + τ_{i}) β_{i m} \end{matrix}

(27)

where

w_{i} \in R^{L}

is the weight vector, and

τ_{i}

is the bias of i-th node. Final output can be obtained with the combination of the activation function

F (\cdot)

and the weight matrix

B \in R^{I \times M}

.

To improve the generalization performance and stability, a regularization theory is introduced to simplify the ELM model. If we define the target output as

T \in R^{S \times M}

and the hidden-layer output as

N \in R^{S \times I}

with

n_{s i} = F ({w_{i}}^{T} x_{s} + τ_{i})

, then the least squares problem is turned into

min_{B} \frac{1}{2} {∥N B - T∥}_{F}^{2} + \frac{1}{2} μ^{2} {∥B∥}_{F}^{2}

(28)

where

{∥\cdot∥}_{F}

indicates the Frobenius norm, and

μ > 0

represents the regulation factor.

Formula (28) is the objective function, which is a convex optimization calculation model with separable structure. To simplify the objective function, the ADMM calculation framework is introduced to decompose the objective function into a set of concurrent executable subproblems. Therefore, each subproblem has a subset of the model coefficients; that is, Formula (28) is equivalent to Formula (18). Then, the RELM method (28) can be derived from LCC-AADMM.

Denote

v_{m}^{k} = {[v_{1 m}^{k}, v_{1 m}^{k}, \dots, v_{(S + I) m}^{k}]}^{T}

,

z_{m}^{k} = {[z_{1 m}^{k}, z_{1 m}^{k}, \dots, z_{(S + I) m}^{k}]}^{T}

and

λ_{m}^{k} =

[λ_{1 m}^{k},

λ_{1 m}^{k},

\dots, λ_{(S + I), m}^{k}]^{T}

. Then, the RELM is expressed as

\begin{matrix} β_{i m}^{k + 1} = β_{i m}^{k} - \hat{α} {\hat{n}}_{i}^{- 1} (v_{m}^{k} + λ_{m}^{k} - z_{m}^{k}) \\ v_{j m}^{k + 1} = {\hat{h}}_{j}^{T} β_{m}^{k + 1} \\ z_{j m}^{k + 1} = \frac{1}{1 + (I^{- 1} α^{2} ρ)} {\hat{t}}_{j m} + \frac{I^{- 1} α^{2} ρ}{1 + (I^{- 1} α^{2} ρ)} (v_{j m}^{k + 1} + \frac{λ_{j m}^{k}}{α ρ}) \\ λ_{j m}^{k + 1} = λ_{j m}^{k} + α ρ (v_{j m}^{k + 1} - z_{j m}^{k + 1}) \end{matrix}

(29)

where

\hat{α} = I^{- 1} α

,

i = 1, 2, \dots, I

and

j = 1, 2, \dots, S + I

.

\hat{N} = I^{- 1} {[N^{T}, μ I_{I}]}^{T}

. Then,

\hat{N} =

{[{\hat{h}}_{1}, {\hat{h}}_{2}, \dots, {\hat{h}}_{S + I}]}^{T}

=

[{\hat{n}}_{1}, {\hat{n}}_{2}, \dots, {\hat{n}}_{I}]

.

\hat{T} = I^{- 1} {[T^{T}, 0_{M \times I}]}^{T}

with

\hat{T} = [{\hat{t}}_{1}, {\hat{t}}_{2}, \dots, {\hat{t}}_{M}]

and

{\hat{t}}_{m} =

{[{\hat{t}}_{1 m}, {\hat{t}}_{2 m}, \dots, {\hat{t}}_{(S + I) m}]}^{T}

.

By reconstructing the RELM method, all variable updates can be decomposed into scalar variable updates, which has a highly parallel structure. This is the desired structure for optimization applications with high computational cost.

5. Simulation Experiment and Result Analysis

As with the R-SDL-BFGS method discussed so far, it can not only be well applied to unconstrained optimization problems but also solve convex optimization problems. In unconstrained optimization problems, the ability to find the bounds of the optimal value and the stochastic robust approximation make the proposed method superior to other quasi-Newton algorithms. To verify this claim, simulations are carried out on four benchmark functions (Branin Function, Levy Function N.13, Matyas Function, and Three-Hump Camel Function) to compare the performance of R-SDL-BFGS, SD-BFGS, LBFGS, andBFGS methods. Experiments are conducted using MATLAB 2019(The MathWorks, Inc., Natick, MA, US) on a desktop with an Intel Core i7-10700 8-core CPU and 16GB of RAM. At the same time, in order to ensure that the performance of the algorithm is not accidental, it is necessary to take the average value of multiple experiments to ensure whether the algorithm has good convergence performance in each experiment, so as to ensure that the algorithm has good robustness. The specific description is shown in Table 2.

For convex optimization problems, the R-SDL-BFGS method can simplify the solution process of subproblems with the Hessian approximation matrix, thereby reducing the ADMM computational cost and improving the speed of convergence. To verify this claim, simulations are carried out on eight benchmark datasets to compare the performance of R-SDL-BFGS with LCC-AADMM, MS-AADMM and RB-ADMM algorithms. Here, the benchmark datasets include Gisette, USPS, Magic, BASEHOCK, Pendigits, Optical-Digits, statlog, and PCMAC. The characteristics of the eight datasets are shown in Table 3. The first six sets of data in the table are from the UCI machine learning library, and the last two are from the ASU feature selection dataset.

5.1. Comparative Analysis of Convergence Performance of Quasi-Newton Algorithms

This section reports the the convergence speed of different algorithms, based on the number of iterations required before

f (x_{k + 1}) - f (x_{k}) \leq Q

, where Q is the error stop station. In general, the quasi-Newton method needs simple line search procedures to satisfy the termination condition. This property leads to a low computational cost during the training phase. Therefore, it is convenient to use the iteration step to evaluate the effectiveness of the method.

As shown in Table 4 and Table 5, the standard BFGS algorithm avoids the problem of singular matrices by replacing inverse matrices with Hessian approximation matrices. However, this algorithm needs to calculate and store the matrix in each iteration, which leads to an expensive computational cost as well as a slow convergence.

To ensure the positive definiteness, the SD-BFGS algorithm is devised by making use of damping technology so as to maintain the positive definiteness of the BFGS matrix in non-convex optimization. Since the global convergence of the algorithm depends on the monotonicity of the function, a suitable numerical solution of non-monotonic equations is in general not feasible.

In order to attain asymptotic linear convergence, a non-monotonic Wolfe-type strategy can be applied to the memory gradient method (R-SDL-BFGS). By combining the current function iteration information and the function information of multiple points in the past, it overcomes the oscillation phenomenon and improves the global convergence of the algorithm.

From a theoretical point of view, the R-SDL-BFGS algorithm has a better global convergence and convergence speed compared with the BFGS, SD-BFGS, and R-SDL-BFGS methods. Experiments on benchmark function are presented in Table 4, which shows the speed of convergence can be quite acceptable. As can be seen from Table 5, the convergence rate of R-SDL-BFGS is

88.7260 %

higher than that of BFGS in the CEC benchmark function. The convergence rate of R-SDL-BFGS is

69.20135 %

higher than that of SD-BFGS in the CEC benchmark function. According to the numerical experiment results, it can be seen that the performance of the R-SDL-BFGS is completely consistent with the theoretical results and achieves good classification performance in practice. The results of experiments in Figure 1, Figure 2, Figure 3 and Figure 4 and Table 5 demonstrate that the R-SDL-BFGS algorithm converges faster than other QNMs.

5.2. Convergence Performance Comparative Analysis

The complexity of the iterative algorithm is determined by the complexity of the unit computation and the number of iterations, where the unit computation complexity refers to the number of floating-point operations required by the optimization algorithm for single iteration, and the complexity of the iteration times refers to the number of iterations required to calculate the solution with a given precision. However, since the unit computation complexity in the iterative process is almost the same, the convergence performance of the algorithm is evaluated by comparing the complexity of iterations.

In this section, the convergence performance of different ADMM algorithms under the same error condition are proved, and the iterative termination condition (27) is the key to evaluating the convergence speed of the algorithm. Under the same iteration termination condition, LCC-AADMM, MS-AADMM, and RB-ADMM methods are evaluated on eight benchmark datasets. At the same time, in order to ensure that the performance of the algorithm is not accidental, it is necessary to take the average value of multiple experiments to ensure whether the algorithm has good convergence performance in each experiment so as to ensure that the algorithm has good robustness. Additionally, these algorithms for computing and approximating the matrix are analyzed by comparing the iteration number.

From a theoretical point of view, for the M-category classification problem, we can rewrite (28) as

\begin{matrix} min_{β} {∥N β - t∥}_{F}^{2} \\ β = vec (B) \in R^{I M \times 1} \\ N = I_{M} \otimes {[N^{T}, μ I_{I}^{T}]}^{T} \in R^{(S + I) M \times I M} \\ t = vec ({[T^{T}, 0_{I \times M}^{T}]}^{T}) \in R^{(S + I) M \times 1} \end{matrix}

(30)

where ⊗ is the Kronecker product, and

vec (\cdot)

means the concatenation of all columns of the matrix. Then, the MS-ADMM (31) and RB-AADMM (32) are as follows:

\begin{matrix} β_{i m}^{k + 1} = \underset{β_{i m}}{arg min} \{g_{i m} (β_{i m}) + \frac{ρ}{2} {∥n_{i m} β_{i m} - z_{i m}^{k} + \frac{λ_{i m}^{k}}{ρ}∥}_{2}^{2}\} \\ Z^{k + 1} = \underset{Z}{arg min} \{\sum_{i m = 1}^{I M} z_{i m} - t + \frac{ρ}{2} \sum_{i m = 1}^{I M} {∥n_{i m} β_{i m}^{k} - z_{i} + \frac{λ_{i m}^{k}}{ρ}∥}_{2}^{2}\} \\ λ_{i m}^{k + 1} = λ_{i m}^{k} + ρ (n_{i m} β_{i m}^{k + 1} - z_{i m}^{k + 1}) \end{matrix}

(31)

\begin{matrix} β_{i m}^{k + 1} = \underset{β_{i m}}{arg min} \{\frac{ρ}{2} {∥n_{i m} (β_{i m} - β_{i m}^{k}) - α (z_{i m}^{k} - n_{i m} β_{i m}^{k} - \frac{λ_{i m}^{k}}{ρ α})∥}_{2}^{2}\} \\ Z^{k + 1} = \underset{Z}{arg min} \{\sum_{i m = 1}^{I M} z_{i m} - t + \frac{ρ α^{2}}{2} \sum_{i m = 1}^{I M} {∥z_{i m} - n_{i m} β_{i m}^{k + 1} - \frac{λ_{i m}^{k}}{ρ α}∥}_{2}^{2}\} \\ λ_{i m}^{k + 1} = λ_{i m}^{k} + ρ α (n_{i m} β_{i m}^{k + 1} - z_{i m}^{k + 1}) \end{matrix}

(32)

where

i m = 1, 2, \dots, I M

and

z_{i m} = n_{i m} β_{i m}

.

5.2.1. Convergence Performance Analysis Compared with RB-ADMM

Combining (29) and (31), it can be found that most of iterative steps in LCC-AADMM are linear, which reduces the computational cost and improves the speed of convergence. From a theoretical point of view, the choice of the penalty factor is of practical importance in improving the overall performance of the model. Although RB-ADMM can auto-adjust the penalty factor by balancing relationship between the dual residuals and the primitive residuals, it can be seen that the computational cost of RB-ADMM varies greatly with the size of the problem. At the same time, if no proper penalty factor is chosen, then the algorithm may not converge.

The parameter selection scheme can provide fast and accurate estimates of the optimal parameters of the algorithm. To improve the convergence performance, LCC-AADMM use step-size selection constraints to construct the adaptive parameter selection scheme, where the step size is chosen to satisfy the Wolfe conditions. Also, instead of computing Hessian approximation afresh at every iteration, LCC-AADMM updates it in a simple manner to account for the curvature measured during the most recent step. This makes LCC-AADMM converge more rapidly than RB-ADMM.

The simulation results of the objective function are given in Table 6, showing that R-SDL-BFGS has the desired effect of reducing ADMM computational costs. For comparison, the improvement rate of different algorithms is shown in Table 7. As can be seen from Table 7, the convergence rate of LCC-ADMM is

96.6109 %

higher than that of RB-ADMM in the two types of classified datasets. In the 6-classification datasets, the convergence speed is improved by

96.8750 %

on average. In the 10-classification datasets, the convergence rate is improved by

95.9727 %

on average.

5.2.2. Convergence Performance Analysis Compared with MS-AADMM

One powerful approach to obtaine the optimal output weights starts from an appropriate parameter selection scheme, which allows us to use an adjustable step size to speed up algorithm convergence. For the actual numerical performance of ADMM, the subproblem solving process is the key to determining the performance of the algorithm. However, MS-AADMM has ignored this key factor.

It can be known from (31) and (32) that LCC-AADMM converts the exact solution into an approximate solution by performing inexact optimization with the help of the Hessian approximation matrix, which greatly reduces the computational cost and thereby improves the speed of convergence. Theoretically, it can be obviously found that the convergence performance of LCC-AADMM is better than MS-AADMM.

A comparison of the convergence performance of different methods is shown in Table 6. It can be seen that the R-SDL-BFGS algorithm obviously performs the best in terms of classification performance. And Table 7 shows that the R-SDL-BFGS method has a certain improvement over MS-AADMM in the efficiency of classification.

As can be seen from Table 7, the convergence rate of LCC-ADMM is

42.6666 %

higher than that of RB-ADMM in the two types of classified datasets. In the 6-classification datasets, the convergence speed is improved by

39.1304 %

on average. In the 10-classification datasets, the convergence rate is improved by

63.2057 %

on average.

5.2.3. Overall Convergence Performance Analysis

The LCC-AADMM method divides the convex optimization problem of RELM into univariate subproblems that can be executed in parallel by using the maximum partitioning technique, which reduces the computational complexity in iterative updates. By introducing the R-SDL-BFGS algorithm, AADMM achieves inexact optimization with the help of the Hessian approximation matrix, which reduces the computational cost while maintaining a fast convergence speed.

Theoretically, LCC-AADMM usually has better convergence performance than other algorithms in solving classification problems. It can be seen from Table 6 that the LCC-AADMM algorithm has the fastest convergence speed. The difference in performance of these methods is also evident as the size of the error is varied. Under the same conditions, the LCC-AADMM algorithm always has the best convergence according to Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12.

5.3. Accuracy Analysis for LCC-AADMM

Classification accuracy is one of the important indicators to evaluate the performance of classification models and assesses the quality of the model. The fatal flaw of RELM is that it cannot be applied to large-scale distributed optimization problems owing to its high computational cost. In view of the above shortcomings, LCC-AADMM is adopted to decompose the convex optimization problem into a set of subproblems that can be executed in parallel, thereby achieving efficient classification performance. We performed experiments on the eight benchmark classification datasets in Table 2 using the MI-based, MS-AADMM, and proposed methods. The accuracy of the test results is listed in Figure 13.

Figure 13 shows the performance of different methods under the big data classification task. It is obvious that the LCC-AADMM methods consistently outperform the two competitive ELM algorithms on all eight datasets. The best overall performance is provided by the the proposed LCC-AADMMs as shown in Figure 12. In addition, the LCC-AADMM algorithm shows significant improvement over the best results obtained by the other two competitive ADMM methods, showing good performance in terms of classification accuracy and suitability for applications that require superior accuracy.

It can be concluded that the proposed method performs well on a wide variety of problems and does not require excessive computer time or storage compared with MI-based and MS-AADMM methods. In practice, this technique can be expected to provide good learning ability and satisfactory generalization performance.

6. Conclusions

In this paper, we consider implementing distributed learning through the effective solution of subproblems. That is, the regularized LS problem in the RELM is split into a set of optimization subproblems. To achieve high computational efficiency in solving subproblems, an efficient LCC-AADMM based on the R-LBFGS algorithm is proposed. The novelty of this method mainly lies in three aspects: (1) A SL-BFGS method is devised, which uses a limited amount of storage and updates the quasi-Newton matrix continuously; (2) random damping technology is proposed, which adopts a new strategy for determining the step size at each iteration and guarantees the positive definiteness of the BFGS matrix to achieve a learning ability with high quality; (3) based on the residual balancing scheme, an adaptive penalty factor selection strategy is applied to balance the relationship between the distance from convergence and the residuals and obtain good convergence.

The impact of this issue is demonstrated on eight benchmark dataset example problems. These experiments show that the proposed method achieves good performance in certain cases and converges faster than other ADMM methods. The high parallelism of LCC-AADMM is further demonstrated by comparison with an MI-based method. Therefore, the LCC-AADMM method offers a complementary alternative to optimization problems in large-scale applications.

Author Contributions

Conceptualization, K.W. and S.H.; methodology, K.W.; writing—original draft preparation, S.H. and K.W.; writing—review and editing, K.W., S.H. and T.R.; supervision, Z.W. and B.L.; project administration, K.W., Z.W. and T.R.; funding acquisition, Z.W. and T.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Public Welfare Technology Application and Research Projects in Zhejiang Province of China under Grants No. LQ23F030002; the “Ling Yan” Research and Development Project of Science and Technology Department under Grants No. 2022C03122; the State Key Laboratory of Industrial Control Technology, Zhejiang University, No.ICT2022B34; and Zhejiang Shuren University Basic Scientific Research Special Funds 2023XZ001.

Data Availability Statement

The data are fully open access [38,39]. The details can be found from http://archive.ics.uci.edu/ml/index.php and http://featureselection.asu.edu/datasets.php (accessed on 1 January 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Shi, X.; Kang, Q.; An, J.; Zhou, M. Novel L1 Regularized Extreme Learning Machine for Soft-Sensing of an Industrial Process. IEEE Trans. Ind. Inform. 2022, 18, 1009–1017. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, B.; Wang, S. Mixture correntropy-based kernel extreme learning machines. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 811–825. [Google Scholar] [CrossRef] [PubMed]
Luo, M.; Zhang, L.; Liu, J.; Guo, J.; Zheng, Q. Distributed extreme learning machine with alternating direction method of multiplier. Neurocomputing 2017, 261, 164–170. [Google Scholar] [CrossRef]
Qing, Y.; Zeng, Y.; Li, Y. Deep and wide feature based extreme learning machine for image classification. Neurocomputing 2020, 412, 426–436. [Google Scholar] [CrossRef]
Farahbakhsh, F.; Shahidinejad, A.; Ghobaei-Arani, M. Multiuser context-aware computation offloading in mobile edge computing based on Bayesian learning automata. Trans. Emerg. Telecommun. Technol. 2021, 32, e4127. [Google Scholar] [CrossRef]
Masdari, M.; Gharehpasha, S.; Ghobaei-Arani, M.; Ghasemi, V. Bio-inspired virtual machine placement schemes in cloud computing environment: Taxonomy, review, and future research directions. Clust. Comput. 2020, 23, 2533–2563. [Google Scholar] [CrossRef]
Xin, J.; Wang, Z.; Chen, C.; Ding, L.; Wang, G.; Zhao, Y. ELM*: Distributed extreme learning machine with MapReduce. World Wide Web 2014, 17, 1189–1204. [Google Scholar] [CrossRef]
Wang, Y.; Dou, Y.; Liu, X.; Lei, Y. PR-ELM: Parallel regularized extreme learning machine based on cluster. Neurocomputing 2016, 173, 1073–1081. [Google Scholar] [CrossRef]
Xin, J.; Wang, Z.; Qu, L.; Wang, G. Elastic extreme learning machine for big data classification. Neurocomputing 2015, 149, 464–471. [Google Scholar] [CrossRef]
Duan, M.; Li, K.; Liao, X.; Li, K. A Parallel Multiclassification Algorithm for Big Data Using an Extreme Learning Machine. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2337–2351. [Google Scholar] [CrossRef]
Wang, B.; Fang, J.; Duan, H.; Li, H. Graph Simplification-Aided ADMM for Decentralized Composite Optimization. IEEE Trans. Cybern. 2021, 51, 5170–5183. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Huo, S.; Xiong, X.; Wang, K.; Liu, B. A Maximally Split and Adaptive Relaxed Alternating Direction Method of Multipliers for Regularized Extreme Learning Machines. Mathematics 2023, 11, 3198. [Google Scholar] [CrossRef]
Xu, Z.; Figueiredo, M.A.T.; Yuan, X.; Studer, C.; Goldstein, T. Adaptive Relaxed ADMM: Convergence Theory and Practical Implementation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Wohlberg, B. ADMM Penalty Parameter Selection by Residual Balancing. 2017. Available online: http://xxx.lanl.gov/abs/1704.06209 (accessed on 20 April 2017).
Lai, X.; Cao, J.; Huang, X.; Wang, T.; Lin, Z. A Maximally Split and Relaxed ADMM for Regularized Extreme Learning Machines. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 1899–1913. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Dai, Y.; Wu, Q. Sparse and Outlier Robust Extreme Learning Machine Based on the Alternating Direction Method of Multipliers. Neural Process. Lett. 2023, 55, 9787–9809. [Google Scholar] [CrossRef]
Xu, J.; Chao, M. An inertial Bregman generalized alternating direction method of multipliers for nonconvex optimization. J. Appl. Math. Comput. 2022, 68, 1–27. [Google Scholar] [CrossRef]
Wang, X.; Yan, J.; Jin, B.; Li, W. Distributed and Parallel ADMM for Structured Nonconvex Optimization Problem. IEEE Trans. Cybern. 2021, 51, 4540–4552. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Wang, R.; Fang, Y.; Sun, M.; Luo, Z. Alternating Direction Method of Multipliers for Convolutive Non-Negative Matrix Factorization. IEEE Trans. Cybern. 2023, 53, 7735–7748. [Google Scholar] [CrossRef]
Wang, H.; Gao, Y.; Shi, Y.; Wang, R. Group-Based Alternating Direction Method of Multipliers for Distributed Linear Classification. IEEE Trans. Cybern. 2017, 47, 3568–3582. [Google Scholar] [CrossRef]
Darvishi, M.T. A two-step high order Newton-like method for solving systems of nonlinear equations. Int. J. Pure Appl. Math. 2009, 57, 543–555. [Google Scholar]
Babajee, D.K.R.; Dauhoo, M.Z.; Darvishi, M.T.; Karami, A.; Barati, A. Analysis of two Chebyshev-like third order methods free from second derivatives for solving systems of nonlinear equations. J. Comput. Appl. Math. 2010, 233, 2002–2012. [Google Scholar] [CrossRef]
Coşğun, S.; Bilgin, E.; Çayören, M. Quasi-Newton-Based Inversion Method for Determining Complex Dielectric Permittivity of 3D Inhomogeneous Objects. IEEE Trans. Antennas Propag. 2020, 70, 4810–4817. [Google Scholar] [CrossRef]
Al-Obaidi, R.H.; Darvishi, M.T. Constructing a Class of Frozen Jacobian Multi-Step Iterative Solvers for Systems of Nonlinear Equations. Mathematics 2022, 10, 2952. [Google Scholar] [CrossRef]
Li, X.; Feng, F.; Zhang, J.; Zhang, W.; Zhang, Q. Advanced Simulation-Inserted Optimization Using Combined Quasi-Newton Method with Lagrangian Method for EM-Based Design Optimization. IEEE Trans. Microw. Theory Tech. 2022, 70, 3753–3764. [Google Scholar] [CrossRef]
Wang, D.; Xu, X.; Yang, Y.; Zhang, T. A Quasi-Newton Quaternions Calibration Method for DVL Error Aided GNSS. IEEE Trans. Veh. Technol. 2021, 70, 2465–2477. [Google Scholar] [CrossRef]
Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A Stochastic Quasi-Newton Method for Large-Scale Optimization, 2015. Available online: http://xxx.lanl.gov/abs/1401.7020 (accessed on 18 February 2015).
Zhang, Q.; Huang, F.; Deng, C. Faster Stochastic Quasi-Newton Methods. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 4388–4397. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Wu, H.; Chan, S.; Lam, W. A Stochastic Quasi-Newton Method for Large-Scale Nonconvex Optimization with Applications. Neurocomputing 2020, 31, 4776–4790. [Google Scholar] [CrossRef] [PubMed]
Aryan, M.; Alejandro, R. Stochastic Quasi-Newton Methods. Proc. IEEE 2020, 108, 1906–1922. [Google Scholar]
Zhang, X.; Liu, D.; Wang, X.; Zhang, X. Advanced Ellipse Fitting Algorithm Based on ADMM and Hybrid BFGS Method. IEEE Trans. Instrum. Meas. 2021, 70, 1–11. [Google Scholar] [CrossRef]
Azam, A.; Michael, O. Analysis of limited-memory BFGS on a class of nonsmooth convex functions. IMA J. Numer. Anal. 2020, 41, 1–27. [Google Scholar]
Li, L.; Hu, J. Fast-Converging and Low-Complexity Linear Massive MIMO Detection with L-BFGS Method. IEEE Trans. Veh. Technol. 2022, 71, 10656–10665. [Google Scholar] [CrossRef]
Yu, T.; Liu, X.; Dai, H.; Sun, J. A Minibatch Proximal Stochastic Recursive Gradient Algorithm Using a Trust-Region-Like Scheme and Barzilai–Borwein Stepsizes. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4627–4638. [Google Scholar] [CrossRef] [PubMed]
Bastianello, N.; Carli, R.; Schenato, L. Asynchronous distributed optimization over lossy networks via relaxed admm: Stability and linear convergence. IEEE Trans. Autom. Control 2021, 66, 2620–2635. [Google Scholar] [CrossRef]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
Zhou, B.; Gao, L.; Dai, Y.H. Gradient methods with adaptive step-sizes. Comput. Optim. Appl. 2006, 35, 69–86. [Google Scholar] [CrossRef]
Markelle, K.; Rachel, L.; Kolby, N. UCI Machine Learning Repository. 2023. Available online: https://archive.ics.uci.edu (accessed on 1 January 2023).
Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR) 2018, 50, 94. [Google Scholar] [CrossRef]

Figure 1. Performance on Branin.

Figure 2. Performance on Levy Function N.13.

Figure 3. Performance on Matyas.

Figure 4. Performance on Three-Hump Camel.

Figure 5. Convergence results on Gisette.

Figure 6. Convergence results on USPS.

Figure 7. Convergence results on BASEHOCK.

Figure 8. Convergence results on Magic.

Figure 9. Convergence results on Pendigits.

Figure 10. Convergence results on Optical-Digits.

Figure 11. Convergence results on Statlog.

Figure 12. Convergence results on PCMAC.

Figure 13. Accuracy of Test Results.

Table 2. Benchmark function description.

Function	Function Optimal Value	Optimal Solution
Branin Function	0.389	[3.1416, 2.2750]
Levy Function N.13	0	[0, 0]
Matyas Function	0	[0, 0]
Three-Hump Camel Function	0	[0, 0]

Table 3. Datasets specifications.

Dataset	Attributes	Training Samples	Testing Samples	Classes
Gisette	5000	5600	1400	2
USPS	256	7439	1859	10
Magic	10	15,216	3804	2
Pendigits	16	8794	2198	10
Optical-Digits	64	4496	1124	10
statlog	36	5148	1287	6
PCMAC	3289	1555	388	2
BASEHOCK	4862	1595	438	2

Table 4. Comparison of Convergence Speeds of Different Quasi-Newton Methods.

Function	BFGS	SD-BFGS	R-SDL-BFGS
Branin Function	15	12	6
Levy Function N.13	5	5	1
Matyas Function	20	1	1
Three-Hump Camel Function	8	5	3

Table 5. Comparison of Convergence Speeds of BFGS based Methods.

CEC Benchmark Function	Ratio (BFGS)	Ratio (SD-BFGS)
Branin Function	88.0000	81.2500
Levy Function N.13	95.2380	88.8888
Matyas Function	96.6666	66.6666
Three-Hump Camel Function	75.0000	40.0000

Table 6. Comparison of Convergence Analysis Results.

Dataset	RB-AADMM		MS-AADMM		LCC-AADMM
Dataset	Time	Iteration	Time	Iteration	Time	Iteration
Gisette	737.5244	912	33.4513	39	5.0.96	13
USPS	2621.8914	546	152.5741	31	39.1072	13
BASEHOCK	332.8139	1674	9.0221	41	2.6267	19
Magic	1156.6178	674	46.0599	25	19.2866	14
Pendigits	3912.3798	742	176.3753	33	46.7936	14
Optical-Digits	1825.9015	538	139.2235	41	18.4858	9
statlog	1626.4322	704	92.0177	39	16.8493	13
PCMAC	353.2414	1753	3.3102	15	1.4095	13

Table 7. Convergence Speed Improvement Ratio.

Dataset	Classes	RB-ADMM	MS-AADMM
Gisette	2	97.9808	31.5789
Magic	2	94.6984	55.1724
BASEHOCK	2	98.3636	29.6296
PCMAC	2	95.4008	54.2857
statlog	6	96.8750	39.1304
USPS	10	94.6494	68.9655
Pendigits	10	95.0425	62.5871
Optical-Digits	10	98.2264	58.0645

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, K.; Huo, S.; Liu, B.; Wang, Z.; Ren, T. An Adaptive Low Computational Cost Alternating Direction Method of Multiplier for RELM Large-Scale Distributed Optimization. Mathematics 2024, 12, 43. https://doi.org/10.3390/math12010043

AMA Style

Wang K, Huo S, Liu B, Wang Z, Ren T. An Adaptive Low Computational Cost Alternating Direction Method of Multiplier for RELM Large-Scale Distributed Optimization. Mathematics. 2024; 12(1):43. https://doi.org/10.3390/math12010043

Chicago/Turabian Style

Wang, Ke, Shanshan Huo, Banteng Liu, Zhangquan Wang, and Tiaojuan Ren. 2024. "An Adaptive Low Computational Cost Alternating Direction Method of Multiplier for RELM Large-Scale Distributed Optimization" Mathematics 12, no. 1: 43. https://doi.org/10.3390/math12010043

APA Style

Wang, K., Huo, S., Liu, B., Wang, Z., & Ren, T. (2024). An Adaptive Low Computational Cost Alternating Direction Method of Multiplier for RELM Large-Scale Distributed Optimization. Mathematics, 12(1), 43. https://doi.org/10.3390/math12010043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive Low Computational Cost Alternating Direction Method of Multiplier for RELM Large-Scale Distributed Optimization

Abstract

1. Introduction

2. Preliminaries

2.1. Convex Optimization

2.2. Limited-Memory BFGS Algorithm

3. Adaptive Stochastic Damping Optimization for Limited Memory

3.1. Proposed Damped SL-BFGS Method

3.2. Robust Optimization Approach for Limited Memory

4. The Adaptive Stochastic Optimization to RELM Design

4.1. AADMM with Low Computing Cost

4.2. RELM Models with Stochastic Optimization Constraints

5. Simulation Experiment and Result Analysis

5.1. Comparative Analysis of Convergence Performance of Quasi-Newton Algorithms

5.2. Convergence Performance Comparative Analysis

5.2.1. Convergence Performance Analysis Compared with RB-ADMM

5.2.2. Convergence Performance Analysis Compared with MS-AADMM

5.2.3. Overall Convergence Performance Analysis

5.3. Accuracy Analysis for LCC-AADMM

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI