1. Introduction
The extreme learning machine (ELM) [
1] has received much attention in recent years, owing to its fast training speed and good generalization. Note that traditional extreme learning machines suffer from a limitation of memory with large scale datasets. Especially in the era of big data, the dataset scale is usually extremely large and the data are often very high-dimensional for detailed information [
2,
3,
4,
5,
6] because the increasing complexity of the datasets enlarges the dimension of the hidden output matrix, which leads to a huge memory space and heavy computational load in matrix-inversion based (MI-based) solutions.
To address these limitations, some enhanced ELMs with parallel or distributed structures have been implemented to meet the challenge of large-scale datasets, as shown in
Table 1. For example, ELM based on the MapReduce framework can effectively calculate the matrix multiplication in parallel [
7,
8], and has an efficient learning ability in massive rapidly updated datasets [
9]. However, the parallel ELM based on MapReduce creates a large amount of extra overhead and degrades the learning speed. The algorithm based on the Spark parallel framework is then proposed to speed up the whole computing process of ELM for big data [
10].
The methods discussed above focus on the computation of MI-based solutions using parallel and distributed hardware structures and programming models. The alternating direction method of multipliers (ADMM), without the time-consuming MI operation, is also an effective method for distributed optimization [
3,
11,
12]. By using the ADMM framework, the model fitting problem can be decomposed into a set of subproblems that can be executed in parallel to achieve efficient classification performance so as to meet the needs of large-scale data processing in the real environment. To achieve optimal performance without user oversight, an adaptive method that automatically tunes the key algorithm parameters is applied to improve the Relaxed ADMM [
13]. Appropriate selection of the penalty parameter is crucial to obtaining good performance from the ADMM. While analytic results for optimal selection of the penalty parameter are very limited, an adaptive penalty strategy based on residual balancing is proposed to obtain good performance from the ADMM [
14] because a convex model fitting problem can be split into a set of concurrently executable subproblems. Therefore, in a big data environment, the regularized least-squares problem is split across the coefficients and incorporates a relaxation technique to achieve good convergence [
15]. Furthermore, elastic-net theory is employed to simultaneously improve the sparsity and stability of the model, which develops an accelerated ADMM algorithm [
16].
Table 1.
Review of various approaches of enhanced ELMs in the literature.
Table 1.
Review of various approaches of enhanced ELMs in the literature.
Framework | Utilized Techniques | Metrics | Datasets | Main Characteristics |
---|
Parallel or distributed learning | MapReduce [7] | Running time, Speed up | Synthetic datasets | Parallel computing ability, Efficient learning of large-scale data |
MapReduce [9] | Running time, Update ratio | Synthetic datasets | Efficient learning in massive rapidly updated dataset |
MapReduce [8] | Speed up, Scaleup, Sizeup | Real datasets | Parallelism, Low runtime memory, Good scalability |
Spark [10] | Running time, Speed up, Accuracy | Synthetic datasets | Fault tolerance, Persist/cache strategies |
ADMM | Residuals normalization [14] | Iteration number | Synthetic datasets | Robust in sparse coding |
Adaptive penalty, Relaxation technique [13] | Iteration number | Real datasets | Without user oversight or parameter tuning |
Maximally splitting, Relaxation technique [15] | Convergence ratio, Acceleration ratios | Real datasets | Fast convergence, Less computations, High parallelism |
Inertial technique, Bregman distance [17] | Training time, Constraint errors | Synthetic datasets | Global convergence, High acceleration |
For real-time data classification in the convex optimization problem, the prime problem can be decomposed into several subproblems by leveraging ADMM [
18]. The global optimal solution to the original problem can be obtained by processing the subproblem in parallel. Fast convergence speed and parallelism make ADMM suitable for solving large-scale distributed optimization problems. However, subproblem optimization must be solved at each iteration, which imposes a heavy calculation burden [
19]. Numerical experience has shown that the effective solution of subproblems is critical to the performance of ADMM [
20].
Several alternatives are available for unconstrained optimization such as Newton-type methods [
21], Chebyshev-like methods [
22], the quasi-Newton method(QNM) [
23], and so on [
24]. These methods require less computational effort to calculate the search direction, thus demonstrating rapid convergence. The RLS problem in RELM mainly involves the computation of a Hessian matrix and the gradient of the cost function. The second-order partial derivatives of the Hessian matrix can be avoided by putting forth the displacement and first derivative information of two adjacent entries [
25,
26]. Combined with line search techniques, it could achieve attractive global convergence properties.
However, in machine learning and image processing, the computation of the Hessian matrix at each iteration is not a trivial task [
13]. The cost of storing and working with Hessian approximations can be excessive for large matrixes. In order to reduce the storage of the Hessian approximations, variants of the quasi-Newton approach, such as limited-memory BFGS (LBFGS) and stochastic QNM [
27,
28,
29,
30,
31], are developed to store Hessian approximations compactly. Azam et al. [
32] analyzed the convergence performance of L-BFGS in convex optimization problems and further proved their practical value in solving large-scale optimization problems. Aryan et al. [
28] proposed stochastic QNM(SQN), which uses second-order information to accelerate stochastic convergence, and modified the update formula of BFGS to ensure that the eigenvalues of the Hessian approximation matrix remain bounded, so as to ensure that the function can obtain extreme values. Chen et al. [
30] proposed the stochastic damped L-BFGS to ensure positivity and avoid ill-conditioned outputs in the Hessian update process by introducing damping parameters. The computation of Hessian matrixes often involves the step-size parameter, which can be determined by line search method. Backtracking line search is commonly used to guarantee convergence in case the linear model assumptions break down and an unstable stepsize is produced. It is time-consuming and may lose its advantage in other types of ELMs.
In this paper, we study a low-cost computational scheme for the ADMM and jointly devise an adaptive step-size selection. The stochastic damped optimal L-BFGS(R-SDL-BFGS) is therefore derived, which improves the computational efficiency of the ADMM. Our contributions can be summarized as follows:
- (1)
Low-cost computational scheme: The curvature information from recent iterations is used to reduce computational cost;
- (2)
Damped BFGS correction scheme: The damping technology is introduced into the BFGS to make up for the deficiency of the Hessian approximation matrix in the positive definiteness of the solution, and to ensure the positive definiteness of the BFGS matrix in non-convex optimization;
- (3)
Step factor selection scheme: The non-monotonic Wolfe-type strategy is applied to the memory gradient method, combined with the BB spectral gradient descent, to obtain the optimal step-size factor.
Finally, we compare the proposed method to other ADMM variants by experiments of real-world classification and image processing problems.
3. Adaptive Stochastic Damping Optimization for Limited Memory
The positive definite matrix plays a significant role in convex optimization. And the BFGS uses the positive definite matrix to approximate the Hessian. Note that, during iterations, the may become a singular matrix, which will significantly affect the convergence of the algorithm. Simultaneously, BFGS requires that the optimization problem must be convex. Otherwise, the may become non-positive, and the decreasing step size may not be positive. Therefore, we need to deal with non-convexity and ill-conditioned behavior to guarantee the positivity of .
3.1. Proposed Damped SL-BFGS Method
In the optimization process of BFGS, the Hessian approximation
may become a non-positive definite matrix if the property
is not satisfied. In this case, it may not be possible to ensure that the algorithm is along the best search direction. Considering that the convergence of BFGS relies heavily on the positive definite matrix, damping technology [
23] is used to correct the BFGS update formula. This makes up the deficiency of the solution to the positive definiteness of the Hessian approximation, which therefore maintains the positive definiteness of the
.
The optimization of LBFGS reduces the computational cost from the perspective of transmitting or storing data. Nevertheless, such modifications do affect the accuracy of the Hessian approximation. Stochastic optimization methods, as a popular optimization tool, can effectively obtain good analytical solutions. By applying the stochastic gradient information to approximate the curvature of the objective function in the convex optimization, the optimal analytical solution can be obtained, which also speeds up the convergence.
Because the noise of the stochastic gradient may be infinitely amplified in the curvature estimation, the Hessian approximation matrix will be negatively affected, which reduces the convergence speed. We shall adjust the gradient estimation
and descent distance estimation
by different batch sizes, thereby decoupling the computations of stochastic gradient and curvature estimation. By extending the random damping technique to the LBFGS,
and
can be represented by
and
, respectively.
where
b is interval length (also called batch size), and scalar
denotes the damping factor.
It is possible to design a more efficient model that computes the stochastic gradient and target variable with a batch update of batch size
b in each iteration.
where
represents the batch step size, and
are constant.
Although knowledge of gradient information allows BFGS to gradually approximate the inverse of the Hessian, the search direction also plays a crucial role in the global convergence. We should ensure that the algorithm makes reasonable progress along the given search direction and focus on finding a suitable step length along this direction.
3.2. Robust Optimization Approach for Limited Memory
As a common descending direction search criterion, the key to the success of the inexact line search is that the convergence performance of each step must be monotonically decreasing. In many cases, it is possible to leverage a non-monotonic search technique [
34] to relax the convergence conditions while overcoming the oscillation phenomenon, whereas this method makes it easy to obtain the local extrema when the initial value is taken near the local function valley.
To avoid the above problems, a non-monotonic Wolfe search strategy can be devised. This method combines current and past function iteration information to find global solutions. By introducing this method for a convex optimization problem, the objective value changes by rules based on
5. Simulation Experiment and Result Analysis
As with the R-SDL-BFGS method discussed so far, it can not only be well applied to unconstrained optimization problems but also solve convex optimization problems. In unconstrained optimization problems, the ability to find the bounds of the optimal value and the stochastic robust approximation make the proposed method superior to other quasi-Newton algorithms. To verify this claim, simulations are carried out on four benchmark functions (Branin Function, Levy Function N.13, Matyas Function, and Three-Hump Camel Function) to compare the performance of R-SDL-BFGS, SD-BFGS, LBFGS, andBFGS methods. Experiments are conducted using MATLAB 2019(The MathWorks, Inc., Natick, MA, US) on a desktop with an Intel Core i7-10700 8-core CPU and 16GB of RAM. At the same time, in order to ensure that the performance of the algorithm is not accidental, it is necessary to take the average value of multiple experiments to ensure whether the algorithm has good convergence performance in each experiment, so as to ensure that the algorithm has good robustness. The specific description is shown in
Table 2.
For convex optimization problems, the R-SDL-BFGS method can simplify the solution process of subproblems with the Hessian approximation matrix, thereby reducing the ADMM computational cost and improving the speed of convergence. To verify this claim, simulations are carried out on eight benchmark datasets to compare the performance of R-SDL-BFGS with LCC-AADMM, MS-AADMM and RB-ADMM algorithms. Here, the benchmark datasets include Gisette, USPS, Magic, BASEHOCK, Pendigits, Optical-Digits, statlog, and PCMAC. The characteristics of the eight datasets are shown in
Table 3. The first six sets of data in the table are from the UCI machine learning library, and the last two are from the ASU feature selection dataset.
5.1. Comparative Analysis of Convergence Performance of Quasi-Newton Algorithms
This section reports the the convergence speed of different algorithms, based on the number of iterations required before , where Q is the error stop station. In general, the quasi-Newton method needs simple line search procedures to satisfy the termination condition. This property leads to a low computational cost during the training phase. Therefore, it is convenient to use the iteration step to evaluate the effectiveness of the method.
As shown in
Table 4 and
Table 5, the standard BFGS algorithm avoids the problem of singular matrices by replacing inverse matrices with Hessian approximation matrices. However, this algorithm needs to calculate and store the matrix in each iteration, which leads to an expensive computational cost as well as a slow convergence.
To ensure the positive definiteness, the SD-BFGS algorithm is devised by making use of damping technology so as to maintain the positive definiteness of the BFGS matrix in non-convex optimization. Since the global convergence of the algorithm depends on the monotonicity of the function, a suitable numerical solution of non-monotonic equations is in general not feasible.
In order to attain asymptotic linear convergence, a non-monotonic Wolfe-type strategy can be applied to the memory gradient method (R-SDL-BFGS). By combining the current function iteration information and the function information of multiple points in the past, it overcomes the oscillation phenomenon and improves the global convergence of the algorithm.
From a theoretical point of view, the R-SDL-BFGS algorithm has a better global convergence and convergence speed compared with the BFGS, SD-BFGS, and R-SDL-BFGS methods. Experiments on benchmark function are presented in
Table 4, which shows the speed of convergence can be quite acceptable. As can be seen from
Table 5, the convergence rate of R-SDL-BFGS is
higher than that of BFGS in the CEC benchmark function. The convergence rate of R-SDL-BFGS is
higher than that of SD-BFGS in the CEC benchmark function. According to the numerical experiment results, it can be seen that the performance of the R-SDL-BFGS is completely consistent with the theoretical results and achieves good classification performance in practice. The results of experiments in
Figure 1,
Figure 2,
Figure 3 and
Figure 4 and
Table 5 demonstrate that the R-SDL-BFGS algorithm converges faster than other QNMs.
5.2. Convergence Performance Comparative Analysis
The complexity of the iterative algorithm is determined by the complexity of the unit computation and the number of iterations, where the unit computation complexity refers to the number of floating-point operations required by the optimization algorithm for single iteration, and the complexity of the iteration times refers to the number of iterations required to calculate the solution with a given precision. However, since the unit computation complexity in the iterative process is almost the same, the convergence performance of the algorithm is evaluated by comparing the complexity of iterations.
In this section, the convergence performance of different ADMM algorithms under the same error condition are proved, and the iterative termination condition (
27) is the key to evaluating the convergence speed of the algorithm. Under the same iteration termination condition, LCC-AADMM, MS-AADMM, and RB-ADMM methods are evaluated on eight benchmark datasets. At the same time, in order to ensure that the performance of the algorithm is not accidental, it is necessary to take the average value of multiple experiments to ensure whether the algorithm has good convergence performance in each experiment so as to ensure that the algorithm has good robustness. Additionally, these algorithms for computing and approximating the matrix are analyzed by comparing the iteration number.
From a theoretical point of view, for the
M-category classification problem, we can rewrite (
28) as
where ⊗ is the Kronecker product, and
means the concatenation of all columns of the matrix. Then, the MS-ADMM (
31) and RB-AADMM (
32) are as follows:
where
and
.
5.2.1. Convergence Performance Analysis Compared with RB-ADMM
Combining (
29) and (
31), it can be found that most of iterative steps in LCC-AADMM are linear, which reduces the computational cost and improves the speed of convergence. From a theoretical point of view, the choice of the penalty factor is of practical importance in improving the overall performance of the model. Although RB-ADMM can auto-adjust the penalty factor by balancing relationship between the dual residuals and the primitive residuals, it can be seen that the computational cost of RB-ADMM varies greatly with the size of the problem. At the same time, if no proper penalty factor is chosen, then the algorithm may not converge.
The parameter selection scheme can provide fast and accurate estimates of the optimal parameters of the algorithm. To improve the convergence performance, LCC-AADMM use step-size selection constraints to construct the adaptive parameter selection scheme, where the step size is chosen to satisfy the Wolfe conditions. Also, instead of computing Hessian approximation afresh at every iteration, LCC-AADMM updates it in a simple manner to account for the curvature measured during the most recent step. This makes LCC-AADMM converge more rapidly than RB-ADMM.
The simulation results of the objective function are given in
Table 6, showing that R-SDL-BFGS has the desired effect of reducing ADMM computational costs. For comparison, the improvement rate of different algorithms is shown in
Table 7. As can be seen from
Table 7, the convergence rate of LCC-ADMM is
higher than that of RB-ADMM in the two types of classified datasets. In the 6-classification datasets, the convergence speed is improved by
on average. In the 10-classification datasets, the convergence rate is improved by
on average.
5.2.2. Convergence Performance Analysis Compared with MS-AADMM
One powerful approach to obtaine the optimal output weights starts from an appropriate parameter selection scheme, which allows us to use an adjustable step size to speed up algorithm convergence. For the actual numerical performance of ADMM, the subproblem solving process is the key to determining the performance of the algorithm. However, MS-AADMM has ignored this key factor.
It can be known from (
31) and (
32) that LCC-AADMM converts the exact solution into an approximate solution by performing inexact optimization with the help of the Hessian approximation matrix, which greatly reduces the computational cost and thereby improves the speed of convergence. Theoretically, it can be obviously found that the convergence performance of LCC-AADMM is better than MS-AADMM.
A comparison of the convergence performance of different methods is shown in
Table 6. It can be seen that the R-SDL-BFGS algorithm obviously performs the best in terms of classification performance. And
Table 7 shows that the R-SDL-BFGS method has a certain improvement over MS-AADMM in the efficiency of classification.
As can be seen from
Table 7, the convergence rate of LCC-ADMM is
higher than that of RB-ADMM in the two types of classified datasets. In the 6-classification datasets, the convergence speed is improved by
on average. In the 10-classification datasets, the convergence rate is improved by
on average.
5.2.3. Overall Convergence Performance Analysis
The LCC-AADMM method divides the convex optimization problem of RELM into univariate subproblems that can be executed in parallel by using the maximum partitioning technique, which reduces the computational complexity in iterative updates. By introducing the R-SDL-BFGS algorithm, AADMM achieves inexact optimization with the help of the Hessian approximation matrix, which reduces the computational cost while maintaining a fast convergence speed.
Theoretically, LCC-AADMM usually has better convergence performance than other algorithms in solving classification problems. It can be seen from
Table 6 that the LCC-AADMM algorithm has the fastest convergence speed. The difference in performance of these methods is also evident as the size of the error is varied. Under the same conditions, the LCC-AADMM algorithm always has the best convergence according to
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12.
5.3. Accuracy Analysis for LCC-AADMM
Classification accuracy is one of the important indicators to evaluate the performance of classification models and assesses the quality of the model. The fatal flaw of RELM is that it cannot be applied to large-scale distributed optimization problems owing to its high computational cost. In view of the above shortcomings, LCC-AADMM is adopted to decompose the convex optimization problem into a set of subproblems that can be executed in parallel, thereby achieving efficient classification performance. We performed experiments on the eight benchmark classification datasets in
Table 2 using the MI-based, MS-AADMM, and proposed methods. The accuracy of the test results is listed in
Figure 13.
Figure 13 shows the performance of different methods under the big data classification task. It is obvious that the LCC-AADMM methods consistently outperform the two competitive ELM algorithms on all eight datasets. The best overall performance is provided by the the proposed LCC-AADMMs as shown in
Figure 12. In addition, the LCC-AADMM algorithm shows significant improvement over the best results obtained by the other two competitive ADMM methods, showing good performance in terms of classification accuracy and suitability for applications that require superior accuracy.
It can be concluded that the proposed method performs well on a wide variety of problems and does not require excessive computer time or storage compared with MI-based and MS-AADMM methods. In practice, this technique can be expected to provide good learning ability and satisfactory generalization performance.
6. Conclusions
In this paper, we consider implementing distributed learning through the effective solution of subproblems. That is, the regularized LS problem in the RELM is split into a set of optimization subproblems. To achieve high computational efficiency in solving subproblems, an efficient LCC-AADMM based on the R-LBFGS algorithm is proposed. The novelty of this method mainly lies in three aspects: (1) A SL-BFGS method is devised, which uses a limited amount of storage and updates the quasi-Newton matrix continuously; (2) random damping technology is proposed, which adopts a new strategy for determining the step size at each iteration and guarantees the positive definiteness of the BFGS matrix to achieve a learning ability with high quality; (3) based on the residual balancing scheme, an adaptive penalty factor selection strategy is applied to balance the relationship between the distance from convergence and the residuals and obtain good convergence.
The impact of this issue is demonstrated on eight benchmark dataset example problems. These experiments show that the proposed method achieves good performance in certain cases and converges faster than other ADMM methods. The high parallelism of LCC-AADMM is further demonstrated by comparison with an MI-based method. Therefore, the LCC-AADMM method offers a complementary alternative to optimization problems in large-scale applications.