Next Article in Journal
Stability of Stochastic Partial Differential Equations
Next Article in Special Issue
Optimal Reinsurance–Investment Strategy Based on Stochastic Volatility and the Stochastic Interest Rate Model
Previous Article in Journal
Generalized Bayes Prediction Study Based on Joint Type-II Censoring
Previous Article in Special Issue
Analysis of Water Infiltration under Impermeable Dams by Analytical and Boundary Element Methods in Complex
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L1-Norm for Classification

1
School of Mathematics and Information Science, North Minzu University, Yinchuan 750021, China
2
The Key Laboratory of Intelligent Information and Big Data Processing of NingXia Province, North Minzu University, Yinchuan 750021, China
*
Author to whom correspondence should be addressed.
Axioms 2023, 12(7), 717; https://doi.org/10.3390/axioms12070717
Submission received: 27 June 2023 / Revised: 15 July 2023 / Accepted: 17 July 2023 / Published: 24 July 2023

Abstract

:
Twin extreme learning machine (TELM) is a classical and high-efficiency classifier. However, it neglects the statistical knowledge hidden inside the data. In this paper, in order to make full use of statistical information from sample data, we first come up with a Fisher-regularized twin extreme learning machine (FTELM) by applying Fisher regularization into TELM learning framework. This strategy not only inherits the advantages of TELM, but also minimizes the within-class divergence of samples. Further, in an effort to further boost the anti-noise ability of FTELM method, we propose a new capped L 1 -norm FTELM (C L 1 -FTELM) by introducing capped L 1 -norm in FTELM to dwindle the influence of abnormal points, and C L 1 -FTELM improves the robust performance of our FTELM. Then, for the proposed FTELM method, we utilize an efficient successive overrelaxation algorithm to solve the corresponding optimization problem. For the proposed C L 1 -FTELM, an iterative method is designed to solve the corresponding optimization based on re-weighted technique. Meanwhile, the convergence and local optimality of C L 1 -FTELM are proved theoretically. Finally, numerical experiments on manual and UCI datasets show that the proposed methods achieve better classification effects than the state-of-the-art methods in most cases, which demonstrates the effectiveness and stability of the proposed methods.

1. Introduction

Extreme learning machine [1,2], as a remarkable single hidden layer feed-forward neural networks (SLFNs) [3] training method, has been widely studied and applied in many fields such as efficient modeling [4], fashion retailing forecasting [5], fingerprint matching [6], metagenomic taxonomic classification [7], online sequential learning [8], and feature selection [9]. The  weights of the input layer and hidden layer offsets are randomly generated. The output weight of the network is calculated effectively by minimizing the training error and the norm of the output weight. In addition, many researchers have tried to extend the extreme learning machine model to the support vector machine (SVM) learning framework to solve the classification problem [10]. Frenay et al. [11] found that the transformation performed by the first layer of ELM can be viewed as a kernel that can be plugged into SVM. Due to solving the support vector machine (SVM) type of optimization method that can be utilized to resolve the ELM model, an extreme learning machine based on the optimization method (OPTELM) was proposed in [12]. For binary classification problems, traditional ELM needs to compute all the sample points of training data at the same time in the training stage, which is time-consuming. The singe hyperplane was trained to perform the classification task in the traditional ELM, which enormously restricts its application prospect and the direction of evolution. Jayadeva et al. [13] proposed twin SVM (TWSVM), which is a famous non-parallel hyper-plane classification algorithm for binary classification. Inspired by TWSVM, Wan et al. [14] proposed the twin extreme learning machine (TELM). Compared with ELM, TELM trains two non-parallel hyperplanes for classification tasks by solving two smaller quadratic programming problems (QPPs). Compared with TWSVM, TELM’s optimization problem has fewer constraints, so the training speed is faster and the application prospect is broader. In recent years, researchers have made many improvements to TELM, such as sparse twin extreme learning machine [15], robust twin extreme learning machine [16], time efficient varient of twin extreme learning machine [17], and a generalized adaptive robust distance metric driven smooth regularization learning framework [18], etc.
Although the above ELM-based algorithm has a good classification effect, the statistical knowledge from the data itself is ignored. However, the knowledge of mathematical statistics from the data is very important to construct an efficient classifier. Fisher discriminant analysis (FDA) is an effective discriminant tool by minimizing the intra-class divergence while keeping the inter-class divergence of the data constant. From the above discussion, it can be known that it is necessary to reconstruct a new classification model by combining the characteristics of ELM model and FDA. In recent years, Ma et al. [19] have successfully combined them and proposed a Fisher-regularized extreme learning machine (Fisher-ELM), which not only has the advantages of efficient solution of ELM but also fully considers the statistical knowledge of the data.
Although the above models have good classification performance, most of them consider the L 2 -norm. When the data contains noise or outliers, they can not deal with noise and outliers well, which degrades the classification performance of the model. In recent years, researchers have tried to introduce the L 1 -norm into various models [20,21,22,23] to reduce the impact of outliers. This studies have shown that the L 1 -norm was able to reduce the effect of outliers to some extent. However, it was still unsatisfactory when the data contains a large number of outliers. Recently, researchers have introduced the idea of truncation into the L 1 -norm, constructed a new capped L 1 -norm, and applied it to various models [24,25,26]. Many studies [27,28] show that the capped L 1 -norm not only inherits the advantages of the L 1 -norm, but also is bounded. So it is more robust and it approaches the L 0 -norm to some degree. For instance, by applying the capped L 1 -norm to the twin SVM, Wang et al. [29] proposed a new robust twin support vector machine (C L 1 -TWSVM). Based on twin support vector machine with privileged information [30] (TWSVMPI), a new robust TWSVMPI [31] is proposed by replacing the L 2 -norm with capped L 1 -norm. The new model further improves the anti-noise ability of the pattern.
In order to utilize the advantanges of the twin extreme learning machine and FDA, we first put forward to a novel classifier named Fisher-regularized twin extreme learning machine (FTELM). Also considering the instability of the L 2 -norm for the outliers, we introduce the capped L 1 -norm into the FTELM model and propose a more robust capped L 1 -norm FTELM (C L 1 -FTELM) model.
The main contributions of this paper are as follows:
(1) Based on twin extreme learning machine and Fisher-regularization extreme learning machine (FELM), a new Fisher-regularized twin extreme learning machine (FTELM) is proposed. FTELM minimizes intra-class divergence while fixing the inter-class divergence of samples. FTELM takes full account of the statistical information of the sample data, and the training speed is faster than FELM.
(2) Considering the instability of L 2 -norm and Hinge loss used by FTELM, we introduce capped L 1 -norm instead of them and propose a new capped L 1 -norm FTELM model. C L 1 -FTELM uses the capped L 1 -norm to reduce the influence of noise points, and at the same time utilizes Fisher regularization to consider the statistical knowledge of the data.
(3) Two algorithms are designed by utilizing the successive overrelaxation (SOR) [32] technique and the re-weighted technique [27] to solve the optimization problems of the proposed FTELM and C L 1 -FTELM, respectively.
(4) Two theorems about convergence and local optimality of C L 1 -FTELM are proved.
The organizational structure of this paper is as follows. In Section 2, we briefly review related work. In Section 3, we describe the FTELM model in detail. The robust capped L 1 -norm FTELM learning framework along with related theoretical proofs are described in detail in Section 4. In Section 5, we describes numerical experiments on artificial and benchmark datasets. We summarize this paper in Section 6.

2. Related Work

In this section, we first define some concepts of symbols needed for this paper, and then we briefly review Fisher regularization, Fisher-ELM, TELM and successive overrelaxation algorithm.

2.1. The Concept of Symbols

e is a vector whose components are all ones, an identity matrix is represented by I , and a matrix(vector) of zeros is represented by 0 . Then, · 2 is the L 2 norm, and  · F stands for the Frobenius norm.
A binary classification problem in Euclidean space ( R d ) can be formulated in the following form:
T = x i , y i X , Y , i = 1 , , m
where x i X R d is expressed as an input sample in a d-dimensional Euclidean space. Similarly, y i Y = 1 , + 1 is represented as an output label corresponding to an input instance x i . In addition, m 1 and m 2 represent the number of sample data of the positive class and negative class, respectively, and  m = m 1 + m 2 .

2.2. Fisher Regularization

Fisher regularization has the following form:
f F 2 = f T N f = i I + f x i f ¯ + 2 + i I f x i f ¯ 2
where f = f x 1 , f x 2 , , f x m T , N = I G , I R m × m is the identity matrix and G is the matrix with the elements:
G i j = 1 m 1 , f o r i , j I + 1 m 2 , f o r i , j I 0 , o t h e r w i s e
where I ± are the index sets of positive and negative training data, m 1 = I + , m 2 = I . The average value of f x over the positive sample set is expressed as f ¯ + , the average value of f x over the negative sample set is expressed as f ¯ . From Equation (2), we can see that the meaning of the Fisher regularization is the intra-class divergence of the data.
The proof of Formula (2) is as follows:
i I + f x i f ¯ + 2 = i I + f 2 x i 2 · f x i · f ¯ + + f ¯ + 2 = i I + f 2 x i m 1 · f ¯ + 2 = f + T · f + 1 m 1 · f + T · e · e T · f + T = f + T · I + f + f + T · M + · f + T = f + T · ( I + M + ) · f + = f + T · ( N 1 ) · f +
where e = 1 , , 1 T is a vector of m 1 dimensions, f + = f x 1 , f x 2 , , f x i , , f x m 1 ,   i I + , I + R m 1 × m 1 is the identity matrix. M + R m 1 × m 1 , and all the entries in the matrix M + are 1 m 1 .
Similarly, it can be obtained:
j I f x i f ¯ 2 = f T · ( I M ) · f = f T · ( N 2 ) · f
where f = f x 1 , f x 2 , , f x i , , f x m 2 , i I , I R m 2 × m 2 is the identity matrix. M R m 2 × m 2 , and all the entries in the matrix M are 1 m 2 .
Combining Equations (4) and (5), we can get another form of Equation (2):
f + T · ( I + M + ) · f + + f T · ( I M ) · f = f + , f T · I M + 0 1 0 2 M · f + , f = f T · I G · f = f T · N · f
where 0 1 0 m 1 × m 2 , 0 2 0 m 2 × m 1 , G = M + 0 1 0 2 M .

2.3. Fisher-Regularized Extreme Learning Machine

The primal problem of Fisher-regularized extreme learning machine (FELM) is as follows:
min α , ξ 1 2 β T · β + C 1 · e T · ξ + 1 2 C 2 · α T · K E L M · N · K E L M . α s . t . Y · H · β e ξ ξ 0
According to the representer theorem β = i = 1 m α i h x i = H T α , problem (7) can be written as problem (8):
min α , ξ 1 2 α T · K E L M · α + C 1 · e T · ξ + 1 2 C 2 · α T · K E L M · N · K E L M . α s . t . Y · K ELM · α e ξ ξ 0
where K E L M R m × m is a Gram matrix with elements k E L M x i , x j , k E L M x i , x = h x T · h x i , h x denotes the output of some hidden node, Y R m × m is a diagonal matrix with elements y i , C 1 , C 2 are the regularization parameters, and  ξ is a nonnegative slack vector.
According to the optimization theory, the dual form of the problem (8) can be obtained as follows:
min θ 1 2 θ T · Q · θ e T · θ s . t . 0 θ C 1 · e
where Q = Y · I + C 2 · N · K E L M 1 T · K E L M · Y .
The decision function of Fisher-regularized extreme learning machine is:
f x = s i g n i = 1 m α i · k E L M x i , x

2.4. Twin Extreme Learning Machine

Similar to the form of TWSVM [13], the primal problem of TELM [14] can be expressed in the following:
Primal TELM 1 : min β 1 1 2 H 1 · β 1 2 2 + C 1 · e 2 T · ξ s . t . H 2 · β 1 e 2 ξ ξ 0
Primal TELM 2 : min β 2 1 2 H 2 · β 2 2 2 + C 2 · e 1 T · η s . t . H 1 · β 2 e 1 η η 0
where H 1 and H 2 respresent the outputs of the hidden layer for positive and negative samples, ξ and η represent the slack vectors, 0 is a zero vector, C 1 , C 2 0 are penalty parameters, e 1 R m 1 and e 2 R m 2 are vectors of ones.
By introducing Lagrange multipliers α and ϑ , the dual problem of (11) and (12) can be written as follows:
Dual TELM 1 : min α 1 2 α T · H 2 H 1 T · H 1 1 · H 2 T · α e 2 T · α s . t . 0 α C 1 · e 2
Dual TELM 2 : min ϑ 1 2 ϑ T · H 1 H 2 T · H 2 1 · H 1 T · ϑ e 1 T · ϑ s . t . 0 ϑ C 2 · e 1
The solution of (13) and (14) are as follows:
β 1 = H 1 T · H 1 + 1 I 1 · H 2 T · α
β 2 = H 2 T · H 2 + 2 I 1 · H 1 T · ϑ
where 1 and 2 are two small positive constants and I is an identity matrix. The decision function of twin extreme learning machine is:
f x = a r g min k = 1 , 2 d k x = a r g min k = 1 , 2 β k T · h x

2.5. Successive Overrelaxation Algorithm

The successive overrelaxation algorithm [32] mainly aims at the following optimization problems:
min μ 1 2 H T μ 2 2 e T μ s . t . μ S = μ | 0 μ C e
Let H H T = L + E + L T , the strictly lower triangular matrix of the matrix H H T is L , and the diagonal elements of the matrix H H T form the diagonal matrix E .
The gradient projection optimality condition is the necessary and sufficient optimality condition for Equation (18):
μ = μ π E 1 H H T μ e # , π 0
where the 2-norm projection onto the feasible region of Equation (18) is denoted by · # , that is:
μ # i = 0 , i f μ i 0 , i = 1 , 2 , , m μ i , i f 0 < μ i < C , i = 1 , 2 , , m C , i f μ i C , i = 1 , 2 , , m
The matrix H H T is expressed in the following form:
H H T = π 1 E B + C s . t . B C i s p o s i t i v e d e f i n i t e
Here:
B = I + π E 1 L , C = π 1 I + π E 1 L T , 0 < π < 2
According to the [33], the matrix splitting algorithm is as follows:
μ i + 1 = μ i + 1 B μ i + 1 C μ i + π E 1 e #
Substituting Equation (21) into Equation (22), it can be obtained:
μ i + 1 = μ i π E 1 H H T μ i e + L μ i + 1 μ i #

3. Fisher-Regularized Twin Extreme Learning Machine

3.1. Model Formulation

As mentioned above, TELM solves two smaller QPPs, which can get the solution quickly. However, it ignores the prior statistical knowledge from data. FELM minimizes the within-class scatter while controlling the between-class scatter of samples, but FELM needs to solve a large-scale quadratic programming problems which is time-consuming. In this paper, by combining the advantages of FELM and TELM, we first propose the Fisher-regularized twin extreme learning machine (FTELM) by introducing the Fisher regularization into the TELM feature space. FTELM only needs to solve two smaller quadratic programming problems and meanwhile utilizes the prior statistical knowledge from data. The pair of FTELM primal problems is as follows:
Primal FTELM 1 : min β 1 , ξ 1 2 H 1 · β 1 2 + C 1 · e 2 T · ξ + C 2 2 · f 1 x T · N 1 · f 1 x s . t . H 2 · β 1 + ξ e 2 ξ 0
Primal FTELM 2 : min β 2 , η 1 2 H 2 · β 2 2 + C 3 · e 1 T · η + C 4 2 · f 2 x T · N 2 · f 2 x s . t . H 1 · β 2 + η e 1 η 0
From the Equations (4) and (5), we can know that N 1 = I + M + and N 2 = I M , C 1 , C 2 , C 3 , C 4 > 0 are regularization parameters, ξ and η are the error vectors, and all the elements in vectors e 1 R m 1 and e 2 R m 2 are one. FTELM first inherits the advantage of the classical twin extreme learning machine, which computes two non-parallel hyperplanes to solve the classification problem. Secondly, FTELM takes full account of the statistical information of the samples and further improves the classification accuracy of the classifier. The optimization objective function in (24) of FTELM mainly has three terms: minimizing the distance from the positive class sample points to the positive class hyperplane, minimizing empirical loss, and minimizing the intra-class divergence from the samples. The constraint condition in (24) of the optimization objective function is that the distance between the negative class sample points and the positive class hyperplane is greater than or equal to one. In a word, FTELM makes the positive class sample points closer to the positive class hyperplane, and the negative class sample points far away from the positive class hyperplane. At the same time, the positive class sample points are more concentrated in the center of the positive class sample points. There is a similar explanation for the model (25).
According to the representer theorem β = i = 1 m α i h x i = H T α , then β 1 = H 1 T · α 1 and β 2 = H 2 T · α 2 . We know that f = H · β . Therefore, the problem (24) and (25) can be written in the following forms:
min α 1 , ξ 1 2 α 1 T · K E L M 1 · K E L M 1 · α 1 + C 1 · e 2 T · ξ + C 2 2 α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1 s . t . H 2 · H 1 T · α 1 + ξ e 2 ξ 0
min α 2 , η 1 2 α 2 T · K E L M 2 · K E L M 2 · α 2 + C 3 · e 1 T · η + C 4 2 α 2 T · K E L M 2 · N 2 · K E L M 2 · α 2 s . t . H 1 · H 2 T · α 2 + η e 1 η 0
where K E L M 1 = H 1 · H 1 T and K E L M 2 = H 2 · H 2 T are Gram matrices.

3.2. Model Solution

Introducing Lagrange multipliers θ = θ 1 , , θ m 2 T and ϑ = ϑ 1 , , ϑ m 2 T , the Lagrange function of (26) can be written as follows:
L α 1 , ξ , θ , ϑ = 1 2 α 1 T · K E L M 1 · I 1 + C 2 · N 1 · K E L M 1 · α 1 + C 1 · e 2 T · ξ θ T · H 2 · H 1 T · α 1 + ξ e 2 ϑ T · ξ
According to the KKT conditions, we get:
L α 1 = K E L M 1 · I 1 + C 2 · N 1 · K E L M 1 · α 1 + H 1 · H 2 T · θ = 0
L ξ = C 1 · e 2 θ ϑ = 0
θ T · H 2 · H 1 T · α 1 + ξ e 2 = 0
ϑ T · ξ = 0
θ 0
ϑ 0
From (29) and (30), we can get:
α 1 * = K E L M 1 · I 1 + C 2 · N 1 · K E L M 1 1 · H 1 · H 2 T · θ
0 θ C 1 · e 2
By substituting (29)–(34) into (28), the dual optimization problem for (26) can be written in the following form:
Dual FTELM 1 : min θ 1 2 θ T · Q 1 · θ e 2 T · θ s . t . 0 θ C 1 · e 2
Here Q 1 = H 2 · H 1 T · K E L M 1 · I 1 + C 2 · N 1 · K E L M 1 1 · H 1 · H 2 T .
Similarly, we can obtain the dual of (27) as:
Dual FTELM 2 : min λ 1 2 λ T · Q 2 · λ e 1 T · λ s . t . 0 λ C 3 · e 1
Here λ = λ 1 , , λ m 1 T is the vector of Lagrange multipliers and we can get: Q 2 = H 1 · H 2 T · K E L M 2 · I 2 + C 4 · N 2 · K E L M 2 1 · H 2 · H 1 T .
We use the successive overrelaxation (SOR) [32] technique to solve the convex quadratic optimization problems of (37) and (38) (The SOR-FTELM algorithm is summarized as Algorithm 1). We can get θ and λ . Therefore, we can obtain the solution for problems of (24) and (25) in the following:
β 1 = H 1 T · K E L M 1 · I 1 + C 2 · N 1 · K E L M 1 + δ 1 · I 1 1 · H 1 · H 2 T · θ
β 2 = H 2 T · K E L M 2 · I 2 + C 4 · N 2 · K E L M 2 + δ 2 · I 2 1 · H 2 · H 1 T · λ
The decision function of FTELM is:
f x = a r g min k = 1 , 2 β k T · h x
Algorithm 1 The procedure of SOR-FTELM.
Input: 
  
  
Training set T = x i , y i i = 1 m , where x i R d , y i = ± 1 , the number of hidden node number L , tolerance ε , regularization parameters C 1 , C 2 , C 3 , C 4 .
Output: 
  
β 1 , β 2 , and the decision function of FTELM.
1:
Compute the graph matrix N 1 , N 2 by Equations (4) and (5).
2:
Choose an activation function such as G x = 1 1 + e x and compute the hidden layer output matrix H 1 , H 2 by h x i = G j = 1 d ω j i x j + b i and compute K E L M 1 = H 1 H 1 T and K E L M 2 = H 2 H 2 T .
3:
Choose t 0 , 2 , start with any θ 0 R m 2 ,Having θ i , compute θ i + 1 as follows:
θ i + 1 = θ i t E 1 Q 1 θ i e 2 + L 1 θ i + 1 θ i #
  
until θ i + 1 θ i ε , where e 2 is a vector of ones of appropriate dimensions. L 1 R m 2 × m 2 is the strictly lower triangular matrix, where l i j = q i j , i > j . E 1 R m 2 × m 2 is the diagonal matrix, where e i j = q i j , i > j .
  
Then, given any λ 0 R m 1 , Having λ i , compute λ i + 1 as follows
λ i + 1 = λ i t E 2 Q 2 λ i e 1 + L 2 λ i + 1 λ i #
4:
Compute the output weights β 1 , β 2 using Equations (39) and (40).
5:
Construct the following decision functions:
f x = a r g min k = 1 , 2 β k T · h x

4. Capped L 1 -Norm Fisher-Regularized Twin Extreme Learning Machine

4.1. Model Formulation

The Fisher-regularized twin extreme learning machine proposed in the previous section not only inherits the advantages of the twin extreme learning machine but also makes full use of the statistical information of the samples. However, due to the use of the squared L 2 -norm distance and hinge loss function, the Fisher-regularized twin extreme learning machine is not robust enough when noisy points are present, which often enlarges the impact of abnormal values. In order to reduce the influence of outliers and improve the robustness of the FTELM, we propose a capped L 1 -norm Fisher twin extreme machine (C L 1 -FTELM) by replacing the L 2 -norm and hinge loss in the FTELM with capped L 1 -norm. The primal C L 1 -FTELM is in the following:
Primal C L 1 -FTELM 1 :
min α 1 , ξ i = 1 m 1 min h T x i · H 1 T · α 1 1 , ε 1 + C 1 · j = 1 m 2 min ξ j 1 , ε 2 + C 2 2 · α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1 s . t . H 2 · H 1 T · α 1 + ξ e 2
Primal C L 1 -FTELM 2 :
min α 2 , η j = 1 m 2 min h T x j · H 2 T · α 2 1 , ε 3 + C 3 · i = 1 m 1 min η i 1 , ε 4 + C 4 2 · α 2 T · K E L M 2 · N 2 · K E L M 2 · α 2 s . t . H 1 · H 2 T · α 2 + η e 1
where C 1 , C 2 , C 3 , C 4 > 0 are regularization parameters, ε 1 , ε 2 , ε 3 , ε 4 are thresholding parameters.
C L 1 -FTELM uses the capped L 1 -norm to reduce the influence of noise points, and at the same time utilizes Fisher regularization to consider the statistical knowledge of the data. Based on FTELM, C L 1 -FTELM changes the L 2 -norm metric and Hinge loss function of the original model to the capped L 1 -norm. The capped L 1 -norm is bounded and can constrain the impact of noise within a certain range. Therefore, the anti-noise ability of the model can be improved. The optimization objective function in (42) of C L 1 -FTELM also contains three terms: minimizing the distance between the positive class sample points and the positive class hyperplane by using capped L 1 -norm metric, minimizing empirical loss by using capped L 1 -norm loss function, and minimizing the within-class scatter of the samples. The constraints in (42) of C L 1 -FTELM are as follows: the distance between the negative class sample points and the positive class hyperplane is greater than or equal to one. In summary, C L 1 -FTELM inherits the advantages of FTELM, while further improving the noise immunity of the model by replacing the metric and loss function with the capped L 1 -norm. However, the C L 1 -FTELM is a non-convex and non-smooth problem. Here, we use the reweighting technique [27] to solve the problem corresponding to the C L 1 -FTELM model, which is shown below:
C L 1 -FTELM 1 :
min α 1 , ξ 1 2 α 1 T · K E L M 1 · F · K E L M 1 · α 1 + C 1 2 · ξ T · D · ξ + C 2 2 · α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1 s . t . H 2 · H 1 T · α 1 + ξ e 2
where F and D are two diagonal matrices with i-th and j-th diagonal elements as:
f i = 1 h T x i · H 1 T · α 1 , h T x i · H 1 T · α 1 ε 1 , i 1 , , m 1 σ 1 , o t h e r w i s e
d j = 1 ξ j , ξ j ε 2 , j 1 , , m 2 σ 2 , o t h e r w i s e
Here σ 1 , σ 2 are two small constants.
C L 1 -FTELM 2 :
min α 2 , η 1 2 α 2 T · K E L M 2 · R · K E L M 2 · α 2 + C 3 2 · η T · S · η + C 4 2 · α 2 T · K E L M 2 · N 2 · K E L M 2 · α 2 s . t . H 1 · H 2 T · α 2 + η e 1
where R and S are two diagonal matrices with j-th and i-th diagonal elements as:
r j = 1 h T x j · H 2 T · α 2 , h T x j · H 2 T · α 2 ε 3 , j 1 , , m 2 σ 3 , o t h e r w i s e
s i = 1 η i , η i ε 4 , i 1 , , m 1 σ 4 , o t h e r w i s e
Here σ 3 , σ 4 are two small constants.

4.2. Model Solution

Introducing Lagrange multipliers α , the Lagrange function of (44) can be written as follows:
L α 1 , ξ , α = 1 2 α 1 T · K E L M 1 · F + C 2 · N 1 · K E L M 1 · α 1 + C 1 2 · ξ T · D · ξ α T · H 2 · H 1 T · α 1 + ξ e 2
According to the KKT conditions, we can get the following formula:
L α 1 = K E L M 1 · F + C 2 · N 1 · K E L M 1 · α 1 + H 1 · H 2 T · α = 0
L ξ = C 1 · D · ξ α = 0
α T · H 2 · H 1 T · α 1 + ξ e 2 = 0
α 0
From Equations (51) and (52), we can get:
α 1 = K E L M 1 · F + C 2 · N 1 · K E L M 1 1 · H 1 · H 2 T · α ξ = 1 C 1 · D 1 · α
Similarly, we can get:
α 2 = K E L M 2 · R + C 4 · N 2 · K E L M 2 1 · H 2 · H 1 T · λ η = 1 C 3 · S 1 · λ
Thus, we can get the dual problem of (44) as follows:
Dual C L 1 -FTELM 1
min α 0 1 2 α T · H 2 H 1 T Q 1 1 H 1 H 2 T + 1 C 1 D 1 · α e 2 T · α
where Q 1 = K E L M 1 · F + C 2 · N 1 · K E L M 1 .
In the same way, we can obtain the dual problem of the Equation (47) as follows:
Dual C L 1 -FTELM 2
min λ 0 1 2 λ T · H 1 H 2 T Q 2 1 H 2 H 1 T + 1 C 3 S 1 · λ e 1 T · λ
where Q 2 = K E L M 2 · R + C 4 · N 2 · K E L M 2 .
After solving (55) and (56), α and λ are derived, and then α 1 and α 2 are obtained. So, the decision function of C L 1 -FTELM is as follows:
y = a r g min k = 1 , 2 α k T · H k · h x = a r g min k = 1 , 2 i = 1 m k α k i · k E L M k x , x i
Based on the above discussion, our algorithm will be presented in Algorithm 2. Next, we give the convergence analysis of Algorithm 2.
Algorithm 2 The procedure of C L 1 -FTELM.
Input: 
  
  
Training set T = x i , y i i = 1 m , where x i R d , y i = ± 1 , the number of hidden node number L , regularization parameters C 1 , C 2 , C 3 , C 4 > 0 , ε 1 , ε 2 , ε 3 , ε 4 > 0 , ρ 1 , ρ 2 , σ 1 , σ 2 , σ 3 , σ 4 .
Output: 
  
  
α 1 * , α 2 * and the decision function of C L 1 -FTELM.
1:
Initialize F 0 R m 1 × m 1 , D 0 R m 2 × m 2 , R 0 R m 2 × m 2 , S 0 R m 1 × m 1 .
2:
Compute the graph matrix N 1 , N 2 by Equations (4) and (5).
3:
Choose an activation function such as G x = 1 1 + e x and compute the hidden layer output matrix H 1 , H 2 by h x i = G j = 1 d ω j i x j + b i and compute K E L M 1 = H 1 H 1 T and K E L M 2 = H 2 H 2 T .
4:
Set t = 0.
5:
While
  •       Solving (55) and (56), the  α t and λ t can be obtained.
  •       Then get the solution α 1 t , α 2 t , ξ t , and  η t by
    α 1 t = K E L M 1 · F t + C 2 · N 1 · K E L M 1 1 · H 1 · H 2 T · α t , ξ t = 1 C 1 · D t 1 · α t
    α 2 t = K E L M 2 · R t + C 4 · N 2 · K E L M 2 1 · H 2 · H 1 T · λ t , η t = 1 C 3 · S t 1 · λ t
          Update the matrices F t + 1 , D t + 1 , R t + 1 , and  S t + 1 by (45), (46), (48) and (49), respectively.
  •       Compute the objective function values J 1 t + 1 and J 2 t + 1 , by        
    J 1 t + 1 = 1 2 α 1 t T · K E L M 1 · F t + 1 · K E L M 1 · α 1 t + C 1 2 · ξ t T · D t + 1 · ξ t + C 2 2 · α 1 t T · K E L M 1 · N 1 · K E L M 1 · α 1 t
          
    J 2 t + 1 = 1 2 α 2 t T · K E L M 2 · R t + 1 · K E L M 2 · α 2 t + C 3 2 · η t T · S t + 1 · η t + C 4 2 · α 2 t T · K E L M 2 · N 2 · K E L M 2 · α 2 t
          if  J 1 t + 1 J 1 t ρ 1 and J 2 t + 1 J 2 t ρ 2 .
  •             break
  •       else
  •             t = t+1
6:
end while
7:
Stop the iteration process and get the solution of α 1 * , α 2 * .

4.3. Convergence Analysis

Before we prove the convergence of the iterative algorithm, we first review two lemmas [34].
Lemma 1.
For any non-zeros vectors x, y R n , if f x = x 1 x 1 2 2 y 1 , then the following inequalities f x f y hold.
Lemma 2.
For any non-zeros vectors x, y, p, q R n , if f x , p = x 1 x 1 2 2 y 1 + C p 1 p 1 2 2 q 1 , C R + , then the following inequalities f x , p f y , q hold.
The proof of two lemmas is detailed in [34].
Theorem 1.
Algorithm 2 monotonically decreases the objectives of problems (42) and (43) in each iteration until it converges.
Proof. 
Here, we only use problem (42) as an example to prove Theorem 1.
J α 1 , ξ = min α 1 , ξ i = 1 m 1 min h T x i · H 1 T · α 1 1 , ε 1 + C 1 · j = 1 m 2 min ξ j 1 , ε 2 + C 2 2 · α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1
when h T x i · H 1 T · α 1 1 < ε 1 and ξ j 1 < ε 2 ,we have:
J α 1 , ξ = min α 1 , ξ i = 1 m 1 h T x i H 1 T α 1 1 + C 1 j = 1 m 2 ξ j 1 + C 2 2 · α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1
We take the derivative of Equation (61) with respect to α 1 and ξ separately and then obtain that:
i = 1 m 1 H 1 h x i h T x i H 1 T α 1 h T x i H 1 T α 1 + C 2 · K E L M 1 · N 1 · K E L M 1 · α 1 = 0 C 1 · j = 1 m 2 ξ j ξ j = 0
by the above Equation (62), we can get:
i = 1 m 1 H 1 h x i h T x i H 1 T α 1 h T x i H 1 T α 1 + C 1 · j = 1 m 2 ξ j ξ j + C 2 · K E L M 1 · N 1 · K E L M 1 · α 1 = 0
We define f i = 1 h T x i · H 1 T · α 1 and d j = 1 ξ j as the diagonal entries of F and D , respectively. Thus we can rewrite Equation (63) as follows:
H 1 · H 1 T · F · H 1 T · H 1 · α 1 + C 1 · D · ξ + C 2 · K E L M 1 · N 1 · K E L M 1 · α 1 = 0
Obviously, Equation (64) is the optimal solution to the following problem:
min α 1 , ξ 1 2 α 1 T · K E L M 1 · F · K E L M 1 · α 1 + C 1 2 · ξ T · D · ξ + C 2 2 · α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1
Now, assume that α ¯ 1 and ξ ¯ denote the updated α 1 and ξ of Algorithm 2, respectively. Thus we can get:
1 2 α ¯ 1 T · K E L M 1 · F · K E L M 1 · α ¯ 1 + C 1 2 · ξ ¯ T · D · ξ ¯ + C 2 2 · α ¯ 1 T · K E L M 1 · N 1 · K E L M 1 · α ¯ 1 1 2 α 1 T · K E L M 1 · F · K E L M 1 · α 1 + C 1 2 · ξ T · D · ξ + C 2 2 · α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1
we have rewritten Equation (66) as follows
i = 1 m 1 K E L M 1 α ¯ 1 T K E L M 1 α ¯ 1 2 h T x i H 1 T α 1 + j = 1 m 2 C 1 ξ ¯ j 2 2 ξ j + C 2 2 α ¯ 1 T · K E L M 1 · N 1 · K E L M 1 · α ¯ 1 i = 1 m 1 K E L M 1 α 1 T K E L M 1 α 1 2 h T x i H 1 T α 1 + j = 1 m 2 C 1 ξ j 2 2 ξ j + C 2 2 α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1
Here, we let x = K E L M 1 α ¯ 1 , y = K E L M 1 α 1 , C = C 1 , p = ξ ¯ j , q = ξ j . Based on Lemma 2, we have
K E L M 1 · α ¯ 1 K E L M 1 · α ¯ 1 2 2 K E L M 1 · α 1 + C 1 · ξ ¯ j ξ ¯ j 2 2 ξ j K E L M 1 · α 1 K E L M 1 · α 1 2 2 K E L M 1 · α 1 + C 1 · ξ j ξ j 2 2 ξ j
then we can get
i = 1 m 1 h T x i · H 1 T · α ¯ 1 K E L M 1 · α ¯ 1 2 2 K E L M 1 · α 1 + C 1 j = 1 m 2 ξ ¯ j ξ ¯ j 2 2 ξ j i = 1 m 1 h T x i · H 1 T · α 1 K E L M 1 · α 1 2 2 K E L M 1 · α 1 + C 1 j = 1 m 2 ξ j ξ j 2 2 ξ j
combining (67) and (69), we can get the following inequalities
i = 1 m 1 h T x i · H 1 T · α ¯ 1 + C 1 j = 1 m 2 ξ ¯ j + C 2 2 α ¯ 1 T · K E L M 1 · N 1 · K E L M 1 · α ¯ 1 i = 1 m 1 h T x i · H 1 T · α 1 + C 1 j = 1 m 2 ξ j + C 2 2 α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1
further, we can get
i = 1 m 1 min h T x i · H 1 T · α ¯ 1 + C 1 j = 1 m 2 min ξ ¯ j + C 2 2 α ¯ 1 T · K E L M 1 · N 1 · K E L M 1 · α ¯ 1 i = 1 m 1 min h T x i · H 1 T · α 1 + C 1 j = 1 m 2 min ξ j + C 2 2 α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1
Therefore, we have J α ¯ 1 , ξ ¯ J α 1 , ξ . Similarly, when h T x i H 1 T α 1 1 ε 1 and ξ j ε 2 , or h T x i H 1 T α 1 1 ε 1 and ξ j ε 2 , or h T x i H 1 T α 1 1 ε 1 and ξ j ε 2 , we can obviously get J α ¯ 1 , ξ ¯ J α 1 , ξ . Thus, the inequality J α ¯ 1 , ξ ¯ J α 1 , ξ holds. The three terms in Equation (60) are equal to or greater than 0. Meaning that Algorithm 2 decreases objective of problem (42) until convergence. □
Theorem 2.
Algorithm 2 will converge to a local optimum to the problem in (42).
Proof. 
Here, we only use (42) as an example to prove Theorem 2.
When h T x i H 1 T α 1 1 ε 1 and ξ j 1 ε 2 , we write out the formula of (42) Lagrange function:
L 1 α 1 , ξ , λ = i = 1 m 1 h T x i H 1 T α 1 1 + C 1 j = 1 m 2 ξ j 1 + C 2 2 α 1 T · K E L M 1 · N 1 · K E L M 1 · α 1 λ T j = 1 m 2 h T x j H 1 T α 1 + ξ j 1
Then, we take the derivative of L α 1 , ξ , λ with respect to α 1
L 1 α 1 = i = 1 m 1 H 1 h x i h T x i H 1 T α 1 h T x i H 1 T α 1 + C 2 · K E L M 1 · N 1 · K E L M 1 · α 1 + H 1 · H 2 T · λ = K E L M 1 · F + C 2 · N 1 · K E L M 1 · α 1 + H 1 · H 2 T · λ = 0
Similarly, we get the Lagrangian function of problem (44):
L 2 α 1 , ξ , α = 1 2 α 1 T · K E L M 1 · F + C 2 · N 1 · K E L M 1 · α 1 + C 1 2 · ξ T · D · ξ λ T · H 2 · H 1 T · α 1 + ξ e 2
Taking the derivative of L 2 α 1 , ξ , α with respect to α 1 :
L 2 α 1 = K E L M 1 · F + C 2 · N 1 · K E L M 1 · α 1 + H 1 · H 2 T · λ = 0
The other three cases are similar. From the discussion above, we may safely draw that Equations (73) and (75) are equivalent, so we can use problem (44) instead of problem (42) to solve C L 1 -FTELM, which further illustrates that Algorithm 2 can converge to a local optimal solution. □

5. Experiments

Description of the four comparison algorithms:
OPTELM: The optimization function of the model consists of minimizing the L 2 -norm of the weight vector and minimizing empirical loss. It neither consider the establishment of two non-parallel hyperplanes to deal with classification tasks, nor consider the statistical information of samples. At the same time, since it uses L 2 -norm metric and Hinge loss, it has weak anti-noise ability.
TELM: The optimization function of the model consists of minimizing the distance from the sample points to the hyperplane as well as minimizing empirical loss. TELM does not fully consider the statistical information of the sample. At the same time, its metric uses the L 2 -norm metric and the loss function uses the Hinge loss. When there is noise in the data set, the influence of noise data will be amplified and the accuracy of classification will be reduced.
FELM: The optimization function of the model includes minimizing the L 2 -norm of the weight vector, minimizing empirical loss, and minimizing the within-class scatter of the number sample data. Although FELM takes into account the statistics of the sample, it has to deal with a much larger optimization problem than the twin extreme learning machines, which is time-consuming. At the same time, FELM still continues the metric and loss used by OPTELM, so its anti-noise ability is weak.
C L 1 -TWSVM: C L 1 -TWSVM is formed on the basis of twin support vector machines by changing the model’s metric and loss to capped L 1 -norm. Although C L 1 -TWSVM has the ability to resist noise, it does not fully take into account the statistics of the data. Meanwhile, C L 1 -TWSVM not only needs to solve the weight vector of the hyperplane, but also needs to solve the bias of the hyperplane, so it is time-consuming.
We systematically compare our algorithm above advanced algorithms (OPTELM [12], TELM [14], FELM [19], and C L 1 -TWSVM [29]) on artificial synthetic datasets and UCI real datasets to verify the effectiveness of our FTELM and C L 1 -FTELM. In Section 5.1, we describe the relevant experimental setting in detail. We describe their performance in different cases in Section 5.2 and Section 5.3, respectively. In Section 5.4, we use the one-versus-rest multi-classification method to perform data classification tasks in four image datasets: Yale “http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2023)”, ORL “http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2023)”, USPS “http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html (accessed on 15 February 2023)” handwritten digit dataset and MNIST “http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html (accessed on 15 February 2023)” dataset.

5.1. Experimental Setting

All experiments were implemented in MATLAB R2020a installed in a personal computer (PC) with an AMD Radeon Graphics processor (3.2 GHz), and 16 GB random-access memory (RAM). For C L 1 -TWSVM, and C L 1 -FTELM, we take the maximum number of iterations to be 100 and the iteration stopping threshold to be 0.001. The activation functions used in a total of five models (OPTELM, TELM, FELM, FTELM, and C L 1 -FTELM) are G x = 1 1 + e x . The Gaussian kernel function K ( x , z ) = e x z 2 2 σ 2 was used for C L 1 -TWSVM. The parameters selected by all the above algorithms are as follows: ε 1 , ε 2 , ε 3 , ε 4 were selected from 10 i | 6 , 5 , 4 , C 1 , C 2 , C 3 , C 4 were selected from 10 i | 5 , 4 , , 4 , 5 , σ was chosen from 2 i | 3 , 2 , , 2 , 3 , and the hidden layer node number L was chosen from 50 , 100 , 200 , 500 , 1000 , 2000 , 5000 , 10 , 000 . The optimal parameters used by the model are selected by 10-fold cross-validation and grid search. Normalization was performed for both artificial and UCI datasets. For image datasets, we randomly select 20% of the data as the test set to get the classification accuracy of the algorithm. All experimental processes are repeated 10 times and the average of the 10 test results is used as the performance measure, and the evaluation criterion selected in this paper is classification accuracy (ACC).

5.2. Experiments on Artificial Datasets

We first do experiments on the Banana, Circle, Two spirals, and XOR datasets which are generated by trigonometric function(sine, cosine), two circle lines, two spirals lines, and two intersecting lines, respectively. The two-dimensional distributions of the four synthetic datasets are shown in Figure 1. Dark blue ‘+’ represents class 1, and cyan ‘∘’ represents class 2. Figure 2 illustrates the experimental results of four twin algorithms namely TELM, FTELM, C L 1 -TWSVM, and C L 1 -FTELM for four datasets with 0%, 20%, and 25% noise in terms of accuracy. From Figure 2a, we can observe that the classification accuracy of our FTELM and C L 1 -FTELM in Banana and Two spirals datasets is higher than the other two methods. In the Circle and XOR datasets, the classification accuracy of the four methods is similar. The experimental results show that fully considering the statistical information of the data can effectively improve the classification accuracy of the classifier, which shows that our C L 1 -FTELM method is effective. From Figure 2b,c, we can see that the overall effect of FTELM is better than TELM. This shows the importance of fully considering the statistical information of the sample. At the same time, we can see that C L 1 -FTELM has the best effect, followed by C L 1 -TWSVM. It shows that the capped L 1 -norm can control the influence of noise on the model in a certain range, and further shows the effectiveness of using the capped L 1 -norm. In summary, Figure 2 illustrates the effectiveness of considering sample statistics information and changing the distance metric and loss of the model into capped L 1 -norm at the same time.
To further show the robustness of C L 1 -FTELM, we add noise with different ratios to the Circle dataset. Figure 3 shows the accuracy of TELM, FTELM, C L 1 -TWSVM, and C L 1 -FTELM algorithms on the Circle dataset in different noises ratios. The ratio is set in the range of 0.1 , 0.15 , 0.2 , 0.25 . We plot the accuracy results of ten experiments with different noise ratios in a box-shaped plot. By observing the median of the four subgraphs, we can find that the median of C L 1 -FTELM algorithm is much higher than the other three algorithms. And C L 1 -FTELM method in four different noise ratios results is relatively concentrated. In other words, the variance of ten experimental results obtained by the C L 1 -FTELM algorithm is smaller and the mean value is larger. The above results show that our C L 1 -FTELM has better stability and better classification effect in environments containing noise. This shows the effectiveness and noise resistance of the distance metric and loss functions of the model using the capped L 1 -norm.

5.3. Experiments on UCI Datasets

In this section, we conduct the numerical simulation on UCI datasets. Table 1 describes the features of the UCI datasets used in detail. We also added two algorithms (OPTELM, FELM) to verify the classification performance of FTELM and C L 1 -FTELM in ten UCI data sets.
All experimental results obtained based on the optimal parameters are shown in Table 2. Here, the average running time according to the optimal parameters is denoted by Times(s), and the average classification plus or minus standard deviation is denoted by A C C ± . From Table 2, we can see that FTELM performs better than OPTELM, TELM, and FELM on all ten datasets. This indicates that adding Fisher regularization term on the basis of TELM framework can significantly improve the accuracy of model classification. In addition, the average training time of FTELM algorithm on most data sets is smaller than that of FELM algorithm, which indicates that FTELM has inherited the advantages of TELM’s short training time. In addition, we also can draw our C L 1 -FTELM in most data sets has achieved the highest classification accuracy besides the WDBC data set. Through the analysis of the above results, we can conclude that the Fisher regularization and capped L 1 -norm added to the TELM learning framework can effectively improve the performance of the classifier. It is shown that the proposed FTELM and C L 1 -FTELM are efficient algorithms.
In order to more significantly verify the robustness of C L 1 -FTELM to outliers, we added 20% and 25% Gaussian noise to 10 data sets, respectively. All experimental results are presented in Table 3 and Table 4. From Table 3 and Table 4, we find that the classification accuracy of all six algorithms decreases after adding noise. However, the classification accuracy of our algorithm C L 1 -FTELM is the highest of the eight datasets, which further reveals the effectiveness of our method using capped L 1 -norm instead of Hinge loss and L 2 -norm distance metric. Compared with the other five algorithms, our C L 1 -FTELM algorithm is more time-consuming. This is due to that C L 1 -FTELM requires a lot of time in the process of training to iterative calculation, eliminating outliers, and computing graph matrices. In addition, we used different noise factor values (0.1, 0.15, 0.2, 0.25, 0.3) on the Cancer, German, Ionosphere, and WDBC for the six algorithms. The experimental results are given in Figure 4. It can be seen from Figure 4a that when the Breast Cancer dataset contains 10% noise, the effects of our FTELM and C L 1 -FTELM are comparable. This shows that it is important to consider the statistical information of the sample. As the ratio of noise increases, the classification accuracy of all methods decreases, but our C L 1 -FTELM still has the highest accuracy. This illustrates the effectiveness of our using the capped L 1 -norm. Figure 4b shows that with the increase of noise ratio, the decline trend of accuracy of C L 1 -TWSVM and C L 1 -FTELM is similar, but C L 1 -FTELM is still the most stable among the six methods when facing the influence of noise. From both Figure 4c,d, we can clearly observe that the anti-noise effect of our C L 1 -FTELM is the best. This illustrates the effectiveness of using the Fisher regularization term as well as the capped L 1 -norm.
We also conduct experiments on four data sets (Breast cancer, QSAR, WDBC, and Vote) to verify the convergence of the proposed Algorithm 2. As shown in Figure 5, we plot the objective function value of each iteration. It can be seen that the objective function value converges to a fixed value rapidly with the increase in the number of iterations. This shows that our algorithm can make the objective function value can converge to a local optimal value within a limited number of iterations. The effectiveness and convergence of the Algorithm 2 are demonstrated.

5.4. Experiments on Image Datasets

The image datasets include Yale, ORL, USPS, and MNIST. Figure 6 illustrates examples of four high-dimensional image datasets. The number of samples and characteristics of the four image datasets are shown in Table 5. These four image datasets are used to investigate the performance of our FTELM and C L 1 -FTELM for multi-classification. Specifically, for the MNIST dataset, we only select the first 2000 samples to participate in the experiment.
Table 6 shows the specific experimental results. As can be seen from the results of the experiment, our C L 1 -FTELM and C L 1 -TWSVM have similar training times. This is because this paper uses an iterative algorithm to solve non-convex optimization problem of C L 1 -FTELM, which is time-consuming. Simultaneously, the C L 1 -FTELM at Yale, ORL, USPS, and MNIST four datasets classification accuracy is highest among the six algorithms. In addition, the classification accuracy of our FTELM algorithm on the four image datasets is the second highest after our C L 1 -FTELM. The above results fully show the effectiveness of our two algorithms in dealing with multi-classification tasks.

6. Conclusions

In this paper, we have proposed FTELM and C L 1 -FTELM. FTELM not only inherits the advantages of TELM but also takes full account of the statistical information of samples, so as to further improve the classification performance of the classifier. Specifically, when there is no noise in the data or the ratio of noise is very small, our FTELM algorithm can deal with the classification problem very well, not only time-saving but also with high classification accuracy. C L 1 -FTELM further improves the anti-noise ability of the model by replacing the L 2 -norm and hinge loss in FTELM with capped L 1 -norm. It not only utilizes the distribution information of the data but also improves the anti-noise ability of the model. Furthermore, we have designed two algorithms to solve the problems of FTELM and C L 1 -FTELM. In addition, we present two theorems to prove the convergence of our C L 1 -FTELM. However, in terms of computational cost, FTELM is better than C L 1 -FTELM to some extent. Therefore, in future work, we will propose some new tricks to accelerate the computation of the C L 1 -FTELM. In addition, trying to extend FTELM and C L 1 -FTELM from supervised learning to semi-supervised learning framework is also a future research focus.

Author Contributions

Z.X., conceptualization, methodology, validation, investigation, project administration, writing—original draft. L.C., methodology, software, validation, formal analysis, investigation, data curation, writing—original draft. All authors have read and agreed to the published version of the manuscript.

Funding

The authors wish to acknowledge the financial support of the National Nature Science Youth Foundation of China (No. 61907012), the Start-up Funds of Scientific Research for Personnel Introduced by North Minzu University (No. 2019KYQD41), the Special project of North Minzu University (No. FWNX01), the Basic Research Plan of Key Scientific Research Projects of Colleges and Universities in Henan Province (No. 19A120005), the Construction Project of First-Class Disciplines in Ningxia Higher Education (NXYLXK2017B09), the Young Talent Cultivation Project of North Minzu University (No. 2021KYQD23), the Natural Science Foundation of Ningxia Provincial of China (No. 2022A0950), the Fundamental Research Funds for the Central Universities (No. 2022XYZSX03).

Informed Consent Statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability Statement

The UCI machine learning repository is available at “http://archive.ics.uci.edu/ml/datasets.php (accessed on 15 February 2023)”. The image data are available at “http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2023)”.

Acknowledgments

The authors thank the anonymous reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: A new learning scheme of feedforward neural networks. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Budapest, Hungary, 25–29 July 2004; Volume 2, pp. 985–990. [Google Scholar]
  2. Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
  3. Huang, G.B.; Chen, Y.Q.; Babri, H.A. Classification ability of single hidden layer feedforward neural networks. IEEE Trans. Neural Networks 2000, 11, 799–801. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Chen, X.; Cui, B. Efficient modeling of fiber optic gyroscope drift using improved EEMD and extreme learning machine. Signal Process. 2016, 128, 1–7. [Google Scholar] [CrossRef]
  5. Xia, M.; Zhang, Y.; Weng, L.; Ye, X. Fashion retailing forecasting based on extreme learning machine with adaptive metrics of inputs. Knowl.-Based Syst. 2012, 36, 253–259. [Google Scholar] [CrossRef]
  6. Yang, J.; Xie, S.; Yoon, S.; Park, D.; Fang, Z.; Yang, S. Fingerprint matching based on extreme learning machine. Neural Comput. Appl. 2013, 22, 435–445. [Google Scholar] [CrossRef]
  7. Rasheed, Z.; Rangwala, H. Metagenomic Taxonomic Classification Using Extreme Learning Machines. J. Bioinform. Comput. Biol. 2012, 10 5, 1250015. [Google Scholar] [CrossRef]
  8. Zou, Q.Y.; Wang, X.J.; Zhou, C.J.; Zhang, Q. The memory degradation based online sequential extreme learning machine. Neurocomputing 2018, 275, 2864–2879. [Google Scholar] [CrossRef]
  9. Fu, Y.; Wu, Q.; Liu, K.; Gao, H. Feature Selection Methods for Extreme Learning Machines. Axioms 2022, 11, 444. [Google Scholar] [CrossRef]
  10. Liu, Q.; He, Q.; Shi, Z. Extreme support vector machine classifier. In Proceedings of the Advances in Knowledge Discovery and Data Mining: 12th Pacific-Asia Conference, PAKDD 2008, Osaka, Japan, 20–23 May 2008; pp. 222–233. [Google Scholar]
  11. Frénay, B.; Verleysen, M. Using SVMs with randomised feature spaces: An extreme learning approach. In Proceedings of the 18th European Symposium on Artificial Neural Networks, ESANN 2010, Bruges, Belgium, 28–30 April 2010. [Google Scholar]
  12. Huang, G.B.; Ding, X.; Zhou, H. Optimization method based extreme learning machine for classification. Neurocomputing 2010, 74, 155–163. [Google Scholar] [CrossRef]
  13. Khemchandani, R.; Chandra, S. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar]
  14. Wan, Y.; Song, S.; Huang, G.; Li, S. Twin extreme learning machines for pattern classification. Neurocomputing 2017, 260, 235–244. [Google Scholar] [CrossRef]
  15. Shen, J.; Ma, J. Sparse Twin Extreme Learning Machine with ε -Insensitive Zone Pinball Loss. IEEE Access 2019, 7, 112067–112078. [Google Scholar] [CrossRef]
  16. Yuan, C.; Yang, L. Robust twin extreme learning machines with correntropy-based metric. Knowl.-Based Syst. 2021, 214, 106707. [Google Scholar] [CrossRef]
  17. Anand, P.; Bharti, A.; Rastogi, R. Time efficient variants of Twin Extreme Learning Machine. Intell. Syst. Appl. 2023, 17, 200169. [Google Scholar] [CrossRef]
  18. Ma, J.; Yu, G. A generalized adaptive robust distance metric driven smooth regularization learning framework for pattern recognition. Signal Process. 2023, 211, 109102. [Google Scholar] [CrossRef]
  19. Ma, J.; Wen, Y.; Yang, L. Fisher-regularized supervised and semi-supervised extreme learning machine. Knowl. Inf. Syst. 2020, 62, 3995–4027. [Google Scholar] [CrossRef]
  20. Gao, S.; Ye, Q.; Ye, N. 1-Norm least squares twin support vector machines. Neurocomputing 2011, 74, 3590–3597. [Google Scholar] [CrossRef]
  21. Yan, H.; Ye, Q.L.; Zhang, T.A.; Yu, D.J. Efficient and robust TWSVM classifier based on L1-norm distance metric for pattern classification. In Proceedings of the 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR), Nanjing, China, 26–29 November 2017; pp. 436–441. [Google Scholar]
  22. Ye, Q.; Yang, J.; Liu, F.; Zhao, C.; Ye, N.; Yin, T. L1-norm distance linear discriminant analysis based on an effective iterative algorithm. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 114–129. [Google Scholar] [CrossRef]
  23. Wu, Q.; Wang, F.; An, Y.; Li, K. L-1-Norm Robust Regularized Extreme Learning Machine with Asymmetric C-Loss for Regression. Axioms 2023, 12, 204. [Google Scholar] [CrossRef]
  24. Wu, M.J.; Liu, J.X.; Gao, Y.L.; Kong, X.Z.; Feng, C.M. Feature selection and clustering via robust graph-laplacian PCA based on capped L 1-norm. In Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Kansas City, MO, USA, 13–16 November 2017; pp. 1741–1745. [Google Scholar]
  25. Nie, F.; Huang, H.; Cai, X.; Ding, C. Efficient and robust feature selection via joint L2, 1-norms minimization. Adv. Neural Inf. Process. Syst. 2010, 23, 1813–1821. [Google Scholar]
  26. Ma, J.; Yang, L.; Sun, Q. Capped L1-norm distance metric-based fast robust twin bounded support vector machine. Neurocomputing 2020, 412, 295–311. [Google Scholar] [CrossRef]
  27. Jiang, W.; Nie, F.; Huang, H. Robust Dictionary Learning with Capped L1-Norm. In Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 3590–3596. [Google Scholar]
  28. Nie, F.; Huo, Z.; Huang, H. Joint Capped Norms Minimization for Robust Matrix Recovery. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 2557–2563. [Google Scholar]
  29. Wang, C.; Ye, Q.; Luo, P.; Ye, N.; Fu, L. Robust capped L1-norm twin support vector machine. Neural Netw. 2019, 114, 47–59. [Google Scholar] [CrossRef] [PubMed]
  30. Pal, A.; Khemchandani, R.R.n. Learning TWSVM using Privilege Information. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 1548–1554. [Google Scholar]
  31. Li, Y.; Sun, H.; Yan, W.; Cui, Q. R-CTSVM+: Robust capped L1-norm twin support vector machine with privileged information. Inf. Sci. 2021, 574, 12–32. [Google Scholar] [CrossRef]
  32. Mangasarian, O.; Musicant, D. Successive overrelaxation for support vector machines. IEEE Trans. Neural Netw. 1999, 10, 1032–1037. [Google Scholar] [CrossRef] [Green Version]
  33. Luo, Z.Q.; Tseng, P. Error bounds and convergence analysis of feasible descent methods: A general approach. Ann. Oper. Res. 1993, 46, 157–178. [Google Scholar] [CrossRef]
  34. Yang, Y.; Xue, Z.; Ma, J.; Chang, X. Robust projection twin extreme learning machines with capped L1-norm distance metric. Neurocomputing 2023, 517, 229–242. [Google Scholar] [CrossRef]
Figure 1. Four types of data without noise.
Figure 1. Four types of data without noise.
Axioms 12 00717 g001
Figure 2. Accuracy for TELM, FTELM, C L 1 -TWSVM, and C L 1 -FTELM on four types of data with 0%, 20%, and 25% noise.
Figure 2. Accuracy for TELM, FTELM, C L 1 -TWSVM, and C L 1 -FTELM on four types of data with 0%, 20%, and 25% noise.
Axioms 12 00717 g002
Figure 3. Accuracy for TELM, FTELM, C L 1 -TWSVM, and C L 1 -FTELM on Circle dataset with noise in different ratios.
Figure 3. Accuracy for TELM, FTELM, C L 1 -TWSVM, and C L 1 -FTELM on Circle dataset with noise in different ratios.
Axioms 12 00717 g003aAxioms 12 00717 g003b
Figure 4. Accuracies of six algorithms via different noises factors.
Figure 4. Accuracies of six algorithms via different noises factors.
Axioms 12 00717 g004
Figure 5. Objective values of C L 1 -FTELM on four datasets.
Figure 5. Objective values of C L 1 -FTELM on four datasets.
Axioms 12 00717 g005
Figure 6. Examples of four high-dimensional image datasets.
Figure 6. Examples of four high-dimensional image datasets.
Axioms 12 00717 g006
Table 1. Characteristics of UCI datasets.
Table 1. Characteristics of UCI datasets.
DatasetsInstancesAttributesDatasetsInstancesAttributes
Australian69014Vote43216
German100024Ionosphere35135
Breast cancer6999Pima7688
WDBC56930QSAR105541
Wholesale4407Spam460157
Table 2. Experimental results on UCI datasets, The best results are marked in bold.
Table 2. Experimental results on UCI datasets, The best results are marked in bold.
DatasetsOPTELMTELMFELMFTELMC L 1 -TWSVMC L 1 -FTELM
ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)
Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)
Australian85.31 ± 0.3485.60 ± 0.4485.46 ± 0.1986.79 ± 0.3385.82 ± 0.2887.13 ± 0.52
0.6820.5931.6980.4561.6762.533
German76.26 ± 0.5276.40 ± 0.1676.50 ± 0.4276.56 ± 0.4776.70 ± 0.2577.15 ± 1.18
1.1820.9794.5550.4745.3187.006
Breast cancer95.70 ± 0.2496.35 ± 0.1596.45 ± 0.0997.07 ± 0.1596.39 ± 0.1397.32 ± 0.53
0.6010.6681.6460.5054.0113.902
WDBC96.71 ± 0.2797.13 ± 0.4897.55 ± 0.1798.55 ± 0.2697.09 ± 0.2597.86 ± 0.21
0.4160.6051.1440.5783.6184.551
Wholesale87.35 ± 0.9389.86 ± 0.8490.26 ± 0.1290.56 ± 0.3389.89 ± 0.3090.70 ± 0.56
0.2782.0910.6650.3591.2461.377
Vote95.31 ± 0.1695.56 ± 0.3096.04 ± 0.2496.12 ± 0.3195.21 ± 0.5496.43 ± 0.35
0.2560.5020.6510.4451.0770.992
Ionosphere90.59 ± 0.8491.38 ± 0.5292.32 ± 0.3292.74 ± 0.8392.56 ± 0.5493.32 ± 1.21
0.1840.4760.4210.2681.1282.237
Pima76.83 ± 0.7377.51 ± 0.0877.79 ± 0.1078.24 ± 0.4977.49 ± 0.3778.82 ± 0.98
0.8580.7952.0990.9321.7434.708
QSAR83.91 ± 0.6686.56 ± 0.1987.12 ± 0.1887.35 ± 0.2385.72 ± 0.5987.50 ± 0.56
1.4420.9792.4892.8642.66514.288
Spam85.57 ± 0.6591.38 ± 0.5289.67 ± 0.2191.94 ± 1.2390.56 ± 1.2392.27 ± 0.54
125.49864.314488.251108.232158.145170.261
Table 3. Experimental results on UCI datasets with 20% noise, The best results are marked in bold.
Table 3. Experimental results on UCI datasets with 20% noise, The best results are marked in bold.
DatasetsOPTELMTELMFELMFTELMC L 1 -TWSVMC L 1 -FTELM
ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)
Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)
Australian79.68 ± 1.7580.37 ± 0.5679.06 ± 1.3680.44 ± 1.3481.98 ± 0.8782.78 ± 0.57
0.6210.7281.7560.2241.7083.224
German69.67 ± 0.9773.57 ± 1.8571.99 ± 1.3572.76 ± 0.8873.86 ± 1.3574.32 ± 1.12
1.3180.9814.1020.3985.6736.764
Breast cancer70.60 ± 0.4576.97 ± 0.4270.32 ± 0.3777.81 ± 0.5679.84 ± 0.3780.14 ± 0.91
0.8030.7061.5520.3154.5725.034
WDBC82.98 ± 0.1584.38 ± 1.0183.29 ± 0.6889.43 ± 1.1589.98 ± 0.3093.77 ± 0.32
0.4190.2040.9920.3763.8994.861
Wholesale73.40 ± 0.9373.77 ± 0.6973.74 ± 0.7674.77 ± 0.5678.74 ± 0.9179.47 ± 2.58
0.2750.5430.6590.4040.8491.420
Vote93.48 ± 0.6294.36 ± 0.6094.24 ± 0.8294.10 ± 0.9493.90 ± 0.4494.29 ± 0.61
0.2770.6190.5490.1141.0481.398
Ionosphere80.79 ± 2.8882.71 ± 2.0981.00 ± 3.1186.06 ± 1.6785.76 ± 1.5887.74 ± 1.08
0.1590.0210.4560.7370.3912.081
Pima65.79 ± 0.2367.07 ± 0.5666.12 ± 0.1266.30 ± 1.3470.25 ± 1.5771.42 ± 0.94
0.8730.6492.0511.4921.7583.968
QSAR68.32 ± 2.4868.80 ± 0.9568.54 ± 2.5072.28 ± 2.1871.09 ± 2.0272.31 ± 1.98
1.5343.0894.5780.8921.8289.151
Spam83.16 ± 0.5787.38 ± 2.3185.66 ± 0.6587.98 ± 0.8785.77 ± 2.2186.75 ± 0.45
128.79860.565432.257106.267147.365160.231
Table 4. Experimental results on UCI datasets with 25% noise, The best results are marked in bold.
Table 4. Experimental results on UCI datasets with 25% noise, The best results are marked in bold.
DatasetsOPTELMTELMFELMFTELMC L 1 -TWSVMC L 1 -FTELM
ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)
Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)
Australian73.68 ± 2.2075.41 ± 1.5274.25 ± 2.0176.40 ± 1.1980.56 ± 1.0781.63 ± 0.71
0.5850.6731.6270.2062.2052.261
German69.72 ± 0.1372.87 ± 0.8271.41 ± 0.8873.15 ± 0.8773.13 ± 1.1673.25 ± 0.76
1.5650.8713.8550.3425.2336.798
Breast cancer67.59 ± 0.1870.43 ± 0.7967.23 ± 0.2471.65 ± 0.5870.93 ± 0.5272.71 ± 0.49
0.6540.5131.4380.3094.4765.124
WDBC79.61 ± 0.7881.66 ± 0.8479.83 ± 0.7287.96 ± 1.1388.50 ± 0.7492.43 ± 0.76
0.4170.1970.8870.3343.6754.861
Wholesale71.79 ± 1.0371.63 ± 0.8969.63 ± 0.3871.60 ± 1.0275.53 ± 1.0275.74 ± 3.48
0.5702.0210.6230.3381.1471.387
Vote92.62 ± 0.8892.95 ± 0.5093.12 ± 0.8093.21 ± 0.8093.21 ± 0.6893.50 ± 1.00
0.2520.5030.5140.1211.2131.390
Ionosphere78.15 ± 2.9478.79 ± 3.0176.62 ± 3.6783.59 ± 1.4982.94 ± 2.9085.03 ± 2.28
0.2290.0580.3130.7370.5761.987
Pima65.67 ± 0.1265.45 ± 1.5565.89 ± 0.1265.79 ± 0.1469.01 ± 1.5568.51 ± 2.75
0.8030.7612.1820.4715.5324.012
QSAR67.49 ± 3.0867.81 ± 1.6370.30 ± 2.3371.53 ± 3.0070.49 ± 2.1369.72 ± 2.14
2.0672.7304.2510.8491.78312.564
Spam71.77 ± 1.0575.35 ± 0.7270.89 ± 1.2376.43 ± 1.1683.56 ± 0.2684.75 ± 0.78
99.54161.254462.221116.267142.365165.214
Table 5. Characteristics of image datasets.
Table 5. Characteristics of image datasets.
DatasetsInstancesAttributesDatasetsInstancesAttributes
Yale1651024ORL4001024
USPS9298256MNIST70,000784
Table 6. Experimental results on images and handwritten digits datasets. The best results are marked in bold.
Table 6. Experimental results on images and handwritten digits datasets. The best results are marked in bold.
DatasetsOPTELMTELMFELMFTELMC L 1 -TWSVMC L 1 -FTELM
ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)ACC ± S (%)
Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)
Yale89.39 ± 2.8591.44 ± 1.5890.54 ± 2.0192.23 ± 1.2991.54 ± 1.0793.12 ± 1.71
0.1260.1010.2620.1360.1350.492
ORL87.72 ± 1.5390.87 ± 0.5290.41 ± 0.7892.45 ± 0.6792.32 ± 1.1693.25 ± 0.46
1.1690.4833.0640.5291.3382.695
USPS98.76 ± 0.1898.83 ± 0.6998.23 ± 0.2499.65 ± 0.6899.23 ± 0.4299.89 ± 0.89
118.72917.536134.4386.795358.368355.762
MNIST89.61 ± 0.5890.66 ± 0.7489.83 ± 0.7591.26 ± 1.1390.88 ± 0.1491.53 ± 0.56
8.7231.23741.6560.86814.25814.973
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xue, Z.; Cai, L. Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L1-Norm for Classification. Axioms 2023, 12, 717. https://doi.org/10.3390/axioms12070717

AMA Style

Xue Z, Cai L. Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L1-Norm for Classification. Axioms. 2023; 12(7):717. https://doi.org/10.3390/axioms12070717

Chicago/Turabian Style

Xue, Zhenxia, and Linchao Cai. 2023. "Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L1-Norm for Classification" Axioms 12, no. 7: 717. https://doi.org/10.3390/axioms12070717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop