Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L1-Norm for Classification

Xue, Zhenxia; Cai, Linchao

doi:10.3390/axioms12070717

Open AccessArticle

Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L₁-Norm for Classification

by

Zhenxia Xue

^1,2,* and

Linchao Cai

¹

School of Mathematics and Information Science, North Minzu University, Yinchuan 750021, China

²

The Key Laboratory of Intelligent Information and Big Data Processing of NingXia Province, North Minzu University, Yinchuan 750021, China

^*

Author to whom correspondence should be addressed.

Axioms 2023, 12(7), 717; https://doi.org/10.3390/axioms12070717

Submission received: 27 June 2023 / Revised: 15 July 2023 / Accepted: 17 July 2023 / Published: 24 July 2023

(This article belongs to the Special Issue Application of Machine Learning and Optimization Methods in Engineering Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

Twin extreme learning machine (TELM) is a classical and high-efficiency classifier. However, it neglects the statistical knowledge hidden inside the data. In this paper, in order to make full use of statistical information from sample data, we first come up with a Fisher-regularized twin extreme learning machine (FTELM) by applying Fisher regularization into TELM learning framework. This strategy not only inherits the advantages of TELM, but also minimizes the within-class divergence of samples. Further, in an effort to further boost the anti-noise ability of FTELM method, we propose a new capped

L_{1}

-norm FTELM (C

L_{1}

-FTELM) by introducing capped

L_{1}

-norm in FTELM to dwindle the influence of abnormal points, and C

L_{1}

-FTELM improves the robust performance of our FTELM. Then, for the proposed FTELM method, we utilize an efficient successive overrelaxation algorithm to solve the corresponding optimization problem. For the proposed C

L_{1}

-FTELM, an iterative method is designed to solve the corresponding optimization based on re-weighted technique. Meanwhile, the convergence and local optimality of C

L_{1}

-FTELM are proved theoretically. Finally, numerical experiments on manual and UCI datasets show that the proposed methods achieve better classification effects than the state-of-the-art methods in most cases, which demonstrates the effectiveness and stability of the proposed methods.

Keywords:

twin extreme learning machine; within-class scatter; fisher regularization; capped L₁-norm; robustness

1. Introduction

Extreme learning machine [1,2], as a remarkable single hidden layer feed-forward neural networks (SLFNs) [3] training method, has been widely studied and applied in many fields such as efficient modeling [4], fashion retailing forecasting [5], fingerprint matching [6], metagenomic taxonomic classification [7], online sequential learning [8], and feature selection [9]. The weights of the input layer and hidden layer offsets are randomly generated. The output weight of the network is calculated effectively by minimizing the training error and the norm of the output weight. In addition, many researchers have tried to extend the extreme learning machine model to the support vector machine (SVM) learning framework to solve the classification problem [10]. Frenay et al. [11] found that the transformation performed by the first layer of ELM can be viewed as a kernel that can be plugged into SVM. Due to solving the support vector machine (SVM) type of optimization method that can be utilized to resolve the ELM model, an extreme learning machine based on the optimization method (OPTELM) was proposed in [12]. For binary classification problems, traditional ELM needs to compute all the sample points of training data at the same time in the training stage, which is time-consuming. The singe hyperplane was trained to perform the classification task in the traditional ELM, which enormously restricts its application prospect and the direction of evolution. Jayadeva et al. [13] proposed twin SVM (TWSVM), which is a famous non-parallel hyper-plane classification algorithm for binary classification. Inspired by TWSVM, Wan et al. [14] proposed the twin extreme learning machine (TELM). Compared with ELM, TELM trains two non-parallel hyperplanes for classification tasks by solving two smaller quadratic programming problems (QPPs). Compared with TWSVM, TELM’s optimization problem has fewer constraints, so the training speed is faster and the application prospect is broader. In recent years, researchers have made many improvements to TELM, such as sparse twin extreme learning machine [15], robust twin extreme learning machine [16], time efficient varient of twin extreme learning machine [17], and a generalized adaptive robust distance metric driven smooth regularization learning framework [18], etc.

Although the above ELM-based algorithm has a good classification effect, the statistical knowledge from the data itself is ignored. However, the knowledge of mathematical statistics from the data is very important to construct an efficient classifier. Fisher discriminant analysis (FDA) is an effective discriminant tool by minimizing the intra-class divergence while keeping the inter-class divergence of the data constant. From the above discussion, it can be known that it is necessary to reconstruct a new classification model by combining the characteristics of ELM model and FDA. In recent years, Ma et al. [19] have successfully combined them and proposed a Fisher-regularized extreme learning machine (Fisher-ELM), which not only has the advantages of efficient solution of ELM but also fully considers the statistical knowledge of the data.

Although the above models have good classification performance, most of them consider the

L_{2}

-norm. When the data contains noise or outliers, they can not deal with noise and outliers well, which degrades the classification performance of the model. In recent years, researchers have tried to introduce the

L_{1}

-norm into various models [20,21,22,23] to reduce the impact of outliers. This studies have shown that the

L_{1}

-norm was able to reduce the effect of outliers to some extent. However, it was still unsatisfactory when the data contains a large number of outliers. Recently, researchers have introduced the idea of truncation into the

L_{1}

-norm, constructed a new capped

L_{1}

-norm, and applied it to various models [24,25,26]. Many studies [27,28] show that the capped

L_{1}

-norm not only inherits the advantages of the

L_{1}

-norm, but also is bounded. So it is more robust and it approaches the

L_{0}

-norm to some degree. For instance, by applying the capped

L_{1}

-norm to the twin SVM, Wang et al. [29] proposed a new robust twin support vector machine (C

L_{1}

-TWSVM). Based on twin support vector machine with privileged information [30] (TWSVMPI), a new robust TWSVMPI [31] is proposed by replacing the

L_{2}

-norm with capped

L_{1}

-norm. The new model further improves the anti-noise ability of the pattern.

In order to utilize the advantanges of the twin extreme learning machine and FDA, we first put forward to a novel classifier named Fisher-regularized twin extreme learning machine (FTELM). Also considering the instability of the

L_{2}

-norm for the outliers, we introduce the capped

L_{1}

-norm into the FTELM model and propose a more robust capped

L_{1}

-norm FTELM (C

L_{1}

-FTELM) model.

The main contributions of this paper are as follows:

(1) Based on twin extreme learning machine and Fisher-regularization extreme learning machine (FELM), a new Fisher-regularized twin extreme learning machine (FTELM) is proposed. FTELM minimizes intra-class divergence while fixing the inter-class divergence of samples. FTELM takes full account of the statistical information of the sample data, and the training speed is faster than FELM.

(2) Considering the instability of

L_{2}

-norm and Hinge loss used by FTELM, we introduce capped

L_{1}

-norm instead of them and propose a new capped

L_{1}

-norm FTELM model. C

L_{1}

-FTELM uses the capped

L_{1}

-norm to reduce the influence of noise points, and at the same time utilizes Fisher regularization to consider the statistical knowledge of the data.

(3) Two algorithms are designed by utilizing the successive overrelaxation (SOR) [32] technique and the re-weighted technique [27] to solve the optimization problems of the proposed FTELM and C

L_{1}

-FTELM, respectively.

(4) Two theorems about convergence and local optimality of C

L_{1}

-FTELM are proved.

The organizational structure of this paper is as follows. In Section 2, we briefly review related work. In Section 3, we describe the FTELM model in detail. The robust capped

L_{1}

-norm FTELM learning framework along with related theoretical proofs are described in detail in Section 4. In Section 5, we describes numerical experiments on artificial and benchmark datasets. We summarize this paper in Section 6.

2. Related Work

In this section, we first define some concepts of symbols needed for this paper, and then we briefly review Fisher regularization, Fisher-ELM, TELM and successive overrelaxation algorithm.

2.1. The Concept of Symbols

e

is a vector whose components are all ones, an identity matrix is represented by

I

, and a matrix(vector) of zeros is represented by

0

. Then,

{∥\cdot∥}_{2}

is the

L_{2}

norm, and

{∥\cdot∥}_{F}

stands for the Frobenius norm.

A binary classification problem in Euclidean space (

R^{d})

can be formulated in the following form:

T = \{x_{i}, y_{i}\} \in (X, Y), (i = 1, \dots, m)

(1)

where

x_{i} \in X \subset R^{d}

is expressed as an input sample in a d-dimensional Euclidean space. Similarly,

y_{i} \in Y = \{- 1, + 1\}

is represented as an output label corresponding to an input instance

x_{i}

. In addition,

m_{1}

and

m_{2}

represent the number of sample data of the positive class and negative class, respectively, and

m = m_{1} + m_{2}

.

2.2. Fisher Regularization

Fisher regularization has the following form:

{∥f∥}_{F}^{2} = f^{T} N f = \sum_{i \in I_{+}}^{} {(f (x_{i}) - {\bar{f}}_{+})}^{2} + \sum_{i \in I_{-}}^{} {(f (x_{i}) - {\bar{f}}_{-})}^{2}

(2)

where

f = {[f (x_{1}), f (x_{2}), \dots, f (x_{m})]}^{T}

,

N = I - G

,

I \in R^{m \times m}

is the identity matrix and

G

is the matrix with the elements:

G_{i j} = \{\begin{matrix} \frac{1}{m_{1}}, f o r i, j \in I_{+} \\ \frac{1}{m_{2}}, f o r i, j \in I_{-} \\ 0, o t h e r w i s e \end{matrix}

(3)

where

I_{\pm}

are the index sets of positive and negative training data,

m_{1} = |I_{+}|

,

m_{2} = |I_{-}|

. The average value of

f (x)

over the positive sample set is expressed as

{\bar{f}}_{+}

, the average value of

f (x)

over the negative sample set is expressed as

{\bar{f}}_{-}

. From Equation (2), we can see that the meaning of the Fisher regularization is the intra-class divergence of the data.

The proof of Formula (2) is as follows:

\begin{matrix} \sum_{i \in I_{+}}^{} {(f (x_{i}) - {\bar{f}}_{+})}^{2} = \sum_{i \in I_{+}}^{} (f^{2} (x_{i}) - 2 \cdot f (x_{i}) \cdot {\bar{f}}_{+} + {\bar{f}}_{+}^{2}) = \sum_{i \in I +}^{} f^{2} (x_{i}) - m_{1} \cdot {\bar{f}}_{+}^{2} \\ = f_{+}^{T} \cdot f_{+} - \frac{1}{m_{1}} \cdot (f_{+}^{T} \cdot e \cdot e^{T} \cdot f_{+}^{T}) = f_{+}^{T} \cdot I_{+} f_{+} - f_{+}^{T} \cdot M_{+} \cdot f_{+}^{T} \\ = f_{+}^{T} \cdot (I_{+} - M_{+}) \cdot f_{+} = f_{+}^{T} \cdot (N_{1}) \cdot f_{+} \end{matrix}

(4)

where

e = {[1, \dots, 1]}^{T}

is a vector of

m_{1}

dimensions,

f_{+} = (f (x_{1}), f (x_{2}), \dots, f (x_{i}), \dots, f (x_{m_{1}})), i \in I_{+}

,

I_{+} \in R^{m_{1} \times m_{1}}

is the identity matrix.

M_{+} \in R^{m_{1} \times m_{1}}

, and all the entries in the matrix

M_{+}

are

\frac{1}{m_{1}}

.

Similarly, it can be obtained:

\begin{matrix} \sum_{j \in I_{-}}^{} {(f (x_{i}) - {\bar{f}}_{-})}^{2} = f_{-}^{T} \cdot (I_{-} - M_{-}) \cdot f_{-} = f_{-}^{T} \cdot (N_{2}) \cdot f_{-} \end{matrix}

(5)

where

f_{-} = (f (x_{1}), f (x_{2}), \dots, f (x_{i}), \dots, f (x_{m_{2}})), i \in I_{-}

,

I_{-} \in R^{m_{2} \times m_{2}}

is the identity matrix.

M_{-} \in R^{m_{2} \times m_{2}}

, and all the entries in the matrix

M_{-}

are

\frac{1}{m_{2}}

.

Combining Equations (4) and (5), we can get another form of Equation (2):

\begin{matrix} f_{+}^{T} \cdot (I_{+} - M_{+}) \cdot f_{+} + f_{-}^{T} \cdot (I_{-} - M_{-}) \cdot f_{-} \\ = {(f_{+}, f_{-})}^{T} \cdot [I - [\begin{matrix} M_{+} & 0_{1} \\ 0_{2} & M_{-} \end{matrix}]] \cdot (f_{+}, f_{-}) \\ = f^{T} \cdot (I - G) \cdot f = f^{T} \cdot N \cdot f \end{matrix}

(6)

where

0_{1} \in 0^{m_{1} \times m_{2}}

,

0_{2} \in 0^{m_{2} \times m_{1}}

,

G = [\begin{matrix} M_{+} & 0_{1} \\ 0_{2} & M_{-} \end{matrix}]

.

2.3. Fisher-Regularized Extreme Learning Machine

The primal problem of Fisher-regularized extreme learning machine (FELM) is as follows:

\begin{matrix} min_{α, ξ} \frac{1}{2} β^{T} \cdot β + C_{1} \cdot e^{T} \cdot ξ + \frac{1}{2} C_{2} \cdot α^{T} \cdot K_{E L M} \cdot N \cdot K_{E L M} . α \\ s . t . Y \cdot H \cdot β \geq e - ξ \\ ξ \geq 0 \end{matrix}

(7)

According to the representer theorem

β = \sum_{i = 1}^{m} α_{i} h (x_{i}) = H^{^{T}} α

, problem (7) can be written as problem (8):

\begin{matrix} min_{α, ξ} \frac{1}{2} α^{T} \cdot K_{E L M} \cdot α + C_{1} \cdot e^{T} \cdot ξ + \frac{1}{2} C_{2} \cdot α^{T} \cdot K_{E L M} \cdot N \cdot K_{E L M} . α \\ s . t . Y \cdot K_{ELM} \cdot α \geq e - ξ \\ ξ \geq 0 \end{matrix}

(8)

where

K_{E L M} \in R^{m \times m}

is a Gram matrix with elements

k_{E L M} (x_{i}, x_{j})

,

k_{E L M} (x_{i}, x) = h {(x)}^{T} \cdot h (x_{i})

,

h (x)

denotes the output of some hidden node,

Y \in R^{m \times m}

is a diagonal matrix with elements

y_{i}

,

C_{1}

,

C_{2}

are the regularization parameters, and

ξ

is a nonnegative slack vector.

According to the optimization theory, the dual form of the problem (8) can be obtained as follows:

\begin{matrix} min_{θ} \frac{1}{2} θ^{T} \cdot Q \cdot θ - e^{T} \cdot θ \\ s . t . 0 \leq θ \leq C_{1} \cdot e \end{matrix}

(9)

where

Q = Y \cdot {({(I + C_{2} \cdot N \cdot K_{E L M})}^{- 1})}^{T} \cdot K_{E L M} \cdot Y

.

The decision function of Fisher-regularized extreme learning machine is:

f (x) = s i g n (\sum_{i = 1}^{m} α_{i} \cdot k_{E L M} (x_{i}, x))

(10)

2.4. Twin Extreme Learning Machine

Similar to the form of TWSVM [13], the primal problem of TELM [14] can be expressed in the following:

\begin{matrix} Primal {TELM}_{1} : min_{β_{1}} \frac{1}{2} {∥H_{1} \cdot β_{1}∥}_{2}^{2} + C_{1} \cdot e_{2}^{T} \cdot ξ \\ s . t . - H_{2} \cdot β_{1} \geq e_{2} - ξ \\ ξ \geq 0 \end{matrix}

(11)

\begin{matrix} Primal {TELM}_{2} : min_{β_{2}} \frac{1}{2} {∥H_{2} \cdot β_{2}∥}_{2}^{2} + C_{2} \cdot e_{1}^{T} \cdot η \\ s . t . H_{1} \cdot β_{2} \geq e_{1} - η \\ η \geq 0 \end{matrix}

(12)

where

H_{1}

and

H_{2}

respresent the outputs of the hidden layer for positive and negative samples,

ξ

and

η

represent the slack vectors,

0

is a zero vector,

C_{1}

,

C_{2} \geq 0

are penalty parameters,

e_{1} \in R^{m_{1}}

and

e_{2} \in R^{m_{2}}

are vectors of ones.

By introducing Lagrange multipliers

α

and

ϑ

, the dual problem of (11) and (12) can be written as follows:

\begin{matrix} Dual {TELM}_{1} : min_{α} \frac{1}{2} α^{T} \cdot H_{2} {(H_{1}^{T} \cdot H_{1})}^{- 1} \cdot H_{2}^{T} \cdot α - e_{2}^{T} \cdot α \\ s . t . 0 \leq α \leq C_{1} \cdot e_{2} \end{matrix}

(13)

\begin{matrix} Dual {TELM}_{2} : min_{ϑ} \frac{1}{2} ϑ^{T} \cdot H_{1} {(H_{2}^{T} \cdot H_{2})}^{- 1} \cdot H_{1}^{T} \cdot ϑ - e_{1}^{T} \cdot ϑ \\ s . t . 0 \leq ϑ \leq C_{2} \cdot e_{1} \end{matrix}

(14)

The solution of (13) and (14) are as follows:

\begin{matrix} β_{1} = - {(H_{1}^{T} \cdot H_{1} + △_{1} I)}^{- 1} \cdot H_{2}^{T} \cdot α \end{matrix}

(15)

\begin{matrix} β_{2} = - {(H_{2}^{T} \cdot H_{2} + △_{2} I)}^{- 1} \cdot H_{1}^{T} \cdot ϑ \end{matrix}

(16)

where

△_{1}

and

△_{2}

are two small positive constants and

I

is an identity matrix. The decision function of twin extreme learning machine is:

f (x) = a r g min_{k = 1, 2} d_{k} (x) = a r g min_{k = 1, 2} |β_{k}^{T} \cdot h (x)|

(17)

2.5. Successive Overrelaxation Algorithm

The successive overrelaxation algorithm [32] mainly aims at the following optimization problems:

\begin{matrix} min_{μ} \frac{1}{2} {∥H^{T} μ∥}_{2}^{2} - e^{T} μ \\ s . t . μ \in S = \{μ | 0 \leq μ \leq C e\} \end{matrix}

(18)

Let

H H^{T} = L + E + L^{T}

, the strictly lower triangular matrix of the matrix

H H^{T}

is

L

, and the diagonal elements of the matrix

H H^{T}

form the diagonal matrix

E

.

The gradient projection optimality condition is the necessary and sufficient optimality condition for Equation (18):

μ = {(μ - π E^{- 1} (H H^{T} μ - e))}_{#}, π \geq 0

where the 2-norm projection onto the feasible region of Equation (18) is denoted by

{(\cdot)}_{#}

, that is:

{({(μ)}_{#})}_{i} = \{\begin{matrix} 0, i f μ_{i} \leq 0, i = 1, 2, \dots, m \\ μ_{i}, i f 0 < μ_{i} < C, i = 1, 2, \dots, m \\ C, i f μ_{i} \geq C, i = 1, 2, \dots, m \end{matrix}

(19)

The matrix

H H^{T}

is expressed in the following form:

\begin{matrix} H H^{T} = π^{- 1} E (B + C) \\ s . t . B - C i s p o s i t i v e d e f i n i t e \end{matrix}

(20)

Here:

B = (I + π E^{- 1} L), C = ((π - 1) I + π E^{- 1} L^{T}), 0 < π < 2

(21)

According to the [33], the matrix splitting algorithm is as follows:

μ^{i + 1} = {(μ^{i + 1} - B μ^{i + 1} - C μ^{i} + π E^{- 1} e)}_{#}

(22)

Substituting Equation (21) into Equation (22), it can be obtained:

μ^{i + 1} = {(μ^{i} - π E^{- 1} (H H^{T} μ^{i} - e + L (μ^{i + 1} - μ^{i})))}_{#}

(23)

3. Fisher-Regularized Twin Extreme Learning Machine

3.1. Model Formulation

As mentioned above, TELM solves two smaller QPPs, which can get the solution quickly. However, it ignores the prior statistical knowledge from data. FELM minimizes the within-class scatter while controlling the between-class scatter of samples, but FELM needs to solve a large-scale quadratic programming problems which is time-consuming. In this paper, by combining the advantages of FELM and TELM, we first propose the Fisher-regularized twin extreme learning machine (FTELM) by introducing the Fisher regularization into the TELM feature space. FTELM only needs to solve two smaller quadratic programming problems and meanwhile utilizes the prior statistical knowledge from data. The pair of FTELM primal problems is as follows:

\begin{matrix} Primal {FTELM}_{1} : min_{β_{1}, ξ} \frac{1}{2} {∥H_{1} \cdot β_{1}∥}^{2} + C_{1} \cdot e_{2}^{T} \cdot ξ + \frac{C_{2}}{2} \cdot f_{1} {(x)}^{T} \cdot N_{1} \cdot f_{1} (x) \\ s . t . - H_{2} \cdot β_{1} + ξ \geq e_{2} \\ ξ \geq 0 \end{matrix}

(24)

\begin{matrix} Primal {FTELM}_{2} : min_{β_{2}, η} \frac{1}{2} {∥H_{2} \cdot β_{2}∥}^{2} + C_{3} \cdot e_{1}^{T} \cdot η + \frac{C_{4}}{2} \cdot f_{2} {(x)}^{T} \cdot N_{2} \cdot f_{2} (x) \\ s . t . H_{1} \cdot β_{2} + η \geq e_{1} \\ η \geq 0 \end{matrix}

(25)

From the Equations (4) and (5), we can know that

N_{1} = I_{+} - M_{+}

and

N_{2} = I_{-} - M_{-}

,

C_{1}

,

C_{2}

,

C_{3}

,

C_{4} > 0

are regularization parameters,

ξ

and

η

are the error vectors, and all the elements in vectors

e_{1} \in R^{m_{1}}

and

e_{2} \in R^{m_{2}}

are one. FTELM first inherits the advantage of the classical twin extreme learning machine, which computes two non-parallel hyperplanes to solve the classification problem. Secondly, FTELM takes full account of the statistical information of the samples and further improves the classification accuracy of the classifier. The optimization objective function in (24) of FTELM mainly has three terms: minimizing the distance from the positive class sample points to the positive class hyperplane, minimizing empirical loss, and minimizing the intra-class divergence from the samples. The constraint condition in (24) of the optimization objective function is that the distance between the negative class sample points and the positive class hyperplane is greater than or equal to one. In a word, FTELM makes the positive class sample points closer to the positive class hyperplane, and the negative class sample points far away from the positive class hyperplane. At the same time, the positive class sample points are more concentrated in the center of the positive class sample points. There is a similar explanation for the model (25).

According to the representer theorem

β = \sum_{i = 1}^{m} α_{i} h (x_{i}) = H^{^{T}} α

, then

β_{1} = H_{1}^{T} \cdot α_{1}

and

β_{2} = H_{2}^{T} \cdot α_{2}

. We know that

f = H \cdot β

. Therefore, the problem (24) and (25) can be written in the following forms:

\begin{matrix} min_{α_{1}, ξ} \frac{1}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot K_{E L M 1} \cdot α_{1} + C_{1} \cdot e_{2}^{T} \cdot ξ + \frac{C_{2}}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \\ s . t . - H_{2} \cdot H_{1}^{T} \cdot α_{1} + ξ \geq e_{2} \\ ξ \geq 0 \end{matrix}

(26)

\begin{matrix} min_{α_{2}, η} \frac{1}{2} α_{2}^{T} \cdot K_{E L M 2} \cdot K_{E L M 2} \cdot α_{2} + C_{3} \cdot e_{1}^{T} \cdot η + \frac{C_{4}}{2} α_{2}^{T} \cdot K_{E L M 2} \cdot N_{2} \cdot K_{E L M 2} \cdot α_{2} \\ s . t . H_{1} \cdot H_{2}^{T} \cdot α_{2} + η \geq e_{1} \\ η \geq 0 \end{matrix}

(27)

where

K_{E L M 1} = H_{1} \cdot H_{1}^{T}

and

K_{E L M 2} = H_{2} \cdot H_{2}^{T}

are Gram matrices.

3.2. Model Solution

Introducing Lagrange multipliers

θ = {(θ_{1}, \dots, θ_{m_{2}})}^{T}

and

ϑ = {(ϑ_{1}, \dots, ϑ_{m_{2}})}^{T}

, the Lagrange function of (26) can be written as follows:

\begin{matrix} L (α_{1}, ξ, θ, ϑ) = \frac{1}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot (I_{1} + C_{2} \cdot N_{1}) \cdot K_{E L M 1} \cdot α_{1} + C_{1} \cdot e_{2}^{T} \cdot ξ \\ - θ^{T} \cdot (- H_{2} \cdot H_{1}^{T} \cdot α_{1} + ξ - e_{2}) - ϑ^{T} \cdot ξ \end{matrix}

(28)

According to the KKT conditions, we get:

\frac{\partial L}{\partial α_{1}} = K_{E L M 1} \cdot (I_{1} + C_{2} \cdot N_{1}) \cdot K_{E L M 1} \cdot α_{1} + H_{1} \cdot H_{2}^{T} \cdot θ = 0

(29)

\frac{\partial L}{\partial ξ} = C_{1} \cdot e_{2} - θ - ϑ = 0

(30)

θ^{T} \cdot (- H_{2} \cdot H_{1}^{T} \cdot α_{1} + ξ - e_{2}) = 0

(31)

ϑ^{T} \cdot ξ = 0

(32)

θ \geq 0

(33)

ϑ \geq 0

(34)

From (29) and (30), we can get:

α_{1}^{*} = - {(K_{E L M 1} \cdot (I_{1} + C_{2} \cdot N_{1}) \cdot K_{E L M 1})}^{- 1} \cdot H_{1} \cdot H_{2}^{T} \cdot θ

(35)

0 \leq θ \leq C_{1} \cdot e_{2}

(36)

By substituting (29)–(34) into (28), the dual optimization problem for (26) can be written in the following form:

\begin{matrix} Dual {FTELM}_{1} : min_{θ} \frac{1}{2} θ^{T} \cdot Q_{1} \cdot θ - e_{2}^{T} \cdot θ \\ s . t . 0 \leq θ \leq C_{1} \cdot e_{2} \end{matrix}

(37)

Here

Q_{1} = H_{2} \cdot H_{1}^{T} \cdot {(K_{E L M 1} \cdot (I_{1} + C_{2} \cdot N_{1}) \cdot K_{E L M 1})}^{- 1} \cdot H_{1} \cdot H_{2}^{T}

.

Similarly, we can obtain the dual of (27) as:

\begin{matrix} Dual {FTELM}_{2} : min_{λ} \frac{1}{2} λ^{T} \cdot Q_{2} \cdot λ - e_{1}^{T} \cdot λ \\ s . t . 0 \leq λ \leq C_{3} \cdot e_{1} \end{matrix}

(38)

Here

λ = {(λ_{1}, \dots, λ_{m_{1}})}^{T}

is the vector of Lagrange multipliers and we can get:

Q_{2} = H_{1} \cdot H_{2}^{T} \cdot {(K_{E L M 2} \cdot (I_{2} + C_{4} \cdot N_{2}) \cdot K_{E L M 2})}^{- 1} \cdot H_{2} \cdot H_{1}^{T}

.

We use the successive overrelaxation (SOR) [32] technique to solve the convex quadratic optimization problems of (37) and (38) (The SOR-FTELM algorithm is summarized as Algorithm 1). We can get

θ

and

λ

. Therefore, we can obtain the solution for problems of (24) and (25) in the following:

β_{1} = - H_{1}^{T} \cdot {(K_{E L M 1} \cdot (I_{1} + C_{2} \cdot N_{1}) \cdot K_{E L M 1} + δ_{1} \cdot I_{1})}^{- 1} \cdot H_{1} \cdot H_{2}^{T} \cdot θ

(39)

β_{2} = H_{2}^{T} \cdot {(K_{E L M 2} \cdot (I_{2} + C_{4} \cdot N_{2}) \cdot K_{E L M 2} + δ_{2} \cdot I_{2})}^{- 1} \cdot H_{2} \cdot H_{1}^{T} \cdot λ

(40)

The decision function of FTELM is:

f (x) = a r g min_{k = 1, 2} |β_{k}^{T} \cdot h (x)|

(41)

Algorithm 1 The procedure of SOR-FTELM.

Input:
: Training set $T = {\{x_{i}, y_{i}\}}_{i = 1}^{m}$ , where $x_{i} \in R^{d}$ , $y_{i} = \pm 1$ , the number of hidden node number $L$ , tolerance $ε$ , regularization parameters $C_{1}$ , $C_{2}$ , $C_{3}$ , $C_{4}$ .
Output:

β_{1}

,

β_{2}

, and the decision function of FTELM.

1:: Compute the graph matrix $N_{1}$ , $N_{2}$ by Equations (4) and (5).
2:: Choose an activation function such as $G (x) = \frac{1}{1 + e^{- x}}$ and compute the hidden layer output matrix $H_{1}$ , $H_{2}$ by $h (x_{i}) = G (\sum_{j = 1}^{d} ω_{j i} x_{j} + b_{i})$ and compute $K_{E L M 1}$ = $H_{1} H_{1}^{T}$ and $K_{E L M 2}$ = $H_{2} H_{2}^{T}$ .
3:: Choose $t \in (0, 2)$ , start with any $θ^{0} \in R^{m_{2}}$ ,Having $θ^{i}$ , compute $θ^{i + 1}$ as follows:

θ^{i + 1} = {(θ^{i} - t E_{1} (Q_{1} θ^{i} - e_{2} + L_{1} (θ^{i + 1} - θ^{i})))}_{#}

: until $|θ^{i + 1} - θ^{i}| \leq ε$ , where $e_{2}$ is a vector of ones of appropriate dimensions. $L_{1} \in R^{m_{2} \times m_{2}}$ is the strictly lower triangular matrix, where $l_{i j} = q_{i j}, i > j$ . $E_{1} \in R^{m_{2} \times m_{2}}$ is the diagonal matrix, where $e_{i j} = q_{i j}$ , $i > j$ .
: Then, given any $λ^{0} \in R^{m_{1}}$ , Having $λ^{i}$ , compute $λ^{i + 1}$ as follows

$λ^{i + 1} = {(λ^{i} - t E_{2} (Q_{2} λ^{i} - e_{1} + L_{2} (λ^{i + 1} - λ^{i})))}_{#}$

4:: Compute the output weights $β_{1}, β_{2}$ using Equations (39) and (40).
5:: Construct the following decision functions:

f (x) = a r g min_{k = 1, 2} |β_{k}^{T} \cdot h (x)|

4. Capped $L_{1}$ -Norm Fisher-Regularized Twin Extreme Learning Machine

4.1. Model Formulation

The Fisher-regularized twin extreme learning machine proposed in the previous section not only inherits the advantages of the twin extreme learning machine but also makes full use of the statistical information of the samples. However, due to the use of the squared

L_{2}

-norm distance and hinge loss function, the Fisher-regularized twin extreme learning machine is not robust enough when noisy points are present, which often enlarges the impact of abnormal values. In order to reduce the influence of outliers and improve the robustness of the FTELM, we propose a capped

L_{1}

-norm Fisher twin extreme machine (C

L_{1}

-FTELM) by replacing the

L_{2}

-norm and hinge loss in the FTELM with capped

L_{1}

-norm. The primal C

L_{1}

-FTELM is in the following:

Primal C $L_{1}$ -FTELM $_{1}$ :

\begin{matrix} min_{α_{1}, ξ} \sum_{i = 1}^{m_{1}} min_{} ({∥h^{T} (x_{i}) \cdot {H_{1}}^{T} \cdot α_{1}∥}_{1}, ε_{1}) + C_{1} \cdot \sum_{j = 1}^{m_{2}} min_{} ({∥ξ_{j}∥}_{1}, ε_{2}) \\ + \frac{C_{2}}{2} \cdot α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \\ s . t . - H_{2} \cdot H_{1}^{T} \cdot α_{1} + ξ \geq e_{2} \end{matrix}

(42)

Primal C $L_{1}$ -FTELM $_{2}$ :

\begin{matrix} min_{α_{2}, η} \sum_{j = 1}^{m_{2}} min_{} ({∥h^{T} (x_{j}) \cdot {H_{2}}^{T} \cdot α_{2}∥}_{1}, ε_{3}) + C_{3} \cdot \sum_{i = 1}^{m_{1}} min_{} ({∥η_{i}∥}_{1}, ε_{4}) \\ + \frac{C_{4}}{2} \cdot α_{2}^{T} \cdot K_{E L M 2} \cdot N_{2} \cdot K_{E L M 2} \cdot α_{2} \\ s . t . H_{1} \cdot H_{2}^{T} \cdot α_{2} + η \geq e_{1} \end{matrix}

(43)

where

C_{1}

,

C_{2}

,

C_{3}

,

C_{4}

> 0 are regularization parameters,

ε_{1}

,

ε_{2}

,

ε_{3}

,

ε_{4}

are thresholding parameters.

C

L_{1}

-FTELM uses the capped

L_{1}

-norm to reduce the influence of noise points, and at the same time utilizes Fisher regularization to consider the statistical knowledge of the data. Based on FTELM, C

L_{1}

-FTELM changes the

L_{2}

-norm metric and Hinge loss function of the original model to the capped

L_{1}

-norm. The capped

L_{1}

-norm is bounded and can constrain the impact of noise within a certain range. Therefore, the anti-noise ability of the model can be improved. The optimization objective function in (42) of C

L_{1}

-FTELM also contains three terms: minimizing the distance between the positive class sample points and the positive class hyperplane by using capped

L_{1}

-norm metric, minimizing empirical loss by using capped

L_{1}

-norm loss function, and minimizing the within-class scatter of the samples. The constraints in (42) of C

L_{1}

-FTELM are as follows: the distance between the negative class sample points and the positive class hyperplane is greater than or equal to one. In summary, C

L_{1}

-FTELM inherits the advantages of FTELM, while further improving the noise immunity of the model by replacing the metric and loss function with the capped

L_{1}

-norm. However, the C

L_{1}

-FTELM is a non-convex and non-smooth problem. Here, we use the reweighting technique [27] to solve the problem corresponding to the C

L_{1}

-FTELM model, which is shown below:

C $L_{1}$ -FTELM $_{1}$ :

\begin{matrix} min_{α_{1}, ξ} \frac{1}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot F \cdot K_{E L M 1} \cdot α_{1} + \frac{C_{1}}{2} \cdot ξ^{T} \cdot D \cdot ξ \\ + \frac{C_{2}}{2} \cdot α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \\ s . t . - H_{2} \cdot H_{1}^{T} \cdot α_{1} + ξ \geq e_{2} \end{matrix}

(44)

where

F

and

D

are two diagonal matrices with i-th and j-th diagonal elements as:

f_{i} = \{\begin{matrix} \frac{1}{|h^{T} (x_{i}) \cdot H_{1}^{T} \cdot α_{1}|}, |h^{T} (x_{i}) \cdot H_{1}^{T} \cdot α_{1}| \leq ε_{1}, i \in (1, \dots, m_{1}) \\ σ_{1}, o t h e r w i s e \end{matrix}

(45)

d_{j} = \{\begin{matrix} \frac{1}{|ξ_{j}|}, |ξ_{j}| \leq ε_{2}, j \in (1, \dots, m_{2}) \\ σ_{2}, o t h e r w i s e \end{matrix}

(46)

Here

σ_{1}

,

σ_{2}

are two small constants.

C $L_{1}$ -FTELM $_{2}$ :

\begin{matrix} min_{α_{2}, η} \frac{1}{2} α_{2}^{T} \cdot K_{E L M 2} \cdot R \cdot K_{E L M 2} \cdot α_{2} + \frac{C_{3}}{2} \cdot η^{T} \cdot S \cdot η \\ + \frac{C_{4}}{2} \cdot α_{2}^{T} \cdot K_{E L M 2} \cdot N_{2} \cdot K_{E L M 2} \cdot α_{2} \\ s . t . H_{1} \cdot H_{2}^{T} \cdot α_{2} + η \geq e_{1} \end{matrix}

(47)

where

R

and

S

are two diagonal matrices with j-th and i-th diagonal elements as:

r_{j} = \{\begin{matrix} \frac{1}{|h^{T} (x_{j}) \cdot H_{2}^{T} \cdot α_{2}|}, |h^{T} (x_{j}) \cdot H_{2}^{T} \cdot α_{2}| \leq ε_{3}, j \in (1, \dots, m_{2}) \\ σ_{3}, o t h e r w i s e \end{matrix}

(48)

s_{i} = \{\begin{matrix} \frac{1}{|η_{i}|}, |η_{i}| \leq ε_{4}, i \in (1, \dots, m_{1}) \\ σ_{4}, o t h e r w i s e \end{matrix}

(49)

Here

σ_{3}

,

σ_{4}

are two small constants.

4.2. Model Solution

Introducing Lagrange multipliers

α

, the Lagrange function of (44) can be written as follows:

\begin{matrix} L (α_{1}, ξ, α) = \frac{1}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot (F + C_{2} \cdot N_{1}) \cdot K_{E L M 1} \cdot α_{1} + \frac{C_{1}}{2} \cdot ξ^{T} \cdot D \cdot ξ \\ - α^{T} \cdot (- H_{2} \cdot H_{1}^{T} \cdot α_{1} + ξ - e_{2}) \end{matrix}

(50)

According to the KKT conditions, we can get the following formula:

\frac{\partial L}{\partial α_{1}} = K_{E L M 1} \cdot (F + C_{2} \cdot N_{1}) \cdot K_{E L M 1} \cdot α_{1} + H_{1} \cdot H_{2}^{T} \cdot α = 0

(51)

\frac{\partial L}{\partial ξ} = C_{1} \cdot D \cdot ξ - α = 0

(52)

α^{T} \cdot (- H_{2} \cdot H_{1}^{T} \cdot α_{1} + ξ - e_{2}) = 0

(53)

α \geq 0

(54)

From Equations (51) and (52), we can get:

\begin{matrix} α_{1} = - {(K_{E L M 1} \cdot (F + C_{2} \cdot N_{1}) \cdot K_{E L M 1})}^{- 1} \cdot H_{1} \cdot H_{2}^{T} \cdot α \\ ξ = \frac{1}{C_{1}} \cdot D^{- 1} \cdot α \end{matrix}

Similarly, we can get:

\begin{matrix} α_{2} = {(K_{E L M 2} \cdot (R + C_{4} \cdot N_{2}) \cdot K_{E L M 2})}^{- 1} \cdot H_{2} \cdot H_{1}^{T} \cdot λ \\ η = \frac{1}{C_{3}} \cdot S^{- 1} \cdot λ \end{matrix}

Thus, we can get the dual problem of (44) as follows:

Dual C $L_{1}$ -FTELM $_{1}$

min_{α \geq 0} \frac{1}{2} α^{T} \cdot ((H_{2} H_{1}^{T}) Q_{1}^{- 1} (H_{1} H_{2}^{T}) + \frac{1}{C_{1}} D^{- 1}) \cdot α - e_{2}^{T} \cdot α

(55)

where

Q_{1} = K_{E L M 1} \cdot (F + C_{2} \cdot N_{1}) \cdot K_{E L M 1}

.

In the same way, we can obtain the dual problem of the Equation (47) as follows:

Dual C $L_{1}$ -FTELM $_{2}$

min_{λ \geq 0} \frac{1}{2} λ^{T} \cdot ((H_{1} H_{2}^{T}) Q_{2}^{- 1} (H_{2} H_{1}^{T}) + \frac{1}{C_{3}} S^{- 1}) \cdot λ - e_{1}^{T} \cdot λ

(56)

where

Q_{2} = K_{E L M 2} \cdot (R + C_{4} \cdot N_{2}) \cdot K_{E L M 2}

.

After solving (55) and (56),

α

and

λ

are derived, and then

α_{1}

and

α_{2}

are obtained. So, the decision function of C

L_{1}

-FTELM is as follows:

y = a r g min_{k = 1, 2} |α_{k}^{T} \cdot H_{k} \cdot h (x)| = a r g min_{k = 1, 2} \sum_{i = 1}^{m_{k}} α_{k_{i}} \cdot k_{E L M_{k}} (x, x_{i})

(57)

Based on the above discussion, our algorithm will be presented in Algorithm 2. Next, we give the convergence analysis of Algorithm 2.

Algorithm 2 The procedure of C

L_{1}

-FTELM.

Input:
: Training set $T = {\{x_{i}, y_{i}\}}_{i = 1}^{m}$ , where $x_{i} \in R^{d}$ , $y_{i} = \pm 1$ , the number of hidden node number $L$ , regularization parameters $C_{1}$ , $C_{2}$ , $C_{3}$ , $C_{4} > 0$ , $ε_{1}$ , $ε_{2}$ , $ε_{3}$ , $ε_{4} > 0$ , $ρ_{1}$ , $ρ_{2}$ , $σ_{1}$ , $σ_{2}$ , $σ_{3}$ , $σ_{4}$ .
Output:
: $α_{1}^{*}$ , $α_{2}^{*}$ and the decision function of C $L_{1}$ -FTELM.

1:: Initialize $F_{0} \in R^{m_{1} \times m_{1}}$ , $D_{0} \in R^{m_{2} \times m_{2}}$ , $R_{0} \in R^{m_{2} \times m_{2}}$ , $S_{0} \in R^{m_{1} \times m_{1}}$ .
2:: Compute the graph matrix $N_{1}$ , $N_{2}$ by Equations (4) and (5).
3:: Choose an activation function such as $G (x) = \frac{1}{1 + e^{- x}}$ and compute the hidden layer output matrix $H_{1}$ , $H_{2}$ by $h (x_{i}) = G (\sum_{j = 1}^{d} ω_{j i} x_{j} + b_{i})$ and compute $K_{E L M 1}$ = $H_{1} H_{1}^{T}$ and $K_{E L M 2}$ = $H_{2} H_{2}^{T}$ .
4:: Set t = 0.
5:: While

Solving (55) and (56), the $α^{t}$ and $λ^{t}$ can be obtained.
Then get the solution $α_{1}^{t}$ , $α_{2}^{t}$ , $ξ^{t}$ , and $η^{t}$ by

$α_{1}^{t} = - {(K_{E L M 1} \cdot (F_{t} + C_{2} \cdot N_{1}) \cdot K_{E L M 1})}^{- 1} \cdot H_{1} \cdot H_{2}^{T} \cdot α^{t}, ξ^{t} = \frac{1}{C_{1}} \cdot D_{t}^{- 1} \cdot α^{t}$

$α_{2}^{t} = {(K_{E L M 2} \cdot (R_{t} + C_{4} \cdot N_{2}) \cdot K_{E L M 2})}^{- 1} \cdot H_{2} \cdot H_{1}^{T} \cdot λ^{t}, η^{t} = \frac{1}{C_{3}} \cdot S_{t}^{- 1} \cdot λ^{t}$

Update the matrices $F_{t + 1}$ , $D_{t + 1}$ , $R_{t + 1}$ , and $S_{t + 1}$ by (45), (46), (48) and (49), respectively.
      Compute the objective function values $J_{1}^{t + 1}$ and $J_{2}^{t + 1}$ , by

$\begin{matrix} J_{1}^{t + 1} = \frac{1}{2} {(α_{1}^{t})}^{T} \cdot K_{E L M 1} \cdot F_{t + 1} \cdot K_{E L M 1} \cdot α_{1}^{t} + \frac{C_{1}}{2} \cdot {(ξ^{t})}^{T} \cdot D_{t + 1} \cdot ξ^{t} + \\ \frac{C_{2}}{2} \cdot {(α_{1}^{t})}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1}^{t} \end{matrix}$

(58)



$\begin{matrix} J_{2}^{t + 1} = \frac{1}{2} {(α_{2}^{t})}^{T} \cdot K_{E L M 2} \cdot R_{t + 1} \cdot K_{E L M 2} \cdot α_{2}^{t} + \frac{C_{3}}{2} \cdot {(η^{t})}^{T} \cdot S_{t + 1} \cdot η^{t} + \\ \frac{C_{4}}{2} \cdot {(α_{2}^{t})}^{T} \cdot K_{E L M 2} \cdot N_{2} \cdot K_{E L M 2} \cdot α_{2}^{t} \end{matrix}$

(59)

      if $|J_{1}^{t + 1} - J_{1}^{t}| \leq ρ_{1}$ and $|J_{2}^{t + 1} - J_{2}^{t}| \leq ρ_{2}$ .
break
else
t = t+1

6:: end while
7:: Stop the iteration process and get the solution of $α_{1}^{*}$ , $α_{2}^{*}$ .

4.3. Convergence Analysis

Before we prove the convergence of the iterative algorithm, we first review two lemmas [34].

Lemma 1.

For any non-zeros vectors x, y

\in R^{n}

, if

f (x) = {∥x∥}_{1} - \frac{{∥x∥}_{1}^{2}}{2 {∥y∥}_{1}}

, then the following inequalities

f (x) \leq f (y)

hold.

Lemma 2.

For any non-zeros vectors x, y, p, q

\in R^{n}

, if

f (x, p) = {∥x∥}_{1} - \frac{{∥x∥}_{1}^{2}}{2 {∥y∥}_{1}} + C ({∥p∥}_{1} - \frac{{∥p∥}_{1}^{2}}{2 {∥q∥}_{1}})

,

C \in R^{+}

, then the following inequalities

f (x, p) \leq f (y, q)

hold.

The proof of two lemmas is detailed in [34].

Theorem 1.

Algorithm 2 monotonically decreases the objectives of problems (42) and (43) in each iteration until it converges.

Proof.

Here, we only use problem (42) as an example to prove Theorem 1.

\begin{matrix} J (α_{1}, ξ) = min_{α_{1}, ξ} \sum_{i = 1}^{m_{1}} min_{} ({∥h^{T} (x_{i}) \cdot H_{1}^{T} \cdot α_{1}∥}_{1}, ε_{1}) + C_{1} \cdot \sum_{j = 1}^{m_{2}} min_{} ({∥ξ_{j}∥}_{1}, ε_{2}) \\ + \frac{C_{2}}{2} \cdot α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \end{matrix}

(60)

when

{∥h^{T} (x_{i}) \cdot H_{1}^{T} \cdot α_{1}∥}_{1} < ε_{1}

and

{∥ξ_{j}∥}_{1} < ε_{2}

,we have:

\begin{matrix} J (α_{1}, ξ) = min_{α_{1}, ξ} \sum_{i = 1}^{m_{1}} {∥h^{T} (x_{i}) H_{1}^{T} α_{1}∥}_{1} + C_{1} \sum_{j = 1}^{m_{2}} {∥ξ_{j}∥}_{1} \\ + \frac{C_{2}}{2} \cdot α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \end{matrix}

(61)

We take the derivative of Equation (61) with respect to

α_{1}

and

ξ

separately and then obtain that:

\{\begin{matrix} \sum_{i = 1}^{m_{1}} \frac{H_{1} h (x_{i}) h^{T} (x_{i}) H_{1}^{T} α_{1}}{|h^{T} (x_{i}) H_{1}^{T} α_{1}|} + C_{2} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} = 0 \\ C_{1} \cdot \sum_{j = 1}^{m_{2}} \frac{ξ_{j}}{|ξ_{j}|} = 0 \end{matrix}

(62)

by the above Equation (62), we can get:

\begin{matrix} \sum_{i = 1}^{m_{1}} \frac{H_{1} h (x_{i}) h^{T} (x_{i}) H_{1}^{T} α_{1}}{|h^{T} (x_{i}) H_{1}^{T} α_{1}|} + C_{1} \cdot \sum_{j = 1}^{m_{2}} \frac{ξ_{j}}{|ξ_{j}|} \\ + C_{2} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} = 0 \end{matrix}

(63)

We define

f_{i} = \frac{1}{|h^{T} (x_{i}) \cdot H_{1}^{T} \cdot α_{1}|}

and

d_{j} = \frac{1}{|ξ_{j}|}

as the diagonal entries of

F

and

D

, respectively. Thus we can rewrite Equation (63) as follows:

H_{1} \cdot H_{1}^{T} \cdot F \cdot H_{1}^{T} \cdot H_{1} \cdot α_{1} + C_{1} \cdot D \cdot ξ + C_{2} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} = 0

(64)

Obviously, Equation (64) is the optimal solution to the following problem:

\begin{matrix} min_{α_{1}, ξ} \frac{1}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot F \cdot K_{E L M 1} \cdot α_{1} + \frac{C_{1}}{2} \cdot ξ^{T} \cdot D \cdot ξ \\ + \frac{C_{2}}{2} \cdot α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \end{matrix}

(65)

Now, assume that

{\bar{α}}_{1}

and

\bar{ξ}

denote the updated

α_{1}

and

ξ

of Algorithm 2, respectively. Thus we can get:

\begin{matrix} \frac{1}{2} {\bar{α}}_{1}^{T} \cdot K_{E L M 1} \cdot F \cdot K_{E L M 1} \cdot {\bar{α}}_{1} + \frac{C_{1}}{2} \cdot {\bar{ξ}}^{T} \cdot D \cdot \bar{ξ} \\ + \frac{C_{2}}{2} \cdot {\bar{α}}_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot {\bar{α}}_{1} \\ \leq \frac{1}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot F \cdot K_{E L M 1} \cdot α_{1} + \frac{C_{1}}{2} \cdot ξ^{T} \cdot D \cdot ξ \\ + \frac{C_{2}}{2} \cdot α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \end{matrix}

(66)

we have rewritten Equation (66) as follows

\begin{matrix} \sum_{i = 1}^{m_{1}} \frac{{(K_{E L M 1} {\bar{α}}_{1})}^{T} (K_{E L M 1} {\bar{α}}_{1})}{2 |h^{T} (x_{i}) H_{1}^{T} α_{1}|} + \sum_{j = 1}^{m_{2}} \frac{C_{1} {({\bar{ξ}}_{j})}^{2}}{2 |ξ_{j}|} \\ + \frac{C_{2}}{2} {\bar{α}}_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot {\bar{α}}_{1} \\ \leq \sum_{i = 1}^{m_{1}} \frac{{(K_{E L M 1} α_{1})}^{T} (K_{E L M 1} α_{1})}{2 |h^{T} (x_{i}) H_{1}^{T} α_{1}|} + \sum_{j = 1}^{m_{2}} \frac{C_{1} {(ξ_{j})}^{2}}{2 |ξ_{j}|} \\ + \frac{C_{2}}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \end{matrix}

(67)

Here, we let

x = K_{E L M 1} {\bar{α}}_{1}

,

y = K_{E L M 1} α_{1}

,

C = C_{1}

,

p = {\bar{ξ}}_{j}

,

q = ξ_{j}

. Based on Lemma 2, we have

\begin{matrix} |K_{E L M 1} \cdot {\bar{α}}_{1}| - \frac{{|K_{E L M 1} \cdot {\bar{α}}_{1}|}^{2}}{2 |K_{E L M 1} \cdot α_{1}|} + C_{1} \cdot (|{\bar{ξ}}_{j}| - \frac{{|{\bar{ξ}}_{j}|}^{2}}{2 |ξ_{j}|}) \\ \leq |K_{E L M 1} \cdot α_{1}| - \frac{{|K_{E L M 1} \cdot α_{1}|}^{2}}{2 |K_{E L M 1} \cdot α_{1}|} + C_{1} \cdot (|ξ_{j}| - \frac{{|ξ_{j}|}^{2}}{2 |ξ_{j}|}) \end{matrix}

(68)

then we can get

\begin{matrix} \sum_{i = 1}^{m_{1}} (|h^{T} (x_{i}) \cdot H_{1}^{T} \cdot {\bar{α}}_{1}| - \frac{{|K_{E L M 1} \cdot {\bar{α}}_{1}|}^{2}}{2 |K_{E L M 1} \cdot α_{1}|}) + C_{1} \sum_{j = 1}^{m_{2}} (|{\bar{ξ}}_{j}| - \frac{{|{\bar{ξ}}_{j}|}^{2}}{2 |ξ_{j}|}) \\ \leq \sum_{i = 1}^{m_{1}} (|h^{T} (x_{i}) \cdot H_{1}^{T} \cdot α_{1}| - \frac{{|K_{E L M 1} \cdot α_{1}|}^{2}}{2 |K_{E L M 1} \cdot α_{1}|}) + C_{1} \sum_{j = 1}^{m_{2}} (|ξ_{j}| - \frac{{|ξ_{j}|}^{2}}{2 |ξ_{j}|}) \end{matrix}

(69)

combining (67) and (69), we can get the following inequalities

\begin{matrix} \sum_{i = 1}^{m_{1}} (|h^{T} (x_{i}) \cdot H_{1}^{T} \cdot {\bar{α}}_{1}|) + C_{1} \sum_{j = 1}^{m_{2}} (|{\bar{ξ}}_{j}|) \\ + \frac{C_{2}}{2} {\bar{α}}_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot {\bar{α}}_{1} \\ \leq \sum_{i = 1}^{m_{1}} (|h^{T} (x_{i}) \cdot H_{1}^{T} \cdot α_{1}|) + C_{1} \sum_{j = 1}^{m_{2}} (|ξ_{j}|) \\ + \frac{C_{2}}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \end{matrix}

(70)

further, we can get

\begin{matrix} \sum_{i = 1}^{m_{1}} min (|h^{T} (x_{i}) \cdot H_{1}^{T} \cdot {\bar{α}}_{1}|) + C_{1} \sum_{j = 1}^{m_{2}} min (|{\bar{ξ}}_{j}|) \\ + \frac{C_{2}}{2} {\bar{α}}_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot {\bar{α}}_{1} \\ \leq \sum_{i = 1}^{m_{1}} min (|h^{T} (x_{i}) \cdot H_{1}^{T} \cdot α_{1}|) + C_{1} \sum_{j = 1}^{m_{2}} min (|ξ_{j}|) \\ + \frac{C_{2}}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \end{matrix}

(71)

Therefore, we have

J ({\bar{α}}_{1}, \bar{ξ}) \leq J (α_{1}, ξ)

. Similarly, when

{∥h^{T} (x_{i}) H_{1}^{T} α_{1}∥}_{1} \leq ε_{1}

and

∥ξ_{j}∥ \geq ε_{2}

, or

{∥h^{T} (x_{i}) H_{1}^{T} α_{1}∥}_{1} \geq ε_{1}

and

∥ξ_{j}∥ \leq ε_{2}

, or

{∥h^{T} (x_{i}) H_{1}^{T} α_{1}∥}_{1} \geq ε_{1}

and

∥ξ_{j}∥ \geq ε_{2}

, we can obviously get

J ({\bar{α}}_{1}, \bar{ξ}) \leq J (α_{1}, ξ)

. Thus, the inequality

J ({\bar{α}}_{1}, \bar{ξ}) \leq J (α_{1}, ξ)

holds. The three terms in Equation (60) are equal to or greater than 0. Meaning that Algorithm 2 decreases objective of problem (42) until convergence. □

Theorem 2.

Algorithm 2 will converge to a local optimum to the problem in (42).

Proof.

Here, we only use (42) as an example to prove Theorem 2.

When

{∥h^{T} (x_{i}) H_{1}^{T} α_{1}∥}_{1} \leq ε_{1}

and

{∥ξ_{j}∥}_{1} \leq ε_{2}

, we write out the formula of (42) Lagrange function:

\begin{matrix} L_{1} (α_{1}, ξ, λ) = \sum_{i = 1}^{m_{1}} ({∥h^{T} (x_{i}) H_{1}^{T} α_{1}∥}_{1}) + C_{1} \sum_{j = 1}^{m_{2}} ({∥ξ_{j}∥}_{1}) \\ + \frac{C_{2}}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} - λ^{T} \sum_{j = 1}^{m_{2}} (h^{T} (x_{j}) H_{1}^{T} α_{1} + ξ_{j} - 1) \end{matrix}

(72)

Then, we take the derivative of

L (α_{1}, ξ, λ)

with respect to

α_{1}

\begin{matrix} \frac{\partial L_{1}}{\partial α_{1}} = \sum_{i = 1}^{m_{1}} \frac{H_{1} h (x_{i}) h^{T} (x_{i}) H_{1}^{T} α_{1}}{|h^{T} (x_{i}) H_{1}^{T} α_{1}|} + C_{2} \cdot K_{E L M 1} \cdot N_{1} \cdot K_{E L M 1} \cdot α_{1} \\ + H_{1} \cdot H_{2}^{T} \cdot λ = K_{E L M 1} \cdot (F + C_{2} \cdot N_{1}) \cdot K_{E L M 1} \cdot α_{1} + H_{1} \cdot H_{2}^{T} \cdot λ = 0 \end{matrix}

(73)

Similarly, we get the Lagrangian function of problem (44):

\begin{matrix} L_{2} (α_{1}, ξ, α) = \frac{1}{2} α_{1}^{T} \cdot K_{E L M 1} \cdot (F + C_{2} \cdot N_{1}) \cdot K_{E L M 1} \cdot α_{1} + \frac{C_{1}}{2} \cdot ξ^{T} \cdot D \cdot ξ \\ - λ^{T} \cdot (- H_{2} \cdot H_{1}^{T} \cdot α_{1} + ξ - e_{2}) \end{matrix}

(74)

Taking the derivative of

L_{2} (α_{1}, ξ, α)

with respect to

α_{1}

:

\begin{matrix} \frac{\partial L_{2}}{\partial α_{1}} = K_{E L M 1} \cdot (F + C_{2} \cdot N_{1}) \cdot K_{E L M 1} \cdot α_{1} + H_{1} \cdot H_{2}^{T} \cdot λ = 0 \end{matrix}

(75)

The other three cases are similar. From the discussion above, we may safely draw that Equations (73) and (75) are equivalent, so we can use problem (44) instead of problem (42) to solve C

L_{1}

-FTELM, which further illustrates that Algorithm 2 can converge to a local optimal solution. □

5. Experiments

Description of the four comparison algorithms:

OPTELM: The optimization function of the model consists of minimizing the

L_{2}

-norm of the weight vector and minimizing empirical loss. It neither consider the establishment of two non-parallel hyperplanes to deal with classification tasks, nor consider the statistical information of samples. At the same time, since it uses

L_{2}

-norm metric and Hinge loss, it has weak anti-noise ability.

TELM: The optimization function of the model consists of minimizing the distance from the sample points to the hyperplane as well as minimizing empirical loss. TELM does not fully consider the statistical information of the sample. At the same time, its metric uses the

L_{2}

-norm metric and the loss function uses the Hinge loss. When there is noise in the data set, the influence of noise data will be amplified and the accuracy of classification will be reduced.

FELM: The optimization function of the model includes minimizing the

L_{2}

-norm of the weight vector, minimizing empirical loss, and minimizing the within-class scatter of the number sample data. Although FELM takes into account the statistics of the sample, it has to deal with a much larger optimization problem than the twin extreme learning machines, which is time-consuming. At the same time, FELM still continues the metric and loss used by OPTELM, so its anti-noise ability is weak.

C

L_{1}

-TWSVM: C

L_{1}

-TWSVM is formed on the basis of twin support vector machines by changing the model’s metric and loss to capped

L_{1}

-norm. Although C

L_{1}

-TWSVM has the ability to resist noise, it does not fully take into account the statistics of the data. Meanwhile, C

L_{1}

-TWSVM not only needs to solve the weight vector of the hyperplane, but also needs to solve the bias of the hyperplane, so it is time-consuming.

We systematically compare our algorithm above advanced algorithms (OPTELM [12], TELM [14], FELM [19], and C

L_{1}

-TWSVM [29]) on artificial synthetic datasets and UCI real datasets to verify the effectiveness of our FTELM and C

L_{1}

-FTELM. In Section 5.1, we describe the relevant experimental setting in detail. We describe their performance in different cases in Section 5.2 and Section 5.3, respectively. In Section 5.4, we use the one-versus-rest multi-classification method to perform data classification tasks in four image datasets: Yale “http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2023)”, ORL “http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2023)”, USPS “http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html (accessed on 15 February 2023)” handwritten digit dataset and MNIST “http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html (accessed on 15 February 2023)” dataset.

5.1. Experimental Setting

All experiments were implemented in MATLAB R2020a installed in a personal computer (PC) with an AMD Radeon Graphics processor (3.2 GHz), and 16 GB random-access memory (RAM). For C

L_{1}

-TWSVM, and C

L_{1}

-FTELM, we take the maximum number of iterations to be 100 and the iteration stopping threshold to be 0.001. The activation functions used in a total of five models (OPTELM, TELM, FELM, FTELM, and C

L_{1}

-FTELM) are

G (x) = \frac{1}{1 + e^{- x}}

. The Gaussian kernel function

K (x, z) = e^{- \frac{{∥x - z∥}^{2}}{2 σ^{2}}}

was used for C

L_{1}

-TWSVM. The parameters selected by all the above algorithms are as follows:

ε_{1}

,

ε_{2}

,

ε_{3}

,

ε_{4}

were selected from

\{10^{i} | - 6, - 5, - 4\}

,

C_{1}, C_{2}, C_{3}, C_{4}

were selected from

\{10^{i} | - 5, - 4, \dots, 4, 5\}

,

σ

was chosen from

\{2^{i} | - 3, - 2, \dots, 2, 3\}

, and the hidden layer node number L was chosen from

\{50, 100, 200, 500, 1000, 2000, 5000, 10, 000\}

. The optimal parameters used by the model are selected by 10-fold cross-validation and grid search. Normalization was performed for both artificial and UCI datasets. For image datasets, we randomly select 20% of the data as the test set to get the classification accuracy of the algorithm. All experimental processes are repeated 10 times and the average of the 10 test results is used as the performance measure, and the evaluation criterion selected in this paper is classification accuracy (ACC).

5.2. Experiments on Artificial Datasets

We first do experiments on the Banana, Circle, Two spirals, and XOR datasets which are generated by trigonometric function(sine, cosine), two circle lines, two spirals lines, and two intersecting lines, respectively. The two-dimensional distributions of the four synthetic datasets are shown in Figure 1. Dark blue ‘+’ represents class 1, and cyan ‘∘’ represents class 2. Figure 2 illustrates the experimental results of four twin algorithms namely TELM, FTELM, C

L_{1}

-TWSVM, and C

L_{1}

-FTELM for four datasets with 0%, 20%, and 25% noise in terms of accuracy. From Figure 2a, we can observe that the classification accuracy of our FTELM and C

L_{1}

-FTELM in Banana and Two spirals datasets is higher than the other two methods. In the Circle and XOR datasets, the classification accuracy of the four methods is similar. The experimental results show that fully considering the statistical information of the data can effectively improve the classification accuracy of the classifier, which shows that our C

L_{1}

-FTELM method is effective. From Figure 2b,c, we can see that the overall effect of FTELM is better than TELM. This shows the importance of fully considering the statistical information of the sample. At the same time, we can see that C

L_{1}

-FTELM has the best effect, followed by C

L_{1}

-TWSVM. It shows that the capped

L_{1}

-norm can control the influence of noise on the model in a certain range, and further shows the effectiveness of using the capped

L_{1}

-norm. In summary, Figure 2 illustrates the effectiveness of considering sample statistics information and changing the distance metric and loss of the model into capped

L_{1}

-norm at the same time.

To further show the robustness of C

L_{1}

-FTELM, we add noise with different ratios to the Circle dataset. Figure 3 shows the accuracy of TELM, FTELM, C

L_{1}

-TWSVM, and C

L_{1}

-FTELM algorithms on the Circle dataset in different noises ratios. The ratio is set in the range of

\{0.1, 0.15, 0.2, 0.25\}

. We plot the accuracy results of ten experiments with different noise ratios in a box-shaped plot. By observing the median of the four subgraphs, we can find that the median of C

L_{1}

-FTELM algorithm is much higher than the other three algorithms. And C

L_{1}

-FTELM method in four different noise ratios results is relatively concentrated. In other words, the variance of ten experimental results obtained by the C

L_{1}

-FTELM algorithm is smaller and the mean value is larger. The above results show that our C

L_{1}

-FTELM has better stability and better classification effect in environments containing noise. This shows the effectiveness and noise resistance of the distance metric and loss functions of the model using the capped

L_{1}

-norm.

5.3. Experiments on UCI Datasets

In this section, we conduct the numerical simulation on UCI datasets. Table 1 describes the features of the UCI datasets used in detail. We also added two algorithms (OPTELM, FELM) to verify the classification performance of FTELM and C

L_{1}

-FTELM in ten UCI data sets.

All experimental results obtained based on the optimal parameters are shown in Table 2. Here, the average running time according to the optimal parameters is denoted by Times(s), and the average classification plus or minus standard deviation is denoted by

A C C \pm

. From Table 2, we can see that FTELM performs better than OPTELM, TELM, and FELM on all ten datasets. This indicates that adding Fisher regularization term on the basis of TELM framework can significantly improve the accuracy of model classification. In addition, the average training time of FTELM algorithm on most data sets is smaller than that of FELM algorithm, which indicates that FTELM has inherited the advantages of TELM’s short training time. In addition, we also can draw our C

L_{1}

-FTELM in most data sets has achieved the highest classification accuracy besides the WDBC data set. Through the analysis of the above results, we can conclude that the Fisher regularization and capped

L_{1}

-norm added to the TELM learning framework can effectively improve the performance of the classifier. It is shown that the proposed FTELM and C

L_{1}

-FTELM are efficient algorithms.

In order to more significantly verify the robustness of C

L_{1}

-FTELM to outliers, we added 20% and 25% Gaussian noise to 10 data sets, respectively. All experimental results are presented in Table 3 and Table 4. From Table 3 and Table 4, we find that the classification accuracy of all six algorithms decreases after adding noise. However, the classification accuracy of our algorithm C

L_{1}

-FTELM is the highest of the eight datasets, which further reveals the effectiveness of our method using capped

L_{1}

-norm instead of Hinge loss and

L_{2}

-norm distance metric. Compared with the other five algorithms, our C

L_{1}

-FTELM algorithm is more time-consuming. This is due to that C

L_{1}

-FTELM requires a lot of time in the process of training to iterative calculation, eliminating outliers, and computing graph matrices. In addition, we used different noise factor values (0.1, 0.15, 0.2, 0.25, 0.3) on the Cancer, German, Ionosphere, and WDBC for the six algorithms. The experimental results are given in Figure 4. It can be seen from Figure 4a that when the Breast Cancer dataset contains 10% noise, the effects of our FTELM and C

L_{1}

-FTELM are comparable. This shows that it is important to consider the statistical information of the sample. As the ratio of noise increases, the classification accuracy of all methods decreases, but our C

L_{1}

-FTELM still has the highest accuracy. This illustrates the effectiveness of our using the capped

L_{1}

-norm. Figure 4b shows that with the increase of noise ratio, the decline trend of accuracy of C

L_{1}

-TWSVM and C

L_{1}

-FTELM is similar, but C

L_{1}

-FTELM is still the most stable among the six methods when facing the influence of noise. From both Figure 4c,d, we can clearly observe that the anti-noise effect of our C

L_{1}

-FTELM is the best. This illustrates the effectiveness of using the Fisher regularization term as well as the capped

L_{1}

-norm.

We also conduct experiments on four data sets (Breast cancer, QSAR, WDBC, and Vote) to verify the convergence of the proposed Algorithm 2. As shown in Figure 5, we plot the objective function value of each iteration. It can be seen that the objective function value converges to a fixed value rapidly with the increase in the number of iterations. This shows that our algorithm can make the objective function value can converge to a local optimal value within a limited number of iterations. The effectiveness and convergence of the Algorithm 2 are demonstrated.

5.4. Experiments on Image Datasets

The image datasets include Yale, ORL, USPS, and MNIST. Figure 6 illustrates examples of four high-dimensional image datasets. The number of samples and characteristics of the four image datasets are shown in Table 5. These four image datasets are used to investigate the performance of our FTELM and C

L_{1}

-FTELM for multi-classification. Specifically, for the MNIST dataset, we only select the first 2000 samples to participate in the experiment.

Table 6 shows the specific experimental results. As can be seen from the results of the experiment, our C

L_{1}

-FTELM and C

L_{1}

-TWSVM have similar training times. This is because this paper uses an iterative algorithm to solve non-convex optimization problem of C

L_{1}

-FTELM, which is time-consuming. Simultaneously, the C

L_{1}

-FTELM at Yale, ORL, USPS, and MNIST four datasets classification accuracy is highest among the six algorithms. In addition, the classification accuracy of our FTELM algorithm on the four image datasets is the second highest after our C

L_{1}

-FTELM. The above results fully show the effectiveness of our two algorithms in dealing with multi-classification tasks.

6. Conclusions

In this paper, we have proposed FTELM and C

L_{1}

-FTELM. FTELM not only inherits the advantages of TELM but also takes full account of the statistical information of samples, so as to further improve the classification performance of the classifier. Specifically, when there is no noise in the data or the ratio of noise is very small, our FTELM algorithm can deal with the classification problem very well, not only time-saving but also with high classification accuracy. C

L_{1}

-FTELM further improves the anti-noise ability of the model by replacing the

L_{2}

-norm and hinge loss in FTELM with capped

L_{1}

-norm. It not only utilizes the distribution information of the data but also improves the anti-noise ability of the model. Furthermore, we have designed two algorithms to solve the problems of FTELM and C

L_{1}

-FTELM. In addition, we present two theorems to prove the convergence of our C

L_{1}

-FTELM. However, in terms of computational cost, FTELM is better than C

L_{1}

-FTELM to some extent. Therefore, in future work, we will propose some new tricks to accelerate the computation of the C

L_{1}

-FTELM. In addition, trying to extend FTELM and C

L_{1}

-FTELM from supervised learning to semi-supervised learning framework is also a future research focus.

Author Contributions

Z.X., conceptualization, methodology, validation, investigation, project administration, writing—original draft. L.C., methodology, software, validation, formal analysis, investigation, data curation, writing—original draft. All authors have read and agreed to the published version of the manuscript.

Funding

The authors wish to acknowledge the financial support of the National Nature Science Youth Foundation of China (No. 61907012), the Start-up Funds of Scientific Research for Personnel Introduced by North Minzu University (No. 2019KYQD41), the Special project of North Minzu University (No. FWNX01), the Basic Research Plan of Key Scientific Research Projects of Colleges and Universities in Henan Province (No. 19A120005), the Construction Project of First-Class Disciplines in Ningxia Higher Education (NXYLXK2017B09), the Young Talent Cultivation Project of North Minzu University (No. 2021KYQD23), the Natural Science Foundation of Ningxia Provincial of China (No. 2022A0950), the Fundamental Research Funds for the Central Universities (No. 2022XYZSX03).

Informed Consent Statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability Statement

The UCI machine learning repository is available at “http://archive.ics.uci.edu/ml/datasets.php (accessed on 15 February 2023)”. The image data are available at “http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2023)”.

Acknowledgments

The authors thank the anonymous reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: A new learning scheme of feedforward neural networks. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Budapest, Hungary, 25–29 July 2004; Volume 2, pp. 985–990. [Google Scholar]
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
Huang, G.B.; Chen, Y.Q.; Babri, H.A. Classification ability of single hidden layer feedforward neural networks. IEEE Trans. Neural Networks 2000, 11, 799–801. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, X.; Cui, B. Efficient modeling of fiber optic gyroscope drift using improved EEMD and extreme learning machine. Signal Process. 2016, 128, 1–7. [Google Scholar] [CrossRef]
Xia, M.; Zhang, Y.; Weng, L.; Ye, X. Fashion retailing forecasting based on extreme learning machine with adaptive metrics of inputs. Knowl.-Based Syst. 2012, 36, 253–259. [Google Scholar] [CrossRef]
Yang, J.; Xie, S.; Yoon, S.; Park, D.; Fang, Z.; Yang, S. Fingerprint matching based on extreme learning machine. Neural Comput. Appl. 2013, 22, 435–445. [Google Scholar] [CrossRef]
Rasheed, Z.; Rangwala, H. Metagenomic Taxonomic Classification Using Extreme Learning Machines. J. Bioinform. Comput. Biol. 2012, 10 5, 1250015. [Google Scholar] [CrossRef]
Zou, Q.Y.; Wang, X.J.; Zhou, C.J.; Zhang, Q. The memory degradation based online sequential extreme learning machine. Neurocomputing 2018, 275, 2864–2879. [Google Scholar] [CrossRef]
Fu, Y.; Wu, Q.; Liu, K.; Gao, H. Feature Selection Methods for Extreme Learning Machines. Axioms 2022, 11, 444. [Google Scholar] [CrossRef]
Liu, Q.; He, Q.; Shi, Z. Extreme support vector machine classifier. In Proceedings of the Advances in Knowledge Discovery and Data Mining: 12th Pacific-Asia Conference, PAKDD 2008, Osaka, Japan, 20–23 May 2008; pp. 222–233. [Google Scholar]
Frénay, B.; Verleysen, M. Using SVMs with randomised feature spaces: An extreme learning approach. In Proceedings of the 18th European Symposium on Artificial Neural Networks, ESANN 2010, Bruges, Belgium, 28–30 April 2010. [Google Scholar]
Huang, G.B.; Ding, X.; Zhou, H. Optimization method based extreme learning machine for classification. Neurocomputing 2010, 74, 155–163. [Google Scholar] [CrossRef]
Khemchandani, R.; Chandra, S. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar]
Wan, Y.; Song, S.; Huang, G.; Li, S. Twin extreme learning machines for pattern classification. Neurocomputing 2017, 260, 235–244. [Google Scholar] [CrossRef]
Shen, J.; Ma, J. Sparse Twin Extreme Learning Machine with ε -Insensitive Zone Pinball Loss. IEEE Access 2019, 7, 112067–112078. [Google Scholar] [CrossRef]
Yuan, C.; Yang, L. Robust twin extreme learning machines with correntropy-based metric. Knowl.-Based Syst. 2021, 214, 106707. [Google Scholar] [CrossRef]
Anand, P.; Bharti, A.; Rastogi, R. Time efficient variants of Twin Extreme Learning Machine. Intell. Syst. Appl. 2023, 17, 200169. [Google Scholar] [CrossRef]
Ma, J.; Yu, G. A generalized adaptive robust distance metric driven smooth regularization learning framework for pattern recognition. Signal Process. 2023, 211, 109102. [Google Scholar] [CrossRef]
Ma, J.; Wen, Y.; Yang, L. Fisher-regularized supervised and semi-supervised extreme learning machine. Knowl. Inf. Syst. 2020, 62, 3995–4027. [Google Scholar] [CrossRef]
Gao, S.; Ye, Q.; Ye, N. 1-Norm least squares twin support vector machines. Neurocomputing 2011, 74, 3590–3597. [Google Scholar] [CrossRef]
Yan, H.; Ye, Q.L.; Zhang, T.A.; Yu, D.J. Efficient and robust TWSVM classifier based on L1-norm distance metric for pattern classification. In Proceedings of the 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR), Nanjing, China, 26–29 November 2017; pp. 436–441. [Google Scholar]
Ye, Q.; Yang, J.; Liu, F.; Zhao, C.; Ye, N.; Yin, T. L1-norm distance linear discriminant analysis based on an effective iterative algorithm. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 114–129. [Google Scholar] [CrossRef]
Wu, Q.; Wang, F.; An, Y.; Li, K. L-1-Norm Robust Regularized Extreme Learning Machine with Asymmetric C-Loss for Regression. Axioms 2023, 12, 204. [Google Scholar] [CrossRef]
Wu, M.J.; Liu, J.X.; Gao, Y.L.; Kong, X.Z.; Feng, C.M. Feature selection and clustering via robust graph-laplacian PCA based on capped L 1-norm. In Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Kansas City, MO, USA, 13–16 November 2017; pp. 1741–1745. [Google Scholar]
Nie, F.; Huang, H.; Cai, X.; Ding, C. Efficient and robust feature selection via joint L2, 1-norms minimization. Adv. Neural Inf. Process. Syst. 2010, 23, 1813–1821. [Google Scholar]
Ma, J.; Yang, L.; Sun, Q. Capped L1-norm distance metric-based fast robust twin bounded support vector machine. Neurocomputing 2020, 412, 295–311. [Google Scholar] [CrossRef]
Jiang, W.; Nie, F.; Huang, H. Robust Dictionary Learning with Capped L1-Norm. In Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 3590–3596. [Google Scholar]
Nie, F.; Huo, Z.; Huang, H. Joint Capped Norms Minimization for Robust Matrix Recovery. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 2557–2563. [Google Scholar]
Wang, C.; Ye, Q.; Luo, P.; Ye, N.; Fu, L. Robust capped L1-norm twin support vector machine. Neural Netw. 2019, 114, 47–59. [Google Scholar] [CrossRef] [PubMed]
Pal, A.; Khemchandani, R.R.n. Learning TWSVM using Privilege Information. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 1548–1554. [Google Scholar]
Li, Y.; Sun, H.; Yan, W.; Cui, Q. R-CTSVM+: Robust capped L1-norm twin support vector machine with privileged information. Inf. Sci. 2021, 574, 12–32. [Google Scholar] [CrossRef]
Mangasarian, O.; Musicant, D. Successive overrelaxation for support vector machines. IEEE Trans. Neural Netw. 1999, 10, 1032–1037. [Google Scholar] [CrossRef] [Green Version]
Luo, Z.Q.; Tseng, P. Error bounds and convergence analysis of feasible descent methods: A general approach. Ann. Oper. Res. 1993, 46, 157–178. [Google Scholar] [CrossRef]
Yang, Y.; Xue, Z.; Ma, J.; Chang, X. Robust projection twin extreme learning machines with capped L1-norm distance metric. Neurocomputing 2023, 517, 229–242. [Google Scholar] [CrossRef]

Figure 1. Four types of data without noise.

Figure 2. Accuracy for TELM, FTELM, C

L_{1}

-TWSVM, and C

L_{1}

-FTELM on four types of data with 0%, 20%, and 25% noise.

Figure 2. Accuracy for TELM, FTELM, C

L_{1}

-TWSVM, and C

L_{1}

-FTELM on four types of data with 0%, 20%, and 25% noise.

Figure 3. Accuracy for TELM, FTELM, C

L_{1}

-TWSVM, and C

L_{1}

-FTELM on Circle dataset with noise in different ratios.

Figure 3. Accuracy for TELM, FTELM, C

L_{1}

-TWSVM, and C

L_{1}

-FTELM on Circle dataset with noise in different ratios.

Figure 4. Accuracies of six algorithms via different noises factors.

Figure 5. Objective values of C

L_{1}

-FTELM on four datasets.

Figure 5. Objective values of C

L_{1}

-FTELM on four datasets.

Figure 6. Examples of four high-dimensional image datasets.

Table 1. Characteristics of UCI datasets.

Datasets	Instances	Attributes	Datasets	Instances	Attributes
Australian	690	14	Vote	432	16
German	1000	24	Ionosphere	351	35
Breast cancer	699	9	Pima	768	8
WDBC	569	30	QSAR	1055	41
Wholesale	440	7	Spam	4601	57

Table 2. Experimental results on UCI datasets, The best results are marked in bold.

Datasets	OPTELM	TELM	FELM	FTELM	C $L_{1}$ -TWSVM	C $L_{1}$ -FTELM
	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)
	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	85.31 $\pm$ 0.34	85.60 $\pm$ 0.44	85.46 $\pm$ 0.19	86.79 $\pm$ 0.33	85.82 $\pm$ 0.28	87.13 $\pm$ 0.52
	0.682	0.593	1.698	0.456	1.676	2.533
German	76.26 $\pm$ 0.52	76.40 $\pm$ 0.16	76.50 $\pm$ 0.42	76.56 $\pm$ 0.47	76.70 $\pm$ 0.25	77.15 $\pm$ 1.18
	1.182	0.979	4.555	0.474	5.318	7.006
Breast cancer	95.70 $\pm$ 0.24	96.35 $\pm$ 0.15	96.45 $\pm$ 0.09	97.07 $\pm$ 0.15	96.39 $\pm$ 0.13	97.32 $\pm$ 0.53
	0.601	0.668	1.646	0.505	4.011	3.902
WDBC	96.71 $\pm$ 0.27	97.13 $\pm$ 0.48	97.55 $\pm$ 0.17	98.55 $\pm$ 0.26	97.09 $\pm$ 0.25	97.86 $\pm$ 0.21
	0.416	0.605	1.144	0.578	3.618	4.551
Wholesale	87.35 $\pm$ 0.93	89.86 $\pm$ 0.84	90.26 $\pm$ 0.12	90.56 $\pm$ 0.33	89.89 $\pm$ 0.30	90.70 $\pm$ 0.56
	0.278	2.091	0.665	0.359	1.246	1.377
Vote	95.31 $\pm$ 0.16	95.56 $\pm$ 0.30	96.04 $\pm$ 0.24	96.12 $\pm$ 0.31	95.21 $\pm$ 0.54	96.43 $\pm$ 0.35
	0.256	0.502	0.651	0.445	1.077	0.992
Ionosphere	90.59 $\pm$ 0.84	91.38 $\pm$ 0.52	92.32 $\pm$ 0.32	92.74 $\pm$ 0.83	92.56 $\pm$ 0.54	93.32 $\pm$ 1.21
	0.184	0.476	0.421	0.268	1.128	2.237
Pima	76.83 $\pm$ 0.73	77.51 $\pm$ 0.08	77.79 $\pm$ 0.10	78.24 $\pm$ 0.49	77.49 $\pm$ 0.37	78.82 $\pm$ 0.98
	0.858	0.795	2.099	0.932	1.743	4.708
QSAR	83.91 $\pm$ 0.66	86.56 $\pm$ 0.19	87.12 $\pm$ 0.18	87.35 $\pm$ 0.23	85.72 $\pm$ 0.59	87.50 $\pm$ 0.56
	1.442	0.979	2.489	2.864	2.665	14.288
Spam	85.57 $\pm$ 0.65	91.38 $\pm$ 0.52	89.67 $\pm$ 0.21	91.94 $\pm$ 1.23	90.56 $\pm$ 1.23	92.27 $\pm$ 0.54
	125.498	64.314	488.251	108.232	158.145	170.261

Table 3. Experimental results on UCI datasets with 20% noise, The best results are marked in bold.

Datasets	OPTELM	TELM	FELM	FTELM	C $L_{1}$ -TWSVM	C $L_{1}$ -FTELM
	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)
	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	79.68 $\pm$ 1.75	80.37 $\pm$ 0.56	79.06 $\pm$ 1.36	80.44 $\pm$ 1.34	81.98 $\pm$ 0.87	82.78 $\pm$ 0.57
	0.621	0.728	1.756	0.224	1.708	3.224
German	69.67 $\pm$ 0.97	73.57 $\pm$ 1.85	71.99 $\pm$ 1.35	72.76 $\pm$ 0.88	73.86 $\pm$ 1.35	74.32 $\pm$ 1.12
	1.318	0.981	4.102	0.398	5.673	6.764
Breast cancer	70.60 $\pm$ 0.45	76.97 $\pm$ 0.42	70.32 $\pm$ 0.37	77.81 $\pm$ 0.56	79.84 $\pm$ 0.37	80.14 $\pm$ 0.91
	0.803	0.706	1.552	0.315	4.572	5.034
WDBC	82.98 $\pm$ 0.15	84.38 $\pm$ 1.01	83.29 $\pm$ 0.68	89.43 $\pm$ 1.15	89.98 $\pm$ 0.30	93.77 $\pm$ 0.32
	0.419	0.204	0.992	0.376	3.899	4.861
Wholesale	73.40 $\pm$ 0.93	73.77 $\pm$ 0.69	73.74 $\pm$ 0.76	74.77 $\pm$ 0.56	78.74 $\pm$ 0.91	79.47 $\pm$ 2.58
	0.275	0.543	0.659	0.404	0.849	1.420
Vote	93.48 $\pm$ 0.62	94.36 $\pm$ 0.60	94.24 $\pm$ 0.82	94.10 $\pm$ 0.94	93.90 $\pm$ 0.44	94.29 $\pm$ 0.61
	0.277	0.619	0.549	0.114	1.048	1.398
Ionosphere	80.79 $\pm$ 2.88	82.71 $\pm$ 2.09	81.00 $\pm$ 3.11	86.06 $\pm$ 1.67	85.76 $\pm$ 1.58	87.74 $\pm$ 1.08
	0.159	0.021	0.456	0.737	0.391	2.081
Pima	65.79 $\pm$ 0.23	67.07 $\pm$ 0.56	66.12 $\pm$ 0.12	66.30 $\pm$ 1.34	70.25 $\pm$ 1.57	71.42 $\pm$ 0.94
	0.873	0.649	2.051	1.492	1.758	3.968
QSAR	68.32 $\pm$ 2.48	68.80 $\pm$ 0.95	68.54 $\pm$ 2.50	72.28 $\pm$ 2.18	71.09 $\pm$ 2.02	72.31 $\pm$ 1.98
	1.534	3.089	4.578	0.892	1.828	9.151
Spam	83.16 $\pm$ 0.57	87.38 $\pm$ 2.31	85.66 $\pm$ 0.65	87.98 $\pm$ 0.87	85.77 $\pm$ 2.21	86.75 $\pm$ 0.45
	128.798	60.565	432.257	106.267	147.365	160.231

Table 4. Experimental results on UCI datasets with 25% noise, The best results are marked in bold.

Datasets	OPTELM	TELM	FELM	FTELM	C $L_{1}$ -TWSVM	C $L_{1}$ -FTELM
	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)
	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	73.68 $\pm$ 2.20	75.41 $\pm$ 1.52	74.25 $\pm$ 2.01	76.40 $\pm$ 1.19	80.56 $\pm$ 1.07	81.63 $\pm$ 0.71
	0.585	0.673	1.627	0.206	2.205	2.261
German	69.72 $\pm$ 0.13	72.87 $\pm$ 0.82	71.41 $\pm$ 0.88	73.15 $\pm$ 0.87	73.13 $\pm$ 1.16	73.25 $\pm$ 0.76
	1.565	0.871	3.855	0.342	5.233	6.798
Breast cancer	67.59 $\pm$ 0.18	70.43 $\pm$ 0.79	67.23 $\pm$ 0.24	71.65 $\pm$ 0.58	70.93 $\pm$ 0.52	72.71 $\pm$ 0.49
	0.654	0.513	1.438	0.309	4.476	5.124
WDBC	79.61 $\pm$ 0.78	81.66 $\pm$ 0.84	79.83 $\pm$ 0.72	87.96 $\pm$ 1.13	88.50 $\pm$ 0.74	92.43 $\pm$ 0.76
	0.417	0.197	0.887	0.334	3.675	4.861
Wholesale	71.79 $\pm$ 1.03	71.63 $\pm$ 0.89	69.63 $\pm$ 0.38	71.60 $\pm$ 1.02	75.53 $\pm$ 1.02	75.74 $\pm$ 3.48
	0.570	2.021	0.623	0.338	1.147	1.387
Vote	92.62 $\pm$ 0.88	92.95 $\pm$ 0.50	93.12 $\pm$ 0.80	93.21 $\pm$ 0.80	93.21 $\pm$ 0.68	93.50 $\pm$ 1.00
	0.252	0.503	0.514	0.121	1.213	1.390
Ionosphere	78.15 $\pm$ 2.94	78.79 $\pm$ 3.01	76.62 $\pm$ 3.67	83.59 $\pm$ 1.49	82.94 $\pm$ 2.90	85.03 $\pm$ 2.28
	0.229	0.058	0.313	0.737	0.576	1.987
Pima	65.67 $\pm$ 0.12	65.45 $\pm$ 1.55	65.89 $\pm$ 0.12	65.79 $\pm$ 0.14	69.01 $\pm$ 1.55	68.51 $\pm$ 2.75
	0.803	0.761	2.182	0.471	5.532	4.012
QSAR	67.49 $\pm$ 3.08	67.81 $\pm$ 1.63	70.30 $\pm$ 2.33	71.53 $\pm$ 3.00	70.49 $\pm$ 2.13	69.72 $\pm$ 2.14
	2.067	2.730	4.251	0.849	1.783	12.564
Spam	71.77 $\pm$ 1.05	75.35 $\pm$ 0.72	70.89 $\pm$ 1.23	76.43 $\pm$ 1.16	83.56 $\pm$ 0.26	84.75 $\pm$ 0.78
	99.541	61.254	462.221	116.267	142.365	165.214

Table 5. Characteristics of image datasets.

Datasets	Instances	Attributes	Datasets	Instances	Attributes
Yale	165	1024	ORL	400	1024
USPS	9298	256	MNIST	70,000	784

Table 6. Experimental results on images and handwritten digits datasets. The best results are marked in bold.

Datasets	OPTELM	TELM	FELM	FTELM	C $L_{1}$ -TWSVM	C $L_{1}$ -FTELM
	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)	ACC $\pm$ S (%)
	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Yale	89.39 $\pm$ 2.85	91.44 $\pm$ 1.58	90.54 $\pm$ 2.01	92.23 $\pm$ 1.29	91.54 $\pm$ 1.07	93.12 $\pm$ 1.71
	0.126	0.101	0.262	0.136	0.135	0.492
ORL	87.72 $\pm$ 1.53	90.87 $\pm$ 0.52	90.41 $\pm$ 0.78	92.45 $\pm$ 0.67	92.32 $\pm$ 1.16	93.25 $\pm$ 0.46
	1.169	0.483	3.064	0.529	1.338	2.695
USPS	98.76 $\pm$ 0.18	98.83 $\pm$ 0.69	98.23 $\pm$ 0.24	99.65 $\pm$ 0.68	99.23 $\pm$ 0.42	99.89 $\pm$ 0.89
	118.729	17.536	134.438	6.795	358.368	355.762
MNIST	89.61 $\pm$ 0.58	90.66 $\pm$ 0.74	89.83 $\pm$ 0.75	91.26 $\pm$ 1.13	90.88 $\pm$ 0.14	91.53 $\pm$ 0.56
	8.723	1.237	41.656	0.868	14.258	14.973

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xue, Z.; Cai, L. Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L₁-Norm for Classification. Axioms 2023, 12, 717. https://doi.org/10.3390/axioms12070717

AMA Style

Xue Z, Cai L. Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L₁-Norm for Classification. Axioms. 2023; 12(7):717. https://doi.org/10.3390/axioms12070717

Chicago/Turabian Style

Xue, Zhenxia, and Linchao Cai. 2023. "Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L₁-Norm for Classification" Axioms 12, no. 7: 717. https://doi.org/10.3390/axioms12070717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L₁-Norm for Classification

Abstract

1. Introduction

2. Related Work

2.1. The Concept of Symbols

2.2. Fisher Regularization

2.3. Fisher-Regularized Extreme Learning Machine

2.4. Twin Extreme Learning Machine

2.5. Successive Overrelaxation Algorithm

3. Fisher-Regularized Twin Extreme Learning Machine

3.1. Model Formulation

3.2. Model Solution

4. Capped $L_{1}$ -Norm Fisher-Regularized Twin Extreme Learning Machine

4.1. Model Formulation

4.2. Model Solution

4.3. Convergence Analysis

5. Experiments

5.1. Experimental Setting

5.2. Experiments on Artificial Datasets

5.3. Experiments on UCI Datasets

5.4. Experiments on Image Datasets

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L1-Norm for Classification

Abstract

1. Introduction

2. Related Work

2.1. The Concept of Symbols

2.2. Fisher Regularization

2.3. Fisher-Regularized Extreme Learning Machine

2.4. Twin Extreme Learning Machine

2.5. Successive Overrelaxation Algorithm

3. Fisher-Regularized Twin Extreme Learning Machine

3.1. Model Formulation

3.2. Model Solution

4. Capped L 1 -Norm Fisher-Regularized Twin Extreme Learning Machine

4.1. Model Formulation

4.2. Model Solution

4.3. Convergence Analysis

5. Experiments

5.1. Experimental Setting

5.2. Experiments on Artificial Datasets

5.3. Experiments on UCI Datasets

5.4. Experiments on Image Datasets

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L₁-Norm for Classification

4. Capped $L_{1}$ -Norm Fisher-Regularized Twin Extreme Learning Machine