The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L2,p-norm Metric, and Fisher Regularization

Xue, Zhenxia; Wang, Yan; Ren, Yuwen; Zhang, Xinyuan

doi:10.3390/sym16091230

Open AccessArticle

The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L_2,p-norm Metric, and Fisher Regularization

by

Zhenxia Xue

^1,2,*

,

Yan Wang

¹,

Yuwen Ren

¹ and

Xinyuan Zhang

¹

School of Mathematics and Information Science, North Minzu University, Yinchuan 750021, China

²

The Key Laboratory of Intelligent Information and Big Data Processing of NingXia Province, North Minzu University, Yinchuan 750021, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(9), 1230; https://doi.org/10.3390/sym16091230

Submission received: 10 August 2024 / Revised: 4 September 2024 / Accepted: 9 September 2024 / Published: 19 September 2024

(This article belongs to the Section Mathematics)

Download

Browse Figures

Versions Notes

Abstract

As a novel learning algorithm for feedforward neural networks, the twin extreme learning machine (TELM) boasts advantages such as simple structure, few parameters, low complexity, and excellent generalization performance. However, it employs the squared

L_{2}

-norm metric and an unbounded hinge loss function, which tends to overstate the influence of outliers and subsequently diminishes the robustness of the model. To address this issue, scholars have proposed the bounded capped

L_{2, p}

-norm metric, which can be flexibly adjusted by varying the p value to adapt to different data and reduce the impact of noise. Therefore, we substitute the metric in the TELM with the capped

L_{2, p}

-norm metric in this paper. Furthermore, we propose a bounded, smooth, symmetric, and noise-insensitive squared fractional loss (SF-loss) function to replace the hinge loss function in the TELM. Additionally, the TELM neglects statistical information in the data; thus, we incorporate the Fisher regularization term into our model to fully exploit the statistical characteristics of the data. Drawing upon these merits, a squared fractional loss-based robust supervised twin extreme learning machine (SF-RSTELM) model is proposed by integrating the capped

L_{2, p}

-norm metric, SF-loss, and Fisher regularization term. The model shows significant effectiveness in decreasing the impacts of noise and outliers. However, the proposed model’s non-convexity poses a formidable challenge in the realm of optimization. We use an efficient iterative algorithm to solve it based on the concave-convex procedure (CCCP) algorithm and demonstrate the convergence of the proposed algorithm. Finally, to verify the algorithm’s effectiveness, we conduct experiments on artificial datasets, UCI datasets, image datasets, and NDC large datasets. The experimental results show that our model is able to achieve higher ACC and

F_{1}

scores across most datasets, with improvements ranging from 0.28% to 4.5% compared to other state-of-the-art algorithms.

Keywords:

twin extreme learning machine; squared fractional loss; fisher regularization; capped L2,p-norm

1. Introduction

In the field of machine learning, researchers have been dedicated to enhancing the efficiency and accuracy of models. Sakheta et al. [1] improved the prediction of the biomass gasification model through six machine learning algorithms. The research demonstrated that the XGBoost algorithm has significant advantages in improving the accuracy of gasification product prediction. Maydanchi et al. [2] systematically compared various machine learning methods and found that tree-based ensemble methods, such as XGBoost, gradient boosting, and random forest, excelled in diabetes prediction. Kim et al. [3] successfully classified three similar enterococci by combining MALDI-TOF mass spectrometry techniques and multiple supervised learning algorithms (e.g., KNN, SVM, random forest). Although these methods have made significant progress in different domains, there remains room for improvement in enhancing computational efficiency and response times. The extreme learning machine (ELM) [4,5] offers a promising solution to these challenges with its efficient training process and superior generalization capabilities. It was first proposed by Huang et al. [6] and quickly gained widespread application in multiple fields, including image classification [7], fault detection [8,9], disease diagnosis [10], computer vision [11], face recognition [12], and signal processing [13]. These application cases fully validate the practicability and effectiveness of ELM as an efficient neural network training method.

In binary classification tasks, traditional ELM only learns a single hyperplane to distinguish between classes. Recently, two nonparallel hyperplanes classification algorithms have attracted significant attention and research interest [14,15]. These algorithms involve the training of multiple hyperplanes, where each hyperplane is designed to minimize its distance to one of the two classes while maximizing its distance from the other class. For example, the twin support vector machine (TSVM) is notable for its efficiency in learning two nonparallel separating hyperplanes more quickly than the traditional support vector machine (SVM) by solving two reduced-sized quadratic programming problems (QPPs). The various variants of TSVM [16,17,18,19] have been extensively studied and have been successfully applied in classification tasks.

Inspired by TSVM, Wan et al. [20] proposed the twin extreme learning machine (TELM). It is noteworthy that the TELM and TSVM use the hinge loss function, which is unbounded and tends to exaggerate the impact of noise and outliers on the model. Consequently, the research community has expressed increasing interest in exploring alternative loss functions. Wang et al. [21] proposed a new robust capped

L_{1}

-norm twin support vector machine (CTWSVM), which maintains the benefits of TWSVM and enhances the robustness of the model. Wang and Yu et al. [22] proposed a new robust loss function, the capped Linex loss function, which was applied to the TSVM to enhance the classification capabilities of the model. Kumari A et al. [23] introduced the capped pinball loss function into the universum twin support vector machine (UTWSVM), and proposed a universum twin support vector machine (Tpin-UTWSVM) based on capped pinball loss function, which improved the model’s generalization performance. Ma et al. [24] proposed a robust adaptive capped

L_{θ ε}

loss, altering the loss function value by adjusting the adaptive parameter

θ

during the training process. Applying this loss function to TSVM, an adaptive robust learning framework was proposed, namely the adaptive robust twin support vector machine (ARTSVM). All the above models use bounded capped loss functions, which constrain the impact of noise within certain limits and make the classifiers less sensitive to noise.

In order to further reduce the impact of noise, many scholars have begun to look for new metrics to substitute for the squared

L_{2}

-norm metric used in the TELM. Ma et al. [25] proposed a fast robust twin extreme learning machine (FRTELM) based on capped

L_{1}

-norm metric and loss function in the classic TELM learning framework, which enhances the robustness of the TELM in handling classification problems. Yang et al. [26] added the idea of projection on the basis of the twin extreme learning machine, and combining this with the capped

L_{1}

-norm metric and loss function, they proposed a new capped

L_{1}

-norm projection twin extreme learning machine (

C L_{1}

-PTELM). It lessens the influence of outliers and demonstrates more robustness than the TELM. Ma and Yang et al. [27] proposed a new robust TELM framework (RTELM) using the capped

L_{1}

-norm metrics and capped

L_{θ ε}

loss function. RTELM addresses the limitations of

L_{2}

-norm metric and hinge loss, particularly in scenarios with outliers. It retains the strengths of the TELM and further enhances the robustness of classification. These algorithms show that the capped

L_{1}

-norm metric is resistant to outliers. In fact, the capped

L_{1}

-norm metric is considered an effective approximation of the

L_{0}

-norm by a non-negative parameter, and it is superior in robustness to the

L_{1}

-norm metric [27]. In addition, related scholars have begun to focus on the capped

L_{2, p}

-norm metric and have applied it to their models. This metric is bounded and can be flexibly tuned by adjusting the p-value to adapt to diverse datasets and reduce the effect of noise. Yuan et al. [28] created a novel framework to improve robustness by substituting the squared

L_{2}

-norm metric with the robust capped

L_{2, p}

-norm metric in a least squares twin support vector machine (LSTSVM), which is called capped

L_{2, p}

-norm LSTSVM (

C L_{2, p}

-LSTSVM). Wang et al. [29] proposed a capped

L_{2, p}

-norm metric based on the robust twin support vector machine with Welsch loss function (WCTBSVM). The generalization performance and robustness of the TSVM are further improved. Jiang et al. [30] proposed a novel robust twin extreme learning machine learning framework (CWTELM) by combining the capped

L_{2, p}

-norm metric and Welsch loss function with the TELM. CWTELM improves robustness while preserving the advantages of TELM, thereby enhancing classification performance.

Besides altering metrics and loss functions, regularization techniques play a vital role in improving the generalization capabilities of models. The Fisher regularization term is a notable technique that minimizes within-class variance and excels in improving class separability and robustness. Ma and Wen et al. [31] proposed a Fisher regularization ELM (Fisher-ELM) to reach a minimal within-class scatter. Fisher-ELM utilizes the statistical properties of the data, which exhibits excellent generalization ability. Although Fisher-ELM incorporates statistical knowledge into its framework, it tends to ignore the potential effects of noise or outliers. To reduce the negative effects of these factors, Xue and Zhao et al. [32] first proposed a novel asymmetric Welsch loss function and integrated it into Fisher-ELM, then proposed a robust Fisher regularization extreme learning machine with asymmetric Welsch-induced loss function (AWFisher-ELM). This model better copes with the adverse effects of noise and outliers, enhancing the robustness of the model. Xue et al. [33] added Fisher regularization to the TELM and proposed Fisher regularization TELM (FTELM), which both keeps the strengths of the TELM and minimizes the intra-class differences of samples. In order to further improve the noise immunity of the FTELM method, a new capped

L_{1}

-norm Fisher regularization TELM (C

L_{1}

-FTELM) is proposed by combining the capped

L_{1}

-norm metric and loss function to enhance the robustness of the model.

In this paper, we first propose a bounded, smooth, and symmetrical squared fractional loss (SF-loss). Based on the proposed SF-loss, we also integrate the TELM, capped

L_{1}

-norm metric, and Fisher regularization and propose a robust supervised TELM learning framework (SF-RSTELM). SF-RSTELM can effectively utilize the statistical properties of the data, which the TELM lacks. In addition, it can effectively reduce the impact of noise and outliers by employing the bounded capped

L_{2, p}

-norm metric and SF-loss function. In contrast, the TELM uses the unbounded squared

L_{2}

-norm metric and hinge loss, which are susceptible to the influence of noise and outliers.

The main work of this paper is summarized as follows:

(1): A new robust loss function called squared fractional loss (SF-loss) is presented. It has some important properties such as being bounded, smooth, symmetric, and noise-insensitive. Moreover, the robustness of the SF-loss is analyzed according to the perspective of M estimation theory [34], and its Fisher consistency is proved according to the Bayesian rule [35].
(2): An innovative method named “The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped $L_{2, p}$ -norm Metric, and Fisher Regularization” is proposed. This framework cleverly combines the efficiency of the TELM, the robustness of the SF-loss function, the flexibility of the capped $L_{2, p}$ -norm metric, and the advantages of Fisher regularization. This integrated approach not only takes into account the statistical information of the data but also significantly reduces the impact of noise, thereby enhancing the model’s performance.
(3): Due to the non-convex nature of the established optimization model, an efficient algorithm based on CCCP [36] is proposed to solve the optimization problem. Moreover, the convergence of the proposed algorithm is proved.
(4): We performed extensive experiments on artificial datasets, UCI datasets, image datasets, and NDC-large datasets to validate the effectiveness of our proposed algorithm compared to other state-of-the-art algorithms.

The rest of this paper is structured as follows. In Section 2, we briefly review related work on Fisher regularization, the Fisher regularized twin extreme learning machine, the capped

L_{2, p}

-norm metric, and the concave-convex procedure. In Section 3, we provide a comprehensive description of the proposed model and a detailed solution process. The experimental results on multiple datasets are presented in Section 4. Conclusions and suggestions for future work are given in Section 5.

2. Related Work

In this section, we briefly review related work on Fisher regularization, the Fisher regularized twin extreme learning machine, the Capped

L_{2, p}

-norm metric, and the concave-convex procedure.

2.1. Fisher Regularization

Fisher regularization [32] can measure the intra-class divergence within the data, facilitating the development of more effective learning models. With the training set

T = \{(x_{1}, y_{1}), \dots, (x_{m}, y_{m})\}

, the Fisher regularization has the form:

{∥ f ∥}_{F}^{2} = \sum_{i \in I_{+}} {(f (x_{i}) - {\bar{f}}_{+})}^{2} + \sum_{i \in I_{-}} {(f (x_{i}) - {\bar{f}}_{-})}^{2}

(1)

where f is the prediction function, and

f (x_{i})

represents the value of f on sample

x_{i}

; f on all samples forms a vector

f = {[f (x_{1}), f (x_{2}), \dots, f (x_{m})]}^{T}

;

{\bar{f}}_{+}

and

{\bar{f}}_{-}

represent the mean of f on all positive and negative samples, respectively; and

I_{+}

and

I_{-}

represent the index collections of positive and negative samples.

We can expand Equation (1):

\begin{matrix} {∥ f ∥}_{F}^{2} = \sum_{i \in I_{+}} {(f (x_{i}) - {\bar{f}}_{+})}^{2} + \sum_{i \in I_{-}} {(f (x_{i}) - {\bar{f}}_{-})}^{2} \\ = \sum_{i \in I_{+}} (f^{2} (x_{i}) - 2 \cdot f (x_{i}) \cdot {\bar{f}}_{+} + f_{+}^{2}) + \sum_{i \in I_{-}} (f^{2} (x_{i}) - 2 \cdot f (x_{i}) \cdot {\bar{f}}_{-} + f_{-}^{2}) \\ = (\sum_{i \in I_{+}} f^{2} (x_{i}) - 2 \cdot m_{1} \cdot {\bar{f}}_{+}^{2} + m_{1} \cdot {\bar{f}}_{+}^{2}) + (\sum_{i \in I_{-}} f^{2} (x_{i}) - 2 \cdot m_{2} \cdot {\bar{f}}_{-}^{2} + m_{2} \cdot {\bar{f}}_{-}^{2}) \\ = (\sum_{i \in I_{+}} f^{2} (x_{i}) - m_{1} \cdot {\bar{f}}_{+}^{2}) + (\sum_{i \in I_{-}} f^{2} (x_{i}) - m_{2} \cdot {\bar{f}}_{-}^{2}) \\ = f_{+}^{T} \cdot f_{+} - \frac{1}{m_{1}} \cdot (f_{+}^{T} \cdot e^{T} \cdot e \cdot f_{+}) + f_{-}^{T} \cdot f_{-} - \frac{1}{m_{2}} \cdot (f_{-}^{T} \cdot e^{T} \cdot e \cdot f_{-}) \\ = f_{+}^{T} \cdot I_{+} \cdot f_{+} - f_{+}^{T} \cdot M_{+} \cdot f_{+} + f_{-}^{T} \cdot I_{-} \cdot f_{-} - f_{-}^{T} \cdot M_{-} \cdot f_{-} \\ = f_{+}^{T} \cdot (I_{+} - M_{+}) \cdot f_{+} + f_{-}^{T} \cdot (I_{-} - M_{-}) \cdot f_{-} \\ = {(f_{+}, f_{-})}^{T} \cdot [I - [\begin{matrix} M_{+} & 0_{1} \\ 0_{2} & M_{-} \end{matrix}]] \cdot (f_{+}, f_{-}) \\ = f^{T} \cdot (I - G) \cdot f = f^{T} \cdot N \cdot f \end{matrix}

(2)

where

f_{+} = (f (x_{1}), f (x_{2}), \dots, f (x_{i}), \dots, f (x_{m_{1}}))

,

i \in I_{+}

,

I_{+} \in R^{m_{1} \times m_{1}}

is the identity matrix.

f_{-} = (f (x_{1}), f (x_{2}), \dots, f (x_{i}), \dots, f (x_{m_{2}}))

,

i \in I_{-}

,

I_{-} \in R^{m_{2} \times m_{2}}

is the identity matrix.

0_{1} \in 0^{m_{1} \times m_{2}}

,

0_{2} \in 0^{m_{2} \times m_{1}}

,

M_{+} \in R^{m_{1} \times m_{1}}

, and all the elements in the matrix

M_{+}

are

\frac{1}{m_{1}}

;

M_{-} \in R^{m_{2} \times m_{2}}

, and all the elements in the matrix

M_{-}

are

\frac{1}{m_{2}}

;

N = I - G

; and

I \in R^{m \times m}

is the identity matrix

G = [\begin{matrix} M_{+} & 0_{1} \\ 0_{2} & M_{-} \end{matrix}]

.

2.2. Fisher Regularized Twin Extreme Learning Machine

Within a supervised classification framework, the training dataset is typically represented as

T = {\{(x_{i}, y_{i})\}}_{i = 1}^{l}

, where

x_{i} \in R^{d}

,

y_{i} \in \{1, - 1\}

. The set

T

includes

l_{1}

positive class samples and

l_{2}

negative class samples, where

l = l_{1} + l_{2}

.

The TELM is a traditional and highly efficient classifier. Nevertheless, it overlooks the statistical properties contained within the data. Drawing inspiration from Fisher’s concepts, Xue et al. [33] proposed Fisher-TELM (FTELM) by introducing Fisher regularization terms into the TELM learning framework. Specifically, the primal FTELM is given as:

\begin{matrix} P r i m a l F T E L M_{1} : & min_{β_{1}, ξ_{1}} \frac{1}{2} {∥H_{1} β_{1}∥}_{2}^{2} + C_{1} e_{2}^{T} ξ_{1} + \frac{C_{2}}{2} f_{1} {(x)}^{T} N_{1} f_{1} (x) \\ s . t . - H_{2} β_{1} + ξ_{1} \geq e_{2} \\ ξ_{1} \geq 0 \end{matrix}

(3)

\begin{matrix} P r i m a l F T E L M_{2} : & min_{β_{2}, ξ_{2}} \frac{1}{2} {∥H_{2} β_{2}∥}_{2}^{2} + C_{3} e_{1}^{T} ξ_{2} + \frac{C_{4}}{2} f_{2} {(x)}^{T} N_{2} f_{2} (x) \\ s . t . H_{1} β_{2} + ξ_{2} \geq e_{1} \\ ξ_{2} \geq 0 \end{matrix}

(4)

where

N_{1} = I_{+} - M_{+}

,

N_{2} = I_{-} - M_{-}

,

β_{1}

and

β_{2}

represent the output weights connecting the hidden layer to the output layer, and

h (x)

represents the hidden layer output matrix given by

h (x) = (h_{1} (x), h_{2} (x), \dots, h_{L} (x)) \in R^{1 \times L}

. Here, L represents the count of hidden nodes, with each node function

h_{i} (x) = G (\sum_{j = 1}^{d} x_{j} ω_{j i} + b_{i})

. G denotes an activation function, with frequently used examples being the sigmoid function

G (x) = \frac{1}{1 + e^{- x}}

, the ReLU function

G (x) = max (0, x)

, etc. In this context,

w_{i} = (ω_{i, 1}, ω_{i, 2}, \dots, ω_{i, d})

stands for input weight,

ω_{j, i}

denotes the input weight connecting the j-th feature to the i-th hidden node, and

b_{i}

is the bias term for the i-th hidden node

(i = 1, 2, \dots,

L). d represents the sample dimension.

H_{1} = {(\begin{matrix} h {(x_{1})}^{T}, h {(x_{2})}^{T}, \dots, h {(x_{l_{1}})}^{T} \end{matrix})}^{T} \in R^{l_{1} \times L}

and

H_{2} = {(\begin{matrix} h {(x_{1})}^{T}, h {(x_{2})}^{T}, \dots, h {(x_{l_{2}})}^{T} \end{matrix})}^{T} \in R^{l_{2} \times L}

denote the hidden layer outputs for the positive class and negative class samples, respectively.

C_{1}

,

C_{2}

,

C_{3}

,

C_{4} > 0

are regularization parameters, while

e_{1} \in R^{m_{1}}

and

e_{2} \in R^{m_{2}}

are vectors of ones.

According to the representer theorem,

f_{1} (x_{1}) = H_{1} β_{1}

,

f_{2} (x_{2}) = H_{2} β_{2}

. Therefore, Equations (3) and (4) can be rephrased as follows:

\begin{matrix} min_{β_{1}, ξ_{1}} \frac{1}{2} {∥H_{1} β_{1}∥}_{2}^{2} + C_{1} e_{2}^{T} ξ_{1} + \frac{C_{2}}{2} β_{1}^{T} H_{1}^{T} N_{1} H_{1} β_{1} \\ s . t . - H_{2} β_{1} + ξ_{1} \geq e_{2} \\ ξ_{1} \geq 0 \end{matrix}

(5)

\begin{matrix} min_{β_{2}, ξ_{2}} \frac{1}{2} {∥H_{2} β_{2}∥}_{2}^{2} + C_{3} e_{1}^{T} ξ_{2} + \frac{C_{4}}{2} β_{2}^{T} H_{2}^{T} N_{2} H_{2} β_{2} \\ s . t . H_{1} β_{2} + ξ_{2} \geq e_{1} \\ ξ_{2} \geq 0 \end{matrix}

(6)

The positive training point can be made to approach the hyperplane

f_{1}

as closely as possible by optimizing the first term in the objective function (5). Minimizing the second term ensures that the negative class samples are as far as possible from the positive class hyperplane

f_{1}

. The last term is a Fisher regularization term that minimizes the within-class scatter. For problem (6), we can also use a similar meaning to explain it.

By incorporating Lagrange multipliers

θ_{1}

and

λ_{1}

, the dual problem of (5) can be formulated as follows:

\begin{matrix} D u a l F T E L M_{1} : & min_{θ_{1}} \frac{1}{2} θ_{1}^{T} Q_{1} θ_{1} - e_{2}^{T} θ_{1} \\ s . t . 0 \leq θ_{1} \leq C_{1} e_{2} \end{matrix}

(7)

Here,

Q_{1} = H_{2} {(H_{1}^{T} (I_{1} + C_{2} N_{1}) H_{1})}^{- 1} H_{2}^{T}

. Similarly, we can obtain the dual of (6) as:

\begin{matrix} D u a l F T E L M_{2} : & min_{λ_{1}} \frac{1}{2} λ_{1}^{T} Q_{2} λ_{1} - e_{1}^{T} λ_{1} \\ s . t . 0 \leq λ_{1} \leq C_{3} e_{1} \end{matrix}

(8)

where

Q_{2} = H_{1} {(H_{2}^{T} \cdot (I_{2} + C_{4} N_{2}) H_{2})}^{- 1} H_{1}^{T}

.

By solving problem (7) and problem (8), we get

θ_{1}

and

λ_{1}

. Then, we get:

β_{1} = - {(H_{1}^{T} (I_{1} + C_{2} N_{1}) H_{1})}^{- 1} H_{2}^{T} θ

(9)

β_{2} = {(H_{2}^{T} (I_{2} + C_{4} N_{2}) H_{2})}^{- 1} H_{1}^{T} λ

(10)

With

β_{1}

and

β_{2}

determined, we classify a new sample point

x

using the following decision function:

f (x) = arg min_{k = 1, 2} d_{k} (x) = arg min_{k = 1, 2} | h (x) β_{k} |

(11)

2.3. Capped $L_{2, p}$ -Norm Metric

The squared

L_{2}

-norm is frequently employed in TELM-related variant classifiers because it is differentiable and easier to optimize. Nevertheless, the squared term heightens the impact of outliers, thereby diminishing the model’s classification performance. Yuan et al. [28] proposed the

L_{2, p}

-norm and capped

L_{2, p}

-norm to enhance the model’s robustness to outliers by making p fall inside the range of (0, 2].

For any vector

s \in R^{n}

and parameter

0 < p \leq 2

, with thresholding parameter

ε \geq 0

, the

L_{2, p}

-norm and capped

L_{2, p}

-norm are the following formulas (12) and (13), respectively.

f_{1} (s) = {(\sum_{i = 1}^{n} s_{i}^{2})}^{\frac{p}{2}}

(12)

f_{2} (s) = min ({(\sum_{i = 1}^{n} s_{i}^{2})}^{\frac{p}{2}}, ε)

(13)

We also provide a comparison of the

L_{1}

-norm, the

L_{2}

-norm, and the capped

L_{2, p}

-norm (p = 0.5, 1, 1.5, 2) of a scalar in Figure 1. Firstly, from Figure 1, we can see that the

L_{1}

-norm and

L_{2}

-norm are unbounded, and the capped

L_{2, p}

-norm is bounded. Secondly, we can observe that the capped

L_{2, p}

-norm can be the capped

L_{1}

-norm when

p = 1

, and the capped

L_{2, p}

-norm can be the capped

L_{2}

-norm when

p = 2

. This indicates that the capped

L_{2, p}

-norm can behave as a capped form of the traditional norm at certain parameter values.

To more intuitively help us understand the characteristics of the capped

L_{2, p}

-norm metric, we also provide a comparison of the

L_{2, p}

-norm and capped

L_{2, p}

-norm for a two-dimensional vector (the high-dimensional situation is similar) as shown in Figure 2. Figure 2a,b are the

L_{2, p}

-norm metric when p takes 1 and 2, respectively, while Figure 2c,d are corresponding capped versions. Figure 2a is an unbounded

L_{2}

-norm metric, whose surface is a smooth curve surface. Figure 2b is essentially an unbounded squared

L_{2}

-norm metric, which is also a smooth surface, similar to Figure 2a. However, due to the influence of the square, the surface rises rapidly away from the center. Applying it to a model means that it is very sensitive to data points that are far from the center, which can be noise or outliers. Figure 2a also has a relatively sharp turning point, but the overall rise is slow. This norm metric is less sensitive to outliers compared to the squared

L_{2}

-norm metric in Figure 2b. Figure 2c,d are the bounded capped

L_{2}

-norm metric and capped squared

L_{2}

-norm metric, respectively, characterized by flat regions on their surface when the capped threshold is exceeded. If this metric is applied to the model, it can control the impact of the outliers, and the robustness of the model is enhanced.

In conclusion, the capped

L_{2, p}

-norm metric is a better choice when handling datasets containing outliers or noise. Furthermore, the metric can better strengthen the model’s resilience and improve the classification ability of the model.

2.4. Concave-Convex Procedure

The concave-convex procedure (CCCP) [36] is employed to address optimization problems involving the difference of convex functions. Let

x \in R^{n}

be the variable; the optimization problem associated with the CCCP is expressed in the following form:

\begin{matrix} min_{x} f (x) \\ s . t . c_{i} (x) \leq 0, i = 1, \dots, p_{1} \\ d_{j} (x) = 0, j = 1, \dots, p_{2} \end{matrix}

(14)

where

f (x) = g (x) - h (x)

;

g (.)

and

h (.)

are realvalued convex functions; and

p_{1}

and

p_{2}

denote the number of constraints. Suppose that

h (.)

is differentiable. The solution to (14) is derived through an iterative process of solving the subsequent series of convex optimization problems:

\begin{matrix} x^{k + 1} \in a r g min_{x} g (x) - x^{T} \nabla h (x^{k}) \\ s . t . c_{i} (x) \leq 0, i = 1, \dots, p_{1} \\ d_{j} (x) = 0, j = 1, \dots, p_{2} \end{matrix}

(15)

3. Squared Fractional Loss Based Robust Supervised Twin Extreme Learning Machine

In this section, we first put forward a new loss function (SF-loss) and then combine it with the TELM, capped

L_{2, p}

-norm metric, and Fisher regularization to propose a new robust supervised learning framework (RS-SFTELM). We also provide a detailed solution process and a convergence analysis for this model.

3.1. Squared Fractional Loss

Convex loss functions (

L_{1}

loss function,

L_{2}

loss function, hinge loss function) are commonly utilized in machine learning due to their ability to achieve global optimality. However, their unbounded nature makes them vulnerable to noise and outliers. According to M-estimation theory [34], loss functions with bounded properties or bounded influence functions demonstrate greater robustness to noise and outliers. Thus, we propose a new bounded loss function, called squared fraction loss (abbreviated as SF-loss) in the following:

Definition 1.

Given a vector u, the SF-loss is defined as

L_{r} (u) = \frac{u^{2}}{r u^{2} + 1}

(16)

where the parameter

r \in (0, + \infty)

.

Figure 3 shows the

L_{r} (u)

loss function with different parameter r values. From Figure 3, we can see that the parameter r affects the upper bound of the loss function. The more r values, the smaller the upper bounds. In addition, we provide some interesting properties, robustness analysis, and Fisher consistency for our SF-loss function.

3.1.1. The Properties of the SF-Loss Function

Property 1.

L_{r} (u)

= 0 if

u = 0

. This guarantees that the function

L_{r} (u)

passes through the original point.

Property 2.

L_{r} (u)

is bounded, which can ensure better robustness.

Proof.

lim_{u \to \infty} L_{r} (u) = lim_{u \to \infty} \frac{u^{2}}{r u^{2} + 1} = lim_{u \to \infty} \frac{1}{r + \frac{1}{u^{2}}} = \frac{1}{r}

(17)

Therefore,

L_{r} (u)

is a bounded function. □

Property 3.

L_{r} (u)

is a differentiable function that can help us optimize better.

Proof.

L_{r}^{'} (u) = \frac{2 u}{{(r u^{2} + 1)}^{2}}

(18)

Therefore,

L_{r} (u)

is differentiable. □

Property 4.

L_{r} (u)

is a symmetrical function.

Proof.

Calculate

L_{r} (u)

and

L_{r} (- u)

separately and obtain:

L_{r} (u) = \frac{u^{2}}{r u^{2} + 1}

(19)

L_{r} (- u) = \frac{{(- u)}^{2}}{r {(- u)}^{2} + 1} = \frac{u^{2}}{r u^{2} + 1}

(20)

Since

L_{r} (u) = L_{r} (- u)

,

L_{r} (u)

is symmetrical. □

Property 5.

L_{r} (u)

is a non-convex function.

Proof.

Let

r = 1

,

λ = \frac{1}{2}

,

u_{1} = 1

, and

u_{2} = 2

. Then, we obtain:

L_{r} (λ u_{1} + (1 - λ) u_{2}) = L_{1} (\frac{1}{2} \cdot 1 + \frac{1}{2} \cdot 2) = L_{1} (\frac{3}{2}) = \frac{{(\frac{3}{2})}^{2}}{{(\frac{3}{2})}^{2} + 1} = \frac{2.25}{3.25} \approx 0.692

(21)

λ L_{r} (u_{1}) + (1 - λ) L_{r} (u_{2}) = \frac{1}{2} \cdot L_{1} (1) + \frac{1}{2} \cdot L_{1} (2) = \frac{1}{2} \cdot \frac{1}{2} + \frac{1}{2} \cdot \frac{4}{5} = \frac{13}{20} = 0.65

(22)

Since

0.692 > 0.65

, i.e.,

L_{r} (λ u_{1} + (1 - λ) u_{2}) > λ L_{r} (u_{1}) + (1 - λ) L_{r} (u_{2})

,

L_{r} (u)

is a non-convex function. □

3.1.2. Robustness Analysis of SF-Loss Function

Clearly, the new loss function

L_{r} (u)

is bounded. From a robust statistics perspective, the

L_{r} (u)

shows noise insensitivity, which ensures superior robustness. The derivative of

L_{r} (u)

is expressed as:

L_{r}^{'} (u) = \frac{2 u}{{(r u^{2} + 1)}^{2}}

(23)

and we have:

lim_{u \to \infty} \frac{2 u}{{(r u^{2} + 1)}^{2}} = 0

(24)

Hence, according to M-estimation theory [34], the loss function is robust against noise.

3.1.3. Fisher Consistency of SF-Loss Function

An important attribute for a binary classifier f:

X ⟶ Y

is whether the classifier satisfies Fisher consistency. Specifically, a classifier f is deemed Fisher consistent if the minimizer of the associated expected risk, as dictated by a loss function L, exhibits the same sign as the Bayes classifier [35]. Thus, the loss function L is termed Fisher consistent if it adheres to this property.

In binary classification problems, the training set is represented by

{\{(x_{i}, y_{i})\}}_{i = 1}^{l}

, under the assumption that the samples are independent and the probability measure is

ρ

on

X \times Y

. Then, the expected risk of classifier f:

X ⟶ Y

is defined as:

R_{L, ρ} (f) = \int_{X \times Y} L (1 - y f (x)) d ρ

(25)

Here,

L (.)

denotes the loss function, and

P (χ) = P (Y = 1 (- 1) ∣ X = χ)

represents the conditional probability of the positive (or negative) class when

X = χ

. Under the condition that

χ

is given, the conditional distribution of

ρ

is indicated by

ρ (y ∣ χ)

. To minimize the expected risk, we introduce the optimization variable q and define the function to minimize the expected risk through the following formula:

f_{L, ρ} (χ) = a r g min_{q} \int_{Y} L (1 - y q) d ρ (y ∣ χ) \forall χ \in X

(26)

where

q = f (x)

is a function to be optimized to represent the predicted value under a specific condition (i.e., given

χ

). In the binary classification problem,

ρ (y | χ)

is a binary distribution. Specifically,

P r o b (y = 1 | x)

and

P r o b (y = - 1 | x)

denote the probabilities for the positive and negative classes, respectively. The Bayes classifier is given by:

f_{c} (χ) = \{\begin{matrix} 1, & if P r o b (y = 1 | χ) \geq P r o b (y = - 1 | χ) \\ - 1, & if P r o b (y = 1 | χ) < P r o b (y = - 1 | χ) \end{matrix}

(27)

Next, we will examine whether the SF-loss satisfies Fisher consistency.

Property 6.

Function

f_{L_{r}, ρ} (χ)

, which minimizes SF-loss expected risk among all measurable functions, is equivalent to the Bayes classifier:

f_{L_{r}, ρ} (χ) = f_{c} (χ)

. This means that the SF-loss satisfies Fisher consistency.

Proof.

For binary classification, we have:

\int_{Y} L_{r} (1 - y (x) q) d ρ (y | χ) = L_{r} (1 - q) P r o b (y = 1 | χ) + L_{r} (1 + q) P r o b (y = - 1 | χ)

(28)

when

1 + q

and

1 - q

are applied to Formula (16), the following two equations are obtained:

L_{r} (1 - q) = \frac{{(1 - q)}^{2}}{r {(1 - q)}^{2} + 1}

(29)

and

L_{r} (1 + q) = \frac{{(1 + q)}^{2}}{r {(1 + q)}^{2} + 1}

(30)

By substituting (29) and (30) into (28), we can obtain:

\int_{Y} L_{r} (1 - y (x) q) d ρ (y | χ) = \frac{{(1 - q)}^{2}}{r {(1 - q)}^{2} + 1} P r o b (y = 1 | χ) + \frac{{(1 + q)}^{2}}{r {(1 + q)}^{2} + 1} P r o b (y = - 1 | χ)

(31)

Therefore, if

P r o b (y = 1 | χ) \geq P r o b (y = - 1 | χ)

, expected risk receives the minimum value when

q = 1

; if

P r o b (y = 1 | χ) < P r o b (y = - 1 | χ)

, expected risk receives the minimum value when

q = - 1

. Hence, whichever minimizes the expected risk measured by the SF-loss satisfies

f_{L_{r}, ρ} (χ) = f_{c} (χ)

. This analysis confirms that the minimizer of the associated expected risk, determined by the loss function, has the same sign as the Bayes classifier, thereby proving that this property is satisfied. □

To more clearly compare the performance of our loss function with other advanced loss functions (Hinge loss [20,33], capped

L_{1}

-norm loss [25,33], and Welsch loss [30]), we present them in Figure 4. We can observe that the hinge loss is unbounded and tends to exaggerate the impact of outliers. Other losses are bounded, so we set specific parameters to ensure that the upper bounds of these losses are consistent. In this case, our proposed loss function exhibits a smoother characteristic and its growth trend is relatively gradual compared to the capped

L_{1}

-norm loss. Compared to the Welsch loss function, our proposed loss function increases rapidly near the origin (corresponding to normal data) and grows relatively slowly further away from it (representing noisy or outlier data points). This characteristic indicates that our loss function places greater emphasis on increasing the loss for normal points, encouraging the model to make more precise predictions for regular data. Meanwhile, due to the slower growth of loss for outliers, the model becomes less sensitive to them, thus reducing their impact and enhancing the overall robustness of the model.

3.2. Squared Fractional Loss Based Robust Supervised Twin Extreme Learning Machine

In order to better elaborate on the idea of our model improvement, we rewrite problems (5) and (6) of FTELM as follows:

\begin{matrix} min_{β_{1}} \frac{1}{2} {∥H_{1} β_{1}∥}_{2}^{2} + C_{1} \sum_{j = 1}^{m_{2}} L_{H} (1 + h (x_{j}) β_{1}) + \frac{C_{2}}{2} β_{1}^{T} H_{1}^{T} N_{1} H_{1} β_{1} \end{matrix}

(32)

\begin{matrix} min_{β_{2}} \frac{1}{2} {∥H_{2} β_{2}∥}_{2}^{2} + C_{3} \sum_{i = 1}^{m_{1}} L_{H} (1 - h (x_{i}) β_{2}) + \frac{C_{4}}{2} β_{2}^{T} H_{2}^{T} N_{2} H_{2} β_{2} \end{matrix}

(33)

where hinge loss function

L_{H} (u) = max (0, u)

, for (32),

u ≜ u_{j} = 1 + h (x_{j}) β_{1}

and for (33),

u ≜ u_{i} = 1 - h (x_{i}) β_{2}

. It is worth noting that FTELM uses the squared

L_{2}

-norm metric and hinge loss, which can exaggerate the influence of noise and outliers. To enhance the robustness of the model, we have substituted the squared

L_{2}

-norm metric and hinge loss with the capped

L_{2, p}

-norm metric and SF-loss. Therefore, we propose a robust supervised TELM learning framework called SF-loss-based robust supervised TELM (SF-RSTELM). The core problem of our designed model can be described as:

SF-RSTELM1:

$\begin{matrix} min_{β_{1}} \frac{1}{2} \sum_{i = 1}^{m_{1}} min ({∥h (x_{i}) β_{1}∥}_{2}^{p}, ε_{1}) + \frac{C_{1}}{2} \sum_{j = 1}^{m_{2}} L_{r} (1 + h (x_{j}) β_{1}) + \frac{C_{2}}{2} β_{1}^{T} H_{1}^{T} N_{1} H_{1} β_{1} \end{matrix}$

(34)
SF-RSTELM2:

$\begin{matrix} min_{β_{2}} \frac{1}{2} \sum_{j = 1}^{m_{2}} min ({∥h (x_{j}) β_{2}∥}_{2}^{p}, ε_{3}) + \frac{C_{3}}{2} \sum_{i = 1}^{m_{1}} L_{r} (1 - h (x_{i}) β_{2}) + \frac{C_{4}}{2} β_{2}^{T} H_{2}^{T} N_{2} H_{2} β_{2} \end{matrix}$

(35)

where $C_{1}$ , $C_{2}$ , $C_{3}$ , $C_{4} > 0$ are regularization parameters, and $ε_{1}$ and $ε_{3}$ are the thresholding parameters.

For problem (34), the positive training point approaches the hyperplane

f_{1}

as closely as possible by optimizing the first term. Minimizing the second term ensures the negative class samples are as distant as possible from the positive class hyperplane

f_{1}

. The last term is the Fisher regularization term that minimizes the intra-class divergence from the samples. For problem (35), a similar interpretation can be applied.

In order to solve the problem effectively, we first work on the first term of the objective function. We first use the concave duality theorem [37] to deal with the first term of the objective functions (34) and (35).

Theorem 1.

Consider a continuous nonconvex function

g (θ) : R^{n} \to R

and suppose

h (θ) : R^{n} \to Ω \subset R^{n}

is a map with range Ω. We assume the existence of a concave function

\bar{g} (u)

defined on Ω such that

g (θ) = \bar{g} (h (θ))

is satisfied. Based on this condition, the nonconvex function

g (θ)

can be represented as:

g (θ) = inf_{υ \in R^{n}} [υ^{T} h (θ) - g^{*} (υ)]

(36)

Following concave duality [38],

g^{*} (υ)

is the concave dual of

g (u)

given as:

g^{*} (υ) = inf_{u \in Ω} [υ^{T} u - \bar{g} (u)]

(37)

Furthermore, the minimum on the right-hand side of (36) is obtained at

υ^{*}

:

υ^{*} = \frac{\partial \bar{g} (u)}{\partial u} |_{u = h (θ)}

(38)

According to Theorem 1, we define a concave function

\bar{g} (θ) : R \to R

such that

\forall θ > 0

:

\bar{g} (θ) = min (θ^{\frac{p}{2}}, ε)

(39)

Suppose

h (μ) = μ^{2}

. The capped

L_{2, p}

-norm metric can be reformulated as:

min ({∥h (x_{i}) β∥}_{2}^{p}, ε) = \bar{g} (h (μ))

(40)

where

μ = {∥h (x) β∥}_{2}

. Therefore, the capped

L_{2, p}

-norm metric can also be represented as:

\sum_{i = 1}^{m_{1}} \bar{g} ({∥h (x_{i}) β_{1}∥}_{2}^{2})

(41)

\sum_{j = 1}^{m_{2}} \bar{g} ({∥h (x_{j}) β_{2}∥}_{2}^{2})

(42)

Let

θ_{1} = h (μ_{1}) = {∥h (x_{i}) β_{1}∥}_{2}^{2}

. According to Equation (36), (40) can also be obtained as:

\begin{matrix} min ({∥h (x_{i}) β_{1}∥}_{2}^{p}, ε_{1}) = \bar{g} ({∥h (x_{i}) β_{1}∥}_{2}^{2}) = inf_{f_{i i} > 0} [f_{i i} h (μ_{1}) - g^{*} (f_{i i})] = inf_{f_{i i} > 0} [f_{i i} θ_{1} - g^{*} (f_{i i})] \end{matrix}

(43)

The

g^{*} (f_{i i})

is the concave dual of

\bar{g} (θ_{1})

, represented as:

g^{*} (f_{i i}) = inf_{θ_{1}} [f_{i i} θ_{1} - \bar{g} (θ_{1})] = inf_{θ_{1}} \{\begin{matrix} f_{i i} θ_{1} - θ_{1}^{\frac{p}{2}}, θ_{1}^{\frac{p}{2}} < ε_{1} \\ f_{i i} θ_{1} - ε_{1}, θ_{1}^{\frac{p}{2}} \geq ε_{1} \end{matrix}

(44)

By optimizing

θ_{1}

for (44), we can obtain:

g^{*} (f_{i i}) = \{\begin{matrix} f_{i i} {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}} - {(\frac{2}{p} f_{i i})}^{\frac{p}{p - 2}}, θ_{1}^{\frac{p}{2}} < ε_{1} \\ f_{i i} ε_{1}^{\frac{2}{p}} - ε_{1}, θ_{1}^{\frac{p}{2}} \geq ε_{1} \end{matrix}

(45)

Therefore, the capped

L_{2, p}

-norm metric can be further converted to:

\begin{matrix} min_{β_{1}} \sum_{i = 1}^{m_{1}} min ({∥h (x_{i}) β_{1}∥}_{2}^{p}, ε_{1}) \\ ⟺ & min_{β_{1}} \sum_{i = 1}^{m_{1}} \bar{g} ({∥h (x_{i}) β_{1}∥}_{2}^{2}) \\ ⟺ & min_{β_{1}} \sum_{i = 1}^{m_{1}} inf_{f_{i i} \geq 0} L_{i} (β_{1}, f_{i i}, ε_{1}) \\ ⟺ & min_{(β_{1}, f_{i i} \geq 0)} \sum_{i = 1}^{m_{1}} L_{i} (β_{1}, f_{i i}, ε_{1}) \end{matrix}

(46)

where

L_{i} (β_{1}, f_{i i}, ε_{1}) = \{\begin{matrix} f_{i i} θ_{1} - f_{i i} {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}} + {(\frac{2}{p} f_{i i})}^{\frac{p}{p - 2}}, θ_{1}^{\frac{p}{2}} < ε_{1} \\ f_{i i} θ_{1} - f_{i i} ε_{1}^{\frac{2}{p}} + ε_{1}, θ_{1}^{\frac{p}{2}} \geq ε_{1} \end{matrix}

(47)

Similarly, let

θ_{2} = h (μ_{2}) = {∥h (x_{j}) β_{2}∥}_{2}^{2}

.

g^{*} (t_{j j})

is represented as a concave dual function of

\bar{g} (θ_{2})

, so the capped

L_{2, p}

-norm metric can be written as:

\begin{matrix} min_{β_{2}} \sum_{j = 1}^{m_{2}} min ({∥h (x_{j}) β_{1}∥}_{2}^{p}, ε_{3}) \\ ⟺ & min_{β_{2}} \sum_{j = 1}^{m_{2}} \bar{g} ({∥h (x_{j}) β_{2}∥}_{2}^{2}) \\ ⟺ & min_{β_{2}} \sum_{j = 1}^{m_{2}} inf_{t_{j j} \geq 0} L_{j} (β_{2}, t_{j j}, ε_{3}) \\ ⟺ & min_{(β_{2}, t_{j j} \geq 0)} \sum_{j = 1}^{m_{2}} L_{j} (β_{2}, t_{j j}, ε_{3}) \end{matrix}

(48)

where

L_{j} (β_{2}, t_{j j}, ε_{3}) = \{\begin{matrix} t_{j j} θ_{2} - t_{j j} {(\frac{2}{p} t_{j j})}^{\frac{2}{p - 2}} + {(\frac{2}{p} t_{j j})}^{\frac{p}{p - 2}}, θ_{2}^{\frac{p}{2}} < ε_{3} \\ t_{j j} θ_{2} - t_{j j} ε_{3}^{\frac{2}{p}} + ε_{3}, θ_{2}^{\frac{p}{2}} \geq ε_{3} \end{matrix}

(49)

\frac{\partial \bar{g} (θ)}{\partial θ} = \{\begin{matrix} \frac{p}{2} θ^{\frac{p}{2} - 1}, 0 < θ < ε^{\frac{2}{p}} \\ 0, θ > ε^{\frac{2}{p}} \end{matrix}

(50)

Let

θ_{1} = h (μ_{1}) = {∥h (x_{i}) β_{1}∥}_{2}^{2}

, and we can obtain:

f_{i i} = \frac{\partial \bar{g} (θ_{1})}{\partial θ_{1}} |_{θ_{1} = {∥h (x_{i}) β_{1}∥}_{2}^{2}} = \{\begin{matrix} \frac{p}{2} {∥h (x_{i}) β_{1}∥}_{2}^{p - 2}, 0 < {∥h (x_{i}) β_{1}∥}_{2}^{p} < ε_{1} \\ σ_{1}, e l s e \end{matrix}

(51)

Similarly, let

θ_{2} = h (μ_{2}) = {∥h (x_{j}) β_{2}∥}_{2}^{2}

, and we can obtain:

t_{j j} = \frac{\partial \bar{g} (θ_{2})}{\partial θ_{2}} |_{θ_{2} = {∥h (x_{j}) β_{2}∥}_{2}^{2}} = \{\begin{matrix} \frac{p}{2} {∥h (x_{j}) β_{2}∥}_{2}^{p - 2}, 0 < {∥h (x_{j}) β_{2}∥}_{2}^{p} < ε_{3} \\ σ_{2}, e l s e \end{matrix}

(52)

When the variables

f_{i i}

and

t_{j j}

are fixed to solve the classifier-related parameters

β_{1}

and

β_{2}

, the optimization problems (34) and (35) can be written as:

\begin{matrix} min_{β_{1}} \frac{1}{2} \sum_{i = 1}^{m_{1}} \frac{2}{p} f_{i i} {∥h (x_{i}) β_{1}∥}_{2}^{2} + \frac{C_{1}}{2} \sum_{j = 1}^{m_{2}} L_{r} (1 + h (x_{j}) β_{1}) + \frac{C_{2}}{2} β_{1}^{T} H_{1}^{T} N_{1} H_{1} β_{1} \end{matrix}

(53)

\begin{matrix} min_{β_{2}} \frac{1}{2} \sum_{j = 1}^{m_{2}} \frac{2}{p} t_{j j} {∥h (x_{j}) β_{2}∥}_{2}^{2} + \frac{C_{3}}{2} \sum_{i = 1}^{m_{1}} L_{r} (1 - h (x_{i}) β_{2}) + \frac{C_{4}}{2} β_{2}^{T} H_{2}^{T} N_{2} H_{2} β_{2} \end{matrix}

(54)

Let

F = d i a g (\frac{2}{p} f_{11}, \frac{2}{p} f_{22}, \dots, \frac{2}{p} f_{m_{1} m_{1}})

and

T = d i a g (\frac{2}{p} t_{11}, \frac{2}{p} t_{22}, \dots, \frac{2}{p} t_{m_{2} m_{2}})

, so that (53) and (54) are equivalent to (55) and (56), respectively:

\begin{matrix} min_{β_{1}} \frac{1}{2} {(H_{1} β_{1})}^{T} F (H_{1} β_{1}) + \frac{C_{1}}{2} \sum_{j = 1}^{m_{2}} L_{r} (1 + h (x_{j}) β_{1}) + \frac{C_{2}}{2} β_{1}^{T} H_{1}^{T} N_{1} H_{1} β_{1} \end{matrix}

(55)

\begin{matrix} min_{β_{2}} \frac{1}{2} {(H_{2} β_{2})}^{T} T (H_{2} β_{2}) + \frac{C_{3}}{2} \sum_{i = 1}^{m_{1}} L_{r} (1 - h (x_{i}) β_{2}) + \frac{C_{4}}{2} β_{2}^{T} H_{2}^{T} N_{2} H_{2} β_{2} \end{matrix}

(56)

The problems (55) and (56) can be rewritten as the following (57) and (58):

\begin{matrix} min_{β_{1}} \frac{1}{2} β_{1}^{T} H_{1}^{T} (F + C_{2} N_{1}) H_{1} β_{1} + \frac{C_{1}}{2} \sum_{j = 1}^{m_{2}} L_{r} (1 + h (x_{j}) β_{1}) \end{matrix}

(57)

\begin{matrix} min_{β_{2}} \frac{1}{2} β_{2}^{T} H_{2}^{T} (T + C_{4} N_{2}) H_{2} β_{2} + \frac{C_{3}}{2} \sum_{i = 1}^{m_{1}} L_{r} (1 - h (x_{i}) β_{2}) \end{matrix}

(58)

It can be seen that (57) and (58) are non-convex optimization problems because of the non-convexity of the loss function. In this research, we utilize the CCCP technique to handle the non-convexity. Note that the loss function can be expressed as the subtraction of two convex functions, or equivalently, as the sum of a convex function and a concave function. That is,

L_{r} (u) = L_{r_{1}} (u) + L_{r_{2}} (u)

. The specific expressions for the convex function

L_{r_{1}} (u)

and the concave function

L_{r_{2}} (u)

are:

\begin{matrix} L_{r_{1}} (u) = r u^{2} \end{matrix}

(59)

\begin{matrix} L_{r_{2}} (u) = - r u^{2} + \frac{u^{2}}{r u^{2} + 1} \end{matrix}

(60)

Figure 5 represents the graphs for

L_{r_{1}} (u)

and

L_{r_{2}} (u)

.

Therefore, the above optimization problem can be expressed as:

min_{β_{1}} \frac{1}{2} β_{1}^{T} H_{1}^{T} (F + C_{2} N_{1}) H_{1} β_{1} + \frac{C_{1}}{2} \sum_{j = 1}^{m_{2}} (L_{r_{1}} (1 + h (x_{j}) β_{1}) + L_{r_{2}} (1 + h (x_{j}) β_{1}))

(61)

\begin{matrix} min_{β_{2}} \frac{1}{2} β_{2}^{T} H_{2}^{T} (E + C_{4} N_{2}) H_{2} β_{2} + \frac{C_{3}}{2} \sum_{i = 1}^{m_{1}} (L_{r_{1}} (1 - h (x_{i}) β_{2}) + L_{r_{2}} (1 - h (x_{i}) β_{2})) \end{matrix}

(62)

Since (61) and (62) are similar, we will only show the solution process for (61), which can also be expressed as:

min_{β_{1}} \underset{L_{v e x} (β_{1})}{\underset{⏟}{\frac{1}{2} β_{1}^{T} H_{1}^{T} (F + C_{2} N_{1}) H_{1} β_{1} + \frac{C_{1}}{2} \sum_{j = 1}^{m_{2}} L_{r_{1}} (1 + h (x_{j}) β_{1})}} + \underset{L_{c a v} (β_{1})}{\underset{⏟}{\frac{C_{1}}{2} \sum_{j = 1}^{m_{2}} L_{r_{2}} (1 + h (x_{j}) β_{1})}}

(63)

The solution to the above optimization issue can be obtained by addressing the following equation:

min_{β_{1}} L (β_{1}) = L_{v e x} (β_{1}) + L_{c a v} (β_{1})

(64)

The value of

β_{1}

in

k + 1

iterations is as follows:

β_{1}^{(k + 1)} = a r g min_{β_{1}} \{L_{v e x} (β_{1}) + \nabla L_{c a v} (β_{1}^{(k)}) β_{1}\}

(65)

Let

δ_{1}^{(k)} = \nabla L_{c a v} (β_{1}^{(k)})

, where

δ_{1}^{(k)} = {[δ_{1_{1}}^{k}, δ_{1_{2}}^{k}, \dots, δ_{1 j}^{k}, \dots, δ_{1_{m_{2}}}^{k}]}^{T} \in R^{m_{2}}

.

δ_{1 j}^{k} = \frac{\partial L_{c a v}}{\partial u_{j}} = - 2 r u_{j} + \frac{2 u_{j}}{{(r u_{j}^{2} + 1)}^{2}}

(66)

where

u_{j} = 1 + h (x_{j}) β_{1}

. Therefore, we can get:

\nabla L_{c a v} (β_{1}^{(k)}) β_{1} = δ_{1}^{(k) T} H_{2} β_{1}

(67)

Solving (65) is equivalent to solving the following subproblem:

\begin{matrix} min_{β_{1}} \frac{1}{2} β_{1}^{T} H_{1}^{T} (F + C_{2} N_{1}) H_{1} β_{1} + \frac{C_{1}}{2} r ξ_{1}^{T} ξ_{1} + δ_{1}^{(k) T} H_{2} β_{1} \\ s . t . - H_{2} β_{1} + ξ_{1} = e_{2} \end{matrix}

(68)

Here, we introduce Lagrange multipliers

λ_{1}

for problem (68). Next, its Lagrange function is given by:

\begin{matrix} L (β_{1}, ξ_{1}, λ_{1}, θ_{1}) = & \frac{1}{2} β_{1}^{T} H_{1}^{T} (F + C_{2} N_{1}) H_{1} β_{1} + \frac{C_{1}}{2} r ξ_{1}^{T} ξ_{1} + δ_{1}^{(k) T} H_{2} β_{1} \\ - λ_{1}^{T} (- H_{2} β_{1} + ξ_{1} - e_{2}) \end{matrix}

(69)

Following the Karush–Kuhn–Tucker (KKT) conditions, we derive the following constraints:

\begin{matrix} K K T : \{\begin{matrix} \frac{\partial L}{\partial β_{1}} = H_{1}^{T} (F + C_{2} N_{1}) H_{1} β_{1} + H_{2}^{T} δ_{1}^{(k)} + H_{2}^{T} λ_{1} = 0 \\ \frac{\partial L}{\partial ξ_{1}} = C_{1} r ξ_{1} - λ_{1} = 0 \\ λ_{1}^{T} (- H_{2} β_{1} + ξ_{1} - e_{2}) = 0 \end{matrix} \end{matrix}

(70)

From the KKT condition, we can obtain:

\begin{matrix} \{\begin{matrix} β_{1} = - {(H_{1}^{T} (F + C_{2} N_{1}) H_{1})}^{- 1} H_{2}^{T} (δ_{1}^{(k)} + λ_{1}) . \\ C_{1} r ξ_{1} = λ_{1} . \end{matrix} \end{matrix}

(71)

Bring (71) into (68), and we can obtain the dual problem of the original problem:

\begin{matrix} min_{λ_{1}} \frac{1}{2} {(δ_{1}^{(k)} + λ_{1})}^{T} H_{2} {(H_{1}^{T} (F + C_{2} N_{1}) H_{1})}^{- 1} H_{2}^{T} (δ_{1}^{(k)} + λ_{1}) - e_{2}^{T} λ_{1} \end{matrix}

(72)

Using a similar approach, we can obtain the dual problem of Equation (58):

\begin{matrix} min_{λ_{2}} \frac{1}{2} {(λ_{2} - δ_{2}^{(k)})}^{T} H_{1} {(H_{2}^{T} (T + C_{1} N_{2}) H_{2})}^{- 1} H_{1}^{T} (λ_{2} - δ_{2}^{(k)}) - e_{1}^{T} λ_{2} \end{matrix}

(73)

After solving (72) and (73) to obtain the optimal solutions

λ_{1}

and

λ_{2}

, we can obtain:

\begin{matrix} β_{1} = - {(H_{1}^{T} (F + C_{2} N_{1}) H_{1})}^{- 1} H_{2}^{T} (δ_{1}^{(k)} + λ_{1}) \end{matrix}

(74)

\begin{matrix} β_{2} = {(H_{2}^{T} (T + C_{4} N_{2}) H_{2})}^{- 1} H_{1}^{T} (λ_{2} - δ_{2}^{(k)}) \end{matrix}

(75)

Therefore, the decision function of SF-RSTELM is given by:

\begin{matrix} f (x) = arg min_{k = 1, 2} d_{k} (x) = arg min_{k = 1, 2} |β_{k}^{T} h (x)| . \end{matrix}

(76)

Based on the previous discussion, we provide a detailed description of the implementation steps of the proposed method in Algorithm 1.

Algorithm 1 The procedure of SF-RSTELM

Input: The training set

T = {(x_{i}, y_{i}) | 1 \leq i \leq m}

, where

x_{i} \in R^{d}

and

y_{i} = {1, - 1}

;

Parameters

C_{1}

,

C_{2}

,

C_{3}

,

C_{4} > 0

and

ε_{1}

,

ε_{2}

,

ε_{3}

,

ε_{4} > 0

,

ε > 0

.

σ_{1}

,

σ_{2}

.

Activation function

G (x)

; The number of hidden nodes L.

Maximum number of iterations

k_{m a x}

.

Output:

β_{1}^{*}

,

β_{2}^{*}

.

Step:

1: Initialize

F^{(0)} \in R^{m_{1} \times m_{1}}

and

T^{(0)} \in R^{m_{2} \times m_{2}}

;

δ_{1}^{(0)}

and

δ_{2}^{(0)}

2: Compute the graph matrix

N_{1}

,

N_{2}

.

3: Set

t = 0

.

4: While

Calculate

λ_{1}^{(k)}

and

λ_{2}^{(k)}

for the dual problem (72) and (73), respectively.

Then get the solution

β_{1}^{(k)}

,

β_{2}^{(k)}

by

β_{1}^{(k)} = - {(H_{1}^{T} (F^{(k)} + C_{2} N_{1}) H_{1})}^{- 1} H_{2}^{T} (δ_{1}^{(k)} + λ_{1})

β_{2}^{(k)} = {(H_{2}^{T} (T^{(k)} + C_{4} N_{2}) H_{2})}^{- 1} H_{1}^{T} (λ_{2} - δ_{2}^{(k)})

Update the matrices

F^{(k + 1)}

,

T^{(k + 1)}

,

δ_{1}^{(k + 1)}

and

δ_{2}^{(k + 1)}

by (51), (52) and (66).

if

k > k_{m a x}

or

|β_{i}^{(k)} - β_{i}^{(k - 1)}| \leq ε

(i = 1, 2)

break

else

k = k + 1

end while

5: Construct the following decision functions:

f (x) = arg {min}_{k = 1, 2} d_{k} (x) = arg {min}_{k = 1, 2} |β_{k}^{T} h (x)| .

3.3. Convergence Analysis

Theorem 2.

Utilizing the CCCP technique to address problem (63), the resulting sequence

\{β_{1}^{(k)}\}

converges.

Proof of Theorem 2.

At the iteration point of step

k + 1

, the following inequality holds:

L_{v e x} (β_{1}^{(k)}) + \nabla L_{c a v} {(β_{1}^{(k)})}^{T} β_{1}^{(k)} \geq L_{v e x} (β_{1}^{(k + 1)}) + \nabla L_{c a v} {(β_{1}^{(k)})}^{T} β_{1}^{(k + 1)}

(77)

It can be written as

L_{v e x} (β_{1}^{(k)}) - L_{v e x} (β_{1}^{(k + 1)}) \geq \nabla L_{c a v} {(β_{1}^{(k)})}^{T} (β_{1}^{(k + 1)} - β_{1}^{(k)})

(78)

Due to the concavity of

L_{c a v} (.)

, we have

\nabla L_{c a v} {(β_{1}^{(k)})}^{T} (β_{1}^{(k + 1)} - β_{1}^{(k)}) \geq L_{c a v} (β_{1}^{(k + 1)}) - L_{c a v} (β_{1}^{(k)})

(79)

By combining the above inequalities, we have

L_{v e x} (β_{1}^{(k)}) + L_{c a v} (β_{1}^{(k)}) \geq L_{v e x} (β_{1}^{(k + 1)}) + L_{c a v} (β_{1}^{(k + 1)})

(80)

Accordingly, the objective value of problem (63) decreases monotonically with each iteration and remains non-negative, thereby proving the convergence of the sequence. □

Theorem 3.

Algorithm 1 converges to a local optimum to the problems in (34) and (35).

Proof of Theorem 3.

Taking problem (34) as an example, the analysis for problem (35) follows a similar approach.

First and foremost, let us recall the formulation of our framework, namely Equation (34).

\begin{matrix} min_{β_{1}, ξ_{1}} \frac{1}{2} \sum_{i = 1}^{m_{1}} min ({∥h (x_{i}) β_{1}∥}_{2}^{p}, ε_{1}) + \frac{C_{1}}{2} \sum_{j = 1}^{m_{2}} L_{r} (1 + h (x_{j}) β_{1}) + \frac{C_{2}}{2} β_{1}^{T} H_{1}^{T} N_{1} H_{1} β_{1} \end{matrix}

(81)

For convenience, let

J_{1} = \frac{C_{1}}{2} \sum_{j = 1}^{m_{2}} L_{r} (1 + h (x_{j}) β_{1})

and

J_{2} = \frac{C_{2}}{2} β_{1}^{T} H_{1}^{T} N_{1} H_{1} β_{1}

. When

{∥h (x_{i}) β_{1}∥}_{2}^{p} \leq ε_{1}

, we represent the Lagrange function of (81) as follows:

\begin{matrix} L_{1} (β_{1}) = \frac{1}{2} \sum_{i = 1}^{m_{1}} {∥h (x_{i}) β_{1}∥}_{2}^{p} + J_{1} + J_{2} \end{matrix}

(82)

Then, we differentiate

L_{1} (β_{1})

with respect to

β_{1}

:

\begin{matrix} \frac{\partial L_{1} (β_{1})}{\partial β_{1}} = \sum_{i = 1}^{m_{1}} \frac{1}{2} p {({∥h (x_{i}) β_{1}∥}_{2})}^{p - 1} \frac{h (x_{i}) β_{1} h^{'} (x_{i})}{{∥h (x_{i}) β_{1}∥}_{2}} + \frac{\partial J_{1}}{\partial β_{1}} + \frac{\partial J_{2}}{\partial β_{1}} = 0 \end{matrix}

(83)

According to (51), we bring

f_{i i}

into Formula (83):

\begin{matrix} \frac{\partial L_{1} (β_{1})}{\partial β_{1}} = H_{1}^{T} F H_{1} β_{1} + \frac{\partial J_{1}}{\partial β_{1}} + \frac{\partial J_{2}}{\partial β_{1}} = 0 \end{matrix}

(84)

Similarly, we obtain the Lagrangian function of problem (55):

\begin{matrix} L_{2} (β_{1}) = \frac{1}{2} {(H_{1} β_{1})}^{T} F (H_{1} β_{1}) + J_{1} + J_{2} \end{matrix}

(85)

Taking the derivative of

L_{2} (β_{1})

with respect to

β_{1}

:

\begin{matrix} \frac{\partial L_{2} (β_{1})}{\partial β_{1}} = H_{1}^{T} F H_{1} β_{1} + \frac{\partial J_{1}}{\partial β_{1}} + \frac{\partial J_{2}}{\partial β_{1}} = 0 \end{matrix}

(86)

It is noted that Formula (84) is equal to Formula (86) when determining the optimal solution

β_{1}

. Furthermore, the optimal solution

β_{1}

meets the KKT condition of model (34). By solving problem (55), we can determine the optimal solution for problem (34). Thus, Algorithm 1 is capable of converging to a local optimum, making it feasible to obtain the local minimum of problem (34). □

4. Numerical Experiments

Within this part, we performed a comparison of SF-RSTELM with several algorithms, such as TELM [20], FTELM [33],

C L_{1}

-FTELM [33], FRTELM [25], and CWTELM [30]. To assess the performance of the proposed algorithm, experiments are carried out on four distinct types of databases: artificial datasets, UCI datasets, image datasets, and NDC large datasets. In addition, we demonstrate the convergence of the proposed algorithm through experimental analysis.

4.1. Experimental Setting

4.1.1. Operating Environment

All experiments were carried out using MATLAB (2021a) (MathWorks, Natick, United States) on a personal computer (PC) equipped with an Intel Core-i7 processor (2.5 GHz) and 16 GB random-access memory (RAM).

4.1.2. Benchmark Approaches

We have selected five advanced algorithms as benchmarks to compare with the SF-RSTELM proposed in this paper. These algorithms are:

TELM [20]: Using the hinge loss function and squared $L_{2}$ -norm metric.
FTELM [33]: The Fisher regularization term is introduced into TELM, and relates to the statistical information of intra-class samples.
$C L_{1}$ -FTELM [33]: Capped $L_{1}$ -norm loss and metric are introduced into FTELM.
FRTELM [25]: Capped $L_{1}$ -norm loss and metric are introduced into TELM.
CWTELM [30]: Replace the hinge loss function and squared $L_{2}$ -norm metric in TELM with Welsch loss function and capped $L_{2, p}$ -norm metric.

4.1.3. Parameter Selection

In the training process of the model, the selection of parameters is crucial because it will affect the classification performance of the model. Here are some parameters that are respectively required for each model. Since both the comparison model and our model are twin models, which contain two optimization problems, the two optimization models are carried out separately. Therefore, with regard to the parameter selection problem, we will only explain the parameter selection method for one of the optimization problems, and the other model is similar.

TELM: the count of hidden layer nodes L, the regularization parameter $C_{1}$ .
FTELM: the count of hidden layer nodes L, the regularization parameters $C_{1}$ , $C_{2}$ .
$C L_{1}$ -FTELM: the count of hidden layer nodes L, the regularization parameters $C_{1}$ , $C_{2}$ , the capped parameter $ε_{1}$ in the metric, and the capped parameter $ε_{2}$ in the loss function.
FRTELM: the count of hidden layer nodes L, the regularization parameter $C_{1}$ , the capped parameter $ε_{1}$ in the metric, and the capped parameter $ε_{2}$ in the loss function.
CWTELM: the count of hidden layer nodes L, the regularization parameter $C_{1}$ , the capped parameters $ε_{1}$ and the parameter p of the metric, and the parameter $σ$ in the loss function.
SF-RSTELM: the count of hidden layer nodes L, the regularization parameters $C_{1}$ , $C_{2}$ , the capped parameters $ε_{1}$ and the parameter p of capped $L_{2, p}$ -norm metric, and the parameter r in the loss function.

Due to the different number of parameters in different models, we will adopt different parameter selection strategies. For models with fewer than three parameters (TELM and FTELM), we will use grid search and ten-fold cross-validation to explore the parameter space and find the optimal parameter combination. However, when dealing with models with more than three parameters (

C L_{1}

-FTELM, FRTELM, CWTELM, and SF-RSTELM), directly applying grid search may lead to a significant increase in computational load. To overcome this challenge, we first fix some parameters to narrow down the search space based on initial experimental results and domain expertise. The fixed parameter strategy is as follows: the count of hidden nodes L is fixed, but L should be different for different datasets, and the range of L is

\{i \times 10^{j} ∣ i = 1, 2, \dots, 6, j = 1, 2\}

. For a fair contrast, we fix the parameters to ensure that the above algorithms with bounded loss functions have the same upper bound on their loss. Specifically, we set the capped

L_{1}

-norm loss parameter to 2 in

C L_{1}

-FTELM and FRTELM, the parameter

σ

to 2 in CWTELM for Welsch loss, and the parameter r to 0.5 in our model’s SF-loss. Moreover, for the aforementioned models that utilize bounded metrics, we set all their upper bounds to be equal. That is, the capped parameter

ε_{1}

is set to 0.001. In addition to fixing the above parameters, we still choose to use ten-fold cross-validation and grid search to find the best values for the remaining parameters. Specifically, the regularization parameters

C_{1}

and

C_{2}

are chosen from

{i \times 10^{j} ∣ i = 1, 2, \dots, 6, and

j = - 3, \dots, 2}

, and the parameter p is chosen from

(0, 2]

.

4.1.4. Evaluation Criteria

To assess the validity of the model, we employ accuracy (

A C C

) and the

F_{1}

-score as evaluation criteria. Specifically, these criteria are specified as follows:

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(87)

F_{1} = \frac{2 T P}{2 T P + F P + F N}

(88)

where

T P

,

T N

,

F P

, and

F N

refer to true positives, true negatives, false positives, and false negatives, respectively. Both

A C C

and the

F_{1}

-score serve to evaluate the model’s generalization capability, and the higher they are, the better the performance. To guarantee the reliability of our experiments, we repeat all experimental procedures 10 times, and the final experimental result is the mean of the 10 repetitions.

4.2. Experiments on the Artificial Datasets

In this subsection, experiments are initially performed on the Two Moons, XOR, and Banana datasets. Both the Two Moons and Banana datasets comprise 400 samples, while the XOR dataset contains 100 samples. Figure 6 visually depicts the two-dimensional distribution graphs of these three artificial datasets. Class 1 is shown as a red ‘∘’, and class 2 is shown as a blue ‘⋄’.

To test the performance of the proposed method, Figure 7 shows the accuracy of five comparison algorithms as well as the proposed algorithm under conditions of no noise and a noise ratio of 20%. The noise is introduced by randomly selecting training samples and perturbing their features with Gaussian noise that follows a normal distribution

N (0, σ^{2})

. Specifically, for the training data X,

X + \tilde{X}

is used instead of X, where

\tilde{X}

represents the noise matrix that adheres to a normal distribution with a mean of 0 and a variance of

σ^{2}

.

As can be seen from Figure 7a, the accuracy of SF-RSTELM is highest among the six methods when no noise is added to the dataset. Our model has the highest accuracy, which can be explained in three ways. First, it uses a bounded capped

L_{2, p}

-norm metric, whose p-value can be flexibly adjusted to suit different data. Second, we use a bounded loss that effectively controls the upper bound. In addition, we consider the intra-class divergence information of the data. The remaining algorithms are sorted in descending order of accuracy as

C L_{1}

-FTELM, CWTELM, FTELM, FRTELM, TELM. The accuracy of

C L_{1}

-FTELM is slightly lower than that of our model, probably because the capped

L_{1}

-norm metric is not as flexible as the capped

L_{2, p}

-norm metric we use. Although CWTELM has also improved the metric and loss, it still falls slightly below

C L_{1}

-FTELM due to its failure to consider the statistical properties of the data. FTELM is merely an extension of TELM, with the addition of a Fisher regularization term. FRTELM improved the metric and loss; however, the intra-class divergence information of the data is not taken into account. When noise is added to the dataset (Figure 7b), the performance of all algorithms is degraded, but SF-RSTELM’s accuracy is still the highest. This shows that our algorithm has the strongest noise immunity and robustness.

4.3. Experiments on the UCI Datasets

In this subsection, we evaluate the performance of our model on nine UCI datasets from http://archive.ics.uci.edu/ml/datasets.php (accessed on 15 February 2024) and contrast it with five other state-of-the-art algorithms. Table 1 provides a detailed overview of the characteristics of the UCI datasets employed. All features within these datasets are normalized to a scale of

[0, 1]

to ensure consistency.

Initially, we assess the steadiness of our SF-RSTELM without adding extra noises. Table 2 presents the results of the experiments, with the best results highlighted in bold. Here, Time(s) denotes the average execution time for each algorithm when optimized with the best parameters. In ACC ± S, ACC denotes the mean learning accuracy, while S represents the standard deviation. Table 2 shows that SF-RSTELM outperforms the other five methods in terms of accuracy and

F_{1}

on most datasets, except for the German, Pima, and QSAR datasets. This is because SF-RSTELM not only incorporates the Fisher regularization term, which considers intra-class divergence, but also leverages the parameter adjustability of SF-loss and the flexibility of the capped

L_{2, p}

-norm metric.

Moreover, to illustrate the noise-resistant properties of SF-RSTELM, we introduce Gaussian noise into the training subset, affecting their features to generate noise. The experimental outcomes at noise levels of 15% and 25% are displayed in Table 3 and Table 4. It is clear that with rising noise levels, the learning accuracy of all algorithms declines. However, SF-RSTELM maintains higher accuracy and

F_{1}

scores compared to the other five methods, except on a few datasets. This also demonstrates the strong noise resistance capability of our model.

To demonstrate the robustness of SF-RSTELM at varying noise levels (10%, 15%, 20%, and 25%), experiments are conducted on three datasets (Australian, Vote, and Ionosphere). Figure 8 illustrates the accuracy variation line charts of six algorithms on these three datasets across different noise levels. For the original dataset

X

, we change it with

X + μ \bar{X}

, where

\bar{X}

is the Gaussian variable. Here,

μ = \frac{{ρ ∥ X ∥}_{F}}{∥ \bar{X} ∥_{F}}

, and

μ

is a noise ratio. The value of

μ \in {0, 0.1, 0.15, 0.2, 0.25}

. As is evident from Figure 8, the performance of all algorithms decreases as noise levels rise, but the accuracy of our model decreases the slowest. This further indicates that SF-RSTELM possesses superior classification accuracy compared to the other five methods. This superiority in noise handling is likely attributed to the combined effectiveness of the SF-loss function, the capped

L_{2, p}

-norm metrics, and the Fisher regularization term.

4.4. Experiments on the Image Datasets

In this part, we will perform experiments using high-dimensional image datasets to assess and compare the noise resistance and classification accuracy of our model SF-RSTELM with five other algorithms. We utilized the following three image datasets: COIL-20 http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2024), USPS http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2024), and MNIST http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2024). Since these image datasets are essentially a problem with multiple classes, we employed the “one-vs-rest” strategy to convert them to multiple binary classification problems. Table 5 shows the features of the three image datasets. Figure 9 demonstrates examples from the three high-dimensional image datasets. Specifically, due to the large size of the MNIST dataset, we only select the first 2000 samples to participate in the experiment. These image datasets are utilized to evaluate the performance of our models in multi-class classification tasks.

The specific experimental results are presented in Table 6. From the experiment results, it is evident that SF-RSTELM has the highest accuracy on all three datasets. Table 7 shows the results after adding 15% Gaussian noise. It can be seen that the ACC and

F_{1}

value of all models are significantly reduced, but our model is still the highest, which also indicates that our model has good noise resistance capability, classification ability, and ability to process high-dimensional image datasets.

4.5. Experiments on the NDC Large Datasets

In this subsection, to evaluate the stability of our proposed model on large-scale datasets. We have conducted a comparative analysis of our algorithm with five other algorithms using the NDC datasets generated by David Musicant’s NDC data generator http://www.cs.wisc.edu/musicant/data/ndc (accessed on 15 February 2024). Detailed descriptions of the NDC datasets are presented in Table 8. Table 9 summarizes the experimental results of the six algorithms on three large-scale NDC datasets.

As can be seen from Table 9, our model SF-RSTELM has the highest accuracy and

F_{1}

value except for the NDC-15k dataset. Overall, SF-RSTELM is more stable than the other five algorithms. This is mainly due to the advantages of SF-RSTELM, which not only incorporates the statistical attributes of the data, but also uses the bounded and flexibly adjustable metric and loss function, which effectively controls the disturbance from noise and outliers. Therefore, from this set of experimental results, we can see that our model is also effective for large datasets.

4.6. Convergence Curve

We also perform experiments on four UCI datasets (Australian, QSAR, WDBC, and Vote) to confirm the convergence of the proposed Algorithm 1. Figure 10 demonstrates the convergence case of the objective function value with the increasing number of iterations. As can be seen from Figure 10, the objective function value converges relatively quickly to a stable fixed value.

This observation verifies the effectiveness of our algorithm in converging the objective function to a local optimum within a finite number of iterations, thereby demonstrating its convergence properties.

5. Conclusions and Future Works

In this paper, we first propose a new kind of SF-loss function that exhibits favorable characteristics including boundedness, smoothness, symmetry, noise insensitivity, and Fisher consistency. Then, SF-RSTELM is proposed by integrating the capped

L_{2, p}

-norm metric, SF-loss, and Fisher regularization term. SF-RSTELM not only integrates the Fisher regularization term, addressing the intra-class divergence of the data, but also exploits the parameter adjustability of SF-loss and the flexibility of capped

L_{2, p}

-norm metrics to reduce the influence of noise and outliers. Moreover, an efficient iterative algorithm is proposed to solve the model, and the convergence of the algorithm is proved. Experimental results on multiple datasets demonstrate the efficiency of the proposed model. Specifically, our model was able to achieve higher ACC and

F_{1}

scores on most datasets, with improvements ranging from 0.28% to 4.5% compared to other state-of-the-art algorithms.

In the future, we will continue to study the improvement of the algorithm. Because the model constructed in this paper represents a non-convex optimization problem, we convert it into a series of convex problems to solve by the CCCP method, resulting in a long training time, so it is necessary to find a fast solution method in future research. Moreover, transforming this paper’s model from supervised to semi-supervised learning remains an important direction for future studies.

Author Contributions

Z.X., conceptualization, methodology, validation, investigation, project administration, writing—original draft. Y.W., methodology, software, validation, formal analysis, investigation, data curation, writing—original draft. Y.R., validation, software. X.Z., validation, software. All authors have read and agreed to the published version of the manuscript.

Funding

The authors wish to acknowledge the financial support of the National Nature Science Youth Foundation of China (No. 61907012), Construction Project of First-Class Disciplines in Ningxia Higher Education (NXYLXK2017B09), Postgraduate Innovation Project of North Minzu University (YCX23098, YCX23091), National Nature Science Foundation of China (No. 62366001), and the Natural Science Foundation of Ningxia (2024A2787).

Informed Consent Statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability Statement

The UCI machine learning repository is available at http://archive.ics.uci.edu/ml/datasets.php (accessed on 15 February 2024). The image data are available at http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2024). The NDC datasets are available at http://www.cs.wisc.edu/musicant/data/ndc. (accessed on 15 February 2024).

Acknowledgments

The authors thank the anonymous reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sakheta, A.; Raj, T.; Nayak, R.; O’Hara, I.; Ramirez, J. Improved prediction of biomass gasification models through machine learning. Comput. Chem. Eng. 2024, 191, 108834. [Google Scholar] [CrossRef]
Maydanchi, M.; Ziaei, M.; Mohammadi, M.; Ziaei, A.; Basiri, M.; Haji, F.; Gharibi, K. A Comparative Analysis of the Machine Learning Methods for Predicting Diabetes. J. Oper. Intell. 2024, 2, 230–251. [Google Scholar] [CrossRef]
Kim, E.; Yang, S.M.; Ham, J.H.; Lee, W.; Jung, D.H.; Kim, H.Y. Integration of MALDI-TOF MS and machine learning to classify enterococci: A comparative analysis of supervised learning algorithms for species prediction. Food Chem. 2024, 462, 140931. [Google Scholar] [CrossRef]
Ding, S.; Zhao, H.; Zhang, Y.; Xu, X.; Nie, R. Extreme learning machine: Algorithm, theory and applications. Artif. Intell. Rev. 2015, 44, 103–115. [Google Scholar] [CrossRef]
Deng, C.; Huang, G.; Xu, J.; Tang, J. Extreme learning machines: New trends and applications. Sci. China Inf. Sci. 2015, 2, 1–16. [Google Scholar] [CrossRef]
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
Mirza, B.; Kok, S.; Dong, F. Multi-layer online sequential extreme learning machine for image classification. In Proceedings of the ELM-2015 Volume 1: Theory, Algorithms and Applications (I), Hangzhou, China, 15–17 December 2015; Springer: Berlin/Heidelberg, Germany, 2016; pp. 39–49. [Google Scholar]
Yu, H.; Yuan, K.; Li, W.; Zhao, N.; Chen, W.; Huang, C.; Chen, H.; Wang, M. Improved butterfly optimizer-configured extreme learning machine for fault diagnosis. Complexity 2021, 2021, 1–17. [Google Scholar] [CrossRef]
Chen, Z.; Gryllias, K.; Li, W. Mechanical fault diagnosis using convolutional neural networks and extreme learning machine. Mech. Syst. Signal Process. 2019, 133, 106272. [Google Scholar] [CrossRef]
Wang, Z.; Li, M.; Wang, H.; Jiang, H.; Yao, Y.; Zhang, H.; Xin, J. Breast cancer detection using extreme learning machine based on feature fusion with CNN deep features. IEEE Access 2019, 7, 105146–105158. [Google Scholar] [CrossRef]
Zhu, W.; Miao, J.; Hu, J.; Qing, L. Vehicle detection in driving simulation using extreme learning machine. Neurocomputing 2014, 128, 160–165. [Google Scholar] [CrossRef]
Deeb, H.; Sarangi, A.; Mishra, D.; Sarangi, S.K. Human facial emotion recognition using improved black hole based extreme learning machine. Multimed. Tools Appl. 2022, 81, 24529–24552. [Google Scholar] [CrossRef]
Zhou, J.; Zhang, X.; Jiang, Z. Recognition of imbalanced epileptic EEG signals by a graph-based extreme learning machine. Wirel. Commun. Mob. Comput. 2021, 2021, 1–12. [Google Scholar] [CrossRef]
Zhao, J.; Xu, Y.; Fujita, H. An improved non-parallel universum support vector machine and its safe sample screening rule. Knowl.-Based Syst. 2019, 170, 79–88. [Google Scholar] [CrossRef]
Sun, F.; Xie, X. Deep Non-Parallel Hyperplane Support Vector Machine for Classification. IEEE Access 2023, 11, 7759–7767. [Google Scholar] [CrossRef]
Chen, S.; Cao, J.; Huang, Z. Weighted linear loss projection twin support vector machine for pattern classification. IEEE Access 2019, 7, 57349–57360. [Google Scholar] [CrossRef]
Zheng, X.; Zhang, L.; Yan, L. Sparse discriminant twin support vector machine for binary classification. Neural Comput. Appl. 2022, 34, 16173–16198. [Google Scholar] [CrossRef]
Borah, P.; Gupta, D. Robust twin bounded support vector machines for outliers and imbalanced data. Appl. Intell. 2021, 51, 5314–5343. [Google Scholar] [CrossRef]
Xiao, Y.; Liu, J.; Wen, K.; Liu, B.; Zhao, L.; Kong, X. A least squares twin support vector machine method with uncertain data. Appl. Intell. 2023, 53, 10668–10684. [Google Scholar] [CrossRef]
Wan, Y.; Song, S.; Huang, G.; Li, S. Twin extreme learning machines for pattern classification. Neurocomputing 2017, 260, 235–244. [Google Scholar] [CrossRef]
Wang, C.; Ye, Q.; Luo, P.; Ye, N.; Fu, L. Robust capped L₁-norm twin support vector machine. Neural Netw. 2019, 114, 47–59. [Google Scholar] [CrossRef]
Wang, Y.; Yu, G.; Ma, J. Capped linex metric twin support vector machine for robust classification. Sensors 2022, 22, 6583. [Google Scholar] [CrossRef] [PubMed]
Kumari, A.; Tanveer, M.; Alzheimer’s Disease Neuroimaging Initiative. Universum twin support vector machine with truncated pinball loss. Eng. Appl. Artif. Intell. 2023, 123, 106427. [Google Scholar] [CrossRef]
Ma, J.; Yang, L.; Sun, Q. Adaptive robust learning framework for twin support vector machine classification. Knowl.-Based Syst. 2021, 211, 106536. [Google Scholar] [CrossRef]
Ma, J. Capped L₁-norm distance metric-based fast robust twin extreme learning machine. Appl. Intell. 2020, 50, 3775–3787. [Google Scholar] [CrossRef]
Yang, Y.; Xue, Z.; Ma, J.; Chang, X. Robust projection twin extreme learning machines with capped L₁-norm distance metric. Neurocomputing 2023, 517, 229–242. [Google Scholar] [CrossRef]
Ma, J.; Yang, L. Robust supervised and semi-supervised twin extreme learning machines for pattern classification. Signal Process 2021, 180, 107861. [Google Scholar] [CrossRef]
Yuan, C.; Yang, L. Capped L_2,P-norm metric based robust least squares twin support vector machine for pattern classification. Neural Netw. 2021, 142, 457–478. [Google Scholar] [CrossRef]
Wang, H.; Yu, G.; Ma, J. Capped L_2,P-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss. Symmetry 2023, 15, 1076. [Google Scholar] [CrossRef]
Jiang, Y.; Yu, G.; Ma, J. Distance Metric Optimization-Driven Neural Network Learning Framework for Pattern Classification. Axioms 2023, 12, 765. [Google Scholar] [CrossRef]
Ma, J.; Wen, Y.; Yang, L. Fisher-regularized supervised and semi-supervised extreme learning machine. Knowl. Inf. Syst. 2020, 62, 3995–4027. [Google Scholar] [CrossRef]
Xue, Z.; Zhao, C.; Wei, S.; Ma, J.; Lin, S. Robust Fisher-regularized extreme learning machine with asymmetric Welsch-induced loss function for classification. Appl. Intell. 2024, 54, 7352–7376. [Google Scholar] [CrossRef]
Xue, Z.; Cai, L. Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L₁-Norm for Classification. Axioms 2023, 12, 717. [Google Scholar] [CrossRef]
Huber, P.J. Robust estimation of a location parameter. In Breakthroughs in Statistics: Methodology and Distribution; Springer: Berlin/Heidelberg, Germany, 1992; pp. 492–518. [Google Scholar]
Yuan, C.; Yang, L. Robust twin extreme learning machines with correntropy-based metric. Knowl.-Based Syst. 2021, 214, 106707. [Google Scholar] [CrossRef]
Yuille, A.L.; Rangarajan, A. The concave-convex procedure. Neural Comput. 2003, 15, 915–936. [Google Scholar] [CrossRef] [PubMed]
Zhang, T. Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 2010, 11, 1081–1107. [Google Scholar]
Rockafellar, R. Convex Analysis. Princet. Math. Ser. 1970, 28, 326–332. [Google Scholar]

Figure 1. Comparison of the

L_{1}

−norm, the

L_{2}

−norm, and the capped

L_{2, p}

−norm with different p values.

Figure 1. Comparison of the

L_{1}

−norm, the

L_{2}

−norm, and the capped

L_{2, p}

−norm with different p values.

Figure 2. Comparison of

L_{2, p}

−norm metric and capped

L_{2, p}

−norm metric: (a)

L_{2, p}

−norm metric (

p = 1

); (b)

L_{2, p}

−norm metric (

p = 2

); (c) capped

L_{2, p}

−norm metric (

p = 1

,

ε = 2

); (d) capped

L_{2, p}

−norm metric (

p = 2

,

ε = 2

).

Figure 2. Comparison of

L_{2, p}

−norm metric and capped

L_{2, p}

−norm metric: (a)

L_{2, p}

−norm metric (

p = 1

); (b)

L_{2, p}

−norm metric (

p = 2

); (c) capped

L_{2, p}

−norm metric (

p = 1

,

ε = 2

); (d) capped

L_{2, p}

−norm metric (

p = 2

,

ε = 2

).

Figure 3. Loss function

L_{r} (u)

with different values of r. The horizontal axis indicates the u value (

u = 1 - y f

is the margin error), while the vertical axis shows the respective loss function value.

Figure 3. Loss function

L_{r} (u)

with different values of r. The horizontal axis indicates the u value (

u = 1 - y f

is the margin error), while the vertical axis shows the respective loss function value.

Figure 4. SF-loss compares hinge loss, capped

L_{1}

-norm loss, and Welsch loss. The horizontal axis indicates the u value, while the vertical axis shows the respective loss function value.

Figure 4. SF-loss compares hinge loss, capped

L_{1}

-norm loss, and Welsch loss. The horizontal axis indicates the u value, while the vertical axis shows the respective loss function value.

Figure 5. Convex function (a)

L_{r_{1}} (u)

and concave function (b)

L_{r_{2}} (u)

.

Figure 5. Convex function (a)

L_{r_{1}} (u)

and concave function (b)

L_{r_{2}} (u)

.

Figure 6. Two dimensional distribution graphs of three artificial datasets (Two Moons, XOR, Banana): (a) Two Moons; (b) XOR; (c) Banana.

Figure 7. Experimental results of six algorithms on three artificial datasets: (a) experimental results of six algorithms on three artificial datasets without noise; (b) experimental results of six algorithms on three artificial datasets with 20% noise.

Figure 8. Under different noise ratios, the accuracy variation curves of six algorithms across three datasets: (a) Australian; (b) Vote; (c) Ionosphere.

Figure 9. Example images for three image datasets: (a) COIL-20 database; (b) USPS database; (c) MNIST database.

Figure 10. Convergence graph of SF-RSTELM’s objective function values increasing with the number of iterations on four datasets (Australian, QSAR, WDBC, Vote): (a) Australian; (b) QSAR; (c) WDBC; (d) Vote.

Table 1. Characteristics of UCI datasets.

Datasets	Instances	Attributes	Datasets	Instances	Attributes
Australian	690	14	Vote	435	16
Ionosphere	351	34	WDBC	569	30
German	1000	24	wholesalesta	400	7
Pima	768	8	Sonar	208	60
QSAR	1055	41

Table 2. Experimental results on UCI datasets. The best results are marked in bold.

Datasets	TELM [20]	FTELM [33]	C $L_{1}$ -FTELM [33]	CWTELM [30]	FRTELM [25]	SF-RSTELM
	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)
	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)
	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)
Australian	85.66 ± 0.37	86.37 ± 0.29	86.49 ± 0.37	86.25 ± 0.82	86.03 ± 0.94	86.76 ± 0.56
	86.27 ± 0.33	85.70 ± 1.04	86.06 ± 0.76	86.14 ± 0.47	85.97 ± 0.39	86.55 ± 0.64
	0.596	0.651	3.739	5.939	4.638	6.856
Vote	94.26 ± 0.55	94.98 ± 0.69	95.74 ± 0.37	95.25 ± 0.57	94.83 ± 0.66	95.86 ± 0.37
	93.46 ± 0.83	95.89 ± 0.57	95.28 ± 0.17	96.50 ± 0.46	95.80 ± 0.56	96.87 ± 1.64
	0.233	0.794	1.009	1.843	3.036	3.604
Ionosphere	88.67 ± 1.36	90.25 ± 1.13	90.36 ± 1.64	90.34 ± 1.73	90.14 ± 0.85	90.67 ± 0.93
	84.87 ± 2.03	86.45 ± 1.95	88.33 ± 2.75	87.38 ± 1.78	86.26 ± 0.94	89.41 ± 1.12
	0.458	0.237	2.254	2.358	3.329	3.217
WDBC	96.13 ± 0.34	97.13 ± 0.17	97.05 ± 0.65	96.63 ± 0.38	96.50 ± 0.83	97.28 ± 0.98
	95.37 ± 0.51	96.07 ± 0.21	95.78 ± 0.34	95.98 ± 0.62	95.94 ± 0.87	97.03 ± 0.56
	0.719	0.689	4.359	3.159	4.656	5.289
German	74.52 ± 1.28	74.86 ± 2.96	76.42 ± 0.92	74.38 ± 2.74	74.63 ± 0.79	74.80 ± 0.68
	70.36 ± 0.87	71.28 ± 2.77	72.90 ± 0.53	73.21 ± 0.41	72.18 ± 0.47	72.73 ± 0.32
	0.878	0.654	8.744	7.934	7.218	8.306
wholesalesta	87.30 ± 0.72	88.01 ± 0.44	89.91 ± 0.51	89.43 ± 1.04	88.72 ± 1.34	90.00 ± 1.20
	76.76 ± 1.71	82.30 ± 0.49	82.26 ± 0.27	83.87 ± 1.86	81.23 ± 2.72	85.16 ± 2.26
	1.869	0.481	3.647	4.385	3.287	4.473
Pima	76.38 ± 1.72	76.48 ± 1.38	76.97 ± 0.45	78.13 ± 2.27	76.22 ± 1.11	76.79 ± 0.58
	74.52 ± 1.28	76.74 ± 1.08	77.96 ± 0.32	78.82 ± 0.45	76.90 ± 0.64	75.32 ± 0.68
	0.969	1.282	4.561	7.548	5.078	6.631
Sonar	67.76 ± 0.54	69.65 ± 2.21	69.83 ± 0.46	68.45 ± 3.01	68.37 ± 2.12	70.50 ± 3.67
	65.83 ± 0.66	72.10 ± 2.86	68.26 ± 0.15	65.35 ± 3.41	66.64 ± 1.84	72.82 ± 2.72
	0.611	1.055	1.376	1.537	1.718	1.476
QSAR	85.61 ± 0.74	86.43 ± 0.82	87.44 ± 2.43	87.16 ± 0.36	86.35 ± 1.93	86.81 ± 2.68
	77.64 ± 1.11	78.36 ± 1.24	82.23 ± 0.31	81.35 ± 0.74	81.36 ± 0.71	81.59 ± 0.23
	0.784	1.522	12.884	9.572	11.667	19.101

Table 3. Experimental results on UCI datasets with 15% Gaussian noise. The best results are marked in bold.

Datasets	TELM [20]	FTELM [33]	C $L_{1}$ -FTELM [33]	CWTELM [30]	FRTELM [25]	SF-RSTELM
	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)
	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)
	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)
Australian	82.10 ± 0.88	83.07 ± 0.90	85.44 ± 0.22	84.25 ± 0.34	84.29 ± 0.63	85.59± 0.75
	83.88 ± 0.84	84.60 ± 0.78	86.64 ± 0.25	84.14 ± 0.81	84.68 ± 0.61	86.17 ± 0.49
	0.481	0.727	3.624	5.619	4.256	6.499
Vote	91.17 ± 1.04	92.54 ± 1.16	93.65 ± 0.45	92.56 ± 0.23	92.43 ± 0.30	93.72 ± 0.91
	92.84 ± 0.87	93.88 ± 0.76	92.53 ± 0.26	91.62 ± 0.58	91.32 ± 0.27	93.26 ± 0.69
	0.320	0.586	1.146	2.052	2.940	3.296
Ionosphere	85.50 ± 1.80	85.88 ± 1.71	85.97 ± 0.89	85.34 ± 1.73	85.38 ± 1.33	87.94 ± 1.79
	82.96 ± 0.61	84.34 ± 0.43	83.08 ± 0.53	85.21 ± 1.78	83.39 ± 1.75	85.56 ± 0.57
	0.231	0.164	2.897	2.358	2.428	3.299
WDBC	94.89 ± 0.63	95.13 ± 0.41	94.36 ± 0.54	94.29 ± 0.87	93.37 ± 0.46	95.36 + 0.51
	93.15 ± 0.88	93.09 ± 0.57	92.66 ± 0.21	91.84 ± 0.34	90.24 ± 0.77	93.28 ± 0.82
	0.392	0.227	4.087	4.527	4.560	5.498
German	70.79 ± 0.73	71.59 ± 1.19	72.87 ± 0.58	71.24 ± 0.56	70.50 ± 0.66	72.30 ± 0.65
	68.35 ± 0.15	67.57 ± 0.96	71.32 ± 0.17	69.54 ± 0.24	69.88 ± 0.73	70.54 ± 0.18
	0.786	1.446	8.582	7.934	7.572	8.081
wholesalesta	84.19 ± 1.82	83.65 ± 1.32	85.72 ± 0.19	84.91 ± 0.87	84.67 ± 0.95	86.05 ± 0.74
	78.45 ± 2.50	82.25 ± 0.78	82.15 ± 0.32	81.13 ± 1.45	81.29 ± 0.79	82.34 ± 1.65
	0.632	0.253	3.362	4.073	3.395	4.455
Pima	69.07 ± 1.03	71.87 ± 1.09	73.09 ± 0.48	72.45 ± 0.71	70.75 ± 0.35	71.58 ± 1.03
	66.15 ± 0.57	69.45 ± 0.23	70.17 ± 0.23	70.05 ± 0.30	69.34 ± 0.45	69.81 ± 0.73
	0.497	1.815	5.178	8.237	6.583	7.871
Sonar	63.90 ± 0.82	65.25 ± 0.75	64.27 ± 0.65	64.13 ± 0.46	63.95 ± 0.52	65.50 ± 2.44
	64.19 ± 0.94	64.67 ± 2.34	64.73 ± 0.24	64.23 ± 0.67	64.01 ± 1.22	65.63 ± 1.25
	0.433	0.652	1.686	1.874	1.634	1.947
QSAR	78.08 ± 1.49	80.05 ± 1.42	82.35 ± 0.35	82.56 ± 0.76	81.21 ± 0.37	82.05 ± 0.84
	79.35 ± 1.37	77.25 ± 0.57	80.56 ± 0.28	81.74 ± 0.87	80.45 ± 0.17	79.69 ± 2.20
	0.846	1.731	13.932	8.528	10.462	17.196

Table 4. Experimental results on UCI datasets with 25% Gaussian noise. The best results are marked in bold.

Datasets	TELM [20]	FTELM [33]	C $L_{1}$ -FTELM [33]	CWTELM [30]	FRTELM [25]	SF-RSTELM
	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)
	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)
	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)
Australian	79.97 ± 0.74	80.50 ± 1.25	81.82 ± 0.31	81.12 ± 0.54	80.34 ± 0.93	82.50 ± 0.84
	78.56 ± 1.41	79.13 ± 0.20	82.09 ± 0.65	81.23 ± 0.34	80.39 ± 0.79	83.09 ± 0.82
	0.369	0.549	2.937	4.457	3.210	5.239
Vote	90.40 ± 0.90	90.14 ± 0.65	92.52 ± 0.46	91.12 ± 0.76	90.88 ± 1.05	92.67 ± 1.09
	93.73 ± 0.82	91.83 ± 0.91	94.48 ± 0.81	91.72 ± 0.83	92.97 ± 0.75	92.87 ± 0.77
	0.299	0.312	1.391	2.131	3.724	2.984
Ionosphere	80.44 ± 0.92	83.06 ± 1.55	84.18 ± 0.67	84.13 ± 0.76	83.23 ± 0.65	84.71 ± 1.99
	79.24 ± 1.46	81.29 ± 0.38	81.26 ± 0.34	81.05 ± 0.23	81.25 ± 0.12	81.89 ± 0.86
	0.254	0.157	2.721	2.480	4.008	3.365
WDBC	91.91 + 0.34	91.93 + 1.39	92.26 + 0.48	92.03 + 0.56	91.82 + 0.42	93.04 + 0.64
	89.71 ± 1.54	89.76 ± 0.45	90.14 ± 0.86	89.47 ± 0.35	87.21 ± 0.62	90.34 ± 0.76
	0.285	0.486	4.228	5.162	3.014	5.271
German	70.66 ± 0.91	70.09 ± 0.64	71.12 ± 0.46	70.37 ± 1.08	69.46 ± 0.78	71.83 ± 0.43
	67.23 ± 0.83	68.72 ± 0.31	67.33 ± 0.28	67.53 ± 0.36	67.24 ± 0.41	68.98 ± 0.65
	0.734	1.379	9.411	8.475	8.306	9.916
wholesalesta	81.44 ± 1.13	80.05 ± 1.65	81.16 ± 0.23	80.52 ± 0.34	80.25 ± 0.31	83.33 ± 1.02
	78.23 ± 0.68	79.21 ± 0.84	77.34 ± 0.75	78.47 ± 0.75	78.02 ± 1.40	79.45 ± 0.93
	0.493	0.367	3.737	4.782	5.434	5.837
Pima	65.71 ± 0.22	66.78 ± 0.93	68.36 ± 0.57	67.23 ± 0.84	66.79 ± 0.14	65.86 ± 0.82
	67.30 ± 0.17	67.37 ± 0.56	68.65 ± 0.26	67.18 ± 0.72	65.37 ± 0.23	67.73 ± 0.59
	0.429	1.518	6.551	8.169	5.176	8.117
Sonar	60.75 ± 0.74	61.15 ± 0.26	62.23 ± 0.14	61.23 ± 0.52	61.15 ± 0.21	62.55 ± 1.26
	61.53 ± 0.31	61.90 ± 0.88	62.17 ± 0.56	61.45 ± 1.31	62.43 ± 1.61	63.23 ± 0.67
	0.538	0.399	1.372	1.549	1.752	1.382
QSAR	74.63 ± 1.27	76.45 ± 0.65	81.35 ± 0.35	81.68 ± 0.81	81.05 ± 0.54	79.13 ± 0.76
	72.36 ± 1.10	72.94 ± 0.46	80.82 ± 0.28	80.93 ± 0.77	80.54 ± 0.23	76.56 ± 1.84
	0.606	1.572	12.847	9.458	11.284	18.193

Table 5. Characteristics of image datasets.

Datasets	COIL-20	USPS	MNIST
Instances	1440	9298	70,000
Attributes	1024	256	784
Image Resolution	$32 \times 32$ pixels	$16 \times 16$ pixels	$28 \times 28$ pixels
Input Dimension	1024	256	784
Description	20 objects, each rotated on a turntable with images captured every 5 degrees, resulting in 72 images per object.	Contains a total of 9298 images of handwritten digits.	It has a training set of 60,000 examples, and a test set of 10,000 examples.

Table 6. Experimental results on image datasets. The best results are marked in bold.

Datasets	TELM [20]	FTELM [33]	C $L_{1}$ -FTELM [33]	CWTELM [30]	FRTELM [25]	SF-RSTELM
	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)
	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)
	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)
COIL20	88.06 ± 0.89	90.38 ± 0.52	90.63 ± 0.28	91.62 ± 0.39	90.25 ± 0.74	92.46 ± 0.25
	87.61 ± 0.60	88.15 ± 1.56	89.34 ± 0.75	89.45 ± 0.21	89.50 ± 0.67	90.82 ± 0.71
USPS	97.25 ± 0.28	97.14 ± 0.74	98.27 ± 0.65	98.49 ± 0.75	97.45 ± 0.96	98.73 ± 0.38
	95.57 ± 0.47	94.83 ± 0.36	95.51 ± 0.59	96.63 ± 0.38	95.58 ± 0.35	97.25 ± 0.17
MNIST	89.92 ± 0.35	90.23 ± 0.81	90.89 ± 0.38	89.92 ± 0.32	89.61 ± 0.23	91.26 ± 0.79
	88.24 ± 0.51	87.34 ± 0.75	89.17 ± 0.29	87.38 ± 0.41	87.23 ± 0.16	89.03 ± 0.85

Table 7. Experimental results on image datasets with 15% Gaussian noise. The best results are marked in bold.

Datasets	TELM [20]	FTELM [33]	C $L_{1}$ -FTELM [33]	CWTELM [30]	FRTELM [25]	SF-RSTELM
	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)
	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)
	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)
COIL-20	86.35 ± 0.56	87.87 ± 0.58	89.25 ± 0.75	88.78 ± 0.91	88.46 ± 0.64	89.57 ± 0.56
	85.51 ± 0.92	85.25 ± 0.29	87.85 ± 0.28	87.25 ± 0.29	87.67 ± 0.71	88.04 ± 0.63
USPS	94.54 ± 0.32	95.32 ± 0.56	95.81 ± 0.26	96.58 ± 0.82	95.34 ± 0.68	96.65 ± 0.76
	93.12 ± 0.79	93.54 ± 0.93	94.78 ± 0.57	95.45 ± 0.73	94.16 ± 0.45	95.47 ± 0.64
MNIST	87.48 ± 0.83	88.74 ± 0.46	88.93 ± 0.76	88.13 ± 0.25	87.94 ± 0.57	89.18 ± 0.42
	84.62 ± 0.21	86.37 ± 0.84	87.32 ± 0.82	86.84 ± 0.18	86.15 ± 0.74	87.56 ± 0.27

Table 8. Characteristics of NDC datasets.

Datasets	Instances	Attributes
NDC-5k	5000	35
NDC-10k	10,000	35
NDC-15k	15,000	35

Table 9. Experimental results on NDC large datasets. The best results are marked in bold.

Datasets	TELM [20]	FTELM [33]	C $L_{1}$ -FTELM [33]	CWTELM [30]	FRTELM [25]	SF-RSTELM
	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)	ACC ± S(%)
	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)	$F_{1}$ ± S(%)
	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)	Time(s)
NDC-5k	86.38 ± 0.26	87.63 ± 0.23	90.84 ± 0.81	90.79 ± 0.23	87.65 ± 0.78	90.95 ± 0.47
	85.89 ± 0.91	88.47 ± 0.39	91.78 ± 0.39	91.60 ± 0.88	87.71 ± 0.31	91.83 ± 0.39
NDC-10k	87.39 ± 0.73	89.72 ± 0.84	91.37 ± 0.45	92.74 ± 0.29	88.97 ± 0.65	92.91 ± 0.23
	86.82 ± 0.57	88.90 ± 0.61	90.72 ± 0.24	91.69 ± 0.82	87.83 ± 0.18	91.80 ± 0.35
NDC-15k	87.13 ± 0.36	88.21 ± 0.13	90.69 ± 0.71	90.32 ± 0.14	88.05 ± 0.53	89.95 ± 0.46
	86.87 ± 0.85	87.56 ± 0.38	90.94 ± 0.80	90.72 ± 0.52	86.48 ± 0.29	88.73 ± 0.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xue, Z.; Wang, Y.; Ren, Y.; Zhang, X. The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L_2,p-norm Metric, and Fisher Regularization. Symmetry 2024, 16, 1230. https://doi.org/10.3390/sym16091230

AMA Style

Xue Z, Wang Y, Ren Y, Zhang X. The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L_2,p-norm Metric, and Fisher Regularization. Symmetry. 2024; 16(9):1230. https://doi.org/10.3390/sym16091230

Chicago/Turabian Style

Xue, Zhenxia, Yan Wang, Yuwen Ren, and Xinyuan Zhang. 2024. "The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L_2,p-norm Metric, and Fisher Regularization" Symmetry 16, no. 9: 1230. https://doi.org/10.3390/sym16091230

APA Style

Xue, Z., Wang, Y., Ren, Y., & Zhang, X. (2024). The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L_2,p-norm Metric, and Fisher Regularization. Symmetry, 16(9), 1230. https://doi.org/10.3390/sym16091230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L2,p-norm Metric, and Fisher Regularization

Abstract

1. Introduction

2. Related Work

2.1. Fisher Regularization

2.2. Fisher Regularized Twin Extreme Learning Machine

2.3. Capped L 2 , p -Norm Metric

2.4. Concave-Convex Procedure

3. Squared Fractional Loss Based Robust Supervised Twin Extreme Learning Machine

3.1. Squared Fractional Loss

3.1.1. The Properties of the SF-Loss Function

3.1.2. Robustness Analysis of SF-Loss Function

3.1.3. Fisher Consistency of SF-Loss Function

3.2. Squared Fractional Loss Based Robust Supervised Twin Extreme Learning Machine

3.3. Convergence Analysis

4. Numerical Experiments

4.1. Experimental Setting

4.1.1. Operating Environment

4.1.2. Benchmark Approaches

4.1.3. Parameter Selection

4.1.4. Evaluation Criteria

4.2. Experiments on the Artificial Datasets

4.3. Experiments on the UCI Datasets

4.4. Experiments on the Image Datasets

4.5. Experiments on the NDC Large Datasets

4.6. Convergence Curve

5. Conclusions and Future Works

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L_2,p-norm Metric, and Fisher Regularization

2.3. Capped $L_{2, p}$ -Norm Metric