Next Article in Journal
HCAM-CL: A Novel Method Integrating a Hierarchical Cross-Attention Mechanism with CNN-LSTM for Hierarchical Image Classification
Previous Article in Journal
Probabilistic Multi-Robot Task Scheduling for the Antarctic Environments with Crevasses
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L2,p-norm Metric, and Fisher Regularization

1
School of Mathematics and Information Science, North Minzu University, Yinchuan 750021, China
2
The Key Laboratory of Intelligent Information and Big Data Processing of NingXia Province, North Minzu University, Yinchuan 750021, China
*
Author to whom correspondence should be addressed.
Symmetry 2024, 16(9), 1230; https://doi.org/10.3390/sym16091230
Submission received: 10 August 2024 / Revised: 4 September 2024 / Accepted: 9 September 2024 / Published: 19 September 2024
(This article belongs to the Section Mathematics)

Abstract

:
As a novel learning algorithm for feedforward neural networks, the twin extreme learning machine (TELM) boasts advantages such as simple structure, few parameters, low complexity, and excellent generalization performance. However, it employs the squared L 2 -norm metric and an unbounded hinge loss function, which tends to overstate the influence of outliers and subsequently diminishes the robustness of the model. To address this issue, scholars have proposed the bounded capped L 2 , p -norm metric, which can be flexibly adjusted by varying the p value to adapt to different data and reduce the impact of noise. Therefore, we substitute the metric in the TELM with the capped L 2 , p -norm metric in this paper. Furthermore, we propose a bounded, smooth, symmetric, and noise-insensitive squared fractional loss (SF-loss) function to replace the hinge loss function in the TELM. Additionally, the TELM neglects statistical information in the data; thus, we incorporate the Fisher regularization term into our model to fully exploit the statistical characteristics of the data. Drawing upon these merits, a squared fractional loss-based robust supervised twin extreme learning machine (SF-RSTELM) model is proposed by integrating the capped L 2 , p -norm metric, SF-loss, and Fisher regularization term. The model shows significant effectiveness in decreasing the impacts of noise and outliers. However, the proposed model’s non-convexity poses a formidable challenge in the realm of optimization. We use an efficient iterative algorithm to solve it based on the concave-convex procedure (CCCP) algorithm and demonstrate the convergence of the proposed algorithm. Finally, to verify the algorithm’s effectiveness, we conduct experiments on artificial datasets, UCI datasets, image datasets, and NDC large datasets. The experimental results show that our model is able to achieve higher ACC and F 1 scores across most datasets, with improvements ranging from 0.28% to 4.5% compared to other state-of-the-art algorithms.

1. Introduction

In the field of machine learning, researchers have been dedicated to enhancing the efficiency and accuracy of models. Sakheta et al. [1] improved the prediction of the biomass gasification model through six machine learning algorithms. The research demonstrated that the XGBoost algorithm has significant advantages in improving the accuracy of gasification product prediction. Maydanchi et al. [2] systematically compared various machine learning methods and found that tree-based ensemble methods, such as XGBoost, gradient boosting, and random forest, excelled in diabetes prediction. Kim et al. [3] successfully classified three similar enterococci by combining MALDI-TOF mass spectrometry techniques and multiple supervised learning algorithms (e.g., KNN, SVM, random forest). Although these methods have made significant progress in different domains, there remains room for improvement in enhancing computational efficiency and response times. The extreme learning machine (ELM) [4,5] offers a promising solution to these challenges with its efficient training process and superior generalization capabilities. It was first proposed by Huang et al. [6] and quickly gained widespread application in multiple fields, including image classification [7], fault detection [8,9], disease diagnosis [10], computer vision [11], face recognition [12], and signal processing [13]. These application cases fully validate the practicability and effectiveness of ELM as an efficient neural network training method.
In binary classification tasks, traditional ELM only learns a single hyperplane to distinguish between classes. Recently, two nonparallel hyperplanes classification algorithms have attracted significant attention and research interest [14,15]. These algorithms involve the training of multiple hyperplanes, where each hyperplane is designed to minimize its distance to one of the two classes while maximizing its distance from the other class. For example, the twin support vector machine (TSVM) is notable for its efficiency in learning two nonparallel separating hyperplanes more quickly than the traditional support vector machine (SVM) by solving two reduced-sized quadratic programming problems (QPPs). The various variants of TSVM [16,17,18,19] have been extensively studied and have been successfully applied in classification tasks.
Inspired by TSVM, Wan et al. [20] proposed the twin extreme learning machine (TELM). It is noteworthy that the TELM and TSVM use the hinge loss function, which is unbounded and tends to exaggerate the impact of noise and outliers on the model. Consequently, the research community has expressed increasing interest in exploring alternative loss functions. Wang et al. [21] proposed a new robust capped L 1 -norm twin support vector machine (CTWSVM), which maintains the benefits of TWSVM and enhances the robustness of the model. Wang and Yu et al. [22] proposed a new robust loss function, the capped Linex loss function, which was applied to the TSVM to enhance the classification capabilities of the model. Kumari A et al. [23] introduced the capped pinball loss function into the universum twin support vector machine (UTWSVM), and proposed a universum twin support vector machine (Tpin-UTWSVM) based on capped pinball loss function, which improved the model’s generalization performance. Ma et al. [24] proposed a robust adaptive capped L θ ε loss, altering the loss function value by adjusting the adaptive parameter θ during the training process. Applying this loss function to TSVM, an adaptive robust learning framework was proposed, namely the adaptive robust twin support vector machine (ARTSVM). All the above models use bounded capped loss functions, which constrain the impact of noise within certain limits and make the classifiers less sensitive to noise.
In order to further reduce the impact of noise, many scholars have begun to look for new metrics to substitute for the squared L 2 -norm metric used in the TELM. Ma et al. [25] proposed a fast robust twin extreme learning machine (FRTELM) based on capped L 1 -norm metric and loss function in the classic TELM learning framework, which enhances the robustness of the TELM in handling classification problems. Yang et al. [26] added the idea of projection on the basis of the twin extreme learning machine, and combining this with the capped L 1 -norm metric and loss function, they proposed a new capped L 1 -norm projection twin extreme learning machine ( C L 1 -PTELM). It lessens the influence of outliers and demonstrates more robustness than the TELM. Ma and Yang et al. [27] proposed a new robust TELM framework (RTELM) using the capped L 1 -norm metrics and capped L θ ε loss function. RTELM addresses the limitations of L 2 -norm metric and hinge loss, particularly in scenarios with outliers. It retains the strengths of the TELM and further enhances the robustness of classification. These algorithms show that the capped L 1 -norm metric is resistant to outliers. In fact, the capped L 1 -norm metric is considered an effective approximation of the L 0 -norm by a non-negative parameter, and it is superior in robustness to the L 1 -norm metric [27]. In addition, related scholars have begun to focus on the capped L 2 , p -norm metric and have applied it to their models. This metric is bounded and can be flexibly tuned by adjusting the p-value to adapt to diverse datasets and reduce the effect of noise. Yuan et al. [28] created a novel framework to improve robustness by substituting the squared L 2 -norm metric with the robust capped L 2 , p -norm metric in a least squares twin support vector machine (LSTSVM), which is called capped L 2 , p -norm LSTSVM ( C L 2 , p -LSTSVM). Wang et al. [29] proposed a capped L 2 , p -norm metric based on the robust twin support vector machine with Welsch loss function (WCTBSVM). The generalization performance and robustness of the TSVM are further improved. Jiang et al. [30] proposed a novel robust twin extreme learning machine learning framework (CWTELM) by combining the capped L 2 , p -norm metric and Welsch loss function with the TELM. CWTELM improves robustness while preserving the advantages of TELM, thereby enhancing classification performance.
Besides altering metrics and loss functions, regularization techniques play a vital role in improving the generalization capabilities of models. The Fisher regularization term is a notable technique that minimizes within-class variance and excels in improving class separability and robustness. Ma and Wen et al. [31] proposed a Fisher regularization ELM (Fisher-ELM) to reach a minimal within-class scatter. Fisher-ELM utilizes the statistical properties of the data, which exhibits excellent generalization ability. Although Fisher-ELM incorporates statistical knowledge into its framework, it tends to ignore the potential effects of noise or outliers. To reduce the negative effects of these factors, Xue and Zhao et al. [32] first proposed a novel asymmetric Welsch loss function and integrated it into Fisher-ELM, then proposed a robust Fisher regularization extreme learning machine with asymmetric Welsch-induced loss function (AWFisher-ELM). This model better copes with the adverse effects of noise and outliers, enhancing the robustness of the model. Xue et al. [33] added Fisher regularization to the TELM and proposed Fisher regularization TELM (FTELM), which both keeps the strengths of the TELM and minimizes the intra-class differences of samples. In order to further improve the noise immunity of the FTELM method, a new capped L 1 -norm Fisher regularization TELM (C L 1 -FTELM) is proposed by combining the capped L 1 -norm metric and loss function to enhance the robustness of the model.
In this paper, we first propose a bounded, smooth, and symmetrical squared fractional loss (SF-loss). Based on the proposed SF-loss, we also integrate the TELM, capped L 1 -norm metric, and Fisher regularization and propose a robust supervised TELM learning framework (SF-RSTELM). SF-RSTELM can effectively utilize the statistical properties of the data, which the TELM lacks. In addition, it can effectively reduce the impact of noise and outliers by employing the bounded capped L 2 , p -norm metric and SF-loss function. In contrast, the TELM uses the unbounded squared L 2 -norm metric and hinge loss, which are susceptible to the influence of noise and outliers.
The main work of this paper is summarized as follows:
(1)
A new robust loss function called squared fractional loss (SF-loss) is presented. It has some important properties such as being bounded, smooth, symmetric, and noise-insensitive. Moreover, the robustness of the SF-loss is analyzed according to the perspective of M estimation theory [34], and its Fisher consistency is proved according to the Bayesian rule [35].
(2)
An innovative method named “The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L 2 , p -norm Metric, and Fisher Regularization” is proposed. This framework cleverly combines the efficiency of the TELM, the robustness of the SF-loss function, the flexibility of the capped L 2 , p -norm metric, and the advantages of Fisher regularization. This integrated approach not only takes into account the statistical information of the data but also significantly reduces the impact of noise, thereby enhancing the model’s performance.
(3)
Due to the non-convex nature of the established optimization model, an efficient algorithm based on CCCP [36] is proposed to solve the optimization problem. Moreover, the convergence of the proposed algorithm is proved.
(4)
We performed extensive experiments on artificial datasets, UCI datasets, image datasets, and NDC-large datasets to validate the effectiveness of our proposed algorithm compared to other state-of-the-art algorithms.
The rest of this paper is structured as follows. In Section 2, we briefly review related work on Fisher regularization, the Fisher regularized twin extreme learning machine, the capped L 2 , p -norm metric, and the concave-convex procedure. In Section 3, we provide a comprehensive description of the proposed model and a detailed solution process. The experimental results on multiple datasets are presented in Section 4. Conclusions and suggestions for future work are given in Section 5.

2. Related Work

In this section, we briefly review related work on Fisher regularization, the Fisher regularized twin extreme learning machine, the Capped L 2 , p -norm metric, and the concave-convex procedure.

2.1. Fisher Regularization

Fisher regularization [32] can measure the intra-class divergence within the data, facilitating the development of more effective learning models. With the training set T = ( x 1 , y 1 ) , , ( x m , y m ) , the Fisher regularization has the form:
f F 2 = i I + f x i f ¯ + 2 + i I f x i f ¯ 2
where f is the prediction function, and f x i represents the value of f on sample x i ; f on all samples forms a vector f = f x 1 , f x 2 , , f x m T ; f ¯ + and f ¯ represent the mean of f on all positive and negative samples, respectively; and I + and I represent the index collections of positive and negative samples.
We can expand Equation (1):
f F 2 = i I + f x i f ¯ + 2 + i I f x i f ¯ 2 = i I + f 2 x i 2 · f x i · f ¯ + + f + 2 + i I f 2 x i 2 · f x i · f ¯ + f 2 = i I + f 2 x i 2 · m 1 · f ¯ + 2 + m 1 · f ¯ + 2 + i I f 2 x i 2 · m 2 · f ¯ 2 + m 2 · f ¯ 2 = i I + f 2 x i m 1 · f ¯ + 2 + i I f 2 x i m 2 · f ¯ 2 = f + T · f + 1 m 1 · f + T · e T · e · f + + f T · f 1 m 2 · f T · e T · e · f = f + T · I + · f + f + T · M + · f + + f T · I · f f T · M · f = f + T · I + M + · f + + f T · I M · f = f + , f T · I M + 0 1 0 2 M · f + , f = f T · ( I G ) · f = f T · N · f
where f + = f x 1 , f x 2 , , f x i , , f x m 1 , i I + , I + R m 1 × m 1 is the identity matrix. f = f x 1 , f x 2 , , f x i , , f x m 2 , i I , I R m 2 × m 2 is the identity matrix. 0 1 0 m 1 × m 2 , 0 2 0 m 2 × m 1 , M + R m 1 × m 1 , and all the elements in the matrix M + are 1 m 1 ; M R m 2 × m 2 , and all the elements in the matrix M are 1 m 2 ; N = I G ; and I R m × m is the identity matrix G = M + 0 1 0 2 M .

2.2. Fisher Regularized Twin Extreme Learning Machine

Within a supervised classification framework, the training dataset is typically represented as T = x i , y i i = 1 l , where x i R d , y i 1 , 1 . The set T includes l 1 positive class samples and l 2 negative class samples, where l = l 1 + l 2 .
The TELM is a traditional and highly efficient classifier. Nevertheless, it overlooks the statistical properties contained within the data. Drawing inspiration from Fisher’s concepts, Xue et al. [33] proposed Fisher-TELM (FTELM) by introducing Fisher regularization terms into the TELM learning framework. Specifically, the primal FTELM is given as:
P r i m a l F T E L M 1 : min β 1 , ξ 1 1 2 H 1 β 1 2 2 + C 1 e 2 T ξ 1 + C 2 2 f 1 ( x ) T N 1 f 1 ( x ) s . t . H 2 β 1 + ξ 1 e 2 ξ 1 0
P r i m a l F T E L M 2 : min β 2 , ξ 2 1 2 H 2 β 2 2 2 + C 3 e 1 T ξ 2 + C 4 2 f 2 ( x ) T N 2 f 2 ( x ) s . t . H 1 β 2 + ξ 2 e 1 ξ 2 0
where N 1 = I + M + , N 2 = I M , β 1 and β 2 represent the output weights connecting the hidden layer to the output layer, and h ( x ) represents the hidden layer output matrix given by h ( x ) = ( h 1 ( x ) , h 2 ( x ) , , h L ( x ) ) R 1 × L . Here, L represents the count of hidden nodes, with each node function h i ( x ) = G ( j = 1 d x j ω j i + b i ) . G denotes an activation function, with frequently used examples being the sigmoid function G ( x ) = 1 1 + e x , the ReLU function G ( x ) = max 0 , x , etc. In this context, w i = ω i , 1 , ω i , 2 , , ω i , d stands for input weight, ω j , i denotes the input weight connecting the j-th feature to the i-th hidden node, and b i is the bias term for the i-th hidden node ( i = 1 , 2 , , L). d represents the sample dimension. H 1 = h ( x 1 ) T , h ( x 2 ) T , , h ( x l 1 ) T T R l 1 × L and H 2 = h ( x 1 ) T , h ( x 2 ) T , , h ( x l 2 ) T T R l 2 × L denote the hidden layer outputs for the positive class and negative class samples, respectively. C 1 , C 2 , C 3 , C 4 > 0 are regularization parameters, while e 1 R m 1 and e 2 R m 2 are vectors of ones.
According to the representer theorem, f 1 x 1 = H 1 β 1 , f 2 x 2 = H 2 β 2 . Therefore, Equations (3) and (4) can be rephrased as follows:
min β 1 , ξ 1 1 2 H 1 β 1 2 2 + C 1 e 2 T ξ 1 + C 2 2 β 1 T H 1 T N 1 H 1 β 1 s . t . H 2 β 1 + ξ 1 e 2 ξ 1 0
min β 2 , ξ 2 1 2 H 2 β 2 2 2 + C 3 e 1 T ξ 2 + C 4 2 β 2 T H 2 T N 2 H 2 β 2 s . t . H 1 β 2 + ξ 2 e 1 ξ 2 0
The positive training point can be made to approach the hyperplane f 1 as closely as possible by optimizing the first term in the objective function (5). Minimizing the second term ensures that the negative class samples are as far as possible from the positive class hyperplane f 1 . The last term is a Fisher regularization term that minimizes the within-class scatter. For problem (6), we can also use a similar meaning to explain it.
By incorporating Lagrange multipliers θ 1 and λ 1 , the dual problem of (5) can be formulated as follows:
D u a l F T E L M 1 : min θ 1 1 2 θ 1 T Q 1 θ 1 e 2 T θ 1 s . t . 0 θ 1 C 1 e 2
Here, Q 1 = H 2 H 1 T I 1 + C 2 N 1 H 1 1 H 2 T . Similarly, we can obtain the dual of (6) as:
D u a l F T E L M 2 : min λ 1 1 2 λ 1 T Q 2 λ 1 e 1 T λ 1 s . t . 0 λ 1 C 3 e 1
where Q 2 = H 1 H 2 T · I 2 + C 4 N 2 H 2 1 H 1 T .
By solving problem (7) and problem (8), we get θ 1 and λ 1 . Then, we get:
β 1 = H 1 T I 1 + C 2 N 1 H 1 1 H 2 T θ
β 2 = H 2 T I 2 + C 4 N 2 H 2 1 H 1 T λ
With β 1 and β 2 determined, we classify a new sample point x using the following decision function:
f ( x ) = arg min k = 1 , 2 d k ( x ) = arg min k = 1 , 2 | h ( x ) β k |

2.3. Capped L 2 , p -Norm Metric

The squared L 2 -norm is frequently employed in TELM-related variant classifiers because it is differentiable and easier to optimize. Nevertheless, the squared term heightens the impact of outliers, thereby diminishing the model’s classification performance. Yuan et al. [28] proposed the L 2 , p -norm and capped L 2 , p -norm to enhance the model’s robustness to outliers by making p fall inside the range of (0, 2].
For any vector s R n and parameter 0 < p 2 , with thresholding parameter ε 0 , the L 2 , p -norm and capped L 2 , p -norm are the following formulas (12) and (13), respectively.
f 1 s = i = 1 n s i 2 p 2
f 2 s = min i = 1 n s i 2 p 2 , ε
We also provide a comparison of the L 1 -norm, the L 2 -norm, and the capped L 2 , p -norm (p = 0.5, 1, 1.5, 2) of a scalar in Figure 1. Firstly, from Figure 1, we can see that the L 1 -norm and L 2 -norm are unbounded, and the capped L 2 , p -norm is bounded. Secondly, we can observe that the capped L 2 , p -norm can be the capped L 1 -norm when p = 1 , and the capped L 2 , p -norm can be the capped L 2 -norm when p = 2 . This indicates that the capped L 2 , p -norm can behave as a capped form of the traditional norm at certain parameter values.
To more intuitively help us understand the characteristics of the capped L 2 , p -norm metric, we also provide a comparison of the L 2 , p -norm and capped L 2 , p -norm for a two-dimensional vector (the high-dimensional situation is similar) as shown in Figure 2. Figure 2a,b are the L 2 , p -norm metric when p takes 1 and 2, respectively, while Figure 2c,d are corresponding capped versions. Figure 2a is an unbounded L 2 -norm metric, whose surface is a smooth curve surface. Figure 2b is essentially an unbounded squared L 2 -norm metric, which is also a smooth surface, similar to Figure 2a. However, due to the influence of the square, the surface rises rapidly away from the center. Applying it to a model means that it is very sensitive to data points that are far from the center, which can be noise or outliers. Figure 2a also has a relatively sharp turning point, but the overall rise is slow. This norm metric is less sensitive to outliers compared to the squared L 2 -norm metric in Figure 2b. Figure 2c,d are the bounded capped L 2 -norm metric and capped squared L 2 -norm metric, respectively, characterized by flat regions on their surface when the capped threshold is exceeded. If this metric is applied to the model, it can control the impact of the outliers, and the robustness of the model is enhanced.
In conclusion, the capped L 2 , p -norm metric is a better choice when handling datasets containing outliers or noise. Furthermore, the metric can better strengthen the model’s resilience and improve the classification ability of the model.

2.4. Concave-Convex Procedure

The concave-convex procedure (CCCP) [36] is employed to address optimization problems involving the difference of convex functions. Let x R n be the variable; the optimization problem associated with the CCCP is expressed in the following form:
min x f x s . t . c i x 0 , i = 1 , , p 1 d j x = 0 , j = 1 , , p 2
where f ( x ) = g ( x ) h ( x ) ; g ( . ) and h ( . ) are realvalued convex functions; and p 1 and p 2 denote the number of constraints. Suppose that h ( . ) is differentiable. The solution to (14) is derived through an iterative process of solving the subsequent series of convex optimization problems:
x k + 1 a r g min x g x x T h x k s . t . c i x 0 , i = 1 , , p 1 d j x = 0 , j = 1 , , p 2

3. Squared Fractional Loss Based Robust Supervised Twin Extreme Learning Machine

In this section, we first put forward a new loss function (SF-loss) and then combine it with the TELM, capped L 2 , p -norm metric, and Fisher regularization to propose a new robust supervised learning framework (RS-SFTELM). We also provide a detailed solution process and a convergence analysis for this model.

3.1. Squared Fractional Loss

Convex loss functions ( L 1 loss function, L 2 loss function, hinge loss function) are commonly utilized in machine learning due to their ability to achieve global optimality. However, their unbounded nature makes them vulnerable to noise and outliers. According to M-estimation theory [34], loss functions with bounded properties or bounded influence functions demonstrate greater robustness to noise and outliers. Thus, we propose a new bounded loss function, called squared fraction loss (abbreviated as SF-loss) in the following:
Definition 1. 
Given a vector u, the SF-loss is defined as
L r ( u ) = u 2 r u 2 + 1
where the parameter r ( 0 , + ) .
Figure 3 shows the L r ( u ) loss function with different parameter r values. From Figure 3, we can see that the parameter r affects the upper bound of the loss function. The more r values, the smaller the upper bounds. In addition, we provide some interesting properties, robustness analysis, and Fisher consistency for our SF-loss function.

3.1.1. The Properties of the SF-Loss Function

Property 1. 
L r ( u ) = 0 if u = 0 . This guarantees that the function L r ( u ) passes through the original point.
Property 2. 
L r ( u ) is bounded, which can ensure better robustness.
Proof. 
lim u L r ( u ) = lim u u 2 r u 2 + 1 = lim u 1 r + 1 u 2 = 1 r
Therefore, L r ( u ) is a bounded function. □
Property 3. 
L r ( u ) is a differentiable function that can help us optimize better.
Proof. 
L r ( u ) = 2 u r u 2 + 1 2
Therefore, L r ( u ) is differentiable. □
Property 4. 
L r ( u ) is a symmetrical function.
Proof. 
Calculate L r ( u ) and L r ( u ) separately and obtain:
L r ( u ) = u 2 r u 2 + 1
L r ( u ) = ( u ) 2 r ( u ) 2 + 1 = u 2 r u 2 + 1
Since L r ( u ) = L r ( u ) , L r ( u ) is symmetrical. □
Property 5. 
L r ( u ) is a non-convex function.
Proof. 
Let r = 1 , λ = 1 2 , u 1 = 1 , and u 2 = 2 . Then, we obtain:
L r λ u 1 + 1 λ u 2 = L 1 ( 1 2 · 1 + 1 2 · 2 ) = L 1 ( 3 2 ) = ( 3 2 ) 2 ( 3 2 ) 2 + 1 = 2.25 3.25 0.692
λ L r u 1 + 1 λ L r u 2 = 1 2 · L 1 ( 1 ) + 1 2 · L 1 ( 2 ) = 1 2 · 1 2 + 1 2 · 4 5 = 13 20 = 0.65
Since 0.692 > 0.65 , i.e., L r λ u 1 + 1 λ u 2 > λ L r u 1 + 1 λ L r u 2 , L r u is a non-convex function. □

3.1.2. Robustness Analysis of SF-Loss Function

Clearly, the new loss function L r ( u ) is bounded. From a robust statistics perspective, the L r ( u ) shows noise insensitivity, which ensures superior robustness. The derivative of L r ( u ) is expressed as:
L r ( u ) = 2 u r u 2 + 1 2
and we have:
lim u 2 u r u 2 + 1 2 = 0
Hence, according to M-estimation theory [34], the loss function is robust against noise.

3.1.3. Fisher Consistency of SF-Loss Function

An important attribute for a binary classifier f: X Y is whether the classifier satisfies Fisher consistency. Specifically, a classifier f is deemed Fisher consistent if the minimizer of the associated expected risk, as dictated by a loss function L, exhibits the same sign as the Bayes classifier [35]. Thus, the loss function L is termed Fisher consistent if it adheres to this property.
In binary classification problems, the training set is represented by x i , y i i = 1 l , under the assumption that the samples are independent and the probability measure is ρ on X × Y . Then, the expected risk of classifier f: X Y is defined as:
R L , ρ ( f ) = X × Y L ( 1 y f ( x ) ) d ρ
Here, L ( . ) denotes the loss function, and P ( χ ) = P ( Y = 1 ( 1 ) X = χ ) represents the conditional probability of the positive (or negative) class when X = χ . Under the condition that χ is given, the conditional distribution of ρ is indicated by ρ ( y χ ) . To minimize the expected risk, we introduce the optimization variable q and define the function to minimize the expected risk through the following formula:
f L , ρ ( χ ) = a r g min q Y L ( 1 y q ) d ρ ( y χ ) χ X
where q = f x is a function to be optimized to represent the predicted value under a specific condition (i.e., given χ ). In the binary classification problem, ρ ( y | χ ) is a binary distribution. Specifically, P r o b ( y = 1 | x ) and P r o b ( y = 1 | x ) denote the probabilities for the positive and negative classes, respectively. The Bayes classifier is given by:
f c χ = 1 , if P r o b ( y = 1 | χ ) P r o b ( y = 1 | χ ) 1 , if P r o b ( y = 1 | χ ) < P r o b ( y = 1 | χ )
Next, we will examine whether the SF-loss satisfies Fisher consistency.
Property 6. 
Function f L r , ρ ( χ ) , which minimizes SF-loss expected risk among all measurable functions, is equivalent to the Bayes classifier: f L r , ρ ( χ ) = f c χ . This means that the SF-loss satisfies Fisher consistency.
Proof. 
For binary classification, we have:
Y L r 1 y x q d ρ y | χ = L r 1 q P r o b y = 1 | χ + L r 1 + q P r o b y = 1 | χ
when 1 + q and 1 q are applied to Formula (16), the following two equations are obtained:
L r 1 q = 1 q 2 r 1 q 2 + 1
and
L r 1 + q = 1 + q 2 r 1 + q 2 + 1
By substituting (29) and (30) into (28), we can obtain:
Y L r 1 y x q d ρ y | χ = 1 q 2 r 1 q 2 + 1 P r o b y = 1 | χ + 1 + q 2 r 1 + q 2 + 1 P r o b y = 1 | χ
Therefore, if P r o b ( y = 1 | χ ) P r o b ( y = 1 | χ ) , expected risk receives the minimum value when q = 1 ; if P r o b ( y = 1 | χ ) < P r o b ( y = 1 | χ ) , expected risk receives the minimum value when q = 1 . Hence, whichever minimizes the expected risk measured by the SF-loss satisfies f L r , ρ ( χ ) = f c χ . This analysis confirms that the minimizer of the associated expected risk, determined by the loss function, has the same sign as the Bayes classifier, thereby proving that this property is satisfied. □
To more clearly compare the performance of our loss function with other advanced loss functions (Hinge loss [20,33], capped L 1 -norm loss [25,33], and Welsch loss [30]), we present them in Figure 4. We can observe that the hinge loss is unbounded and tends to exaggerate the impact of outliers. Other losses are bounded, so we set specific parameters to ensure that the upper bounds of these losses are consistent. In this case, our proposed loss function exhibits a smoother characteristic and its growth trend is relatively gradual compared to the capped L 1 -norm loss. Compared to the Welsch loss function, our proposed loss function increases rapidly near the origin (corresponding to normal data) and grows relatively slowly further away from it (representing noisy or outlier data points). This characteristic indicates that our loss function places greater emphasis on increasing the loss for normal points, encouraging the model to make more precise predictions for regular data. Meanwhile, due to the slower growth of loss for outliers, the model becomes less sensitive to them, thus reducing their impact and enhancing the overall robustness of the model.

3.2. Squared Fractional Loss Based Robust Supervised Twin Extreme Learning Machine

In order to better elaborate on the idea of our model improvement, we rewrite problems (5) and (6) of FTELM as follows:
min β 1 1 2 H 1 β 1 2 2 + C 1 j = 1 m 2 L H 1 + h x j β 1 + C 2 2 β 1 T H 1 T N 1 H 1 β 1
min β 2 1 2 H 2 β 2 2 2 + C 3 i = 1 m 1 L H 1 h x i β 2 + C 4 2 β 2 T H 2 T N 2 H 2 β 2
where hinge loss function L H u = max 0 , u , for (32), u u j = 1 + h x j β 1 and for (33), u u i = 1 h x i β 2 . It is worth noting that FTELM uses the squared L 2 -norm metric and hinge loss, which can exaggerate the influence of noise and outliers. To enhance the robustness of the model, we have substituted the squared L 2 -norm metric and hinge loss with the capped L 2 , p -norm metric and SF-loss. Therefore, we propose a robust supervised TELM learning framework called SF-loss-based robust supervised TELM (SF-RSTELM). The core problem of our designed model can be described as:
  • SF-RSTELM1:
    min β 1 1 2 i = 1 m 1 min h x i β 1 2 p , ε 1 + C 1 2 j = 1 m 2 L r 1 + h x j β 1 + C 2 2 β 1 T H 1 T N 1 H 1 β 1
  • SF-RSTELM2:
    min β 2 1 2 j = 1 m 2 min h x j β 2 2 p , ε 3 + C 3 2 i = 1 m 1 L r 1 h x i β 2 + C 4 2 β 2 T H 2 T N 2 H 2 β 2
    where C 1 , C 2 , C 3 , C 4 > 0 are regularization parameters, and ε 1 and ε 3 are the thresholding parameters.
For problem (34), the positive training point approaches the hyperplane f 1 as closely as possible by optimizing the first term. Minimizing the second term ensures the negative class samples are as distant as possible from the positive class hyperplane f 1 . The last term is the Fisher regularization term that minimizes the intra-class divergence from the samples. For problem (35), a similar interpretation can be applied.
In order to solve the problem effectively, we first work on the first term of the objective function. We first use the concave duality theorem [37] to deal with the first term of the objective functions (34) and (35).
Theorem 1. 
Consider a continuous nonconvex function g θ : R n R and suppose h θ : R n Ω R n is a map with range Ω. We assume the existence of a concave function g ¯ u defined on Ω such that g θ = g ¯ h θ is satisfied. Based on this condition, the nonconvex function g θ can be represented as:
g θ = inf υ R n υ T h θ g * υ
Following concave duality [38], g * υ is the concave dual of g u given as:
g * υ = inf u Ω υ T u g ¯ u
Furthermore, the minimum on the right-hand side of (36) is obtained at υ * :
υ * = g ¯ u u | u = h θ
According to Theorem 1, we define a concave function g ¯ θ : R R such that θ > 0 :
g ¯ θ = min θ p 2 , ε
Suppose h μ = μ 2 . The capped L 2 , p -norm metric can be reformulated as:
min h x i β 2 p , ε = g ¯ h μ
where μ = h x β 2 . Therefore, the capped L 2 , p -norm metric can also be represented as:
i = 1 m 1 g ¯ h x i β 1 2 2
j = 1 m 2 g ¯ h x j β 2 2 2
Let θ 1 = h μ 1 = h x i β 1 2 2 . According to Equation (36), (40) can also be obtained as:
min h x i β 1 2 p , ε 1 = g ¯ h x i β 1 2 2 = inf f i i > 0 f i i h μ 1 g * f i i = inf f i i > 0 f i i θ 1 g * f i i
The g * f i i is the concave dual of g ¯ θ 1 , represented as:
g * f i i = inf θ 1 f i i θ 1 g ¯ θ 1 = inf θ 1 f i i θ 1 θ 1 p 2 , θ 1 p 2 < ε 1 f i i θ 1 ε 1 , θ 1 p 2 ε 1
By optimizing θ 1 for (44), we can obtain:
g * f i i = f i i 2 p f i i 2 p 2 2 p f i i p p 2 , θ 1 p 2 < ε 1 f i i ε 1 2 p ε 1 , θ 1 p 2 ε 1
Therefore, the capped L 2 , p -norm metric can be further converted to:
min β 1 i = 1 m 1 min h x i β 1 2 p , ε 1 min β 1 i = 1 m 1 g ¯ h x i β 1 2 2 min β 1 i = 1 m 1 inf f i i 0 L i β 1 , f i i , ε 1 min β 1 , f i i 0 i = 1 m 1 L i β 1 , f i i , ε 1
where
L i β 1 , f i i , ε 1 = f i i θ 1 f i i 2 p f i i 2 p 2 + 2 p f i i p p 2 , θ 1 p 2 < ε 1 f i i θ 1 f i i ε 1 2 p + ε 1 , θ 1 p 2 ε 1
Similarly, let θ 2 = h μ 2 = h x j β 2 2 2 . g * t j j is represented as a concave dual function of g ¯ θ 2 , so the capped L 2 , p -norm metric can be written as:
min β 2 j = 1 m 2 min h x j β 1 2 p , ε 3 min β 2 j = 1 m 2 g ¯ h x j β 2 2 2 min β 2 j = 1 m 2 inf t j j 0 L j β 2 , t j j , ε 3 min β 2 , t j j 0 j = 1 m 2 L j β 2 , t j j , ε 3
where
L j β 2 , t j j , ε 3 = t j j θ 2 t j j 2 p t j j 2 p 2 + 2 p t j j p p 2 , θ 2 p 2 < ε 3 t j j θ 2 t j j ε 3 2 p + ε 3 , θ 2 p 2 ε 3
g ¯ θ θ = p 2 θ p 2 1 , 0 < θ < ε 2 p 0 , θ > ε 2 p
Let θ 1 = h μ 1 = h x i β 1 2 2 , and we can obtain:
f i i = g ¯ θ 1 θ 1 | θ 1 = h x i β 1 2 2 = p 2 h x i β 1 2 p 2 , 0 < h x i β 1 2 p < ε 1 σ 1 , e l s e
Similarly, let θ 2 = h μ 2 = h x j β 2 2 2 , and we can obtain:
t j j = g ¯ θ 2 θ 2 | θ 2 = h x j β 2 2 2 = p 2 h x j β 2 2 p 2 , 0 < h x j β 2 2 p < ε 3 σ 2 , e l s e
When the variables f i i and t j j are fixed to solve the classifier-related parameters β 1 and β 2 , the optimization problems (34) and (35) can be written as:
min β 1 1 2 i = 1 m 1 2 p f i i h x i β 1 2 2 + C 1 2 j = 1 m 2 L r 1 + h x j β 1 + C 2 2 β 1 T H 1 T N 1 H 1 β 1
min β 2 1 2 j = 1 m 2 2 p t j j h x j β 2 2 2 + C 3 2 i = 1 m 1 L r 1 h x i β 2 + C 4 2 β 2 T H 2 T N 2 H 2 β 2
Let F = d i a g 2 p f 11 , 2 p f 22 , , 2 p f m 1 m 1 and T = d i a g 2 p t 11 , 2 p t 22 , , 2 p t m 2 m 2 , so that (53) and (54) are equivalent to (55) and (56), respectively:
min β 1 1 2 H 1 β 1 T F H 1 β 1 + C 1 2 j = 1 m 2 L r 1 + h x j β 1 + C 2 2 β 1 T H 1 T N 1 H 1 β 1
min β 2 1 2 H 2 β 2 T T H 2 β 2 + C 3 2 i = 1 m 1 L r 1 h x i β 2 + C 4 2 β 2 T H 2 T N 2 H 2 β 2
The problems (55) and (56) can be rewritten as the following (57) and (58):
min β 1 1 2 β 1 T H 1 T F + C 2 N 1 H 1 β 1 + C 1 2 j = 1 m 2 L r 1 + h x j β 1
min β 2 1 2 β 2 T H 2 T T + C 4 N 2 H 2 β 2 + C 3 2 i = 1 m 1 L r 1 h x i β 2
It can be seen that (57) and (58) are non-convex optimization problems because of the non-convexity of the loss function. In this research, we utilize the CCCP technique to handle the non-convexity. Note that the loss function can be expressed as the subtraction of two convex functions, or equivalently, as the sum of a convex function and a concave function. That is, L r u = L r 1 u + L r 2 u . The specific expressions for the convex function L r 1 u and the concave function L r 2 u are:
L r 1 u = r u 2
L r 2 u = r u 2 + u 2 r u 2 + 1
Figure 5 represents the graphs for L r 1 u and L r 2 u .
Therefore, the above optimization problem can be expressed as:
min β 1 1 2 β 1 T H 1 T F + C 2 N 1 H 1 β 1 + C 1 2 j = 1 m 2 L r 1 1 + h x j β 1 + L r 2 1 + h x j β 1
min β 2 1 2 β 2 T H 2 T E + C 4 N 2 H 2 β 2 + C 3 2 i = 1 m 1 L r 1 1 h x i β 2 + L r 2 1 h x i β 2
Since (61) and (62) are similar, we will only show the solution process for (61), which can also be expressed as:
min β 1 1 2 β 1 T H 1 T F + C 2 N 1 H 1 β 1 + C 1 2 j = 1 m 2 L r 1 1 + h x j β 1 L v e x β 1 + C 1 2 j = 1 m 2 L r 2 1 + h x j β 1 L c a v β 1
The solution to the above optimization issue can be obtained by addressing the following equation:
min β 1 L β 1 = L v e x β 1 + L c a v β 1
The value of β 1 in k + 1 iterations is as follows:
β 1 k + 1 = a r g min β 1 L v e x β 1 + L c a v β 1 k β 1
Let δ 1 k = L c a v β 1 k , where δ 1 k = δ 1 1 k , δ 1 2 k , , δ 1 j k , , δ 1 m 2 k T R m 2 .
δ 1 j k = L c a v u j = 2 r u j + 2 u j r u j 2 + 1 2
where u j = 1 + h x j β 1 . Therefore, we can get:
L c a v β 1 k β 1 = δ 1 k T H 2 β 1
Solving (65) is equivalent to solving the following subproblem:
min β 1 1 2 β 1 T H 1 T F + C 2 N 1 H 1 β 1 + C 1 2 r ξ 1 T ξ 1 + δ 1 k T H 2 β 1 s . t . H 2 β 1 + ξ 1 = e 2
Here, we introduce Lagrange multipliers λ 1 for problem (68). Next, its Lagrange function is given by:
L ( β 1 , ξ 1 , λ 1 , θ 1 ) = 1 2 β 1 T H 1 T F + C 2 N 1 H 1 β 1 + C 1 2 r ξ 1 T ξ 1 + δ 1 k T H 2 β 1 λ 1 T H 2 β 1 + ξ 1 e 2
Following the Karush–Kuhn–Tucker (KKT) conditions, we derive the following constraints:
K K T : L β 1 = H 1 T F + C 2 N 1 H 1 β 1 + H 2 T δ 1 k + H 2 T λ 1 = 0 L ξ 1 = C 1 r ξ 1 λ 1 = 0 λ 1 T H 2 β 1 + ξ 1 e 2 = 0
From the KKT condition, we can obtain:
β 1 = ( H 1 T F + C 2 N 1 H 1 ) 1 H 2 T δ 1 k + λ 1 . C 1 r ξ 1 = λ 1 .
Bring (71) into (68), and we can obtain the dual problem of the original problem:
min λ 1 1 2 δ 1 k + λ 1 T H 2 H 1 T F + C 2 N 1 H 1 1 H 2 T δ 1 k + λ 1 e 2 T λ 1
Using a similar approach, we can obtain the dual problem of Equation (58):
min λ 2 1 2 λ 2 δ 2 k T H 1 H 2 T T + C 1 N 2 H 2 1 H 1 T λ 2 δ 2 k e 1 T λ 2
After solving (72) and (73) to obtain the optimal solutions λ 1 and λ 2 , we can obtain:
β 1 = ( H 1 T F + C 2 N 1 H 1 ) 1 H 2 T δ 1 k + λ 1
β 2 = ( H 2 T T + C 4 N 2 H 2 ) 1 H 1 T λ 2 δ 2 k
Therefore, the decision function of SF-RSTELM is given by:
f ( x ) = arg min k = 1 , 2 d k ( x ) = arg min k = 1 , 2 β k T h ( x ) .
Based on the previous discussion, we provide a detailed description of the implementation steps of the proposed method in Algorithm 1.
Algorithm 1 The procedure of SF-RSTELM
Input: The training set T = { ( x i , y i ) | 1 i m } , where x i R d and y i = { 1 , 1 } ;
   Parameters C 1 , C 2 , C 3 , C 4 > 0 and ε 1 , ε 2 , ε 3 , ε 4 > 0 , ε > 0 . σ 1 , σ 2 .
   Activation function G ( x ) ; The number of hidden nodes L.
   Maximum number of iterations k m a x .
Output: β 1 * , β 2 * .
Step:
   1: Initialize F ( 0 ) R m 1 × m 1 and T ( 0 ) R m 2 × m 2 ; δ 1 ( 0 ) and δ 2 ( 0 )
   2: Compute the graph matrix N 1 , N 2 .
   3: Set t = 0 .
   4: While
           Calculate λ 1 ( k ) and λ 2 ( k ) for the dual problem (72) and (73), respectively.
           Then get the solution β 1 ( k ) , β 2 ( k ) by
                     β 1 ( k ) = ( H 1 T F ( k ) + C 2 N 1 H 1 ) 1 H 2 T δ 1 k + λ 1
                     β 2 ( k ) = ( H 2 T T ( k ) + C 4 N 2 H 2 ) 1 H 1 T λ 2 δ 2 k
           Update the matrices F ( k + 1 ) , T ( k + 1 ) , δ 1 ( k + 1 ) and δ 2 ( k + 1 ) by (51), (52) and (66).
           if k > k m a x or β i ( k ) β i ( k 1 ) ε   i = 1 , 2
                break
           else
                 k = k + 1
     end while
   5: Construct the following decision functions:
                   f ( x ) = arg min k = 1 , 2 d k ( x ) = arg min k = 1 , 2 β k T h ( x ) .

3.3. Convergence Analysis

Theorem 2. 
Utilizing the CCCP technique to address problem (63), the resulting sequence β 1 ( k ) converges.
Proof of Theorem 2. 
At the iteration point of step k + 1 , the following inequality holds:
L v e x β 1 ( k ) + L c a v β 1 ( k ) T β 1 ( k ) L v e x β 1 ( k + 1 ) + L c a v β 1 ( k ) T β 1 ( k + 1 )
It can be written as
L v e x β 1 ( k ) L v e x β 1 ( k + 1 ) L c a v β 1 ( k ) T β 1 ( k + 1 ) β 1 ( k )
Due to the concavity of L c a v . , we have
L c a v β 1 ( k ) T β 1 ( k + 1 ) β 1 ( k ) L c a v β 1 ( k + 1 ) L c a v β 1 ( k )
By combining the above inequalities, we have
L v e x β 1 ( k ) + L c a v β 1 ( k ) L v e x β 1 ( k + 1 ) + L c a v β 1 ( k + 1 )
Accordingly, the objective value of problem (63) decreases monotonically with each iteration and remains non-negative, thereby proving the convergence of the sequence. □
Theorem 3. 
Algorithm 1 converges to a local optimum to the problems in (34) and (35).
Proof of Theorem 3. 
Taking problem (34) as an example, the analysis for problem (35) follows a similar approach.
First and foremost, let us recall the formulation of our framework, namely Equation (34).
min β 1 , ξ 1 1 2 i = 1 m 1 min h x i β 1 2 p , ε 1 + C 1 2 j = 1 m 2 L r 1 + h x j β 1 + C 2 2 β 1 T H 1 T N 1 H 1 β 1
For convenience, let J 1 = C 1 2 j = 1 m 2 L r 1 + h x j β 1 and J 2 = C 2 2 β 1 T H 1 T N 1 H 1 β 1 . When h x i β 1 2 p ε 1 , we represent the Lagrange function of (81) as follows:
L 1 β 1 = 1 2 i = 1 m 1 h x i β 1 2 p + J 1 + J 2
Then, we differentiate L 1 β 1 with respect to β 1 :
L 1 β 1 β 1 = i = 1 m 1 1 2 p h x i β 1 2 p 1 h x i β 1 h x i h x i β 1 2 + J 1 β 1 + J 2 β 1 = 0
According to (51), we bring f i i into Formula (83):
L 1 β 1 β 1 = H 1 T F H 1 β 1 + J 1 β 1 + J 2 β 1 = 0
Similarly, we obtain the Lagrangian function of problem (55):
L 2 β 1 = 1 2 H 1 β 1 T F H 1 β 1 + J 1 + J 2
Taking the derivative of L 2 β 1 with respect to β 1 :
L 2 β 1 β 1 = H 1 T F H 1 β 1 + J 1 β 1 + J 2 β 1 = 0
It is noted that Formula (84) is equal to Formula (86) when determining the optimal solution β 1 . Furthermore, the optimal solution β 1 meets the KKT condition of model (34). By solving problem (55), we can determine the optimal solution for problem (34). Thus, Algorithm 1 is capable of converging to a local optimum, making it feasible to obtain the local minimum of problem (34). □

4. Numerical Experiments

Within this part, we performed a comparison of SF-RSTELM with several algorithms, such as TELM [20], FTELM [33], C L 1 -FTELM [33], FRTELM [25], and CWTELM [30]. To assess the performance of the proposed algorithm, experiments are carried out on four distinct types of databases: artificial datasets, UCI datasets, image datasets, and NDC large datasets. In addition, we demonstrate the convergence of the proposed algorithm through experimental analysis.

4.1. Experimental Setting

4.1.1. Operating Environment

All experiments were carried out using MATLAB (2021a) (MathWorks, Natick, United States) on a personal computer (PC) equipped with an Intel Core-i7 processor (2.5 GHz) and 16 GB random-access memory (RAM).

4.1.2. Benchmark Approaches

We have selected five advanced algorithms as benchmarks to compare with the SF-RSTELM proposed in this paper. These algorithms are:
  • TELM [20]: Using the hinge loss function and squared L 2 -norm metric.
  • FTELM [33]: The Fisher regularization term is introduced into TELM, and relates to the statistical information of intra-class samples.
  • C L 1 -FTELM [33]: Capped L 1 -norm loss and metric are introduced into FTELM.
  • FRTELM [25]: Capped L 1 -norm loss and metric are introduced into TELM.
  • CWTELM [30]: Replace the hinge loss function and squared L 2 -norm metric in TELM with Welsch loss function and capped L 2 , p -norm metric.

4.1.3. Parameter Selection

In the training process of the model, the selection of parameters is crucial because it will affect the classification performance of the model. Here are some parameters that are respectively required for each model. Since both the comparison model and our model are twin models, which contain two optimization problems, the two optimization models are carried out separately. Therefore, with regard to the parameter selection problem, we will only explain the parameter selection method for one of the optimization problems, and the other model is similar.
  • TELM: the count of hidden layer nodes L, the regularization parameter C 1 .
  • FTELM: the count of hidden layer nodes L, the regularization parameters C 1 , C 2 .
  • C L 1 -FTELM: the count of hidden layer nodes L, the regularization parameters C 1 , C 2 , the capped parameter ε 1 in the metric, and the capped parameter ε 2 in the loss function.
  • FRTELM: the count of hidden layer nodes L, the regularization parameter C 1 , the capped parameter ε 1 in the metric, and the capped parameter ε 2 in the loss function.
  • CWTELM: the count of hidden layer nodes L, the regularization parameter C 1 , the capped parameters ε 1 and the parameter p of the metric, and the parameter σ in the loss function.
  • SF-RSTELM: the count of hidden layer nodes L, the regularization parameters C 1 , C 2 , the capped parameters ε 1 and the parameter p of capped L 2 , p -norm metric, and the parameter r in the loss function.
Due to the different number of parameters in different models, we will adopt different parameter selection strategies. For models with fewer than three parameters (TELM and FTELM), we will use grid search and ten-fold cross-validation to explore the parameter space and find the optimal parameter combination. However, when dealing with models with more than three parameters ( C L 1 -FTELM, FRTELM, CWTELM, and SF-RSTELM), directly applying grid search may lead to a significant increase in computational load. To overcome this challenge, we first fix some parameters to narrow down the search space based on initial experimental results and domain expertise. The fixed parameter strategy is as follows: the count of hidden nodes L is fixed, but L should be different for different datasets, and the range of L is i × 10 j i = 1 , 2 , , 6 , j = 1 , 2 . For a fair contrast, we fix the parameters to ensure that the above algorithms with bounded loss functions have the same upper bound on their loss. Specifically, we set the capped L 1 -norm loss parameter to 2 in C L 1 -FTELM and FRTELM, the parameter σ to 2 in CWTELM for Welsch loss, and the parameter r to 0.5 in our model’s SF-loss. Moreover, for the aforementioned models that utilize bounded metrics, we set all their upper bounds to be equal. That is, the capped parameter ε 1 is set to 0.001. In addition to fixing the above parameters, we still choose to use ten-fold cross-validation and grid search to find the best values for the remaining parameters. Specifically, the regularization parameters C 1 and C 2 are chosen from { i × 10 j i = 1 , 2 , , 6 , and j = 3 , , 2 } , and the parameter p is chosen from 0 , 2 .

4.1.4. Evaluation Criteria

To assess the validity of the model, we employ accuracy ( A C C ) and the F 1 -score as evaluation criteria. Specifically, these criteria are specified as follows:
A C C = T P + T N T P + T N + F P + F N
F 1 = 2 T P 2 T P + F P + F N
where T P , T N , F P , and F N refer to true positives, true negatives, false positives, and false negatives, respectively. Both A C C and the F 1 -score serve to evaluate the model’s generalization capability, and the higher they are, the better the performance. To guarantee the reliability of our experiments, we repeat all experimental procedures 10 times, and the final experimental result is the mean of the 10 repetitions.

4.2. Experiments on the Artificial Datasets

In this subsection, experiments are initially performed on the Two Moons, XOR, and Banana datasets. Both the Two Moons and Banana datasets comprise 400 samples, while the XOR dataset contains 100 samples. Figure 6 visually depicts the two-dimensional distribution graphs of these three artificial datasets. Class 1 is shown as a red ‘∘’, and class 2 is shown as a blue ‘⋄’.
To test the performance of the proposed method, Figure 7 shows the accuracy of five comparison algorithms as well as the proposed algorithm under conditions of no noise and a noise ratio of 20%. The noise is introduced by randomly selecting training samples and perturbing their features with Gaussian noise that follows a normal distribution N 0 , σ 2 . Specifically, for the training data X, X + X ˜ is used instead of X, where X ˜ represents the noise matrix that adheres to a normal distribution with a mean of 0 and a variance of σ 2 .
As can be seen from Figure 7a, the accuracy of SF-RSTELM is highest among the six methods when no noise is added to the dataset. Our model has the highest accuracy, which can be explained in three ways. First, it uses a bounded capped L 2 , p -norm metric, whose p-value can be flexibly adjusted to suit different data. Second, we use a bounded loss that effectively controls the upper bound. In addition, we consider the intra-class divergence information of the data. The remaining algorithms are sorted in descending order of accuracy as C L 1 -FTELM, CWTELM, FTELM, FRTELM, TELM. The accuracy of C L 1 -FTELM is slightly lower than that of our model, probably because the capped L 1 -norm metric is not as flexible as the capped L 2 , p -norm metric we use. Although CWTELM has also improved the metric and loss, it still falls slightly below C L 1 -FTELM due to its failure to consider the statistical properties of the data. FTELM is merely an extension of TELM, with the addition of a Fisher regularization term. FRTELM improved the metric and loss; however, the intra-class divergence information of the data is not taken into account. When noise is added to the dataset (Figure 7b), the performance of all algorithms is degraded, but SF-RSTELM’s accuracy is still the highest. This shows that our algorithm has the strongest noise immunity and robustness.

4.3. Experiments on the UCI Datasets

In this subsection, we evaluate the performance of our model on nine UCI datasets from http://archive.ics.uci.edu/ml/datasets.php (accessed on 15 February 2024) and contrast it with five other state-of-the-art algorithms. Table 1 provides a detailed overview of the characteristics of the UCI datasets employed. All features within these datasets are normalized to a scale of 0 , 1 to ensure consistency.
Initially, we assess the steadiness of our SF-RSTELM without adding extra noises. Table 2 presents the results of the experiments, with the best results highlighted in bold. Here, Time(s) denotes the average execution time for each algorithm when optimized with the best parameters. In ACC ± S, ACC denotes the mean learning accuracy, while S represents the standard deviation. Table 2 shows that SF-RSTELM outperforms the other five methods in terms of accuracy and F 1 on most datasets, except for the German, Pima, and QSAR datasets. This is because SF-RSTELM not only incorporates the Fisher regularization term, which considers intra-class divergence, but also leverages the parameter adjustability of SF-loss and the flexibility of the capped L 2 , p -norm metric.
Moreover, to illustrate the noise-resistant properties of SF-RSTELM, we introduce Gaussian noise into the training subset, affecting their features to generate noise. The experimental outcomes at noise levels of 15% and 25% are displayed in Table 3 and Table 4. It is clear that with rising noise levels, the learning accuracy of all algorithms declines. However, SF-RSTELM maintains higher accuracy and F 1 scores compared to the other five methods, except on a few datasets. This also demonstrates the strong noise resistance capability of our model.
To demonstrate the robustness of SF-RSTELM at varying noise levels (10%, 15%, 20%, and 25%), experiments are conducted on three datasets (Australian, Vote, and Ionosphere). Figure 8 illustrates the accuracy variation line charts of six algorithms on these three datasets across different noise levels. For the original dataset X , we change it with X + μ X ¯ , where X ¯ is the Gaussian variable. Here, μ = ρ X F X ¯ F , and μ is a noise ratio. The value of μ { 0 , 0.1 , 0.15 , 0.2 , 0.25 } . As is evident from Figure 8, the performance of all algorithms decreases as noise levels rise, but the accuracy of our model decreases the slowest. This further indicates that SF-RSTELM possesses superior classification accuracy compared to the other five methods. This superiority in noise handling is likely attributed to the combined effectiveness of the SF-loss function, the capped L 2 , p -norm metrics, and the Fisher regularization term.

4.4. Experiments on the Image Datasets

In this part, we will perform experiments using high-dimensional image datasets to assess and compare the noise resistance and classification accuracy of our model SF-RSTELM with five other algorithms. We utilized the following three image datasets: COIL-20 http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2024), USPS http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2024), and MNIST http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2024). Since these image datasets are essentially a problem with multiple classes, we employed the “one-vs-rest” strategy to convert them to multiple binary classification problems. Table 5 shows the features of the three image datasets. Figure 9 demonstrates examples from the three high-dimensional image datasets. Specifically, due to the large size of the MNIST dataset, we only select the first 2000 samples to participate in the experiment. These image datasets are utilized to evaluate the performance of our models in multi-class classification tasks.
The specific experimental results are presented in Table 6. From the experiment results, it is evident that SF-RSTELM has the highest accuracy on all three datasets. Table 7 shows the results after adding 15% Gaussian noise. It can be seen that the ACC and F 1 value of all models are significantly reduced, but our model is still the highest, which also indicates that our model has good noise resistance capability, classification ability, and ability to process high-dimensional image datasets.

4.5. Experiments on the NDC Large Datasets

In this subsection, to evaluate the stability of our proposed model on large-scale datasets. We have conducted a comparative analysis of our algorithm with five other algorithms using the NDC datasets generated by David Musicant’s NDC data generator http://www.cs.wisc.edu/musicant/data/ndc (accessed on 15 February 2024). Detailed descriptions of the NDC datasets are presented in Table 8. Table 9 summarizes the experimental results of the six algorithms on three large-scale NDC datasets.
As can be seen from Table 9, our model SF-RSTELM has the highest accuracy and F 1 value except for the NDC-15k dataset. Overall, SF-RSTELM is more stable than the other five algorithms. This is mainly due to the advantages of SF-RSTELM, which not only incorporates the statistical attributes of the data, but also uses the bounded and flexibly adjustable metric and loss function, which effectively controls the disturbance from noise and outliers. Therefore, from this set of experimental results, we can see that our model is also effective for large datasets.

4.6. Convergence Curve

We also perform experiments on four UCI datasets (Australian, QSAR, WDBC, and Vote) to confirm the convergence of the proposed Algorithm 1. Figure 10 demonstrates the convergence case of the objective function value with the increasing number of iterations. As can be seen from Figure 10, the objective function value converges relatively quickly to a stable fixed value.
This observation verifies the effectiveness of our algorithm in converging the objective function to a local optimum within a finite number of iterations, thereby demonstrating its convergence properties.

5. Conclusions and Future Works

In this paper, we first propose a new kind of SF-loss function that exhibits favorable characteristics including boundedness, smoothness, symmetry, noise insensitivity, and Fisher consistency. Then, SF-RSTELM is proposed by integrating the capped L 2 , p -norm metric, SF-loss, and Fisher regularization term. SF-RSTELM not only integrates the Fisher regularization term, addressing the intra-class divergence of the data, but also exploits the parameter adjustability of SF-loss and the flexibility of capped L 2 , p -norm metrics to reduce the influence of noise and outliers. Moreover, an efficient iterative algorithm is proposed to solve the model, and the convergence of the algorithm is proved. Experimental results on multiple datasets demonstrate the efficiency of the proposed model. Specifically, our model was able to achieve higher ACC and F 1 scores on most datasets, with improvements ranging from 0.28% to 4.5% compared to other state-of-the-art algorithms.
In the future, we will continue to study the improvement of the algorithm. Because the model constructed in this paper represents a non-convex optimization problem, we convert it into a series of convex problems to solve by the CCCP method, resulting in a long training time, so it is necessary to find a fast solution method in future research. Moreover, transforming this paper’s model from supervised to semi-supervised learning remains an important direction for future studies.

Author Contributions

Z.X., conceptualization, methodology, validation, investigation, project administration, writing—original draft. Y.W., methodology, software, validation, formal analysis, investigation, data curation, writing—original draft. Y.R., validation, software. X.Z., validation, software. All authors have read and agreed to the published version of the manuscript.

Funding

The authors wish to acknowledge the financial support of the National Nature Science Youth Foundation of China (No. 61907012), Construction Project of First-Class Disciplines in Ningxia Higher Education (NXYLXK2017B09), Postgraduate Innovation Project of North Minzu University (YCX23098, YCX23091), National Nature Science Foundation of China (No. 62366001), and the Natural Science Foundation of Ningxia (2024A2787).

Informed Consent Statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability Statement

The UCI machine learning repository is available at http://archive.ics.uci.edu/ml/datasets.php (accessed on 15 February 2024). The image data are available at http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 15 February 2024). The NDC datasets are available at http://www.cs.wisc.edu/musicant/data/ndc. (accessed on 15 February 2024).

Acknowledgments

The authors thank the anonymous reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sakheta, A.; Raj, T.; Nayak, R.; O’Hara, I.; Ramirez, J. Improved prediction of biomass gasification models through machine learning. Comput. Chem. Eng. 2024, 191, 108834. [Google Scholar] [CrossRef]
  2. Maydanchi, M.; Ziaei, M.; Mohammadi, M.; Ziaei, A.; Basiri, M.; Haji, F.; Gharibi, K. A Comparative Analysis of the Machine Learning Methods for Predicting Diabetes. J. Oper. Intell. 2024, 2, 230–251. [Google Scholar] [CrossRef]
  3. Kim, E.; Yang, S.M.; Ham, J.H.; Lee, W.; Jung, D.H.; Kim, H.Y. Integration of MALDI-TOF MS and machine learning to classify enterococci: A comparative analysis of supervised learning algorithms for species prediction. Food Chem. 2024, 462, 140931. [Google Scholar] [CrossRef]
  4. Ding, S.; Zhao, H.; Zhang, Y.; Xu, X.; Nie, R. Extreme learning machine: Algorithm, theory and applications. Artif. Intell. Rev. 2015, 44, 103–115. [Google Scholar] [CrossRef]
  5. Deng, C.; Huang, G.; Xu, J.; Tang, J. Extreme learning machines: New trends and applications. Sci. China Inf. Sci. 2015, 2, 1–16. [Google Scholar] [CrossRef]
  6. Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
  7. Mirza, B.; Kok, S.; Dong, F. Multi-layer online sequential extreme learning machine for image classification. In Proceedings of the ELM-2015 Volume 1: Theory, Algorithms and Applications (I), Hangzhou, China, 15–17 December 2015; Springer: Berlin/Heidelberg, Germany, 2016; pp. 39–49. [Google Scholar]
  8. Yu, H.; Yuan, K.; Li, W.; Zhao, N.; Chen, W.; Huang, C.; Chen, H.; Wang, M. Improved butterfly optimizer-configured extreme learning machine for fault diagnosis. Complexity 2021, 2021, 1–17. [Google Scholar] [CrossRef]
  9. Chen, Z.; Gryllias, K.; Li, W. Mechanical fault diagnosis using convolutional neural networks and extreme learning machine. Mech. Syst. Signal Process. 2019, 133, 106272. [Google Scholar] [CrossRef]
  10. Wang, Z.; Li, M.; Wang, H.; Jiang, H.; Yao, Y.; Zhang, H.; Xin, J. Breast cancer detection using extreme learning machine based on feature fusion with CNN deep features. IEEE Access 2019, 7, 105146–105158. [Google Scholar] [CrossRef]
  11. Zhu, W.; Miao, J.; Hu, J.; Qing, L. Vehicle detection in driving simulation using extreme learning machine. Neurocomputing 2014, 128, 160–165. [Google Scholar] [CrossRef]
  12. Deeb, H.; Sarangi, A.; Mishra, D.; Sarangi, S.K. Human facial emotion recognition using improved black hole based extreme learning machine. Multimed. Tools Appl. 2022, 81, 24529–24552. [Google Scholar] [CrossRef]
  13. Zhou, J.; Zhang, X.; Jiang, Z. Recognition of imbalanced epileptic EEG signals by a graph-based extreme learning machine. Wirel. Commun. Mob. Comput. 2021, 2021, 1–12. [Google Scholar] [CrossRef]
  14. Zhao, J.; Xu, Y.; Fujita, H. An improved non-parallel universum support vector machine and its safe sample screening rule. Knowl.-Based Syst. 2019, 170, 79–88. [Google Scholar] [CrossRef]
  15. Sun, F.; Xie, X. Deep Non-Parallel Hyperplane Support Vector Machine for Classification. IEEE Access 2023, 11, 7759–7767. [Google Scholar] [CrossRef]
  16. Chen, S.; Cao, J.; Huang, Z. Weighted linear loss projection twin support vector machine for pattern classification. IEEE Access 2019, 7, 57349–57360. [Google Scholar] [CrossRef]
  17. Zheng, X.; Zhang, L.; Yan, L. Sparse discriminant twin support vector machine for binary classification. Neural Comput. Appl. 2022, 34, 16173–16198. [Google Scholar] [CrossRef]
  18. Borah, P.; Gupta, D. Robust twin bounded support vector machines for outliers and imbalanced data. Appl. Intell. 2021, 51, 5314–5343. [Google Scholar] [CrossRef]
  19. Xiao, Y.; Liu, J.; Wen, K.; Liu, B.; Zhao, L.; Kong, X. A least squares twin support vector machine method with uncertain data. Appl. Intell. 2023, 53, 10668–10684. [Google Scholar] [CrossRef]
  20. Wan, Y.; Song, S.; Huang, G.; Li, S. Twin extreme learning machines for pattern classification. Neurocomputing 2017, 260, 235–244. [Google Scholar] [CrossRef]
  21. Wang, C.; Ye, Q.; Luo, P.; Ye, N.; Fu, L. Robust capped L1-norm twin support vector machine. Neural Netw. 2019, 114, 47–59. [Google Scholar] [CrossRef]
  22. Wang, Y.; Yu, G.; Ma, J. Capped linex metric twin support vector machine for robust classification. Sensors 2022, 22, 6583. [Google Scholar] [CrossRef] [PubMed]
  23. Kumari, A.; Tanveer, M.; Alzheimer’s Disease Neuroimaging Initiative. Universum twin support vector machine with truncated pinball loss. Eng. Appl. Artif. Intell. 2023, 123, 106427. [Google Scholar] [CrossRef]
  24. Ma, J.; Yang, L.; Sun, Q. Adaptive robust learning framework for twin support vector machine classification. Knowl.-Based Syst. 2021, 211, 106536. [Google Scholar] [CrossRef]
  25. Ma, J. Capped L1-norm distance metric-based fast robust twin extreme learning machine. Appl. Intell. 2020, 50, 3775–3787. [Google Scholar] [CrossRef]
  26. Yang, Y.; Xue, Z.; Ma, J.; Chang, X. Robust projection twin extreme learning machines with capped L1-norm distance metric. Neurocomputing 2023, 517, 229–242. [Google Scholar] [CrossRef]
  27. Ma, J.; Yang, L. Robust supervised and semi-supervised twin extreme learning machines for pattern classification. Signal Process 2021, 180, 107861. [Google Scholar] [CrossRef]
  28. Yuan, C.; Yang, L. Capped L2,P-norm metric based robust least squares twin support vector machine for pattern classification. Neural Netw. 2021, 142, 457–478. [Google Scholar] [CrossRef]
  29. Wang, H.; Yu, G.; Ma, J. Capped L2,P-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss. Symmetry 2023, 15, 1076. [Google Scholar] [CrossRef]
  30. Jiang, Y.; Yu, G.; Ma, J. Distance Metric Optimization-Driven Neural Network Learning Framework for Pattern Classification. Axioms 2023, 12, 765. [Google Scholar] [CrossRef]
  31. Ma, J.; Wen, Y.; Yang, L. Fisher-regularized supervised and semi-supervised extreme learning machine. Knowl. Inf. Syst. 2020, 62, 3995–4027. [Google Scholar] [CrossRef]
  32. Xue, Z.; Zhao, C.; Wei, S.; Ma, J.; Lin, S. Robust Fisher-regularized extreme learning machine with asymmetric Welsch-induced loss function for classification. Appl. Intell. 2024, 54, 7352–7376. [Google Scholar] [CrossRef]
  33. Xue, Z.; Cai, L. Robust Fisher-Regularized Twin Extreme Learning Machine with Capped L1-Norm for Classification. Axioms 2023, 12, 717. [Google Scholar] [CrossRef]
  34. Huber, P.J. Robust estimation of a location parameter. In Breakthroughs in Statistics: Methodology and Distribution; Springer: Berlin/Heidelberg, Germany, 1992; pp. 492–518. [Google Scholar]
  35. Yuan, C.; Yang, L. Robust twin extreme learning machines with correntropy-based metric. Knowl.-Based Syst. 2021, 214, 106707. [Google Scholar] [CrossRef]
  36. Yuille, A.L.; Rangarajan, A. The concave-convex procedure. Neural Comput. 2003, 15, 915–936. [Google Scholar] [CrossRef] [PubMed]
  37. Zhang, T. Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 2010, 11, 1081–1107. [Google Scholar]
  38. Rockafellar, R. Convex Analysis. Princet. Math. Ser. 1970, 28, 326–332. [Google Scholar]
Figure 1. Comparison of the L 1 −norm, the L 2 −norm, and the capped L 2 , p −norm with different p values.
Figure 1. Comparison of the L 1 −norm, the L 2 −norm, and the capped L 2 , p −norm with different p values.
Symmetry 16 01230 g001
Figure 2. Comparison of L 2 , p −norm metric and capped L 2 , p −norm metric: (a) L 2 , p −norm metric ( p = 1 ); (b) L 2 , p −norm metric ( p = 2 ); (c) capped L 2 , p −norm metric ( p = 1 , ε = 2 ); (d) capped L 2 , p −norm metric ( p = 2 , ε = 2 ).
Figure 2. Comparison of L 2 , p −norm metric and capped L 2 , p −norm metric: (a) L 2 , p −norm metric ( p = 1 ); (b) L 2 , p −norm metric ( p = 2 ); (c) capped L 2 , p −norm metric ( p = 1 , ε = 2 ); (d) capped L 2 , p −norm metric ( p = 2 , ε = 2 ).
Symmetry 16 01230 g002
Figure 3. Loss function L r ( u ) with different values of r. The horizontal axis indicates the u value ( u = 1 y f is the margin error), while the vertical axis shows the respective loss function value.
Figure 3. Loss function L r ( u ) with different values of r. The horizontal axis indicates the u value ( u = 1 y f is the margin error), while the vertical axis shows the respective loss function value.
Symmetry 16 01230 g003
Figure 4. SF-loss compares hinge loss, capped L 1 -norm loss, and Welsch loss. The horizontal axis indicates the u value, while the vertical axis shows the respective loss function value.
Figure 4. SF-loss compares hinge loss, capped L 1 -norm loss, and Welsch loss. The horizontal axis indicates the u value, while the vertical axis shows the respective loss function value.
Symmetry 16 01230 g004
Figure 5. Convex function (a) L r 1 u and concave function (b) L r 2 u .
Figure 5. Convex function (a) L r 1 u and concave function (b) L r 2 u .
Symmetry 16 01230 g005
Figure 6. Two dimensional distribution graphs of three artificial datasets (Two Moons, XOR, Banana): (a) Two Moons; (b) XOR; (c) Banana.
Figure 6. Two dimensional distribution graphs of three artificial datasets (Two Moons, XOR, Banana): (a) Two Moons; (b) XOR; (c) Banana.
Symmetry 16 01230 g006
Figure 7. Experimental results of six algorithms on three artificial datasets: (a) experimental results of six algorithms on three artificial datasets without noise; (b) experimental results of six algorithms on three artificial datasets with 20% noise.
Figure 7. Experimental results of six algorithms on three artificial datasets: (a) experimental results of six algorithms on three artificial datasets without noise; (b) experimental results of six algorithms on three artificial datasets with 20% noise.
Symmetry 16 01230 g007
Figure 8. Under different noise ratios, the accuracy variation curves of six algorithms across three datasets: (a) Australian; (b) Vote; (c) Ionosphere.
Figure 8. Under different noise ratios, the accuracy variation curves of six algorithms across three datasets: (a) Australian; (b) Vote; (c) Ionosphere.
Symmetry 16 01230 g008
Figure 9. Example images for three image datasets: (a) COIL-20 database; (b) USPS database; (c) MNIST database.
Figure 9. Example images for three image datasets: (a) COIL-20 database; (b) USPS database; (c) MNIST database.
Symmetry 16 01230 g009
Figure 10. Convergence graph of SF-RSTELM’s objective function values increasing with the number of iterations on four datasets (Australian, QSAR, WDBC, Vote): (a) Australian; (b) QSAR; (c) WDBC; (d) Vote.
Figure 10. Convergence graph of SF-RSTELM’s objective function values increasing with the number of iterations on four datasets (Australian, QSAR, WDBC, Vote): (a) Australian; (b) QSAR; (c) WDBC; (d) Vote.
Symmetry 16 01230 g010
Table 1. Characteristics of UCI datasets.
Table 1. Characteristics of UCI datasets.
DatasetsInstancesAttributesDatasetsInstancesAttributes
Australian  690   14Vote   435   16
Ionosphere   351   34WDBC   569   30
German   1000   24wholesalesta   400   7
Pima   768   8Sonar   208   60
QSAR   1055   41
Table 2. Experimental results on UCI datasets. The best results are marked in bold.
Table 2. Experimental results on UCI datasets. The best results are marked in bold.
DatasetsTELM [20]FTELM [33]C L 1 -FTELM [33]CWTELM [30]FRTELM [25]SF-RSTELM
ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)
F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%)
Time(s)Time(s)Time(s)Time(s)Time(s)Time(s)
Australian85.66 ± 0.3786.37 ± 0.2986.49 ± 0.3786.25 ± 0.8286.03 ± 0.9486.76 ± 0.56
86.27 ± 0.3385.70 ± 1.0486.06 ± 0.7686.14 ± 0.4785.97 ± 0.3986.55 ± 0.64
0.5960.6513.7395.9394.6386.856
Vote94.26 ± 0.5594.98 ± 0.6995.74 ± 0.3795.25 ± 0.5794.83 ± 0.6695.86 ± 0.37
93.46 ± 0.8395.89 ± 0.5795.28 ± 0.1796.50 ± 0.4695.80 ± 0.5696.87 ± 1.64
0.2330.7941.0091.8433.0363.604
Ionosphere88.67 ± 1.3690.25 ± 1.1390.36 ± 1.6490.34 ± 1.7390.14 ± 0.8590.67 ± 0.93
84.87 ± 2.0386.45 ± 1.9588.33 ± 2.7587.38 ± 1.7886.26 ± 0.9489.41 ± 1.12
0.4580.2372.2542.3583.3293.217
WDBC96.13 ± 0.3497.13 ± 0.1797.05 ± 0.6596.63 ± 0.3896.50 ± 0.8397.28 ± 0.98
95.37 ± 0.5196.07 ± 0.2195.78 ± 0.3495.98 ± 0.6295.94 ± 0.8797.03 ± 0.56
0.7190.6894.3593.1594.6565.289
German74.52 ± 1.2874.86 ± 2.9676.42 ± 0.9274.38 ± 2.7474.63 ± 0.7974.80 ± 0.68
70.36 ± 0.8771.28 ± 2.7772.90 ± 0.5373.21 ± 0.4172.18 ± 0.4772.73 ± 0.32
0.8780.6548.7447.9347.2188.306
wholesalesta87.30 ± 0.7288.01 ± 0.4489.91 ± 0.5189.43 ± 1.0488.72 ± 1.3490.00 ± 1.20
76.76 ± 1.7182.30 ± 0.4982.26 ± 0.2783.87 ± 1.8681.23 ± 2.7285.16 ± 2.26
1.8690.4813.6474.3853.2874.473
Pima76.38 ± 1.7276.48 ± 1.3876.97 ± 0.4578.13 ± 2.2776.22 ± 1.1176.79 ± 0.58
74.52 ± 1.2876.74 ± 1.0877.96 ± 0.3278.82 ± 0.4576.90 ± 0.6475.32 ± 0.68
0.9691.2824.5617.5485.0786.631
Sonar67.76 ± 0.5469.65 ± 2.2169.83 ± 0.4668.45 ± 3.0168.37 ± 2.1270.50 ± 3.67
65.83 ± 0.6672.10 ± 2.8668.26 ± 0.1565.35 ± 3.4166.64 ± 1.8472.82 ± 2.72
0.6111.0551.3761.5371.7181.476
QSAR85.61 ± 0.7486.43 ± 0.8287.44 ± 2.4387.16 ± 0.3686.35 ± 1.9386.81 ± 2.68
77.64 ± 1.1178.36 ± 1.2482.23 ± 0.3181.35 ± 0.7481.36 ± 0.7181.59 ± 0.23
0.7841.52212.8849.57211.66719.101
Table 3. Experimental results on UCI datasets with 15% Gaussian noise. The best results are marked in bold.
Table 3. Experimental results on UCI datasets with 15% Gaussian noise. The best results are marked in bold.
DatasetsTELM [20]FTELM [33]C L 1 -FTELM [33]CWTELM [30]FRTELM [25]SF-RSTELM
ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)
F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%)
Time(s)Time(s)Time(s)Time(s)Time(s)Time(s)
Australian82.10 ± 0.8883.07 ± 0.9085.44 ± 0.2284.25 ± 0.3484.29 ± 0.6385.59± 0.75
83.88 ± 0.8484.60 ± 0.7886.64 ± 0.2584.14 ± 0.8184.68 ± 0.6186.17 ± 0.49
0.4810.7273.6245.6194.2566.499
Vote91.17 ± 1.0492.54 ± 1.1693.65 ± 0.4592.56 ± 0.2392.43 ± 0.3093.72 ± 0.91
92.84 ± 0.8793.88 ± 0.7692.53 ± 0.2691.62 ± 0.5891.32 ± 0.2793.26 ± 0.69
0.3200.5861.1462.0522.9403.296
Ionosphere85.50 ± 1.8085.88 ± 1.7185.97 ± 0.8985.34 ± 1.7385.38 ± 1.3387.94 ± 1.79
82.96 ± 0.6184.34 ± 0.4383.08 ± 0.5385.21 ± 1.7883.39 ± 1.7585.56 ± 0.57
0.2310.1642.8972.3582.4283.299
WDBC94.89 ± 0.6395.13 ± 0.4194.36 ± 0.5494.29 ± 0.8793.37 ± 0.4695.36 + 0.51
93.15 ± 0.8893.09 ± 0.5792.66 ± 0.2191.84 ± 0.3490.24 ± 0.7793.28 ± 0.82
0.3920.2274.0874.5274.5605.498
German70.79 ± 0.7371.59 ± 1.1972.87 ± 0.5871.24 ± 0.5670.50 ± 0.6672.30 ± 0.65
68.35 ± 0.1567.57 ± 0.9671.32 ± 0.1769.54 ± 0.2469.88 ± 0.7370.54 ± 0.18
0.7861.4468.5827.9347.5728.081
wholesalesta84.19 ± 1.8283.65 ± 1.3285.72 ± 0.1984.91 ± 0.8784.67 ± 0.9586.05 ± 0.74
78.45 ± 2.5082.25 ± 0.7882.15 ± 0.3281.13 ± 1.4581.29 ± 0.7982.34 ± 1.65
0.6320.2533.3624.0733.3954.455
Pima69.07 ± 1.0371.87 ± 1.0973.09 ± 0.4872.45 ± 0.7170.75 ± 0.3571.58 ± 1.03
66.15 ± 0.5769.45 ± 0.2370.17 ± 0.2370.05 ± 0.3069.34 ± 0.4569.81 ± 0.73
0.4971.8155.1788.2376.5837.871
Sonar63.90 ± 0.8265.25 ± 0.7564.27 ± 0.6564.13 ± 0.4663.95 ± 0.5265.50 ± 2.44
64.19 ± 0.9464.67 ± 2.3464.73 ± 0.2464.23 ± 0.6764.01 ± 1.2265.63 ± 1.25
0.4330.6521.6861.8741.6341.947
QSAR78.08 ± 1.4980.05 ± 1.4282.35 ± 0.3582.56 ± 0.7681.21 ± 0.3782.05 ± 0.84
79.35 ± 1.3777.25 ± 0.5780.56 ± 0.2881.74 ± 0.8780.45 ± 0.1779.69 ± 2.20
0.8461.73113.9328.52810.46217.196
Table 4. Experimental results on UCI datasets with 25% Gaussian noise. The best results are marked in bold.
Table 4. Experimental results on UCI datasets with 25% Gaussian noise. The best results are marked in bold.
DatasetsTELM [20]FTELM [33]C L 1 -FTELM [33]CWTELM [30]FRTELM [25]SF-RSTELM
ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)
F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%)
Time(s)Time(s)Time(s)Time(s)Time(s)Time(s)
Australian79.97 ± 0.7480.50 ± 1.2581.82 ± 0.3181.12 ± 0.5480.34 ± 0.9382.50 ± 0.84
78.56 ± 1.4179.13 ± 0.2082.09 ± 0.6581.23 ± 0.3480.39 ± 0.7983.09 ± 0.82
0.3690.5492.9374.4573.2105.239
Vote90.40 ± 0.9090.14 ± 0.6592.52 ± 0.4691.12 ± 0.7690.88 ± 1.0592.67 ± 1.09
93.73 ± 0.8291.83 ± 0.9194.48 ± 0.8191.72 ± 0.8392.97 ± 0.7592.87 ± 0.77
0.2990.3121.3912.1313.7242.984
Ionosphere80.44 ± 0.9283.06 ± 1.5584.18 ± 0.6784.13 ± 0.7683.23 ± 0.6584.71 ± 1.99
79.24 ± 1.4681.29 ± 0.3881.26 ± 0.3481.05 ± 0.2381.25 ± 0.1281.89 ± 0.86
0.2540.1572.7212.4804.0083.365
WDBC91.91 + 0.3491.93 + 1.3992.26 + 0.4892.03 + 0.5691.82 + 0.4293.04 + 0.64
89.71 ± 1.5489.76 ± 0.4590.14 ± 0.8689.47 ± 0.3587.21 ± 0.6290.34 ± 0.76
0.2850.4864.2285.1623.0145.271
German70.66 ± 0.9170.09 ± 0.6471.12 ± 0.4670.37 ± 1.0869.46 ± 0.7871.83 ± 0.43
67.23 ± 0.8368.72 ± 0.3167.33 ± 0.2867.53 ± 0.3667.24 ± 0.4168.98 ± 0.65
0.7341.3799.4118.4758.3069.916
wholesalesta81.44 ± 1.1380.05 ± 1.6581.16 ± 0.2380.52 ± 0.3480.25 ± 0.3183.33 ± 1.02
78.23 ± 0.6879.21 ± 0.8477.34 ± 0.7578.47 ± 0.7578.02 ± 1.4079.45 ± 0.93
0.4930.3673.7374.7825.4345.837
Pima65.71 ± 0.2266.78 ± 0.9368.36 ± 0.5767.23 ± 0.8466.79 ± 0.1465.86 ± 0.82
67.30 ± 0.1767.37 ± 0.5668.65 ± 0.2667.18 ± 0.7265.37 ± 0.2367.73 ± 0.59
0.4291.5186.5518.1695.1768.117
Sonar60.75 ± 0.7461.15 ± 0.2662.23 ± 0.1461.23 ± 0.5261.15 ± 0.2162.55 ± 1.26
61.53 ± 0.3161.90 ± 0.8862.17 ± 0.5661.45 ± 1.3162.43 ± 1.6163.23 ± 0.67
0.5380.3991.3721.5491.7521.382
QSAR74.63 ± 1.2776.45 ± 0.6581.35 ± 0.3581.68 ± 0.8181.05 ± 0.5479.13 ± 0.76
72.36 ± 1.1072.94 ± 0.4680.82 ± 0.2880.93 ± 0.7780.54 ± 0.2376.56 ± 1.84
0.6061.57212.8479.45811.28418.193
Table 5. Characteristics of image datasets.
Table 5. Characteristics of image datasets.
DatasetsCOIL-20USPSMNIST
Instances1440929870,000
Attributes1024256784
Image Resolution 32 × 32 pixels 16 × 16 pixels 28 × 28 pixels
Input Dimension1024256784
Description20 objects, each rotated on a turntable with images captured every 5 degrees, resulting in 72 images per object.Contains a total of 9298 images of handwritten digits.It has a training set of 60,000 examples, and a test set of 10,000 examples.
Table 6. Experimental results on image datasets. The best results are marked in bold.
Table 6. Experimental results on image datasets. The best results are marked in bold.
DatasetsTELM [20]FTELM [33]C L 1 -FTELM [33]CWTELM [30]FRTELM [25]SF-RSTELM
ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)
F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%)
Time(s)Time(s)Time(s)Time(s)Time(s)Time(s)
COIL2088.06 ± 0.8990.38 ± 0.5290.63 ± 0.2891.62 ± 0.3990.25 ± 0.7492.46 ± 0.25
87.61 ± 0.6088.15 ± 1.5689.34 ± 0.7589.45 ± 0.2189.50 ± 0.6790.82 ± 0.71
USPS97.25 ± 0.2897.14 ± 0.7498.27 ± 0.6598.49 ± 0.7597.45 ± 0.9698.73 ± 0.38
95.57 ± 0.4794.83 ± 0.3695.51 ± 0.5996.63 ± 0.3895.58 ± 0.3597.25 ± 0.17
MNIST89.92 ± 0.3590.23 ± 0.8190.89 ± 0.3889.92 ± 0.3289.61 ± 0.2391.26 ± 0.79
88.24 ± 0.5187.34 ± 0.7589.17 ± 0.2987.38 ± 0.4187.23 ± 0.1689.03 ± 0.85
Table 7. Experimental results on image datasets with 15% Gaussian noise. The best results are marked in bold.
Table 7. Experimental results on image datasets with 15% Gaussian noise. The best results are marked in bold.
DatasetsTELM [20]FTELM [33]C L 1 -FTELM [33]CWTELM [30]FRTELM [25]SF-RSTELM
ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)
F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%)
Time(s)Time(s)Time(s)Time(s)Time(s)Time(s)
COIL-2086.35 ± 0.5687.87 ± 0.5889.25 ± 0.7588.78 ± 0.9188.46 ± 0.6489.57 ± 0.56
85.51 ± 0.9285.25 ± 0.2987.85 ± 0.2887.25 ± 0.2987.67 ± 0.7188.04 ± 0.63
USPS94.54 ± 0.3295.32 ± 0.5695.81 ± 0.2696.58 ± 0.8295.34 ± 0.6896.65 ± 0.76
93.12 ± 0.7993.54 ± 0.9394.78 ± 0.5795.45 ± 0.7394.16 ± 0.4595.47 ± 0.64
MNIST87.48 ± 0.8388.74 ± 0.4688.93 ± 0.7688.13 ± 0.2587.94 ± 0.5789.18 ± 0.42
84.62 ± 0.2186.37 ± 0.8487.32 ± 0.8286.84 ± 0.1886.15 ± 0.7487.56 ± 0.27
Table 8. Characteristics of NDC datasets.
Table 8. Characteristics of NDC datasets.
DatasetsInstancesAttributes
NDC-5k5000   35
NDC-10k10,000   35
NDC-15k15,000   35
Table 9. Experimental results on NDC large datasets. The best results are marked in bold.
Table 9. Experimental results on NDC large datasets. The best results are marked in bold.
DatasetsTELM [20]FTELM [33]C L 1 -FTELM [33]CWTELM [30]FRTELM [25]SF-RSTELM
ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)ACC ± S(%)
F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%) F 1 ± S(%)
Time(s)Time(s)Time(s)Time(s)Time(s)Time(s)
NDC-5k86.38 ± 0.2687.63 ± 0.2390.84 ± 0.8190.79 ± 0.2387.65 ± 0.7890.95 ± 0.47
85.89 ± 0.9188.47 ± 0.3991.78 ± 0.3991.60 ± 0.8887.71 ± 0.3191.83 ± 0.39
NDC-10k87.39 ± 0.7389.72 ± 0.8491.37 ± 0.4592.74 ± 0.2988.97 ± 0.6592.91 ± 0.23
86.82 ± 0.5788.90 ± 0.6190.72 ± 0.2491.69 ± 0.8287.83 ± 0.1891.80 ± 0.35
NDC-15k87.13 ± 0.3688.21 ± 0.1390.69 ± 0.7190.32 ± 0.1488.05 ± 0.5389.95 ± 0.46
86.87 ± 0.8587.56 ± 0.3890.94 ± 0.8090.72 ± 0.5286.48 ± 0.2988.73 ± 0.71
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xue, Z.; Wang, Y.; Ren, Y.; Zhang, X. The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L2,p-norm Metric, and Fisher Regularization. Symmetry 2024, 16, 1230. https://doi.org/10.3390/sym16091230

AMA Style

Xue Z, Wang Y, Ren Y, Zhang X. The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L2,p-norm Metric, and Fisher Regularization. Symmetry. 2024; 16(9):1230. https://doi.org/10.3390/sym16091230

Chicago/Turabian Style

Xue, Zhenxia, Yan Wang, Yuwen Ren, and Xinyuan Zhang. 2024. "The Robust Supervised Learning Framework: Harmonious Integration of Twin Extreme Learning Machine, Squared Fractional Loss, Capped L2,p-norm Metric, and Fisher Regularization" Symmetry 16, no. 9: 1230. https://doi.org/10.3390/sym16091230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop