Next Article in Journal
MAGDM Model Using an Intuitionistic Fuzzy Matrix Energy Method and Its Application in the Selection Issue of Hospital Locations
Previous Article in Journal
Starlike Functions Associated with Bernoulli’s Numbers of Second Kind
Previous Article in Special Issue
A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Distance Metric Optimization-Driven Neural Network Learning Framework for Pattern Classification

School of Mathematics and Information Sciences, North Minzu University, Yinchuan 750021, China
*
Author to whom correspondence should be addressed.
Axioms 2023, 12(8), 765; https://doi.org/10.3390/axioms12080765
Submission received: 21 June 2023 / Revised: 24 July 2023 / Accepted: 26 July 2023 / Published: 3 August 2023
(This article belongs to the Special Issue Mathematics of Neural Networks: Models, Algorithms and Applications)

Abstract

:
As a novel neural network learning framework, Twin Extreme Learning Machine (TELM) has received extensive attention and research in the field of machine learning. However, TELM is affected by noise or outliers in practical applications so that its generalization performance is reduced compared to robust learning algorithms. In this paper, we propose two novel distance metric optimization-driven robust twin extreme learning machine learning frameworks for pattern classification, namely, CWTELM and FCWTELM. By introducing the robust Welsch loss function and capped L 2 , p -distance metric, our methods reduce the effect of outliers and improve the generalization performance of the model compared to TELM. In addition, two efficient iterative algorithms are designed to solve the challenges brought by the non-convex optimization problems CWTELM and FCWTELM, and we theoretically guarantee their convergence, local optimality, and computational complexity. Then, the proposed algorithms are compared with five other classical algorithms under different noise and different datasets, and the statistical detection analysis is implemented. Finally, we conclude that our algorithm has excellent robustness and classification performance.

1. Introduction

Single-Hidden Layer Feedforward Neural Networks (SLFNs) [1] are popular training algorithms, which have a hidden layer and output layer, and the weight between the input layer and the hidden layer is adjustable. When we correctly choose the activation function of the hidden node, SLFNs can form a decision region of arbitrary shape. SLFNs have a large number of applications in the field of pattern recognition [2], extracting the features of the input data in the hidden layer, with the network classifying and recognizing different modes, such as speech recognition [3], image classification [4], etc. In addition, they are also widely used to solve nonlinear problems and time series analysis, for instance, stock price forecast [5], weather forecast [6], etc. Although SLFNs have many advantages and applications, they also have great limitations. Because SLFNs rely too much on the training sample, the networks are not able to generalize well to a new dataset, which makes the methods prone to overfitting phenomena. Moreover, when processing large-scale datasets, the training speed of SLFNs is relatively slow, and the accuracy is correspondingly reduced.
In order to break the bottleneck of SLFNs, Extreme Learning Machine (ELM) was proposed by Professor Huang [1,7] in 2004. ELM is a new single-hidden layer feedforward network training algorithm. The advantage of this framework is that the input weights and the bias of the hidden nodes are randomly generated, and we only need to analyze the output weights of all the parameters. Compared with traditional neural networks, ELM has the advantages of simple structure, good versatility, and low computational cost [8]. In recent years, due to the rapid learning, outstanding generalization, and general approximation capability [9,10,11,12,13,14,15], ELM has been used in biology [9,10], pattern classification [11], big data [12], robotics [13], and other fields. However, ELM learns only one hyperplane, which leads to challenges in ELM for handling large-scale datasets as well as non-balanced data. Therefore, two non-parallel hyperplanes have been developed [16,17,18,19]. One of the most widely known is the Twin Support Vector Machine (TSVM), which was presented by Jayadeva et al. [16]. Influenced by TSVM, the Twin Extreme Learning Machine (TELM) was introduced by Wan et al. [20]. TELM introduces two ELM models and trains them together, so TELM learns two hyperplanes. The inputs of the two hyperplanes are the same dataset, and different feature expressions are learned under different target functions. Finally, the results obtained by the two models are integrated to obtain richer feature expression and classification results. In 2019, Rastogi et al. [21] proposed the Least Squares Twin Extreme Learning Machine (LS-TELM). The LS-TELM introduces the least squares method based on the TELM to solve the weight matrix between the hidden layer and the output layer. While maintaining the advantages of TELM, LS-TELM transforms the inequality constraints into equality constraints, so that the problem becomes solving two sets of linear equations, which greatly reduces the computational cost.
In many areas, TELM and its variants are widely used, but they encounter bottlenecks when dealing with issues with outliers. To remove this dilemma, many scholars have studied deeply and proposed many robust algorithms based on TELM (see [22,23,24,25,26]). For example, Yuan et al. [22] proposed Robust Twin Extreme Learning Machines with correntropy-based metric (LCFTELM) which enhance the robustness and classification performance of the TELM by employing the non-convex fractional loss function. A Robust Supervised Twin Extreme Learning Machine (RTELM) was put forward by Ma and Li [23]. The proposed framework employs a non-convex squared loss function, which greatly suppresses the negative effects of outliers. The presence of outliers is an important factor affecting the robustness. To reduce the effect of outliers and improve the robustness of the model, we can use a non-convex loss function so that it can consistently penalize outliers. The above experimental results show that it is an effective method. Therefore, to suppress the negative effects of the outliers, we introduce a non-convex, bounded, and smooth loss function (Welsch loss) [27,28,29,30]. The Welsch estimation method is a robust estimation. The Welsch loss is a loss function based on the Welsch estimation method. It can be expressed as L ( y , f ( x ) ) = σ 2 2 [ 1 exp ( ( y f ( x ) ) 2 2 σ 2 ) ] , where σ is a turning parameter that can control the degree of penalty for the outliers. When the data error is normally distributed, it is comparable to the mean squared error loss, but, when the error is non-normally distributed, if the error is caused by outliers, the Welsch loss is more robust than the mean squared error loss.
It is worth mentioning that TELM has good performance in classification, but it uses the square L 2 -norm distance, which increases the influence of outliers on the model and changes the construction of the hyperplane. In recent years, many researchers have also turned their attention to the L 1 -norm measure and proposed a series of robust algorithms, such as L 1 -norm and non-square L 2 -norm [31], Non-parallel Proximal Extreme Learning Machine ( L 1 -NPELM) [32] based on L 1 -norm distance measure, and robust L 1 -norm Twin Extreme Learning Machine ( L 1 -TELM) [33]. Overall, the L 1 -norm alleviates the effects of outliers and improves the robustness, but it also performs poorly when dealing with large numbers of outliers due to the unboundedness of the L 1 -norm. Based on this point, the document [33] presented Capped L 2 , p -norm Support Vector Classification (SVC). The Capped L 2 , p -norm Least Squares Twin Extreme Learning Machine (C L 2 , p -LSTELM) was proposed in [34]. The convergence of the above methods was proven in theory, and the capped L 2 , p distance metric significantly improves the robustness when dealing with outliers.
Inspired by the above excellent works, we propose two novel distance metric optimization-driven robust twin extreme learning machine learning frameworks for pattern classification, namely, CWTELM and FCWTELM. CWTELM was based on optimization theory. CWTELM introduced the capped L 2 , p -norm measure and Welsch loss into the model, which greatly improves the robustness and classification ability. In addition, in order to maintain relatively stable classification performance of CWTELM and accelerate its operation, we presented the least squares version of CWTELM (FCWTELM). Experimental results with different noise rates and different datasets show that the CWTELM and FCWTELM algorithms have significant advantages in terms of classification performance and robustness.
The main work of this paper is summarized as follows
(1)
By imbedding the capped L 2 , p -norm metric distance and Welsch loss to the TELM, a novel robust learning algorithm called Capped L 2 , p -norm Welsch Robust Twin Extreme Learning Machine (CWTELM) is proposed. CWTELM enhances the robustness while maintaining the superiority of the TELM, so that the performance of classification is also polished;
(2)
To speed up the computation of CWTELM and carry forward its advantages, we present a least square version of CWTELM, namely, Fast CWTELM (FCWTELM). While inheriting the superiority of the CWTELM, FCWTELM transforms the inequality constraints into equality constraints, so that the problem becomes solving two sets of linear equations, which greatly reduces the computational cost;
(3)
Two efficient iterative algorithms are designed to solve CWTELM and FCWTELM, which are easy to realize, and guarantee the existence of a reasonable optimization method theoretically. Simultaneously, we have carried out a rigorous theoretical analysis and proof of the convergence of the two designed algorithms;
(4)
A great deal of experiments conducted across various datasets and different noise proportions demonstrates that CWTELM and FCWTELM are competitive with five other traditional classification methods in terms of robustness and practicability;
(5)
A statistical analysis is performed for our algorithms, which further verifies that CWTELM and FCWTELM exceed five other classifiers in robustness and classification performance.
The remainder of the article is constructed as follows. In Section 2, we briefly review the TELM, LS-ELM, RTELM, Welsch loss, and the capped L 2 , p -norm. In Section 3, we describe the proposed CWTELM and FCWTELM in detail and give an analysis in theory. In Section 4, we introduce our experimental setups; the proposed algorithm is compared with five other classical algorithms with different noise and different datasets, and the statistical detection analysis is implemented. This article is summarized in Section 5 after giving experimental results for multiple datasets in Section 4. First we present the abbreviations and main notations in Table 1 and Table 2.

2. Related Work

In this section, we first describe some concepts applied in this text, and then make a concise introduction to TELM, Least Squares Twin Extreme Learning Machine (LS-TELM), Welsch loss, Robust Supervised Twin Extreme Learning Machine (RTELM) [23], and Capped L 2 , p -norm.

2.1. TELM

ELM is a special feedforward neural network. In the training process, the weights and bias of the hidden layer are often generated randomly or artificially given, without updating. Computing the weights of the output layer completes the training process. Taking a training dataset τ l = ( x 1 , y 1 ) ( x l , y l ) ( R n , Y ) l into account, where x i R n , y i Y = 1 , 1 , i = 1 , , l . The training dataset τ l comprises m 1 positive class and m 2 negative class, where l = m 1 + m 2 . In addition, we make matrices H 1 and H 2 represent the hidden layer output of the samples belonging to the positive class and negative class, severally. The goal of TELM is to find a pair of non-parallel hyperplanes to achieve classification:
f 1 ( x ) = β 1 T h ( x ) ,
f 2 ( x ) = β 2 T h ( x ) .
where β 1 and β 2 are the output weight between the hidden layer and the output layer. h ( x ) is the nonlinear random feature mapping output of the hidden layer with respect to the input pattern. Inspired by the idea of TSVM, the primal TELM can be given by
min β 1 1 2 H 1 β 1 2 2 + C 1 e 2 T ξ s . t H 2 · β 1 + ξ 1 e 2 ξ 0
min β 2 1 2 H 2 β 2 2 2 + C 2 e 1 T ξ s . t H 1 · β 2 + η e 1 η 0
where C 1 > 0 and C 2 > 0 are regularization parameters, ξ and η are relaxation vector, e 1 R m 1 and e 2 R m 2 are vectors of ones, and the zero vector is expressed by 0. According to the Karush–Kuhn–Tucker theorem, to solve such a TELM problem is the same to finish off the the next dual optimization problems:
min α 1 2 α T H 2 ( H 1 T H 1 ) 1 H 2 T α e 2 T α s . t 0 α C 1 e 2 .
min ϑ 1 2 ϑ T H 1 ( H 2 T H 2 ) 1 H 1 T ϑ e 1 T ϑ s . t 0 ϑ C 2 e 1 .
In the above formulas, α and ϑ are given as Lagrange multipliers. Then, we can obtain two nonparallel separating planes, β 1 and β 2 :
β 1 = ( H 1 T H 1 + ϵ I ) 1 H 2 T α .
β 2 = ( H 2 T H 2 + ϵ I ) 1 H 1 T ϑ .
We classify the new sample points x on the basis of the following decision function:
f ( x ) = arg min k = 1 , 2 d k ( x ) = arg min k = 1 , 2 | β k T h ( x ) | .
where | · | means the shortest distance from data point x to the hyperplane β k .

2.2. LS-TELM

In order to accelerate the training speed of TELM and achieve better performance stability, LS-TELM was proposed in 2018. The algorithm uses the least squares method to solve the original problem of TELM. In LS-TELM, the inequality constraint is replaced by the equation constraints. In addition, the L 2 -norm of the relaxation variables is also replaced by the L 1 -norm. The LS-TELM is indicated as follows:
min β 1 , ξ 1 2 H 1 β 1 2 2 + C 1 2 ξ T ξ s . t H 2 · β 1 + ξ = e 2
min β 2 , η 1 2 H 2 β 2 2 2 + C 2 2 η T η s . t H 2 · β 2 + η = e 2
where H 1 represents the hidden layer output matrix of positive class sample points and H 2 represents the hidden layer output matrix of negative class sample points. According to the constraint in Equation (8), ξ can be expressed as e 2 + H 2 β 1 , and we bring it into the objective function:
min β 1 1 2 H 1 β 1 2 2 + C 1 2 e 2 + H 2 β 1 T e 2 + H 2 β 1 s . t H 2 · β 1 + ξ = e 2
Setting the gradient with respect to β 1 equal to zero gives
( H 1 T H 1 + C 1 H 2 T H 2 ) β 1 + C 1 H 2 T e 2 = 0
β 1 can be expressed as
β 1 = C 1 ( H 1 T H 1 + C 1 H 2 T H 2 ) 1 H 2 T e 2
Similarly, β 2 can be written as
β 2 = C 2 ( H 2 T H 2 + C 2 H 1 T H 1 ) 1 H 1 T e 1
To obtain the optimal values of β 1 and β 2 , the separation superplane
β 1 · h x = 0 β 2 · h x = 0
can be recalculated. The data point x can be divided into two categories according on the following formula.
f ( x ) = arg min k = 1 , 2 d k ( x ) = arg min k = 1 , 2 | β k T h ( x ) | .
where | · | is the perpendicular distance of the data points x from the hyperplane β .
For more details, please refer to [21].

2.3. Welsch Loss Function

Welsch Loss, also known as pseudo-Huber loss, is used to measure the error between the predicted value and the actual value. Compared to mean squared and absolute errors, the Welsch loss function is more robust and can better handle the effects of outliers. The Welsh loss function expression is
L ( y , f ( x ) ) = σ 2 2 [ 1 exp ( ( y f ( x ) ) 2 2 σ 2 ) ]
where the true value is given by y, f ( x ) is the predicted value and σ is the adjustable parameters. As shown in Figure 1a, we change the parameter C value from 1 to 3, and we can see that the upper bound of Welsch is gradually increasing and slowly converging. Observation (18), when y f ( x ) approaches infinity, the upper bound of L ( y f ( x ) ) is σ 2 2 , which means that the outliers in the model can be limited by Welsch loss.

2.4. RTELM

TELM is a very excellent and powerful classification model with a wide range of research and applications in various academic fields. But it uses the square L 2 -norm and the hinge loss function, and the effect of outliers is usually exaggerated. From this, we cannot guarantee the robustness of the TSVM. Therefore, based on this basis, RTELM is proposed, which replaces the L 2 -norm distance metric and the hinge loss function with the capped L 1 -norm distance metric and the adaptive capped L θ ε -norm loss function. The expression for the RTELM is given below:   
min β 1 Σ i = 1 m 1 min ( | β 1 h ( x i ) | , ε 1 ) + C 1 Σ i = 1 m 2 min ( ( 1 + θ ) ξ 1 , i 2 | ξ 1 , i | + θ , ε 2 ) s . t H 2 β 1 + ξ 1 e 2
min β 2 Σ i = 1 m 2 min ( | β 2 h ( x i ) | , ε 3 ) + C 2 Σ i = 1 m 1 min ( ( 1 + θ ) ξ 2 , i 2 | ξ 2 , i | + θ , ε 4 ) s . t H 1 β 2 + ξ 2 e 1
where C 1 , C 2 > 0 are regularization parameters, e 1 R m 1 and e 2 R m 2 are vectors of ones, and ε 1 , ε 2 , ε 3 and ε 4 are thresholding parameters. To solve the above optimization problems (14) and (15) more efficiently, we can reformulate the problems as the following approximation problems through the reweighted method [32]:
min β 1 1 2 ( H 1 β 1 ) T Q H 1 β 1 + 1 2 C 1 ξ 1 T U ξ 1 s . t H 2 β 1 + ξ 1 e 2
min β 2 1 2 ( H 2 β 2 ) T Q H 2 β 2 + 1 2 C 2 ξ 2 T Z ξ 2 s . t H 1 β 2 + ξ 2 e 1
where e 2 R m 2 and e 1 R m 1 are vectors of ones, Q, G, U and Z are four diagonal matrices with i-th diagonal elements as
q i = 1 | β 1 · h ( x i ) | , | β 1 · h ( x i ) | ε 1 s m a l l v a l , o t h e r w i s e .
g i = 1 | β 2 · h ( x i ) | , | β 2 · h ( x i ) | ε 3 s m a l l v a l , o t h e r w i s e .
u i = ( 1 + θ ) | ξ 1 , i | + 2 θ 2 ( | ξ 1 , i | + θ ) 2 , ( 1 + θ ) ξ 1 , i 2 | ξ 1 , i | + θ ε 2 s m a l l v a l , o t h e r w i s e .
z i = ( 1 + θ ) | ξ 2 , i | + 2 θ 2 ( | ξ 2 , i | + θ ) 2 , ( 1 + θ ) ξ 2 , i 2 | ξ 2 , i | + θ ε 4 s m a l l v a l , o t h e r w i s e .
According to the optimization theory and the dual theory, the Wolfes dual problems of (16) and (17) are obtained as follows:
min α 0 1 2 α T ( H 2 ( H 1 T Q H 1 ) 1 H 2 T + 1 C 1 U 1 ) α e 2 T α
Similarly,
min ϑ 0 1 2 ϑ T ( H 1 ( H 2 T Q H 2 ) 1 H 1 T + 1 C 2 Z 1 ) α e 1 T ϑ
where α , ϑ are the Lagrange multipliers.

2.5. Capped L 2 , p -Norm

In TELM and other related areas, the L 2 -norm is often applied in building the model, but L 2 -norm is differentiable, and the negative effects of outliers may be magnified. . 2 p is mainly used to enhance the robustness of the model by making fall within the interval of ( 0 , 2 ] [32,33]. Therefore, we build different models for different problems by choosing the parameter 0 < p 2 , which makes the L 2 , p -norm more robust. For any vector α R n and parameters 0 < p 2 , L 2 , p -norm and capped L 2 , p -norm are defined as
f 1 ( α ) = ( Σ i = 1 n α i n ) 2 p
f 2 ( α ) = min ( ( Σ i = 1 n α i n ) 2 p , ε )
where ε 0 is the threshold parameter. From the above analysis and Figure 1, the robustness of capped L 2 , p -norm is stronger than capped L 1 -norm and capped L 2 -norm, which is a generalization and extension of capped L 1 -norm and capped L 2 -norm.
More details can refer to [30].

3. Main Contribution

In this section, we place the Welsch loss and L 2 , p -norm into the model TELM and obtain the two proposed models, CWTELM and FCWTELM. At the same time, to test the stability of the above models, we conducted a convergence analysis of them.

3.1. CWTELM

The primary problem of the model we built can be written as
min β 1 i = 1 m 1 min ( β 1 h x i 2 p , ε 1 ) + C 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 σ 2 ) ] , s . t H 2 β 1 + ξ 1 e 2 .
min β 2 i = 1 m 2 min ( β 2 h x i 2 p , ε 2 ) + C 2 i = 1 m 1 [ 1 exp ( ξ 1 , i 2 2 σ 2 ) ] , s . t H 1 β 2 + ξ 2 e 1 .
where C 1 0 and C 2 0 , e 1 R m 1 and e 2 R m 2 are the unit vectors.
To address the above issues, we let
R 1 ( β 1 ) = i = 1 m 1 min ( β 1 h x i 2 p , ε 1 ) L 1 ( β 1 ) = C 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 σ 2 ) ]
Further, we let
max L ( β 1 ) = C 1 i = 1 m 2 exp ( ξ 1 , i 2 2 σ 2 ) = I ( β 1 )
Similarly,
max L ( β 2 ) = C 2 i = 1 m 2 exp ( ξ 2 , i 2 2 σ 2 ) = I ( β 2 )
Thus, (31) and (32) can be written as
max β 1 I ( β 1 ) R ( β 1 ) max β 2 I ( β 2 ) R ( β 2 )
For an easier computation, we define a function g ( v ) = v log ( v ) + v , v i < 0 , v = ( v 1 , v 2 , , v m ) , based on the theory of conjugate functions, we have   
I ( β 1 ) = sup V < 0 [ c 1 i = 1 m 1 ( v i ξ 1 , i 2 2 σ 2 g ( v i ) ]
where v i = exp ( ξ 1 , i 2 2 σ 2 ) .
Then,
max β 1 , v < 0 M ( β 1 , v ) = c 1 i = 1 m 1 ( v i ξ 1 , i 2 2 σ 2 g ( v i ) ) R ( β 1 )
Thus, the following formula holds true:
min β 1 Σ i = 1 m 1 min ( β 1 h ( x i ) 2 p , ε 1 ) + c 1 2 σ 2 ξ 1 , i Ω ξ 1 , i s . t H 2 β 1 + ξ 1 e 2
min β 2 Σ i = 1 m 2 min ( β 2 h ( x i ) 2 p , ε 2 ) + c 2 2 σ 2 ξ 2 , i Ω ξ 2 , i s . t H 1 β 2 + ξ 2 e 1 .
In order to optimize the objective function smoothly, we will introduce concave duality in Theorem 1.
Theorem 1.
Let g ( θ ) : R n R is a continuous non-convex function, suppose h ( θ ) : R n Ω R n is a map with range Ω. We assume that a concave function g ¯ ( u ) defined on Ω exists, such that g ( θ ) = g ( h ( θ ) ) holds. Therefore, the non-convex function g ( θ ) can be expressed as [30]
g ( θ ) = inf v R n [ V T h ( θ ) g * ( v ) ]
According to the convex dual theorem, the convex dual of g ( θ ) : R R is written as
g * ( v ) = inf u [ V T h ( θ ) g * ( v ) ]
Moreover, the minimum value on the right side of formula (28) is
v * = α g ( θ ) ¯ α θ | u = h ( θ )
Proof. 
Thus, based on Theorem 1, we give a convex function g ( θ ) : R R , that makes arbitrary θ > 0
g ¯ ( θ ) = min ( θ p 2 , ε )
Assuming h ( μ ) = μ 2 , we can find that
min ( β h ( x i ) 2 p , ε ) = g ¯ ( h ( μ ) )
where μ = β h ( x i ) 2 .
Therefore, based on (39), (40) and (45) can be written as
min β 1 Σ i = 1 m 1 g ¯ β 1 h ( x i ) 2 2 + C 1 2 σ 2 ξ 1 , i T Ω ξ 1 , i s . t H 2 · β 1 + ξ 1 e 2
min β 2 Σ i = 1 m 1 g ¯ β 2 h ( x i ) 2 2 + C 2 2 σ 2 ξ 2 , i T Ω ξ 2 , i s . t H 1 · β 2 + ξ 2 e 1
Let θ 1 = h ( μ 1 ) = β 1 h ( x i ) 2 2 , by Theorem 1, and the first term of (46) can be expressed as
min ( β 1 h ( x i ) 2 p , ε 1 ) = g ¯ ( β 1 h ( x i ) 2 2 ) = inf f i i 0 f i i h ( μ 1 ) g * ( f i i )
Here, the concave dual function of g ( θ 1 ) is
g * ( f i i ) = inf θ 1 [ f i i g ¯ ( θ 1 ) ] = inf θ 1 f i i θ 1 θ 1 p 2 , θ 1 2 p 2 < ε 1 f i i θ 1 ε 1 , θ 1 p 2 ε 1
After the optimization of θ 1 in Equation (49), we have
g * ( f i i ) = f i i ( 2 p f i i ) 2 p 2 ( 2 p f i i ) 2 p 2 , θ 1 2 p 2 < ε 1 f i i ε 1 2 p ε 1 , θ 1 p 2 ε 1
Therefore, the objective function (39) can be further written as
min β 1 Σ i = 1 m 1 min ( β 1 h ( x i ) p 2 ) , ε 1 ) + C 1 2 σ 2 ξ 1 , i T Ω ξ 1 , i min β 1 Σ i = 1 m 1 inf f i i 0 L i ( β 1 , f i i , ε 1 ) + C 1 2 σ 2 ξ 1 , i T Ω ξ 1 , i min ( β 1 , f i i 0 ) Σ i = 1 m 1 L i ( β 1 , f i i , ε 1 ) + C 1 2 σ 2 ξ 1 , i T Ω ξ 1 , i
where
L i ( β 1 , f i i , ε 1 ) = f i i θ 1 f i i ( 2 p f i i ) 2 p 2 + ( 2 p f i i ) 2 p 2 , θ 1 p 2 < ε 1 . f i i θ 1 f i i ε 1 2 p + ε 1 , θ 1 p 2 ε 1 .
Similarly, let θ 2 = h ( μ 2 ) = β 2 h ( x i ) 2 2 , g * ( k i i ) is expressed as a concave dual function of g ¯ ( θ 2 ) , so, formula (40) can be written as
min β 2 Σ i = 1 m 1 min ( β 2 p 2 , ε 3 ) + C 2 2 σ 2 ξ 2 , i T Ω ξ 2 , i min β 2 Σ i = 1 m 1 inf k i i 0 L i ( β 2 , k i i , ε 3 ) + C 2 2 σ 2 ξ 2 , i T Ω ξ 2 , i min β 2 , f i i 0 Σ i = 1 m 2 L i ( β 2 , k i i , ε 3 ) + C 2 2 σ 2 ξ 2 , i T Ω ξ 2 , i
where
L i ( β 2 , k i i , ε 3 ) = k i i θ 2 k i i ( 2 p f i i ) 2 p 2 + ( 2 p k i i ) 2 p 2 , θ 2 p 2 < ε 3 . k i i θ 2 k i i ε 1 2 p + ε 2 , θ 2 p 2 ε 3 .
The objective functions (52) and (54) solve the optimization algorithm by alternately learning the optimal classifiers. We calculated the gradient of the function g ( θ ) with respect to θ as follows:
g ¯ ( θ ) θ = p 2 θ p 2 1 , 0 < θ < ε 2 p 0 , θ > ε 2 p
If θ 1 = h ( μ 1 ) = β 1 h ( x i ) 2 2 , we can obtain
f i i = g ¯ ( θ ) θ θ 1 = β 1 h ( x i ) 2 2 = p 2 β 1 h ( x i ) 2 2 , 0 < β 1 h ( x i ) 2 2 < ε 1 0 , e l s e
Likewise, if θ 2 = h ( μ 2 ) = β 2 h ( x i ) 2 2 , we can obtain
k i i = g ¯ ( θ 2 ) θ 2 θ 2 = β 2 h ( x i ) 2 2 = p 2 β 2 h ( x i ) 2 2 , 0 < β 2 h ( x i ) 2 2 < ε 3 0 , e l s e .
It is important that to understand the relationship between parameters more clearly, we set the distance from sample x i to the hyperplane as X. If X > ε 1 , then f i i is almost 0, then the sample x i is considered an outlier and discarded. Additionally, d i i is similar to f i i . When the variables f i i and d i i are fixed, to solve the classifier related parameters β 1 and β 2 , the optimization problems (39) and (40) can be written as
min β 1 Σ i = 1 m 1 f ( i i ) β 1 h ( x i ) 2 2 + C 1 2 σ 2 ξ 1 , i T Ω ξ 1 , i s . t H 2 · β 1 + ξ 1 e 2
min β 2 Σ i = 1 m 2 k ( i i ) β 2 h ( x i ) 2 2 + C 2 2 σ 2 ξ 2 , i T Ω ξ 2 , i s . t H 1 · β 2 + ξ 2 e 1
Let F = d i a g ( f 11 , f 22 , f 33 , , f m 1 , m 1 ) be the m 1 -diagonal matrix, and K = d i a g ( k 11 , k 22 , k 33 , , k m 2 , m 2 ) be a diagonal matrix of m 2 , so that (39) and (40) are equivalent to
min β 1 ( β 1 h ( x i ) ) T F ( β 1 h ( x i ) ) + C 1 2 σ 2 ξ 1 , i T Ω ξ 1 , i s . t H 2 · β 1 + ξ 1 e 2
min β 2 ( β 2 h ( x i ) ) T K ( β 2 h ( x i ) ) + C 2 2 σ 2 ξ 2 , i T Ω ξ 2 , i s . t H 1 · β 2 + ξ 2 e 1
The corresponding Lagrange function of the above optimization problem (60) can be rewritten as
L ( β 1 , ξ 1 ) = 1 2 ( β 1 h ( x i ) ) T F ( β 1 h ( x i ) ) + 1 2 σ 2 C 1 ξ 1 T Ω ξ 1 α T ( H 2 β 1 + ξ 1 e 2 )
where α is a Lagrange multiplier. Differentiating the Lagrangian function with respect to β 1 and β 2 yields the following Karush–Kuhn–Tucker (KKT) conditions
L β = h ( x i ) F ( β 1 h ( x 1 ) ) + α T H 2 = 0 , ( i ) L ξ = σ 1 2 C 2 Ω ξ 1 + α T = 0 , ( i i ) α T ( H 2 β 1 + ξ 1 e 2 ) = 0 , ( i i i ) α 0 . ( i v )
By combining formulas (i) and (ii), we can obtain
β 1 = ( H 1 T F H 1 ) 1 ( H 2 T α )
Similarly, we can also obtain ξ 1 = ( C 2 Ω ) 1 α T , so the dual problem of (60) is
min α 0 1 2 α T ( H 2 ( H 1 T Q H 1 ) H 2 T + 1 C 1 U 1 ) α e 2 T α
At the same time, the dual problem of (61) as follows:
min ϑ 0 1 2 ϑ T ( H 1 ( H 2 T G H 2 ) H 1 T + 1 C 2 Z 1 ) ϑ e 1 T ϑ
where α , ϑ is the Lagrange multiplier.

3.2. FCWTELM

Reduce the computation time complexity of CWTELM
min β 1 i = 1 m 1 min ( β 1 · h x i 2 p , ε 1 ) + C 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 · σ 2 ) ] , s . t H 2 · β 1 + ξ 1 = e 2 .
min β 2 i = 1 m 2 min ( β 2 · h x i 2 p , ε 2 ) + C 2 i = 1 m 1 [ 1 exp ( ξ 2 , i 2 2 · σ 2 ) ] , s . t H 1 · β 2 + ξ 2 = e 1 .
Equivalent to processing the second item of FCWTELM, further, (67) and (68) written as
min β 1 i = 1 m 1 min ( β 1 · h x i 2 p , ε 1 ) + C 1 2 σ 2 ξ 1 , i Ω ξ 1 , i , s . t H 2 · β 1 + ξ 1 = e 2 .
min β 2 i = 1 m 2 min ( β 2 · h x i 2 p , ε 2 ) + c 2 2 σ 2 ξ 2 , i Ω ξ 2 , i , s . t H 1 · β 1 + ξ 2 = e 1 .
Replace the equality constraint into the objective function, we obtain
min β 1 i = 1 m 1 min ( β 1 · h x i 2 p , ε 1 ) + C 1 Ω 1 e 2 + H 2 β 1 2 2
min β 2 i = 1 m 1 min ( β 2 · h x i 2 p , ε 2 ) + C 2 Ω 2 e 1 H 1 β 2 2 2
Further, similar to the handling of CWTELM, we can obtain
min β 1 ( β 1 h ( x i ) ) T F ( β 1 h ( x i ) ) + C 1 2 σ 2 ξ 1 , i T Ω ξ 1 , i
min β 2 ( β 2 h ( x i ) ) T K ( β 2 h ( x i ) ) + C 2 2 σ 2 ξ 2 , i T Ω ξ 2 , i
The (73) differential for β 1 to zero gives
2 H T D H β 1 + C 1 H 2 T Ω e 2 + c 1 H 2 T Ω 1 H 1 β 1 = 0
So,
β 1 = ( 2 H 1 T F H + C 1 Ω 1 H 2 T H 2 ) 1 C 1 H 2 T Ω 1 e 2
Similarly,
β 1 = ( 2 H 2 T F H 2 + C 2 H 1 T Ω 2 H 1 ) 1 C 2 H 1 T Ω 2 e 1

3.3. Convergence Analysis

Lemma 1.
For any scalar t, when 0 < p 2 , inequality 2 | t | p p t 2 + p 2 0 is always established.
Lemma 2.
For arbitrary x y R n , if f ( x ) = x x 2 2 y , then inequality f ( x ) < f ( y ) is always established.
Lemma 3.
For any non-zero vector α, β, when 0 < p 2 , inequality
α 2 p p 2 β 2 p 2 α 2 2 β 2 p p 2 β 2 p 2 β 2 2
is always established.
Theorem 2.
Algorithm 1 will monotonously reduce the objective of problems (49) and (50), respectively, in each iteration.
Algorithm 1 Training CWTELM.
Input: Training data: Training set T 1 = x i , y i l i = 1 , i = 1 , , l , where x i R n , x j R n , y i 1 , + 1 ; activation function G ( x ) , and the number of hidden node number L, the parameters C 1 , C 2 , ε 1 , ε 2 , ε 3 , ε 4 , δ 1 and δ 2 .
β 1 * and β 2 * ;
Process:
1. Initialize F R m 1 × m 1 and Q R m 2 × m 2 ; K R m 2 × m 2 and U R m 1 × m 1 ;
2. α and β ;
3. Passing Z 1 = ( H T F H + C 3 I ) 1 E T α and Z 2 = ( E T K E + C 4 I ) 1 H T β Calculate Z 1 and Z 2 ,
4. Accordingly, update matrix separately Q , U , F , K .
Proof. 
Let
J = min β 1 Σ i = 1 m 1 min ( β 1 h ( x i ) 2 P ) , ε 1 ) + c 1 i = 1 m 2 min [ 1 exp ( ( ξ 1 , i ) 2 2 σ 2 ) , ε 2 ]
When β 1 h ( x i ) < ε 1 and 1 exp ( ( ξ 1 , i ) 2 2 σ 2 ) < ε 2
J = min β , ξ 1 , i Σ i = 1 m 1 min ( β 1 h ( x i ) 2 P ) , ε 1 ) + c 1 i = 1 m 2 [ 1 exp ( ( ξ 1 , i ) 2 2 σ 2 ) ]
Let J = J 1 + J 2 , where
J 1 = min β Σ i = 1 m 1 β 1 h ( x i ) 2 P
J 2 = c 1 min i = 1 m 2 [ 1 exp ( ( ξ 1 , i ) 2 2 σ 2 ) , ε 2 ]
For J 1 , assuming that Z k + 1 is the solution of the k + 1 iteration of Algorithm 1:
[ β , h ( x i ) ] ( k + 1 ) = min 1 2 ( β 1 h ( x i ) ) T F k ( β 1 h ( x i ) )
Apparently, the Algorithm 1 has the following formula in iteration k,
[ ( β 1 h ( x i ) ) k + 1 ] T F ( K ) [ β 1 h ( x i ] [ ( β 1 h ( x i ) ) k ] T F ( K ) [ β 1 h ( x i ]
Reduce to
P 2 ( β 1 H ( x i ) ) k + 1 2 2 ( β 1 H ( x i ) ) k + 1 P 2 2 P 2 ( β 1 H ( x i ) ) k 2 2 ( β 1 H ( x i ) ) k P 2 2
Based on Lemma 3, we obtain
β 1 H ( x i ) k + 1 2 p P 2 ( β 1 h ( x i ) ) k 2 p 2 ( β 1 h ( x i ) ) k + 1 2 2 β 1 H ( x i ) k p 2 P 2 ( β 1 h ( x i ) ) k p 2 2 ( β 1 h ( x i ) ) k 2 2
Combining (85) and (86), we have
β 1 H ( x i ) k + 1 2 p β 1 H ( x i ) k 2 p
Thus, the J 1 is convergent. Next, we discuss the convergence of g ( v ) = 1 exp ( v 2 ) = 1 exp ( ξ 1 , i 2 2 σ 2 ) , where V = ξ 1 , i 2 2 σ , there exists a convex function ψ ( s ) , we have g ( v ) = inf s > 0 1 2 s v 2 + ψ ( s ) , and when V is fixed, we have the minimum s * , which satisfies
g ( v ) = inf s > 0 1 2 s v 2 + ψ ( s ) = 1 2 s * v 2 + ψ ( s * )
where s * = 2 exp ( v 2 ) , so L ( h ( x ) ) = c λ inf s > 0 ( s ( ξ 1 , i 2 ) 4 σ 2 + ψ ( s ) ) .
The above formula is converted into
min β 1 Σ i = 1 m 1 min ( β 1 h ( x i ) 2 P ) , ε 1 ) + c λ Σ i = 1 N inf s i > 0 ( s ( ξ 1 , i 2 ) 4 σ 2 + ψ ( s i ) ) min β 1 Σ i = 1 m 1 min ( β 1 h ( x i ) 2 P ) , ε 1 ) + c λ Σ i = 1 N ( s ( ξ 1 , i 2 ) 4 σ 2 + ψ ( s i ) )
The above problem is solved by alternating iterative algorithms. Specifically, in the k-th iteration, we bring s ( k ) into problem (89):
min β β 2 2 + i = 1 N c i ξ 1 , i 1
where c i = c λ s i k 1 4 σ 2 and c i > 0 . By introducing the relaxation variable ξ in Equation (90), the optimization problem becomes
min β 1 , ξ 1 β 2 2 + i = 1 N c i ξ i 2 s . t ξ 1 , i 0 , i = 1 , , N
The optimal solution β ( k ) can be obtained by solving (85) and then putting it in (91).
min s > 0 i = 1 N ( s i ( ξ 1 , i 2 ) 4 σ 2 + ψ ( s i ) )
According to Theorem 1, we obtain the minimal solution
s i ( k ) = 2 exp ( ( ξ 1 , i ) 2 2 σ 2 ) , i = 1 , , N
From (84), (85) and Lemma 2, we can obtain A 1 ( β , s ) A ( β ) 0 , then the sequence is lower bounded. Assuming that β ( k ) and s ( k ) are obtained after k iterations, we use s ( k ) to optimize the formula (93) on β :
A 1 ( β k , s ( k ) ) A 1 ( β ( k + 1 ) , s ( k ) )
And β ( k + 1 ) is optimized for formula (64) on s:
A 1 ( β ( k + 1 ) , s ( k ) ) A 1 ( β ( k + 1 ) , s ( k + 1 ) )
Concluding from the above inequality, we have □
A 1 ( β ( k ) , s ( k ) ) A 1 ( β ( k + 1 ) , s ( k + 1 ) )
Therefore, J 2 is convergence. Thus, the sequence is convergence.

4. Experimental

4.1. Experiments Setup

ELM has the goodness of rapid learning, strong approximation and excellent generalization, both in regression as well as multiple classification. To judge the performance of our proposed CWTELM and FCWTELM, we compare CWTELM and FCWTELM with other traditional methods systematically, including ELM, Correntropy-based Robust Extreme Learning Machine (CHELM), TELM, Capped L 1 -norm Twin Support Vector Machine (CTSVM) and RTELM. For CTSVM, FCWTELM and CWTELM, we stop the iteration process when the target value of two consecutive iterations is less than 0.001 and the number of iterations is greater than 50. The parameters selected by all of the above algorithms are ε 1 = ε 2 = 10 4 , c , c 1 , c 2 , c 3 , c 4 : 10 i | 5 , 4 , , 4 , 5 , σ : 10 i | i = 4 , 3 , , 3 , 4 , λ : 10 i | i = 7 , 6 , 5 , , 5 , 6 , 7 , k : 3 , 7 , 9 , 11 , η : 0.001 , 0.005 , 0.01 , 0.05 , 0.1 , 0.5 , 1 , 1.5 , 2 and L: 50 , 100 , 200 , 400 , 500 , 1000 , 2000 . We selected the optimal parameters for the parameters c, c 1 , c 2 , λ , σ , and Land by using 10-fold cross-validation and grid search. ELM, TELM, CHELM, RTELM, CTSVM, CWTELM and FCWTELM use the activation function 1 / ( 1 + exp ( ( w · x + b ) ) ) (w, bare random generation). Meanwhile, we measure the classification performance of all algorithms by the accuracy (ACC). Acc is expressed as [23]
ACC = TP + TN TP + TN + FP + FN
Specifically, TP indicates true positives, TN represents true negatives, FN expresses false negatives, and FP represents false positives. Furthermore, the computational efficiency of each algorithm is represented by learning time. All of the measures are conducted on the MATLAB 2021a and run on the system configuration 11th Gen Intel (R) Core (TM) i5-11357G7 processor (2.40 GHz) with 16 GB of memory.

4.2. Artificial Dataset

Since our proposed algorithm is mainly used to solve the binary classification problem, so to verify the effectiveness of FCWTELM and CWTELM, we generated a class of binary classification datasets based on Gaussian distribution. First, 100 artificial data samples are grouped into two classes, one positive and one negative. The positive class is expressed by +, and the other is represented by *. Since the outliers influence the classification performance of the model significantly, nine outliers were inserted to compare the robustness of ELM, CHELM, TELM, CTSVM, RTELM, CWTELM and FCWTELM. Then, the 9 outliers were divided into four positive and five negative, as shown in Figure 2.

4.3. UCI Dataset

The UCI (http://archive.ics.uci.edu/ml/datasets.html (accessed on 28 December 2022) dataset is one of the widely used standard datasets, and the UCI dataset can provide a relative standard basis, making the model comparison more objective. Since data acquisition and processing is a time-consuming and laborious task, in the case of limited time and resources, we applied eight datasets, which are: Australian, Balance, Vote, Cancer, Wholesale, QSAR, Pima, WDBC. As shown in Table 3, they represent different data types and characteristics, such a selection also makes the study results more reliable. And we will follow up the experiment with our algorithms in more datasets. To verify the classification behavior of our two models, we conducted a series of experiments on the above datasets. In consideration of that noise is an important factor to measure the robustness of the algorithm, we will study these eight datasets of different noise rates, and if the classification accuracy varies smoothly for different noise rates, the algorithm shows good robustness.

4.4. Experimental Results on the UCI Datasets without Outliers

In this part, to test the classification behavior of the CWTELM, FCWTELM and other correlative algorithms, we ran eight UCI datasets on these algorithms. In Table 4, all the test outcomes are based on the optimal parameters. The time (s) represents the average running time obtained by each algorithm according to the optimal parameters, and the accuracy (ACC) represents the average classification accuracy. As can be seen from Table 4, from a classification point of view, CWTELM performs better than the other six algorithms on all the datasets. In many cases, ACC of the FCWTELM is in the forefront, and the average running time of the algorithm is shorter. By analyzing the test results in the above, we can reach the conclusion that using the L 2 , p -norm in the TELM framework is able to promote the classification performance. Thus, the proposed CWTELM and FCWTELM are valid supervised algorithms without outliers.

4.5. Robustness against Outliers

See from the previous subsection that CWTELM and FCWTELM have good classification performance, to further test the robustness of them to the outliers, we considered three scenarios, respectively: the noise levels M = 0.1 , M = 0.2 and M = 0.25 .
With noise levels of M = 0.1 , M = 0.2 and M = 0.25 , all tests outcomes are revealed in Table 5, Table 6 and Table 7, respectively. Table 5, Table 6 and Table 7 show the experimental results of the seven algorithms on the eight datasets with the 0.1 , 0.2 and 0.25 noise levels. It is obvious that the accuracy of seven algorithms decreases after the introduction of outliers. Yet, except for some cases, the accuracy of CWTELM is still better than the other six algorithms. In many illustrations, the classification accuracy of FCWTELM is in the forefront of the seven algorithms, and its running time is the shortest among the other algorithms.
We take Vote, WDBC, Cancer, and Australia this four UCI datasets for examples and draw the line diagram of above seven algorithms under different Gaussian noise rates, which is shown in Figure 3. We can more intuitive to find that the accuracy of CWTELM and FCWTELM change more smoothly than the other five algorithms when the noise increases. Summarize the above experimental results, we can obtain the following conclusion: in the case of noise, CWTELM and FCWTELM still maintain the advantages of classification performance and robustness. In addition, the classification performance of CWTELM and FCWTELM is little different. Next we will use statistical monitoring analysis to further verify the accuracy of them.

4.6. Experimental Results on Artificial Dataset with Outliers

By tests on eight UCI datasets, we confirm that CWTELM and FCWTELM have better properties in classification and robustness. Therefore, to further explore the advantages of both algorithms, we will again validate their accuracy in the artificial dataset. The results are shown in Figure 4. Observing Figure 4, we find that the precision of the above seven frameworks classified varied in the order of containing outliers, from low to high, with roughly the same operation on UCI. The classification accuracy of the above seven models in the artificial dataset is, respectively, ELM 61.3%, CHELM 65.6%, TELM 67.0%, CTSVM 78%, RTELM 80%, CWTELM 84%, and FCWTELM 86%. It also further determined that the L 2 , P -norm measure and the Welsch loss have significant positive effects on robustness and classification performance.
To verify the effect of parameter p on the models performance, we present the parameter analysis results in Figure 5. It can be seen from the Figure 5 that the proposed method is less affected by the parameters. In addition, we experiment on the accuracy of CWTELM and FCWTELM for different values of p. In Figure 5, we can find that many better accuracies are not achieved at p = 1 and p = 2, so it is a wise choice to introduce the L 2 , p norm metric.

4.7. Statistical Analysis

Within this segment, the Friedman test [34] is applied to analyze the significant differences among above algorithms across eight UCI datasets. The Friedman test was used to compare differences in paired groups across multiple related groups in a sample. Its null hypothesis is that there is no difference between groups, where the median observed values are equal across all groups. If the observed difference between groups is significant, the null hypothesis can be rejected and concluded that at least one group is significantly different. When the null hypothesis is rejected, we can perform a Nemenyi test [34]. The Nemenyi test is a post hoc test method that used to determine whether significant differences exist between multiple independent groups. It is based on a cross-consideration of the direction and variability of group differences to determine which groups differ significantly by comparing double comparisons between two groups. Then, we calculated the average accuracy and ranking of seven algorithms on 8 datasets, which is revealed in Table 8. First, taking a 20% Gaussian noise as an example, we can use the following formula to calculate the Friedman statistical variables:
X F 2 = 12 N k ( k + 1 ) [ j R j 2 k ( k + 1 ) 2 4 ] = 32.21
where N and k represent the number of UCI datasets and algorithms, as well as R j is the average rank of the j-th algorithm on the dataset used. In this paper, k = 7 as well as N = 8 . Moreover, based on the x F 2 -distribution with ( k 1 ) degrees of freedom, we can obtain
F F = ( N 1 ) X F 2 N ( k 1 ) X F 2 = 14.277
where F F ( k 1 ) , ( k 1 ) ( N 1 ) follows the f distribution, with ( k 1 ) and ( k 1 ) ( N 1 ) degrees of freedom. Furthermore, for α = 0.05 , we can obtain an F α = ( 6 , 42 ) = 2.324 . Clearly, F F > F α , and therefore we can reject the null hypothesis. From Table 8, we can see that the average ranking of CWTELM and FCWTELM is much lower than the other algorithms, meaning that our CWTELM and FCWTELM are more effective than the other algorithms. In addition, we further compared seven algorithms by the Nemenyi post hoc test method. When the average rank difference between the two algorithms is less than the cut-off value, the difference in performance between the two algorithms is not significant, or otherwise significant. By dividing the study range statistic by 2, we can obtain q α = 0.05 = 2.949 . Therefore, we calculate the critical difference value (CD) using the following formula:
C D = q α = 0.1 k ( k + 1 ) 6 N = 2.949 × 7 ( 7 + 1 ) 6 × 10 = 2.5858
As shown in Figure 6, CWTELM and FCWTELM behave significantly better than ELM, CHELM, TELM, CTSVM and RTELM in classification. It can further be seen that there is no significant difference between CWTELM and FCWTELM, as the difference is smaller than the CD value. Therefore, it can be confirmed that the proposed methods CWTELM and FCWTELM have better performance by statistical analysis.

5. Conclusions

The Welsch loss function has good qualities such as smooth, non-convex and boundness and, therefore, it is more robust than the commonly used L 1 and L 2 losses. Capped L 2 , p -norm is an excellent norm distance that can reduce the negative effects of outliers and thus improve the robustness of the model. In this paper, we proposed a distance metric optimization-driven robust twin extreme learning machine learning framework, namely CWTELM, which introduced Welsch loss and L 2 , p -norm distance to the TELM in order to enhance the performance of robust. Then, to speed up the computation of CWTELM while maintaining its advantages, we presented a least square version of CWTELM, namely Fast CWTELM (FCWTELM). Meanwhile, we design two efficient iterative algorithms to solve CWTELM and FCWTELM, respectively, and guarantee their convergence and computational complexity in theory. To evaluate the performance of CWTELM and FCWTELM, we experiment with them with five classical algorithms in different datasets and different noise rates. In the absence of noise, CWTELM achieved the best results in seven datasets. The experimental results of FCWTELM in the eight datasets are slightly lower than CWTELM, but the gap is small, and its running time is the shortest among the seven algorithms. In the case of noise, we take 10% noise as an example, CWTELM achieved the best results in Australian, Balance, Cancer, Wholesale, QSAR, WDBC, and FCWTELM performed the best in Pima. From a running time perspective, FCWTELM has the fastest running speed in the six datasets and all within 1 s. In addition, we found that CWTELM and FCWTELM have little difference between no noise and 10% noise conditions in same dataset. We continue to observe the experimental data with 20% and 25% noise and can also obtain the above conclusions. To this end, this paper takes Australian, Vote, WDBC, and Cancer as examples to more clearly show the accuracy of the seven algorithms in the form of different noise proportions. Similarly, we also conducted comparative experiments on the seven algorithms in the artificial dataset, and showed the classification effect of the seven algorithms more intuitively in the form of a scatter plot. The performance of CWTELM and FCWTELM is still excellent. Finally, we carried out statistical tests on seven algorithms and verified that CWTELM and FCWTELM exceeded other five models and that the two models had no significant difference in performance. From the above works, we can obtain that CWTELM and FCWTELM alleviate the negative effects of outliers to some extent, so they have good robustness. Besides they also have little difference in classification performance and have a outstanding operation while maintaining the advantages of TELM. The algorithms CWTELM and FCWTELM proposed in this paper can be applied to pattern classification. On the one hand, our algorithm has good classification representation and robustness, and it can learn the nonlinear relationship between the input data. In this way, a high-precision classification model can be obtained. Therefore, our model is able to obtain more accurate results when performing the pattern classification. On the other hand, our algorithm can improve the robustness of pattern classification. They can automatically choose and solve the specificity in the classification process, and can deal with the noise between different categories, so they are more suitable for different pattern classification tasks in practical application scenarios. Of course, in addition to pattern classification, CWTELM and FCWTELM can also be applied in many fields, such as data mining, pattern recognition, action recognition in robot control, path planning, image classification and so on. In the future, to improve the algorithms we proposed, in-depth studying for them is necessary, such as exploring better loss functions for the TELM framework to improve the robustness of the model and algorithm performance. In addition, we can also deepen the basic research, derive the upper bound of their generalization ability, etc.

Author Contributions

Writing—original draft, Y.J., J.M. and G.Y.; conceptualization, Y.J. and J.M.; writing—reviewing and editing, Y.J. and J.M.; software, Y.J. and J.M.; data curation, Y.J. and J.M.; funding acquisition, Y.J., J.M. and G.Y.; supervision, G.Y.; validation, G.Y.; project administration, G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Fundamental Research Funds for the Central Universities (No. 2021KYQD23, No. 2022XYZSX03), in part by the Natural Science Foundation of Ningxia Provincial of China (No. 2022AAC03260, No. 2023AAC02053), in part by the Key Research and Development Program of Ningxia (Introduction of Talents Project) (No. 2022BSB03046), and in part by the Postgraduate Innovation Project of North Minzu University (YCX23077).

Institutional Review Board Statement

This paper does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent Statement

Informed consent was obtained from all individual participants included in the study.

Data Availability Statement

All of the benchmark datasets used in our numerical experiments are from the UCI Machine Learning Repository, and are available at http://archive.ics.uci.edu/ml/ (accessed on 28 December 2022).

Conflicts of Interest

The authors declare no conflict of interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Guang-Bin, H.; Zhu, Q.; Siew, C. Extreme learning machine: Theory and applica tions. Neurocomputing 2006, 70, 489–501. [Google Scholar]
  2. Huang, G.-B.; Chen, Y.; Babri, H.A. Classification ability of single hidden layer feedforward neural networks. IEEE Trans. Neural Netw. 2000, 11, 799–801. [Google Scholar] [CrossRef] [Green Version]
  3. Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the Interspeech 2014, Singapore, 14–18 September 2014. [Google Scholar]
  4. Romanuke, V. Setting the hidden layer neuron number in feedforward neural network for an image recognition problem under Gaussian noise of distortion. Comput. Inf. Sci. 2013, 6, 38. [Google Scholar] [CrossRef] [Green Version]
  5. Tiwari, S.; Bharadwaj, A.; Gupta, S. Stock price prediction using data analytics. In Proceedings of the 2017 International Conference on Advances in Computing, Communication and Control (ICAC3), Mumbai, India, 1–2 December 2017. [Google Scholar]
  6. Imran, M.; Khan, M.R.; Abraham, A. An ensemble of neural networks for weather forecasting. Neural Comput. Appl. 2004, 13, 112–122. [Google Scholar]
  7. Guang-Bin, H.; Zhu, Q.; Siew, C. Extreme learning machine: A new learning scheme of feedforward neural networks. In Proceedings of the IEEE International Joint Conference on Neural Networks, Budapest, Hungary, 25–29 July 2004. [Google Scholar]
  8. Son, Y.J.; Kim, H.G.; Kim, E.H.; Choi, S.; Lee, S.K. Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc. Inform. Res. 2010, 16, 253–259. [Google Scholar] [CrossRef]
  9. Wang, G.; Zhao, Y.; Wang, D. A protein secondary structure prediction framework based on the extreme learning machine. Neurocomputing 2008, 72, 262–268. [Google Scholar] [CrossRef]
  10. Yuan, L.; Soh, Y.C.; Huang, G. Extreme learning machine based bacterial protein subcellular localization prediction. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008. [Google Scholar]
  11. Abdul Adeel, M.; Minhas, R.; Wu, Q.M.J.; Sid-Ahmed, M.A. Human face recognition based on multidimensional PCA and extreme learning machine. Pattern Recognit. 2011, 44, 2588–2597. [Google Scholar]
  12. Nizar, A.H.; Dong, Z.Y.; Wang, Y. Power utility nontechnical loss analysis with extreme learning machine method. IEEE Trans. Power Syst. 2008, 23, 946–955. [Google Scholar] [CrossRef]
  13. Decherchi, S.; Gastaldo, P.; Dahiya, R.S.; Valle, M.; Zunino, R. Tactile-data classification of contact materials using computational intelligence. IEEE Trans. Robot. 2011, 27, 635–639. [Google Scholar] [CrossRef]
  14. Choudhary, R.; Shukla, S. Reduced-Kernel Weighted Extreme Learning Machine Using Universum Data in Feature Space (RKWELM-UFS) to Handle Binary Class Imbalanced Dataset Classification. Symmetry 2022, 14, 379. [Google Scholar] [CrossRef]
  15. Owolabi, T.O.; Abd Rahman, M.A. Prediction of Band Gap Energy of Doped Graphitic Carbon Nitride Using Genetic Algorithm-Based Support Vector Regression and Extreme Learning Machine. Symmetry 2021, 13, 411. [Google Scholar] [CrossRef]
  16. Jayadeva; Khemchandani, R.; Chandra, S. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar] [CrossRef] [PubMed]
  17. Hou, Q.; Zhang, J.; Liu, L.; Wang, Y.; Jing, L. Discriminative information-based nonparallel support vector machine. Signal Process. 2019, 162, 169–179. [Google Scholar] [CrossRef]
  18. Nasiri, J.A.; Charkari, N.M.; Mozafari, K. Energy-based model of least squares twin support vector machines for human action recognition. Signal Process. 2014, 104, 248–257. [Google Scholar] [CrossRef]
  19. Ghorai, S.; Mukherjee, A.; Dutta, P.K. Nonparallel plane proximal classifier. Signal Process. 2009, 89, 510–522. [Google Scholar] [CrossRef]
  20. Wan, Y.; Song, S.; Huang, G.; Li, S. Twin extreme learning machines for pattern classification. Neurocomputing 2017, 260, 235–244. [Google Scholar] [CrossRef]
  21. Reshma, R.; Bharti, A. Least squares twin extreme learning machine for pattern classification. In Innovations in Infrastructure: Proceedings of ICIIF 2018; Springer: Singapore, 2019. [Google Scholar]
  22. Yuan, C.; Yang, L. Robust twin extreme learning machines with correntropy-based metric. Knowl.-Based Syst. 2021, 214, 106707. [Google Scholar] [CrossRef]
  23. Ma, J.; Yang, L. Robust supervised and semi-supervised twin extreme learning machines for pattern classification. Signal Process. 2021, 180, 107861. [Google Scholar] [CrossRef]
  24. Ma, J.; Yuan, C. Adaptive Safe Semi-Supervised Extreme Machine Learning. IEEE Access 2019, 7, 76176–76184. [Google Scholar] [CrossRef]
  25. Shen, J.; Ma, J. Sparse Twin Extreme Learning Machine With ε -Insensitive Zone Pinball Loss. IEEE Access 2019, 7, 112067–112078. [Google Scholar] [CrossRef]
  26. Zhang, K.; Luo, M. Outlier-robust extreme learning machine for regression problems. Neurocomputing 2015, 151, 1519–1527. [Google Scholar] [CrossRef]
  27. Ke, J.; Gong, C.; Liu, T.; Zhao, L.; Yang, J.; Tao, D. Laplacian Welsch Regularization for Robust Semisupervised Learning. IEEE Trans. Cybern. 2020, 52, 164–177. [Google Scholar] [CrossRef]
  28. Tokgoz, E.; Trafalis, T.B. Mixed convexity optimization of the SVM QP problem for nonlinear polynomial kernel maps. In Proceedings of the 5th WSEAS International Conference on Computers, Puerto Morelos, Mexico, 29–31 January 2011. [Google Scholar]
  29. Xu, Z.; Lai, J.; Zhou, J.; Chen, H.; Huang, H.; Li, Z. Image Deblurring Using a Robust Loss Function. Circuits Syst. Signal Process. 2021, 41, 1704–1734. [Google Scholar] [CrossRef]
  30. Wang, H.; Yu, G.; Ma, J. Capped L2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss. Symmetry 2023, 15, 1076. [Google Scholar] [CrossRef]
  31. Ma, X.; Ye, Q.; Yan, H. L2,p-norm distance twin support vector machine. IEEE Access 2017, 5, 23473–23483. [Google Scholar] [CrossRef]
  32. Li, C.-N.; Shao, Y.-H.; Deng, N.-Y. Robust L1-norm non-parallel proximal support vector machine. Optimization 2014, 65, 169–183. [Google Scholar] [CrossRef]
  33. Yuan, C.; Yang, L. Capped L2,p-norm metric based robust least squares twin support vector machine for pattern classification. Neural Netw. 2021, 142, 457–478. [Google Scholar] [CrossRef]
  34. Chapelle, O.; Scholkopf, B.; Zien, A. Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Trans. Neural Netw. 2009, 20, 542. [Google Scholar] [CrossRef]
Figure 1. Loss Functions.
Figure 1. Loss Functions.
Axioms 12 00765 g001
Figure 2. Distribution of artificial datasets with outliers.
Figure 2. Distribution of artificial datasets with outliers.
Axioms 12 00765 g002
Figure 3. Accuracies of five algorithms via different noises factors.
Figure 3. Accuracies of five algorithms via different noises factors.
Axioms 12 00765 g003
Figure 4. The classification results on the artificial datasets.
Figure 4. The classification results on the artificial datasets.
Axioms 12 00765 g004
Figure 5. Accuracies of CWTELM and FCWTELM via different parameter p.
Figure 5. Accuracies of CWTELM and FCWTELM via different parameter p.
Axioms 12 00765 g005
Figure 6. Visualization of post-hoc tests for UCI datesets.
Figure 6. Visualization of post-hoc tests for UCI datesets.
Axioms 12 00765 g006
Table 1. Abbreviations.
Table 1. Abbreviations.
Abbreviated FormComplete Form
SLFNsSingle-hidden Layer Feedforward Neural Networks
ELMExtreme Learning Machine
TELMTwin Extreme Learning Machine
TSVMTwin Support Vector Machine
LS-TELMLeast Squares Twin Extreme Learning Machine
CHELMCorrentropy-based Robust Extreme Learning Machine
RSS-ELMRobust Semi-supervised Extreme Learning Machine
L 1 -NPELMNon-parallel Proximal Extreme Learning Machine
L 1 -TELMRobust L 1 -norm Twin Extreme Learning Machine
SVCCapped L 2 , p -norm Support Vector Classification
C L 2 , p -LSTELMCapped L 2 , p -norm Least Squares Twin Extreme Learning Machine
RTELMRobust Supervised Twin Extreme Learning Machine
CTSVMCapped L 1 -norm Twin Support Vector Machine
CWTELMCapped L 2 , P -norm Welsch Twin Extreme Learning Machine
FCWTELMFast Capped L 2 , P -norm Welsch Twin Extreme Learning Machine
LCFTELMRobust twin extreme learning machines with correntropy-based metric
ACCAccuracy
TPTrue Positives
TNTrue Negatives
FNFalse Negatives
FPFalse Positives
CDCritical Difference
Table 2. Notation.
Table 2. Notation.
SymbolMeaning
RReal number
R n Real n-dimensional vector space
R n × n The linear space of the real n-order matrix
| · | Perpendicular distance of the data points x from the hyperplane
x 1 The 1-norm of vector x
x 2 The 2-norm of vector x
x 2 2 Square of the 2-norm of the vector x
x p The p-norm of vector x
x 1 The 1-norm of the matrix A
x 1 The 2-norm of the matrix A
A T The transpose of matrix A
A 1 The inverse of matrix A
τ Training set
lNumber of samples in the training set
y i Label of x i , y i + 1 , 1
H 1 The hidden layer output of the samples belonging to positive class
H 2 The hidden layer output of the samples belonging to negative class
f ( x ) Decision function
Table 3. Characteristics of UCI Datasets.
Table 3. Characteristics of UCI Datasets.
DatasetsSamplesAttributesDatasetsSamplesAttributes
Australian69014Cancer6999
Balance5764Wholesale4407
Vote43216WDBC56930
QSAR105541Pima7688
Table 4. Experimental results on UCI datasets with 0% Gaussian noise.
Table 4. Experimental results on UCI datasets with 0% Gaussian noise.
ELMCHELMTELMCTSVMRTELMCWTELMFCWTELM
DatasetsACC (%)ACC (%)ACC (%)ACC (%)ACC (%)ACC (%)ACC (%)
Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)
Australian85.7486.5386.6984.9386.58 88.24 86.70
1.541 4.5612.0933.4665.1256.847 0.536
Balance85.1191.0490.4189.29 94.64 96.43 91.07
1.739 4.5433.1123.0974.3814.853 0.427
Vote94.5895.6095.4895.58 95.81 97.62 92.65
1.0434.547 0.901 9.3106.2344.654 0.587
Cancer80.6186.4386.3386.8890.75 94.20 91.30
1.7065.013 0.873 2.7714.2746.256 0.581
wholesale75.0774.3174.5673.4981.40 86.05 81.44
1.4764.675 0.937 2.8193.9484.123 0.369
QSAR84.4381.6686.87 88.31 87.64 88.46 87.50
1.5413.043 0.629 7.8567.79811.437 0.798
Pima77.7676.7878.0172.68 78.86 79.01 76.32
2.6743.622 1.316 6.0477.6646.492 0.772
WDBC 95.85 95.3295.5595.1395.21 98.21 94.64
1.4358.951 1.225 9.5496.4495.224 0.454
Table 5. Experimental results on UCI datasets with 10% Gaussian noise.
Table 5. Experimental results on UCI datasets with 10% Gaussian noise.
ELMCHELMTELMCTSVMRTELMCWTELMFCWTELM
DatasetsACC (%)ACC (%)ACC (%)ACC (%)ACC (%)ACC (%)ACC (%)
Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)
Australian79.8380.2181.0380.4581.98 85.29 82.35
1.523 4.6312.1433.4875.1876.473 0.521
Balance83.3284.4383.2187.2385.71 91.07 89.29
1.909 4.8762.4534.4174.4184.497 0.426
Vote93.5792.2294.6595.01 95.43 95.24 91.35
0.9783.872 0.376 9.6545.5034.717 0.587
Cancer79.3683.4685.4885.3684.06 89.86 86.96
1.7585.001 0.773 2.6084.3166.354 0.590
wholesale74.4775.3173.5673.1476.64 83.72 78.37
1.4764.657 0.879 2.8924.0634.150 0.373
QSAR73.6172.4379.6478.7984.31 85.58 84.62
1.931 6.7782.7899.85210.75411.198 0.763
Pima72.2173.4573.4770.3875.91 76.32 84.62
2.0133.023 1.482 6.9246.7656.714 0.768
WDBC88.5389.2687.6391.43 92.31 96.43 91.07
1.2387.693 0.924 8.9885.9736.608 0.667
Table 6. Experimental results on UCI datasets with 20% Gaussian noise.
Table 6. Experimental results on UCI datasets with 20% Gaussian noise.
ELMCHELMTELMCTSVMRTELMCWTELMFCWTELM
DatasetsACC (%)ACC (%)ACC (%)ACC (%)ACC (%)ACC (%)ACC (%)
Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)
Australian76.4878.8578.9179.80 81.12 83.82 80.88
1.5866.683 0.774 9.7108.7407.704 0.583
Balance79.1181.3982.0184.9182.14 87.50 85.71
1.679 5.4602.4893.3454.3924.871 0.424
Vote92.5190.2393.7694.1994.43 95.24 95.36
0.9903.856 0.401 9.3485.6075.321 0.506
Cancer79.2580.1582.6785.2984.06 86.96 85.51
1.7395.203 0.798 2.6614.3466.952 0.411
wholesale74.1974.9173.0672.2174.22 81.40 76.74
1.3304.857 0.896 3.4123.9534.119 0.374
QSAR64.8765.4468.4468.5676.92 83.65 79.81
2.245 10.2124.44311.38712.87610.844 0.809
Pima65.8765.8066.3271.75 73.31 73.68 71.50
1.7462.498 1.090 7.0955.1986.329 0.931
WDBC84.7685.3782.0882.7585.56 92.86 89.29
1.3898.918 1.124 9.3306.4995.235 0.509
Table 7. Experimental results on UCI datasets with 25% Gaussian noise.
Table 7. Experimental results on UCI datasets with 25% Gaussian noise.
ELMCHELMTELMCTSVMRTELMCWTELMFCWTELM
DatasetsACC (%)ACC (%)ACC (%)ACC (%)ACC (%)ACC (%)ACC (%)
Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)Times (s)
Australian66.7969.8565.6674.65 76.76 83.82 76.47
1.4136.747 1.406 9.5527.8957.842 0.724
Balance78.3980.1181.1285.4685.71 87.50 85.72
1.710 2.5393.4113.2794.4944.664 0.453
Vote93.8292.2391.1093.1893.64 93.86 93.98
0.8933.653 0.664 9.1187.2355.321 0.508
Cancer79.2580.3681.3284.8681.16 85.61 84.97
1.7395.210 0.799 2.5034.3326.863 0.404
wholesale74.0074.0173.0670.3572.09 79.07 74.42
1.4045.110 0.789 3.0204.0664.173 0.374
QSAR63.7267.6965.7773.6570.01 74.04 75.00
0.441 10.3574.4999.33612.87611.357 0.790
Pima65.8365.5965.2970.39 70.66 73.68 69.74
1.7462.202 1.045 6.6917.8366.394 0.860
WDBC77.5675.5779.0980.1284.64 85.71 85.71
1.401 8.6311.5539.4076.6395.248 0.435
Table 8. Average accuracy and ranking of the seven algorithms on the UCI datasets with different noise proportions.
Table 8. Average accuracy and ranking of the seven algorithms on the UCI datasets with different noise proportions.
ELMCHELMTELMCTSVMRTELMCWTELMFCWTELM
Avg.ACC 10%80.6181.3582.3382.7284.5487.9486.08
Avg.rank 10%6.0005.5004.8754.6253.0001.2502.750
Avg.ACC 20%77.1377.7778.4179.9381.4785.6483.10
Avg.rank 20%6.0005.7505.0004.5002.6251.3752.750
Avg.ACC 25%74.9275.6875.3079.0879.3382.9180.75
Avg.rank 25%5.6255.5005.7504.1253.6251.3752.000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, Y.; Yu, G.; Ma, J. Distance Metric Optimization-Driven Neural Network Learning Framework for Pattern Classification. Axioms 2023, 12, 765. https://doi.org/10.3390/axioms12080765

AMA Style

Jiang Y, Yu G, Ma J. Distance Metric Optimization-Driven Neural Network Learning Framework for Pattern Classification. Axioms. 2023; 12(8):765. https://doi.org/10.3390/axioms12080765

Chicago/Turabian Style

Jiang, Yimeng, Guolin Yu, and Jun Ma. 2023. "Distance Metric Optimization-Driven Neural Network Learning Framework for Pattern Classification" Axioms 12, no. 8: 765. https://doi.org/10.3390/axioms12080765

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop