Interactive Guiding Sparse Auto-Encoder with Wasserstein Regularization for Efficient Classification

Lee, Haneum; Hur, Cheonghwan; Ibrokhimov, Bunyodbek; Kang, Sanggil

doi:10.3390/app13127055

Open AccessArticle

Interactive Guiding Sparse Auto-Encoder with Wasserstein Regularization for Efficient Classification

Department of Computer Engineering, Inha University, Inha-ro 100, Michuhol-gu, Incheon 22212, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(12), 7055; https://doi.org/10.3390/app13127055

Submission received: 26 April 2023 / Revised: 6 June 2023 / Accepted: 9 June 2023 / Published: 12 June 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In the era of big data, feature engineering has proved its efficiency and importance in dimensionality reduction and useful information extraction from original features. Feature engineering can be expressed as dimensionality reduction and is divided into two types of methods, namely, feature selection and feature extraction. Each method has its pros and cons. There are a lot of studies that combine these methods. The sparse autoencoder (SAE) is a representative deep feature learning method that combines feature selection with feature extraction. However, existing SAEs do not consider feature importance during training. It causes extracting irrelevant information. In this paper, we propose an interactive guiding sparse autoencoder (IGSAE) to guide the information by two interactive guiding layers and sparsity constraints. The interactive guiding layers keep the main distribution using Wasserstein distance, which is a metric of distribution difference, and it suppresses the leverage of guiding features to prevent overfitting. We perform our experiments using four datasets that have different dimensionalities and numbers of samples. The proposed IGSAE method produces a better classification performance compared to other dimensionality reduction methods.

Keywords:

dimensionality reduction; autoencoder; feature extraction; feature selection; guiding layer; regularization

1. Introduction

Efficient feature learning algorithms have become an essential element and the hot research topic in the era of big data [1]. Massive and high-dimensional data generally have a lot of redundancies and noise. This brings huge challenges to extracting important features, processing them, and identifying meaningful patterns. Many traditional methods are also fine in some cases. For example, many methods are based on linear relationships. This is intuitive and easily understandable. However, there are also non-linear data structures. Therefore, a machine learning-based model can be a solution to explain the non-linear structure [2,3,4,5]. An autoencoder [6,7,8,9] is a representative machine learning technique for finding manifolds that represent large-scale data and is employed extensively in the field of feature extraction and dimensionality reduction. It consists of two parts: an encoder and a decoder. The encoder transforms the input into a latent vector that contains the main information. The decoder reconstructs the latent vector to new features that are the same as the input form. It learns to minimize the reconstruction error. This method has become one of the more useful feature engineering methods.

Feature engineering is a process of extracting important information from raw data and can be generally divided into two types: feature selection [10,11,12] and feature extraction [13,14]. Feature selection can be explained as a method of finding a subset of features with higher feature importance for classification. Therefore, it is possible to effectively remove less relevant features. These methods generally have a high generalization performance and computational efficiency. For example, there is a regularization [15,16], information gain [17], chi-square test [18], etc. However, these methods have the disadvantage of removing potential non-noise factors of the original data. On the other hand, feature extraction methods are usually based on the reconstruction of features and projection from high-dimensional data into low-dimensional subspace. This kind of algorithm normally preserves the variance of features, so it has high information retention. Since these algorithms consider almost all features in the transformation process, the dimensionality reduction effect is usually depleted. Principal component analysis (PCA) [19,20], canonical correlation analysis (CCA) [21], linear discriminant analysis (LDA) [22,23], and autoencoder using machine learning models for feature extraction are representative methods.

However, an autoencoder has overfitting problems and highly complex structural issues. Improved structures have been proposed in recent years to solve these problems. Peicheng et al. [24] applied a stacked autoencoder (SAE) to hyperspectral image classification to generate high-quality hyperspectral image data to maximize the performance of the task. Meidi et al. [25] applied a stacked autoencoder to fault diagnosis of rolling bearings and obtained promising results. They increase the size of the network to obtain a better result. However, the autoencoder-based simple method is still a field of feature extraction, so it is difficult to modify and remove features with less relevance, which is the advantage of feature selection. To improve this shortcoming, new techniques applying regularization have been studied. Notably, the stacked sparse autoencoder [26,27,28] is one of the representative deep autoencoder methods that have feature selection ideas. The stacked sparse autoencoder (SSAE) improved the autoencoder by a combination of two advanced autoencoders: a sparse autoencoder and a stacked autoencoder. A sparse autoencoder adds sparse constraints to the autoencoder. It restricts the useless node to reduce the dimension efficiently. A stacked autoencoder improves the training performance by stacking autoencoders. It mediates to deeply learn the features in comparison with the simple learning process of traditional autoencoders. Anush et al. [29] proposed the structure of a group sparse autoencoder (GAE) with regularization applied to solve the overfitting, which is a chronic problem of autoencoders.

SSAE’s regularization-based methods achieved successful results in various applications. However, it is difficult to apply the dynamic and efficient feature selection method due to its structural issues. In many cases, there are vanishing gradient problems and over-generalizing information (e.g., on sensitive applications). Zhilei et al. [30] proposed a label and sparse regularization autoencoder (LSRAE). The main idea of LSRAE is to learn by dividing the output layer of the autoencoder into two parts, the actual label and output data. LSRAE extracts features that involve the relationship between features and targets, and prevents overfitting via sparse regularization. Most regularization-based autoencoders optimize the loss function without considering the structure. However, LSRAE has structural changes and the advantages of supervised learning. This is an important point in the feature engineering field. Hongteng et al. [31] proposed a relational regularized autoencoder (RAE) that flexibly learns a structured prior distribution using Fused Gromov–Wasserstein (FGW) distance [32]. The main idea is to limit the difference by regularizing the structural distance between the latent prior and the corresponding posterior using a metric determining the distribution similarity. To obtain high-quality and low-dimensional features, the structure of SSAE needs to be improved and reformed.

In this paper, we propose an interactive guiding sparse autoencoder (IGSAE), which is an advanced autoencoder model. It is hard to prevent overfitting with existing autoencoders because the weights are updated to reduce reconstruction loss. To avoid overfitting, we modify the autoencoder by inserting two guiding layers before and after the reconstruction layer to adjust the leverage of feature importance. These guiding layers learn the distribution of selected labels and relieve overfitting caused by the reconstruction loss. Next, we design a regularization module to prevent the reconstruction layer and guiding layer from having excessive influence. Therefore, our method is a new manifold learning system that can improve the quality of deep features.

Main Contributions

The main contributions of our method are summarized as follows:

We reduce overfitting of the autoencoder to propose a new structure that can utilize both label information and reconstruction information, similar to semi-supervised learning by proposing a guiding layer before and after the reconstruction layer of the autoencoder.
We design an architecture that consists of two different autoencoders and show outstanding classification performance compared to six representative feature extraction methods.

The rest of this paper is organized as follows. Section 2 introduces the related work, including the basic theory and advanced theory of the autoencoder that is related to the proposed method. Section 3 describes the details of the proposed IGSAE method and shows its framework. Section 4 illustrates the experiment environment, evaluation metrics, information on datasets, and experimental results. Finally, Section 5 concludes the paper.

2. Background and Related Works

In this section, we describe the autoencoder-based models.

2.1. Sparse Autoencoder (SAE)

An autoencoder consists of two parts: an encoder and a decoder. In the encoder step, the autoencoder transforms the original feature into a latent representation. It is model-interpretable information. In the decoder step, the autoencoder reconstructs the latent representation into new meaningful features.

An encoding function

f

maps the input data sample

x \in ℝ^{m}

to a low-dimensional latent feature space

ℝ^{n}

. It is formulated as:

h = f (x) = ϕ (W_{1}^{T} x + b_{1})

(1)

where

W_{1}^{T}

is the transformed weight matrix (

W_{1}

) between the input layer and hidden layer,

ϕ

is the activation function,

b_{1}

is a bias, and

h

is a latent vector of the hidden layer. A decoding function

g

maps

h

to reconstructed feature

x^{'}

. It can be expressed as:

x^{'} = g (x) = ϕ (W_{2}^{T} h + b_{2})

(2)

where

W_{2}^{T}

is the weight matrix (

W_{2}

) between the hidden layer and output layer and

b_{2}

is a bias. The purpose of the autoencoder is to find a structure that minimizes the reconstruction error, and the objective function is defined as follows:

\min_{θ} J_{A E} (θ) = \frac{1}{2 n} \sum_{i = 1}^{n} ∥ x_{i} - x_{i}^{'} ∥^{2}

(3)

where

J_{A E}

is the reconstruction loss of the autoencoder and

n

is the number of input samples. SAE’s structure is based on an autoencoder. However, in a basic autoencoder, when reconstructing the input, the distribution of reconstructed features has a lot of overlap due to problems such as overfitting [33], and simply copying the input may not represent key information well. To solve these problems, we add a sparsity constraint as a penalty term that can suppress the output of neurons when the number of neurons in the hidden layer is large and formulated as follows:

\min_{θ} J_{S A E} (θ) = \frac{1}{n} \sum_{i = 1}^{n} ∥ x_{i} - x_{i}^{'} ∥^{2} + ε \sum_{j = 1}^{s} K L (ρ | | {\hat{ρ}}_{j})

(4)

where

J_{S A E}

is the objective of the sparse autoencoder (SAE), s is the number of neurons in the hidden layer,

{\hat{ρ}}_{j}

is the activation probability, and

ρ

is the sparse parameter. KL divergence is briefly a measure of how differences between two probability distributions [34].

2.2. Label and Sparse Regularization Autoencoder (LSRAE)

Zhilei et al. proposed LSRAE, a structure that combines supervised learning methods with the autoencoder, which is unsupervised learning, to improve the performance of classification tasks. They thought that the overfitting and the irregular information problem of classification, which are chronic problems of autoencoders, occur because the relation between the target data and the input data is not considered at all in the reconstruction process. Therefore, they solved the problem by introducing two main methods: the first is adding a label regularization term to the objective function to reduce the difference between the actual label and the desired label, and the second is using the extreme learning machine (ELM) [35] for fast optimization. Label regularization term (

J_{l a b e l}

) of the LSRAE is formulated as follows:

J_{l a b e l} = \frac{1}{2 n} \sum_{i = 1}^{n} ∥ L - {T ∥}^{2} = \frac{1}{2 n} \sum_{i = 1}^{n} ∥ H \hat{β} - {T ∥}^{2}

(5)

where

L

is the actual label as the output of the autoencoder,

T

is the desired label, and

n

is the number of input samples. Figure 1 shows a way to utilize the ELM on the autoencoder. The blue lines connect the hidden layer and the output layer. The red lines connect the hidden layer and label layer. SoftMax classifier is adopted for the classifier. In the training process, they compute the reconstruction loss, the weight decay term, and the sparse regularization term, in order. Then, they use the concept of ELM to calculate the real label. The real label is used to obtain label regularization. The LSRAE learns through backpropagation by obtaining the squared error of the expression as Equation (5). It can restrict overfitting of the autoencoder that focuses only on reducing reconstruction loss. It improves the classification accuracies, and they also use the sparse constraint to restrict useless nodes. The simple objective function of LSRAE is calculated as follows:

J_{s p a r s e} = \sum_{j = 1}^{s} K L (ρ | | {\hat{ρ}}_{j})

(6)

J_{L S R A E} = J_{A E} + λ J_{l a b e l} + β J_{s p a r s e}

(7)

where

J_{A E}

is the loss function of the basic autoencoder, and

λ

and

β

are coefficients.

3. Interactive Guiding Sparse Autoencoder

Autoencoders have traditionally been used as one of the dimensionality reduction methods, such as PCA and LDA. However, when overfitting occurs in the learning process, the following two major problems occur. First, even if unrelated data are used as input, the autoencoder transforms the data into a similar distribution with a training set. Second, while learning the information of features with low importance, a noise can be recognized as important information, which affects classification accuracy. To solve the problem, we developed an interactive guiding sparse autoencoder (IGSAE) that learns the manifold of selected features and guides it to the rest of the features.

As shown in Figure 2, IGSAE has a structure with two added guiding layers to the basic autoencoder so that the two layers are interactive. From left to right, it is composed in the order of input layer, hidden layer (with guiding layer), reconstruction layer, and guiding layer (output layer). The whole process is explained in small five processes. In the first process of IGSAE,

H

and

G_{1}

are the output of hidden units and the output of the first guiding units, respectively. The step to calculate

H

and

G_{1}

can be represented as follows:

H = ϕ (W_{1} X + b_{1})

(8)

G_{1} = ϕ (W_{1}^{'} X + b_{1}^{'})

(9)

where

X = {x_{1}, x_{2}, \dots, x_{n}}

is input data,

ϕ

is activation function (RELU),

and W_{1}

and

b_{1}

denote the weight and bias between the input and hidden layer, respectively.

W_{1}^{'}

and

b_{1}^{'}

denote the weight and bias between the input and guiding layer, respectively.

X

denotes a guided feature and

X_{g}

denotes a guiding feature. The learning process of IGSAE can be divided into two parts: (a) to learn the guiding feature as a target from the guided feature to involve the relationship between the two features, and (b) to reconstruct the guided feature. A pretrained model from guided feature

X

to guiding feature

X_{g}

is required to involve the relationship between the features. The corresponding part of IGSAE maps the guided feature of the input layer

X

to the guiding feature of the first guiding layer

G_{1}

and is pretrained partially. The cost function

J_{f g}

of the partial pretrained model is formulated as follows:

\begin{array}{l} J_{f g} & = \frac{1}{2 n} ∥ X_{g} - G_{1} ∥^{2} \\ = \frac{1}{2 n} ∥ X_{g} - ϕ {(W_{1}^{'} X + b_{1}^{'}) ∥}^{2} \end{array}

(10)

where

X_{g} = {x_{n + 1}, x_{n + 2}, \dots, x_{m}}

is input data of the guiding layer. The cost function used for pretraining is also used as a regularization term for the objective function in Equation (18).

In the second process, as the number of units in the hidden layer increases, there may be too much sparse information. As a sparse constraint, the average activity of the hidden unit is calculated. If it is an important unit, it is activated, and if not, it is deactivated. The sparsity criterion for determining the importance of a unit uses the Kullback–Leibler (KL) divergence. KL divergence is the relative entropy to measure the similarity between a sparsity parameter

ρ

and the average activation probability

{\hat{ρ}}_{j}

.

s

is the number of the hidden units. The average activation

{\hat{ρ}}_{j}

is represented as:

{\hat{ρ}}_{j} = \frac{1}{n} \sum_{i = 1}^{n} ϕ (x_{i})

(11)

and

J_{s p a r s e}

(KL divergence) is represented as:

\begin{array}{l} J_{s p a r s e} & = \sum_{j = 1}^{s} K L (ρ | | {\hat{ρ}}_{j}) \\ = \sum_{j = 1}^{s} ρ \log \frac{ρ}{{\hat{ρ}}_{j}} + (1 - ρ) \log \frac{1 - ρ}{1 - {\hat{ρ}}_{j}} \end{array}

(12)

In the third process, there is a reconstruction loss

J_{A E},

and the loss can be calculated by the output vector of the reconstruction layer

X^{'}

. The output of the reconstruction layer is calculated as follows:

\begin{array}{l} X^{'} & = ϕ (W_{2} H + b_{2}) + ϕ (W_{2}^{'} G + b_{2}^{'}) \\ = ϕ (W_{2} ϕ (W_{1} X + b_{1}) + b_{2}) + ϕ (W_{2}^{'} ϕ (W_{1}^{'} X + b_{1}^{'}) + b_{2}^{'}) \end{array}

(13)

where

ϕ

is the activation function, and

W_{2}

and

b_{2}

denote the weight and bias between the hidden and output layers, respectively.

W_{2}^{'}

and

b_{2}^{'}

denote the weight and bias between the guiding and output layers, respectively. The

J_{A E}

is a typical loss used in the existing autoencoder and is the main loss of IGSAE. In general, the mean squared deviation is used for linear cases, and cross-entropy is used for non-linear cases. In this section, we compute using mean squared deviation.

J_{A E}

is computed as follows:

\begin{array}{l} J_{A E} & = \frac{1}{2 n} ∥ X - {X^{'} ∥}^{2} \\ = \frac{1}{2 n} ∥ X - ϕ (W_{2} ϕ (W_{1} X + b_{1}) + b_{2}) + ϕ {(W_{2}^{'} ϕ (W_{1}^{'} X + b_{1}^{'}) + b_{2}^{'}) ∥}^{2} \end{array}

(14)

In the fourth process, IGSAE has a second guiding part. Its operation step is almost the same to obtain the first guiding layer, and the two guide layers operate complementarily. The second guiding regularization term (

J_{s g}

) is calculated as follows:

\begin{array}{l} G_{2} & = ϕ (W_{3} X^{'} + b_{3}) \\ = ϕ (W_{3} (ϕ (W_{2} ϕ (W_{1} X + b_{1}) + b_{2}) + ϕ (W_{2}^{'} ϕ (W_{1}^{'} X + b_{1}^{'}) + b_{2}^{'})) + b_{3}) \end{array}

(15)

\begin{array}{l} J_{s g} & = \frac{1}{2 n} ∥ X_{g} - G_{2} ∥^{2} \\ = \frac{1}{2 n} ∥ X_{g} - ϕ {(W_{3} (ϕ (W_{2} ϕ (W_{1} X + b_{1}) + b_{2}) + ϕ (W_{2}^{'} ϕ (W_{1}^{'} X + b_{1}^{'}) + b_{2}^{'})) + b_{3}) ∥}^{2} \end{array}

(16)

where

W_{3}

and

b_{3}

are the weight and bias between the output layer and the second guiding layer, respectively. The guiding term

(J_{g u i d e})

is computed by two guiding terms Equations (10) and (16), and is formulated as follows:

J_{g u i d e} = J_{f g} + J_{s g}

(17)

J_{g u i d e}

is the loss of the two guiding layers that learn the distribution of the real label. If the distribution of the label can be influenced in the learning process of the autoencoder, then overfitting, which learns only to reduce the reconstruction loss [36], can be restricted. Therefore, the objective function of IGSAE using the above Equations (12), (14) and (17), and the Wasserstein regularization term

(J_{w a s)}

is defined as:

\begin{array}{l} \min_{w \in W} J_{I G S A E} & = \min_{w \in W} J_{A E} + ε J_{s p a r s e} + λ J_{g u i d e} + μ J_{w a s} \\ = \min_{w \in W} \frac{1}{2 n} ∥ X - {X^{'} ∥}^{2} + ε \sum_{j = 1}^{s} K L (ρ | | {\hat{ρ}}_{j}) + λ \frac{1}{2 n} ∥ X_{g} - G_{2} ∥^{2} \\ + \frac{1}{2 n} ∥ X_{g} - G_{1} ∥^{2} + μ \inf_{r \in \prod (P_{θ_{1}}, P_{θ_{2}})} E_{(X, H) ~ γ} [d (G_{1}, G_{2})] \end{array}

(18)

where

ε

,

λ

, and

μ

are coefficients to balance the

J_{I G S A E}

.

W

is the weight of IGSAE.

θ_{1}

is the parameter of the first guiding layer and

θ_{2}

is the parameter of the second guiding layer.

P_{θ_{1}}

is the pretrained prior distribution, which we assume is real data distribution.

P_{θ_{2}}

is the latent variable distribution.

\prod (P_{θ_{1}}, P_{θ_{2}})

denotes the set of all joint distributions

d (G_{1}, G_{2})

whose marginals are

P_{θ_{1}}

and

P_{θ_{2}}

, respectively.

When learning IGSAE, the influence of the first guiding layer becomes too strong, which can increase the reconstruction loss. If reconstruction performance decreases, then classification performance can be also degraded [37,38]. Therefore, the Wasserstein guiding term is proposed to limit the influence of the first guiding layer. The Wasserstein regularization term (

J_{w a s}

) is explained in the next part and shown as:

J_{w a s} = \inf_{r \in \prod (P_{θ_{1}}, P_{θ_{2}})} E_{(X, H) ~ γ} [d (G_{1}, G_{2})]

(19)

where

J_{w a s}

is the lowest estimate of the expected value of

\prod (P_{θ_{1}}, P_{θ_{2}})

. However, the infimum in Equation (19) is too intractable and complex. Kantorovich–Rubinstein duality suggests that the solution to replace the problem is with a low-complexity duality problem. It can be formulated as:

\min_{w \in W} J_{I G S A E} = \min_{w \in W} J_{A E} + ε J_{s p a r s e} + λ J_{g u i d e} + μ \sup_{f \in L} E_{x ~ P_{θ_{1}}} [f (x)] - E_{x ~ P_{θ_{2}}} [f (x)]

(20)

where

L

means all ranges that satisfy the 1-Lipschitz function and

f

is a function about

θ_{1}

that satisfies the 1-Lipschitz function. By adding a weight clipping parameter to the

f

function to fit the duality condition, it can be expressed as follows:

\min_{w \in W} J_{I G S A E} = \min_{w \in W} J_{A E} + ε J_{s p a r s e} + λ J_{g u i d e} + μ \max_{w \in W} E_{x ~ P_{θ_{1}}} [f_{w} (x)] - E_{x ~ P_{θ_{2}}} [f_{w} (g_{θ_{2}} (x))]

(21)

IGSAE assumes that we already have good performance

f_{w}

throught pretraining. Ultimately our remaining goal is to find the optimum to update

W_{1}

,

W_{2}

, and

W_{2}^{'}

through the Wasserstein regularization. That is, it should find optimal

g_{θ_{2}}

. Furthermore, we could consider and compute differentiating the changed

J_{w a s}

through the back-propagation method in Equation (21) via obtaining

\max_{w \in W} E_{x ~ P_{θ_{1}}} [f_{w} (x)]

. According to the above methods, using backpropagation of the entire objective function is possible by the gradient descent algorithm. In the partial derivative formula to optimize

J_{I G S A E}

, bias is omitted to simplify the expression.

The weights

W

is formulated in the order as follows:

\frac{\partial J_{I G S A E}}{\partial W_{1}} = \frac{\partial J_{A E}}{\partial W_{1}} + ε \frac{\partial J_{s p a r s e}}{\partial W_{1}} + λ \frac{\partial J_{g u i d e}}{\partial W_{1}} + μ \frac{\partial J_{w a s}}{\partial W_{1}} = \frac{\partial J_{A E}}{\partial W_{1}} + ε \frac{\partial J_{s p a r s e}}{\partial W_{1}} + λ \frac{\partial J_{s g}}{\partial W_{1}} + μ \frac{\partial J_{w a s}}{\partial W_{1}}

(22)

Since

W_{1}

is not affected by

J_{f g}

, it is removed from

J_{g u i d e}

and only

J_{s g}

remains after differentiation.

W_{2}

is computed as follows:

\begin{array}{l} \frac{\partial J_{I G S A E}}{\partial W_{2}} & = \frac{\partial J_{A E}}{\partial W_{2}} + ε \frac{\partial J_{s p a r s e}}{\partial W_{2}} + λ \frac{\partial J_{g u i d e}}{\partial W_{2}} + μ \frac{\partial J_{w a s}}{\partial W_{2}} \\ = \frac{\partial J_{A E}}{\partial W_{2}} + λ \frac{\partial J_{s g}}{\partial W_{2}} + μ \frac{\partial J_{w a s}}{\partial W_{2}} \end{array}

(23)

Since

W_{2}

is the weight between the first guiding layer and the reconstruction layer, it is not affected by

J_{s p a r s e}

. Therefore,

J_{s p a r s e}

is removed.

W_{1}^{'}

is computed as follows:

\begin{array}{l} \frac{\partial J_{I G S A E}}{\partial W_{1}^{'}} & = \frac{\partial J_{I G A E}}{\partial W_{3}} \\ = [\frac{\partial J_{A E}}{\partial W_{1}^{'}} + ε \frac{\partial J_{s p a r s e}}{\partial W_{1}^{'}} + λ \frac{\partial J_{g u i d e}}{\partial W_{1}^{'}} + μ \frac{\partial J_{w a s}}{\partial W_{1}^{'}}] \\ + [\frac{\partial J_{A E}}{\partial W_{3}} + ε \frac{\partial J_{s p a r s e}}{\partial W_{3}} + λ \frac{\partial J_{g u i d e}}{\partial W_{3}} + μ \frac{\partial J_{w a s}}{\partial W_{3}}] \\ = λ \frac{\partial J_{g u i d e}}{\partial W_{1}^{'}} + μ \frac{\partial J_{w a s}}{\partial W_{3}} \end{array}

(24)

Since

W_{1}^{'}

and

W_{3}

must have the adjacent gradient value and are updated in interactive, the sum of partial derivative of

J_{g u i d e}

with respect to

W_{1}^{'}

and partial derivative of

J_{w a s}

with respect to

W_{3}

is updated.

W_{2}^{'}

is computed as follows:

\frac{\partial J_{I G S A E}}{\partial W_{2}^{'}} = \frac{\partial J_{A E}}{\partial W_{2}^{'}} + ε \frac{\partial J_{s p a r s e}}{\partial W_{2}^{'}} + λ \frac{\partial J_{g u i d e}}{\partial W_{2}^{'}} + μ \frac{\partial J_{w a s}}{\partial W_{2}^{'}} = \frac{\partial J_{A E}}{\partial W_{2}^{'}} + λ \frac{\partial J_{g u i d e}}{\partial W_{2}^{'}} + μ \frac{\partial J_{w a s}}{\partial W_{2}^{'}}

(25)

The main operating process of IGSAE is shown in the Algorithm 1 and Figure 3 as follows:

Algorithm 1: IGSAE

Input: Guided feature

X

(input data), Guiding feature

G

1: Setting hyper parameters:

ε, μ, and λ

. Number of hidden layers and units.
2: Pretraining about

G_{1}

of the first guiding layer using Equation (10)
3: Training start
4: Compute

S

and

G_{1}

of the first guiding layer and the

J_{s p a r s e}

and

J_{f g}

terms.
5: Compute

X^{'}

of the reconstruction layer and

J_{A E}

term.
6: Compute

G_{2}

of the second guiding layer and the

J_{g u i d e}

and

J_{w a s}

terms.
7: Compute partial derivative

\frac{\partial J_{A E}}{\partial W_{1}}, \frac{\partial J_{A E}}{\partial W_{2}}, \frac{\partial J_{A E}}{\partial W_{1}^{'}}, \frac{\partial J_{A E}}{\partial W_{2}^{'}}

, and

\frac{\partial J_{l a b e l}}{\partial W_{3}}

to compute the gradients by Equations (22)–(25)
8: Update the weights using gradient descent.
9: Repeat until iterations.
Output: Reconstructed guided feature

X^{'}

4. Experimental Result of IGSAE

4.1. Datasets and Experimental Setups

In this paper, experiments are conducted using six public datasets with various dimensionalities: Heart Disease (Heart), Wine quality (Wine), U.S. Postal Service (USPS), Fashion-MNIST, Yale face (Yale), and Columbia Object Image Library (COIL-100) datasets. The features of the dataset are divided into low-dimensional features, medium-dimensional features, and high-dimensional features. Brief information of the dataset is summarized in Table 1. We carry out the experiments on a personal server with the following configurations: two CPUs (AMD 7742 64-core), eight GPUs (NVIDIA DGX A100), memory (1 TB), and NVMe Storage (15 TB).

The proposed IGSAE has three regularization terms: an interactive guiding term (

J_{g u i d e}

), a sparse regularization term (

J_{s p a r s e}

), and a Wasserstein regularization term (

J_{w a s})

from Equation (18). We find the optimal value of

ε, λ

, and

μ

in the equation for different datasets. In particular,

λ

and

μ

affect the leverage of the guiding feature. Thus, the classification performance depends on the sensitively tuned hyperparameters. The parameters of IGSAE that contain guiding layers and the coefficients are summarized in Table 2.

4.2. Authentication of Main Innovation

The main innovation of this study is to solve the overfitting problem of the autoencoder model. Overfitting of the autoencoder is mainly divided into two categories: learning the distribution of the identity function and transforming unrelated data such as training data [39,40,41,42]. To solve the two overfitting categories, experiments were conducted in two phases. First, we designed an experiment to prove that IGSAE is not learning the distribution of identity functions. Second, the experiment was conducted to test whether the deep features extracted from IGSAE could potentially express the main information. In this section, an ablation study was applied to prove the performance of the basic structure of IGSAE and the effects of sparse regularization and Wasserstein regularization. Table 3 shows the results of using SVM as a classifier. The symbols in Table 3 are explained as follows. RF: raw features; DF: deep features trained by IGSAE without Wasserstein regularization and sparse regularization; DF & WR: deep features with Wasserstein regularization; and DF & WR & SR: deep features with Wasserstein regularization and sparse regularization. As shown in Table 1 and Table 3, our methods have low-dimensional features compared to conventional RF in most datasets. However, the results show no significant performance degradation. DF & WR (DW) and DF & WR & SR (DWS) showed excellent results as follows: DWS shows 0.17% higher on the Heart; DWS shows 0.59% lower on the Wine; DW shows 1.42% lower on the USPS; DWS shows 1.73% lower on the Fashion-MNIST; DWS shows 3.21% lower on the Yale; and DWS shows 0.86% lower on the COIL-100. In most cases, the performance slightly decreases, but the training efficiency increases a lot as follows: the efficiency increases about 31% from 13 features to 9 on the Heart; the efficiency increases about 33% from 12 to 8 on the Wine; the efficiency increases about 29% from 256 to 180 on the USPS; the efficiency increases about 29% from 784 to 549 on the Fashion-MNIST; the efficiency increases about 30% increase from 1024 to 717 on the Yale; and the efficiency increases about 30% increase from 1024 to 717 on the COIL-100. Most of the results show a high learning efficiency of about 30%. Therefore, this strategy has been proven to be effective and can improve the performance and stability of the existing autoencoder model.

4.3. Comparison of the Benchmark Datasets

The efficiency of the proposed IGSAE is proven in Section 4.2, and in this section, we experiment to prove that IGSAE shows better performance than existing dimensionality reduction algorithms and does not learn the distribution of identity function. When the autoencoder model learns the distribution of the identity function to minimize the reconstruction error, it is difficult to expect improvement in classification because there are few transformations of the feature. Therefore, if the classification accuracy is improved compared to the result of using only the raw features by the hybrid feature of the deep features extracted from IGSAE and the raw features, it shows that our IGSAE is not overfitted without learning the distribution of the identity function. In Table 4, Hybrid IGSAE showed improvement in accuracy of 0.75%, 1.09%, 0.45%, 0.48%, 2.02%, and 0.57% for 6 datasets, respectively. This means that IGSAE does not learn the distribution of identity functions. In addition, Table 4 illustrates how much performance improvement of IGSAE was achieved compared to the existing feature extraction methods. The proposed IGSAE is compared with representative autoencoder models and dimensionality reduction methods such as Kernel PCA, LDA, stacked autoencoder (SAE), stacked sparse autoencoder (SSAE), uniform manifold approximation and projection (UMAP), and t-Stochastic neighbor embedding (t-SNE) [43,44,45]. As shown in the table, the proposed method shows superior achievement in most cases compared to other methods on the six datasets. The IGSAE on the datasets Heart, USPS, Fashion-MNIST, and COIL-100 shows the best outcome. Moreover, IGSAE shows slightly lower performance by 0.62% and 0.13% on the Wine and Yale compared to the SSAE and Kernal PCA, respectively. In addition, the accuracy of Hybrid IGSAE on the dataset Heart is higher by 0.75% and 0.58% than RF. Similarly, the performance increases by 1.09% and 1.06% on the Wine, 0.45% and 1.87% on the USPS, 0.48% and 2.21% on the Fashion-MNIST, 2.02% and 5.23% on the Yale, and 0.57% and 1.43% on the COIL-100, respectively, to RF and the best algorithm. This means that the proposed method can avoid overfitting the autoencoder and display superior results.

4.4. Prevention of Transforming Irrelevant Data Such as Training Data

In Section 4.3, we proved that IGSAE does not learn the distribution of identity functions, but basically, the autoencoder performance relies on reconstruction quality, which is the most important component in the encoding–decoding process. The reconstruction error can still be an insufficient indicator due to the other type of overfitting. Thus, additional experiments are necessary. We compare the reconstruction error with the actual image simultaneously and show that the proposed IGSAE can solve the overfitting that transforms data unrelated to the training set into the distribution of the training set. In Figure 4, the experimental setup of the IGSAE on Fashion-MNIST and COIL-100 is illustrated in Table 2. In addition, Gaussian noise was used as an extra test set to confirm the transformation of the data distribution. In Figure 4, the IGSAE shows better restoration ability than the basic autoencoder model for detailed information. However, in Table 5, the reconstruction error of the IGSAE on Fashion-MNIST and COIL-100 is higher than the basic autoencoder. This means that the overfitting occurred because the basic autoencoder, which has a lower reconstruction error, shows lower performance than the IGSAE. In the figure, the result of using Gaussian noise as an input value shows that most of the noise information is vaguely expressed in the case of a basic autoencoder. If there are many similar data in the training set, the reconstruction error can be reduced by transforming the information. However, the main information can be lost. IGSAE shows that this overfitting is improved through the guiding layer, Wasserstein regularization, and sparse regularization.

4.5. Effect of the Coefficients and Classifiers

4.5.1. Coefficients Optimization on the Six Datasets

In this experiment, two optimization techniques are used to find the optimum of the three coefficients

ε, λ,

and

μ

. The first one is a grid search algorithm that searches all values within a specific range and the second is a Bayesian optimization algorithm that probabilistically finds the optimal parameter composed of a surrogate model (Gaussian process) and an acquisition function [46,47,48]. Figure 5 shows the experimental results for coefficients

ε, λ,

and

μ

using grid search, and the ranges are

{0.005 \times e | e = 1, 2, 3, \dots, 10}, {0.01 \times e | e = 1, 2, 3, \dots, 10}

and

{0.001 \times e | e = 1, 2, 3, \dots, 10}

. Since

λ

is relatively more influential than

ε

and

μ

, we reduce the range of the sensitive coefficient by the results of a grid search and, additionally, test the coefficient using the Bayesian optimization algorithm to obtain a suitable value efficiently. The six datasets are split into two subsets: one accounting for about one-fifth of all samples as the test data, and the rest as the training data. We ran tests five times with randomly shuffled samples. As a classifier, SVM was chosen to optimize all coefficients. Figure 5a shows the results of the parameter experiment on the Heart dataset. As the Wasserstein regularization parameter increases, the classification accuracy decreases. It shows superior outputs for values between about 0.003 and 0.004.

λ

is the most important coefficient for extracting core features, and the optimal value should be obtained through experiments since the performance varies greatly depending on the data. If

λ

is too low, it causes a huge loss of the guiding feature and low accuracy of classification. If

λ

is too high, the weights are updated to improve the performance of guiding layers only. It brings a high reconstruction error of the IGSAE. The

μ

is the role to prevent overfitting. If

μ

is too low, the first guiding and second guiding layers are not interactive. In this case, the imbalanced guiding layers are updated only to reduce the reconstruction error, which causes overfitting easily. If

μ

is too high, the first guiding layer and the second guiding layer become almost the same. This is too stiff to be optimal and leads to high reconstruction error. The

ε

filters sparse nodes out. It may increase or decrease the performance depending on the data and hidden unit. In Figure 4c, IGSAE shows better results when sparse regularization is not applied. Therefore, the parameter optimization process to obtain useful features is imperative in IGSAE. Figure 5b–f are the optimization results of Wine, USPS, Fashion-MNIST, Yale, and COIL-100, respectively.

4.5.2. Effects of Classifier

The classifier has a significant impact on the performance of IGSAE. We conducted an experiment using the following three representative classifiers: support vector machine (SVM), K-nearest neighbor (KNN), and random forest [49]. The experimental results are shown in Table 6. As a result, KNN shows poor performance. The random forest achieves higher accuracies than KNN and has the highest accuracy for Yale. However, it does not show optimal performance because all hyperparameters are set as default. Since SVM can be applied stably to the proposed algorithm and shows excellent accuracies in most cases, it was chosen as a final classifier for IGSAE. Since the proposed IGSAE is in the field of dimensionality reduction and feature extraction domain, further experiments on different classifiers were not carried out. However, if more suitable and optimized classifiers are used in future work, performance can be improved, and it can be extended to other tasks.

5. Conclusions and Future Work

In this paper, we introduced a new method called IGSAE that works with interactive guiding layers. It is a dimensionality reduction method that combines the ideas of feature selection and feature extraction. There are two key points. One is using guiding layers to flexibly update the weights. This offers the advantages of feature selection and manages the effects of each feature to reduce classification performance loss. The other is a constraint method by Wasserstein distance and KL divergence. Wasserstein distance prevents overfitting in guiding layers, and KL divergence restricts the useless nodes of the network. These two main ideas interactively perform to extract the main information of the origin feature and reduce the dimensionality. In the experiment, we compared IGSAE with other traditional dimensionality reduction methods and show better performance. The proposed method outperformed other methods on average and showed the highest classification accuracies because the coefficients of

J_{g u i d e}

and

J_{w a s}

are efficiently optimized through the experimental tests.

As a future work, we need to expand our methodology to vision tasks by applying convolutional layers to ensure compatibility for image data and high-dimensional data representation [50]. Dimensionality reduction methods are used not only for efficiency and performance but also for visualization to show a representative data structure.

Author Contributions

Conceptualization, S.K.; methodology, L.H.; software, H.L. and C.H.; validation, H.L. and C.H.; formal analysis, H.L. and B.I.; investigations, H.L.; resources, H.L.; data curation, H.L.; writing—original draft preparation, H.L. and S.K.; writing—review and editing, H.L., C.H., B.I. and S.K.; visualization, H.L. and B.I.; supervision, H.L. and C.H.; project administration, S.K.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Inha University Grant.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Every dataset can be found as follows: Heart (https://archive.ics.uci.edu/ml/datasets/heart+disease) accessed 11 June 2023, Wine (https://archive.ics.uci.edu/ml/datasets/wine+quality) accessed 11 June 2023, USPS (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#usps) accessed 11 June 2023, Fashion-MNIST (https://www.kaggle.com/datasets/zalando-research/fashionmnist) accessed 11 June 2023, Yale (http://vision.ucsd.edu/~leekc/ExtYaleDatabase/Yale%20Face%20Database.htm) Accessed 22 December 2020, and COIL-100 (https://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php) accessed 11 June 2023.

Conflicts of Interest

All authors declare that they have no conflict of interest.

References

Storcheus, D.; Rostamizadeh, A.; Kumar, S. A survey of modern questions and challenges in feature extraction. In Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015, PMLR2015, Montreal, QC, Canada, 11 December 2015; pp. 1–18. [Google Scholar]
Zhou, F.; Fan, H.; Liu, Y.; Zhang, H.; Ji, R. Hybrid Model of Machine Learning Method and Empirical Method for Rate of Penetration Prediction Based on Data Similarity. Appl. Sci. 2023, 13, 5870. [Google Scholar] [CrossRef]
Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
Chen, X.; Ding, M.; Wang, X.; Xin, Y.; Mo, S.; Wang, Y.; Han, S.; Luo, P.; Zeng, G.; Wang, J. Context autoencoder for self-supervised representation learning. arXiv 2022, arXiv:2202.03026. [Google Scholar]
Aguilar, D.L.; Medina-Pérez, M.A.; Loyola-Gonzalez, O.; Choo KK, R.; Bucheli-Susarrey, E. Towards an interpretable autoencoder: A decision-tree-based autoencoder and its application in anomaly detection. IEEE Trans. Dependable Secur. Comput. 2022, 20, 1048–1059. [Google Scholar] [CrossRef]
Wang, Y.; Yao, H.; Zhao, S. Auto-encoder based dimensionality reduction. Neurocomputing 2016, 184, 232–242. [Google Scholar] [CrossRef]
Liou, C.-Y.; Cheng, W.-C.; Liou, J.-W.; Liou, D.-R. Autoencoder for words. Neurocomputing 2014, 139, 84–96. [Google Scholar] [CrossRef]
Li, J.; Luong, M.-T.; Jurafsky, D. A hierarchical neural autoencoder for paragraphs and documents. arXiv 2015, arXiv:1506.01057. [Google Scholar]
Tschannen, M.; Bachem, O.; Lucic, M. Recent advances in autoencoder-based representation learning. arXiv 2018, arXiv:1812.05069. [Google Scholar]
Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR) 2017, 50, 1–45. [Google Scholar] [CrossRef] [Green Version]
Jović, A.; Brkić, K.; Bogunović, N. A review of feature selection methods with applications. In Proceedings of the 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 25–29 May 2015; pp. 1200–1205. [Google Scholar]
Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Osia, S.A.; Taheri, A.; Shamsabadi, A.S.; Katevas, K.; Haddadi, H.; Rabiee, H.R. Deep private-feature extraction. IEEE Trans. Knowl. Data Eng. 2018, 32, 54–66. [Google Scholar] [CrossRef] [Green Version]
Ghojogh, B.; Samad, M.N.; Mashhadi, S.A.; Kapoor, T.; Ali, W.; Karray, F.; Crowley, M. Feature selection and feature extraction in pattern analysis: A literature review. arXiv 2019, arXiv:1905.02845. [Google Scholar]
Schmidt, M.; Fung, G.; Rosales, R. Fast optimization methods for L₁ regularization: A comparative study and two new approaches. In Proceedings of the 18th European Conference on Machine Learning, Warsaw, Poland, 17–21 September 2007; pp. 286–297. [Google Scholar]
Van Laarhoven, T. L2 regularization versus batch and weight normalization. arXiv 2017, arXiv:1706.05350. [Google Scholar]
Azhagusundari, B.; Thanamani, A.S. Feature selection based on information gain. Int. J. Innov. Technol. Explor. Eng. 2013, 2, 18–21. [Google Scholar]
Bryant, F.B.; Satorra, A. Principles and practice of scaled difference chi-square testing. Struct. Equ. Model. A Multidiscip. J. 2012, 19, 372–398. [Google Scholar] [CrossRef]
Mika, S.; Schölkopf, B.; Smola, A.J.; Müller, K.-R.; Scholz, M.; Rätsch, G. Kernel PCA and De-noising in feature spaces. Adv. Neural Inf. Process. Syst. 1998, 11, 536–542. [Google Scholar]
Ding, C.; Zhou, D.; He, X.; Zha, H. R 1-pca: Rotational invariant l 1-norm principal component analysis for robust subspace factorization. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 281–288. [Google Scholar]
Andrew, G.; Arora, R.; Bilmes, J. Livescu, Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning, PMLR2013, Atlanta, GA, USA,, 17–19 June 2013; pp. 1247–1255. [Google Scholar]
Yu, H.; Yang, J. A direct LDA algorithm for high-dimensional data—With application to face recognition. Pattern Recognit. 2001, 34, 2067–2070. [Google Scholar] [CrossRef] [Green Version]
Martinez, A.M.; Kak, A.C. Pca versus lda. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 228–233. [Google Scholar] [CrossRef] [Green Version]
Zhou, P.; Han, J.; Cheng, G.; Zhang, B. Learning compact and discriminative stacked autoencoder for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4823–4833. [Google Scholar] [CrossRef]
Sun, M.; Wang, H.; Liu, P.; Huang, S.; Fan, P. A sparse stacked denoising autoencoder with optimized transfer learning applied to the fault diagnosis of rolling bearings. Measurement 2019, 146, 305–314. [Google Scholar] [CrossRef]
Coutinho, M.G.; Torquato, M.F.; Fernandes, M.A. Deep neural network hardware implementation based on stacked sparse autoencoder. IEEE Access 2019, 7, 40674–40694. [Google Scholar] [CrossRef]
Shi, Y.; Lei, J.; Yin, Y.; Cao, K.; Li, Y.; Chang, C.-I. Discriminative feature learning with distance constrained stacked sparse autoencoder for hyperspectral target detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1462–1466. [Google Scholar] [CrossRef]
Xiao, Y.; Wu, J.; Lin, Z.; Zhao, X. A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data. Comput. Methods Programs Biomed. 2018, 166, 99–105. [Google Scholar] [CrossRef] [PubMed]
Sankaran, A.; Vatsa, M.; Singh, R.; Majumdar, A. Group sparse autoencoder. Image Vis. Comput. 2017, 60, 64–74. [Google Scholar] [CrossRef]
Chai, Z.; Song, W.; Wang, H.; Liu, F. A semi-supervised auto-encoder using label and sparse regularizations for classification. Appl. Soft Comput. 2019, 77, 205–217. [Google Scholar] [CrossRef]
Xu, H.; Luo, D.; Henao, R.; Shah, S.; Carin, L. Learning autoencoders with relational regularization. In Proceedings of the International Conference on Machine Learning, PMLR2020, Virtual Event, 13–18 July 2020; pp. 10576–10586. [Google Scholar]
Vayer, T.; Chapel, L.; Flamary, R.; Tavenard, R.; Courty, N. Fused Gromov-Wasserstein distance for structured objects. Algorithms 2020, 13, 212. [Google Scholar] [CrossRef]
Liang, J.; Liu, R. Stacked denoising autoencoder and dropout together to prevent overfitting in deep neural network. In Proceedings of the 2015 8th International Congress on Image and Signal Processing (CISP), Shenyang, China, 14–16 October 2015; pp. 697–701. [Google Scholar]
Goldberger, J.; Gordon, S.; Greenspan, H. An Efficient Image Similarity Measure Based on Approximations of KL-Divergence Between Two Gaussian Mixtures. In Proceedings of the Ninth IEEE International Conference on Computer Vision, ICCV2003, Nice, France, 13–16 October 2003; pp. 487–493. [Google Scholar]
Huang, G.-B.; Zhu, Q.-Y.; Siew, C.-K. Extreme learning machine: A new learning scheme of feedforward neural networks. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Budapest, Hungary, 25–29 July 2004; pp. 985–990. [Google Scholar]
Yang, Z.; Xu, B.; Luo, W.; Chen, F. Autoencoder-based representation learning and its application in intelligent fault diagnosis: A review. Measurement 2022, 189, 110460. [Google Scholar] [CrossRef]
Zheng, Q.; Zhao, P.; Zhang, D.; Wang, H. MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification. Int. J. Intell. Syst. 2021, 36, 7204–7238. [Google Scholar] [CrossRef]
Li, Y.; Lei, Y.; Wang, P.; Jiang, M.; Liu, Y. Embedded stacked group sparse autoencoder ensemble with L1 regularization and manifold reduction. Appl. Soft Comput. 2021, 101, 107003. [Google Scholar] [CrossRef]
Steck, H. Autoencoders that don′t overfit towards the Identity. Adv. Neural Inf. Process. Syst. 2020, 33, 19598–19608. [Google Scholar]
Probst, M.; Rothlauf, F. Harmless Overfitting: Using Denoising Autoencoders in Estimation of Distribution Algorithms. J. Mach. Learn. Res. 2020, 21, 2992–3022. [Google Scholar]
Kunin, D.; Bloom, J.; Goeva, A.; Seed, C. Loss landscapes of regularized linear autoencoders. In Proceedings of the International Conference on Machine Learning, PMLR2019, Long Beach, CA, USA, 9–15 June 2019; pp. 3560–3569. [Google Scholar]
Pretorius, A.; Kroon, S.; Kamper, H. Learning dynamics of linear denoising autoencoders. In Proceedings of the International Conference on Machine Learning, PMLR2018, Stockholm, Sweden, 10–15 July 2018; pp. 4141–4150. [Google Scholar]
Bunte, K.; Haase, S.; Biehl, M.; Villmann, T. Stochastic neighbor embedding (SNE) for dimension reduction and visualization using arbitrary divergences. Neurocomputing 2012, 90, 23–45. [Google Scholar] [CrossRef] [Green Version]
McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.A.; Kwok, I.W.; Ng, L.G.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2018, 37, 38–44. [Google Scholar] [CrossRef]
Wang, H.; van Stein, B.; Emmerich, M.; Back, T. A new acquisition function for Bayesian optimization based on the moment-generating function. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; pp. 507–512. [Google Scholar]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 2012, 25, 1–9. [Google Scholar]
Audet, C.; Denni, J.; Moore, D.; Booker, A.; Frank, P. A surrogate-model-based method for constrained optimization. In Proceedings of the 8th Symposium on Multidisciplinary Analysis and Optimization, Long Beach, CA, USA, 6–8 September 2000; p. 4891. [Google Scholar]
Lin, W.; Wu, Z.; Lin, L.; Wen, A.; Li, J. An ensemble random forest algorithm for insurance big data analysis. IEEE Access 2017, 5, 16568–16575. [Google Scholar] [CrossRef]
Nikoloulopoulou, N.; Perikos, I.; Daramouskas, I.; Makris, C.; Treigys, P.; Hatzilygeroudis, I. A Convolutional Autoencoder Approach for Boosting the Specificity of Retinal Blood Vessels Segmentation. Appl. Sci. 2023, 13, 3255. [Google Scholar] [CrossRef]

Figure 1. The virtual structure of the autoencoder using ELM.

Figure 2. The architecture of IGSAE. It consists of first guiding layer for the reconstruction of guided features and second guiding layer to reflect the distribution of guiding features.

Figure 3. Step-by-step flowchart of the proposed method.

Figure 4. Reconstructed image of Overfitted autoencoder model.

Figure 5. Effects of the coefficients for classification accuracies on the six datasets: (a) Heart; (b) Wine; (c) USPS; (d) Fashion-MNIST; (e) Yale; (f) COIL-100.

Table 1. Summary of six public datasets.

Dataset		Classes	Features	Training	Testing	Dimensionality after Feature Extraction
Low-dimension	Heart	2	13	240	63	9
Low-dimension	Wine	2	12	4873	1624	8
Medium-dimension	USPS	10	256	7291	2007	180
Medium-dimension	Fashion-MNIST	10	784	60,000	10,000	549
High-dimension	Yale	15	1024	130	35	717
High-dimension	COIL-100	100	1024	1080	360	717

Table 2. Parameters of IGSAE for six datasets.

Dataset	Guiding Units of IGSAE	$ε$	$λ$	$μ$
Heart	64 – 32 – 64	0.040	0.119	0.004
Wine	64 – 32 – 64	0.035	0.166	0.003
USPS	256 – 128 – 256	0	0.025	0.001
Fashion-MNIST	256 – 128 – 256	0.030	0.115	0.003
Yale	512 – 256 – 512	0.030	0.171	0.002
COIL-100	512 – 256 – 512	0.035	0.172	0.003

Table 3. Ablation study on key components of IGSAE.

Dataset	RF (%)	DF (%)	DF & WR (%)	DF & WR & SR (%)
Heart	82.47 $\pm$ 2.1	81.24 $\pm$ 1.3	82.44 $\pm$ 1.8	82.64 $\pm$ 2.0
Wine	96.84 $\pm$ 0.5	94.62 $\pm$ 2.0	96.12 $\pm$ 0.6	96.25 $\pm$ 1.1
USPS	94.92 $\pm$ 3.4	93.15 $\pm$ 1.6	93.50 $\pm$ 2.4	93.24 $\pm$ 2.3
Fashion-MNIST	90.60 $\pm$ 0.4	86.76 $\pm$ 2.3	88.46 $\pm$ 2.1	88.87 $\pm$ 1.8
Yale	74.56 $\pm$ 2.1	70.51 $\pm$ 5.7	71.08 $\pm$ 4.2	71.35 $\pm$ 5.5
COIL-100	96.19 $\pm$ 0.2	93.87 $\pm$ 1.1	94.56 $\pm$ 1.2	95.33 $\pm$ 0.5

Table 4. Classification accuracies of three different classifiers by seven dimensionality reduction methods.

Dataset	RF (%)	Kernel PCA (%)	LDA (%)	SAE (%)	SSAE (%)	UMAP (%)	t-SNE (%)	IGSAE (%)	Hybrid IGSAE (%)
Heart	82.47 ± 2.1	81.82 ± 1.1	76.86 ± 4.5	82.23 ± 2.8	81.82 ± 2.7	74.38 ± 5.2	60.33 ± 9.2	82.64 ± 2.0	83.22 ± 1.7
Wine	96.84 ± 0.5	96.00 ± 1.2	93.15 ± 1.4	95.98 ± 1.5	96.87 ± 1.1	95.54 ± 2.2	95.52 ± 2.9	96.25 ± 1.1	97.93 ± 0.8
USPS	94.92 ± 3.4	91.46 ± 5.6	92.78 ± 5.7	93.47 ± 3.9	93.23 ± 4.5	89.23 ± 9.9	86.36 ± 8.4	93.50 ± 2.4	95.37 ± 3.1
Fashion-MNIST	90.60 ± 0.4	87.07 ± 3.5	87.27 ± 3.2	87.98 ± 2.7	88.65 ± 1.8	81.28 ± 6.4	81.38 ± 4.7	88.87 ± 1.8	91.08 ± 1.0
Yale	74.56 ± 2.1	71.48 ± 6.8	68.26 ± 3.6	67.44 ± 7.4	70.78 ± 4.1	69.33 ± 5.5	61.45 ± 8.5	71.35 ± 5.5	76.58 ± 4.2
COIL-100	96.19 ± 0.2	94.55 ± 1.3	92.95 ± 1.5	95.01 ± 0.9	95.12 ± 1.1	90.03 ± 4.6	89.85 ± 4.2	95.33 ± 0.5	96.76 ± 0.8

Table 5. Reconstruction loss of the autoencoder models.

Dataset	Basic AE (%)	SAE (%)	SSAE (%)	IGSAE (%)
Heart	0.3521	0.3213	0.3494	0.3132
Wine	0.4560	0.4651	0.4476	0.4242
USPS	0.1757	0.1634	0.1654	0.1537
Fashion-MNIST	0.2463	0.2578	0.2563	0.2487
Yale	0.8754	0.8743	0.8864	0.8435
COIL-100	0.4749	0.4836	0.4798	0.4803

Table 6. Classification accuracies of the IGSAE under different classifiers.

	Heart	Wine	USPS	Fashion-MNIST	Yale	COIL-100
SVM (%)	82.64 ± 2.0	96.25 ± 1.1	93.50 ± 2.4	88.87 ± 1.8	71.35 ± 5.5	95.33 ± 0.5
KNN (%)	77.48 ± 6.3	92.78 ± 3.2	89.75 ± 4.7	83.79 ± 2.9	69.95 ± 4.1	86.46 ± 4.6
Random Forest (%)	81.72 ± 1.8	94.55 ± 2.1	90.56 ± 2.2	86.78 ± 2.7	72.17 ± 4.8	94.8 ± 11.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, H.; Hur, C.; Ibrokhimov, B.; Kang, S. Interactive Guiding Sparse Auto-Encoder with Wasserstein Regularization for Efficient Classification. Appl. Sci. 2023, 13, 7055. https://doi.org/10.3390/app13127055

AMA Style

Lee H, Hur C, Ibrokhimov B, Kang S. Interactive Guiding Sparse Auto-Encoder with Wasserstein Regularization for Efficient Classification. Applied Sciences. 2023; 13(12):7055. https://doi.org/10.3390/app13127055

Chicago/Turabian Style

Lee, Haneum, Cheonghwan Hur, Bunyodbek Ibrokhimov, and Sanggil Kang. 2023. "Interactive Guiding Sparse Auto-Encoder with Wasserstein Regularization for Efficient Classification" Applied Sciences 13, no. 12: 7055. https://doi.org/10.3390/app13127055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interactive Guiding Sparse Auto-Encoder with Wasserstein Regularization for Efficient Classification

Abstract

1. Introduction

Main Contributions

2. Background and Related Works

2.1. Sparse Autoencoder (SAE)

2.2. Label and Sparse Regularization Autoencoder (LSRAE)

3. Interactive Guiding Sparse Autoencoder

4. Experimental Result of IGSAE

4.1. Datasets and Experimental Setups

4.2. Authentication of Main Innovation

4.3. Comparison of the Benchmark Datasets

4.4. Prevention of Transforming Irrelevant Data Such as Training Data

4.5. Effect of the Coefficients and Classifiers

4.5.1. Coefficients Optimization on the Six Datasets

4.5.2. Effects of Classifier

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI