Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables with Application to Neural Network Target Prediction Problems

Wang, Yiying; Li, Jinghua; Yang, Boxin; Song, Dening; Zhou, Lei

doi:10.3390/app14177466

Open AccessArticle

Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables with Application to Neural Network Target Prediction Problems

by

Yiying Wang

¹,

Jinghua Li

^2,3,*,

Boxin Yang

^2,*,

Dening Song

²

and

Lei Zhou

²

¹

College of Shipbuilding Engineering, Harbin Engineering University, Harbin 150001, China

²

College of Mechanical and Electrical Engineering, Harbin Engineering University, Harbin 150001, China

³

Sanya Nanhai Innovation and Development Base of Harbin Engineering University, Harbin Engineering University, Sanya 572024, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7466; https://doi.org/10.3390/app14177466 (registering DOI)

Submission received: 6 August 2024 / Revised: 21 August 2024 / Accepted: 21 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Intelligent Data Mining, Analysis and Modeling Based on Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Neural network models, such as BP, LSTM, etc., support only numerical inputs, so data preprocessing needs to be carried out on the categorical variables to convert them into numerical data. For unordered multi-categorical variables, existing encoding methods may produce dimensional catastrophes and may also introduce additional order misrepresentation and distance bias in neural network computation. To solve the above problems, this paper proposes an unordered multi-categorical variable encoding method O-AE using orthogonal matrix for encoding and encoding representation learning and dimensionality reduction via an autoencoder. Bayesian optimization is used for hyperparameter optimization of the autoencoder. Finally, seven experiments were designed with the basic O-AE, Bayesian optimization of the hyperparameters of the autoencoder for O-AE, and other encoding methods to encode unordered multi-categorical variables in five datasets, and they were input into a BP neural network to carry out target prediction experiments. The results show that the experiments using O-AE and O-AE-b have better prediction results, proving that the method proposed in this paper is highly feasible and applicable and can be an optional method for the data processing of unordered multi-categorical variables.

Keywords:

unordered multi-categorical variables; orthogonal matrix; autoencoder; encoding method; dimensionality reduction; Bayesian optimization; BP neural network; target prediction

1. Introduction

Target prediction is the prediction of a target associated with an input based on existing data by some method. Target prediction is increasingly important in finance [1,2], healthcare [3], manufacturing [4], power [5,6], weather [7], transportation [8], etc. Excellent prediction results can identify potential risks, provide credible prediction data, and improve users’ risk management ability; it can develop executable future plans and provide decision-making support for enterprises.

Neural networks, decision trees, linear regression, polynomial regression, etc., are methods to achieve target prediction. Kim [9] used multiple target prediction methods, conducted experiments under different numbers of features, feature types, and numbers of samples and concluded that the performance of artificial neural networks with target prediction is better under multi-categorical variable input conditions.

Overall, neural networks are the dominant model for solving target prediction problems at this stage. It is important to note that neural network models, such as BP, LSTM, etc., support only numerical data inputs; however, a large portion of the dataset contains non-numerical variables, i.e., categorical variables. For example, the binary categorical variable “drug reaction” includes two categories of “positive” and “negative”, and the “assessment result” is also a binary categorical variable, including “pass” and “fail”; another example is the multi-categorical variable “occupation”, which includes “teacher, firefighter, construction worker, doctor…” and other categories. Numerical variables are easier to interpret, but when categorical variables dominate the dataset, it is not easy to view data trends and make predictions [10]. Therefore, a method must be taken to map categorical variables to numerical values before inputting them into the neural network [11].

Categorical variable coding methods include one-hot encoding, label encoding, target encoding, embedding, etc. In this, one-hot encoding is applicable to unordered categorical variables, which generates N-dimensional binary vectors using 0, 1 for N categories. Label encoding applies to ordered categorical variables, and it maps each category to an integer starting from 0. Target encoding uses the target mean corresponding to the category variables instead of the categorical variables. Embedding, on the other hand, randomly generates weight vectors according to the specified embedding dimensions and maps the text into word vectors according to the indexes of the categorical variables.

Unordered multi-categorical variables are categorical variables that contain several (usually more than three) categories with no differences in order or distance between categories.

Some existing encoding methods are not entirely suitable for unordered multi-categorical variables; for example, when the number of categories is large, applying one-hot encoding or applying embedding with a large embedding dimension creates the risk of dimensionality explosion, and the data are too sparse after one-hot encoding. Embedding and label encoding also introduce additional order and distance to an originally unordered categorical variable. Target encoding, on the other hand, affects the ability of the neural network model to extract information with the risk of overfitting.

In the current research on the target prediction problem, scholars are more concerned about how to design a better model structure to make better prediction results, and when carrying out the input data processing, the data either do not include the categorical variables [12,13,14] or use other traditional coding methods. For example, Bu et al. [15] used label encoding for both ordered and unordered categorical variables for data preprocessing when adopting selective integrated learning for ship painting man-hour prediction. Hur et al. [16] predicted ship construction man-hours using deployable data at different times during the manufacturing process and converted the categorical variables into dummy variables when processing them. Sasan et al. [17] converted binary categorical variables into 1 and 2 labeled coding. Wang et al. [18] set working time and non-working time as 0 and 1 coding, respectively. Carrizosa et al. [19] processed all the categorical variables using one-hot encoding in a binary categorical problem in the presence of categorical variables. All of the above studies adopted the existing conventional encoding methods to carry out the codes for unordered multi-categorical variables but ignored the fact that the encoded values did not maintain the characteristics of unordered multi-categorical variables and also additionally introduced the misleading order and distance bias.

In his research on encoding methods for categorical variables, Sebastian [20] applied five encoding methods to three regression models in his experiments and proved that the regression results varied according to the encoding methods of categorical variables; Hien et al. [21] also verified that the three encoding methods of categorical variables, namely label encoding, one-hot encoding, and embedding, had different effects on the performance of a deep dense neural network and long- and short-term memory neural network models. Li et al. [22] proposed an ordered log-linear model for ordered categorical variables that can convert ordered categorical data into model-describable multi-categorical lists; Meulemeester et al. [23] used unsupervised embedding to convert categorical variables into word vectors. Dahouda et al. [24] extended the word-embedding approach by proposing a deep learning method to codify categorical variables. All of the above methods are more suitable for ordered categorical variables or for categorical variables with dependencies. Namgil et al. [25] devised a Bayesian network based method for converting categorical variables into data variables, and Jung et al. [26] proposed a method for updating data points in the kernel space of a continuous variable by using SVMs to reflect the effect of each categorical variable; all of the above methods are only suitable for the target value of a binary categorical variable.

In general, there is no encoding method for transforming unordered multi-categorical variables into numerical codes with the characteristics of unordered variables and small dimensions. Based on this need, this paper carries out a research on encoding methods for unordered multi-categorical variables. This paper also extends the problem of predicting the working hours of cruise ship production design tasks by analyzing the relationship between task workload and task working hours through the previous working hours data and the design attribute data in the ship area, such as the task type, the number of model structures, the planned working hours, the feedback working hours, etc., so as to provide credible and standards-based working hours solutions for the planning and scheduling of the design tasks when designing and constructing a new ship in the future, which is also the key link to promote the standardization of ship design and construction. In the ship production design task working hour prediction problem, ship area and task type are categorical variables containing dozens of unordered categories, so the unordered multi-categorical variables cannot be ignored.

Autoencoder, as a classical tool for feature extraction and data dimensionality reduction, has continued to show its unique value in the field of data processing in recent years. For example, the research of Yang et al. [27] is a profound exploration, in which they cleverly integrate a deep autoencoder network into the Orthogonal Nonnegative Matrix Factorization (ONMF) framework, which achieves an accurate capture and hierarchical parsing of the intrinsic structure of complex data. This innovation not only highlights the ability of the autoencoder to automatically extract high-level features from data in an unsupervised learning environment but also effectively reduces the data dimensionality while preserving key information through its unique network structure.

Based on this core feature of autoencoder, this paper proposes a new encoding method, O-AE, for unordered multi-categorical variables, which first uses an orthogonal matrix to numerically encode the unordered multi-categorical variables, ensuring that the codes are independent of each other in terms of position, are equal distance, and are the same size; second, it uses autoencoder to carry out representation learning on the numerical codes and then performs the dimensionality reduction to ensure that the data inputted to the neural network has learned the relevant characteristics of the orthogonal matrix at the same time low dimensionality. This paper is structured as follows. Section 2 introduces four traditional methods for encoding multi-categorical variables; Section 3 describes the encoding mechanism of O-AE proposed in this paper; Section 4 introduces the Bayesian optimization of the hyperparameters of autoencoder for O-AE; Section 5 designs seven encoding experiments for six encoding methods (including O-AE), designs the Bayesian optimization of the hyperparameters of the autoencoder for O-AE, designs seven encoding experiments, introduces the specific information of five datasets, and proposes the validation metrics; and Section 6 carries out the example validation and demonstrates and discusses the experimental results.

2. Encoding of Categorical Variables

This section presents the theory of several types of encoding methods that are used frequently. First, the parametric representation of unordered multi-categorical variables is presented along with the parametric representation after encoding.

Definition 1.

Unordered multi-categorical variables are non-numerical variables with no size or positional differences between categories.

Definition 2.

The sample size of unordered multi-categorical variables

C V

is set as

\tilde{N}

, and the variable contains

N

2.1. Label Encoding

The label encoding

f_{L}

maps the category

c_{i}

to integers from 0 to

N - 1

.

C V : {[\begin{matrix} c_{1} \\ c_{2} \\ \dots \\ c_{i} \\ \dots \\ c_{N} \end{matrix}]}_{N \times 1} \overset{f_{L}}{\to} N V : {[\begin{matrix} 0 \\ 1 \\ \dots \\ i - 1 \\ \dots \\ N - 1 \end{matrix}]}_{N \times 1}

(4)

2.2. Target Encoding

The target encoding

f_{T}

maps the category

c_{i}

to the mean of the target value corresponding to that category.

Assuming that the predicted target is

y

, the target value corresponding to category

c_{i}

is

y_{i}

. When

c_{i}

contains multiple similar data

c_{i 1}, c_{i 2}, \dots c_{i m}

within

c_{i}

, the corresponding target values are

y_{i 1}, y_{i 2}, \dots y_{i m}

.

C V : {[\begin{matrix} c_{1} \\ c_{2} \\ \dots \\ c_{i} \\ \dots \\ c_{N} \end{matrix}]}_{N \times 1} \overset{f_{T}}{\to} N V : {[\begin{matrix} n_{1} \\ n_{2} \\ \dots \\ n_{i} \\ \dots \\ n_{N} \end{matrix}]}_{N \times 1}, n_{i} = \sum_{j = 1}^{m} \frac{y_{i j}}{m}

(5)

2.3. One-Hot Encoding

One-hot encoding

f_{O}

maps the

N

2.4. Embedding

The embedding encoding

f_{E}

maps each category to a point in a vector space, and vectors with similar categories are closer together in the vector space.

Setting embedding converts the categorical variable

C V

into a numerical variable

N V

with a specified embedding dimension

d_{E}

. The embedding dimension

d_{E}

denotes the dimension of the mapped

n_{i}

, i.e.,

c_{i} \overset{f_{E}}{\to} n_{i} = [\begin{matrix} n_{i}^{1} & n_{i}^{2} & \dots & n_{i}^{d_{E}} \end{matrix}]

.

Embedding is roughly divided into four steps.

(1) Predefine the vector space, determine the number

N

of categories

c_{i}

contained within the categorical variables, specify the embedding dimension

d_{E}

, and construct a weight matrix

W

of

N

rows and

d_{E}

columns. The weight matrix is randomly generated by default from a standard normal distribution with mean 0 and standard deviation 1. That is,

W ~ N (0,1)

.

(2) Assign an integer index starting from 0 to each category in the categorical variables.

(3) Map the integer indexes into a predefined vector space by means of a lookup table (this table is the weight matrix

W

), where each index is associated with a vector corresponding to the number of the weight matrix.

(4) Return the mapped numeric variables for the categorical variables.

C V : {[\begin{matrix} c_{1} \\ c_{2} \\ \dots \\ c_{i} \\ \dots \\ c_{N} \end{matrix}]}_{N \times 1} \overset{f_{E}}{\to} N V : {[\begin{matrix} n_{1} \\ n_{2} \\ \dots \\ n_{i} \\ \dots \\ n_{N} \end{matrix}]}_{N \times d_{E}} \to W = [\begin{matrix} \begin{matrix} n_{1}^{1} & n_{1}^{2} & \dots & n_{1}^{d_{E}} \end{matrix} \\ \begin{matrix} n_{2}^{1} & n_{2}^{2} & \dots & n_{2}^{d_{E}} \end{matrix} \\ \dots \\ \begin{matrix} n_{i}^{1} & n_{i}^{2} & \dots & n_{i}^{d_{E}} \end{matrix} \\ \dots \\ \begin{matrix} n_{N}^{1} & n_{N}^{2} & \dots & n_{N}^{d_{E}} \end{matrix} \end{matrix}]

(7)

3. Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables

Among the existing methods for encoding categorical variables, part of the methods is not applicable to unordered multi-categorical variables with no size, order, or distance requirements between categories, and the other part of the methods have large dimensionality after encoding. Therefore, this paper proposes the method O-AE based on orthogonal matrix-autoencoder for encoding and dimension reduction of unordered multi-categorical variables. O-AE for unordered multi-categorical variables encoding and dimensionality reduction process is shown in Figure 1.

C V : [c_{i}] \overset{f_{O - A E}}{\to} Z \in Z \Leftrightarrow 〈C V \to A \overset{Q R}{\to} Q \to X \in X, X \in X \overset{e n c o d e r}{\to} Z \in Z \overset{d e c o d e r}{\to} \hat{X} \in X〉

(8)

The model is mainly divided into two parts, the original code generation part and the code reduction part.

The original code generation part is

C V \to A \overset{Q R}{\to} Q \to X \in X

. The code dimensionality reduction part is

X \in X \overset{e n c o d e r}{\to} Z \in Z \overset{d e c o d e r}{\to} \hat{X} \in X

.

First, according to the number of categories

N

in the unordered multi-categorical variables, the

N

-order orthogonal matrix

Q

is constructed, and the orthogonal matrix is mapped one by one with the original unordered multi-categorical variables according to the positional indexes into an orthogonal numerical coding matrix

X

with the same data size, independent positions, and equal distances.

Second, the orthogonal matrix of numerical code matrix

X

is input to the autoencoder, and the autoencoder is trained step by step by compressed reconstruction coding through the encoder and decoder in the autoencoder; finally, the learned feature

Z

is obtained.

3.1. Code Generation

The

N

-order square matrix is set as

A \in R^{N \times N}

,

A = [a_{1}, a_{2}, \dots, a_{i}, \dots, a_{N}]

, where

a_{i}

is the column vector of

A

and

a_{i} = {[a_{i}^{1}, a_{i}^{2}, \dots, a_{i}^{N}]}^{T}

.

There exists a series of transformations that decompose

A

into an orthogonal matrix

Q

and an upper triangular matrix

R

, i.e.,

A = Q R

. These transformations are also called QR decomposition of

A

.

The orthogonal matrix

Q

is set to consist of

N

column vectors,

Q = [q_{1}, q_{2}, \dots, q_{i}, \dots, q_{N}]

, where

q_{i} = {[q_{i}^{1}, q_{i}^{2}, \dots, q_{i}^{N}]}^{T}

.

The row and column vectors of an orthogonal matrix are two-by-two orthogonal unit vectors.

Q Q^{T} = Q^{T} Q = E

(9)

{q_{i}}^{T} q_{j} = \{\begin{array}{l} 0, i \neq j \\ 1, i = j \end{array}

(10)

{‖q_{i}‖}_{2} = 1

(11)

Therefore, mapping unordered multi-categorical variables into orthogonal numerical codes can ensure that the different categories are independent of each other in terms of location, are equal distances, and are the same size, and at the same time increase the data richness to avoid underfitting in the subsequent prediction process brought about by data impoverishment.

The steps for encoding unordered multi-categorical variables using an orthogonal matrix are as follows.

(1) Construct

N

-order square matrix

A \in R^{N \times N}

based on the number of categories

N

in the unordered multi-categorical variables.

(2) Perform the QR decomposition of

A

using the Gram–Schmidt orthogonal transformation to obtain the orthogonal matrix

Q \in R^{N \times N}

.

The Gram–Schmidt orthogonal transformation is able to convert non-orthogonal bases into orthogonal bases by first orthogonalizing the non-orthogonal bases and, second, unitizing them.

\{\begin{array}{l} {\tilde{q}}_{1} = a_{1} \\ {\tilde{q}}_{2} = a_{2} - \frac{〈a_{2}, {\tilde{q}}_{1}〉}{〈{\tilde{q}}_{1}, {\tilde{q}}_{1}〉} {\tilde{q}}_{1} \\ \dots \\ {\tilde{q}}_{N} = a_{N} - \frac{〈a_{N}, {\tilde{q}}_{1}〉}{〈{\tilde{q}}_{1}, {\tilde{q}}_{1}〉} {\tilde{q}}_{1} - \frac{〈a_{N}, {\tilde{q}}_{2}〉}{〈{\tilde{q}}_{2}, {\tilde{q}}_{2}〉} {\tilde{q}}_{2} \dots - \frac{〈a_{N}, {\tilde{q}}_{N - 1}〉}{〈{\tilde{q}}_{N - 1}, {\tilde{q}}_{N - 1}〉} {\tilde{q}}_{N - 1} \end{array}

(12)

\{\begin{array}{l} q_{1} = \frac{{\tilde{q}}_{1}}{‖{\tilde{q}}_{1}‖} \\ q_{2} = \frac{{\tilde{q}}_{2}}{‖{\tilde{q}}_{2}‖} \\ \dots \\ q_{N} = \frac{{\tilde{q}}_{N}}{‖{\tilde{q}}_{N}‖} \end{array}

(13)

Q = [q_{1}, q_{2}, \dots, q_{i}, \dots, q_{N}]

(14)

(3) Construct the

N

-order identity matrix

E

. Using matrix multiplication, map the orthogonal matrix to the original unordered multi-categorical variables one by one according to the position indexes to obtain the orthogonal numerical code matrix

X \in R^{N \times N}

, which consists of row vectors.

C V : {[\begin{matrix} c_{1} \\ c_{2} \\ \dots \\ c_{i} \\ \dots \\ c_{N} \end{matrix}]}_{N \times 1} \to X : {[\begin{matrix} x_{1} \\ x_{2} \\ \dots \\ x_{i} \\ \dots \\ x_{N} \end{matrix}]}_{N \times N} = E_{N \times N} {Q^{T}}_{N \times N} = {[\begin{matrix} {q_{1}}^{T} \\ {q_{2}}^{T} \\ \dots \\ {q_{i}}^{T} \\ \dots \\ {q_{N}}^{T} \end{matrix}]}_{N \times N} = {[\begin{matrix} \begin{matrix} q_{1}^{1} & q_{1}^{2} & \dots & q_{1}^{N} \end{matrix} \\ \begin{matrix} q_{2}^{1} & q_{2}^{2} & \dots & q_{2}^{N} \end{matrix} \\ \dots \\ \begin{matrix} q_{i}^{1} & q_{i}^{2} & \dots & q_{i}^{N} \end{matrix} \\ \dots \\ \begin{matrix} q_{N}^{1} & q_{N}^{2} & \dots & q_{N}^{N} \end{matrix} \end{matrix}]}_{N \times N}

(15)

where

x_{i} = {q_{i}}^{T}

and

x_{i}

is the row vector of

X

.

After the encoding of unordered multi-categorical variables, input the orthogonal matrix of numerical codes

X

into the autoencoder for representation learning and dimension reduction.

3.2. Code Dimensionality Reduction

An autoencoder (AE) [28] is an unsupervised neural network model, which mainly consists of two parts, an encoder and decoder, and has a symmetric structure. Through neural network training, an autoencoder can achieve the tasks of data denoising, feature learning, or data dimensionality reduction [29].

An example of the structure of the autoencoder and the working principle of the autoencoder is shown in Figure 2. The autoencoder includes an input layer, a hidden layer, latent space, and an output layer, each with a different number of nodes. In this case, the structure from the input layer to the latent space is referred to as the encoder, and the structure from the latent space to the output layer is referred to as the decoder. The number of nodes in the input layer is the size of the original dimension of the data, the number of nodes in the latent space is the dimension in which the data needs to be compressed, and the number of nodes in the hidden layer is in between the original dimension and the compressed dimension.

The working principle of the autoencoder is as follows: the encoder compresses the input data into a predefined low-dimensional encoding through a series of transformations

f

, and the decoder tries to reconstruct the low-dimensional encoding into the original input data through a series of transformations

g

. Iterative training of the autoencoder is carried out by minimizing the reconstruction error between the input and the output, and ultimately, it is hoped that the autoencoder learns the abstract feature representation of the samples

z

.

The working principle of the autoencoder to carry out data dimensionality reduction is shown in Equation (16).

\begin{array}{l} f : R \to Z \\ g : Z \to R \\ z = f (W_{1} x + b_{1}) \\ \hat{x} = g (W_{2} z + b_{2}) \\ f, g = a r g \min_{f, g} {‖x - g (f (x))‖}_{2} \end{array}

(16)

In this paper, multiple symmetric fully connected layers are designed in the encoder and decoder of the autoencoder to increase the complexity of the model and feature extraction capability by stacking multiple connected layers.

ReLU is used as the activation function; ReLU is simpler to compute and has relatively stable performance. ReLU can avoid the gradient disappearing during model training, increase the nonlinear fitting ability of the model, and have good performance on multi-class tasks and datasets. The ReLU formula is

f (x) = m a x (0, x)

.

Taking the autoencoder model containing four hidden layers and one-dimensional data compression dimension as an example in Figure 3,

X_{N \times N}

denotes the orthogonal numerical code matrix, and

{\hat{X}}_{N \times N}

denotes the numerical matrix after the reconstruction of the autoencoder. Let

X^{(1)} \in R^{N \times a}, X^{(2)} \in R^{N \times b}, a, b < N

denote the process data compressed by the encoding layer. Let

{\hat{X}}^{(2)} \in R^{N \times b}, {\hat{X}}^{(1)} \in R^{N \times a}, a, b < N

denote the process data that has been reconstructed by the decoding layer, and let

Z \in Z^{N \times 1}

denote the coded data that has undergone dimensionality reduction.

X, X^{(1)}, X^{(2)}, {\hat{X}}^{(2)}, {\hat{X}}^{(1)}, \hat{X}

all consist of row vectors.

The change in data dimensions during the encoding and decoding of the orthogonal numerical code matrix by AE is shown in Equation (17).

\begin{array}{l} X_{N \times N} = {[\begin{matrix} x_{1} \\ x_{2} \\ \dots \\ x_{i} \\ \dots \\ x_{N} \end{matrix}]}_{N \times N} & \overset{e n c o d e r}{\to} {X^{(1)}}_{N \times a} = {[\begin{matrix} {x^{(1)}}_{1} \\ {x^{(1)}}_{2} \\ \dots \\ {x^{(1)}}_{i} \\ \dots \\ {x^{(1)}}_{N} \end{matrix}]}_{N \times a} \overset{e n c o d e r}{\to} {X^{(2)}}_{N \times b} = {[\begin{matrix} {x^{(2)}}_{1} \\ {x^{(2)}}_{2} \\ \dots \\ {x^{(2)}}_{i} \\ \dots \\ {x^{(2)}}_{N} \end{matrix}]}_{N \times b} \overset{e n c o d e r}{\to} Z_{N \times 1} \\ = {[\begin{matrix} z_{1} \\ z_{2} \\ \dots \\ z_{i} \\ \dots \\ z_{N} \end{matrix}]}_{N \times 1} \overset{d e c o d e r}{\to} {\hat{X}}^{(2)}_{N \times b} = {[\begin{matrix} {\hat{x}}^{(2)}_{1} \\ {\hat{x}}^{(2)}_{2} \\ \dots \\ {\hat{x}}^{(2)}_{i} \\ \dots \\ {\hat{x}}^{(2)}_{N} \end{matrix}]}_{N \times b} \overset{d e c o d e r}{\to} {\hat{X}}^{(1)}_{N \times a} = {[\begin{matrix} {\hat{x}}^{(1)}_{1} \\ {\hat{x}}^{(1)}_{2} \\ \dots \\ {\hat{x}}^{(1)}_{i} \\ \dots \\ {\hat{x}}^{(1)}_{N} \end{matrix}]}_{N \times a} \overset{d e c o d e r}{\to} {\hat{X}}_{N \times N} = {[\begin{matrix} {\hat{x}}_{1} \\ {\hat{x}}_{2} \\ \dots \\ {\hat{x}}_{i} \\ \dots \\ {\hat{x}}_{N} \end{matrix}]}_{N \times N} \end{array}

(17)

More specifically, the steps for the dimensionality reduction of the orthogonal vector

x_{i}

in the orthogonal numerical coding matrix are as follows.

(1) By means of the encoder, the orthogonal vector

x_{i}

is mapped from a high-dimensional space to a low-dimensional space to achieve data dimensionality reduction.

\begin{array}{l} {x^{(1)}}_{i} = m a x (0, W_{1}^{(1)} x_{i} + b_{1}^{(1)}) \\ {x^{(2)}}_{i} = m a x (0, W_{1}^{(2)} {x^{(1)}}_{i} + b_{1}^{(2)}) \end{array}

(18)

The coded data

z_{i}

after feature extraction of the model is obtained by linearly transforming the output of the previous layer to avoid introducing additional nonlinear relationships to enhance the complexity of the coded data, as in Equation (19).

z_{i} = W_{1}^{(3)} {x^{(2)}}_{i} + b_{1}^{(3)}

(19)

(2) The original data

x_{i}

is reconstructed by mapping the low-dimensional space back to the high-dimensional space through the decoder.

\begin{array}{l} {\hat{x}}^{(2)}_{i} = m a x (0, W_{2}^{(1)} z_{i} + b_{2}^{(1)}) \\ {\hat{x}}^{(1)}_{i} = m a x (0, W_{2}^{(2)} {\hat{x}}^{(2)}_{i} + b_{2}^{(2)}) \end{array}

(20)

The final output of the decoding layer,

\hat{X}

, is obtained by linearly transforming the output of the previous layer to satisfy the purpose of output data reconstruction, as in Equation (21).

{\hat{x}}_{i} = W_{2}^{(3)} {\hat{x}}^{(1)}_{i} + b_{2}^{(3)}

(21)

(3) The error is calculated between the reconstructed data

{\hat{X}}_{N \times N}

and the original data

X_{N \times N}

, generally using the mean-square error as a loss function

L

.

L = {‖x_{i} - {\hat{x}}_{i}‖}_{2}

(22)

(4) With the goal of minimizing the reconstruction error, the weights

W

and bias

b

in the encoder and decoder are optimised by error back propagation using a gradient-based method for the purpose of training the autoencoder.

{f, g}_{(W, b)} = a r g \min_{f, g} L (x_{i}, {\hat{x}}_{i})

(23)

(5) After autoencoder training, the autoencoder learns the feature

z_{i}

of the data.

x_{i} \overset{f_{O - A E}}{\to} z_{i}

(24)

The same is true for the other orthogonal vectors, which ultimately results in the reduced dimensionality of the orthogonal numerical code matrix with numerical code

Z \in R^{N \times 1}

.

X_{N \times N} \overset{f_{O - A E}}{\to} Z_{N \times 1} = {[\begin{matrix} z_{1} \\ z_{2} \\ \dots \\ z_{i} \\ \dots \\ z_{N} \end{matrix}]}_{N \times 1}

(25)

From Equation (3), the values are the same after encoding in the same category. Therefore, the coded values after dimensionality reduction are expanded to the original sample size

Z \in R^{\tilde{N} \times 1}

to obtain the coded expressions for all unordered multi-categorical variables.

C V : {[\begin{matrix} c_{1} \\ c_{2} \\ \dots \\ c_{i} \\ \dots \\ c_{N} \end{matrix}]}_{N \times 1} \to C V = {[\begin{matrix} c_{1} \\ c_{2} \\ \dots \\ c_{i 1} \\ c_{i 2} \\ \dots \\ c_{N} \end{matrix}]}_{\tilde{N} \times 1} \overset{f_{O - A E}}{\to} Z_{\tilde{N} \times 1} = {[\begin{matrix} z_{1} \\ z_{2} \\ \dots \\ z_{i 1} \\ z_{i 2} \\ \dots \\ z_{N} \end{matrix}]}_{\tilde{N} \times 1}

(26)

4. Bayesian Optimization of Autoencoder Hyperparameters

The structure of the autoencoder model is not fixed. Since a dataset may include multiple unordered multi-categorical variables with varying numbers of categories, the number of hidden layers and nodes in the autoencoder adapts according to the dataset’s size and the required level of data compression. Therefore, parameters such as the number of layers and nodes in the hidden layer of the autoencoder are the key influencing factors that affect the performance of the autoencoder and the effect of data dimensionality reduction.

Bayesian optimization is suitable for black-box optimization problems where the target function is complex and has no analytical expression and the computational cost is high. The core idea is to use a probabilistic agent model (e.g., Gaussian process or random forest) to approximate the objective function and use this model to select the next evaluation point so that the objective function can converge to the optimal value as soon as possible, which is a kind of optimization method based on a priori information. There are two core components in this process: the probabilistic model and the acquisition strategy.

Bayes’ theorem is shown in Equation (27).

p (y| x) = \frac{p (x| y) p (y)}{p (x)}

(27)

This paper uses the TPE (Tree-structured Parzen Estimator) algorithm in the Optuna framework to specify the implementation of the Bayesian optimization technique. The TPE is a tree-structured Bayesian optimization method. In the TPE, two density functions are used to define

p (x| y)

.

p (x| y) = \{\begin{matrix} l (x), y < y^{*} \\ g (x), y \geq y^{*} \end{matrix}

(28)

The TPE uses Kernel Density Estimation (KDE) to model the probability distribution of objective function values. Specifically, the TPE divides the objective function value into two components: a probability distribution

l (x)

for the better objective value and a probability distribution

g (x)

for the worse objective value.

The TPE uses Expected Improvement (EI) as a collection function.

{E I}_{y^{*}} (x) = \int_{- \infty}^{+ \infty} m a x (y^{*} - y, 0) p_{M} (y| x) d y = \frac{\int_{- \infty}^{y^{*}} (y^{*} - y, 0) p (y) d y}{γ + (1 - γ) \frac{g (x)}{l (x)}}

(29)

This leads to Equation (30).

{E I}_{y^{*}} (x) \propto {(γ + (1 - γ) \frac{g (x)}{l (x)})}^{- 1}

(30)

The value of EI is proportional to

{(γ + (1 - γ) \frac{g (x)}{l (x)})}^{- 1}

and depends on the ratio of the two probabilities, so it is necessary to find the

x

that makes the ratio

\frac{l (x)}{g (x)}

maximal and, in each iteration, return the maximum EI.

5. Experimental Design

5.1. Dataset

In order to validate the feasibility and effectiveness of the proposed method, this section carried out example validation on five datasets and compared the results with those of one-hot encoding, embedding, label encoding, and target encoding.

In this paper, the working-hour dataset of two typical majors in the cruise ship production design process, namely piping majors and hull structure majors, were selected from the major cruise shipbuilding yards in domestic market, and three public datasets containing unordered multi-categorical variables from UCI and Kaggle were also selected. The sample sizes of the five datasets as well as the number of categories in the unordered multi-categorical variables differed from each other, which made the method of this paper feasible to validate the universality of the method.

In particular, the sample size in the Gait dataset was so large that we randomly selected 1000 pieces of data to form a new dataset for the experiment.

The sample size of the data set, the number of variables, the number of categorical variables, the number of categories in the categorical variables, and the targets used for prediction are shown in the following table, Table 1.

5.2. Detailed Experimental Design

The dimension of the values obtained by different encoding methods is not the same. In this paper, we set the dimension of autoencoder compression to be one-dimensional.

Therefore, O-AE, label encoding, and target encoding received one-dimensional codes, one-hot encoding received the same dimensionality as the number of categories in the categorical variables, and embedding methods were free to set the dimensionality size.

We were concerned with the effect of different encoding methods on the input dimensions of the dataset as well as on the target prediction results of the neural network, so we designed 2 experiments with different embedding dimensions for embedding and for O-AE, one-hot encoding, embedding, label encoding, target encoding, and Bayesian optimization of the hyperparameters of the autoencoder for the O-AE-b method, respectively, in 1 experiment.

The encoded dataset was input to BP neural network to carry out target prediction.

In this paper, only the multi-categorical variables “subject” in the Gait dataset were encoded; the categorical variables “condition” and “joint” in the Gait dataset contained a small number of categories, and “leg” was a binary categorical variable, so the experiments were conducted using the labeled data that came with the original dataset.

The multi-categorical variables “Cuisine_Type” in the Restaurant dataset were encoded and the binary categorical variables “Promotions” were experimented with using the labeled data that came with the original data.

Continuous variables in the dataset were mapped to a standard normal distribution with a mean of 0 and standard deviation of 1 by Z-Score standardization.

The specific methods of the seven experiments and the total dimensions of the five datasets input to the BP neural network under each of the seven experimental codes are shown in Table 2.

5.3. Evaluation Metrics

In this paper, three classical evaluation metrics were used to assess the performance of different encoding methods in the BP neural network target prediction task, including Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and coefficient of determination (

R^{2}

).

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - y_{i}|, M A E \in [0, + \infty]

(31)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}, R M S E \in [0, + \infty]

(32)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}, \bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}, R^{2} \in [0, 1]

(33)

where

n

is the number of samples,

y_{i}

is the actual value,

{\hat{y}}_{i}

is the predicted value, and

\bar{y}

is the mean of the actual value.

MAE is the mean of the absolute errors, and RMSE is the square root of the mean of the sum of squares of the errors. MAE and RMSE are both measurements of the deviation of the predicted values from the actual values, with smaller values indicating better final predictions.

R^{2}

is used to assess the linear relationship between the actual values and the predicted values, with values closer to 1 indicating a better fit of the model.

6. Example Validation

In this section, we analyze the performance of seven experiments. We implemented all the algorithms in Python and conducted all the experiments on a computer with an i7-11390H processor, 3.40 GHz CPU, and 16.0 GB RAM.

6.1. Parameter Setting

The relevant parameters of the autoencoder when carrying out Experiment 1 and Experiment 7 are shown in Table 3 below. The batch size is set as the sample size of the dataset due to the small sample size of the dataset for the experimental design.

In particular, in Experiment 7, the unordered multi-categorical variables are divided into training and validation sets according to 7:3. The early-stopping strategy is introduced, and the patience parameter for early stopping is set to 100. The number of AE hidden layers and the number of nodes per layer are obtained with Bayesian optimization. The number of encoder hidden layers is set to be no more than five, and the decoder and encoder structures are symmetric.

The number of hidden layers of the autoencoder and the number of nodes in each layer in Experiment 1 are manually and dynamically adjusted according to different datasets. The epoch of Experiment 1 was kept the same as Experiment 7. Also, since there is no additional autoencoder hyperparameter optimization procedure for Experiment 1 and in order to compare with other experiments to demonstrate the applicability of Experiment 1, the early-stopping strategy is not used.

To carry out target prediction using a BP neural network, the dataset is divided into a training set, validation set, and test set in the ratio of 7:2:1.

The relevant parameters are shown in Table 4. The batch size is set as the sample size of the dataset due to the small sample size of the dataset for the experimental design. In order to quickly conduct experiments and obtain a better prediction result, the hyperparameter optimization method of a 5-fold cross-validation grid search is used, and the number of nodes in each hidden layer is set to be the same; the number of optional hidden layers and the number of optional nodes are shown in Table 4. In order to improve the model performance and avoid overfitting, we introduce L2 Regularization and early-stopping strategy and set the patience parameter of early stopping to 100.

Meanwhile, the original code generation step randomly generates

N

-order square matrix

A \in R^{N \times N}

based on the number of categories

N

in the unordered multi-categorical variables and fixes the random seed to ensure that the experimental data are the same.

6.2. Target Prediction Results

Under the above parameter settings, several calculations are carried out to obtain the results of the evaluation metrics for the five datasets in seven experiments. The number of nodes per layer of the encoding layer of the autoencoder (the decoding layer is symmetrical to the structure of the encoding layer, so the description is omitted) and the number of layers and nodes in the hidden layer of the BP neural network after hyperparameter optimization are shown in Table 5.

In this paper, we denote the model structure of the autoencoder by unordered multi-categorical variables’ original dimension − intermediate compression dimension − 1.

The results of the evaluation metrics for the five datasets in seven experiments are shown in Table 6.

This paper plots the results of the target prediction experiments for each dataset, including the fitted regression plots between the predicted and actual values in the test set, as shown in Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8, and the line charts between the predicted and actual values in the test set, as shown in Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13.

6.3. Analysis of Results

In this paper, MAE, RMSE and

R^{2}

are chosen as the evaluation metrics of target prediction results, and the smaller MAE and RMSE, and the closer

R^{2}

is to 1 means the target prediction quality is better.

(1) Comparing the metrics results of target prediction.

In the Piping dataset, the MAE results obtained by O-AE are the best, the RMSE and

R^{2}

results obtained by O-AE-b are the best, and the MAE results of O-AE-b are the best among the seven experiments except for O-AE. The RMSE and

R^{2}

results of O-AE are also the best among the seven experiments except for O-AE-b. The results of all three metrics for TE are worse in the other experiments.

In the Hull structure dataset, the MAE, RMSE, and

R^{2}

results obtained by O-AE-b are the best, and the RMSE and

R^{2}

results obtained by O-AE are also the best among the seven experiments except for O-AE-b. The results of all three metrics for TE are worse in the other experiments.

In the Gait dataset, the MAE results obtained by 5-EE are the best, the RMSE and

R^{2}

results obtained by O-AE-b are the best, and the MAE results of O-AE-b are the best among the seven experiments except for 5-EE. The RMSE and

R^{2}

results of O-AE are also the best among the seven experiments except for O-AE-b. The results of all three metrics for TE are worse in the other experiments.

In the Fish dataset, the MAE results obtained by OHE are the best, the RMSE and

R^{2}

results obtained by O-AE-b are the best, and the MAE results of O-AE-b are the best among the seven experiments except for OHE. The RMSE and

R^{2}

results of O-AE are also the best among the seven experiments except for O-AE-b. The results of all three metrics for TE are worse in the other experiments.

In particular, the prediction results of the seven experiments on the Restaurant dataset are not very satisfactory, with the highest

R^{2}

of only 0.7624, which is analyzed in this paper as a possible reason for the fact that the chosen BP neural network model is not applicable to the Restaurant dataset. However, in the Restaurant dataset, the MAE results obtained by O-AE are the best, the RMSE and

R^{2}

results obtained by O-AE-b are the best, and the MAE results of O-AE-b are the best among the seven experiments except for O-AE. The RMSE and

R^{2}

results of O-AE are also the best among the seven experiments except for O-AE-b. The results of all three metrics for 1-EE are worse in the other experiments.

Overall, the Bayesian optimization of the hyperparameters of the autoencoder for O-AE as well as the basic O-AE have consistently excellent target prediction results when experiments are carried out on different datasets. Analyzing the metric results, the basic O-AE can then meet the encoding needs of unordered multi-categorical variables, while the MAE results of O-AE are better than the MAE results of O-AE-b in some datasets.

(2) Comparing the experimental performance of different methods.

In other experiments, 1-EE and 5-EE are embedding methods with one and five embedding dimensions, respectively, and the metric results of 1-EE and 5-EE are in the middle of the range on the five datasets, with little difference in experimental performance. The metric results of OHE on the five datasets are also more stable, and the experimental performance is a little bit poorer than that of O-AE and O-AE-b, but OHE has the problems of unordered multi-categorical variables with larger dimensions and sparse data after encoding. The experimental performance of LE and TE on the five datasets is general mainly because the use of LE introduces additional order misclassification and distance bias in addition to the originally unordered features, and TE affects the fitting ability of the neural network model.

Overall, the experimental results of O-AE-b and O-AE are superior compared to other classical encoding methods. The unordered multi-categorical variables processed by O-AE-b and O-AE were able to satisfy the data input requirements of the subsequent neural network model and facilitated the ability of neural network data learning to improve the neural network target prediction results.

7. Conclusions

The objective of this paper is to find an encoding method for unordered multi-categorical variables that results in the lower dimensionality of the encoded data and better target prediction results when fed into a neural network.

After analyzing the characteristics of unordered multi-categorical variables and comparing them with other encoding methods, this paper proposes a method for the encoding and dimensionality reduction of unordered multi-categorical variables using an orthogonal matrix-autoencoder. Seven experiments are designed for validation using the basic O-AE, the Bayesian optimization of the hyperparameters of the autoencoder for O-AE, and several other classical encoding methods. The experimental results show that O-AE and O-AE-b have more stable and excellent evaluate metrics, indicating that the method proposed in this paper is highly feasible and applicable and can be an optional method for data processing of unordered multi-categorical variables.

Through the experimental analysis, the basic O-AE can meet the encoding requirements of unordered multi-categorical variables, but O-AE manually adjusts the number of layers and nodes in the hidden layer of the autoencoder and needs to adjust the parameters several times to achieve the optimal results of the metrics. O-AE-b uses Bayesian optimization to find the optimal number of layers and nodes in the hidden layer, but the Bayesian optimization of hyperparameters consumes more time than that of the basic O-AE. Therefore, the encoding method can be selected according to the characteristics of the problem and the demand of experimental resources in the specific use.

Although the research in this paper has achieved preliminary results, it still contains vast room for deepening. Future work will focus on optimizing the model performance and try to introduce activation functions such as PReLU or other regularization strategies such as L1-L2 with a view to further improve the model performance. Meanwhile, we will also work on improving the efficiency of Bayesian optimization in the O-AE-b model and exploring more new ways to efficiently optimize the hyperparameters of the autoencoder in order to promote the continuous progress of research in this area.

Author Contributions

Conceptualization, Y.W. and J.L.; methodology, Y.W.; software, Y.W.; validation, Y.W. and J.L.; formal analysis, B.Y., D.S. and L.Z.; investigation, Y.W. and B.Y.; resources, J.L.; data curation, B.Y., D.S. and L.Z.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W., J.L., B.Y., D.S. and L.Z.; visualization, Y.W.; supervision, J.L.; project administration, J.L. and B.Y.; funding acquisition, J.L. and B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study is funded by the Ministerial Civil Ship Research Project of China (Grant numbers [2024]56).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

The authors are responsible for the contents of this publication. In addition, the authors would like to thank their lab classmates for their contributions to the writing quality.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, J.; Li, C.; Ouyang, P.; Liu, J.; Wu, C. Interpreting the prediction results of the tree-based gradient boosting models for financial distress prediction with an explainable machine learning approach. J. Forecast. 2023, 42, 1112–1137. [Google Scholar] [CrossRef]
Seong, N.; Nam, K. Forecasting price movements of global financial indexes using complex quantitative financial networks. Knowl.-Based Syst. 2022, 235, 107608. [Google Scholar] [CrossRef]
El-Rashidy, N.; El-Sappagh, S.; Abuhmed, T.; Abdelrazek, S.; El-Bakry, H.M. Intensive Care Unit Mortality Prediction: An Improved Patient-Specific Stacking Ensemble Model. IEEE Access 2020, 8, 133541–133564. [Google Scholar] [CrossRef]
Han, M.; Fan, L. A short-term energy consumption forecasting method for attention mechanisms based on spatio-temporal deep learning. Comput. Electr. Eng. 2024, 114, 109063. [Google Scholar] [CrossRef]
Yukseltan, E.; Yucekaya, A.; Bilge, A.H. Hourly electricity demand forecasting using Fourier analysis with feedback. Energy Strategy Rev. 2020, 31, 100524. [Google Scholar] [CrossRef]
Zhao, H.; Zhu, D.; Yang, Y.; Li, Q.; Zhang, E. Study on photovoltaic power forecasting model based on peak sunshine hours and sunshine duration. Energy Sci. Eng. 2023, 11, 4570–4580. [Google Scholar] [CrossRef]
Chu, W.-T.; Liang, Y.-H.; Ho, K.-C. Visual Weather Property Prediction by Multi-Task Learning and Two-Dimensional RNNs. Atmosphere 2021, 12, 584. [Google Scholar] [CrossRef]
Sundareswaran, A.; Lavanya, K. Real-Time Vehicle Traffic Prediction in Apache Spark Using Ensemble Learning for Deep Neural Networks. Int. J. Intell. Inf. Technol. 2020, 16, 19–36. [Google Scholar] [CrossRef]
Kim, Y.S. Comparison of the decision tree, artificial neural network, and linear regression methods based on the number and types of independent variables and sample size. Expert Syst. Appl. 2008, 34, 1227–1234. [Google Scholar] [CrossRef]
Reilly, D.; Taylor, M.; Fergus, P.; Chalmers, C.; Thompson, S. The Categorical Data Conundrum: Heuristics for Classification Problems—A Case Study on Domestic Fire Injuries. IEEE Access 2022, 10, 70113–70125. [Google Scholar] [CrossRef]
Hancock, J.T.; Khoshgoftaar, T.M. Survey on categorical data for neural networks. J. Big Data 2020, 7, 28. [Google Scholar] [CrossRef]
Chen, H.-S.; Lan, T.-S.; Lai, Y.-M. Prediction Model of Working Hoursa of Cooling Turbine of Jet Engine with Back-propagation Neural Network. Sens. Mater. 2021, 33, 843. [Google Scholar] [CrossRef]
Yu, T.; Cai, H. The Prediction of the Man-Hour in Aircraft Assembly Based on Support Vector Machine Particle Swarm Optimization. J. Aerosp. Technol. Manag. 2015, 7, 19–30. [Google Scholar] [CrossRef]
Ge, Y.; Nan, Y.; Bai, L. A Hybrid Prediction Model for Solar Radiation Based on Long Short-Term Memory, Empirical Mode Decomposition, and Solar Profiles for Energy Harvesting Wireless Sensor Networks. Energies 2019, 12, 4762. [Google Scholar] [CrossRef]
Bu, H.; Ge, Z.; Zhu, X.; Yang, T.; Zhou, H. Prediction of Ship Painting Man-Hours Based on Selective Ensemble Learning. Coatings 2024, 14, 318. [Google Scholar] [CrossRef]
Hur, M.; Lee, S.; Kim, B.; Cho, S.; Lee, D.; Lee, D. A study on the man-hour prediction system for shipbuilding. J. Intell. Manuf. 2015, 26, 1267–1279. [Google Scholar] [CrossRef]
Golnaraghi, S.; Zangenehmadar, Z.; Moselhi, O.; Alkass, S. Application of Artificial Neural Network(s) in Predicting Formwork Labour Productivity. Adv. Civ. Eng. 2019, 2019, e5972620. [Google Scholar] [CrossRef]
Wang, L.; Xie, D.; Zhou, L.; Zhang, Z. Application of the hybrid neural network model for energy consumption prediction of office buildings. J. Build. Eng. 2023, 72, 106503. [Google Scholar] [CrossRef]
Carrizosa, E.; Nogales-Gómez, A.; Romero Morales, D. Clustering categories in support vector machines. Omega 2017, 66, 28–37. [Google Scholar] [CrossRef]
Gnat, S. Impact of Categorical Variables Encoding on Property Mass Valuation. Procedia Comput. Sci. 2021, 192, 3542–3550. [Google Scholar] [CrossRef]
Hien, D.T.T.; Thuy, C.T.T.; Anh, T.K.; Son, D.T.; Giap, C.N. Optimize the Combination of Categorical Variable Encoding and Deep Learning Technique for the Problem of Prediction of Vietnamese Student Academic Performance. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 274–280. [Google Scholar] [CrossRef]
Li, J.; Xu, J.; Zhou, Q. Monitoring serially dependent categorical processes with ordinal information. IISE Trans. 2018, 50, 596–605. [Google Scholar] [CrossRef]
De Meulemeester, H.; De Moor, B. Unsupervised Embeddings for Categorical Variables. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Dahouda, M.K.; Joe, I. A Deep-Learned Embedding Technique for Categorical Features Encoding. IEEE Access 2021, 9, 114381–114391. [Google Scholar] [CrossRef]
Lee, N.; Kim, J.-M. Conversion of categorical variables into numerical variables via Bayesian network classifiers for binary classifications. Comput. Stat. Data Anal. 2010, 54, 1247–1265. [Google Scholar] [CrossRef]
Jung, T.; Kim, J. A new support vector machine for categorical features. Expert Syst. Appl. 2023, 229, 120449. [Google Scholar] [CrossRef]
Yang, M.; Xu, S. Orthogonal Nonnegative Matrix Factorization using a novel deep Autoencoder Network. Knowl.-Based Syst. 2021, 227, 107236. [Google Scholar] [CrossRef]
Lecun, Y.; Soulie Fogelman, F. Modeles connexionnistes de l’apprentissage. Intellectica Spec. Issue Apprentiss. Mach. 1987, 2, 114–143. [Google Scholar] [CrossRef]
Bourlard, H.; Kamp, Y. Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 1988, 59, 291–294. [Google Scholar] [CrossRef]
Helwig, N.; Hsiao-Wecksler, E. Multivariate Gait Data; UCI Machine Learning Repository: Irvine, CA, USA, 2022. [Google Scholar] [CrossRef]
MrSimple07. Restaurants Revenue Prediction 2024; Kaggle: San Francisco, CA, USA, 2024. [Google Scholar] [CrossRef]
Kaggle. Fish Market. Available online: https://www.kaggle.com/datasets/vipullrathod/fish-market (accessed on 12 July 2024).

Figure 1. O-AE for unordered multi-categorical variable encoding and dimensionality reduction process.

Figure 2. The structure and working principle of the autoencoder.

Figure 3. Example of dimensionality reduction using autoencoder.

Figure 4. Piping dataset–fitted regression plots.

Figure 5. Hull structure dataset–fitted regression plots.

Figure 6. Gait dataset–fitted regression plots.

Figure 7. Restaurant dataset–fitted regression plots.

Figure 8. Fish dataset–fitted regression plots.

Figure 9. Piping dataset–line charts.

Figure 10. Hull structure dataset–line charts.

Figure 11. Gait dataset–line charts.

Figure 12. Restaurant dataset–line charts.

Figure 13. Fish dataset–line charts.

Table 1. Specific information on the dataset.

Dataset	Sample Size ( $\tilde{N}$ )	Total Number of Variables	Number of Continuous Variables	Categorical Variable			Target
Dataset	Sample Size ( $\tilde{N}$ )	Total Number of Variables	Number of Continuous Variables	Number of Categorical Variables	Name	Number of Categories (N)	Target
Piping	163	8	6	2	area	16	Feedback working hours
Piping	163	8	6	2	task type	19	Feedback working hours
Hull structure	144	4	2	2	area	67	Feedback working hours
Hull structure	144	4	2	2	task type	4	Feedback working hours
Gait [30]	1000	6	2	4	subject	10	angle
					condition	3
					joint	3
					leg	2
Restaurant [31]	1000	7	5	2	Cuisine_Type	4	Monthly_Revenue
Restaurant [31]	1000	7	5	2	Promotions	2	Monthly_Revenue
Fish [32]	159	6	5	1	Species	7	Weight

Table 2. The experimental design and the total input dimensions after encoding.

No.	Encoding Method	Code Dimension	Total Input Dimensions					Experiments Abbreviation
No.	Encoding Method	Code Dimension	Piping	Hull Structure	Gait	Restaurant	Fish	Experiments Abbreviation
1	Orthogonal matrix-autoencoder-based encoding method	1	8	4	6	7	6	O-AE
2	Embedding	1	8	4	6	7	6	1-EE
3	Embedding	5	16	12	10	11	10	5-EE
4	One-hot encoding	Number of categories (N)	31	73	15	10	12	OHE
5	Label encoding	1	8	4	6	7	6	LE
6	Target encoding	1	8	4	6	7	6	TE
7	Bayesian optimization of the hyperparameters of the autoencoder for O-AE	1	8	4	6	7	6	O-AE-b

Table 3. The parameter setting of the autoencoder in O-AE and O-AE-b.

Experiment	Loss Function	Activation Function	Optimizer	Learning Rate	Epoch	Batch Size	Patience
O-AE	MSE	Relu	Adam	0.001	2000	$Sample size (\tilde{N}$ )	-
O-AE-b	MSE	Relu	Adam	0.001	2000	$Sample size (\tilde{N}$ )	100

Table 4. The parameter setting of the BP neural network.

Loss Function	Activation Function	Optimizer	Learning Rate	Epoch
MSE	Relu	Adam	0.001	1000
L2 Regularization	Batch size	Patience	Optional hidden layers	Optional nodes
0.0001	$Sample size (\tilde{N})$	100	10, 20, 30	1, 2, 3

Table 5. Optimal hyperparameter results for autoencoder and BP neural network.

Dataset	Hyperparameter		O-AE	1-EE	5-EE	OHE	LE	TE	O-AE-b
Piping	AE	area	16-6-1	-					16-13-11-9-5-1
	AE	task type	19-6-1	-					19-18-16-13-12-1
	BP	layers	3	3	3	3	3	3	3
	BP	nodes	30	10	20	30	30	20	30
Hull structure	AE	area	67-30-6-1	-					67-26-17-1
	AE	task type	4-2-1	-					4-2-1
	BP	layers	3	3	3	3	3	3	3
	BP	nodes	30	20	30	30	20	30	30
Gait	AE	subject	10-4-1	-					10-9-8-5-1
	BP	layers	3	3	3	3	3	3	3
	BP	nodes	30	30	30	30	30	30	30
Restaurant	AE	Cuisine_Type	4-3-1	-					4-2-1
	BP	layers	3	3	3	3	3	3	3
	BP	nodes	30	30	10	20	30	20	30
Fish	AE	Species	7-3-1	-					7-6-5-1
	BP	layers	3	3	3	3	3	3	3
	BP	nodes	30	30	30	30	30	30	30

Table 6. The results of the evaluation metrics.

		O-AE	1-EE	5-EE	OHE	LE	TE	O-AE-b
Piping	MAE	4.3681	6.8860	6.1809	4.8380	5.6480	8.1285	4.5841
	RMSE	5.8296	9.8875	9.9672	7.9449	9.2587	11.1007	5.7244
	$R^{2}$	0.9718	0.9189	0.9175	0.9476	0.9288	0.8977	0.9728
Hull structure	MAE	8.4987	12.3219	10.6569	6.4534	9.9077	14.3765	5.8390
	RMSE	10.2222	17.5980	15.5944	12.5951	15.5421	19.2587	7.6556
	$R^{2}$	0.9628	0.8896	0.9133	0.9435	0.9139	0.8678	0.9791
Gait	MAE	3.4295	3.4297	3.1289	3.6434	3.8893	4.2187	3.3678
	RMSE	4.5314	4.7570	4.6198	5.1868	5.1943	5.7029	4.2098
	$R^{2}$	0.9319	0.9250	0.9292	0.9108	0.9105	0.8921	0.9412
Restaurant	MAE	40.8338	42.3132	42.3016	41.4832	41.1913	41.6640	40.9350
	RMSE	52.6461	54.2528	53.0498	52.8340	53.3646	52.6894	51.9348
	$R^{2}$	0.7559	0.7407	0.7521	0.7541	0.7492	0.7555	0.7624
Fish	MAE	21.7449	24.5180	22.9351	20.7503	23.8476	56.8344	20.8339
	RMSE	32.1668	34.4302	34.8801	33.4807	33.7599	72.4618	31.8064
	$R^{2}$	0.9867	0.9848	0.9844	0.9856	0.9854	0.9327	0.9870

Bold indicates the best and worst values of the evaluation metrics.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Li, J.; Yang, B.; Song, D.; Zhou, L. Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables with Application to Neural Network Target Prediction Problems. Appl. Sci. 2024, 14, 7466. https://doi.org/10.3390/app14177466

AMA Style

Wang Y, Li J, Yang B, Song D, Zhou L. Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables with Application to Neural Network Target Prediction Problems. Applied Sciences. 2024; 14(17):7466. https://doi.org/10.3390/app14177466

Chicago/Turabian Style

Wang, Yiying, Jinghua Li, Boxin Yang, Dening Song, and Lei Zhou. 2024. "Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables with Application to Neural Network Target Prediction Problems" Applied Sciences 14, no. 17: 7466. https://doi.org/10.3390/app14177466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables with Application to Neural Network Target Prediction Problems

Abstract

1. Introduction

2. Encoding of Categorical Variables

2.1. Label Encoding

2.2. Target Encoding

2.3. One-Hot Encoding

2.4. Embedding

3. Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables

3.1. Code Generation

3.2. Code Dimensionality Reduction

4. Bayesian Optimization of Autoencoder Hyperparameters

5. Experimental Design

5.1. Dataset

5.2. Detailed Experimental Design

5.3. Evaluation Metrics

6. Example Validation

6.1. Parameter Setting

6.2. Target Prediction Results

6.3. Analysis of Results

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI