**1. Introduction**

**\***

> Whether deep or shallow, the operation of artificial neural networks (ANNs) depends on their hyper-parameters and parameters [1–3]. Certain variables of ANNs are called hyper-parameters, such as the number of layers [2], or control the training process, such as the learning rate [4]. In contrast, the trainable variables pertaining to layer connections and tuned during the training process, which are weights and biases, are called parameters [5–7]. Although parameter tuning may yield good results, it does not yield notable results without hyper-parameter tuning (HPT).

> The importance of HPT became more manifest than before with the development of deep learning algorithms. Deep learning is a type of machine learning (ML) technique with diverse hyper-parameters that severely affect its performance [8–10]. Since HPT is an arduous task and requires data and network knowledge [11,12], it is often acquired by empirical methods (trial-and-error), which is time-consuming and does not guarantee

**Citation:** Jalaeian Zaferani, E.; Teshnehlab, M.; Khodadadian, A.; Heitzinger, C.; Vali, M.; Noii, N.; Wick, T. Hyper-Parameter Optimization of Stacked Asymmetric Auto-Encoders for Automatic Personality Traits Perception. *Sensors* **2022**, *22*, 6206. https://doi.org/ 10.3390/s22166206

Academic Editor: Jing Tian

Received: 17 July 2022 Accepted: 16 August 2022 Published: 18 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

significant results in terms of efficient algorithms and overall cost complexity. Therefore, studies based on applying optimization methods to ANNs have gained attention.

Accordingly, the usage of optimization algorithms is divided into three groups, as follows:


Thus, the aims of the present work were to (1) propose a novel optimization method based on cultural evolution and parallel computing, (2) obtain the near-optimal values of hyper-parameters of SAAE, and (3) classify five personality traits.

The rest of the article is organized as follows. In Section 2, some related works of HPO in deep learning and APP are explained. In Section 3, the dataset is introduced, and the summary of the feature extraction method is presented in Section 4. The new optimization method is proposed in Section 5. The simulation results of the new method, which is applied to three benchmark functions of finding global optima, are presented in Section 6. In addition, this section discusses the outcomes of applying the proposed method to SAAE for automatic personality perception classification.

#### **2. Related Works**

Given that this article examines HPO methods in order to find a proper one to optimize the hyper-parameters of SAAE for automatic personality trait perception, the related works section is divided into two parts. The focus of the first part is on recently published methods of neural network hyper-parameter tuning, regardless of the application in which it is used. Thus, the works related to the investigation of HPO in ML are summarized in the first part. Since the aim of our research was HPT of SAAE to classify five personality traits from speech, the second part is related to studying HPO in machine learning methods applied in the field of personality trait perception.

#### *2.1. Hyper-Parameter Tuning in ML*

Deep learning hyper-parameter types are vast and can be divided into three groups: integer, real, and categorical. The integer group consists of variables such as the number of layers (whether hidden or convolutional) [25], the number of neurons [8], the size of the kernel [26], the number of kernels [27], batch size, pooling size, and number of maximum

epochs [9]. The real group includes the learning rate [25], dropout rate [25], regularization factor [25], network weight initialization [5], and momentum [4]. The categorical group comprises activation function type [8] and optimization method [8].

Considering that a change in the value of each hyper-parameter changes the values of the neural network parameters that affect the output of the network, and also that examination of any possible combination of hyper-parameters is time-consuming, expensive and practically impossible, studies have investigated the effect of adjusting and optimizing some of the most important hyper-parameters.

In this regard, the article in [4] employed the HPO method for bearing fault diagnosis in mechanical equipment. Parallel computing was used to find hyper-parameters of the deep belief network (DBN). The learning rate and momentum were optimized, while other hyper-parameters were predefined and kept constant. Additionally, Wu Deng et al. used quantum-inspired differential evolution (DE) to optimize DBN parameters. Results showed an improvement in global search and avoiding premature convergence for fault classification [28].

The numbers of hidden neurons as a hyper-parameter and of the weights/biases as parameters were optimized in a feed-forward ANN by Gray wolf optimizer in [18]. Feed-forward ANNs (not back-propagation) were used because adjusted parameters were achieved by the optimization method.

Y. Peng et al. proposed an HPO method based on a fuzzy system in [8]. They optimized the number of hidden layers and the number of neurons in each layer of a DNN. The activation function type and optimization method, including Genetic Algorithms (GA), Bayesian search, grid search, random search, and quasi-random search, were selected automatically during HPO. For preventing over-fitting, the dropout technique was used. The proposed method was tested in three rainfall prediction datasets.

The authors of [29] suggested a distributed particle swarm optimization (PSO) for the HPO of a convolution neural network (CNN). They were concerned about the timeconsuming population search based on distributed PSO, and parallel computing was employed to speed up the algorithm. They optimized the number and size of the kernels, the type of pooling (max or average) for two convolutional layers, the activation function type in convolutional layers, the number of neurons, learning rate, and the dropout rate of the fully connected layers.

Time-series prediction of congestion in highway systems based on long short-term memory (LSTM) was investigated in [9]. To obtain the proper model and structure, the authors recommended an HPO method by applying the Bayesian optimization (BO) method. Five hyper-parameters were automatically obtained, including learning rate, the number of hidden layers, the number of neurons in each layer, batch size, and dropout rate.

The intention of [25] was to examine the robustness of one HPO method over six benchmarks, contrary to other works that designed an algorithm that fit one problem. In other work, the authors used BO as an old HPO method in CNN [1] and applied four strategies to alleviate the drawbacks of BO. They tuned the hyper-parameters of two convolutional layers and two fully connected layers in this way.

In [26], an intuitive architecture design using GA was proposed for CNN. The obtained model was evaluated on a CNN with a single convolutional layer and a fully connected layer. Additionally, some hyper-parameters, including maximum epochs, batch size, initial learning rate, regularization, and momentum were optimized by PSO to prepare a CNN for expression recognition in [30].

Since the success of neural networks depends on their structure, the article in [31] proposed a micro-canonical optimization algorithm for overcoming large parameter spaces and optimizing hyper-parameters of a CNN. Hyper-parameters were the number of convolution layers, activation function type, batch size, pooling type, and dropout rate. The method was evaluated by six image recognition datasets and exhibited accuracy improvement.

State-of-health estimation and remaining usable life prediction in battery prognosis were examined in [32] by a deep convolution neural network. The authors addressed hyper-parameter tuning that affected DNN performance. They improved the algorithm by using the BO method.

Anjir A. Chowdhury et al. concentrated on the role of hyper-parameter optimization in the performance and reliability of deep learning outcomes [33]. They compared several HPO algorithms to obtain better validation accuracy in DNNs and concluded that most of them are computationally expensive. Finally, a greedy approach-based HPO algorithm was proposed for enabling faster computing on edge devices for on-the-fly learning applications. The VGG and ResNet architectures were used, and their hyper-parameters such as epochs, number of hidden layers, number of units per layer, activation function, dropout rate, batch size, and learning rate were optimized.

The Gray wolf optimization was employed to optimize the parameters of the kernel extreme learning machine to realize a hyperspectral image classification method in [34].

#### *2.2. Automatic Personality Perception*

In psychology, the big five inventory (BFI) is a well-known theory of personality with five traits, including openness to experience (Ope.), conscientiousness (Con.), extraversion (Ext.), agreeableness (Agr.), and neuroticism (Neu.). These traits are in an individual simultaneously by different scores and can be measured by a BFI questionnaire in general [35,36].

Due to the importance of personality in daily life, computer science researchers have investigated personality trait identification by multimodal media (audio, text, video, image) recently. Here, we focus on studies structured by deep learning methods.

A multimodal approach for perceiving personality traits was proposed by employing well-known deep structures (ResNet-v2-101 and VGGish) [37]. The LSTM network for using temporal information was added at the end. The authors optimized only the learning rate, while other hyper-parameters were configured manually. It is clear that the structure of the mentioned deep methods is fixed, and the weights and biases are pre-trained. Therefore, HPO or HPT does not tune according to each dataset in these networks.

Given the fact that personality traits can influence appearance, MobileNetv2 and ResNeSt50 networks were employed in [38] to extract facial features and classification. Results specified that one pre-trained network such as MobileNetv2 is inappropriate for classifying all five personality traits. It indicated that each trait must classify by a specific model, which means different hyper-parameters are necessary. However, the authors did not mention it directly and applied a combination of two pre-trained deep networks to build a complex deep model.

Onno Kampman et al. examined feature extraction and the classification of five personality traits by applying a one-dimensional CNN to a raw audio dataset. The HPT of the deep network containing regularization factors and kernel size was performed manually [39].

One of the personality detection applications is discovering interpersonal communication skills. Article [40] investigated this aspect from a video interview using a semisupervised CNN in which HPT was performed by trial-and-error. The authors concentrated on video processing, and a fixed hyper-parameter set to utilize for all traits.

The study in [41] analyzed the acoustic and lexical features of a speech signal that were affected by BFI traits. Additionally, it designed six models based on recurrent neural networks for classifying those traits. Hyper-parameters such as hidden size, learning rate, batch size, and dropout percentage were defined, but tuning them was not discussed.

## **3. Dataset**

The SSPNet speaker personality corpus (SPC) is a well-known automatic personality perception dataset introduced in 2010. This dataset originally contained 640 recorded speech signals of 322 native French speakers. There is one speaker in each clip recorded for 10 s. Due to the studies on the effect of mental factors on speech signals [42], the collected clips were emotionally neutral, and to confirm that lexical content did not affect the personality scores, evaluators who were foreign to the French language were selected. Therefore, eleven assessors who did not understand French evaluated each clip based on

the BFI questionnaire. The average score of these assessors was considered as the final score for each clip. Hence, five scores were obtained for each clip [43].

Although the SPC dataset has been applied in several works and is a proper dataset for comparison with the new methods, the number of samples is uses is low to train the enormous number of parameters of a DNN. This important challenge was addressed in our previous work [24], and we proved that the sample size of speech signals could be enhanced with data augmentation methods based on a spectrogram so that the prosodic content of speech could be preserved. Data augmentation is a popular technique to expand the size of the dataset artificially and is widely used in image processing. However, using this technique in speech is not as easy as using an image. In other words, we needed to choose transformations that maintain the speaker's personality, and we had to be confident that such manipulations in the spectrogram do not interfere with the extracted features related to personality traits. In this regard, frequency masking and time warping were selected as data augmentation methods, and the number of clips increased up to 640,000. For more details, please see [24].

#### **4. Feature Extraction**

Despite DNN's ability to perform automatic feature extraction from raw speech signals, deep learning methods have been generally applied to manually extracted hand-crafted audio features. This is mainly because of the large volume of data required for deep learning methods to outperform. Nevertheless, building a dataset with large available labeled samples is costly, time-consuming, and laborious work in the automatic personality perception field, which restricts various methods. Therefore, previous studies have used handcrafted features for the DNN input [44].

These handcrafted features contain 6373 statistical features extracted from 130 lowlevel descriptions (LLD) [45]. Table 1 contains 65 LLD features and 65 first derivatives of LLD ( ΔLLD), for a total of 130 LLD features.

For the LLD feature extraction process, each clip was divided into 60 ms frames with a 20 ms overlap in the time domain and 20 ms frames with a 10 ms overlap in the frequency domain by the Opensmile2.3 toolkit.


**Table 1.** The 130 LLD features, including 65 LLD and 65 ΔLLD features [46].

```
Table 1. Cont.
```


#### **5. Proposed Method**

This section is divided into two parts. In the first part, we thoroughly describe the new optimization method mathematically. In order to apply our optimization method to the SAAE, we had to address several problems. The second part deals with this issue and its solution.

#### *5.1. The Proposed Optimization Method*

HPO of deep learning is a time-consuming task in practice that depends on the network depth, the size of parameters, processor system, and optimization algorithm speed [5]. Applying HPO to deep learning is challenging. It can be (1) the unsupervised learning of most deep learning methods that causes trouble for optimization and imperfect tuning of parameters [47], (2) a large model with enormous trainable parameters that lead the processing system to runtime errors [5,8], and (3) an intricate search space created by different types of hyper-parameter domains (categorical, continuous, and integer value), causing inherent computational complexity [5]. A larger search space gives rise to a longer search time.

Parallel evaluation can partly reduce optimization time [48], and culture speeds up the population's evolution more than chromosomes (each chromosome represents a solution in the population space) [49]. Accumulated experience that is potentially accessible to all individuals is called culture, which is used in problem-solving activities [50]. The knowledge extracted by identifying patterns in the population's problem-solving experiences

influences the generation of new solutions [51]. Therefore, the combination of CA and parallel computing can facilitate the discovery of the search space [52]. In this regard, researchers are interested in combining CA with other optimization algorithms. Sun et al. combined a cultural algorithm and two PSO populations and shared their belief space. It indicated that sharing knowledge of belief space can improve performance by avoiding local optima [53]. A single population and multi-population based on CA was proposed in [54]. A PSO population-based method with interactive belief space was introduced by [49]. A hybrid evolutionary optimization method coupling CA with GAs was defined in [55]. Fuzzy operations were employed to exchange individuals between belief space and population space in [56].

From this perspective, we proposed a four-island approach based on the parallel evaluation and CA.

Although CA and parallel computing can perform better than the basic optimization algorithms [57], they do not provide enough convergence speed alone for deep learning. Thus, three driving force factors were applied to population space for creating interactive space between four island population spaces. Creating interactive population space causes interactive belief space, which can determine the direction and step size faster than traditional optimization methods. In this regard, our proposed method is called the multi-island interactive cultural (MIC) algorithm.

The MIC method is illustrated in Figure 1. In this method, control parameters are configured firstly. The initial population X[m, D] is generated randomly in the feasible space. The variable m indicates the population size (the number of chromosomes or individuals), and D is chromosome dimension (the number of genes).

**Figure 1.** Flowchart of the MIC algorithm.

After preparing the random initial population, it transfers into the four islands in parallel (gray lines): GA, PSO, DE, and evaluation strategy (ES). The GA and PSO are the optimization algorithms widely applied to HPO studies in deep learning [1,8]. GA is far more successful in complex networks such as CNNs, but eliminates previous information by changing the population every iteration [50]. PSO shares information between the particles and is popular on the smaller networks [29]. The DE algorithm is utilized in

optimization problems due to the high convergence speed and low control parameters when searching global optima. It is suitable for nonlinear search spaces [28]. The ES is less popular among the global optimization algorithms because it is a simple mutation-selection method, but it is helpful in making small changes [48]. It should be noticed that in the first iteration, the population of the four islands is the same.

The four islands were evaluated individually and in parallel. Then, some individuals of each island were randomly selected to transfer into an interactive belief space (InBS) through an acceptance function (colored arrows). Here, the acceptance function was 25% of the best individuals of each island. So, the belief space size was y[m, D].

The InBS consisted of normative ( N[D]) and situational knowledge (S) of all islands. Knowledge of different islands in the belief space causes the chromosomes to move away from unwanted regions and ge<sup>t</sup> closer to the optimal points by using different experiences faster than previously published works. InBS can be used effectively to prune the population space.

Normative knowledge represents the range of the best solutions by determining the upper and lower bands of each gene of a chromosome and is used to influence the direction of the search efforts within the promising ranges. In other words, it computes the range of each gene that leads the individual to a good solution.

The offspring affected by normative knowledge are generated by Equation (1) as

$$\mathbf{y}\_{\mathsf{P}\models i,j}^{\mathsf{t}\models 1} = \begin{cases} \mathbf{y}\_{\mathsf{i},j}^{\mathsf{t}} + \left| (\mathbf{u}\_{\mathsf{j}}^{\mathsf{t}} - \mathbf{l}\_{\mathsf{j}}^{\mathsf{t}}) \ast \mathbf{N}(0, 1) \right| & \text{if } \mathbf{y}\_{\mathsf{i},j}^{\mathsf{t}} < \mathbf{l}\_{\mathsf{j}'}^{\mathsf{t}} \\\ \mathbf{y}\_{\mathsf{i},j}^{\mathsf{t}} - \left| (\mathbf{u}\_{\mathsf{j}}^{\mathsf{t}} - \mathbf{l}\_{\mathsf{j}}^{\mathsf{t}}) \ast \mathbf{N}(0, 1) \right| & \text{if } \mathbf{y}\_{\mathsf{i},j}^{\mathsf{t}} > \mathbf{u}\_{\mathsf{j}'}^{\mathsf{t}} \\\ \mathbf{y}\_{\mathsf{i},j}^{\mathsf{t}} + \beta \left| (\mathbf{u}\_{\mathsf{j}}^{\mathsf{t}} - \mathbf{l}\_{\mathsf{j}}^{\mathsf{t}}) \right| \ast \mathbf{N}(-1, 1) & \text{otherwise}, \end{cases} \tag{1}$$

where uj is the upper and lj is the lower band of InBS for jth gene, respectively, *β* is a constant value, t is the current iteration, and N(0, 1) is the normal distribution.

For each gene, the structure contains the upper band ( ut j), the lower bound (l t j), the upper band value ( U<sup>t</sup> j), and the lower bound value ( Lt j), which are obtained by Equations (2)–(5), respectively.

$$\mathbf{L}\_{\uparrow}^{\natural+1} = \begin{cases} \mathbf{f}(\mathbf{y}\_{\text{i}}) & \text{if} \quad \mathbf{y}\_{\text{i},\mathbf{j}} \le \mathbf{l}\_{\uparrow}^{\natural} \text{ Or } \mathbf{f}(\mathbf{y}\_{\text{i}}) < \mathbf{L}\_{\uparrow}^{\natural}, \\ \mathbf{L}\_{\uparrow}^{\natural} & \text{otherwise,} \end{cases} \tag{2}$$

$$\mathbf{l}\_{\mathbf{j}}^{\mathbf{t}+1} = \begin{cases} \mathbf{y}\_{\mathbf{i},\mathbf{j}}^{\mathbf{t}} & \text{if} \quad \mathbf{y}\_{\mathbf{i},\mathbf{j}}^{\mathbf{t}} \le \mathbf{l}\_{\mathbf{j}}^{\mathbf{t}} \text{ Or } \mathbf{f}(\mathbf{y}\_{\mathbf{i}}^{\mathbf{t}}) < \mathbf{L}\_{\mathbf{j}}^{\mathbf{t}},\\ \mathbf{l}\_{\mathbf{j}}^{\mathbf{t}} & \text{otherwise}, \end{cases} \tag{3}$$

$$\mathbf{U}\_{\mathbf{j}}^{t+1} = \begin{cases} \mathbf{f}(\mathbf{y}\_{i}) & \text{if} \quad \mathbf{y}\_{i,\mathbf{j}} \ge \mathbf{u}\_{\mathbf{j}}^{t} \text{ Or } \mathbf{f}(\mathbf{y}\_{i}) < \mathbf{U}\_{\mathbf{j}}^{t}, \\ \mathbf{U}\_{\mathbf{j}}^{t} & \text{Otherwise,} \end{cases} \tag{4}$$

$$\mathbf{u}\_{\mathbf{j}}^{\mathbf{t}+1} = \begin{cases} \mathbf{y}\_{\mathbf{i},\mathbf{j}}^{\mathbf{t}} & \text{if} \quad \mathbf{y}\_{\mathbf{i},\mathbf{j}}^{\mathbf{t}} \ge \mathbf{u}\_{\mathbf{j}}^{\mathbf{t}} \text{ Or } \mathbf{f}(\mathbf{y}\_{\mathbf{i}}^{\mathbf{t}}) < \mathbf{U}\_{\mathbf{j}}^{\mathbf{t}},\\ \mathbf{u}\_{\mathbf{j}}^{\mathbf{t}} & \text{otherwise,} \end{cases} \tag{5}$$

where yi,j is the jth gene in the ith individual of InBS, and the <sup>f</sup>(yi ) is the value of the individual yi calculated by the fitness function. A fitness function (loss function) evaluated individuals of each island separately. The problem description determines the fitness function.

The situational knowledge, as seen in Equation (6), adjusts the mutation step size relative to the distance between the current best individual and the other individuals. The greater the distance between ith individual, yi, and the current best individual, the greater the step size and vice versa.

Updating the situational knowledge adds the InBS's best individual to the situational knowledge if it outperforms the current best individual, as described in Equation (6).

Here, ytbest is the best individual in the InBS at iteration t. The influence rule can be represented with Equation (7) (for i = 1, . . . , m and j = 1, . . . , D).

$$\mathbf{y} < \mathbf{E}\_1^{t+1}, \mathbf{E}\_2^{t+1}, \dots, \mathbf{E}\_e^{t+1} > = \begin{cases} < \mathbf{y}\_{\text{best}}^t, \mathbf{E}\_2^t, \dots, \mathbf{E}\_e^t > & \text{if } \mathbf{f}(\mathbf{y}\_{\text{best}}^t) > \mathbf{f}(\mathbf{E}\_1^t), \\ < \mathbf{y}\_{\text{best}}^t > & \text{if } \text{ change detected}, \\ < \mathbf{E}\_1^t, \mathbf{E}\_2^t, \dots, \mathbf{E}\_e^t > & \text{otherwise}, \end{cases} \tag{6}$$

$$\mathbf{y}\_{p+i\dot{j}}^{t+1} = \begin{cases} \mathbf{y}\_{i\dot{j}}^{t} + \left| (\mathbf{y}\_{i\dot{j}}^{t} - \mathbf{E}\_{i\dot{j}}^{t}) \cdot \mathbf{N}\_{i\dot{j}}(0, 1) \right| & \text{if } \mathbf{y}\_{i\dot{j}}^{t} < \mathbf{E}\_{\dot{j}}^{t},\\ \mathbf{y}\_{i\dot{j}}^{t} - \left| (\mathbf{y}\_{i\dot{j}}^{t} - \mathbf{E}\_{i\dot{j}}^{t}) \cdot \mathbf{N}\_{i\dot{j}}(0, 1) \right| & \text{if } \mathbf{y}\_{i\dot{j}}^{t} > \mathbf{E}\_{\dot{j}}^{t},\\ \mathbf{y}\_{i\dot{j}}^{t} + \beta \left| (\mathbf{y}\_{i\dot{j}}^{t} - \mathbf{E}\_{i\dot{j}}^{t}) \right| \cdot \mathbf{N}\_{i\dot{j}}(0, 1) & \text{otherwise}, \end{cases} \tag{7}$$

where Ej is the jth gene in the best individual, β is a constant factor, N(0, 1) is the normal distribution, and yp+i,jis the offspring of the individual yi,j.

After updating InBS with new generations, some individuals are transferred into each island population space by influence function. There is no doubt that the individuals of InBS contain the knowledge of all of the islands. This is the ability of the proposed method. Various studies have shown that the efficiency of optimization methods is altered for different problems. In other words, choosing an optimization method for a problem is a challenge that some researchers consider as a kind of hyper-parameter that needs to be tuned. Hence, 25% of the best individuals of InBS were replaced with 25% of the worst population on each island. Offspring generation processing is started in each island separately and evaluated through fitness function.

If the algorithm reaches the stopping criterion, the process will be stopped. Otherwise, interactive population space is created by three driving forces in order to promote cooperation among the islands and increase diversity.

The three driving-force methods are named the elitism method (EM), merge method (MM), and lambda method (LM).

In interactive population space, all individuals of each island are considered. In EM, the best individuals with size *m* are preserved and replaced with the old population on each island. As we use this method, the populations of the next generation for each island are the same. This driving force method forces the four basic algorithms to create interactive space only by the best individuals of four islands.

In MM, after considering all individuals of each island, a random number *a*, *a* ∈ (0, 1), is produced. The *a* × *m* (*a* ∗ *sizeofpopulation*) of the best individuals are merged with (*a* − 1) × *m* of the old population on each island. It is clear that each island has a unique new population in this interactive space.

In LM, two of the islands are selected randomly, according to two random numbers μ, μ ∈ (0, 1), and λ, λ ∈ (0, 1), representing emigration and immigration, respectively. The random numbers of individuals based on μ and λ of each island indicate which individuals can immigrate to and emigrate from another random island. This method forces islands to cooperate with the best individual and the worst one to create interactive space.

Due to the interaction and sharing of individuals among the four islands, if one algorithm traps in local optima, others can lead MIC into global optima because the result is not dependent on a single algorithm. This feature allows the MIC to be used for various global optimization problems to escape local optima efficiently.

The MIC strategy is presented step by step below (Algorithm 1). **Algorithm 1:** Implementation of MIC


**Step 2**: Generate the initial population randomly.

**Step 3**: Transfer 25% of the best individuals of each island into InBS (Accept).

**Step 4**: Update Belief space whith Equations (1)–(7).

**Step 5**: Transfer 25% of offspring into each island (Influ).

**Step 6**: **If** stop criterion < ζ

> Stop algoriyhm.

**Else**

Go to Step 7.

**Step 7:** Create Interactive population space by using the following three methods:

EM: *m* of the best individuals of four islands are selected and replaced with an old population.

MM: The *a* × *m* (*a* ∗ *sizeo f population*) of the best individuals are selected and merged with (*a* − 1) × *m*, which is obtained from the old population in islands.

LM: According to two random numbers, μ and λ, some individuals of a random island can immigrate to and emigrate from another random island.

**Step 8:** Go to Step 3.

#### *5.2. Stacked Asymmetric Auto-Encoder HPO Using MIC*

Since our work aimed to obtain the SAAE near-optimal structure, a brief overview of this method is presented below.

#### (1) **Stacked asymmetric auto-encoder**

The AsyAE is a semi-supervised DNN that poses the curse of dimensionality. The schematic of the AsyAE is illustrated in Figure 2.

**Figure 2.** Schematic of the asymmetric auto-encoder [24].

In this type, one neuron is added in the decoder part of the conventional auto-encoder with the desired value of the problem, which is the studied personality score in our field. The symmetry of the encoder and decoder parts is disrupted by this single neuron and made asymmetric.

The feed-forward equations of the AsyAE are similar to the conventional one as follows. For representing encoder and decoder layers, superscripts of 1 and 2 were used, respectively.

$$\mathbf{net}^{(1)} = \mathbf{W}^{(1)} \mathbf{X}\_{\prime} \tag{8}$$

$$\mathbf{O}^{(1)} = \mathbf{f}\left(\mathbf{net}^{(1)}\right),\tag{9}$$

where **W**(1) indicates the encoder weight matrix, **X** displays the input matrix, **O**(1) is the encoder output matrix, and f is the activation function.

$$\mathbf{net}^{(2)} = \mathbf{W}^{(2)} \mathbf{O}^{(1)},\tag{10}$$

$$\mathbf{O}^{(2)} = \mathbf{f}\left(\mathbf{net}^{(2)}\right),\tag{11}$$

where **W**(2) and **O**(2) are the weight and output matrixes of the decoder layer, respectively.

The error back-propagation related to the encoder and decoder weights matrixes is calculated by Equation (12).

$$\mathbf{E} := \frac{1}{\mathbf{k}} \sum\_{\mathbf{i}=1}^{\mathbf{k}} \log(\cosh(\mathbf{e}\_{\mathbf{i}})),\tag{12}$$

where et is the error vector of AsyAE at time t, which is described by Equation (13), and k is the neuron size of decoder layer output.

$$\mathbf{e}\_{\mathfrak{t}} \colon = \mathbf{d}\_{\mathfrak{t}} - \mathbf{o}\_{\mathfrak{t}}^{(2)}.\tag{13}$$

The desired output vector at time t is presented by **d**t, which belongs to the matrix **D**. It is the desired output matrix of AsyAE, which is produced by the combination of desired labels and AsyAE input.

$$\mathbf{D} = \begin{bmatrix} \mathbf{x}\_{11} & \mathbf{x}\_{12} & \dots & \mathbf{x}\_{1n\_0} & \mathcal{L} \\ \mathbf{x}\_{21} & \mathbf{x}\_{22} & \dots & \mathbf{x}\_{2n\_0} & \mathcal{L} \\ \vdots & \vdots & \vdots & \vdots \\ \mathbf{x}\_{m1} & \mathbf{x}\_{m2} & \dots & \mathbf{x}\_{mn\_0} & \mathcal{L} \end{bmatrix}.$$

Here, xij is the AsyAE input matrix element, and L is the desired label of the problem. A stacked asymmetric auto-encoder is a result of putting several AsyAEs together.

#### (2) **Optimizing some hyper-parameters of a stacked asymmetric auto-encoder**

Given the fact that the number of DNN hyper-parameters is significantly large, the simultaneous optimization of all of them complicates the computation and requires highperformance computing systems. Hence, we compromised between MIC and expertise for calculating the six critical DNN hyper-parameters as follows:


For HPO of SAAE, the following principles come after. Figure 3 illustrates the flowchart of the proposed method in detail.
