**0,&RSWLPL]DWLRQPHWKRG**

**Determining the number of neurons in each hidden layer:** In our work, Ni indicates the number of neurons in the ith hidden layer that will be optimized by the MIC method. So, the first variable of MIC is Ni, which is an integer value, Ni ∈ [1, m] where m value is equal to the input size of AsyAE. It forces the AsyAE to be an incomplete network. It means the encoder layer has fewer neurons than the input layer.

**Determining the learning rate in each hidden layer:** μi specifies the learning rate in the ith hidden layer, which will be optimized by the MIC method. Therefore, the second variable of the MIC population is a real value between zero and one, μi ∈ (0, <sup>1</sup>). It should be mentioned that we set the decimal digit of μi equal to 5 to examine its effect on SAAE performance.

**Initial value of trainable parameters:** Although deep learning methods have good performance in various problems, they are complicated tasks. Because there are huge factors that strongly influence them, one of the critical factors is initialization.

The DNN parameters need a starting point in the feasible area to be trained. The proper initial parameters can accelerate the convergence. Contrarily, random initialization can trap the network in the local optima.

Optimization algorithms such as GA and PSO can be used in this field. However, the number of DNN parameters (weights and biases) is vast, e.g., 1015, and producing the chromosomes with these dimensions causes a memory error in the processor system and is not efficient in practice. Another method, suggested by Hinton et al., applies the restricted Boltzman machine (RBM) network to tune the auto-encoder's initial parameters [58,59].

According to the ANN-base of an AsyAE and RBM, the AsyAE can be interpreted as two consecutive RBMs illustrated in Figure 4. The input layer is the visible unit, and the encoder layer is the hidden unit for the RBM1. In the RBM2, the encoder layer is the visible unit, and the decoder layer is the hidden unit.

**Figure 4.** Converting auto-encoder to two RBMs for tuning the initial weights of the encoder and decoder layers.

The conventional RBM is based on binary visible and hidden units, called Bernoulli-Bernoulli RBM (BBRBM). If both visible and hidden units have a Gaussian distribution, the Gaussian-Gaussian RBM (GGRBM) is employed [60]. Since the AsyAE input and parameters are real values, we used the GGRBM equations.

The energy function of the GGRBM is defined as Equation (14), where v presents visible units and h shows hidden units. It should be noted that the AsyAE input and the encoder output are the visible units of RBM1 and RBM2, respectively.

$$\mathbf{E}(\mathbf{v}, \mathbf{h}) = -\sum\_{\mathbf{i}=1}^{\mathbf{g}\_{\mathbf{v}}} \sum\_{\mathbf{j}=1}^{\mathbf{g}\_{\mathbf{h}}} \mathbf{W}\_{\mathbf{i}, \mathbf{j}} \frac{\mathbf{v}\_{\mathbf{i}} \mathbf{h}\_{\mathbf{j}}}{\sigma\_{\mathbf{i}} \sigma\_{\mathbf{j}}} - \sum\_{\mathbf{i}=1}^{\mathbf{g}\_{\mathbf{v}}} \frac{\left(\mathbf{v}\_{\mathbf{i}} - \mathbf{a}\_{\mathbf{i}}\right)^{2}}{2\sigma\_{\mathbf{i}}^{2}} - \sum\_{\mathbf{j}=1}^{\mathbf{g}\_{\mathbf{h}}} \frac{\left(\mathbf{h}\_{\mathbf{j}} - \mathbf{b}\_{\mathbf{j}}\right)^{2}}{2\sigma\_{\mathbf{j}}^{2}},\tag{14}$$

where ai and bj are visible and hidden units biases, respectively, σi and σj are their standard deviations. Wi,j is the weight between the visible and hidden units. A probability value is assigned to each possible visible and hidden unit by Equation (15),

$$\mathbf{P}(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \exp(-\mathbf{E}(\mathbf{v}, \mathbf{h})). \tag{15}$$

Here, Z is the normalization constant calculated by Equation (16).

$$Z = \sum\_{\mathbf{v}} \sum\_{\mathbf{h}} \exp(-\mathcal{E}(\mathbf{v}, \mathbf{h})).\tag{16}$$

Equation (17) shows the loss function, which must be maximized,

$$\text{maximize}\_{\{\mathbf{W}\_{ij}, \mathbf{a}\_i, \mathbf{b}\_j\}} \frac{1}{\mathbf{c}} \sum\_{\mathbf{L}=1}^{\mathbf{c}} \log \left( \mathbf{P} \left( \mathbf{v}^{\mathbf{L}}, \mathbf{h}^{\mathbf{L}} \right) \right), \tag{17}$$

The updating functions are

$$
\Delta W\_{i,j} = \zeta \left( <\,\,\mathrm{v}\_{i}\mathrm{h}\_{\rangle} >\_{\mathrm{data}} - <\,\,\mathrm{v}\_{i}\mathrm{h}\_{\rangle} >\_{\mathrm{model}} \right), \tag{18}
$$

$$
\Delta \mathbf{a}\_{\mathbf{i}} = \mathsf{J}(<\text{v}\_{\mathbf{i}}>\_{\text{data}} -<\text{v}\_{\mathbf{i}}>\_{\text{model}}),
\tag{19}
$$

$$
\Delta \mathbf{b}\_{\rangle} = \zeta \left( < \mathbf{h}\_{\rangle} >\_{\text{data}} - < \mathbf{h}\_{\rangle} >\_{\text{model}} \right), \tag{20}
$$

where < • >data and < • >model are expanded values of sample data and model probabilistic distribution, and ζ is the learning rate.

We described GGRBM briefly, and this is the time to use it. For a traditional autoencoder, first, the initial parameters of the encoder layer are randomly selected and then trained by the GGRBM method. The trained parameters are considered the encoder layer's initial parameters, and its transposition is employed for the decoder layer. However, in the AsyAE, the encoder and decoder parameters are not symmetric and have to be obtained individually. So, the above principle is applied to the decoder layer to obtain the initial parameters.

**The number of hidden layers:** The value of this hyper-parameter is dependent on the performance of AsyAEs. The classification performance of each AsyAE is examined in MIC for each pair of (Ni, μi). For the next AsyAE, the performance has to be better than that of the previous one. If the performance of AsyAE(i+1) is better than that of AsyAE(i), the MIC algorithm is continued.

The performance criterion is different from one problem to another. The Unweighted Average (UA) recall criterion frequently used in personality perception studies is calculated by Equation (21),

$$\text{UA recall} = \frac{1}{2} \left( \text{recalll}\_{\text{LOW}} + \text{recalll}\_{\text{High}} \right) \tag{21}$$

The recallLow means the recall of detecting the low degree of studied personality, and the recallHigh indicates the recall of detecting its high degree.

**The maximum epoch of network training:** Generally, the DNN training process proceeds to reach maximum epoch (updating time) [40]. As discussed in [24], proper data separation does not occur in the maximum epoch. Thus, a J variation is employed as a stopping criterion to finish the training process in the epoch in which the maximum separation is achieved.

J is calculated as follows,

$$\mathbf{J} = \frac{\det(\mathbf{S\_B})}{\det(\mathbf{S\_W})'} \tag{22}$$

where SW is a within-class scattering matrix, and SB is a between-class scattering matrix [61]. **det** represents the determinant of a matrix.

$$\mathbf{S}\_{\mathbf{W}} = \sum\_{\mathbf{i}=1}^{\mathbb{C}} \sum\_{\mathbf{x} \in \mathbf{c}\_{\mathbf{i}}} \left( \mathbf{X} - \mu\_{\mathbf{i}} \right) \left( \mathbf{X} - \mu\_{\mathbf{i}} \right)^{\mathrm{T}},\tag{23}$$

$$\mathbf{S}\_{\mathbf{B}} = \sum\_{\mathbf{i}=1}^{\mathbb{C}} \mathbf{n}\_{\mathbf{i}} (\mu\_{\mathbf{i}} - \mu) (\mu\_{\mathbf{i}} - \mu)^{\mathsf{T}}.\tag{24}$$

Here, ni is the instance number of ith class, **X** is the encoder output matrix, and c is the number of classes, μ is the matrix for average all instances, and μi is the class average matrix of ith class.

**Preventing over-fitting and under-fitting problems:** The over-fitting problem happens when a model trains properly on the training dataset but performs poorly on the testing dataset. The under-fitting problem occurs when a model performs poorly on both the training and testing samples.

The number of layers and the neurons in each layer can excessively lead a model into over-fitting or under-fitting. This can be easily changed by changing the structure. More neurons and layers complicate the model, but fewer cannot pursue the data pattern. Therefore, this is one of the problems that has to be dealt with in designing an optimum structure. So, a new loss function is defined to guide the model toward good fitting.

$$\text{Loss} = \frac{\text{UA}\_{\text{train}}}{\text{a}} \ast \frac{\text{UA}\_{\text{val}}}{\text{b}},\tag{25}$$

where a is the training threshold, and b is the validation threshold. We already discussed the UA recall criterion used chiefly in personality perception. We applied the loss function defined in Equation (25) instead of Equation (21). The aim is the maximization of Equation (25). We set a = 0.8 and b = 0.6 because a UAtrain of more than 80% and UAval of more than 60% are acceptable. The loss value can be in the range of [2.08, 0]. So, the set of (Ni, μi) is acceptable to be maximized in Equation (25).

**Final algorithm:** The pseudo-code of optimizing SAAE hyper-parameters is described in Algorithm 2.


#### **6. Simulations and Results**

In this section, firstly, the results of the MIC method on three benchmarks and comparison with other published methods will be discussed. Then, the MIC will be used to design the structure of five individual DNNs for classifying five personality traits. A final comparison can be found at the end of this section.

#### *6.1. The Results of the MIC on Three Optimization Benchmarks*

Three well-known, multimodal, continuous, and non-separable benchmark functions that have a global minimum value of zero, called Rastrigin [52], Ackley [62], and Griewang [62], are used to validate the MIC method.

The multimodal property means having many local optima or peaks in the function, which can test the ability of an algorithm to avoid being stuck in a local minimum. Nonseparable refers to the independence of obtained solution variables. If all variables are independent, they can be optimized independently, and the function will be optimized [62]. Therefore, these three functions are complex problems in evaluating the performance of any new optimization algorithm.

The formula, feasible range of variables, and the global optima points of three functions are summarized in Table 2.


**Table 2.** Description of Three Benchmark Functions.

Here, n indicates the dimension of the function, which is n ≥ 2 for all mentioned functions. Figure 5 shows the shape of the functions described in Table 2. As can be seen, all three functions have many local optima and are suitable to show the ability of optimization methods to escape from being stuck in local optima.

**Figure 5.** Benchmark functions ( **A**) Rastrigin, (**B**) Ackley, and ( **C**) Griewang.

In order to show the performance of MIC against the conventional optimization methods, the comparison results of the mentioned four islands and MIC are reported in Table 3.

Given the fact that the problem complexity increases with increasing dimensionality, increasing the number of the variables (dimension) grows the search space, which makes exploring the best solution difficult [62]. To investigate the effect of dimension on searching quality in MIC, we compared our results with 30D and 10D in Table 3.

For a fair comparison, all parameters and initial populations for the basic algorithms and MIC were set to the same values.

The following six criteria were utilized for a more reliable analysis. It should be mentioned that these criteria are common in optimization problems.


It was concluded by AvI numerical results that MIC can reach more accurate solutions with a faster convergence speed than traditional algorithms in n = 10. Although for the n = 30, the MM performance diminishes, LM and EM preserve their performance with increasing complexity. It is demonstrated that LM and EM improve solutions steadily for a long time without getting stuck in local minima. It is clear that MIC is more powerful than the four basic algorithms alone when it comes to solving global optimization problems.



Table 3 shows the simulation outcomes of MIC and four basic optimization algorithms. According to the AvP values in n = 10 and n = 30, traditional algorithms are often unsuccessful in finding favorable solutions in comparison to MIC, especially EM. Additionally, it can be concluded from AvP that the MIC speeds up the convergence to the global optima. The AvP values in n = 30 in comparison to n = 10 decreased about 0.1 in Rastrigin and remained constant for the other two functions in LM and EM. The change in the AvP values in MM is meaningful, which indicates getting stuck in the local optimum with the increase in the complexity of the problem, like the traditional methods.

Our SI outcomes show that the MIC method, especially EM and LM, reaches the stop criterion in a few iterations. It means the MIC method speeds up convergence. Moreover, the SI criterion shows that although the MM method performs better than the basic optimization methods in simpler functions (n = 10), its performance drops in complex functions (n = 30). LM and EM not only show their effectiveness in simple functions, but also perform well in complex problems compared to other methods.

The evaluation results of criterion BOP show that the LM and EM methods achieve the global optimal value more accurately than the basic methods in n = 10 and n = 30. However, MM implementation results decrease with increasing complexity.

It can be seen that the SD values of MIC, except for MM, are very small in comparison to those of the four basic algorithms in n = 10 and n = 30, which means the repeatability and robustness of the new algorithm are due to pruning search space.

The SR results prove that the MIC is very promising in bringing higher reliability than traditional algorithms because the number of times that LM, EM, and MM reached the desired value of the function was 100% in n = 10. As can be seen, as the complexity of the function increases (n = 30), the LM and EM methods are still successful in reaching the desired value.

From Table 3, it is concluded that despite increasing dimensions, the implementation outcomes of all algorithms decrease, except EM and LM.

Our study indicates that the quality of the solutions found using our proposed method for widespread global optima functions is higher than that of the solutions provided by traditional algorithms. This is due to a more appropriate tradeoff between exploring new individuals and exploiting highly fit individuals found at the parallelism level. By means of three widespread test functions, it is demonstrated that the new method has grea<sup>t</sup> potential for substantial improvement in search performance.

Due to the wide usage of these benchmarks, a comparison with other published works is presented in Table 4. It can be observed that LM and EM achieved the best solution in Ackley and Griewang functions (30D).


**Table 4.** Comparison with Other Published Methods in 30D. N/A means not available.

#### *6.2. The Results of Personality Perception with The MIC Method*

After the successful outcomes with the MIC method to find the global optima of three complex benchmark functions, we applied our novel method to find the near-optimal values of hyper-parameters for classifying five personality traits. We used "near-optimal" instead of "optimal" structure because tuning of MIC hyper-parameters such as mutation and crossover rating is chosen randomly.

Taking into account that different personality traits have different effects on speech characteristics [24,42], using the same DNN structure for all traits to extract features is not recommended. Assuming the five personality traits were independent, five separate neural networks were designed and trained to classify the five personality traits.

Hence, the network's depth was determined by classifying the output of each AsyAE encoder layer by the SVM with radial basis function kernel. The AsyAE with higher classification results is considered as the output layer of the SAE.

Table 5 shows the comparison results of our proposed method with other works in the SPC dataset in terms of UA recall and accuracy. In our previous work, the structure of SAAE was chosen by trial-and-error, which was time-consuming, and two traits (extraversion and openness) achieved lower accuracy than reported by other research [24]. N/A means not available.

**Table 5.** Comparison Results of Our Proposed Method with Other Works in the SPC Dataset in Terms of UA Recall % (Accuracy %).


In the present study, not only were the accuracy of extraversion and openness improved, but UA recalls were also increased more than before. This evidences that the performance and robustness of trained models are highly dependent on their hyper-parameter settings.
