*Article* **RHOASo: An Early Stop Hyper-Parameter Optimization Algorithm**

**Ángel Luis Muñoz Castañeda 1,2,\*, Noemí DeCastro-García 1,2 and David Escudero García <sup>2</sup>**


**Abstract:** This work proposes a new algorithm for optimizing hyper-parameters of a machine learning algorithm, RHOASo, based on conditional optimization of concave asymptotic functions. A comparative analysis of the algorithm is presented, giving particular emphasis to two important properties: the capability of the algorithm to work efficiently with a small part of a dataset and to finish the tuning process automatically, that is, without making explicit, by the user, the number of iterations that the algorithm must perform. Statistical analyses over 16 public benchmark datasets comparing the performance of seven hyper-parameter optimization algorithms with RHOASo were carried out. The efficiency of RHOASo presents the positive statistically significant differences concerning the other hyper-parameter optimization algorithms considered in the experiments. Furthermore, it is shown that, on average, the algorithm needs around 70% of the iterations needed by other algorithms to achieve competitive performance. The results show that the algorithm presents significant stability regarding the size of the used dataset partition.

**Keywords:** hyperparameters; machine learning; optimization; inference

#### **1. Introduction**

Tuning the hyper-parameter configuration of a machine learning (ML) algorithm is a recommended procedure to obtain a successful ML model for a given problem. Different ML algorithms have specific hyper-parameters whose configuration requires a deep understanding of both the model and the task. Since the hyper-parameter configuration greatly impacts the models' performance, the research in automatic hyper-parameter optimization (HPO) is focused on developing techniques that efficiently find optimal values for the hyper-parameters, maximizing accuracy while avoiding complex and expensive operations. However, this process remains a challenge because not all optimization methods are always suitable for a given problem.

Although there are several methods for tuning both continuous and discrete hyperparameters, they do not perform equally for all ML algorithms, displaying different consumption of computational resources and stability. The process becomes computationally expensive if too many function evaluations of hyper-parameter values must be carried out to obtain a suitable accuracy. Since the size of the dataset in the HPO phase influences the dynamic complexity of the classifier but not its accuracy [1], another possible limitation is that a HPO algorithm may require a large dataset to work efficiently. Finally, most HPO algorithms are iterative, which suggests that stopping the algorithm when the expected improvement of testing new configurations is low can be a good option [2]. Nevertheless, the ML user does not have information about the rate of convergence and the loss function values. Therefore, the user usually tends to leave the default parameters (more than 50 iterations) or set a high number of iterations to assure good performance ([3]). This fact implies that the algorithm may perform more iterations than needed to obtain an adequate accuracy, with the consequent increased computational cost.

**Citation:** Muñoz Castañeda, A.L.; DeCastro-García, N.; Escudero García, D. RHOASo: An Early Stop Hyper-Parameter Optimization Algorithm. *Mathematics* **2021**, *9*, 2334. https://doi.org/10.3390/math9182334

Academic Editors: Florin Leon, Mircea Hulea and Marius Gavrilescu

Received: 25 July 2021 Accepted: 16 September 2021 Published: 20 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

This article proposes a novel early stop HPO algorithm, RHOASo (*RIASC hyperoptimization automatic software*). The work aims to analyze RHOASo, compare its behavior with different state-of-the-art HPO algorithms, and measure its early-stop feature, which entails minimal human intervention.

The research questions that will be studied regarding this new algorithm are the following: RQ1: Given a dataset, how good is the performance (accuracy, time complexity, sensibility, and specificity) of a ML algorithm when RHOASo is applied?

RQ2: How many iterations does the algorithm need until it stops? How much faster or slower is RHOASo, compared with the other HPO algorithms?

RQ3: Are there statistically meaningful differences between the performance of RHOASo and other HPO algorithms?

RQ4: Are the above results consistent? That is, do they hold for different HPO algorithms and different datasets with different characteristics (size, number of features, etc.)?

In order to test the behavior of RHOASo and answer the above questions, we have evaluated the efficiency of the algorithm combined with three well-known classifier algorithms: random forest (RF), gradient boosting (GB), and multi-layer perceptron (MLP). We choose these three supervised models because each follows a different learning paradigm: RF and GB are ensemble models (bagging and boosting models), and MLP is a type of neural network. Therefore, it is possible to study whether a particular type of model works better with the application of RHOASo.

We have measured the efficiency of RHOASo, by itself and carrying out statistical inference to evaluate how well it performs compared with the other seven HPO algorithms from several families (decision theoretic approaches, Bayesian optimization, evolutionary strategies, etc). The variables considered in the article are the accuracy, the MCC (Matthews correlation coefficient), the time complexity of the whole optimization (from initialization to termination), the sensibility, and the specificity of the obtained models. These variables are collected by applying HPOs over 16 well-known public datasets. Furthermore, the algorithm's performances are studied by working with four different size partitions of each dataset to study their impact on the optimization performance. All of these comparisons are studied through two experiments: (a) Experiment 1 in which RHOASo carries out the number of iterations that it needs until it stops, and the other HPOs have a default input (50 iterations); and (b) Experiment 2 in which the other HPOs perform the same number of trials that RHOASo has conducted. This means that we avoid biased comparisons at the same time that we evaluate RHOASo under the default number of evaluations that a user could specify ([3]).

The results show that RHOASo works efficiently in low-dimension hyper-parameter spaces, and it is competitive in terms of accuracy and computational cost. On average, RHOASo achieves a statistically significant positive difference in terms of efficiency over 70% of the times that the algorithms are applied and obtains good results when it uses small parts of the datasets. Sensitivity and specificity also show positive results in general, although in some unbalanced datasets, there is a bias toward the majority class. Lastly, the automatic early-stop feature of RHOASo lets it finish the tuning process before reaching the fixed number of iterations given as input in the other hyper-parameter optimization methods (*Me* = 34 vs. 50), with a 30% reduction. When fixing an equal number of iterations for all algorithms, RHOASo loses the advantage decreasing its gain to 50% on average, but remains competitive in two of the three evaluated models. It shows weakness when it is applied with MLP in some datasets.

The article is organized as follows. In Section 2, we state the hyper-parameters optimization problem when the size of the dataset is left as a variable, and we provide an overview of the state-of-the-art methods. In Section 3, we describe the proposed algorithm. First, we set and solve a conditional optimization problem for the logistic function. The solution is a simple iterative algorithm whose discrete analogous is used as the base to define RHOASo. In Section 4, we describe the experimental details of the study. Finally, in Section 5, we develop the obtained results. We analyze the performance of RHOASo

when it is run together with the three ML algorithms mentioned above, and it is compared with other HPO algorithms. This is done in two different ways. On one hand, we let the number of iterations of the HPO algorithms have their default values. On the other hand, we allow the HPO algorithms to be run for the same number of iterations that RHOASo needs until it stops.

Additionally, we have included an appendix with the results concerning the first experiment in which case the ML algorithms are decision tree (DT and K-nearest neighbor (KNN). It is shown that the performance of RHOASo compared with the other HPO algorithms is substantially better. Due to the superiority of the performance of RHOASo with respect to the rest of the HPO algorithms when they are run with DT and KNN, we have carried out the complete analysis only with the RF, GB and MLP algorithms.

In order to facilitate the reading of the article, we include below a scheme describing the experimentation and validation phases that were carried out.

#### **2. Related Work**

A hyper-parameter of a ML algorithm is a hidden component that directly influences the algorithm's behavior. Tuning it allows the user to control the performance of the algorithm.

#### *2.1. Problem Statement*

**Definition 1.** *Let* <sup>X</sup> *be a tuple of random variables. Let* <sup>Y</sup> *be the space labels. Let <sup>D</sup>train* ∈X ×Y *be an i. i. d. sample whose distribution function is* P*.*

*A machine learning algorithm* A *is a functional as follows:*

$$\begin{aligned} \mathcal{A}: \cup\_{n \in \mathbb{N}} (\mathcal{X} \times \mathcal{Y})^n &\longrightarrow \mathcal{H} \\\\ D^{train} \mapsto A(D^{train}) &:= h\_{A, D^{train}} : \mathcal{X} \longrightarrow \mathcal{Y} \\ &\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \begin{aligned} (1) \end{aligned}$$

*The model hA*,*Dtrain* (*x*) *predicts the label of (unseen) instance x minimizing the expected loss function,* <sup>L</sup>(*Dtrain*, *hA*,*Dtrain* )*.*

This loss function measures the discrepancy between a hypothesis *h* ∈ H and an ideal predictor. The target space of functions of the algorithm, H, depends on specific

parameters, *λ* = (*λ*1, ... , *λn*) ∈ Λ , that might take discrete or continuous values and have to be fixed before applying the algorithm. We use the notation *A<sup>λ</sup>* to refer to the algorithm with the hyper-parameter configuration *λ*.

In this scenario, another independent data set, *<sup>D</sup>* :<sup>=</sup> *<sup>D</sup>train* <sup>∪</sup> *<sup>D</sup>test*, serves to evaluate the loss function, <sup>L</sup>(*Dtest*, *hAλ*,*Dtrain* ), provided by the algorithm <sup>A</sup>*λ*(*Dtrain*). Let the hyperparameters *<sup>λ</sup>* = (*λ*1,..., *<sup>λ</sup>n*) remain free in <sup>L</sup>(*Dtest*, *hAλ*,*Dtrain* ).

In the case of a classification ML problem, we can take the loss function L as the error rate, that is, one minus the cross-validation value. In this situation, one can define the following function:

$$\begin{aligned} \Phi\_{A,D} : \Lambda &\longrightarrow [0,1] \\ \lambda &\mapsto \text{mean}\_{D^{\text{test}}} \mathcal{L}(D^{\text{test}}, h\_{A\_{\lambda}, D^{\text{train}}}) \end{aligned} \tag{2}$$

The hyper-parameter optimization (HPO) problem consists of trying to reach *<sup>λ</sup>*<sup>∗</sup> :<sup>=</sup> min*λ*(mean*Dtest*L(*Dtest*, *hAλ*,*Dtrain* )).

#### *2.2. Overview of the State-of-the-Art Methods*

Since the hyper-parameter configuration has a significant effect on the performance of a ML model, the main goal in the HPO research is to find optimal values for the hyper-parameters that maximize the accuracy of the model while minimizing the costs and avoiding manual tuning. In the case where hyper-parameters are continuous, HPO algorithms usually work using gradient descent-based methods ([4–6]) in which the search direction in the hyper-parameter space is determined by the gradient of a model selection criterion at any step.

The discrete case has several approaches that perform differently depending on the ML algorithm and the dataset. Bayesian HPO is a type of surrogate-based optimization ([7]) that tunes the hyper-parameters by keeping the assumed prior distribution of the loss function updated, taking into account the new observations that are selected by the acquisition function. The construction of this surrogate model and the hyper-parameter selection criteria result in several types of sequential model-based optimization (SMBO). The main methods model the error distribution with a Gaussian process ([8]) or tree-based algorithms, such as sequential model-based algorithm configuration (SMAC) or the Tree Parzen Estimators (TPE) method ([9,10]). Another perspective is the radial basis function optimization (RBFOpt) that proposes a deterministic surrogate model to approximate the error function of the hyper-parameters through dynamic coordinate search. These methods require fewer evaluations, improving the associated costs of Gaussian process methods ([11]). Regarding the selection function to choose the next promising hyper-parameter configuration to test in the surrogate-based optimization, the typical approach is to use the expected improvement ([8]). There are other alternatives, such as the predictive entropy search ([12]. Other variants of SMBO can be found in [13,14], where different datasets and tasks are characterized by several measurements that allow to predict a ranking of several combinations of hyper-parameter values.

Another important HPO approach is the decision-theoretic method, where the algorithm obtains the hyper-parameter setting by searching the hyper-parameter space directly following some particular strategy. As examples, we have grid search, which uses brute force, and the simple and effective random search (RS) that tests randomly sampled configurations from the hyper-parameter space [15,16]).

Other optimization algorithms are applied to the problem of discrete hyper-parameter values selection. This is the case, for instance, of the evolutionary algorithms, such as the covariance matrix adaptation evolutionary (CMA-ES) method [17], the simplex Nelder– Mead (NM) method ([18,19]) or the application of continuous techniques over the discrete case such as the particle swarm (PS) ([20,21]).

Although there are several options, these methods provide different results and consumption of computational resources, and they do not perform equally well with all ML algorithms. Then, we need to consider the costs to choose the HPO method, the size of the data required to run the optimization process effectively, and the human interaction needed. These issues arise in several open research challenges that we have summarized in Table 1.

**Table 1.** Open research challenge in HPO. Content extracted from [3].


RHOASo is an HPO algorithm that is designed in order for the potential user not to have to configure the end of the process. Currently, the termination of a general HPO algorithm can be carried out in several ways ([3]): (1) an amount of runtime fixed by the user based on intuition; (2) a lower bound of the generalization error specified by the user; and (3) considering the convergence of HPO if no progress is identified. All of these procedures can lead to over-optimistic bounds or excessive runtime that increases the computational cost. In this scenario, RHOASo is able to stop automatically, without losing accuracy and with minimal intervention of the user.

Additionally, many of the state-of-the-art HPO algorithms have, in turn, parameters that must be set up before running them. For instance, when a user wants to tune an (unbounded) integer-valued hyper-parameter of a given ML algorithm, the HPO algorithm requires the user to pre-configure a grid over which it is to be run. In many cases, the higher the size of the grid, the higher the execution cost of the HPO algorithm. A natural way to proceed in these cases is to accelerate the hyper-parameter running process, using earlystopping techniques ([2,22–24]). However, these algorithms still have other parameters that must be set up. Therefore, in some sense, HPO algorithms move the hyper-parameter tuning problem from ML algorithms to themselves, which increases the complexity and cost of the whole process. Table 2 below shows the parameters on which the HPO algorithms used in this work depend.

**Table 2.** Hyper-parameters of the HPO algorithms considered in this work.


Thus, the natural question is how to tune hyper-parameters of HPO algorithms without increasing the complexity. Since using HPO algorithms over themselves does not solve the problem, it is natural to ask for HPO algorithms depending on as few hyperparameters as possible and achieving good performance, compared with state-of-the-art HPO algorithms.

Our aim is to present a novel HPO algorithm with only one parameter to be tuned and to analyze its performance, compared with other state-of-the-art HPO algorithms.

#### **3. The Proposed Algorithm: RHOASo**

RHOASo is an approach to the HPO problem, whose underlying idea is the reversible gradient-based HPO method proposed for the continuous case ([5]).

Open source code for RHOASo is hosted in GitHub (https://github.com/amunc/ RHOASo, accessed on: 25 July 2021), and it is available under the GPL license (version 3). Users do not need to install software separately, save for the Python language. Additionally, this is included in a ML intelligent system, RADSSo (RIASC automated decision support software), and it was used in several research works ([29,30]).

#### *3.1. The Setup*

Recall from Section 2.1 that the space of functions in which a learning algorithm takes values is assumed to depend on certain parameters *λ* = (*λ*1, ... , *λn*), and this space of functions is denoted by H*λ*. We make the following assumptions:


Typical examples of hyper-parameters satisfying such assumptions are the maximum depth in any tree-based machine learning model, the number of trees if the output of the model is a weighted average of the outputs of all the trees, or the number of neurons in hidden layers of a multilayer perceptron (the weights of the inputs of a given neuron may be zero).

Let Φ*A*,*<sup>D</sup>* be the functional given in Equation (2) for a given machine learning model defined by a dataset *D* and a model H*λ*. If we plot the functional Φ*λ*,*<sup>D</sup>* considering, for example, the model random forest with hyper-parameters maximum depth (*x*-axis) and number of trees (*y*-axis), we can obtain a figure like those shown in Figure 1 (the plots are obtained from two different datasets).

From the expression <sup>Φ</sup>*A*,*D*(*λ*) = mean*Dtest*L(*Dtest*, *fAλ*,*Dtrain* ) it follows that if we let both the size of the dataset and the number of iterations in the cross-validation go to infinity, the surface given in Figure 1 becomes smoother, and takes the form shown in Figure 2, which is a concave surface with an asymptote in the plane *z* = *z*<sup>0</sup> ≤ 1.

At this point, one can ask for an algorithm to find a value of *λ* = (*λ*1, *λ*2) at which Φ*A*,*<sup>D</sup>* attains a sufficiently high value while keeping *λ*<sup>1</sup> and *λ*<sup>2</sup> as small as possible.

#### *3.2. Motivation of the Algorithm*

In order to motivate the algorithm, consider the logistic function *f*(*x*) = 1/(1 + *e*−*x*) restricted to R>0. This is a concave function with an asymptote in *y* = 1, thus it has no maximum. Although the maximization problem of this function has no sense, we can ask for the point *<sup>x</sup>*<sup>0</sup> <sup>∈</sup> <sup>R</sup> with higher value *<sup>f</sup>*(*x*) subject to the condition that making *<sup>x</sup>*<sup>0</sup> smaller makes the decrease in *f* important. One way to formalize this question is by considering the maximization problem of the function Stb(*x*) = *f*(*x*)*f* - (*x*) = *e*−*<sup>x</sup> f*(*x*)3. In a certain sense, maximizing Stb(*x*) consists of choosing a point *x*<sup>0</sup> with a sufficiently large image *f*(*x*0) but whose slope at that point is not too low. Note that Stb(*x*)- <sup>=</sup> *<sup>e</sup>*−*<sup>x</sup> <sup>f</sup>*(*x*)3(−<sup>1</sup> <sup>+</sup> <sup>3</sup>*e*−*<sup>x</sup> <sup>f</sup>*(*x*)), and therefore Stb(*x*)-= 0, has only one solution, *x* = ln(2), which is a maximum.

Consider now the function Stb(*x*, *n*) = *x<sup>n</sup> f*(*x*)*f* - (*x*), *n* being a natural number. Then, taking *n* > ln(2), the equation Stb- (*x*, *n*) = 0 has only one solution *x*0, which is greater than ln(2), and becomes larger as we make *n* increase. Thus, we can control how small the slope is at the solution *x*<sup>0</sup> by making *n* vary.

**Figure 1.** Surface of Φ*λ*,*<sup>D</sup>* for random forest. (**a**) Φ in RF. (**b**) Φ in RF.

**Figure 2.** Smoothing of Φ*A*,*D*.

In order to explain how RHOASo works, and to link directly with the form it is presented below, let us denote the function *f* by Φ*A*,*<sup>D</sup>* and let us restrict its domain of definition to the set of natural numbers. The variable is now denoted by *λ*. Then, instead of the derivative, we may consider the following function:

$$\frac{\Phi\_{A,D}(\lambda + h) - \Phi\_{A,D}(\lambda)}{h}\_{}$$

where *h* is a natural number. In order to simplify the notation, we set *h* = 1. Thus, we may consider the optimization problem defined by the following:

$$\max\_{\lambda} \{ \text{Stb}(\lambda, n) \},$$

where

$$\text{Stb}(\lambda, n) := \lambda^n \Phi\_{A,D}(\lambda)(\Phi\_{A,D}(\lambda + 1) - \Phi\_{A,D}(\lambda)).$$

Now, we can give a simple iterative algorithm to find the value *λ* close to that at which Φ*A*,*<sup>D</sup>* attains a sufficiently high value while keeping the magnitude of such coordinate as low as possible; at the iteration *i*, do: if Stb(*λ<sup>i</sup>* + 1) > Stb(*λi*), then *λi*+<sup>1</sup> := *λ<sup>i</sup>* + 1 and stop otherwise. This is just the most basic algorithm to solve the optimization problem max*λ*{Stb(*λ*, *n*)}. Observe that the convergence is always ensured because of the properties of the function Stb(*λ*).

See Figure 3 to show how the stabilizer function, Stb, behaves in two particular cases.

**Figure 3.** Stabilizer Stb(*λ*) in dimension one. (**a**) Stabilizer for logistic function. (**b**) Stabilizer for arctangent function.

#### *3.3. The Algorithm*

There are two important features we want for the algorithm to have. On one hand, we want to avoid meta-parameters, that is, we want for the algorithm not to depend (strongly) on extra parameters. In state-of-the-art HPO algorithms, the user has to set as input the exact number of iterations that the algorithm must perform. On the other hand, we want the algorithm to give a good result when it is run, giving as input not the whole dataset but a small part of it. The proposed HPO algorithm exploits the consequences derived from assumptions 1 and 2 to reach the objective in a simple way.

Since the hyper-parameters we will work with are discrete, we may assume that the hyper-parameter space is <sup>Γ</sup> <sup>=</sup> <sup>N</sup>*n*. Suppose that we are at the point *<sup>λ</sup>* <sup>∈</sup> <sup>Γ</sup> <sup>=</sup> <sup>N</sup>*<sup>n</sup>* of the hyper-parameter space. The decision about to which next point the algorithm must jump is based on two basic rules:


$$\text{stb}(\lambda) = \max \{ \lambda\_i \} \cdot \Phi\_{A,D}(\lambda) \cdot \sum\_{\lambda' \in \lambda + \text{Sbfits}} (\Phi\_{A,D}(\lambda') - \Phi\_{A,D}(\lambda)) \tag{3}$$

If we are at point *λ*, we will transit at point *λ*<sup>∗</sup> ∈ *λ* + Shifts if the inequality stb(*λ*∗) > stb(*λ*) holds true and stb(*λ*∗) = max{stb(*λ*- )| *λ*-∈ *λ* + Shifts}.

3. Once the algorithm stops at some point *λ*<sup>∗</sup> ∈ Ω, there is a final step at which the point *λ*-∈ *λ*<sup>∗</sup> + Shifts with maximum Φ*A*,*<sup>D</sup>* is found and given as the final output.

The pseudo-code of the algorithm is included in Algorithms 1–4.

**Algorithm 1** Computing stabilizer.


**Algorithm 2** Computing best neighbor.

1: **Input:** data, train\_data, test\_data, features, target 2: **Output:** best next hyper-parameters 3: **procedure** GETBESTNEIGH(Input,*λ*current) 4: Stb = ∅ 5: Shifts <sup>=</sup> {0, 1}*<sup>N</sup>* 6: **for** *η* **in** Shifts **do** 7: *λ* ← *λ*current + *η* 8: Stb ← Stb {GETSTB(Input, *<sup>λ</sup>*)} 9: **end for** 10: maximum = *maxλ*Stb 11: **return** maximum, max\_Stb 12: **end procedure**

#### **4. Materials and Methods**

Some experiments were carried out in order to answer the research questions formulated in the introduction:

1. RQ1: Given a dataset, how good is the performance (accuracy, time complexity, sensibility, and specificity) of a ML algorithm when RHOASo is applied?

**Algorithm 3** Last phase.


#### **Algorithm 4** RHOASo algorithm.


The analyses aim to measure the quality of the proposed algorithm and decide whether there are statistically meaningful differences between the performance of the selected HPO methods and RHOASo.

#### *4.1. ML and HPO Algorithms*

We have evaluated the efficiency of three well-known ML algorithms:


In Table 3, we give a summary of the ML algorithms we have used together with the hyper-parameters we have tuned. The search space for all hyper-parameters is in the interval [1, 50]. All hyper-parameters not being tuned are set to their default values as per scikit-learn implementation ([34]). We have used 10-fold cross-validation to assess the performance of all ML models combined with the HPO algorithms.

On the other hand, in Table 4 a summary of the HPO algorithms selected for this study is given.


**Table 3.** ML algorithms together with hyper-parameters we have considered.

**Table 4.** HPO algorithms used. Colors explanation: Bayesian methods in blue, decision-theoretic techniques in pink, evolutionary algorithms in brown, and other optimization algorithms in green.


#### *4.2. Datasets*

The datasets selected for the experiments are described in Table 5. The choice was motivated by different reasons: availability in public servers to verify the results, the number of instances and classes, and the type of features. There are 16 public benchmark datasets with a different number of variables (8–300), rows (4601–284,807), and classes (2–10). Since the size of the dataset in the HPO phase influences the performance of the classifier ([1]), the performances of the algorithms are analyzed with four different-sized partitions of each dataset (P = {*P*<sup>1</sup> = 8.3%, *P*<sup>2</sup> = 16.6%, *P*<sup>3</sup> = 50%, *P*4 = *Di*}). These proportions are chosen because, in this way, the number of instances is in different orders of magnitude. In addition, each dataset is divided into train data (80%) and validation data (20%) *Dtrain <sup>i</sup>* <sup>∪</sup> *<sup>D</sup>valid <sup>i</sup>* = *Di*, and the partitioning scheme above is applied to obtain train and validation subsets with the corresponding proportions *Pj*(*Di*).

The transformation of features to obtain treatable datasets to input directly to each model is manually implemented with Python.


**Table 5.** Descriptions of datasets.

#### *4.3. Construction of the Response Variables*

In order to build response variables to measure the performance of the HPO algorithms, we apply them to each ML model H*<sup>k</sup>* over each partition, *Pj*(*Di*), obtaining a hyper-parameter configuration *λ<sup>k</sup> <sup>i</sup>*,*<sup>j</sup>* for H*k*. Then, the learning algorithm with the obtained hyper-parameter configuration is run over *Dtrain <sup>i</sup>* to construct a classifier that is validated over *Dvalid <sup>i</sup>* . At this step, we collect the obtained accuracy. This scenario is repeated a number of times (trials), giving rise to two different experiments.


The time complexity of the whole process is stored as well. The time complexity, measured in seconds, is the sum of the time needed by the HPO algorithm to find the optimal hyper-parameter configuration, and the time that the ML algorithm uses for training. Then, we create two response variables. We denote by *Acc<sup>k</sup> <sup>i</sup>*,*<sup>j</sup>* the number of trials × 1 array where the *m*-th component is the accuracy of the predictive model tested on *Dvalid <sup>i</sup>* that was trained over *Dtrain <sup>i</sup>* with the hyper-parameters (*λ<sup>k</sup> i*,*j* )*<sup>m</sup>* at the *m*-th trial. *TC<sup>k</sup> <sup>i</sup>*,*<sup>j</sup>* is the notation for time complexity.

We have also collected the sensitivity and the specificity of each iteration for RHOASo to measure its performance more accurately.

Additionally, we have collected the MCC (Matthew correlation coefficient), which is defined as follows:

$$\text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \tag{4}$$

where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. This coefficient works as a substitute metric of the accuracy for unbalanced datasets [55,56]. Since some datasets contain a certain degree of imbalance, we present our results with both indicators, accuracy and MCC.

#### *4.4. Statistical Analyses*

Our main objective is to analyze the quality of the RHOASo algorithm and compare it with other HPO algorithms. In order to analyze whether there are meaningful statistical differences among the obtained results by the HPO algorithms and RHOASo, we perform the following statistical analysis:


The following inference tests are carried out for both Experiments 1 and 2.


sions on statistical difference for each ML algorithm and for each response variable, obtaining, thus, 2688 *p*-values. From these results, we have computed how many times we obtain positive difference (*validity*(*RHOASo*) > *validity*(*Hk*)), negative difference (*validity*(*RHOASo*) < *validity*(*Hk*)) or equality (*validity*(*RHOASo*) = *validity*(*Hk*)), see Table 6.

**Table 6.** Conditions of validity. The symbols =, >, < denote statistically meaningful equality and difference, and *Me* denotes the median.


5. Since the blue cells may be understood as being both a positive difference or negative difference, depending on the improvement that we obtain, we have reclassified the results, creating a new table, correcting these cases by the rule described in Equation (5).

$$\mathcal{V}^{\*} = \begin{cases} \mathcal{V}^{+} & \text{if } \Delta(Acc\_{i,\vec{\jmath}}(k)) > \Delta(TC\_{i,\vec{\jmath}}(k)) \\ \mathcal{V}^{-} & \text{if } \Delta(Acc\_{i,\vec{\jmath}}(k)) < \Delta(TC\_{i,\vec{\jmath}}(k)) \end{cases} \tag{5}$$

where

$$\Delta(Acc\_{i,j}(k)) = \frac{|\mathcal{Mc}(Acc\_{i,j}^{ROHOAso}) - \mathcal{Mc}(Acc\_{i,j}^k)|}{\min(\mathcal{Mc}(Acc\_{i,j}^{ROHOAso}), \mathcal{Mc}(Acc\_{i,j}^k))}\tag{6}$$

and

$$\Delta(T\mathbb{C}\_{i,j}(k)) = \frac{|\operatorname{Me}(T\mathbb{C}\_{i,j}^{RHOASo}) - \operatorname{Me}(T\mathbb{C}\_{i,j}^{k})|}{\min(\operatorname{Me}(T\mathbb{C}\_{i,j}^{RHOASo}), \operatorname{Me}(T\mathbb{C}\_{i,j}^{k}))} \tag{7}$$

6. We have completed the analysis by computing the rate of each type of validity (red, yellow and green cells) as follows:

$$\mathcal{R}\_{\mathcal{V}^\bullet} = \frac{\text{number of cases in the class } \mathcal{V}^\bullet}{N} . \tag{8}$$

where *N* denotes the number of total possible comparisons. Since we have performed the computations for each ML algorithm, we have *N* = 448.

7. Finally, we compute RV• per partition and per dataset to analyze the consistency of the results.

Note that for studying the cases of unbalanced datasets, we have carried out the analyses described above by changing the accuracy for the MCC.

#### *4.5. Technical Details*

The analyses are carried out at high-performance computing over HP ProLiant SL270s Gen8 SE, with two processors, Intel Xeon CPU E5-2670 v2 @ 2.50GHz, with 10 cores each and 128 GB of RAM and one hard disk of 1 TB. The analysis script is implemented in Python language.

#### **5. Results and Discussion**

This section is organized according to the research questions we have formulated in the introduction.

#### *5.1. Research Question 1: Given a Dataset, How Good Are the ML Models When RHOASo Is Applied?*

Since the behavior of RHOASo is similar in both Experiments 1 and 2, in this section, we detail the performance of RHOASo in Experiment 1, and we include a summary of the results for Experiment 2.

#### 5.1.1. Performance in Experiment 1

We can see in Figure 4 the median of the accuracy when RHOASo is applied. The median for RF is 0.92, for GB it is 0.9058, and for MLP it is 0.8182. We can see that in all of these cases, it is greater than 0.80. We can observe that the achieved accuracy by RHOASo presents great stability in terms of the partitions of the dataset, except for dataset *D*<sup>6</sup> with model RF, and dataset *D*<sup>16</sup> with model GB. In the case of *D*6, the variation appears when we change from *P*<sup>3</sup> to *P*4. This dataset is the largest one in the study, with the highest rate of unbalanced data. Then, the most appropriate metric is the MCC. As is discussed later, the MCC remains stable for *D*<sup>6</sup> in all the partitions. The case of *D*<sup>16</sup> is more involved. The most frequent hyper-parameter configurations that RHOASo computes for each partition are *max*. *depth*: 5, *number trees*: 9 for *P*<sup>1</sup> (14 times out of 50), *max*. *depth*: 9, *number trees*: 3 for *P*<sup>2</sup> (50 times out of 50), *max*. *depth* : 3, *number trees*: 9 for *P*<sup>3</sup> (31 times out of 50), and *max*. *depth*: 9, *number trees*: 3 for *P*<sup>4</sup> (50 times out of 50). Since the number of features in *D*<sup>16</sup> is 12, there are few instances, and the dataset is unbalanced, the most probable explanation is that the model is overfitting the training data. This behavior appears also in the rest of the metrics with the combination *D*<sup>16</sup> and GB. The stability is not as evident when the dataset changes, although the general trend is maintained through the ML models. For instance, the obtained models with *D*<sup>10</sup> provide the worst accuracy for the three ML algorithms. Although it may seem that the accuracy is not very high, the rest of the HPO does not achieve better results, as is outlined below. For this reason, the fit achieved by RHOASo is considered to be sufficient.

We can see in Figure 5 the median of the MCC (∈ [−1, 1]) when RHOASo is applied. The median for RF is 0.51, for GB it is 0.40, and for MLP it is 0.32. The MCC value considers class imbalance, so the results worsen for accuracy, particularly in datasets 3 and 12, which are highly unbalanced: the minority classes contain less than 1% of total instances. Apart from unbalanced datasets, the trends are similar to those presented when evaluating the accuracy.

As far as the time complexity is concerned, the median results are included in Figure 6. The median value for RF is 1.3028 s, and for GB it is 4.2567 s. The MLP stands out for its high computational cost, with 259.8138 s. Regarding the stability, as expected, the larger the partition size that is used, the larger the time complexity, independently of the ML algorithm used together with RHOASo. Nevertheless, this behavior is different for each ML algorithm. It is worth noting that in the case of RF, the increase in time complexity as the size of the partitions increases is much smoother for most of the datasets, compared to GB and MLP.

Sensitivity and specificity are plotted in Figures 7 and 8. The results are stable across partitions and datasets, which achieve similar results, even when using different models, except for *D*<sup>16</sup> and GB. The median of sensitivity is 0.9 for GB, 0.91 for MLP, and 0.88 for RF. In contrast, specificity has a median of 0.7 for GB, 0.69 for MLP, and 0.733 for RF. The lower specificity could be caused by the imbalance between classes in specific datasets (see Table 5), which causes the models to be biased toward the majority class. However, *D*<sup>11</sup> has both low specificity and low sensitivity. Overall, the trends are similar to those found in the evaluation of accuracy.

**Figure 4.** Behavior of RHOASo: accuracy. Each line represents a dataset. (**a**) Accuracy for RF. Axis X: the partition of the dataset. Axis Y: the obtained accuracy. (**b**) Accuracy for GB. Axis X: the partition of the dataset. Axis Y: the obtained accuracy. (**c**) Accuracy for MLP. Axis X: the partition of the dataset. Axis Y: the obtained accuracy.

**Figure 5.** Behavior of RHOASo: MCC. Each line represents a dataset. (**a**) MCC for RF. Axis X: the partition of the dataset. Axis Y: the obtained MCC. (**b**) MCC for GB. Axis X: the partition of the dataset. Axis Y: the obtained MCC. (**c**) MCC for MLP. Axis X: the partition of the dataset. Axis Y: the obtained MCC.

(**c**)

**Figure 6.** Behavior of RHOASo: time complexity. Each line represents a dataset. (**a**) Time complexity for RF. Axis X: the partition of the dataset. Axis Y: the obtained time complexity in seconds. (**b**) Time complexity for GB. Axis X: the partition of the dataset. Axis Y: the obtained time complexity in seconds. (**c**) Time complexity for MLP. Axis X: the partition of the dataset. Axis Y: the obtained time complexity in seconds.

**Figure 7.** Behavior RHOASo: sensitivity. The datasets are represented with a bar chart for each partition. (**a**) Sensitivity for RF. Axis X: the partition of the dataset. Axis Y: the obtained sensitivity. (**b**) Sensitivity for GB. Axis X: the partition of the dataset. Axis Y: the obtained sensitivity. (**c**) Sensitivity for MLP. Axis X: the partition of the dataset. Axis Y: the obtained sensitivity.

**Figure 8.** Behavior RHOASo: specificity. The datasets are represented with a bar chart for each partition. (**a**) Specificity for RF. Axis X: the partition of the dataset. Axis Y: the obtained specificity. (**b**) Specificity for GB. Axis X: the partition of the dataset. Axis Y: the obtained specificity. (**c**) Specificity for MLP. Axis X: the partition of the dataset. Axis Y: the obtained specificity.

#### 5.1.2. Performance in Experiment 2

Since the behavior of RHOASo in both experiments is the same, we include in Table 7 a summary with the median of all metrics (without taking into account partitions) that RHOASo has obtained in Experiment 2.


**Table 7.** Medians of response variables for RHOASo in Experiment 2.

*5.2. Research Question 2: How Many Iterations Does the Algorithm Need Until It Stops? How Faster or Slower Is RHOASo Compared with the Other HPO Algorithms?*

The number of iterations that RHOASo has needed until stopping for Experiment 1 is included in Figure 9. In the case of Experiment 2, this information can be observed in Table 7.

**Figure 9.** Number of iterations per partition. Each line represents a dataset (medians). (**a**) Number of iterations RF. In axis X, the partition of the dataset. In axis Y, the number of the iterations that RHOASo has carried out. (**b**) Number of iterations GB. In axis X, the partition of the dataset. In axis Y, the number of the iterations that RHOASo has carried out. (**c**) Number of iterations MLP. In axis X, the partition of the dataset. In axis Y, the number of the iterations that RHOASo has carried out.

In Experiment 1, we can see the median of the number of iterations needed by RHOASo for each ML algorithm, each dataset and each partition. As general result, the median of the number of iterations for each partition (computed over all datasets) is 35 for RF, 33.5 for GB, and 34 iterations for MLP. Taking into account that the number of iterations given as input (by default) in the rest of HPO algorithms is equal to 50, it implies that, on average, RHOASo needs approximately 70% of the iterations required by the other algorithms. Additionally, it stops the process by itself. As a consequence of this fact, less time is required by RHOASo to obtain a good enough accuracy and is able to be more competitive than other algorithms. This is most significant in the case of MLP, where each iteration is highly resource consuming. There is not a clear trend relating the partition size and number of iterations, especially in the case of GB, in which there is a greater variability in the number of iterations. It could be expected that a greater amount of data would contribute to a faster convergence, but this is not case. Therefore, it is likely that the functional Φ*A*,*<sup>D</sup>* may not be as concave, as it would be desirable to perform an effective early stopping.

We have not compared whether RHOASo is faster or slower, compared with the other HPO algorithms for Experiment 2 by the very nature of the design of the experiment.

#### *5.3. Research Question 3: Are There Statistically Meaningful Differences between the Performance of RHOASo and the Other HPO Algorithms?*

We remind that we have carried out two experiments:


#### 5.3.1. Experiment 1

In Figures 10 and 11, the performances (accuracies and time complexities) that are achieved by the HPO algorithms over each dataset are shown.

However, if we want to compare whether RHOASo obtains any gain against the other HPO algorithms, we need to carry out more detailed analyses. This is the study of the validity of RHOASo.

The rates of validity that are obtained by RHOASo, compared to the rest of the HPO algorithms are included in Table 8. Note that these computations are carried out with the accuracies and time complexities by the analyses that are explained in Section 4.4.


**Table 8.** Rate of validity of RHOASo vs. HPO across all *Di* (% with accuracy and time complexity).

**Figure 10.** Accuracy complexity. (**a**) Accuracy in RF. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. (**b**) Accuracy in GB. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. (**c**) Accuracy in MLP. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm.

**Figure 11.** Time complexity. (**a**) Time complexity in RF. In axis X, the dataset. In axis Y, the total time obtained measured in seconds. Each line represents a HPO algorithm. (**b**) Time complexity in GB. In axis X, the dataset. In axis Y, the total time obtained measured in seconds. Each line represents a HPO algorithm. (**c**) Time complexity in MLP. In axis X, the dataset. In axis Y, the total time obtained measured in seconds. Each line represents a HPO algorithm.

On average, the class corresponding to positive statistically significative differences (green class) is higher than 50%. This can be considered a good result since RHOASo achieves better results than the rest of the algorithms in more than half of the cases analyzed. However, there is still a high rate in the blue class. Once we have transformed the blue class (see Section 4.4), we can analyze whether RHOASo is more effective than the rest of the HPO algorithms. The results are included in Table 9, which show that, on average, the class corresponding to positive statistically significative differences is higher than 70%.


**Table 9.** Rate of validity of RHOASo vs. HPO across all *Di* (% with accuracy and time complexity).

After confirming that RHOASo is 70% more efficient than the rest of the HPO algorithms, the question that arises is whether there is a pattern in the 30% of the cases in which it does not succeed. For example, it might be possible for RHOASo to fail for datasets with a certain dimensionality, or for a specific ML algorithm. Another possibility is that RHOASo always loses against the same HPO algorithm. For this reason, we are going to study in depth the consistency of the previous results.

Since we have dealt with unbalanced datasets, such as D1, D3D4, D<sup>6</sup> or D12, we have repeated the analyses, changing the metric of accuracy by MCC so as to avoid overoptimistic scores. In Figure 12, the MCCs that are achieved by the HPO algorithms over each dataset are shown.

The rates of validity that are obtained by RHOASo, compared to the rest of the HPO algorithms, are included in Table 10. Note that these computations are carried out with the MCCs and time complexities by the analyses that are explained in Section 4.4.


**Table 10.** Rate of validity of RHOASo vs. HPO across all *Di* (% with MCC and time complexity).

We can observe that RHOASo maintains its rate of gain, overcoming 70% of the cases. In Section 5.4, we study the consistency of these results as well as those situations in which RHOASo does not obtain a gain.

**Figure 12.** MCC of all HPO algorithms. (**a**) MCC in RF. In axis X, the dataset. In axis Y, the MCC obtained. Each line represents a HPO algorithm. (**b**) MCC in GB. In axis X, the dataset. In axis Y, the MCC obtained. Each line represents a HPO algorithm. (**c**) MCC in MLP. In axis X, the dataset. In axis Y, the MCC obtained. Each line represents a HPO algorithm.

#### 5.3.2. Experiment 2

In Figures 13 and 14, the performances (accuracies and time complexities) that are achieved by the HPO algorithms over each dataset are shown.

**Figure 13.** Experiment 2: Accuracy complexity. (**a**) Accuracy in RF. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. (**b**) Accuracy in GB. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm. (**c**) Accuracy in MLP. In axis X, the dataset. In axis Y, the accuracy obtained. Each line represents a HPO algorithm.

**Figure 14.** Experiment 2: Time complexity. (**a**) Time complexity in RF. In axis X, the dataset. In axis Y, the total time obtained measured in seconds. Each line represents a HPO algorithm. (**b**) Time complexity in GB. In axis X, the dataset. In axis Y, the total time obtained measured in seconds. Each line represents a HPO algorithm. (**c**) Time complexity in MLP. In axis X, the dataset. In axis Y, the total time obtained measured in seconds. Each line represents a HPO algorithm.

The rates of validity that have been obtained by RHOASo compared to the rest of HPO of algorithms are included in Table 11. Note that these computations are carried out with the accuracies and time complexities by the analyses that are explained in Section 4.4 for Experiment 2.

**Table 11.** Experiment 2: Rate of validity of RHOASo vs. HPO across all *Di* (% with accuracy and time complexity).


In this experiment, RHOASo loses some of its advantage over other HPO algorithms, mainly because the improvement in execution time is lower since the number of iterations of the other algorithms is fixed to be equal to that of RHOASo. The average gain is of 53.86%, which is lower than the 71.96% obtained when evaluating the accuracy. Nevertheless, RHOASo performs better for GB, has a slight advantage for RF and is outperformed in MLP. This fact may be related to the search strategy of RHOASo. MLP has worse performance than other models across all datasets and configurations, which increases the number of neurons that tend to perform better, but the search strategy of RHOASo favors configurations with low magnitude of hyper-parameter values. Therefore, the performance when tuning MLP can be expected to be less satisfactory.

For the unbalanced datasets, we have included the analyses changing the metric of accuracy by MCC. In Figure 15, the MCCs that are achieved by the HPO algorithms over each dataset are shown.

The rates of validity that are obtained by RHOASo, compared to the rest of HPO of algorithms are included in Table 12. Note that these computations are carried out with the MCCs and time complexities by the analyses that are explained in Section 4.4.

**Table 12.** Experiment 2: rate of validity of RHOASo Vs HPO across all *Di* (% with MCC and time complexity).


When using MCC as the reference metric, the results follow similar trends to that of accuracy. RHOASo gains a slight advantage for RF and incurs a slight loss for MLP, but there are no significant differences.

**Figure 15.** Experiment 2: MCC of all HPO algorithms. (**a**) MCC in RF. In axis X, the dataset. In axis Y, the MCC obtained. Each line represents a HPO algorithm. (**b**) MCC in GB. In axis X, the dataset. In axis Y, the MCC obtained. Each line represents a HPO algorithm. (**c**) MCC in MLP. In axis X, the dataset. In axis Y, the MCC obtained. Each line represents a HPO algorithm.

#### *5.4. Research Question 4: Are the above Results Consistent?*

In this section, we analyze whether RHOASo achieves a significant performance improvement consistently across datasets and partitions.

#### 5.4.1. Experiment 1

The rates of validity for each partition computed with the accuracy and time complexity are included in Figure 16. The consistency of the green class is clear when we discriminate them by partitions. That is to say, if we only consider the partitions, RHOASo always outperforms the rest of the HPO algorithms. This may be due to the early stop of RHOASo, consuming less time but achieving good accuracy.

(**c**)

**Figure 16.** Experiment 1: consistency per partition (accuracy and time complexity). (**a**) Rate of validity per partition in RF. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity. (**b**) Rate of validity per partition in GB. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity. (**c**) Rate of validity per partition in MLP. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity.

As we can see in Figure 17, the above conclusion is not so general when we discriminate them by datasets, except for the case of GB. Nonetheless, the results concerning RF and MLP are not so far from being consistent (RF being closer than MLP), and a deeper analysis should be performed regarding this fact.

**Figure 17.** Experiment 1: consistency per dataset. Rate of validity with the accuracy and time complexity. (**a**) Rate of vx per dataset in RF. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity. (**b**) Rate of validity per dataset in GB. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity. (**c**) Rate of validity per dataset in MLP. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity.

In the case of MLP, there is a clear trend for the datasets in which RHOASo losses are *D*1, *D*2, *D*3, and *D*4. That is, datasets with a low number of instances (<10,000) but with a high number of features > 50. This failure in the large-dimension small-sized datasets can be due to the early-stop characteristic of the algorithm. Datasets with high dimensionality and low number of instances tend to increase the variance of the results, which contributes to create an irregular surface in Φ*A*,*D*, possibly trapping RHOASo in a local maximum. A possible solution could be to substitute the function that maps elements from the hyper-parameter space to the performance of the trained ML model on a validation dataset with an approximated probabilistic model of such function. This suggests that combining the underlying ideas of Bayesian hyper-parameter optimization algorithms

with those presented in this paper could yield an early-stop algorithm that works efficiently in high dimensions.

In the case of RF, the gain of RHOASo is not enough for datasets *D*5, *D*9, and *D*14. Unlike in the case of MLP, these datasets have no clear commonalities, so we can only hypothesize. We believe that the problem is the same as in the case of MLP: high variance in the results diminishes the effectiveness of RHOASo. However, in this case, this variance could be ascribed to the inherent randomness in the training of RF or in the cross validation sampling.

As we can see in Figure 18, the behavior of the rates of validity for each partition remains consistent when these are computed with the MCC and time complexity.

**Figure 18.** Experiment 1: consistency per partition (MCC and time complexity). (**a**) Rate of validity per partition in RF. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity. (**b**) Rate of validity per partition in GB. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity. (**c**) Rate of validity per partition in MLP. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity.

As we can see in Figure 19, the same conclusions as in the case of accuracy are obtained when we discriminate them by datasets. Nonetheless, the results concerning RF and MLP are not so far from being consistent (RF being closer than MLP), and a deeper analysis should be performed regarding this fact. We note that, in some cases, the green class is increased; this has not occurred for the red class in the unbalanced datasets.

**Figure 19.** Consistency per dataset: rate of validity with the MCC and time complexity. (**a**) Rate of validity per dataset in RF. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity. (**b**) Rate of validity per dataset in GB. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity. (**c**) Rate of validity per dataset in MLP. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity.

#### 5.4.2. Experiment 2

The rates of validity for each partition computed with the accuracy and time complexity are included in Figure 20. In this experiment, RHOASo is more inconsistent since it loses the advantage of execution time that it had over the other HPOs. RHOASo has a consistent advantage for all partition sizes in GB, but for RF it only improves other algorithms for partitions 3 and 4 and is inferior in MLP in all partitions, although the results improve when the size of the partition increases. This trend is also present for RF and GB. The reasons for this trend are probably the same as those we discussed in Section 5.4.1: the variance in the results trapping RHOASo in the local maxima.

(**a**)

**Figure 20.** Experiment 2: consistency per partition (accuracy and time complexity). (**a**) Rate of validity per partition in RF. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity. (**b**) Rate of validity per partition in GB. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity. (**c**) Rate of validity per partition in MLP. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity.

As we can see in Figure 21, the validity per dataset is negatively affected. In RF, there is a general decrease, with only datasets 10, 12, 12 and 15 showing a clear advantage for RHOASo. For GB, RHOASo maintains better results than other algorithms for all datasets, except 5 and 6. However, the consistency is lower than in experiment 1. Finally, in MLP, the validity of RHOASo is the lowest among all models, being consistently outperformed in five datasets: 1, 2, 3, 4 and 13.

The datasets with the worst results for each model have no clear commonalities, so it is possible that the neutralization of the execution time advantage of RHOASo is a significant factor in the deterioration of the results.

**Figure 21.** Experiment 2: consistency per dataset. Rate of validity with the accuracy and time complexity. (**a**) Rate of validity per dataset in RF. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity. (**b**) Rate of validity per dataset in GB. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity. (**c**) Rate of validity per dataset in MLP. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity.

In Figure 22, we show the results of the above analysis with the MCC and the time time complexity. The results are very similar to those achieved with accuracy. The main difference is that the validity for RF is improved for partitions 1 and 2, increasing the consistency of the results.

(**b**)

**Figure 22.** Experiment 2: consistency per partition (MCC and time complexity). (**a**) Rate of validity per partition in RF. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity. (**b**) Rate of validity per partition in GB. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity. (**c**) Rate of validity per partition in MLP. In axis X, the partition of the dataset. In axis Y, the rate of RHOASo for each class of validity.

In Figure 23, we show the rate of validity per dataset, using the MCC as the reference metric. As happened in Experiment 1, the trends are mostly the same as those present when evaluating the accuracy.

**Figure 23.** Experiment 2: consistency per dataset. Rate of validity with the MCC and time complexity. (**a**) Rate of validity per dataset in RF. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity. (**b**) Rate of validity per dataset in GB. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity. (**c**) Rate of validity per dataset in MLP. In axis X, the dataset. In axis Y, the rate of RHOASo for each class of validity.

#### **6. Conclusions**

ML provides several powerful tools for data processing that find applications in different fields. Most of the existent models have several hyper-parameters that need to be tuned and have a noticeable impact on their performance. Therefore, HPO algorithms are essential to achieve the highest possible accuracy with minimal human intervention.

In this work, a new HPO algorithm is described as a generalization of the discrete analog of a basic iterative algorithm to obtain the solutions to certain conditional optimization problems for the logistic function. It is shown that its performance is weakly disturbed by changing the size of the data subset with which it is run. The algorithm shows positive statistically meaningful differences in efficiency, regarding the other HPO

algorithms considered in this study. The algorithm can finish the tuning process by itself and only requires an upper bound on the number of iterations to perform. Furthermore, it is shown that, on average, it needs around 70% of the iterations needed by the other hyper-parameter optimization algorithms to achieve competitive results.

The results show that the algorithm achieves high accuracy, with similar results for all classifiers on each dataset. In addition, RHOASo can effectively use a small partition size to accelerate the HPO process without sacrificing the final accuracy of the model. Lastly, the automatic early stop ends the tuning process before reaching the fixed number of iterations (*Me* = 34), further increasing its efficiency.

Future work can be aimed at several lines:


**Author Contributions:** Conceptualization : Á.L.M.C.; methodology: Á.L.M.C. and N.D.-G.; software: Á.L.M.C. and D.E.G.; validation: all authors have contributed equally; formal analysis: N.D.-G. and D.E.G.; investigation: Á.L.M.C. and N.D.-G.; data curation: D.E.G.; writing—original draft preparation, review and editing: all authors have contributed equally; project administration and funding acquisition: N.D.-G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by the Spanish National Cybersecurity Institute (IN-CIBE) under contract Art.83, key: X54.

**Data Availability Statement:** The datasets supporting this work are from previously reported studies and datasets, which are cited. The processed data are available from the corresponding author upon request.

**Acknowledgments:** The authors would like to thank the Spanish National Cybersecurity Institute (INCIBE), who partially supported this work. Additionally, in this research, the resources of the Center of Supercomputation of Castilla y León (SCAYLE) were used.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A. Additional Results**

In this appendix, we show the results concerning Experiment 1 in the case that the ML algorithms are DT and KNN. Due to the great difference in the performance of RHOASo with respect to the rest of the HPO algorithms when they are run with these ML algorithms, we have excluded the analyses from the body of the article. However, we believe that the obtained results may be of interest.

We give a brief description of DT and KNN below.


In Table A1, we include the hyper-parameters that we have tuned. We have chosen them because of their influence on the corresponding ML algorithms (see [3], Appendixes A1, A4).Concerning the hyper-parameters of DT to be tuned, we have chosen the minimum number of samples required to split an internal node (min\_smaples\_split) and the minimum number of samples required to be at a leaf node (min\_samples\_leaf). Regarding the hyperparameters of KNN, we have considered the number of neighbors to use for queries and *p*, the Minkowski's distance type.

The search space for all hyper-parameters is in the interval [1, 50]. All hyperparameters not being tuned are set to their default values in *scikit-learn* implementation ([34]). We have used 10-fold cross-validation to assess the performance of all ML models combined with the HPO algorithms.

**Table A1.** ML algorithms together with hyper-parameters we have considered.


#### *Appendix A.1. Performance of RHOASo*

We can see in Figures A1–A6 that we have plotted the median of the accuracy, MCC, total time, sensitivity, specificity, and the number of iterations when RHOASo is applied together with DT and KNN. The median of accuracy for DT over all datasets is 0.85 and for KNN, it is 0.78. We can observe that the achieved accuracy by RHOASo presents great stability in terms of partitions of the dataset, except for partition P4 and some datasets. The median of MCC for DT is 0.58, and for KNN it is 0.41. Again, RHOASo presents a similar behavior to the accuracy. It can be due to *P*<sup>4</sup> does not contribute to improve the fit of the models, see [1]. Regarding the time complexity, the median value for DT is 1.11 s, and for KNN it is 22.53 seconds. It can be shown that the total time registered for *D*<sup>4</sup> in KNN is higher in *P*<sup>3</sup> than in *P*4. This can be caused because the stabilizer of RHOASo achieves an optimum value before in the total dataset compared to *P*3. The median of sensitivity is 0.88 for DT, and 0.87 for KNN. In contrast, specificity has a median of 0.87 for DT, and 0.63 for KNN. In addition, these plots inherit the same trends as the graphics of the accuracy and MCC. The median of the number of iterations for DT is 14, and for KNN it is 19. It is worth

pointing out the number of iterations for *P*<sup>3</sup> is larger than for *P*<sup>4</sup> when we work with KNN in *D*4. This is probably caused by the same reason as in the plot of the total time.

**Figure A1.** Behavior of RHOASo: accuracy. Each line represents a dataset. (**a**) Accuracy for DT. (**b**) Accuracy for KNN.

**Figure A2.** Behavior of RHOASo: MCC. Each line represents a dataset. (**a**) MCC for DT. (**b**) MCC for KNN.

**Figure A3.** *Cont*.

**Figure A3.** Behavior of RHOASo: time complexity. Each line represents a dataset. (**a**) Time complexity for DT. (**b**) Time complexity for KNN.

**Figure A4.** Behavior RHOASo: sensitivity. The datasets are represented with a bar chart for each partition. (**a**) Sensitivity for DT. (**b**) Sensitivity for KNN.

**Figure A5.** Behavior RHOASo: specificity. The datasets are represented with a bar chart for each partition. (**a**) Specificity for DT. (**b**) Specificity for KNN.

**Figure A6.** *Cont*.

**Figure A6.** Number of iterations per partition. Each line represents a dataset (medians). (**a**) Number of iterations DT. (**b**) Number of iterations KNN.

#### *Appendix A.2. Are There Statistically Meaningful Differences between the Performance of RHOASo and the Other HPO Algorithms under DT and KNN?*

In Figures A7 and A8, the performances (accuracies and time complexities) that are achieved by the HPO algorithms over each dataset are shown.

**Figure A7.** Accuracy complexity. (**a**) Accuracy in DT. (**b**) Accuracy in KNN.

**Figure A8.** Time complexity. (**a**) Time complexity in DT. (**b**) Time complexity in KNN.

The results of validity of RHOASo faced on the rest of HPO algorithms are included in Table A2, which show that, on average, the class corresponding to positive statistically significative differences is higher than 90%. Note that these computations are carried out with the accuracies and time complexities by the analyses that are explained in Section 4.4.

**Table A2.** Rate of validity of RHOASo vs. HPO across all *Di* (% with accuracy and time complexity).


Since we have dealt with unbalanced datasets, we have repeated the analyses, changing the metric of accuracy by MCC so as to avoid over-optimistic scores. In Figure A9, the MCCs that are achieved by the HPO algorithms over each dataset are shown.

**Figure A9.** MCC of all HPO algorithms. (**a**) MCC in DT. (**b**) MCC in KNN.

The rates of validity that are obtained by RHOASo, compared to the rest of HPO of algorithms with the MCCs and time complexities, are included in Table A3.

**Table A3.** Rate of validity of RHOASo vs. HPO across all *Di* (% with MCC and time complexity).


We can observe that the rate of gain of RHOASo overcomes the 89% of the cases.

#### *Appendix A.3. Are the above Results Consistent?*

In this section, we analyze whether RHOASo achieves a significant performance improvement consistently across datasets and partitions.

The rates of validity for each partition computed with the accuracy and time complexity are included in Figure A10. As can be seen, RHOASo presents a much higher performance than any other HPO algorithm taken into account in this work, and this behavior is independent of the partition.

**Figure A10.** Experiment 1: consistency per partition (accuracy and time complexity). (**a**) Rate of validity per partition in DT. (**b**) Rate of validity per partition in KNN.

As we can see in Figure A11, the above conclusion is the same when we discriminate rates of validity by datasets.

As we can see in Figure A12, the behavior of the rates of validity for each partition remains consistent when these are computed with the MCC and time complexity. As we can seen in Figure A13, the same conclusion as in the case of accuracy is obtained when we discriminate them by datasets.

**Figure A11.** Experiment 1: consistency per dataset. Rate of validity with the accuracy and time complexity. (**a**) Rate of validity per dataset in DT. (**b**) Rate of validity per dataset in KNN.

**Figure A12.** *Cont*.

**Figure A12.** Experiment 1: consistency per partition (MCC and time complexity). (**a**) Rate of validity per partition in DT. (**b**) Rate of validity per partition in KNN.

**Figure A13.** Consistency per dataset: rate of validity with the MCC and time complexity. (**a**) Rate of validity per dataset in DT. (**b**) Rate of validity per dataset in KNN.

#### **References**


### *Review* **Review of Metaheuristics Inspired from the Animal Kingdom**

**Elena Niculina Dragoi 1,2,\* and Vlad Dafinescu 2,3**


**Abstract:** The search for powerful optimizers has led to the development of a multitude of metaheuristic algorithms inspired from all areas. This work focuses on the animal kingdom as a source of inspiration and performs an extensive, yet not exhaustive, review of the animal inspired metaheuristics proposed in the 2006–2021 period. The review is organized considering the biological classification of living things, with a breakdown of the simulated behavior mechanisms. The centralized data indicated that 61.6% of the animal-based algorithms are inspired from vertebrates and 38.4% from invertebrates. In addition, an analysis of the mechanisms used to ensure diversity was performed. The results obtained showed that the most frequently used mechanisms belong to the niching category.

**Keywords:** metaheuristics; optimization; animal-inspired; exploration; exploitation

**Citation:** Dragoi, E.N.; Dafinescu, V. Review of Metaheuristics Inspired from the Animal Kingdom. *Mathematics* **2021**, *9*, 2335. https:// doi.org/10.3390/math9182335

Academic Editors: Alfonso Mateos Caballero, Cornelio Yáñez Márquez and Mariano Luque

Received: 6 July 2021 Accepted: 17 September 2021 Published: 21 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**1. Introduction**

A metaheuristic is a high level, problem-independent framework that provides a series of steps and guidelines used to develop heuristic optimizers [1]. Nowadays, the tendency is to use the term for both the general framework and for the algorithms built based on its rules [1]. In the latest years, the literature has shown an increase in the number of proposals of new optimization metaheuristics and their improvements through step alterations, local search procedures or hybridizations [2]. For a few well-known metaheuristics, the numerical evolution of the number of papers (journal and conferences) from the IEEE library is provided in [3]. The work of Hussain et al. in [2] presents a detailed distribution of types of research (basic, improvement, applications) focusing on all metaheuristics and, in [4], a timeline of the history of a set of representative techniques is provided.

This increase has been fueled by the need to efficiently find good solutions for difficult problems, especially for those where classical techniques fail to provide acceptable results within a reasonable amount of time and resources consumed.

All of these new optimizers (as well as the existing ones) follow the principles of the No Free Lunch Theorem (NFL), which states that if "an algorithm gains in performance on one class of problems it necessarily pays for on the remaining problems" [5]. In a simplistic view, this can be interpreted as meaning that one specific algorithm cannot outperform its counterparts on all problems but only on specific types or classes of problems, and it was shown that it is theoretically impossible to have a best general-purpose optimization strategy [6] (more details about NFL and its detailed analysis can be found in [7]). Consequently, researchers will probably never be satisfied with the existing metaheuristics [6], and this gives room for the development of new algorithms, improvements and strategies.

The oldest type of metaheuristic optimizers (from the 60s and 70s) is represented by genetic algorithms, based on the evolutionary processes; however, in their quest for better optimization of metaheuristics, researchers have turned to new sources of inspiration. Nowadays, the world of metaheuristics is large and covers ideas varying from the behavior of the very small, e.g., viruses and bacteria, to the mechanisms of galaxies. This multitude of algorithms can be somewhat overwhelming and, therefore, the objective of this paper is to identify the main directions of research, in terms of sources of inspiration, and to shed some light on the mechanisms used to generate powerful optimizers.

#### **2. Classification and Categorization**

When trying to identify the main classes of metaheuristics, various criteria can be applied. Examples include: search path, memory use, neighborhood exploration, number of solutions transferred from one iteration to the next and parallelization ability [8,9]. An extensive discussion related to the issue of classification and categorization for metaheuristics and the different schemes used can be found in [10].

In terms of categorizations, different aspects can be considered. For example, the type of candidate solutions is one of the most used criteria, and it splits the metaheuristic into: (i) individual-based, also known as single solutions, trajectory methods [1] or individualist algorithm [11] and (ii) population-based or collective algorithms [11]. In the individualbased group, a single solution is evolved. The main advantages of these methods consist in simplicity, lower computational costs and a lower number of function evaluation [11]. However, in their basic form, they can become trapped in the local optima and, since there is no information sharing, as there is just one solution, issues such as isolation of optima, deceptiveness and bias of the search space need to be dealt with [12]. Examples of algorithms that belong to this class are: Simulated Annealing (SA) [13]; Tabu Search (TS) [14]; Variable Neighborhood Search (VNH) [15]; Iterated Local Search (ILS) [16], proposed before 2006; Vortex Search (VS) [9], proposed after 2006. In the case of the population-based algorithms, multiple solutions are generated and improved. Distinctive from the individual-based algorithms, the population-based approaches allow some information exchange between the candidate solutions and thus can handle aspects that the individual-based approaches struggle with [12]. However, the cost of the improved performance is higher complexity and a larger number of function evaluations. The majority of metaheuristics are populationbased and can themselves be classified into approaches that [17]: (i) increase the population diversity through the variation of control parameters; (ii) maintain population diversity through the replacement of individuals in the current population; (iii) include memory to store promising solutions; (iv) divide the population into subpopulations; (v) combine multiple methods, i.e., hybrid approaches.

When type of search is considered, metaheuristics can be local or global. The local search approaches tend to be more exploitative while the global algorithms are more explorative in nature [2]. On the other hand, in the latest years, the trend is to create hybrids that combine the two types of searches. The best-known examples of local search algorithms are TS, ILS and Greedy Randomized Adaptive Search Procedure (GRASP). Differential Evolution (DE), Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) are examples of global search algorithms. Although the global search approaches can be hybridized to also include local procedures as a means to improve a previously proposed version, the literature presents algorithms that include this global–local search combination from the first version, e.g., the Bat Algorithm (BA) [18], the Shuffle Frog Leap Algorithm (SFL) [19] or Water Wave Optimization (WWO) [20].

When considering the source of inspiration, the majority of authors identify the metaheuristics as evolutionary and swarm intelligence techniques [11]. An extended categorization can be considered, such as the one in [21], where two more groups are included: stochastic and physical. In the latest years, various sources of inspiration have been used for metaheuristics. Therefore, this categorization must be extended to include all of the new methods. As a result, this work performs an extensive literature review of the proposed approaches covering the years 2006–2021. The review is organized around the biological classification of living things (kingdom-phylum-class), and its aim is to determine the main directions of research followed in the last 15 years and to identify new potential directions. The work [22] has a similar aim but focuses on all types of sources of inspirations for metaheuristics. Taking into account the variety of aspects that can be analyzed and the number of algorithms, this work only considers the metaheuristics with a biological base.

Regarding classification of metaheuristics, the work of Stegherr et al. [10] presents a seven-layer classification system. It focuses on structure (with criteria that include discontinuances, population, local search and memory), behavior (with criteria that include the strategy to create new solutions, groups and sub-populations), search (with criteria dealing with the intensification and diversification capabilities), algorithm (with criteria including the basic components incorporated), specific features (dealing with capabilities, i.e., use of adaptive parameters), evaluation (concerning the efficiency on various types of problems) and metaheuristics (which contains the specific algorithm that corresponds to the characteristics form the previous levels). If the first six levels are viewed from a framework perspective, the metaheuristic level deals with algorithms.

#### **3. Source of Inspiration**

In order to perform the current review, the main databases searched were: ScienceDirect (https://www.sciencedirect.com/, accessed on 6 August 2021), Web of Science (https: //apps.webofknowledge.com/, accessed on 6 August 2021), Google Scholar (https:// scholar.google.ro/, accessed on 6 August 2021), Springer Link (https://link.springer.com/, accessed on 6 August 2021) and IEEE Xplore Digital Library (https://ieeexplore.ieee.org/ Xplore/home.jsp, accessed on 6 August 2021). The terms used in the search process were "metaheuristics", "nature-inspired optimizers" and "bio-inspired algorithms". The strategy to use both nature-inspired and bio-inspired terms is related to the fact that, in many works, there is not a clear distinction between the two and they are used to describe a variety of metaheuristics. Based on the identified sources, a drill down (study of the references used) and drill up approach (study of the papers citing a specific work) were applied in order to determine additional appropriate manuscripts. For the covered period, 283 algorithms were identified. Their distribution, based on the inspiration source, is presented in Figure 1.

By analyzing the identified categories, two main groups can be distinguished: biological and non-biological sources. The biological sources include animals, plants and humans, while the non-biological sources are represented by the chemical and physical laws of nature. Therefore, broadly speaking, the optimization metaheuristics can be grouped into: (i) biologically-inspired and (ii) nature-inspired.

**Figure 1.** Distribution of newly proposed algorithms in the period 2006–2021, based on their inspiration source.

As it can be observed from Figure 1, the largest group of newly proposed metaheuristics in the considered period have animals as a source of inspiration, and this group is the main focus of the current work. Although humans, from a biological point of view, belong to the animal group, Chordata (vertebrates) phylum, in this work they were not included because they deserve a separate discussion, considering the unique ways of thinking, behaving and interacting with the environment.

In the latest years, various reviews have tried to shed light on the novel approaches that are constantly developed. Examples include: (i) a comprehensive list of algorithms and the steps of a few selected approaches [23]; (ii) a detailed discussion about the main research aspects specific to the field of nature-inspired metaheuristic optimizers [2]. In most works, researchers present the names of the best-known metaheuristics and a few details about the general ideas. This work aims to provide a series of details (such as: source code availability, improvement, applications, mechanisms for controlling the explorationexploitation balance) in a systematic manner, for each algorithm considered.

#### *3.1. Vertebrates*

Most algorithms inspired by animals simulate two main general behaviors: (i) food search (foraging) and (ii) mating. For foraging, there are a number of theoretical models developed to predict the behavior of living things: the optimal foraging theory, the ideal free distribution, game theory and predator-prey models [24]. The theory of optimal foraging was developed to explain the dietary patterns and the resource use, and it states that the individuals using their energy more efficiently for finding food are favored by natural selection [25]. Foraging for food can be an individual activity (solitary foraging where each individual searches for its food) or can be a social activity (social foraging where foraging is a group behavior) [26]. The topics of social foraging include: (i) the mechanisms used by the members to find food; (ii) the manner in which the food locations are communicated to other members; (iii) the division of food between group members. The majority of foraging inspired optimization algorithms focus on the first two topics [26]. A taxonomy of foraging inspired algorithms is proposed in [27], where three main categories are identified: vertebrates (with backbone), invertebrates (without backbone) and nonneuronal (organisms that do not possess a central nervous system or brain).

Concerning the mating behavior, different theoretical models that simulate the mating mechanisms of specific species exist. For example, in birds, different strategies are used to display the quality of genes to the potential mates by showing the main physical features: color, shape of specific body parts, etc. In terms of partner combinations, five strategies are encountered: monogamy, polygyny, polyandry, parthenogenesis and promiscuity [28]. These mechanisms are included, in different forms, in the metaheuristic optimizers with the objective of improving diversity and thus increasing performance.

From the total of algorithms inspired from animals, the ones based on vertebrates represent 61.6%, with the most represented sub-groups being birds (Section 3.1.1) and mammals (Section 3.1.2).

#### 3.1.1. Birds

#### Mating Behavior

One of the best-known algorithms inspired from bird behavior is Cuckoo Search (CS) [29]. It simulates the brood parasitic behavior of some species of cuckoo and, in order to search for new solutions, it uses Levy flight random walk (mutation based on the best solution found so far) and biases/selective random walk (crossover between a current solution and its mutation) [30]. The Levy flight is a random process from the non-Gaussian class, a step which is based on the Levy distribution [31]. The steps of the CS algorithm express three idealized rules: (i) each cuckoo lays an egg and places it into a randomly chosen nest; (ii) the nests with high-quality eggs will be further used in the next generations; (iii) there is a fixed number of nests and there is a probability that the host will discover the foreign egg [32]. The main disadvantage of this algorithm is the fixed value of the scaling factor (that controls the step size) [30] and, in order to improve its performance, various strategies have been applied. A list of different modifications of CS and its applications can be found in [32,33]. Inspired by the same cuckoo breeding behavior, the Cuckoo Optimization Algorithm (COA) was proposed in [34]. When comparing COA and CS, it can be observed that COA is more complex, in the sense that it contains an additional behavioral aspect represented by the immigration process. Also, it uses the k-means clustering algorithm to identify the group that a cuckoo belongs to.

Bird Mating Optimizer (BMO) [28] uses the birds mating process as a framework. Throughout generations, the birds (the population of solutions) apply a probabilistic method to improve the quality of their offspring. The population is divided into males and females. The males can be monogamous, polygamous or promiscuous. On the other hand, the females can be parthenogenetic and polyandrous. In BMO, five species are simulated, and each one has its specific updating pattern.

Developed to adjust the parameters of adaptive neuro-fuzzy inference systems, the Satin Bowerbird Optimizer (SBO) [35] simulates the mating behavior of bowerbirds (a close relative species of the birds-of-paradise). In each iteration, a target individual is determined through roulette wheel selection. The other individuals try to follow it, i.e., they change their position accordingly, and try to improve their strategies by mutation.

#### Food Search

The flight of eagles (as does, in fact, the flight behavior of many animals and insects) has the typical characteristics of the Levy flight [36]. Based on this idea, the Eagle Strategy (ES) was proposed in [36]. It simulates an idealized two stage strategy (find and chase) [37]. The find step represents the exploration phase realized by the Levy walk and the chase step is an exploitation phase where an intensive local search is performed by the Firefly Algorithm (FA). ES represents a strategy and not an algorithm, and its authors indicate that different algorithms can be used at different stages of the iterations [38]. For example, in [38–40], Differential Evolution (DE) [41] performs the local search procedure that was initially solved by FA.

Chicken Swarm Optimization (CSO) is inspired by swarm behavior [42]. It mimics the flock and foraging behavior of chickens and is based on four simplified rules: (i) the swarm is comprised of several groups and each group has a dominant rooster, some hens and chicks; (ii) the selection of individuals representing each type of bird is based on the fitness value; (iii) the dominance and hen–chick relationship remains unchanged for several iterations; (iv) the search for food is performed around the dominant entity. CSO is a multi-swarm algorithm and its performance analysis showed that it can be efficiently applied to solve benchmarks and real-world problems.

The Crow Search Algorithm (CSA) [43] simulates the behavior of crows when it comes to food, i.e., storing excess food and thievery. Its main principles are: (i) crows live in flocks; (ii) each crow memorizes their hiding places; (iii) crows follow each other to steal food; (iv) crows try to protect their stashes. The algorithm includes a memory of good solutions and, since the movement of crows is performed regardless of whether a newly generated position is worse than the current one, distinctively from many metaheuristics, CSA is not a greedy algorithm.

Simulating the auditory-based hunting mechanism employed by owls, the Owl Search Algorithm (OSA) [44] assumes that the fitness value of each individual is correlated to the intensity of the received sound and that the search space has one global optimum.

The hummingbird's optimization algorithm (HOA) [45] focuses on the foraging processes of hummingbirds. It includes a self-searching phase, based on the individual accumulated experience (using a Levy flight mechanism), and a guide-searching phase that includes information from dominant individuals.

Simulating the social roosting and foraging behavior of ravens, in [26], the Raven Roosting Optimization (RRO) is proposed. The algorithm includes four main components: (i) the perception capability of each individual to find food; (ii) a memory related to the position of previous foraging locations; (iii) transmitting and receiving information about food locations; (iv) probabilistic movement when searching for new resources.

Based on the cooperative hunting behavior of Harris hawks, the Harris Hawks Optimization (HHO) [46] algorithm simulates a series of aspects such as prey exploration, surprise pounce, and attack strategies. The algorithm complexity is O(population\_ size × (iterations + iterations × dimensionality + 1)) and, in exploring or exploiting the search space, a series of strategies such as diversification mechanism, progressive selection scheme and adaptive and time-varying parameters were used.

Aquila is a very successful bird of prey located in the Northern hemisphere that represents the source of inspiration for the Aquila Optimizer (AO) [47]. The model that the AO is based on simulates four hunting methods: (i) high soar with a vertical stoop (corresponding to an expanded exploration step); (ii) contour flight with glide attack (narrowed exploration step); (iii) low flight and slow descent (expanded exploitation step); (iv) walking and prey grabbing (narrowed exploitation). The AO computational complexity is O(solution\_number × (iterations x dimensionality + 1)).

The hunting mechanisms of golden eagles (spiral trajectory for searching food and straight path when attacking, a tendency of cruising at the beginning of the search and attacking at the end, capability to easily change between cruising and attaching) is simulated in the Golden Eagle Optimizer (GEO) [48]. The attack phase corresponds to exploitation and cruising to exploration. To extend applicability, two variants were proposed: the GEO version (single objective) and the MOGEO (multi-objective) version. The GEO computational complexity is O(population\_size × dimensionality × iterations) and that for MOGEO is O(population\_size × dimensionality × iterations × objectives × archive).

#### Movement

Based on the characteristics of geese's flight and the general PSO model, a geeseinspired hybrid PSO (Geese-PSO) was proposed in [49]. Although it does not use a completely novel metaphor and the algorithm is a hybridization, it is considered in this work because, to the authors' knowledge, the principle of following the particle ahead and the application of a unidirectional flow of information was not used prior to the proposal of the Geese-PSO approach. In the same direction of research, Migrating Bird

Optimization (MBO) [50] is inspired by the "V" flight formation of migrating birds. MBO is a neighborhood search approach where each solution is improved based on its neighbors.

The swarming behavior of passenger pigeons (the common name given to Blue Pigeons, Merne Rouck Pigeons, wandering long tailed Doves and Wood pigeons) represents the inspiration of the Pigeon Optimization Algorithm (POA) [51]. Pigeons have a specific behavior that can be simplified into a series of rules: (i) in order to enhance the search and reduce the probability of being prey for other animals, flight is performed in flocks; (ii) different flocks have their own solution for movement, which influences the shape of the group; (iii) there is communication between birds through "keck" and "tweet" calls; (iv) the behavior of other pigeons can be imitated; (v) the pigeons responding to the calls are the closest to the source of the call; (vi) in order to have a better probability of survival, each pigeon must lead. Another algorithm inspired by pigeons is Pigeon Inspired Optimization (PIO) [52]. However, in PIO the main idea is to simulate the homing behavior and the mechanisms used by a pigeon to move from a point A to a point B, i.e., orientation through magnetoreception and the sun, and recall of known landmarks close to the destination.

The migratory and attack behavior of seagulls is imitated in the Seagull Optimization Algorithm (SeOA) [53]. Several simplified rules are considered: (i) the migration is performed in groups; (ii) all the individuals travel towards the one with the best fitness; (iii) the attack follows a spiral-like movement. In addition, during the migration process, a mechanism for collision avoidance is included. The Sooty Tern Optimization Algorithm (STOA) [54] has a similar source of inspiration as the SeOA, but based on Sooty Tern seabirds. In this case, the migration behavior represents the exploration phase and the attacking behavior the exploitation one. The complexity of the STOA is O(problem\_dimension × iteration × objective\_number × population\_size × objective\_function) and the space complexity is O(objective\_number × population\_size).

In their fight to survive the harsh conditions of the polar regions, the emperor penguins use a specific strategy of huddling. This represents the main source of inspiration for the Emperor Penguin Optimizer (EPO) [55], where operations such as huddle boundary determination, temperature computation, distance determination and identification of the effective move are mathematically modeled and simulated in order to perform the optimization. The time complexity of EPO is O(k × individual\_length × iterations × dimensionality × population\_size), where the algorithm termination criteria requires O(k) time. The same mechanisms are also simulated in Emperor Penguins Colony (EPC) [56]. The main difference between the EPO and EPC consists in the manner in which the movement is realized, i.e., in EPC the individuals perform a spiral-like movement.

The foraging and navigation behaviors of African vultures is modeled in the African Vultures Optimization Algorithm (AVOA) [57], where multiple mechanisms to improve the exploration–exploitation balance were proposed: the use of a coefficient vector to change between these phases, use of phase-shift to precent premature convergence and local optimum escape and inclusion of Levy Flight. The AVOA computational complexity is based on initialization, fitness evaluation and vulture update and is O(iteration × population\_size) + O(iteration × population\_size × dimensionality).

Table 1 summarizes the algorithms briefly presented in this section and shows a series of examples for improvements and applications. In Table 1, where a link to the source code exists, if not specifically indicated, the implementation is provided by a third party. In the application column, due to the fact that it is common to test the performance of a newly proposed algorithm on a set of problems with known characteristics, the standard benchmarks were not specified.


 sorted).


**Table 1.** *Cont.*

#### 3.1.2. Mammals Food search

Based on the principles of echolocation used by bats to find food, the Bat Algorithm (BA) [18] employs several idealized rules: (i) echolocation is used for distance sensing and prey identification; (ii) the flying pattern is random, with characteristics such as velocity, pulse rate and loudness; (iii) the variation of loudness is assumed to move from a large value to a minimum (constant). The role of the pulse rate and loudness is to balance exploration and exploitation [129]. Another algorithm simulating bats is the Directed Artificial Bat Algorithm (DABA) [130]. DABA considers the individual flight of bats with no interaction between individuals, while in BA, the bat behavior is similar to the PSO particles. Although, compared to BA, the DABA model is closer to the natural behavior, in terms of optimization performance, BA is better [131]. On the other hand, the Dynamic Virtual Bats Algorithm (DVBA) [131] has a population comprised of only two individuals, i.e., explorer and exploiter bats, that dynamically exchange their roles based on their locations.

The echolocation mechanism is not specific to bats; other animals also using it to navigate and to find food, e.g., Dolphin Echolocation [132]. The abbreviation given by its authors is DE; however, so as not to confuse it with Differential Evolution, Dolphin Echolocation will be denoted by DEO in this work. It mimics the manner in which sound, in the form of clicks, is used to track and aim objects. Distinctively from the bat sonar system, which has a short range of 3–4 m, the range of the dolphin sonar varies from a few tens of meters to over a hundred meters. This aspect and the differences in the environmental characteristics lead to the development of totally different sonar systems, and a direct comparison between the two may be difficult.

In their search for food, sperm whales go as deep as 2000–3000 m and can stay underwater without breathing for about 90 min [133]. They are social animals, travel in groups and only the weaker specimens are attacked by predators such as orcas. This behavior was modeled in the Sperm Whale Algorithm (SWA) [133], where the population is divided into subgroups. In each cycle of breathing and feeding, the individual experiences two opposite poles (surface and bottom of the sea); however, because computing the mirror place is expensive and its influence on the search process is limited, it is applied only to the worst solutions. In order to simulate the hunting behavior of humpback whales, i.e., the bubble net feeding method, the Whale Optimization Algorithm (WOA) [134] searches for prey (the exploration phase) and then uses the shrinking encircling mechanism and the spiral updating position (the exploitation phase). A detailed review covering the multiple aspects of WOA is presented in [135].

The Grey Wolf Optimizer (GWO) [136] models a strict social dominant hierarchy and the group hunting mechanisms—tracking, chasing, approaching and attacking the prey–of grey wolfs (Canis lupus). The complexity of GWO is O(problem\_dimension × iteration × objective\_number × population\_size × objective\_function) [54]. Similar to other bio-inspired approaches, the GWO suffers from premature convergence. The prey weight and astrophysics concepts were applied in the Astrophysics Inspired Grey Wolf Optimizer (AGWO) [137] to simultaneously improve exploration and exploitation. Although wolves live in packs and communicate over long distances by howling, they have developed unique semi-cooperative characteristics [138]. By focusing on the independent hunting ability, as opposed to the GWO, which uses a single leader to direct the search in a cooperative manner, the Wolf Search Algorithm (WSA) [138] functions with multiple leaders swarming from multiple directions towards the optimal solution.

Spider monkeys are specific to South America and their behavior falls in the category of fission–fusion social structure [139], i.e., based on the scarcity or availability of food, they split from large to smaller groups and vice versa. The algorithm that simulates this structure is called Spider Monkey Optimization (SMO) [139]. It consists of six phases and, unlike the natural system, the position of leader (local or global) is not fixed, depending instead on its ability to search for food. In addition, the optimization procedure does not

include the communication tactics specific to spider monkeys. Distinctively, the individual intelligence of chimps used for group hunting is modelled into the Chimp Optimization Algorithm (ChOA), where four types of hunting are included: driving, chasing, blocking and attacking [140]. Another type of ape is represented by gorillas and, in the Artificial Gorilla Troops Optimizer (GTO), their collective life is mathematically modeled to include exploration–exploitation mechanisms [141].

Spotted hyenas have a behavior similar to that of wolves and whales, which uses collective behavior to encircle the prey and attack. Their model is used in the Spotted Hyena Optimizer (SHO) [142], which saves the best-so-far solution, simulates the encircling prey through a circle-shaped neighborhood that can be extended to higher dimensions and controls the exploration–exploitation balance through control parameters. The time complexiy of SHO is O(problem\_dimension × G × iteration × objective\_number × population\_size × objective\_function), where the time to define the groups of individuals is O(G).

In the Squirrel Search Algorithm (SSA) [109], the gliding behavior of flying squirrels when exploring different areas of a forest in search for food is simulated by considering some simplifications of the natural mechanisms: (i) a squirrel is assumed to be on one tree; (ii) in the forest there are only three types of trees: normal, oak and hickory; (iii) the region under consideration contains three oaks and one hickory tree. It is considered that the squirrel with the best fitness is positioned on a hickory tree and the next three individuals with the best fitness are on oak trees. The other individuals in the population move towards the oak or the hickory, depending on their daily energy requirements. In SSA, the seasonal changes are modeled through control parameters and influence the behavior of the individuals in the population.

#### Social Behavior

The Lion's Algorithm (LA) [143] is based on the social behavior of lions. It simulates the process of pride forming through mating, removing weak cubs, territorial defense and takeover. The population is formed of males and females and the cub population is subjected to gender grouping (through the application of k-means clustering). LA is not the only approach inspired by lions; the Lion Optimization Algorithm (LOA) [144] is also an example. Distinctively from LA, LOA includes the hunting and migration mechanisms and the mating process is based on differentiation rather than on crossover and mutation. Another lion-inspired approach is the Lion Pride Optimization (LPOA) [145].

Similar to honey bees or ant colonies, blind naked mole rats (a species specific to Africa) have a complex social behavior: (i) they live in large colonies; (ii) a queen and a reduced number of males are responsible for offspring generation; (iii) there are individuals specialized in food search and domestic activities, i.e., taking care of the nest and of the young and in protection against invaders [146]. These mechanisms, in a simplified form, are simulated in the Blind Naked Mole Rats (BNMR) algorithm [146].

Elephants are the largest walking mammals and their successful survival is influenced, among other things, by their social and behavioral structures. The adult males solitarily roam into the wild, they do not commit to any family and can potentially mate over thirty times a year, while the female elephants form matriarchal societies that allow better protection and safe rearing of young calves. The Elephant Search Algorithm (ESA) [147] and Elephant Herding Optimization [148,149] simulate these mechanisms and perform the search.

Table 2 summarizes the algorithms briefly presented in this section and shows a series of examples for improvements and applications. The same structure and idea as in Table 1 are applied.


 sorted).


**Table 2.** *Cont.*

#### 3.1.3. Other Vertebrates

This category includes other sources of inspiration from the vertebrate group that do not belong to the bird and mammal classes.

The SailFish Optimizer (SFO) [222] is inspired by the group hunting of sailfish (*Istiophorus platypterus*), one of the fastest fish in the ocean. This mechanism of alternating attacks on schools of sardines is modeled through the use of energy-based approaches, where, at the beginning of the hunt, both the predator and the prey are energetic and not injured; however, as the hunt continues, the power of the sailfish will decrease and the sardines will become tired and have reduced awareness.

The food catching behavior of Agama lizards is modelled in the Artificial Lizard Search Optimization (ALSO) [223]. The algorithm focuses on new discoveries regarding the mechanisms for movement control through the tail during prey hunting.

The manner in which chameleons catch prey using their long and sticky tongue represents the basis for the Chameleon Swarm Algorithm [224]. The notation given by the authors of this algorithm is CSA, however, since the same notation is used to represent the Crow Search Algorithm, in this work, the Chameleon Swarm Algorithm will be indicated by the ChSA notation. The ChSA follows three main strategies for catching prey: tracking (modelled as a position update step), eye pursuing (modeled as position update in accordance with the position of the prey) and attacking (based on tongue velocity). Distinctively from the majority of metaheuristics, which tend to have less than three parameters, ChSA has five parameters that help in controlling the exploration–exploitation balance.

#### 3.1.4. General

Unlike the other algorithms mentioned in this work that have a source of inspiration represented by a single animal, in the case of the general class, the metaheuristics are based on a general aspect that can be specific to multiple animals or types of animals. Examples proposed prior to 2008 include algorithms such as Genetic Algorithms (where the genetic principles of mutation and crossover are applicable to all species) and Extremal Optimization [225], based on the Bak–Sneppen mechanism, a model of co-evolution between interacting species which reproduces nontrivial features of paleontological data.

Inspired from the encircling mechanisms used by group hunters such as lions, wolves and dolphins, the Hunting Search (HuS) [226] simulates the cooperation of members to catch food. As a perfect correlation between nature and an optimization process cannot be achieved, a set of differences from the real world are taken into account: (i) in the majority of cases, the location of the optimum of a problem is not known, while, in the real world, the hunters can see the prey or sense its presence; (ii) the optimum is set, however, in the real world, the prey dynamically changes its position. Unlike the DEO and GWO, which emulate the specific hunting approaches used by dolphins and wolves, HuS is focused on the cooperation aspect and the repositioning during the hunt. Other approaches which simulate the food searching mechanisms include: the Backtracking Search Algorithm Optimization (BSA) [227], Optimal Foraging Algorithm (OFA) [25], Fish Electrolocation Optimization (FEO) [228] and Marine Predators Algorithm (MPA) [229]. BSA is based on the return of a living creature to previously found fruitful areas. At its core, it is an evolutionary approach that, although it has a very similar structure to the other EAs, differs as follows: (i) mutation is applied to a single individual; (ii) there is a more complex crossover strategy compared with DE; (iii) it is a dual population algorithm; (iv) it has boundary control mechanisms. Distinctively from the BSA, the OFA algorithm is based on the Optimal Foraging Theory developed to explain the dietary patterns of animals. In the OFA, the animal foraging is an individual and its position represents a solution. Its time complexity is O(group\_size × dimensionality × iterations) and its space complexity is O(group\_size × dimensionality × (iterations + 1)). The FEO simulates the active and passive electrolocation mechanisms used by sharks and "elephant nose fishes" to find prey. A series of electric waves are generated and reflected back to the fish after hitting the surrounding objects, which creates an electric image that is then

analyzed. In the case of the MPA, the different strategies used for finding food and the interaction between predator and prey are modeled in different scenarios through Brownian and Levy strategies. The MPA algorithm complexity is O(iterations × (agent\_number × dimensionality + Cost\_function\_evaluation × agent\_number)).

Another aspect specific to all species in their quest to survive is represented by the competition for food, resources or mates. Two metaheuristic optimizers based on competition were identified: Competition over Resources (COR) [230] and the Competitive Optimization Algorithm (COOA) [231]. The COR algorithm mimics the competition for food of wild animals. The groups with the best approach to storing food have improved scores while the worst performance groups are starving and, after a few generations, die and are removed from the population. In the COOA approach, the competition is simulated by the Imperialist Competitive Algorithm [232] and the groups are represented by the populations of various metaheuristics.

Migration behavior is encountered in all major animal groups. Among the first bio-inspired metaheuristics that contain elements specific to migration is the Biogeographybased Optimization (BBO) [233]. However, the BBO imitates a much larger phenomenon island biogeography—that includes both migration and mutation [234]. Another algorithm that has the migration principle at its core is the Migrating Birds Optimization [50]. As it simulates the features of the "V" flight of birds, the MBO was included in the bird inspired metaheuristic section. The Animals Migration Optimization (AMO) [235] simulates the animal migration model proposed by ecologists and uses two idealized assumptions: (i) the leader animal will survive to the next generation; (ii) the number of animals in the population is fixed. The algorithm has two phases: the migration process (where the individuals respect three rules: move in the same direction as the neighbors, remain close to the neighbors and avoid collision with neighbors) and population update (where some individuals leave the group and others join it).

Table 3 summarizes the algorithms briefly presented in this section and shows a series of examples for improvements and applications. The same structure and idea as in Tables 1 and 2 are applied.


#### *Mathematics* **2021**, *9*, 2335

#### *3.2. Invertebrates*

From the total of algorithms inspired from animals, the ones based on invertebrates represent 38.4%, with the main sub-group indicated by insects (Section 3.2.1). As the number of algorithms inspired from other invertebrate sub-groups were small, the ones not belonging to insects were included in a separate section (Section 3.2.2).

#### 3.2.1. Insects

Although the majority of insects are solitary, several types of insects are organized in colonies or swarms [262]. As insect swarms have several desirable attributes, a high percentage of insect-inspired metaheuristic optimizers belong to the swarm intelligence class.

Swarm intelligence has two main key components: self-organization (global response through interactions among low level components that do not have a central authority) and division of labor (the tasks are performed by specialized individuals) [139,263]. It follows three basic principles: (i) separation (static collision avoidance); (ii) alignment (velocity matching); (iii) cohesion (the tendency of individuals to go towards the center of the mass of the swarm) [264].

While, in the classic swarm approaches, the individuals considered are unisex and perform virtually the same behavior, thus wasting the possibility of adding new operators [265], in the newer bio-inspired metaheuristics researchers began to incorporate different types of individuals in the population(s), and the results obtained show an improvement of several characteristics, such as search ability and population diversity. However, the use of different operators leads to an increase in complexity and, until now, theoretical studies that can explain the influence of these operators and the context in which they are recommended have been very scarce.

#### Hymenoptera

This order includes some of the best-known social insects: wasps, bees and ants. The main characteristics of these insects are: (i) the presence of a pair of membranous wings; (ii) antennae longer than the head; (iii) complete metamorphosis.

• *Bees*

The social bees show all the characteristics of eusociality: generation overlapping, separation into fertile and infertile groups, labor division and brood care. In addition, the beehive can be considered as a self-organizing system with multiple agents [266].

In a comprehensive review regarding the algorithms inspired by honey bees, the authors identified five main characteristics that were modeled: (i) mating; (ii) foraging and communication; (iii) swarming; (iv) spatial memory and navigation; (v) division of labor [267]. However, [268] considers that, alongside mating and foraging, the third class is represented by nest-site selection process, and thus proposed the Bee Nest-Site Selection Scheme (BNSS)—a framework for designing optimization algorithms.

In addition to the algorithms presented in [267], other approaches that simulate bee behavior are: Bumblebees (B) [269], Bee Colony Inspired Algorithm (BCiA) [266] and Bumble Bee Mating Optimization (BBMO) [270]. The B algorithm is based on a simplified model of the evolution of bumblebees and can be regarded as a loose implementation of the concepts of the evolutionary model proposed by [271]. On the other hand, the BCiA focuses on the foraging behavior and the BBMO simulates the mating process.

### • *Ants*

Among the first algorithms that simulate ant behavior is the Ant System [272]. However, the best-known approach is the Ant Colony Optimization (ACO), used to find the path of minimum length in a graph. To the authors' knowledge, the only other approach simulating ant behavior, which is not based on the ACO, is Termite-hill [273]. It is a swarmbased algorithm designed for wireless sensor networks, i.e., an on-demand and multipath routing algorithm.

#### Diptera

• *Flies*

Due to their short life-span and easiness of breeding and of providing an adequate living environment, the fruit fly is widely studied in laboratory conditions. Consequently, their behavior is known in detail and specific mechanisms for finding food are sources of inspiration for new algorithms. To the authors' knowledge, there are two metaheuristics that simulate the fruit fly: the Fruit Fly Optimization (FOA) and the Drosophila Food Search Optimization (DFO) [274]. Similar to the DFO, the FOA is also based on the Drosophila fly and the literature shows that there are at least two different implementations, proposed by [275] and [276].

• *Mosquitoes*

The host seeking behavior of female mosquitos is mimicked by the Mosquito host seeking algorithm (MHSA) [277]. The general idea is simple and it is based on the following idealized rules: (i) the mosquito looks for carbon dioxide or some other scent; (ii) if found, the mosquito moves toward the location with the highest concentration; (iii) it descends when the heat radiating from the host is felt. The algorithm was developed specifically to solve the travelling salesman problem and it has several advantages that include: (i) the ability to perform large-scale distributed parallel optimization; (ii) it can describe complex dynamics; (iii) it can perform multi-objective optimization; (iv) it is claimed to be independent of the initial conditions and problem size [278]

#### Lepidoptera

The insects that belong to this class have wings covered with overlapping small scales. The best-known examples include butterflies and moths.

• *Butterflies*

The Monarch Butterfly Optimization (MBO) [279] simulates the migration behavior of monarch butterflies through the use of a set of idealized rules: (i) the entire population of butterflies is located in two areas, i.e., Land1 and Land2; (ii) each offspring is generated by the migration operator applied to individuals from Land1 or Land2; (iii) once an offspring is generated, the parent butterfly dies if its fitness is worse than that of the offspring; (iv) the butterflies with the best fitness survive to the next generation. Similar to the MBO, the Monarch Migration Algorithm (MMA) [280] models the migration behavior of monarch butterflies. The main differences between the MBO and the MMA consist in the mechanisms used for movement, for new individual creation and for population size control.

If the MBO and the MMA focus on migration aspects, the Butterfly Optimizer (BO) simulates the mate-location, behavior–perching and patrolling of male butterflies [281]. The initial BO version is developed for unconstrained optimization and is a dual population algorithm that includes male butterflies and auxiliary butterflies. The Artificial Butterfly Optimization (ABO) [282] is inspired from the same mating strategy as BO. However, the ABO is a single-population optimizer that contains two types of butterflies: sunspot and canopy, and the rules that it follows are different. In the BO, the following rules are considered: (i) the male butterflies are attracted to the highest UV/radiation object; (ii) the best perching position and the flying direction is memorized; (iii) the flying velocity

is constant; (iii) the flying direction is changed if necessary [281]. On the other hand, the ABO considers the following generalized rules: (i) all male butterflies attempt to fly towards a better location (sunspot); (ii) in order to occupy a better position, the sunspot butterflies try to fly to the neighbor's sunspot; (iii) the canopy butterflies continually fly towards the sunspot butterflies to contend for the sunspot [282]. In ABO, three flight strategies are considered and their combination leads to two other variants of the algorithm.

The Butterfly Optimization Algorithm (BOA) [283] considers the foraging behavior and focuses on the smell of butterflies as the strategy used for determining the location of food or of a mating partner. In order to model this behavior, a set of idealized rules are used: (i) all butterflies emit fragrances that attract each other; (ii) the movement of the butterfly is random or towards the most fragrant butterfly; (iii) the stimulus intensity is influenced by the landscape of the objective function.

• *Moths*

The transverse orientation navigation mechanism of moths represents the source of inspiration for the Moth Flame Optimization (MFO) [284]. The population of moths updates their position in accordance with a flame. The group of flames represents the best solutions and serves as guidance for the moths [285]. The complexity of the MFO is O(problem\_dimension × iteration × objective\_number × population\_size × objective\_function) [54]. While the MFO contains a population of moths and flames, in the case of the Moth Swarm Algorithm (MSA) [286], which is also inspired by the navigation behavior of moths, the population is formed of three groups of moths: pathfinders (with the ability to discover new areas of the search space), prospectors (that tend to wander in spiral) and onlookers (that drift directly to the solutions obtained by the prospectors). Distinctively from MFO and MSA, the Moth Search (MS) algorithm [287] considers the phototaxis and the Levy flight of the moths as a source of inspiration. In this case, the population is formed of two subpopulations. One follows the Levy movement and the other simulates the straight flight.

#### Ortoptera

This order includes, among others, insects such as grasshoppers, crickets and locusts. They are insects that move with great agility and have many shapes and characteristics.

The first metaheuristic optimizer that used the locust swarm metaphor is the Locust Swarm [288], a multi-optima search technique developed for non-globally convex search spaces. However, in order to identify the starting points for the search, it uses the PSO as part of its search [288], and one may argue that it is not a new metaheuristic, but a hybridization of the PSO. On the other hand, the Locust Swarm proposed in [289] emulates the interaction of a locust cooperative swarm. Since the two algorithms have the same name, in this work they are referred to as LS1 for the version proposed in [288] and LS2 for the [289] version. LS2 considers both solitary and social behavior, and consists of three parts: initialization, solitary operation and social processes [289]. The solitary phase performs the exploration of the search space, while the social phase is dedicated to exploitation.

The grasshoppers' social interactions represent the basis for the Grasshopper Optimization Algorithm (GOA) [290]. Both larvae, which corresponds to the feeding stage, and the adult form, which corresponds to the exploration stage, are considered. While in nature, the individual evolves from larvae to adult (local, then global), and due to the nature of the search space, the optimization algorithm first needs to find a promising region and, after that, exploit it (global, then local).

#### Other Insects

• *Hunting*

The mechanisms used by antlions to hunt ants is simulated in the Ant Lion Optimizer (ALO) [291]. Antlions have two phases: larvae, focused on feeding, and adults, focused on mating. The ALO is based on the larvae form and, in order to perform the optimization, a series of conditions are considered: (i) a random walk imitates the movement of an ant; (ii) the random walks are affected by the traps of the antlions; (iii) the pits built by antlions (and the probability of catching ants) are proportional with the antlion fitness; (iv) the random walk range decreases adaptively; (v) the ant is caught when its fitness is worse than that of an antlion; (vi) after catching a prey, the antlion repositions and builds another pit.

• *Mixed behavior*

Inspiration for metaheuristics comes not only from insects that are generally considered useful, e.g., bees, but also from insects that are considered pests, such as cockroaches. Among the first approaches that included the social behavior of cockroaches is the Roach Infestation Optimization (RIO) [292]. It is an adaptation of the PSO that implements three elements: finding the darkness, finding friends and finding food. Other algorithms simulating cockroach behavior are: the Cockroach Swarm Optimization (CSO) [293] and the Cockroach-inspired Swarm Evolution (CSE) [294]. The paper containing the initial CSO version was retracted from the IEEE database due to violations of publication principles, however, this did not stop other researchers from using and improving CSO; Google Scholar indicates that, as of the end of February 2019, there are 32 articles citing the retracted paper. Unlike the RIO, the CSE considers competition, space endurance and migration of cockroaches, beside cooperative behavior [294].

The Dragonfly Algorithm (DA) [264] is inspired from the static (feeding) and dynamic (migratory) swarming behavior of dragonflies. In the feeding stage, the dragonflies are organized into small groups that cover a small area to hunt, through back-and-forth movement with abrupt changes in the flying path. In the dynamic stage, a large group of individuals form a swarm and migrate in one direction over long distances.

The Pity Beetle Algorithm (PBA) [295] is based on the aggregation behavior and search mechanisms used for nesting and food finding of Pityogenesis chalcographus, a beetle also known under the name of "six toothed spruce bark beetle". The PBA follows three stages: searching (where the chemicals emitted by the weakened trees are used to identify a suitable host), aggregation (were multiple individuals feed on the host and attract more individuals—both male and female) and anti-aggregation (that is specific to the situation when the population size surpasses a specific threshold).

Table 4 summarizes the algorithms inspired from inspects and presents a series of examples for improvements and applications. The same structure and idea as in the previous tables are applied.


**Table 4.** Improvements and applications for insect-inspired metaheuristics (alphabetically sorted).


**Table 4.** *Cont.*


**Table 4.** *Cont.*

#### 3.2.2. Other Invertebrates

This group includes algorithms inspired from different invertebrates that do not belong to the insect class.

• *Arachnids*

The social behavior of spiders, i.e., communication using vibrations throughout the web, represents the source of inspiration for the Social Spider Optimization (SSO) [265]. It emulates a group of spiders that contains both males and females and applies different evolutionary operators to mimic the distinct behaviors typically found in the colony. In addition to the cooperation behavior, a mating operator, applicable only to the strongest individuals, is introduced to increase diversity. A comprehensive review of the SSO, which covers its main variants and applications, is the work of Luque–Chang et al. [374].

Distinctively from the SSO, which models the cooperative behavior and exchange of information through the web, the Social Spider Algorithm (SSA) [375] simulates the foraging behavior of social spiders. The SSA does not distinguish the individuals by sex; all the spiders share the same search operations. Compared to the SSO, the SSA is simpler, it uses a single random move operator and depends on the parameter settings to control the search [375]. A parameter sensitivity analysis (through advanced non-parametric statistical tests) indicated that medium population, small to medium attenuation rate, medium crossover probability and small mutation probability lead to good results for the majority of the problem being tested [376].

• *Crustacea*

The Krill Herd Algorithm (KHA) [377] is inspired from the herding behavior of krill individuals and was developed to solve non-complex optimization problems. It is a population-based approach that uses three main ways to explore the search space: (i) movement, induced by other individuals; (ii) foraging; (iii) random diffusion. A review that covers the main improvements and applications of the KHA is [378]. Newer studies (after 2017) that use the KHA to solve specific problems are presented in Table 5.

• *Annelid worms*

The reproduction mechanisms used by earthworms are the source of inspiration for the Earthwork Optimization Algorithm (EWA) [379]. The idealized rules that the EWA follows are: (i) all the earthworms can produce offspring using only two types of reproduction; (ii) the number of genes of the offspring is the same as the parent's; (iii) the best individuals go directly, without change, to the next generation so as to ensure that the population cannot deteriorate throughout generations.

• *Tunicata*

Tunicates are marine, small bio-luminescent invertebrates with a unique mode of jet propulsion. The movement strategy and the swarming behavior of tunicates was modelled in the Tunicate Swarm Algorithm (TSA) [380], its performance for a set of benchmarking problems being similar with state-of-the-art approaches. The time complexity of the TSA is O(iterations × population\_size × dimensionality × N) where N indicates the jet propulsion and swarm behaviors.


**Table 5.** Improvements and applications for other invertebrates-based metaheuristics (alphabetically

 sorted).

#### **4. The Exploration–Exploitation Balance**

Although based on different ideas, for all metaheuristic optimizers, the mechanisms used to simulate the optimization behaviors are similar. Generally speaking, all the algorithms in this class start with an initial population of potential solutions, usually randomly generated, which is evolved, i.e., modified by a series of mechanisms that can include among others—selection, crossover and mutation, until a stopping criterion is satisfied. The actual strategies used to perform these steps and the mechanisms used to control the exploration–exploitation balance (EEB) influence the performance behavior and represent the main elements that make the distinction between algorithms.

The research in the area of metaheuristics often mentions the exploration and exploitation aspects of the algorithms; however, these terms have never been formally defined [402]. Informally, exploration is defined as the process of visiting entirely new regions of the search space, also known as the global search ability of the algorithm, while exploration is the process of visiting those regions of the search space within the neighborhood of previously visited points, which represents the local search ability [402]. Pure exploration leads to a decrease in precision but increases the ability of finding new, good solutions, while pure exploitation refines the existing solution and drives the search to a local optimum [403]. Because it indicates how the resources are allocated, knowing the EEB information can be useful to determine the impact of specific aspects of the algorithm [404]. The EEB can be seen from two points of view: (i) exploration and exploitation as opposing forces; (ii) exploration and exploitation as orthogonal forces [404]. However, it was shown that the opposing forces view is a special case of the orthogonal view and, thus, EEB monitoring must involve a metric for the exploration axis and one for the exploitation axis [404].

For evolutionary algorithms, it was shown that different operators, depending on the algorithm, are acting as exploitation or exploration procedures [1]. In populationbased algorithms, the EEB is connected to the population diversity: when this is high, the algorithm is explorative and when it is low, the behavior is exploitative [1]. Although a diverse population is a prerequisite rather than a guarantee for the EEB and a good EEB can be reached through other means, e.g., fitness, using diversity is one of the simplest methods for achieving it [402]. Due to the fact that the problems to be solved must be encoded into a binary or real-valued vector, a clear distinction between the genotype (the encoded structure) and the phenotype (the actual problem) must be done. In this context, the diversity can be measured at the genotype level, at the phenotype level or using complex or composite measures that combine the genotype and the phenotype.

In [402], the diversity-based approaches applied to the EEB are classified as: (i) diversity maintenance—in this case it is assumed that the techniques will maintain diversity per se; (ii) diversity control, where feedback from measured individual fitness and or/fitness improvement is used to direct the evolution towards exploration or exploitation; (iii) diversity learning—a long-term history in combination with machine learning approaches is used to learn unexplored search areas; iv) other direct approaches (Figure 2). In the case of diversity maintenance, two categories can be indicated: niching and non-niching. The niching techniques represent extensions of the algorithms to multi-modal domains [405]. One of the most comprehensive definitions for niching is given in [406]: "Niching is a two-step procedure that (a) concurrently or subsequently distributes individuals into distinct basins of attraction and (b) facilitates approximation of the corresponding (local) optimizers".

**Figure 2.** Mechanisms used to control the exploration–exploitation balance through diversity.

Table 6 shows the different mechanisms for the EEB used by the initial versions of the algorithms presented in Section 3. The subsequent modifications performed to the base versions are not considered in this table. The following notations are used: *C* indicates the controlling mechanisms, *L* the learning approaches, *OD* are the other direct approaches and *H.* represents the hybrid techniques. *Pop.* represents the population-based techniques, *Sel.* the selection based, *Crs.* the crossover/mutation based, *Fit.* the fitness based, *Rep*. the replacement based and *Pres.* the preservation-based approaches. In Table 6, in each case a specific approach is encountered in the algorithm, an x is set in the corresponding column.


**6.** The diversity-based approaches used for EEB by the bio-inspired metaheuristic optimizers (alphabetically sorted).

**Table** 


**Table 6.** *Cont.*

As it can be observed from Table 6, some strategies are more popular than others; the non-niching techniques based on population are the most used approaches for diversity maintenance. This can be explained by the fact that the bio-inspired metaheuristics are population-based and, therefore, the most intuitive methods consist in modifying the characteristics of the population as a means of improving performance.

#### **5. Algorithm Selection**

As it can be observed, the list of algorithms, even when the source of inspiration is restrung to a single category, is extensive. In practice, the most common question is what is the best suited algorithm and how can it can be successfully applied for a specific problem? Unfortunately, answering this question is not an easy task, as by their definition, heuristics can provide sufficiently good solutions to an optimization problem. Thus, depending on the accepted level of precision, there can be more than one heuristic that can generate acceptable solutions in terms of quality. However, performance is not the only aspect that can be taken into account [407]. The issue of algorithm selection was formalized by Rice in [408], and involves: (i) a problem space P; (ii) an algorithm space A; (iii) a mapping PxA onto R (also known as performance model). This implies that there must exist an extensive set of problem instances and features that describe them and the algorithm state at any time [409]. Thus, although advances regarding algorithm selection were made [409,410], the most used strategy in the metaheuristic filed is based on comparison. The work of LaTorre et al. [411] presents a series of methodological guidelines for comparing metaheuristics that involves: (i) selection of benchmarks and refence algorithms; (ii) validation of results (with statistical analysis and visualization); (iii) parameter tunning; (iv) identification of usefulness.

As a demonstration, in this work, the three most used real-world benchmark problems were selected (Table 7) and used to determine (based on already reported results) what the best performing animal-inspired metaheuristics are. The centralized results, organized from the best to the worst solution, are presented in Table 8 for the pressure vessel design, Table 9 for the welded beam design and Table 10 for the tension/compression spring design. All the considered problems are constrained minimization problems and their description can be found in the references from the column "Reported work" of Tables 8–10.


**Table 7.** Real-world problems characteristics.


**Table 8.** Solutions for the pressure vessel design problem.


**Table 9.** Solutions for the welded beam design.


**Table 10.** Solutions for the tension/compression spring.

In Tables 8–10, the column "Reported work" indicates the paper where the specific results were reported and where the control parameters used to obtain those results where indicated. The column "Modified version" indicates if the specific algorithm is a modified version of the base variant. As it can be observed, the majority of the algorithms in the top of the list represent base variants of metaheuristics proposed in the last five years. This indicates that, for this type of constrained problem with a reduced number of parameters, the newer metaheuristics tend to perform better than the older, more known metaheuristics, as well as the classical mathematical approaches.

To determine which algorithm is best suited to all the considered engineering problems, a Condorcet-based approach was applied [432]. It is based on the idea of the voting system, the problem of determining a rank of algorithms becoming an electoral problem. In this context, the considered algorithms represent the candidates and their solutions for each problem indicate the voters. Thus, as a majority-based method, the Condorcet algorithm determines the winner of an election as the candidate who outperforms or is equal to each candidate in a pair-wise comparison. As not all the algorithms considered were tested on

all of the problems, the Condorcet algorithm was applied for the metaheuristics tested on all three problems. The obtained results identified the top four metaheuristics as: EPO (45 votes), AO (42 votes), COOT (40 votes) and ChSA (34 votes). In the fifth and sixth pace, at equality with 32 votes, are I-ABC greedy and AVOA. As it can be observed, the difference between the algorithms placed in the first three positions is relatively small (2 votes). Similarly, there is a small difference between the algorithm placed in positions four, five and six. On the other hand, the difference between place three and four is larger (6 votes), indicating that the first three algorithms, when applied for the three engineering problems considered, performed substantially better than the next three. To test if there is a significant difference between the two groups, a *t*-Test Paired Sample was performed. The results obtained indicate a Pearson correlation of 0.999 and a P(T <= t) two-tail of 0.3168. As it is higher than 0.05, the null hypothesis is accepted, resulting in that there are no statistically significant differences between the results provided by the best three algorithms versus the results provided by the next three best algorithms. Thus, it can be concluded that, although from the 17 algorithms considered EPO is the winner, all of the first six algorithms can provide similar results and can be used successfully for solving the three engineering problems considered.

#### **6. General Issues**

Eighty five percent of articles that propose new bio-inspired metaheuristics have a high number of citations, i.e., more than 20/year, in a relatively reduced period in comparison with the norm in the area of artificial intelligence, where, during a year, the average number of citations is around 5. This indicates that the issues of finding good optimizers that can be easily applied to solve different problems is of high interest. However, a high percentage of the research performed and the subsequently published articles is focused on applications. An in-depth analysis of the theoretical aspects that influence the performance of the different operators used and their combination is relatively scarce. However, researchers are trying to correct this aspect and, in the latest years, a series of studies focusing on the analysis of theoretical and practical aspects were published [10,409,411,433–436].

The high number of citations was observed mostly for the algorithms for which the source-code is provided or easy to find. For the majority of these highly cited metaheuristics, the research focused on two main directions: (i) improvement or hybridization and (ii) applications—usually without any analysis or motivation for the selection of a particular algorithm. However, although the rate of publishing new algorithms (and the variants proposed) is high, studies focusing on the aspects that make an algorithm successful or on the mechanisms that lead to improvements in performance are quite rare. Therefore, in order to further advance the knowledge in the area and to establish some comprehensive basis on which newer, faster and efficient approaches are developed and successfully applied to problems from various domains, the mechanisms and the influence of different aspects of the problem/optimizer domain must be analyzed in depth. In the last years, it was observed that the manuscripts publishing new metaheuristics contain a more detailed analysis and comparison with other algorithms. However, these studies are predominantly based on empirical observations gathered from simulations performed on a handful of benchmarks (mathematical functions such as those proposed in the CEC competitions and engineering problems such as welded beam, pressure vessel or tension/compression spring design). The fact that the CEC test problems are considered, until recently, only in C++ and Matlab can be one explanation for the fact that the majority of these metaheuristics are implemented in Matlab.

This paper presented a comprehensive list of metaheuristics, with a focus on animals as a source of inspiration. All the studied works have a similar organization. First, a general description of the domain is presented, followed by the natural mechanisms used as sources of inspiration and a section with the implementation strategies used to simulate the natural mechanisms. In the results section, a set of problems are selected to demonstrate the strengths and weaknesses of the proposed approach. Although this seems straightforward and easy to understand, in the metaheuristics area, the main issues are related to the fact that:


It can be observed that the source of inspiration follows the main classes identified in the biological taxonomy. Although the inspiration sources are varied and range from the behavior of simple organisms to the mechanisms used for survival of the species by large animals, the simulation of these sources is focused on exploration and exploitation, which translates into mathematical relations that make changes on the individuals. As it can be observed from Table 6, the majority of mechanisms used to control the exploration–exploitation balance in the standard versions belong to the niching class and are population-based. Overall, the manner in which the mechanisms that simulate the reallife behavior of animals are implemented and the combinations used represent the main aspects that differentiate the algorithms and that make them more sensitive or insensitive to the characteristics of the problem being solved, e.g., multi-modality, separability, etc.

Based on the aspects described above, the following potential directions of research can be identified:

• Performance measurement: the issue of performance is a complex aspect, especially taking into account that different metrics can be used. Although the tendency is to see performance as the capability to provide the best solution, other aspects such as complexity, computational resources consumed and stability can be employed. Moreover, how the best solution is identified is usually based on experiments with

mean and standard deviation as validation criteria [440], and a standardization of all these metrics and criteria of evolution can be a further step in the development of a general framework for metaheuristics.


#### **7. Conclusions**

This work is a review that focus on the animal-inspired metaheuristics proposed between 2006–2021. It was observed that, despite the rising number of critiques addressed to the entire metaheuristic community, the trend of proposing algorithms based on novel ideas and sources of inspiration does not seem to slow down considerably. In fact, it maintains the growth rate already observed a few years ago, mainly due to the large area of applicability and popularity of both the older, more established algorithms such as the GA, and newer approaches, for which the tendency is to provide the source code and thus increase the ease of use.

Regarding the animal-based metaheuristics, the most used source of inspiration is represented by the vertebrates, where easily observable behaviors such as food finding and mating are mathematically modeled using various approaches. However, a closer analysis of the inspiration sources indicated that all the main branches of the biological classification are represented in the metaheuristic world. This shows that researchers are actively searching for new ideas in unusual places and are not hindered by the difficulties associated with identifying the mechanisms of the behaviors of hard to analyze sources, such as animals living in remote and difficult to reach areas. In fact, the more exotic the inspiration source and the more uncommon the behavior, the higher the probability of finding new mechanisms that can be translated into truly novel approaches.

The main directions of research that were identified focus on the proposal of new metaheuristics and their application for various types of problems and only a few studies tackle the influence of specific operators or mechanisms on performance. Better performing algorithms are always desired and using nature as a source of inspiration can lead to new advances in this field of metaheuristics. However, attention must be paid not only to the source of inspiration but also to how this inspiration is modelled and put into practice. Similarity with existing variants, performance, complexity, exploration–exploitation balance, proper comparison and use of benchmarks must also be taken into account. An in-depth analysis of all the aspects that influence the performance behavior and the relations with the characteristics of the problems being solved can benefit both the metaheuristic community and the areas where these algorithms are applied.

**Author Contributions:** E.N.D. design the study and drafted the work. V.D. performed the literature search, revised and completed the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by project PN-III-P4-ID-PCE no 58/2021 financed by UEFISCDI, Romania.

**Acknowledgments:** The authors want to thank Florin Leon for his valuable insights and suggestions regarding the use of animal inspired metaheuristics.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

