*Parameter Representation*

Since the tuner algorithms we consider are real-valued optimization methods, we need a proper representation of the nominal parameters of the LL-EA, i.e., the parameters that encode choices among a limited set of options. We opted for representing each nominal parameter as a real-valued vector with as many elements (genes) as the options available: the actual choice is the one that corresponds to the gene with the largest value. For instance, if the parameter to optimize is PSO topology, we can choose between *ring*, *star* and *global* topology. Each individual in the tuner represents this setting as a three-dimensional vector whose largest element determines the topology used in the LL-EA configuration. These particular genes are mutated and crossed-over following NSGA-II rules just like any other. Figure 3 shows how DE and PSO configurations are encoded.

## **5. Experimental Evaluation**

In this section, we discuss the results of some experiments in which we optimize different performance criteria that can assess the effectiveness of an EA in solving a given optimization task. We take into consideration "classical" criteria pairs, such as solution quality vs. convergence speed, as well as experiments in which the different criteria are represented by different constraints on the available resources (e.g., different fitness evaluation budgets).

To do so, we use DE and PSO as LL-EAs and NSGA-II as EMOPaT's Tuner-EA. Table 1 shows the ranges within which we let PSO and DE parameters change in our tests. During the execution of the Tuner-EA, all values are actually normalized in the range [0, 1]; a linear scale transformation is then performed whenever a LL-EA is instantiated.

**Table 1.** Search ranges for the DE and PSO parameters. We chose ranges that are wider than those usually considered in the literature, to allow SEPaT and EMOPaT to "think outside the box", and possibly find unusual parameter sets.


The computation load of the meta-optimization process is heavily dependent on the cost of a single optimization process. If we term *t* the average time needed for a single run of the LL-EA (which corresponds, for the Tuner-EA, to one fitness evaluation), then the upper bound for the time *T* needed for the whole process is:

$$T = t \cdot \text{``Tner \*\*general`ron\*\*} \cdot \text{`Tner `population` \*\*size`} \cdot N \tag{6}$$

since we can consider the computation time requested by the tuner's search operators to be negligible with respect to a fitness evaluation. This process can be highly parallelized, since all *N* repetitions, as well as all evaluations of a population can be run in parallel if enough resources are available. In our tests, we used an 8-core 64-bit Intel(R) CoreTM i7 CPU running at 3.40 GHz; we chose not to parallelize the optimization process but we preferred to parallelize independent runs of the tuners.

EMOPaT has been tested on some functions from the CEC 2013 benchmark [38], with the only difference that the function minima were set to 0.

The code used to perform the tests is available online at http://ibislab.ce.unipr.it/software/emopat.

## *5.1. Multi-Objective Single-Function Optimization Under Different Constraints*

A multi-objective experiment can optimize different functions, the same function under different conditions, etc. Thus, optimizing a single function under different constraints can be seen as a particular case of multi-objective optimization. In this section, we report the results of tests on single-function optimization under different fitness evaluations budgets. Similar experiments can be performed evaluating the function under different conditions (e.g., different problem dimensions) or according to different quality indices as we did in [8] where we considered two objectives (fitness and fitness evaluations budget) for a single function. With respect to that work, the main additional contribution of this section is showing how EMOPaT can be used to generalize the behavior of an EA in optimizing a function when working under different conditions. We consider the following set of quality indices:

{*QXi*} - best results after {*Xi*} fitness evaluations, averaged over *N* runs.

We performed four different tests considering, in each of them, one of the four functions shown in Table 2. Our objectives were the best-fitness values reached after 1000, 10, 000 and 100, 000 function evaluations, namely *Q1K*, *Q10K*, *Q100K*. Each test was run 10 times. Doing so, we expected we would favor the emergence of patterns related with the impact of a parameter when looking for "fast-converging" or "slow-converging" configurations. Table 2 summarizes the experimental setup for these experiments.

Firstly, we analyze the LL-EA parameter sets evolved under the different criteria. To do so, we merge the populations of the ten independent runs and, from this pool, we select, for each objective, the top 10% of the best solutions. For most parameters there is a clear trend as their values monotonically grow or decrease as the fitness evaluations budget increases (see Table 3). This result suggests that, in these cases, it may not be necessary to keep track of all possible computational budgets as in [32], but that the optimal parameters for intermediate objectives may be inferred by interpolating the ones found for the objectives actually taken into consideration; consequently, a developer can use this information to tune its algorithm according to his own budget constraints. Nevertheless, while this can be true for a function, such trends are rarely consistent through different functions, preventing one from drawing more general conclusions.

**Table 2.** Single-function optimization with different fitness evaluation budgets. Experimental settings.


Let us analyze in more details the results on the Rastrigin and the Sphere functions (similar conclusions can be drawn for the other functions). Figures 4 and 5 show the boxplots of some parameters for the top 10% DE and PSO configurations on these two functions. When parameters are nominal, a bar chart plots the selection frequencies for each option. For the Sphere function, the boxplots of *Q10K* and *Q100K* are very similar, pointing out that for this function, 10,000 evaluations are usually sufficient to reach convergence.

**Table 3.** Trends of DE and PSO parameter values versus fitness evaluations budget. Upward arrows denote parameter values increasing with the number of evaluations, downward arrows the opposite. A dash denotes no clear trend. For nominal parameters, the table reports the most frequently selected choice. If the choice changes within the top solutions for different evaluation budgets, an arrow shows the direction of this change as the budget increases.


**Figure 4.** DE parameters of the top solutions (best 10%) for the 30-dimensional Rastrigin (first row) and Sphere (second row) functions, with an available budget of 1000, 10,000, and 100,000 fitness evaluations. Bar plots indicate the normalized selection frequency. Descending/ascending trends for all parameter values are clearly visible.

To evaluate the hypothesis that intermediate budgets can be inferred from the results obtained on the objectives that have actually been optimized, one can generate new solutions in two ways:


**Figure 5.** PSO parameters of the top solutions (best 10% of the population) for the 30-dimensional Rastrigin (first row) and Sphere (second row) functions, with an available budget of 1000, 10,000, and 100,000 fitness evaluations. Descending/ascending trends for all parameter values are clearly visible.

**Figure 6.** Fitness values of all the solutions found in ten independent runs of EMOPaT for the three criteria, plotted pairwise for adjacent values of the budget. The green and red stars represent the Top Solutions for each objective, yellow circles are candidate solutions for intermediate evaluation budgets.

Table 4 shows the parameters of the best solutions found for the Rastrigin and Sphere functions for the three objectives and of four intermediate solutions generated by the two methods. The ones indicated by *A* lie between *Q1K* and *Q10K* and the ones indicated by *B* between *Q10K* and *Q100K*. It can be noticed that the values of the parameter sets generated using the Pareto Front differ from both the ones inferred as a weighted mean of neighboring ones and the top solutions. In some cases (as with DE on Sphere) these solutions use a mutation type that is never considered by the top solutions. For the nominal parameters of the inferred solutions, we chose to consider both options when the two top solutions disagreed, distinguishing the two solutions by an index (e.g., *A*1 and *A*2).

Figure 7 shows the performance of the configurations considered in Table 4, averaged over 100 independent runs, for DE and PSO. The solid lines represent the Top Solutions; as expected, after 1000 evaluations (see the plots on the right) *Q1K* is the best-performing configuration, while *Q100K* is slower in the beginning but is the best at the end of the evolution. In most cases, the inferred solutions have a performance that lies between the performance of the two top solutions used as starting points. The results obtained on the Rastrigin function (first row of Figure 7) are particularly clear: in the first 1000 evaluations, *Q1K* performs best, then it is surpassed by Inferred A, followed by *Q10K*, Pareto B, and finally *Q100K*; this example seems to confirm our general hypothesis. A relevant exception is *A*2 in DE Sphere (Figure 7, last row) that performs worse than all others: since its only difference with *A*1 is the crossover type, this suggests that it is not possible to infer nominal parameters reliably unless one is clearly prevalent.

**Table 4.** DE and PSO configurations for the same objective function with three different fitness evaluations budgets."Top Solutions" are the best-performing sets on each objective; "Inferred" refers to the ones obtained averaging Top Solutions; "From Pareto" are extracted from the Pareto front obtained considering the objectives pairwise (see Figure 6).


Finally, to compare EMOPaT with a state-of-the-art tuner, we implemented the Flexible-Budget method (FBM) proposed by [32]. For a fair comparison, we implemented their method using the same NSGA-II parameters used by EMOPaT, including the budget of LL-EA evaluations. The secondary criterion used to compare equally ranked individuals (see [32] for more details) is the Area Under the Curve. Ten independent runs of FBM were run, after which we selected the solutions that among all runs, had obtained the best-performing configurations after 1 K, 10 K, 100 K evaluations (similar to our setting) and 5.5 K and 55 K evaluations (for a comparison with our inferred parameter sets). Then, we performed 100 runs for each configuration and compared the results to the ones reported in Table 4 (for intermediate values we used the ones called "From Pareto") allowing the same computational budget. Table 5 shows the parameters found by FBM and Table 6 the comparison between the performance of the two methods. Except for PSO on the Sphere function, for which the results provided by EMOPaT are always better (Wilcoxon signed-rank test, *p* < 0.01), the two tuning methods always obtain equivalent results with budgets of 1 K, 10 K and 100 K evaluations. With the two intermediate budgets, FBM is better three times, EMOPaT is better twice and once they are equivalent; therefore no significant difference between the two methods can be observed.

**Figure 7.** Average fitness versus number of fitness evaluations for configurations generated for PSO (first and second rows) and DE (third and fourth) for the 30-dimensional Rastrigin (above) and Sphere (below) functions. The plots on the right magnify the first 1000 evaluations to better compare the performance of the "fast" versions.

The results obtained by EMOPaT are also consistent with previous experimental and theoretical findings, such as the ones reported in [39], which proved that premature convergence can be avoided if:

$$F > \sqrt{\frac{1 - \frac{CR}{2}}{PopSize}}\tag{7}$$

Of the fourteen DE instances in Table 4, Sphere *Q1K* is the only one that does not respect this condition; this may be an explanation for the fact that this DE version is fast, but generally unable to reach convergence optimizing such a simple unimodal function.


**Table 5.** DE and PSO configurations obtained by the Flexible-Budget Method [32].



## *5.2. Multi-Function Optimization*

In this section, we show how EMOPaT behaves when the goal is to obtain configurations that perform well on functions that are not included in the "training set". Following the terminology introduced by [31], we expect to find "generalist" and "specialist" versions of the EAs taken into consideration. Table 7 gives more details about this experiment.

**Table 7.** Optimization of seven different functions. Experimental settings.


We used EMOPaT to optimize all the seven functions together (repeating the test 10 times) and then we merged all results into a single collection of configurations. From this collection, we selected the best-performing configuration for each of the seven objective functions.

The next step was to select the "generalist" solutions. We consider a "generalist" solution to be a parameter set that does not perform badly on any of the objectives taken into consideration, i.e., it is never in the worst *θ* percent of the population, when ordered by any objective. Obviously, the higher *θ*, the lower the number of generalists. We decided to set the value of *θ* such that seven generalists would be selected, to match the specialists' number.

Table 8 shows the generalists' and specialists' parameters obtained by merging the results of ten independent runs of EMOPaT. An interesting outcome worth highlighting is that, similar to the previous experiment, some of the generalists are not obtained by simply "interpolating" other results but they contain some traits that are not featured by any specialist. For instance, DE *G*0 has a smaller population than any specialist, PSO *G*3's inertia value is higher than that of all specialists.

A more standard way to infer a "generalist" configuration is to take the one with the best overall results. To do so, we consider the results of all the solutions found by EMOPaT and normalize them so that each fitness has average = 0 and standard deviation = 1; then, we select the configuration that minimizes the sum of the normalized fitnesses. In Table 8, these configurations are reported as "average".

We also performed 10 meta-optimizations using SEPaT and *irace*, with the same budget allowed for EMOPaT. The parameters of SEPaT are the ones presented in Table A1, while for *irace* we used the parameters suggested by the authors. For each optimization method, the ten solutions obtained were compared using the tournament method described in Section 4 to find the best configuration, which is also reported in Table 8.

To test the parameter sets obtained, we selected seven functions from the CEC 2013 benchmark that were not used during training (namely Elliptic, Rotated Discus, Rotated Weierstrass, Griewank, Rotated Katsuura, CF5 and CF7). Table 9 shows, for each function, which configuration(s) obtained the best results. To determine the best function, we performed the Wilcoxon signed-rank test (*p* < 0.01) on all configurations pairwise. A configuration is considered to be the best if no other configuration performs significantly better on that function. The table shows that, in some cases, generalists were actually able to obtain better results on previously unseen functions than specialists.

Since the definition of "generalist EA" implies the ability not to perform badly on any function, we also analyzed the same data from another viewpoint. Each cell (*i*, *j*) in Table 10 shows the number of test functions for which the optimizer which row *i* refers to performs statistically worse than the one referred to by column *j* (Wilcoxon signed-rank test, *p* < 0.01). The last column reports the sum of each line and can be considered an indicator of the generalization ability of the optimizer with respect to the others over the test functions.

It can be observed that some of the generalists performed very well. The best optimizers for DE were the configurations obtained by SEPaT along with two generalists, *G*4 and *G*5. The first two configurations are very similar to each other (same mutation and crossover, *CR* 0.15 and *F* 0.5), as shown by the presence of statistically significant differences between them only on one function out of seven. No specialist features a similar parameter set. Regarding PSO, two of the specialists (*Scigar* and *Srosenbrock*) obtained very good results, as well as three of the generalists ( *G*0, *G*1, and *G*3). It is important to notice that most generalists evolved by EMOPaT outperform the solutions found by the other single-objective tuners used as reference, as well as the one obtained by computing a normalized average of all solutions evolved by EMOPaT ("average" in Table 8). This last configuration (which is the same as *Sackley* for PSO) was the best optimizer for three functions and the worst one (not reported) for two (Elliptic, Katsuura). This suggests that this is not the correct way of finding a configuration able to perform well on different functions.

In conclusion, we can say that EMOPaT, in a single optimization process, is able to find, at the same time, algorithm configurations that work well on a single function of interest and others that are able to generalize over different unseen functions, while single-objective tuners need separate processes with a consequent increase of the time spent to perform this operation.

**Table 8.** The seven DE and PSO best-performing configurations generated by EMOPaT for each "training" function (denoted by S, for specialist, followed by the name of the function); the seven configurations that never achieved bad results in any of them (denoted by G, for generalist); the parameter sets found by *irace* and by SEPaT; and the single generalist configuration obtained by normalizing fitness values (see text).



**Table 9.** Best-performing DE and PSO configurations on the seven test functions.

**Table 10.** Number of test functions for which the optimizer associated with the row is statistically worse than the one associated with the column. The last column reports the sum of the values in that row, measuring the optimizer performance (the lower, the better). The three best configurations are highlighted in bold.


## **6. Summary and Future Work**

In this paper we presented some examples of the kind of information and insights into stochastic optimization algorithms that can be offered by a multi-objective meta-optimization environment. To do so, we used EMOPaT, a simple and generic multi-objective evolutionary optimization framework for tuning the parameters of an EA. EMOPaT was tested on the optimization of DE and PSO in different scenarios, showing that it is able to highlight how the parameters affect the performance of an EA in different situations, allowing one to draw generalizable results when considering different constraints applied to the optimization of the same function. Successively, we tested it on different functions and proved it not only allows one to find good configurations for the training function(s), but also to derive from those results new parameter sets that perform well on unseen problems.

We think that EMOPaT can be helpful in many applications and provide useful hints about the behavior of any metaheuristic. In [40] we showed that EMOPaT can be effective in real-world situations, by using it to tune a DE-based object recognition algorithm. In general, a basic application of EMOPaT can be summarized in the following steps, as described also in the code we made available online:


Below we report some interesting directions towards which this approach can be further expanded:


**Author Contributions:** Conceptualization, R.U. and S.C.; Investigation, R.U., L.S. and S.C.; Software, R.U.; Supervision, S.C.

**Funding:** This research received no external funding.

**Ethical Approval:** This article does not contain any studies with human participants performed by any of the authors.

**Conflicts of Interest:** The authors declare no conflict of interest.
