*Article* **The SOS Platform: Designing, Tuning and Statistically Benchmarking Optimisation Algorithms**

**Fabio Caraffini 1,\* and Giovanni Iacca <sup>2</sup>**


Received: 31 March 2020; Accepted: 5 May 2020; Published: 13 May 2020

**Abstract:** We present Stochastic Optimisation Software (SOS), a Java platform facilitating the algorithmic design process and the evaluation of metaheuristic optimisation algorithms. SOS reduces the burden of coding miscellaneous methods for dealing with several bothersome and time-demanding tasks such as parameter tuning, implementation of comparison algorithms and testbed problems, collecting and processing data to display results, measuring algorithmic overhead, etc. SOS provides numerous off-the-shelf methods including: (1) customised implementations of statistical tests, such as the Wilcoxon rank-sum test and the Holm–Bonferroni procedure, for comparing the performances of optimisation algorithms and automatically generating result tables in PDF and LATEX formats; (2) the implementation of an original advanced statistical routine for accurately comparing couples of stochastic optimisation algorithms; (3) the implementation of a novel testbed suite for continuous optimisation, derived from the IEEE CEC 2014 benchmark, allowing for controlled activation of the rotation on each testbed function. Moreover, we briefly comment on the current state of the literature in stochastic optimisation and highlight similarities shared by modern metaheuristics inspired by nature. We argue that the vast majority of these algorithms are simply a reformulation of the same methods and that metaheuristics for optimisation should be simply treated as stochastic processes with less emphasis on the inspiring metaphor behind them.

**Keywords:** algorithmic design; metaheuristic optimisation; evolutionary computation; swarm intelligence; memetic computing; parameter tuning; fitness trend; Wilcoxon rank-sum; Holm–Bonferroni; benchmark suite

#### **1. Introduction**

Experimentalism plays a major role in all fields of metaheuristic optimisation, such as evolutionary computation (EC) and swarm intelligence (SI). It can be a time-demanding, repetitive and tedious process to fine-tune optimisers, compare them or validate new algorithmic strategies on benchmark problems. Due to their stochastic nature, and the lack of theoretical knowledge on the internal dynamics of many of these nature-inspired approaches, their algorithmic design is partially blind and often empirically performed through a series of trial-and-error phases [1]. This involves using several artificially built benchmark problems simulating characteristics present in real-world scenarios, but for which some knowledge is available. The use of such testbed problems is preferable since they usually have a more reasonable execution time than real-world applications, they can be run on any machine, this being e.g., a modest personal computer or a high-performance computing system, and they are not subject to time and other limitations typical of real-world scenarios, unless these are purposely imposed to reproduce a specific situation. Thus, these problems are suitable choices for performing the

algorithmic design phase of novel algorithms and evaluating the performances of specific operators and algorithmic variants. By design, these problems display particular characteristics reported in the corresponding technical reports in term of, e.g., ill-conditioning, separability, m-separability, noise, modality (i.e., multimodal or unimodal function), etc., allowing also for studying the algorithmic behaviour of certain operators [2–4].

In this light, the research community benefits from platforms featuring a large number of algorithms and benchmark problems, such as the Decision Tree for Optimisation software [5], the COCO platform [6], IOHprofiler [7], jMetal [8] or PlatEMO [9]. However, the availability of code implementing complete benchmark suites and popular optimisation algorithms is only one amongst the many requirements an algorithm designer has to work efficiently and focus algorithmic issues rather than having to deal with the implementation of ancillary software. The other functionalities that a research-oriented platform for metaheuristic optimisation should have are:


On top of that, such a platform should provide:


Additionally, the platform should be based on:


The Stochastic Optimisation Software, henceforth SOS, exhibits the aforementioned key features, thus facilitating the design, the evaluation and the use of metaheuristic optimisation algorithms. Thanks to its simple and extendible structure, SOS can be easily used by researchers to implement new algorithms and compare their performances with those of several other benchmark algorithms. The most recent SOS release, available in the Zenodo repository [17], contains over fifty ready-to-use algorithms amongst single-solution and population-based metaheuristics, as those described in [18], all belonging to established optimisation paradigms such as evolutionary algorithms (EA) [19], swarm intelligence (SI) optimisation [20] and other hybrid schemes. Hence, it provides users with a variety of techniques to explore multiple flavours of metaheuristic/stochastic optimisation. More algorithms will be added in future releases. Researchers can benefit in several ways from having such a large availability of algorithms. Firstly, this means that comparison algorithms do not need to be implemented as they are already present and ready to be executed. Secondly, their source code is accessible and can be modified to create new variants or simply visioned as a source of inspiration for implementing other algorithms or algorithmic operators. Thirdly, they can combine the existing algorithms or embed them into new algorithms, thus creating novel hybrid methods as done, e.g., in memetic computing and hyperheuristics [3,21–24]. It must be highlighted that implementing algorithms in SOS is quite simple since most algorithmic operators are provided by the platform. Moreover, the "template" class Algorithm, which is equipped with several ancillary methods, can

simply be extended to implement any metaheuristic, regardless of the optimisation family. In this light, SOS also represents an excellent pedagogical tool for teaching purposes and can be used to give guidance to students while implementing optimisation algorithms. Indeed, since 2015 the "student" version of SOS has been used to teach the master module "Computational Intelligence Optimisation" led by Dr Fabio Caraffini at De Montfort University (Leicester, UK), for which it was originally designed before being extended to the current form. Researchers willing to collaborate and add code (algorithms, real-world problems, etc.) to the platform are welcome and invited to get in touch to be added to the SOS GitHub repository (whose link is provided in Table 1). Finally, it is worth mentioning that due to the simplicity in adding problems and algorithms, SOS represents a useful tool for practitioners who might have less interest in the algorithmic design but a high need for off-the-shelf implementations of optimisation algorithms. The remainder of this article is structured as follow:



**Table 1.** Relevant links to source code and software documentation.

#### **2. The SOS Platform**

For portability reasons, SOS is coded in Java, thus being platform-independent. To speed up the execution of extensive experiments, some benchmark suites are also available in the form of C/C++ native libraries (compiled for MS Windows, MAC OS and Linux machines) using the Java Native Interface (JNI). By default, SOS spreads the "runs" (i.e., the optimisation processes in which one algorithm optimises one problem) over the available threads and executes them in parallel. In order to avoid that, e.g., if the experiment is executed on a laptop or personal computer and CPU power must be saved of other applications, it is possible to switch to the single-thread mode by means of the setter method setMT(boolean MT) of the class Experiment (i.e., its Boolean flag variable must be set to false). By looking at the code available in [17], one can notice that in the current release, a strong emphasis is given to real-valued box-constrained single-objective optimisation, also known as "single-objective continuous optimisation" in the EC community. Nevertheless, SOS can be used for addressing other optimisation domains such as discrete or combinatorial ones.

#### *2.1. Functional Overview*

SOS is research-oriented and is designed to have an intuitive structure making it straightforward to use for both expert algorithm designers in the research community and, most importantly, for practitioners and students from different subject fields. Indeed, to execute an experiment, it is sufficient to add a class, which extends the abstract Experiment class, in the homonym folder (package). Such a class will automatically inherit the add methods, which can be used to add algorithms and problems to the experiment, and other methods to set, e.g., the allowed computational budget (if not

set, by default, this is assumed to be equal to 5000 × *n* with *n* being the dimensionality of the problem) and the number of runs to be performed. Subsequently, a main method must be written to launch one or multiple experiments. To do this, the methods of the utility class RunAndStore can be imported to, e.g.,:


It must be said that the computational budget can be arbitrarily allocated for each experiment using the provided setter methods (described in the SOS documentation). These can be used to either set a multiplicative factor (which multiplies *n*, thus adjusting the budget of each function in the experiment based on its dimensionality) or set a fixed amount of functional calls. This poses limitations on the maximum number of allowed functional evaluations, but does not prevent the algorithms from stopping the search earlier. Indeed, each algorithm can be equipped with a customised "stop" criterion based, e.g., on the number of improvements on the fitness values, the use of thresholds specifying a desired fitness tolerance, an incremental improvement value, etc.

Figure 1 shows an example of a main method used for executing a list of experiments and displaying the final outputs, i.e., tables and graphs, generated by SOS after processing the results. With reference to the latest release in [17], a large number of experiment files can be found in the experiments package, while several examples of main files (showing different configurations and options for storing and processing results in different formats) are available in the mains package. Among these examples, the most convenient one to be used as a template is the class RunExperiments, shown in Figure 1, which is located in the default package of the SOS platform. More indications on how to write an experiment class are given in Section 2.2 and graphically shown in Figure 2.

**Figure 1.** A graphical functional overview of the Stochastic Optimisation Software (SOS) platform.

To minimise the need for referring to a detailed document, SOS is equipped with a high number of self-explanatory example classes. Moreover, the clear nomenclature used for methods, variable types and class names and the user-friendly structure of the proposed platform give guidance to the users, who may not even need further indications. Nonetheless, the software documentation is available online to provide users with further information on SOS packages, classes and methods. Table 1 displays relevant links to SOS web pages, from which SOS documentation can be accessed. These web pages are kept up-to-date, and the documentation itself is constantly updated to reflect any major changes of the platform. This documentation plays a key role for those who intend to modify SOS source code, extend it or customise it to a specific need.

SOS makes it very easy to add new algorithms and new problems that are fully compatible with the platform, so as to exploit its full capabilities. For the sake of good organisation, it is suggested to place their implementation into the three provided packages: algorithms, benchmarks and applications.

The algorithms folder currently contains a vast list of optimisation algorithms for continuous optimisation, which, in may cases, can be run under different combinations of variation and/or selection operators, thus providing an even higher number of choices. This results in a good balance between established optimisation methods, such as differential evolution (DE) [1,25], simulated annealing (SA) [26], genetic algorithms (GAs), particle swarm optimisation (PSO) [27] and evolution strategies (ES) [28–30], and more modern optimisers such as single particle optimisation [31], self-adaptive algorithms [32–38] and MC/hyperheuristic approaches [3,23,24,39–41]. For a complete list, one can refer to the algorithms folder from [17] or check the online documentation. More information on the SOS algorithms is given in Section 2.3.

The benchmarks folder contains the implementation of problems taken from some established benchmark suites for optimisation. SOS implements several testbed problems and complete benchmarks suites that can be added in the experiment files. A complete list of these optimisation problems and their relevant documentation is provided in Section 3. Furthermore, a novel benchmark suite is presented in Section 3.

The applications folder is where specific real-world applications should be implemented. Some examples from the literature, e.g., the code for the application described in [23], and from a benchmark suite for real-world problems [42] are available in this folder.

Following the execution of an experiment, the produced raw data can be processed by means of the classes in the package utils.resultsProcessing. These provide several functionalities, including:


The file TabGen.java, located in the Mains folder, contains the methods for generating the previously mentioned tables and graphical outputs and can be modified to customise the production of tables with different layouts and informative content, as discussed in Section 5.

It is worth stressing the fact that having a fast way to visualise results in multiple formats is a key element to facilitate research in the area of stochastic optimisation. This feature is indeed very useful since the raw data are automatically processed and the LATEX source of the comparative tables is generated without any user intervention, together with the corresponding PDF files.

Moreover, this also accelerates the algorithmic design phase, which is often performed empirically through several trial-and-error steps. At each step, different variants of the same algorithm must be compared and tables must be generated. This is quite common in the EC filed where using a different combination of variation operators [19] with different combinations of parents and survivors selection schemes [19] can lead to a variety of possible algorithmic behaviours. The same issue arises when designing MC and hyperheuristic algorithms, for which several coordination schemes must be tested to ensure high performances [3,21,22,24]. This aspect suggests that the empirical design approach can only produce algorithms tailored to the problem(s) considered during the design phase. To some extent, this issue can be considered analogous to the training process in machine learning.

In this light, the algorithmic design process is not so different from the process performed to choose the less disruptive correction method for handling infeasible solutions generated while addressing a problem [1,16,43] or from the fine-tuning process performed to find the most appropriate set of parameters for an algorithm A meant to address specific problem P.

#### *2.2. Parameters' Fine-Tuning*

Metaheuristics for optimisation must be tuned on the problem at hand to make them achieve their full potential and return high-quality solutions. Against the original common belief that a "universal" algorithm could have been designed, theoretical achievements such as, first and foremost, the "no free lunch theorems" (NFLTs) for optimisation [47,48] highlighted the need for fine-tuning, self-adaptation and the use of problem-specific operators.

However, finding the optimal parameter configuration is an optimisation problem itself. Despite meta-optimisation strategies having been proposed [49,50], these do not completely solve the problem and are arguable, since they introduce further complexity due to the fact that also the meta-optimiser might need to be fine-tuned. Furthermore, the fact that most real-world optimisation problems are black-box systems, i.e., little or no analytical information on the fitness function is available, increases the difficulty in finding an exact method for finding the optimal configuration of parameters.

Under these circumstances, the parameters' tuning process is necessarily performed empirically, thus resulting in a rather time-consuming and tiresome activity. SOS can be used to accelerate and facilitate this task significantly as: (1) the same algorithm can be included multiple times in the same experiment file with different parameter configurations; (2) runs can be spread over the available processors and threads to speed up the generation of results; (3) raw data are statistically analysed and results automatically displayed in compact, yet highly informative tables and graphs (see the examples provided in Section 5). Hence, the only portion of code that one needs to write to apply and tune an algorithm A on a specific real-world problem P with SOS is actually the code implementing P.

It must be pointed out that also artificially built testbed problems, as those already implemented in SOS (see Section 3), can play a role in the parameter tuning process. Indeed, studying how the algorithmic behaviour of a metaheuristic algorithm changes under different configurations of its parameters on these benchmarks can help shed light on how to perform the tuning process. By using testbed problems, whose mathematical properties (e.g., differentiability, separability, modality, ill-conditioning, etc.) are known, one can understand, e.g., the effect of the parameters in relation to these mathematical features. As a result, specific regions of the parameter space could be detected to obtain a robust general-purpose behaviour rather than a very problem-specific one, or to strengthen exploitation capabilities rather than exploration capabilities, or to minimise the rise of algorithmic structural biases, as done in [1,12,43], the generation of a high number of infeasible solutions, as done in [1,16], or premature convergence [51] and stagnation [52], as done in [11].

Most of the aforementioned studies have been performed with SOS, and their experiment files can be found in [17]. Portions of the code from three experiment files are reported in Figure 2, which helps understand how the parameter space of an algorithm can be explored in SOS.

**Figure 2.** Examples from experiment classes (original files were downloadable from [17]) for parameter tuning and algorithmic behaviour testing, i.e., the same algorithms are included in the same experiment under different parameter configurations. The fragment of code reported in (**a**) refers to an experiment performed for the studies in [4,53] located in SOS→src→experiments→rotInvStudy→CEC11.java. The file RCEC14TuningDEroe.java, from which the segment of code in (**b**) was taken, can be found in the same folder. In this case, the DE/rand/1/exp algorithm executes with three different values for the crossover rate parameter *Cr* over the 30 rotated versions of the problems of the R-CEC14 (R indicates rotation flag) benchmark suite proposed in this article in Section 3. Finally, the fragment of code in (**c**) shows how to define compactly an experiment running 12 DE variants (each with four correction strategies) with several different parameter configurations, over one problem only, to find the most appropriate triplet population size, scale factor (*F*), crossover ratio (*Cr*). The complete class file for this example is located in SOS→src→mains→ExampleTuning.java.

In Figure 2a, four DE variants are added into an experiment named ExperimentDE by using the constructor super(probDim,5000,"ExperimentDE"). Therefore, all the produced text files containing raw data will be saved in a homonym folder, located inside the SOS default results folder. From the constructor method, it can be seen that a maximum computational budget of 5000 × *n* fitness functional calls, with *n* being the dimensionality of the problem, is used. Clearly, this experiment file contains scalable testbed problems and not real-world applications since the dimensionality of the problems is not fixed and passed as an argument with the variable probDim. The number of performed runs, 30 in this case, is specified with the command setNrRuns(30) (SOS performs 100 runs if not specified otherwise). It must be noted that all the other parameters, apart from the population size, are fixed. This means that this experiment will only study the effect of varying population sizes, which are in this case equal to the problem dimension as required for the studies in [4,53], which can be read for further details. To execute this experiment with increasing problem dimensionalities, and therefore, in turn, increasing population sizes, it is sufficient to instantiate objects with increasing probDim values by calling the class constructor in the file RunExperiments.java (as shown in the example of Figure 1). It is interesting to note that, to avoid confusion, a name is assigned to each DE variant (e.g., a.setID("DErn1bin")). The assigned name will show up in the generated tables. If a name is not provided, as in the example of Figure 2c, it will be automatically assigned equal to the class name of the algorithm. Therefore, in this case, the assigned names would start with DE, followed by "-i", where "i" represents a counter incremented every time an algorithm is added to an experiment. In a nutshell, if names were not specified in the example of Figure 2a, tables would display DE-1, DE-2, DE-3 and DE-4, since all the algorithms in this experiment are instances of the DE class. Differently, in Figure 2b, the same DE variant, namely DE/rand/1/exp, is added four times to the experiment, with four different values for the so-called "crossover ratio" parameter *Cr*. As can be seen at the bottom of the figure, all 30 rotated problems from the benchmark suite proposed in Section 3 are added to this experiment. Hence, its purpose is to understand the impact of the *Cr* value on DE/rand/1/exp when facing different problems.

Finally, Figure 2c shows a very compact and fast way for configuring a large experiment in SOS, in which 12 DE variants are equipped with four different correction strategies for handling infeasible solutions, for a total of 48 different algorithms to tune. From the left-hand side of the figure, it can be seen that the DE variants are obtained by combining the six mutations with the two crossover operators in the DEMutations and CrossOvers arrays, respectively. The four correction strategies are in the corrections array. For details about the adopted notation for DE variants and correction strategies, one can consult the articles [1,12] and the results in [45]. For each algorithm, the explored parameters space is defined as the Cartesian product of the three sets represented by the populationSizes array, containing three population sizes, the FValues array, containing 10 equally spaced values for the so-called "scale factor" parameter *F* in the range [0.05, 2], and the CRValues array, containing five equally spaced values for *Cr* in the range [0.05, 0.99]. Hence, each one of the 48 algorithms is tuned over a set of 150 possible combinations of the three parameters. In total, this means that 7200 × *Nr* optimisation processes are executed, with *Nr* being the number of runs performed for each problem added to this experiment. The full code for this example, named ExampleTuning, is located in the mains package. For demonstration purposes, *Nr* is set equal to 10, and the computation budget is set to 1000 × *n*.

#### *2.3. Adding New Algorithms and Problems*

The literature is currently saturated with "novel" nature-inspired optimisation metaphors claiming to mimic collaborative or individual behaviours of animals [54–61], human activities [62–64] or other natural phenomena [65–67]. The contribution made by most of these new optimisation paradigms is arguable, as in general, they are very similar to more established EAs, such as GA, DE or ES, or swarm intelligence algorithms such as PSO. Furthermore, these new metaheuristics are by definition subject to the NFLTs, and as such, it can be shown that, unless they are highly fine-tuned, they cannot outperform classic algorithms over all possible problems. It must be remarked that also amongst the most established optimisation frameworks, some similarities can be detected in support of the thesis that taxonomies based on metaphors inspiring the algorithm design could (and should) be

indeed avoided. To provide some examples, it is sufficient to observe that DE mutations, being linear combinations of individuals from the population [1,25], operate in the same way as the arithmetic crossover used in real-valued GAs and other EAs [19]. Similarly, it can be observed that the lack of parent selection in SI frameworks, such as in PSO [27], is simply moved in the perturbation operator, since this requires the definition of the neighbourhood of the candidate solution to be perturbed to find its local best solution.

In this light, a scientific approach to describe a metaheuristic for optimisation is by considering it as a stochastic process returning a near-optimal solution for an optimisation problem. Regardless of the metaphor behind them, all metaheuristic methods are the expression of the same concept and aim at reaching a good balance between exploration of the fitness landscape, looking for promising basins of attraction, and their exploitation, to refine the search and find local optimal solutions (i.e., maxima or minima according to the need). This dynamic process can alternate such phases as appropriate to keep looking for the global optimum, without prematurely converging to a locally optimal solution or stagnating without being able to return any satisfactory near-optimal solution.

Hence, it makes sense to analyse the algorithmic structure of modern nature-inspired optimisation algorithms and extrapolate a common, high level, general skeleton to use as a template for their implementation. This is done in SOS by means of the abstract class Algorithm, from which all algorithms inherit ancillary and auxiliary methods (e.g., for algorithm execution, storing the fitness trend, setting or generating the initial guess, etc.) and extend only the parts to be customised to implement the desired optimisation framework (e.g., GA, DE, a hybrid method or a completely new optimisation paradigm). Therefore, any class file extending Algorithm is already equipped with all the methods needed to add the algorithm to an experiment file and execute it as described in the previous section, while it must implement one method, named execute, which returns an object of the kind FTrend. To give further details on this, other than referring to the online software documentation, it is worth briefing about the classes of the package utils listed below.


Moreover, the packages utils.algorithms and utils.random provide other useful methods helping users to implement algorithms. In particular, these packages contain:

	- **–** the class GAOp, which provides numerous mutation, parent selection, survivor selection and crossover operators from the GA literature;
	- **–** the class DEOp, which provides all the established DE mutation and crossover strategies, plus several others from the recent literature;
	- **–** the classes in aos, which implements adaptive heuristic/operator selection mechanisms from the hyperheuristic field [24].

Given the facility of adding new algorithms and the availability of numerous off-the-shelf operators, SOS is a suitable workbench to design hybrid algorithms or variants tailored to the problem at hand. In particular, the possibility of declaring variables of the kind Algorithm, in order to instantiate and execute optimisation processes inside another algorithm, leads to the implementation of rather complex algorithms by writing very neat code. This is evident from Figure 3, where the code of the iterated local search algorithm proposed in [73] is shown. At first glance, one can observe that the source code is clear and resembles pseudocode. Indeed, instead of implementing the sophisticated (1+1)–CMA-ES algorithm in [74], which plays the local searcher role, an Algorithm variable is instantiated by using the CMAES\_11 class available from the algorithms package and implementing the (1+1)–CMA-ES algorithm. The parameters and computational budget for this algorithm, which can be seen as an operator in this context, are passed as described in the previous sections. It can be noticed that the initial solution for each iterated search is passed to the local searcher before it is executed. This solution is generated by applying the exponential crossover operator [1] to the current best solution and a feasible randomly sampled individual. Obviously, the crossover method, as well as the method for generating an individual in the search space are already present in SOS, and therefore, they can be simply imported without the need for writing further code. To monitor the internal dynamics of the resulting algorithm, a FTrend object can be used as shown in Figure 3.

Based on similar considerations, also optimisation problems are added to SOS by following a template defined in the abstract class Problem. Indeed, regardless of their nature, i.e., black-box, grey-box, real-world or synthetic testbed, all optimisation problems share similar attributes indicating their dimensionality (i.e., the number of design variables), the boundaries delimiting the search space in which the optimal solution must be found and an optional name to describe the problem. All these fields are accessed and changed with setter and getter methods already implemented in Problem. This way, no coding is required to deal with such aspects, thus allowing SOS users to focus on writing the body of only one abstract method, f, which obviously represents the fitness function. When the execute method of an algorithm is called, a reference to a Problem variable P is passed to the algorithm, which evaluates the fitness value y with the code line y = P.f(x), with x being a candidate solution. To ensure that the return value is meaningful, algorithms inherit from the superclass Algorithm the function x = correct(x,bounds), which can be used before performing the

fitness function evaluation. The latter executes the correction strategy specified when the algorithm object is instantiated (the default strategy is the toroidal correction [1]). The Problem class comes with multiple constructors so that problems can be instantiated in different ways. Usually, real-world applications have a fixed number of design variables and fixed boundaries, while benchmark functions are expected to be scalable and with adjustable search space boundaries. Thus, being able to select the most appropriate constructor is very convenient. Finally, it is worth reminding that the methods from the MatLab class can aid the implementation of a novel problem and that, if a benchmark function displaying particular features is needed, its implementation might already be available amongst those indicated in Section 3.

(**a**) Initialisation phase (**b**) Optimisation phase

**Figure 3.** Algorithmic design and implementation in SOS. This example shows portions of code from the algorithm class RI1p1CAMES, located in the algorithms package, which implements the algorithm proposed in [73]. On the left-hand side (**a**), the initialisation phase, where the object cma11, which is an instance of the class CMAES\_11 (which extends Algorithms), is initialised and ready to be executed inside another class extending Algorithms. On the right-hand side (**b**), the implementation of an iterated local search method using cma11 as a local searcher. Note that the FTrend variable ft returned by cames11 is automatically appended to FT to obtain the overall fitness trend.

#### **3. Benchmarking with SOS**

Due to the difficulties in dealing with real-world black-box problems, the metaheuristic optimisation research community started developing artificially built functions [75] to:


Since the publication of the first testbed problems [75], several and ever more challenging benchmark suites have been released on an annual basis. These usually consist of heterogeneous groups of functions displaying similar features in terms of modality, separability and ill-conditioning. Despite the high number of suites in the literature, their technical reports show similarities. In particular, a set of established functions, such as Michalewicz, Ackley, Rastrigin, De Jong and Schwefel functions, is almost always present. Hence, many of these testbeds are basically equivalent, if it was not for the recent tendency to increase the degree of difficulty by also including in the testbeds hybrid functions, obtained by combining or composing the aforementioned basic functions or by shifting and rotating them. Even though the utility of having (often over-complicated) compositions of functions, which are sometimes as cryptic as black-box real-word problems, might be arguable, these have constantly been employed to propose challenging competitions for stochastic optimisation.

To remove the burden of implementing such a high number of testbed problems, the SOS platform comes with full implementations of:

	- **–** CEC 2005 [76];
	- **–** CEC 2008 for Large Scale Global Optimisation (LSGO) [77];
	- **–** CEC 2010 for Large Scale Global Optimisation (LSGO) [78];
	- **–** CEC 2013 [79];
	- **–** CEC 2013 for Large Scale Global Optimisation (LSGO) [80];
	- **–** CEC 2014 [2];
	- **–** CEC 2015 [81];

It is worth indicating that some applications from the CEC 2011 benchmark suite for real-world optimisation [42] are also available, implemented in the applications package.

#### *The R-CEC14 Benchmark Suite*

The original IEEE CEC 2014 benchmark suite [2] consisted of 30 functions for single-objective real-parameter numerical optimisation. These problems are obtained by shifting, rotating and ill-conditioning well-established functions in the fields of computational optimisation. However, only two of the 30 mathematical functions are not subject to rotation, i.e., Function Number 8 (shifted Rastrigin's function) and Function Number 10 (shifted Schwefel's function), but they do have a rotated counterpart in the benchmark suite.

For this reason, one could argue that only these four functions, i.e., Function Numbers 8, 10 and their rotated counterparts, are insufficient to draw interesting conclusions on:


Indeed, rotating a separable function does not alter its modality or ill-conditioning features, but does alter its separability. To further investigate this effect, SOS contains a redesign of the original CEC 2014 benchmark (used, for instance, in [53]), in which the rotation can be activated or deactivated by simply setting a flag (R). If R is equal to one, the rotation is active, and the functions are identical to those of CEC 2014. Otherwise, no rotation takes place. To avoid duplicates, Functions 8 and 10 are removed since they are obtainable from their rotated counterparts, i.e., Functions 9 and 11, by setting the rotation flag equal to zero. Thus, the resulting R-CEC14 suite contains 56 functions:


These problems are optimised within a given search space D defined as [−100, 100] *<sup>n</sup>*, with *<sup>n</sup>* <sup>∈</sup> [10, 30, 50, 100] being the admissible dimensionality values. For each problem, the corresponding *n* × *n* rotation matrices are stored in the benchmarks.problemsImplementation.CEC2014.files\_cec2014 package [17] and loaded when the rotation is active. The minimum fitness function value *fmin* = *f*(*xmin*), with *xmin* = argmin*f*(*x*), *x* ∈ D, is shown for each problem in Table 2. More detailed information can be found in [2] or by inspecting the source code [17] and the online documentation.

**Table 2.** R-CEC14 (R stands for rotation flag) benchmark suite. Each problem can be evaluated with and without the action of the rotation. Further details on these problems and the rotation procedure are available at [2].


#### **4. Statistical Analysis with SOS**

Stochastic algorithms can be evaluated, e.g., by measuring their overhead (in SOS, the class mains.test.TestOverhead can be used for this purpose), by calculating their time and memory complexity, in terms of scalability, average performances (i.e., final fitness value returned by the algorithm, averaged over multiple runs) and, qualitatively, by visual inspection of the fitness trend graphs. However, in order to claim that an algorithm is capable of outperforming one or more competing algorithms, when tested on a specific problem, or a set of multiple problems, statistical evidence must be sought.

A review of the statistical tests to be used for analysing and comparing stochastic algorithms can be found in [13]. Amongst the suggested methods, non-parametric tests such as the Holm test [84] and the Wilcoxon signed-rank test [85] are commonly employed since results collected over multiple runs of stochastic algorithms are not necessarily normally distributed. Several recent studies adopted similar variants of these methods, namely the Wilcoxon rank-sum test and the Holm–Bonferroni test, which can now be considered quite established in the field [3,4,24,41,86].

To facilitate the use established and advanced statistical tests for evaluating the performance of stochastic optimisation algorithms, SOS provides implementations of:


#### *4.1. The Wilcoxon Rank-Sum Test*

The Wilcoxon rank-sum test [87], also known as the Mann–Whitney U-test [88], is a non-parametric test used to understand whether two independent samples belong to populations having the same distribution. Unlike similar parametric counterparts, such as the two-sample unpaired t-test [89], it does not operate on data, but scores, referred to as ranks, associated with the actual values and for which assumptions on their distribution can be made. This makes it suitable for analysing the results of stochastic algorithms, whose distributions may significantly vary over different problems and combinations of parameters, while their ranks' distributions can be assumed to follow a normal model.

To describe the procedure implemented in SOS, let us consider two generic algorithms A and B running on a generic problem P, respectively nA and nB times (usually, nA = nB, although having an unbalanced number of runs would not prevent one from using the Wilcoxon rank-sum test). The null-hypothesis is then formulated as H0 : A = B, which means that, regardless of their working logic, the two algorithms are statistically equivalent, i.e., they are instances of the same population as the solutions provided over multiple runs are equally distributed. This is graphically shown in Figure 4. The Wilcoxon rank-sum test can then be used to produce evidence to reject H0. Failing this task would instead support the so-called alternative hypothesis, which is formulated as H1 : A < B or H1 : A > B, if a one-sided (also known as one-tailed) test is performed, or H1 : A = B, if a two-sided (also known as two-tailed) test is performed.

By default, SOS performs the two-tailed test. However, both the one-sided and the two-sided tests are available in the platform. Indications on how to run these tests can be found in the online documentation. Regardless of the specific variant, i.e., single or two-sided, the null-hypothesis has always the same formulation and can be either accepted or rejected as shown in Figure 4. From the outcome of the test, conclusions can then be drawn on the comparison of the performance of the algorithms A and B. For example, if H0 : A = B is rejected in a two-sided test, indicators such as the average fitness value or the median fitness value across the performed runs can be used to understand which algorithm outperforms the other on a given problem P. Obviously, this is not necessary if a one-sided variant of the test is employed.

The steps in Sections 4.1.1–4.1.5 describe the decision-making process implemented in the SOS platform in order to apply the Wilcoxon rank-sum test.

**Figure 4.** Hypothesis testing. On the top row, the test fails at rejecting the null-hypothesis (i.e., H0 is accepted) as the algorithms A and B have statistically similar distributions. On the bottom row, H0 is rejected as A and B are two different stochastic processes with statistically different distributions.

#### 4.1.1. Reference and Comparison Algorithms

To be performed a statistical test, a "reference" is needed. When asked to perform the Wilcoxon rank-sum test on the results generated with an experiment E, the list of algorithms executed in E is shown, and the reference can be indicated. If not specified, SOS assumes that the reference algorithm is the first one added to E. Without loss of generality, let us indicate the reference algorithm with A. The remaining algorithms in E will form a set of "comparison" algorithms. If more than two comparison algorithms are present, SOS will iteratively schedule a Wilcoxon rank-sum test to compare the reference algorithm A with each comparison algorithm B, taken from such a set, for each problem P in E. The order of the comparison algorithms can be specified. This will be the order of appearance of the algorithms on the automatically generated result table, as those shown in the graphical examples included in Section 5. By pressing "c" during the order selection process, a table will be created if the comparison set is not empty. This way, multiple and smaller results tables can be incrementally generated from a single experiment. If the "a" (i.e., all) option is instead used, a single long table for the whole experiment E will be generated. This will contain as many columns as algorithms in E and as many rows as problems in E.

#### 4.1.2. Assigning Ranks

Let us consider the totality of the observations obtained from the N = nA + nB runs. Without loss of generality, let us refer to minimisation problems, for which the lower the fitness value, the better the algorithm's performance. Observations are then sorted in ascending order to assign ranks *ri* = *i* (with *i* = 1, 2, ... , N) so that the smallest value has Rank 1 (i.e., *r*<sup>1</sup> = 1), the second smallest value has Rank 2 (i.e., *r*<sup>2</sup> = 2), and so on, until the observation with the greatest value, which has rank *r*<sup>N</sup> = N.

This process needs to be slightly modified if the dataset presents "ties", i.e., observations with identical numerical value, for which SOS will assign the same rank by computing the average of their position index *i* in the ordered sequence. This intermediate value will indeed represent better the distribution of the original results. To clarify this case with a numerical example, let us suppose that four observations have ordinal Ranks 4, 5, 6, and 7, but an equal fitness value. To make sure that the rank distribution would be a good representation of the real distribution, these four ties will be assigned with the same rank value *rj* <sup>=</sup> <sup>4</sup>+5+6+<sup>7</sup> <sup>4</sup> <sup>=</sup> 5.5 (*<sup>j</sup>* ∈ {4, 5, 6, 7}).

#### 4.1.3. The Rank Distributions

Let us indicate with WA a random variable associated with A. Its distribution of probability is tabulated only for small sample sizes [90], i.e., N < 20, which can be very small for optimisation experiments. To overcome this limitation, SOS implements such a distribution to deal also with larger sample sizes and get more accurate evaluation results. To do this, see [87,90], it is sufficient to define the normal distribution N (*μ*A, *σA*) whose mean value *μ*<sup>A</sup> and standard deviation *σ*<sup>A</sup> are calculated as:

$$
\mu\_{\mathsf{A}} = \sqrt{\mathsf{n}\_{\mathsf{A}} \mathsf{n}\_{\mathsf{B}} \frac{\mathsf{N} + 1}{12}} \quad \text{and} \quad \sigma\_{\mathsf{A}} = \mathsf{n}\_{\mathsf{A}} \frac{\mathsf{N} + 1}{2}.
$$

#### 4.1.4. Wilcoxon Rank-Sum Statistic and *p*-Value

Let us fill a set *R*<sup>A</sup> with the nA ranks *ri* associated with the observations from the reference algorithm A. The so-called Wilcoxon rank-sum statistic *w*<sup>A</sup> is calculated by summing all these ranks:

$$w\_{\Lambda} = \sum\_{r \in R\_{\Lambda}} r$$

and is then used to define the *p*-value. For the default case, i.e., the two-sided test with H0 : A = B versus H1 : A = B, this is formulated as follows:

$$p\text{-value} = \begin{cases} 2 \cdot \text{Prob}\{\mathcal{W}\_{\text{A}} \ge w\_{\text{A}}\} & \text{if } w\_{\text{A}} > \mu\_{\text{A}} \quad \text{(i.e., } w\_{\text{A}} \text{ is in the upper tail)} \\\\ 2 \cdot \text{Prob}\{\mathcal{W}\_{\text{A}} \le w\_{\text{A}}\} & \text{otherwise} \quad \text{(i.e., } w\_{\text{A}} \text{ is in the lower tail)} \end{cases}$$

where the probability of falling into the tail of the distribution closest to *w*<sup>A</sup> is doubled in order to consider both the cases in which the two algorithms differ because A > B and A < B simultaneously. This is not needed for the one-sided test, for which the null-hypothesis H0 : A = B is either tested against the alternative hypothesis H1 : A < B, with a corresponding *p*-value = Prob{WA ≤ *w*A}, or against the alternative hypothesis H1 : A > B, with a corresponding *p*-value = Prob{WA ≥ *w*A}. A graphical explanation of the null-hypothesis testing process is given in Figure 5.

**Figure 5.** A graphical explanation of the hypothesis testing process for a generic one-sided test with alternative hypothesis H1 : A > B and *α* = 0.05. The area under the normal distribution, highlighted in red, is equal to *α* and indicates the rejection zone. Indeed, any *x* > 0.658 would return a *p*-value (i.e., area under the distribution) lower than *α*, thus rejecting H0. Conversely, for all *x* < 0.658, the corresponding *p*-value would be greater than *α*, thus failing to reject H0. In a two-sided test, the red area on the right-hand side should have a symmetric counterpart on the left-hand side. The two resulting tails, each one casting an area of 0.025 (i.e., *<sup>α</sup>* <sup>2</sup> <sup>=</sup> 0.05 <sup>2</sup> , so that the total significance level is 0.05 and the corresponding confidence is 1 − *α* = 95%) would form the rejection zone.

The *p*-value provides an indication of the truthfulness of H0 by calculating the probability of obtaining test results similar to those observed experimentally, assuming that H0 is correct. This probability can then be used to decide whether the null-hypothesis can be trusted to be true or it must be rejected. This probability can be calculated by numerically integrating the statistic for the specific test, in this case N (*μ*A, *σ*A), with *μ*<sup>A</sup> and *σ*<sup>A</sup> calculated as indicated in Section 4.1.3, and then using the appropriate definition of the *p*-value for the specific test, i.e., one- or two-sided, as described before.

#### 4.1.5. Decision-Making

To test whether or not H0 is rejected, the calculated *p*-value is compared to a threshold *α*, commonly referred to as the significance level. This value represents the probability of making a so-called "Type I" error, which occurs when a true null-hypothesis is incorrectly rejected, as indicated in Table 3.

The *α* value is arbitrarily chosen. Usually, this is a small number amongst 0.10 (one Type I error chance in 10 decisions is tolerated), 0.05 (one Type I error chance in 20 decisions is tolerated), and 0.01 (one Type I error chance in 100 decisions is tolerated). It is worth mentioning that if a decision is made with a probability *α* of making a Type I error, its correctness can be trusted with a probability of 1 − *α*. This figure is referred to as the confidence level, and it is sometimes provided in the literature instead of *α*.


**Table 3.** Table of truth, also known as the confusion matrix.

By default, SOS performs the Wilcoxon rank-sum test with *α* = 0.05, which is a common value for non-parametric tests, unless differently specified before running the test. This results in a confidence level of 95%. To employ a different *α* value, a setter method is provided (see the online documentation). Other methods are also available to switch between two-sided to one-sided tests and decide whether or not *p*-values are shown on screen.

To conclude, once the *p*-value is computed and the significance level *α* is chosen, a final decision is made by means of the following logic:


#### *4.2. The Holm–Bonferroni Test*

The Holm test is a non-parametric and sequentially-rejective procedure for multiple-hypothesis testing, which aims at rejecting one hypothesis at a time until no further rejection is possible [84].

In the optimisation field, this test can be used to compare the reference algorithm with more than one comparison algorithm over multiple problems simultaneously. This provides a different and more global view of the comparison with respect to the one provided by the Wilcoxon rank-sum test, which is instead problem-specific. In this light, the two tests complement each other, and it is suggested to always use both (or equivalent tests), to analyse the results obtained statistically with empirical experimentation.

The original Holm test was designed with the intent of guaranteeing a reasonably low family-wise error rate (FWER) [13,84], an error plaguing multiple-hypothesis testing and known to increase when the number of hypotheses increases. To further improve upon this aspect and keep the FWER low even when the number of hypotheses is high, several "corrections" for obtaining more informed *p*-values have been proposed, such as the well-established Bonferroni's correction [91].

SOS implements a simple Holm–Bonferroni test for comparing stochastic algorithms consisting of the steps described in Sections 4.2.1–4.2.4.

#### 4.2.1. Choosing the Reference Algorithm

Let us consider a generic experiment E containing NA algorithms and NP problems. When the test is performed on E, SOS will display the list of available algorithms and ask the user to select the reference algorithm. Unlike the case described in Section 4.1.1 for the Wilcoxon rank-sum test, this step can change the output of the test. Indeed, in the one-to-one comparison performed with a Wilcoxon rank-sum test, exchanging A with B would not alter the rank distributions obtained on P. Conversely, the NA − 1 statistics computed in the Holm–Bonferroni test do depend on the rank of the reference algorithm. In this light, if the goal is to test the performance of an algorithm A in E against the other algorithms, A should play the role of the reference algorithm. On the contrary, if the goal is to test which algorithm has the best overall performance in E, it could be necessary to run the test twice. The first time, a random reference is chosen. The second time, the algorithm with the highest rank must be found and selected to be the reference for a second round of the test. Once a reference algorithm is selected, SOS proceeds with the test.

#### 4.2.2. Assigning Ranks

First, the average final fitness values must be computed based on the values returned by the NA algorithms after the execution of multiple runs on the *NP* problems. SOS automatically collects the text files containing the information stored in FTrend objects and provides the NP final average values for each algorithm. It is worth mentioning that SOS processes this information only if it was not previously requested for another test or graphical procedure (in which case the values are retrieved and immediately returned, thus minimising overheads). Then:

	- **–** the average score of the reference algorithm is referred to as rank *R*0;
	- **–** the remaining NA − 1 average scores are used to sort the corresponding algorithms in descending order and constitute their ranks, which are indicated with *Ri* (*i* = 1, 2, ... , NA−1). These ranks provide a first indication of the global performance of the algorithms on E, and their order will be automatically displayed in the form of a "league" table.

#### 4.2.3. Z-Statistics and *p*-Values

For each *i* th comparison algorithm (*<sup>i</sup>* <sup>=</sup> 1, 2, ... , NA <sup>−</sup> 1), this test requires the use of the so-called "z-value" statistic [13], calculated with the formula:

$$z\_i = \frac{R\_i - R\_0}{\sqrt{\text{NA} \frac{\text{NA} + 1}{6 \text{NP}}}}$$

This allows for the determination of NA − 1 *p*-values through the normalised cumulative normal distributions *zi* for each comparison algorithm.

#### 4.2.4. Sequential Decision-Making

When all the previous steps have been completed, SOS proceeds with a sequential decision-making process in which A is tested against each comparison algorithm, following the order obtained as described above. A similar procedure to that presented in Section 4.1.5 for the Wilcoxon rank-sum test is therefore iterated NA − 1 times. However, the rejection threshold requires an adjustment due to the multiple-hypothesis settings. In detail, for each *i* th comparison algorithm (*i* = 1, 2, ... , NA − 1), taken in the order above, the corresponding p*i*-value and threshold *α*/*i* are compared and a decision on the null-hypothesis is made as follows:


Finally, a table displaying the outcome of the test is automatically generated. Some examples of Holm–Bonferroni tables are shown in Section 5. It is worth remarking that SOS employs a default *α* = 0.05 also for this test. However, this can be changed as previously pointed out in Section 4.1.

#### *4.3. Advanced Statistical Analysis*

The SOS platform provides a novel advanced statistical analysis procedure, referred to as the ASA procedure, for comparing couples of algorithms on a single problem. This procedure makes use of several statistical tests, and it is based on a simple workflow, displayed in Figure 6. The main rationale of ASA is that non-parametric tests are preferable if the assumption of normality cannot be made on the results' distributions, whereas parametric tests should be used if statistical evidence is found in support of such an assumption. Therefore, the ASA procedure first checks the distributions of the results of two selected algorithms, A and B, by looking for such evidence, and then applies the most appropriate tests accordingly.

**Figure 6.** ASA workflow diagram.

To launch the ASA routine, a TableAvgStdStat object must be instantiated with the UseAdvancedStastic Boolean flag set equal to true; see Figure 7. It must be remarked that if such a flag is not activated, the Wilcoxon rank-sum test is performed. It should be noted that the two tests can differ especially when the number of runs is inferior to 100. In such a case, it is strongly recommended to use ASA for a more accurate caparison. Conversely, when a very high number of runs is available, ASA and Wilcoxon rank-sum are comparable. For this reason, the default number of runs performed by SOS is 100. However, this number of runs might be unpractical when facing time-consuming real-world optimisation problems. Hence, as explained in the previous sections (see Figure 2), SOS

provides a setter method for specifying the number of runs to be performed in a given experiment. The most common values used in the literature range from 30 to 60 runs. When the ASA procedure is activated, SOS performs the following steps:

	- if variances are equivalent, it is concluded that both A and B have normal distributions, which suggests the use of a two-sided T-test [94,95] with null-hypothesis H0 : *A* = *B* to make a decision according to the following logic:
		- if H0 is rejected, A and B are two equivalent stochastic processes. The test terminates.
		- if the test fails at rejecting if H0, a one-sided T-test [94,95] is performed to test if A outperforms B, i.e., H0 : A < B is rejected, or B outperforms A, i.e., H0 : A > B is rejected. The test terminates.
	- if variances are not equivalent, the Welch T-test [96,97] is used to test the null-hypothesis H0 : A = B;
		- if H0 is rejected, A and B are two equivalent stochastic processes. The test terminates.
		- if the test fails at rejecting if H0, a one-sided Welch T-test [96,97] is performed to test if A outperforms B, i.e., H0 : A < B is rejected, or B outperforms A, i.e., H0 : A > B is rejected. The test terminates.
	- if the null-hypothesis is rejected, A and B are two equivalent stochastic processes. The test terminates.
	- if the test fails at rejecting H0, a one-sided Wilcoxon rank-sum test is performed to test if A outperforms B, i.e., H0 : A < B is rejected, or B outperforms A, i.e., H0 : A > B is rejected. The test terminates.

It should be noted that the ASA procedure formulates the null-hypothesis as H0 : A = B, while the stand-alone version of the Wilcoxon rank-sum test explained in Section 4.1 is implemented by considering H0 : A = B. This should not generate confusion. The null-hypothesis can be indeed arbitrarily chosen, and it is formulated differently here to facilitate the implementation of the test.

#### **5. Visual Representation of Results**

With SOS, results can be displayed in several formats. After performing an experiment, raw data are located in the SOS results folder, unless differently indicated, and can be processed straight away or merged with those from other experiments before being processed. In the results folder, data are arranged in sub-folders with self-explanatory names to be easily individuated. The main folder comes with the name of the experiment and contains a text file describing it, as explained in Section 2.1, as well as sub-folders whose names refer to the benchmark suite used, the specific function identifier and the dimensionality of the problem. Inside each folder, a fitness trend file is stored for each performed run. On top of the fitness values' trend, these files also report the variables of the best solution found.

From the raw data, SOS can extrapolate information and create graphs and tables thanks to several auxiliary methods available in the platform. The class Experiments provides some of these routines for collecting data and transforming them into more useful formats. Some of these can be seen in Figure 7, which shows the main method of the TablesGenerator class located in the default package.

**Figure 7.** An example of the TablesGenerator class containing methods for visually displaying results in several different formats.

With reference to Figure 7, let us focus first on the experiment.setTrendsFlag(true,true) method. The first Boolean value, in the example set to true, indicates that while scanning the raw data for producing tables, SOS will simultaneously save a further text file with the data needed to plot an average fitness trend graph. The second Boolean value, i.e., the error flag, also set to true in the example , indicates that the graph will show the fitness trend in terms of average error w.r.t. the known optimum, rather than average fitness. Obviously, this flag is mainly activated when optimising benchmark functions, for which the optimum is usually known. The outputs will look like

the examples reported in Figure 8. More details regarding the generation of the fitness trends are given in Appendix A.

**Figure 8.** Example of average error trends produced with SOS for the study in [24]. In (**a**), the full image with the caption. In (**b**), a further example of an average error trend from the same study, plotted in both linear and logarithmic scale. More examples from several other studies were gathered in [98].

The next two methods, experiment.importData() and experiment.describeExperiment(), start this process and describe it (in the generated log file) iteratively. If the first flag passed to experiment.setTrendsFlag() is not active, the importData() method will collect data from the corresponding folders and process them, but will store in memory only the information required for the generation of tables, without saving the average trends' information into plottable files. Once the raw data are loaded in memory, SOS can extrapolate the information to be displayed in tables. For example, the commands experiment.computeAVG() and experiment.computeSTD() will lead to a table where the average fitness value (or the average error value if the error flag is active) ± the corresponding standard deviation are shown. The command experiment.computeMedian() will

instead return the median fitness value (or median error depending on the error flag) amongst the available runs. In the example, the three methods are used, but this is not compulsory. It is indeed possible to call only some of or none of them. Moreover, other methods not included in this example are also available, e.g., those for displaying the best or the worst run. If the methods are not called at all, they will be called automatically (if needed) when the tables are generated, as shown in the following part of the example in the figure. Conversely, if they are called, it is suggested to use experiment.deleteFinalValues() to free some memory by discarding the final results from memory and keeping only their average values (and the corresponding standard deviations). Before freeing the memory, one may also want to save the final values to plot histograms and distributions, which might be useful in some cases. This can be done in SOS by simply activating some additional flags. For more details on these advanced features, we refer the reader to the online software documentation.

At this point, the statistical tests described in Section 4 can be performed to produce tables in PDF and LATEX source code formats. To structure compact, but highly informative tables, SOS provides specific classes. Let us keep following the example in Figure 7. One can see that four classes are used to produce tables for the experiment object passed as an argument to the constructor. These classes are:


Of note, these four classes are all extensions of the abstract class TableStatistics, from which they inherit several miscellaneous methods, such as the setErrorFlag() method (used to indicate if tables should display average fitness or errors values), the setRefenceAlgorithm() method (used to indicate the reference algorithm) and the execute() method (which implements the specific statistical test), as well as attributes, e.g., variables to store the significance levels *α*, confidence levels *δ* = 1 − *α*, flags, etc. If customised tables are needed, this can be simply done by extending the TableStatistics superclass and making use of the auxiliary methods provided in such a class to implement the execute() method where new statistical tests and/or new table layouts can be implemented.

Let us now describe in detail the tables produced with the TableHolmBonferroni and TableAvgStdStat classes, shown respectively in the examples reported in Figures 9 and 10.

With reference to Figure 9, which shows the outcome of the Holm–Bonferroni test, it can be noted that SOS reports the rank of the reference algorithm in the caption (which is automatically generated). Regardless of the number of benchmark or real-world problems added in the experiment file, SOS performs the test as described in Section 4.2 and displays the comparison algorithms in descending order according to their rank. As can be seen by comparing the Rank and *pj* columns, this corresponds to a descending order also for the *p*-values. This way, algorithms in the top positions are quite likely to behave similarly to the reference algorithm (i.e., the null-hypothesis is accepted), while those occupying lower positions behave worse than the reference algorithm (i.e., the null-hypothesis is rejected). Other relevant information as the z-statistic values (*zj*) and the normalised significance/confidence levels (*δ*/*j*) are also reported in the table.




(**b**)

**Figure 9.** Two examples of TableHolmBonferroni tables. In (**a**), some results from the study in [4] obtained over the non-rotated functions of the R-CEC14 benchmark suite presented in Section 3. More examples are available in the extended results files stored in the repository [99]. In (**b**), some results from the study in [73] obtained over the functions of the CEC 2014 benchmark suite [2]. Extended results for this study are available in the repository [98].

Figure 10 shows instead two examples of comparative tables obtained with the class TableAvgStdStat. At first glance, it can be noticed that the best performance (in terms of average error) is highlighted in boldface. This is obtained by setting the useBold flag equal to true, as in the example of Figure 7. Generally, the boldface option is useful as it allows for spotting the "winner" algorithm in a facilitated way. However, it can be turned off by simply setting useBold to false. To strengthen the validity of the displayed results, the outcome of a statistical test is also added to the table. With reference to Figure 7, if the useAdvancedStatistic flag is activated (i.e., it is equal to true), the ASA test presented in Section 4.3 is performed. Otherwise, the Wilcoxon rank-sum test, described in Section 4.1, is performed. Regardless of the employed statistical test, the same compact notation is adopted to report its outcome in the table:



(**a**) Wilcoxon rank-sum test



**Figure 10.** Two examples of TableAvgStdStat tables based on the Wilcoxon rank-sum test (**a**) and the ASA test (**b**), respectively. In (**a**), some results from the study in [73] obtained over the functions of the CEC 2014 benchmark suite [2]. In (**b**), some from the study in [24] obtained over three real-world problems from the CEC 2011 benchmark suite [42]. Extended results for the studies in (**a**,**b**) are available in the repository [98].

It is important to report the outcome of the statistical tests next to the average fitness/error value. For example, with reference to Figure 10a, it can be noticed that despite (1+1)-CMA-ES displaying the best average error value for *f*2, this is quite likely to be an isolated event as the "=" symbol next to its value suggests that its general behaviour is actually statistically equivalent to the one of the reference algorithm, i.e., RI-(1+1)-CMA-ES. Indeed, there is a very small difference between the two average error values, i.e., of about 10<sup>−</sup>14, which is mathematically in favour of the comparison algorithm, but practically, it is not sufficient to state that the comparison algorithm outperforms the reference algorithm.

Online sources with further numerical tables and graphs produced with SOS are inidcated in the Supplementary Materials section of this manuscript.

#### **6. Conclusions**

This paper highlighted the need for appropriate software platforms and procedures for rigorously evaluating, comparing, tuning and studying the behaviour of stochastic optimisation algorithms. Contrary to the current research trends, which led to an inflation of "novel" nature-inspired approaches whose algorithmic behaviours are often difficult to comprehend, we argued that regardless of their

inspiring metaphor, modern optimisation algorithms should be considered simply as stochastic processes, due to their randomised nature, and therefore studied accordingly. The proposed SOS platform facilitates the design of stochastic optimisation algorithms by: (1) proposing a general approach for their implementation, which stresses the fact the metaheuristic algorithms are an implementation of the same concept (i.e., convergence to near-optimal solutions by alternating exploratory and exploitative phases); (2) providing mathematical tools to compare algorithms statistically regardless of their inspiring metaphor; (3) providing tools for studying the internal dynamics of population-based algorithms. The examples reported in this article illustrated how SOS can be used to address the aforementioned points and display several successful studies where SOS played a key role in fine-tuning algorithm parameters and studying the applicability of algorithmic components on specific classes of optimisation problems. In this light, we can conclude that SOS is not only a useful software tool for designing new algorithms, but most importantly, it is also ideal for studying the algorithmic behaviour of established optimisation frameworks.

**Supplementary Materials:** Further material can be obtained from the following open-access repositories: Raw data and extended galleries of tables and fitness trend graphs, generated with SOS for some of the studies mentioned in this article, are available at www.doi.org/10.17632/6st7grtxfr.2 and http://www.doi.org/10.17632/psp65d2nbc.1; Extensive galleries of images obtained with the joint use of SOS, for generating data, and IOHprofiler, for plotting them, are available at www.doi.org/10.17632/zdh2phb3b4.2 and www.doi.org/10.17632/cjjw6hpv9b.1.

**Author Contributions:** Writing, original draft: F.C.; Conceptualisation: F.C.; Investigation: F.C.; Data Curation: F.C.; Writing, review and editing: F.C. and G.I.; Software Development: F.C. and G.I. All authors read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Terminology**

The following terminology is used in this manuscript:


#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A. Producing the Fitness Trend**

To save a plottable file containing the average fitness trend of an algorithm over a specific problem, SOS first fills an array (of a size equal to the computational budget, by default 5000 × *n* with *n* being the problem dimensionality) with the function evaluations' values contained in the FTrend object returned by the algorithm. This is done for each performed run. It should be noted that the FTrend object contains a sequence of values that is monotonically decreasing (or increasing, depending on the problem), i.e., each element of the array contains the best fitness value so far. These values reproduce the optimisation trend of a single run and can be averaged with those obtained from other runs in order to get an average fitness trend. The latter is stored in another array of equal dimension, with each element computed as the average of the corresponding elements from each single run. Subsequently, these arrays (the single run arrays and the average fitness array) are downsampled to have 500 equispaced (in terms of fitness evaluation counter) elements. This results in better quality graphs (as those shown in Figure 8), which are at the same time easier to handle and read. However, the number of points plotted in the fitness trend can be changed, thus allowing for a higher or lower number of fitness values to be included in the graphs to adapt to specific needs.

#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
