To show how the DSCTool can be used for identifying the exploration and exploitation abilities of the compared optimization algorithms, basic variants of Differential Evolution (DE) were implemented, where mutation
, scaling factor
F, and crossover probability
can be changed. The DE pseudocode is presented in Algorithm 2.
Algorithm 2 Differential Evolution Algorithm |
Input: , F, , Output: best solution - 1:
P = Create and initialize the population of ; - 2:
while stopping condition(s) not true do - 3:
for each individual, do - 4:
Randomly select individuals; - 5:
Create an offspring, , by applying with scaling factor F on selected individuals; - 6:
Update using binomial crossover on with probability ; - 7:
Evaluate the fitness, ; - 8:
if is better than then - 9:
Replace with in P; - 10:
end if - 11:
end for - 12:
end while - 13:
Return the individual with the best fitness as the solution;
|
We would like to point out that for the purpose of this paper, it is not important which algorithms were selected/compared and which hyperparameters were chosen, but the process (i.e., the steps of the analysis) itself, which shows how one can easily obtain a better understanding of the algorithm’s performance using the DSCTool.
The variants of DE algorithms (shown in
Table 1 and defined by randomly selected
,
F, and
) were compared on the 2009 Genetic and Evolutionary Computation Conference workshop on the black-box-optimization-benchmarking (BBOB) [
17] testbed. The testbed consists of 24 single-objective, noise-free optimization problems, where for each problem first five instances were selected. This provided us with 120 test instances. This is one of the most popular testbeds for evaluating an algorithm’s performance and has been used in many articles and dedicated workshops on comparing different algorithms performances. For these reasons it was selected as a testbed to showcase our approach. The benchmark set consists of five types of functions that try to cover different aspects of the problem space that try to mimic different real-world scenarios (i.e., separable functions, functions with low or moderate conditioning, functions with high conditioning and unimodal, multi-modal functions with adequate global structure, and multi-modal functions with weak global structure). All the functions are non-constrained (except for range of functions), but this is not relevant for our approach, since we are working with solutions locations (in discrete or continuous search space) and its values. Here we assume that even the solutions outside of the constrained space have some values (e.g., using penalty function). If this is not the case then some post-processing must be applied on the values otherwise our approach will not work (it requires single solution value). All the details about the testbed can be found at
https://coco.gforge.inria.fr. All DE variants were run 25 times on every test instance with population size set to 40 and dimension set to
. Since we are interested in showing the applicability of the proposed approach and not to compare the specific algorithms (DE variants in our case), we have randomly selected three DE variants and also defined theirs parameters with no specific goal in mind.
To show how to track exploration and exploitation abilities during the optimization process, the best solutions after D*(1000, 10,000, and 100,000) evaluations are taken.
As we have already mentioned, this analysis consists of two main phases: (i) comparison made with regard to the obtained solutions values (i.e., DSC ranking scheme) and (ii) comparison made with regard to obtained solutions distribution in the search space (i.e., eDSC ranking scheme).
3.1. Comparison Made with Regard to the Obtained Solutions Values (i.e., DSC Ranking Scheme)
The first step in this phase was to use the DSC ranking scheme in order to rank the compared algorithms according to the obtained solutions values on each test instance. For this purpose, the DSCTool rank service was used. For the ranking service, the only decision that should be made was the selection of the statistical test (e.g., KS or AD) that is used to compare the one-dimensional distribution of the obtained solution values and the statistical significance,
[
18]. In our case, the AD statistical test was selected to be used by the DSC ranking scheme and a statistical significance of 0.05 set. Since we performed the comparison at several time points from the optimization process, the input JSONs for the ranking service should be prepared for each time point. The JSON inputs for the rank service for all pairwise comparisons and all time points from the optimization process can be found at
http://cs.ijs.si/dl/dsctool/eDSC-rank.zip. The results of calling the rank service are JSONs that contain DSC rankings for each algorithm pair on every problem instance.
Once the DSC rankings are obtained, the next step is to analyze them using an appropriate omnibus statistical test. In the literature there are a lot of discussions on how to select an appropriate omnibus statistical test [
19]. One benefit of using the DSCTool is that the user does not need to take care about this, since the appropriate omnibus statistical test can be selected from the result of the rank service. The DSCTool is an e-Learning tool, so all conditions for selecting an appropriate statistical test have been already checked by the tool and the required statistical knowledge from the user’s side is consequently significantly reduced.
To continue with the analysis, we used the DSCTool omnibus web service. For creating the input JSONs required for the omnibus web service, the results from the rank service were used. Since only two algorithms were compared, the one-sided left-tailed Wilcoxon Signed Rank test was selected, which was proposed by the ranking service. The one-sided left-tailed was selected, since we are interested in if one algorithm performs significantly better and not only if there is a statistical significance between their performances (i.e., two-sided hypothesis). The input JSONs for the omnibus test can be found at
http://cs.ijs.si/dl/dsctool/eDSC-rank-omnibus.zip. The omnibus test returns mean DSC rankings for the compared algorithms and
p-value, which tells us if the null hypothesis is or is not rejected. In general, if the
p-value is lower than our predefined significance level then the null hypothesis is rejected and we can conclude that there is statistical significance between algorithm performances (i.e., the first algorithm has better performance since we are testing one-sided hypothesis). Otherwise, if the
p-value is greater or equal then we can conclude that there is no statistical significance between compared algorithms performances.
By performing the first phase of the eDSC approach, we have obtained results for comparing the algorithms with regard to the obtained solutions values using the DSC ranking scheme.
3.3. Results and Discussion
To provide more information about the exploration and exploitation abilities of the compared algorithms during the optimization process, we compared them in different time points,
D*(1000, 10,000, 100,000, and 1,000,000). In
Table 2,
Table 3,
Table 4 and
Table 5 the results for eDSC approach are presented for every selected time point, respectively. Each table consists of results obtained by both phases of eDSC approach (i.e., DSC and eDSC ranking scheme) with obtained
p-values written in brackets. The results are separated by the | sign. On the left side is the result of comparison between algorithms based on solution values (i.e., DSC ranking scheme), while on the right side is the result for comparison based on solution locations (i.e., eDSC ranking scheme). The / sign represents that the comparison is not logical (i.e., we cannot compare the algorithm with itself), while the + sign indicates that the algorithm written in row significantly outperforms algorithm written in column, while the − sign indicates that the algorithm written in row has statistically significantly worse performance of the algorithm written in column. This interpretation comes from the definition of the null and the alternative hypothesis of the left one-sided test (i.e.,
:
and
:
, where
’s are sample means of compared algorithms).
The
Table 2 represents the results for comparing the algorithm performance achieved on
D*1000 function evaluations (FEs). In this case, we are primarily interested in the quality of the algorithm performance at the beginning of the optimization, or it can be a case when we have only relatively small number of FEs at our disposal. Looking at the results, it is obvious that DE2 performs significantly better than the other two DE variants (DE1 and DE3) with regard to the solutions quality and their distribution, indicating superior exploration and exploitation abilities. We should point here, that we set the preference of distribution for the eDSC ranking scheme on clustered solutions. Further, DE1 has significantly better exploration and exploitation abilities than DE3, and DE3 performs significantly worse than the the other two DE variants. So if we would had at our disposal only
D*1000 evaluations the DE2 is the obvious choice. But if we have more FEs at our disposal, then we do not know if this is the case.
If we have at our disposal more FEs, then having more clustered distribution of solutions early in the optimization process does not necessary guaranty high quality of solutions later on. If all runs quickly converge to the same area in solution space, one could see this as too quick convergence. In such cases when multimodal search space is explored, the algorithm might be trapped in some local optima. So quick convergence at the beginning of the optimization process is not guaranty of high quality algorithm or its hyperparameters in the long run. Let us look at the
Table 3, where results for comparing the algorithm performances achieved on
D*10,000 FEs are presented. The DE3 algorithm stayed as the worst performing algorithm, but the relation between the quality of DE1 and DE2 has significantly changed. As we can observe, the preferred distribution of solutions is now significantly better for DE1 algorithm. So seemingly worst initial exploration abilities of DE1 algorithm paid dividends that allowed for achieving significantly better preferred solution distribution. The question that arises is, did we choose the correct preferred distribution when evaluating algorithms at
D*1000 FEs? If target FEs would be
D*1000 then the selection was correct; however, looking at the results obtained at
D*10,000 FEs the answer is not so clear.
Now let us look at
Table 4 and see what happens at
D*100,000 FEs. The comparison results stay the same, so no significant changes can be observed between the performances of the algorithms.
Finally, using the
Table 5 where the results for comparing the algorithm performances achieved on
D*1,000,000 FEs (i.e., the end of the optimization process), it is obvious that the results have changed again. DE3 stayed the worst performing algorithm. The randomly selected hyperparameters for DE3 turned the basic DE into a low performing algorithm (compared to DE1 and DE2) with inefficient exploration abilities, and consequently we cannot say much about exploitation abilities. DE2 significantly worsened its performance compared to DE1 throughout the optimization process, obtaining significantly worse solutions with respect to the solution value and their distribution. The selected hyperparameters worked the best for scenarios when very limited number of FEs are used, but with a higher number of FEs allowed, the performance deteriorated compared to DE1. Finally, the DE1 concluded our analysis as significantly better performing algorithm, that acquired the best solutions with respect to their values and distribution, and showing the best exploration and exploitation abilities among the compared algorithms at
D*1,000,000 FEs.
To get a better perception of what is happening during the compared algorithms runs with regard to solutions values and their distribution, a graphical representation of the results for DSC and eDSC ranking schemes is shown in
Figure 2 and
Figure 3, respectively. On
x-axis all time points are represented, while on
y-axis, respective rankings of algorithms defined according to results (see
Table 2,
Table 3,
Table 4 and
Table 5) of one-sided Wilcoxon Signed Rank test between all compared algorithms are presented (i.e., one means best performance, while three means the worst performance).
Finally, we can conclude that the hyperparameters of DE3 are not suitable in any scenario (i.e., time point) since exploration and exploitation performance on our testbed is below the other two DE variants. Next, if we performed the same analysis using experiments with hyperparameters that are in some -neighborhood, we could obtain further information which of the hyperparameters contribute to improving/declining efficiency of exploration and exploitation abilities. The hyperparameters of DE2 provide good initial exploration and exploitation abilities with regard to the other two DE variants. In case when we have scenario with low number of FEs, it is a good starting point to further investigate parameters in some -neighborhood of the DE2 parameters. Such investigation will provide us with information how to further improve the performance in this kind of scenario. Finally, DE1 turned out to be the best performing algorithm when there is a large number of FEs available. Since this is a limited experiment designed to show how we can efficiently use eDSC approach in combination with DSCTool, our conclusions are limited to these three algorithms. If we are comparing different algorithms, this would be enough so one can focus on improving exploration or exploitation abilities of the developed algorithm. In cases when the goal is to understand influence of hyperparameter selection on exploration and exploitation abilities of the algorithm, much more testing would be needed to acquire enough information.