Next Article in Journal
On the Δ n 1 Problem of Harvey Friedman
Previous Article in Journal
Cauchy Problem for a Linear System of Ordinary Differential Equations of the Fractional Order
 
 
Article
Peer-Review Record

Improved Method for Parallelization of Evolutionary Metaheuristics

Mathematics 2020, 8(9), 1476; https://doi.org/10.3390/math8091476
by Diego Díaz 1,*,†, Pablo Valledor 1,†, Borja Ena 1,†, Miguel Iglesias 1,† and César Menéndez 2,†
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Mathematics 2020, 8(9), 1476; https://doi.org/10.3390/math8091476
Submission received: 3 August 2020 / Revised: 24 August 2020 / Accepted: 26 August 2020 / Published: 1 September 2020
(This article belongs to the Section Mathematics and Computer Science)

Round 1

Reviewer 1 Report

The idea is interesting but far from revolutionary. It is not based on solid mathematical arguments. The collector does not guarantee a better solution. On the contrary, it can lead to premature convergence. The exchange of information between populations seems to be beneficial for increased accuracy, but what do we do if this exchange leads to a blockage of evolution?
The algorithms are presented in the style of a programmer. I would have preferred an algorithmic style and more attention to details. I leave aside the fact that the first algorithm (Listing 1) does not have numbered lines, although in the explanations there are references to pseudocode line numbers.
The experimental results are many but somewhat unclear. It seems that the proposed method offers some advantages, but I did not understand if the Multistart method is implemented by the authors or is the best implementation currently available. Then, is an MPI implementation better than a CUDA or an OpenMP one? I am not convinced that the superiority of the Multiverse method is not due in part to the implementation.
The bibliography is incomplete. Relevant works in the field are missing.
Anyway, the paper has value, but it doesn't seem to be very suitable for the Mathematics journal. I have not seen any mathematical innovation to justify sending this manuscript to this journal.

Author Response

Premature optimization is an issue we have often seen in single population versions of the algorithms. By receiving solutions from several independent populations, the collector is actually less prone to the problem (though not immune), as it has a more diverse pool of solutions. This is the main reason why we chose to build a multi-population method rather than distribute the calculations of a single population.

We implemented the Multistart method we use, which merely consists of running an independent instance of the selected algorithm at each node, and selecting the best overall solution. To our knowledge there is no 'best available implementation' of this, but rather it is usually made ad-hoc due to its extreme simplicity. The algorithms that are instanced (GA, ACO) are the same in both Multistart and Multiverse, which we think makes sense, as researchers will develop their own versions, either from scratch or using a specific framework; but the same one would be used in either Multistart or Multiverse.

The choice of MPI comes from the objective of the algorithm to provide a distributed version of any population-based algorithm. OpenMP only allows parallelization, not distribution (i.e. within a single machine), which does not achieve the purpose. CUDA or VULKAN would exploit one or several GPUs for acceleration, but in our experience the SIMD (Single Instruction Multiple Data) parallelization model of GPUs does not fit the general case of population based metaheuristics. In particular, its performance is very sensitive to branching, and beyond the logic in the algorithms themselves, additional logic is to be expected in the cost and constraint functions for the problem, which limits the acceleration that can be obtained, and may even slow the problem down. And then there is the low level management of memory and CPU-GPU communications, plus the need to write the kernels for the specific problem. That is not to say that GPU acceleration cannot be useful: it can for specific problems that because of their structure can take advantage of the vectorized operations that GPUs excel at accelerating. But they don't seem to be the best general case option.

Reviewer 2 Report

The multiverse method is a novel method to provide a better baseline for multiple independent runs.

I notice that the multiverse method (as coded) has synchronization points (barriers), i.e. all workers take a step, all report their best, the collector waits for them all, then broadcasts the new best.

Would it make sense to allow the processes to proceed asynchronously, each worker taking a step as soon as it's ready? As each worker finishes a step it could send its best value and receive a new best value back. I'm not sure how much time is spent waiting at the barriers, but it might be avoidable.

Could the authors could comment on the choice of using barriers vs. asynchronous communication? Possibly an asynchronous implementation could be mentioned in future work.

One thing I wondered about was the capping of runs at 25 (see line 272). This seems a modest and small number of runs for the 18 node cluster (section 4). Could the authors comment on why this value was chosen?

The discussion for listing 1 refers to line numbers in the pseudo code, but the pseudo code has no line numbers. This is awkward. There should be line numbers as in listing 2.

Fig 5 might look better as a semilog plot.

A partial list of minor grammar issues follows:

Line 7:
instances receives -> instances receive

Line 52:
quite slower -> quite a bit slower

Line 84:
operations a value -> operations, a value

Line 190:
a given fitness levels -> a given fitness level

Line 322:
Period missing at the end of the sentence.

Author Response

The choice of synchronization barriers stems mainly from the limitations of the MPI specification; in particular, the MPI communication primitives are synchronous. We have played with the idea of an asynchronous approach, and if we were starting right now we might go for it, as new distributed frameworks that allow more flexibility have become mature recently (especially Dask, for Python). In any case, all workers have the same load, so the timing differences should not be long compared to the computation time, but it is true that it could be a further improvement.

As for the number of runs, we had limited access to the hardware, and the cluster is fully used by each run of each test problem with each algorithm, so we went for a number that could provide statistical significance but would not be excessive in the total use of the facilities. As it is, there are 18 test problems, 4 algorithms (ga, aco, each multistart and multiverse), 25 runs, making a grand total of 1800 individual runs. Although the smaller problems are quite fast, some take a longer time, and the full set still takes 2-3 days to run.

We have corrected most of the identified grammar issues, except for the one in line 7 (the subject there is not 'instances' but 'one of the instances') and reviewed the paper for additional mistakes. We have also changed figures 5 and 6 to log scale, but there is little difference in appearance.

Round 2

Reviewer 1 Report

I am very sorry, but my previous opinion remains unchanged. The paper has value, but it doesn't seem to be very suitable for the Mathematics journal. I have not seen any mathematical innovation to justify sending this manuscript to this journal.

I am only partially satisfied with the your answers.

I wrote in the report 1 that the collector does not guarantee a better solution. On the contrary, it can lead to premature convergence. The exchange of information between populations seems to be beneficial for increased accuracy, but what do we do if this exchange leads to a blockage of evolution? Your answer is " By receiving solutions from several independent populations, the collector is actually less prone to the problem (though not immune), as it has a more diverse pool of solutions". Maybe it is true or not. I think it would have been good to make a comparison with the classic heuristic, based on a single population. I believe that in genetic algorithms the quality of mutation and crossover operations is also very important. For the ACO are very important the travel policy based on pheromone, updating pheromon, etc.

It is justifiable to compare two algorithms using the same hardware platform, but you do the comparison with the best implementation. You say that "there is no best available implementation, but rather it is usually made ad-hoc due to its extreme simplicity". I do not consider this to be a solid argument.

The test platform used consists of a cluster of 18 virtual machines. This does not justify choosing the MPI implementation to achieve increased acceleration. Communication on a multicomputer system is time consuming. There are a lot of processors that exceed these capabilities (for example AMD Ryzen Threadripper 3990X 64-Core, 128-Thread Unlocked Desktop Processor) and they are not very expensive. On such a machine can be implemented the heuristic proposed in the paper in OpenMP and I am sure that the performance would be better. I also do not agree with the way you see the distribution and the parallelism. The heuristic proposed is of type SPMD (single program, multiple data). In OpenMP, algorithms that use this paradigm can be easily implemented. Both in CUDA and OpenCL there is this possibility. This is my point of view. I may be wrong.

I still consider that the bibliography is incomplete. Relevant works in the field are missing (for example Dorigo M., Stützle T. (2019) Ant Colony Optimization: Overview and Recent Advances. In: Gendreau M., Potvin JY. (eds) Handbook of Metaheuristics. International Series in Operations Research & Management Science, vol 272. Springer, Cham)

Author Response

In our opinion the approach matches the more numerical/computational aspects of the scope of the journal, but whether it falls within it is of course for you to decide.

On the topic of premature convergence. The communication is one-way, which means that all the instances except for the collector are unaffected by it. The collector receives the best from each other instance, not just the best overall, which provides more diversity and therefore helps avoiding premature convergence if it has an effect on it. The comparison with the classical heuristic is implicit in the tests, as the multistart method runs multiple independent instances of the classical heuristic (and then selects the best solution across all instances). Clearly the classical heuristics are converging on lower quality solutions. If there is a problem with premature convergence, it can be addressed by adapting the heuristic (the GA or ACO, in the case of the tests presented) with the usual approaches: parameter tuning, pheromone matrix restart, niching, tweaking initial population creation, etc. The heuristic is conceptually an input of the multistart and multiverse methods, and can be played with orthogonally to the method itself. The strategies both for ACO and GA are standard, although certainly not the only possible choice, for the TSP problem: the ACO uses the same travel policy as proposed initially by Dorigo, and the GA operators are a city swap for mutation and Partially-Mapped Crossover (PMX) for crossover.

Note that the comment on availability of a 'best implementation' refers to the multistart method which consists of launching a number of independent runs, and selecting the best of the solutions returned by each, and not to the metaheuristic (GA or ACO). As mentioned above, the metaheuristic is an input of the method.

The choice of MPI is not due to the testing hardware, but it arises from the purpose of the method, which is the ability to exploit distributed systems. Even if relatively powerful single-processors exist, they are still more limited than what can be achieved on a distributed cluster, and the computer power of each single machine can be applied to solve larger and/or more complex problems. Admittedly, the specific test problems do not require that much power, but they are initial tests to ascertain the ability of the method to improve the solutions. The choice of the test platform is a matter of the best approximation to the target architecture within availability. Nonetheless, if the overall problem fits within a single machine, an OpenMP version could be developed, and it would likely improve on the timing (but MPI can also run locally, and in that case communication does not incur network latency, so it is not a complete certainty). In any case, since the multiverse method has an strictly higher communication overhead than multistart, this speedup would only be in its favour. Also, this makes MPI a generic solution that can work on both types of architectures.

The issue with GPUs is not related to the computations in the multistart and multiverse methods, but to the metaheuristics. There are many examples in the literature of GPU acceleration of both ACO and GA, but this is achieved at a lower level (there are details on this in the text). Making this a generic method using GPU would require implementing the metaheuristic also for the GPU, at a global level, which is not only hard, but also very likely inefficient in the general case. GPUs are actually SIMD (single instruction, multiple data), and can emulate SPMD by running every branch of each condition sequentially*; while the program consists of unconditional vectorized operations or the conditions result in one branch that does something and another one that does nothing (typical e.g. of reduction operations) there is no impact, but for general programs where multiple conditions may be checked and the different combinations of branches accumulate, this can become dominant over the acceleration gained by the parallelization. Once again, the metaheuristic is an input to the method, so GPUs may be used if available inside each instance to accelerate some internal calculations such as the fitness function, but this is problem-specific.

We are also aware that different approaches may be optimal for other distributed architectures, such as HADOOP/SPARK-like environments or in-memory computing systems (e.g. HP's 'The Machine'). We consider that computer clusters are the most extended in use for metaheuristics (SPARK is mostly used for AI and ML, and in-memory computing systems are still at the prototype stage).

We're happy to include the references you provide as more up-to-date reviews of the general methods. However, their contents relate to the improvement of the metaheuristics used inside the method, but is not relevant with regard to the method itself. If anything, they support the choice of coarse-grained parallelization. Also note that given the general aim of the method, we focused on broad surveys of parallelization of metaheuristics.


[*] This is an oversimplification: it actually applies to each stream (a pool of 32+ threads, depending on the specific architecture), but the overall reasoning applies. Also, this is not true for certain GPGPUs, but using those would make the method hardware-specific, rather than general.

Round 3

Reviewer 1 Report

I propose to the editors to request a third reviewer. I expressed my opinions. You are good programmers. Respect for your professionalism in this field. In terms of mathematical and algorithmic skills, I refrain. I don't want to continue the controversy.

Back to TopTop