*4.1. Performance of MPI on SpiNNaker*

In Tables 1 and 2 and Figure 5 we report the performance of the MPI primitives on SpiNNaker. Table 1 shows the average execution time for 2000 iterations of the MPI\_BARRIER synchronization primitive; the growth of the execution time is bounded with respect to the size of the context (i.e., the number of cores being used). The amount of memory necessary to store information about the context also grows slowly with respect to the number of cores.



Table 2 shows the average execution time for the MPI\_SEND/MPI\_RECV unicast primitive. The average execution time grows linearly with the amount of data sent.

Finally, in Figure 5 we describe the average execution time for 2000 iterations of the MPI\_BROADCAST primitive with respect to the amount of data sent and the context size. Once again the execution time grows linearly with the data sent, with overhead corresponding to the context-wide synchronization. The execution time also has a bounded growth in relation to the number of cores.

**Table 2.** Table profiling the performance of the MPI\_SEND/MPI\_RECV unicast primitive for different amounts of data sent on SpiNNaker.

## *4.2. Evaluation of Boyer-Moore MPI Implementation Running on SpiNNaker*

In the following, we analyse the efficiency and scalability of our optimised *Boyer-Moore* (*FED*) implementation on SpiNNaker. We compare it with the scalability on a traditional multi-core CPU using a server configuration with two Intel Silver Xeon 4114 processors, each with 10 cores and 20 threads. The *FED* algorithm is implemented in C and used to benchmark both Server and SpiNNaker architectures. The benchmark running on the general purpose Server architecture is written in C++ and compiled with g++ 7.4.0 and MPICH 3.3 parallel environment. The benchmark running on SpiNNaker architecture is written in C and compiled with gcc-arm-none-eabi 5.4.1 and SpinMPI 19w19. At this point, it is important to note that, by using the SpinMPI library, we ported the *FED* code written for a standard PC to the SpiNNaker hardware without applying any adaptation or transformation of the code.

The text used for the sake of testing is the *Escherichia coli* genome, which is about 4 million symbols long, leading to an encoded text of about 1 MB size, which is then split into a set of about 4000 chunks, each 256 Bytes long.

There exist two types of strategies to evaluate the scalability of a problem in a parallel environment:


The SpiNNaker platform provides a fast, core-local data memory (DTCM) of 64 kB. This memory constraint allows to store at most 100 *FED* patterns per node, totalling 40 kB in size. Given this memory constraint, we decided to use a *weak-scaling* benchmarking strategy to scale our benchmark up to the 768 nodes available on SpiNNaker. The problem size must be calibrated in order to claim a condition of equivalence and perform a fair comparison between different architectures; in our case, a condition of equivalence is met whenever the same *FED* execution time *t*FED is observed using a single *FED* worker. When SpinMPI is requested to match 1000 *FED* chunks against 100 *FED* patterns on a single node, a run-time of 26,970 ms is measured; the same run-time, for the MPICH implementation

with 1000 *FED* chunks, is obtained when the single *FED* worker used is in charge of 12,500 *FED* patterns. This preliminary assessment is needed to evaluate only the scalability features of the two architectures, without considering the difference in computing power of the single working node for the two architectures. The reason for this comparison is to put the performance of MPI on SpiNNaker in a familiar perspective, as the CPU-DualSocket server is a widespread general purpose machine that allows to use MPI; however, the communication on the Xeon is networkless message passing happening entirely in RAM, while the message passing on SpiNNaker makes efficient use of the board's interconnection scheme.

A general strategy for evaluating the parallel scaling of an MPI application is computing the scaling efficiency, which measures how good the application is at using every node the parallel environment has. Given an environment with *N* workers and a problem that requires *t*FED,*<sup>i</sup>* units of time to be solved with *i* workers, the *weak-scaling* efficiency *E<sup>N</sup>* can be measured as in Equation (1). The speed-up *S<sup>N</sup>* can be easily inferred from the efficiency and computed with Equation (2).

$$E\_N = \frac{t\_{\text{PED},1}}{t\_{\text{EDTA},N}} \tag{1}$$

$$S\_N = E\_N \cdot N \tag{2}$$

Figures 6 and 7 report the speed-up and efficiency of the *FED* with MPI algorithm on the Server and SpiNNaker architectures. The horizontal axis represents the number of MPI workers used; both systems were tested until saturation, with the Server reaching 40 parallel workers through Intel hyper-threading and the Spin5 board utilizing all 768 available physical cores. Tests were performed for genomes of 500, 1000 and 2000 chunks.

**Figure 6.** Comparison of Weak-scaling speed-up for MPI-FED on a general purpose architecture and on SpiNNaker.

In Figure 6 we can see how the massively parallel architecture of SpiNNaker influences the speed-up. The high number of physical cores on the machine lets the speed increase linearly, avoiding the discontinuities that a general-purpose processor has at critical points when hyperthreading is activated to provide the required number of workers (note, in the graph, the inflection point at 20 MPI workers for the PC version, i.e., the point at which the maximum number of physical threads on the Xeon is reached).

In Figure 7 SpiNNaker demonstrates excellent scalability, with efficiency values close to 95% for up to 200 workers. Additionally, we can see that the performance markedly improves for longer text sequences; the efficiency for 768 workers processing 2000 chunks is 87.83%. The reason for this happening is that as the size of the data to be processed increases, the ratio of processing time

to communication time in the overall algorithm increases, since the data are only sent once at the beginning of processing and then gathered at the end. The bottleneck due to the communication overhead thus becomes less prevalent, and the efficiency improvement due to massive parallelism is more evident.

**Figure 7.** Comparison of Weak-scaling Efficiency for MPI-FED on a general purpose architecture and on SpiNNaker.

By contrast, the efficiency of the Server dips much faster, dropping below 90% as soon as the requested MPI workers outnumber the physical cores. It also remains fairly constant when changing the number of chunks. This appears reasonable as, for the high-speed CPU used in the test, the computation time is very small, but it suggests that other phases of the computation such as inter-process communication and thread management have a significant impact on the efficiency of the algorithm.

As a side-experiment, we evaluated the impact of the size of the *FED* buffer distributing data among the *MPI workers* on the measured scaling efficiency. Figure 8 shows the scaling efficiency of two experiments—the former distributes the *FED* chunks to be analyzed as 1000 256-Byte packets. The latter broadcasts the same amount of data, formatted as 125 2-kB packets. Figure 8 highlights that the two scaling efficiency tracks are comparable, meaning that the size of packets used to distribute *FED* chunks among the *MPI workers* does not impact the benchmark results for the general purpose architecture.

**Figure 8.** Efficiency of the general purpose architecture for different *FED* buffer sizes.

Finally, we can make a comparison of the power efficiency on the two architectures by using estimated consumption based on the nominal values from the CPU [30] and SpiNNaker [31] data-sheets. For the Intel Xeon, we consider the peak and idle powers at the values of *Ppeak* = 11, 030 mW and *Pidle* = 6320 mW, and we hypothesize that the number of active physical cores (out of the available 20), *f*(*x*), can be expressed as a function of the active MPI workers *x* as *f*(*x*) = *ceil*( *x*+1 2 ). The appearance of the term *x* + 1 rather than *x* is because there is one Controller process that has the task of distributing the data and patterns to the MPI workers. Based on this assumption, we assign a power consumption of *Ppeak* to the active cores and of *Pidle* to every other core; thus the estimated power consumption with respect to the number of MPI workers *x* is *P*(*x*) = *Ppeak* · *f*(*x*) + *Pidle* · (20 − *f*(*x*)).

On the other hand, for SpiNNaker we consider the values of Idle Power per Chip *Cidle* = 360 mW, Idle Power per Core *Pidle* = 20 mW, Peak Power per Core *Ppeak* = 55.56 mW, and the Off-Chip-Link power, *Plink* = 6.3 mW. The power estimation for SpiNNaker depends on the MPI execution context, which can be described by a pair of values (*p*, *k*) where *p* ∈ [1, 16] is the number of active processors per chip and *k* ∈ [1, 48] is the number of active chips. The power estimation formula can be expressed as a function of the number of active processors and chips as *P*(*p*, *k*) = *k* · (*Cidle* + (*Ppeak* − *Pidle*) · (*p* + 1) + *Plink*) + (48 − *k*) · *Cidle*. Counting *p* + 1 processors to include the Monitor Processor on each core. Then, the estimated power given the number of MPI workers *x* is *P*(*x*) = *P*(*p*, *k*)|*min<sup>k</sup>* [*p* · *k* = *x* + 1] As in the CPU case, we count *x* + 1 processes to include the Controller process.

Given the architectural difference between the SpiNNaker and CPU machines, it is necessary to outline a fair method to evaluate the efficiency of the algorithm's implementation. We define power efficiency as the energy consumed to align a single pattern to the reference, measured in units of mJ/pattern, as a function of the parallelisation effort of the given system, expressed as a percentage of the total resources. The maximum energy efficiency is obtained when all resources are in use, corresponding to a parallelisation effort of 100%. For SpiNNaker it is easy to assume that 100% utilisation occurs when all 768 cores are busy (i.e., at 767 MPI workers), corresponding to an average energy consumption of 37.3 mJ/pattern. For the CPU utilisation, we can either consider 100% utilisation to be the situation where all physical cores are active, or the one where all the virtual cores are active (20 physical + 20 virtual, providing 39 MPI workers). In the first case, the estimated average energy consumption is of 51 mJ/pattern, with an estimated power saving of 27% in favour of SpiNNaker. In the second case, the energy is 43 mJ/pattern, with SpiNNaker consuming 13% less.
