3.2. The Acceleration Performance of GPU Parallelization
To illustrate the performance differences between CPU and GPU, we conducted extensive simulations using various 3D meshes, as detailed in
Figure 8. This study encompassed three distinct categories of 3D grids, each precisely characterized by its number of nodes.
Table 5 concisely presents comprehensive information about each grid. These grids are carefully classified into small, medium, and large categories based on node counts. Specifically, small grids consist of 4168 nodes, while large grids encompass 646,627 nodes. These diverse grids were employed to carry out simulations at various scales. The time step may vary depending on the cell model and precision. Our electrocardiogram research focuses more on changes in the action potential morphology, such as early after-depolarization (EAD), delayed after-depolarization (DAD), and re-entry phenomena. Therefore, the precision requirement is not very high, and we selected a time step of 0.005 ms for our experiments. With this time step, the simulation can be performed efficiently without encountering issues like overflow or NaN errors related to precision.
The relevant simulation parameters are summarized in
Table 6. We chose the CPU serial program with the improved mesh storage as the baseline for our comparative experiments, as this approach applies to both CPU and GPU simulation programs. The implementation details have been discussed in the preceding sections.
The simulation runtimes for three different mesh scales, using various optimization strategies on both CPU and GPU platforms, are shown in
Figure 9 and
Table 7. Due to the long runtimes, only performance data for the small mesh are presented, excluding the adjacency list optimization. The results demonstrate the significant advantage of GPUs over CPUs in cardiac simulations, mainly due to the GPU’s SIMD capabilities, which greatly accelerate the simulation process.
Furthermore, choosing the correct storage method for the mesh is important because it directly affects how quickly the neighboring nodes can be retrieved. Optimizing cache hit rates and streamlining data access patterns, whether focused on concentration or intensity, significantly impact simulation runtimes. When comparing the runtimes of Algo GPU_2 and Algo GPU_3 at different mesh scales, it becomes clear that flattening the data structure leads to substantial improvements in runtime efficiency as the simulation scale increases.
Figure 10 illustrates the increasing trend in runtime as the simulation scale expands, comparing the performance of CPU and GPU simulation programs. The figure clearly shows the exponential growth in runtime for the CPU sequential program. The ODE computation is almost proportional to the simulation scale for the CPU serial program. During the propagation phase, the average computation time per node also increases due to the growing number of neighboring nodes. As a result, for simulations involving a large mesh, the CPU program requires an impractical duration of approximately 50,000 s to complete, rendering it inefficient.
The GPU programs consistently show significant simulation efficiency compared with their CPU counterparts. Several optimizations, including improvements in storage approach, algorithm refinement, and asynchronous transmission, contribute significantly to this efficiency. As a result, a large mesh simulation concludes in approximately 1000 s, highlighting the substantial advantage of GPU-based simulations.
Due to the influence of hardware and software environments, it is not reasonable to directly compare the parallel results with those of other studies. Additionally, due to differences in the original algorithms, it is challenging to directly apply others’ parallel methods to our algorithm. Conversely, some of the optimization strategies presented in this paper may yield minimal benefits when applied to others’ algorithms. Therefore, reproducing others’ work is quite difficult, and comparing our parallel results with those of other studies is also very challenging. Based on the reasons mentioned above, we computed the acceleration ratio using the subsequent formula to further emphasize the exceptional efficiency of GPUs:
The results shown in
Figure 11 support the claim that the GPU parallel program, even without any optimization, achieves about ten times faster than the CPU. With storage optimizations, this speedup increases to around 20-fold. By implementing an improved algorithm that allows simultaneous execution of ODE and PDE computations, the speedup reaches approximately 40-fold. Finally, integrating pipelined asynchronous data transmission stabilizes the speedup at around 55-fold.
However, the chart shows that as the computational scale increases, the speedup diminishes to some extent as the data size grows, despite various optimization strategies. This happens because most optimization strategies focus on reducing the number of global memory accesses through efficient caching in the ODE calculation portion with a smaller impact on the PDE component. Moreover, as the number of cell nodes increases, it becomes challenging to ensure that each thread is responsible for the calculations of only one node. In our GPU setup, when simulating a large mesh, each thread handles approximately five nodes, while in a small mesh, each thread handles only one node. This difference leads to decreased performance compared with the condition where each thread manages a single node.
Another reason for this performance variation is the different strategies employed in CPU and GPU simulations. We reduce some ODE calculations in CPU simulations by checking if cell nodes are at rest. However, our tests showed this approach is less effective on the GPU. The GPU’s inherent Single Instruction, Multiple Data (SIMD) execution mode requires threads within the same warp to execute the same instruction simultaneously. Excessive branching disrupts this feature, as some threads within the same warp may follow different branches, leading to suboptimal parallelism. As a result, while the strategy effectively avoids some ODE calculations for certain nodes on the CPU, it leaves nodes idle while others are being computed on the GPU. Therefore, we did not use this strategy on the GPU, resulting in more ODE calculations per node in the GPU simulation. Although this difference decreases as the current diffuses, it still impacts the results, especially as the simulation scale increases.
Also, as the computational scale increases, the time required to transfer results back to the CPU at each time step increases. Although we aim to mitigate this time overhead through a pipelining approach, this technique has limitations as the computational load grows.
In order to accurately demonstrate the computational acceleration of the GPU, a simple comparison of speedup ratios is not sufficient. To showcase the GPU’s computational performance, we calculated the average time spent per node, as displayed in
Table 8. In this calculation, we excluded the time spent on data transmission and focused only on the running time of the computational part.
Based on the data in the table, it’s clear that when it comes to GPU parallelization, the GPU_1 program, which utilizes adjacency lists, shows a much higher average computation time per node at large scales than at small scales. This difference is mainly due to frequent access to global memory. However, after optimizing data storage in GPU_2, the performance at large scales becomes roughly comparable to that at mid-scales. While there is still a performance difference compared with small scales, it has been significantly reduced.
After implementing the enhanced algorithm, the average computation time per node remains between 0.66 and 0.69 ms. This effectively resolves the problem of decreased cache hit rate resulting from the expansion of the simulation scale. The GPU’s computational performance remains consistent across all three scales. These findings demonstrate the effectiveness of the optimization strategies in improving computational performance.
In order to show the impact of the improved data storage method and algorithm on parallel efficiency, we analyzed the time allocation for each part of the simulation. We then represented the results in
Figure 12. Looking at
Figure 12a–c, it’s clear that in the traditional numerical solution approach, the ODE component takes up a significant amount of the total simulation time, especially in programs with unoptimized data storage. However, with the implementation of the improved storage strategy, the ODE component’s computational time is notably reduced. Specifically, for simulations involving a large mesh, the ODE computation time, initially around 3000 s, has been reduced to just over 600 s. On the other hand, the improvement in data storage has a relatively small effect on the computation time of the PDE component. This difference primarily comes from the GPU’s frequent access to various parameters within the LR model during ODE computations, compared with the reliance of PDE computations solely on the model’s action potential.
The optimized algorithm combines the calculations of ODE and PDE into a single function, eliminating the need for synchronization and update processes for the entire grid. This results in a significant reduction in simulation runtime. The original 5000-s simulation computation period has been reduced to approximately 450 s after implementing optimizations in both the algorithm and storage strategy.
The comparison in
Figure 13a–c shows the time consumption of synchronous and asynchronous data transfer at different simulation scales. The corresponding tables demonstrate that, after several systematic optimizations, the program without pipeline asynchronous transfers spends only 30% of the total simulation duration on the computation phase. Approximately 70% of the time is allocated to data transfer activities. The pipeline implementation involves concurrent interleaving of result transfers from the previous time step with the ongoing computation process, which significantly enhances program execution efficiency. In extensive simulations, asynchronous transfers combined with computations are only marginally slower, approximately 10%, compared with direct data transfers, highlighting the importance of pipeline transfers.
3.6. The Scalability of GPU Parallel Approach
To further demonstrate the scalability of the parallel method in this paper, we conducted simulations using different cell models.
Currently, there are two main types of cell models. One type is based on modeling the action potential waveform. These models often have simple expressions with parameters that lack physiological significance, focusing on simulating the cell’s action potential while lacking a more profound representation of intracellular electrophysiological activity. The other type is derived from the Hodgkin and Huxley theory. These models express electrophysiological activity through the modeling of ion channels. Compared with the former, this type of cell model is more suitable for exploring the internal mechanisms of cells and can also simulate different cell pathologies or mutations by adjusting the relevant parameters of the ion channels.
When selecting cell models, we mainly chose those based on the Hodgkin and Huxley theory. The reason is twofold: The first type of cell model has relatively simple expressions and does not require much computational resources, so simulations can be efficiently completed even on CPUs. On the other hand, the application of this type of model is quite limited, as it only models the action potential waveform and cannot be used to explore the internal mechanisms of cardiac electrophysiology. Through comparison, we ultimately selected the Luo–Rudy 1991 model [
28], the Stewart Zhang 2009 PF model [
31], and the Takeuchi HL1 model [
32], three different cell models, to demonstrate the scalability of the proposed method concerning cell models. All three cell models are derived from the Hodgkin and Huxley theory. The difference is that each cell model has a different number of ion channels and the expressions for each ion channel vary.
The runtime is listed in
Table 12. As can be seen, despite using different cell models, GPU parallelization still achieved excellent acceleration. When using the Takeuchi HL1 model, the CPU serial program took almost 46 h to complete the simulation, which is highly unacceptable. In contrast, GPU parallelization reduced the time to only about 40 min. Furthermore, we calculated the cache hit rate for different cell models. The experimental results listed in
Table 13 indicate that as the complexity of the model increases and more ion channels are introduced, the cache hit rate gradually decreases when the memory optimization strategy is not applied. This decline was caused by increased parameters, leading to more fragmented memory storage. However, when the memory optimization strategy was employed, the cache hit rate remained relatively high. The reason is that, despite the increase in the number of parameters, the same type of parameters were still stored continuously in memory. Therefore, the memory optimization strategy remained effective across different cell models and varying numbers of parameters.
To demonstrate the performance of our method on large-scale models, we used the “Oxford Rabbit Heart” [
33], a highly detailed MR-based rabbit mesh with 4 million elements, making it one of the highest-resolution cardiac meshes in the world, for simulation. Additionally, we performed simulations on a coarse-resolution version with 40,000 nodes from the “Oxford Rabbit Heart” mesh. We also used Chaste [
34], an open-source simulation library that supports CPU parallelization, to simulate both meshes. By comparing the runtime of our method with Chaste, we demonstrate the efficiency of GPUs and the scalability of our method on large-scale models.
In the Chaste simulation, we used eight cores for CPU parallelization. The runtime for the two meshes is shown in
Table 14. As the table shows, even with CPU parallelization, a significant amount of time is still required to complete the simulation for large-scale meshes. For the 4-million-node mesh, Chaste took approximately 31 h to simulate 100 ms, which is unacceptable for scenarios requiring frequent parameter adjustments. In contrast, our method only took about 3 h to complete a short-term simulation, which is much more acceptable. Moreover, the simulation on the large-scale mesh further demonstrated the scalability of our method.