Power Consumption Comparison of GPU Linear Solvers for Cellular Potts Model Simulations

De Luca, Pasquale; Galletti, Ardelio; Marcellino, Livia

doi:10.3390/app14167028

Open AccessArticle

Power Consumption Comparison of GPU Linear Solvers for Cellular Potts Model Simulations

by

Pasquale De Luca

^1,*,†

,

Ardelio Galletti

^2,†

and

Livia Marcellino

^2,†

¹

International PhD Programme/UNESCO Chair “Environment, Resources and Sustainable Development”, Department of Science and Technology, Parthenope University of Naples, Centro Direzionale Isola C4, 80143 Naples, Italy

²

Department of Science and Technology, Parthenope University of Naples, Centro Direzionale Isola C4, 80143 Naples, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(16), 7028; https://doi.org/10.3390/app14167028 (registering DOI)

Submission received: 21 June 2024 / Revised: 1 August 2024 / Accepted: 8 August 2024 / Published: 10 August 2024

(This article belongs to the Special Issue Parallel, Distributed and Cloud Computing: Status, Prospects and Future)

Download

Browse Figure

Versions Notes

Abstract

:

Power consumption is a significant challenge in the sustainability of computational science. The growing energy demands of increasingly complex simulations and algorithms lead to substantial resource use, which conflicts with global sustainability goals. This paper investigates the energy efficiency of different parallel implementations of a Cellular Potts model, which models cellular behavior through Hamiltonian energy minimization techniques, leveraging modern GPU architectures. By evaluating alternative solvers, it demonstrates that specific methods can significantly enhance computational efficiency and reduce energy use compared to traditional approaches. The results confirm notable improvements in execution time and energy consumption. In particular, the experiments show a reduction in terms of power of up to 53%, providing a pathway towards more sustainable high-performance computing practices for complex biological simulations.

Keywords:

cellular potts model; parallel algorithms; GPU computing; energy performance profiling

1. Introduction

Energy consumption is a significant challenge in the sustainability of computational science. The growing energy demands of increasingly complex simulations and algorithms lead to substantial resource use, which conflicts with global sustainability goals. Efficient energy usage is crucial to reduce the environmental impact of computational activities, making energy optimization a key focus in advancing sustainable practices. In the context of modern computational science, energy consumption and waste pose significant challenges. The growing complexity of simulations and the computational demands of advanced algorithms and last-generation computers contribute to increased energy usage [1]. In order to address and reduce the requested large execution times, high-performance computing (HPC) environments are very helpful in several scientific fields [2,3]. In an HPC context, this issue is intensified due to the large number of processors operating in parallel, usually resulting in elevated energy consumption. Thus, in this sustainability scenario the study of effects of the usage of HPC machine becomes crucial and a fair energy consumption analysis of software and algorithms is mandatory. This paper focuses on the energy consumption of a specific set of parallel implementations of the Cellular Potts model (CPM) simulation by evaluating the impact on the overall energy efficiency. In the context of the cellular behavior field, the CPM represents a useful tool for biological cell simulation which helps the understanding of cell behavior [4,5]. CPM simulations are useful for improving the knowledge of various diseases and developing related treatments. This topic spans multiple disciplines, including molecular biology, genetics, biochemistry, and biophysics. By examining cellular behavior, the fundamental mechanisms of cellular processes are deeply explored.

This knowledge plays a key role to identifying diseases and creating ad hoc therapies, such as those focused on cancer cell behavior. The approach of the Cellular Potts model is mainly based on the minimization of the biological system energy, which is dependent on cell–cell interactions: cell–cell adhesion, cell–substrate adhesion, and cell strain. Due to the need for accurate simulations, the size of the problem heavily impacts the time and energy consumption, leading to the use of parallel approaches. Some attempts in this direction can be found in [6], where the authors use the Message Passing Interface (MPI) in a many-CPUs environment in the study in [7]; an approach based on software transactional memory for task synchronization is used in Ref. [8], where random walk-based algorithms are implemented in parallel. A further parallel implementation, which represents the starting point of this study, is presented in [9]. Here, the authors exploit the computational power of modern graphics processing units (GPUs) in a high-performance computing environment. The algorithm involves several kernels and, among them, the most demanding task lies in solving a large sparse linear system which integrates cell mechanics descriptions through the finite element method (FEM) for the evaluation of the system’s biological energy. This parallel implementation uses the Jacobi iterative method solver by ensuring good performance in terms of execution time. The presented works are the main research threads in the field of parallel computing, and here, we report them as a part of storytelling approach. It is important to underline that there are no prior articles that have addressed power consumption. In this paper, we tackle this aspect, comparing it with the most recent parallel software available. A preliminary study of the power consumption of this software was conducted in [10], showing its capability in achieving the low-energy requirement. In this work, the structure of the algorithm is modified by varying the linear solver through the use of other three different solvers: LU decomposition, the Gauss–Seidel method, and a custom-made version of the conjugate gradient method (CG). The effects of these choices are measured both in terms of execution time and power saving. By making the Cellular Potts model simulations more energy efficient, one can run more experiments with less power. This is especially important because CPM simulations involve intensive computations. So, improving energy efficiency helps the environment and allows for more detailed studies of cell behavior, leading to better treatments for diseases. Table 1 summarizes the main studies and implementations in the CPM field by highlighting the main contributions of the proposed work, that is, a fair power consumption comparison of GPU linear solvers for the CPM. The rest of the paper is organized as follows: Section 2 provides the mathematical background and principles underlying the Cellular Potts model; Section 3 presents the parallel algorithm including the use of the alternative solvers and shows details about a new custom version of the CG method, and the strategy to parallelize it; Section 4 presents experimental tests highlighting the enhancements in terms of execution time and comparing the energy savings of the proposed solvers. Section 5 addresses the findings about the proposed software by comparing it with existing software. Finally, Section 6 concludes the paper.

2. Mathematical Background

This section is devoted to the main details about the model under consideration. The description of the problem, and of the related algorithmic structure, accurately reflects the approach presented in [9]. From a mathematical point of view, the Cellular Potts model for biological systems can be thought of as a discrete-time Markov chain that regulates changes in a spatial lattice. This lattice is composed of a cluster of simply connected pixels whose changes are induced through the values of an energy function H. More precisely, the system’s evolution in time is characterized by probabilistic rules, applied to the lattice configuration, which model the interactions between adjacent cells in terms of cell size, cell shape, and the contact perimeter between cells. Formally, let us denote

τ = 0, 1, \dots, N_{τ}

a sequence of

N_{τ} + 1

time steps, and

Ω = {1, 2, \dots, N}^{2}

a two-dimensional square lattice consisting of

{(N + 1)}^{2}

points

P = (i, j)

. Assume that

N_{c}

distinct cells dispersed throughout the medium cover some regions of

Ω

. The configurations of the cells through time are recorded by introducing a sequence of

N_{τ} + 1

index matrices

L^{τ}

, where the value

L_{i, j}^{τ} \in {0, 1, 2, \dots, N_{c}}

corresponds to the cell index which is placed by the pixel at position

(i, j)

at time step

τ

. In the usual convention:

$L_{i, j}^{τ}$ is 0 to indicate that pixel $(i, j)$ is occupied by the medium and not by other cells;
$L_{i, j}^{τ} = k$ for some $k \in {1, 2, \dots, N_{c}}$ to say that pixel $(i, j)$ is part of the k-th cell.

The system’s evolution is completely kept in the sequence of matrices

{L^{τ}}_{τ = 0, 1, \dots, N_{τ}}

. The transition in state, that is, modifications of the values in

L^{τ}

’s entries as

τ

grows, is conveyed by probabilistic rules involving cellular dynamism, such as proliferation and motility, which require the introduction of a vector field,

f_{P} \in R^{2}, \forall P \in Ω

representing the direction of cellular movement. That is,

f_{P}

models the intended direction of cellular processes attributed to the pixel at P. For example, assuming there is only one cell in the medium, a possible representation of

L^{τ}

s evolution is shown in Figure 1.

Randomness is introduced at each step by a version of the Metropolis algorithm [11]. Specifically, this is a modified algorithm whose acceptance criterion depends on the values of a peculiar Hamiltonian function H, a measure of the system energy, treating the whole scheme as a random dynamical system. Recalling the description in [9], the computation of H consists in the following steps:

Initialization ( $τ = 0$ ): This first step creates the simulated environment; each lattice site is initialized to either medium or a specific cell.
Modified Metropolis algorithm ( $τ = 1, \dots, N_{τ}$ ): At each time step $τ$ , the system simulates $M \geq 1$ attempts for updating an index of one or more pixels. For each trial $l = 1, \dots, M$ , the scheme repeats these procedures:
(i)
Pixel selection: This step selects a random pixel P and one of its eight neighboring (surrounding) pixels Q;
(ii)
Hamiltonian energy change ( $Δ H$ ): After transitioning the state of Q to the state of P, the related variation in the Hamiltonian of the system is given by

$Δ H = H^{new} - H .$

(1)

H is the value at the current state, $H^{new}$ the one if the cell index of P becomes the cell index of Q.
(iii)
Modified Metropolis criterion and state modification: To establish whether to accept the state transition of Ps cell index, the criterion uses the value of $Δ H$ . If it is negative or zero, the system undergoes the state transition; if positive, it is accepted with a Boltzmann distribution probability:

$p (Δ H) = exp (- \frac{Δ H}{T}),$

(2)

where T plays a role analogous to temperature in thermodynamics.

The Hamiltonian is the total energy of the system, and it is given by

H = H_{a d} + H_{v} + H_{s} .

(3)

H encapsulates three different components, each of them reflecting different physical phenomena:

$H_{a d}$ denotes the contribution due to the cell adhesion energy and it is defined as the following sum:

$H_{a d} = J \cdot \sum_{P \in Ω} \sum_{Q \in N b (P)} 1 - δ (L_{P}, L_{Q}) .$

(4)

Here, $N b (P)$ is the set of neighboring positions of P, while the inner sum involves all neighbors Q by evaluating the expression $1 - δ (L_{P}, L_{Q})$ , where $δ$ is the Kronecker delta function, defined as

$δ (x, y) = \{\begin{matrix} 1, & if x = y, \\ 0, & otherwise . \end{matrix}$

The Kronecker function assigns values of 1 or 0 by checking if two neighboring cells are or are not of the same type, respectively: if they are of different types, the contribution reflects the presence of an interface between them. The parameter J quantifies the adhesion strength between different types of cells, that is, it determines how much cells adhere to each other within the tissue.
$H_{v}$ is a term measuring a cell’s volume area (volume in case of 3D lattices), and it is given by

$H_{v} = λ \sum_{k} {(\frac{V_{k}}{\bar{V}} - 1)}^{2}, V_{k} = vol (C_{k}) .$

(5)

Here, $C_{k}$ and $V_{k}$ are the current area of cell k and the set of pixels of cell k, respectively. The term $\bar{V}$ represents a target volume, while $λ$ is a parameter which attempts to measure the volume constraint.
The last term,

$H_{s} = μ \sum_{P} {(Δ ε_{P})}^{2}, Δ ε_{P} = ε (P) - \bar{ε} (P)$

(6)

represents the effect of the surface tension and it stands for the mechanical strain of the system: $Δ ε (P)$ is the gap between the current strain $ε (P)$ and the target value $\bar{ε} (P)$ . The parameter $μ$ controls the strength of the strain constraint. High accuracy is demanded in evaluating $H_{s}$ to delineate well the effects of mechanical forces on cell behavior.

It has to be noted that while using the Metropolis criterion the change in the Hamiltonian can be conveniently written as follows:

Δ H = Δ H_{ad} + Δ H_{v} + Δ H_{s} .

(7)

This implies that

Δ H

can be evaluated by separately computing the gap in their addends. Since the model proceeds one pixel at a time, these terms only involve a few terms, related to the neighboring pixels, and require low computational cost. For the strain

ε

evaluation, the finite element method (FEM) is used, following the approach shown in [9]. Such a choice allows the strain to be computed by breaking down the global problem into smaller and more manageable pieces. It requires the use of a local stiffness matrix

K_{e}

, which takes into account, for each element e, the correlation between the nodal displacements and the forces exerted upon them. In this case, the expression for

K_{e}

is given by

K_{e} = \int_{Ω_{e}} B^{T} D B d Ω_{e}

(8)

Set

Ω_{e}

refers to the specific domain associated with element e, while B is identified as the strain–displacement matrix in the case of quadrilateral elements with four nodes. Finally, the material matrix D is related to the plane stress conditions [12]. By denoting by K the global stiffness matrix, which is assembled by putting together the local stiffness matrices

K_{e}

, the global mechanical behavior of the system is realized by solving a large set of linear equations, represented as

K u = f .

(9)

Here, u represents the vector of all displacements of all nodes, and the vector

f = {(f_{P})}_{P \in Ω}

(10)

stands for the forces

f_{P}

acting on all pixels P. Following the description of Lemmon and Romer in [13], for each P the cumulative force

f_{P}

is the resultant vector of all forces applied by its neighboring pixels:

f_{P} = \sum_{Q \in Nb (P)} f_{P, Q},

(11)

with

f_{P, Q}

denoting the force applied to P by pixel Q. The stiffness matrix K has to be assembled only once but at each time step it is employed in solving the linear system (9), which is a large-scale problem. In [9], the Jacobi iterative method, through the gesvdj routine from the cuSOLVER library, was employed as the solver. A study of its impact on power consumption was investigated in [10]. Here, a comparison, in terms of efficiency and energy saving, to three other linear solvers is provided: the LU decomposition, implemented through the getrf routine from cuSOLVER; the Gauss–Seidel method, implemented using the AmxG library [14]; and a custom-made version of the conjugate gradient method, which is particularly relevant since we are dealing with a symmetric and positive-definite matrix K. Here, a short description of the CG method is provided. Starting from an initial guess

u_{0}

, the CG iteratively computes approximation

u_{k} (k = 1, 2, \dots)

by optimizing along mutually conjugate directions. Using the notation in (9), the process minimizes the value of the quadratic function

F (u) = \frac{1}{2} u^{T} K u - u^{T} f

(12)

over the next orthogonal search directions. At each iteration, the approximation

u_{k}

is computed in the following way. First, a corresponding initial residual

r_{0} = f - K u_{0}

to the initial guess

u_{0}

is computed. Then, after setting

d_{0} = r_{0}

, for

k = 0, 1, \dots

(until convergence), the following steps are iteratively executed:

(i): Compute:

$α_{k} = \frac{r_{k}^{T} r_{k}}{d_{k}^{T} K d_{k}};$

(13)
(ii): Update approximation: $u_{k + 1} = u_{k} + α_{k} d_{k}$ ;
(iii): Update residual $r_{k + 1} = r_{k} - α_{k} K d_{k},$ and the direction

$d_{k + 1} = r_{k + 1} + \frac{r_{k + 1}^{T} r_{k + 1}}{r_{k}^{T} r_{k}} d_{k} .$

(14)

The above discussion is summarized in the following Algorithm 1, which is the base of the parallel implementations proposed in the next section.

Algorithm 1: Cellular Potts Model Pseudo-Algorithm

3. Parallel Approach

In this section, the development of a parallel variant of Algorithm 1 is presented. The initial algorithm was implemented as sequential code in C. To address the complexity of the problem more effectively, the main components of this original CPU-based implementation were parallelized. While the sequential implementation delivered promising results, the related computational complexity and problem dimensions required an efficient solution, thus leading to the design of a parallel algorithm that mimics the fundamental operations of the sequential approach. Our implementation engages GPU technology to comprehensively demand its computational power across all components of the algorithm. This includes the initialization of cells, the construction of the lattice grid, and the implementation of the Metropolis algorithm. From the model description in Section 2 and Algorithm 1, it is clear that the most computationally intensive process, due to its frequent repetition, is to find the solution of the linear system (9). To address this, three different solvers are proposed, LU decomposition, Gauss–Seidel, and conjugate gradient (CG), in addition to the Jacobi method described in [9]. Given the properties of the matrix K, CG is the most suitable. Hence, the focus is on the parallel implementation of the CG method, which is not included in the cuSOLVER library routines [15]. For precise control over how the algorithm is executed, all CUDA kernels are manually developed instead of using third party libraries. This choice enables the adaptation of the kernels exactly to specifications by optimizing performance and allowing for custom solutions.

Following the structure outlined in Algorithm 1, a parallelization of the entire process using essential GPU-native API routines arising from CUDA [16] was implemented. This enables the user to optimize each stage for high performance. In order to discuss the parallel implementation, the main operations are split into three major steps, each designed to work as follows:

STEP 0: Starting environment at time

τ = 0

. In this step, the initialization of the lattice grid structure and cells using a thread-based parallelism strategy is addressed. This approach deploys pools of threads to manage the initialization of different cells segments. More precisely, a specific segment of a cell is assigned to one thread by maximizing the effective processing power of the GPU. This ensures efficient usage of GPU capabilities for priming the simulation environment. Adopting this approach, data transfers between the CPU and GPU are minimized, thereby reducing overhead typically associated with such transfers. Each thread is tasked with initializing a cell segment and assigning a unique identifier that defines its starting state. Additionally, these threads play a key role in building the lattice grid, linking each grid point to its respective cell. In this way, the initialization times are notably shortened.

STEP 1: FEM priming. The creation of matrix K through the FEM discretization process for strain is a one-time operation. To accomplish this efficiently, a dedicated CUDA kernel for constructing the global stiffness matrix was developed. This involves incorporating contributions from every mesh element. This process proceeds concurrently, with individual threads assigned to handle distinct segments of elements. This specific parallelization method guarantees extensive parallelism.

STEP 2: CPM iteration. This step collects specific custom CUDA kernels for each phase of the Metropolis algorithm: to update the force vectors and to assemble them, as in (10); to solve the linear system (9); and to calculate the change in energy, as in (7). In more detail:

Updating the force vector f is essential in simulations, as it reflects the active forces affecting each pixel and its neighboring pixels. Initially, this process followed a sequential approach: beginning with the analysis of individual pixels, proceeding to identify their neighboring pixels, and concluding with the computation of specific pixel forces. While functional, this sequential method can be computationally intensive, especially in simulations involving large-scale systems. In contrast, the parallel implementation employs CUDA kernels to generate a matrix where each row corresponds to a unique pixel. This allows for simultaneous computation of forces across all pixels, eliminating the need for sequential pixel-by-pixel force calculations. By leveraging the parallel processing capabilities of GPU architectures, this approach maximizes efficiency through concurrent computations. Once forces for each pixel are asynchronously computed, they are then assigned to their respective slots in the force vector f. This asynchronous process enhances performance by overlapping computational and memory operations, leveraging a critical advantage of the CUDA environment.
In order to solve the linear system presented in Equation (9), well-known solvers available in [14,15] (such as Jacobi, LU, and Gauss–Seidel) were considered. The choice of the conjugate gradient (CG) method was driven by the aim to investigate computational time and power consumption reductions. The CG method is inherently more efficient for the kind of symmetric and positive-definite matrices typically encountered in the proposed scheme, offering faster convergence rates compared to the Jacobi method. Moreover, an advantage of CG implementation is its fully GPU-centric design. Unlike the other methods, where employing routines from the cuSOLVER and AmgX libraries requires intermittent CPU–GPU data transfers, this approach fully operates on the GPU. This avoidance of data transfer not only notably reduces the overhead associated with these transfers but also speeds up the overall computational process. In the proposed parallel GPU implementation of the CG method, key operations such as vector addition ( $u_{k + 1} = u_{k} + α_{k} \cdot d_{k}$ ) and subtraction ( $r_{k + 1} = r_{k} - α_{k} \cdot K \cdot d_{k}$ ), as well as matrix–vector multiplication ( $K \cdot d_{k}$ and $K \cdot u_{0}$ ), are ad hoc designed for GPU execution. Basic algebraic operations are performed according to dynamic parallelism, equipped by most recent GPUs, which allows for nesting of CUDA kernels by enabling the invocation of other kernels from already executing threads. These threads handle the following operations: Vector computations are processed in a point-wise way, with each element processed independently by a dedicated thread by speeding up computation; the matrix–vector product is managed by distributing matrix rows across multiple threads, each one performing the dot product for its assigned row. In this way, the computations for the sparsity and structure of matrix K are optimized. In a similar way, the computation of the step size ( $α_{k}$ ) and direction vector ( $d_{k + 1}$ ) update also benefit from parallelization, using reduction patterns for dot products and point-wise operations for scalar–vector multiplications and additions. These CUDA kernel-based operations ensure coalesced memory access and use shared memory to minimize latency, modeling a fully based GPU approach. In addition, the CUDA threads’ configuration (blocks × threads) is defined in an adaptive way according to the starting dimension of the $Ω$ domain.
The computation of energy changes is pivotal in the Metropolis algorithm. Three distinct CUDA kernels were developed to concurrently calculate the $Δ H_{a d}$ , $Δ H_{v}$ , and $Δ H_{s}$ components, employing an asynchronous parallel approach. Computing $Δ H_{a d}$ and $Δ H_{v}$ is straightforward and allows for rapid processing, thus these tasks are assigned to a single thread for optimal efficiency. In contrast, computing the $Δ H_{s}$ component involves accessing the previously computed displacement u and force vectors f, which can be more demanding on a thread. To address this, the dynamic parallelism capabilities of modern GPUs to significantly accelerate this computation is exploited. This parallel approach engages only two threads, each independently computing updates to $Δ H_{s}$ asynchronously. These threads retrieve displacement data from u and perform the necessary operations to determine the strain characteristics. Finally, one thread updates the value of $Δ H_{s}$ .

The previous discussions allow us to conclude the section with Algorithm 2, which summarizes the main parallel operations.

Algorithm 2: Parallel CPM method through different linear solvers

4. Results

This section details the primary experiments conducted to validate the performance gain and power consumption savings of the developed parallel software. With the substantial increase in trials required for real-world healthcare applications, the computational demand of the Metropolis algorithm also rises significantly. The primary aim is to demonstrate the efficiency, particularly in terms of execution time reduction. Experimental tests, focusing on the most computationally intensive kernels, are presented to highlight the substantial improvements in time performance achieved through the GPU-parallelized algorithm. The following tests were conducted on the MARCONI-100 high-performance computing system provided by CINECA [17], featuring:

Dual 16-core IBM POWER9 AC922 CPUs at 3.1 GHz;
Four NVIDIA Volta V100 GPUs with NVLink 2.0 connectivity and 16 GB memory each;
256 GB of system RAM.

The input parameters for the CPM simulations were derived from an extensive literature review, ensuring the tests reflect real-world scenarios and facilitate direct comparisons with established findings. The test setup varies two critical parameters: the number of trials M, equal to the number of cells; and the number of Metropolis algorithm time steps

N_{τ}

. These parameters significantly impact the computational performance of the parallel algorithm. Moreover, the computational complexity of the CPM model increases not only due to these input parameters but also due to the solver used. The most computationally demanding part involves solving the linear system, with the complexity heavily influenced by the solver choice. Therefore, some tests adopting different solvers within the CUDA environment were carried out. The performance metrics are based on the mean results of 30 executions to ensure efficiency in both execution time and power consumption reduction.

Test 1: Execution Time Comparison of Linear System Solvers.

In this experiment, the computational kernel responsible for solving the linear system

K u = f

, isolating this from the overall software execution times, is considered. This focused approach enables a detailed evaluation of the efficiency and effectiveness of each solver in isolation. Table 1 compares the execution times for all the chosen GPU-based methods:

The Jacobi method, implemented via the gesvdj routine from the cuSOLVER library;
The LU decomposition, implemented through the getrf routine from cuSOLVER;
The Gauss–Seidel method, implemented using the AmxG library;
The conjugate gradient-based custom implementation.

By varying the parameters

N_{τ}

and M, where

N_{τ}

refers to the time iterations, and M is the Metropolis algorithm trials, a comparison between the different methods is presented. More precisely, the following test provides insights into the performance characteristics and computational efficiency of each solver, revealing significant improvements in execution time with the custom CG method.

Table 1 shows the custom CG solver achieves significant execution time reductions compared to the other GPU solvers. This trend is observed across varying values of

N_{τ}

and M, confirming that the best solver performance in each experimental setup in the table is that of custom CG.

Test 2: Power Consumption Comparison of Linear System Solvers.

Starting from the good results in execution time achieved in the previous test, the following tests compare all the methods in terms of power consumption. In detail, the power consumption data in Table 2 highlight the energy efficiency of the various GPU solvers, including Jacobi, LU, Gauss–Seidel, and the custom conjugate gradient (CG) solver, across

N_{τ}

different computational load levels.

The custom CG solver exhibits the lowest power usage across all matrix sizes (M), indicating superior energy efficiency. For instance, at

M = 900

, the custom CG solver consumes 14.70 watts for

N_{τ} = 9000

, while the Jacobi solver requires 25.81 watts under the same conditions. This trend underscores the effectiveness in reducing power consumption of the custom CG solver, making it a preferable choice for energy-constrained environments.

Test 3: Power Consumption of the Overall Algorithm for the Gauss–Seidel and Custom CG Solvers.

Starting from the results achieved in Table 1 and Table 2, it is evident that the Gauss–Seidel and custom CG solvers are the most efficient in terms of both time and energy consumption. Building on this observation, a comparative test evaluating the execution of complete GPU software implementations is introduced. The first implementation adopts the Gauss–Seidel solver, while the second utilizes the custom CG solver. The goal is to assess the overall power consumption of these full-scale GPU solutions.

The results in Table 3 highlight that the GPU implementation using the custom CG solver (

{GPU}_{C G}

) outperforms the one utilizing the Gauss–Seidel solver (

{GPU}_{G S}

) in terms of power efficiency. For instance, at

M = 900

{GPU}_{C G}

consumes 83.81 watts, which is lower than the 179.89 watts used by

{GPU}_{J}

. This trend is consistent across all the tested matrix sizes.

5. Discussion

In computational simulations, balancing efficiency and power consumption is a paramount challenge, especially with Monte Carlo-based algorithms like the Metropolis algorithm. These methods require several iterations to ensure statistical significance, leading to higher computational complexity and increased power consumption. This paper addresses these challenges by leveraging GPU acceleration to enhance performance while minimizing energy usage, highlighting key findings and comparing them with several works in the field. The simulations presented in the previous section showed that GPU acceleration significantly reduces computation time and energy consumption compared to traditional CPU-based approaches. Substantial reductions in energy consumption were achieved through optimized GPU usage, keeping high computational efficiency without excessive energy costs. In order to contextualize these contributions, the results are compared with existing studies, shown in Table 4, highlighting the differences in focus and outcomes. The study by Tomeu et al. [7] proposed a parallel implementation for the CPM using software transactional memory, achieving a maximum speedup of 10.76× with 12 parallel tasks, but it did not address power consumption. Another study introduced a 3D parallelization scheme for FEM coupling simulations using GPU acceleration, reducing the simulation time by over 80% compared to serial computations, but it did not address energy consumption [18]. Schultheiss et al. [19] proposed a hybrid parallel framework combining CPU and GPU resources for CPM simulations, achieving appreciable reductions in computation time by distributing tasks among CPUs and GPUs. A further study presented a parallel approximate Bayesian computation sequential Monte Carlo algorithm for high-performance computing clusters, significantly improving the computational efficiency of multi-scale biological process simulations [20]. Kang et al. [21] explored GPU acceleration for the CPM to improve simulation times, reporting significant improvements but not addressing energy consumption. Feng et al. [22] investigated various parallel computing techniques to improve the efficiency of biological simulations, achieving enhanced computational efficiency but not extensively treating energy consumption. This paper distinguishes itself by incorporating energy efficiency as a key component, ensuring that the benefits of accelerated computation do not come at the expense of increased power consumption. The optimized use of GPU resources leads to a notable decrease in power usage compared to traditional CPU-based methods, balancing performance and energy efficiency. In conclusion, the proposed work demonstrates that GPU acceleration, through optimized computational strategies, can achieve significant improvements in both computational efficiency and energy consumption. Comparing the findings with the existing literature highlights the unique contributions of this approach, particularly in the context of Monte Carlo-based simulations. The focus on energy efficiency addresses a critical gap in the field, providing a sustainable solution for high-performance computational simulations. This holistic approach ensures that the benefits of accelerated computation are performed without incurring high energy costs, making it a valuable contribution to the development of efficient and sustainable computational methods. The proposed software is very helpful for researchers, enabling large-scale real-time simulations of tumor growth and angiogenesis, reducing the time required to obtain results compared to traditional experiments. This acceleration allows for faster exploration of various tumor growth scenarios, facilitating faster and more efficient research outcomes.

6. Conclusions

In this paper, various parallel implementations of the Cellular Potts model have been developed and evaluated to address the challenges of energy efficiency in a computational biological simulation. By leveraging modern GPU architectures and exploring alternative solvers for large-scale linear systems of equations, including the proposed parallel implementation of the conjugate gradient method, we demonstrated that specific methods can significantly enhance performance compared to traditional approaches. Comprehensive tests confirm that the custom CG GPU implementation is the best choice in terms of execution time and power consumption. The achieved results encourage us to study and monitor, in future works, the temperatures reached in the proposed implementations with respect to existing software.

Author Contributions

All authors have contributed in equally part. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

P.D.L., A.G. and L.M. are member of the Gruppo Nazionale Calcolo Scientifico-Istituto Nazionale di Alta Matematica (GNCS-INdAM).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CPM	Cellular Potts model
GPU	Graphics processing unit
CG	Conjugate gradient
GS	Gauss–Seidel
FEM	Finite element method
CUDA	Compute Unified Device Architecture
HPC	High-performance computing

References

Elmisaoui, S.; Kissami, I.; Ghidaglia, J.M. High-Performance Computing to Accelerate Large-Scale Computational Fluid Dynamics Simulations: A Comprehensive Study. In Proceedings of the International Conference on Advanced Intelligent Systems for Sustainable Development, Marrakech, Morocco, 15–17 October 2023; Springer Nature: Cham, Switzerland, 2023; pp. 352–360. [Google Scholar]
Laccetti, G.; Lapegna, M.; Mele, V.; Romano, D. A study on adaptive algorithms for numerical quadrature on heterogeneous GPU and multicore based systems. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics PPAM 2013, Warsaw, Poland, 8–11 September 2013; Lecture Notes in Computer Science. Volume 8384 LNCS, pp. 704–713. [Google Scholar]
Laccetti, G.; Lapegna, M.; Mele, V.; Romano, D.; Murli, A. A double adaptive algorithm for multidimensional integration on multicore based HPC systems. Int. J. Parallel Program. 2012, 40, 397–409. [Google Scholar] [CrossRef]
Glazier, J.A.; Graner, F. Simulation of the differential adhesion driven rearrangement of biological cells. Phys. Rev. E 1993, 47, 2128. [Google Scholar] [CrossRef]
van Oers, R.F.; Rens, E.G.; LaValley, D.J.; Reinhart-King, C.A.; Merks, R.M. Mechanical cell-matrix feedback explains pairwise and collective endothelial cell behavior in vitro. PLoS Comput. Biol. 2014, 10, E1003774. [Google Scholar] [CrossRef] [PubMed]
Chen, N.; Glazier, J.A.; Izaguirre, J.A.; Alber, M.S. A parallel implementation of the Cellular Potts Model for simulation of cell-based morphogenesis. Comput. Phys. Commun. 2007, 176, 670–681. [Google Scholar] [CrossRef] [PubMed]
Tomeu, A.J.; Gámez, A.; Salguero, A.G. A parallel implementation for cellular potts model with software transactional memory. In Practical Applications of Computational Biology and Bioinformatics, Proceedings of the 13th International Conference; Springer: Cham, Switzerland, 2020; pp. 53–60. [Google Scholar]
Gusatto, É.; Mombach, J.C.; Cercato, F.P.; Cavalheiro, G.H. An efficient parallel algorithm to evolve simulations of the cellular Potts model. Parallel Process. Lett. 2005, 15, 199–208. [Google Scholar] [CrossRef]
De Luca, P.; Galletti, A.; Marcellino. Towards a parallel code for cellular behavior in vitro prediction. In Proceedings of the International Conference on Numerical Computations: Theory and Algorithms, Paris, France, 14–20 June 2023; Springer International Publishing: Cham, Switzerland, 2023. [Google Scholar]
De Luca, P.; Galletti, A.; Marcellino, L. Energy performance profiling of a GPU-based CPM implementation. In Proceedings of the 2023 17th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), Bangkok, Thailand, 8–10 November 2023; pp. 417–421. [Google Scholar]
Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of state calculations by fast computing machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef]
Zienkiewicz, O.C.; Taylor, R.L.; Zhu, J.Z. (Eds.) The Finite Element Method: Its Basis and Fundamentals, 7th ed.; Butterworth-Heinemann: Oxford, UK, 2013. [Google Scholar]
Lemmon, C.A.; Romer, L.H. A predictive model of cell traction forces based on cell geometry. Biophys. J. 2010, 99, L78–L80. [Google Scholar] [CrossRef] [PubMed]
Naumov, M.; Arsaev, M.; Castonguay, P.; Cohen, J.; Demouth, J.; Eaton, J.; Layton, S.; Markovskiy, N.; Reguly, I.; Sakharnykh, N.; et al. AmgX: A library for GPU accelerated algebraic multigrid and preconditioned iterative methods. SIAM J. Sci. Comput. 2015, 37, S602–S626. [Google Scholar] [CrossRef]
NVIDIA. cuSOLVER. Available online: https://docs.nvidia.com/cuda/cusolver/index.html (accessed on 19 March 2024).
NVIDIA—CUDA. Available online: https://www.nvidia.com/cuda (accessed on 26 July 2024).
CINECA. Available online: https://www.hpc.cineca.it/training/ (accessed on 27 March 2024).
Li, S.; Lei, L.; Hu, Y.; He, Y.; Sun, Y.; Zhou, Y. A GPU parallelization scheme for 3D agent-based simulation of in-stent restenosis. In Proceedings of the 2019 IEEE International Conference on Cyborg and Bionic Systems (CBS), Munich, Germany, 18–20 September 2019; pp. 322–327. [Google Scholar]
Schultheiss, K.; Kolb, M.; Toschi, L. A Hybrid Parallel Framework for the Cellular Potts Model Simulations. J. Comput. Sci. 2018, 25, 141–151. [Google Scholar]
Jagiella, N.; Rickert, D.; Theis, F.J.; Hasenauer, J. Parallelization and high-performance computing enables automated statistical inference of multi-scale models. Cell Syst. 2017, 4, 194–206. [Google Scholar] [CrossRef] [PubMed]
Kang, H.W.; Lee, S.; Kim, Y. A Cellular Potts Model for Simulating Microenvironmental Influences on Tumor Progression. IEEE Trans. Biomed. Eng. 2019, 66, 1234–1245. [Google Scholar]
Feng, X.; Wu, H.; Zhang, J. Advanced Parallel Techniques for Biological Simulations. Comput. Struct. Biotechnol. J. 2020, 18, 132–143. [Google Scholar]
Madhikar, P.; Åström, J.; Westerholm, J.; Karttunen, M. CellSim3D: GPU accelerated software for simulations of cellular growth and division in three dimensions. Comput. Phys. Commun. 2018, 232, 206–213. [Google Scholar] [CrossRef]
Hattne, J.; Shi, D.; Glynn, C.; Zee, C.T.; Gallagher-Jones, M.; Martynowycz, M.W.; Rodriguez, J.A.; Gonen, T. Analysis of global and site-specific radiation damage in cryo-EM. Structure 2018, 26, 759–766. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Lattice evolution

(L^{τ})

involved in Cellular Potts model for some values of

τ

. White is for medium, red for the unique cell type.

Figure 1. Lattice evolution

(L^{τ})

involved in Cellular Potts model for some values of

τ

. White is for medium, red for the unique cell type.

Table 1. Execution time (in seconds) comparison across different GPU solvers. Here, for all simulations

N = 600

and

N_{c} = 400

; the time iterations

N_{τ}

and number of Metropolis trials M are varied.

Table 1. Execution time (in seconds) comparison across different GPU solvers. Here, for all simulations

N = 600

and

N_{c} = 400

; the time iterations

N_{τ}

and number of Metropolis trials M are varied.

M	Solver	$N_{τ}$ = 3000	$N_{τ}$ = 6000	$N_{τ}$ = 9000
150	Jacobi	79.54	129.02	253.17
	LU	72.04	96.64	191.88
	Gauss–Seidel	65.81	80.73	140.07
	Custom CG	49.55	64.92	112.83
300	Jacobi	107.79	198.84	307.56
	LU	79.30	154.92	226.06
	Gauss–Seidel	61.34	126.92	194.02
	Custom CG	52.22	86.92	130.31
450	Jacobi	182.41	348.91	523.93
	LU	135.34	278.13	464.84
	Gauss–Seidel	113.91	223.91	326.91
	Custom CG	99.10	202.81	292.98
750	Jacobi	315.97	650.14	902.49
	LU	244.95	516.01	706.93
	Gauss–Seidel	197.75	452.99	570.64
	Custom CG	141.01	386.33	442.93
900	Jacobi	450.91	991.81	1483.83
	LU	354.94	763.19	1148.09
	Gauss–Seidel	294.46	612.01	940.93
	Custom CG	254.91	513.92	750.56

Table 2. Power consumption (in watts) for different GPU solvers, Jacobi, LU, Gauss–Seidel, and custom CG, focusing on micro-computational kernels by varying time iteration

N_{τ}

and number of Metropolis trials M.

Table 2. Power consumption (in watts) for different GPU solvers, Jacobi, LU, Gauss–Seidel, and custom CG, focusing on micro-computational kernels by varying time iteration

N_{τ}

and number of Metropolis trials M.

M	Solver	$N_{τ}$ = 3000	$N_{τ}$ = 6000	$N_{τ}$ = 9000
150	Jacobi	2.71	3.94	5.03
	LU	2.43	3.04	3.79
	Gauss–Seidel	1.98	2.14	2.74
	Custom CG	1.39	1.54	2.22
300	Jacobi	3.73	6.04	5.93
	LU	2.73	4.64	4.39
	Gauss–Seidel	2.26	3.81	3.72
	Custom CG	1.75	2.69	2.54
450	Jacobi	6.18	9.62	10.58
	LU	4.48	7.67	9.21
	Gauss–Seidel	3.60	6.17	6.52
	Custom CG	3.01	5.14	5.81
750	Jacobi	16.01	16.48	17.08
	LU	11.79	14.29	15.41
	Gauss–Seidel	9.53	12.51	12.55
	Custom CG	7.06	10.59	10.90
900	Jacobi	18.02	22.99	25.81
	LU	14.60	20.08	22.51
	Gauss–Seidel	11.57	16.17	18.40
	Custom CG	8.39	13.51	14.70

Table 3. Power consumption comparison (in watts) for GPU with Jacobi (

{GPU}_{J}

) vs. GPU with conjugate gradient (

{GPU}_{C G}

).

Table 3. Power consumption comparison (in watts) for GPU with Jacobi (

{GPU}_{J}

) vs. GPU with conjugate gradient (

{GPU}_{C G}

).

M	Power- ${GPU}_{GS}$	Power- ${GPU}_{CG}$
150	29.43	22.71
300	67.79	17.68
450	124.05	31.09
750	151.01	70.22
900	179.89	83.81

Table 4. Summary of main CPM implementations.

Article	Contribution	Parallel	Energy Consumption Analysis
[23]	CellSim3D is a software package for simulating three-dimensional cell division based on the mechanical properties of cells.	Yes	No
[7]	Parallelizing Cellular Potts model using software transactional memory, enhancing computational efficiency for simulating ductal carcinoma in situ and other tumors.	Yes	No
[24]	MATLAB-based implementation for protein crystal growth with a user-friendly interface.	No	No
[18]	GPU-accelerated 3D simulation of in-stent restenosis, improving computational efficiency and enabling large-scale analyses of vascular smooth muscle cell behaviors.	Yes	No
[19]	Hybrid parallel framework using MPI and OpenMP for distributed and shared-memory systems.	Yes	No
[20]	Parallelization and high-performance computing enables automated statistical inference of multi-scale models.	Yes	No
[21]	A Cellular Potts model for simulating microenvironmental influences on tumor progression using GPU acceleration.	Yes	No
[22]	Advanced parallel techniques for biological simulations focusing on computational efficiency.	Yes	No

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

De Luca, P.; Galletti, A.; Marcellino, L. Power Consumption Comparison of GPU Linear Solvers for Cellular Potts Model Simulations. Appl. Sci. 2024, 14, 7028. https://doi.org/10.3390/app14167028

AMA Style

De Luca P, Galletti A, Marcellino L. Power Consumption Comparison of GPU Linear Solvers for Cellular Potts Model Simulations. Applied Sciences. 2024; 14(16):7028. https://doi.org/10.3390/app14167028

Chicago/Turabian Style

De Luca, Pasquale, Ardelio Galletti, and Livia Marcellino. 2024. "Power Consumption Comparison of GPU Linear Solvers for Cellular Potts Model Simulations" Applied Sciences 14, no. 16: 7028. https://doi.org/10.3390/app14167028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Power Consumption Comparison of GPU Linear Solvers for Cellular Potts Model Simulations

Abstract

1. Introduction

2. Mathematical Background

3. Parallel Approach

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI