Next Article in Journal
Solid Waste Detection Using Enhanced YOLOv8 Lightweight Convolutional Neural Networks
Previous Article in Journal
Marshall–Olkin Bivariate Weibull Model with Modified Singularity (MOBW-μ): A Study of Its Properties and Correlation Structure
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Parallel Implicit Solvers for 2D Numerical Models on Structured Meshes

1
National Center for Computational Hydroscience and Engineering, University of Mississippi, Oxford, MS 38655, USA
2
Department of Civil Engineering, University of Mississippi, University, MS 38677, USA
3
Department of Geology and Geological Engineering, University of Mississippi, University, MS 38677, USA
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(14), 2184; https://doi.org/10.3390/math12142184
Submission received: 21 June 2024 / Revised: 9 July 2024 / Accepted: 9 July 2024 / Published: 12 July 2024
(This article belongs to the Special Issue Mathematical Modeling and Numerical Simulation in Fluids)

Abstract

:
This paper presents the parallelization of two widely used implicit numerical solvers for the solution of partial differential equations on structured meshes, namely, the ADI (Alternating-Direction Implicit) solver for tridiagonal linear systems and the SIP (Strongly Implicit Procedure) solver for the penta-diagonal systems. Both solvers were parallelized using CUDA (Computer Unified Device Architecture) Fortran on GPGPUs (General-Purpose Graphics Processing Units). The parallel ADI solver (P-ADI) is based on the Parallel Cyclic Reduction (PCR) algorithm, while the parallel SIP solver (P-SIP) uses the wave front method (WF) following a diagonal line calculation strategy. To map the solution schemes onto the hierarchical block-threads framework of the CUDA on the GPU, the P-ADI solver adopted two mapping methods, one block thread with iterations (OBM-it) and multi-block threads (MBMs), while the P-SIP solver also used two mappings, one conventional mapping using effective WF lines (WF-e) with matrix coefficients and solution variables defined on original computational mesh, and a newly proposed mapping using all WF mesh (WF-all), on which matrix coefficients and solution variables are defined. Both the P-ADI and the P-SIP have been integrated into a two-dimensional (2D) hydrodynamic model, the CCHE2D (Center of Computational Hydroscience and Engineering) model, developed by the National Center for Computational Hydroscience and Engineering at the University of Mississippi. This study for the first time compared these two parallel solvers and their efficiency using examples and applications in complex geometries, which can provide valuable guidance for future uses of these two parallel implicit solvers in computational fluids dynamics (CFD). Both parallel solvers demonstrated higher efficiency than their serial counterparts on the CPU (Central Processing Unit): 3.73~4.98 speedup ratio for flow simulations, and 2.166~3.648 speedup ratio for sediment transport simulations. In general, the P-ADI solver is faster than but not as stable as the P-SIP solver; and for the P-SIP solver, the newly developed mapping method WF-all significantly improved the conventional mapping method WF-e.

1. Introduction

In computational fluids dynamics (CFD) analysis, efficient computation still remains a challenge in long-term, large-scale, and high-resolution simulations. Many methodologies/techniques have been developed to alleviate computing efficiency problems, such as the development of efficient numerical schemes [1,2]; model surrogating, i.e., one-dimensional (1D) models surrogating two-dimensional (2D) models [3,4] and data-driven models with machine learning techniques surrogating conventional physics-based models [5,6]; model couplings, i.e., 1D and 2D model coupling [7,8] and 2D and three-dimensional (3D) model coupling [9]; multi-block algorithm [10]; and parallel computing [11,12,13,14,15,16,17,18,19,20].
Since CFD models involve solving a set of highly non-linear partial differential equations (PDEs), their computing efficiency depends largely on numerical solvers, which are designed for different computational mesh systems. Structured meshes and unstructured meshes have been widely used in CFD analysis. For examples, the CCHE2D model [21] and the Delft 3D model [22] are based on the structured mesh system, while the MIKE 21 flow model [23], the HEC-RAS 2D model [8], the SRH-2D model [24], and the CCHE2D-hybrid model [25] use the unstructured system. One big advantage of the structured mesh system is that there are fast numerical solvers available, despite its generation difficulties in complex geometries being more than that of the unstructured mesh system. Among those most widely used solvers, two implicit solvers stand out: the ADI (Alternating Direction Implicit) solver for the tri-diagonal matrix system [26,27] and the SIP (Strong Implicit Procedure) [28] solver for the penta-diagonal matrix system.
The ADI solver is simple and easy to implement, although its implicitness and dependency are not as strong as the SIP solver, and thus its convergence is slower. In fact, the strong dependency of the SIP solver makes it unfriendly to parallelization. In parallel computing, implicit solvers are more difficult to parallelize than explicit solvers because of their strong data dependency. For the ADI solver, based on the Parallel Cyclic Reduction (PCR) algorithm proposed by Hockeny and Jesshope [29], it has been successfully parallelized and applied to flow simulations in CFD analyses [16,17,30] and heat conduction simulations [18,31]. Compared to the ADI solver, the SIP solver is fully implicit and thus computationally more stable. In general, the SIP solver can use a larger time step than the ADI solver. Due to the incomplete LU decomposition, the SIP solver converged much faster than the ADI solver as well.
For the SIP solver, attempts have been made for parallelization. For example, Reeve et al. [32] attempted to parallelize the SIP solver on multi-CPUs (Central Processing Units) using the MPI (Message Passing Interface) framework. Based on a dependency analysis, they proposed two parallel methods, namely, the Wave Front (WF) method and the Red-Black Checkerboard (RBC) method. Deserno et al. [33] also analyzed the effects of nodal numbering on the implementation of the SIP solver on a high performance computing (HPC) platform with shared memory multi-processors in detail and proposed a diagonal process strategy, essentially the same as the WF method. Igounet et al. [34] implemented two variants of the SIP method on the GPU by exploring the possibilities of using shared memory. According to the implementation of the RBC method for the SIP solver for flow simulations, Zhang and Jia [16] recognized that the RBC was just approximations of the SIP algorithm, so that it can be used only for simple problems. Dufrechou et al. [35] designed and implemented a self-scheduling procedure to avoid the overhead of CPU–GPU synchronization implied by the so-called hyper-planes strategy (another version of the WF method).
In this study, the parallel ADI solver (P-ADI) based on the PCR algorithm and the parallel SIP (P-SIP) solver based on the WF method were implemented on the GPU with the CUDA Fortran programming technique [36] and then integrated into a 2D hydrodynamic model, the CCHE2D (Center of Computational Hydroscience and Engineering) model. This study proposed a new mapping strategy for the P-SIP solver, distinguished from the conventional WF method; that is, the matrix coefficients and the solution variables are defined on the WF mesh, and the computation is performed on the WF mesh as well. Two mapping methods of the P-ADI solver and two mapping methods of the P-SIP solvers are validated and compared by examples and applications. It demonstrated that both parallel solvers are capable of significantly improving computing efficiency in complex geometries, such as the flow simulations in the Arkansas River pool 7 with many dikes for navigations, the tidal flow simulations in Lake Pontchartrain, and the sediment transport simulations with wetting-and-drying process in the Vistula River with dikes and the meandering East Fork River.

2. Numerical Model

2.1. Flow Model

In this study, a two-dimensional (2D) hydrodynamic model, the CCHE2D model [21] is selected for parallelization. The CCHE2D model solves the depth-integrated 2D Navier–Stokes equations as follows:
Continuity equation
h t + ( h u ) x + ( h v ) y = 0
Momentum equations
u t + u u x + v u y = g η x + 1 h [ ( h τ x x ) x + ( h τ x y ) y ] + τ w x τ b x ρ h + f C o r v
v t + u v x + v v y = g η y + 1 h [ ( h τ y x ) x + ( h τ y y ) y ] + τ w y τ b y ρ h f C o r u
where u and v are the depth-integrated velocity components in the x and y directions respectively; g is the gravitational acceleration; η is the water surface elevation; ρ is water density; h is the local water depth; fCor is the Coriolis parameter; τ b x and τ b y are shear stresses on the bed surface; τ w x and τ w y are surface wind shear stress.
( τ w x , τ w y ) = ρ a i r c f a U w 2 + V w 2 ( U w , V w )
where c f a is the friction coefficient at the water surface and (Uw and Vw) are wind velocity; and τ x x , τ x y , τ y x , and τ y y are the depth-integrated stresses including both viscous and turbulent effects, which are approximated based on Boussinesq’s assumption as follows:
τ x x = 2 ρ ( ν + ν t ) u x
τ x y = τ y x = ρ ( ν + ν t ) ( u y + v x )
τ y y = 2 ρ ( ν t + ν ) v y
where ν is the kinematic viscosity of water.
Two zero-equation eddy viscosity models, namely, the parabolic model and the mixing length model, and the more sophisticated two-equation k- ε turbulence model [37] were adopted in the CCHE2D model to calculate the eddy viscosity ν t , which will not be covered in detail here.

2.2. Sediment Transport Model

The CCHE2D model can simulate the transport of non-uniform sediment mixtures with slope effect correction, secondary flow effect in curved channel, and avalanche effects. The following equations are solved decoupled from flow.
Suspended-load equation [38]:
( h C k ) t + ( u h C k ) x + ( v h C k ) y =   x ε s   h   C k   x +   y ε s   h   C k   y                                                                                                     + α ω s k C k C k
Bed-load equation [38]:
δ b c ¯ b k t + α b x q b k x x + α b y q b k y y + 1 L t q b k q b k = 0
Bed-change equation:
1 p z b k t     =   α ω s k C k C k + q b k q b k L t
where ε s is the eddy diffusivity of sediment; Ck is the concentration of the k-th size class, and C*k is the corresponding suspended-load transport capacity; α is the adaptation coefficient for suspended load; ω s k is the sediment settling velocity; α b x = u / U and α b y = v / U   q b * k are the direction cosines of bed-load movement; U is the velocity magnitude; q b k , q b k x , and q b k y are the bed-load transport capacity, the bed-load transport rate, and transport rate components in x and y directions, respectively; Lt is the adaptation length for bed load; and p is the porosity of bed material, and z b k is the contributed bed elevation.
For multi-sized sediments, the bed-sorting calculation plays a crucial role, which is not listed here but can be found in more detail in [25]. In general, sediment transport is more complex and has more uncertainties than flow simulations due to those empirical/semi-empirical formulas (e.g., transport capacity, incipient motion, settling velocity, adaptation coefficient, porosity, mobile bed roughness, etc.) derived based on physical laws under steady flow conditions [25].

2.3. Discretization

The CCHE2D model [21] uses the efficient element method (EEM) to discretize the governing equations on a staggered structured mesh (Figure 1a). The EEM applies a hybrid scheme of the Finite Element Method (FEM) for the momentum equations and the sediment transport equations, and the Finite Volume Method (FVM) for the continuity equation. When using the FEM for the momentum equations and the sediment transport equations, the nine-node element is used, while for the FVM for the continuity equation, the four-node element is used. Please refer to [21] for more detail.
As shown in Figure 1b, at a typical mesh node P(i,j) with its four neighboring nodes, namely, west (i − 1, j), east (i + 1, j), south (i, j − 1), and north (i, j + 1) nodes, an algebraic equation will be formed after discretization:
A W ϕ i 1 , j + A E ϕ i + 1 , j + A S ϕ i , j 1 + A N ϕ i , j + 1 + A P ϕ i , j = S i , j ϕ
where ϕ is the dependent variable defined at node (i, j), A is the coefficient matrix, and S i , j ϕ denotes the source term.
Equation (9) can be rewritten into the form of a five-diagonal matrix equation system denoted by [A] { ϕ } = {S}, where A is the coefficient matrix of Imax × Jmax, and it is solved implicitly in the CCHE2D model using the SIP (Strong Implicit Procedure) solver [28].

3. CUDA Programming

Parallel computing is an important technique to improve computing efficiency use hardware. Since NVIDIA released the General-Purpose computing Graphic Process Unit (GPGPU) with the CUDA (Computer Unified Device Architecture) framework in 2006, CUDA programming on the GPU has been successfully applied in many disciplines, including CFD analysis [16,17,19,20,30,39].
In NVIDIVA’s CUDA architecture (NVIDIA, [40]), the computing units on the GPU are organized in a hierarchical way as block threads (Figure 2). Each thread in blocks that are laid out in a grid would process data in parallel. Basically, the CUDA programming can be divided into three simple steps: (1) copy data from the host memory (CPU) to the device memory (GPU); (2) the host (CPU) calls kernels that perform parallel computing in the device memory (GPU); and (3) copy data from the device memory (GPU) back to the host memory (CPU).
One of the keys for successful CUDA programming is how to map the problem onto the block-threads grid. For a 2D structured mesh (Imax × Jmax), an intuitive and straightforward mapping is to use Imax blocks with Jmax threads or Jmax blocks with Imax threads, which is equivalent to a 2D loop. However, for implicit numerical solvers, such as the ADI solver and the SIP solver, the mapping would become more complicated.

4. Parallel ADI Solver

When using the ADI method to solve Equation (9), a time level is halved so that in each half, one spatial direction is solved implicitly with the other direction explicitly.
A W ϕ i 1 , j n + 1 / 2 + A E ϕ i + 1 , j n + 1 / 2 + A P ϕ i , j n + 1 / 2 = S i , j ϕ A S ϕ i , j 1 n A N ϕ i , j + 1 n   ( in   i   direction )
A S ϕ i , j 1 n + 1 + A N ϕ i , j + 1 n + 1 + A P ϕ i , j n + 1 = S i , j ϕ A W ϕ i 1 , j n + 1 / 2 A E ϕ i + 1 , j n + 1 / 2   ( in   j   direction )
where n denotes the time level, and i and j denote two directions in the computational space.
Equation (10) is a tri-diagonal matrix system, which conventionally is solved by the Thomas algorithm for Tri-diagonal Matrix (TDMA) [26]. Rewriting Equation (10) in the form of [A] { ϕ } = {d}, one will have
a i , j ϕ i 1 , j + ϕ i , j ϕ i , j + c i , j ϕ i + 1 , j = d i , j   [ i ( 1 , I max ) ,   j ( 1 , J max ) ]
where d is the source term matrix, and A is the Imax × Jmax coefficient matrix as follows:
A = b 1 c 1 a 2 b 2 c 2 a 3 b 3 0 c 3 0 a I max 1 b I max 1 c I max 1 a I max b I max
Equation (11) will be used to illustrate the ADI method hereafter in this study.

4.1. TDMA

The TDMA has only two steps. The forward elimination step eliminates the lower diagonal:
c 1 n e w = c 1 b 1 c i n e w = c i / ( b i c i 1 n e w a i )
d 1 n e w = d 1 b 1 d i n e w = d i d i 1 n e w a i b i c i 1 n e w a i
where the superscript “new” denotes the updated new values.
The backward substitution step is to obtain the solution:
ϕ I max = d I max n e w ϕ i = d i n e w c i n e w ϕ i + 1 ( i = I max 1 ,   I max 2 ,   ,   1 )

4.2. Cyclic Reduction

Hockeny [27] proposed an alternative algorithm to solve the tri-diagonal system, the Cyclic Reduction (CR) algorithm. Similar to the TDMA, there are also two steps in the CR. In the forward reduction step, odd-numbered rows are successively eliminated by reducing a system to a smaller system with a half number of unknowns until a system of two unknowns is formed. For even-numbered row i, a new equation can be obtained by row operations on i − 1, i and i + 1; thus, one can obtain
a i n e w = a i 1 p 1 c i n e w = c i + 1 p 2 b i n e w = b i c i 1 p 1 a i + 1 p 2 d i n e w = d i d i 1 p 1 d i + 1 p 2
where p 1 = a i / b i 1 and p 2 = c i / b i + 1 .
The backward substitution step successively updates the other half of the unknowns using previously known values:
ϕ i = d i n e w a i n e w F i 1 c i n e w F i + 1 b i n e w
As can be seen, compared to the TDMA, the CR has more computations in the forward reduction steps; thus, it is computationally slower. However, it is much parallel-friendlier in the sense that the TDMA demonstrates stronger dependency than the CR. In the TDMA, the new values of c and d at i depend on their adjacent new values at i – 1, while in the CR, the new values of a, b, c, and d are only related to their previous old values.

4.3. Parallel Cyclic Reduction

Hockeny and Jesshope [29] extended the CR to the Parallel Cyclic Reduction (PCR) due to its parallel-friendly characteristics. The PCR has only the forward reduction step since it halves the system recursively into smaller systems.
Considering that in i direction, the coefficient matrix A is Jmax × Imax, while in j direction, the matrix A will be Imax × Jmax, Zhang et al. [31] and Heng [41] adopted a simple mapping that each block solves one tri-diagonal system and each thread calculates one node, which is denoted by the one-block mapping (OBM) in this study (Figure 3). Obviously, the OBM is limited by the max number of threads in one block, and the size of the tri-diagonal system to be solved must be less than the square and the cubic of the max thread number in 2D and 3D, respectively.
When using the OBM mapping, in i direction the number of blocks is Jmax, the number of threads is Imax, and the required shared memory will be 4 × Imax × 4 (bytes) for single-precision computation and 4 × Imax × 8 (bytes) for double-precision computation.
To resolve the size limitation problems, Zhang and Jia [16,17] proposed the improved OBM method by allowing the calculations of multiple nodes in one thread, denoted by one-block mapping with iterations (OBM-it). In the OBM-it method, each block is solving one tri-diagonal system, but each thread is calculating multiple nodes. In this method, a partition strategy was proposed as follows:
m i × I p I max
m j × J p J max
where mi and mj are the number of partitions in i and j directions and also denote the number of nodes calculated in each thread, and Ip ( max thread number) and Jp ( max thread number) denote the number of nodes in each partition in i and j directions, respectively.
Figure 4 shows the mapping of the OBM-it method. In i direction, Thread_Per_Block × Block_Per_Grid = Jmax × m i × I p I max , while in j direction, Block_Per_Grid × Thread_Per_Block = Imax × m j × J p J max . When using OBM-it method, in i direction the number of blocks is mj × Jp, but the number of threads is Ip, and the required shared memory will be 4 × Ip × mi × 4 (bytes) for single-precision computation and 4 × Ip × mi × 8 (bytes) for double-precision computation.
Wei at al. [18] proposed the multi-block mapping (MBM) strategy that the whole system is mapped to multiple blocks (mi × Ip × mj × Jp). When using the MBM method, in i direction, the number of threads is Ip, but the number of blocks is (mi, mj × Jp), and the required shared memory will be 4 × Ip × 4 (bytes) for single-precision computation and 4 × Ip × 8 (bytes) for double-precision computation. Compared to the OBM-it method, in the MBM method, the partitions (Equation (16)) are mapped onto the blocks and threads simultaneously [9].
Figure 5 shows the MBM. In i direction, Thread_Per_Block × Block_Per_Grid = ( I p , 1 ) × ( m i , m j × J p ) , while in j direction, Block_Per_Grid × Thread_Per_Block = ( m i × I p , m j ) × ( 1 , J p ) .

4.4. Relaxation

In practice, in order to improve computation stability and enhance convergences, Zhang and Jia [16] proposed an iteration process with relaxation ( 0 < R x 1 ) for the PCR, in which the coefficients a, b, and c are kept unchanged, but the source term d is updated with the new solution.
The OBM-it and MBM of the P-ADI solver are illustrated using pseudo Fortran code in Table 1 and Table 2 as follows. Note the different calculations of the block number and thread number in these two mappings.

5. Parallel SIP Solver

5.1. SIP Solver for Penta-Diagonal System

There are four steps in the SIP algorithm, which are simply listed as follows:
The first step LU decomposition factorizes matrix A using A = LU so that
L i , j W = A W 1 + r U i 1 , j N L i , j S = A S 1 + r U i , j 1 E L i , j P = A P + r ( L i , j W U i 1 , j N + L i , j S U i , j 1 E ) L i , j W U i 1 , j E L i , j S U i , j 1 N U i , j E = A E r L i , j S U i , j 1 E L i , j P U i , j N = A N r L i , j W U i 1 , j N L i , j P
where r is a coefficient in the range of [0, 1] for fast convergence.
For the second step, the residual is calculated by
R i , j = S i , j ϕ A ϕ i , j
In the forward elimination step, the V is computed as
V i , j = R i , j ( L i , j S V i , j 1 + L i , j W V i 1 , j ) L i , j P ,
and the increment of the solution δ n can then be computed using U δ n = V .
The backward substitution step calculates the increment of the solution by
δ i , j = V i , j U i , j N δ i , j + 1 U i , j E δ i + 1 , j
so that the solution can be updated via ϕ n + 1 = ϕ n + δ n .
Finally, check convergence if δ n is small enough. If not, go to the second step.
As can be seen, the SIP algorithm demonstrates strong sequential characteristics. In the LU decomposition step, Equation (17) defines the L and U matrix using the coefficient of A in a recursive way; that is, Lw, LS, and LP at (i, j) rely on UN and UE at (i − 1, j) and (i, j − 1), respectively, and new values of UN and UE at (i, j) require not only Lw, LS, and LP at (i, j) but also their adjacent new values at (i, j − 1) and (i − 1, j). Similarly, in the forward substitution step described by Equation (19), the new value of V at (i, j) depends on its adjacent values at (i − 1, j) and (i, j − 1). This dependency is indicated in Figure 6a from the lower index to the higher index values. In the backward substitution step, on the contrary, the increment calculation described in Equation (20) should start from the opposite corner as shown in Figure 6b.

5.2. Wave Front Method

As demonstrated by Figure 6, the diagonal dashed lines can be described as the wave front lines, propagating from the left bottom corner to the right top corner for the LU decomposition step and the forward elimination step of the SIP algorithm but in the opposite direction for the backward substitution step. In other words, along the propagating direction, the mesh nodes on one diagonal line have dependency only on the diagonal line behind.
This observation inspired a general parallel strategy for the SIP solver as follows: the parallel computing can be carried out along the wave fronts (WF) line by line, which is exactly the core concept of the WF method proposed by Reeve et al. [32] and the diagonal process strategy by Deserno et al. [33].
For the WF method, the first step is to identify the WF lines. As shown in Figure 7, for a mesh with the dimension of Imax × Jmax, the number of the WF lines Nw is Imax + Jmax − 1. Along one WF line, there are at least one mesh node and at most Jmax nodes including the boundary nodes. Excluding the boundary nodes, a WF line with at least one internal or non-boundary mesh node is defined as an effective WF line. Therefore, there are Imax + Jmax − 3 effective WF lines, which will be computed line by line on the GPU. As identified by Deserno et al. [33], along each WF line, the nodal index (i, j) satisfies the following condition: (i + j − 1) = current WF line number (constant).
Straightforwardly, for one WF line, four kernels can be designed for the LU decomposition step, the residual calculation step, the forward elimination step, and the backward substitution step. For any variable ϕ (=u, v, η, C, qb) in Equations (1)–(9), the parallel SIP solver can be used in the following way, demonstrated by the pseudo Fortran code in Table 3. As can be seen, three kernels are called in an iterative way by going through all wave front lines, and each kernel is called line by line (number of WF lines) by the host (CPU), which may induce a CPU–GPU overhead problem.
To alleviate the above CPU–GPU overheads, the kernels should be designed for all WF lines instead of one WF line on the GPU and then be called only once on CPUs (Table 4). However, due to the strong dependency of the SIP solver, the calculation for all nodes must follow a fixed consecutive order of the WF lines. Moreover, according to the CUDA programming guide, the order of blocks within a grid and warps within a block are undefined. All these concerns raise a question of mapping strategy for the SIP solver: how to carry out the calculation with a fixed order on the block-thread framework of the GPU. Dufrechou et al. [35] identified this problem and proposed a synchronization method to alleviate it. In their method, a global array was defined to record the block number for each WF line. In the main kernel, before computing the next WF line, each block must wait actively for the block number of that global array to become zero. That global array will be decreased by each thread block with a block index lower than the current WF line.

5.3. Wave Front Mesh

To fulfill the parallel strategy of the WF method, a mapping needs to be established between the nodes on WF lines and the original computational mesh nodes. Therefore, an auxiliary WF mesh is constructed, through which the computational nodes can be mapped onto the hierarchical block-threads framework of the GPU.
The WF mesh is structured with the size of NW × Wp. Each WF line has Np nodes including the computational nodes and the virtual nodes outside of domain. Then this WF mesh is mapped onto the block-thread framework on the GPU as follows: B blocks per grid and Wp threads per block on GPU. All WF lines are nearly equally distributed to each block, so the number of wave lines in each block is Dw = NW/B. To enhance coalescing in memory accessing, the number of nodes along each WF line is adjusted by the warp size (32) so that Wp = 16 × (Wp − 1)/16. In such a case, in the WF mesh there may be more virtual nodes along each WF line. When Wp is greater than the maximum thread number, iterations will be applied for each thread so that more than one node will be calculated in one thread.
Figure 8 shows two WF meshes: one with effective WF lines only (denoted by WF-e) and the other with all WF lines (denoted by WF-all), which correspond to two different mapping strategies.
In addition to the WF-e mesh being smaller in size than the WF-all mesh, the key difference between the WF-e and the WF-all lies in on which mesh the matrix coefficients and solution variable are defined. In the WF-e, the matrix coefficients and the solution variable are defined on the original computational mesh; the mapping between the WF mesh and the original mesh is limited only for the effective WF lines and needed dynamically through an index converter (an array stores the nodal number of the original mesh) in each thread. In the WF-e, the computation is essentially on the original mesh, which is mapped onto block threads using the WF mesh. The index mapping between the original mesh and the WF mesh in each thread hinders fast data retrieving and storing.
In contrast, in the WF-all, the matrix coefficients and the solution variable are defined on the WF mesh instead, and the mapping between the WF mesh and the original mesh is conducted before the SIP computation. The computation is carried out on the WF at each thread, and after computations, the solution needs to be mapped back onto the original mesh.
Figure 9 shows the mapping between the WF mesh and the original mesh. In a nine-node stencil, as can be seen, the east (I + 1, j) and west (i − 1, j) nodes on two meshes overlap, but the north node of the original mesh (i, j + 1) corresponds to the northeast node of the WF mesh (iw + 1, jw + 1), and the south node of the original mesh (i, j − 1) corresponds to the southwest node of the WF mesh (iw − 1, jw − 1).
Since the computation is carried out on the WF mesh, the SIP algorithm for the original mesh (Equations (17)–(20)) needs modification. Following the mapping in Figure 9, the SIP algorithm is modified for the WF mesh as follows:
LU decomposition
L i w , j w W = A W 1 + r U i w 1 , j w N L i w , j w S = A S ( 1 + r U i w 1 , j w 1 E ) L i w , j w P = A P + r ( L i w , j w W U i w 1 , j w N + L i w , j w S U i w 1 , j w 1 E ) L i w , j w W U i w 1 , j w E L i w , j w S U i w 1 , j w 1 N U i w , j w E = A E r L i w , j w S U i w 1 , j w 1 E L i w , j w P U i w , j w N = A N r L i w , j w W U i w 1 , j w N L i w , j w P
Forward elimination
V i w , j w = R i w , j w ( L i w , j w S V i w 1 , j w 1 + L i w , j w W V i w 1 , j w ) L i w , j w P
Backward substitution
δ i w , j w = V i w , j w U i w , j w N δ i w + 1 , j w + 1 U i w , j w E δ i w + 1 , j w + 1
Note the index difference between Equations (17), (19) and (20) and their counterparts, Equations (21)–(23).
For completeness, two mappings of the P-SIP solver, including the kernel for the LU decomposition step (the kernels for the forward elimination step and the backward substitution step are similar), are listed in Table 5 and Table 6 using pseudo Fortran code. As can be seen, in the kernel the WF-e needs dynamic mapping from the WF mesh to the OM, and there is one more kernel used to map the solution from WF mesh back to OM in WF-all. This kernel can also be combined into the backward calculation so that the solution mapping can be conducted simultaneously in the backward substitution step.

6. Implementation

In the serial version of the CCHE2D model, most of the variables were defined as 2D or 3D arrays, which are conceptually clear and simple. However, in parallelization, to allow faster access of memories on devices (GPUs), one-dimensional arrays in the device memory were used to store the global variables in order to reduce the amount of memory transfer to a minimum. Specifically, for most of the flow variables defined by 2D arrays (Imax, Jmax), a 1D index k = ( i 1 ) J max + j is used to map a 2D index (i, j); for the fractional sediment variables defined in 2D arrays (Np, NSize) with Np = Imax × Jmax, a 1D index k = ( i 1 ) N s i z e + j is used; for the layered bed-composition variables defined in (Np, Nlayer), a 1D index k = ( i 1 ) N l a y e r + j is used.
The CCHE2D model consists of many modules which make it capable of simulating flows of all regimes and sediment transports. It uses a time-marching solution procedure. All modules within the time-marching loop were parallelized. Basically, the parallel computing of the CCHE2D model involves three steps, as mentioned previously: (1) in the “initialization” step, all host (CPU) data and device (GPU) data were declared, initialized, and loaded with input data, and data transfers between host and device were carried out as well. For the P-SIP solver, a mapping between the original mesh and the WF mesh is established. For WF-e, a global array is used to store the index of the original mesh along each WF line (WF → OM), while for WF-all, a global array is used to store the index of the WF mesh for each original mesh node (OM → WF) in order to define matrix coefficients and solution variables on the WF mesh; (2) in the time-marching loop, the host calls kernels to perform computations on the GPU using the device data; and (3) after the time-marching loop is finished, the device data are transferred and copied to the host.
In the implementations, the double-precision floating point data type was used. All the testing examples and application were conducted on a PC with Intel Core i7 (2.93 GHz), 14GB Memory, and a Tesla 2070 GPU card with 48k shared memory. The PGI CUDA-Fortran developed by the Portland Group [36] was used to parallelize the CCHE2D model. Note that since the Portland Group was purchased by NIVIDIA, the PGI CUDA-Fortran used in the current parallelization is fixed on PGI Visual Fortran 12.1 with the CUDA Toolkit 4.2, which is way behind the latest CUDA Toolkit 12.3. Therefore, this parallel implementation cannot use the latest advanced CUDA capabilities, which may affect the whole performance of the model.

7. Example and Application

7.1. 2D Heat Diffusion

The heat-transfer problem was first used to test the P-ADI solver and the P-SIP solver. In a rectangle domain, the heat diffusion equation as follows is solved:
ρ c T x = k ( 2 T x 2 + 2 T y 2 )
where T is temperature, ρ is the density of conduction media, c is specific heat, and k is the diffusion coefficient.
The test domain was discretized into 1000 × 1000 nodes, and there are 1999 WF lines including 1995 effective WF lines and 1008 warp-size-adjusted nodes along each WF line. Seven solvers with different configurations, namely, ADI-TDMA on the CPU, SIP for original mesh (SIP-OM) on the CPU, SIP using WF lines (SIP-WF) on the CPU, P-ADI OBM-it on the GPU, P-ADI MBM on the GPU, P-SIP WF-e on the GPU, and P-SIP WF-all on the GPU were compared.
Table 7 lists the CPU running time for these seven configurations, where “s” denotes second, and “m” denotes minute.
As can be seen, on the CPU, the ADI with TDMA is faster than the two SIP configurations, and the SIP-WF using the WF lines is the slowest, which demonstrates how much the mapping between the original mesh and the WF mesh affects the SIP computations. The SIP for the original mesh requires Imax × Jmax iterations for each SIP step, while the WF version needs Jmax × (Imax + Jmax − 5) iterations. When Imax = Jmax, i.e., the current case, the WF version would need almost double the iterations. In addition, when following the WF lines, the dynamic mapping in each iteration step affects the memory access speed significantly.
On the GPU, correspondingly, two P-ADI configurations are faster than two P-SIP configurations. The WF-all is 3.67 times faster than the WF-e and 5.42 times faster than the SIP-OM; the speedup ratio of WF-e is only 1.48 compared to SIP-OM, while the speedup ratio for the P-ADI OBM-it and the P-ADI BMB is 7.89 and 6.96, respectively. For the P-ADI, Wei et al. [18] reported that the MBM could achieve up to about 70 times speedup when using the single-precision floating point data type.
In this heat-diffusion problem, with the same time step, the ADI solver is faster than the SIP solver due to its lighter computing loads. Computationally, the TDMA has only two steps (two loops in computation) and the SIP has four steps (four loops in computation) with inner iterations to check the residual error. Therefore, the CPU version of ADI is about two times (1.81) faster than that of the SIP.
Similarly, the P-ADI solvers only has two kernels, but the P-SIP solver has four kernels with inner iterations, which partially explains why the P-ADI demonstrated higher efficiency than the P-SIP. With the matrix coefficients and the solution variable defined on the WF mesh, the WF-all performed at a much higher computing speed than the WF-e with the dynamic mapping between the original mesh and the WF mesh in each thread.

7.2. Flow Simulation

The P-ADI and the P-SIP on the GPU were integrated into theCCHE2D model and applied to two flow simulation cases: steady flow in the Arkansas River pool 7 and the unsteady flow in Lake Pontchartrain.
Table 8 and Table 9 list the CPU running time for the model using both solvers with different configurations, respectively. Note that the listed CPU running times were for the comprehensive performances of the whole model with many other parallel implementations in the entire time marching loop, not purely for the solvers themselves. According to the CPU time comparisons, the P-ADI solver demonstrated higher computing efficiency than the P-SIP solver in both cases, especially for the WF-e mapping method.

7.2.1. Steady Flow Simulation in Arkansas River Pool 7

In the Arkansas River pool 7 (Figure 10), there are more than 30 dikes serving to improve navigation conditions, which make the flow field complicated. Velocity profiles at six cross sections from mile 147 to mile 141.4, measured by USGS, were selected for validating the simulation flow fields. Figure 11 compares those simulated velocity profiles by the CPU-serial model with the SIP solver and the GPU-parallel model with the P-ADI and P-SIP solvers to the measurements. Note that two configurations for the P-SIP solver yielded almost the same result, so they are denoted by GPU-SIP in this figure. Similarly, the results from the P-ADI solver are denoted by GPU-ADI. These notations will be applied to other cases hereafter. Table 10 lists the L2 error norm defined by
L 2 = ( i ( N i M i ) 2 i M i 2 ) 1 / 2
where N denotes the numerical result and M denotes the measured one.
As can be seen, all the simulated profiles generally agreed with the measured ones. However, due to the dikes, the flow field is very complicated, which makes it difficult to accurately validate those velocity profiles. Although identical results were obtained, discrepancies between the CPU model and the parallel GPU model can be observed as well, which is expected and mainly due to (1) the different computing capabilities of the CPU and GPU, and (2) the differences between the parallelization coding and the sequential coding when implementing those numerical solvers.
The -ADI solver presented a little oscillation for Y velocity at mile 142.7, which may lie in its alternating implicit characteristics so that it is not as robust as the full implicit P-SIP solver.

7.2.2. Unsteady Flow Simulation in Lake Pontchartrain

Lake Pontchartrain (Figure 12) is located in southeastern Louisiana and serves not only as a brackish estuary connected to the Gulf of Mexico, Lake Borgne, and Lake Maurepas but also a flood release area for protecting the city of New Orleans.
As shown in Figure 12, the tidal flow goes through Rigolets and Chef Menteur into the lake. The simulation results for Lake Pontchartrain on CPU and GPU are validated by using (1) the measured water surface elevation at Mandeville and Westend, and (2) the measured velocity at South Lake.
Figure 13 shows the time series profile of the water surface elevation at two different gauging stations: the Mandeville station at the north and the Westend station at the south. All simulated water level profiles captured well the general trend under tidal flow boundary conditions. The P-SIP solver yielded closer results to the CPU results than the P-ADI solver because of the same solver.
As for the velocity components at South Lake station (Figure 14), generally speaking the basic trend of the measurements was well captured by all simulated velocity profiles, despite their much more frequent fluctuations over time than the water level. Similarly, the P-SIP solver yielded closer results to the CPU results than the P-ADI solver, as observed.

7.3. Sediment Transport Simulations

7.3.1. Steady Sediment Transport in Vistula River

These two parallel solvers (P-ADI and P-SIP) were further implemented in the sediment transport module of CCHE2D-SED. The steady sediment transport simulation in Vistula River and the unsteady sediment transport simulation in East Fork River were used to validate and test these two parallel solvers.
Vistula River is the largest river of Poland and the drainage basin of the Baltic Sea. As shown in Figure 15, there are 30 dikes in the study reach. In the sediment transport simulation in Vistula River (Figure 15), there are seven size classes (Table 11). A constant flow discharge of 570 m3/s, a suspended-load concentration of 11.4 kg/m3, and a bed-load transport rate of 0.032 kg/m/s are imposed at the inlet, while at the outlet a constant water surface elevation 10.868 m is maintained. The total simulation time is 10,000 s, and the time step is 2 s.
Figure 16 compares the sediment fields: distributions of bed-load transport rate, suspended-load concentration, and bed change between the serial version on the CPU and the parallel version on the GPU. Identical results from both the CPU and GPU were obtained, but differences due to different computing capabilities of CPU and GPU were also observed.
Table 12 compares the computing efficiency of the P-ADI solver and the P-SIP solver for the steady sediment transport simulation in Vistula River. Again, note that the CPU running time listed in Table 12 includes all other parallel implementations in the entire time-marching loop. As can be seen, the P-ADI solver is about two times faster than the P-SIP solver with WF-e mapping.

7.3.2. Unsteady Sediment Transport in East Fork River

The unsteady sediment transport simulation in a 3.2 km reach of the East Fork River (Figure 17) located in Wyoming State selected the simulation period of 2–19 June 1979. Figure 18 shows the corresponding hydrograph and the stage graph at the inlet and outlet.
In this case, there are nine size classes of sediment mixtures, ranging from 0.088 mm (fine sand) to 32 mm (coarse gravel). Bed sample information was obtained from the USGS technical report [42] for the selected reach. In this simulation, the total simulation time is 18 days, and the time step is 2 s
Figure 19 compares the simulated sediment flux at the bed trap of the outlet with the measurements. The general trend was well captured by all solvers, but the first peak was over-predicted, the second peak was under-predicted, and again oscillations occurred in the P-ADI solvers. For this unsteady sediment transport simulation, more discrepancies were observed between different solvers. The reason may lie in the complexity of unsteady multi-sized sediment transport that is highly non-linear and depends not only on the solutions of those transport equations but also on empirical/semi-empirical formulas, which tends to create more uncertainties [25].
Table 13 lists the CPU time used for all solvers. Similar to the Vistula case, the P-ADI has better performances than the P-SIP solver in this unsteady sediment transport simulation as well.
As can be seen, in general, sediment transport simulations are more complicated with more computations than flow simulations. Correspondingly, the speedup ratio decreases for both parallel solvers.

8. Conclusions

Two widely-used implicit solvers for structured meshes, namely, the ADI solver and the SIP solver, were parallelized and integrated into a 2D numerical model, the CCHE2D, for comparison. For each parallel solver, two mapping methods were implemented to map 2D problems (Imax, Jmax) onto the hierarchical block-thread frame of the CUDA on the GPU: the OBM-it (one-block-thread mapping with iterations) and MBM (multi-block-threads mapping) for the P-ADI solver and WF-e mapping (matrix coefficients and solution variables defined on the original mesh) and WF-all mapping (matrix coefficients and solution variables defined on WF mesh) for the P-SIP solver. According to numerical tests, the following conclusions can be drawn:
(1)
Both the P-ADI solver and P-SIP solver were validated by example cases with complex geometries and produced identical results to those of the serial version on the CPU, despite the different computing capabilities of the CPU and GPU;
(2)
In general, the P-ADI solver was faster than the P-SIP solver due to its lighter computing loads. The respective speedup ratio of the P-ADI and P-SIP was up to 7.89 and 5.42 for simple 2D heat-diffusion problems, 4.98 and 3.73 for flow simulations, and 3.648 and 2.166 for sediment transport simulations. With increasing complexities of problems, the speedup ratio also decreased;
(3)
In the P-ADI solver, both the OBM-it and the MBM had close performances in computing efficiency, and the OBM-it was slightly faster than the MBM. Most of time, the difference was less than 5%. In the 2D heat-diffusion problem, the OBM-it was 13% faster than the MBM;
(4)
In the P-SIP solver, the newly proposed WF-all was significantly faster than the conventional WF-e: 3.67 times faster in 2D heat-diffusion problem, 1.47~2.03 times faster in flow simulations, and 1.2 times faster in sediment transport simulations;
(5)
The P-ADI solver was not as robust as the P-SIP solver due to its alternating implicit characteristics, and it may suffer from oscillations in complicated simulations, such as unsteady sediment transport in domains with wetting-and-drying processes.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; validation, Y.Z. and X.C.; writing—original draft preparation, Y.Z.; writing—review and editing, M.Z.A.-H. and X.C.; project administration, M.Z.A.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the U.S. Department of Agriculture, Agricultural Research Service, through the Cooperative Agreement No. 6060-13000-030-00D between the USDA-ARS National Sedimentation Laboratory (USDA-ARS-NSL) and the University of Mississippi/National Center for Computational Hydroscience and Engineering (NCCHE).

Data Availability Statement

All data in the examples and application are available upon request.

Acknowledgments

The authors would like to thank Ronald Bingner from the USDA Agricultural Research Service (ARS), National Sedimentation Laboratory (NSL) for his help and support in this study.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Nomenclature

Amatrix coefficient of the assembled algebraic equation
CkSuspended-load concentration in volume for k-th sediment class
C*ktransport capacity of suspended load for k-th sediment class
ggravity acceleration (ms−2)
hwater depth (m)
nManning’s n (m)
qcross-edge discharge (m2s−1)
q b k bed-load transport rate for kth sediment class (m2s−1)
q b k transport capacity of bed-load for kth sediment class (m2s−1)
ux velocity (ms−1)
vy velocity (ms−1)
ηwater surface elevation (m)
ρwater density (kgm−3)
ϕdependent variable
νkinematic viscosity (m2s−1)
p the porosity of bed material
ω s k the sediment settling velocity for kth sediment class (ms−1)
τbbed shear stress (Pa)
τwwind shear stress (Pa)
ε s the eddy diffusivity of sediment (m2s−1)

References

  1. Cea, L.; Bladé, E. A simple and efficient unstructured finite volume scheme for solving the shallow water equations in overland flow applications. Water Resour. Res. 2015, 51, 5464–5486. [Google Scholar] [CrossRef]
  2. Xia, X.; Liang, Q.; Ming, X.; Hou, J. An efficient and stable hydrodynamic model with novel source term discretization schemes for overland flow and flood simulations. Water Resour. Res. 2017, 53, 3730–3759. [Google Scholar] [CrossRef]
  3. Zhang, Y.; Al-Hamdan, M.; Bingner, R.; Chao, X.; Langendoen, E.; O’Reilly, A.; Vieria, D.A. Application of 1D Model for overland flow simulations on 2D complex domains. Adv. Water Resour. 2024, 188, 104711. [Google Scholar] [CrossRef]
  4. Zhang, Y.; Al-Hamdan, M.; Bingner, R.; Chao, X.; Langendoen, E.; Vieria, D.A. Generation of 1D Channel Networks for overland flow simulations on 2D complex domains. J. Hydrol. 2024, 628, 130560. [Google Scholar] [CrossRef]
  5. Panchigar, D.; Kar, K.; Shukla, S.; Mathew, R.M.; Chadha, U.; Selvaraj, S.K. Machine learning-based CFD simulations: A review, models, open threats, and future tactics. Neural Comput. Appl. 2022, 34, 21677–21700. [Google Scholar] [CrossRef]
  6. Lee, C.H.; Cant, S. A grid-induced and physics-informed machine learning CFD framework for turbulent flows. Flow Turbul. Combust. 2024, 112, 407–442. [Google Scholar] [CrossRef]
  7. Chen, Y.C.; Wnag, Z.Y.; Liu, Z.W.; Zhu, D.J. 1D-2D Coupled Numerical Model for Shallow-Water Flows. ASCE J. Hydraul. Eng. 2012, 138, 122–132. [Google Scholar] [CrossRef]
  8. Brunner, G.W. HEC-RAS- River Analysis System 2D Modeling Users’ Manual; US Army Corps of Engineering, Institute for Water Resources, Hydrologic Engineering Center: Davis, CA, USA, 2016. [Google Scholar]
  9. Zhang, Y.; Jia, Y. Towards Efficient Modeling. In Proceedings of the EWRI World Environment & Water Resources Congress 2017, Sacramento, CA, USA, 21–25 May 2017. [Google Scholar]
  10. Zhang, Y.X.; Jia, Y.F.; Wang, S.S.Y. A Conservative Multi-block Algorithm for Two-dimensional Numerical Model. Int. J. Math. Sci. 2007, 1, 100–112. [Google Scholar]
  11. Navon, I.M.; Cai, Y. Domain decomposition and parallel processing of finite element model of the shallow water equations. Comput. Methods Appl. Mech. Eng. 1993, 106, 179–212. [Google Scholar] [CrossRef]
  12. Thibault, J.C.; Senocak, I. CUDA Implementation of Navier-Stokes Solver on Multi-GPU Desktop Platforms for Imcompressible Flows. In Proceedings of the 47th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, Orlando, FL, USA, 5–8 January 2009. [Google Scholar]
  13. Cohen, J.M.; Molemaker, M.J. A fast double precision CFD Code using CUDA. In Proceedings of the 21st International Conference on Parallel Computational Fluid Dynamics, Parallel CFD 2009, Moffett, CA, USA, 18–22 May 2009. [Google Scholar]
  14. Corrigan, A.; Camelli, F.; Lohner, R.; Wallin, J. Running Unstructured Grid based CFD solvers on Modern Graphics hardware. In Proceedings of the 19th AIAA CFD, San Antomio, TX, USA, 22–25 June 2009. [Google Scholar]
  15. Brodtkorb, A.R.; Sætra, M.L.; Altinakar, M.S. Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation. Comput. Fuids 2011, 55, 1–12. [Google Scholar] [CrossRef]
  16. Zhang, Y.; Jia, Y. Parallelized CCHE2D Flow Model with CUDA Fortran on Graphics Process Units. Comput. Fluids 2013, 84, 359–368. [Google Scholar] [CrossRef]
  17. Zhang, Y.; Jia, Y. Parallelization of Implicit CCHE2D Model using CUDA Programming Techniques. In Proceedings of the EWRI World Environment & Water Resources Congress 2013, Cincinnati, OH, USA, 19–23 May 2013. [Google Scholar]
  18. Wei, Z.; Jang, B.; Zhang, Y.; Jia, Y. Parallelizing Alternating Direction Implicit Solver on GPUs. In Proceedings of the International Conference on Computer Science, ICCS 2013, Barcelona, Spain, 5–7 June 2013. [Google Scholar]
  19. Lacasta, A.; Morales-Hernandez, M.; Murillo, J.; Garcı´a-Navarro, P. GPU implementation of the 2D shallow water equations for the simulation of rainfall/runoff events. Environ. Earth Sci. 2015, 74, 7295–7305. [Google Scholar] [CrossRef]
  20. Hou, J.M.; Kang, Y.D.; Hu, C.H.; Tong, Y.; Pan, B.Z.; Xia, J.Q. A GPU-based numerical model coupling hydrodynamical and morphological processes. Int. J. Sediment Res. 2020, 35, 386–394. [Google Scholar] [CrossRef]
  21. Jia, Y.F.; Wang, S.S.-Y. Numerical Model for channel flow and morphological changes studies. ASCE J. Hydraulic Engrg. 1999, 125, 924–933. [Google Scholar] [CrossRef]
  22. Deltares. Delft3D-FLOW. In Simulation of Multi-Dimensional Hydrodynamic Flow and Transport Phenomena, Including Sediments–User Manual; Version 3.04, rev. 11114; Deltares: Delft, The Netherlands, 2010. [Google Scholar]
  23. MIKE 21 Flow Model FM Hydrodynamic Module, Users’ Guide. 2014. Available online: https://manuals.mikepoweredbydhi.help/2017/Coast_and_Sea/MIKE_FM_HD_2D.pdf (accessed on 20 June 2024).
  24. Lai, Y. SRH-2D Users’ Manual: Sediment Transport and Mobile-Bed Modeling; US Department of the Interior, Bureau of Reclamation: Denver, CO, USA, 2020. [Google Scholar]
  25. Zhang, Y.; Al-Hamdan, M.; Wren, D. Development of a Two-Dimensional Hybrid Sediment-Transport Model. Appl. Sci. 2023, 13, 4940. [Google Scholar] [CrossRef]
  26. Thomas, L.H. Elliptic Problems in Linear Differential Equations over a Network: Watson Scientific Computing Laboratory; Columbia University: New York, NY, USA, 1949. [Google Scholar]
  27. Hockney, R.W. A fast direct solution of Poisson’s equation using Fourier analysis. J. ACM 1965, 12, 95–113. [Google Scholar] [CrossRef]
  28. Stone, H.L. Iterative solution of implicit approximations of multidimensional partial differential equations. SIAM J. Numer. Anal. 1968, 5, 87–113. [Google Scholar] [CrossRef]
  29. Hockney, R.W.; Jesshope, C.R. Parallel Computers; Adam Hilger: Bristol, UK, 1981. [Google Scholar]
  30. Sakharnykh, N. Tridiagonal solvers on the GPU and applications to fluid simulation. In Proceedings of the GPU Technology Conference, San Jose, CA, USA, 30 September–3 October 2009. [Google Scholar]
  31. Zhang, Y.; Cohen, J.; Owens, J.D. Fast Tridiagonal Solver on the GPU. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), India, Bangalore, 9–14 January 2010. [Google Scholar]
  32. Reeve, J.S.; Scurr, A.D.; Merlin, J.H. Parallel Version of Stone’s Strong Implicit Algorithm. Concurr. Comput. Pract. Exp. 2001, 13, 1049–1062. [Google Scholar] [CrossRef]
  33. Deserno, F.; Hager, G.; Brechtefeld, F.; Wellein, G. Basic Optimization Strategies for CFD-Codes; Technical Report; Regional Data Center Erlangen (RRZE): Erlangen, Germany, 2002. [Google Scholar]
  34. Igounet, P.; Alfaro, P.; Pedemonte, M.; Ezzatti, P. A GPU implementation of the SIP method. In Proceedings of the 30th International Conference of the Chilean Computer Science Society, Curico, Chile, 9–11 November 2011. [Google Scholar] [CrossRef]
  35. Dufrechou, E.; Ezzatti, P.; Usera, G. Avoiding synchronization to accelerate a CFD solver in GPU. In Proceedings of the 31st International Symposium on Computer Architecture and High-Performance Computing, Campo Grande, Brazil, 15–18 October 2019. [Google Scholar] [CrossRef]
  36. The Portland Group. CUDA Fortran Programming Guide and References; NVIDIA Corp.: Santa Clara, CA, USA, 2011. [Google Scholar]
  37. Rodi, W. Turbulence Models and Their Applications in Hydraulics, 3rd ed.; IAHR Monograph: Rotterdam, The Netherlands, 1993. [Google Scholar]
  38. Wu, W.M. Computational River Dynamics; Taylor & Francis: London, UK, 2007. [Google Scholar]
  39. Igounet, P.; Alfaro, P.; Pedemonte, M.; Ezzatti, P. GPU acceleration of the caffa3d.MB Model. In Proceedings of the ICCSA 2012, Salvador de Bahia, Brazil, 18–21 June 2012; Part IV, LNCS 7336. pp. 530–542. [Google Scholar]
  40. NVIDIA. CUDA C Programming Guide Version 3.1.1; NVIDIA Corp.: Santa Clara, CA, USA, 2010. [Google Scholar]
  41. Heng, K.S. Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units; National University of Singapore: Singapore, 2011. [Google Scholar]
  42. Meade, R.H.; Myrick, R.M.; Emmett, W.W. Field Data Describing the Movement and Storage of Sediment in the East Fork River, Wyoming, Part II Bed Elevations; 1979, USGS Open-File Rep. 80-1190; United States Geological Survey (USGS): Denver, CO, USA, 1980. [Google Scholar]
Figure 1. Four neighboring nodes of node Pi,j (a) four-nodes element and nine-node element on staggered mesh and (b) on non-staggered mesh.
Figure 1. Four neighboring nodes of node Pi,j (a) four-nodes element and nine-node element on staggered mesh and (b) on non-staggered mesh.
Mathematics 12 02184 g001
Figure 2. Grid of block threads.
Figure 2. Grid of block threads.
Mathematics 12 02184 g002
Figure 3. OBM with one block thread.
Figure 3. OBM with one block thread.
Mathematics 12 02184 g003
Figure 4. OBM-it with one block thread with iterations.
Figure 4. OBM-it with one block thread with iterations.
Mathematics 12 02184 g004
Figure 5. MBM mapping with multiple block threads.
Figure 5. MBM mapping with multiple block threads.
Mathematics 12 02184 g005
Figure 6. Computation dependencies for SIP solver. (a) Forward elimination step; (b) Backward substitution step.
Figure 6. Computation dependencies for SIP solver. (a) Forward elimination step; (b) Backward substitution step.
Mathematics 12 02184 g006
Figure 7. Wave front lines.
Figure 7. Wave front lines.
Mathematics 12 02184 g007
Figure 8. WF mesh: (a) WF-e and (b) WF-all.
Figure 8. WF mesh: (a) WF-e and (b) WF-all.
Mathematics 12 02184 g008
Figure 9. Mapping between WF mesh and original mesh.
Figure 9. Mapping between WF mesh and original mesh.
Mathematics 12 02184 g009
Figure 10. Arkansas River pool 7 with more than 30 dikes.
Figure 10. Arkansas River pool 7 with more than 30 dikes.
Mathematics 12 02184 g010
Figure 11. Velocity profiles in Arkansas River pool 7.
Figure 11. Velocity profiles in Arkansas River pool 7.
Mathematics 12 02184 g011aMathematics 12 02184 g011bMathematics 12 02184 g011cMathematics 12 02184 g011d
Figure 12. Lake Pontchartrain.
Figure 12. Lake Pontchartrain.
Mathematics 12 02184 g012
Figure 13. Water surface elevation at Mandeville and Westend of Lake Pontchartrain.
Figure 13. Water surface elevation at Mandeville and Westend of Lake Pontchartrain.
Mathematics 12 02184 g013
Figure 14. Velocity at South Lake of Lake Pontchartrain.
Figure 14. Velocity at South Lake of Lake Pontchartrain.
Mathematics 12 02184 g014
Figure 15. Structured mesh for Vistula River (34 × 339).
Figure 15. Structured mesh for Vistula River (34 × 339).
Mathematics 12 02184 g015
Figure 16. Sediment fields at 167 min: (a) bed-load transport rate, (b) suspended-load concentration, and (c) bed change.
Figure 16. Sediment fields at 167 min: (a) bed-load transport rate, (b) suspended-load concentration, and (c) bed change.
Mathematics 12 02184 g016aMathematics 12 02184 g016bMathematics 12 02184 g016cMathematics 12 02184 g016dMathematics 12 02184 g016e
Figure 17. Selected reach of East Fork River. The pink line denotes the boundary of the study reach.
Figure 17. Selected reach of East Fork River. The pink line denotes the boundary of the study reach.
Mathematics 12 02184 g017
Figure 18. Hydrograph at the inlet and stage graph at the outlet.
Figure 18. Hydrograph at the inlet and stage graph at the outlet.
Mathematics 12 02184 g018
Figure 19. Sediment discharge at the outlet.
Figure 19. Sediment discharge at the outlet.
Mathematics 12 02184 g019
Table 1. OBM-it of P-ADI solver.
Table 1. OBM-it of P-ADI solver.
1Call Build_Matrix
2
3Do I = 1, Iteration
4   Block = Ipmi; Thread = Jp
5    Call PCR_X <<<Block, Thread, 4 ∗ Thread ∗ 8 ∗ mj>>> (a, b, c, d, relaxation, mj)
6End Do7
7Do I = 1, Iter
8   Block = Jpmj; Thread = Ip
9    Call PCR_Y <<<Block, Thread, 4 ∗ Thread ∗ 8 ∗ mi>>> (a, b, c, d, relaxation, mi)
10  End Do
Table 2. MBM of P-ADI solver.
Table 2. MBM of P-ADI solver.
1Call Build_Matrix
2
3Do I = 1, Iteration
4   Block = Dim3(Ipmi, mj, 1); Thread = Dim3(1, Jp, 1)
5    Call PCR_X <<<Block, Thread, 4 ∗ Thread ∗ 8 ∗ mj>>> (a, b, c, d, relaxation, mj)
6End Do
7Do I = 1, Iter
8   Block = Dim3(mi, Jpmj, 1); Thread = Dim3(Ip, 1, 1)
9    Call PCR_Y <<<Block, Thread, 4 ∗ Thread ∗ 8 ∗ mi>>> (a, b, c, d, relaxation, mi)
10  End Do
Table 3. P-SIP solver using CPU–GPU overheads.
Table 3. P-SIP solver using CPU–GPU overheads.
1Do I = 1, Num_WF_Line
2      Call LU_Decomposition <<<Block_Num, Thread_Num>>>
3End Do
4
5J = 0
6J = J + 1
7   Call Cal_Residual <<<Block_Num, Thread_Num>>>
8   Do I = 1, Num_WF_Line
9      Call Forward_Elimination <<<Block,_Num Thread_Num>>>
10   End Do
11
12   Do I = Num_WF_Line, 1, −1
13      Call Backward_Substitution <<<Block_Num, Thread_Num>>>
14   End Do
15
16If (error > Threshold_value and J < Max_Iteration) Go to 6
Table 4. P-SIP solver with alleviated CPU–GPU overheads.
Table 4. P-SIP solver with alleviated CPU–GPU overheads.
1Call LU_Decomposition <<< Block, Thread >>>
2
3J = 0
4J = J + 1
5
6  Call Calculate_Residual <<<Block, Thread>>>
7
8  Call Forward_Elimination <<<Block, Thread>>>
9
10  Call Backward_Substitution <<<Block, Thread>>>
11
12If (Residual > Threshold_value and J < Max_Iteration) Go to 4
Table 5. (a) WF-e of P-SIP solver. (b) Kernel of LU decomposition for WF-e.
Table 5. (a) WF-e of P-SIP solver. (b) Kernel of LU decomposition for WF-e.
(a)
1Call Build_Matrix_on_OM
2  Block = NWF/Dw; Thread = Wp
3  Call LU_Decomposition_OM<<<Block, Thread>>>
4    Do While (Residual > Threshold And. I < Max_Iteration)
5    Td = 256; Bk = (Total_WF_Node + Td)/Td
6    Call Cal_Residual_on_OM<<<Bk, Td>>>
7
8    Call Forward_on_OM<<<Block, Thread>>>
9    Call Backward_on_OM <<<Block, Thread >>>
10    I = I+1
11   End Do
(b)
1! Index Calculation
2B = BlockIdx%X; I = ThreadIdx%X ! Block Index and Thread Index
3! Ln is the total number of WF lines; Nb is number of blocks
4    Do J = (B − 1) ∗ Ln/Nb + 1, B ∗ Ln/Nb ! For WF lines in current block
5    Jw = (J − 1) ∗ Wp + I ! Index on WF mesh
6    k = Map(Jw) ! Index on original mesh
7    Lw(k) = Aw(k)/(1 + alfa ∗ Un(k − Jmax))
8    Ls(k) = An(k)/(1 + alfa ∗ Ue(k − 1))
9
10    P1 = alfa ∗ Lw(k) ∗ Un(k − Jmax)
11    P2 = alfa ∗ Ls(k) ∗ Ue(k − 1)
12    Lp(k) = 1./(Ap(k) + P1 +P2 − Lw(k) ∗ Ue(k − Jmax)-Ls(k) ∗ Un(k − 1))
13
14    Un(k) = (As(k) − P1) ∗ Lp(k)
15    Ue(k) = (Ae(k) − P2) ∗ Lp(k)
16
17  End Do
Table 6. (a) WF-all of P-SIP solver. (b) Kernel of LU decomposition for WF-all.
Table 6. (a) WF-all of P-SIP solver. (b) Kernel of LU decomposition for WF-all.
(a)
1Call Build_Matrix_on _OM_and_Map_to_WF
2  Block = NWF/Dw; Thread = Wp
3  Call LU_Decomposition_on_WF<<<Block, Thread>>>
4     Do While (Residual > Threshold And. I < Max_Iteration)
5    Td = 256; Bk = (Total_WF_Node + Td)/Td
6    Call Cal_Residual_on_WF<<<Bk, Td>>>
7
8    Call Forward_on_WF<<<Block, Thread>>>
9    Call Backward_on_WF <<<Block, Thread >>>
10    I = I+1
11     End Do
12     Call Map_Solution_from_WF_to_OM<<<Bk, Td>>>
(b)
1! Index Calculation
2B = BlockIdx%X; I = ThreadIdx%X ! Block Index and Thread Index
3   ! Ln is the total number of WF lines; Nb is num of blocks
4   Do J = (B − 1) ∗ Ln/Nb +1, B ∗ Ln/Nb ! For WF lines in current block
5    k = (J − 1) ∗ Wp + I ! Index on WF mesh
6
7    Lw(k) = Aw(k)/(1 + alfa ∗ Un(k − Jmax))
8    Ls(k) = An(k)/(1 + alfa ∗ Ue(k − 1 − Jmax))
9
10    P1 = alfa ∗ Lw(k) ∗ Un(k − Jmax)
11    P2 = alfa ∗ Ls(k) ∗ Ue(k − 1 − Jmax)
12    Lp(k) = 1./(Ap(k) + P1 +P2 − Lw(k) ∗ Ue(k − Jmax) − Ls(k) ∗ Un(k – 1 − Jmax))
13
14    Un(k) = (As(k) − P1) ∗ Lp(k)
15    Ue(k) = (Ae(k) − P2) ∗ Lp(k)
16
17  End Do
Table 7. CPU time for heat-transfer problem.
Table 7. CPU time for heat-transfer problem.
Mesh
Imax × Jmax
Sim. Time (s)Time Step (s)On CPUOn GPU
ADI TDMASIP-OMSIP-WFP-ADI OBM-ItP-ADI MBMP-SIP WF-eP-SIP WF-All
1000 × 100036001254 s462 s48.4 m32.2 s36.5 s312 s85.2 s
Speedup Ratio ---7.896.961.485.42
Table 8. Run time for Arkansas River.
Table 8. Run time for Arkansas River.
SolverMesh Imax × JmaxSim. Time (h)Time Step (s)CPU (s)GPU (s)Speed Ratio
P-ADIOBM-it65 × 120411.124698.6140.24.98
MBM65 × 120411.124698.6146.74.76
P-SIPWF-e65 × 120411.124698.6276.12.53
WF-all65 × 120411.124698.6169.33.73
Table 9. Run time for Lake Pontchartrain.
Table 9. Run time for Lake Pontchartrain.
SolverMesh Imax × JmaxSim. Time (day)Time Step (s)CPU (min)GPU (min)Speed Ratio
P-ADIOBM-it141 × 22430.31007117.244.12
MBM141 × 22430.31007116.984.18
P-SIPWF-e141 × 22430.31007147.921.48
WF-all141 × 22430.31007123.663.00
Table 10. L2 error norms.
Table 10. L2 error norms.
CaseCPUP-SIPP-ADI
Arkansas River Pool 7Mile 141.4-X0.2230.2130.240
Mile 141.4-Y0.9470.9550.953
Mile 142.7-X0.2070.1970.169
Mile 142.7-Y0.8370.7020.923
Mile 143-X0.1780.1760.195
Mile 143-Y0.8040.8260.823
Mile 143.3-X0.2380.2330.263
Mile 143.3-Y0.9540.9250.960
Mile 146.1-X0.2580.2640.275
Mile 146.1-Y0.2450.2290.249
Mile 147.0-X0.3540.3290.338
Mile 147.0-Y0.4000.3750.384
Lake
Pontchartrain
Mandeville0.1340.1620.189
Westend0.2620.2730.321
Southlake-X0.8850.8770.886
Southlake-Y0.7670.7500.782
East Fork River 0.4830.3910.400
Table 11. Size classes in Vistula River.
Table 11. Size classes in Vistula River.
Size (mm)0.0750.10.150.250.512
Fraction0.0540.068950.268350.54650.04590.014850.00145
Table 12. Run time for steady sediment transport simulation in Vistula River.
Table 12. Run time for steady sediment transport simulation in Vistula River.
SolverMesh Imax × JmaxSim. Time (min)Time Step (s)CPU (min)GPU (min)Speed Ratio
P-ADIOBM-it34 × 3391672123.3153.620
MBM34 × 3391672123.3183.617
P-SIPWF-e34 × 3391672126.7161.787
WF-all34 × 3391672125.542.166
Table 13. Run time for unsteady sediment transport simulation in East Fork River.
Table 13. Run time for unsteady sediment transport simulation in East Fork River.
SolverMesh Imax × JmaxSim. Time (day)Time Step (s)CPU (h)GPU (h)Speed Ratio
P-ADIOBM-it33 × 50118228.57.812 3.648
MBM33 × 50118228.57.8583.627
P-SIPWF-e33 × 50118228.516.9971.68
WF-all33 × 50118228.513.4572.118
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Al-Hamdan, M.Z.; Chao, X. Parallel Implicit Solvers for 2D Numerical Models on Structured Meshes. Mathematics 2024, 12, 2184. https://doi.org/10.3390/math12142184

AMA Style

Zhang Y, Al-Hamdan MZ, Chao X. Parallel Implicit Solvers for 2D Numerical Models on Structured Meshes. Mathematics. 2024; 12(14):2184. https://doi.org/10.3390/math12142184

Chicago/Turabian Style

Zhang, Yaoxin, Mohammad Z. Al-Hamdan, and Xiaobo Chao. 2024. "Parallel Implicit Solvers for 2D Numerical Models on Structured Meshes" Mathematics 12, no. 14: 2184. https://doi.org/10.3390/math12142184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop