**1. Introduction**

Reservoir dams have played a huge role in flood control, water supply, power generation, etc. [1], and are one of the important components of the hydraulic engineering system. But if a dam is damaged due to various factors, the consequences are generally catastrophic. Excessive flooding is one of the natural causes of dam failure. Floods are the most frequent natural disasters, affecting large numbers of people and agricultural lands, as well as causing casualties and damage to infrastructure. Increased runoff rates due to urbanization, prolonged rainfall, and insufficient river capacity are some of the main causes of flooding. After entering the flood season in 2020, there have been multiple rounds of heavy rainfall in southern China, resulting in severe floods in many places. As of 9 July 2020, flood disasters had affected 30.2 million people in 27 provinces (district and cities), with direct economic losses of 61.79 billion yuan. In 2021, Henan, Shanxi, and other places were also hit by heavy rainstorms. Therefore, understanding the behavior of water flow in a channel is critical for early flood disaster management and saving lives. The research direction of this paper is to study the behavior of flood through numerical simulation of surface water flow, and to perform heterogeneous parallel processing and visualization of the simulation process.

Mathematical models are scientific or engineering models constructed using mathematical logic methods and mathematical language to solve various practical problems. In hydraulic science, mathematical models are often used to simulate fluids of different phenomena, and numerical methods are used to control the fluid models. What we use in

**Citation:** Qi, Y.; Li, Q.; Zhao, Z.; Zhang, J.; Gao, L.; Yuan, W.; Lu, Z.; Nie, N.; Shang, X.; Tao, S. Heterogeneous Parallel Implementation of Large-Scale Numerical Simulation of Saint-Venant Equations. *Appl. Sci.* **2022**, *12*, 5671. https://doi.org/ 10.3390/app12115671

Academic Editors: Daniel Dias and Yuzhu Wang

Received: 25 April 2022 Accepted: 31 May 2022 Published: 2 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

our research is the flood wave propagation dynamics equation, which was first proposed by Saint-Venant [2], so it is also called the Saint-Venant (SV) equations. It consists of a continuity equation reflecting the law of conservation of mass and an equation of motion reflecting the law of conservation of momentum, and is widely used to predict surface flow parameters such as velocity, depth, or height. Nowadays, they are used to model flows in a wide variety of physical phenomena, such as overland flow, flooding, dam breaks, tsunami [3]. SV equation was originally only used to describe one-dimensional surface water flow, for two-dimensional, SV equation is derived from the Navier-Stokes equation [4], and two-dimensional SV equation is often referred to as the shallow water equation(SWE). Since SV equations are mathematically quasi-linear hyperbolic partial differential equations (PDE), it is difficult to obtain the analytical solution of SV equations by analytical method. Therefore, different numerical methods have been proposed to simulate surface water flow.

Long before the advent of computers, some people solved the SV equations through numerical simulation. Reinaldo and Rene [5] long ago used the explicit MacCormack time-splitting scheme to establish a mathematical model for solving two-dimensional SV equations. Two industrial applications at the time are also presented, demonstrating the validity of the model. Later, Fiedler and Ramirez [6] also used this method to simulate a discontinuous two-dimensional hydrodynamic surface flow equation (a variant of the two-dimensional SV equations) with spatially variable properties. The method was developed to model spatially variable infiltration and microtopography, and can also be used to model irrigation, tidal flat and wetland cycles, and flooding. With the development of computers and the improvement of computing performance, more and more people put the simulation process on the computer. For example, Kamboh [7] et al. also established a mathematical model with initial conditions and boundary conditions using two-dimensional Saint-Venant PDE in order to predict and simulate flood behavior. Next, the corresponding models are discretized and implemented on MATLAB using the common explicit finite difference method. The finally generated graph structure can visually see the changes of each parameter over time. Asier [8] also used the two-dimensional Saint-Venant equation to simulate precipitation or runoff events, but the method used was the finite volume method. He developed and compared the following three programming methods: sequential, multi-threaded and many-core architectures. The multi-threaded code is written using OpenMP and the many-core architecture is written using CUDA. He concluded that the performance of the GPU parallel version using CUDA is strongly affected by the size of the problem. It is also proposed that combining MPI and GPU methods can improve computational efficiency and data capacity, but this is not implemented in the paper. In the field of heterogeneous computing, Ding [9] et al. transplanted, parallelized, and accelerated the solver of the one-dimensional S-V equation based on MPI and athread library. The athread library is an accelerated thread library designed for the master-slave acceleration programming model of the SW26010 processor. They use MPI to realize the parallelization between the master cores, and use athread to accelerate the slave cores. After that, optimization methods such as SIMD (Single Instruction Multiple Data) vectorization and communication/calculation overlapping were carried out. In addition to the above work, adaptive mesh refinement (AMR) is also an important part of the algorithm performance. AMR is generally efficient and effective in treating problems with multiple spatial and temporal scales. AMR improves the quality of solution on a mesh by refining cells only in places where a high grid resolution is desired, thereby increasing the memory efficiency and computation speed [10]. Xin Zhao [11] proposed a 3D volume-of-fluid method based on the adaptive mesh refinement technique. He introduced projection methods on adaptive grids to solve the incompressible Navier–Stokes equations. In order to simulate ocean wave propagation, Michael [12] et al. proposed a method for numerical simulation of dynamically adaptive problems on adaptive triangular grids with recursive structure. They used a grid generation process based on recursive bisection of triangles along marked edges to achieve 2D dynamically adaptive discretization. Kevin and Frank [13] used a multilayer lattice Boltzmann model (LBM) to solve the 3D wind-driven shallow water flow

problems, and studied the performance of the parallel LBM for the multilayer shallow water equations on the CPU-based high performance computing environment using OpenMP. It is concluded that explicit loop control with cache optimization in LBM provides better performance than the implicit loop control on execution time, speedup, and efficiency as the number of processors increases. There are many more applications of the two-dimensional Saint-Venant equation, such as [14–17].

There are already some frameworks for solving PED using GPUs or accelerated devices. For example, Bhadke [18] et al. used CUDA to develop a 3D-CFD computing framework for the conduction process. They discretized the model into a three-dimensional grid and solved it using an alternating-direction implicit method. In summary, although many different studies have modeled the SV equation (e.g., [19,20]), and there are discussions on parallelization and performance optimization, there are few studies on large-scale Saint-Venant systems. Therefore, this research aims to build a two-dimensional simple finite difference model and use MPI, OpenMP, Pthread, and OpenCL for heterogeneous large-scale processing. In this work, we first introduce the governing equation and calculation method of SV equations. Then we introduce the basic implementation of our heterogeneous massively parallel computing. Finally, the parallel strategy is optimized and the performance is tested.

#### **2. Governing Equations and Numerical Method**

The governing equations used in our research are as follows:

$$\frac{\partial z}{\partial t} + \frac{\partial (zu)}{\partial x} + \frac{\partial (zv)}{\partial y} = 0 \tag{1}$$

$$\frac{\partial(zu)}{\partial t} + \frac{\partial\left(zu^2 + \frac{\mathbf{g}z^2}{2}\right)}{\partial x} + \frac{\partial(zuv)}{\partial y} = \operatorname{gz}\left(\mathbf{S}\_{0x} - \mathbf{S}\_{fx}\right) \tag{2}$$

$$\frac{\partial(zv)}{\partial t} + \frac{\partial(zuv)}{\partial x} + \frac{\partial\left(zv^2 + \frac{\xi^2}{2}\right)}{\partial y} = gz\left(\mathcal{S}\_{0x} - \mathcal{S}\_{fx}\right) \tag{3}$$

Equation (1) is derived from the conservation of mass, and Equations (2) and (3) are derived from the conservation of momentum in the *x* and *y* directions, respectively. Among them, *z* refers to the elevation (depth or height) of the water flow in the open channel, *u* and *v* are the water velocity in the *x* and *y* directions respectively, *t* is the time, g is the acceleration of gravity. *S*<sup>0</sup> and *Sf* refer to the water surface gradient and frictional resistance. In order to use this system of equations more conveniently in the computer, we need to use the product rule of differentiation to further simplify the system of equations. We then discretize the equations using an explicit finite difference scheme, where the time and space derivatives are respectively discretized by the following expressions:

$$\frac{\partial \mathfrak{U}}{\partial t} \approx \frac{u\_{i,j}^{k+1} - u\_{i,j}^{k-1}}{2\Delta t}, \frac{\partial \mathfrak{U}}{\partial \mathbf{x}} \approx \frac{u\_{i+1,j}^{k} - u\_{i-1,j}^{k}}{2\Delta \mathbf{x}}, \frac{\partial \mathfrak{U}}{\partial y} \approx \frac{u\_{i,j+1}^{k} - u\_{i,j-1}^{k}}{2\Delta y}$$

This format is the central difference format, in which the central difference in time is also called the leapfrog format, and its advantage is that it can enhance the stability of the calculation. In this way, the discretized equations become the following form:

$$\frac{z\_{i,j}^{k+1} - z\_{i,j}^{k-1}}{2\Delta t} + u\_{i,j}^k \frac{z\_{i+1,j}^k - z\_{i-1,j}^k}{2\Delta x} + z\_{i,j}^k \frac{u\_{i+1,j}^k - u\_{i-1,j}^k}{2\Delta x} + v\_{i,j}^k \frac{z\_{i,j+1}^k - z\_{i,j-1}^k}{2\Delta y} + z\_{i,j}^k \frac{v\_{i,j+1}^k - v\_{i,j-1}^k}{2\Delta y} = 0 \tag{4}$$

$$\begin{aligned} u\_{i,j}^k \frac{z\_{i,j}^{k+1} - z\_{i,j}^{k-1}}{2\Delta t} + z\_{i,j}^k \frac{u\_{i,j}^{k+1} - u\_{i,j}^{k-1}}{\Delta t} + \left(u\_{i,j}^k\right)^2 \frac{z\_{i+1,j}^k - z\_{i-1,j}^k}{2\Delta x} + 2z\_{i,j}^k u\_{i,j}^k \frac{u\_{i+1,j}^k - u\_{i-1,j}^k}{2\Delta x} + g z\_{i,j}^k \frac{z\_{i+1,j}^k - z\_{i-1,j}^k}{2\Delta x} + \\ u\_{i,j}^k v\_{i,j}^k \frac{z\_{i,j+1}^k - z\_{i,j-1}^k}{2\Delta y} + z\_{i,j}^k v\_{i,j}^k \frac{u\_{i,j+1}^k - u\_{i,j-1}^k}{2\Delta y} + z\_{i,j}^k u\_{i,j}^k \frac{v\_{i,j+1}^k - v\_{i,j-1}^k}{2\Delta y} = g z\_{i,j}^k \left(\mathcal{S}\_{0x} - \mathcal{S}\_{fx}\right) \end{aligned} \tag{5}$$

$$\begin{split} & \, ^k \boldsymbol{z}\_{i,j}^k \frac{z\_{i,j}^{k+1} - z\_{i,j}^{k-1}}{\Delta \boldsymbol{x}} + \boldsymbol{z}\_{i,j}^k \frac{v\_{i,j}^{k+1} - v\_{i,j}^k}{\Delta \boldsymbol{x}} + \left( \boldsymbol{v}\_{i,j}^k \right) \frac{2}{2\Delta \boldsymbol{y}}^k \frac{z\_{i,j+1}^k - z\_{i,j-1}^k}{2\Delta \boldsymbol{y}} + 2\boldsymbol{z}\_{i,j}^k \boldsymbol{v}\_{i,j}^k \frac{v\_{i,j+1}^k - v\_{i,j-1}^k}{2\Delta \boldsymbol{y}} + \boldsymbol{g} \boldsymbol{z}\_{i,j}^k \frac{z\_{i,j+1}^k - z\_{i,j-1}^k}{2\Delta \boldsymbol{y}} + \\ & \, ^k \boldsymbol{u}\_{i,j}^k \frac{z\_{i,j}^k - z\_{i-1,j}^k}{2\Delta \boldsymbol{x}} + \boldsymbol{z}\_{i,j}^k \boldsymbol{v}\_{i,j}^k \frac{u\_{i+1,j}^k - u\_{i-1,j}^k}{2\Delta \boldsymbol{x}} + \boldsymbol{z}\_{i,j}^k u\_{i,j}^k \frac{v\_{i+1,j}^k - v\_{i-1,j}^k}{2\Delta \boldsymbol{x}} = \boldsymbol{g} \boldsymbol{z}\_{i,j}^k \left( \boldsymbol{S}\_{0y} - \boldsymbol{S}\_{fy} \right) \end{split} \tag{6}$$

Then, after simply shifting the term and removing the denominator, the three variables *z*, *u*, and *v* can be solved iteratively. As shown in Figure 1, each time an iteration value is calculated, the results of the previous two iterations are used, which is also a feature of the leapfrog format. Because of this, the stability of the calculation process is strengthened.

**Figure 1.** The value used by the iterative process.

It is also because of the solution value range shown in Figure 1 that the boundary of the grid will be exceeded when calculating the edge value. At this time, we need to introduce boundary conditions and add a circle of ghost cells around the standard grid. Balzano [21] gives a general review of existing wetting and drying (WD) methods. He gave expressions for boundary conditions at moving boundaries, which can be solved on fixed grids, adaptive grids and moving grids based on natural coordinates. Moreover, a number of solutions or models for dealing with boundary problems based on this method are introduced. Finally, a discussion on implicit finite-difference models is carried out. Heniche [22] et al. also gave specific boundary conditions based on the WD model. They proposed two kinds of boundary conditions: solid boundary and open boundary. Specific conditions must be imposed when dealing with solid boundaries, while open boundaries are used to specify flow regimes. The basic idea of the boundary conditions we use is to bounce incoming particles toward the boundary back into the fluid [23]. In order to achieve this boundary condition, it is only necessary to make the boundary value equal to the edge value.

After that, showing the initial conditions, we consider using a rectangular area with no bottom stress and wind stress. The water surface at each location is stationary and has a height of 10 and the flow is zero, i.e., *z* = 10 m, *u* = 0 m/s, *v* = 0 m/s. Then a flood wave with a maximum height of 1 m is generated at the entrance of the water to simulate the flow behavior of fluid in a single channel, through the formula:

$$z = e^{-\frac{(a-a\_0)^2 + (b-b\_0)^2}{k^2}}$$

In this formula, *a*<sup>0</sup> and *b*<sup>0</sup> are the location of the highest flood wave in the x and y directions respectively, where *a*<sup>0</sup> = 25, *b*<sup>0</sup> = 1 is taken to locate the center of the entrance. *k* is the initial height of the water wave, where *k* = 10 m. As for *a* and *b*, they are the coordinates of each grid node on the coordinate axis. This will get the highest value when *a* = *a*0, *b* = *b*0. The initial state of a small-scale simulation process is shown in Figure 2.

**Figure 2.** Schematic of a small-scale initial state.

### **3. Heterogeneous Implementation**

First of all, our heterogeneous implementation works together through the CPU and accelerator. The CPU also participates in the calculation process while being responsible for communication, while the accelerator is only responsible for calculation. After that, we used MPI, OpenMP, Pthread, and OpenCL runtime libraries. Among them, MPI is a parallel program interface based on multiple processes with good performance, which is used in this paper for point-to-point communication between nodes. Both OpenMP and Pthread are thread-parallel interfaces. OpenMP is a portable API that is very convenient to use because it does not bind the program to a pre-set thread. But because of this, we cannot use OpenMP to manage specific threads individually. So, we introduced Pthread. The Pthread API handles most of the behavior required by threads. These behaviors include creating and terminating threads, waiting for threads to complete, and managing interactions between threads. Combining the advantages and disadvantages between the two, we use OpenMP when performing simple thread parallelism, and use Pthread when more complex thread operations are required. OpenCL is the first open, free standard for general-purpose parallel programming for heterogeneous systems. It is widely used in parallel processors such as multi-core CPUs, GPUs [24–26], accelerators, etc. We are here to process the accelerator and take care of some of the computing tasks, through OpenCL.

A heterogeneous parallel framework can be used across multiple platforms, but different clusters are equipped with different equipment, and the program will be changed accordingly. This experiment is performed on the domestic advanced computing system "SunRising-1". The specific experimental environment is shown in Table 1.

**Table 1.** SunRising-1 experimental environment.


9


**Table 1.** *Cont.*

Figure 3 shows the schematic of our heterogeneous parallel framework. We first initialize MPI and use MPI\_Type\_contiguous() to create an MPI datatype by replicating an existing datatype (for example, MPI\_INT, MPI\_DOUBLE, etc.). These replications are created into contiguous locations, resulting in a contiguous datatype created. The created datatype is then committed using MPI\_Type\_commit(), before it can be used for communication. After entering the process parallelism, routinely initialize OpenCL, including obtaining the platform, device, creating context, etc. The next step is to implement the initial value conditions described in Section 2, and we call this process "initialization". The cluster used in our experiment is equipped with 32 CPU cores and four accelerators for a single node, so for the initialization process, we enable four threads to simultaneously call OpenCL to start the kernel function. This process does not involve complicated operations, so OpenMP is used. Thread parallelism ends after the kernel function finishes running. After that the calculation process starts. In this process, both the CPU and the accelerator participate in the calculation, so we divided two thread groups, one of which contains four threads to perform operations similar to the previous process to start the kernel function, and the other is to enable the remaining 28 threads participate in the computation. The grouping and waiting of threads are involved here, so Pthread are used. After an iterative calculation is over, because of the existence of ghost cells, communication is required next. Because of the large scale of computation, a node may need to exchange data bidirectionally with four adjacent nodes (maybe 2 or 3 times for nodes at the edge). We use four MPI\_Sendrecv() functions to implement this communication operation. MPI\_Sendrecv() combines sending a message to a destination and receiving a message from another process into one call. Figure 4 shows how MPI\_Sendrecv() in the four directions exchanges data. For the intermediate nodes, each node must send data to the surrounding adjacent nodes, and also receive data from the surrounding adjacent nodes, which can be easily implemented by calling MPI\_Sendrecv().

**Figure 3.** Schematic of the heterogeneous parallel framework.

**Figure 4.** Schematic of data exchange.

When dealing with nodes at edges, we introduce MPI\_PROC\_NULL, a constant that represents a dummy MPI process rank. This allows all MPI processes to issue the same calls regardless of their position.

After the communication ends, a complete iteration is over, and after that, the iterative loop continues until the set time is reached. Then we output the result and perform postprocessing to generate a picture, such as the visualization of the output of the z value shown in Figure 5 (Only the parts with numerical changes are displayed). In this way, you can intuitively see the change of the water level in the river channel with time.

**Figure 5.** Simulation of water surface elevation at different time steps.
