1. Introduction
The lattice Boltzmann method (LBM) as a mesoscopic scale computational fluid dynamics method [
1] has been widely recognized as an effective means of dealing with complex fluid flows [
2]. For example, simulating the fluid flow of an infinitely wide wedge [
3], coupled conduction and radiative heat transfer in composite materials [
4] or indoor airflow [
5].
The advantages of LBM are easily handling complex boundaries [
6] and high parallelism [
7]. However, LBM faces some challenges such as numerical instability at high Reynolds numbers [
8]. In order to solve this problem, the regularized lattice Boltzmann method (RLBM) was proposed [
9]. The principle of RLBM is to perform a pre-collision step before the collision and streaming steps [
10]. The pre-collision step restores symmetries that may not be inherently satisfied during numerical simulations [
11]. RLBM maintains the high parallelism of the LBM without significantly increasing the amount of computation [
12]. Latt and Chopard [
9] introduced a regularisation model based on single relaxation time BGK dynamics. It was demonstrated that the RLBM reduces the computational cost and has improved accuracy and stability over the LBM. Malaspinas [
13] proposed regularized boundary conditions for dealing with straight boundaries. The method is applied to standard grids (D2Q9 and D3Q19). Guo [
14] discusses arbitrary surface boundary conditions for compressible flows in the context of regularized models. Otomo [
15] applied regularized collision models to phase-field multiphase flow models. Basha and Nor Azwadi [
16] showed the similarity between the structure of LBM and RLBM. The RLBM has the properties of easily handling complex boundary conditions and high parallelism [
17].
RLBM typically uses a uniform grid for numerical calculations. The fact that the accuracy of the grid cannot be flexibly adjusted limits the application of RLBM in some fields. In numerical simulations, there are regions where flow details need to be finely captured. Refining the entire grid would significantly increase the computational overhead [
18]. Therefore, it is necessary to design a method for local grid refinement. Filippova and Hanel [
19] proposed an FH method where the information of the grid is stored in the vertices. The coarse and fine grids in the FH method exchange information in a bi-directional coupled way. An adaptive mesh refinement (AMR) is proposed, which effectively reduces the loss of computational accuracy [
20]. Liu et al. [
21] used the AMR method in combination with the immersion boundary lattice Boltzmann method (IB-LBM) to validate the effectiveness of the method in both 2D and 3D problems. Since the distribution functions are exchangeable on different grids, Lin and Lai [
22] propose a single coupling method. The method is subject to errors in some cases. Dupuis and Chopard [
23] proposed a bi-directional coupled local grid refinement method, which solves the error problem in an excellent way. The method can improve the computational efficiency. In addition, a buffer-based interpolation method between coarse and fine grids was proposed by Chen et al. [
24]. The method effectively reduces the computation time. A method for storing grid information in the center point was proposed by Rohde et al. [
25] The grid information at different sizes is stored in the grid intersections. Storing the grid information using this method can accurately and uniformly spread the grid information, which improves the accuracy of numerical calculations [
26].
The RLBM with multi-layer grids uses a uniform grid for each layer, and the size of the previous grid is twice the size of the subsequent grid. In using the method to calculate equilibrium distribution functions, collision, streaming, and computing macroscopic quantities are localized. Therefore, excellent conditions are provided for the parallelization of RLBM. Hasert, M et al. [
27] proposed a new parallel approach that is suitable for numerical simulations under large-scale clustering. A method of computational domain division by one and two dimensions was proposed by Kandhai et al. [
28]. The method effectively improves the parallel performance and achieves accurate numerical results. Numerical experiments on multiphase flow were carried out by Pan et al. [
29]. The performance differences between different domain divisions were analyzed. In engineering problems, parallel computation of LBM is also applied in different flow domains. Such as nuclear reactor fuel cooling [
30] and vehicle external flow field [
31]. A load-balancing method was proposed by Schornbaum et al. [
32]. The method requires that all grids be distributed to all parallel domains. A parallel algorithm was designed by Abas et al. [
33]. The algorithm is based on numerical experiments implemented on the Palabos library [
34] with RLBM and MPI cross-node parallelism. Good results were achieved; however, the implementation details of the algorithm were not disclosed. Multi-layer grids RLBM is capable of handling large-scale computational CFD problems. Servers [
35] and supercomputers [
36] with multiple cores are common choices for solving problems in current scenarios.
These methods demonstrate that RLBM is a stable method. Local grid refinement of multi-layer grids can concentrate the parts that need to be computed accurately in more desired regions. RLBM with multi-layer grids is suitable for parallelization. Based on this, a load balancing-based grid dividing algorithm and an MPI-based parallel algorithm for RLBM on multi-layer grids are proposed in this paper. The load balancing-based grid dividing algorithm ensures that the workload is evenly distributed across processes, minimizing discrepancies in computational load. The MPI-based parallel algorithm for RLBM on multi-layer grids ensures accurate and efficient numerical simulation. The algorithm satisfies load-balanced grid division and MPI-based multi-processing parallel design.
The paper is structured as follows. The
Section 1 is the introduction.
Section 2 introduces the RLBM and gives the RLBM evolution process for multi-layer grids.
Section 3 proposes load balancing-based grid dividing algorithm and MPI-based parallel algorithm for RLBM on multi-layer grids. Numerical experiments and parallel performance analysis are in
Section 4. Conclusions are in
Section 5.
3. Parallel Strategy and Algorithm
This paper describes the division of the grid in the two-dimensional case. A load balancing strategy [
34] prevents a process from going into an idle state while waiting for data from other processes. When performing simulations, all grid cells are assigned to available processes. The presence of idle processes is prevented. The effective assignment of grids to processes is essential for efficiency improvements.
In the simulation scenario, the grid for each layer is stored in a one-dimensional array. This array stores the number of grids involved in each column of the evolution step for the grid region. The one-dimensional array is partitioned according to the number of processes.
Firstly, this paper separates the different grid layers into individual dimensions. Divide the various layers of the grid into multiple subdomains along the same direction. Subdomains with the same number of processes in different layers are formed into a separate computational region. At the same time, data exchange between boundary grids basically occurs only between neighboring processes. In this paper, the grid division algorithm is designed in Algorithm 1. The flowchart for grid division is shown in
Figure 6.
Algorithm 1 Load Balancing-Based Grid Dividing Algorithm |
Input: Number of grid layers: N; Number of processes: ; Output: File after grid division. - 1:
int , ; - 2:
// R is the number of columns in the current layer grid. - 3:
// represents how many grids are stored in each of the columns 1 to R from layer 1 to layer N. - 4:
// represents how many grids are stored in column 1 to the current column. - 5:
for int to do - 6:
for int to do - 7:
; //l is the number of grids in the column of the layer. - 8:
end for - 9:
end for - 10:
for int to do - 11:
for int to do - 12:
; - 13:
end for - 14:
end for - 15:
for int to do - 16:
for int to do - 17:
for int to do - 18:
if then - 19:
int ; - 20:
int ; - 21:
if a < b then - 22:
Set the process number in this column to ; - 23:
++; - 24:
else - 25:
++; - 26:
Set the process number in this column to ; - 27:
end if - 28:
else - 29:
Set the process number in this column to ; - 30:
end if - 31:
end for - 32:
end for - 33:
end for
|
The division of the multi-layer grid is shown schematically in
Figure 7.
The three-layer grid structure of a two-dimensional flow over two circular cylinders is taken as an example in this paper. Using the multi-layer grids RLBM, all grid points within the same layer have similar computational demands. Coarse grid points undergo half the computations compared to fine grid points. The grid area in
Figure 7 consists of three distinct grid sizes. Based on the division idea in Algorithm 1, a parallel grid division was performed. Grids of varying resolutions are partitioned into four equal segments. Each MPI process manages a segment of each grid layer. The data transfer method utilizing this parallel partitioning is illustrated in
Figure 7. Red denotes process 0, yellow denotes process 1, purple denotes process 2, and blue denotes process 3.
The data-transfer method of the multi-layer grid is shown schematically in
Figure 8.
Under this parallel strategy, each process exchanges data solely with two neighboring processes. The parallel algorithm proposed in this paper relies on the grid division algorithm within the computational domain. After dividing the grid into sections, the data at the grid junctions include data for evolution and data for interpolation. The data used for evolution are from the same layer of grid connections, the outermost vertical space on the grid. The data used for interpolation are from the intersection of different layers of the grid. These data are used to complete the interpolation between coarse and fine grids. Both the data used for transfer and the data used for interpolation contain information about the neighboring grids and the values of the distribution function. Parallel processing can be carried out after data transfer of interpolated and transmitted data. In this section, Algorithm 2 gives an MPI-based parallel algorithm for RLBM on multi-layer grids.
To prevent deadlocks from occurring, each grid performs separate collisions, streaming, and computation of macro-physical quantities within the computational domain for each process. Neighboring processes pass the information on the boundary grid at the grid junction and then compute. This optimizes communication efficiency. Each process first performs a local computation. Finally, the results of the computations in each region are restored to the results of the complete original problem.
Algorithm 2 MPI-based parallel algorithm for RLBM on multi-layer grids |
Input: Grid files after division, total number of MPI processes , number of grid layers N, Initial information of the flow field. Output: Evolution results. 1. Calculate the number of grid points in each layer of the multi-layer grids. 2. The multi-layer grids are uniformly divided into copies of grid regions according to Algorithm 1. 3. Allocate a grid area to each process in the MPI. 4. From layer N to layer 1, each process executes the RLBM based on multi-layer grid. 5. Once the grid calculations are complete at each layer, a synchronisation operation is performed. 6. Determine whether the exit condition has been met, if not, continue with step 4 and step 5. 7. Output calculation results.
|
4. Numerical Simulations
This section provides two types of numerical validation: flow over two circular cylinders in the 2D case and flow around a sphere in the 3D case. The parallel performance of 3D numerical experiments is evaluated and analyzed.
4.1. Flow over Two Circular Cylinders in a 2D Case
To demonstrate the precision of the parallel algorithm of RLBM based on multi-layer grids, the flow over two circular cylinders at different spacing ratios was simulated. Two cylinders of the same size are placed in the channel. The diameter of the cylinder is defined as
D. The Reynolds number was set the same in all scenarios of this experiment at
Re = 200. The length of the channel is 80
D and the width of the channel is 60
D. The channel’s inlet velocity is
U (
U = 0.1). The front cylinder is located 20
D from the left boundary. The back cylinder is positioned at a distance
L from the front cylinder. This experiment investigates the impact on the flow field as the position of the rear cylinder changes. The distance
L between two cylinders is taken as 1.5
D, 2
D, 3
D, and 4
D. A schematic of the simulation scenario is shown in
Figure 9.
The fluid region grid is set up as follows: The size of the refined two-layer grid area is
and
, respectively. Each layer’s grid size is half that of the preceding layer. Boundary grid refinement was performed around the two cylinders to capture flow details. The three-layer grid structure is shown in
Figure 10.
The flow field is analyzed for different spacing ratios as follows: As the evolutionary steps increase, the vortices between the cylinders gradually stabilize. The streamlines of the upstream cylinders closely follow the downstream cylinders. A symmetrical vortex pair was created between the two cylinders. The vortex of the downstream cylinder could not form a stable vortex pair due to the interference of the upstream cylinder, as shown in
Figure 11,
Figure 12,
Figure 13 and
Figure 14. The different colors of velocity contours represent different ranges of velocity values.
The vortex between the two cylinders cannot be stable and symmetrical when the spacing ratio (
L/
D = 3). The existence of critical spacing is proved [
41].
In this paper, the lift coefficients
and the drag coefficients
are used to verify the accuracy of the algorithm when performing numerical calculations.
Table 1 and
Table 2 demonstrate the results obtained using the algorithm of this paper in comparison with the published literature [
41,
42,
43].
Figure 15,
Figure 16,
Figure 17 and
Figure 18 represent different spacings of the two cylinders. The cyclic variation of the lift coefficients
and drag coefficients
of the fluid flowing through the cylinders with time iteration steps.
In
Figure 19, the number of grids and CPU runtime are compared for a single-layer grid and a three-layer grid. The CPU time for a three-layer grid is only 1/8th of the original, while the number of grids to be computed is 1/4th of the original. The results demonstrate that the multi-layer grids regularized lattice Boltzmann method (RLBM) exhibits high accuracy and stability.
4.2. Flow around a Sphere in 3D Case
To demonstrate the precision of the parallel algorithm of RLBM based on multi-layer grids, flow around a sphere was simulated. To place a sphere in a closed box. The radius of the sphere is defined as
R. The box size is
. The inlet flow velocity is
U(
U = 0.1). The setup of the simulation scene is shown in
Figure 20. A three-layer grid was used to perform the numerical calculations. The dimension of the first layer of the grid is 1/6
R. The size of the second layer of the grid region is
. The size of the third layer of the grid area is
.
In this section, experiments were conducted with four sets of Reynolds numbers of 50, 100, 150, and 200. The length
of the recirculation region is different under different conditions. The flow diagrams at different Reynolds numbers are shown in
Figure 21,
Figure 22,
Figure 23 and
Figure 24. The length
of the recirculation region under different conditions is compared with the published literature [
44,
45,
46] in
Table 3.
In this paper, the drag coefficients
and length
of the recirculation region are used to verify the accuracy of the algorithm when performing numerical calculations.
Table 3 demonstrates the results obtained using the algorithm of this paper in comparison with the published literature [
44,
45,
46].
4.3. Performance Evaluation
In this section, we evaluate the algorithm proposed in this paper in terms of both speed-up ratio and efficiency. The scenario configuration for the simulation in this section is the same as in
Section 4.2. This section evaluates parallel performance using speedup ratio and efficiency. The speedup ratio is defined as follows:
where
is the serial execution time and
is the parallel execution time using
p cores.
The efficiency is defined as follows:
The speedup ratio and efficiency of the parallel algorithm proposed in this paper at different numbers of cores is demonstrated in
Table 4.
In order to facilitate the comparison of the performance of the algorithms in this paper, parallel computation is performed using OpenMP under the same experimental conditions and scenarios. The speedup ratio and efficiency of the OpenMP at different numbers of cores is demonstrated in
Table 5.
The performance of the multi-layer grids RLBM algorithm using the parallel algorithm proposed in this paper is compared with the multi-layer grids RLBM algorithm using OpenMP multi-threaded computation. The comparison of speed-up ratios is shown schematically in
Figure 25. The comparison of efficiency is shown schematically in
Figure 26.
It is clear that the parallel algorithm proposed in this paper exhibits a significant speed-up ratio for all numbers of cores. For example, in the case of 16 cores, the algorithm achieves a speed-up ratio of 9.05, while the speed-up ratio using OpenMP is only 4.621. It shows that the algorithm proposed in this paper is able to make better use of parallel computing resources and achieve higher computational efficiency when scaling up to more processor cores. As the number of processor cores increases, the speed-up ratio of MPI increases significantly more than that of OpenMP, demonstrating excellent scalability.
The algorithm proposed in this paper is significantly more efficient than OpenMP for all cores. In the two-core case, the MPI is 87% efficient compared to 73% for the OpenMP. This trend is more pronounced as the number of cores increases. It shows that the algorithm in this paper is better able to maintain efficient load balancing and low communication overhead when increasing the number of cores. Although the efficiency of the MPI algorithm decreases as the number of cores increases, the decrease is significantly smaller than that of the OpenMP algorithm. The efficiency of the MPI algorithm is still maintained at a reasonable level in larger-scale parallel computations, indicating the effectiveness of its load-balancing strategy.
The rationality and efficiency of the algorithm in this paper are proved, and it can greatly reduce the computational bottleneck of individual processes and improve the overall computational efficiency. The algorithm only needs to communicate between neighboring processes and only needs to exchange information about the boundary grid. This mode of communication greatly reduces the amount of communication and allows the parallel algorithm to remain highly efficient at larger scales.