2.2.1. GPU-CA Framework

The GPU-CA frame used in this work is shown in Figure 2, in which each slice has the cell state of the cross-section corresponding to this moment.

**Figure 2.** GPU-CA architecture.

In this heterogeneous framework, the CPU is responsible for allocating memory, initializing, and transferring the initial state to the global memory on the GPU via the PCIe bus. Then, all the threads on the GPU execute instructions simultaneously in singleinstruction-multiple-data (SIMD) mode until the required calculations are completed. Finally, data are transferred from GPU memory to CPU memory via the PCIe bus for post-processing. This work uses Pascal GP100 GPU (embedded with 3584 CUDA cores) as the accelerator, where millions of threads can execute instructions simultaneously, and each thread corresponds to a cell.

#### 2.2.2. Implementation by CUDA C

This work adopts CUDA as the programming model, because it allows us to execute applications on heterogeneous computing systems by simply annotating code with a small set of extensions to the C programming language. In addition, the CUDA programming model provides the exposed memory hierarchy, in which global memory is the most widely used, because it has a large memory space. This work chose global memory and, based on it, designed the structure of array (SOA) to organize data, because the SOA is more liable to maximize the efficiency of accessing the global memory.

## 1. Task scheduling scheme

CUDA provides streams to achieve concurrency between kernels, where kernels are executed sequentially in the same stream, simultaneously in different streams. To increase parallelism, this paper divides the problem into eight small tasks, each corresponding to a kernel function, as shown in Table 1.


**Table 1.** Task assignment in GPU-CA model.

All kernels are related, but at some point, some kernels are independent. For example, before kernel solute\_Dif is scheduled, kernels capture, schange, calD are independent. To solve the waiting problem among independent tasks, this paper puts all kernels into two streams based on the kernel dependency, as shown in Figure 3. It is noted that all operations in the non-default stream are non-blocking with respect to the host thread. Thus, we need to synchronize the host with operations running in a stream. The overall flow of the numerical simulation is shown in Figure 3, where kernel function solute\_Dif depends on values of *CL* and *D* calculated by calD and schange, respectively. In this case, the host has to wait until the calD and schange finish their calculation before solute\_Dif is scheduled. This work uses the cudaDeviceSynchronize(void) function to achieve synchronization between the host and operations running in a stream.

2 Parallel algorithms

**Figure 3.** The flow-chart of the program model.

Global memory can be accessed on the device from any SM throughout the application's lifetime. When multiple threads write data to the same address at the same time, there will be a 'data race' that can lead to an undefined error. The CA model in this work adopted the Moore neighbor, in which each cell has eight neighbors. Therefore, there may be more than one neighbor discharging solute into a liquid cell or capturing it at the same time. Mapping to the programming model, there are multiple threads that attempt to modify the data in the same address, introducing a 'data race'. This paper avoided this phenomenon by preventing the cell from 'actively' discharging solute to neighboring cells or capturing it. For example, the solute redistribution algorithm adopts a kernel function and a device function to turn 'active' to 'passive', as shown in Figure 4. There are two steps to complete the solute redistribution process. The pseudo-code is listed in Algorithm 1.

**Figure 4.** Avoiding data race ((**a**) store (**b**) read->change).

First, calculate how much solute will be discharged to each liquid neighbor and store it in the intermediate variables, lines 2–7 in Algorithm 1.

Then, read the data stored by the first step from the neighbors and sum it to the solute of the current cell changed, lines 8–11 in Algorithm 1.


This algorithm can avoid the 'data race' and make all cells in the calculation domain independent.

#### **3. Results and Discussion**

#### *3.1. Validated by the LGK Model*

This model was validated by comparing the steady velocity of a free dendrite crystal in an undercooled melt with the analytical value of the LGK model. This work defines the velocity of the dendrite tip through the cell size and time interval as the cell stays in the interface state, and the steady state of dendrite growth is determined as the solute at the boundary opposite the dendrite tip reaches 1.01 times the initial value [37,38]. The physical parameters used in this paper are shown in Table 2.

**Table 2.** Physical properties [39,40].


This test was performed in a square domain with a size of 400 μm × 400 μm, which is uniformly divided into 400 × 400 cells with the size of 1 μm × 1 μm. At the beginning of solidification, a nucleus with a preferred orientation of 0◦ was placed in the center of the domain.

Figure 5 shows the evolution of the transient velocity of the dendrite tip as a function of time calculated by the present CA model under the undercooling of 8K and the steady velocity calculated by the LGK model. It can be seen that tip velocity of the dendrite decreases dramatically in the transient growth period; then, the decreasing tendency becomes mild in the steady growth period. Finally, the velocity is stable at 56.33 μm/s, which is very close to the analytical value of 55.90 μm/s calculated by the LGK model.

**Figure 5.** Comparison of the steady tip velocity at a constant undercooling Δ*T* = 8K.

#### *3.2. Model Capability*

This model is applied to simulate the dendrite growth of binary alloy. Firstly, a group of dendrites of Fe–0.6C alloy are simulated to evaluate the ability to simulate the dendrites with different orientations. Secondly, it is applied to simulate multi-orient dendrites' growth of Fe–5.3Si alloy [41], and the simulated result is compared with the in situ observation. Finally, the performance of the parallel solution is analyzed by comparing the calculation time with the traditional CPU calculation.

#### 3.2.1. Single Dendrite of Fe–0.6C Alloy

A group of dendrites of Fe–0.6C with different orientations are simulated in this part, where the orientation ranges from 0◦ to 90◦ due to the four-fold symmetry of the dendrite. Simulations are also performed on a square domain with a size of 400 μm × 400 μm that is divided into 400 × 400 cells. Figure 6 shows the simulation results at 0.6 s with a constant undercooling of 8K, which shows that the present model can simulate clear dendrite morphology with not only different orientations but also second dendrite arms.

In addition, as shown in Figure 7, the primary arm of dendrites with different orientations of 0◦, 30◦, and 60◦ are almost the same length, indicating that growth velocities with different orientations are in good symmetry.

#### 3.2.2. Multi-Dendrites of Fe–5.3Si Alloy

The development of synchrotron X-ray technology has enabled the observation of solidification in metallic alloys, which provides a powerful tool to verify the model. Yasuda et al. [41] performed in situ observation of Fe–5.3Si wt.% alloy, which shows a clear morphology of dendrite, including secondary dendrite arms. To validate this model, this paper simulated multi-orient dendrites' growth of Fe–5.3Si alloy under the same experimental condition as the in situ observation, which set the nuclei positions and orientations corresponding to the in situ experimental observation before the simulation began. Figure 8a,b depict the simulated dendrite morphology and solute profile, respectively.

**Figure 6.** Solute profile and morphology of the equiaxed dendrite at the melt undercooling of 8K as the orientation is (**a**) 0◦, (**b**) 5◦, (**c**) 20◦, (**d**) 40◦, (**e**) 60◦, and (**f**) 80◦.

**Figure 7.** Velocity symmetry in different orientations.

**Figure 8.** Muti-dendrite morphology of Fe–5.3Si alloy simulated by present model (**a**) morphology, and (**b**) solute profile.

Simulation results show that the bottom dendrite is less developed than the upper because there are more dendrites at the bottom than at the upper, which agrees well with the in situ observation by synchrotron X-ray imaging [41]. For one thing, the growth of many dendrites will make the surrounding solute enriched, which will restrict the development of dendrites. For another, the more dendrites, the less space for each of them. In addition, Figure 8b shows the segregation between dendrites and within a dendrite, which satisfies well the non-equilibrium solidification theory [42]. Therefore, this model can

simulate the growth of the dendrite in the actual casting process and reveal the phenomenon of segregation.

### 3.2.3. Acceleration Performance

The block size configured when launching a kernel function means a lot to performance because of its impact on latency hiding, memory efficiency, and occupancy, etc. A group of numerical experiments were carried out to find the optimal execution configuration according to the grid and block size guidelines [43]. Further, parts of the representative numerical simulation results are shown in Table 3, which shows that the configuration of (32,4) can obtain optimal performance. Accordingly, the following numerical experiments are performed under this configuration.

**Table 3.** Time elapsed with different kernel configuration.


Speedup, the ratio between the CPU and GPU computational time, is used to evaluate the performance of the present GPU-CA model. The same numerical simulations were performed both on the Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz and Pascal GP 100 GPU. All simulations are carried out on the assumption that solidification has been started with 0.6 s, which equals 600,000 steps in the case of Δ*t* = 0.0001 s. Figure 9 shows the computation time on the CPU and GPU as well as the speedup, which shows that this model can achieve great acceleration. In addition, the speedup increases with the number of grids, which is up to 158×, but the tendency to increase becomes low from a certain point. This can be explained by limited hardware resources, and before it is fully utilized, performance will be significantly improved when the grid number is increased. Once occupancy reaches its maximum, performance may be limited by additional scheduling overhead when increasing the grid number again.

**Figure 9.** Computational performance.

## **4. Conclusions**

This paper aims to develop a high-performance CA model incorporating the decentered square capture rule to simulate dendrite growth with different orientations more efficiently. The calculation efficiency of the CA model based on GPU has been improved by the proposed parallel algorithms adapted to the many-core architecture of GPU and the task scheduling scheme using multi-stream, and it is validated by comparing the calculated value with the analytical value by the LGK model. By applying this model to simulate the single dendrite of the Fe–0.6C alloy in different orientations and the multi-orient dendrite growth of the Fe–5.3Si alloy, the following conclusions can be drawn:


This GPU-based model can greatly accelerate the calculation, which will be a promising tool in the field of studying dendrite growth during large-scale solidification.

**Author Contributions:** Methodology, writing—original draft preparation, J.W.; formal analysis, H.M. and J.Y.; funding acquisition and supervision, Z.X. and H.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded by National Natural Science Foundation of China, grant number No.51634002 and No.61703084 and Fundamental Research Funds for the Central Universities, grant number N224001-8.

**Data Availability Statement:** The data applied in this research are available from the authors upon request.

**Conflicts of Interest:** The authors declare no conflict of interest.
