2.1. The Numerical Methods Being Investigated
This work focuses on the heat conduction equation. It is mathematically analogous to the diffusion equation. The homogenous version of them is as follows:
which is to be used with an
initial condition, and where
f is the temperature [°C, K] —or, in the case of diffusion, concentration,
t is the simulated time [s],
is the position vector [m],
D is the diffusion coefficient [m2/s], and
is the Laplacian operator.
As for the concentration term, one should note that the mass concentration [kg/m3], molar concentration [mol/m3], and particle count concentration [m−3] are all applicable as long as the same type of concentration is used consistently on both side of the equation.
After expanding the Laplacian, we have the following equation:
In this study, the equations are applied to simple, rectangle-like domains and a fixed
Dirichlet’s condition is used at the boundaries (see Equations (12) and (13) in
Section 2.3).
The algorithms mentioned in
Section 1 belong to the family of finite difference methods, which makes spatial and temporal discretization necessary. The spatial discretization divides the domain into a grid (with initial points
Xinit,
Yinit, and
Zinit and endpoints
Xfin,
Yfin, and
Zfin):
where the spatial step sizes are
the index ranges are
and
Nx,
Ny, and
Nz are the number of nodes along the spatial mesh, i.e., the data point counts. It should be mentioned that in
Section 3.1, the notation
Nr is used as the total number of spatial grid-points, which is
for a 1-dimensional measurement and
for 2-dimensional data.
Similarly, the temporal discretization divides the time interval into smaller timesteps as follows:
where the step size and index range are
Euler’s method, which serves as a reference in this paper, uses a straightforward conception to produce the data point values of timestep
n + 1 as an explicit function of the values of timestep
n. Approximating the Laplacian operator with the central difference formula, the stepper formula is (for ease of understanding, in 2 dimensions) as follows:
where the index ranges are
This method is susceptible to numerical instability and requires a small timestep size for accuracy. To eliminate stability problems,
CNe method uses a convex combination of the function values to calculate the values of the next timestep. First, an average temperature value
of the neighboring datapoints is calculated, and then this value is used as an ambient or terminal temperature to apply Newton’s law of heat transfer (i.e., to calculate the exponentially converging value after the timestep has elapsed):
where
is the time constant of the node and
is the asymptotic value to which the node temperature tends.
The bottleneck of this method is its relatively poor accuracy. To mitigate this problem,
CpC method combines it with a predictor–corrector approach, namely the second-order
Runge–Kutta scheme. In the first stage, an approximating value is calculated at a fractional timestep using the
CNe formula. In the second stage, a linear combination of the original value and the result of the first stage is used as the terminal temperature for
CNe to calculate the full-timestep value as follows:
where 0 <
p ≤ 1 is the step fraction.
It is mathematically proven in [
15] that the above algorithm with coefficients 1 – 1/2
p and 1/2
p indeed yields to a second-order error term. The results of [
15] also show that the accuracy does not depend significantly on the value of parameter
p; hence,
p = 1/2 is a good choice to simplify calculations. By doing so, the coefficient of the pre-step value becomes zero, enabling the result of the first stage to be used directly in the second stage without having to calculate the linear combination:
which yields an algorithm analog to the
midpoint method.
2.2. Parallelization and Applying OpenCL
Both the CNe and CpC methods are parallelizable, and the same applies to the Euler’s method. When determining whether an algorithm is parallelizable, several key criteria must be considered. One important criterion is the independence of tasks, often referred to as data parallelism. This refers to the extent to which tasks or operations can be performed independently of each other. If the algorithm’s tasks do not depend on the results of other tasks, they can be executed in parallel. Another crucial factor is task granularity, which refers to the size of the tasks being parallelized. Fine-grained tasks, which involve small and fast operations, can create overhead in managing parallel tasks, whereas coarse-grained tasks, involving larger and more complex operations, may reduce overhead but require careful balancing. Communication overhead is also a key consideration. This involves the amount of data exchange or synchronization required between parallel tasks. Algorithms that require minimal communication between tasks are easier to parallelize. Load balancing, which is the ability to distribute the workload evenly across all available computing units, is another important criterion. For effective parallelization, the workload should be balanced to avoid situations where some processors are idle while others are overloaded. Scalability is another critical factor, referring to how well the algorithm’s performance improves with the addition of more processing units. An algorithm is considered scalable if its performance continues to improve as more cores or processors are added. A scalable algorithm might demonstrate near-linear speedup as the number of processors increases, such as in parallel matrix multiplication. Synchronization needs must also be considered. These involve the need for synchronization between tasks during execution. Memory access patterns are also important. The way an algorithm accesses memory can affect parallelization, with algorithms that have predictable and non-conflicting memory access patterns being easier to parallelize. For example, algorithms with regular, sequential memory access patterns, such as matrix operations, are easier to optimize for parallel execution compared to those with irregular access patterns that may cause contention. In this research, a cell-by-cell parallelization is carried out, which means that each kernel invocation calculates one cell value.
In this paper, OpenCL [
20] was used. Open Computing Language is a framework for writing programs that execute across heterogeneous platforms, including CPUs, GPUs, DSPs, and FPGAs. Its primary purpose is to facilitate parallel computing by allowing developers to harness the computational power of diverse hardware architectures within a unified programming model. By leveraging parallelization, OpenCL aims to significantly enhance performance in various computational tasks, from scientific simulations to real-time data processing. OpenCL is widely used for high-performance computing [
21]. Its cross-platform nature [
22] makes it useful in various domains of application. Parallelization was used in [
23] for solving nonlinear ordinary differential equations. A sparse matrix-vector multiplication is presented in [
24]. Parallel computing in the modeling of fractional-order state–space systems is presented by [
25]. The Smith–Waterman implementation is shown in [
26].
The kernel function for Euler’s method, as well as for the
CNe method, can be found in Listing 1 (in 2 dimensions, supposing a GPU supports
Khronos-style 64-bit floating point arithmetic [
27]):
Listing 1: Kernel of CNe and the 1st stage of CpC. |
#pragma OPENCL EXTENSION cl_khr_fp64 : enable |
__kernel void calc_cell_2D( |
__global const double *before, __global double *after, |
double2 coeffs, uint2 size, long4 neighbours) { |
const size_t index = get_global_id(0) + size.x * get_global_id(1); |
__global const double *old_cell = before + index; |
after[index] = (1 − 2 * (coeffs.x + coeffs.y)) * *old_cell |
+ coeffs.x * |
(old_cell[neighbours.even.x] + |
old_cell[neighbours.odd.x]) |
+ coeffs.y * |
(old_cell[neighbours.even.y] + |
old_cell[neighbours.odd.y]); |
} |
where
before, after are pointers to the array containing data before and after the timestep;
coeffs is a vector containing numerical coefficients for computation;
size is a vector containing the [Nx, Ny] number of datapoints;
neighbours is a vector containing pointer differences between the current datapoint and its neighbours (introduced to simplify handling boundary conditions).
During each timestep, the calculation uses the value of the neighboring cells. To ensure that none of these neighboring values has been updated by parallel cell computation—i.e., the before-step value of all the neighbors is used—two memory buffers are necessary [
28]. During odd timestep computations, buffer 1 is read while buffer 2 is written to. During even timestep computations, buffer 2 is read while buffer 1 is written to, as follows (
Figure 1):
This is an explicit buffer parallelization technique [
29]. Its advantage is that the programmer controls when and how data are moved, optimizing the data transfer to minimize latency and maximize throughput. It allows the use of device-specific features and capabilities, potentially enabling hardware-specific optimizations [
30] that are not possible with a more abstracted approach. The first stage of the CpC method can also use the same kernel as shown in Listing 1, but this algorithm also needs a second-stage kernel, as well (Listing 2):
Listing 2: Kernel of the 2nd stage of CpC. |
__kernel void stage2p05_calc_cell_2D( |
__global const double *stage0, |
__global const double *stage1, |
__global double *after, |
double2 coeffs, uint2 size, long4 neighbours) { |
const size_t index = get_global_id(0) + size.x * get_global_id(1); |
__global const double *old_cell = stage0 + index; |
__global const double *midpoint = stage1 + index; |
after[index] = (1 − 2 * (coeffs.x + coeffs.y)) * *old_cell |
+ coeffs.x * |
(midpoint[neighbours.even.x] + |
midpoint[neighbours.odd.x]) |
+ coeffs.y * |
(midpoint[neighbours.even.y] + |
midpoint[neighbours.odd.y]); |
} |
where
stage0 is a pointer to the array containing data before stage 1;
stage1 is a pointer to the array containing the result of stage 1;
after is a pointer to the array containing data after all the stages;
coeffs is a vector containing numerical coefficients for computation;
size is a vector containing the [Nx, Ny] number of datapoints;
neighbours is a vector containing pointer differences between the current datapoint and its neighbours (introduced to possibly simplify handling boundary conditions).
Method CpC also requires 2 memory buffers, but they are used in a different way. The computation of each step uses both buffers. The first stage reads buffer 1 and writes the result to buffer 2. The second stage reads both buffer 1 (
stage0) and buffer 2 (
stage1) and stores the result in buffer 1 (
after), as shown in
Figure 2.
2.3. Initial and Boundary Conditions, Scale Parameters, and the Method of Investigation
In this study, fixed
Dirichlet boundaries [
31] are assumed, meaning that the initial value of the boundaries is kept constant during the calculation, and no calculation is involved on these boundaries, as follows:
in 1 dimension, and
in 2 dimensions.
To deal with the initial values, four different problems are created using two initial condition functions for both 1- and 2-dimensional measurements. In the case of one type of the initial function, we have the analytical solution, which is used as a reference. The other problem does not have an analytical solution; hence, a numerical reference solution is used instead. This reference solution is generated using the built-in ode15s function of
MATLAB [
32]. These initial value functions are the following (assuming
Xinit = 0,
Yinit = 0 and
Tinit = 0):
A 1D sine with nodes at the boundaries is
where
are arbitrarily chosen parameters and the analytical solution is
A 1D
Gaussian function [
33] is
where
are arbitrarily chosen parameters. In this case, a numerical reference solution is used.
In 2D, the product of sinewaves for each dimension with nodes at the boundaries is
where
are arbitrarily chosen parameters and the analytical solution is
A 2D planar wave, propagating at an angle [
34] is
where
are arbitrarily chosen parameters. In this case, a numerical reference solution is used.
The timescales and data grids used together with these functions are listed in
Table 1,
Table 2,
Table 3 and
Table 4. During the measurement, each row of the timescale tables is combined with each row of the appropriate spatial mesh–grid table (i.e., the rows of
Table 1 with the rows of
Table 2 and the rows of
Table 3 with the rows of
Table 4) and one of the 2 appropriate initial functions. For each combination, the measurement is repeated 4 times, and the mean value along with the standard deviation of the CPU time is calculated. The maximal absolute value of the difference between the output data points and the reference solution values (mathematically spoken, the
l∞ distance of the output vector and the reference vector) is considered as the error of the algorithm as follows:
for 1 dimensional calculations and
for 2 dimensional calculations. (This value is expected to be the same for all the 4 repetitions, since the examined algorithms are deterministic.)