*2.1. FDTD Algorithm*

The flowchart of the FDTD algorithm to calculate lightning electromagnetic fields consists of five main modules [41,42], as shown in Figure 3. In the FDTD algorithm, first, all the variables and parameters are set and defined in the initialization step, followed by the assignment of the required memory to store each component of the electromagnetic field. In each time step, the lightning channel source model excites the desired space. Then, the components of the electric (*Ex*, *Ey*, *Ez*) and magnetic (*Hx*, *Hy*, *Hz*) fields are computed and updated. Finally, boundary conditions are applied to avoid unwanted reflections of the electromagnetic field. These steps are repeated until the simulations are performed in the desired time period. The output of the FDTD algorithm would be the electric and magnetic fields at each observation point within the specified time period.

**Figure 3.** Finite difference time domain (FDTD) algorithm flowchart.

### *2.2. Hardware and Software Used*

For the GPU implementation of the adopted FDTD approach, the GeForce GTX 1050 Ti hardware and PGI Community Edition 17.10 software were used. The PGI compiler was developed by the Portland Group and NVidias [43], and the PGI Community Edition includes a free license of the PGI C++ compilers and tools for multicore CPUs and NVIDIA Tesla GPUs. Community Edition enables the development of performance-portable High-performance computing (HPC) applications with a uniform source code in the most widely used parallel processors and systems [44]. Details of the GPU are presented in Table 1. For comparison purposes, we also repeated the calculations on CPU by making use of an Intel® Core™ i7-6800K @ 3.40 GHz hardware.



### *2.3. OpenACC Programming*

This section aims to provide more elaborate details about OpenACC programming, as well as compiler specifications for building and executing codes on the GPU.

### 2.3.1. OpenACC Data Clause

The data required for the process implementation are transferred to the GPU memory prior to the parallel region in the code, and the appropriate OpenACC directives are then inserted into the main code for each loop. Since it is burdensome for a compiler to determine whether data will be needed in the future, they are copied conscientiously in case of future demand. Data locality helps to solve this problem so that the data remain in the local memory in the host or system as long as needed. Data clauses enable the programmer to control the method and time of generating/copying the data from/on the device. The data directives are briefly described as follows:


In the FDTD calculations, the start and end points of data locality were positioned before the time loop (iteration) and at the end of all modules and functions, respectively. As shown in Figure 4, the starting point of data locality in the FDTD algorithm was placed before the time loop (iteration) in the code so that the proper data needed for the process execution were moved to the GPU memory using copy directive. The ending point of data locality was also placed after performing (execution) all of the modules and functions so that the generated output data returned to the CPU memory using copyin directive. This data directive avoids extra and inefficient data transfer for the calculation. Therefore, the data transfer time between the GPU and CPU is reduced using the data clause, remarkably reducing the computation time.

**Figure 4.** Optimized data locality.

### 2.3.2. OpenACC Parallelize Loops

Parallel and kernel regions are two different approaches to incorporate parallelism into the code provided by OpenACC:

• Structure of kernels: The structure of kernels determines a code region that can contain parallelism. However, the automatic parallelization capabilities of the compiler, identification of the loops that are secure enough for parallelization, and acceleration of these loops influence the analysis of the region. The compiler is free to select the best way of mapping the parallelism in the loops onto the hardware [38,45];

• Structure of parallel: If this directive is put in a loop, the programmer claims that the a ffected parallelization loop is secured and enables the compiler to choose how to schedule the loop iterations on the target accelerator. In this case, the programmer determines the availability of parallelism per se, whereas the decision regarding mapping parallelism onto the accelerator depends on the compiler's knowledge of the device [38,45].

The OpenACC parallelizing directives were applied to di fferent parts of the code to implement parallel computing. For example, as can be seen in Figure 5, the #pragma ACC kernels loop is set at the beginning of each of the three loops, which calculate and update each of the electromagnetic components. Table 2 presents the computation time associated with these two directives. According to Table 2, C on CPU runtime (without optimization) is 2231.59 s, which is the benchmark time. The speed was increased by a factor of 2.43 by optimizing the programming C on CPU. Moreover, as can be observed in the table, the computing speed on the GPU processor increased by factors of 10.49 and 11.14 by using di fferent directive OpenACC including the #pragma ACC parallel loop and the #pragma ACC kernels loop OpenACC, respectively.

**Figure 5.** Mapping parallelism onto the hardware.


**Table 2.** Comparison of OpenACC directive.

### 2.3.3. Save Variable

As shown in Figure 1, a GPU processor is presented as a set of multiprocessors, each with its own stream processors and memory. The stream processors have the full capability to execute integer and single precision floating point arithmetic, including extra cores applied to double-precision procedure [16]. The results of Table 3 reveal that when the variables are of the float type, the calculation time in the GPU process is reduced. Precisely, in GPU processing by storing variables with single precision, the computing speed can be increased up to 20.02 times the speed of storing variables by double processing in the CPU series.

**Table 3.** Increase in the computational speed regarding the accuracy of the variables stored.


### 2.3.4. Optimization by Tuning Number of Vectors

The OpenACC performance model includes three sections: Gang (Thread block), Worker (Warp), and Vector (Threads in a Warp) [38].

Gangs, vectors, and workers can be automatically set up by the OpenACC compiler or by the developer to yield the optimum approach. The number of vectors is modified in the OpenACC directives to gain the maximum occupancy and speed-up in GPU. Modifying the vector length clause can be beneficial to reaching the optimal runtime value by trying different vector lengths for the accelerator, which was used in this study. Figure 6 shows the relative runtime derived by varying the vector length compared to the compiler-selected value. It is worth noting that the best performance was obtained for a vector length of 1024.

**Figure 6.** Relative runtime from varying vector length from the default value.

### **3. Application: Models and Hypotheses Considered for Analysis**

The geometry of the problem is shown in Figure 7. The aim is to evaluate lightning electromagnetic fields over a mountainous terrain by making use of the FDTD technique. We assume a pyramidal mountain with a height Hm, which is varied from 0 (flat ground case) to 1000 m, while its cross-section is assumed to have an area of A = 40,000 m2. The ground and mountain are assumed to be a perfect electric conductor (PEC) and the ground depth hm = 100 m. The vertical electric field and the radial magnetic field components are calculated for three observation points A, B, and C. These points are, respectively, 100, 400, and 500 m far from the lightning channel base, and all located on the ground surface. In the FDTD calculations, a vertical array of current sources was used for the modeling of the lightning channel [46]. The modified transmission line model with exponential decay (MTLE) [47,48] was used for modeling the return stroke channel. For subsequent stroke current waveforms, a summation of two Heidler functions was used [49], which is given in Equation (1):

$$\mathbf{i}(\mathbf{t}) = \left[ \frac{\mathbf{I}\_{01}}{\eta\_1} \frac{\left(\frac{\mathbf{t}}{\mathbf{T}\_{11}}\right)^{\eta\_1}}{1 + \left(\frac{\mathbf{t}}{\mathbf{T}\_{11}}\right)^{\eta\_1}} \mathbf{e} \begin{array}{c} \mathbf{T}\_{21} \\ \hline \end{array} + \frac{\mathbf{I}\_{02}}{\eta\_2} \frac{\left(\frac{\mathbf{t}}{\mathbf{T}\_{12}}\right)^{\eta\_2}}{1 + \left(\frac{\mathbf{t}}{\mathbf{T}\_{12}}\right)^{\eta\_2}} \mathbf{e} \begin{array}{c} \mathbf{T}\_{22} \\ \hline \end{array} \right] \tag{1}$$

where I01 = 10.7 kA, 11 = 0.25 μs, 21 = 2.5 μs, n1 = 2, ŋ1 = 0.639, I01 = 6.5 kA, 11 = 0.25 μs, 21 = 230 μs, n1 = 2, and ŋ1 = 0.867, respectively [49]. The current decay constant and the return stroke speed are considered to be λ = 2000 m and v = 1.5 × 10<sup>8</sup> m/s, respectively. The time increment was set to 9.5 ns. The dimensions of the computational domain are 1500 × 1500 × 1500 m with a spatial mesh grid composed of square cells of 5 × 5 × 5 m. In an unbounded space, the electromagnetic response analysis of a structure absorbing boundary conditions in which unwanted reflections are suppressed must be applied to planes to truncate and limit the open space and the working volume, respectively. In order to avoid reflections at the domain boundaries, Liao's condition is adopted as the absorption boundary condition [50]. The horizontal magnetic field (Hy) components at the observation points A, B, and C are calculated assuming different heights for the mountain. The results are presented in Figure 8.

According to Figure 8, the time delay, peak value, waveshape, and amplitudes of the horizontal magnetic fields (Hy) are affected by the presence of the mountain. Although the effect of the mountain on the time delay (peak field and onset time) was negligible for point A, its effect on points B and C (right side of the hill) was significant [51]. The results represented in Figure 8 have been validated using canonical structures against the simulation results obtained in [51]. The only difference was that the GPU was used and the dimensions were 1500 × 1500 × 1500 m in this study, whereas a CPU processor was used in [51]. Figure 9 presents, for validation purposes, the obtained results compared with those obtained using a 2D-FDTD method.

**Figure 7.** The simulation domain of the 3D-FDTD method and geometry for the lightning electromagnetic fields over mountainous terrain: (**a**) three-dimensional perspective, (**b**) side view, (**c**) top view.

**Figure 8.** Horizontal magnetic field (Hy) associated with a subsequent return stroke at the considered 3 observation points A (**a**), B (**b**), C (**c**), and considering different mountain heights (Hm = 0, 200, 500, and 1000 m).

**Figure 9.** Horizontal magnetic field (Hy) on a smooth ideal ground at observation point A. The height of the mountain is set at Hm = 0 m. Results were calculated using the 2D-FDTD (solid line) and the 3D-FDTD (dashed line) methods.
