2.2.3. QR Decomposition

LU decomposition is more commonly used to solve power flow problems, but it is also possible to adopt QR decomposition as an alternative for leveraging GPU-based parallel processing. This is because LU decomposition without pivoting may suffer from significant numerical instability, as described in [27]. Although QR decomposition has greater computational complexity compared to

LU decomposition, some studies have found that it is more suitable for GPU-based parallel computing due to its high numerical stability even without pivoting [28].

In linear algebra, a QR decomposition of a complex *m* × *n* matrix *A* is represented as follows:

$$A = \mathbb{Q} \cdot \mathbb{R} \tag{11}$$

where *Q* and *R* indicate an *m* × *m* unitary matrix and an *m* × *n* upper triangular matrix, respectively. Several methods for implementing the QR decomposition are available, such as the Gram–Schmidt process, Householder transformations, and Givens rotations [29].

## 2.2.4. Conjugate Gradient Method

The conjugate gradient (CG) method is designed to find numerical solutions of linear system equations in which the matrix is symmetric and positive definite. The method is typically implemented as an iterative algorithm for analyzing large sparse power systems that are too large to be handled by direct methods, such as Cholesky decomposition.

The CG method can be employed instead of LU decomposition in a PF study. The use of the CG method in PF studies has been investigated extensively. In [30], a PF study based on the CG method was introduced. In [31], a PF study using the CG method was performed based on GPUs. In [32], a GPU-based PF study was conducted using a combination of preconditioners and a CG solver. In general, preconditioners are widely used for matrix handling. Since it is difficult to secure parallelism in a general preconditioner, a Chebyshev preconditioner is used to secure the parallelism of the preconditioner in [32].

#### **3. GPU Application Trend in Power Flow Computation**

#### *3.1. Conventional Parallel Processing in PF Studies*

Recent studies applying GPUs to PF studies have usually been conducted based on existing parallel processing research. Therefore, in terms of PF studies, a review of current parallel processing research is necessary to understand the application of GPUs. As mentioned in Section 2.2.1, PF computation using the NR method performs matrix handling during computation. In this process, it is usually necessary to solve a sparse linear system for power system analysis such as static security and transient stability analysis. Most of the power system applications have conventionally employed direct LU decomposition methods to solve such linear equations. Since this decomposition process takes large amounts of time, numerous studies have examined how to perform this operation more quickly. Considerable computational resources are required for fast computation, and parallel computing is an excellent way to secure such resources. Hence, many attempts to conduct large amounts of power flow computation quickly have been made using parallel computing. Parallel computing requires resources for computation, and some attempts have been made to perform power flow computation by connecting multiple computational resources, such as computers with the goal of fast PF computation. Theoretical attempts at parallel processing in PF studies are described in [33,34]. In the 1990s, researches into the parallel processing of PF computation began in earnest. The fundamental concepts of applying the parallel processing to PF computations are precisely presented in [22,26], and a number of further studies have been performed since; most of the studies focused on matrix handling to divide computation for each computational resource [23–25,35,36]. G. A. Ezhilarasi et al. [37] presented an effective method to perform a parallel contingency analysis using high-performance computing. Also, Z. Huang [38] proposed a dynamic computational load balancing method that utilizes multiple machines of high-performance computing for delivering a large number of contingency computations.

It is also essential to employ a massively large number of cores for accelerating the computation. There have been many endeavors to enhance the performance of parallel computing by adopting multiple processing machines in PF computation. In [39], a commercial solver was utilized in a massively parallel computing environment to carry out the contingency analysis of a power system. To decrease the calculation time significantly, 4127 of the contingencies were spread to 4127 cores and computed simultaneously. I. Konstantelos et al. [40] presented a computational platform, as a part of the iTesla project, for evaluating the dynamic security of an extensive power system using a commercial analysis software on 10,000 cores of Intel Xeon E5-2680. This project utilizes a large number of CPUs, i.e., 80,640 of cores, which are mutually interconnected, but the maximum number of CPU cores used simultaneously is up to 10,000.

#### *3.2. Advantages of GPU-Based PF Computation*

Analysis of a power system requires a large number of PF computations, and one method for reducing the computational time is parallel processing. Although the purpose is to calculate quickly by using many computational resources, this approach has obvious disadvantages in terms of time and cost in establishing a parallel processing environment by connection of a large number of machines. Therefore, the use of GPUs that can process many computations at once is an attractive proposal for PF computation, as it overcomes the shortcomings related to physical connections between machines [41].

The LU decomposition mentioned in Section 2.2.2 is a very time-consuming task in PF studies; thus, it is a good candidate for the application of parallel processing with GPUs. LU decomposition with GPUs has been studied outside the field of PF studies, as it is used in various fields that require a large amount of computation. For example, a parallel process for solving the sparse matrix for circuit simulation has been studied in [42–44], and L. Ren et al. [45] presented results that could overcome the difficulty of super-node formation for extremely sparse matrix objects.

Many studies have recently been conducted to reduce the computation time required for LU decomposition through the use of GPUs. Basically, while matrix handling related operations have been physically assigned to multiple machines in the past, recent studies have focused on how to assign the operations to GPUs for efficient calculation. Compared to using physically separated computational sources, the use of a GPU inside the same machine as a computational resource can yield very high-speed delivery of information, which has a significant advantage in terms of reducing computational time compared to past parallel processing studies. The following chapters summarize the trend of GPU-applied PF studies.

#### *3.3. GPU Applications Trends in PF Studies*

GPU-based acceleration techniques are highly productive methods to enhance the performance of power flow computations significantly. We can dramatically diminish the computational time of a sparse linear system (SLS) solution by utilizing GPUs. Researches broadly range from investigating conventional direct LU decomposition to using computational solvers other than LU. In [31], the basics of GPU-based parallel computing were explicitly presented, and an efficient method of using GPUs was proposed in [32]. There are some other attempts to GPU-accelerated SLS solution using LU decomposition in [46–48]. Also, in [49,50], the implementation of parallel PF computation being conducted by adopting GPUs were described in detail. Although the methods of using GPUs in the PF study are different, there is a commonality that they tried to use the GPU for a large amount of computation for the analysis of large-scale systems [51–53]. S. Hung et al. [54] showed a typical use of GPU to solve the PF problem. They described a GPU-accelerated FDPF method and a direct linear solver using LU decomposition in detail, and simulation results using Matlab were summarized. G. Zhou et al. [55] described an LU decomposition solver that packaged a large number of LU decomposition tasks. The speed of their proposed solver was up to 76 times greater compared to the KLU library. KLU library is a solver for a sparse linear system [56].

In [32], a GPU-based Chebyshev preconditioner integrated with a GPU-based conjugate gradient solver was proposed to solve SLS for speedup. Also, X. Li et al. [57] combines a GPU with a multifrontal method to solve sparse linear equations in power system analysis, and the multifrontal method requires an inversion of the factorization of a sparse matrix to a series of a dense matrix. Since the

inversion process requires significant computation, a reduction of the computation time using GPUs was proposed.

In addition to the typical approach based on the NR method, which requires LU decomposition, in some cases, GPUs have been combined with other methods. In [50], PF computation was studied by combining GPUs and the FDPF method.

In the paper, the authors proposed a new approach using Matpower, i.e., one of Matlab's toolbox, to solve the FDPF problem by combining an iterative linear solver of a preconditioned conjugate gradient method and an Inexact Newton method, alternatively of typical LU decomposition [58,59].

The proposed GPU-based FDPF method in [50] exhibited a speedup performance improvement of up to 2.86 times over 10,000 buses compared to the CPU-based FDPF method. S. Hung et al. [60] proposed a fast batched solution for real-time optimal power flow (RTOPF) considering renewable energy penetration. A primal-dual interior point method including three kernels was used, and several test power systems were simulated on four platforms (regular CPU, parallel CPU, regular GPU, batched GPU). The batched GPU method obtained good results, with a speedup of up to 56.23 compared to a regular CPU. Moreover, an unconventional study used GPUs for computation of probabilistic power flow (PPF) rather than general PF [61]. PPF can be obtained by using Monte Carlo simulation with simple random sampling (MCS-SRS). In [61], a GPU-accelerated algorithm was proposed that could handle largescale MCS-SRS for PFF computation. There are some studies focusing on the CPU-GPU hybrid system [62,63]. In [62], an advanced PF computation technique was proposed. This technique is based on the CPU-GPU hybrid system, vectorization parallelization, and sparse techniques. In [63], the method for performing simultaneous parallel PF computation using a hybrid CPU-GPU system was proposed.

#### **4. Typical GPU-Based PF Studies**

#### *4.1. A Study Based on LU Decomposition*

In this section, we describe some representative studies that apply GPU-based parallel computing to PF analysis. The most time-consuming part of PF computation using the NR method is the LU decomposition component for matrix handling. Therefore, many studies have been conducted on fast parallel processing of LU decomposition parts using GPUs; this chapter summarizes the LU decomposition parallel processing elements of several related papers.

A GPU-based sparse solver that does not use dense kernels was proposed in [64]. The primary purpose of this study was the application of the parallel processing approach for circuit simulation, so it was not closely related to PF research. However, we mention this study here to explain the parallel processing of LU decomposition using a GPU.

Basically, LU decomposition is a popular method using pivoting. There is partial pivoting in the LU decomposition with proper permutation. In the case of partial pivoting, one of the rows and columns is permutated, and row permutation is generally performed. This can be expressed using the following formula.

$$P \cdot A = L \cdot \mathcal{U} \tag{12}$$

In the case of full pivoting, both row and column pivoting are performed, as expressed by the following formula.

$$P \cdot A \cdot Q = L \cdot \mathcal{U} \tag{13}$$

where *P* and *Q* represent permutation matrices, and *L* and *U* mean the lower and upper triangular matrices, respectively. Also, *A* is an input matrix to be decomposed.

However, this study found that LU decomposition without pivoting is more favorable for parallel processing with GPUs. Hence, a GPU-based sparse solver based on LU decomposition without pivoting was proposed. The proposed GPU solver does not use dense kernels, unlike conventional direct solvers, so it can be used for fast matrix calculation. The framework of the proposed sparse

solver is shown in Figure 3. The proposed method is designed to distribute tasks appropriately to CPUs and GPUs, and the row/column reordering performed for the pivoting is replaced by the execution of one pre-processing step on a CPU to reduce the computation time. Most of the iterations for actual factorization take place on a GPU, which receives the CPU's initial calculations. A GPU-based left-looking algorithm is used to factorize a matrix in this method, which is summarized in Algorithm 1. The result obtained through iteration on the GPU is read back from the CPU, and triangular equations are executed. The proposed approach achieves, on average, 7.02 × speedup compared with sequential PARDISO. The simulation results were obtained using a total of 16 cores of Intel E5-2690 CPU and a single NVIDIA GTX580 GPU, respectively.

**Figure 3.** Framework of the sparse solver in [64].

#### **Algorithm 1** left-looking LU factorization (GPU-based) [45,55,64].

**INPUT** *A*: an *n* × *n* input matrix to be LU factorized

**OUTPUT** *L*, *U*: lower and upper triangular matrices factorized from *A*

1: **while** there still exist executable columns **do in parallel**

> // e.g., the GPU thread *j* of the *j*-th column.

```
2: &c = A(:, j);
       // Renew the j-th column with all its dependent columns on the left-hand side.
3: for i =1: j − 1 where U(i, j) is not equal to zero do
4: c(i + 1 : n)− = c(i) × L(i + 1 : n, i);
5: end for
6: Normalize the j-th column;
7: end while
```
In [49], a GPU-accelerated GS and NR methods for PF study were analyzed. As discussed earlier, the primary purpose of this review paper is a focus on the acceleration of matrix handling with GPU, so the use of the GS method is beyond the scope of this work. However, this is valuable because it shows the time required for each computational step of the NR method. Hence, the LU decomposition of the NR method is briefly summarized. Also, some preliminary works were presented at the level of data, software composition, and matrix representation for parallel processing using GPU; however, this is not discussed in detail here.

Table 1 summarizes the four computational steps that consume the most execution time in NR implementation. Both building an admittance matrix and obtaining apparent power and line losses are unrelated to matrix handling, and the use of the GS method does not introduce any differences here. The part to be examined carefully is the solution of a sparse linear system that produces and solves the Jacobian matrix. This study demonstrated how some of the iterative solvers provided by CUDA are beneficial for single calculations but have no significant advantages in parallel processing for large

scale PF studies. Hence, the Eigen library, an open-source C++ library, is proposed as a solution [56]. In the proposed method, each linear system is solved sequentially, but parallel processing is performed in which multiple systems are distributed to multiple OpenMP threads.



G. Zhou et al. [55] proposed a batch-LU solution method to calculate multiple SLS in the massive-PFs problem (MPFP). The crux of the proposed method is a simultaneous solution of multiple linear systems consisting of **A***i* · *xi* = *bi* (*i* = 1, 2, ..., *N*). This method is described in Algorithm 2 and its main features are as follows.

**Algorithm 2** GPU-based batch LU factorization [55]. **INPUT** *A*: an *n* × *n* input matrix to be LU factorized **OUTPUT** *L*, *U*: lower and upper triangular matrices factorized from *A* 1: **while** there still exist executable columns **do in parallel** // Thread block *j* for the *j*-th column of batch tasks of LU factorization 2: *j* ← block ID, *t* ← thread ID in the thread block; 3: &*ct* = *At*(:, *j*); // Renew the *j*-th column of all the other subtasks of LU factorization.


8: **end while**


This method can be used for both forward and backward substitutions of LU decomposition. In the case study using the proposed method, the maximum improvement is a 76-fold speedup compared with the KLU library.

#### *4.2. A Study Based on QR Decomposition*

In the case of a system of *N* elements, an *N* − 1 static security analysis (SSA) is required to perform *N* alternating-current power flows (ACPF), so that usually incurs a tremendously high computational expense for a large-scale power system. Hence, in [28], a GPU-based batch-ACPF method using a QR solver was proposed to reduce the computation time for solving a large number of PFs of *N* − 1 SSA. For the sake of clarity, the basic concept of QR decomposition is described in the background Section 2.2.3, and the overall framework of the proposed batch-ACPF solution is shown in Figure 4.

**Figure 4.** Framework of the proposed method in [28].

Reference [28] proposed a GPU-accelerated contingency screening algorithm to determine the critical contingency set (CCS) required for analysis. The proposed algorithm, which is described in detail in chapter VI of [28], includes dense vector operation in the execution of the algorithm.

Two GPU-based algorithms are used to solve the CCS **S** generated by the screening algorithm. First, the batch Jacobian-Matrix (batch JM) generation is employed to produce the JM set {*Ji*|*<sup>i</sup>*∈**<sup>S</sup>**} and power mismatch set {*bi*|*<sup>i</sup>*∈**<sup>S</sup>**}. The JM generation can be identified from the modified equations of ACPF analysis in Equation (14).

$$
\begin{bmatrix}
\Delta P\\\Delta Q
\end{bmatrix} = -J \begin{bmatrix}
\Delta \theta\\\Delta Ul/\mathcal{U}
\end{bmatrix} = -\begin{bmatrix}
HN\\ML
\end{bmatrix} \begin{bmatrix}
\Delta \theta\\\Delta Ul/\mathcal{U}
\end{bmatrix} \tag{14}
$$

where Δ*P* denotes a mismatch vector of nodal active power injection, Δ*Q* represents a mismatch vector of nodal reactive power injection, Δ*θ* and Δ*U* are modification vectors of voltage angle and magnitude, respectively. Also, *U* means a vector of nodal voltage magnitude, *H*, *N*, *M*, and *L* indicate the sub-matrices of JM *J*. JM generation is necessary to calculate four sub-matrices. Each submatrix has independence by node, so the larger the power system is, the greater the parallelism.

Second, the batch-QR solver is necessary to accelerate the solution of the modified equation set {*Jixi* = *bout*|*<sup>i</sup>*∈**<sup>S</sup>**}. Reference [28] showed that a conventional QR solver for solving single SLS was not suitable for the GPU-accelerated algorithm, so a batch-QR solver for a massive number of SLSs is

proposed to overcome the disadvantages. The purpose of the batch-QR solver is to simultaneously solve the linear system *Aixi* = *bi* (*i* = 1, 2, ..., *N*). The proposed solver splits the entire calculation into a single QR subtask, which takes the form of a parallel computation on a GPU to achieve a higher degree of parallelism. This includes the parts for coalesced memory accesses and reordering algorithms for computation.

There are two stop criteria for batch-ACPF. One is that 80% of ACPFs have converged. The other is that the batch-ACPF solver has reached the maximum number of iterations. In [28], the maximum number of iterations was set at 10. Here, the GPU kernels in double-precision format are used to check the overload. One is used to check the branch power flow, and the other is used to check the nodal voltage and reactive power. In Table 2, we summarize the simulation results for a specific case. The first solution was designated as a baseline for performance comparison, and the speedup result for each SSA solution method is indicated to prove the validity of the proposed method.


**Table 2.** Performance comparison in [28]. All the other speedups were calculated against the analysis time of No. 1.

#### *4.3. Computation Time Reduction of GPU-Based Parallel Processing*

Research on GPUs to PF study can be categorized into studies aimed at reducing the execution time of matrix handling in a single computation, and studies focused on parallel processing of a large number of computations simultaneously. Many studies focused on parallel processing have suggested methods for efficiently separating matrix operations. Since most of the operations are concentrated in this area, efficiency can be increased if the time required for matrix handling can be reduced significantly [21]. The previous studies introduced in Sections 4.1 and 4.2 are typical matrix handling studies. On the other hand, some studies that attempt parallel processing for a large number of computations sugges<sup>t</sup> a method of reducing the execution time by focusing on parallel processing beyond the division of matrix handling. In [63], parallel power flow using a hybrid CPU-GPU approach was studied. The NR method was used here for PF computation, and some libraries were used for CUDA implementation. This study considered the time needed for the allocation of memory or disadvantageous aspects of GPUs related to CPUs, as well as the PF computation time. The hybrid CPU-GPU approach for the proposed simultaneous parallel power flow computation is shown in Figure 5.

In [63], several systems with 14, 30, 57, 118, 300, 1354, 2869, and 9241 buses were used to obtain results under various conditions, respectively. Single and parallel calculations of 10, 50, 100, and 200 PF computations were performed simultaneously for each system, and the speedup result of simulations with the proposed method was recorded. The results show that if the power system is small, the overhead of memory allocation to the GPU is relatively high for multiple simultaneous PF computations. In contrast, when a large number of PF computations are performed, the memory allocation becomes relatively negligible as the power system grows. Furthermore, focusing the allocation on the beginning of the program reduces the bottleneck effect caused by the nature of GPU behavior. The overall results of the study show that the use of GPUs is efficient in PF studies and that this can be applied to real-time operational planning or control actions in control centers that require large amounts of PF computation.

**Figure 5.** Hybrid CPU-GPU approach in [63].

## *4.4. Results and Comparison*

Studies performing parallel processing using GPUs for PF are summarized in Table 3. In addition to the papers described in detail above, a number of additional meaningful studies are included.

Table 3 shows the solvers or the specific methods used for each study and the speedup results. The NR method is the most popular, and the time-consuming part of this method is matrix handling. Therefore, parallel processing studies using GPUs are also associated with the matrix handling part for speedup of computation. Many studies have been conducted on parallel processing of LU decomposition, which is frequently used in the NR method, and there have been some studies using QR decomposition associated with matrix handling. Table 3 summarizes the speedup results presented in each paper, including what reference they were compared with. The table also shows the hardware specifications of CPU and GPU, which are employed for the speedup evaluation. In some cases, the speedup is measured in comparison to the existing commercial library, and in other cases, the performance is evaluated compared to the CPU alone. Since the speedup reference is differently represented in each paper, it is difficult to understand the inherent speedup value in each case; however, it is clear that GPUs aid speedup when performing PF computations.


