**1. Introduction**

In applied sciences, such as computational electromagnetics, the solving of partial-differential equation systems is usually touched upon. Many variables need to be sought for solving engineering problems. These often need to be transformed into a solution of partial differential equations. When solving partial differential equations, the equations need to be discretized. When discretizing partial differential equations, symmetric systems of equations are usually gotten. Hence, it is necessary to use the idea of symmetry to solve partial differential equations. Several studies on multi-computers have appeared. For instance, Eric Polizzi and Ahmed H. Sameh [1] contributed a spike algorithm as a parallel solution to hybrid banded equations. The algorithm firstly decomposes banded equations into block-tridiagonal form and then makes full use of the divide and conquer technique. However, by increasing the bandwidth, the parallel computation becomes much more complex, leading to a decrease in the parallel efficiency. Obviously, the highly efficient parallelism of banded systems is of great importance. Methods for block-tridiagonal linear equations contain iterative algorithms such as the multi-splitting algorithm [2,3]. The multi-splitting algorithm (MPA) [2] can be used to solve large band linear systems of equations; however, it sometimes has lower parallel efficiency. In [4], a method for working out block-tridiagonal equations is provided by the authors. Any incomplete type preconditioner will be appropriate for the algorithm. Based on the Galerkin principle, the parallelism solution for large-scale banded equations is investigated in [5]. In [6], a parallel direct algorithm is used on multi-computers. In [7], a parallel direct method for large banded equations is presented. A preconditioner of large-scale banded equations is discovered in [8–14]. The block successive over-relaxation method (BSOR) [10] can be adopted to solve large-scale systems of equations, but

different parallel efficiencies will be presented because of the different optimal relaxation factors. These algorithms use parallelism to solve banded equations but they cannot contain solving partial differential equations. From better provision of a computing environment, a highly efficient preconditioner can be carried out on multi-computers [15–23]. Simultaneously, Krylov subspace solvers [24–30] and preconditioners [31–38] for large-scale banded equations are commonly used, including the generalized minimal residual (GMRES) [39]. The pseudo-elimination method with parameter k (PEk) [40] can be applied on multi-processors; however, the setting of parameter k will influent the speedup and parallel efficiency. These are mostly preconditioners for sparse linear systems or partial differential equation problems in Graphics Processing Unit (GPU) computation. However, these methods consume great computational effort. The development of a new algorithm which needs less calculation among every iteration and has more speedup and higher parallel efficiency is required. This paper is based on the symmetry subject of solving partial differential equation systems. The systems of equations are usually symmetric. In the process of solving them, the systems of equations need to be divided into blocks. The block equations may be symmetric or asymmetric, so this paper considers the general form of block equations. Of course, for symmetric block equations, the incomplete lower and upper factorization preconditioner (ILUP) algorithm is suitable. This paper is concerned with partial-differential equation systems of the form *Ax* = *b*. The associated iterative form *Mx*(*k*+1) = *Nx*(*k*) + *b* is used. The linear tridiagonal special form is tested on multi-processors. Then, the iterative ILUP for partial differential equation systems is used to examine multi central processing unit (CPU) cores.

The outline is as mentioned hereunder. Section 2 describes a decomposition strategy of a parallel algorithm. Section 3 documents the analysis of convergence. Section 4 introduces the parallel implementation of this algorithm. The analysis of results computations with numerical examples including a large-scale system of equations and partial-differential equations are presented in Section 5. Finally, we conclude the paper in Section 6.

#### **2. Decomposition Strategy**

Consider large-scale band equations

$$A\mathbf{x} = b \tag{1}$$

that is

$$
\begin{pmatrix} A\_1 & B\_1 \\ C\_2 & A\_2 & B\_2 \\ & \ddots & \ddots & \ddots \\ & & \mathcal{C}\_{n-1} & A\_{n-1} & B\_{n-1} \\ & & & \mathcal{C}\_n & A\_n \end{pmatrix} \begin{pmatrix} \mathbf{x}\_1 \\ \mathbf{x}\_2 \\ \vdots \\ \mathbf{x}\_{n-1} \\ \mathbf{x}\_n \end{pmatrix} = \begin{pmatrix} b\_1 \\ b\_2 \\ \vdots \\ b\_{n-1} \\ b\_n \end{pmatrix}
$$

where *<sup>A</sup>i*, *<sup>B</sup>i*, and *<sup>C</sup><sup>i</sup>* are *di* <sup>×</sup> *di*, *di* <sup>×</sup> *di*+1, and *di* <sup>×</sup> *di*−1, and *<sup>x</sup>i*, *<sup>b</sup><sup>i</sup>* are the *di*<sup>−</sup> vectors of the unknowns and the right–hand side,

The coefficient matrix *A* can be approximately decomposed as

$$A \approx GH \tag{2}$$

Generally, supposing *<sup>n</sup>* <sup>=</sup> *pm*(*<sup>m</sup>* <sup>≥</sup> 2, *<sup>m</sup>* <sup>∈</sup> *<sup>Z</sup>*), where p represents the processors, let

$$\mathcal{M} = \mathcal{G}H$$

where

in which

*<sup>S</sup><sup>i</sup>* <sup>=</sup> *<sup>B</sup>i*, *<sup>i</sup>* <sup>=</sup> *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + 1, ··· , *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + *<sup>m</sup>* <sup>−</sup> 1, *<sup>q</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup> S<sup>i</sup>* = *BiA*−<sup>1</sup> *<sup>i</sup>*+1, *i* = *mq*, *q* = 1, 2 ··· , *p* − 1 *<sup>L</sup><sup>i</sup>* <sup>=</sup> *<sup>C</sup>i*, *<sup>i</sup>* <sup>=</sup> *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + 1, *<sup>q</sup>* <sup>=</sup> 2, 3, ··· , *<sup>p</sup> L<sup>i</sup>* = *CiU*−<sup>1</sup> *<sup>i</sup>*−1, *<sup>i</sup>* <sup>=</sup> *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + 2, ··· , *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + *<sup>m</sup>*, *<sup>q</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup> <sup>U</sup><sup>i</sup>* <sup>=</sup> *<sup>A</sup>i*, *<sup>i</sup>* <sup>=</sup> *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + 1, *<sup>q</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup> <sup>U</sup><sup>i</sup>* <sup>=</sup> *<sup>A</sup><sup>i</sup>* <sup>−</sup> *<sup>L</sup>iSi*−1, *<sup>i</sup>* <sup>=</sup> *mq* <sup>+</sup> 2, ··· , *mq* <sup>+</sup> *<sup>m</sup>* <sup>−</sup> 1, *<sup>q</sup>* <sup>=</sup> 0, ··· , *<sup>p</sup>* <sup>−</sup> 1; *<sup>i</sup>* <sup>=</sup> *mq* <sup>+</sup> *<sup>m</sup>*, *<sup>q</sup>* <sup>=</sup> *<sup>p</sup>* <sup>−</sup> <sup>1</sup> *<sup>U</sup><sup>i</sup>* <sup>=</sup> *<sup>A</sup><sup>i</sup>* <sup>−</sup> *<sup>L</sup>iSi*−<sup>1</sup> <sup>−</sup> *<sup>S</sup>iLi*+1, *<sup>i</sup>* <sup>=</sup> *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + *<sup>m</sup>*, *<sup>q</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>* <sup>−</sup> <sup>1</sup> and *<sup>I</sup><sup>i</sup>* is a *di* <sup>×</sup> *di* unit matrix, *<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>n</sup>*.

Then

*<sup>N</sup>* <sup>=</sup> *<sup>M</sup>* <sup>−</sup> *<sup>A</sup>*

that is

*N* = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝ (*O*) *O SmSm*+<sup>1</sup> *O Lm*+2*Lm*+<sup>1</sup> (*O*) *O S*2*mS*2*m*+<sup>1</sup> ... *O <sup>L</sup>*(*p*−1)*m*+2*L*(*p*−1)*m*+<sup>1</sup> (*O*)

where (*O*) is the *<sup>m</sup> i*=1 *di* <sup>×</sup> *<sup>m</sup> i*=1 *di* zero matrix. Therefore, the new iterative scheme for the large-scale band system of equations is

$$\mathbf{G}\mathbf{H}\mathbf{x}^{(k+1)} = \mathbf{N}\mathbf{x}^{(k)} + \mathbf{b} \tag{4}$$

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

where the iterative matrix is

$$T = H^{-1}G^{-1}N$$

Obviously, *GH* is nonsingular, which is the necessary condition that the algorithm holds. In terms of the structure of *G* and *H*, the parallelism of the iterative algorithm is preferable.

The strategy is an ILUP algorithm. Compared with published algorithms [2,10,40], the ILUP algorithm requires less multiplication and adds calculation among every iteration, meaning this algorithm has more speedup and higher parallel efficiency. It is appropriate for solving the large-scale system of equations and partial-differential equations for multi-core processors.

#### **3. Analysis of Convergence**

#### *3.1. Preliminary*

Here, some notations are introduced. Two definitions and one lemma are mentioned.

**Definition 1.** *([39]) A real <sup>n</sup>* <sup>×</sup> *<sup>n</sup> matrix <sup>A</sup>* = (*ai*,*j*) *with ai*,*<sup>j</sup>* <sup>≤</sup> <sup>0</sup> *for all <sup>i</sup> j is an M-matrix if A is nonsingular and <sup>A</sup>*−<sup>1</sup> <sup>≥</sup> *<sup>O</sup>*.

**Definition 2.** *([39]) The matrix <sup>A</sup>, <sup>M</sup>, <sup>N</sup>, <sup>A</sup>* <sup>=</sup> *<sup>M</sup>* <sup>−</sup> *<sup>N</sup> is a regular splitting of <sup>A</sup> if <sup>M</sup> is nonsingular, <sup>M</sup>*−<sup>1</sup> <sup>≥</sup> *<sup>O</sup>, <sup>N</sup>* <sup>≥</sup> *<sup>O</sup>*.

**Lemma 1.** *([39]) Presume <sup>A</sup>* <sup>=</sup> *<sup>M</sup>* <sup>−</sup> *<sup>N</sup> is a regular splitting of <sup>A</sup>. Then, <sup>A</sup> is nonsingular and <sup>A</sup>*−<sup>1</sup> <sup>≥</sup> *<sup>O</sup>, if and only if* ρ(*M*−1*N*) < 1.

#### *3.2. Proposition and Theorem*

Note that the inverse matrix of the following matrix is gained by the algorithm of the Gaussian elimination. Firstly, from the definitions and lemma, a proposition is obtained as follows.

**Proposition 1.** *If <sup>A</sup> is an M-matrix, in this way, the matrices <sup>U</sup><sup>i</sup>* (*<sup>i</sup>* <sup>=</sup> 1, 2, 3, ··· , *<sup>n</sup>*) *defined by Expression (3) satisfy U*−<sup>1</sup> *<sup>i</sup>* ≥ *O*.

**Proof.** From Expression (3), in terms of the contracture of *<sup>A</sup>*, *<sup>G</sup>*, *<sup>H</sup>* and *<sup>M</sup>* <sup>=</sup> *GH*, *<sup>N</sup>* <sup>=</sup> *<sup>M</sup>* <sup>−</sup> *<sup>A</sup>*, we have

$$\begin{aligned} \mathbf{U}\_{i} &= A\_{i\prime} \ i = m(q-1) + 1, \ q = 1, \cdots, p \\ \mathbf{U}\_{i} &= A\_{i\prime} - L\_{i} \mathbf{S}\_{i-1} = A\_{i\prime} - \mathbf{C}\_{i} \mathbf{U}\_{i-1}^{-1} \mathbf{B}\_{i-1}, \ i = m(q-1) + 2, \cdots, m(q-1) + m - 1, \ q = 1, \cdots, p - 1; \\ \mathbf{u}\_{i} &= n - m + 1, \cdots, n, \ q = p; \\ \mathbf{U}\_{i} &= A\_{i\prime} - L\_{i} \mathbf{S}\_{i-1} - \mathbf{S}\_{i} L\_{i+1}, \ i = m(q-1) + m, \ q = 1, \cdots, p - 1. \end{aligned}$$

As *A* is an M-matrix, then *U*−<sup>1</sup> *<sup>i</sup>* ≥ *O* for *i* = *m*(*q* − 1) + 1 , *q* = 1, ··· , *p* Let *W<sup>i</sup>* = ⎛ ⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝ *<sup>A</sup>*(*i*−1)*m*+<sup>1</sup> *<sup>B</sup>*(*i*−1)*m*+<sup>1</sup> *<sup>C</sup>*(*i*−1)*m*+<sup>2</sup> *<sup>A</sup>*(*i*−1)*m*+<sup>2</sup> ... ... ... *<sup>B</sup>im*−<sup>1</sup> ⎞ ⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ , then *W*−<sup>1</sup> *<sup>i</sup>* <sup>≥</sup> *<sup>O</sup>*.

*Cim Aim* Since the block on the *m*-th row and *m*-th column of *W*−<sup>1</sup> *<sup>i</sup>* is *<sup>U</sup>*−<sup>1</sup> *<sup>i</sup>* for *i* = *m*(*q* − 1) + 2 , ··· , *m*(*q* − 1) + *m* − 1 *and q* = 1, ··· , *p* − 1;

Hence, *U*−<sup>1</sup> *<sup>i</sup>* <sup>≥</sup> *<sup>O</sup>* for *<sup>i</sup>* <sup>=</sup> *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + 2 , ··· , *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + *<sup>m</sup>* <sup>−</sup> <sup>1</sup> *and q* <sup>=</sup> 1, ··· , *<sup>p</sup>* <sup>−</sup> 1; Furthermore,

$$V\_i = \begin{pmatrix} A\_{(i-1)m+1} & B\_{(i-1)m+1} & & \\ \mathbf{C}\_{(i-1)m+2} & A\_{(i-1)m+2} & & \ddots \\ & \ddots & \ddots & B\_{(i-1)m+m} \\ & & & \mathbf{C}\_{(i-1)m+m+1} & A\_{(i-1)m+m+1} \end{pmatrix}$$

Similarly, the block on the *m*-th row and *m*-th column of *V*−<sup>1</sup> *<sup>i</sup>* is *<sup>U</sup>*−<sup>1</sup> *<sup>i</sup>* for *i* = *m*(*q* − 1) + *m* , *q* = 1, ··· , *<sup>p</sup>* <sup>−</sup> 1 by inducing. Therefore, *<sup>U</sup>*−<sup>1</sup> *<sup>i</sup>* <sup>≥</sup> *<sup>O</sup>* for *<sup>i</sup>* <sup>=</sup> *<sup>m</sup>*(*<sup>q</sup>* <sup>−</sup> <sup>1</sup>) + *<sup>m</sup>* , *<sup>q</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>* <sup>−</sup> 1. Then, we have *U*−1 *<sup>i</sup>* <sup>≥</sup> *<sup>O</sup>* (*<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>n</sup>*).

,

Secondly, taking advantage of the above lemma and proposition, a theorem is given. -

**Theorem 1.** *If A is an M-matrix, then the approximate factorization of matrix A can be represented by Expression (2), and the iterative scheme Algorithm (4) converges to X*<sup>∗</sup> = *A*−1*b*.

**Proof.** From the above proposition, the approximate factorization of matrix *A* can be represented by Expression (2).

Firstly, prove *<sup>N</sup>* <sup>≥</sup> *<sup>O</sup>*.

As *A* is an M-matrix, then *A*−<sup>1</sup> *im*+<sup>1</sup> <sup>≥</sup> *<sup>O</sup>*, *<sup>B</sup>im*+<sup>1</sup> <sup>≤</sup> *<sup>O</sup>*, *<sup>B</sup>im* <sup>≤</sup> *<sup>O</sup>*, *<sup>C</sup>im*+<sup>1</sup> <sup>≤</sup> *<sup>O</sup>*, *<sup>C</sup>im*+<sup>2</sup> <sup>≤</sup> *<sup>O</sup>*, for *<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>* <sup>−</sup> 1. Hence, *<sup>B</sup>imA*−<sup>1</sup> *im*+1*Bim*+<sup>1</sup> <sup>≥</sup> *<sup>O</sup>*, *<sup>C</sup>im*+2*A*−<sup>1</sup> *im*+1*Cim*+<sup>1</sup> <sup>≥</sup> *<sup>O</sup>*, for *<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>* <sup>−</sup> 1. Therefore, *<sup>N</sup>* <sup>≥</sup> *<sup>O</sup>*.

Secondly, prove *<sup>M</sup>*−<sup>1</sup> <sup>≥</sup> *<sup>O</sup>*.

$$\begin{aligned} \text{Since } \mathbf{M}^{-1} = \overset{\pi}{\mathbf{U}}^{-1} \overset{\pi}{L}^{-1}, \text{ provided } \overset{\pi}{L}^{-1} = \begin{pmatrix} \widehat{L}\_{1} & -\widehat{S}\_{1} \\ & \widehat{L}\_{2} & -\widehat{S}\_{2} \\ & & \ddots & \ddots \\ & & & \widehat{L}\_{p-1} & -\widehat{S}\_{p-1} \\ & & & & \widehat{L}\_{p} \end{pmatrix} \\ \text{where} \\ \widehat{L}\_{i} = \begin{pmatrix} I\_{(i-1)m+1} & & & & \\ & -L\_{(i-1)m+2} & I\_{(i-1)m+2} & & \\ & & \ddots & \ddots & \\ & & & -L\_{\text{int}} & I\_{\text{int}} \end{pmatrix}; i = 1, \dots, p, -\widehat{S}\_{i} = \begin{pmatrix} \mathbf{O} \\ & \vdots \\ & \mathbf{O} \\ & \mathbf{O} \\ & -S\_{\text{int}} \end{pmatrix}; i = 1, \dots, p - 1, \dots \end{aligned}$$

$$
\tilde{\boldsymbol{\mathcal{U}}}^{-1} = \begin{pmatrix}
\widehat{\boldsymbol{\mathcal{U}}}\_{1} & & & & \\
\widehat{\boldsymbol{\mathcal{C}}}\_{2} & \widehat{\boldsymbol{\mathcal{U}}}\_{2} & & & \\
& \ddots & \ddots & & \\
& & \widehat{\boldsymbol{\mathcal{C}}}\_{p-1} & \widehat{\boldsymbol{\mathcal{U}}}\_{p-1} & \\
& & & \widehat{\boldsymbol{\mathcal{C}}}\_{p} & \widehat{\boldsymbol{\mathcal{U}}}\_{p}
\end{pmatrix}.
$$

where

$$
\widehat{\mathcal{U}}\_{i} = \begin{pmatrix}
\mathcal{U}^{-1}\_{(i-1)m+1} & -\mathcal{U}^{-1}\_{(i-1)m+1}\mathcal{B}\_{(i-1)m+1}\mathcal{U}^{-1}\_{(i-1)m+2} & \cdots & (-1)^{m-1}\prod\_{j=1}^{m-1}\mathcal{U}^{-1}\_{j}\mathcal{B}\_{j}\mathcal{U}^{-1}\_{im} \\ & \ddots & \ddots & \vdots \\ & & \mathcal{U}^{-1}\_{(i-1)m+m-1} & -\mathcal{U}^{-1}\_{(i-1)m+m-1}\mathcal{B}\_{(i-1)m+m-1}\mathcal{U}^{-1}\_{im} \\ & & & \mathcal{U}^{-1}\_{im} \\ & & & \dots & \dots & \dots \end{pmatrix}
$$

,

$$
\widehat{\mathbf{C}}\_i = \begin{pmatrix}
\vdots \\
\mathbf{O} \\
\mathbf{O}
\end{pmatrix}, i = \mathbf{2}, \cdots, p.
$$

According to the proposition, *U*−<sup>1</sup> *<sup>i</sup>* <sup>≥</sup> *<sup>O</sup>*(*<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>n</sup>*). Since

$$L\_{(i-1)m+j} = \mathcal{C}\_{(i-1)m+j} \mathcal{U}^{-1}\_{(i-1)m+j-1}, \\ j = 2, \dots, m, \\ i = 1, \dots, p$$

we have <sup>−</sup>*L*(*i*−1)*m*+*<sup>j</sup>* <sup>≥</sup> *<sup>O</sup>*, *<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>*, *<sup>j</sup>* <sup>=</sup> 2, ··· , *<sup>m</sup>*. Therefore, *<sup>~</sup> L* −1 <sup>≥</sup> *<sup>O</sup>*, *~ U* −1 <sup>≥</sup> *O M*−<sup>1</sup> <sup>≥</sup> *<sup>O</sup>*.

Finally, based on *<sup>M</sup>*−<sup>1</sup> <sup>≥</sup> *<sup>O</sup>*, *<sup>N</sup>* <sup>≥</sup> *<sup>O</sup>* and Lemma 1, we conclude that <sup>ρ</sup>(*M*−1*N*) <sup>&</sup>lt; 1. That is, this algorithm converges. -

This section shows that the condition in the theorem is a sufficient condition for convergence of the algorithm. If *A* is not an M-matrix, Algorithm (4) is sometimes convergent, as is shown in the following section (Example 1).

#### **4. Parallel Implementations**

#### *4.1. Storage Method*

For the *<sup>i</sup>*-th processor *Pi*(*<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>*), allocate *<sup>A</sup>*(*i*−1)*m*+*j*, *<sup>B</sup>*(*i*−1)*m*+*j*, *<sup>C</sup>*(*i*−1)*m*+*<sup>j</sup>* (*<sup>i</sup> p*, *j* = 1, ··· , *<sup>m</sup>*, *<sup>m</sup>* <sup>+</sup> 1; *<sup>i</sup>* <sup>=</sup> *<sup>p</sup>*, *<sup>j</sup>* <sup>=</sup> 1, ··· , *<sup>m</sup>*), *<sup>b</sup>*(*i*−1)*m*+*<sup>j</sup>* (*<sup>j</sup>* <sup>=</sup> 1, ··· , *<sup>m</sup>*), the initial vector *<sup>x</sup>* (0) (*i*−1)*m*+*j* , and the convergence tolerance ε.

#### *4.2. Circulating*

(1) *Gy* = *b* + *Nx*(*k*) is solved to obtain *y*.

*Pi* (*<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>* <sup>−</sup> <sup>1</sup>) acquires *<sup>x</sup>* (*k*) (*i*+1)*m*+<sup>2</sup> from *Pi*+<sup>1</sup> and then computes to obtain *<sup>y</sup>*(*i*−1)*m*+*q*, *<sup>q</sup>* <sup>=</sup> 1, ··· , *<sup>m</sup>* <sup>−</sup> 1, *<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>* and *<sup>y</sup>n*. *Pi* (*<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>* <sup>−</sup> <sup>1</sup>) gains *<sup>y</sup>*(*i*+1)*m*+<sup>1</sup> from *Pi*+<sup>1</sup> and then obtains *<sup>y</sup>im*, *<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>* <sup>−</sup> 1.

(2) *Hx*(*k*+1) = *y* is solved to obtain *x*(*k*+1).

*Pi* (*<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>*) computes to obtain *<sup>x</sup>* (*k*+1) (*i*−1)*m*+*q* (*<sup>q</sup>* <sup>=</sup> 2, ··· , *<sup>m</sup>*, *<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>*) and *<sup>x</sup>* (*k*+1) <sup>1</sup> . The -*i*th processor *Pi* (*<sup>i</sup>* <sup>=</sup> 2, ··· , *<sup>p</sup>*) gains *<sup>x</sup>* (*k*+1) *im* from *Pi*−<sup>1</sup> and then computes to obtain *<sup>x</sup>* (*k*+1) (*i*−1)*m*+1 , *i* = 2, ··· , *p*.

(3) On *Pi*(*<sup>i</sup>* <sup>=</sup> 1, ··· , *<sup>p</sup>*), judge *<sup>x</sup>* (*k*+1) (*i*−1)*m*+*j* − *x* (*k*) (*i*−1)*m*+*j* ≤ ε. Following this, stop if correct, or otherwise, go back to step (1).

#### **5. Results Analysis of Numerical Examples**

For testing the new algorithm, some results on the Inspur TS10000 cluster have been given by the new algorithm and order 2 multi-splitting algorithm [2], which is a well-known parallel iterative algorithm. The PEk method [40] is used on the inner iteration of the order 2 multi-splitting algorithm. Suppose *di* <sup>=</sup> *di*−<sup>1</sup> <sup>=</sup> *di*+<sup>1</sup> <sup>=</sup> *<sup>t</sup>*, *<sup>x</sup>* (0) *<sup>i</sup>* = (0, ··· , 0) T *<sup>t</sup>*×1, *x*(*k*+1) <sup>−</sup> *<sup>x</sup>*(*k*)∞ <sup>&</sup>lt; <sup>ε</sup>, <sup>ε</sup> <sup>=</sup> <sup>10</sup><sup>−</sup>10.

In the tables, P is the number of processors, l is the inner iteration time, *k* is the parameter of the PEk method, T is the run time (in seconds), I is the iterative time, S is the speedup and E is the parallel efficiency (E = S/P). In the following figures, ILUP, BSOR, PEk, and MPA, respectively, denote the iterative incomplete lower and upper factorization preconditioner, the block successive over-relaxation method, the PEk method, and the multi-splitting algorithm.

#### *5.1. Results Analysis of the Large-Scale System of Equations*

**Example 1.** *A in Expression (1) represents*

$$\begin{aligned} \mathbf{A}\_{l} &= \begin{bmatrix} 12 & -2 & & & & \\ -3 & 12 & -2 & & & \\ & \ddots & \ddots & \ddots & & \\ & & -3 & 12 & -2 \\ & & & -3 & 12 \end{bmatrix}, \mathbf{B}\_{l} = \begin{bmatrix} 2.2 & -1.3 & & & \\ -3 & 2.2 & -1.3 & & \\ & \ddots & \ddots & \ddots & \\ & & -3 & 2.2 & -1.3 \\ & & & -3 & 2.2 \end{bmatrix}, \mathbf{b}\_{l} \\ \mathbf{C}\_{i} &= \begin{bmatrix} 2 & 2 & & & \\ & \ddots & \ddots & \ddots & \\ & & -1 & 2 & 2 \\ & & -1 & 2 & 2 \\ & & -1 & 2 & 2 \\ & & & -1 & 2 \end{bmatrix}, \mathbf{b}\_{i} = \begin{bmatrix} (i-1)k+1 \\ (i-1)k+2 \\ \vdots \\ \vdots \\ \text{ik}-1 \\ \text{ik} \end{bmatrix}\_{l \times 1 \dots}, \text{ and } (i-1, 2, \dots, \mathbf{v}, \mathbf{n})\_{l} \end{aligned}$$

*where B<sup>n</sup>* = *C*<sup>1</sup> = *O, n* = 300*, and t* = 300*. The numerical results are shown in Tables 1–5, and in Figures 1 and 2.*

The first example is not a numerical simulation regarding any partial differential equations (PDE); we use this example in order to test the correctness of the iterative incomplete lower and upper factorization preconditioner algorithm. The first example can build a good foundation for the second example regarding PDE. The solutions to the large-scale system of equations for Example 1 by the ILUP are shown in Table 1 and the details of these are as follows: This problem requires solving with more than eight processors and the number of iterations is 238. When increasing the number of processors, time and parallel efficiency all decrease. The number of processors for solving Example 1 transforms from 4 to 64 and the parallel efficiency changes from 91.14% to 73.80%. All of the parallel efficiency values are higher than those in published works, including Cui et al.'s [10], Zhang et al.'s [40], and Yun et al.'s [2] methods, with the values being above 73%. No matter how many processors are used to calculate the problem, the error tolerance of this example is the same: 6.897 <sup>×</sup> <sup>10</sup><sup>−</sup>11.


**Table 1.** The iterative incomplete lower and upper factorization preconditioner (ILUP) for Example 1.

The results of Example 1 when using the BSOR method [10] are listed in Table 2. When more than four processors are used to resolve the problem of Example 1, the number of iterations is 216. When increasing the number of processors, the time and parallel efficiency decrease. The cost of the time of every iteration and communication is more than that found when using the ILUP algorithm for the large-scale system of equations. Hence, the speedup, which is less than that found when using the ILUP algorithm, decreases. Thus the parallel efficiency is not better than that found when using the ILUP algorithm for the large-scale system of equations. When the number of processors for solving Example 1 is four, the parallel efficiency is 59.56%; however, the parallel efficiency is 91.14% for four processors when using the ILUP algorithm. When increasing the number of processors, the parallel efficiency decreases to 44.81%, which is lower than that found when using the ILUP algorithm.

**Table 2.** The key to the block successive over relaxation method (BSOR) method for Example 1 (ω = 2.0).


The results of Example 1 when using the PEk method published by Zhang et al. [40] are described as Table 3. When more than four processors are used to resolve the problem of Example 1, the number of iterations is 227. When increasing the number of processors, the time and parallel efficiency decrease. The cost of the time of every iteration and communication is more than that when using the ILUP algorithm for the large-scale system of equations. Hence, the speedup, which is less than that found when using the ILUP algorithm, decreases. Therefore, the parallel efficiency is poorer than that found when using the ILUP algorithm for the large-scale system of equations. When the number of processors used when solving Example 1 is four, the parallel efficiency is 64.08%; however, the parallel efficiency is 91.14% for four processors when using the ILUP algorithm. When increasing the number of processors, the parallel efficiency decreases to 44.79%, corresponding to the parallel efficiency when using the BSOR method, which is lower than that found when using the ILUP algorithm, 73.80%.

**Table 3.** Answers for the pseudo-elimination method with parameter k (PEk) for Example 1 (k = 1.6).


The results of Example 1 when using the multi-splitting algorithm (MPA) published by Yun et al. [2] are introduced in Table 4. As seen in Table 4, when more than four processors are used to solve the problem of Example 1, the number of iterations is 174. When increasing the number of processors, the time and parallel efficiency decrease. The cost of the time of every iteration and communication is more than that when using the ILUP algorithm for the large-scale system of equations. Hence, the speedup, which is less than that found when using the ILUP algorithm, decreases. Thus, the parallel efficiency is poorer than when using the ILUP algorithm for the large-scale system of equations. When the number of processors for solving Example 1 is four, the parallel efficiency is 55.64%, 33.50% less than that that found when using the ILUP algorithm. When increasing the number of processors, the parallel efficiency decreases to 40.82%, about 4% less than the parallel efficiency obtained with the BSOR method, which is 23% lower than that that found when using the ILUP algorithm.


**Table 4.** The solutions to the multi-splitting algorithm (MPA) used for Example 1.

This section compares the speedup and parallel efficiency performance of the ILUP algorithm with methods in other recently published works, including Cui et al.'s [10], Zhang et al.'s [40], and Yun et al.'s [2] methods. Table 5 introduces a summary and comparison of the speedup and parallel efficiency with the different methods used for Example 1 on 64 CPU cores, which is better than other works [2,10,40]. As seen in Table 5, the speedup obtained with our method for Example 1 on 64 CPU cores is 47.2324, and the parallel efficiency is 73.80%. The parallel efficiency obtained with the ILUP algorithm is about 29% higher than that obtained using the BSOR method. The parallel efficiency is 29.01% more than that obtained using the PEk method. The parallel efficiency obtained with the BSOR method corresponds to the parallel efficiency obtained with the PEk method. The parallel efficiency is 23% higher than that obtained using the MPA algorithm.

**Table 5.** Comparison speedup and parallel efficiency with the different methods used for Example 1 on 64 central processing unit (CPU) cores.


Figure 1 illustrates the speedup performances obtained with the ILUP algorithm and the other three methods for Example 1 at different CPU cores. As seen from Figure 1, when increasing the number of processors, the speedup obtained using all the methods increases. No matter how great the number of processors, the speedup obtained using the ILUP algorithm is significantly higher than that obtained using the other three methods, especially when the number of processors is more. Regardless of the number of processors, the speedup values obtained using the BSOR method, the PEk method, and the MPA algorithm are close, particularly those obtained with the BSOR method and the PEk method.

Figure 2 shows the parallel efficiency performance of the ILUP algorithm and the other three methods for Example 1 at different CPU cores. As seen from Figure 2, when increasing the number of processors, the parallel efficiency obtained using all the methods decreases. Regardless of the number of processors, the parallel efficiency obtained using the ILUP algorithm is much higher than that found using the other three methods, maintaining a value of more than 70%. No matter the number of processors, the parallel efficiency values obtained using the PEk method, the BSOR method, and the MPA algorithm are lower and nearer, especially those found using the BSOR method and the PEk method. In particular, when the number of processors is 64, the parallel efficiency obtained

using the ILUP algorithm rises above 73%; however, the parallel efficiencies obtained using the BSOR method, the PEk method, and the MPA algorithm are only about 40%. The ILUP algorithm has the clear superiority of producing exceedingly higher parallel efficiency values.

**Figure 2.** The parallel efficiency values for Example 1.

#### *5.2. Results Analysis of the Partial-Di*ff*erential Equations*

**Example 2.** *Given the equations*

*Cx* <sup>∂</sup>2*<sup>u</sup>* <sup>∂</sup>*x*<sup>2</sup> + *Cy* ∂2*u* <sup>∂</sup>*y*<sup>2</sup> + (*C*<sup>1</sup> sin 2π*<sup>x</sup>* <sup>+</sup> *<sup>C</sup>*2) <sup>∂</sup>*<sup>u</sup>* <sup>∂</sup>*<sup>x</sup>* + (*D*<sup>1</sup> sin 2π*<sup>y</sup>* <sup>+</sup> *<sup>D</sup>*2) <sup>∂</sup>*<sup>u</sup>* <sup>∂</sup>*<sup>y</sup>* + *Eu* = 0 0 ≤ *x* ≤ 1, 0 ≤ *y* ≤ 1 *and u <sup>x</sup>*=<sup>0</sup> <sup>=</sup> *<sup>u</sup> <sup>x</sup>*=<sup>1</sup> = 10 + cos π*y u <sup>y</sup>*=<sup>0</sup> <sup>=</sup> *<sup>u</sup> <sup>y</sup>*=<sup>1</sup> <sup>=</sup> <sup>10</sup> <sup>+</sup> cos <sup>π</sup>*x,*

*Cx, Cy, C*1*, C*2*, D*1*, D*<sup>2</sup> *and E are invariants. Let Cx* = *Cy* = *E* = 1*, C*<sup>1</sup> = *C*<sup>2</sup> = *D*<sup>1</sup> = *D*<sup>2</sup> = 0 *and h* = 1/101*. The results are given in Tables 6–10 and in Figures 3 and 4.*

The finite difference method is used to discretize Example 2 in the tests. We adopt second-order central difference schemes to discretize Example 2 and then converse the format for numerical simulation; lastly, we test the iterative incomplete lower and upper factorization preconditioner algorithm on different processors. The results to the partial-differential equations for Example 2 obtained using the ILUP are listed in Table 6. The details are thus: This problem was solved with more than four CPU cores and the number of iterations was 560. When increasing the number of processors, the time and the parallel efficiency can be seen to all decrease. When the number of processors used for solving Example 2 changes from 4 to 64 the parallel efficiency changes from 89.48% to 71.64%. All of the parallel efficiency values are higher than in the published works [2,10,40], being above 71%. Regardless of how many processors are used to compute Example 2, the error allowance of this problem can be seen to be equally 3.158 <sup>×</sup> <sup>10</sup><sup>−</sup>11.

**Table 6.** The iterative incomplete lower and upper factorization preconditioner for Example 2.


The results for Example 2 obtained with the BSOR method [10] are listed in Table 7. When more than four processors are used to resolve the problem of Example 2, the number of iterations is 793. When increasing the number of processors, the time and parallel efficiency decrease. The cost of the time of every iteration and communication is more than that obtained using the ILUP algorithm for the large-scale system of equations. Hence, the speedup, which is less than that found when using the ILUP algorithm, decreases. Thus, the parallel efficiency is not as good as that found using the ILUP algorithm for the partial-differential equations. When the number of processors used for solving Example 2 is four, the parallel efficiency is 86.24%, 3.24% lower than that found when using the ILUP algorithm for the partial-differential equations. With increasing the number of processors, the parallel efficiency decreases to 52.42%, which is less than that obtained using the ILUP algorithm, 71.64%.

**Table 7.** The key to the BSOR method for Example 2 (ω = 2.0).


The results obtained for Example 2 using the PEk method [40] are given in Table 8. When more than four processors are used to resolve the problem of Example 2, the number of iterations is 798. When increasing the number of processors, the time and parallel efficiency decrease. The cost of the time of every iteration and communication is more than that obtained using the ILUP algorithm for the large-scale system of equations. Hence, the speedup, which is less than that obtained when using the ILUP algorithm, decreases. Thus, the parallel efficiency is poorer than that found when using the ILUP algorithm for the partial-differential equations. When the number of processors used for solving Example 2 is four, the parallel efficiency is 80.59%, which is 8.89% lower than that found when using the ILUP algorithm. When increasing the number of processors, the parallel efficiency decreases to 48.40%, which is 23.24% lower than that obtained with the ILUP algorithm.


**Table 8.** Answers to the PEk method for Example 2 (k = 2.7).

The results for Example 2 obtained with the multi-splitting algorithm [2] are introduced in Table 9. As seen in Table 9, when more than four processors are used to solve the problem of Example 2, the number of iterations is 838. When increasing the number of processors, the time and parallel efficiency decrease. The cost of the time of every iteration and communication is more than that found when using the ILUP algorithm for the partial-differential equations. Hence, the speedup, which is less than that found using the ILUP algorithm, decreases. Thus, the parallel efficiency is poorer than that obtained using the ILUP algorithm for the large-scale system of equations. When the number of processors used for solving Example 2 is four, the parallel efficiency is 78.25%, 11.23% less than that obtained using the ILUP algorithm. When increasing the number of processors, the parallel efficiency decreases to 46.34%, about 6% less than the parallel efficiency obtained with with the BSOR method, corresponding to the parallel efficiency obtained with the PEk technique, which is 25.3% lower than that found using the ILUP algorithm.


**Table 9.** The solutions to the multi-splitting algorithm for Example 2.

This section compares the speedup and parallel efficiency performance of the ILUP algorithm with methods in other recently published works, including Cui et al.'s [10], Zhang et al.'s [40], and Yun et al.'s [2] methods. Table 10 provides a summary and comparisons of speedup and parallel efficiency obtained using the different methods for Example 2 on 64 CPU cores, which is better than other published works. As seen in Table 10, the speedup in our method for Example 2 on 64 CPU cores is 45.8483 and the parallel efficiency is 71.64%. The parallel efficiency obtained using the ILUP algorithm is 19.22% higher than found using the BSOR method. The parallel efficiency is 23.24% more than that found using the PEk method. The parallel efficiency is 25.3% higher than that obtained using the MPA algorithm.


**Table 10.** Comparison of speedup and parallel efficiency values obtained using the different methods for Example 2 on 64 CPU cores.

Figure 3 compares the speedup performance of ILUP algorithm and the other three methods for Example 2 at different CPU cores. As seen from Figure 3, when increasing the number of processors, the speedup values of all the methods increase. Regardless of the number of processors, the speedup obtained using the ILUP algorithm is much higher than that found using the other three methods, in particular when the number of processors is greater. No matter the number of processors, the speedup values found using the BSOR method, the PEk method, and the MPA algorithm are close, especially for those found using the PEk technique and the MPA algorithm. For example, when the number of processors is 64, the speedup found using the ILUP algorithm rises above 45; however, the speedup values obtained using the BSOR method, the PEk method, and the MPA algorithm are only about 30. Obviously, the ILUP algorithm has the advantage of producing higher speedup values.

**Figure 3.** The speedup values for Example 2.

Figure 4 shows the parallel efficiency performance of the ILUP algorithm and the other three methods for Example 2 at different CPU cores. As seen from Figure 4, when increasing the number of processors, the parallel efficiency of all the methods decreases. Regardless of the number of processors, the parallel efficiency obtained using the ILUP algorithm is much higher than that found using the other three methods, maintaining a value of more than 70%. When increasing the number of processors, the parallel efficiency values obtained using the BSOR method, the PEk method, and the MPA algorithm are lower and sustain a descent, especially for those found using the MPA algorithm. In particular, when the number of processors is 64, the parallel efficiency obtained using the ILUP algorithm rises

above 71%; however, the parallel efficiency values found using the BSOR method, the PEk method, and the MPA algorithm are only about 50%. The ILUP algorithm is clearly beneficial in its production of exceedingly high parallel efficiency values.

**Figure 4.** The parallel efficiency values for Example 2.

#### **6. Conclusions**

In this work, an iterative incomplete LU factorization preconditioner for partial-differential equation systems has been presented. The partial-differential equations were discretized into linear equations with the form Ax = b. An iterative scheme of linear systems was used. The iterative ILU preconditioners of linear systems and partial-differential equations systems were performed on different computation nodes of multi-CPU cores. From the above numerical results in the tables and figures, we can obtain the following conclusions:


**Author Contributions:** Conceptualization, Y.-H.F. and L.-H.W.; methodology, L.-H.W.; software, Y.J.; validation, Y.J., X.-G.L., and X.-X.Y.; formal analysis, Y.-H.F.; investigation, Y.-H.F.; resources, L.-H.W.; data curation, Y.J.; writing—original draft preparation, X.-G.L. and C.-C.C.; writing—review and editing, Y.-H.F.; visualization, Y.-H.F.; supervision, Y.-H.F.; project administration, L.-H.W. and C.-C.C.; funding acquisition, X.-X.Y.

**Funding:** This research was funded by the Natural Science Foundation of Shanxi Province, China (201801D221118), the National Natural Science Foundation of China (grant nos. 11802194 and 11602157) and the Taiyuan University of Science and Technology Scientific Research Initial Funding (TYUST SRIF. 20152027, 20162037).

**Conflicts of Interest:** The authors declare no conflict of interest. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.
