**Appendix B**

Required: Coefficient matrix *A*, which is a definite positive symmetric matrix, preconditioner *M* (diagonal matrix of the diagonal elements of *A*), initial guess value x1 and vector *b* associated with grid block Ai,j

1. Initialization: Given initial guess x0,


NEMO divides the global domain into blocks and distributes them to the processes. Each process computes the only evolution procedures related to the grid points in its block and maintains a halo region to update data with its neighbors [9]. The following assumptions were made:


Then, for each iteration, *<sup>t</sup>*comp is the cost of the computation, *t*b is the cost of the boundary updating, and *t*g is the cost of the global reduction. Below is a detailed estimation of these components.

(1) Computational cost

For Algorithm A1, the computational cost, *<sup>t</sup>*comp, contains three parts:

• Step 5, with a total of 9*n* operations from one matrix-vector multiplication (MVM).


$$t\_{comp} = K \left( 9n^2 + 6n^2 + 4n^2 \right) \sigma = 19 \frac{N^2}{p} \sigma K \tag{A3}$$

where σ is the execution time per floating-point operation. If the domain size *N* is constant, *<sup>t</sup>*comp decreases to a lower bound of zero when the number of processes increases.

(2) Communication cost

There are two kinds of communication in the solver. The first one is from the boundary update. After MVM and nondiagonal preconditioning, a boundary update occurs in the halo regions for each process, which requires one or more boundary layers to be updated. In NEMO, since each process covers one block and one additional halo layer is used, only one boundary update is needed per iteration. The total boundary updating cost depends on the data volume of the halo regions and network latency. When the halo size is 1, the data volume in each boundary is *n*, and there are four boundaries for each block (east, west, north, and south). Then, the total boundary updating cost for each iteration satisfies:

$$t\_b = K(4\theta + 4n\varepsilon) = K\left(4\theta + \left(\frac{4N}{\sqrt{p}}\right)\varepsilon\right) \tag{A4}$$

where *θ* is the point-to-point communication latency per message and *ε* is the transfer time per floating-point datum. The boundary updating time also decreases as the number of processes increases but has a lower limit of 4*θ*. The second communication cost is due to global reduction and can be estimated with a similar method. There are two global reductions per iteration, each of which results in a reduction in computing time and communication latency. Considering a binomial tree, the communication latency cost should be δ log2p [10]. Then, the global reduction execution time can be regarded as:

$$t\_{\mathcal{S}} = K \left( 2 \frac{N^2}{p} \sigma + \theta \log\_2 p \right) \tag{A5}$$

When the process number p increases, the cost of the computing time should decrease, and the cost of the communication latency should increase monotonically. Combining computational costs together with boundary updating and global reduction costs, the execution time of one step for NEMO PCG can be described as:

$$t\_{\rm pcg} = \left(t\_{\rm comp} + t\_{\rm comu}\right) = \left(t\_{\rm comp} + t\_{\rm b} + t\_{\rm g}\right) = \mathcal{K}\left[21\frac{N^2}{p}\sigma + \left(\frac{4N}{\sqrt{p}}\right)\varepsilon + (4 + \log\_2 p)\theta\right] \tag{A6}$$

where K is constant regardless of how many processes are used [23].

Equation (A6) shows that the more processes used, the less time is required for computation and boundary updating. However, the time required for global reduction increases with the number of processes. Therefore, when the process number exceeds a certain limit, the total time of the NEMO PCG solver will increase monotonically and become dominant.
