**Appendix C**


In the CA-PCG algorithm, all of the iterations are divided into two layers of loops. In the outer loop, global reductions are still needed to obtain the Gram matrix [13], which can be used in the inner loop. Since we used the same preconditioner as that used for PCG, the convergence rate will not change, and the iteration number *K* should be the same. If *s* is the inner loop count, then the outer loop count will be *K*/*<sup>s</sup>*, which means one global reduction is needed after every *s* iterations. Since the inner loop count increases, the optimization efficiency will be improved, but the computational accuracy will decrease. We found that eight inner loop counts could meet both optimization efficiency and computational accuracy after multiple tests. Then the inner loop count *s* was set as eight in this study. However, in the PCG algorithm, one global reduction was needed after each iteration step. Thus, the amount of global reduction was reduced to 1/8 of that in the PCG algorithm. Notably, even though the global communication cost was highly reduced, the computational cost was increased compared to the PCG algorithm.

Assuming that the complexity of PCG is Ncomp, then CA-PCG is (<sup>2</sup>*s* + 1)2Ncomp, which comes from steps 4, 10, 11, 12, 13, 14, and 16 in Algorithm A2, especially step 4. To reduce the computation cost of CA-PCG, we implemented optimizations to reduce the computation complexity to 6s\*Ncomp. The hotspots of CA-PCG were the basic operation in step 4 and the Gram matrix operation in step 6, while the costs of the inner iterations were

negligible (from steps 10 to 14). The count of MVM in step 4 was 2*s* − 1, approximately twice that of the PCG iterations, which could be amortized by data parallelization. The complexity of the inner production in step 6 was (2s + 1)<sup>2</sup> with a naive implementation or approximately (2s + 1)2/2 by taking the symmetric property of matrix A and M−1. However, the computational cost was still high and could be reduced further. With the choice of monomial basis and diagonal matrix <sup>M</sup>−1, the element in the Gram matrix Gk(i,j) was (p*sk*+<sup>1</sup> or <sup>z</sup>*sk*+1) by M(M−1A)i−1(M−1A)j−<sup>1</sup> by (psk+1 or zsk+1). After the argument, Gk(i1,j1) and Gk(i2,j2) were equal, given that they were in the same submatrix (e.g., P *k*MPk) and i1+j1 = i2+j2. As a result, the counts of the inner products needed for submatrices P *k*MPk, P *k*MZk, and Z *k*MZk were 2s + 1, 2s, and 2s − 1, respectively. In total, the computational complexity of the inner products was reduced to 6s at the algorithmic level compared to 2s in the corresponding PCG iterations.

The details about the computational and communication cost estimation after optimization are provided here. The communication cost comes from the boundary updating in steps 5 and 17 and the global reduction cost of step 6. Thus, we can obtain the following:

$$t\_{comp} = 19K \ast \left\| \frac{N^2}{p} \sigma \right\| \tag{A7}$$

$$t\_b = \left(\frac{K}{s}\right)(8\theta + 8n\varepsilon) = \left(\frac{K}{s}\right)\left(8\theta + \left(\frac{8N}{\sqrt{p}}\right)\varepsilon\right) \tag{A8}$$

$$t\_{\mathcal{S}} = \left(2s\frac{N^2}{p}\sigma + \theta\log\_2 p\right)\left(\frac{K}{s}\right) = 2K\frac{N^2}{p}\sigma + \left(\frac{K}{s}\right)\theta\log\_2 p\tag{A9}$$

$$t\_{\rm cap;y} = (19K \star 6s + 2K) \frac{N^2}{p} \sigma + \left(\frac{K}{s}\right) \left(\frac{8N}{\sqrt{p}}\right) \varepsilon + \left(\frac{K}{s}\right) (8 + \log\_2 p) \theta \tag{A10}$$

According to the impact of the processor number *p*, we divided *<sup>t</sup>*capcg into two parts, *F*(1) capcg and *F*(2) capcg:

$$F\_{\rm capcg}^{(1)} = (19K \ast \text{6s} + 2K) \frac{N^2}{p} \sigma + \left(\frac{K}{s}\right) \left(\frac{8N}{\sqrt{p}}\right) \varepsilon \tag{A11}$$

$$F\_{\text{cap}\,\text{cy}}^{(2)} = \left(\frac{K}{s}\right)(8 + \log\_2 p)\theta \tag{A12}$$

Implementing the same operation on Equation (A6), we can convert Equation (A6) into:

$$t\_{p\lhd\mathfrak{g}} = F\_{p\lhd\mathfrak{g}}^{(1)} + F\_{p\lhd\mathfrak{g}}^{(2)}$$

where:

$$F\_{p\llcorner\mathfrak{g}}^{(1)} = \, 21K \frac{N^2}{p} \sigma + \left( K \frac{4N}{\sqrt{p}} \right) \mathfrak{e} \left[ \tag{A13}$$

$$F\_{\text{pcg}}^{(2)} = \left[ \mathcal{K} (4 + \log\_2 p) \theta \right] \tag{A14}$$

where *s* is the inner loop count. As shown in Equation (A12), the cost of global communication will decrease as *s* increases in CA-PCG. However, *s* has a limit value, which depends on the basis we choose in the algorithm. Since the *s* monomial basis was used in this paper, the maximum value of *s* should be 8, or the solver cannot achieve convergence regardless of the iteration number [24].

From Equations (A11) and (A13), we can determine that the computational cost of the CA-PCG algorithm is greater than that of the PCG algorithm. However, for a certain workload, as *p* increases, there will be a lower limit of zero for computational cost. From Equations (A12) and (A14), we found that as *p* increases, the global reduction in CA-PCG reduces to approximately 1 *s* of that of the PCG algorithm. Moreover, we could also predict that there is a certain value *p*inf for the process number. When *p* < *p*inf, the computational cost is dominant, and the performance of the CA-PCG method is even worse than that of the PCG method. However, when *p* > *p*inf, the communication bottleneck due to the global reductions becomes considerable, and the CA-PCG method becomes very effective in reducing the communication cost, compared to the PCG method. Thus, the scalability is improved effectively with the CA-PCG solver.

Even though the CA-PCG solver can improve the performance of NEMO, correctness and accuracy should be ensured. After optimization, the solution of the barotropic solver should be the same or very close. In both the PCG and CA-PCG solver, the solution is an approximate value that can satisfy the condition of convergence (Algorithm A1, A2). If the solution of the CA-PCG solver is correct, it should also make the PCG solver achieve convergence. Thus, we used the most direct method to verify the correctness of the result. After applying the solution of the CA-PCG solver as the initial guess value of the PCG solver, we found that only one step iteration was needed to reach convergence. Then, the solution of the new solver CA-PCG can be recognized as the correct solution of this elliptic equation in the barotropic mode of NEMO.
