*2.4. CA-PCG Solver*

Optimization of NEMO scalability, which was mainly caused by the global reduction in the PCG solver, has rarely been studied until now. To solve this problem, a communicationavoiding Krylov subspace method was implemented with PCG (CA-PCG), according to Carson's work [13]. For PCG, inner product operation is needed for each iteration, which means a global reduction cost for each iteration. However, for CA-PCG, all of the iterations are divided into two layers of loops. The global reductions are only needed in the outer loop. Since we used the same preconditioner for PCG, the iteration number K should not change. If *s* is the inner loop count, then the outer loop count will be K/*<sup>s</sup>*, which means one global reduction is needed after every *s* iterations. After multiple tests, the inner loop count *s* was set as 8 in this study. Then the frequency of inner product operation was decreased to every eight iterations. Moreover, the global communication cost could decrease dramatically, and the high-core count scalability could be improved effectively. The pseudocode for the CA-PCG algorithm designed for NEMO is shown in Appendix C.
