*2.2. Barotropic Solver*

As discussed above, the solution of an elliptic equation is the most time-consuming part of the barotropic mode for large-scale clusters. Here, the elliptic equation is shown [11]:

$$\mathbb{E}\left[\nabla \cdot H\nabla - \mathcal{Q}(\mathbb{1})\right] \boldsymbol{\eta}^{n+1} = \boldsymbol{\psi}\left(\boldsymbol{\eta}^{n}, \boldsymbol{\eta}^{n-1}, \boldsymbol{\tau}\right) \tag{1}$$

where *H* represents the depth of the ocean, *τ* is the time step, *ηn* is the sea surface height (SSH) at the *n*-th time step, and *ψ* is the function of the influence that the previous states of the SSH and forcing have on the next state.

At each time step, the time derivative of the SSH at the next time step is solved with an elliptic equation. Equation (1) is discretized into a two-dimensional orthogonal curvilinear grid using a five-point stencil in NEMO and can be reorganized into a linear symmetric system *A*x = *b*. The idea of the PCG method is to search for the solution of x.

#### *2.3. NEMO PCG Solver*

The PCG used in NEMO is a modified typical PCG. However, the modification does not reduce the operation of the global reductions (MPI\_Allreduce). As global reductions are the main bottleneck of the NEMO model, the optimization was still based on the typical PCG for simplicity in this study. The typical PCG method (Algorithm A1) contains three major parts: computing, boundary updating, and global reduction. There are matrix-vector multiplications (MVMs) and vector–vector multiplications (VVMs) in this computation,

displaying good scalability. The time cost of boundary updating is required to update the halo area after an MVM, but there is no scalability issue for large-scale clusters. The cost of global reductions required by the inner product, is the key problem in a large-core count cluster. The details of the communication costs and performance analysis are illustrated in Appendix B.
