**3. Results**

#### *3.1. Validity of the Results Using the CA-PCG Solver*

We verified the model results using the CA-PCG solver with the revised CESM-PVT method shown above. Figure 4 shows the distribution of the z-score biases of the temperature variable in 101 ensemble runs with the PCG solver and the CA-PCG experimental result. As the observation of temperature is more accessible than other variables and is also one of the fundamental physical quantities in ocean simulation, we chose the threedimensional temperature field for this evaluation. To match the accuracy of CESM-PVT, the variable's z-score bias of the experiment using the CA-PCG solver should be no larger than those of the ensemble simulations. As shown in Figure 4, the CA-PCG experimental result's z-score bias fell into the distribution of the ensemble z-score biases, showing good consistency with the results simulated using the PCG solver. Above all, the model results simulated using the CA-PCG solver were accurate, and the CA-PCG solver can be used in the NEMO model in future research.

**Figure 4.** Z-score biases of temperature in ensemble run (grey lines) and the communication-avoiding Krylov subspace method with a preconditioned conjugate gradient (CA-PCG) solver experiment (black line). The dashed line represents the mean value of the z-score biases of all the ensemble runs.

#### *3.2. Performance of the NEMO Model Using the CA-PCG Solver*

Before investigating the performance of the model using the CA-PCG solver, it was important to compare the global reductions before and after optimization. As shown in Figure 5, the percentage of global reductions increased with the number of GYRE\_PCG experiment processes. When more than 8000 processes were used, the execution time of the global reduction increased to more than 94.5% of the total execution time and could even reach 99.9% when using 24,000 processes. However, the execution time was reduced to 63.4% when using the CA-PCG solver. The ratio of the percentage of the global reductions in the two experiments was calculated (Figure 6). Almost all the ratios in the GYRE\_CAPCG to GYRE\_PCG experiment were less than 1, except when using 400 processes, due to the increase in the calculation time after optimization. When using more than 8000 processes, the ratio was approximately 0.70. Thus, the CA-PCG solver was useful for reducing the operation of global reduction, especially when scaled to more than 8000 processes.

**Figure 5.** Percentages of the global reduction (MPI\_Allreduce) time of the total execution time in 5-km Nucleus for European Modelling of the Ocean (NEMO) experiments using the preconditioned conjugate gradient (PCG) (blue line) and CA-PCG (red line) solvers.

**Figure 6.** Ratio of the percentage of the global reduction (MPI\_Allreduce) time of the total execution time determined using the CA-PCG and PCG solvers. In particular, the barotropic mode in GYRE\_PCG accounted for approximately 97–99% of the total execution times when scaled to more than 8000 processes (Figure 7). After using the new CA-PCG solver, the percentage decreased slightly to approximately 92–96%. Even though the reduction in the percentage does not seem notable, the execution time of the barotropic mode decreased from more than 17,000 s to less than 6000 s (figure not shown).

**Figure 7.** Speedup of 5-km NEMO experiments using the PCG (solid blue line) and CA-PCG (blue solid-star line) solvers. Percentages of the barotropic time of the total execution time using the PCG (solid red line) and CA-PCG (red solid-star line) solvers.

The total execution time considering different numbers of processes in both GYRE\_PCG and GYRE\_CAPCG is shown in Table 1. The total execution time decreased as the number of processes employed increased in the GYRE\_PCG experiment when using less than 6000 processes, and the minimum time was 1510 s, which was due to the decrease in the cost of computation. However, it greatly increased to more than 18,000 s when using more than 10,000 processes because of the dramatic increase in global reduction. After comparing the total execution time for GYRE\_PCG and GYRE\_CAPCG, we found that the differences were not considered with less than 6000 processes. Sometimes the execution time in GYRE\_CAPCG was longer than that in GYRE\_PCG due to the increase in the required computation. However, as the model scaled to larger scales, especially more than 10,000 processes, the total execution time decreased to 4000–6000 s in GYRE\_CAPCG, much faster than that in GYRE\_PCG. After optimization, there was a 3.56–4.40-fold improvement at large scales, which effectively reduced the model simulation costs.


**Table 1.** Execution time for the high-resolution experiments.

Next, we tested the overall performance of the NEMO model. The speedup was also greatly improved after optimization, especially at large scales (Table 1). With the PCG solver, the speedup relative to 400 processes was 4.65 times when using 6000 processes. However, the speedup decreased to less than 1.0 times when using more than 8000 processes, while in GYRE\_CAPCG, the speedup was still greater than 1.0 times at large scales. After comparing the speedup in GYRE\_PCG and GYRE\_CAPCG (Figure 8), we found that the ratio of GYRE\_CAPCG to GYRE\_PCG was larger than 1 when using more than 6000 processes and reached 4.61 when using more than 10,000 processes, which demonstrates an efficient improvement.

**Figure 8.** Ratio of the speedup of the NEMO model using the CA-PCG and PCG solvers (blue line) and the ratio of the percentage of the barotropic time of the total execution time using the CA-PCG and PCG solvers (red line).

#### **4. Discussion and Conclusions**

As the number of processors used has increased, the barotropic solver in the NEMO model has become the bottleneck for large-scale clusters. The most time-consuming part is solving elliptic equations, for which it is recommended that the PCG method is used in NEMO. In this work, we developed and implemented the CA-PCG method in the model to solve this problem. The results showed that the requirement of inner product operations decreased from each iteration to every eight iterations, which highly reduced the global reduction when using the newly developed CA-PCG solver. Thus, the execution time of the barotropic mode decreased effectively. Additionally, the speedup of NEMO after optimization was also better than that before optimization.

Above all, the optimization method could be useful in optimizing the NEMO model at large scales, as in this study. It should be noted that the main effect of the CA-PCG solver is reducing the cost of the global reduction during simulation. However, the architecture of the cluster, such as the bandwidth and the resolution of the experiment will impact the execution time of global reduction and can further influence the efficiency of the CA-PCG solver. To further evaluate the new barotropic solver's effectiveness, the optimized model should be tested on different clusters, including Heterogeneous System Architecture. Moreover, the accuracy of the results after optimization was only validated using smallscale simulation due to computation and storage limitations. Exploring different grid resolutions, particularly large-scale ones, is critical and should also be validated in the future. Besides, it should also be noted that the distribution of the ensemble z-score biases shows a seasonal variation (Figure 4), which also needs further analysis.

The experience gained from the NEMO model could benefit other OGCMs and even other ocean components in coupled models. In the Coupled Model Intercomparison Project phase 6 (CMIP6), approximately 25 climate models use NEMO as their ocean component (https://github.com/WCRP-CMIP/CMIP6\_CVs/blob/master/CMIP6\_source\_id.json (accessed on 10 April 2021)). Climate models usually require numerous calculations and long integral times and need to be run on high-performance computing resources more than stand-alone ocean models do [20]. Therefore, the NEMO model's high performance will benefit from simple ocean process simulation and climate models that use NEMO as their ocean component.

**Author Contributions:** Methodology, validation, analysis, and writing–original draft preparation, X.Y.; algorithm coding and description, S.Z. (Shan Zhou); numerical experiments, S.Z. (Shengchang Zhou); conceptualization, supervision, funding acquisition, and writing–review and editing, Z.S.; algorithm and numerical experiments' suggestion, W.L. All authors discussed, read, edited, and approved the article. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (nos. U1806205, 41906029, and 42022042) and the China–Korea Cooperation Project on Northwestern Pacific Climate Change and its Prediction.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data used within this study are the combination of the World Ocean Atlas 2009 and the Polar Science Center Hydrographic Climatology version 3 data, which can be obtained from the National Oceanographic Data Center (NODC) and Polar Science Center, respectively.

**Acknowledgments:** The authors are grateful to Junmin Lin and Zhe Wang for developing the algorithm.

**Conflicts of Interest:** The authors declare no conflict of interest.
