**4. Conclusions**

We have successfully implemented a parallel framework based on the serial GCCOM model, using domain decomposition and parallel linear solver methods from the PETSc libraries. The model has been validated using the seamount test case. Results show that the parallel version reproduces serial results to within acceptable ranges for key variables and scalars: around 10−<sup>5</sup> for the Krylov subspace pressure solver; 10−<sup>7</sup> to 10−<sup>8</sup> for the velocities; and the scalars *D* and *T* are on the order of 32-bit machine precision.

Measured performance improvement tests show that detailed simulation run-times follow the scaling of the core PETSc framework speed tests (Streams). In some cases, the GCCOM model outperformed the streams test because the problem size is big enough to offset the communication overhead. For the experiments run in this study, the speedup was improved by a factor of 80 for 240 cores, and follows closely (or is better than) the speedup of the PETSc Streams test. Additionally, the gain in speedup shows that we can expect the model will be capable of additional improvement when it is migrated to a larger system with more memory and cores.

Utilization of the PETSc libraries has proven to be of significant benefit, but using PETSc has its pros and cons: We found that there was significant savings in the time needed for HPC model development, which has immense value for our small research group, but that the learning curve requires significant effort. For example, the complexity of representing the Arakawa-C staggered grid using Fortran Matrices and MPI communications schemes was extremely complex. This was more effectively achieved by employing the PETSc DM and DMDA parallelization paradigm for the array distribution and linear solvers. In addition, once completed, the model scaling improved, while adding and defining new scalars, variables, or testing different solvers was greatly simplified. We note that the development and testing of those objects was challenging, and eventually became the topic of a masters thesis project. Recently, PETSc has started offering staggered distributed arrays, DMSTAG (which represents a "staggered grid" or a structured cell complex), which is something we will explore in the future.

Based on our experiences, we strongly recommend PETSc as a proven alternative to obtain scalability in complex models without the need to build a custom parallel framework. As stated above, with PETSc, there is a learning curve: the migration of the model from an MPI based model to the current PETSc model required more than two years. However, based on the improvement of the GCCOM performance, our team feels that the adoption of the PETSc framework has been worth the effort.

Additionally, we find the PETSc based model to be portable: we recently successfully completed a prototype migration to the SDSC Comet system. The model ran to completion, but there is still much work to be done as we explore the optimal memory and core configuration. Another motivation to move GCCOM to a system like Comet is that they have access to optimized parallel IO libraries and file systems (such as the parallel NetCDF, and the Lustre system).

The current version of the parallel framework can be improved in several ways. Domain decomposition can be modified to take advantage of full partitioning in all three dimensions. However, this would require changing, or replacing, the existing pressure-gradient algorithm, which forces a vertical-slab decomposition because of self-recurrence in the spline integration over the column. The parallel model would also benefit from the use of parallel file input and output and improvements in memory management. We plan to explore how the adoption of exascale software systems such as ADIOS, which manages data between nodes in parallel with the computations on the nodes, will benefit the GCCOM model [53].

In conclusion, the performance tests conducted in these experiments show that the PETSc-based parallel GCCOM model satisfies several of its primary goals, including:


The results shown in this paper also show that GCCOM is a parallel and scaleable, multiphysics, and multiscale model: it scales to hundreds of cores (the limit of the test system); it can capture different types of phenomena, including fluid flow, nonhydrostatic pressure, thermodynamics, sea surface height; and it can operate over physical scales that range from 10<sup>0</sup> to 10<sup>3</sup> m.

**Author Contributions:** Conceptualization, M.V., M.G., and J.E.C.; methodology, M.V., M.G., M.P.T., and J.E.C.; software, M.V., M.G. and M.P.T.; validation, M.V., M.G.; formal analysis, M.V. and M.P.T.; investigation, M.V.; resources, J.E.C.; data curation, M.V.; writing—original draft preparation, M.V., M.G., M.P.T.; writing—review and editing, M.V., M.P.T. and M.G.; visualization, M.V.; supervision, J.E.C.; project administration, J.E.C.; funding acquisition, J.E.C.

**Funding:** This research was supported by the Computational Science Research Center (CSRC) at San Diego State University (SDSU), the SDSU Presidents Leadership Fund, the CSU Council on Ocean Affairs, Science and Technology (COAST) Grant Development Program, and the National Science Foundation (OCI 0721656, CC-NIE 1245312, MRI Grant 0922702). Portions of this work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant No. ACI-1548562.

**Acknowledgments:** The authors are grateful for the contribution of the work done by Neelam Patel on the DMDA PETSc framework. We also want to acknowledge helpful conversations with Ryan Walter on the internal wave, l beam and lock release experiments for validation, Paul Choboter for his contribution to the development of the current GCCOM model, and Jame Otto for his helpful insights into parallelization and hardware characterization. This research was supported by the Computational Science Research Center (CSRC) at San Diego State University (SDSU), the SDSU Presidents Leadership Fund, the CSU Council on Ocean Affairs, Science and Technology (COAST) Grant Development Program, and the National Science Foundation (OCI 0721656, CC-NIE 1245312, MRI Grant 0922702). Portions of this work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant No. ACI-1548562.

**Conflicts of Interest:** The authors declare no conflict of interest.
