**1. Introduction**

As computational modeling and its resources becomes ubiquitous, numerical solutions to complex equations can be solved with increasing resolution and accuracy. At the same time, as more variables and processes are taken into account, and spatial and temporal resolutions are increased to model real field-scale events, models become more complex ye<sup>t</sup> resource efficiency remains an important requirement. Multiscale and multiphysics modeling encompasses these factors and relies on High-Performance Computing (HPC) resources and services to solve problems effectively [1]. The models used for atmospheric and ocean studies are examples of such applications.

In atmospheric and ocean studies, one of the major challenges is the simulation of coastal ocean dynamics due to the vast range of length and time scales required: tidal processes and oceanic currents happen in fractions of days; wavelengths are scale lengths of kilometers; mixing and turbulence events need to be resolved at the meter or submeter scale; and time resolution is in terms of years to minutes or seconds. Consequently, field scale simulations need to cover both sides of the time/space

spectrum while preserving accuracy, forcing the use of highly detailed grids of considerable size. The General Curvilinear Coastal Ocean Model (GCCOM) is a 3D, curvilinear, large-eddy simulation model designed specifically for high-resolution (meter scale) simulations of coastal regions, capable of capturing nonhydrostatic flow dynamics [2].

Another major challenge is the need to solve for nonhydrostatic pressures. Ocean and climate models are generally hydrostatic, limiting the physics they are able to capture, and large scale, so that they are computationally efficient enough to make forecasts in reasonable run-times. Historically, these models have opted to solve the more computationally efficient, but less accurate, hydrostatic version of the Navier–Stokes equations because simulating nonhydrostatic processes using the Boussinesq approximation is computationally intensive, as these models require solving an elliptic problem [3–5]. However, advances in HPC systems and algorithms have enabled the use of nonhydrostatic solvers in ocean models including GCCOM [6], MITgcm [7], SUNTANS [8], FVCOM [9], SOMAR [10], and KLOCAW [11].

A differentiating feature of these models is the method used to solve the nonhydrostatic pressure: the pressure can be either solved in the physical grid after taking the divergence of the momentum equation (Boussinesq), or reconstructed from the equation of state when density is chosen as the main scalar argument. In GCCOM, the pressure is solved on a fully 3D (not 2D plus sigma coordinates) computational grid (a normalized representation of the physical grid) that is created after applying a unique, general 3D curvilinear transformation to the Laplacian problem [2]. All computations are performed on the computational grid, which increases the accuracy of the results. The curvilinear coordinates formulation enforces the computational grid to be unitary and orthogonal, thus simplifying the boundary treatment and ensuring minimal energy loss by transforming any boundary geometry to an unitary cube, which in practice reduces the application of boundaries to a plane (or edge) of this cube. In contrast, the typical approach of approximating the grid cell to the problem geometry loses energy in the boundary proportional to the grid resolution. A detailed comparison of the GCCOM model with other non hydrostatic ocean models functionalities can be found in [12].

One other defining characteristic of some of these nonhydrostatic models is the HPC framework used: many use the Message Passing Interfae (MPI) framework (in some cases enhanced by using OpenMP, or accelerated using GPU resources), parallel file IO, and have large teams developing advanced multiphysics modules. Typically, these models are often nested within large, global, weather prediction models such as those used in disaster response situations, which is a long-term goal of this project. GCCOM is a modern model: written in Fortran 95; modular; employing NetCDF to manage the data. The first parallel version of an earlier GCCOM model (UCOAM) used a customized MPI-based parallel framework that scaled to a few hundred cores [13]. The UCOAM model has also been used for nesting inside the California ROMS global ocean model [14,15]. Having demonstrated the multiscale and multiphysics capabilities of GCCOM [12], the next phase of development requires a more advanced HPC framework in order to speed up development and testing and allow us to work with more realistic domains and algorithms. These factors motivated the decision to migrate the model to a more advanced HPC framework.

One additional factor in our project is the size of the team, and the level of effort needed to produce and support this type of model. Often the larger model projects are optimized for specific hardware that may not be portable. These models often require large development teams. One of the goals of the GCCOM project is to deliver a portable model that can run in a heterogeneous computing environment and proves useful to smaller research teams, but would someday scale to run inside of larger models. As the GCCOM model evolved, the PETSc (Portable-Extensible Toolkit for Scientific Computing) model was chosen to improve the scaling and efficiency of the model and to improve access to advanced mathematical libraries and tools [16–18]. The results presented in the paper justify the time and effort required to accomplish these objectives.

In this paper, the outcome of this approach is presented. The background of the model, motivation for parallelization, the parallel approach, and the choices for the underlying software stack are

described in Sections 2.1, 2.2, and 2.4. The methodology and specifics of the test case used to validate the parallel framework are described in Section 2.5 and the validated results showing that the parallelized model produces correct results are presented in Section 3.1. The parallel performance is presented in Section 3.2 with results showing that the prototype model scales with the PETSc framework and to the maximum size of the HPC system used for these tests. Section 3.3 contains results demonstrating the multiscale and multiphysics capabilities of the model. Conclusions and future work are discussed in Section 4.

#### **2. Materials and Methods**

#### *2.1. The General Curvilinear Coastal Ocean Model (GCCOM)*

The General Curvilinear Coastal Ocean Model (GCCOM) is a coastal ocean model that solves the three-dimensional Navier–Stokes equation with the Boussinesq approximation, a Large Eddy Simulation (LES) formulation is implemented with a subgrid-scale model, capable of handling strongly stratified environments [12]. GCCOM features include: an embedded fully 3D curvilinear transformation, which makes it uniquely equipped to handle non-convex features in every direction including along the vertical axis [2]; a full 3D curvilinear Laplacian operator that solve a 3D nonhydrostatic pressure equation that accurately reproduces features resulting from the interaction of currents and steep bathymetries; and ability to calculate solutions from the sub-meter to the kilometer ranges in one simulation. With these key features, GCCOM has been used to simulate coastal ocean processes with multiscale simulations ranging from oceanic currents and internal waves [15] to turbulence mixing and bores formation, all in a single scenario [19], as well as the use of data assimilation capabilities [20], including thermodynamics and turbulence mixing on high-resolution grids, up to meter and sub-meter scales. Additionally, a key goal of the GCCOM model is to study turbulent mixing and internal waves in the coastal region, as a way to bridge the work of regional models (e.g., ROMS [21] or POP [22]), and global models (e.g., MPAS [23], HYCOM [24]), all the way to the sea–land interface.

GCCOM was recently validated for stratified oceanographic processes in [12] and has been coupled with ROMS [15] using nested grids to obtain greater resolution in a region of interest, using approaches that are similar to the work of other coupled ocean model systems. In [25], an overset grid method is used to couple a hydrostatic, large scale, coastal ocean flow model (FVCOM or unstructured grid finite volume coastal ocean model) with a non-hydrostatic model tailored for high-fidelity, unsteady, small scale flows (SIFOM or solver for incompressible flow on overset meshes). This method presents a way to offset the computational cost of a single, comprehensive model capable of dealing with multiple types of physics. GCCOM is similar to SIFOM in the curvilinear transformation and multiphysics capabilities, but they differ in the basic grid layout (staggered vs. non-staggered grids) and in the numerical methods each one applies to solve the Navier–Stokes equations. In addition, GCCOM has been taking strides in its development so it does not need to couple with a large-scale model to capture multiscale processes. Instead, GCCOM includes the entire domain in the curvilinear, nonhydrostatic, multiphysics capable region, where we handle thermodynamics, hydrostatic and nonhydrostatic pressure and equation of state density distributions at the same time. In this sense, GCCOM is a model capable of supporting both multiphysics and multiscale calculations at high computational costs. Recently, the PETSc-based GCCOM model has been coupled with SWASH [26] to simulate surface waves and overcome the limitations of the rigid lid [27]. SWASH is a numerical tool used to simulate free-surface, rotational flow and transport phenomena in coastal waters, we used the hydrostatic component to capture and validate surface wave heights in a seamount testcase.

GCCOM development began with [28] introducing the 3D curvilinear coordinates transformation, along with practical applications including the Alarcon Seamount and Lake Valencia waters [29,30]. Later, the model was revised by [2] to add thermodynamic processes, the UNESCO Equation of State, and use of a simple Successive Over–Relaxation (SOR) algorithm to solve the non-hydrostatic

pressure [31] on an Arakawa-C grid [32], and an upgrade of the model to Fortran 95. In order to accelerate convergence, the authors in [33] implemented an elliptic equation solver for pressure that used the Aggregation-based Algebraic Multigrid Library (AGMG) [34]. Additionally, an MPI-based model with the SOR solver was also developed around the same time by [35]. All of these development efforts demonstrated improvements in model accuracy and speedup, but there were still limitations running GCCOM because of the amount of time spent in the pressure solver.

After studying options available for parallelizing the pressure solver, the Portable Extensible Toolkit for Scientific Computing, PETSc [16–18] was chosen because of its ability to handle the complexities of the Arakawa-C staggered grid, its large collection of iterative Krylov subspace methods, and its ability to interface with other similar libraries. An additional benefit is that PETSc has been selected be part of the DOE Exascale computing project [36], which will allow our model to scale to very large coastal regions at high resolutions, and model developers can expect that the PETSc libraries will have long-term support.

A prototype PETSc hybrid model was developed by [37], in which the pressure solver was parallelized and inserted into the existing model. As expected, the implementation had performance limitations: the rest of the model was solved in serial and was not PETSc based, which has documented performance issues; and the Laplacian system vectors needed to be scattered at each iteration, solved, and then gathered back to the main processor where the rest of routines were runing. The approach, although not optimal, presented promising results: the computational time of the floating point operations performed inside the pressure solver routine scaled by the number of processors on a single node. Total run-time was dominated by the time required to transfer data between processors before and after the floating operations. A scatter and gather call at each iteration drove up the communication time, and the decrease in computation time inside the pressure solver routine did not offset it. To address these issues, a full parallelization strategy was developed by [38] through the use of PETSc DM/DMDA (Data Management for Distributed Arrays) objects [18]. DM/DMDA objects are domain decomposition tools that distribute arbitrary 3D meshes among processors. By using proper domain decomposition, and linear system parallel solving, full PETSc-based parallelization of the model has been completed (see Section 2.4).

#### *2.2. The PETSc Libraries*

PETSc is a modular set of libraries, data structures, and routines developed and maintained by Argonne National Laboratory and a thriving community around the world. Designed to solve scientific applications that are modeled by partial differential equations, PETSc has been shown to be particularly useful for computational fluid dynamics (CFD) problems. PETSc libraries support a variety of different numerical formulations, including finite element methods [39], finite volume [40] or finite difference as is the case for the GCCOM model. The wide array of fundamental tools for scientific computing, as linear and nonlinear, solvers and the parallel domain distribution and input/output protocols native to it, make it one of the strongest choices to port a proven code into a parallel framework. PETSc makes also possible to use GPUs, threads and MPI parallelism in a same model for different effects, further optimizing code performance. As of today, PETSc is supported by most of the XSEDE machines many large-scale NSF and DOE HPC systems in the US and it has become a fundamental tool in the scientific computing community.

The PETSc framework is designed to be used by large scale scientific applications. In fact, it is one of the software components that is on the list of the Department of Energy's Exascale Computing Roadmap [36]. Part of the exascale initiative is to develop "composable" software tools, where different PDE-based models can be directly coupled. PETSc will be developing libraries that will couple multiphysics models at scale. When properly implemented, PETSc applications will be capable of running multilevel, multidomain, multirate, and multiphysics algorithms for very large domains (billions of cells) [41,42]. Our expectation is that, by using PETSc for the parallel data distribution model and MPI communications, the parGCCOM model will be capable of scaling to very large

numbers of cores and problem sizes and to support a wide variety of physics models via the PETSc libraries. In addition, PETSc supports OpenMP and GPU acceleration, which will be useful for application optimization.

In this paper, the strategies used to parallelize the GCCOM model using the PETSc libraries are presented, with the result that the model attains a reasonable speed up and scale in MPI systems up to 240 processors so far (the max number of processors on the system where these tests were run), while demonstrating preservation of the solution by testing with baseline experiments such as stratified seamount as seen in Section 3.1. In this manner, we want to emphasize the effort saved by using an established HPC toolkit, while still preserving the unique features of our model.

#### *2.3. PETSc Development in GCCOM*

A prototype PETSc-GCCOM hybrid model was developed by [37], in which the pressure solver was parallelized and inserted into the existing model. As expected, the implementation had performance limitations, since the rest of the model was solved in serial, and the Laplacian system vectors needed to be scattered at each iteration, solved, and then gathered back to the main processor where the rest of routines were running. The approach, although not optimal, presented promising results: the computational time of the floating point operations scaled by the number of processors inside the pressure solver routine. The total run-time was dominated by the time required to transfer data between processors before and after the floating operations. A scatter and gather call at each iteration drove up the communication time, and any decreases in the computation time inside the pressure solver routine did not offset the gather-scatter increases.

To address these issues, a full parallelization strategy was developed by [38] that utilizes the PETSc DM/DMDA (Data Management for Distributed Arrays) objects [18,43]. Data Management (DM) objects are used to manage communication between the algebraic structures in PETSc (Vec and Mat) and mesh data structures in PDE-based (or other) simulations. PETSc uses DMs to provide a simple way to employ parallelism via domain decomposition using objects known as Data Management for Distributed Arrays ( DMDA objects [43]. DMDAs control parallel data layout and communication information including: the portion of data belonging to each processor (local domain); the location of neighboring processors and their own domains; and managemen<sup>t</sup> of the parallel communication between domains [16].

In the GCCOM model, DMDA objects are used for all domain data decomposition and linear system parallel solving. This approach towards parallelization of the model is described below (in Section 2.4).

### *2.4. Model Parallelization*

In this section, we describe the key aspects of the model that impact the parallelization of the model. This includes the use of the DMDA objects to manage data decomposition and message passing, the location of the scalars and velocities on the Arakawa-C grid, the hydrostatic pressure-gradient force (HPGF) and its impact on the data decomposition, the pressure calculations which dominate the computations.

GCCOM uses finite difference approximations to solve differential equations on curvilinear grids and are operated on via stencils. These grids need to be distributed across processors, updated independently, and require special cases to handle the data communication where the local domain ends. This set of operations are referred to as domain decomposition. Throughout the GCCOM code, specific DMDAs are created to manage the layout of multidimensional arrays for the velocities (*u,v,w*) and the temperature, salinity, pressure, and density (*T,S,P,ρ*) scalars. Each of the velocity components and the scalars are located at positions in the staggered grid that require special treatment in order to be distributed and updated correctly, and is described in more detail in Section 2.4.1.

Another key PETSc component that is utilized in the GCCOM model is the linear system solver. PETSc provides a way to apply iterative solvers for linear systems using MPI. At the same time, the most computationally intensive part of the GCCOM model involves solving a fully elliptic pressure-Poisson system. This approach is unique among most CFD models because it embeds the fully 3D curvilinear transformation in the solver. Details of this Laplacian are discussed in Section 2.4.4. The strength of using the PETSc libraries to solve linear systems lies in the ability to experiment with dozens of different iterative solvers and preconditioners. Another advantage of using PETSc is being able to use PETSc with well known external packages such as Trilinos, HYPRE and OpenCL [44–46] with minimal changes.

These key PETSc elements, domain decomposition and linear system solvers, provide the foundations needed to develop the core parallel framework of the GCCOM model, and help to keep development effort to a minimum. This proved to be very helpful for the small development team involved in this effort.

#### 2.4.1. The Arakawa-C Grid

GCCOM defines the locations of its vectors and scalars on an Arakawa-C grid, where the components of flow are staggered in space [32]. In the C-grid, the *u*-velocity component of velocity is located at the west and east edges of the cell, *v*-velocity is located at the north and south edges, and pressure and other scalars are evaluated at cell centers (see Figure 1). Similarly, the *w* component will be in the front face of the 3D cube, going inwards. This staggered grid arrangemen<sup>t</sup> becomes the first challenge of the parallel overhaul because, for the DMDA objects, every component is regarded as co-located and are referenced by their lower-left grid point.

**Figure 1.** Diagram of the Arakawa C-grid, in which the velocity components u, v, and w are staggered by half a grid spacing. Scalars such as salinity S or temperature are located in the center.

The grid staggering results in different array *layouts* which depend on each variable position (*<sup>u</sup>*, *v*, *w*, *p*), which in turn creates different sizes for each layout. For example, there is one more *u*-velocity point in the horizontal direction than the *v*-velocity. Similarly, there is an extra *v*-velocity in the vertical direction compared to the *u*-velocity. Even though the computational ranges differ for each variable in the serial model, the arrays used by the parallel model were padded so that they would be the same dimensions. Note that this method is also used in the Regional Ocean Modeling System (ROMS) [47]. Table 1 describes the computational ranges and global sizes of each variable. The variables *gx, gy, gz* correspond to the full grid size in *x,y,z* and the fact the interior range is not the full array size counts for the number of ghost rows in each direction (e.g., *gx* − 2 has two ghost rows in that direction).


**Table 1.** Sizes and computational ranges of variable layouts on domain Ω = [*gx* × *gy* × *gz*] where *gx*, *gy*, *gz* are the total grid dimensions in each direction.

One of the main features of the GCCOM model is the embedded fully 3D curvilinear transformation, capable of handling any kind of structured grid (i.e., rectangular, sigma, curvilinear) to be solved in the computational domain by these transformation metrics (a partial discussion of the transformation metrics is discussed in Section 2.4.4). The downside of this approach is a significant increase in memory allocation, since full sized arrays need to be stored for each of the transformation metrics. In GCCOM, the metrics arrays are associated with the location of the aforementioned variable layouts. This provides an opportunity to minimize overhead: the model arrays are grouped by variable layouts and similar functions inside the code. Similarly, other GCCOM modules such as Sub–Grid Scale (SGS) calculations and stratification (thermodynamics) handling, follow the same parallelization treatment. Table 2 shows the total number of DM objects (layouts) used by GCCOM and the associated variables.

**Table 2.** GCCOM Data Managment Objects (DMs) and variable layout used.


Each of these DMs control the domain distribution independently, and each spawn the full sized arrays that become the grid where the finite difference calculations are carried out. In total, more than 100 hundred full sized arrays are used inside the model on different occasions, but around 60 of them remain in active memory because they are part of the core calculations, and all of them are distributed thanks to the use of these structures.

### 2.4.2. Domain Decomposition

Domain decomposition refers to dividing a large domain into smaller subdomains so that it can be solved independently on each processor. However, Grid operations with carried dependencies from one grid point to the next (self recurrence) are difficult to parallelize on distributed grids, thus truncating parallel implementation in this direction. In GCCOM, there are two such reasons that prevent domain decomposition in arbitrary directions: grid size and the calculation of the hydrostatic pressure-gradient force (HPGF) algorithm. For some of the experiments in GCCOM, the number of points in the *y*-direction is the minimum 6, used to simulate 2D processes, which in turn makes it too small to partition. The case of the parallelization of the HPGF is discussed next.

#### 2.4.3. Hydrostatic Pressure-Gradient Force

The hydrostatic pressure-gradient force (Equation (1)) is calculated as a spline reconstruction and integral along the vertical (*z*-direction), which is updated at every time step as a fundamental step in the main GCCOM algorithm [12]. The requirement for vertical integration enforces a recursive computation inside the loop, i.e., subsequent values depend on previously calculated vertical levels, as can be seen in Equation (5) in which *h* is dependent of the whole column above itself. This set of Equations (1)–(5) describe a spline reconstruction of the hydrostatic pressure *pH* over the *z*-column using a fourth-order approximation for *fk*(*ξ*) = *∂ρ*(*ξ*)/*∂<sup>x</sup>*, by a series of coefficients (Δ*k*, *dk*) defined by the local change in vertical coordinate *h* = *zk*+<sup>1</sup> − *zk*:

$$\frac{\partial p\_H}{\partial \mathbf{x}} = \frac{\partial}{\partial \mathbf{x}} \int\_z^0 \mathbf{g} \rho d\mathbf{\tilde{z}},\tag{1}$$

$$f(\xi) = f^{(0)} + f^{(1)}\xi + f^{(2)}\frac{\xi^2}{2} + f^{(3)}\frac{\xi^3}{6},\tag{2}$$

$$f^{(0)} = f\_{\mathbf{k}\prime} \tag{3}$$

$$f^{(0)} = f\_{\mathbf{k}\prime} \tag{4}$$

$$f^{(2)} = \frac{6\Delta\_k - 2d\_k - 4d\_k}{h}, \qquad \qquad f^{(3)} = \frac{6d\_k + 2d\_{k+1} - 12\Delta\_k}{h^2}, \tag{4}$$

$$h = z\_{k+1} - z\_k,\qquad\qquad\qquad\Lambda\_k = \frac{f\_{k+1} - f\_k}{h},\qquad\qquad d\_k = \frac{2\Lambda\_k \Lambda\_{k-1}}{\Lambda\_k + \Lambda\_{k-1}}.\tag{5}$$

As seen, the self-recurrence lies inside the algorithm and thus cannot be automatically detected by PETSc, resulting in a crash. The solution applied is not to partition data in the *z*-direction, which is easily done in PETSc by the command line -da\_processors\_z 1, another strength of the library. The result is that each processor stores and calculates the pressure-gradients on their respective sub-domain, effectively a vertical column. While this strategy enables the use of the HPGF, it also limits the parallelism available to solve the equations, a limitation that we will need to overcome in future developments of the model.
