2.5.1. Test System

Timings for test cases were primarily conducted on the Coastal Ocean Dynamics (COD) cluster at the Computational Science Research Center at San Diego State University, a Linux based cluster with the following features:


GCCOM is a memory intensive model: there are over 100 matrices that must be held in memory. For the "small" test case run in these experiments, 6 × 10<sup>7</sup> cells per array, the model requires on the order of 2 × 10<sup>2</sup> GB of storage plus the run-time overhead. Consequently, the test cases used for this research were run on the large-memory nodes of the Coastal Ocean Dynamics (COD) system. The COD system hosts 12 large memory nodes with 263 GB per node and 20 cores per node, for a total of 240 available cores. The total memory is 20 × 263 GB ≈ 10<sup>4</sup> GB.

The rest of the machine nodes were not able to run a problem of this size because of the memory requirements of the model. As mentioned before, the scope of the research reported in this paper was to validate the PETSC-based implementation of the model physics, and to defer optimizations to later research. Of future interest will be to profile the memory consumed by the framework. Most of the timing data was recorded using the built-in PETSc timers, which have been shown to be to be fairly accurate and to have little or no overhead [16].

In addition, the model has been ported to the XSEDE Comet machine at the San Diego Supercomputer Center, in preparation for running larger jobs and as a test to check the portability of the model [49]. Comet currently has 1944 nodes with 320 GB/node, four large memory nodes with 1.5 TB of DRAM and four Haswell processors with 16 cores per node. Future plans include exploring how the model performs on this type of system.

### 2.5.2. Stratified Seamount

Seamount experiments are regarded as a straightforward way to showcase an oceanographic models' capabilities and behavior. We carried out our timing and validation tests on a classical seamount, using continuous stratification with temperature ranging between (10 ◦C to 12 ◦C from the bottom to the top in the water column) and equivalent density for seawater, while maintaining the salinity constant at 35. The bathymetry for this experiment is defined by Equation (16),

$$D(\mathbf{x}, \mathbf{y}) = L(-1 + a \ast e^{-b(\mathbf{x}^2 + \mathbf{y}^2)}),\tag{16}$$

where *L* = 1000 m is the maximum depth and characteristic length, and the parameters *a* = 0.5 and *b* = 8 control the seamount shape. The experimental domain is (*<sup>x</sup>*, *y*, *z*) = 3.6 km × 2.8 km × 1 km. The experiment is forced externally with a linearly increasing *<sup>u</sup>*−velocity on the vertical column from 0 at the bottom to 0.01 m/s at the top, coming from the east direction.

The grid was created with cell clustering at the bottom, along half the domain in each horizontal direction as seen in Figure 3; this created a 3D curvilinear grid with the point distribution seen in Equation (18), where *Li* is the dimension length, D is the horizontal clustering position ( *D* = *Li*/0.5) and *β* varies between {1,5} uniformly along the vertical, *β* = 5 at the bottom where the cell clustering is most and *β* = 1 at the surface where there is no clustering. This grid is based in the work of [2], expanding it to be able to use continuous stratification and simplifying the curvilinear implementation:

$$\max\_{i} = D\{1 + \frac{\sinh[\beta(\xi\_i - A)]}{\sinh \beta A}\},\tag{17}$$

$$A = \frac{1}{2\beta} \ln[\frac{1 + (e^{\beta} - 1)(D/L\_i)}{1 + (e^{-\beta} - 1)(D/L\_i)}].\tag{18}$$

Three grids were generated where each would have enough resolution to be able to show strong scaling on the test cluster (see Section 3.2). The grid sizes are (*<sup>x</sup>*, *y*, *z*) = 1500 × 100 × 50 for the smallest one, with 7.5 million cell points per variable (our lowest resolution problem), a second grid with sizes (*<sup>x</sup>*, *y*, *z*) = 2000 × 100 × 100 having 20 million points per variable, and a high resolution problem of size (*<sup>x</sup>*, *y*, *z*) = 3000 × 200 × 100 yielding around 60 million cell points per variable. The simulation was run for five main loop iterations which in turn is 5 s of simulation, creating and writing a NetCDF file as output. This I/O operation happens twice and is removed from the parallel timing and performance analysis since it is not ye<sup>t</sup> parallelized.

**Figure 3.** Seamount grid of size *x* = 3000, *y* = 200, *z* = 100. Point agglomeration can be seen along each half horizontal direction with parameter *β* = 5, gradually decreasing onto the top to be uniform (*β* = 1).

#### **3. Results and Discussion**

### *3.1. Model Validation*

This section shows how the parallelized model produces correct results for our stratified seamount experiment. Correct in this context means replicating the same results obtained by the serial version of the model which was recently validated for non-hydrostatic oceanographic applications [12]. We also discuss the strategy applied to compare the models output and how much the results differ when adding more processors. Comparison is presented against the serial and parallel outputs for a single processor, to give a rounded picture on how communication errors are propagated in the parallel framework.

### 3.1.1. Validation Procedure

The Stratified seamount experiment is run for five computational iterations, or cycles of solving Equations (6)–(10) inside the main computational loop. The goal of this exercise is to define the consistency of the whole computational suite in the parallel framework, compared to the same set of equations being solved in serial. Each iteration represents a second of simulation time for a total of <sup>5</sup>[*s*]. This validation process is carried out in the high resolution problem (3000 × 200 × 100).

NCO operators [50] were used to obtain the root-mean squared (RMS) error of the output directly from the NetCDF output files, comparing each parallel run with the single processor serial run as seen in Table A1 and also with the single processor parallel run, visible in Table A2. The RMS is obtained for each main variable (*p*,*u*,*v*,*w*,*T*,*D*) by (1) subtracting each parallel output from the single processor output, (2) applying a weighted average with respect to time in order to unify the records obtained in a single time-averaged snapshot, and (3) obtaining the RMS error using the ncra operator. For the results, we report the maximum absolute value of the RMS, minimizing boundary errors by reading the 50th *X*-*Y* plane out of the vertical column of 100 planes.

#### 3.1.2. Comparison with Serial GCCOM Model

Here, we present in table form the values of the maximum RMS along the half point of the vertical column for each of the parallel runs obtained while comparing with the serial output of the identical experiment of the Serial GCCOM. Note that we will refer to the comparison of parallel results to serial as *vs. serial* for the rest of the document.

As can be seen models are in agreemen<sup>t</sup> for every practical purpose, with exception of the pressure (which is the result of solving the linear system and depends of a krylov subspace solver method, and therefore can be refined) placing the biggest error at around 10−5. The velocities readings are all of them between 10−7–10−<sup>8</sup> and the scalars *D*, *T* are close to machine error. Additionally, from Table A1, errors are virtually equal across all parallel runs. This tells us the solutions we obtain from the parallel model are in very close agreemen<sup>t</sup> with the serial model. This comparison brings confidence to the robustness of the PETSc implementation we have achieved, yielding the same degree of error beyond the data partitioning used. Nonetheless communication and rounding errors exist and the number of processors used are affected by them as we will explain next.
