3.1.3. Error Propagation

In this section, we examine how solution errors grow along with number of processors/nodes when running the exact same experiment. Often, these errors are a consequence of halo communication and rounding errors. The results can be seen in Table A2 and Figure 4. Note that, for these tests, where we are comparing parallel model output for one processor vs. N processors, we will refer to this as *vs. parallel.* In every case, the *vs. serial* RMS error is below 10−<sup>5</sup> and would be unable to influence the dynamics of our experiment, effectively transferring the physics model validation obtained at [12]. In addition, in the case of *vs. parallel*, RMS error is in every case orders of magnitude smaller than *vs. serial*; as this is the case, we can confidently conclude that using as many as 240 processors (and presumably more) won't affect the solution because of rounding or communication errors that may otherwise be introduced by a large data distribution layout. This finding brings confidence in our parallel-enabled model.

In this section, we have shown that the PETSc based parallel GCCOM framework preserves the solutions obtained by the validated serial GCCOM model for different mesh sizes of the Seamount test case. We have also shown here that the communication errors PETSc introduces are small enough not to be a problem with the 240 processors/12 nodes we have used.

Finally, the trend we show points that for the communication errors (*vs. parallel* error) to catch up with the parallel framework migration error (*vs. serial* error) we would need to double the processor count with a properly sized experiment, something that would be impractical to run in the serial model. In short, we have attained a new range of problem sizes we can solve in this new parallel framework, while carrying out the physics validation obtained in the serial version of the model.

**Figure 4.** Error comparison between serial and parallel models for one processor output, the parallel error is consistently orders of magnitude smaller than the serial comparison, which in turn is small enough to carry the physics validation of the model. Comparison with Serial GCCOM Mode.

### *3.2. Model Performance*

Model performance can be assessed and measured in different ways, but in general should improve with number of resources allocated, up to the limit of where the problem size and other factors make it limit or even degrade performance. This is known as scalability or scaling power.

In order to analyze the parallel performance of the PETSc-based GCCOM model, it is important to first validate the results as was done in Section 3.1. Once the model is validated, and the algorithmic approach has been verified, the next step is to identify critical blocks and bottlenecks in order to determine what elements of the model can be optimized. In the case of GCCOM, key factors include problem size and resolution, time step resolution, numerical methods and solvers, file IO, the PETSc framework, and the test cluster. The multiscale/multiphysics non-hydrostatic capabilities of the GCCOM model are demonstrated using the stratified seamount test case using different meshes and a 3D lock exchange test case. The impact and results of these factors are presented below.

### 3.2.1. PETSc Performance

The PETSc framework has its own performance characteristics, and basically defines an upper limit that we can expect from any model using the framework. The performance of PETSc is measured using its *streams* test, which outputs the speedup as a function of the number of cores [16]. Streams measures the communication overhead and efficiency that is realistically attainable in a system. The test probes the machine for its maximum memory speed, and becomes an alternative way to measure the maximum bandwidth, speedup and efficiency. The Streams tests are done as part of the PETSc installation process and is regarded as an upper limit of the speedup attainable on the system, limited by the memory bandwidth. More information can bee seen the PETSc user guide (see [16] Section 14—Hints for Performance and Tuning).

The speedup is defined as the ratio of the runtime of the serial model ( *T*1) to the time, *Tn*, taken by the parallel model as a function of the number of cores (*n*). The ideal speedup of an application would be a perfect scaling of the serial (or base) timing to the number of processors used, or *Sideal* = *T*1/*<sup>n</sup>*. The measured speedup of a model is defined as follows *SN* = *T*1/*Tn*.

The PETSc streams speedup is a diagnostics test for our system. As we will see in Section 3.2.3 it shows as a linear trend, which indicates that the bandwidth communication capacity grows linearly across the system, in this case up to 240 processors on 12 nodes. Note also that the PETSc framework shows no sign of turning over, indicating that it is capable of scaling to a much larger number of cores.

In practice, most distributed memory applications are bounded by the memory bandwidth allocation of the system, which is measured in PETSc by the streams test. For every test performed in our analysis, we have used the streams' speedup estimate as an upper bound on the speedup that can be obtained as a result of the memory bandwidth constraint. For the GCCOM model, we have determined that, for large enough problem sizes, this speedup threshold can be surpassed, effectively offsetting this performance limit by some margin.

#### 3.2.2. Profiling the GCCOM Model

To profile the GCCOM model, we analyze the three phases that are typical of many parallel models: initialization, computation, and finalization. Initial wall-clock profiling timings show that the time spent in the finalization phase is less than 1% of the wall-clock time, independent of problem size and number of cores. Consequently, it will not be part of further analysis discussed in this section. Timings also show that approximately 15–35% of total wall-clock time is spent in the initialization phase, where the PETSc arrays and objects are initialized, memory is allocated, initialization data are loaded in from files, and the curvilinear metrics are derived. The remainder of the execution time is spent in what we refer to as the "main loop", in which a set of iterative solutions to the governing equations are computed after the startup phase. In general, as the number of cores increases, the percentage of time spent in the main loop goes down, while the time spent in the initialization phase increases.

An explanation for the impact on scaling due to the startup phase may have something to do with the strategy employed to initialize and allocate the arrays used in the serial model. First, the model uses serial NetCDF to read and write data, effectively loading external files onto one master node and then scattering the data across the system. Similarly, when writing output data, the whole array is gathered onto one node, and then results are saved serially to a NetCDF file. Thus, the model is both IO and memory bound. This is a well known issue, and moving to parallel IO libraries is an important next step for this model.

In order to quantify the roles that these processes play in the total run time, we measured the partitioning of the total wall clock time as a function of the number of processors for three key functional areas: the main loop, or computational time; the I/O time; and the MPI communication time. The results can be seen in Figure 5, which shows a stacked histogram view of the functional area timings for the 3000 × 200 × 100 problem. The figure plots the percentage of time used by each component as a function of the number of cores. In this figure, we see three trends: the computational time (the bottom, or blue, group of datum) dominates the run-time for small number of cores, and appears to scale well; the MPI communication time increases with the number of cores, which is expected and is a function of the model and the PETSc framework. The figure also shows clearly that I/O is impacting the run-time, and increasing with the number of processors. This would explain why the model is not scaling well overall. Stacked plots for the lower resolution grids are presented in Figure 6, here the overtaking of communication and I/O times over computation time is evident. These problems are too small to take real advantage of the MPI framework over the 240 processors system and are capped by the I/O and memory bandwidth speeds.

**Figure 5.** The histogram above shows a stacked normalized plot of the time partitioning between computation, I/O operations and estimated communication times as a function of processors for the high resolution problem.

**Figure 6.** Stacked normalized plot of the time partitioning for the lower resolution grids.

As mentioned above, the primary goal of this paper is to report on advances made to the GCCOM model using the PETSc framework, to validate the results of the parallel version of the serial model (which is done in Section 3.1, and to show that the computational aspect of the model scales. Based on these goals, we will focus our scaling analysis primarily on the computational time required to solve the flow processes (knowing that we would be addressing parallel file I/O in future research). Thus, for the rest of this section, we will analyze the performance of the computational time, which is calculated as follows:

$$T\_{comp} = T\_{min\\_loop} - T\_{IO\\_main\\_loop}.\tag{19}$$

#### 3.2.3. Parallel Performance Analysis

Ideal parallel performance is usually described as a reduction in execution time by a factor of the number of processors used. However, this is seldom achieved. Several factors can limit model speedup, some of which are discussed here, but a more generalized overview can be found in several well-known textbooks [51,52]. Examples include loop calculations that cannot be unrolled because the statements are dependent upon previous steps in the calculation, collecting information on all processors before computing the next step (a self-recurrent loop). In addition, the hardware of the system could potentially impact speed, including chip memory bandwidth or the network. Despite these factors, speedup can be achieved by splitting up the work between multiple processors, which reduces the calculation time.

Model performance can be assessed and measured in different ways, but in general should improve with number of resources allocated, up to the limit of where the problem size and other factors make it limit or even degrade performance. This is known as scalability or scaling power.

Figure 7 compares the speedup of the PETSc Streams test with the speedup of the GCCOM computational work done in the main loop, for the Seamount test cases, as function of the number of cores. The plot shows a linear trend, which is a consequence of the maximum memory bandwidth allocation, which increases with the growing number or processors. We can see that the bandwidth communication capacity grows linearly across the system, in this case up to 240 processors in 12 nodes. Interestingly, the high resolution seamount experiment speedup (3000 × 200 × 100) is consistently better than the streams test speedup. This is explained by the size of the high resolution problem benefiting from internal PETSc optimizations that occur within the DM and DMDA objects, which dynamically repartition grids and adjust the MPI communicators during the computations [43]. This is not the case for the lower resolution cases: for these, we see that the speedup trend follows the streams test closely, but it never surpasses the PETSc limit. In fact, we see the lower resolution trends being bogged down by too much data distribution and hence more message passing, and the speedup ends up being worse than the streams test with more processors added.

The measurement of the parallel efficiency indicates the percentage of efficiency for an application when increasing the number of available resources. Efficiency is defined by Equation (20):

$$E\_n = \frac{T\_1}{n \ast T\_n},\tag{20}$$

where *T*1 and *Tn* are the execution times for one and *n* processors, respectively, and *n* is the number of processors. Ideal efficiency would be the case, for example, where doubling the number of processors halves the run time. Efficiency is expected to decrease when too many processors are allocated to a specific problem size. The optimal number of processors to use, from an economical perspective, can be determined from this metric. It is important to keep in mind that the application can show speedup while still decreasing its efficiency. The most common causes for decreasing efficiency are usually related to memory bandwidth speed limits and sub-optimal domain decomposition. As we will see later in this section, when these factors are taken into account, the scaling performance behavior can be explained.

Figure 8 shows efficiency for the GCCOM seamount test cases and the PETSc Streams test, calculated using Equation (20). From this, we can see that the PETSc Streams efficiency levels off at around 30%. This efficiency is typically regarded as the realistic efficiency of the system, limited by memory bandwidth, and as such we see that for the highest resolution problem (3000 × 200 × 100) we are obtaining a better than streams efficiency across all nodes used, hinting at our parallelization overcoming the memory bandwidth overhead up to 240 processors. This efficiency is expected to decrease with a higher number of processors, something we don't ye<sup>t</sup> see happening for the highest resolution case but does happen for the lower resolution experiments. Once again and as we saw with the speedup chart, these problems are still too small to take real advantage of the parallelization; although they present some speedup, the efficiency of resources allocated hardly justifies using more than a few nodes to run these problems.

**Figure 7.** Speedup scaling for stratified seamount experiments of different resolutions compared to the Portable, Extensible Toolkit for Scientific Computation (PETSc) streams test results for the test system [16]. The theoretical ideal is shown for reference.

**Figure 8.** Efficiency for stratified seamount experiments of different resolutions compared to the PETSc streams test results for the test system [16]. A theoretical ideal is shown for reference.

Finally, given the restrictions imposed on the model in the form of self-recurrent algorithms and forced data distribution by HPGF algorithm, and keeping in mind that we have left out the I/O and initialization processes (since they have not been parallelized), we regard this version of parallel implementation to be a success. The model demonstrates efficiencies that are better than the streams test estimates for our stratified seamount experiment, ye<sup>t</sup> preserving the solution to any practical threshold even when partitioning the problem across 200 processors and 12 nodes.
