*LICOM 's Parallel Scheme*

The ocean is discretized into two-dimensional grid points with staggered latitude and longitude coordinates in the horizontal direction. The latitudinal direction is divided into *imt*\_*global* grid points, and the longitudinal direction is divided into *jmt*\_*global* grid points. The vertical direction is divided into km layers with the ocean depth as the coordinate. Thus, the ocean is decomposed into a three-dimensional structure. On this basis, the ocean grid points are organized into grid blocks. Each grid block contains *BLCKX*×*BLCKY*×*km* grid points. There is barotropic-mode communication between grid blocks, and there are mutual pressure and vertical velocity calculations between layers. As shown in Figure 2, the ocean area is decomposed into *imt*\_*global*/*BLCKX*×*jmt*\_*global*/*BLCKY* grid blocks. In order to achieve a higher efficiency with the difference method, LICOM uses MPI to parallelize and speed up the calculation. *NBLOCKS*\_*CLINIC* grid blocks are allocated to each process and sequentially calculated inside one grid block.

**Figure 2.** Ocean grid.

There is an independent array for each grid block. The size of the array decreases as the degree of parallelism increases. Since differential computation requires boundary values of adjacent grid blocks, frequent boundary communication is necessary. The array for each grid block is updated according to the array boundaries from two-dimensional logically adjacent processes. Therefore, the array of each grid block stores the boundary data of the adjacent grid block, in addition to its own data. The boundary data from other grid blocks are called the ghost boundary, while its own boundary data are called the real boundary. LICOM employs an adjustable ghost boundary strategy. The size of the ghost boundary depends on a specific issue. The blue squares in Figure 2 show the situation when the ghost equals 2. Setting the ghost to 2 instead of 1 will effectively eliminate the communication–calculation ratio, since communication is only needed for every two calculation steps.

Figure 3 illustrates the calculation procedure of a grid point in one iteration step. First, the two stripes of the ghost boundaries of each of the grid blocks are updated. The real boundary in grid blocks 1, c, and d is transferred to b and a of grid block 2. Meanwhile, the real boundary in grid blocks 2, c, and d is transferred to b and a of grid block 1. Then, the real boundaries are updated. In grid block 1, the grid points are updated from left to right until b. Meanwhile, in grid block 2, the grid points are updated from right to left until b. Finally, both grid blocks 1 and 2 update to c.

**Figure 3.** Ocean grid boundary communication.

#### **3. Optimization of Parallelization**

This section illustrates the series of optimizations that were implemented to speed up the running of LICOM. The original LICOM was parallelized by using only MPI. We utilized a hybrid of MPI and OpenMP in an optimized version of LICOM. Algorithm 1 shows an example of the implementation of OpenMP. However, due to the architecture of the supercomputers, the hybrid of MPI and OpenMP did not achieve good performance because the utilization of the MPI and OpenMP hybrid had no advantages at small scales. Thus, this optimization of the hybrid of MPI and OpenMP is only discussed, but is not included in the description of the performance tests in the following section. In the test on Tianhe III with a large number of PEs, we used the hybrid of MPI and OpenMP in order to utilize more PEs.



#### *3.1. Optimization A: Improving the Parallelization Scheme*

We optimized the parallelization scheme of LICOM, as shown in Figure 4. *N* grid points were to be allocated to *np* processors. If *N* was divisible by *np*, the best decomposition was to allocate n, which equalled *N* divided by *np*, to each processor. However, in most cases, *N* was not divisible by *np*. Thus, in the original scheme, *n* + 1 grid points were allocated to each processor from processor No. 0. Here, we set *n* to *N*/*np* . For the last processor, *N* − (*n* + 1) ∗ (*np* − 1) grid points were allocated. In the optimized scheme, *n* grid points were allocated to each processor, while the rest of the *N* − *n* ∗ *np* grid points were evenly allocated to some processors. In this way, the scalability of the model was considerably improved.

We redesigned the software structure of LICOM to improve the modernization of the software. Thus, LICOM was more convenient to use. For example, we replaced an alterable array with a fixed array, used structured data, and extracted important parameters. Therefore, there is no need to recompile or alter the code for different experiments or processor distributions.

#### *3.2. Optimization B: Communication Optimization*

In order to reduce the time consumed, we optimized the communication procedure. The method of data packing was employed to improve the communication time. In addition, the algorithms were improved by replacing communication with calculation.

(I) For a high-latitude area, a one-dimensional horizontal smoothing algorithm needs to conduct smoothing more than once. Since the times of smoothing on different latitudes are not the same, the original algorithm conducts smoothing once on each latitude, while the new algorithm involves just one smoothing and two-dimensional communication at all latitudes. Algorithms 2 and 3 show the representative code alterations. Algorithm 2 provides the original code, and Algorithm 3 describes the optimized version.

**Figure 4.** Optimization of the parallelization scheme.

(II) Based on previous test results when using Intel Vtune, the hotspots of LICOM are the communication functions. Therefore, the optimization of communications is of great importance. During the process of integration, the least communication algorithm was employed. Some communications were replaced by adding boundary calculations. It was beneficial to eliminate communication by adding some calculations. Because of the large number of PEs used, the total calculation job was divided by the number of PEs, while the cost of communication was multiplied by the number of PEs. Algorithms 4 and 5 show the representative code alterations. The original code in Algorithm 4 was optimized into the code in Algorithm 5 to eliminate the cost of communication.

```
Algorithm 3: Communication Optimization (I-B)
  for NCY←1 to MAX_NN do
     for j←jst to jmt do
         if NN(j)≥NCY then
            for K←1 to KK do
               for I←1 to IMT do
                   Calculate XS(I);
               end
               for I←2 to IMM do
                   Calculate X(I,J,K);
               end
            end
         end
         Boundary Exchange 2D;
     end
  end
```
**Algorithm 4:** Communication Optimization (II-A)

```
for J←JSM to JEM do
   for I←2 to IMM do
      Calculate UB(I,J), VB(I,J), and H0(I,J) ;
   end
end
Exchange boundary ub, vb, and h0 ;
for J←JSM to JEM do
   for I←2 to IMM do
      Calculate WKA(I,J,1), WKA(I,J,2), and WORK(I,J) ;
   end
end
```
**Algorithm 5:** Communication Optimization (II-B)

```
for J←JST to JET do
   for I←1 to IMT do
       Calculate UB(I,J), VB(I,J), and H0(I,J) ;
   end
end
for J←JST to JET do
   for I←1 to IMT do
       Calculate WKA(I,J,1), WKA(I,J,2), and WORK(I,J) ;
   end
end
```
(III) The grid matching information was obtained via communication in the original code in Listing 1. In contrast, in the optimized algorithm, it was obtained by conducting a calculation in Listing 2. Therefore, point-to-point communication was reduced by a considerable amount.

**Listing 1.** Communication Optimization (III-A)

```
i f ( mytid == 0 ) then
  i_start (1)= i_global (1)
  j_start (1)= j_global (1)
  do n=1 , nproc −1
     call mpi_recv ( j _ s t a r t (n+1) ,1 , mpi_integer , n , tag_1d ,&
                       mpi_comm_ocn , status , ierr )
     call mpi_recv ( i _ s t a r t (n+1) ,1 , mpi_integer , n , tag_2d ,&
                       mpi_comm_ocn , status , ierr )
  end do
else
  j_start (1) =j_global (1)
  i_start (1) =i_global (1)
  call mpi_send ( j _ s t a r t ( 1 ) , 1 , mpi_integer , 0 , tag_1d ,&
                    mpi_comm_ocn , ierr )
  call mpi_send ( i _ s t a r t ( 1 ) , 1 , mpi_integer , 0 , tag_2d ,&
                    mpi_comm_ocn , ierr )
end i f
```
**Listing 2.** Communication Optimization (III-B)

```
do i =1,nproc
   iix=mod( i −1,nx_proc )
   i i y =( i −1− i i x )/ nx_proc
   i_s tar t ( i )= iix *( imt−num_overlap )+1
   j_s tar t ( i )=3
   i f ( iiy /=0) then
     do j =1, iiy
        j _ s t a r t ( i )= j _ s t a r t ( i )+jmt−num_overlap
     end do
  endif
end do
```