*3.3. Optimization C: Calculation Optimization*

We carried out a series of mature optimizing methods, including vectorization, memory access optimization, instruction optimization, cache optimization, and runtime optimization. Thus, we could fully discover the potential performance of stencil calculation on the CPU cluster, such as by expanding the functions in a loop, as shown in Algorithms 6 and 7. The original DENS function in Algorithm 6 was expanded in the optimized version in Algorithm 7 in order to utilize vectorization.



#### *3.4. Optimization D: Parallel IO*

We also designed an asynchronous parallel IO method. As shown in Figure 5, an independent node was employed to conduct the IO procedures. A separate communication group was created for the IO work. When the computing nodes needed to carry out IO procedures after calculation procedures, the communication group of the computing nodes collected the data and sent it to the IO node. After receiving all of the data, the IO node began IO procedures. Meanwhile, the nodes from the computing group could continue with their calculation procedures instead of waiting for the completion of the IO procedures. Therefore, the elapsed time for IO was hidden by overlapping the calculations and IO.

**Figure 5.** Asynchronous parallel IO.

#### **4. Parallel Performance and Application in Actual Scenarios**

Overall, we ran three sets of tests. The setup of the tests is shown in Table 1.

**Table 1.** Setup of Tests.


For the first test, we used three versions of the code for comparison. The code versions were (a) the original LICOM, (b) semi-optimized LICOM, in which optimization methods B(I), B(III), and C were employed, and (c) fully optimized LICOM. The hardware environments were the "Era" supercomputer and the "Tianhe II" supercomputer. For the second test, the fully optimized code was run on the "Tianhe III" supercomputer. For the third test, the fully optimized code was run on the "Era" supercomputer for a long term in a real scenario. For all of the tests on the three different hardware platforms, the same test case with a global resolution of 10 km × 10 km was used. The simulation started from 1993. The detailed configurations are listed in Table 2. These parameters controlled the run of LICOM and described the real scenario in our test. An elapsed time of one model day was used as an indicator for comparison. We ran each particular test at least five times to calculate the average value. The simulation year per day, as a widely used indicator of running speed, was calculated from the elapsed time of one model day.

**Table 2.** Configurations of the test cases.


#### *4.1. Performance on Era*

The CPU on the computing nodes of Era was an Intel(R) CPU E5-2680V3:2.5GHz. The operating system was Linux CentOS release 6.4 (Final). The compilers were Intel composer\_xe\_2013\_sp1.0.080 and Intelmpi 4.1.3.049. Based on the above environment, we conducted tests using up to 4800 processor cores. Since LICOM is so complicated, the elapsed time of LICOM when simulating one model day is used as an indicator of performance. Figure 6 shows the elapsed time and running speed of the three editions of LICOM, while Figure 7 shows the speedups. The simulation year per day (SYPD) is usually used to measure the computational performance of models. The speedups for the three editions of code were calculated separately. The elapsed times for each edition of code on 1200 PEs were chosen as references. For instance, the speedup of the original code on 4800 PEs over 1200 PEs was the elapsed time of the original code on 1200 PEs divided by the elapsed time of the original code on 4800 PEs. The speedup of the fully optimized code on 4800 PEs over 1200 PEs was the elapsed time of the fully optimized code on 1200 PEs divided by the elapsed time of the fully optimized code on 4800 PEs. The reason for why the elapsed times of 1200 PEs were chosen as references was that, with a resolution of 10 km × 10 km, LICOM needed a very large memory space. Thus, we needed more PEs so that after decomposition, the limited memory space on each PE could meet the demands

of LICOM. As shown in Figures 6 and 7, the semi-optimized LICOM was much faster than the original LICOM, but it still suffered from the problem of scalability. However, the fully optimized LICOM with the optimization of the decomposition scheme showed considerably good scalability when 4800 processor cores were used. The computing speed reached 9 model years per day. Additionally, the elapsed times of both the original and semioptimized code on 3600 PEs were longer than those on 2400 PEs. This was a non-intuitive phenomenon. However, for the fully optimized code, there was no similar non-intuitive phenomenon. We can infer that the problem that caused the non-intuitive phenomenon was tackled by the optimizations in the fully optimized code. Therefore, the possible reasons that caused this non-intuitive phenomenon could have been the communication overhead and load imbalance.

**Figure 6.** Elapsed time and speed of LICOM on Era.

#### *4.2. Performance on Tianhe II*

Based on the test results on Era, we decided to conduct tests on a larger scale. The hardware environment was Tianhe II. The processor was an Intel Ivy Bridge-E Xeon E5-2692V2:2.2GHz. The operating system edition was Red Hat Enterprise Linux Server release 6.5 (Santiago). The compilers were Intel composer\_xe\_2013\_sp1.2.144 and MPICH 3.1.3. In the above hardware environment, we conducted tests on the same three code editions. As shown in Figures 8 and 9, the fully optimized LICOM achieves a good speedup when 9600 and 19,200 processor cores were used. The computing speed reached 12.6 model years per day, which was twice the speed of the original LICOM. However, as shown in Figure 9, a non-intuitive phenomenon occurred. The speedup for the semi-optimized code on 9600 PEs was smaller than that on 4800 PEs. Similarly to the phenomenon on Era, this may have been due to the communication overhead and load imbalance. In contrast, the fully optimized code achieved good speedups on 9600 and 19,200 PEs. This showed that our optimization worked well in improving the scalability of the code.

**Figure 7.** Speedup of LICOM on Era.

**Figure 8.** Elapsed time and speed of LICOM on Tianhe II.

**Figure 9.** Speedup of LICOM on Tianhe II.

#### *4.3. Performance on Tianhe III*

We tested the fully optimized LICOM on the prototype system of the Tianhe III supercomputer. We reached up to 245,760 PEs, which was the summation of the CPU cores and the Matrix-2000 cores. Figure 10 shows the performance of LICOM on various numbers of PEs. The speedup in Figure 11 is the running time on each PE count divided by the running time on 1920 PEs. The reason for why 1920 PEs were chosen as the reference was that the elapsed time on 960 PEs was too long, which may lead to an abnormal speedup diagram. The speedup diagram in Figure 11 shows that there is still potential for optimization, since the speedup fell when more than 61,400 cores were used.

For the speedups on the three supercomputers, we can see that on Era and Tianhe II, there were still good speedups when 4800 and 19,200 PEs were used. The reason for why we did not test on more PEs was that we did not get access due to the policy for the supercomputer. However, on Tianhe III, the speedup was lower when more PEs were used. This might have been due to the communication overhead. Additionally, on a similar number of PEs, Era achieved the best speedup. For instance, on 4800 PEs (7680 on Tianhe III), the speedup for the fully optimized code on the three platforms was about 4.6, 3.9, and 3, respectively. A possible reason might be that Era had the CPU with the best performance.

#### *4.4. Experiment on the Application of LICOM in a Real Scenario*

Moreover, we conducted a set of application tests. LICOM was used with 10 km × 10 km as the resolution. CORE-II was employed as a forcing field. The simulation period was from 1993 to 2007. The output included temperature, salinity, sea surface height, and the 3D current field. Figure 12 shows the abnormal sea surface height value (difference from the average value) on 31 December 2007. As can be seen, an eddy was apparent. Additionally, there was a special difference. The middle image in Figure 12 shows the average number of eddies on every grid point. West and east boundary eddies often occurred in the southern ocean. There were at least 50 eddies that occurred in some areas. Therewasis a considerable difference between the structures of the eddies of anticyclones and cyclones, as shown in the bottom image in Figure 12. An anticyclone eddy

had a sunken center with a temperature increment of 2◦, while a cyclone eddy had a raised center with a temperature decline of 2◦.

**Figure 10.** Elapsed time and speed of LICOM on Tianhe III.

**Figure 11.** Speedup of LICOM on Tianhe III.

**Figure 12.** Output of the LICOM application.

In addition, we compared the results produced by the original code and the fully optimized code to see whether there were any differences. Figure 13 shows the results of the sea surface temperature and the differences between the two editions.

**Figure 13.** Correctness of simulations.
