Performance Evaluation and Optimization of the Weather Research and Forecasting (WRF) Model Based on Kunpeng 920

Huang, Jian; Wang, Wu; Wang, Yuzhu; Jiang, Jinrong; Yan, Chen; Zhao, Lian; Bai, Yidi

doi:10.3390/app13179800

Open AccessArticle

Performance Evaluation and Optimization of the Weather Research and Forecasting (WRF) Model Based on Kunpeng 920

by

Jian Huang

^1,2

,

Wu Wang

¹,

Yuzhu Wang

^2,*

,

Jinrong Jiang

¹,

Chen Yan

³,

Lian Zhao

¹ and

Yidi Bai

¹

Application Development Department, Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China

²

School of Information Engineering, China University of Geosciences, Beijing 100083, China

³

HPC Laboratory of Huawei Technology Co., Ltd., Hangzhou 310052, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(17), 9800; https://doi.org/10.3390/app13179800

Submission received: 30 June 2023 / Revised: 11 August 2023 / Accepted: 16 August 2023 / Published: 30 August 2023

Download

Browse Figures

Versions Notes

Abstract

:

The Weather Research and Forecasting (WRF) model is a mesoscale numerical weather prediction system, which is widely used in major high-performance server platforms. This study focuses on the performance evaluation and optimization of WRF on Huawei’s self-developed kunpeng 920 processor platform, aiming to improve the operational efficiency of WRF. The results of the study show that the scalability of WRF on kunpeng 920 processor is well performed; the performance of WRF on kunpeng 920 processor is improved by 32.6% after invoking the Fast Math Library and Domain Decomposition Core Tile Division optimization. In terms of IO, the main optimizations are parallel IO and asynchronous IO. Eventually, the single output time of WRF is reduced from 37.28 s in serial IO mode to 0.14 s in asynchronous IO mode, and the overall running time is reduced from 1078.80 s to 807.94 s.

Keywords:

kunpeng 920 processor; WRF; performance evaluation; performance optimization

1. Introduction

Climate change significantly affects human society and the natural environment [1]. As an important meteorological service, weather forecasting provides people with accurate and timely weather information to help them make correct decisions in coping with the impacts of climate change [2]. As people’s demand for forecast accuracy and timeliness increases, the computational volume of the WRF model as well as the required computational resources increase [3].

In High-Performance Computing (HPC), there are significant differences in hardware architectures between servers, so the same program will show different performance when running on different servers [4]. As a typical HPC application, the Weather Research and Forecasting (WRF) model is often used by major chipmakers to demonstrate the strength of their high-performance servers by porting and optimizing the WRF to their server platforms to highlight the superior performance of their servers [5,6,7,8]. Among them, Huawei’s self-developed Kunpeng 920 processor, a high-performance server chip based on ARM architecture [9], has relatively limited research on WRF on this processor.

This paper focuses on the performance evaluation and optimization of WRF based on Kunpeng 920 processor, and the primary task is to make full use of the computational resources of Kunpeng 920 processor to improve the operation efficiency of WRF. The specific research includes:

Introducing the architecture of Kunpeng 920 processor;
Analyzing the scalable performance and hotspot functions of WRF on the Kunpeng 920 processor platform;
Conducting MPI (Message Passing Interface) + OpenMP (Open Multi-Processing) hybrid parallel optimization experiments for WRF on Kunpeng;
Calling Libamath, a math library developed by ARM, to accelerate the math calculation part of WRF;
Optimizing compilation options, region decomposition and tile division for WRF on the Kunpeng processor platform to improve the computational efficiency of WRF;
Adopting asynchronous + parallel IO scheme instead of the default serial IO to improve the IO efficiency of WRF.

2. Related Work

WRF’s related work on high-performance computers is dedicated to improving the performance, accuracy, and applicability of the model to provide more powerful tools and resources for meteorological research and weather forecasting. Through continuous research and innovation, the performance of WRF models in HPC environments will continue to be improved, bringing more value to scientific research and social applications. The following is the work related to the porting and optimization of WRF on other high-performance computers.

Morton [10] used a 1 km-resolution case with over one billion grid points to present results of scaling evaluations on the Cray XT5 for distributed (MPI) and hybrid (MPI/OpenMP) modes of computation.

Malakar [11] conducted a study on the performance analysis and optimization of nested region simulations using the WRF on the IBM Blue Gene/P platform. This study included exploring various methods to optimize the WRF-based code, resulting in a 29% reduction in the total runtime of the simulation. These optimizations include improving data access patterns, minimizing communication overhead, and optimizing computational workload across available cores, and this study highlights the importance of careful optimization to achieve optimal performance of WRF-based simulations.

Christidis [12] evaluated the performance and scaling of WRF on three different parallel supercomputers ((a) POWER 775 cluster (p7-IH, IBM POWER7, HFI interconnect), (b) the POWER 460 cluster (Pureflex IBM POWER7, Dual QDR Infiniband interconnect) and (c) the iDataPlex cluster (dx360M4 Intel Sandybridge, FDR14 interconnect)), and its study showed that the performance of WRF’s MPI + OpenMP allocation performs differently on different machines, and the optimal configurations of nproc_x and nproc_y in the domain decomposition differ on different machines as well.

Samuel Elliott [5] conducted performance analysis for different components of the WRF model on the Xeon Phi platform to create a set of guidelines to advise WRF users how to efficiently utilize their allocations on Xeon Phi. Their research showed that symmetric execution on Xeon and Xeon Phi can be used for efficient WRF simulations, with significantly faster execution relative to running on any isomorphic architectur.

Tricia Balle’s [13] review of the available IO methods listed in the widely used Weather Research and Forecasting (WRF) model highlights the new quilt server + parallel NetCDF (PNETCDF_QUILT) technology, which on the Cray XC40 platform using an asynchronous IO (quilt) server and parallel NetCDF.

Pavani Andraju [14] designed benchmarks for operational configurations using the WRF model in the Indian region based the supercomputer of University Grants Commission center. They performed evaluations to find the optimal computational resources required to implement the WRF model for real-time workflows and considered various aspects of the scalability and IO performance of the WRF model on HPC.

R. Moreno [8] investigated the impacts of different process distributions on simulation times and the reasons behind them, and suggested a better distribution algorithm than the WRF implementation on the GALGO Supercomputer. He also examined the cost of reducing the wall clock time of the simulation by increasing the number of processing resources used, and determined the loss of computing efficiency in the same work.

The above shows that many researchers have optimized WRF on different high performance platforms. Their results show that the performance obtained by the default configuration of WRF after porting to a new HPC server is not the optimal performance, and the optimal performance configurations exhibited by WRF on different platforms varies greatly. Therefore, it is necessary to optimize WRF on its own platform in order to obtain the best performance.

Unlike previous studies, this study focuses on the performance evaluation and optimization of WRF on the Kunpeng 920 processor platform. The WRF is also accelerated using the fast math library specific to the ARM architecture, which is an optimization tool not available in the above studies. In addition, MPI + Openmp, compilation option optimization, region decomposition, tile division, and asynchronous+parallel IO are all used in this study to obtain the optimal performance of WRF on the Kunpeng 920 processor, which is not consistent with the previous studies that focused on a particular optimization method.

3. Materials

3.1. Kunpeng 920 Processor

The Kunpeng 920 is a server-class chip based on the ARMv8-A architecture [15], developed by Huawei using a 7 nm manufacturing process and Taishan V110 microarchitecture. Its excellent features include high performance and low power consumption, as well as a number of advanced technologies. The processor features 64 cores, including two super core clusters (SCCL) and one super IO cluster (SICL). Each SCCL in turn contains eight core clusters (CCL), and each CCL contains four Taishan V110 cores (as shown in Figure 1) and runs at 2.6 GHz main frequency. In terms of cache structure, the Kunpeng 920 processor adopts a multi-level cache design, including L1 Cache, L2 Cache, and L3 Cache (shown in Figure 2). The L1 Cache is divided into two parts, an instruction Cache and a data Cache, each of which is 64 KB in size, providing high-speed access and greatly improving processor efficiency. The L2 Cache is located after the L1 Cache. Meanwhile, the L3 Cache is shared by all cores of the processor, with a size of 64 MB and a group-connected structure.

In terms of IO, the Kunpeng 920 processor increases the number of DDR4 channels from the mainstream 6 to 8. It also integrates PCI Express and CCIX interfaces to increase memory bandwidth and capacity, and can be used to connect different kinds of devices. In terms of vectorization, the Kunpeng 920 processor supports different types of data, including word, double word fixed-point and vector types, where vector data consist of multiple data of the same type. The processor’s register width varies depending on the execution state, and both the general-purpose register file and the SIMD and floating-point register files contain registers of different widths that can be used for floating-point and vector operations [16].

3.2. Case Configuration

In this research, we used the following cases of WRF: Case A, which has a grid size of 1024 × 1024 × 33, a resolution of 4.5 km, and a time step of 27 s, and Case B, which has a grid size of 512 × 512 × 33, a resolution of 9 km, and a time step of 72 s. The simulation time was 6 h, and the results were output once per hour on average.

This paper uses WRF-ARW version 4.0 [17], which involves the dependency libraries NetCDF (v-4.4.1), HDF5 (v-1.12.0), Zlib (v-1.2.7), Libpng (v-1.2.50), Jasper (v-1.900.1), and Ompi (4.0.1). It is compiled and executed in mixed-mode parallelism (MPI + OpenMP), and the compiler version is gcc-9.3.0, as it has been verified in previous studies to have better compilation performance than the other options [18]. We used a benchmark as the default compilation option (-O2 -ftree-vectorize -funroll-loops) and Case B. All analyses and optimizations of WRF are based on this benchmark.

3.3. MPI + OpenMP Hybrid Parallelism

MPI is a message-passing application program interface, which is currently the main parallel model for high-performance computing. Its main feature is message passing, which allows parallel computing as long as two computer nodes can communicate; this also determines its high scalability. However, it has a disadvantage which cannot be ignored: passing messages between computer nodes relies heavily on the speed-limited interconnection network, and if there are too many messages, the interconnection network will block, slowing down the data delivery and resulting in a large communication overhead.

OpenMP is a set of guided compilation processing schemes for the design of multiprocessor programs for shared memory parallel systems. The programmer specifies the intent by adding dedicated guidance statements to the source code, and the compiler can automatically parallelize the program. The most important feature of this model is its shared storage and faster data communication compared to the MPI model, but it also determines poor scalability and does not allow information interaction outside of shared memory.

The OpenMP + MPI hybrid programming model provides two levels of parallelism between node kernel nodes, which can take full advantage of the shared storage model kernel message passing model and effectively improve the performance of the system.

3.4. Libamath

Arm Performance Libraries is a set of optimized standard core math libraries designed for high-performance computing applications running on Arm processors. The libraries are compatible with various versions of GCC and offer optimized routines in both Fortran and C interfaces, including BLAS, LAPACK 3.11.0, FFT functions, and sparse linear algebra. One of the key features of the Arm Performance Libraries is the libamath library, which provides AArch64-optimized versions of various scalar functions such as exponential, logarithm, and error functions, in both single and double precision. In addition, it includes optimized single-precision sine and cosine functions. By linking to libamath ahead of libm, users can ensure that these optimized functions are used.

Libamath also provides vectorized versions (Neon and SVE) of all common math.h functions in libm, which are used whenever possible by the Arm C/C++ Compiler.The Arm Performance Libraries also include libastring, which offers optimized replacement functions for various string.h functions, such as memmove and memset. Finally, Arm Performance Libraries are built with OpenMP, which allows for optimal performance in multi-processor environments.

4. Methods

4.1. Scalable Performance Analysis

In parallel computing, several metrics can be used to evaluate the performance of a program after parallelization optimization to determine the effectiveness of the optimization:

Speedup: The ratio of the running time of a serial algorithm to the running time of a parallel optimised algorithm is referred to as the speedup ratio. The higher the speedup ratio, the better the parallel optimisation results. In parallel programs, the running time t1 of a program running on a low core p1 is usually divided by the running time t2 of a program running on a high core p2, expressed by the formula:

S = \frac{t 1}{t 2}

(1)

Efficiency: The speed-up ratio of the parallel optimised algorithm divided by the number of parallel processes is known as the efficiency (E). The higher the efficiency, the more effective the parallelisation. In parallel programs, the acceleration ratio S is usually divided by the ratio(N) of the number of high cores p2 to the number of low cores p1, expressed by the following formula:

E = \frac{S}{N}

(2)

N = \frac{p 2}{p 1}

(3)

Scalability is the ability to maintain performance gain when system and problem size increase. For parallel programs, suppose we run a parallel program with a fixed number of processes or threads, a fixed problem size, and an efficiency value E. Now, we increase the number of processes or threads used in the program, and if the program’s efficiency value remains E while the problem size increases proportionally, then we say that the program is scalable. A program is called strongly scalable if it can maintain a fixed efficiency when increasing the number of processes or threads without increasing the size of the problem. If the efficiency value can only be maintained by increasing the problem size by the same factor when increasing the number of processes or threads, then the program is called weakly scalable.

4.2. Perf

Perf is a powerful performance tuning tool in the Linux kernel. By programming the Performance Monitoring Counter (PMC) register, Perf can provide the programmer with information about instruction cycles, instruction counts, cache misses, jump instructions and other hardware events, cache misses, jump instructions, and other hardware events [19].

The following five commands are commonly used:

Perf list: view the performance events supported by the current hardware and software environment;
Perf stat: analyze the performance profile of a given program;
Perf top: display system/process performance statistics in real time;
Perf record: record the performance events of the system/process over a period of time;
Perf report: read the perf.data file generated by perf record, and display the analysis data.

4.3. Domain Decomposition and Tile Division

The entire study area in the WRF is called a domain. The subdomain obtained by dividing the domain with processes is called a patch, a process known as domain decomposition. The subdomains are assigned to each process. The patch can also be subdivided into tiles (see Figure 3), a process known as tile division.

WRF uses nproc_x to denote the number of processes in the east–west direction of the domain and nproc_y to denote the number of processes in the north–south direction of the domain. The domain decomposition should follow the following condition:

n = n p r o c_x \times n p r o c_y

(4)

Algorithm 1 can summarize the calculation process of WRF. First, WRF will first calculate the value of each tile; then calculate the data of each patch in units of tiles; finally, through MPI, exchange data for the part of each patch that needs to communicate, and finally obtain the data of the entire domain.

Algorithm 1 Pseudo-code of data calculation process in patch of WRF

1. function compute (tile)

2. for j = 1, tile_max_y

3. for k = 1, tile_max_k

4. for i = 1, tile_max_i

5. tile[i][j][k] = do some calculations

6. enddo

7. enddo

8. enddo

9. end function

10.

11. procedure SLOVE_em (patch)

12.

\dots \dots

13. !$OMP PARALLEL DO

14. for i = 1, numtiles

15. Do some calculations using tile

16. enddo

17. !OMP END PARALLEL DO

18.

19. if MPI PARALLEL then

20. Transfers data from halos

21. to the specified process

22. end if

23. end procedure

Where n is the number of processes currently in use. The number of grids assigned to each process in the x-direction is the total number of grids in the x-directionnproc_x, and the number of grids assigned to each process in the y-direction is the total number of grids in the x-direction/nproc_y.

Patches are used as blocks of data assigned to each process after domain decomposition, and changing the values of nproc_x and nproc_y once the number of processes in use has been determined will change the default algorithm for WRF domain decomposition, which is nproc_y >= nproc_x, with nproc_y-nproc_x being the minimum value. nproc_x and nproc_y values can be changed by setting the parameters in namelistinput [8,12].

In the calculation of tiles, the i belongs to the innermost loop (see Algorithm 1). According to the WRF’s domain decomposition and tile division approach, the more grids are allocated to patches in the i-direction in the decomposition of the domain, the better the local memory continuity in the i-direction of the tiles during computation (see Figure 4). Of course, for WRF calculations, we need to consider factors such as communication and hardware architecture in addition to the cache, and it is not the case that the more the number of patches’ grids in the i-direction is divided, the better. This demonstrates different performance of WRF on different computing platforms and in different arithmetic cases, revealing the preferred method to privilege the best combination of decomposition [7].

Tile division is based on patch partitioning, which can be changed by numtiles = n in namelistinput, where n is the number of tiles divided by the path. If not set, WRF defaults to n = number of threads used. When the tile division is small enough, the WRF can load into the cache as a whole tile during computation, and vice versa; this may occur multiple times. Therefore, a proper tile division can also improve the access to the cache during WRF calculations.

4.4. Parallel IO and Asynchronous IO

The default IO mode in WRF is serial IO, in which the master process collects data from other processes via Gatherv and formats it to be written to a disk. The default serial IO mode shows good performance when the data size is small. However, when the data size is large, the memory size of the master process may become a bottleneck and the communication overhead of collecting data is high when receiving data. Therefore, it is necessary to use parallel IO to improve the overall performance.

Parallel IO technology subdivides the original process into multiple process groups, also called aggregation groups. Each process group has a master process that is responsible for receiving data from other processes in the group, formatting the data and then transferring them to a disk. This greatly reduces the communication overhead caused by a single process collecting other data in serial IO mode [6,12].

The master process in both serial and parallel IO modes collects data through MPIś Gatherv, formats the data, and writes it to a disk. During this process, other processes are blocking and waiting. In the asynchronous IO mode, the processes are divided into computing and IO processes, where the computing process is responsible for computing and the IO process is responsible for writing data to a disk. Asynchronous IO solves the problem of waiting overhead for other processes when serial and parallel IO write data, improving the overall performance.

The WRF parallel IO method allows the following two steps to be implemented: install PNetCDF before compiling, and compile WRF with the appropriate path; second, before running WRF, in the WRF/run directory of namelist.input to modify the io_form_history, io_form_restart value (the default is 2, NetCDF formatted output), the value is modified to 11 is PNetCDF parallel IO mode. It should be noted that the parallel IO output of the data is not compressed.

The developers of WRF provide methods for setting up asynchronous IO. In the namelist_quilt module in namelist.input, we can modify the values of nio_tasks_per_group and nio_groups. nio_groups refers to the number of groups in the IO process group, and nio_tasks_per_group refers to the number of processes in each IO group. By default, nio_tasks_per_group = 0 (for non-asynchronous IO). The number of asynchronous IO process groups and the number of processes in each group are set according to the following rules.

Total number of cores = (nproc_x x nproc_y + nio_groups x nio_tasks_per_group) x threads;
nio_tasks_per_group < nproc_y;
nproc_y and nio_tasks_per_group are preferably multiplicative.

5. Experimental Section

5.1. Performance Analysis Experiments

The experimental part of the performance analysis for the WRF includes a hybrid parallel analysis of MPI + OpenMP, scalable performance analysis, and hotspot function analysis.

(a). For the MPI + OpenMP parallel programming experiments, our main objective is to optimise the process * thread combination with the best performance for a given use of resources. In this paper, we use 8, 16, 32, 64, and 128 to run WRF. Specifically, when running WRF with 8 cores, we record the WRF runtime (simulated for 6 h) for each combination of 8 processes * 1 thread, 4 processes * 2 threads, and 2 processes * 4 threads, respectively, for side-by-side comparison, and select the combination with the best performance. The same is repeated for 16, 32, 64, and 128 cores. See Table 1 for details.

(b). For numerical experiments exploring the scalability of WRF on the Kunpeng 920 platform, we run WRF with 64, 128, 256, and 512 cores using case A. To avoid the impact of IO on the scalability performance, we count the single iteration step runtime in the performance statistics, and since the single iteration step runtime of WRF is relatively stable, we average the runtime of the last ten steps of the simulation over one hour (this is how the later experiments on single time iteration steps are chosen). See Table 2 for details.

(c). To analyse the hotspot function of WRF, we use Case B to run WRF with 30 cores and use Perf record to record the performance of WRF runtime in perf.data, while monitoring it in real time via Perf top. Finally, the data in perf.data were analysed as shown in Table 3.

5.2. Performance Optimisation Experiment

The experiments in the performance optimisation section focus on adding ARM fast maths libraries, domain decomposition, tile division, and a study involving parallel IO and asynchronous IO.

(a). During the Libamath optimisation experiments, we begin by running the WRF with 20, 40, 60 and 120 sums using Case B with the benchmark compilation option and the default namelist.input configuration, and we record the single iteration step running time of the WRF for the corresponding number of cores at this time. For the WRF code with a large number of mathematical calculations, such as pow, this paper compiles WRF with the baseline compilation option and adds the dynamic library path -path_to_libamath to configure.wrf. When running WRF, we run it with 20, 40, 60 and 120 cores, and record the runtime of WRF at this time to determine the corresponding single iteration step run times. The data before and after running the optimisation using the libamath math library are collated and plotted in a histogram; see Figure 1 for details.

(b). During the optimisation of domain decomposition, we use 36 and 72 cores to run WRF in the default state, counting the total running time of a single iteration step and simulation for 6 h. Then, the default domain decomposition algorithm is changed by nproc_x and nproc_y in namelist.input, and the best-performing domain decomposition combination is selected through several different combinations of nproc_x and nproc_y. Finally, the performance improvement is calculated via cross-sectional comparison with the default domain decomposition method. The detailed experimental data are shown in Table 4. We also conduct domain decomposition experiments on a libamath optimised version to run WRF with 120 cores.

(c). In the tile division optimization experiments, we focus on the optimized version of the libamath maths library and domain decomposition. For WRF running with 120 cores (32 processes * 4 threads), we run WRF with the default tile division algorithm (numtiles = 4), record the single iteration step running time, and then modify the value of numtiles in the name.input The value of numtiles in name.input is then modified to change the number of tiles in each process. In this paper, the value of numtiles is set to 12, 16, 20, 24, 28 and 32 to run the WRF and the corresponding single iteration step running time is recorded.

(d). When conducting experiments with parallel io, we run the WRF of serial IO to count the total running time and single IO output time of the WRF, after which we recompile the WRF, compile it with the address of Pnetcdf in configure.wrf, and change the values of io_form_history of namelist.input. Next, we rerun the WRF and count its runtime and single-step IO output time.

(e). In the asynchronous IO optimization experiment, the value of nio_groups in namelist.input is set to 1 and the value of nio_tasks_per_group is set to 2, so that asynchronous IO is implemented. A total of 202 processes are used when WRF is run, 200 of which are computing processes and 2 are IO processes, and by running WRF asynchronously alongside IO, the total running time of WRF and the output time of individual IO are counted and compared with the data of serial IO.

6. Result and Discussion

6.1. Performance Analysis Experiments

This section presents experimental results and discussions of the performance analysis and optimisation of WRF processing in relation to Kunpeng 920. It mainly focuses on MPI + OpenMP analysis, scalable performance analysis, and hotspot function analysis, as well as data presentation and discussion of experimental results on Libamath, domain decomposition, tile division, and IO optimisation.

Table 1 shows the different combinations of MPI + OpenMP showing unused performance when the total number of cores used is fixed. The numbers in the first row represent the total number of cores used, the first column represents the number of threads used, and the number of processes used is the total number of processes divided by the number of threads. At 8, 16, 32, 64 and 128 cores, the best performance combinations occur when the number of threads used is a multiple of four. This is mainly related to the architecture of Kunpeng, in which every four cores form a CCL with a built-in shared cache. When the number of threads is a multiple of four, four threads in the same CCL can use the shared memory and the data transfer cost for threads in OpenMP is relatively small. Thus, the best performance is obtained.

As shown in Table 2, the single iteration step runtime of the WRF is 4.49 s when using 64 cores and 0.66 s when using 512 cores. Thus, the performance of the WRF is accelerated by a factor of 6.8 with respect to 64 cores at 512 cores, with a parallel efficiency of 85%. As demonstrated by the experimental data, WRF shows good scalability performance on the Kunpeng 920 processor platform.

6.2. Domain Decomposition and Tile Division Optimization

Table 3 shows the temporal distribution of the hotspot function of WRF. The data show that the largest hot function of WRF is pow/finite, which accounts for 14% of the total program runtime. This is mainly due to the presence of a large number of power operations in WRF. pow/finite is a function that calculates the yth power of x, where x and y are floating point numbers, and it is one of the commonly used functions in mathematical operations.

Figure 5 shows that with libamath, the performance of WRF is improved at different core counts, where at 120 cores, the single iteration step time of WRF decreases from 0.92 s to 0.78 s, which is a 15.2% performance improvement.

As shown by the data in Table 4, in the experiments running WRF with 36 cores, the best performance is achieved when nproc_x = 3 and nproc_y = 12, with a 5.78% performance improvement over the default domain decomposition. In the experiments running WRF with 72 cores, the best performance is achieved when nproc_x = 3 and nproc_y = 24, with a 11.38% performance improvement over the default domain decomposition. These two sets of preferred experiments clearly demonstrate that the default domain decomposition of WRF does not yield the best performance.

Based on Section 4.3, it is clear that tile segmentation is a more fine-grained segmentation method than domain decomposition segmentation. Since tile is the smallest computational unit of WRF, studies that investigate this unit are crucial. In this paper, we use 120 cores for our experiments. Before optimization, the default tile parameter numtiles = num_threads is used to record the single-step running time of WRF and then set the value of numtiles. The total single-step running time of WRF is 0.62 s when the value of numtiles is set to eight. Thus, the single-step time performance is improved by 13.89%.

The main reason for the performance improvement of tile is that, after dividing the tile into finer parts, the data of the entire tile will be stored in the cache during the computation (the size of the data divided by the tile is smaller than the corresponding cache size), improving the access efficiency and overall performance of the WRF computation in a disguised way.

After the WRF was optimized by adding the maths library libamath, domain decomposition and tile division, the single iteration step runtime of the WRF with 120 cores dropped from 0.92 s to 0.62 s, a 32.6% performance improvement (see Figure 6 for performance improvement).

6.3. IO Optimization

The experimental data show that when we run WRF (Case A) in parallel IO mode with a 200-core computing process, the total time is reduced from 1078.80 s for serial IO to 840.89 s, a 22.05% performance improvement, and the IO speed is reduced from 37.27 s to 13.35 s, a significant speed improvement.

Parallel IO can significantly reduce the time required for IO operations because, compared to serial IO, in the process of gathering data from other processes using gather_v, the data collection workload that was originally borne by the master process is distributed among other “master processes”. As a result, the amount of data that the master process needs to collect is reduced, thereby enhancing the speed of IO operations.

The experimental data show that the total running time of WRF with asynchronous IO is 807.94 s, showing a 25.11% performance improvement compared to serial IO. The overhead of writing the result data file decreases from 37.27 s to 0.14 s, and the IO speed is substantially improved; further information is shown in Figure 7 and Figure 8.

For both serial IO and parallel IO, WRF first completes its calculations, and then designated processes collect the data through global communication. During this data collection phase, other processes remain in a waiting state until the designated process completes the data gathering. On the other hand, asynchronous IO provides dedicated IO processes that solely handle IO tasks and can start data collection during the WRF computation, reducing the waiting time inherent in the data collection process and consequently decreasing the overall IO time.

7. Summary and Conclusions

In conclusion, this study proves that WRF has good scalability on kunpeng 920 through the performance analysis process. Since the kunpeng 920 processor consists of 4 cores to form a CCL hardware feature, when using MPI + OpenMP mixed parallel mode to run WRF, use When the number of threads is a multiple of 4, the performance of WRF is better than other processes. In the part of performance optimization, firstly, libamath is used to optimize the mathematical calculation part to accelerate the calculation of WRF; secondly, the domain decomposition and tile division are improved by optimizing the calculation cache efficiency. When the i direction of the tile is divided long enough, it is beneficial to the local cache when calculating the tile, thereby improving the memory access efficiency. And when the tile data is divided small enough, it will be loaded by the faster cache, thereby improving the cache efficiency; finally, in the process of IO optimization, WRF in the serial IO process of collecting data is caused by a single process processing large data. The processing time is too long, and parallel IO alleviates this process by increasing the number of processes to process data. Asynchronous IO realizes calculation and IO at the same time, eliminating the waiting time for data.

The experimental data shows that after calling libamath and domain decomposition and tile division optimization, the performance of WRF on the kunpeng 920 processor is improved by 32.6%. In terms of IO, parallel IO and asynchronous IO are mainly optimized. Finally, the single output time of WRF was reduced from 37.28 s in serial IO mode to 0.14 s in asynchronous IO mode, and the overall running time was reduced from 1078.80 s to 807.94 s.

Overall, this study demonstrates the performance of WRF on the Kunpeng 920 processor platform, and verifies the feasibility of porting WRF to this high-performance server. In this study, without increasing the computing resources, the WRF is optimized and accelerated by software means, and the optimization effect is remarkable, which has certain reference significance for workers who use ARM architecture servers to study WRF.

Author Contributions

Conceptualization, J.H. and W.W.; Methodology, J.H., W.W. and C.Y.; Software, W.W. and C.Y.; Data curation, J.H. and Y.B.; Writing—review & editing, J.H., J.J. and L.Z.; Visualization, Y.W.; Project administration, J.J.; Funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (41931183), and also by the HPC Application LAB of Huawei’s computing product line.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Harvey, J.A.; Tougeron, K.; Gols, R.; Heinen, R.; Abarca, M.; Abram, P.K.; Basset, Y.; Berg, M.; Boggs, C.; Brodeur, J.; et al. Scientists’ warning on climate change and insects. Ecol. Monogr. 2023, 93, e1553. [Google Scholar] [CrossRef]
He, Z.; Pomeroy, J.W. Assessing hydrological sensitivity to future climate change over the Canadian southern boreal forest. J. Hydrol. 2023, 624, 129897. [Google Scholar] [CrossRef]
Raby, J.; Brown, R.; Raby, Y. Forecast Model and Product Assessment Project User’s Guide; Technical Report; U.S. Army Research Laboratory, Battlefield Environment Division: White Sands Missile Range, NM, USA, 2011.
Zhou, N.; Zhou, H.; Hoppe, D. Containerization for High Performance Computing Systems: Survey and Prospects. IEEE Trans. Softw. Eng. 2022, 49, 2722–2740. [Google Scholar] [CrossRef]
Elliott, S.; Del Vento, D. Performance Analysis and Optimization of the Weather Research and Forecasting Model (WRF) on Intel Multicore and Manycore Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, TX, USA, 15–20 November 2015. [Google Scholar]
Chen, Y.R.; Wang, Y.Z.; Jiang, J.R.; Hao, H.Q.; Chen, T.; Liu, C.; Center, S. Performance evaluation of weather research and forecast (WRF) model on ERA. Comput. Eng. Des. 2016, 1668–1674. [Google Scholar]
Ouermi, T.; Knoll, A.; Kirby, R.M.; Berzins, M. Optimization strategies for WRF single-moment 6-class microphysics scheme (WSM6) on intel microarchitectures. In Proceedings of the 2017 Fifth International Symposium on Computing and Networking (CANDAR), Aomori, Japan, 19–22 November 2017; pp. 146–152. [Google Scholar]
Moreno, R.; Arias, E.; Cazorla, D.; Pardo, J.J.; Navarro, A.; Rojo, T.; Tapiador, F.J. Analysis of a new MPI process distribution for the weather research and forecasting (WRF) model. Sci. Program. 2020, 2020, 8148373. [Google Scholar] [CrossRef]
Xia, J.; Cheng, C.; Zhou, X.; Hu, Y.; Chun, P. Kunpeng 920: The first 7-nm chiplet-based 64-Core ARM SoC for cloud services. IEEE Micro 2021, 41, 67–75. [Google Scholar] [CrossRef]
Morton, D.; Nudson, O.; Stephenson, C. Benchmarking and evaluation of the Weather Research and Forecasting (WRF) Model on the Cray XT5. In Proceedings of the Cray User Group 2009: Compute the Future, Atlanta, GA, USA, 4–7 May 2009. [Google Scholar]
Malakar, P.; Saxena, V.; George, T.; Mittal, R.; Kumar, S.; Naim, A.G.; Husain, S.A.b.H. Performance evaluation and optimization of nested high resolution weather simulations. In Proceedings of the Euro-Par 2012 Parallel Processing: 18th International Conference, Euro-Par 2012, Rhodes Island, Greece, 27–31 August 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 805–817. [Google Scholar]
Christidis, Z. Performance and scaling of WRF on three different parallel supercomputers. In Proceedings of the High Performance Computing: 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, 12–16 July 2015; Springer: Cham, Switzerland, 2015; pp. 514–528. [Google Scholar]
Taj, O.; Kirby, R.M.; Berzins, M. Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling. Int. J. Netw. Comput. 2018, 8, 301–327. [Google Scholar]
Andraju, P.; Kanth, A.L.; Kumari, K.V.; Vijaya Bhaskara Rao, S. Performance optimization of operational WRF model configured for Indian Monsoon Region. Earth Syst. Environ. 2019, 3, 231–239. [Google Scholar] [CrossRef]
Yokoyama, D.; Schulze, B.; Borges, F.; Mc Evoy, G. The survey on ARM processors for HPC. J. Supercomput. 2019, 75, 7003–7036. [Google Scholar] [CrossRef]
Chip, W. TaiShan-Based CPUs are Branded as the Kunpeng 920 Series. 2019. Available online: https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110 (accessed on 29 June 2023).
Skamarock, W.C.; Klemp, J.B.; Dudhia, J.; Gill, D.O.; Liu, Z.; Berner, J.; Wang, W.; Powers, J.G.; Duda, M.G.; Barker, D.M.; et al. A Description of the Advanced Research WRF Model Version 4; NCAR Technical Notes NCAR/TN–556+STR; National Center for Atmospheric Research: Boulder, CO, USA, 2019. [Google Scholar]
Aqib, M.; Fouz, F.F. The effect of parallel programming languages on the performance and energy consumption of HPC applications. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 174–179. [Google Scholar] [CrossRef]
De Melo, A.C. The new linux’perf’tools. In Proceedings of the Linux Kongress: 17th International Linux System Technology Conference, Nürnberg, Germany, 21–24 September 2010; pp. 1–42. [Google Scholar]

Figure 1. Simplified diagram of Kunpeng 920 processor component structure [16].

Figure 2. The basic structure diagram of the system-on-chip of Kunpeng 920 processor.

Figure 3. Three level domain decomposition of WRF.

Figure 4. Two-dimensional process partition of different domain decomposition.

Figure 5. Single-step runtime before and after call to libamath, with different number of processes.

Figure 6. Comparison of WRF’s performance optimisation results at each stage.

Figure 7. Total running time of WRF in different IO modes.

Figure 8. WRF single-step IO times in different IO modes.

Table 1. MPI + OpenMP hybrid parallel preference experiment.

Threads	8	16	32	64	128
1	2442.55	1436.62	1000.40	563.53	310.9
2	2466.07	1404.12	989.75	510.37	294.41
4	2357.44	1428.20	947.21	508.35	284.83
8	2400.20	1426.14	958.42	493.82	278.64
16		1392.33	935.92	498.37	285.23

Table 2. Numerical experiments evaluating the strong scalability performance of WRF on the Kunpeng 920 processor platform.

Grid Size	Process	Run-Time/s	Effiency
1024 × 1024 × 33	64	4.49	1
1024 × 1024 × 33	128	2.32	0.968
1024 × 1024 × 33	256	1.28	0.877
1024 × 1024 × 33	512	0.66	0.850

Table 3. Hotspot function distribution of WRF.

Hotfunction	Rate
pow_finite	14.13%
_iface_progress	5.36%
_worker_progress	4.34%
advect_scalar_pd	4.11%
opal_progress	3.87%
_advance_w	3.78%
_iface_progress	3.45%
_spcvmc_sw	2.81%
_advect_scalar	2.57%
other	55.58%

Table 4. Combination of different regional decomposition preferences experiment.

Procss	nproc_x	nproc_y	total_time/s	step_time/s
36	6	6	168.211	1.55
36	2	18	154.720	1.46
36	3	12	150.117	1.47
36	4	9	151.701	1.52
72	8	9	131.728	0.97
72	2	36	121.098	0.89
72	3	24	120.098	0.86
72	4	18	122.605	0.92
72	6	12	122.650	0.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, J.; Wang, W.; Wang, Y.; Jiang, J.; Yan, C.; Zhao, L.; Bai, Y. Performance Evaluation and Optimization of the Weather Research and Forecasting (WRF) Model Based on Kunpeng 920. Appl. Sci. 2023, 13, 9800. https://doi.org/10.3390/app13179800

AMA Style

Huang J, Wang W, Wang Y, Jiang J, Yan C, Zhao L, Bai Y. Performance Evaluation and Optimization of the Weather Research and Forecasting (WRF) Model Based on Kunpeng 920. Applied Sciences. 2023; 13(17):9800. https://doi.org/10.3390/app13179800

Chicago/Turabian Style

Huang, Jian, Wu Wang, Yuzhu Wang, Jinrong Jiang, Chen Yan, Lian Zhao, and Yidi Bai. 2023. "Performance Evaluation and Optimization of the Weather Research and Forecasting (WRF) Model Based on Kunpeng 920" Applied Sciences 13, no. 17: 9800. https://doi.org/10.3390/app13179800

APA Style

Huang, J., Wang, W., Wang, Y., Jiang, J., Yan, C., Zhao, L., & Bai, Y. (2023). Performance Evaluation and Optimization of the Weather Research and Forecasting (WRF) Model Based on Kunpeng 920. Applied Sciences, 13(17), 9800. https://doi.org/10.3390/app13179800

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Evaluation and Optimization of the Weather Research and Forecasting (WRF) Model Based on Kunpeng 920

Abstract

1. Introduction

2. Related Work

3. Materials

3.1. Kunpeng 920 Processor

3.2. Case Configuration

3.3. MPI + OpenMP Hybrid Parallelism

3.4. Libamath

4. Methods

4.1. Scalable Performance Analysis

4.2. Perf

4.3. Domain Decomposition and Tile Division

4.4. Parallel IO and Asynchronous IO

5. Experimental Section

5.1. Performance Analysis Experiments

5.2. Performance Optimisation Experiment

6. Result and Discussion

6.1. Performance Analysis Experiments

6.2. Domain Decomposition and Tile Division Optimization

6.3. IO Optimization

7. Summary and Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI