The original serial COCAL consists of more than 400,000 lines of Fortran 90 code, and in order to use OpenMP parallelization, many of its loops had to be rewritten in a way that would efficiently utilize OpenMP’s capabilities. In particular, in multiple loops, the code inside the innermost one had to be written in a way that is
independent between threads. For example, in a typical triple loop over the
r (loop 1),
(loop 2), and
(loop 3) coordinates, any code between loop 1 and loop 2 and/or between loop 2 and loop 3 is written
only inside the innermost loop 3. The reason is to utilize the
collapse clause, which collapses the multiple loops into a single one, which is then divided among the multiple threads. Most of our loops were three-dimensional, but there were many that were four or even five-dimensional, which result in large speedups through OpenMP parallelization. For example, in a binary system,
Figure 1, the calculation of the surface integral on the excised sphere
uses the Legendre functions with respect to the coordinate system at the center of
. Assuming
is the spherical angle with respect to the z-axis at the center of
, the term
will depend on all coordinates
of that particular patch; therefore, these functions become five-dimensional, as they depend on the Legendre indices
too.
Another commonly used command is the
reduction clause, which is used to perform summations, for example in the calculation of the volume and surface integrals. In this case, the code is written in a way that the summation appears
only inside the innermost loop so that a combination of the
collapse and
reduction clauses can execute this operation in multiple threads. The
reduction clause is also used in finding the maximum error in Equation (
43) for every variable in each coordinate system. The
private clause is used extensively for local nested-loop variables so that multiple threads executing a parallel region are independently calculated. Listing a variable as
private causes each thread that executes that construct to receive a new temporary variable of the same type, and the multiple loop can be performed independently in parallel.
In order to quantify the speedup and efficiency of PCOCAL, we define the following measures:
To make sure that the parallelized PCOCAL code produces the same results as COCAL itself, we perform a pointwise check for every parallelized subroutine against the serial code and confirm that the differences between the two codes are ≲ in all variables.
Most of the runs have been performed in server A, which has 36 cores with 2 threads per core (72 threads total) and an Intel(R) Xeon(R) Gold 6254 CPU 3.10 GHz (Intel, Santa Clara, CA, USA). We have also used server B, which has 40 cores with 2 threads per core (80 threads total) and an Intel(R) Xeon(R) Gold 6242R CPU 3.10 GHz. Both servers are dual-socket. An Intel compiler (2021.3.0 20210609) is being used.
4.1. Single Rotating Neutron Star
Following the methods of
Section 2.1 and
Section 3.1, we test our new code in three grid resolutions, as can be seen in
Table 3. Resolution H2 has
442,368 intervals (number of points in the
directions are
,
, and
, respectively), while resolutions H3, H4 have twice and four times as many intervals in every dimension, i.e., 8 and 64 times more intervals in total, respectively. The default value of terms in the Legendre expansions is
. Below, we also investigate the performance of the code with respect to
L.
In Algorithm 1, we sketch the most salient steps taken for a solution. In parentheses with green fonts, we show the percentage of time needed for a given calculation per iteration using a single core using the H3 resolution. Lines that do not have a number indicate that the time it took for completion was less than
. The rest of the time is spent in the calculation of diagnostics as well as further output. Steps 2–12 are repeated iteratively until condition 13 is fulfilled, and hence, convergence to a solution is achieved. As we can see, the most time-consuming routines are the computation of the momentum constraint sources
(15%), the calculation of the sources
(13%), followed by the calculation of the conformal Ricci tensor
(11%) and the Poisson solvers themselves (11%). The reason that the three source arrays
of the momentum constraint took more time to be computed than the six source arrays
was mainly due to the modular way that COCAL implements various mathematical formulations. In particular, for the waveless formulation, the sources (right-hand side of the Poisson equations) are split into two parts: (i) the part that comes from the conformal flat part of the metric and (ii) the ones that come from the nonconformal one. In this way, one has the flexibility of choosing a specific method (conformal flat vs. nonconformal flat) without repeating the writing of a code. The disadvantage is that triple loops all over the gridpoints may be repeated; therefore, speed is sacrificed in view of modularity. In the computation of the momentum constraint sources
for the waveless formulation (which is not conformally flat), both the fluid part and the gravitational part of the source computation are performed twice, which leads to a larger time footprint than the six source arrays
. Overall the bottleneck for the rotating neutron star module in the waveless formulation is the computation of the sources (lines 2–8) rather than the Poisson solvers (line 9). As we will see, this is in contrast with the binary neutron star module in the conformal flat approximation, where the latter dominates over the former.
Algorithm 1 Rotating star in the waveless formalism |
1: | procedure RNS | |
2: | Interpolate variables to SFC | ▹ |
3: | Compute , , | ▹ |
4: | Compute volume sources , Equation (A1) | ▹ |
5: | Compute volume sources , Equation (A3) | ▹ |
6: | Compute volume sources , Equation (A2) | ▹ |
7: | Compute volume sources , Equation (A4) | ▹ |
8: | Compute right-hand side of Equation (A22) | ▹ |
9: | Compute , Equation (40) | ▹ |
10: | Variable update, Equation (42) | ▹ |
11: | Use Equation (23) to compute the rest mass density . | |
12: | Compute Equation (43) | ▹ |
13: | if then | |
14: | exit | |
In
Figure 2, we plot in the upper two panels the speedup
(top) and efficiency
(bottom) of the parallelized part of the code (solid lines), as well as the corresponding measures
S and
E for the total code (dashed lines). As we mentioned above,
S underestimates the real speedup as the input/output routines were neither optimized nor minimized (for diagnostic purposes). Thus, in the discussion below, we focus on the speedup
and efficiency
. One characteristic of the speedup is that for all resolutions, it reaches a maximum at ∼36 threads and drops slightly afterwards, from which point it continues to increase. The maximum speedup of the parallelized part of the code is 18–20 times the serial one (when ≲60 threads are used) which can be achieved with a minimum of ∼36 threads. At that point, the efficiency is ∼50%. At maximum speedup, the whole rotating neutron star code (including the serial routines that are nonparallelized, such as the calculation of the parameters
in Equation (
20),
C in Equation (
23), and the coordinate scaling
[
8,
16]) is ∼12, 13, and 14 times faster than the serial analog for resolutions H2, H3, and H4 and occurs at 36, 34, and 35 threads, respectively. This can be seen in
Figure 2 on the lower three panels, where blue bars signify the parallelized routines of the code, red bars the serial part of the code, and yellow bars the memory copy between arrays. From this plot, we see that the parallelized code constitutes the vast majority of the total number of subroutines and is responsible for the achieved speedup. Also, we observe that the higher the resolution (H4), the larger the total speedup, which is a promising result for high-resolution campaigns.
In all tests presented in
Figure 2, we used
terms in the Legendre expansion, Equation (
41), and therefore all integrals, Equation (
40). It is experimentally found that such a number of terms leads to accurate results without compromising the speed of the code.
is commonly used in all nonmagnetized rotating star calculations [
7,
8,
9,
10,
11,
12] as well as in the binary neutron star calculations [
16,
17] that we will mention in the next section. On the other hand, accurate magnetized rotating neutron stars or black hole-disk solutions require a larger number (≲60) of terms to be included in the Legendre expansions. In
Figure 3, we plot the time of a single iteration (blue dots), as well as the time for the Poisson solvers alone (red dots), using the H3 resolution and a
single core. Increasing the number of Legendre terms
L increases the time spent on the Poisson solvers as
where
. This fitting function is plotted by a solid yellow line. At the same time, the time spent on the whole iteration is also fitted by the same relation Equation (
46), but
(as reported in Algorithm 1, line 9, the percentage of time spent for the Poisson solvers with respect to the whole iteration when
is approximately
) and is plotted with a purple curve. In other words, for a given resolution, the increase in the number of Legendre terms essentially affects the Poisson solvers as well as the whole iteration in the same manner and scale as ∼
.
In order to evaluate our new code with respect to the number
L, we performed the same tests using the H4 resolution for three different numbers of Legendre terms
. The results are shown in the upper two panels of
Figure 4 with red, blue, and yellow curves in server A. In addition, we show with a cyan curve the speedup and efficiency of the
run in server B. The first conclusion is that the speedup is approximately preserved when the number of Legendre terms is increased. All runs in server A have a maximum at ∼35 threads. Beyond that point, and similarly to
Figure 2, the speedup drops and then starts to increase again. This behavior is qualitatively the same as the run with
at server B, with the maximum now happening at 40 threads and a speedup of larger than 20. Given than the server A CPUs have 36 cores while the server B CPUs have 40, we conclude that peak speedup with minimum amount of threads happens approximately at the number of cores a CPU has. The larger the number of cores, the larger the speedup will be. On the lower three panels, we plot the total performance time for
in the H4 resolution and server A. We find that the maximum speedup of the total rotating neutron star code (including the routines that are nonparallelized) is ∼14, 13, and 14 times faster than the serial analog in the H4 resolutions for
and occurs at 35, 36, and 35 threads, respectively. Therefore, the speedup is preserved when the number of Legendre terms is increased for the whole code as well, which is expected, given the fact that the parallelization scheme covers all of the main components of PCOCAL.
4.2. Binary Neutron Stars
Following the methods of
Section 2.2 and
Section 4, we test our new code in three resolutions as in
Table 4. Resolution E2.5 has
1,492,992 intervals in each of the three coordinates systems (two COCPs and one ARCP in
Figure 1), while resolutions E3.0 and E3.5 have ∼1.3, and 2 times as many intervals in every dimension, i.e.,
and 8 times more intervals in total, respectively. In all binary cases, we use
as the number of terms in the Legendre expansions. Notice that the BH coordinate system (COCP-BH, top-left, red patch in
Figure 1) is now replaced by an NS coordinate system, similar to the blue, top-right patch COCP-NS. Therefore, there is no inner surface
(no boundary conditions), and
for the red patch in
Figure 1 for the binary neutron/quark case.
In Algorithm 2, we sketch the most important steps taken for a binary neutron star solution using a single core in the E3.0 resolution. As in Algorithm 1, we report the percentage of time needed for the completion of each step with green fonts inside a parenthesis, while when such number is absent, it means that the time it took for completion was less than
. The rest of the time is spent in the calculation of diagnostics as well as further output. Steps 2–14 are repeated iteratively until the condition in line 15 is fulfilled. The main differences between the RNS and NSNS modules are: (i) Steps 3–15 are performed in three coordinate systems (two for COCP-NS and the ARCP) as seen in
Figure 1 instead of one in the RNS module; (ii) the calculation of every potential
in the two COCP-NS coordinate systems is significantly more involved because of the excised sphere
(the reason for the existence of
in the COCP patches is explained in detail in [
13,
14,
15]); (iii) the additional calculation of the velocity potential
(∼4%) (absent in the RNS module) is another important difference between the NSNS module and the RNS one; (iv) because for binary neutron stars we solve for conformally flat initial data, potentials
and, in addition, the source terms (Equation (
A1)–(
A3)) are significantly simpler as the magenta terms are zero.
As mentioned above, surface integrals (see Equation (
40)) on
do not exist in the single coordinate system of the RNS module. When computing those integrals as well as the surface integrals at
and
in a given coordinate system (e.g., on COCP-NS-1), the integrands are computed using the variables of another coordinate system (e.g., from COCP-NS-2). In this way, solutions between coordinate systems communicate between each other in order to achieve a smooth solution everywhere. Therefore, to compute the integrands on the surface integrals at
,
, and
, three-dimensional interpolations from nearby points of another coordinate system is needed. In total, the Poisson solvers take ∼41% of the time of an iteration. The time of the Poisson solvers in each COCP-NS patch is ∼17%, while in the ARCP, it is ∼7%. The big difference between them is due to the fact that patch ARCP does not have an excised sphere
. The surface integral in the off-center sphere
is the most time-consuming operation (∼11%), and it involves the computation of the corresponding associated Legendre functions, which are five-dimensional arrays of (
). In the current implementation, this time-consuming operation is performed on the fly every time such an integral is calculated in order to have a smaller memory footprint. Note that these functions do not change during the iteration procedure and in principle can only be calculated once. Such an array in the E3.5 resolution will be ∼9 GB.
Algorithm 2 Binary neutron star |
1: | procedure NSNS | |
2: | for all Coordinate Systems A do | |
3: | Interpolate variables from SFC | ▹ |
4: | Compute | ▹ |
5: | Compute volume source Equation (A1), | |
6: | Compute volume source Equation (A3), | |
7: | Compute volume source Equation (A2), | ▹ |
8: | Compute sources on excised surface | ▹ |
9: | Compute Equation (40) | ▹ |
10: | Variable update Equation (42) | ▹ |
11: | Use Equation (35) to compute the rest mass density . | |
12: | Compute Equations (36) and (40) | ▹ |
13: | Compute | ▹ |
14: | Variable copy between CSs | ▹ |
15: | if | |
| for all A) then | |
16: | exit | |
In
Figure 5, we plot in the upper two panels the speedup
(top) and efficiency
(bottom) of the parallelized part of the binary NSNS code (solid lines), as well as the corresponding measures
S and
E for the total code (dashed lines). Similar to the RNS module,
Figure 2, we find that for all resolutions the speedup increases until a certain number of threads; then, it exhibits a sudden small drop, from which point it continues to increase with the number of threads. One difference with respect to
Figure 2 is that the speedup curves are distinct for resolutions E2.5, E3.0, and E3.0, while for the RNS module we observe a broad overlap of these curves between resolutions H2, H3, H4. The reason for this behavior is the communication between the different coordinate systems in the binary code, which is absent in the single neutron star module. Despite that, the speedup of the parallelized code reaches values of 12–16 using just ∼36 threads. The efficiency at that point is ∼40%. This speedup can further increase if more than 60 threads are used. Another finding in
Figure 5 is the fact that for the highest resolution E3.5, the speedup is broadly constant from ∼26 to ∼35 threads, which is due to the domination of data communication between the different grids in the calculation process. In the lower three panels of
Figure 5, we also see that the total binary code (including the routines that are nonparallelized) is six, five, and five times faster than the serial analogue for resolutions E2.5, E3.0, and E3.5 and occurs at 33, 34, and 26 threads, respectively. The distribution of times in the whole code can be seen on the lower three panels of
Figure 5. In the future, we will address the data communication and memory handling in order to make the binary speedup scale similarly to the RNS code. In practice, a binary neutron star in quasicircular orbit that took approximately six days of continuous computation in the E3.0 resolution with the COCAL code; with PCOCAL, it needs approximately one day using a modest amount of ∼30 threads.
The differences between the PCOCAL and COCAL for a binary neutron or quark star are ≲ in all coordinate systems for the gravitational variables and fluid variables , as well as for the constants and that appear in the calculation.