1. Introduction
Legendre transform (LT) plays an important part in many scientific applications, such as astrophysical [
1], numerical weather prediction and climate models [
2,
3]. Fast Legendre transform attracts considerable interest amongst the scientific computing and numerical simulation. Scientists have paid very serious attention to develop fast Legendre transform algorithms [
4,
5,
6,
7,
8,
9,
10,
11]. The validity and reliability of these algorithms depend on whether they can keep fast, stable and high accuracy.
The butterfly algorithm [
12,
13] is an effective multilevel technique to compress a matrix that satisfies a complementary low-rank property. It factorizes a complementary low-rank matrix
K of size
N×
N into the product of
O(log
N) sparse matrices, each with O(
N) nonzero entries. Hence, dense matrix-vector multiplication can be transformed into a set of sparse matrix-vector multiplication in O(
Nlog
N) operations [
14]. LT using butterfly algorithm has the advantages of high accuracy, stability and low computational complexity.
As one of the most widely used butterfly algorithms, Tygert’s algorithm (2010) [
11] has been successfully implemented in IFS of ECMWF [
2], YHGSM [
15,
16,
17] of NUDT [
3] and astrophysical [
1]. In the applications of numerical weather prediction and climate models, which need spectral harmonic transform (SHT) many times for each time step, only one precomputation is needed in the first-time step, then the results are stored in memory and reused in each transform. Though Tygert’s algorithm (2010) is slow in terms of precomputation:
O(
N2) for LT and
O(
N3) for SHT, it does not have much impact on total performance. However, some unsolved issues still remain. The main issue is the potential instability of interpolative decomposition (ID) [
18] for very high order Legendre transform. That said, Tygert [
11] points out that the reason why the butterfly procedure works so well for associated Legendre functions may be that the associated transforms nearly weighted averages of Fourier integral operators. There are no literatures to prove that the pre-computations will compress the appropriate
N ×
N matrix enough to enable application of the matrix to vectors using only
O(
Nlog
N) floating-point operations(flops). Full numerical stability has been demonstrated both empirically and theoretically for fast Fourier transform (FFT) using butterfly algorithm. It is difficult to give complete and rigorous proofs of interpolative decomposition for Legendre transform as Fourier transform.
Non-oscillatory phase functions method opens up new avenues for special function transforms. The solutions of some kinds of second order differential equations can be accurately represented by non-oscillatory phase functions [
19,
20]. It has been proved that Legendre’s differential equation [
21] and its generalization Jacobi’s differential equation [
22] admit a non-oscillatory phase function. So non-oscillatory phase functions can be used to the expansions [
22], the calculation of the roots [
23] and transform [
24] of special functions. Jacobi transform by non-oscillatory phase functions shows an optimal computational complexity
O(
Nlog
2N/loglog
N) in reference [
24]. However, Legendre transform algorithm in ButterflyLab [
25], which adopts interpolative butterfly factorization (IBF) [
14,
26] and non-oscillatory phase functions method to evaluate the Legendre polynomials [
24], does not show high accuracy as Fourier transform using IBF. Therefore, Fast Legendre transform (FLT) based on IBF and non-oscillatory phase functions and its extension to the associated Legendre functions need further study.
Recently, fast Legendre transform algorithm based on FFT deserved more attentions for its optimal computational complexity
O(
Nlog
2N/loglog
N). Hale and Townsend [
27] firstly presented a fast Chebyshev-Legendre transform, and then developed a non-uniform discrete cosine transform which use a Taylor series expansion for Chebyshev polynomials about equally-spaced points in the frequency domain. Finally, Hale and Townsend [
28] got an
O(
Nlog
2N /loglog
N) Legendre transform algorithm. In the near future, fast polynomial transforms [
29] based on Toeplitz and Hankel matrices will be presented to accelerate the Chebyshev-Legendre transform. Although FFT-based LT has the attractive computational complexity, it needs too many times FFT, which makes FFT-based LT only become more computationally efficient than LT using Dgemv when
N is greater than or equal to 5000. Since the computation of associated-Legendre-Vandermonde matrices is completed in the pre-computation step, it will become worse on the occasion of multiple use of FLT such as NWP, in which only once computation of associated-Legendre-Vandermonde matrices is needed for many times spectral harmonic transform (SHT).
Motivated and inspired by the ongoing research in these areas, we present a theoretical method to analyze the error of LT using butterfly algorithm, and then provide a numerically stability Legendre transform algorithm based on block partitioning and butterfly algorithm. The novel aspect is the mitigation of the potential instability of LT using butterfly algorithm at a very small increase of computational cost.
3. Error Analysis of Legendre Transform using Butterfly Algorithm
The transformed Legendre nodes
can be seen as a perturbation of an equally-spaced grid
, i.e
and then approximate each
term by a truncated Taylor series expansion about
. If
is small then only a few terms in the Taylor expansion are required.
The Taylor series expansion of
about
can be expressed as
where
Similarly,
about
can be expressed as
where
Substituting
for
in Equation (5), one can get
The Taylor series expansion of
about
can be expressed as
According to Equation (13),
(
) can be written as
Substituting Equation (15) into Equation (14), one can obtain
Substituting Equations (17)–(20) into Equation (16), one can get
By truncating the second term in the right hand side of Equation (19), it can be approximated as
and then
Equation (24) can be expressed in the following compact form
where
So, the computation of Legendre-Vandermonde matrix can be written as
The numerical stability of ID can be analyzed by Equation (27). Since the butterfly algorithm works well for equispaced Fourier series, Legendre transform using butterfly algorithm is numerical stability with the error of . When L tends to infinity, the error is .
Lemma 1. For any and [27] Lemma 2. For any and , the error bound of Equation (24) is Proof of Lemma 2. Finally, one can get the total upper error bound
□
4. Legendre Transform Based on Block Partitioning and Butterfly Algorithm
In this section, we will propose a Legendre transform based on block partitioning and butterfly algorithm. The main idea is separate the matrix
into block
and
sub-matrices
and then factorize
sub-matrices
by butterfly algorithm. The butterfly algorithm can factorize a complementary low-rank matrix of size
N × N into the product of
O(log
N) sparse matrices, each with O(
N) nonzero entries. What’s more, butterfly algorithm works well for
as mentioned in
Section 3. Hence, the total of nonzero entries after factorization can be approximate to
O(
Nlog
2N/loglog
N) by controlling the number of nonzero entries in
. Finally, one can get a Legendre transform with
O(
Nlog
2N/loglog
N) computational complexity.
It can be found that the matrix
can be considered as a perturbation of matrix
from Equation (24). The block partitioning of
can be performed by using the same method as
in the paper of Hale and Townsend [
27]. Therefore, the matrix
is partitioned as
This partitioning separates the matrix
into block
and
sub-matrices
. Block
contains the columns and rows of
, which cannot be computed by using Equation (24).
where
and
where
and
.
Legendre transform can be expressed as
Nonzero entries of
can be accurately expressed by the asymptotic formula, which means that the butterfly compression to
is stable and accurate. Instead of the FFT method, the butterfly algorithm is employed to compute the matrix-vector product
. This is because the butterfly algorithm works well for
as mentioned in
Section 3. So
can be computed in
operations. By restricting
has fewer than
nonzero entries, the matrix-vector product
can be computed in
operations. Finally, the optimal computational cost is achieved. Let
, the parameters
and
are defined as
and
, respectively. In the practical application, only parameters
,
,
and
are used to obtain information such as starting row/column index and offset for all blocks.
Figure 1 shows the partitioning of the Legendre-Vandermonde matrix for
N = 1024. The Legendre-Vandermonde matrix is divided into boundary (denoted by symbol
B) and internal (denoted by symbol
P) parts. The boundary parts include the elements which cannot be accurately expressed by the asymptotic formula. There are 2(
K+1) sub-matrices of
B and 2
K sub-matrices of
P. According to the symmetric or anti-symmetric property of Legendre polynomials, only
K+1 sub-matrices of
B and
K sub-matrices of
P on the top are used. Algorithm 1 presents a summary of Legendre transform algorithm using block butterfly algorithm. Direct computation part and butterfly multiplication part is cost
operations, respectively.
Algorithm 1: Block Butterfly Algorithm for Legendre Transform. |
|
Parameters CMAX and EPS need for butterfly matrix compression are still needed in block butterfly algorithm. CMAX is the number of columns in each sub-matrix on level 0, EPS is desired precision in interpolative decomposition [
3]. A dimensional thresh value DIMTHESH [
3] is also needed in Legendre transform calls to activate FLT when wavenumber (
m) less and equal to NSMAX-2DIMTHESH+3 (NSMAX is truncation order). Block butterfly algorithm is equivalent to Tygert’s algorithm (2010) when no block partition is used, so two dimensional thresh values could be introduced to include Tygert’s algorithm (2010) and LT using DGEMM for further reducing the computational complexity. To facilitate comparison with Tygert’s algorithm, only one dimensional thresh value is used and set to 200 in the rest of the paper.
5. Results
In this section, all tests are performed on the MilkyWay-2 super computer (see Liao et al. [
31] for more details), which installed in NUDT. Each compute node possesses 64GB of memory. The CPU model name is Intel(R)Xeon(R) CPU E5-2692V2 @2.2GHz. A private 32KB L1 instruction cache, a 32KB L1 data cache, a 256KB L2 cache, and a 30720KB L3 cache are used. ID software package developed by Martinsson et al. [
32] for low rank approximation of matrices is employed to perform interpolative decompositions for all tests. ID package can be downloaded from Mark Tygert’s homepage [
33]. Hereafter, LT using matrix-matrix multiplication, Tygert’s algorithm (2010) and block butterfly algorithm are named as LT0, LT1 and LT2, respectively.
Figure 2,
Figure 3 and
Figure 4 show the errors of LT with CMAX = 64 in log10 form for EPS = 1.0E-05, EPS = 1.0E-07 and EPS = 1.0E-10, respectively. Abbreviations “MAX ERR” and “RMS ERR” in
Figure 2,
Figure 3 and
Figure 4 denote maximum error and root-mean-square error, respectively. It can be found that both maximum error and root-mean-square error of LT2 are improved by about one order magnitude than LT1. The results show that the proposed method is effective in improving the accuracy of Legendre transform using butterfly algorithm.
Figure 5 shows the computational time for different Legendre transform algorithms. The speedup and loss speedup of LT2 with CMAX = 64 are demonstrated in
Figure 6 and
Figure 7, respectively. Loss speedup which measures the relative performance penalty is defined as the speedup of LT1 minus the speedup of LT2 and divided by the speedup of LT0. From
Figure 5 and
Figure 6, LT2 begins to be faster than LT0 when
N = 2048 and achieves more than 26%, 22%, 17% reduction in elapsed time for EPS = 1.0E-5, EPS = 1.0E-7 and EPS = 1.0E-10. LT2 has achieved more than 17%, 63%, 75% and 86% reduction in elapsed time for a run of
N2048,
N4096,
N8192 and
N16384, respectively. In
Figure 7, the loss speedup of LT2 relative to LT1 is less than 21%, 11%, 7% and 4% for
N = 2048, 4096, 8192 and 16,384, respectively. Moreover, the loss speedup of LT2 relative to LT1 decreases rapidly with the increase of
N. According to the results of Yin [
3], the potential instability of interpolative decomposition only exists in the case of very high order. So, the presented method can alleviate the potential instability of interpolative decomposition at a very small computational burden.
Figure 8 and
Figure 9 show the computational time of LT scaled by
Nlog
3N and
Nlog
4N, respectively. It can be found that the computational complexity of LT2 appears to a little bigger than LT1. The boundary blocks which can’t be accurately expressed by the asymptotic formula and the internal blocks with dimension less that dimensional thresh value result in the increase of the computational complexity. Although the results of LT2 are bigger than those of LT1, LT2 has a similar trend as LT1 in
Figure 8 and
Figure 9. This means that LT2 has the same computational complexity
O(
Nlog
3N) as LT1.
Legendre-Vandermond matrix is divided into boundary blocks and internal blocks. Boundary blocks which can’t be accurately expressed by the asymptotic formula cause instability of interpolative decompositions and are not suitable for interpolation decomposition. The matrix-vector multiplication based on butterfly algorithm is faster than BLAS function DGEMV only when the dimension of matrix is greater than or equal to 512. Internal blocks with lower matrix dimension adopt direct matrix-vector multiplication instead of butterfly algorithm. The number of nonzero elements of boundary blocks, internal blocks which do not participate in interpolation decomposition cause the increase of the computational cost compare to Tygert’s algorithm. Therefore, through reasonable partitioning, the theoretical computational complexity of the proposed method can reach the optimal computational complexity O(Nlog2N/loglogN).
6. Conclusions
In this paper, a high accurate and stable Legendre transform algorithm is proposed. A block partitioning based on asymptotic formula is employed to mitigate the potential instability of Legendre transform using butterfly algorithm. The Legendre-Vandermonde matrix is divided into one block and sub-matrices . Instead of FFT method, butterfly algorithm is employed to compute . Numerical results demonstrate that the proposed method improves stability by about one order magnitude than Tygert’s algorithm (2010), while only sacrifices less than 7% speedup for very high order (N ≥ 4096) Legendre transform.
Although the computational time of proposed method is a little bigger than Tygert’s algorithm, it has the same computational complexity O(Nlog3N) as Tygert’s algorithm. Moreover, the proposed method is equivalent to Tygert’s algorithm when no block partition is used. In the application of NWP, an additional dimensional thresh value could be introduced to include Tygert’s algorithm (2010) for further reducing the computational complexity.
In the future, we will study the more optimal block partition method to improve the computational performance, while still keeping stability and making a detailed analysis in regard to the spectral harmonic transform using the proposed method for very high resolution—especially its performance in the reduction of potential numerical instability for resolution T7999.