Next Article in Journal
Multi Fractals of Generalized Multivalued Iterated Function Systems in b-Metric Spaces with Applications
Previous Article in Journal
Developing an ANFIS-PSO Model to Predict Mercury Emissions in Combustion Flue Gases
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A High Accurate and Stable Legendre Transform Based on Block Partitioning and Butterfly Algorithm for NWP

College of Meteorology and Oceanography, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Mathematics 2019, 7(10), 966; https://doi.org/10.3390/math7100966
Submission received: 21 September 2019 / Revised: 4 October 2019 / Accepted: 10 October 2019 / Published: 14 October 2019
(This article belongs to the Section Mathematics and Computer Science)

Abstract

:
In this paper, we proposed a high accurate and stable Legendre transform algorithm, which can reduce the potential instability for a very high order at a very small increase in the computational time. The error analysis of interpolative decomposition for Legendre transform is presented. By employing block partitioning of the Legendre-Vandermonde matrix and butterfly algorithm, a new Legendre transform algorithm with computational complexity O(Nlog2N /loglogN) in theory and O(Nlog3N) in practical application is obtained. Numerical results are provided to demonstrate the efficiency and numerical stability of the new algorithm.

1. Introduction

Legendre transform (LT) plays an important part in many scientific applications, such as astrophysical [1], numerical weather prediction and climate models [2,3]. Fast Legendre transform attracts considerable interest amongst the scientific computing and numerical simulation. Scientists have paid very serious attention to develop fast Legendre transform algorithms [4,5,6,7,8,9,10,11]. The validity and reliability of these algorithms depend on whether they can keep fast, stable and high accuracy.
The butterfly algorithm [12,13] is an effective multilevel technique to compress a matrix that satisfies a complementary low-rank property. It factorizes a complementary low-rank matrix K of size N×N into the product of O(logN) sparse matrices, each with O(N) nonzero entries. Hence, dense matrix-vector multiplication can be transformed into a set of sparse matrix-vector multiplication in O(NlogN) operations [14]. LT using butterfly algorithm has the advantages of high accuracy, stability and low computational complexity.
As one of the most widely used butterfly algorithms, Tygert’s algorithm (2010) [11] has been successfully implemented in IFS of ECMWF [2], YHGSM [15,16,17] of NUDT [3] and astrophysical [1]. In the applications of numerical weather prediction and climate models, which need spectral harmonic transform (SHT) many times for each time step, only one precomputation is needed in the first-time step, then the results are stored in memory and reused in each transform. Though Tygert’s algorithm (2010) is slow in terms of precomputation: O(N2) for LT and O(N3) for SHT, it does not have much impact on total performance. However, some unsolved issues still remain. The main issue is the potential instability of interpolative decomposition (ID) [18] for very high order Legendre transform. That said, Tygert [11] points out that the reason why the butterfly procedure works so well for associated Legendre functions may be that the associated transforms nearly weighted averages of Fourier integral operators. There are no literatures to prove that the pre-computations will compress the appropriate N × N matrix enough to enable application of the matrix to vectors using only O(NlogN) floating-point operations(flops). Full numerical stability has been demonstrated both empirically and theoretically for fast Fourier transform (FFT) using butterfly algorithm. It is difficult to give complete and rigorous proofs of interpolative decomposition for Legendre transform as Fourier transform.
Non-oscillatory phase functions method opens up new avenues for special function transforms. The solutions of some kinds of second order differential equations can be accurately represented by non-oscillatory phase functions [19,20]. It has been proved that Legendre’s differential equation [21] and its generalization Jacobi’s differential equation [22] admit a non-oscillatory phase function. So non-oscillatory phase functions can be used to the expansions [22], the calculation of the roots [23] and transform [24] of special functions. Jacobi transform by non-oscillatory phase functions shows an optimal computational complexity O(Nlog2N/loglogN) in reference [24]. However, Legendre transform algorithm in ButterflyLab [25], which adopts interpolative butterfly factorization (IBF) [14,26] and non-oscillatory phase functions method to evaluate the Legendre polynomials [24], does not show high accuracy as Fourier transform using IBF. Therefore, Fast Legendre transform (FLT) based on IBF and non-oscillatory phase functions and its extension to the associated Legendre functions need further study.
Recently, fast Legendre transform algorithm based on FFT deserved more attentions for its optimal computational complexity O(Nlog2N/loglogN). Hale and Townsend [27] firstly presented a fast Chebyshev-Legendre transform, and then developed a non-uniform discrete cosine transform which use a Taylor series expansion for Chebyshev polynomials about equally-spaced points in the frequency domain. Finally, Hale and Townsend [28] got an O(Nlog2N /loglogN) Legendre transform algorithm. In the near future, fast polynomial transforms [29] based on Toeplitz and Hankel matrices will be presented to accelerate the Chebyshev-Legendre transform. Although FFT-based LT has the attractive computational complexity, it needs too many times FFT, which makes FFT-based LT only become more computationally efficient than LT using Dgemv when N is greater than or equal to 5000. Since the computation of associated-Legendre-Vandermonde matrices is completed in the pre-computation step, it will become worse on the occasion of multiple use of FLT such as NWP, in which only once computation of associated-Legendre-Vandermonde matrices is needed for many times spectral harmonic transform (SHT).
Motivated and inspired by the ongoing research in these areas, we present a theoretical method to analyze the error of LT using butterfly algorithm, and then provide a numerically stability Legendre transform algorithm based on block partitioning and butterfly algorithm. The novel aspect is the mitigation of the potential instability of LT using butterfly algorithm at a very small increase of computational cost.

2. Mathematical Preliminaries

In this section, we introduce the theorem that Legendre polynomials on equally-spaced grid can be expressed as a weighted linear combination of Chebyshev polynomials, and a partitioning of Legendre-Vandermonde matrix P N ( x N c h e b ) ( x = x _ N c h e b = cos ( θ _ N c h e b ) ). For more details, see references [27,28].
According to Stieltjes’s theory [30], Legendre polynomials can be expressed as following asymptotic formula when n
P n ( cos θ ) = C n m = 0 M 1 h m , n cos ( ( m + n + 1 2 ) θ ( m + 1 2 ) π 2 ) ( 2 sin θ ) m + 1 / 2 + R M , n ( θ )
where θ = cos 1 x , θ ( 0 , π ) and
C n = 4 π j = 1 n j j + 1 / 2 = 4 π Γ ( n + 1 ) Γ ( n + 3 / 2 )
h m , n = { 1 , m = 0 , j = 1 m ( j 1 / 2 ) 2 j ( n + j + 1 / 2 ) , m > 0 .
The error term in Equation (1) can be bounded by
| R M , n ( θ ) | C n h M , n 2 ( 2 sin θ ) M + 1 / 2
Hale and Townsend [27] rewrote Equation (1) as a weighted linear combination of Chebyshev polynomials
P n ( cos θ ) = C n m = 0 M 1 h m , n ( u m ( θ ) T n ( sin θ ) + v m ( θ ) T n ( cos θ ) ) + R M , n ( θ )
with T n ( cos θ ) = cos ( n θ ) , T n ( sin θ ) = sin ( n θ ) and
u m ( θ ) = sin ( ( m + 1 / 2 ) ( π 2 θ ) ) ( 2 sin θ ) m + 1 / 2 , v m ( θ ) = cos ( ( m + 1 / 2 ) ( π 2 θ ) ) ( 2 sin θ ) m + 1 / 2
Let x k l e g = cos ( θ k l e g ) and θ 0 l e g , , θ N 1 l e g are the transformed Legendre nodes, Equation (5) can be written as
P n ( x k l e g ) = C n m = 0 M 1 h m , n ( u m ( θ k l e g ) T n ( sin ( θ k l e g ) ) + v m ( θ k l e g ) T n ( cos ( θ k l e g ) ) ) + R M , n ( θ k l e g )

3. Error Analysis of Legendre Transform using Butterfly Algorithm

The transformed Legendre nodes θ 0 l e g , , θ N 1 l e g can be seen as a perturbation of an equally-spaced grid θ 0 , , θ N 1 , i.e
θ k l e g = θ k * + δ θ k , 0 k N 1
and then approximate each x k l e g = cos ( n θ k l e g ) term by a truncated Taylor series expansion about θ k * . If | δ θ k | is small then only a few terms in the Taylor expansion are required.
The Taylor series expansion of T n ( cos ( θ + δ θ ) ) = cos ( n ( θ + δ θ ) ) about θ [ 0 , π ] can be expressed as
cos ( n ( θ + δ θ ) ) = cos ( n θ ) + l = 1 cos ( l ) ( n θ ) ( n δ θ ) l l ! = cos ( n θ ) + l = 1 ( 1 ) ( l + 1 ) / 2 Φ l ( n θ ) ( n δ θ ) l l !
where
Φ l ( θ ) = { cos ( θ ) , l e v e n sin ( θ ) , l o d d
Similarly, T n ( sin ( θ + δ θ ) ) = sin ( n ( θ + δ θ ) ) about θ [ 0 , π ] can be expressed as
sin ( n ( θ + δ θ ) ) = sin ( n θ ) + l = 1 sin ( l ) ( n θ ) ( n δ θ ) l l ! = sin ( n θ ) + l = 1 ( 1 ) l / 2 Ψ l ( n θ ) ( n δ θ ) l l !
where
Ψ l ( θ ) = { cos ( θ ) , l o d d sin ( θ ) , l e v e n
Substituting θ k * for θ in Equation (5), one can get
P n ( cos ( θ k ) ) = C n T n ( sin ( θ k ) ) m = 0 M 1 h m , n u m ( θ k ) + C n T n ( cos ( θ k ) ) m = 0 M 1 h m , n v m ( θ k ) + R M , n ( θ k )
The Taylor series expansion of P n ( cos ( θ k l e g ) ) about θ k can be expressed as
P n ( cos ( θ k l e g ) ) = P n ( cos ( θ k + δ θ k ) ) = l = 0 P n ( l ) ( cos ( θ k ) ) ( δ θ k ) l l !
According to Equation (13), P n ( l ) ( cos ( θ k ) ) ( l > 0 ) can be written as
P n ( l ) ( cos ( θ k ) ) = C n m = 0 M 1 h m , n { u m ( θ k ) T n ( l ) ( sin ( θ k ) ) + u m ( l ) ( θ k ) T n ( sin ( θ k ) ) } + C n m = 0 M 1 h m , n { v m ( θ k ) T n ( l ) ( cos ( θ k ) ) + v m ( l ) ( θ k ) T n ( cos ( θ k ) ) } + R M , n l ( θ k )
Substituting Equation (15) into Equation (14), one can obtain
P n ( x k l e g ) = C n m = 0 M 1 h m , n { u m ( θ k ) T n ( sin ( θ k ) ) + v m ( θ k ) T n ( cos ( θ k ) ) } + C n l = 1 ( δ θ k ) l l ! m = 0 M 1 h m , n { u m ( θ k ) T n ( l ) ( sin ( θ k ) ) + u m ( l ) ( θ k ) T n ( sin ( θ k ) ) } + C n l = 1 ( δ θ k ) l l ! m = 0 M 1 h m , n { v m ( θ k ) T n ( l ) ( cos ( θ k ) ) + v m ( l ) ( θ k ) T n ( cos ( θ k ) ) } + l = 0 R M , n ( l ) ( θ k ) ( δ θ k ) l l !
Because
l = 0 ( δ θ k ) l l ! T n ( sin ( θ k ) ) m = 0 M 1 h m , n u m ( l ) ( θ k ) = T n ( sin ( θ k ) ) m = 0 M 1 h m , n l = 0 u m ( l ) ( θ k ) ( δ θ k ) l l ! = T n ( sin ( θ k ) ) m = 0 M 1 h m , n u m ( θ k + δ θ k ) = T n ( sin ( θ k ) ) m = 0 M 1 h m , n u m ( θ k l e g )
and
l = 0 ( δ θ k ) l l ! T n ( cos ( θ k ) ) m = 0 M 1 h m , n v m ( l ) ( θ k ) = T n ( cos ( θ k ) ) m = 0 M 1 h m , n l = 0 v m ( l ) ( θ k ) ( δ θ k ) l l ! = T n ( cos ( θ k ) ) m = 0 M 1 h m , n v m ( θ k + δ θ k ) = T n ( cos ( θ k ) ) m = 0 M 1 h m , n v m ( θ k l e g )
Similarly, we have
l = 0 ( δ θ k ) l l ! m = 0 M 1 h m , n u m ( θ k ) T n ( l ) ( sin ( θ k ) ) = m = 0 M 1 h m , n u m ( θ k ) l = 0 T n ( l ) ( sin ( θ k ) ) ( δ θ k ) l l ! = m = 0 M 1 h m , n u m ( θ k ) T n ( sin ( θ k l e g ) )
and
l = 0 ( δ θ k ) l l ! m = 0 M 1 h m , n v m ( θ k ) T n ( l ) ( cos ( θ k ) ) = m = 0 M 1 h m , n v m ( θ k ) l = 0 T n ( l ) ( cos ( θ k ) ) ( δ θ k ) l l ! = m = 0 M 1 h m , n v m ( θ k ) T n ( cos ( θ k l e g ) )
Substituting Equations (17)–(20) into Equation (16), one can get
P n ( x k l e g ) = C n m = 0 M 1 h m , n ( u m ( θ k l e g ) T n ( sin ( θ k ) ) + v m ( θ k l e g ) T n ( cos ( θ k ) ) ) + C n m = 0 M 1 h m , n { u m ( θ k ) T n ( sin ( θ k l e g ) ) + v m ( θ k ) T n ( cos ( θ k l e g ) ) } C n m = 0 M 1 h m , n { u m ( θ k * ) T n ( sin ( θ k ) ) + v m ( θ k * ) T n ( cos ( θ k ) ) } + R M , n ( θ k l e g )
Then
P n ( x k l e g ) = C n m = 0 M 1 h m , n ( u m ( θ k l e g ) T n ( sin ( θ k ) ) + v m ( θ k l e g ) T n ( cos ( θ k ) ) ) + C n m = 0 M 1 h m , n { u m ( θ k ) T n ( sin ( θ k l e g ) ) + v m ( θ k ) T n ( cos ( θ k l e g ) ) } P n ( x k * ) + R M , n ( θ k * ) + R M , n ( θ k l e g )
By truncating the second term in the right hand side of Equation (19), it can be approximated as
P n ( x k l e g ) = C n m = 0 M 1 h m , n ( u m ( θ k l e g ) T n ( sin ( θ k ) ) + v m ( θ k l e g ) T n ( cos ( θ k ) ) ) + C n l = 1 L ( δ θ k ) l l ! m = 0 M 1 h m , n { u m ( θ k ) T n ( l ) ( sin ( θ k ) ) + v m ( θ k ) T n ( l ) ( cos ( θ k ) ) } + R M , n ( θ k l e g ) + R L , M , n , δ θ
and then
P n ( x k l e g ) = C n m = 0 M 1 h m , n ( u m ( θ k l e g ) T n ( sin ( θ k ) ) + v m ( θ k l e g ) T n ( cos ( θ k ) ) ) + C n l   o d d L ( n δ θ k ) l l ! m = 0 M 1 h m , n ( ( 1 ) l 2 u m ( θ k ) T n ( cos θ k ) + ( 1 ) ( l + 1 ) 2 v m ( θ k ) T n ( sin θ k ) ) + C n l   e v e n L ( n δ θ k ) l l ! m = 0 M 1 h m , n ( ( 1 ) l 2 u m ( θ k ) T n ( sin θ k ) + ( 1 ) ( l + 1 ) 2 v m ( θ k ) T n ( cos θ k ) ) + R M , n ( θ k l e g ) + R L , M , n , δ θ
Equation (24) can be expressed in the following compact form
P n ( x k l e g ) ( U n + V n ) + l   o d d L ( n δ θ k ) l l ! ( ( 1 ) l 2 U n c + ( 1 ) ( l + 1 ) 2 V n s ) + l   e v e n L ( n δ θ k ) l l ! ( ( 1 ) l 2 U n s + ( 1 ) ( l + 1 ) 2 V n c )
where
U n = C n m = 0 M 1 h m , n u m ( θ k l e g ) T n ( sin ( θ k ) ) ,   V n = C n m = 0 M 1 h m , n v m ( θ k l e g ) T n ( cos ( θ k ) ) U n s = C n m = 0 M 1 h m , n u m ( θ k ) T n ( sin ( θ k ) ) ,   V n s = C n m = 0 M 1 h m , n v m ( θ k ) T n ( sin ( θ k ) ) U n c = C n m = 0 M 1 h m , n u m ( θ k ) T n ( cos ( θ k ) ) ,   V n c = C n m = 0 M 1 h m , n v m ( θ k ) T n ( cos ( θ k ) )
So, the computation of Legendre-Vandermonde matrix can be written as
P N ( x _ N l e g ) = ( U N + V N ) + l   o d d L ( n δ θ k ) l l ! ( ( 1 ) l 2 U c + ( 1 ) ( l + 1 ) 2 V s ) + l   e v e n L ( n δ θ k ) l l ! ( ( 1 ) l 2 U s + ( 1 ) ( l + 1 ) 2 V c ) +                  R total
The numerical stability of ID can be analyzed by Equation (27). Since the butterfly algorithm works well for equispaced Fourier series, Legendre transform using butterfly algorithm is numerical stability with the error of R total . When L tends to infinity, the error is R M , n ( θ k l e g ) .
Lemma 1.
For any L 1 and n 0 [27]
R L , n , δ θ : = max θ [ 0 , π ] | cos ( n ( θ + δ θ ) ) l = 0 L 1 cos ( l ) ( n θ ) ( δ θ ) l l ! | ( n | δ θ | ) L L !
Lemma 2.
For any L 1 and n 0 , the error bound of Equation (24) is
R 2 C n h M , n L ! ( ( n | δ θ k | ) L ( 2 sin θ k c h e b ) M + 1 2 + 1 ( 2 sin θ k l e g ) M + 1 2 )
Proof of Lemma 2.
| R M , L , n , δ θ k | = | l = 0 L ( 1 ) ( l + 1 ) 2 C n m = 0 M 1 h m , n ( u m ( θ k ) Ψ l ( n θ k ) + v m ( θ k ) Φ l ( n θ k ) ) ( n δ θ k ) l l ! | | ( n | δ θ k | ) L L ! C n m = 0 M 1 h m , n | ( u m ( θ k ) Ψ l ( n θ k ) + v m ( θ k ) Φ l ( n θ k ) ) | | C n h M , n 2 ( 2 sin θ k ) M + 1 / 2 ( n | δ θ k | ) L L !
Finally, one can get the total upper error bound
R = | R M , L , n , δ θ k + R M , n ( θ k l e g ) | 2 C n h M , n L ! ( ( n | δ θ k | ) L ( 2 sin θ k c h e b ) M + 1 2 + 1 ( 2 sin θ k l e g ) M + 1 2 )

4. Legendre Transform Based on Block Partitioning and Butterfly Algorithm

In this section, we will propose a Legendre transform based on block partitioning and butterfly algorithm. The main idea is separate the matrix P N ( x N l e g ) into block P N REC ( x _ N l e g ) and K sub-matrices P N ( k ) ( x N l e g ) and then factorize K sub-matrices P N ( k ) ( x N l e g ) by butterfly algorithm. The butterfly algorithm can factorize a complementary low-rank matrix of size N × N into the product of O(logN) sparse matrices, each with O(N) nonzero entries. What’s more, butterfly algorithm works well for P N ( k ) ( x N l e g ) as mentioned in Section 3. Hence, the total of nonzero entries after factorization can be approximate to O(Nlog2N/loglogN) by controlling the number of nonzero entries in P N REC ( x _ N l e g ) . Finally, one can get a Legendre transform with O(Nlog2N/loglogN) computational complexity.
It can be found that the matrix P N ( x N l e g ) can be considered as a perturbation of matrix P N ( x N c h e b ) from Equation (24). The block partitioning of P N ( x N l e g ) can be performed by using the same method as P N ( x N c h e b ) in the paper of Hale and Townsend [27]. Therefore, the matrix P N ( x N l e g ) is partitioned as
P N ( x N l e g ) = P N REC ( x N l e g ) + k = 1 K P N ( k ) ( x N l e g )
This partitioning separates the matrix P N ( x N l e g ) into block P N REC ( x _ N l e g ) and K sub-matrices P N ( k ) ( x N l e g ) . Block P N REC ( x N l e g ) contains the columns and rows of P N ( x N l e g ) , which cannot be computed by using Equation (24).
P N REC ( x N l e g ) i j = { P N ( x N l e g ) i j , 1 min ( i , N i + 1 ) j M , P N ( x N l e g ) i j , 1 j n M , 0 , o t h e r w s i e
where
n M = 1 2 ( ε π 3 / 2 Γ ( M + 1 ) 4 Γ ( M + 1 / 2 ) ) 1 M + 1 2
and
j M = N + 1 π sin 1 ( n M N )
P N ( k ) ( x N l e g ) i j = { P N ( x N l e g ) i j , i k i N j k , α k N j α k 1 N 0 , o t h e r w s i e
where α = O ( 1 / log N ) and i k = N + 1 π sin 1 ( n M α k N ) .
Legendre transform can be expressed as
P N ( x _ N l e g ) c _ N l e g = P N REC ( x _ N l e g ) c _ N l e g + k = 1 K P N ( k ) ( x _ N l e g ) c _ N l e g
Nonzero entries of P N ( k ) ( x N l e g ) can be accurately expressed by the asymptotic formula, which means that the butterfly compression to P N ( k ) ( x N l e g ) is stable and accurate. Instead of the FFT method, the butterfly algorithm is employed to compute the matrix-vector product P N ( k ) ( x N l e g ) c _ N l e g . This is because the butterfly algorithm works well for P N ( k ) ( x N l e g ) as mentioned in Section 3. So k = 1 K P N ( k ) ( x _ N l e g ) c _ N l e g can be computed in O ( K N log N ) operations. By restricting P N REC ( x _ N l e g ) has fewer than O ( K N log N ) nonzero entries, the matrix-vector product P N REC ( x _ N l e g ) c _ N l e g can be computed in O ( K N log N ) operations. Finally, the optimal computational cost is achieved. Let n m = min ( n M , N 1 ) , the parameters α and K are defined as
α = { min ( 1 / log ( N / n m ) , 1 / 2 ) , for small N 1 / log ( N / n m ) , for large N
and K = O ( log N / log log N ) , respectively. In the practical application, only parameters N , n m , α and K are used to obtain information such as starting row/column index and offset for all blocks.
Figure 1 shows the partitioning of the Legendre-Vandermonde matrix for N = 1024. The Legendre-Vandermonde matrix is divided into boundary (denoted by symbol B) and internal (denoted by symbol P) parts. The boundary parts include the elements which cannot be accurately expressed by the asymptotic formula. There are 2(K+1) sub-matrices of B and 2K sub-matrices of P. According to the symmetric or anti-symmetric property of Legendre polynomials, only K+1 sub-matrices of B and K sub-matrices of P on the top are used. Algorithm 1 presents a summary of Legendre transform algorithm using block butterfly algorithm. Direct computation part and butterfly multiplication part is cost O ( K N log N ) operations, respectively.
Algorithm 1: Block Butterfly Algorithm for Legendre Transform.
Mathematics 07 00966 i001
Parameters CMAX and EPS need for butterfly matrix compression are still needed in block butterfly algorithm. CMAX is the number of columns in each sub-matrix on level 0, EPS is desired precision in interpolative decomposition [3]. A dimensional thresh value DIMTHESH [3] is also needed in Legendre transform calls to activate FLT when wavenumber (m) less and equal to NSMAX-2DIMTHESH+3 (NSMAX is truncation order). Block butterfly algorithm is equivalent to Tygert’s algorithm (2010) when no block partition is used, so two dimensional thresh values could be introduced to include Tygert’s algorithm (2010) and LT using DGEMM for further reducing the computational complexity. To facilitate comparison with Tygert’s algorithm, only one dimensional thresh value is used and set to 200 in the rest of the paper.

5. Results

In this section, all tests are performed on the MilkyWay-2 super computer (see Liao et al. [31] for more details), which installed in NUDT. Each compute node possesses 64GB of memory. The CPU model name is Intel(R)Xeon(R) CPU E5-2692V2 @2.2GHz. A private 32KB L1 instruction cache, a 32KB L1 data cache, a 256KB L2 cache, and a 30720KB L3 cache are used. ID software package developed by Martinsson et al. [32] for low rank approximation of matrices is employed to perform interpolative decompositions for all tests. ID package can be downloaded from Mark Tygert’s homepage [33]. Hereafter, LT using matrix-matrix multiplication, Tygert’s algorithm (2010) and block butterfly algorithm are named as LT0, LT1 and LT2, respectively.
Figure 2, Figure 3 and Figure 4 show the errors of LT with CMAX = 64 in log10 form for EPS = 1.0E-05, EPS = 1.0E-07 and EPS = 1.0E-10, respectively. Abbreviations “MAX ERR” and “RMS ERR” in Figure 2, Figure 3 and Figure 4 denote maximum error and root-mean-square error, respectively. It can be found that both maximum error and root-mean-square error of LT2 are improved by about one order magnitude than LT1. The results show that the proposed method is effective in improving the accuracy of Legendre transform using butterfly algorithm.
Figure 5 shows the computational time for different Legendre transform algorithms. The speedup and loss speedup of LT2 with CMAX = 64 are demonstrated in Figure 6 and Figure 7, respectively. Loss speedup which measures the relative performance penalty is defined as the speedup of LT1 minus the speedup of LT2 and divided by the speedup of LT0. From Figure 5 and Figure 6, LT2 begins to be faster than LT0 when N = 2048 and achieves more than 26%, 22%, 17% reduction in elapsed time for EPS = 1.0E-5, EPS = 1.0E-7 and EPS = 1.0E-10. LT2 has achieved more than 17%, 63%, 75% and 86% reduction in elapsed time for a run of N2048, N4096, N8192 and N16384, respectively. In Figure 7, the loss speedup of LT2 relative to LT1 is less than 21%, 11%, 7% and 4% for N = 2048, 4096, 8192 and 16,384, respectively. Moreover, the loss speedup of LT2 relative to LT1 decreases rapidly with the increase of N. According to the results of Yin [3], the potential instability of interpolative decomposition only exists in the case of very high order. So, the presented method can alleviate the potential instability of interpolative decomposition at a very small computational burden.
Figure 8 and Figure 9 show the computational time of LT scaled by Nlog3N and Nlog4N, respectively. It can be found that the computational complexity of LT2 appears to a little bigger than LT1. The boundary blocks which can’t be accurately expressed by the asymptotic formula and the internal blocks with dimension less that dimensional thresh value result in the increase of the computational complexity. Although the results of LT2 are bigger than those of LT1, LT2 has a similar trend as LT1 in Figure 8 and Figure 9. This means that LT2 has the same computational complexity O(Nlog3N) as LT1.
Legendre-Vandermond matrix is divided into boundary blocks and internal blocks. Boundary blocks which can’t be accurately expressed by the asymptotic formula cause instability of interpolative decompositions and are not suitable for interpolation decomposition. The matrix-vector multiplication based on butterfly algorithm is faster than BLAS function DGEMV only when the dimension of matrix is greater than or equal to 512. Internal blocks with lower matrix dimension adopt direct matrix-vector multiplication instead of butterfly algorithm. The number of nonzero elements of boundary blocks, internal blocks which do not participate in interpolation decomposition cause the increase of the computational cost compare to Tygert’s algorithm. Therefore, through reasonable partitioning, the theoretical computational complexity of the proposed method can reach the optimal computational complexity O(Nlog2N/loglogN).

6. Conclusions

In this paper, a high accurate and stable Legendre transform algorithm is proposed. A block partitioning based on asymptotic formula is employed to mitigate the potential instability of Legendre transform using butterfly algorithm. The Legendre-Vandermonde matrix is divided into one block P N REC ( x _ N l e g ) and K sub-matrices P N ( k ) ( x N l e g ) . Instead of FFT method, butterfly algorithm is employed to compute P N ( k ) ( x N l e g ) c _ N l e g . Numerical results demonstrate that the proposed method improves stability by about one order magnitude than Tygert’s algorithm (2010), while only sacrifices less than 7% speedup for very high order (N ≥ 4096) Legendre transform.
Although the computational time of proposed method is a little bigger than Tygert’s algorithm, it has the same computational complexity O(Nlog3N) as Tygert’s algorithm. Moreover, the proposed method is equivalent to Tygert’s algorithm when no block partition is used. In the application of NWP, an additional dimensional thresh value could be introduced to include Tygert’s algorithm (2010) for further reducing the computational complexity.
In the future, we will study the more optimal block partition method to improve the computational performance, while still keeping stability and making a detailed analysis in regard to the spectral harmonic transform using the proposed method for very high resolution—especially its performance in the reduction of potential numerical instability for resolution T7999.

Author Contributions

Conceptualization, F.Y. and J.W.; Formal analysis, F.Y.; Funding acquisition, F.Y.; Methodology, F.Y. and J.S.; Supervision, J.S.; Validation, J.W. and J.Y.; Writing—original draft, F.Y.; Writing—review & editing, J.Y.

Funding

This research was funded by the National Natural Science Foundation of China (Grant 41705078) and partly supported by the National Natural Science Foundation of China (Grants 61379022 and 41605070).

Acknowledgments

The author acknowledges Yingzhou Li (Duke University) and Haizhao Yang (National University of Singapore) for providing ButterflyLab for reference. The author would also like to thank two anonymous reviewers for their insightful and constructive comments, which help to improve the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Seljebotn, D.S. Wavemoth—Fast spherical harmonic transforms by butterfly matrix compression. Astrophys. J. 2012, 199, 1–12. [Google Scholar] [CrossRef]
  2. Wedi, N.P.; Hamrud, M.; Mozdzynski, G. A Fast Spherical Harmonics Transform for Global NWP and Climate Models. Mon. Weather Rev. 2013, 141, 3450–3461. [Google Scholar] [CrossRef]
  3. Yin, F.; Wu, G.; Wu, J.; Zhao, J.; Song, J. Performance evaluation of the fast spherical harmonic transform algorithm in the yin–he global spectral model. Mon. Weather Rev. 2018, 146, 3163–3182. [Google Scholar] [CrossRef]
  4. Suda, R.; Takami, M. A fast spherical harmonics transform algorithm. Math. Comput. 2002, 71, 703–715. [Google Scholar] [CrossRef]
  5. Kunis, S.; Potts, D. Fast spherical Fourier algorithms. J. Comput. Appl. Math. 2003, 161, 75–98. [Google Scholar] [CrossRef] [Green Version]
  6. Suda, R. Stability analysis of the fast Legendre transform algorithm based on the fast multipole method. In Proceedings of the Estonian Academy of Sciences Physics; Tallinn Book Printers Ltd.: Tallinn, Estonia, 2004; Volume 53, pp. 107–115. [Google Scholar]
  7. Suda, R. Fast Spherical Harmonic Transform Routine FLTSS Applied to the Shallow Water Test Set. Mon. Weather Rev. 2005, 133, 634–648. [Google Scholar] [CrossRef]
  8. Rokhlin, V.; Tygert, M. Fast Algorithms for Spherical Harmonic Expansions. SIAM J. Sci. Comput. 2005, 27, 1903–1928. [Google Scholar] [CrossRef]
  9. Tygert, M. Recurrence relations and fast algorithms. Appl. Comput. Harmon. Anal. 2006, 28, 121–128. [Google Scholar] [CrossRef]
  10. Tygert, M. Fast algorithms for spherical harmonic expansions, II. J. Comput. Phys. 2008, 227, 4260–4279. [Google Scholar] [CrossRef] [Green Version]
  11. Tygert, M. Short Note: Fast algorithms for spherical harmonic expansions, III. J. Comput. Phys. 2010, 229, 6181–6192. [Google Scholar] [CrossRef]
  12. Michielssen, E.; Boag, A. A multilevel matrix decomposition algorithm for analyzing scattering from large structures. IEEE Trans. Antenn. Propag. 1996, 44, 1086–1093. [Google Scholar] [CrossRef]
  13. O’Neil, M.; Woolfe, F.; Rokhlin, V. An algorithm for the rapid evaluation of special function transforms. Appl. Comput. Harmon. Anal. 2010, 28, 203–226. [Google Scholar] [CrossRef] [Green Version]
  14. Li, Y.; Yang, H. Interpolative Butterfly Factorization. SIAM J. Sci. Comput. 2017, 39, A503–A531. [Google Scholar] [CrossRef] [Green Version]
  15. Wu, J.P.; Zhao, J.; Song, J.Q.; Zhang, W.M. Preliminary design of dynamic framework for global non-hydrostatic spectral model. Comput. Eng. Des. 2011, 32, 3539–3543. [Google Scholar]
  16. Yang, J.; Song, J.; Wu, J.; Ren, K.; Leng, H. A high-order vertical discretization method for a semi-implicit mass-based non-hydrostatic kernel. Q. J. R. Meteorol. Soc. 2015, 141, 2880–2885. [Google Scholar] [CrossRef]
  17. Yang, J.; Song, J.; Wu, J.; Ying, F.; Peng, J.; Leng, H. A semi-implicit deep-atmosphere spectral dynamical kernel using a hydrostatic-pressure coordinate. Q. J. R. Meteorol. Soc. 2017, 143, 2703–2713. [Google Scholar] [CrossRef]
  18. Cheng, H.; Gimbutas, Z.; Martinsson, P.G.; Rokhlin, V. On the compression of low rank matrices. SIAM J. Sci. Comput. 2005, 26, 1389–1404. [Google Scholar] [CrossRef]
  19. Heitman, Z.; Bremer, J.; Rokhlin, V. On the existence of nonoscillatory phase functions for second order ordinary differential equations in the high-frequency regime. J. Comput. Phys. 2015, 290, 1–27. [Google Scholar] [CrossRef]
  20. Bremer, J.; Rokhlin, V. Improved estimates for nonoscillatory phase functions. Discrete Cont. Dyn.-Am. 2016, 36, 4101–4131. [Google Scholar] [CrossRef] [Green Version]
  21. Bremer, J.; Rokhlin, V. On the nonoscillatory phase function for Legendre’s differential equation. J. Comput. Phys. 2017, 350, 326–342. [Google Scholar] [CrossRef]
  22. Bremer, J.; Yang, H. Fast algorithms for Jacobi expansions via nonoscillatory phase functions. IMA J. Numer. Anal. 2018, arXiv:1803.03889. [Google Scholar]
  23. Glaser, A.; Liu, X.; Rokhlin, V. A fast algorithm for the calculation of the roots of special functions. SIAM J. Sci. Comput. 2019, 29, 1420–1438. [Google Scholar] [CrossRef]
  24. James, B.; Qiyuan, P.; Haizhao, Y. Fast Algorithms for the Multi-dimensional Jacobi Polynomial Transform. arXiv 2019, arXiv:1901.07275. [Google Scholar]
  25. ButterflyLab. Available online: https://github.com/ButterflyLab/ButterflyLab (accessed on 14 August 2019).
  26. Candès, E.; Demanet, L.; Ying, L. A Fast Butterfly Algorithm for the Computation of Fourier Integral Operators. Multiscale Model. Simul. 2009, 7, 1727–1750. [Google Scholar] [CrossRef]
  27. Hale, N.; Townsend, A. A fast, simple, and stable Chebyshev-Legendre transform using an asymptotic formula. SIAM J. Sci. Comput. 2014, 36, 148–167. [Google Scholar] [CrossRef]
  28. Hale, N.; Townsend, A. A fast FFT-based discrete Legendre transform. IMA J. Numer. Anal. 2015, 36, 1670–1684. [Google Scholar] [CrossRef] [Green Version]
  29. Townsend, A.; Webby, M.; Olver, S. Fast polynomial transforms based on Toeplitz and Hankel matrices. Math. Comput. 2018, 87, 1913–1934. [Google Scholar] [CrossRef]
  30. Stieltjes, T.J. Sur les polynômes de Legendre. Ann. Fac. Sci. Toulouse 1890, 4, G1–G17. [Google Scholar] [CrossRef]
  31. Liao, X.; Xiao, L.; Yang, C.; Lu, Y. MilkyWay-2 supercomputer: System and application. Front. Comput. Sci. 2014, 8, 345–356. [Google Scholar] [CrossRef]
  32. Martinsson, P.G.; Rokhlin, V.; Shkolnisky, Y.; Tygert, M. ID: a software package for low rank approximation of matrices via interpolative decompositions, version 0.2. Available online: http://cims.nyu.edu/~tygert/id_doc.pdf (accessed on 4 August 2017).
  33. Mark Tygert’s Homepage. Available online: http://tygert.com/software.html (accessed on 14 August 2019).
Figure 1. Partitioning of the Legendre-Vandermonde matrix for N = 1024 (in which matrix B is the boundary parts can’t be accurately expressed by the asymptotic formula while matrix P is the internal parts can. There are 2(K + 1) sub-matrices of B and 2K sub-matrices of P. According to the symmetric or anti-symmetric property of Legendre polynomials, we only need to consider K + 1 sub-matrices of B and K sub-matrices of P on top).
Figure 1. Partitioning of the Legendre-Vandermonde matrix for N = 1024 (in which matrix B is the boundary parts can’t be accurately expressed by the asymptotic formula while matrix P is the internal parts can. There are 2(K + 1) sub-matrices of B and 2K sub-matrices of P. According to the symmetric or anti-symmetric property of Legendre polynomials, we only need to consider K + 1 sub-matrices of B and K sub-matrices of P on top).
Mathematics 07 00966 g001
Figure 2. Errors of LT in log10 form with EPS = 1.0E-05 and CMAX = 64 (LT1 is the butterfly algorithm and LT2 is the proposed method. Abbreviations “MAX ERR” and “RMS ERR” denote the maximum error and root-mean-square error, respectively. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Figure 2. Errors of LT in log10 form with EPS = 1.0E-05 and CMAX = 64 (LT1 is the butterfly algorithm and LT2 is the proposed method. Abbreviations “MAX ERR” and “RMS ERR” denote the maximum error and root-mean-square error, respectively. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Mathematics 07 00966 g002
Figure 3. Errors of LT in log10 form with EPS = 1.0E-07 and CMAX = 64 (LT1 is the butterfly algorithm and LT2 is the proposed method. Abbreviations “MAX ERR” and “RMS ERR” denote the maximum error and root-mean-square error, respectively. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Figure 3. Errors of LT in log10 form with EPS = 1.0E-07 and CMAX = 64 (LT1 is the butterfly algorithm and LT2 is the proposed method. Abbreviations “MAX ERR” and “RMS ERR” denote the maximum error and root-mean-square error, respectively. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Mathematics 07 00966 g003
Figure 4. Errors of LT in log10 form with EPS = 1.0E-10 and CMAX = 64 (LT1 is the butterfly algorithm and LT2 is the proposed method. Abbreviations “MAX ERR” and “RMS ERR” denote the maximum error and root-mean-square error, respectively. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Figure 4. Errors of LT in log10 form with EPS = 1.0E-10 and CMAX = 64 (LT1 is the butterfly algorithm and LT2 is the proposed method. Abbreviations “MAX ERR” and “RMS ERR” denote the maximum error and root-mean-square error, respectively. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Mathematics 07 00966 g004
Figure 5. Computational time for different Legendre transform algorithms with CMAX = 64 (LT0 is the algorithm using DGEMM, LT1 is the butterfly algorithm and LT2 is the proposed method, EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0, Unit: Second).
Figure 5. Computational time for different Legendre transform algorithms with CMAX = 64 (LT0 is the algorithm using DGEMM, LT1 is the butterfly algorithm and LT2 is the proposed method, EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0, Unit: Second).
Mathematics 07 00966 g005
Figure 6. Speedup of LT1 and LT2 with CMAX = 64 compare to LT0 (LT0 is the algorithm using DGEMM, LT1 is the butterfly algorithm and LT2 is the proposed method, EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Figure 6. Speedup of LT1 and LT2 with CMAX = 64 compare to LT0 (LT0 is the algorithm using DGEMM, LT1 is the butterfly algorithm and LT2 is the proposed method, EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Mathematics 07 00966 g006
Figure 7. Loss speedup of LT2 with CMAX = 64 compare to LT1 (LT1 is the butterfly algorithm and LT2 is the proposed method, EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Figure 7. Loss speedup of LT2 with CMAX = 64 compare to LT1 (LT1 is the butterfly algorithm and LT2 is the proposed method, EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Mathematics 07 00966 g007
Figure 8. Computational time scaled by Nlog3N with CMAX = 64(LT1 is the butterfly algorithm and LT2 is the proposed method. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Figure 8. Computational time scaled by Nlog3N with CMAX = 64(LT1 is the butterfly algorithm and LT2 is the proposed method. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Mathematics 07 00966 g008
Figure 9. Computational time scaled by Nlog4N with CMAX = 64 (LT1 is the butterfly algorithm and LT2 is the proposed method. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Figure 9. Computational time scaled by Nlog4N with CMAX = 64 (LT1 is the butterfly algorithm and LT2 is the proposed method. EPS is desired precision in interpolative decomposition, CMAX is the number of columns in each sub-matrix on level 0).
Mathematics 07 00966 g009

Share and Cite

MDPI and ACS Style

Yin, F.; Wu, J.; Song, J.; Yang, J. A High Accurate and Stable Legendre Transform Based on Block Partitioning and Butterfly Algorithm for NWP. Mathematics 2019, 7, 966. https://doi.org/10.3390/math7100966

AMA Style

Yin F, Wu J, Song J, Yang J. A High Accurate and Stable Legendre Transform Based on Block Partitioning and Butterfly Algorithm for NWP. Mathematics. 2019; 7(10):966. https://doi.org/10.3390/math7100966

Chicago/Turabian Style

Yin, Fukang, Jianping Wu, Junqiang Song, and Jinhui Yang. 2019. "A High Accurate and Stable Legendre Transform Based on Block Partitioning and Butterfly Algorithm for NWP" Mathematics 7, no. 10: 966. https://doi.org/10.3390/math7100966

APA Style

Yin, F., Wu, J., Song, J., & Yang, J. (2019). A High Accurate and Stable Legendre Transform Based on Block Partitioning and Butterfly Algorithm for NWP. Mathematics, 7(10), 966. https://doi.org/10.3390/math7100966

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop