Next Article in Journal
Improved Stability Criteria on Linear Systems with Distributed Interval Time-Varying Delays and Nonlinear Perturbations
Previous Article in Journal
Deep Learning for Fake News Detection in a Pairwise Textual Input Schema
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Modified Fast Inverse Square Root and Square Root Approximation Algorithms: The Method of Switching Magic Constants

by
Leonid V. Moroz
1,
Volodymyr V. Samotyy
2,3,* and
Oleh Y. Horyachyy
1
1
Information Technologies Security Department, Lviv Polytechnic National University, 79013 Lviv, Ukraine
2
Automation and Information Technologies Department, Cracow University of Technology, 31155 Cracow, Poland
3
Information Security Management Department, Lviv State University of Life Safety, 79007 Lviv, Ukraine
*
Author to whom correspondence should be addressed.
Computation 2021, 9(2), 21; https://doi.org/10.3390/computation9020021
Submission received: 24 December 2020 / Revised: 26 January 2021 / Accepted: 10 February 2021 / Published: 17 February 2021
(This article belongs to the Section Computational Engineering)

Abstract

:
Many low-cost platforms that support floating-point arithmetic, such as microcontrollers and field-programmable gate arrays, do not include fast hardware or software methods for calculating the square root and/or reciprocal square root. Typically, such functions are implemented using direct lookup tables or polynomial approximations, with a subsequent application of the Newton–Raphson method. Other, more complex solutions include high-radix digit-recurrence and bipartite or multipartite table-based methods. In contrast, this article proposes a simple modification of the fast inverse square root method that has high accuracy and relatively low latency. Algorithms are given in C/C++ for single- and double-precision numbers in the IEEE 754 format for both square root and reciprocal square root functions. These are based on the switching of magic constants in the initial approximation, depending on the input interval of the normalized floating-point numbers, in order to minimize the maximum relative error on each subinterval after the first iteration—giving 13 correct bits of the result. Our experimental results show that the proposed algorithms provide a fairly good trade-off between accuracy and latency after two iterations for numbers of type float, and after three iterations for numbers of type double when using fused multiply–add instructions—giving almost complete accuracy.

1. Introduction

The square root ( x ) and reciprocal of the square root ( 1 / x ), also known as the inverse square root, are two relatively common functions used in digital signal processing [1,2,3,4], and are often found in many computer graphics, multimedia, and scientific applications [1]. In particular, they are important for data analysis and processing, solving systems of linear equations, computer vision, and object detection tasks. In view of this, most current processors allow the use of the appropriate software (SW) functions and multimedia hardware (HW) instructions in both single (SP) and double (DP) precision.
Let us formalize the problem of calculating the function y x = x 1 / 2 in floating-point (FP) arithmetic. We consider an input argument x to be a normalized n -bit value with p -bit mantissa, which satisfies the condition x 0 . Similarly, for the function y 1 / x = x 1 / 2 , x > 0 . For many practical applications, where it is assumed that the input data are obtained with some error, e.g., read from sensors, or where computation speed is preferable to accuracy, e.g., in 3D graphics and real-time computer vision, approximate square root calculation may be sufficient. However, on the other side, there is a strong discussion whether the x function should be correctly-rounded for all input FP numbers according to the IEEE 754 standard.
Compilers offer a built-in sqrt   ( x ) function for various types of FP data (float, double, and long double) [5]. Although the common SW implementation of this function provides high accuracy, it is very slow. On the other hand, HW instruction sets are specific to particular processors or microprocessors. Computers and microcontrollers can use HW floating-point units (FPUs) in order to work effectively with FP numbers and for the fast calculation of some standard mathematical functions [6,7]. Typically, instructions are available for a square root with SP or DP (float and double), and estimation instructions are available for the reciprocal square root with various accuracies (usually 12 bits in SP, although there are also intrinsics that provide 8, 14, 23, or 28 bits). For example, for SP numbers, Intel SSE HW instruction RSQRTSS has an accuracy of 11.42 bits, Intel AVX instructions VRSQRT14SS and VRSQRT28SS provide 14 and 23.4 correct bits, respectively, and ARM NEON instruction FRSQRTE gives 8.25 correct bits of the result. An overview of the basic characteristics of these instructions on different modern machines can be found in [8,9,10,11]. However, similar instructions are not available for many low-cost platforms, such as microcontrollers and field-programmable gate arrays (FPGAs) [12,13]. For such devices, we require simple and efficient SW/HW methods of approximating square root functions.
Usually, the HW-implemented operation of calculating the square root in FP arithmetic requires 3–10 times more processor cycles than multiplication and addition [9]. Therefore, for such applications as video games, complex matrix operations, and real-time control systems, these functions may become a bottleneck that hampers application performance. According to the analysis of all CPU performance bottlenecks (the second table in [14]), the square root calculation function is high on the list and first among the FP operations. As a result, a large number of competing implementations of such algorithms, which make various trade-offs in terms of the accuracy and speed of the approximation, have been developed.
These elementary function algorithms—and especially those for calculating x and 1 / x —can be divided into several classes [2,15,16,17]: digit-recurrence (shift-and-add) methods [2,16,18], iterative methods [16,19], polynomial methods [4,16,17,20], rational methods [16], table-based methods [2,21,22,23], bit-manipulation techniques [24,25,26], and their combinations.
Digit-recurrence, table-based, and bit-manipulation methods have the advantage of using only simple and fast operations, such as addition/subtraction, bit shift, and table lookup. Compared to polynomial and iterative methods, they are therefore more suitable for implementation on devices without HW FP multiplication support (e.g., calculators, some microcontrollers, and FPGAs). However, since digit-recurrence algorithms have linear convergence, they are very slow in terms of SW implementation. Tabular methods are very fast but require a large area (memory), since the size of the table grows exponentially with an increase in the precision required. This may pose a problem not only for small devices such as microcontrollers, but also to a lesser extent for FPGAs. The use of direct lookup tables (LUTs) is less practical than combining several smaller LUTs with addition and multiplication operations. Bit-manipulation techniques are fast in both HW and SW, but have very limited accuracy—about 5–6 bits [15]. They are based on the peculiarities of the in-memory binary representation formats of integer and FP numbers.
In computer systems with fast HW multiplication instructions, iterative and polynomial methods can be efficient. Iterative approaches such as Newton–Raphson (NR) and Goldschmidt’s method have quadratic convergence but require a good initial approximation—in general, polynomial, table-based, or bit-manipulation techniques are used. Polynomial methods of high order rely heavily on multiplications and need to store polynomial coefficients in memory; they also require a range reduction step. This makes them less suitable for the calculation of x and 1 / x than iterative methods, especially for HW implementation. Compared to the polynomial method, a rational approximation is not efficient if there is no fast FP division operation; however, for some elementary functions, it can give more accurate results, e.g., the tan   ( x ) and x functions.
Most existing research studies on calculating the 1 / x and x functions have focused on the HW implementation in FPGAs. They use LUTs or polynomial approximation, and if more accurate results are required, iterative methods are subsequently applied. In this paper, we consider a modification of a bit-manipulation technique called the fast inverse square root (FISR) method for the approximate calculation of these functions with high accuracy, without using large LUTs or divisions. This work proposes a novel approach that combines the FISR method and a modified NR method and uses two different magic constants for an initial approximation depending on the input subinterval. On each of the two subintervals, such values of the magic constants should be chosen that minimize the maximum relative error after the first iteration. This can be considered a fairly accurate and fast initial approximation for other iterative methods such as NR or modified NR. This method can be effectively implemented on microcontrollers and FPGAs that support FP calculations in HW. However, in this paper, we focus mostly on the fast SW implementation of the method, in particular on microcontrollers. In a HW implementation, a kind of tiny 1-bit LUT can be used to determine a magic constant and two other parameters of the basic algorithm.
Among the better-known methods of calculating the square root and reciprocal square root [1,2,15,20], the FISR method [20,25,26,27,28,29] has recently gained increasing popularity in SW [8,20,27,29,30,31,32] and HW [3,33,34,35,36,37] applications. The algorithm was proposed for the first time in [24] but gained wider popularity through its use in the computer game Quake III Arena [27]. Its attraction lies in its very simple and rapid way of obtaining a fairly accurate initial approximation of the function y 0 1 / x —almost 5 bits—without using multiplications or a LUT. It uses integer subtraction and bit shifting, and combines these with switching between two different ways of interpreting the binary data: as an FP or an integer number. In addition, fewer hardware resources are used when this is implemented with an FPGA.
We denote by ι   ( x ) an integer that has the same binary representation as an FP number x and by φ   ( i ) an FP number that has the same binary representation as an integer i . The main idea behind FISR is as follows. If the FP number x , given in the IEEE 754 standard, is represented as an integer ι   ( x ) , then it can be considered a coarse, biased, and scaled approximation of the binary logarithm log 2 ( x ) . This integer is divided by two, its sign is changed, and it is then translated into the IEEE 754 format as an FP number y 0 with the same binary representation y 0 = φ   ( ι   ( x ) / 2 ) . The method introduces a magic constant R to take into account the bias and reduce the approximation error. This gives an initial approximation for the function y = 1 / x , which is then further refined with the help of NR iterations.
The NR method is the most commonly used iterative method; it is characterized by a quadratic convergence rate and has the property of self-correcting errors [19]. Quadratic convergence in an iterative method means that the method roughly doubles the number of exact bits in the result after each iteration. If we apply the NR method directly to find the square root, we obtain the formula known as Heron’s formula. The disadvantage of this approach is the need to perform an FP division operation at each iteration, which is rather complex and has high latency [1,9,18]. An alternative way of calculating the square root is to use the NR method to find the root of the equation f ( x ) = 1 / y 2 x = 0 for a function y = 1 / x . This formula has the form:
y i + 1 = 1 2 y i ( 3 x y i 2 ) .
Using this approach, it is necessary to multiply the final result of the iteration method in Equation (1) by x to get the approximate square root of the input number x . The main feature of iterative methods is the need to select an initial value y 0 ; as a rule, the better this initial approximation, the lower the number of subsequent iterations needed to obtain the required accuracy for the calculations. However, we will show later that this is not always the case.
The purpose of this paper is to present a modified FISR method based on the switching of magic constants. This method is characterized by increased accuracy of calculations—13.71 correct bits after the first iteration—with low overhead compared to the known FISR-based approximation algorithms described below in Section 2. We also provide the code for the corresponding optimized algorithms for calculating the square root and reciprocal square root, which use this method for the initial approximation. These algorithms work for normalized single- and double-precision numbers in the IEEE 754 format and provide different accuracy depending on the number of iterations used.
Functions that comply with the IEEE 754 standard—assuming that the chosen rounding mode is round-to-nearest—return the FP value that is closest to the exact result (the error is less than 0.5 units in the last place, or ulp). By contrast, the proposed method allows a numerical error of the algorithms for the float and double types to be obtained that does not exceed the least significant bit (1 ulp) when using the fused multiply–add function.
The rest of the paper is organized as follows: Section 2 introduces the well-known FISR-based algorithms and basic theoretical concepts of the FISR method. In Section 3, the main idea of the proposed method of switching magic constants is presented, and the corresponding algorithms are given. Section 4 contains the experimental results on microcontrollers and a discussion. Finally, the conclusions are presented in Section 5.

2. Related Work

2.1. State-of-the-Art FISR Algorithms

FISR is most commonly used in its classic version. In this case, the initial approximation y 0 is calculated using magic constants “0x5f3759df” [27] or “0x5f375a86” [25] and one or two clarifying NR iterations are performed using the standard Equation (1) [8,20,25,27,28,33,34,35,36]. As Lomont noted, the best initial guess “0x5f37642f” [25] does not guarantee maximum accuracy for the subsequent NR iterations. The theoretical analysis of Algorithm 1 in [28] mathematically confirmed the optimal values of the magic constants “0x5f37642f” and “0x5f375a86”. The code for this classic FISR algorithm in C/C++ is given below in Algorithm 1. The maximum relative error of this algorithm with two NR iterations is less than 4.74   ×   10 6 . This value corresponds to an accuracy of log 2 ( 4.74   ×   10 6 ) = 17.69 correct bits of the result. Note that, after the first iteration, the accuracy is 9.16 bits.
Algorithm 1. The classic Lomont algorithm [25] for calculating the reciprocal square root.
1: float RcpSqrt1 (float x)
2: {
3:  float xhalf = 0.5f*x;
4:  int i = *(int*)&x;    // represent float as an integer ι   ( x )
5:  i = 0x5f375a86 – (i >> 1); // integer division by two and change in sign
6:  float y = *(float*)&i;   // represent integer as a float φ   ( i )
             // initial approximation y 0
7:  y = y*(1.5f – xhalf *y*y);  // first NR iteration
8:  y = y*(1.5f – xhalf *y*y);  // second NR iteration
9:  return y;
10: }
In [29], an algorithm with increased accuracy and a maximum relative error of 7.37   ×   10 7 (20.37 correct bits) was proposed. In this approach, a simple additive modification of the iterative NR formula is used, and the magic constant of the algorithm is changed, making it possible to reduce the maximum relative error by a factor of more than seven. Algorithm 2 gives the code for this algorithm. The accuracy of the RcpSqrt2 algorithm after the first iteration is 10.15 bits. Another similar algorithm from [29] has even better accuracy (20.97 bits) but requires one extra multiplication compared to Algorithms 1 and 2.
Algorithm 2. Method proposed by Walczyk et al. [29] for the reciprocal square root.
1: float RcpSqrt2 (float x)
2: {
3:  float xhalf = 0.5f*x;
4:  int i = *(int*)&x;
5:  i = 0x5f376908 – (i >> 1);
6:  float y = *(float*)&i;
7:  y = y*(1.50087896f – xhalf *y*y);
8:  y = y*(1.50000057f – xhalf *y*y);
9:  return y;
10: }
Our other recent modifications of the FISR algorithm are discussed in [26] and [32]. In both papers, besides the NR method, a modified second-order Householder method is used to improve the accuracy of the algorithms—in the former case, in the last iteration of the algorithm, and in the latter case, in the first iteration. This method has cubic convergence and requires one additional multiplication and subtraction compared to the NR method. As a result, these algorithms can already provide accuracy up to the last bit (more than 23 bits in SP) when using fused multiply–add functions. However, such extra FP operations are very expensive, especially in HW. Therefore, in this paper, we investigate an alternative method to further increase the accuracy of the FISR-based algorithms by using only modified constants and NR iterations—splitting the input interval. Moreover, in [26], we suggested a method for reducing the number of FP multiplications in the algorithm. It allows performing some operations with an exponent using integer subtractions.

2.2. Brief Theory of the FISR Method

Here, we briefly outline the main concepts of the basic FISR method used in the paper, which are defined in [28,29]. Suppose that we have a positive normalized FP number
x = ( 1 + m x ) 2 E x .
We consider numbers of single (binary32/type float) and double (binary64/type double) precision, according to the IEEE 754 standard. In this standard, an SP FP number x is encoded by 32 bits, n = 32 ( n = 64 for DP). The first bit corresponds to a sign (in our case, this bit is equal to zero), while the next eight bits (11 bits for DP) correspond to an exponent
E x = log 2 ( x ) ,
which is an integer stored in a biased form. The last 23 bits, p = 23 ( p = 52 for DP), encode a fractional part of the mantissa
m x = x 2 E x 1 ,
m x [ 0 ,   1 ) . The integer representation of this value ι   ( x ) (see Algorithm 1, line 4), denoted by I x , is given by
I x = ι   ( x ) = ( b i a s + E x + m x ) N m ,
where N m = 2 23 , b i a s = 127 for SP and N m = 2 52 , b i a s = 1023 for DP. Then, line 5 of Algorithm 1 can be written as:
I y 0 = R I x / 2 .
The result I y 0 of subtracting an integer number I x / 2 from the magic constant R and representing the integer I y 0 again as a float (see Algorithm 1, line 6) gives the initial (zeroth) piecewise linear approximation y 0 of the function y = 1 / x , where
y 0 = φ   ( I y 0 ) = (   1 + I y 0 / N m I y 0 / N m   )   2 I y 0 / N m       b i a s .
Lines 7 and 8 of Algorithm 1 (RcpSqrt1) define the NR iterations
y i + 1 = y i ( 1.5 0.5 x y i y i ) ,   i = 0 ,   1 .
As proved in [28,29], in order to find the behavior of the relative error when calculating y 0 over the whole range of normalized FP numbers, it is sufficient to describe it in the range x [ 1 ,   4 ) . The initial approximation y 0 has three piecewise linear subintervals in this range. According to [28], the analytical approximations that define y 0 can be written as:
y 01 = 1 4 x + 1 + 1 2 m R + 1 4 N m ,       x [ 1 ,   2 )
y 02 = 1 8 x + 3 4 + 1 2 m R + 1 4 N m ,       x [ 2 ,   t )
y 03 = 1 16 x + 5 8 + 1 4 m R + 1 8 N m ,       x [ t ,   4 ) .
Here, m R is the fractional part of the mantissa of the magic constant R , defined as:
m R = R / N m R / N m ,
and
t = 2 + 4 m R + 2 / N m .
As proved in [28], the relative error of such analytic model of the FISR method does not exceed 1 / ( 2 N m ) .

3. Method of Switching Magic Constants

The main idea of the proposed method is to split the interval x [ 1 ,   4 ) of the initial approximation y 0 (see Figure 1a), on which it has different error values, into two parts— x [ 1 ,   2 ) and x [ 2 ,   4 ) —and to perform an approximation of the reciprocal square root function separately in these subintervals. The variable x , as defined in Equation (2), then has different values in the last bit of the exponent E x —zero in the first case and one in the second. We split the interval only for the initial approximation y 0 —where the magic constant R is used—and for the corresponding modified first iteration of the FISR method. As shown below, this technique allows us to reduce the maximum relative error of the algorithm after the first iteration by an order of magnitude compared to Algorithm 2 (RcpSqrt2).
Let us now consider this method in more detail. We require that, on the first and second parts of the interval x [ 1 ,   2 ) , the relative errors of two adjacent piecewise linear initial approximations y 01 and y 02 have the same scope—the same difference between the maximum and minimum values—as shown in Figure 2a. From this, it follows that the relative errors of the approximations y 01 , x [ 1 ,   t 1 ) and y 02 , x [ t 1 ,   2 ) have a similar symmetrical nature with respect to some common value at a point x = t 1 . We will now find the value of the magic constant that gives the corresponding equations for y 01 and y 02 . To do this, we write the analytical equations for the approximations y 01 and y 02 based on Equations (9)–(13) and the results of [28,29]:
y 01 = 1 4 x + 1 2 + 1 2 m R + 1 4 N m ,   x [ 1 ,   t 1 )
y 02 = 1 8 x + 1 2 + 1 4 m R + 1 8 N m ,   x [ t 1 ,   2 ) ,
where, in this case,
t 1 = 2 m R + 1 / N m .
Note also that the third linear approximation, which is not used in the result, has the form
y 03 = 1 16 x + 3 8 + 1 4 m R + 1 8 N m ,   x [ 2 ,   4 ) .
Hence, the expressions for the relative errors are
δ 01 = y 01 x 1 = 1 4 x 3 / 2 + 1 2 x 1 / 2 + 1 2 x 1 / 2 m R + 1 4 N m x 1 / 2 1
δ 02 = y 02 x 1 = 1 8 x 3 / 2 + 1 2 x 1 / 2 + 1 4 x 1 / 2 m R + 1 8 N m x 1 / 2 1 .
Having found the points of maxima and contact, we can determine the value of m R :
m R = 0 . 70241439342498779296875
and the corresponding value of the magic constant R for SP numbers:
R 1 = 0 x 5 ed 9 e 8 b 7 .
Let us now examine the interval x [ 2 ,   4 ) . In the same way, we require that, for the second and third parts, the relative errors of two adjacent piecewise linear initial approximations y 02 and y 03 have the same scope (see Figure 2b). In this case,
y 02 = 1 8 x + 3 4 + 1 2 m R + 1 4 N m ,   x [ 2 ,   t 2 )
y 03 = 1 16 x + 5 8 + 1 4 m R + 1 8 N m ,   x [ t 2 ,   4 ) .
Note that, at the same time,
t 2 = 2 + 4 m R + 2 / N m
and the first linear approximation, not relevant for the result, has the form:
y 01 = 1 4 x + 1 + 1 2 m R + 1 4 N m ,   x [ 1 ,   2 ) .
Hence, we find that
m R = 0.20241439342498779296875
R 2 = 0 x 5 f 19 e 8 b 7 .
As a result, the combined approach with magic constants R 1 , (21), and R 2 , (27), on two different subintervals gives us an opportunity to obtain the piecewise linear initial approximation y 0 on the interval x [ 1 ,   4 ) , as shown in Figure 1b. This can potentially offer a better approximation of the function y = 1 / x , but only after some alignment (bias at each subinterval). For comparison, both of the magic constants used in the algorithms RcpSqrt1 and RcpSqrt2 give the initial approximation, as described in Equations (9)–(13) (see Figure 1a). Although the basic FISR method gives a much more accurate approximation at this stage, our method has four piecewise linear sections of the approximation y 0 rather than three, providing higher accuracy after the first iteration (see Figure 3). This trick is possible due to the modification of the first NR iteration at each of the subintervals. In other words, we align the corresponding errors of the initial approximation as described below.
For each subinterval, we modify the first NR iteration according to Equation (1) as follows:
y 1 = k 1 y 0 ( k 2 x y 0 y 0 ) ,
where k 1 and k 2 are some FP constants that minimize the maximum relative error of the algorithm and depend on the value of the magic constant R . This modified iteration also involves four multiplications.
In order to determine the subinterval in the IEEE 754 format to which x belongs— x [ 1 ,   2 ) or x [ 2 ,   4 ) —we check the least significant bit (LSB) of the biased exponent
e x = E x + b i a s .
In the SW implementation, we apply a bit mask to x . Then, we choose the magic constant and the first modified NR iteration that correspond to this value. We call this technique the method of switching magic constants or the dynamic constants (DC) method.
It should be noted that this method can be generalized to more constants. In this case, each of the indicated subintervals is further divided into two, four, eight, etc., equal parts depending on the value of one or more most significant bits of the fractional part of the mantissa m x , and the bit mask is changed accordingly. However, in general, it cannot be guaranteed that such a partition will significantly improve the accuracy of the algorithm in all the parts; therefore, some parts should be divided further.
The general structure of the proposed SW RcpSqrt3 algorithms for two magic constants, with the modified FISR initial approximation y 0 , the first modified iteration (28), and a branching statement (comparison with zero) using bit masking (bitwise AND), is shown in Template 1. Here, the input argument x has type <fp_type>, which can be float, double, etc., and <int_type> is the corresponding integer type. The names of the algorithms for the reciprocal square root are constructed according to the template, as follows: RcpSqrt3<iter><version><fp_type_abbr>, where <iter> is the number of iterations used in the algorithm, <version> is an optional index, and <fp_type_abbr> indicates the required FP data type. This also applies to the square root calculation algorithms, which we denote as Sqrt3. The parameters R 1 and R 2 are integer magic constants; k 11 , k 12 , k 21 , and k 22 are FP constants, which are defined later. When implemented in HW, in this case, a small 1-bit LUT can be used to choose appropriate values for the parameters R , k 1 , and k 2 .
Template 1. Basic structure of the proposed DC algorithms for the reciprocal square root.
1: <fp_type> RcpSqrt3<iter><version><fp_type_abbr> [
      <int_type> R 1 , <int_type> R 2 ,
      <fp_type> k 11 , <fp_type> k 12 ,
      <fp_type> k 21 , <fp_type> k 22 ] (<fp_type> x)
2: {
3:  <int_type> i = *(<int_type>*)&x;   // x—input argument
4:  <int_type> k = i & <ex_mask>;    // binary mask on the LSB of e x
5:  <fp_type> y;
6:  if (k != 0) {
7:   i = R 1 – (i >> 1);       // R 1 first magic constant
8:   y = *(<fp_type>*)&i;       // approximation y 0
9:   y = k 11 *y*( k 12 x*y*y);     // first modified NR iteration
10:  } else {
11:   i = R 2 –(i >> 1);        // R 2 second magic constant
12:   y = *(<fp_type>*)&i;      // approximation y 0
13:   y = k 21 *y*( k 22 x*y*y);     // first modified NR iteration
14:  }               // DC initial approximation y 1
15:   …              // subsequent NR or modified NR iterations
16:  return y;            // output y < i t e r >
17: }
The proposed method can be thought of as a relatively accurate and fast initial guess y 1 —the DC initial approximation—for other iterative algorithms (see Template 1, lines 14–16). In this paper, we consider modified NR iterations written in a special form, with combined multiply–add operations.
Furthermore, in this section, we present final (ready-to-use) codes for the proposed algorithms in C/C++ and give their errors (accuracy results).
Note that—when determining the relative errors of these algorithms—giving here an example for the reciprocal square root function—we used the following notation for the upper and lower limits of the maximum relative error, respectively:
δ max + = max x [ 1 ,   4 ) ( y x 1 )
δ max = min x [ 1 ,   4 ) ( y x 1 ) .
Alternatively, without taking into account the sign of the error, the maximum relative error was determined using the formula
δ max = max x [ 1 ,   4 ) | y x 1 | = max { | δ max | ,   | δ max + | } .
To iterate over all possible float values in the interval x [ 1 ,   4 ) , we used the nextafterf   ( x ,   d ) function from the cmath library. For the case of type double, we traversed this interval with a small step—about 1   ×   10 12 . As a reference implementation, we used a higher-precision sqrt   ( x ) or sqrtl   ( x ) function from this library. The number of accurate digits in the result—accuracy of the algorithm—was determined in bits by the formula
α = log 2 ( δ max ) .
Error measurements for the algorithms were performed on a quad-core Intel Core i7-7700HQ processor using a GNU compiler (GCC 4.9.2) for C++ on a Windows 10 (64-bit) operating system with options as follows: -std = c++11 -Os -ffp-contract = on -mfma.

3.1. SP Reciprocal Square Root (RcpSqrt3 for Float)

3.1.1. One Iteration—The DC Initial Approximation

For the interval x [ 1 ,   2 ) and the theoretically determined magic constant for SP R 1 , (21), the unknown theoretical coefficients k 1 and k 2 in Equation (28) that minimize the maximum relative error δ max after the first iteration are
k 11 = 2 . 3312425 ,   k 12 = 1 . 07497365 .
Similarly, for x [ 2 ,   4 ) and the magic constant R 2 , (27),
k 21 = 0.8242186 ,   k 22 = 2.1499476 .
An implementation of the algorithm with these parameters shows the maxima of the relative errors
δ max + = 7.462402   ×   10 5   δ max = 7.459646   ×   10 5 .
If a computing platform has a fast HW or SW implementation of the fused multiply–add ( fma ) function, fma   ( a ,   b ,   c ) = a b + c , then in Template 1, iteration (28) can be written as
y 1 = k 1 y 0 fma   ( x ,   y 0 y 0 ,   k 2 ) .
On some platforms, when implemented in HW, this function can increase both the speed and the accuracy of the algorithms. The fma operation has fewer roundings and much higher precision for the internal calculations. In the remainder of this section, unless otherwise specified, we use the fma function in all algorithms.
Taking into account the rounding errors and the issue of the best practical representation of the theoretical parameters in the target SP FP format, we can improve our theoretical parameters R 1 , R 2 , k 11 , k 12 , k 21 , and k 22 (given in (21), (27), (34), (35)). A brute force optimization method was used in a certain neighborhood of the defined theoretical parameters to minimize the maximum relative error of the algorithm on each of the subintervals. This approach also includes elements of randomized multidimensional greedy optimization for coarse search. In this case, three parameters R , k 1 , and k 2 are optimized simultaneously. The method is described in more detail in [38].
Algorithm 3 below gives the proposed improved RcpSqrt31f algorithm for SP numbers with one iteration. This algorithm provides slightly lower values for the maximum relative errors:
δ max + = 7.459289   ×   10 5   δ max = 7.450387   ×   10 5 .
Algorithm 3. Proposed RcpSqrt31f algorithm (DC initial approximation).
1: float RcpSqrt31f (float x)
2: {
3:  int i = *(int*)&x;
4:  int k = i & 0x00800000;
5:  float y;
6:  if (k != 0) {
7:   i = 0x5ed9e91f – (i >> 1);
8:   y = *(float*)&i;
9:   y = 2.33124256f*y*fmaf(−x, y*y, 1.0749737f);
10:  } else {
11:   i = 0x5f19e8fc – (i >> 1);
12:   y = *(float*)&i;
13:   y = 0.824218631f*y*fmaf(−x, y*y, 2.1499474f);
14:  }
15:  return y;
16: }
Graphs of the relative errors of the RcpSqrt2 and RcpSqrt31f algorithms after the first iteration are shown in Figure 3. Numerical experiments show that the maximum relative error of the RcpSqrt2 algorithm after the first iteration is δ max = 8.792   ×   10 4 , corresponding to 10.15 correct bits of the result, and, for our algorithm, from (38), δ max = 7.459   ×   10 5 , providing 13.71 correct bits. Consequently, the error is reduced by a factor of more than 11.7.

3.1.2. Two Iterations

To increase the accuracy of the RcpSqrt31f algorithm described above, it is possible to apply an additional NR iteration over the entire range x [ 1 ,   4 ) . The second iteration is common to both subintervals, uses the fma function, and has the specific form
c 1 = x y 1 r 1 = fma   ( y 1 ,   c 1 ,   k 3 ) y 2 = fma   ( k 4 y 1 ,   r 1 ,   y 1 ) ,
where, in this case, for the SP version,
k 3 = 1 . 0 ,   k 4 = 0.5 ,
which corresponds to the classical Equation (1). The use of the fma function in the second iteration in the form given in (39) is important, since it greatly improves the accuracy of the algorithm at the final stage of the calculations. However, compared to the RcpSqrt1 and RcpSqrt2 algorithms, it has four multiplications rather than three in the second iteration (a further addition is also hidden inside fma ). If we write the second iteration in the same way as in Equation (37), we only get 22.68 bits of accuracy ( δ max = 1.492   ×   10 7 ). A full C/C++ code for two NR iterations is given in Algorithm 4. Here, we also make some corrections to the values of R 1 , R 2 , k 11 , k 12 , k 21 , and k 22 (given in (21), (27), (34), (35)) to minimize the maximum errors of the complete RcpSqrt32f algorithm, in a similar way to the approach described in Section 3.1.1. An alternative would be to modify the values of k 3 and k 4 , although this is less effective for type float.
Algorithm 4. Proposed RcpSqrt32f algorithm.
1: float RcpSqrt32f (float x)
2: {
3:  float y = RcpSqrt31f [
        R 1 = 0x5ed9dbc6, R 2 = 0x5f19d200,
        k 11 = 2.33124018f, k 12 = 1.07497406f,
        k 21 = 0.824212492f, k 22 = 2.14996147f] (x);
4:  float c = x*y;
5:  float r = fmaf(y, −c, 1.0f);
6:  y = fmaf(0.5f*y, r, y);
7:  return y;
8: }
The final RcpSqrt32f algorithm has errors
δ max + = 7.362378   ×   10 8   δ max = 7.754203   ×   10 8 ,
or 23.62 correct bits of the result out of a possible p + 1 = 24 for float numbers. Note that, when this algorithm has the same constants for the initial approximation as in Algorithm 3—RcpSqrt31f plus classic NR in the form given in (39), (40)—it has an error δ max = 8.038   ×   10 8 .

3.2. SP Square Root (Sqrt3 for Float)

We now turn to the square root calculation algorithms to find an approximation for y = x in SP. As noted in Section 1, these algorithms can easily be obtained from those previously described, simply by multiplying the result by the value of the input argument x . However, this involves an additional multiplication operation, and in our algorithms, in most cases, this can be avoided by modifying the last iteration.

3.2.1. One Iteration—The DC Initial Approximation

For one iteration, we make a substitution c 0 = x y 0 in Equation (37). Then, the first iteration for the square root at each subinterval is written as
c 0 = x y 0 y 1 = k 1 c 0 fma   ( y 0 ,   c 0 ,   k 2 ) .
Algorithm 5 provides the final code for Sqrt31f with optimized constants. After the first iteration, the algorithm has errors
δ max + = 7.450372   ×   10 5   δ max = 7.451108   ×   10 5 ,
and hence it has the same level of error as the RcpSqrt31f algorithm. In addition, in Algorithm 5, the same constants can be used as in RcpSqrt31f ( δ max = 7.46   ×   10 5 ).
Algorithm 5. Proposed Sqrt31f algorithm (DC initial approximation).
1: float Sqrt31f (float x)
2: {
3:  int i = *(int*)&x;
4:  int k = i & 0x00800000;
5:  float y;
6:  if (k != 0) {
7:   i = 0x5ed9e893 – (i >> 1);
8:   y = *(float*)&i;
9:   float c = x*y;
10:   y = 2.33130789f*c*fmaf(y, −c, 1.07495356f);
11:  } else {
12:   i = 0x5f19e8fd – (i >> 1);
13:   y = *(float*)&i;
14:   float c = x*y;
15:   y = 0.82421863f*c*fmaf(y, −c, 2.1499474f);
16:  }
17:  return y;
18: }

3.2.2. Two Iterations

When we use two NR iterations to calculate the square root function (Sqrt32f), the structure of the algorithm does not change compared to RcpSqrt32f, and we need only modify the second iteration (see Algorithm 6, line 6). This algorithm has slightly lower accuracy than RcpSqrt32f, with
δ max + = 8.757966   ×   10 8   δ max = 9.037992   ×   10 8 .
This corresponds to 23.4 exact bits— δ max = 9.216   ×   10 8 , if we do not change the constants of the DC initial approximation, i.e., the RcpSqrt31f algorithm.
Algorithm 6. Proposed Sqrt32f algorithm.
1: float Sqrt32f (float x)
2: {
3:  float y = RcpSqrt31f [
       R 1 = 0x5ed9d098, R 2 = 0x5f19d352,
       k 11 = 2.33139729f, k 12 = 1.07492042f,
       k 21 = 0.82420468f, k 22 = 2.14996147f] (x);
4:  float c = x*y;
5:  float r = fmaf(y, −c, 1.0f);
6:  y = fmaf(0.5f*c, r, c);    // modified
7:  return y;
8: }

3.3. DP Reciprocal Square Root (RcpSqrt3 for Double)

3.3.1. One Iteration—The DC Initial Approximation

Similarly, this method can be applied to FP numbers of DP. In this case, the theoretically determined magic constants based on (20) and (26) are
R 1 = 0 x 5 fdb 3 d 16 dd 72 c 671
R 2 = 0 x 5 fe 33 d 16 dd 72 c 671 .
The overall structure of the algorithm does not change (see Template 1), and only the data types used in the calculations are modified compared to the RcpSqrt31f algorithm. The corresponding algorithm for DP is given in Algorithm 7. Here, the parameters R 1 , R 2 , k 11 , k 12 , k 21 , and k 22 are 64-bit constants after optimization. Note that the specified improved constants can be quite different from the theoretical ones according to the practical optimization method used. This has the following aligned maxima in the relative errors:
δ max + = 7.437897   ×   10 5   δ max = 7.437897   ×   10 5 ,
corresponding to an accuracy of 13.71 bits.
Algorithm 7. Proposed RcpSqrt31d algorithm (DC initial approximation).
1: double RcpSqrt31d (double x)
2: {
3:  uint64_t i = *(uint64_t*)&x;
4:  uint64_t k = i & 0x0010000000000000;
5:  double y;
6:  if (k != 0) {
7:   i = 0x5fdb3d20982e5432 – (i >> 1);
8:   y = *(double*)&i;
9:   y = 2.331242396766632*y*fma(−x, y*y, 1.074973693828754);
10:  } else {
11:   i = 0x5fe33d209e450c1b – (i >> 1);
12:   y = *(double*)&i;
13:   y = 0.824218612684476826*y*fma(−x, y*y, 2.14994745900706619);
14:  }
15:  return y;
16: }

3.3.2. Two Iterations

The RcpSqrt32d algorithm for finding the DP reciprocal square root with two iterations is shown in Algorithm 8. Note that, here, we use the same constants for the DC initial approximation as in the RcpSqrt31d algorithm. This algorithm has a second iteration in the form of (39), with changes in the following two constants:
k 3 = 1 . 000000008298416 ,   k 4 = 0 . 50000000057372 .
The maximum relative errors of this algorithm are
δ max + = 4 . 149208   ×   10 9   δ max = 4 . 149157   ×   10 9
(27.84 correct bits), in contrast to δ max = 7.75   ×   10 8 for SP numbers (see (41)). The modification of the last NR iteration in the form (39), (48) allows us to increase the accuracy of the algorithm from 26.84 bits in the case of a classic iteration to 27.84 bits.
Algorithm 8. Proposed RcpSqrt32d algorithm.
1: double RcpSqrt32d (double x)
2: {
3:  double y = RcpSqrt31d (x);
4:  double c = x*y;
5:  double r = fma(y, −c, 1.000000008298416);
6:  y = fma(0.50000000057372*y, r, y);
7:  return y;
8: }

3.3.3. Three Iterations

For three iterations in DP, we present two versions of the algorithm: one with fewer multiplication operations (RcpSqrt331d) and one with higher accuracy (RcpSqrt332d). The complete RcpSqrt331d algorithm is given in Algorithm 9. The errors of this algorithm have the following boundaries:
δ max + = 1 . 603535   ×   10 16   δ max =   1 . 826339   ×   10 16
(52.28 correct bits). Here, we have made the substitution m x h a l f = 0.5 x (line 4) in a similar way as in RcpSqrt1 and RcpSqrt2. This allows us to avoid one multiplication and also to use that substitution in classic or modified NR iterations. In this case, the second and third iterations have the following form:
m x h a l f = 0.5 x y 2 = y 1 fma   ( m x h a l f ,   y 1 y 1 ,   k 3 )
r 2 = fma   ( m x h a l f ,   y 2 y 2 ,   k 4 ) y 3 = fma   ( y 2 ,   r 2 ,   y 2 ) ,
where
k 3 = 1 . 5000000034937999 ,   k 4 = 0.5 .
If we do not change the initial approximation constants in Algorithm 9—RcpSqrt31d plus modified and classic NR iterations in the form (51)–(53)—we obtain 52.23 bits of accuracy ( δ max = 1.898   ×   10 16 ).
Algorithm 9. Proposed RcpSqrt331d (faster) algorithm.
1: double RcpSqrt331d (double x)
2: {
3:  double y = RcpSqrt31d [
        R 1 = 0x5fdb3d14170034b6, R 2 = 0x5fe33d18a2b9ef5f,
        k 11 = 2.33124735553421569, k 12 = 1.07497362654295614,
        k 21 = 0.82421942523718461, k 22 = 2.1499494964450325] (x);
4:  double mxhalf = −0.5*x;
5:  y = y*fma(mxhalf, y*y, 1.5000000034937999);
6:  double r = fma(mxhalf, y*y, 0.5);
7:  y = fma(y, r, y);
8:  return y;
9: }
On the other hand, if we do not make this substitution and write the last iteration using c 2 = x y 2 , we obtain an algorithm RcpSqrt332d that contains one more multiplication and an additional coefficient in the last iteration. In this case, the last two iterations of the algorithm are (see Algorithm 10, lines 4–7)
y 2 = y 1 fma   ( k 3 x ,   y 1 y 1 ,   k 4 )
c 2 = x y 2 r 2 = fma   ( y 2 ,   c 2 ,   1.0 ) y 3 = fma   ( k 5 y 2 ,   r 2 ,   y 2 ) ,
where
k 3 = 0 . 5000000000724769 ,   k 4 = 1 . 50000000394948985 ,   k 5 = 0 . 50000000001394973 .
Compared to RcpSqrt331d, the RcpSqrt332d algorithm has lower maximum relative errors after the third iteration,
δ max + = 1 . 363926   ×   10 16   δ max = 1 . 606246   ×   10 16
(52.47 exact bits). Note that the other alternatives to this algorithm—RcpSqrt31d plus two modified NR in the form (54)–(56) and RcpSqrt32d plus classic NR in the form given in (55), where k 5 = 0.5 —are slightly less accurate (52.44 correct bits).
Algorithm 10. Proposed RcpSqrt332d (higher accuracy) algorithm.
1: double RcpSqrt332d (double x)
2: {
3:  double y = RcpSqrt31d [
        R 1 = 0x5fdb3d15bd0ca57e, R 2 = 0x5fe33d190934572f,
        k 11 = 2.3312432409377752, k 12 = 1.0749736243940957,
        k 21 = 0.824218531163110613, k 22 = 2.1499488934465218] (x);
4:  y = y*fma(−0.5000000000724769*x, y*y, 1.50000000394948985);
5:  double c = x*y;
6:  double r = fma(y, −c, 1.0);
7:  y = fma(0.50000000001394973*y, r, y);
8:  return y;
9: }

3.4. DP Square Root (Sqrt3 for Double)

3.4.1. One and Two Iterations

In the same way as for the type float and reciprocal square root, we construct algorithms for the square root in DP using one and two iterations. These are based on the RcpSqrt31d and RcpSqrt32d algorithms. The errors of these algorithms are close to those of the reciprocal square root (see (47) and (49)).

3.4.2. Three Iterations

For the algorithm with three iterations, we cannot avoid the additional multiplication, as we did in the RcpSqrt331d algorithm described above. Hence, we present only an algorithm that has four multiplications in the third iteration—including fma operations. It is based on RcpSqrt332d, in which the last iteration (55) is modified for the square root calculation, and the corresponding parameters are optimized, as shown in Algorithm 11. After the third iteration, this algorithm has errors of
δ max + = 1.66425   ×   10 16   δ max = 1.847481   ×   10 16
(52.27 correct bits). If we do not modify the constants of the DC initial approximation in Algorithm 11, the accuracy is 52.25 bits. The algorithm based on RcpSqrt32dRcpSqrt32d plus a modified for the square root version of the classic NR iteration in a special form—has 52.23 bits of accuracy.
Algorithm 11. Proposed Sqrt33d algorithm.
1: double Sqrt33d (double x)
2: {
3:  double y = RcpSqrt31d [
        R 1 = 0x5fdb3d20dba7bd3c, R 2 = 0x5fe33d165ce48760,
        k 11 = 2.3312471012384104, k 12 = 1.074974060752685,
        k 21 = 0.82421918338542632, k 22 = 2.1499482562039667] (x);
4:  y = y*fma(−0.50000000010988821*x, y*y, 1.5000000038700285);
5:  double c = x*y;
6:  double r = fma(y, −c, 1.0);
7:  y = fma(0.50000000001104072*c, r, c);    // modified
8:  return y;
9: }

4. Experimental Results and Discussion

Performance testing of these algorithms was conducted on a Raspberry Pi 3 Model B mini-computer and an ESP-WROOM-32 microcontroller. The Raspberry Pi is based on a quad-core 64-bit SoC Broadcom BCM2837 (1.2 GHz, 1 Gb RAM) with an ARM Cortex-A53 processor [6]. We used the GNU compiler (GCC 6.3.0) for Raspbian OS (32-bit) with the following compilation options: -std = c++11 -Os -ffp-contract = on -mfpu = neon-fp-armv8 -mcpu = cortex-a53. The 32-bit Wi-Fi module ESP-WROOM-32 (ESP-32) from Espressif Systems has two low-power Xtensa microprocessors (240 MHz, 520 Kb RAM) [39]. The microcontroller was programmed via the Arduino IDE (GCC 5.2.0) with the following compilation parameters: -std = gnu++11 -Os -ffp-contract = fast. The speed (latency) of the algorithms was measured using the chrono C++ library. Depending on the platform, at least 200 tests were run in which functions were called sequentially in a single thread (core), a million or more times. The average results of these performance tests are given here.
It should be noted that, although we chose C++ to implement the algorithms, it is worth using an inline assembly code for more efficient and better performance optimization on each specific platform. However, the chosen compilation options gave a fairly effective fast code optimization and allowed us to automatically translate the fmaf   ( a ,   b ,   c ) and fma   ( a ,   b ,   c ) C++ functions into the corresponding HW instructions; for the microcontroller, the compiler may even automatically replace successive multiplication and addition/subtraction operations of SP with the corresponding fma HW instructions (the -ffp-contract = fast option, which enables FP expression contraction).
The accuracy and latency measurements for the reciprocal square root ( y = 1 / x ) and square root ( y = x ) functions, in both SP and DP, are summarized in Table 1. In this table, we consider the various methods available on the mini-computer and microcontroller, including the cmath SW library functions ( sqrtf   ( x ) and sqrt   ( x ) ) [5] and the built-in NEON instructions (FRSQRTE and FRSQRTS) [11] for an approximate calculation of the reciprocal square root in Raspberry Pi using NR iterations. We also compare the method of Walczyk et al. (the RcpSqrt2 algorithm) [29], its modification for calculation of the square root, and the proposed method of switching magic constants (the DC algorithms from Section 3). Here, both the Walczyk et al. and the DC algorithms are implemented using the fma function.
Even on platforms that do not have special HW instructions for the square root and reciprocal square root (either approximate or with full accuracy), such as ESP-32, the C++ function sqrt   ( x ) is available for both SP and DP numbers. Modern platforms, such as Intel or ARM, may also have the appropriate hardware FSQRT instructions. These are IEEE-compliant and ensure the full accuracy of the result (see the sqrtf   ( x ) and sqrt   ( x ) functions in Table 1).
Looking at the results from Table 1, it becomes obvious that the main feature of our proposed DC algorithms for the float and double types is that the algorithms RcpSqrt32f, Sqrt32f, RcpSqrt331d, RcpSqrt332d, and Sqrt33d allow the result to be obtained up to the last bit—although the 24th and 53rd bits may be wrong. At the same time, the RcpSqrt32f and RcpSqrt332d algorithms for the reciprocal square root have somewhat higher accuracy than the naive method using division.
All the platforms tested have HW-implemented multiplication, addition, and fma operations for FP numbers of SP and, except for ESP-32, DP [10,11,12]. Since the ESP microcontroller is a 32-bit system, all DP operations are performed by SW. Note that it also does not have a division instruction in SP [12], meaning that the latency of the corresponding operations is much higher.
As shown in Table 1, the proposed algorithms give significantly better performance than the library functions on the Raspberry Pi, from 3.17 to 3.62 times faster, and for SP numbers on ESP-32, 2.34 times faster for the reciprocal square root and approximately 1.78 times faster than the sqrtf   ( x ) function. At the same time, on the microcontroller with the SW implementation of DP fma , the RcpSqrt331d algorithm is a little faster than the naive method using the sqrt   ( x ) function, but has slightly lower accuracy. The RcpSqrt332d algorithm, in contrast, has higher accuracy, but worse performance. The proposed algorithms are also not efficient on ESP-32 for calculating the square root of DP numbers (in contrast to the Raspberry Pi). However, in some cases, it is possible to improve the performance of the algorithms in DP if fewer fma functions are used in the code, as shown later (see also [26] for more details).
In ARM with NEON technology [11], the HW SP instructions FRSQRTE and FRSQRTS, for which the corresponding intrinsics are vrsqrte _ f 32 and vrsqrts _ f 32 , can be used to calculate a fast approximation of the reciprocal square root function and to perform the classical NR iteration (step), respectively. The FRSQRTE instruction is based on a LUT and gives 8.25 correct bits of the result (our DC initial approximation gives 13.71 bits). A combination of these instructions in the Raspberry Pi gives poorer accuracy and latency results than the RcpSqrt32f algorithm (see Table 1). It should also be noted that the HW FSQRT and 64-bit FRSQRTE instructions of the ARMv8 (AArch64) architecture are not available on the Raspberry Pi for the specified official 32-bit OS [11].
In order to ensure a fair comparison between the proposed algorithms and other advanced FISR-based methods, we also implemented an algorithm proposed by Walczyk et al. [29] in a specific form using the fma functions. This allowed us to strike a better compromise between accuracy and speed compared to the original RcpSqrt2 algorithm (see Table 1). For the square root calculation, we used the same method that we suggest for the DC algorithms. The results show that, although these algorithms are faster, their accuracy is much lower.
A comparison of the FISR-based algorithms that also provide 23 exact bits for a float and 52 exact bits for a double is given in Table 2 for SP numbers. Note that, here, TMC denotes the method of two magic constants [26] and Ho2 denotes the approach based on the second-order Householder’s method [32] for the reciprocal square root. The relative performance of the proposed algorithms depends on the operations used, their sequence, and the characteristics of the platform. For example, the RcpSqrt32f (Algorithm 4) and InvSqrt5 [32] (Section VI) algorithms are fairly efficient on ESP-32 for SP numbers. However, to the best of our knowledge, the DC initial approximation is the only FISR-based method with one NR iteration—four multiplications and one subtraction—that provides 13 correct bits of the result for the square root and reciprocal square root functions. It can be implemented on FPGAs using a small LUT, and only one bit of the input argument (LSB of the exponent) needs to be controlled.
Figure 4 shows the results of a comparison of the Lomont [8,25], Walczyk et al. [29], and switching magic constants (DC) methods after each iteration for DP numbers—with and without the use of the fma operations. Here, we consider different ways of implementing NR iterations using the fma functions. As shown by the graphs, the accuracy is almost the same in both cases, with the sole exception of three iterations. The proposed DC algorithm (RcpSqrt331d), in most cases, has slower performance on the platforms considered here than the Lomont and Walczyk et al. algorithms (except perhaps one iteration), but is significantly superior in terms of accuracy. It allows us to obtain highly accurate results for the square root and reciprocal square root calculations for DP numbers by the third iteration. Note that, when the fma function is used, we obtain 52.28 correct bits of the result (see RcpSqrt331d in Table 1), and, otherwise, we have an accuracy of 51.52 bits (see Figure 4a). For the Raspberry Pi, the latency of the algorithms with and without fma is similar, but is slightly smaller for some algorithms when using the fma function (after the first and second iterations). We obtain similar performance results for the ESP-32 microcontroller. The latency of all the algorithms is almost the same for one iteration. However, given the above comments on HW support for FP operations of DP, it should be noted that the algorithms that do not use fma are faster in this case. Figure 4 shows that the speed of the DC algorithm for three iterations is 6092.1 ns without fma and 6828.2 ns with SW fma functions. For ESP-32, we recommend using the DP fma function only in the third iteration of the DC algorithms, in order to obtain a better compromise between accuracy and speed.
It should also be noted that the disadvantages of the FISR methods and the approximate FRSQRTE instructions in comparison with the cmath library functions and fast HW FSQRT instructions are that they generally do not work correctly with subnormal numbers and do not handle other exceptional situations (e.g., ± 0 and ± Inf ), although this does not apply to numbers in the NaN (not a number) range. However, as described in [29], FISR-based methods can be modified to support subnormal numbers.

5. Conclusions

This article proposes a set of algorithms for calculating the square root and reciprocal square root of normalized FP numbers of SP and DP, using the method of switching magic constants for the initial approximation. The proposed DC initial approximation provides about 13.71 correct bits of the result; compared to the RcpSqrt2 algorithm for one iteration, the maximum relative error is reduced by a factor of 11.7. The main feature of this method is the modification of a magic constant and subsequent NR iteration, depending on the input subinterval. It uses fast HW fma instructions, and allows us to obtain results with fairly good accuracy after two iterations for numbers of type float (23.62 bits for the 1 / x and 23.4 bits for the x function) and after three iterations for numbers of type double (52.47 bits for the 1 / x and 52.27 bits for the x function). To achieve correct rounding, you must additionally apply a rounding-error adjustment step, e.g., using the method described in [4] for the square root. As a result, the proposed method reduces the number of iterations required without using large LUTs. It has a low overhead compared to the baseline FISR, which is widely used in many scientific and commercial applications [8,30,34,35,37], and provides a better compromise between latency and accuracy than other known algorithms that use a magic constant, particularly those of Lomont [25] and Walczyk et al. [29]. It should be noted that the proposed DC algorithms can be extended to other data formats, such as extended, quadruple, and octuple formats [26,29].
The algorithms described here can be most useful on microcontrollers and other computer platforms that support FP computations but do not have HW-implemented FPUs or fast HW instructions available for the square root or reciprocal square root calculation, such as ESP-WROOM-32 [12,39] or Raspberry Pi [11]. As it was shown for these platforms, the proposed approximation algorithms in certain cases give a performance gain of 1.5–3.5 times compared to the library functions. They can also be straightforwardly implemented on modern FPGA platforms such as Intel Cyclone [13] and Intel Stratix [40], which have SP FP blocks for add, multiply, and fma operations.

Author Contributions

Conceptualization, L.V.M.; methodology, L.V.M. and O.Y.H.; software, O.Y.H. and L.V.M.; validation, L.V.M., V.V.S. and O.Y.H.; formal analysis, V.V.S.; investigation, O.Y.H. and L.V.M.; writing—original draft preparation, L.V.M. and O.Y.H.; writing—review and editing, O.Y.H. and V.V.S.; visualization, O.Y.H.; supervision, L.V.M.; project administration, V.V.S. and L.V.M.; funding acquisition, V.V.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Acknowledgments

The authors would like to thank Andrii Malohlovets and Petro Rudyi for providing microcontrollers for testing, and Marta Romanytsia for translating the draft version of this manuscript into English.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Allie, M.C.; Lyons, R. A root of less evil digital signal processing. IEEE Signal Process. Mag. 2005, 22, 93–96. [Google Scholar] [CrossRef]
  2. Parhami, B. Computer Arithmetic: Algorithms and Hardware Designs; Oxford University Press: Oxford, UK, 2010; ISBN 9780195328486. [Google Scholar]
  3. Hasnat, A.; Bhattacharyya, T.; Dey, A.; Halder, S.; Bhattacharjee, D. A fast FPGA based architecture for computation of square root and Inverse Square Root. 2017 Devices Integr. Circuit (DevIC) 2017, 383–387. [Google Scholar] [CrossRef]
  4. Beebe, N.H.F. The Mathematical-Function Computation Handbook: Programming Using the MathCW Portable Software Library, 1st ed.; Springer International Publishing: New York, NY, USA, 2017; pp. 215–242. ISBN 978-3-319-64109-6. [Google Scholar]
  5. Loosemore, S.; Stallman, R.; McGrath, R.; Oram, A.; Drepper, U. The GNU C Library Reference Manual for Version 2.31; Free Software Foundation Inc.: Boston, MA, USA, 2020; Available online: https://www.gnu.org/software/libc/manual/pdf/libc.pdf (accessed on 19 December 2020).
  6. Raspberry Pi 3 Model B. RS Components: Corby, UK. Available online: https://www.alliedelec.com/m/d/4252b1ecd92888dbb9d8a39b536e7bf2.pdf (accessed on 27 May 2020).
  7. Floating Point Unit Demonstration on STM32 Microcontrollers; Application Note AN4044, DocID022737 Rev 2; STMicroelectronics N.V., May 2016. Available online: https://www.st.com/resource/en/application_note/dm00047230-floating-point-unit-demonstration-on-stm32-microcontrollers-stmicroelectronics.pdf (accessed on 19 December 2020).
  8. Lemaitre, F.; Couturier, B.; Lacassagne, L. Cholesky factorization on SIMD multi-core architectures. J. Syst. Arch. 2017, 79, 1–15. [Google Scholar] [CrossRef] [Green Version]
  9. Fog, A. Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD and VIA CPUs; Technical University of Denmark: Lyngby, Denmark, 2020; Available online: https://www.agner.org/optimize/instruction_tables.pdf (accessed on 18 November 2020).
  10. Intel 64 and IA-32 Architectures Software Developer’s Manual; Combined Volumes 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4, Order Number 325462-071US; Intel Corp.: Santa Clara, CA, USA, 2019; Available online: https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf (accessed on 19 December 2020).
  11. ARM NEON Intrinsics Reference; IHI 0073B; ARM Ltd.: Cambridge, UK, 2016.
  12. Xtensa Instruction Set Architecture (ISA); Reference Manual PD-09-0801-10-01; Tensilica Inc.: Santa Clara, CA, USA, 2010; Available online: https://usermanual.wiki/Document/Xtensa2020ASSEMBLER20GUIDE.1231659642/view (accessed on 19 December 2020).
  13. Intel Cyclone 10 GX Device Overview; C10GX51001; Intel Corp.: Santa Clara, CA, USA, 2019; Available online: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/cyclone-10/c10gx-51001.pdf (accessed on 19 December 2020).
  14. Yi, J.J.; Joshi, A.; Sendag, R.; Eeckhout, L.; Lilja, D.J. Analyzing the Processor Bottlenecks in SPEC CPU 2000. In Proceedings of the 2006 SPEC Benchmark Workshop, Austin, TX, USA, 23 January 2006. [Google Scholar]
  15. Muller, J.-M. Elementary Functions and Approximate Computing. Proc. IEEE 2020, 108, 2136–2149. [Google Scholar] [CrossRef]
  16. Muller, J.-M. Elementary Functions: Algorithms and Implementation, 2nd ed.; Birkhäuser: Basel, Switzerland, 2006; ISBN 978-1-4899-7981-0. [Google Scholar]
  17. Muller, J.-M.; Brunie, N.; de Dinechin, F.; Jeannerod, C.-P.; Joldes, M.; Lefèvre, V.; Melquiond, G.; Revol, N.; Torres, S. Handbook of Floating-Point Arithmetic, 2nd ed.; Birkhäuser: Basel, Switzerland, 2018; pp. 375–433. ISBN 978-3-319-76525-9. [Google Scholar]
  18. Bruguera, J.D. Low Latency Floating-Point Division and Square Root Unit. IEEE Trans. Comput. 2020, 69, 274–287. [Google Scholar] [CrossRef]
  19. Cornea-Hasegan, M.A.; Golliver, R.A.; Markstein, P. Correctness proofs outline for Newton-Raphson based floating-point divide and square root algorithms. In Proceedings of the 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336), Adelaide, Australia, 14–16 April 1999; pp. 96–105. [Google Scholar]
  20. Eberly, D.H. GPGPU Programming for Games and Science; CRC Press: Boca Raton, FL, USA, 2015; pp. 107–122. ISBN 978-1-4665-9535-4. [Google Scholar]
  21. Muller, J.-M. A Few Results on Table-Based Methods. Dev. Reliab. Comput. 1999, 5, 279–288. [Google Scholar] [CrossRef]
  22. Schulte, M.; Stine, J. Approximating elementary functions with symmetric bipartite tables. IEEE Trans. Comput. 1999, 48, 842–847. [Google Scholar] [CrossRef]
  23. De Dinechin, F.; Tisserand, A. Multipartite table methods. IEEE Trans. Comput. 2005, 54, 319–330. [Google Scholar] [CrossRef] [Green Version]
  24. Blinn, J. Floating-point tricks. IEEE Eng. Med. Boil. Mag. 1997, 17, 80–84. [Google Scholar] [CrossRef]
  25. Lomont, C. Fast Inverse Square Root; Technical Report; Purdue University: West Lafayette, IN, USA, 2003; Available online: http://www.lomont.org/papers/2003/InvSqrt.pdf (accessed on 20 December 2020).
  26. Horyachyy, O.; Moroz, L.; Otenko, V. Simple effective fast inverse square root algorithm with two magic constants. Int. J. Comput. 2019, 18, 461–470. [Google Scholar]
  27. Quake III Arena; Id Software Inc.: Richardson, TX, USA, 1999; Available online: https://github.com/id-Software/Quake-III-Arena/blob/master/code/game/q_math.c#L552 (accessed on 20 December 2020).
  28. Moroz, L.V.; Walczyk, C.J.; Hrynchyshyn, A.; Holimath, V.; Cieśliński, J.L. Fast calculation of inverse square root with the use of magic constant–analytical approach. Appl. Math. Comput. 2018, 316, 245–255. [Google Scholar] [CrossRef] [Green Version]
  29. Walczyk, C.J.; Moroz, L.V.; Cieśliński, J.L. Improving the Accuracy of the Fast Inverse Square Root by Modifying Newton–Raphson Corrections. Entropy 2021, 23, 86. [Google Scholar] [CrossRef] [PubMed]
  30. Lin, J.; Xu, Z.; Nukada, A.; Maruyama, N.; Matsuoka, S. Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor. In Proceedings of the 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 14–17 August 2017; pp. 432–441. [Google Scholar]
  31. Carlile, B.; Delamarter, G.; Kinney, P.; Marti, A.; Whitney, B. Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs). arXiv 2017, arXiv:1710.09967. Available online: https://arxiv.org/pdf/1710.09967.pdf (accessed on 20 December 2020).
  32. Moroz, L.; Samotyy, V.; Horyachyy, O.; Dzelendzyak, U. Algorithms for Calculating the Square Root and Inverse Square Root Based on the Second-Order Householder’s Method. In Proceedings of the 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), Metz, France, 18–21 September 2019; pp. 436–442. [Google Scholar]
  33. Zafar, S.; Adapa, R. Hardware architecture design and mapping of Fast Inverse Square Root algorithm. In Proceedings of the 2014 International Conference on Advances in Electrical Engineering (ICAEE), Vellore, India, 9–11 January 2014; pp. 1–4. [Google Scholar]
  34. Hänninen, T.; Janhunen, J.; Juntti, M. Novel detector implementations for 3G LTE downlink and uplink. Analog. Integr. Circuits Signal Process. 2013, 78, 645–655. [Google Scholar] [CrossRef]
  35. Hsu, C.-J.; Chen, J.-L.; Chen, L.-G. An efficient hardware implementation of HON4D feature extraction for real-time action recognition. In Proceedings of the 2015 International Symposium on Consumer Electronics (ISCE), Madrid, Spain, 24–26 June 2015; pp. 1–2. [Google Scholar]
  36. Hsieh, C.-H.; Chiu, Y.-F.; Shen, Y.-H.; Chu, T.-S.; Huang, Y.-H. A UWB Radar Signal Processing Platform for Real-Time Human Respiratory Feature Extraction Based on Four-Segment Linear Waveform Model. IEEE Trans. Biomed. Circuits Syst. 2015, 10, 219–230. [Google Scholar] [CrossRef] [PubMed]
  37. Sangeetha, D.; Deepa, P. Efficient Scale Invariant Human Detection Using Histogram of Oriented Gradients for IoT Services. In Proceedings of the 2017 30th International Conference on VLSI Design and 2017 16th International Conference on Embedded Systems (VLSID), Hyderabad, India, 7–11 January 2017; pp. 61–66. [Google Scholar]
  38. Moroz, L.; Samotyy, V.; Horyachyy, O. An Effective Floating-Point Reciprocal. In Proceedings of the 2018 IEEE 4th International Symposium on Wireless Systems within the International Conferences on Intelligent Data Acquisition and Advanced Computing Systems (IDAACS-SWS), Lviv, Ukraine, 20–21 September 2018; pp. 137–141. [Google Scholar]
  39. ESP32-WROOM-32 (ESP-WROOM-32) Datasheet; Version 2.4; Espressif Systems: Shanghai, China, 2018; Available online: https://www.mouser.com/datasheet/2/891/esp-wroom-32_datasheet_en-1223836.pdf (accessed on 20 December 2020).
  40. Intel Stratix 10 GX/SX Device Overview; Intel Corp.: Santa Clara, CA, USA, 2020; Available online: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10-overview.pdf (accessed on 20 December 2020).
Figure 1. Initial approximation y 0 for the reciprocal square root function on the interval x [ 1 ,   4 ) , obtained using the fast inverse square root (FISR) method and the modified FISR method: (a) FISR method with the magic constant of Lomont; (b) method of switching magic constants (intermediate result).
Figure 1. Initial approximation y 0 for the reciprocal square root function on the interval x [ 1 ,   4 ) , obtained using the fast inverse square root (FISR) method and the modified FISR method: (a) FISR method with the magic constant of Lomont; (b) method of switching magic constants (intermediate result).
Computation 09 00021 g001
Figure 2. Alignment of the relative errors of two adjacent piecewise linear initial approximations: (a) approximations y 01 and y 02 for the interval x [ 1 ,   2 ) ; (b) approximations y 02 and y 03 for the interval x [ 2 ,   4 ) . Here, we ignore the relative errors on other intervals.
Figure 2. Alignment of the relative errors of two adjacent piecewise linear initial approximations: (a) approximations y 01 and y 02 for the interval x [ 1 ,   2 ) ; (b) approximations y 02 and y 03 for the interval x [ 2 ,   4 ) . Here, we ignore the relative errors on other intervals.
Computation 09 00021 g002
Figure 3. Theoretical relative errors of the RcpSqrt2 (Walczyk et al.) and RcpSqrt31f (the proposed dynamic constants (DC) initial approximation) algorithms in the interval x [ 1 ,   4 ) after the first iteration.
Figure 3. Theoretical relative errors of the RcpSqrt2 (Walczyk et al.) and RcpSqrt31f (the proposed dynamic constants (DC) initial approximation) algorithms in the interval x [ 1 ,   4 ) after the first iteration.
Computation 09 00021 g003
Figure 4. Comparison of the accuracy of the FISR-based algorithms for double-precision numbers—Lomont, Walczyk et al., and DC methods—and their latency on the Raspberry Pi 3 and ESP-WROOM-32 platforms: (a) without fused multiply–add ( fma ) operation; (b) with hardware fma instructions on the Raspberry Pi and software fma functions on ESP-32.
Figure 4. Comparison of the accuracy of the FISR-based algorithms for double-precision numbers—Lomont, Walczyk et al., and DC methods—and their latency on the Raspberry Pi 3 and ESP-WROOM-32 platforms: (a) without fused multiply–add ( fma ) operation; (b) with hardware fma instructions on the Raspberry Pi and software fma functions on ESP-32.
Computation 09 00021 g004
Table 1. Comparison of different methods for calculating the reciprocal square root and square root for single (SP) and double (DP) precision on the Raspberry Pi mini-computer and the ESP-32 microcontroller.
Table 1. Comparison of different methods for calculating the reciprocal square root and square root for single (SP) and double (DP) precision on the Raspberry Pi mini-computer and the ESP-32 microcontroller.
FunctionData TypeMethodRelative ErrorAccuracy (Bits)Latency (ns)
δ max + δ max RP 3ESP-32
1 / x Float1.0f/sqrtf (x)8.9407 × 10−8−8.9348 × 10−823.42178.6
divf (1.0f, sqrtf (x))8.9407 × 10−8−8.9348 × 10−823.42797.3
Walczyk: 1
1 iter.8.7919 × 104−8.7921 × 10410.1525.9234.9
2 iter.6.7893 × 107−6.4727 × 10720.4940.2281.1
DC: 1
1 iter. (RcpSqrt31f)7.4593 × 105−7.4504 × 10513.7128.5248.0
2 iter. (RcpSqrt32f)7.3624 × 108−7.7542 × 10823.6252.2340.3
FRSQRTE 23.2768 × 103−3.0354 × 10−38.2519.3
FRSQRTE + FRSQRTS: 2
1 iter.1.3127 × 107−1.6183 × 10−515.9240.2
2 iter.1.6064 × 107−1.5772 × 10−722.660.4
Double1.0/sqrt (x)1.6653 × 1016−1.6653 × 10−1652.42199.2
div (1.0, sqrt (x))1.6653 × 1016−1.6653 × 10−1652.426984.8
DC: 1
1 iter. (RcpSqrt31d)7.4379 × 105−7.4379 × 10−513.7228.52744.1
2 iter. (RcpSqrt32d)4.1492 × 109−4.1492 × 10−927.8451.15128.4
3 iter. (RcpSqrt331d)1.6035 × 1016−1.8263 × 10−1652.2856.16828.2
3 iter. (RcpSqrt332d)1.3639 × 1016−1.6062 × 10−1652.4764.57237.8
x Floatsqrtf (x)5.9565 × 108−5.9605 × 10−824.00172.1604.3
Walczyk: 1, 3
1 iter.8.7919 × 104−8.7919 × 10−410.1528.4234.9
2 iter.6.8215 × 107−6.4493 × 10−720.4843.4302.0
DC: 1
1 iter. (Sqrt31f)7.4504 × 105−7.4511 × 10−513.7127.5264.8
2 iter. (Sqrt32f)8.7580 × 108−9.0380 × 10−823.4047.5340.3
Doublesqrt (x)1.1102 × 10−16−1.1102 × 10−1653.00187.84403.3
DC: 1
1 iter.7.4379 × 10−5−7.4379 × 10−513.7234.12650.4
2 iter.4.1492 × 10−9−4.1492 × 10−927.8546.85047.6
3 iter. (Sqrt33d)1.6643 × 10−16−1.8475 × 10−1652.2759.37151.9
1 Implemented with HW fma operations (except for DP on ESP-32, where SW fma is used). 2 HW instructions in Raspberry Pi (the estimate FRSQRTE instruction is based on a LUT and the FRSQRTS instruction is used to perform the classic NR iteration). 3 Modified for the square root calculation.
Table 2. Comparison of different FISR-based methods for calculating the SP reciprocal square root with high accuracy.
Table 2. Comparison of different FISR-based methods for calculating the SP reciprocal square root with high accuracy.
MethodNumber of OperationsAccuracy (Bits)
FP Fma 1FP MultFP Add/Sub 2FP–Int Transl 3Int Add/SubInt CompBit ANDBit ShiftTotal
Walczyk
(1 iter.)
41211910.15
TMC
(1 iter.)
3 (−1)13 (+1)2 (+1)110 (+1)10.59
Ho2 (1 iter.)5 (+1)2 (+1)21111 (+2)15.63
DC (1 iter.)41211 (+1)1 (+1)111 (+2)13.71
Walczyk
(3 iter.)
2822111623.42
TMC
(2 iter.)
3 (+1)5 (−3)1 (−1)3 (+1)2 (+1)115 (−1)23.47
Ho2 (2 iter.)27 (−1)221115 (−1)23.69
DC (2 iter.)26 (−2)1 (−1)211 (+1)1 (+1)115 (−1)23.62
1 The fma operation can be replaced with a combination of multiplication and addition/subtraction operations, but this can drastically affect the accuracy. 2 Each FP addition/subtraction can be combined with an adjacent multiplication in the fma operation. 3 Transformation ι   ( x ) or inverse transformation φ   ( i ) .
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Moroz, L.V.; Samotyy, V.V.; Horyachyy, O.Y. Modified Fast Inverse Square Root and Square Root Approximation Algorithms: The Method of Switching Magic Constants. Computation 2021, 9, 21. https://doi.org/10.3390/computation9020021

AMA Style

Moroz LV, Samotyy VV, Horyachyy OY. Modified Fast Inverse Square Root and Square Root Approximation Algorithms: The Method of Switching Magic Constants. Computation. 2021; 9(2):21. https://doi.org/10.3390/computation9020021

Chicago/Turabian Style

Moroz, Leonid V., Volodymyr V. Samotyy, and Oleh Y. Horyachyy. 2021. "Modified Fast Inverse Square Root and Square Root Approximation Algorithms: The Method of Switching Magic Constants" Computation 9, no. 2: 21. https://doi.org/10.3390/computation9020021

APA Style

Moroz, L. V., Samotyy, V. V., & Horyachyy, O. Y. (2021). Modified Fast Inverse Square Root and Square Root Approximation Algorithms: The Method of Switching Magic Constants. Computation, 9(2), 21. https://doi.org/10.3390/computation9020021

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop