Design of Hardware IP for 128-Bit Low-Latency Arcsinh and Arccosh Functions

Chang, Junfeng; Wang, Mingjiang

doi:10.3390/electronics12224658

Open AccessArticle

Design of Hardware IP for 128-Bit Low-Latency Arcsinh and Arccosh Functions

by

Junfeng Chang

¹ and

Mingjiang Wang

^2,*

¹

Shenzhen Semiconductor Industry Association, Shenzhen 518052, China

²

Key Laboratory for Key Technologies of IoT Terminals, Harbin Institute of Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(22), 4658; https://doi.org/10.3390/electronics12224658

Submission received: 22 October 2023 / Revised: 13 November 2023 / Accepted: 13 November 2023 / Published: 15 November 2023

(This article belongs to the Section Circuit and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of technologies like artificial intelligence, high-performance computing chips are playing an increasingly vital role. The inverse hyperbolic sine and inverse hyperbolic cosine functions are of utmost importance in fields such as image blur and robot joint control. Therefore, there is an urgent need for research into high-precision, high-performance hardware Intellectual Property (IP) for arcsinh and arccosh functions. To address this issue, this paper introduces a novel 128-bit low-latency floating-point hardware IP for arcsinh and arccosh functions, employing an enhanced Coordinate Rotation Digital Computer (CORDIC) algorithm, achieving a computation precision of 113 bits in just 32 computation cycles. This significantly enhances computational efficiency while reducing hardware implementation latency. The results indicate that, when compared to Python standard results, the calculated error of the proposed hardware IP does not exceed

8 \times 10^{- 34}

. Furthermore, this paper synthesizes the completed IP using the TSMC 65 nm process, with a total IP area of 2.1056 mm

^{2}

. Operating at a frequency of 300 MHz, its power is 22.4 mW. Finally, hardware implementation and resource analysis are conducted and compared on an Field Programmable Gate Array (FPGA). The results show that the improved algorithm trades a slight area increase for lower latency and higher accuracy. The designed hardware IP is expected to provide a more accurate and efficient computational tool for applications like image processing, thereby advancing technological development.

Keywords:

arcsinh; arccosh; CORDIC; hardware IP design

1. Introduction

In the current digital age, precision and efficiency in computation are crucial for scientific, engineering, and technological applications. Many mathematical functions, including hyperbolic inverse sine (arcsinh) and hyperbolic inverse cosine (arccosh), find extensive use in various domains. Accurate computation of these functions is essential for tasks such as signal processing, image processing, physical modeling, and engineering design. Ref. [1] applied the arcsinh function to estimate the velocity of robotic joints, resulting in high precision and stability, effectively suppressing measurement noise and highlighting the significance of arcsinh in the field of control. Ref. [2] established a parameter-based fuzzy model using the arcsinh distribution function to estimate the blur in single-focus lenses, thereby improving estimation precision. Ref. [3] employed a more reasonable normal sinh–arcsinh model to fit the blur kernel in image algorithms, providing a more accurate characterization of the Point Spread Function’s skewness, underscoring the importance of the hyperbolic inverse sine function in image processing. Whether it is speed estimation of robots in the field of control or fuzzy calculations in image processing, when transcendental functions such as arcsinh and arccosh are applied in ASIC chips, hardware circuits are required to process the data quickly and with high precision, so it is necessary to find the most suitable calculation method.

Traditional methods for calculating arcsinh and arccosh include look-up table, Taylor series expansion, polynomial-based approaches, and stochastic computation, among others [4], and researchers have conducted a series of studies on these methods. Ref. [5] conducted research and improvements on the look-up table method, introducing variants such as FR-dLUT and FR-dLUT with variable-length encoding. The improved LUT was designed and implemented on an FPGA, resulting in a 10% increase in both area and timing. Ref. [6] also employed the look-up table method for transcendental function calculation but with fixed-point processing to accelerate computation, albeit at the cost of some precision loss. Ref. [7] introduced a variable-precision Taylor polynomial to approximate sine and cosine functions, a method that can also be applied to the calculation of arcsinh and arccosh functions. Ref. [8] used Chebyshev spline interpolation-based methods to approximate the sine function, presenting a novel strategy distinct from other methods. Ref. [9] designed a high-speed transcendental function computation unit based on polynomial approximation, capable of processing single-precision floating-point numbers. It also introduced a hybrid max-min algorithm to reduce the operand bit width while maintaining a certain level of precision, along with an improved Wallace tree structure to minimize hardware circuit area overhead. Ref. [10] employed parabolic interpolation to reduce the space occupied by look-up table and designed a hardware accelerator for transcendental function computation, validating the practicality of the accelerator using the logarithmic function. Upon analyzing the aforementioned literature, it becomes evident that table lookup-based methods incur significant area overhead in hardware implementation. The area cost would be substantial when computing double or quadruple precision floating-point arcsinh and arccosh functions. Although polynomial approximation-based methods have some advantages in terms of area, their precision is insufficient for high-precision applications. Given these issues, this study delves into high-precision calculations of arcsinh and arccosh functions based on the Coordinate Rotation Digital Computer (CORDIC) algorithm.

Ref. [11] initially introduced a CORDIC algorithm suitable for transcendental function computations, which has now become a commonly used technique for computing various trigonometric functions. The core idea of this algorithm is to approximate the target function value using iterative rotation operations. Due to its hardware-friendliness and scalability, CORDIC has found extensive applications in fields like digital signal processing, communication systems, and embedded systems. Researchers have also conducted numerous studies on it. In 1971, Walther, J.S. unified the algorithm, demonstrating that simple modifications enable it to perform multiplication, division, sinh, cosh, and arctanh, among other functions [12]. To enhance the execution efficiency of the CORDIC algorithm, Ref. [13] introduced a fast sine and cosine wave generation algorithm based on the concepts of pipelining and multiplexing. This algorithm was implemented on an FPGA, reducing latency while minimizing design area. However, the research was based on an 8-bit CORDIC algorithm, resulting in relatively low precision. Ref. [14] proposed a single-precision floating-point THV-CORDIC algorithm that employs a pipelined structure for computing the base-2 logarithm and allows for configurable iteration counts, balancing precision and area as needed. Ref. [15] introduced a low-power Bfloat16 CORDIC algorithm with a small 7-bit significand for computed floating-point numbers, while it achieved low power and area consumption, its precision is insufficient for high-precision applications. Ref. [16] presented a cycle-optimized Radix-4 CORDIC algorithm, which required only three iterations for computing 32-bit single-precision floating-point numbers, significantly reducing computation unit latency. Ref. [17] improved the calculation of single-precision floating-point sine and cosine functions, introducing a three-stage algorithm comprising a ROM table, CORDIC rotations, and an approximation network, reducing calculation latency to 15 cycles. Ref. [18] enhanced the 64-bit double-precision CORDIC algorithm by encoding two bits of the input angle simultaneously, reducing iteration counts, and effectively lowering hardware circuit area. Ref. [19] eliminated the ROM storing arctangent values in the CORDIC algorithm and used a carry-lookahead adder to speed up computation. The improved hardware structure reduced latency by 39% when implemented on an FPGA. Ref. [20] conducted experiments with different iteration counts for CORDIC algorithms of varying bit lengths, revealing that, to reduce circuit power consumption, iteration counts can be slightly lower than the operand bit length. Ref. [21] applied an improved radix-16 CORDIC algorithm to direct digital frequency synthesis, reducing iteration counts from N to N/2+2 and significantly decreasing system latency. Ref. [22] implemented CORDIC algorithms for quadruple-precision sine and cosine functions, improving computation precision to some extent. However, it required 113 iterations per calculation, resulting in increased computation latency.

Upon a comprehensive analysis of the aforementioned reference, it is evident that current research on the CORDIC algorithm primarily focuses on single-precision and double-precision floating-point operations, with very limited attention to the quadruple-precision CORDIC algorithm. Furthermore, to our knowledge, there is virtually no research dedicated to higher-precision floating-point arcsinh and arccosh computations. Consequently, this paper conducts research and improvements on the quadruple-precision CORDIC algorithm for arcsinh and arccosh functions, along with the implementation of hardware circuits. This work addresses the existing gap in high-precision CORDIC algorithms for hyperbolic sine and cosine functions, meeting the demands for precision and efficiency in modern scientific and engineering applications such as image blur. The primary contributions of this paper are as follows:

An analysis of the CORDIC algorithm in the hyperbolic coordinate system is conducted, presenting a new CORDIC algorithm that synchronizes rotation factors and employs branch parallel processing. This algorithm reduces latency without sacrificing precision;
The 128-bit high-precision floating-point arcsinh and arccosh functions are decomposed, computed using an improved CORDIC algorithm, and implemented in hardware circuits;
The hardware circuits for arcsinh and arccosh functions are implemented on an FPGA and subjected to logic synthesis and resource analysis in the TSMC 65 nm process. This work offers a novel solution for the high-precision ASIC chip design of transcendental functions.

The organization of this paper is as follows: In Section 2, the decomposition of arcsinh and arccosh functions is performed, and the CORDIC algorithm in the hyperbolic coordinate system is introduced, along with discussions on its convergence region. Subsequently, improvements are made to the CORDIC algorithm, presenting strategies for synchronized rotation factor prediction and branch parallel processing. In Section 3, hardware circuit implementations are carried out for each component of the decomposed arcsinh and arccosh functions. In Section 4, the designed hardware IP is subjected to simulation verification, ASIC implementation, and logical synthesis. Additionally, performance comparisons are conducted on the FPGA platform. Finally, conclusions are presented in Section 5.

2. CORDIC Algorithm and Enhancements for Arcsinh and Arccosh Functions

2.1. Principle of CORDIC Algorithm in Hyperbolic Coordinate System

In the IEEE-754 floating-point standard, single precision, double precision, and quadruple precision floating-point representations are specified [23]. Single and double precision floating-point numbers are commonly used in situations where precision requirements are not as high. However, in highly demanding applications, quadruple precision floating-point numbers play an extremely important role. The structure of the standard quadruple precision floating-point number is consistent with the other two types and includes the sign bit S, the exponent bits E, and the mantissa bits M, as shown in Figure 1. The sign bit is the most significant bit, represented by a single 0 or 1. The exponent bits follow the sign bit and consist of 15 bits, with a range from

15^{'} h 0000

to

15^{'} h 7 fff

. Here, the

b i a s

represents the bias in the exponent’s magnitude. In quadruple precision floating-point numbers, the bias is 16,383, which allows for an actual exponent range of −16,383∼16,384. Finally, the last 112 bits represent the mantissa bits, and the number of mantissa bits essentially determines the precision of the floating-point number.

When implementing the CORDIC algorithm for 128 bit arcsinh and arccosh functions, it is necessary to rewrite the formulas for hyperbolic arcsine and arccosine, as shown in Equation (1), as the iterative form of the CORDIC algorithm in both the rotational and hyperbolic coordinate systems cannot directly obtain the values of arcsinh and arccosh. This converts the problem into solving logarithmic functions.

\{\begin{matrix} arcsinh (x) = ln (x + \sqrt{x^{2} + 1}) \\ arccosh (x) = ln (x + \sqrt{x^{2} - 1}) \end{matrix}

(1)

Similarly, since the logarithmic function cannot be directly obtained through the iterative form of the CORDIC algorithm in the rotating coordinate system and hyperbolic coordinate system, it needs to be aligned and further rewritten, as shown in Equation (2), in order to convert the problem into the solution of the arctanh function.

\ln (x) = 2 \tanh^{- 1} (\frac{x - 1}{x + 1}) = 2 arctanh (\frac{x - 1}{x + 1})

(2)

For the CORDIC algorithm in the hyperbolic coordinate system, its iterative form is given by Equation (3), where

(X_{i}, Y_{i})

represents the vector after the i-th iteration,

(X_{i + 1}, Y_{i + 1})

represents the vector after the

(i + 1)

-th iteration,

Z_{i}

and

Z_{i + 1}

are the angle values after the i-th and

(i + 1)

-th iterations, respectively.

ρ_{i}

is the rotation factor of the vector during the i-th approximation, with a range of (−1,1) where 1 indicates a counterclockwise rotation and −1 indicates a clockwise rotation.

θ_{i}

represents the change in angle during the i-th iteration and satisfies

θ_{i} = arctanh (2^{- i})

. It can be observed that with an increase in the number of iterations, the rate of change of the angle decreases, leading to higher precision.

\{\begin{matrix} X_{i + 1} = X_{i} + ρ_{i} Y_{i} 2^{- i} \\ Y_{i + 1} = Y_{i} + ρ_{i} X_{i} 2^{- i} \\ Z_{i + 1} = Z_{i} - ρ_{i} θ_{i} \end{matrix}

(3)

For the vector mode in the hyperbolic coordinate system, the main idea is: rotate the initial vector

(X_{0}, Y_{0})

by a certain angle, and after n iterations, the rotated vector coincides with the x-axis, that is,

Y_{n} \to 0

. At this point, the outputs of the three iterative channels

(X, Y, Z)

are as shown in Equation (4). Here,

K_{n} = \prod_{i = 0}^{n - 1} \frac{1}{\sqrt{1 - 2^{- 2 i}}}

.

\{\begin{matrix} X_{n} = \frac{1}{K_{n}} \sqrt{X_{0}^{2} + Y_{0}^{2}} \\ Y_{n} = 0 \\ Z_{n} = Z_{0} + arctanh (\frac{Y_{0}}{X_{0}}) \end{matrix}

(4)

According to Equation (4), when iterating, the three channels of

(X, Y, Z)

are given different initial values, and different outputs will be obtained after the iteration. When the initial value is

(X_{0}, Y_{0}, Z_{0}) = (x, y, 0)

, the

Z_{n}

after the iteration is the hyperbolic arctangent value

arctanh (\frac{y}{x})

. Therefore, in order to calculate the CORDIC algorithm for the logarithmic function in Equation (2), we can set the iteration initial value

(X_{0}, Y_{0}, Z_{0}) = (x + 1, x - 1, 0)

. Thus, the

Z_{n}

after the iteration is the hyperbolic arctangent value

arctanh (\frac{x - 1}{x + 1})

. Multiplying the result by 2 will give the value of the logarithmic function

ln (x)

.

According to Equations (2) and (3), during the iteration process, the step size of channel Z changes by

arctanh (2^{- i})

each time. As the number of iterations approaches infinity, the range of

Z_{n}

satisfies Equation (5). Therefore, the range of

y / x

can be found to satisfy Equation (6).

- 1.1182 \leq Z_{n} = arctanh (\frac{x - 1}{x + 1}) = \sum_{i = 1}^{\infty} arctanh (2^{- i}) \leq 1.1182

(5)

- 0.80694 \leq \frac{y}{x} \leq 0.80694

(6)

From this, it can be derived that when computing the

ln (x)

function using the CORDIC algorithm, the range of x should be restricted within

(0.11, 9.51)

.

2.2. Improved Low-Latency CORDIC Algorithm

In the traditional CORDIC algorithm, to compute 128-bit arcsinh and arccosh functions and achieve 113-bit precision, it requires 113 iteration cycles to complete. Additionally, the rotation factor

ρ_{i}

during the

(i + 1)

-th iteration depends entirely on whether

Y_{i}

after the i-th iteration aligns with the x-axis. This results in the inability to obtain results at the fastest speed in real-time image processing and similar applications. Therefore, this paper introduces a branch-parallel processing, rotation factor synchronized prediction CORDIC algorithm to reduce the number of iteration cycles, consequently reducing latency.

In Equation (3), only one iteration step is performed in each period. In order to reduce the total number of iterations in the algorithm, this paper integrates four consecutive iteration steps into one iteration. Taking channel Y as an example, the result after four iterations is shown in Equation (7), where

Y_{i + 1}

,

Y_{i + 2}

,

Y_{i + 3}

and

Y_{i + 4}

are the values after the

(i + 1)

-th,

(i + 2)

-th,

(i + 3)

-th, and

(i + 4)

-th iterations, respectively.

ρ_{i}

,

ρ_{i + 1}

,

ρ_{i + 2}

, and

ρ_{i + 3}

are the rotation factors for the four iterations.

\{\begin{matrix} \begin{matrix} Y_{i + 1} = Y_{i} + ρ_{i} X_{i} 2^{- i} \\ Y_{i + 2} = Y_{i + 1} + ρ_{i + 1} X_{i + 1} 2^{- (i + 1)} \\ = Y_{i} + ρ_{i} X_{i} 2^{- i} + ρ_{i + 1} X_{i} 2^{- (i + 1)} + ρ_{i + 1} ρ_{i} Y_{i} 2^{- (2 i + 1)} \\ Y_{i + 3} = Y_{i + 2} + ρ_{i + 2} X_{i + 2} 2^{- (i + 2)} \\ = (Y_{i + 1} + ρ_{i + 1} X_{i + 1} 2^{- (i + 1)}) + ρ_{i + 2} (X_{i + 1} + ρ_{i + 1} Y_{i + 1} 2^{- (i + 1)}) 2^{- (i + 2)} \\ = Y_{i + 1} + [X_{i} + ρ_{i} Y_{i} 2^{- i} + ρ_{i + 1} (Y_{i} + ρ_{i} X_{i} 2^{- i}) 2^{- (i + 1)}] ρ_{i + 2} 2^{- (i + 2)} + ρ_{i + 1} X_{i + 1} 2^{- (i + 1)} \\ Y_{i + 4} = {1 + ρ_{i + 3} ρ_{i + 2} ρ_{i + 1} ρ_{i} 2^{- (4 i + 6)} + [16 ρ_{i + 1} ρ_{i} + 8 ρ_{i + 2} ρ_{i} + 4 (ρ_{i + 2} ρ_{i + 1} + ρ_{i + 3} ρ_{i}) \\ + 2 ρ_{i + 3} ρ_{i + 1} + ρ_{i + 3} ρ_{i + 2}] 2^{- (2 i + 5)}} \cdot Y_{i} + [(8 ρ_{i} + 4 ρ_{i + 1} + 2 ρ_{i + 2} + ρ_{i + 3}) 2^{- (i + 3)} \\ + (8 ρ_{i + 2} ρ_{i + 1} ρ_{i} + 4 ρ_{i + 3} ρ_{i + 1} ρ_{i} + 2 ρ_{i + 3} ρ_{i + 2} ρ_{i} + ρ_{i + 3} ρ_{i + 2} ρ_{i + 1}) 2^{- (3 i + 6)}] \cdot X_{i} \end{matrix} \end{matrix}

(7)

Similarly, the results for channel X and channel Z after four iterations are shown in Equation (8) and Equation (9). Comparing Equation (7) and Equation (8), it can be observed that when calculating X and Y after four iterations, the rotation factors that need to be predicted are the same. Furthermore, their expression are also consistent. This implies that in subsequent hardware circuit implementation, the same circuit structure can be designed.

\{\begin{matrix} \begin{matrix} X_{i + 1} = X_{i} + ρ_{i} Y_{i} 2^{- i} \\ X_{i + 2} = X_{i + 1} + ρ_{i + 1} Y_{i + 1} 2^{- (i + 1)} \\ = X_{i} + ρ_{i} Y_{i} 2^{- i} + ρ_{i + 1} Y_{i} 2^{- (i + 1)} + ρ_{i + 1} ρ_{i} X_{i} 2^{- (2 i + 1)} \\ X_{i + 3} = X_{i + 2} + ρ_{i + 2} Y_{i + 2} 2^{- (i + 2)} \\ = (X_{i + 1} + ρ_{i + 1} Y_{i + 1} 2^{- (i + 1)}) + ρ_{i + 2} (X_{i + 1} + ρ_{i + 1} X_{i + 1} 2^{- (i + 1)}) 2^{- (i + 2)} \\ = X_{i + 1} + [Y_{i} + ρ_{i} X_{i} 2^{- i} + ρ_{i + 1} (X_{i} + ρ_{i} Y_{i} 2^{- i}) 2^{- (i + 1)}] ρ_{i + 2} 2^{- (i + 2)} + ρ_{i + 1} Y_{i + 1} 2^{- (i + 1)} \\ X_{i + 4} = {1 + ρ_{i + 3} ρ_{i + 2} ρ_{i + 1} ρ_{i} 2^{- (4 i + 6)} + [16 ρ_{i + 1} ρ_{i} + 8 ρ_{i + 2} ρ_{i} + 4 (ρ_{i + 2} ρ_{i + 1} + ρ_{i + 3} ρ_{i}) \\ + 2 ρ_{i + 3} ρ_{i + 1} + ρ_{i + 3} ρ_{i + 2}] 2^{- (2 i + 5)}} \cdot X_{i} + [(8 ρ_{i} + 4 ρ_{i + 1} + 2 ρ_{i + 2} + ρ_{i + 3}) 2^{- (i + 3)} \\ + (8 ρ_{i + 2} ρ_{i + 1} ρ_{i} + 4 ρ_{i + 3} ρ_{i + 1} ρ_{i} + 2 ρ_{i + 3} ρ_{i + 2} ρ_{i} + ρ_{i + 3} ρ_{i + 2} ρ_{i + 1}) 2^{- (3 i + 6)}] \cdot Y_{i} \end{matrix} \end{matrix}

(8)

\{\begin{matrix} \begin{matrix} Z_{i + 1} = Z_{i} - ρ_{i} θ_{i} \\ Z_{i + 2} = Z_{i + 1} - ρ_{i + 1} θ_{i + 1} = Z_{i} - ρ_{i} θ_{i} - ρ_{i + 1} θ_{i + 1} \\ Z_{i + 3} = Z_{i + 2} - ρ_{i + 2} θ_{i + 2} = Z_{i} - ρ_{i} θ_{i} - ρ_{i + 1} θ_{i + 1} - ρ_{i + 2} θ_{i + 2} \\ Z_{i + 4} = Z_{i + 3} - ρ_{i + 3} θ_{i + 3} = Z_{i} - (ρ_{i} θ_{i} + ρ_{i + 1} θ_{i + 1} + ρ_{i + 2} θ_{i + 2} + ρ_{i + 3} θ_{i + 3}) \end{matrix} \end{matrix}

(9)

From Equation (7) to Equation (9), it can be observed that to obtain the values after four iterations at once, it is necessary to predict four rotation factors

(ρ_{i}, ρ_{i + 1}, ρ_{i + 2}, ρ_{i + 3})

simultaneously. The range of values for

(ρ_{i}, ρ_{i + 1}, ρ_{i + 2}, ρ_{i + 3})

lies between

(- 1, - 1, - 1, - 1)

and

(1, 1, 1, 1)

. The values of the rotation factors depend on whether the x axis has been crossed by

Y_{i}

after the previous iteration or is about to be crossed. Therefore, in order to select the most suitable rotation factor from the possible 16 scenarios, which corresponds to choosing the most appropriate coefficient

(ρ_{i}, ρ_{i + 1}, ρ_{i + 2}, ρ_{i + 3})

from the 16 possibilities, 16 branches of the Y channel need to be processed in parallel. After obtaining the possible results

Y_{0} \sim Y_{15}

, they are sorted by magnitude, and the

(ρ_{i}, ρ_{i + 1}, ρ_{i + 2}, ρ_{i + 3})

corresponding to the branch that is closest to but not across the x axis is used as the rotation factor in this iteration. This completes the synchronization prediction of the rotation factor. Subsequently, the coefficient

(ρ_{i}, ρ_{i + 1}, ρ_{i + 2}, ρ_{i + 3})

are determined based on the selected rotation factor and applied to channels X and Z, resulting in the values after this iteration,

(X_{i + 4}, Y_{i + 4}, Z_{i + 4})

.

Considering the complexity of the Equation (7) to Equation (9), implementing hardware for all 16 branches would result in high circuit complexity and a large area. Therefore, the range of rotation factors is adjusted from

(- 1, 1)

to

(0, 1)

. This means that if one more rotation is applied, the vector will cross the x axis, and in this case, the rotation factor is set to 0, indicating that during this iteration, no rotation operation is performed on the vector. If one more rotation is applied, and the vector does not cross the x axis, then the rotation factor is set to 1, signifying a counterclockwise rotation during this iteration. Consequently, the range of the rotation factor

(ρ_{i}, ρ_{i + 1}, ρ_{i + 2}, ρ_{i + 3})

also changes from

(0, 0, 0, 0)

to

(1, 1, 1, 1)

. This effectively simplifies the Equation (7) to Equation (9), and it restricts the vector

(x, y)

within the range of

y \leq 0

. Therefore, for the convergence domain issue mentioned in the previous section regarding the CORDIC algorithm for the logarithmic function, a reevaluation is required here.

Here is an example to illustrate the principle of the improved CORDIC algorithm. Assuming that in the 16 types of

Y_{0} \sim Y_{15}

corresponding to

(ρ_{i}, ρ_{i + 1}, ρ_{i + 2}, ρ_{i + 3})

, the

Y_{5}

corresponding to

(0, 1, 0, 1)

is negative, and the

Y_{6}

corresponding to

(0, 1, 1, 0)

is positive, then the rotation factor

(ρ_{i}, ρ_{i + 1}, ρ_{i + 2}, ρ_{i + 3})

selected in this iteration is

(0, 1, 0, 1)

. Subsequently, by substituting

(0, 1, 0, 1)

into Equations (8) and (9), the channel value

(X_{i + 4}, Y_{i + 4}, Z_{i + 4})

after this iteration can be obtained. Similarly, this method is used for each subsequent iteration until a certain accuracy is met, and the

Z_{i + 4}

after the iteration is the final output result.

According to Equation (5) and Equation (6), and the analysis above, when using the improved CORDIC algorithm for computing the logarithmic function, it is not only necessary for

\frac{x - 1}{x + 1}

to fall within the range of

(- 0.80694, 0.80694)

, but it also needs to ensure

x - 1 \leq 0

. Therefore, it can be observed that in the improved CORDIC algorithm, the convergence domain for the logarithmic function is adjusted to

(0.11, 1)

. This is a crucial consideration when implementing hardware circuits for arcsinh and arccosh functions in the subsequent stages.

With this, the improvement of the quadruple-precision CORDIC algorithm is completed. When using the enhanced CORDIC algorithm to compute 128-bit floating-point numbers, only 32 iterations are required to achieve 113 bits of precision. This is a significant reduction in the number of iterations compared to the traditional CORDIC algorithm, leading to a substantial reduction in hardware circuit computation latency.

3. Implementation of Hardware IP for Low-Latency Arcsinh and Arccosh Functions

3.1. Top-Level Module

According to Equation (1), when calculating the arcsinh and arccosh functions, they first need to be transformed into the ln function. Therefore, when implementing the improved low-latency CORDIC algorithm in hardware, it is necessary to solve for

x + \sqrt{x^{2} + 1}

and

x + \sqrt{x^{2} - 1}

through a precomputation process. The overall block diagram of the hardware circuit for the arcsinh and arccosh functions is shown in Figure 2, which includes data_preprocessing module, preliminary_calculation module, ln_CORDIC module, and post_processing module. First, the data_preprocessing module decomposes the standard 128-bit floating-point input data input_num, obtaining the sign bit S, exponent part E, and mantissa part M. Simultaneously, it performs exceptional data detection on the input to determine whether further calculations are needed and returns exceptional signals asinh_exc and acosh_exc along with exceptional results asinh_exc_result and acosh_exc_result. The decomposed S, E, and M are input into the preliminary_calculation module, generating values for

x + \sqrt{x^{2} + 1}

and

x + \sqrt{x^{2} - 1}

as asinh_preliminary and acosh_preliminary. Subsequently, two ln_CORDIC modules use asinh_preliminary and acosh_preliminary as inputs and perform calculations using the improved CORDIC algorithm, resulting in fixed-point results for inverse hyperbolic sine and inverse hyperbolic cosine, which are asinh_round and acosh_round. Finally, the post_processing module converts the fixed-point results into standard quadruple-precision 128-bit floating-point numbers, asinh_result and acosh_result. Additionally, the exceptional signals from the data_preprocessing module determine the final outputs for the inverse hyperbolic sine, asinh_out, and inverse hyperbolic cosine values, acosh_out.

3.2. Data_Preprocessing Module

After obtaining the input data, the first step is to preprocess the data. The primary functions carried out by the preprocessing module include:

According to the format of quadruple-precision floating-point numbers in the IEEE standard, the 128-bit data is decomposed into the sign bit (out_s), the exponent part (out_e), and the mantissa part (out_m). Simultaneously, depending on the value of the exponent part, a compensatory precision bit of 1 or 0 is added in front of the mantissa part;
Based on the exponent part and mantissa part, an assessment for data anomalies is conducted, resulting in the output of exceptional results and flags.

It should be noted that, for the arccosh function, its input data should fall within the range of

[1, + \infty)

. As for the arcsinh function, when the input data is small enough, it can be considered that the output is equal to the input. In this paper, we use the exponent part

E \leq 16326

of the input floating-point number as the criterion for this determination. During exceptional handling, the arcsinh function can yield five exceptional results. In hardware circuit implementation, the decision-making process is depicted in Figure 3. Similarly, the arccosh function can result in five exceptional scenarios, and in hardware circuit implementation, the decision-making process is illustrated in Figure 4.

3.3. Preliminary_Calculation Module

According to Equation (1), after preprocessing the input data, before using the CORDIC algorithm to compute the logarithmic function, it is necessary to calculate

x + \sqrt{x^{2} + 1}

and

x + \sqrt{x^{2} - 1}

. Therefore, this paper designs a Preliminary_calculation module to expedite the overall function calculation. Its structure is depicted in Figure 5. First, the Mantissa_square submodule squares the mantissa part with the added compensatory precision bit and shifts the exponent part left by one position. Subsequently, the 128_normalization module integrates the sign bit, exponent part, and mantissa part and converts them into a standard 128-bit floating-point number, yielding the value

x^{2}

. Afterward,

x^{2}

is incremented and decremented, serving as inputs to the 128_sqrt module, which calculates the square root of the input data. It selects the current output asinh_sqrt and acosh_sqrt based on the completed calculation flag signal sqrt_done. Finally, the stable outputs lk_asinh_sqrt and lk_acosh_sqrt are summed with the input 128-bit floating-point number x, resulting in the values

x + \sqrt{x^{2} + 1}

and

x + \sqrt{x^{2} - 1}

as asinh_preliminary and acosh_preliminary, along with a flag signal pre_done indicating whether the current precomputation is complete.

In this paper, the 128_sqrt floating-point square root calculation module is based on the Goldschmidt method, and its block diagram is presented in Figure 6. Initially, the Sqrt_preconfig module performs exceptional detection and preprocessing on the input floating-point number. It decomposes the input into the sign (input_sign), exponent (input_exp), and mantissa (input_m). Subsequently, exceptional detection is carried out, and if the input data (input) is 0, negative, infinite, or NaN, an exceptional signal (exception) is generated. Furthermore, the unbiased exponent value (exp_unbiased) is calculated based on the input_exp. Then, based on exp_unbiased, the module checks whether the exponent of the input floating-point number is even or odd, and it adjusts the mantissa part accordingly, increasing it from 113 bits to 116 bits to facilitate the grouped calculations in the subsequent Sqrt_mantissa_calculation module.

The Sqrt_mantissa_calculation module iterates the input raw_data using the Gold-Schmidt method, employing the following iteration equation:

\{\begin{matrix} r_{i} = 1.5 - x_{i - 1} \cdot h_{i - 1} \\ x_{i} = x_{i - 1} \cdot r_{i} \\ h_{i} = h_{i - 1} \cdot r_{i} \end{matrix}

(10)

To ensure that the iterated value

x_{i}

equals the square root of x, the initial value for iteration is set as shown in Equation (11) in this paper.

\{\begin{matrix} r_{0} \approx \frac{1}{\sqrt{X}} \\ x_{0} = X \cdot r_{0} \\ h_{0} = 0.5 \cdot r_{0} \end{matrix}

(11)

In this context,

r_{0} \approx 1 / \sqrt{X}

represents an approximation of

1 / \sqrt{X}

. During circuit implementation, it can be obtained through a lookup table. After n iterations,

x_{n} = \sqrt{X}

can be obtained. The state machine for the iteration process is depicted in Figure 7.

When the current state c_state is 1’b0, if exception_in is not equal to 2’b11 (indicating an exceptional input), then it is set to 1’b1, the counter cnt_next is set to 0, and the next state n_state is 1’b0. If exception_in is equal to 2’b11 (indicating normal input), then it is set to 1’b0, the counter cnt_next is incremented by 1, and the next state n_state is 1’b1. When c_state is 1’b1, if cnt is less than or equal to 4’d4, then it is set to 1’b0, the counter cnt_next is incremented by 1, and the next state n_state is 1’b1. If cnt is greater than 4’d4, then it is set to 1’b1, the counter cnt_next is reset to zero, and the next state n_state is 1’b0. The data path during the iteration process is illustrated in Figure 8.

After the completion of the iteration, the Sqrt_normalize module rounds the 128-bit square_root and outputs the standard 128-bit floating-point value square_root_result in combination with exp_in. Simultaneously, based on the input exceptional signal exception_in, the final output results are selected and adjusted.

3.4. Ln_CORDIC Module

After the precomputation, 128-bit floating-point values asinh_preliminary and acosh_ preliminary are obtained. These are then used as inputs for two Ln_CORDIC modules. Since a floating-point number includes both the exponent and the mantissa, as shown in Equation (12) where M and e represent the actual magnitude, this paper first decomposes them. This transforms the solving of the ln function into Equation (13). Therefore, the exponent part can be easily handled with simple multiplication, requiring the

l n (M)

part to undergo the CORDIC algorithm. According to Equation (2), during hardware circuit design, only

arctanh (\frac{M - 1}{M + 1})

needs to undergo CORDIC algorithm implementation. The hardware circuit diagram for the Ln_CORDIC algorithm is illustrated in Figure 9. The CORDIC_pre module decomposes the input floating-point data input_num and obtains the mantissa part m with additional precision bits and the actual value of the exponent part e. Subsequently, CORDIC_iteration takes m as input, performs rotation factor synchronization prediction, and executes branch parallel processing. After iteration, it produces the output z_out for the Z channel. Upon multiplying it by 2, the integer value of

l n (M)

is obtained. At the same time, the exponent e is complemented and multiplied by ln2, yielding a 163-bit value ln_e_m. To ensure compatibility in bit length for subsequent addition, the integer value of

l n (M)

is similarly extended to 163 bits. Next, ln_m_m and ln_e_m undergo truncation and rounding. Finally, based on the value of ln_e_s, the sign is determined, and the results of the two parts are added to obtain the integer results of arcsinh and arccosh, along with the sign of the computed result.

V = {(- 1)}^{S} \times M \times 2^{e}

(12)

ln (V) = ln (M) + e \times ln 2

(13)

The circuit structure diagram of the CORDIC_iteration module is shown in Figure 10. Due to the fact that the output value m from the CORDIC_pre module corresponds to a floating-point number range of

(1, 2)

, as per the analysis of the convergence range for the improved CORDIC in Section 2.2, here it is necessary to right-shift m by one bit to make it satisfy the convergence range

(0.11, 1)

. In order to meet the precision requirements, 27 bits are added to the end of m. Additionally, to ensure that there is no overflow during the iterative process, 8 reserved bits are added to the beginning of m. Consequently, all three channels

(X, Y, Z)

process 148-bit data. The Channel Y module performs parallel calculations for the possible 16 branches, and then transmits the data to the rotation factor prediction module. The rotation factor prediction module determines the rotation factor

ρ

for the current iteration by comparing the value of the 16 branches. Subsequently, the Channel X module calculates the value X for the current iteration based on

ρ

. Simultaneously, the Channel Z module accumulates the angle values for the current iteration from the lookup table, based on the value of a counter incremented by 4, and calculates the current Z based on

ρ

. The iteration controller accumulates the iteration count. Due to the addition of 27 compensatory precision bits, the required number of iterations is

\frac{113 + 27}{4} = 35

. In actual testing, it was found that only 32 iterations are needed to meet the precision requirements. Therefore, when the iteration count exceeds 32, it is considered that the iteration is complete, and at this point, the output z_out from Channel Z represents the value

arctanh (\frac{M - 1}{M + 1})

.

3.5. Post_Processing Module

The role of the post_processing module is to convert the fixed-point results asinh_round and acosh_round for the inverse hyperbolic sine and inverse hyperbolic cosine into 128-bit standard floating-point numbers. The hardware circuit diagram is shown in Figure 11. First, a leading zero detection is performed on the input 147-bit fixed-point numbers to determine the count of leading zeros, denoted as zero_cnt. Subsequently, asinh_round and acosh_round are left-shifted by zero_cnt bits. Following this, the left-shifted values are truncated and rounded. And the first 113 bits are retained. Finally, the highest precision bit is removed, leaving the remaining 112 bits as the mantissa part of the standard floating-point number. For the exponent part, since the 148-bit z_out from the CORDIC_iteration module has been extended to 163 bits with the addition of 14 leading zeros, the exponent part is equal to

16^{'} h 3 fff + 14 - z e r o_c n t

. The different parts are then concatenated, resulting in a 128-bit floating-point value. Finally, based on the input signals asinh_finish and acosh_finish, the completion of the current iteration is determined, leading to the final floating-point values for the inverse hyperbolic sine and inverse hyperbolic cosine.

Finally, the designed circuit is implemented using the verilog language.

4. Simulation and Experimental Results

4.1. Analysis of Calculation Errors

To analyze the correctness of the proposed low-latency arcsinh and arccosh IP in this paper, the computation results from the Bigfloat library in Python are used as the standard. In the Bigfloat library, the precision of computations can be configured, and for this study, a precision of 128 bits is employed. Ten thousand sets of 128-bit test data are randomly generated as input for the arcsinh and arccosh functions. The results obtained using Bigfloat are compared with the results obtained in the ModelSim simulation environment, and the results are plotted. The curve and error of the arcsinh function are shown in Figure 12, and the curve and error of the arccosh function are shown in Figure 13.

To display the error more clearly, the scatter plot of the error is redrawn, as shown in Figure 14 and Figure 15. It can be observed that among the randomly selected ten thousand sets of data, the vast majority of them exhibit negligible differences between the calculation results of our designed hardware IP and those obtained using Python’s Bigfloat library. Only a few test cases show minor discrepancies, with the maximum error not exceeding

8 \times 10^{- 34}

. This indicates that the IP designed in this paper is characterized by excellent computational precision.

Table 1 presents five sets of test data and their corresponding test results for the arcsinh function, while Table 2 displays five sets of test data and their test results for the arccosh function. The results demonstrate that the calculations performed by the hardware IP are in close alignment with those of Python’s calculations. Only a few isolated data points exhibit a discrepancy of 1 bit, suggesting that the precision of the designed hardware circuit is capable of reaching at least 112 bits.

Figure 16 shows the simulation results and timing diagram of the arcsinh and arccosh functions in the Modelsim simulation environment.

4.2. Analysis of Hardware Implementation Results

To analyze the power consumption and area of the improved CORDIC algorithm and the hardware circuits for low-latency 128-bit arcsinh and arccosh functions, the circuit is synthesized using TSMC 65 nm process on Design Compiler tool from Synopsys. The operating frequency is set at 300 MHz, and the synthesized results are shown in Table 3. The overall area of the designed hardware IP is 2.1056 mm

^{2}

, and at an operating frequency of 300 MHz, the power is 22.4 mW.

To further analyze the complexity of the improved circuit, the hardware circuits for arcsinh and arccosh functions are implemented on Xilinx Artix-7 series FPGA using software Vivado 2018.2 in this paper. The resource consumption of the entire arcsinh and arccosh functions on FPGA is shown in Table 4.

Additionally, to the best of our knowledge, as there are currently no other researchers who have studied the entire 128-bit arcsinh and arccosh functions, an analysis of the circuit’s resource consumption rate is conducted by comparing only the improved core module, the CORDIC algorithm module. In the designed IP, the FPGA resources consumed by the core CORDIC module are presented in Table 5.

Table 6 presents a comparison of the resource consumption between the improved CORDIC module in this paper and CORDIC algorithms from other papers. Paper [17] enhances the 32-bit floating-point CORDIC algorithm on an FPGA, requiring 15 iterations for one computation and consuming 1082 LUTs. In comparison, the CORDIC algorithm improved in this paper, although doubling the number of iterations, increases precision by a factor of 4.7. Additionally, due to the parallel computation of branches during the iteration process in this paper, it consumes significantly more LUTs than Paper [17]. Paper [22] implements a 128-bit quadruple-precision floating-point CORDIC algorithm in hardware on an FPGA, which required 113 iterations for one computation. In contrast, the improved CORDIC algorithm in this paper consumes hardware resources that are 2.99 times that of Paper [22], but it achieves a calculation precision of 113 with only 32 iterations. This results in a 3.5 times reduction in computational latency, making it advantageous for applications requiring high precision and low latency.

5. Conclusions

A new low-latency hardware IP structure for 128-bit floating-point arcsinh and arccosh functions is proposed in this paper. In the introduced circuit architecture, improvements have been made to the traditional CORDIC algorithm. This is achieved by using synchronized prediction with rotation factors and parallel branch processing. The iteration count for the 128-bit CORDIC algorithm has been reduced from 113 to 32, significantly lowering the hardware latency of the CORDIC algorithm while maintaining a calculation precision of at least 112 bits. Additionally, based on the improved CORDIC algorithm, this paper implements the square root operation using the Goldschmidt iteration method in hardware. Simulation results of the designed arcsinh and arccosh hardware IP are compared with the standard results in Python, showing a maximum error of not exceeding

8 \times 10^{- 34}

. Furthermore, the paper conducts logic synthesis and analysis of the circuit using the TSMC 65 nm process at 300 MHz conditions, resulting in a total IP area of 2.1056 mm

^{2}

and a power consumption of 22.4 mW. Finally, the designed hardware IP is implemented on an FPGA and compared with other research. The results indicate that the improved CORDIC algorithm in this paper effectively enhances hardware implementation efficiency by trading area for speed and precision.

The new high-precision, low-latency hardware IP for arcsinh and arccosh functions, proposed in this study, addresses the existing gap in CORDIC algorithm-based computations of hyperbolic functions. This hardware IP can be employed in high-precision and high-performance DSP and GPU applications, offering the potential to provide more accurate computational tools for digital signal processing, image processing, and other fields, thereby contributing to advancements in the technology domain. In the future, this work is intended to be expanded as follows:

The designed hardware IP for arcsinh and arccosh will be further optimized, with a focus on performance improvements in terms of area and power consumption;
The algorithm’s precision will be enhanced and made configurable, enabling its integration into various chipsets for different applications;
More experiments and tests will be conducted to ensure the hardware IP’s optimal performance.

Author Contributions

Conceptualization, M.W.; methodology, J.C.; software, J.C.; validation, J.C. and M.W.; writing—original draft preparation, J.C.; project administration, J.C. and M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Plan and Technology Research Project of Shenzhen under Grant No. JSGG20200831092401003.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

Author Junfeng Chang was employed by the company Shenzhen Semiconductor Industry Association. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, S.; Li, Y.; Dong, B. A novel velocity estimation method for robotic joints based on inverse hyperbolic sine tracing differential algorithm. In Proceedings of the 2017 Chinese Automation Congress, Jinan, China, 20–22 October 2017. [Google Scholar] [CrossRef]
Jang, J.; Yun, J.D.; Yang, S. Modeling Non-Stationary Asymmetric Lens Blur by Normal Sinh-Arcsinh Model. IEEE Trans. Image Process. 2016, 25, 2184–2195. [Google Scholar] [CrossRef] [PubMed]
Zhan, D.; Zeng, X.; Li, W.; Liu, Y.; Xiong, Z. Blur kernel estimation using normal sinh-arcsinh model based on simple lens system. In Proceedings of the 2017 IEEE 19th International Workshop on Multimedia Signal Processing, Luton, UK, 1–6 October 2017. [Google Scholar] [CrossRef]
Tang, P.T.P. Table-lookup algorithms for elementary functions and their error analysis. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic, Grenoble, France, 26–28 June 1991. [Google Scholar] [CrossRef]
Gener, Y.S.; Gören, S.; Ugurdag, H.F. Lossless Look-Up Table Compression for Hardware Implementation of Transcendental Functions. In Proceedings of the 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration, Cuzco, Peru, 6–9 October 2019. [Google Scholar] [CrossRef]
Nguyen, M.X.; Dinh-Duc, A.V. Hardware-based algorithm for Sine and Cosine computations using fixed point processor. In Proceedings of the 2014 11th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, Nakhon Ratchasima, Thailand, 1–6 May 2014. [Google Scholar] [CrossRef]
Brunelli, C.; Berg, H.; Guevorkian, D. Approximating sine functions using variable-precision Taylor polynomials. In Proceedings of the 2009 IEEE Workshop on Signal Processing Systems, Tampere, Finland, 7–9 October 2009. [Google Scholar] [CrossRef]
Popov, B. Nonlinear best Chebyshev approximations and splines. In Proceedings of the 6th International Conference on Mathematical Methods in Electromagnetic Theory (MMET’96), Lviv, Ukraine, 10–13 September 1996. [Google Scholar] [CrossRef]
Tian, Z.; Fan, F.; Zhang, J.; Ren, X.; Yang, W. High-Speed Transcendental Function Operation Unit Design. In Proceedings of the 2022 IEEE 9th International Conference on Cyber Security and Cloud Computing, Xi’an, China, 25–27 June 2022. [Google Scholar] [CrossRef]
Chen, J.; Liu, X. A High-Performance Deeply Pipelined Architecture for Elementary Transcendental Function Evaluation. In Proceedings of the 2017 IEEE International Conference on Computer Design, Boston, MA, USA, 5–8 November 2017. [Google Scholar] [CrossRef]
Volder, J.E. The CORDIC Trigonometric Computing Technique. IRE Trans. Electron. Comput. 1959, 8, 330–334. [Google Scholar] [CrossRef]
Walther, J. A unified algorithm for elementary functions. In Proceedings of the Spring Joint Computer Conference, Atlantic City, NJ, USA, 18–20 May 1971; pp. 379–385. [Google Scholar] [CrossRef]
Chinnathambi, M.; Bharanidharan, N.; Rajaram, S. FPGA implementation of fast and area efficient CORDIC algorithm. In Proceedings of the 2014 International Conference on Communication and Network Technologies, Sivakasi, India, 18–19 December 2014. [Google Scholar] [CrossRef]
Chen, H.; Cheng, K.; Lu, Z.; Fu, Y.; Li, L. Hyperbolic CORDIC-Based Architecture for Computing Logarithm and Its Implementation. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 2652–2656. [Google Scholar] [CrossRef]
Mishra, S.M.; Shekhawat, H.S.; Trivedi, G.; Jan, P.; Nemec, Z. Design and Implementation of a Low Power Area Efficient Bfloat16 based CORDIC Processor. In Proceedings of the 2022 32nd International Conference Radioelektronika, Kosice, Slovakia, 1–6 April 2022. [Google Scholar] [CrossRef]
Vinh, T.Q.; Thanh, T.B.; Viet, D.H. FPGA Implementation of Trigonometric Function Using Loop-Optimized Radix-4 CORDIC. In Proceedings of the 2022 9th NAFOSTED Conference on Information and Computer Science, Ho Chi Minh City, Vietnam, 31 October–1 November 2022. [Google Scholar] [CrossRef]
Sergiyenko, A.; Moroz, L.; Mychuda, L.; Samotyj, V. FPGA Implementation of CORDIC Algorithms for Sine and Cosine Floating-Point Calculations. In Proceedings of the 2021 11th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, Cracow, Poland, 22–25 September 2021. [Google Scholar] [CrossRef]
Juang, T.B. Low Latency Angle Recoding Methods for the Higher Bit-Width Parallel CORDIC Rotator Implementations. IEEE Trans. Circuits Syst. II Express Briefs 2008, 55, 1139–1143. [Google Scholar] [CrossRef]
Jain, R.K.; Sharma, V.K.; Mahapatra, K.K. A new approach for high performance and efficient design of CORDIC processor. In Proceedings of the 2012 1st International Conference on Recent Advances in Information Technology, Dhanbad, India, 15–17 March 2012. [Google Scholar] [CrossRef]
Sapper, A.N.; Soares, L.; Costa, E.; Bampi, S. Exploring the combination of number of bits and number of iterations for a power-efficient fixed-point CORDIC implementation. In Proceedings of the 2017 24th IEEE International Conference on Electronics, Circuits and Systems, Batumi, Georgia, 5–8 December 2017. [Google Scholar] [CrossRef]
Changela, A.; Kumar, R. A Modified Radix-16 CORDIC Algorithm-based Direct Digital Frequency Synthesizer. In Proceedings of the 2022 5th International Conference on Contemporary Computing and Informatics, Uttar Pradesh, India, 14–16 December 2022. [Google Scholar] [CrossRef]
Singh, A.K.; Singh, M.K.; Ray, K.C. Design and Implementation of Quadruple Floating-Point CORDIC. In Proceedings of the 2015 IEEE International Symposium on Nanoelectronic and Information Systems, Indore, India, 21–23 December 2015. [Google Scholar] [CrossRef]
IEEE Std 754-2008; IEEE Standard for Floating-Point Arithmetic. IEEE: Piscataway, NJ, USA, 2008; pp. 1–70.

Figure 1. Format for quad-precision floating-point numbers.

Figure 2. Top-level structure diagram of low-latency arcsinh and arccosh hardware circuits.

Figure 3. Flow chart of abnormal situation judgment of arcsinh function.

Figure 4. Flow chart of abnormal situation judgment of arccosh function.

Figure 5. Structural block diagram of Preliminary_calculation module.

Figure 6. Structural block diagram of 128_sqrt module.

Figure 7. State machine of Goldschmidt iterative process.

Figure 8. Data path of Sqrt_mantissa_calculation module.

Figure 9. Hardware circuit structure diagram of Ln_CORDIC module.

Figure 10. Circuit structure diagram of CORDIC_iteration module.

Figure 11. Circuit structure diagram of Post_processing module.

Figure 12. Calculation results and errors of arcsinh function.

Figure 13. Calculation results and errors of arccosh function.

Figure 14. Calculation error of arcsinh function.

Figure 15. Calculation error of arccosh function.

Figure 16. Simulation timing on Modelsim.

Table 1. Simulation results form designed IP and standard results from Python for arcsinh function.

Input_num	Result from IP	Results from Python
40049c48799df21f8b7899d16dd1e0b0	400128ac3236f088986b53fb2017b38e	400128ac3236f088986b53fb2017b38e
4005389b2c7c4305b0d5c634e3e576a1	400143514fa11829eb20d48d3ff7740a	400143514fa11829eb20d48d3ff7740a
3fff92125bb82fc894f0cc84262761e1	3fff3bb9231c2956bf9ee54fd17acd2b	3fff3bb9231c2956bf9ee54fd17acd2c
40036bef97b5fb817f035279b094f793	4000e8b4d979a5ed7cac058eb314f7d8	4000e8b4d979a5ed7cac058eb314f7d8
4002dea9b97c11bcb714f3f3031eaf1b	4000b323961b35968fd40c84e677a015	4000b323961b35968fd40c84e677a015

Table 2. Simulation results form designed IP and standard results from Python for arccosh function.

Input_num	Result from IP	Results from Python
40049c48799df21f8b7899d16dd1e0b0	400128a91c97e76cabc195e6f1b18d03	400128a91c97e76cabc195e6f1b18d03
4005389b2c7c4305b0d5c634e3e576a1	4001434ff843e3956efdaa1091cc1e68	4001434ff843e3956efdaa1091cc1e68
4001321fb3328e3c1bd5114513059bb0	40001fa26273ebb760a1997e033803c4	40001fa26273ebb760a1997e033803c4
4000367b2620532b69445b33a1e8b8eb	3fff88a40b2fb18492b5d2518d3d05ae	3fff88a40b2fb18492b5d2518d3d05af
4002b0f1a5fb60f249cc6a2dba4b6b86	4000a5f8a88db52b94dbd54c5098903f	4000a5f8a88db52b94dbd54c5098903f

Table 3. Area and power consumption in TSMC 65 nm process.

Process	Frequency	Area	Power	Max Error
TSMC 65 nm	300 MHZ	2.1056 mm $^{2}$	22.4 mW	1 bit

Table 4. Resource consumption of the entire arcsinh and arccosh functions on FPGA.

Logic Utilization	Used	Available	Utilization
LUT	513,528	547,600	93.78%
Reg	6152	1,095,200	0.56%
DSP	18	2520	0.71%
F7 MUX	174,395	328,200	53.14%
F8 MUX	83,860	164,100	51.10%

Table 5. Resource consumption of CORDIC module on FPGA.

Logic Utilization	Used	Available	Utilization
LUT	32,904	547,600	6.01%
Reg	1082	1,095,200	0.10%
DSP	9	2520	0.36%
F7 MUX	441	328,200	0.13%
F8 MUX	148	164,100	0.09%

Table 6. Comparison between the improved CORDIC module in this paper and other CORDIC module.

CORDIC	Number of Operations (bits)	Accuracy (bits)	Number of Iterations	Number of LUTs Used
Paper [17]	32	24	15	1082
Paper [22]	128	113	113	11,007
This paper	128	113	32	32,904

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, J.; Wang, M. Design of Hardware IP for 128-Bit Low-Latency Arcsinh and Arccosh Functions. Electronics 2023, 12, 4658. https://doi.org/10.3390/electronics12224658

AMA Style

Chang J, Wang M. Design of Hardware IP for 128-Bit Low-Latency Arcsinh and Arccosh Functions. Electronics. 2023; 12(22):4658. https://doi.org/10.3390/electronics12224658

Chicago/Turabian Style

Chang, Junfeng, and Mingjiang Wang. 2023. "Design of Hardware IP for 128-Bit Low-Latency Arcsinh and Arccosh Functions" Electronics 12, no. 22: 4658. https://doi.org/10.3390/electronics12224658

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Design of Hardware IP for 128-Bit Low-Latency Arcsinh and Arccosh Functions

Abstract

1. Introduction

2. CORDIC Algorithm and Enhancements for Arcsinh and Arccosh Functions

2.1. Principle of CORDIC Algorithm in Hyperbolic Coordinate System

2.2. Improved Low-Latency CORDIC Algorithm

3. Implementation of Hardware IP for Low-Latency Arcsinh and Arccosh Functions

3.1. Top-Level Module

3.2. Data_Preprocessing Module

3.3. Preliminary_Calculation Module

3.4. Ln_CORDIC Module

3.5. Post_Processing Module

4. Simulation and Experimental Results

4.1. Analysis of Calculation Errors

4.2. Analysis of Hardware Implementation Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI