Structure and Principles of Operation of a Quaternion VLSI Multiplier

Cariow, Aleksandr; Naumowicz, Mariusz; Handkiewicz, Andrzej

doi:10.3390/app14188123

Open AccessArticle

Structure and Principles of Operation of a Quaternion VLSI Multiplier

by

Aleksandr Cariow

^1,†

,

Mariusz Naumowicz

^2,*,†

and

Andrzej Handkiewicz

³

¹

Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, 70-310 Szczecin, Poland

²

Faculty of Computing, Poznań University of Technology, ul. Piotrowo 3a, 60-965 Poznań, Poland

³

Faculty of Technology, The Jacob of Paradise University, 66-400 Gorzów Wielkopolski, Poland

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(18), 8123; https://doi.org/10.3390/app14188123

Submission received: 20 July 2024 / Revised: 27 August 2024 / Accepted: 4 September 2024 / Published: 10 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

The paper presents the original structure of a processing unit for multiplying quaternions. The idea of organizing the device is based on the use of fast Hadamard transform blocks. The operation principles of such a device are described. Compared to direct quaternion multiplication, the developed algorithm significantly reduces the number of multiplication and addition operations. Hardware implementations of the developed structure, in FPGA and ASIC, are presented. The FPGA blocks were implemented in the Vivado environment. The ASICs were designed using

130 nm

technology. The developed scripts in VHDL are available in the GitHub repository.

Keywords:

hypercomplex numbers; quaternion multiplier; fast algorithm; matrix–vector multiplication; hardware implementation; FPGA; ASIC

1. Introduction

The development of the theory and practice of data processing as well as the expansion and complexity of the range of problems being solved stimulate the use of increasingly advanced numerical systems. If complex-valued data processing was considered something exotic until quite recently, then there was no talk of the practical application of hypercomplex-valued data processing. Quaternions are being used in a wide range of scientific and engineering fields, such as deep neural networks, computer graphics and machine vision, robotics, navigation, data encoding, image and signal processing, multiresolution analysis, and graph theory (see, for example, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]). It should be noted, however, that hypercomplex-valued arithmetic operations ultimately still come down to performing elementary operations on real numbers.

In the case of real-valued data processing, multiplication has also always been the most time-consuming operation. To reduce the computation time, developers sought to elaborate algorithms with a minimum number of multiplications. Another solution to the problem of accelerating calculations was the introduction of hardware multipliers into the data processing system. This is how hardware multipliers appeared: first as separate chips, and then built into the chip. Since even the common multiplication of two complex numbers necessitates four multiplications and two additions of real numbers, it is the most time-consuming operation for processing complex-valued data. In the case of hypercomplex numerical systems, the problem of multiplication implementation time is even more acute. This is because 16 multiplications and 12 additions of real numbers are needed to complete a single multiplication of two quaternions. A lot has been written about hardware real-valued multipliers—it has already become a classic [22,23,24]. A lot has also been written about hardware complex number multipliers [25,26,27,28,29,30,31,32,33,34,35,36,37]. But only a few works have been written about hardware multipliers of hypercomplex numbers, particularly quaternions [38,39]. In addition, as an analysis of known solutions has shown, they do not use all the possibilities of computer optimization. We want to fill this gap.

2. Short Background

A quaternion

q \in H

is a hypercomplex number which can be presented as [2]:

q = q_{0} + i q_{1} + j q_{2} + k q_{3},

(1)

where

H

is a four-dimensional vector space over the real numbers

q_{0}, q_{1}, q_{2},

and

q_{3} \in R

, and

i, j,

and k are basic imaginary units satisfying the following multiplication rules

j k = i, k i = j, j i = - k, k j = - i, i k = - j .

(2)

Suppose we need to compute the product of two quaternions a and b:

c = a b,

(3)

where

\begin{matrix} a & = & a_{0} + a_{1} i + a_{2} j + a_{3} k, \\ b & = & b_{0} + b_{1} i + b_{2} j + b_{3} k, \\ c & = & c_{0} + c_{1} i + c_{2} j + c_{3} k . \end{matrix}

(4)

\begin{matrix} a b & = & (a_{0} + a_{1} i + a_{2} j + a_{3} k) (b_{0} + b_{1} i + b_{2} j + b_{3} k) \\ = & a_{0} b_{0} - a_{1} b_{1} - a_{2} b_{2} - a_{3} b_{3} \\ + & (a_{0} b_{1} + a_{1} b_{0} + a_{2} b_{3} - a_{3} b_{2}) i \\ + & (a_{0} b_{2} + a_{2} b_{0} + a_{3} b_{1} - a_{1} b_{3}) j \\ + & (a_{0} b_{3} + a_{3} b_{0} + a_{1} b_{2} - a_{2} b_{1}) k . \end{matrix}

(5)

It is well known that the quaternion multiplication can be represented as a matrix–vector product:

y_{4} = B_{4} x_{4},

(6)

where

\begin{matrix} B_{4} & = & [\begin{matrix} b_{0} & - b_{1} & - b_{2} & - b_{3} \\ b_{1} & b_{0} & b_{3} & - b_{2} \\ b_{2} & - b_{3} & b_{0} & b_{1} \\ b_{3} & b_{2} & - b_{1} & b_{0} \end{matrix}], \\ y_{4} & = & {[y_{0}, y_{1}, y_{2}, y_{3}]}^{T}, {y_{m}} = {c_{m}}, m = 0, 1, 2, 3, \\ x_{4} & = & {[x_{0}, x_{1}, x_{2}, x_{3}]}^{T}, {x_{m}} = {a_{m}}, m = 0, 1, 2, 3 . \end{matrix}

Calculating this matrix–vector product directly calls for 16 real multiplications and 12 real additions. If, when implementing matrix–vector multiplications in hardware, to achieve maximum performance, we use the paradigm of the maximum parallelization of calculations, then 16 multipliers and four four-input adders will be required. Our goal is to reduce the required number of multipliers while maintaining performance. Below we will show one way to do this. The proposed approach is based on the idea described in [40].

3. Algorithmic Aspect

Let us multiply the first row of

B_{4}

by (−1). Then, we obtain

\begin{matrix} {\hat{B}}_{4} & = & [\begin{matrix} - b_{0} & b_{1} & b_{2} & b_{3} \\ b_{1} & b_{0} & b_{3} & - b_{2} \\ b_{2} & - b_{3} & b_{0} & b_{1} \\ b_{3} & b_{2} & - b_{1} & b_{0} \end{matrix}] . \end{matrix}

(7)

Therefore, the procedure for multiplying quaternions can be reduced to the following equivalent expression:

\begin{matrix} [\begin{matrix} - y_{0} \\ y_{1} \\ y_{2} \\ y_{3} \end{matrix}] & = & [\begin{matrix} - b_{0} & b_{1} & b_{2} & b_{3} \\ b_{1} & b_{0} & b_{3} & - b_{2} \\ b_{2} & - b_{3} & b_{0} & b_{1} \\ b_{3} & b_{2} & - b_{1} & b_{0} \end{matrix}] & [\begin{matrix} x_{0} \\ x_{1} \\ x_{2} \\ x_{3} \end{matrix}] . \\ \underset{︸}{} & \underset{︸}{} \\ {\hat{y}}_{4} & {\hat{B}}_{4} \end{matrix}

(8)

This modification will eventually result in a reduction in the final algorithm’s computational complexity. Through this transformation, the matrix that has been altered in this way is represented as the algebraic sum of a block-symmetric Toeplitz-type matrix

{\bar{B}}_{4}

and a sparse matrix

{\underset{̲}{B}}_{4}

, or a matrix with only a few non-zero elements:

\begin{matrix} {\hat{B}}_{4} & = & [\begin{matrix} b_{0} & b_{1} & b_{2} & b_{3} \\ b_{1} & b_{0} & b_{3} & b_{2} \\ b_{2} & b_{3} & b_{0} & b_{1} \\ b_{3} & b_{2} & b_{1} & b_{0} \end{matrix}] - 2 [\begin{matrix} b_{0} \\ b_{2} \\ b_{3} \\ b_{1} \end{matrix}], \\ \underset{︸}{} \underset{︸}{} \\ {\bar{B}}_{4} {\underset{̲}{B}}_{4} \end{matrix}

(9)

{\hat{y}}_{4} = ({\bar{B}}_{4} - 2 {\underset{̲}{B}}_{4}) x_{4} = {\bar{B}}_{4} x_{4} - 2 {\underset{̲}{B}}_{4} x_{4} .

(10)

Omitting mathematical derivations, we represent the multiplication of a matrix

{\bar{B}}_{4}

by a vector

x_{4}

in the form of the following matrix–vector procedure:

\begin{matrix} {\bar{B}}_{4} x_{4} & = & H_{4} (\frac{1}{4} D_{4}) H_{4} x_{4} = W_{4}^{(0)} W_{4}^{(1)} (\frac{1}{4} D_{4}) W_{4}^{(1)} W_{4}^{(0)} x_{4}, \\ \underset{︸}{} \underset{︸}{} \\ H_{4} H_{4} \end{matrix}

(11)

where

\begin{matrix} H_{4} & = & [\begin{matrix} 1 & 1 & | & 1 & 1 \\ 1 & - 1 & | & 1 & - 1 \\ - - & - - & - - & - - & - - \\ 1 & 1 & | & - 1 & - 1 \\ 1 & - 1 & | & - 1 & 1 \end{matrix}] \end{matrix}

is the fourth-order Hadamard matrix,

\begin{matrix} D_{4} & = & d i a g (s_{0}, s_{1}, s_{2}, s_{3}), \end{matrix}

\begin{matrix} W_{4}^{(0)} & = & H_{2} \otimes I_{2} = [\begin{matrix} 1 & | & 1 \\ 1 & | & 1 \\ - - & - - & - - & - - & - - \\ 1 & | & - 1 \\ 1 & | & - 1 \end{matrix}], \end{matrix}

\begin{matrix} W_{4}^{(1)} & = & I_{2} \otimes H_{2} = [\begin{matrix} 1 & 1 & | & 0 & 0 \\ 1 & - 1 & | & 0 & 0 \\ - - & - - & - - & - - & - - \\ 0 & 0 & | & 1 & 1 \\ 0 & 0 & | & 1 & - 1 \end{matrix}], \end{matrix}

and

I_{N}

is the order N identity matrix,

H_{2} = [\begin{matrix} 1 & 1 \\ 1 & - 1 \end{matrix}]

is the second-order Hadamard matrix, and ⊗, ⊕ denote the Kronecker product and direct sum of two matrices, respectively.

It is simple to verify that the entries of the matrix

D_{4} = d i a g (d_{4})

may be computed by applying the subsequent matrix–vector procedure, which minimizes the number of real additions involved:

d_{4} = \frac{1}{4} H_{4} b_{4} = \frac{1}{4} W_{4}^{(1)} W_{4}^{(0)} b_{4},

(12)

where

d_{4} = {[s_{0}, s_{1}, s_{2}, s_{3}]}^{T}, b_{4} = {[b_{0}, b_{1}, b_{2}, b_{3}]}^{T} .

(13)

Regretfully, there is no way to lessen the computational cost of the product

2 {\underset{̲}{B}}_{4} x_{4}

; hence, it must be calculated directly without using any techniques. Upon integrating the computations for both matrices into a solitary process, we ultimately derive the following:

y_{4} = S_{4 \times 8} (H_{4} \oplus I_{4}) D_{8} (H_{4} \oplus I_{4}) P_{8 \times 4}^{(0)} x_{4},

(14)

\begin{matrix} P_{8 \times 4}^{(0)} & = & 1_{2 \times 1} \otimes I_{4} = [\begin{matrix} 1 \\ 1 \\ 1 \\ 1 \\ - - & - - & - - & - - \\ 1 \\ 1 \\ 1 \\ 1 \end{matrix}], 1_{2 \times 1} = {[1, 1]}^{T}, \end{matrix}

\begin{matrix} S_{4 \times 8} & = & [\begin{matrix} - 1 & | & 1 \\ 1 & | & - 1 \\ 1 & | & - 1 \\ 1 & | & - 1 \end{matrix}], \\ D_{8} & = & \frac{1}{4} D_{4} \oplus {\underset{̲}{D}}_{4}, \end{matrix}

{\underset{̲}{D}}_{4} = d i a g (s_{4}, s_{5}, s_{6}, s_{7}) = d i a g (2 {\underset{̲}{B}}_{4} x_{4}) .

(15)

Expression (14) describes a fast algorithm for quaternion multiplication.

The data flow diagram of the processing block for calculating elements

{s_{m}}

,

m = 0, 1, 2, 3

, is depicted in Figure 1, whereas the data flow structure of the quaternion multiplication unit is shown in Figure 2. Figure 3 shows a data flow structure of the process for calculating elements

{s_{m + 4}}

,

m = 0, 1, 2, 3

.

The data flow diagrams in this paper are left-to-right-oriented. The data transfer operations in the diagrams are represented by straight lines. Points indicate summation where lines converge. The subtraction procedure is indicated by the dotted lines. To avoid overcrowding the image, we purposefully employ the standard lines without arrows. These figures’ circles depict the process of multiplying by a number, either constant or variable, that is written inside a circle. In turn, the rectangles indicate the matrix–vector multiplications by the matrices inscribed inside rectangles.

Let us consider the data flow diagrams shown in Figure 2 and Figure 3. It can be seen that the implementation of the computations according to (12) and (14) requires only eight multiplications, 28 additions, four right shifts (multiplications by 1/4) and four left shifts (multiplications by 2). And since the left–right shift operations are very simple and do not require additional memory access, they are usually omitted when evaluating the computational complexity. Note that here we take into account the fact that the multiplication of a fourth-order Hadamard matrix by a vector requires only 8 additions instead of 12. Figure 4 shows a detailed schematic view of the matrix–vector multiplications of the corresponding vectors by the factorized matrix

H_{4}

. Furthermore, when assessing computational complexity, left–right shift operations are typically left out because they are fairly simple and do not call for additional memory access. Take note that in this case, we account for the fact that, as opposed to 12, only 8 additions are needed to multiply a fourth-order Hadamard matrix by a vector. A thorough schematic representation of the matrix–vector multiplications of the appropriate vectors by the factorized matrix

H_{4}

is shown in Figure 4. This is the so-called four-point “fast Hadamard transform”.

It should be noted that in applications related to quaternion-valued DSPs, often one of the multiplied quaternions contains coefficients that are real constants. In this case, the coefficients can be calculated in advance and stored in the processor’s memory. Therefore, the arithmetic complexity of the proposed algorithm can be further reduced and it requires only eight real multiplications and only 20 real additions.

4. Quaternion Multiplier Structure

Figure 5 shows a rough outline of the structure of a quaternion multiplier that implements the proposed approach.

The quaternion multiplier works as follows. At the beginning of the system operation, the coefficients of the multiplied quaternions, naturally ordered, are written to the corresponding buffer memory cells, as seen in Figure 5 on the left and at the bottom. The multiplicand is written to the upper-left group of registers, and the multiplier is written to the registers, which are located at the bottom of the figure.

In the next time period, the DM demultiplexers prepare and transmit data, as shown in the figure. In this case, the first half of the entries of the first data vector, using a demultiplexer located in the upper-left part, goes to the fast Hadamard transform block, and then to the first inputs of the upper half of the multiplication blocks. The reordered second half of the entries in this vector goes directly to the first inputs of the lower half of the multiplication blocks. In this case, the appropriately prepared data are fed through the lower demultiplexer to the second inputs of the lower half of the multiplication blocks (see the left half of the demultiplexer outputs). In turn, the right half of the outputs of this demultiplexer is connected to the inputs of the fast Hadamard transform block; for these outputs, the data are supplied to the second inputs of the second half of the multiplication blocks.

In the last time period, data from the outputs of the first four multiplication blocks arrive to the inputs of the upper-right block of the fast Hadamard transform, and then to the first inputs of the corresponding adder subtractors. In this case, data from the outputs of the four lower multiplication blocks arrive at the second inputs of the adder subtractors. The result of the calculation goes into the output buffer.

As you can see, the main working blocks of the proposed structure are three blocks that implement the four-point fast Hadamard transform algorithms, eight multipliers, and four two-input adder subtractors. To avoid problems with data input/output, four register buffers are located at the input and output. Data are supplied to the binary multipliers using two special demultiplexers, the operation of which is clear from Figure 5. The quaternion multiplier operates under the control of a master device.

5. Hardware Implementation of Quaternion Multiplication

5.1. Implementation on FPGA

The fast quaternion multiplication shown in Figure 5 was implemented on Artix-7 xc7a100tfgg484-1. The project was realized using the Vivado 2016.4 tool. This multiplication was carried out in two ways: based on regular and pipeline processing. In the first case, the clock frequency achieved was 50 MHz, and in the second, 250 MHz was achieved. Due to the buffering occurring in pipeline processing, an additional 768 flip flops (FFs) were used, which, however, constitute only 0.61% of the matrix resources.

In the case of the direct pipelined implementation of quaternion multiplication, the multiplications were separated from the additions by inserting registers after each operation. In the case of the fast pipelined implementation, registers were added between each operation, and additional registers in some signal paths were added to synchronize the signals at the output. These changes allowed the clock frequency to be increased from 50 MHz to 250 MHz.

An important comparison criterion, apart from clock frequency and power consumption, is the use of processors (DSPs). As mentioned in the previous sections, the developed algorithm reduces the number of required multiplications, which should translate directly into the number of DSPs. The operations of multiplying by 2 and dividing by 4 in Figure 5 do not involve additional DSPs because they are implemented as shifts in registers. For comparison, a direct algorithm resulting from the notation in Equation (5) was implemented. According to this relationship, 16 multiplication operations are required. This algorithm was also implemented in two versions: pipeline processing, with a clock frequency of 250 MHz, and regular processing, with a clock frequency of 50 MHz. The number of processors requiring the largest area on the chip has doubled (16 DSPs). The power consumption in the case of direct multiplication is practically the same as in the case of fast multiplication.

The simulation results are summarized in tables. Table 1 compares resource utilization and power consumption for a pipelined implementation of direct and fast quaternion multiplication. In Table 2, a similar comparison was made for the regular implementation. Apart from achieving a 5-times higher frequency, regular and pipelined implementations are comparable in terms of resource utilization. Higher dynamic energy consumption in a pipelined implementation is associated with obtaining a higher clock frequency. However, the fast and direct multiplication methods differ in both cases by twice the number of processors (DSPs).

5.2. ASIC Implementation

Both quaternion multiplication operations have also been implemented as ASICs. The OpenLane environment was used for this purpose. OpenLane is an environment composed of tools for the synthesis of digital circuits, the result of which is an integrated circuit in the form of a GDSII file. It should be mentioned that the tools creating OpenLane are integrated with the publicly available SkyWater 130 nm technological process. This makes it possible to produce a physical layout. The entire synthesis process is automated; the environment synthesizes the system into interconnected logic cells. In the next step, it arranges and connects previously generated cells, each time analyzing the system using DRC and LVS until the synthesis assumptions are met. The starting point in this project is the description of the system in a high-level hardware language.

The two versions of the multiplier considered are described in VHDL. The direct algorithm resulting from Formula (5) in this description is available as a free software on GitHub repository http://github.com/AHandkiewicz/IIR3D (accessed on 23 July 2024) in the quaternion VHD directory. The part of this description resulting from the Formula (5) is presented in Listing 1.

Listing 1. Direct implementation of quaternion multiplication

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

architecture Flow of mul is
begin
process(clk,a0,a1,a2,a3,b0,b1,b2,b3)
begin
if rising_edge(clk) then
y0 <= a0∗b0-a1∗b1-a2∗b2-a3∗b3;
y1 <= a0∗b1+a1∗b0+a2∗b3-a3∗b2;
y2 <= a0∗b2+a2∗b0+a3∗b1-a1∗b3;
y3 <= a0∗b3+a3∗b0+a1∗b2-a2∗b1;
end if;
end process;
end Flow;

The automatically generated testbanch for designing a fast quaternion multiplication system is also available as a free software on GitHub repository, http://github.com/AHandkiewicz/IIR3D (accessed on 23 July 2024).

The resulting layout of this system is shown in Figure 6.

For comparison, the ASIC generated for direct multiplication is shown in Figure 7. In the first case, the system was generated in 43 min, and in the second one, the system was generated in 69 min.

The most important parameters related to the ASIC implementation of both quaternion multiplication methods are summarized in Table 3.

As you can easily see, the fast algorithm gives over 25% savings in chip area and resource use. Identical synthesis settings were used when designing the systems, so the results obtained are more objective and fully comparable.

6. Conclusions

The paper presents the original structure of a processing unit for multiplying quaternions and demonstrates significant savings in resources on FPGA blocks and chip areas on ASICs. The direct quaternion multiplication takes 16 real multiplications and 12 real additions. The algorithm presented in the paper, based on the Hadamard matrices, reduces these numbers to eight real multiplications and 20 real additions. This is reflected in the hardware implementation of the algorithm. The utilization of individual FPGA blocks is shown in the Table 1 and Table 2. Resource savings of more than 50% are clearly visible among others, for example, in the case of digital signal processors (DSPs). The results of the implementation of the developed algorithm in the form of an integrated circuit (ASIC) are presented in Table 3. Also in this case, a significant reduction in resource consumption is achieved, e.g., a chip area reduction in network of more than 25%.

Author Contributions

Conceptualization, A.C.; Methodology, A.H.; Software, M.N.; Validation, A.C., M.N. and A.H.; Formal analysis, M.N. and A.H.; Writing—review & editing, A.C. and A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Poznan University of Technology grant number 0311/SBAD/0740.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vince, J. Quaternions for Computer Graphics; Springer: London, UK, 2011. [Google Scholar]
Cariow, A.; Cariowa, G.; Majorkowska-Mech, D. An algorithm for quaternion-based 3D rotation. Int. J. Appl. Math. Comput. Sci. 2020, 30, 149–160. [Google Scholar] [CrossRef]
Schütte, H.D.; Wenzel, J. Hypercomplex numbers in digital signal processing. In Proceedings of the ISCAS ’90, New Orleans, LA, USA, 1–3 May 1990; pp. 1557–1560. [Google Scholar]
Alfsmann, D.; Göckler, H.G.; Sangwine, S.J.; Ell, T.A. Hypercomplex Algebras in Digital Signal Processing: Benefits and Drawbacks (Tutorial). In Proceedings of the EURASIP 15th European Signal Processing Conference (EUSIPCO 2007), Poznań, Poland, 3–7 September 2007; pp. 1322–1326. [Google Scholar]
Bülow, T.; Sommer, G. Hypercomplex signals—A novel extension of the analytic signal to the multidimensional case. IEEE Trans. Signal Process. 2001, 49, 2844–2852. [Google Scholar] [CrossRef]
Moxey, C.E.; Sangwine, S.J.; Ell, T.A. Hypercomplex correlation techniques for vector images. IEEE Trans. Signal Process. 2003, 51, 1941–1953. [Google Scholar] [CrossRef]
Navarro-Moreno, J.; Ruiz-Molina, J.C.; Oya, A.; Quesada-Rubio, J.M. Detection of continuous-time quaternion signals in additive noise. EURASIP J. Adv. Signal Process. 2012, 2012, 7. [Google Scholar] [CrossRef]
Mayhew, C.; Sanfelice, R.; Sheng, J.; Arcak, M.; Teel, A.R. Quaternion-based hybrid feedback for robust global attitude synchronization. IEEE Trans. Autom. Control 2012, 57, 2122–2227. [Google Scholar] [CrossRef]
Le Bihan, N.; Sangwine, S.J.; Ell, T.A. Instantaneous frequency and amplitude of orthocomplex modulated signals based on quaternion Fourier transform. Signal Process. 2014, 94, 308–318. [Google Scholar] [CrossRef]
Witten, B.; Shragge, J. Quaternion-based Signal Processing. In Proceedings of the New Orleans 2006 Annual Meeting, New Orleans, LA, USA, 23–27 June 2006; pp. 2862–2865. [Google Scholar]
Jahanchahi, C.; Mandic, D. A class of quaternion Kalman filters. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 533–544. [Google Scholar] [CrossRef]
Barthelemy, Q.; Larue, A.; Mars, J. Sparse approximations for quaternionic signals. Adv. Appl. Clifford Algebr. 2014, 24, 383–402. [Google Scholar] [CrossRef]
Karakasis, E.; Papakostas, G.; Koulouriotis, D.; Tourassis, V. A unified methodology for computing accurate quaternion color moments and moment invariants. IEEE Trans. Image Process. 2014, 23, 596–611. [Google Scholar] [CrossRef]
Czaplewski, B.; Dzwonkowski, M.; Rykaczewski, R. Digital fingerprinting based on quaternion encryption scheme for gray-tone images. J. Telecommun. Inf. Technol. 2014, 2014, 3–11. [Google Scholar] [CrossRef]
Wang, G.; Liu, Y.; Zhao, T. A quaternion-based switching filter for colour image denoising. Signal Process. 2014, 102, 216–225. [Google Scholar] [CrossRef]
Szczȩsna, A.; Słupik, J.; Janiak, M. Motion data denoising based on the quaternion lifting scheme multiresolution transform. Mach. Graph. Vis. 2011, 20, 237–249. [Google Scholar]
Bayro-Corrochano, E. Multi-resolution image analysis using the quaternion wavelet transform. Numer. Algorithms 2005, 39, 35–55. [Google Scholar] [CrossRef]
Majorkowska_Mech, D.; Cariow, A. One-Dimensional Quaternion Discrete Fourier Transform and an Approach to Its Fast Computation. Electronics 2023, 12, 4974. [Google Scholar] [CrossRef]
Cariow, A.; Cariowa, G. Fast Algorithms for Quaternion-Valued Convolutional Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 457–462. [Google Scholar] [CrossRef]
Belardo, F.; Brunetti, M.; Coble, M.J.; Reff, N.; Skogman, H. Spectra of quaternion unit gain graphs. Linear Algebra Its Appl. 2022, 632, 15–49. [Google Scholar] [CrossRef]
Kyrchei, I.I.; Treister, E.; Pelykh, O. The determinant of the Laplacian matrix of a quaternion unit gain graph. Discret. Math. 2024, 347, 113955. [Google Scholar] [CrossRef]
Kidambi, S.S.; Guibaly, F.E.; Antoniou, A. Area-Efficient Multipliers for Digital Signal Processing Applications. IEEE Trans. Circuits Syst.-II 1996, 43, 90–95. [Google Scholar] [CrossRef]
Immareddy, S.; Sundaramoorthy, A. A survey paper on design and implementation of multipliers for digital system applications. Artif. Intell. Rev. 2022, 55, 4575–4603. [Google Scholar]
Wen, M.C.; Wang, S.J.; Lin, Y.N. Low-power Parallel Multiplier with Column Bypassing. Electron. Lett. 2005, 41, 581–583. [Google Scholar] [CrossRef]
Berkeman, A.; Öwall, V.; Torkelson, M. A Low Logic Depth Complex Multiplier Using Distributed Arithmetic. IEEE J. Solid-State Circuits 2000, 35, 656–659. [Google Scholar] [CrossRef]
Mahdy, Y.B.; Ali, S.A.; Shaaban, K.M. Algorithm and two efficient implementations for complex multiplier. In Proceedings of the ICECS ’99. 6th IEEE International Conference on Electronics, Circuits and Systems, Paphos, Cyprus, 5–8 September 1999; pp. 949–952. [Google Scholar]
Soulas, T.; Villeger, D.; Oklobdzija, V.G. An ASIC macro cell multiplier for complex numbers. In Proceedings of the 1993 European Conference on Design Automation with the European Event in ASIC Design, Paris, France, 22–25 February 1993; pp. 589–593. [Google Scholar]
Wei, B.W.Y.; Du, H.; Chen, H. A complex-number multiplier using radix-4 digits. In Proceedings of the 12th Symposium on Computer Arithmetic, Washington, DC, USA, 19–21 July 1995; pp. 84–90. [Google Scholar]
Oklobdzija, V.G.; Villeger, D.; Soulas, T. An Integrated Multiplier for Complex Numbers. J. VLSI Signal Process. 1994, 7, 213–222. [Google Scholar] [CrossRef]
Sansaloni, T.; Valls, J.; Parhi, K.K. Digit-Serial Complex-Number Multipliers on FPGAs. J. VLSI Signal Process. 2003, 33, 105–115. [Google Scholar] [CrossRef]
Pascual, A.P.; Valls, J.; Peiro, M.M. Efficient complex-number multipliers mapped on FPGA. In Proceedings of the ICECS ’99. 6th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.99EX357), Paphos, Cyprus, 5–8 September 1999; pp. 1123–1126. [Google Scholar]
Kong, M.Y.; Langlois, J.M.P.; Al-Khalili, D. Efficient FPGA Implementation of Complex Multipliers using the Logarithmic Number System. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Seattle, WA, USA, 18–21 May 2008; pp. 3154–3157. [Google Scholar]
Perez-Pascual, A.; Sansaloni, T.; Valls, J. FPGA based on-line complex-number multipliers. In Proceedings of the ICECS 2001. 8th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.01EX483), Malta, 2–5 September 2001; pp. 1481–1484. [Google Scholar]
Ismail, R.C.; Hussin, R. High Performance Complex Number Multiplier Using Booth-Wallace Algorithm. In Proceedings of the IEEE International Conference on Semiconductor Electronics, Kuala Lumpur, Malaysia, 29 October–1 December 2006; pp. 786–790. [Google Scholar]
He, S.; Torkelson, M. A pipelined bit-serial complex multiplier using distributed arithmetic. In Proceedings of the 1995 IEEE International Symposium on Circuits and Systems, Seattle, WA, USA, 30 April–3 May 1995; pp. 2313–2316. [Google Scholar]
Chang, Y.N.; Parhi, K.K. High-Performance Digit-Serial Complex Multiplier. IEEE Trans. Circuits Syst.-II Analog. Digit. Signal Process. 2000, 47, 570–572. [Google Scholar] [CrossRef]
Paz, P.; Garrido, M. Efficient Implementation of Complex Multipliers on FPGAs Using DSP Slices. J. Signal Process. Syst. 2023, 95, 543–550. [Google Scholar] [CrossRef]
Parfieniuk, M.; Petrovsky, A. Quaternion multiplier inspired by the lifting implementation of plane rotations. IEEE Trans. Circuits Syst. I 2010, 57, 2708–2717. [Google Scholar] [CrossRef]
Parfieniuk, M.; Petrovsky, N.A.; Petrovsky, A.A. Rapid prototyping of quaternion multiplier: From matrix notation to FPGA based circuits. In Rapid Prototyping Technology: Principles and Functional Requirements; InTech: Rijeka, Croatia, 2011; pp. 227–246. [Google Scholar]
Ţariova, G.; Ţariov, A. Algorithmic aspects of multiplication block number reduction in a two quaternion hardware multiplier. Pomiary. Autom. Kontrola 2010, 56, 688–690. [Google Scholar]

Figure 1. The data flow diagram that represents the calculation of elements

{s_{m}}

,

m = 0, 1, 2, 3

, according to expression (12).

Figure 1. The data flow diagram that represents the calculation of elements

{s_{m}}

,

m = 0, 1, 2, 3

, according to expression (12).

Figure 2. The data flow diagram of the fast algorithm for the multiplication of two quaternions.

Figure 3. The data flow diagram that represents the calculation of elements

{s_{m}}

,

m = 0, 1, 2, 3

, according to expression (15).

Figure 3. The data flow diagram that represents the calculation of elements

{s_{m}}

,

m = 0, 1, 2, 3

, according to expression (15).

Figure 4. The detailed schematic view of the matrix–vector multiplication of appropriate vectors by factorized

4 \times 4

Hadamard matrix.

Figure 4. The detailed schematic view of the matrix–vector multiplication of appropriate vectors by factorized

4 \times 4

Hadamard matrix.

Figure 5. The common schematic view of the quaternion multiplier.

Figure 6. Implementation of fast quaternion multiplication in an ASIC.

Figure 7. Implementation of direct quaternion multiplication in an ASIC.

Table 1. Pipeline implementation of direct and fast quaternion multiplication obtained for a 250 MHz clock.

Resources	Utilization		Available Resources
Resources	Direct Multiplying	Fast Multiplying	Available Resources
LUT		640 (1.01%)	63,400
FF	96 (0.08%)	768 (0.61%)	12,680
DSP	16 (6.67%)	8 (3.33%)	240
IO	256 (90.18%)	257 (90.18%)	258
BUFG	1 (3.13%)	1 (3.13%)	32
Energy consumption
clocks	0.003 W (1%)	0.01 W (2%)
signals	0.019 W (5%)	0.037 W (10%)
logic	<0.001 W (<1%)	0.016 W (4%)
dsp	0.04 W (11%)	0.02 W (5%)
io	0.302 W (82%)	0.305 W (79%)
dynamic (sum of the above)	0.364 W (80%)	0.388 W (81%)
static	0.092 W (20%)	0.092 W (19%)

Table 2. Usual implementation of direct and fast quaternion multiplication obtained for a 50 MHz clock.

Resources	Utilization		Available Resources
Resources	Direct Multiplying	Fast Multiplying	Available Resources
LUT	62 (0.1%)	512 (0.81%)	63,400
FF			12,680
DSP	17 (7.08%)	8 (3.33%)	240
IO	257 (90.18%)	257 (90.18%)	258
BUFG	1 (3.13%)	1 (3.13%)	32
Energy consumption
clocks	0.001 W (1%)	0.001 W (1%)
signals	0.007 W (5%)	0.013 W (9%)
logic	<0.001 W (<1%)	<0.005 W (<3%)
dsp	0.016 W (11%)	0.008 W (5%)
io	0.121 W (82%)	0.122 W (79%)
dynamic (sum of the above)	0.145 W (61%)	0.149 W (62%)
static	0.091 W (39%)	0.091 W (38%)

Table 3. ASIC implementation of fast and direct quaternion multiplication.

Multiplication	Fast	Direct
Core Area (mm²)	0.3649850496	0.469800576
# Cells	19,004	23,813
AND	1118	4310
NAND	646	802
NOR	2344	1112
OR	2613	3143
XOR	5061	7018
XNOR	2444	2231
Frequency (MHz)	19.6078	19.6078

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cariow, A.; Naumowicz, M.; Handkiewicz, A. Structure and Principles of Operation of a Quaternion VLSI Multiplier. Appl. Sci. 2024, 14, 8123. https://doi.org/10.3390/app14188123

AMA Style

Cariow A, Naumowicz M, Handkiewicz A. Structure and Principles of Operation of a Quaternion VLSI Multiplier. Applied Sciences. 2024; 14(18):8123. https://doi.org/10.3390/app14188123

Chicago/Turabian Style

Cariow, Aleksandr, Mariusz Naumowicz, and Andrzej Handkiewicz. 2024. "Structure and Principles of Operation of a Quaternion VLSI Multiplier" Applied Sciences 14, no. 18: 8123. https://doi.org/10.3390/app14188123

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Structure and Principles of Operation of a Quaternion VLSI Multiplier

Abstract

1. Introduction

2. Short Background

3. Algorithmic Aspect

4. Quaternion Multiplier Structure

5. Hardware Implementation of Quaternion Multiplication

5.1. Implementation on FPGA

5.2. ASIC Implementation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI