Next Article in Journal
FAC-V: An FPGA-Based AES Coprocessor for RISC-V
Next Article in Special Issue
Ocelli: Efficient Processing-in-Pixel Array Enabling Edge Inference of Ternary Neural Networks
Previous Article in Journal
Time- and Amplitude-Controlled Power Noise Generator against SPA Attacks for FPGA-Based IoT Devices
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Designing Energy-Efficient Approximate Multipliers

1
Department of Mechanical, Energy and Management Engineering, University of Calabria, 87036 Rende, Italy
2
Department of Informatics, Modeling, Electronics and System Engineering, University of Calabria, 87036 Rende, Italy
*
Author to whom correspondence should be addressed.
J. Low Power Electron. Appl. 2022, 12(4), 49; https://doi.org/10.3390/jlpea12040049
Submission received: 31 August 2022 / Revised: 21 September 2022 / Accepted: 23 September 2022 / Published: 27 September 2022
(This article belongs to the Special Issue Low-Power Computation at the Edge)

Abstract

:
This paper proposes a novel approach suitable to design energy-efficient approximate multipliers using both ASIC and FPGAs. The new strategy harnesses specific encoding logics based on bit significance and computes the approximate product performing accurate sub-multiplications by applying an unconventional approach instead of using approximate computational modules implementing traditional static or dynamic bit-truncation approaches. The proposed platform-independent architecture exhibits an energy saving of up to 80% over the accurate counterparts and significantly better behavior in terms of accuracy loss with respect to competitor approximate architectures. When employed in 2D digital filters and edge detectors, the novel approximate multipliers lead to an energy consumption up to ~82% lower than the accurate counterparts, which is up to ~2 times higher than that obtained by state-of-the-art competitors.

1. Introduction

Inspired by the observation that exact (or precise) computations are not always necessary in most modern applications, approximate computing is nowadays a widely used paradigm for designing error-resilient circuits that can trade accuracy for energy [1,2].
The topic of several papers [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] is the design of energy-efficient approximate arithmetic circuits realized either by using Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs). Among them, in particular, approximate integer multipliers received a great deal of attention [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]. Despite the generality of the adopted approximation logic, implementing such circuits into either ASIC or FPGA technologies may lead to quite different energy, timing, and area behavior due to the different utilization of available resources. For example, the simplest approximation strategy often adopted in ASIC designs is bit-truncation [3,4,5,6]. When applied statically [3,6], it allows the pruning of the hardware resources used to compute a pre-established number of least significant bits (LSBs) of the product. Conversely, dynamic truncation techniques [4,5] allow the energy saving to be tuned on a time-varying quality target. As an efficient alternative, the static approach proposed very recently in [7] exploits a small inner multiplier to process m-bit segments of the operands and adopts a correction technique to improve the error performance. On the contrary, the approximation approaches presented in [8,9] accumulate the partial products (PPs) with approximate circuits that save energy, introducing a reasonable accuracy loss.
Unfortunately, the approaches conceived for ASIC designs are often not effective when adopted for FPGA implementations. Indeed, in most cases, they lead to energy consumptions higher than the accurate multipliers. For this reason, alternative strategies specific to FPGA designs are proposed in [10,11,12,13,14,15,16,17,18,19]. Most of them exploit Booth’s algorithm, which is simplified by either truncating specific bits of the PPs [14] or approximating the encoding logic [15]. Others are based on modular approaches [12,16,17,18,19] that allow high-order multipliers to be implemented involving approximate low-order sub-multipliers. However, these methods are based on platform-specific optimizations that allow approximate operations to be efficiently mapped within Look-Up-Tables (LUTs), and, as a consequence, they do not perform as well when implemented in ASIC.
The above overview of the state-of-the-art counterparts discloses that none of the above papers’ proposed design methods have either been demonstrated on both ASICs and FPGAs or shown the potentiality of being competitive on both platforms. Indeed, although they are described using the Very High-Speed Integrated Circuits Hardware Description Language (VHDL), the above designs can be synthesized and implemented onto both FPGAs and ASIC, and they can achieve energy-delay trade-offs quite far from that reached by counterparts natively optimized for a specific platform. Therefore, they do not appear to be good candidates for the platform-independent design approach that we want to propose.
With the aim of introducing a platform-independent approximate multiplication strategy, this paper presents a new technique that allows modular energy-efficient approximate multipliers to be designed. In contrast to existing approaches [8,9,10,11,12,13,14,15,16,17,18,19], the proposed multiplier provides inexact results by using unconventional accurate sub-multipliers on approximated input operands, instead of performing approximate computations. Thanks to this strategy, the novel multiplier can be coded at the VHDL level, avoiding any specific platform-dependent primitives. Therefore, it is suitable to be processed by any synthesis tool and for any hardware platform without requiring specific modifications concerning the target technology. It is worth underlining that the approaches discussed in previously published papers [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] significantly differ from the method presented here.
To demonstrate the effectiveness of the proposed strategy, experiments were performed in both the ASIC and FPGA domains. In the former case, we achieve an energy saving over the accurate baseline of up to more than 80%, which, at a comparable number of effective bits (NoEB), is quite better than that obtained by the approximate multiplier recently presented in [7]. A significant advantage in terms of energy reduction with higher NoEB is achieved with respect to the architectures described in [8,9]. The results obtained in comparison with the competitors [15,16,17,18,19] also show that, among FPGA-based implementations, the proposed strategy reaches the best energy-accuracy trade-off.
Similar to previous works, novel multipliers were included in the design of approximate 2D image filters and edge detectors. In the former application, the proposed design consumes ~82% less energy than the accurate baseline, without introducing any Structural Similarity index (SSIM) [26] degradation, thus overcoming achievements in [8]. Conversely, the edge detection tests demonstrate that the proposed multiplier allows a higher edge-detected percentage to be reached with an energy saving only 0.88% lower than [14].

2. Background and Related Works

In this section, the behavior of conventional n × m integer multipliers and some representative static approximation strategies are briefly described. In order to do this, let us assume, without loss of generality, that the n-bit multiplicand A[n−1:0] = an−1, …, a0 and the m-bit multiplier B[n−1:0] = bn−1, …, b0 are 2′s complement numbers represented as given in (1). As it is well known, the basic multiplication algorithm first computes the bitwise ANDs between the operand A and the bits of B. Then, in order to obtain the generic partial product PPj, with j = 0, …, m−1, the j-th result produced by the AND operation related to the bit bj, is left shifted by j bit positions and sign extended to (n + m) bits. Finally, as shown by (2), the exact product Pe[n+m−1:0] is calculated by accumulating the m computed PPs. It is important to highlight that the simpler behavior of a multiplier processing unsigned operands can be easily derived from (1) and (2) by just removing the initial minus sign.
A n 1 : 0 = a n 1 · 2 n 1 + i = 0 n 2 a i · 2 i
B m 1 : 0 = b m 1 · 2 m 1 + j = 0 m 2 b j · 2 j
P e n + m 1 : 0 = b m 1 · 2 m 1 · A n 1 : 0 + j = 0 m 2 b j · 2 j · A n 1 : 0
When the radix-2r Booth’s algorithm is adopted, the m bits of the signed multiplier B are split into m r (r + 1)-bit groups, with 1-bit overlaps. An encoded digit is extracted from each group and used to generate the partial product PPi (with i = 0, …, m r 1 ) as a multiple of A. As an example, with r = 2 the generic encoded digit can assume the values 0, ±1 and ±2, whereas with r = 3, it can be equal to 0, ±1, ±2, ±3, and ±4. Each PPi is sign extended and left shifted by r × i bit positions to be aligned to the other partial products for the accumulation that furnishes the exact product Pe[n+m−1:0] as given in (3). In this case, in order to treat unsigned inputs correctly, A and B must be zero extended to (n + 1)- and (m + 1)-bit, respectively.
P e n + m 1 : 0 = j = 0 m r 1 2 r · j · P P j
Both the above multiplication algorithms are suitable for the modular approach that can be applied, as an example, by splitting the operands A and B into two sub-words, namely, A M = a n 1 a k a , A L = a k a 1 a 0 , B M = b m 1 b k b , and B L = b k b 1 b 0 . In this case, the product Pe is calculated as shown in (4), where P M M = A M × B M , P M L = A M × B L , P L M = B M × A L , and P L L = A L × B L .
P e n + m 1 : 0 = 2 k a + k b · P M M + 2 k a · P M L + 2 k b · P L M + P L L
It is worth noting that, in the case of signed operands, while the sub-words AM and BM still represent 2′s complement numbers, the sub-words AL and BL are unsigned numbers. This makes the management of signs information necessary to compute the sub-products PML, PLM, and PLL much simpler than what is required for calculating PMM. Obviously, the overall computation is even easier when unsigned operands are processed. Furthermore, it is easy to understand that, independent of the adopted algorithm, the modular approach could be applied recursively to compute the sub-products, as shown, for example, in [27].
The above formulations suggest that several strategies are viable to design efficient approximate multipliers. For example, the ASIC implementations presented in [8,9] compute the PPs conventionally as the bitwise AND between A and B, and then accumulate them by means of approximate compressors. Depending on the approximation level adopted and the chosen truncation, four approximate multipliers (named 1StepFull, 1StepTrunc, 2StepFull, and 2StepTrunc) are presented in [8], each providing a different trade-off between speed, power, and accuracy. Conversely, the two approximate multipliers (called C-N and C-FULL) described in [9] differ from each other in the way they use approximate 4-2 compressors. While the architecture C-N exploits the approximate 4-2 compressors only to process the LSBs of the PPs, thus limiting the errors introduced with respect to the exact product, the design C-FULL uses the approximate 4-2 compressors on the entire PPs, thus saving more energy, but sacrificing the accuracy.
The approaches known for FPGA-based designs make use of quite different approximation logics. The architectures (called AxBM1 and AxBM2) proposed in [14] approximate the radix-8 Booth’s multiplier by exploiting new encoders on purpose designed to compute inexact PPs and to be efficiently mapped within LUT primitives. A different method is applied in [15] to design a radix-4 Booth’s approximate multiplier (called BA) by taking advantage also of a LUT-level optimization strategy. In such a case, the logic operations performed by the LUTs responsible for computing the two LSBs of the PPs are removed or approximated, thus saving energy consumption and hardware resources. Alternative ways to exploit efficiently LUT-optimized implementations are presented in [16,17,18,19]. In [16], a 4 × 2 approximate multiplier is used as the basic block to design higher-order multipliers. In such a basic block, the PPs are computed by performing the bitwise AND between the multiplicand A and the bits of B, then they are grouped two by two and added by means of the fast carry chains available in modern FPGAs [28,29]. The modular w × w multiplier, designed as described in [16], performs the generic computation by summing four w/2 × w/2 approximate sub-products using either an accurate or an approximate ternary adder, thus leading to two architectures, named CA and CC, respectively. In a similar way, the modular designs proposed in [17,18] exploit 4 × 4 and 2 × 2 sub-multipliers to implement higher order multipliers. Finally, the open-source library presented in [19] collects several 8-bit approximate circuits, including 471 8 × 8 multipliers designed using conventional multiplication structures.

3. The Novel Approximation Strategy

The architecture of the proposed multiplier is illustrated in Figure 1. It relies on a double-stage encoding logic to simplify the multiplication by minimizing the number of non-zero bits involved in the accumulation of partial products. During the first stage, the inputs A = a n 1 a 0 and B = b m 1 b 0 are split into the sub-words A M = a n 1 a k a , A L = a k a 1 a 0 , B M = b m 1 b k b , and B L = b k b 1 b 0 , with ka and kb being chosen at design time. Then, the least significant sub-words A L and B L are partitioned into non-overlapping 3-bit groups and encoded through an on-purpose conceived method based on a backward propagation action. The encoded digits CDx are properly aligned and OR-ed in overlapped positions, thus obtaining the approximate versions of the least significant sub-words A L a and B L a , corresponding to the closest power-of-two. In the second stage, four sub-products are calculated by multiplying the sub-words A M , B M , A L a and B L a . While the most significant term P M M = A M × B M is computed through a conventional multiplier, the others are obtained by using the new radix-4 encoding logic (NR4EL) here indicated with the operator Θ, which performs an accurate multiplication on approximated input operands, having at most one non-zero partial product. That is, P M L a = A M   Θ B L a , P L M a = B M Θ   A L a , and P L L a = A L a   Θ B L a . Both encoding logics above mentioned will be detailed later. Finally, the sub-products P M M , P M L a , P L M a , and P L L a are aligned, sign-extended, and accumulated to compute the final approximate product Pa, as given in (5).
P a n + m 1 : 0 = 2 k a + k b · P M M + 2 k a · P M L a + 2 k b · P L M a + P L L a

3.1. The New 3-Bit Encoding Logic for Least Significant Sub-Words

As illustrated in Figure 2, before being encoded with the proposed method, the unsigned sub-words A L and B L are zero extended to be treated as positive numbers. Furthermore, additional zeros are put beside the least significant positions (as schematized with the red dots in Figure 2) if needed to obtain an integer number of non-overlapping 3-bit groups. The most significant group is encoded by using the novel logic E3bMG, whereas for the less significant bit positions, the encoding rules E3bG are applied. As visible in Figure 2, such an encoding logic is based on a back-propagation action sustained by the signals Pin and Pout. As shown in the following, coded digits CDx are then aligned and OR-ed to finally furnish A L a and B L a .
It is worth noting that the above encoding strategy introduces an approximation to the closest power of two. As detailed in the following, this property allows simplifying the logic required to compute the accurate sub-products P M L a , P L M a , and P L L a . In addition, it must be pointed out that, in this context, the novel E3bMG and E3bG logic perform much better than the conventional leading one detection. To understand this, as an example, let us consider the 8-bit numbers 127 and 65. While the proposed encoding provides the approximate values 128 and 64, thus leading to an absolute error equal to 1 in both the cases, the conventional technique approximates both the values to 64, thus causing an absolute error equal to 63 and 1, respectively.

3.2. The NR4EL Multiplication

Starting from the observation that the approximate sub-words A L a and B L a are represented as power-of-two, containing at most one non-zero bit, a further original encoding step is here proposed to exploit this property in computing the sub-products P M L a , P L M a , and P L L a . That is: A L a and B L a are split into 3-bit groups, with 1-bit overlaps, and zero extended if needed to complete the most significant group. The NR4EL summarized in Figure 3 is applied to each 3-bit group GL to obtain the corresponding partial product PP. Since the approximate sub-words contain at most one non-zero bit, the NR4EL can output just three possible values: 0, MD, and 2 × MD, with MD being the multiplicand (i.e., A M or B M or A L a ). Moreover, the computations of the sub-products P M L a and P L M a involve at most one non-zero partial product, whereas at most just one bit is asserted among all the partial products computed to calculate P L L a . Due to this, partial products are then accumulated by simple logic ORs rather than addition circuits, thus ensuring that a quite significant energy reduction is expected with respect to conventional approaches.
It is worth noting that, due to the approximation made on the least significant bits of the input operands, the proposed NR4EL logic leads to hardware requirements quite different from that of a conventional radix-4 Booth multiplier. Just as a comparison, let us refer to the example illustrated in Figure 4, where n = m = 8 and the configuration ka = 5 and kb = 4 is used to perform the multiplication by the novel approximate strategy. The input operands A and B are firstly partitioned into the most significant ( A M , B M ) and least significant ( A L , B L ) parts. The latter are zero-extended and encoded through the 3-bit logic shown in Section 3.1. Coded digits are aligned taking into account that their significance is dictated by the bit positions involved in the 3-bit groups from which they are calculated. The approximate values A L a and B L a are then obtained by simply ORing their overlapped bits. As discussed above (see Figure 1), P M M is computed by a full precision conventional multiplier, whereas P M L a , P L M a , and P L L a exploit the NR4EL multiplication logic. In contrast to the Booth multiplier, the proposed one, thanks to its coding strategy, allows any additional circuit for the computation of P M L a , P L M a , and P L L a to be avoided, as illustrated in Figure 4b. The sub-products obtained in this way are aligned, sign extended, and summed to finally furnish the approximate product Pa, as shown in the last step of Figure 4, which also reports the exact product P e .

4. Accuracy and Implementation Results

To prove the effectiveness and the high flexibility of the proposed method, several signed n × m approximate multipliers were implemented using both ASIC and FPGA realization platforms. In the following, Newka_kb indicates a multiplier designed as described here that approximates ka LSBs of A and kb LSBs of B. This section presents results obtained for both symmetric and asymmetric designs. Performances achieved by our proposal are discussed and compared with competitors. All quality measures, in terms of average error (AE), error rate (ER), normalized mean error distance (NMED), mean relative error distance (MRED), defined as reported [30], and number of effective bits (NoEB), introduced in [8], have been obtained through exhaustive C++ simulations. It is worth noting that accuracy tests for multipliers with operands word lengths greater than 16-bit are excessively time consuming. Therefore, as in all the previous works [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] for such cases, only the hardware characteristics are provided.

4.1. Design Space Exploration

It is important to note that the possibility of differently setting ka and kb represents a further degree of freedom that can be exploited to finely tune the energy and accuracy of the multiplier to the requirements of a given specific application. This property leads to a design space wider than that bounded by using ka = kb and it cannot always be obtained by other techniques, such as those based on approximate compressors. In Figure 5, the normalized energy-NMED design space for the 8 × 8 multiplier is illustrated for ka and kb varying in the range 1–6.
Just as an example, with respect to the symmetric ka = kb = 4 scenario, approximating one more bit on one operand (e.g., ka = 4 and kb = 5) leads to a 7% higher energy gain with an NMED increased by only ~0.005. On the contrary, the ka = 3 and kb = 4 configuration reduces the NMED by ~0.002 and the energy gain with respect to the precise architecture by ~3.5%. Such an analysis can be useful to optimize the parameters ka and kb for a given scenario. As an example, for the image processing applications referred to in Section 5, the configuration with ka = 2 and kb = 6 is particularly efficient, given the significantly different nature of the operands to be multiplied. However, the optimizations of ka and kb and the realization of a design framework are beyond the scope of this paper.

4.2. ASIC Implementations

For purposes of a fair comparison with state-of-the-art counterparts, 8 × 8 and 16 × 16 signed multipliers were implemented using the TSMC40 nm CMOS 1.1 V and the ST28 nm UTBB-FDSOI 1 V technologies. They were synthesized with the Cadence Genus™ tool version 19.11 at the minimum delay constraint inserting registers as the driving and the loading logic, with the output flip-flops having 0.1 pF load capacitances. The energy consumption was analyzed using the Value Change Dump (VCD) files extracted for 100,000 random vectors.
Table 1 and Table 2 collect results obtained in terms of delay (D), silicon area (A), energy (E), average error (AE), error rate (ER), and number of effective bits (NoEB). The behavior of each approximate multiplier is clearly appreciable in comparison with the precise baseline versions realized with the same technology process.
The New2_6 signed design achieves an energy saving higher than 80%, with a negligible impact on the speed performances. The 2StepTrunc signed architecture [8] shows an energy saving with respect to its baseline of ~76%, and, even though it reaches an interesting delay reduction, the achieved quality level is quite lower than the New2_6. On the other side, while the C-Full circuit [9] dissipates the same energy as the proposed one, it shows a much lower gain with respect to the baseline and achieves a NoEB lower than the New2_6. Furthermore, it must be considered that the architectures in [9] operate only on unsigned operands. The above analysis confirms the effectiveness of the proposed approach in reducing the number of non-zero bits within the tree of partial products in favor of energy efficiency. Indeed, the strategies proposed in [8,9], being, respectively, based on LSB truncation and approximate compressors, just partially simplify the adder circuits responsible for the accumulation of the partial products.
The energy gain obtained over the baseline generally deteriorates with the operands word length. From Table 2, it can be observed that the New8_8 16 × 16 signed multiplier saves ~75% of the energy, whereas [8] saves at most ~63%. Surprisingly, [9] shows a ~8% improvement in this figure. However, the quality level of the 16 × 16 New8_8 multiplier still overcomes the competitors. On the other hand, [8,9] achieve area and delay reductions remarkably higher than the new designs.
In order to evaluate how the ASIC designs trade-off energy saving (EnSv), accuracy, area, and delay, the figure of merit defined in (6) and the comprehensive cost function given in (7) are introduced.
F M A S I C = E n S v % × N o E B
C F A S I C = D × A × E N o E B
Figure 6 plots the normalized values of FMASIC (NFM) and CFASIC (NCF) and shows that the FMASIC achieved by the New2_6 circuit is 12% and 34% higher than 1StepTrunc [8] and CSSM [7], respectively. Indeed, at a comparable NoEB, the signed 8 × 8 architectures demonstrated in [7] reach a power saving ~20% lower. The graceful behavior of the proposed multiplier is confirmed by the CFASIC, which is up to 13 times lower than that of the competitors.
The FMASIC also reveals that, among the 16 × 16 designs, 1StepTrunc [8] reaches the best trade-off. However, the FMASIC of the proposed New8_8 signed multiplier is only 5% lower and up to 2.6 times higher than other competitors referenced in Table 2.

4.3. FPGA Implementations

Table 3 and Table 4 collect hardware characteristics of 8 × 8 and 16 × 16 approximate multipliers implemented on a Xilinx XC7VX330T FPGA device. Data related to competitors are extracted from the original papers.
Table 3 shows that the circuits BA and Trunc [15] achieve the lowest resource requirements and energy dissipation, respectively. Conversely, CC [16] reaches the highest speed performance. However, the above architectures are characterized by MRED values quite higher than those achieved by the multipliers designed using the strategy here proposed. Indeed, the circuit New4_4 achieves the lowest MRED. Results in Table 3 show that New4_4 and New2_6 architectures achieve the best energy-quality-delay trade-off, significantly overcoming their counterparts.
Table 4 compares a 16 × 16 architecture based on the proposed approach to the competitors AxBM1 and AxBM2 [14], and it reports the MRED, the MED, and the NMED because those metrics are used in [14]. It can be noted that the multipliers AxBM1 and AxBM2 achieve better energy-quality-delay trade-offs. However, such a result is obtained by adopting specific and strictly platform-dependent LUT-level optimizations, which prevent exploiting the AxBM1 and AxBM2 within ASIC designs as efficiently. Even without exploiting any specific optimization, the New11_11 architecture is ~12% faster than [14] and reaches a more than acceptable energy-quality behavior. As a final remark, it is worth noting that none of the competitors evaluated in Table 1, Table 2, Table 3 and Table 4 have the ability to perform well by using both ASIC and FPGA platforms.
In order to show how the operands word length and the adopted configuration affect the behavior of the novel multipliers, further implementations have been characterized. All the obtained results are summarized in Table 5 and Table 6, where the results presented in Table 3 and Table 4 are also reported, to provide a clearer picture. The former collects the achieved accuracy, whereas the latter reports the hardware characteristics in comparison with the competitors [15,16,17,18] and the accurate IP cores.
From Table 6, it is pretty evident that the LUT-optimized approximate design BA [15] is the cheapest one and often dissipates less energy than competitors. Conversely, at least one of the configurations examined for the new multiplier performs better than S1 [17], S2 [15], and S3 [18]. In fact, the amounts of LUTs required by the newly proposed 8 × 8, 12 × 12, 16 × 16, and 24 × 24 implementations are up to ~38%, ~22%, ~54%, and ~37% lower, respectively. It is also worth noting that the designs S1, S2, and S3 always utilize more LUTs than the accurate design. Furthermore, it can be seen that the amount of LUTs required by the designs CA and CC [16] rapidly grows with the operands bit-width, thus becoming higher than the novel multipliers starting from 16 × 16 implementations.
Table 6 also shows that CC implementation always leads to the lowest energy consumption. However, it must be taken into account that both the architectures CA and CC operate on unsigned inputs [16]. The energy improvement achieved by the proposed approximation strategy over the accurate counterpart increases with the operands bit-width: the ~34% energy saving reached in the case of 8 × 8 multipliers, grows to ~43%, ~56%, ~63%, and ~70% for the 12 × 12, 16 × 16, 24 × 24, and 32 × 32 implementations, respectively. The novel designs also exhibit an appreciable energy savings ranging from ~20% to 72%, with respect to the competitors S1, S2, S3, BA, and CA. Conversely, their energy consumption is comparable to AxBM1 and AxBM2 [14] (see also Table 4).

5. Case Study: Image Processing Applications

As an example of applications, the proposed approximate multipliers have been exploited in the realization of two image processing sub-systems, commonly adopted as benchmarks in similar works [7,8,9,10,11,12,13,14,15,17]: the 2D filtering and the edge detector. While the former convolves the input image with a single kernel, the latter performs convolutions with two kernels that compute horizontal and vertical gradients of the input image. Both the sub-systems are based on the 8 × 8 New2_6 multiplier and receive the kerne values as external inputs. Therefore, they can support different edge detectors and filters. However, for purposes of comparison with previous works, the Sobel operator and the 2D Gaussian smoothing filters have been referenced. The energy consumption of complete systems was analyzed with 100,000 random vectors at the maximum toggle rates. Whereas, the accuracy was examined using images from the USC-SIPI dataset [31] as test benches. Accuracy results discussed in the following are calculated by averaging those obtained for all the 256 × 256 and 512 × 512 images available in [31]. Sample images reported in Figure 7 show that the new approximate multipliers work well in both the referred image processing applications.
It is worth pointing out that, in order to analyze the behavior of the designed sub-systems on different FPGA devices, they have been implemented within Xilinx VIRTEX 7 XC7VX485 and Altera CYCLONE 006YE144A7G chips. Table 7 summarizes the hardware characteristics of the implemented sub-systems at different filter sizes. Moreover, it reports the accuracy achieved when the 2D Gaussian Smoothing Filtering is performed and averaged over the processed testbench images. The Peak Signal Noise to Ratio (PSNR) and the Structural Similarity (SSIM) [26] quality metrics have been selected for purposes of comparison with the approximate filters presented in [15].
To provide a complete overview, the behavior of LUT-optimized accurate filters, referenced as the baselines and employing the 8 × 8 accurate IP core multiplier, is also shown. It is worth pointing out that, in terms of SSIM, the approximate filters based on the novel multipliers achieve the same behavior as the accurate implementations. Moreover, when compared to the filter based on the BA multiplier presented in [15], in terms of PSNR, the novel design exhibits an improvement ranging from ~4.8% to ~16%, achieved for the 3 × 3 and the 7 × 7 filter size, respectively. Xilinx VIRTEX 7 implementations exhibit an energy reduction with respect to the baseline that, depending on the filter size, varies between ~25% and ~32%, with an energy improvement up to ~8.5% achieved in comparison with [15]. The Altera CYCLONE implementation achieves a ~56% energy reduction over the baseline. Table 7 also shows that the approximate filters designed as proposed here are up to ~21% and ~22% faster than the baselines and the counterparts characterized in [15], respectively. Finally, it must be noted that, since the architectures proposed in [15] exploit FPGA-specific optimizations, they achieve appreciable reductions in terms of utilized logic resources, with respect to the accurate IP-based implementations.
Table 8 compares several 3 × 3 Sobel edge detectors based on 8 × 8 approximate multipliers. The energy gains and the edge detection accuracies achieved with respect to the precise baselines are reported. While AxBM2 [14] achieves the highest energy gain and the architecture in [10] obtains the best accuracy level, the proposed strategy leads to an appreciable trade-off, even though it does not exploit any specific and strictly platform-dependent LUT-level optimization. On the other hand, this is the reason for which, in contrast to the competitors, the approximation approach proposed here can be efficiently employed also in ASIC designs, as clearly visible in Table 9. The latter reports percentage gains in terms of area, delay, and energy, achieved over the accurate baselines, and SSIM degradations attainable by several approximation techniques in Gaussian smoothing filtering. It can be observed that the proposed method significantly outperforms the competitors. It is worth highlighting that the approximation strategy presented here maintains the same accuracy achieved by the accurate baseline. Conversely, the approach exploited in [8] reduces the SSIM by up to 8%.
Additional tests demonstrated that the approximate multipliers designed as proposed here work well also when employed in 5 × 5 and 7 × 7 Gaussian smoothing filters. In fact, energy gains close to 25% and 80% are still achieved by FPGA and ASIC designs, respectively, with PSNR higher than 50 dB and without causing SSIM degradation.

6. Conclusions

This paper presented a novel approach to designing energy-efficient platform-independent approximate multipliers. In fact, even without exploiting specific low-level optimizations, the proposed approximate approach leads to efficient designs in both ASIC and FPGAs. This is a quite remarkable advantage over existing counterparts, given that, even though any design described using VHDL can be synthesized and implemented onto any realization platform, the energy-delay trade-off achieved is typically quite far from that reached by counterparts natively optimized for a specific platform.
The novel strategy directly approximates the operands received as inputs. In order to do this in a smart way, thus limiting the overall accuracy loss, an innovative encoding logic has been introduced. The approximation method here proposed has been applied to several signed multipliers with different operands word lengths. A thorough analysis performed in terms of accuracy metrics and hardware characteristics demonstrated that the novel approximation strategy achieves remarkable energy savings in both FPGA-based and ASIC implementations. The ASIC designs have shown that the novel approximation technique achieves the best energy improvement over the accurate baseline and overcomes several competitors in terms of NoEB.
The proposed technique has been applied to design approximate 2D filters and edge detectors. When implemented onto FPGA devices, the novel approximate filters exhibit an energy consumption up to ~32% lower than the optimized baselines. Moreover, the achieved energy-delay product is more than 24% lower than its state-of-the-art counterparts [15]. Even better behaviors have been observed for the ASIC designs that consume more than 80% less energy than the baselines without affecting the accuracy achieved in terms of SSIM.

Author Contributions

Conceptualization, S.P., P.C. and F.S.; methodology, S.P., P.C. and F.F.; software, S.P.; validation, S.P., P.C. and F.S.; formal analysis, S.P. and P.C.; writing—original draft preparation, S.P. and P.C.; writing—review and editing, S.P., P.C., F.S. and F.F.; funding acquisition, S.P. and P.C. All authors have read and agreed to the published version of the manuscript.

Funding

The activity of F.S. was funded by Ministero dell’Università e della Ricerca (PON Ricerca & Innovazione—Grant 1062_R24_INNOVAZIONE).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Alioto, M. Ultra-Low Power VLSI Circuit Design Demystified and Explained: A Tutorial. IEEE Trans. Circuits Syst. I Regul. Pap. 2012, 59, 3–29. [Google Scholar] [CrossRef]
  2. Jiang, H.; Santiago, F.J.H.; Mo, H.; Liu, L.; Han, J. Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications. Proc. IEEE 2020, 108, 2108–2135. [Google Scholar] [CrossRef]
  3. Chang, C.-H.; Satzoda, R.K. A low error and high performance multiplexer-based truncated multiplier. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2010, 18, 1767–1771. [Google Scholar] [CrossRef]
  4. Frustaci, F.; Perri, S.; Corsonello, P.; Alioto, M. Approximate Multipliers with Dynamic Truncation for Energy Reduction via Graceful Quality Degradation. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 3427–3431. [Google Scholar] [CrossRef]
  5. Hashemi, S.; Bahar, R.I.; Reda, S. DRUM: A Dynamic Range Unbiased Multiplier for approximate applications. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, USA, 2–6 November 2015. [Google Scholar]
  6. Narayanamoorthy, S.; Moghaddam, H.A.; Liu, Z.; Park, T.; Kim, N.S. Energy-Efficient Approximate Multiplication for Digital Signal Processing and Classification Applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015, 23, 1180–1184. [Google Scholar] [CrossRef]
  7. Strollo, A.G.M.; Napoli, E.; De Caro, D.; Petra, N.; Saggese, G.; Di Meo, G. Approximate Multipliers Using Static Segmentation: Error Analysis and Improvements. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 2449–2462. [Google Scholar] [CrossRef]
  8. Esposito, D.; Strollo, A.G.M.; Napoli, E.; De Caro, D. Approximate Multipliers Based on New Approximate Compressors. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 65, 4169–4182. [Google Scholar] [CrossRef]
  9. Strollo, A.G.M.; Napoli, E.; De Caro, D.; Petra, N.; Di Meo, G. Comparison and Extension of Approximate 4-2 Compressors for Low-Power Approximate Multipliers. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 3021–3034. [Google Scholar] [CrossRef]
  10. Venkatachalam, S.; Adams, E.; Lee, H.J.; Ko, S.-B. Design and analysis of area and power efficient approximate booth multipliers. IEEE Trans. Comput. 2019, 68, 1697–1703. [Google Scholar] [CrossRef]
  11. Waris, H.; Wang, C.; Liu, W. Hybrid low radix encoding-based approximate booth multipliers. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 3367–3371. [Google Scholar] [CrossRef]
  12. Kulkarni, P.; Gupta, P.; Ercegovac, M. Trading accuracy for power with an underdesigned multiplier architecture. In Proceedings of the 24th Internatioal Conference on VLSI Design, Chennai, India, 2–7 January 2011. [Google Scholar]
  13. Qiqieh, I.; Shafik, R.; Tarawneh, G.; Sokolov, D.; Yakovlev, A. Energy-efficient approximate multiplier design using bit significance-driven logic compression. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March 2017. [Google Scholar]
  14. Waris, H.; Wang, C.; Liu, W.; Lombardi, F. AxBMs: Approximate Radix-8 Booth Multipliers for High-Performance FPGA-Based Accelerators. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 1566–1570. [Google Scholar] [CrossRef]
  15. Ullah, S.; Schmidl, H.; Sahoo, S.S.; Rehman, S.; Kumar, A. Area-Optimized Accurate and Approximate Softcore Signed Multiplier Architecture. IEEE Trans. Comput. 2021, 70, 384–392. [Google Scholar] [CrossRef]
  16. Ullah, S.; Rehman, S.; Shafique, M.; Kumar, A. High-Performance Accurate and Approximate Multipliers for FPGA-based Hardware Accelerators. IEEE Trans. Comput. -Aided Des. Integr. Circuits Syst. 2022, 41, 211–224. [Google Scholar] [CrossRef]
  17. Rehman, S.; El-Harouni, W.; Shafique, M.; Kumar, A.; Henkel, J. Architectural-space exploration of approximate multipliers. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, USA, 7–10 November 2016. [Google Scholar]
  18. Ullah, S.; Rehman, S.; Prabakaran, B.S.; Kriebel, F.; Hanif, M.A.; Shafique, M.; Kumar, A. Area-optimized low-latency approximate multipliers for FPGA-based hardware accelerators. In Proceedings of the 55th Annual Design Automation Conference, San Francisco, CA, USA, 24–28 June 2018. [Google Scholar]
  19. Mrazek, V.; Hrbacek, R.; Vasicek, Z.; Sekanina, L. EvoApprox8b: Library of approximate adders and multipliers for circuit design and benchmarking of approximation methods. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March 2017. [Google Scholar]
  20. Imed, B.D. Implementation of a Fuel Estimation Algorithm Using Approximated Computing. J. Low Power Electron. Appl. 2022, 12, 17. [Google Scholar]
  21. Preatto, S.; Giannini, A.; Valente, L.; Masera, G.; Martina, M. Optimized VLSI Architecture of HEVC Fractional Pixel Interpolators with Approximate Computing. J. Low Power Electron. Appl. 2020, 10, 24. [Google Scholar] [CrossRef]
  22. Coelho, D.F.G.; Cintra, R.J.; Bayer, F.M.; Kulasekera, S.; Madanayake, A.; Martinez, P.; Silveira, T.L.T.; Oliveira, R.S.; Dimitrov, V.S. Low-Complexity Loeffler DCT Approximations for Image and Video Coding. J. Low Power Electron. Appl. 2018, 8, 46. [Google Scholar] [CrossRef]
  23. Balasubramanian, P.; Maskell, D.L. Hardware Optimized and Error Reduced Approximate Adder. Electronics 2019, 8, 1212. [Google Scholar] [CrossRef]
  24. Tastan, I.; Karaca, M.; Yurdakul, A. Approximate CPU Design for IoT End-Devices with Learning Capabilities. Electronics 2020, 9, 125. [Google Scholar] [CrossRef]
  25. Perri, S.; Spagnolo, F.; Frustaci, F.; Corsonello, P. Efficient Approximate Adders for FPGA-Based Data-Paths. Electronics 2020, 9, 1529. [Google Scholar] [CrossRef]
  26. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  27. Perri, S.; Corsonello, P.; Cocorullo, G. Efficient recursive multiply architecture for FPGAs. Electron. Lett. 2005, 41, 1314–1316. [Google Scholar] [CrossRef]
  28. 7 Series FPGAs Configurable Logic Block User Guide, UG474 (v1.8) September 27. 2016. Available online: https://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf (accessed on 22 July 2022).
  29. Intel® Stratix® 10 Logic Array Blocksand Adaptive Logic Modules User Guide, UG-S10LA, April 4. 2020. Available online: https://www.intel.com/content/dam/www/programmable/us/en/pdf/literature/hb/stratix-10/ug-s10-lab.pdf (accessed on 22 July 2022).
  30. Liang, J.; Han, J.; Lombardi, F. New Metrics for the Reliability of Approximate and Probabilistic Adders. IEEE Trans. Comput. 2013, 62, 1760–1771. [Google Scholar] [CrossRef]
  31. SIPI Image Database. 2019. Available online: http://sipi.usc.edu/database/database.php?volume=misc (accessed on 22 July 2022).
Figure 1. The top-level architecture of the proposed multiplier.
Figure 1. The top-level architecture of the proposed multiplier.
Jlpea 12 00049 g001
Figure 2. Partitioning and encoding logic applied to the sub-words AL and BL.
Figure 2. Partitioning and encoding logic applied to the sub-words AL and BL.
Jlpea 12 00049 g002
Figure 3. The novel encoding logic NR4EL.
Figure 3. The novel encoding logic NR4EL.
Jlpea 12 00049 g003
Figure 4. An example of multiplication through the proposed approximate approach: (a) approximate A and B; (b) compute P M M , P M L a , P L M a , P L L a ; (c) compute the approximate product Pa.
Figure 4. An example of multiplication through the proposed approximate approach: (a) approximate A and B; (b) compute P M M , P M L a , P L M a , P L L a ; (c) compute the approximate product Pa.
Jlpea 12 00049 g004aJlpea 12 00049 g004b
Figure 5. The design spaces for the 8 × 8 multiplier: (a) normalized energy; (b) normalized NMED.
Figure 5. The design spaces for the 8 × 8 multiplier: (a) normalized energy; (b) normalized NMED.
Jlpea 12 00049 g005
Figure 6. Normalized FMASIC and CFASIC of 8 × 8 signed designs (SSM [7], CSSM [7], 1StepFull [8], 1StepTrunc [8], 2StepFull [8], 2StepTrunc [8].
Figure 6. Normalized FMASIC and CFASIC of 8 × 8 signed designs (SSM [7], CSSM [7], 1StepFull [8], 1StepTrunc [8], 2StepFull [8], 2StepTrunc [8].
Jlpea 12 00049 g006
Figure 7. Sample images: (a) original; (b) precise filtering; (c) approximate filtering.
Figure 7. Sample images: (a) original; (b) precise filtering; (c) approximate filtering.
Jlpea 12 00049 g007
Table 1. Hardware Characteristics of the 8 × 8 ASIC Designs.
Table 1. Hardware Characteristics of the 8 × 8 ASIC Designs.
CircuitProcessD (ps) A (um2)E (pJ)AEER%NoEB
Baseline [8]40 nm5649861.29PRECISE
1StepFull [8]40 nm500 5240.8142.3309.47
1StepTrunc [8]40 nm500 3100.472.3 × 102967.89
2StepFull [8]40 nm419 4280.728.7 × 102845.59
2StepTrunc [8]40 nm375 1710.31.0 × 103995.46
Our Baseline40 nm506780.40.24PRECISE
New2_640 nm519529.60.040.32491.46.79
Baseline [9]28 nm2601960.046PRECISE
C-N [9]28 nm248.61750.041n.a.910.8
C-Full [9]28 nm2161550.031n.a.405.44
Our Baseline28 nm2803700.16PRECISE
New2_628 nm2843600.0310.32491.46.79
Table 2. Hardware Characteristics of the 16 × 16 ASIC Designs.
Table 2. Hardware Characteristics of the 16 × 16 ASIC Designs.
CircuitProcessD (ps) A (um2)E (pJ)AEER%NoEB
Baseline [8]40 nm80025953.58PRECISE
1StepFull [8]40 nm74618592.943.57 × 1046116.04
1StepTrunc [8] 40 nm73010021.561.45 × 10510014.66
2StepFull [8]40 nm66711472.013.77 × 106979.36
2StepTrunc [8]40 nm6507001.293.86 × 1061009.35
Our Baseline40 nm72023627.2PRECISE
New8_840 nm73718141.628867.1899.8710.12
Baseline [9]28 nm3759203.52PRECISE
C-N [9]28 nm3638212.94n.a.4717.53
C-Full [9]28 nm3187272.11n.a.885.44
Our Baseline28 nm44510164.68PRECISE
New8_828 nm4468491.28867.1899.8710.12
Table 3. Hardware Characteristics and accuracy of the 8 × 8 FPGA Designs.
Table 3. Hardware Characteristics and accuracy of the 8 × 8 FPGA Designs.
Configuration#LUTsD(ns)E(pJ)AEER (%)MRED
BA [15]373.414.2285.0190.560.091
Trunc 1 [15]432.153.06149.78930.121
S2 [15]864.897.42118.87534.190.0223
CA [16]573.134.7354.198.360.0029
CC [16]561.983.551592.2680.460.13
S1 [17]924.997.11842.4486.460.362
S3 [18]815.197.41101.948.420.0121
S5 [19]1104.439.75127.1184.430.049
New4_4822.47.20.066489.691.5 × 10−4
New2_6682.133.80.32491.352.1 × 10−3
1 The two LSBs of each input operand are truncated [15].
Table 4. Hardware Characteristics and accuracy of the 16 × 16 FPGA Designs.
Table 4. Hardware Characteristics and accuracy of the 16 × 16 FPGA Designs.
Configuration#LUTsD (ns)E (pJ)MREDMEDNMED
AxBM1 [14]1943.6818.035.0 × 10−49233.628.6 × 10−6
AxBM2 [14]1613.4514.213.0 × 10−47623.17.1 × 10−6
New11_11 1833.0315.154.0 × 10−313,194.91.2 × 10−5
Table 5. Accuracy of the novel multipliers at various operands word lengths.
Table 5. Accuracy of the novel multipliers at various operands word lengths.
MultiplierConfigurationAEMREDNMEDNoEB
New4_40.06641.50 × 10−40.0098.343
8 × 8New5_30.4765.70 × 10−40.01277.78
New2_60.3242.10 × 10−30.0246.79
12 × 12New8_882.5562.50 × 10−30.009438.306
16 × 16New8_88867.188.38 × 10−31.53 × 10−510.118
New11_1182,031.173.97 × 10−31.22 × 10−57.1
Table 6. Hardware characteristics obtained with the VIRTEX 7 Device.
Table 6. Hardware characteristics obtained with the VIRTEX 7 Device.
n × mConfiguration#LUTsD (ns)E (pJ)
8 × 8New4_4822.47.2
New5_3782.214.42
12 × 12New8_81772.811.2
BA [15]795.310.17
Trunc [15]1023.528.97
S1 [17]2286.9820.8
S2 [15]1896.3720.77
S3 [18]1857.1120.39
Accurate IP core1624.219.79
16 × 16New8_8 2703.420.4
New11_11 1833.0315.15
BA [15]1447.6421.15
Trunc [15]2144.114.76
S1 [17]2286.9820.8
S2 [15]3306.5920.39
S3 [18]2967.3318.58
CA [16]2454.9826.5
CC [16]2402.3816.16
Accurate IP core2864.2734.35
24 × 24New16_165653.628.8
BA [15]30110.9948.26
Trunc [15]5146.0753.97
S1 [17]8959.43101.63
S2 [15]7779.4597.48
S3 [18]6979.6992.35
Accurate IP core6275.9877.25
32 × 32New16_16 9374.9564.35
CA [16]10136.9858.84
CC [16]9923.0233.04
Accurate IP core10377.23151.83
Table 7. Comparison of 2D Filters on FPGA Devices.
Table 7. Comparison of 2D Filters on FPGA Devices.
Multiplier UsedDeviceFilter SizeHardware CharacteristicsQuality Metrics
#LUT/LE#FFsD (ns)E (pJ)PSNRSSIM
8 × 8 New2_6VIRTEX 7
XC7VX485
3 × 36641644.99852.91
5 × 519354205.87299.354.681
7 × 737818046.8632.460.751
CYCLONE10LP
006YE144A7G
3 × 3111816414.489.5752.91
8 × 8 BA [15] VIRTEX 7
XC7VX485
3 × 33981636.3100.850.50.98
5 × 512214196.8326.551.850.99
7 × 724118037.5682.552.360.99
8 × 8Accurate IPVIRTEX 7
XC7VX485
3 × 37221645.51431
5 × 520254206.94141
7 × 739768048.6842.81
CYCLONE10LP
006YE144A7G
3 × 3101016414204.71
Table 8. Comparison of Approximate 3 × 3 Edge Detectors.
Table 8. Comparison of Approximate 3 × 3 Edge Detectors.
Energy Gain Edge Detected
New25.53%99.21%
AxBM1 [14]21.25%97.45%
AxBM2 [14]26.41%98.45%
BA [15]22.47%98.96%
[10]18.6%99.23%
[11]16.55%98.70%
Table 9. Comparison of 3 × 3 2D Filters on ASIC.
Table 9. Comparison of 3 × 3 2D Filters on ASIC.
Multiplier UsedTechnologyArea Gain Delay GainEnergy GainSSIM Loss
1StepFull [8]TSMC 40 nm --36%1%
1StepTrunc [8]--63%1.3%
2StepFull [8]--44%7.5%
2StepTrunc [8]--76%8%
New2_6TSMC 40 nm 1.1 V8.85%0%82.6%0%
ST UTBB- FDSOI 28 nm 1 V5.1%−4%50.6%0%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Perri, S.; Spagnolo, F.; Frustaci, F.; Corsonello, P. Designing Energy-Efficient Approximate Multipliers. J. Low Power Electron. Appl. 2022, 12, 49. https://doi.org/10.3390/jlpea12040049

AMA Style

Perri S, Spagnolo F, Frustaci F, Corsonello P. Designing Energy-Efficient Approximate Multipliers. Journal of Low Power Electronics and Applications. 2022; 12(4):49. https://doi.org/10.3390/jlpea12040049

Chicago/Turabian Style

Perri, Stefania, Fanny Spagnolo, Fabio Frustaci, and Pasquale Corsonello. 2022. "Designing Energy-Efficient Approximate Multipliers" Journal of Low Power Electronics and Applications 12, no. 4: 49. https://doi.org/10.3390/jlpea12040049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop