Next Article in Journal
A Survey of 2D Face Recognition Techniques
Previous Article in Journal
Fractal Aspects in Classical Parallel Computing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs †

by
E. George Walters
Penn State Erie, The Behrend College, Department of Electrical and Computer Engineering, 5101 Jordan Road, Erie, PA 16563, USA
This paper is an extended version of our paper published in Walters III, E.G. Partial-Product Generation and Addition for Multiplication in FPGAsWith 6-Input LUTs. In Proceedings of the 48th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 2–5 November 2014; pp. 1247–1251.
Computers 2016, 5(4), 20; https://doi.org/10.3390/computers5040020
Submission received: 19 June 2016 / Revised: 13 September 2016 / Accepted: 18 September 2016 / Published: 23 September 2016

Abstract

:
Multiplication is the dominant operation for many applications implemented on field-programmable gate arrays (FPGAs). Although most current FPGA families have embedded hard multipliers, soft multipliers using lookup tables (LUTs) in the logic fabric remain important. This paper presents a novel two-operand addition circuit (patent pending) that combines radix-4 partial-product generation with addition and shows how it can be used to implement two’s-complement array multipliers. The circuit is specific to modern Xilinx FPGAs that are based on a 6-input LUT architecture. Proposed pipelined multipliers use 42%–52% fewer LUTs, and some versions can be clocked up to 23% faster than delay-optimized LogiCORE IP multipliers. This allows 1.72–2.10-times as many multipliers to be implemented in the same logic fabric and potentially offers 1.86–2.58-times the throughput by increasing the clock frequency.

1. Introduction

Field-programmable gate arrays (FPGAs) are often used in signal processing systems for many applications, such as digital-signal processing (DSP), video processing and image processing. For these applications and others, computation of a sum-of-products is very common. As a result, multiplication is often the focus of efforts to reduce required resources, delay and power. For this reason, most contemporary FPGAs have embedded hard multipliers distributed throughout the fabric. Even so, soft multipliers using lookup tables (LUTs) in the configurable logic fabric remain important for high-performance designs for several reasons:
  • Flexible size and type: Embedded multiplier operands are fixed in size and type, e.g., 25 × 18 two’s complement, while LUT-based multiplier operands can be any size or type.
  • Flexible placement: The number and location of embedded multipliers are fixed, while LUT-based multipliers can be placed anywhere, and the number is limited only by the size of the reconfigurable fabric.
  • Configurable: Embedded multipliers cannot be modified, while LUT-based multipliers can use techniques, such as merged arithmetic [1] and truncated-matrix arithmetic [2,3,4,5,6], to optimize the overall system.
  • Hybrids: LUT-based multipliers are often combined with embedded multipliers to make larger multipliers.
Parandeh-Afshar and Ienne discuss the importance of this topic and present techniques to improve the performance of soft multipliers in Altera FPGAs [7]. They present radix-2 Baugh–Wooley multipliers and radix-4-modified Booth multipliers that use generalized parallel counters (GPCs) in the logic fabric to reduce the partial-product matrix to two or three rows, which are then added using a carry-propagate adder (CPA). This builds on the previous work of Parandeh-Afshar et al. using GPCs for compressor trees for the more general case of multi-operand addition [8,9,10]. In other work, Parandeh-Afshar et al. suggest modifications to the FPGA logic fabric to improve soft multiplier implementations [11,12]. Matsunaga et al. have also published work on using GPCs for multi-operand addition [13,14,15]. De Dinechin and Pasca present methods for implementing large multipliers and squarers on Xilinx FPGAs using a combination of embedded multipliers and logic-based multipliers [16]. Gao et al. present a method for implementing large multipliers that combine embedded multipliers with GPCs [17]. Brunie et al. model a generalized weighted sum of bits as a bit heap and describe techniques for summing them using a combination of embedded multipliers and GPCs on Altera and Xilinx FPGAs [18]. Kumm and Zipf present novel GPCs for Xilinx FPGAs and use them with integer linear programming (ILP) in compressor trees [19]. Mhaidat and Hamzah compare the use of GPCs to Wallace and Dadda multipliers [20]. Kumm et al. use a method similar to this work to implement softcore multipliers [21].
Unlike much of the related work, this paper is specific to the Xilinx 6-input LUT architecture found in the Spartan-6, Virtex-5, Virtex-6, 7-Series, UltraScale and perhaps future generations. The underlying generate-add unit (patent pending) can be used for many purposes, but this paper focuses on its use in general-purpose array multipliers. Previous work describes a radix-4 partial-product generate-add structure and shows how it can be used to make array and tree multipliers [22,23,24]. This paper extends that work in several ways:
  • The generate-add unit is described in greater detail, and new optimizations are presented.
  • Proposed array multipliers are described in greater detail, and new optimizations are presented.
  • All new synthesis results are given using Vivado instead of the ISE toolchain; LogiCORE IP v12.0 multipliers are compared instead of v11.2; and more operand sizes are synthesized.
The new optimizations further reduce the number of required LUTs by approximately 10% and provide approximately a 1.5-times speedup compared to [22]. The proposed multipliers are believed to be the only designs to date that produce better results than LogiCORE IP LUT-based multipliers.
The paper is organized as follows. Section 2 gives background information. Section 3 discusses related work using GPCs. Section 4 describes the proposed two-operand adder, and Section 5 describes the proposed LUT-based array multipliers. Synthesis results are discussed in Section 6, and conclusions are given in Section 7.

2. Background

This section describes the details of the Xilinx logic fabric, two-operand addition in the Xilinx logic fabric, Altera logic fabric and radix-4-modified Booth multiplication.

2.1. Xilinx Logic Fabric

The main logic resource for implementing combinational and sequential circuits in a Xilinx FPGA is the configurable logic block (CLB). Each CLB has two slices. Figure 1 is a partial diagram of a 7-Series FPGA slice. Each slice has four 6-input lookup tables (LUT6s) designated A, B, C and D. Each LUT6 is composed of two 5-input lookup tables (LUT5s) and a two-to-one multiplexer. The two LUT5s are 32 × 1 memories that share five inputs designated I5:I1. The memory values are designated M[63:32] in one LUT5 and M[31:0] in the other LUT5. The output of the M[31:0] LUT5 is designated O5. The sixth input, I6, is input to a multiplexer that selects one of the LUT5 outputs. The selected output is designated O6. The LUT6 is normally configured as either two LUT5s with five shared inputs and two outputs by connecting I6 to logic “1”, or as one LUT6 with six inputs and one output by connecting I6 to the sixth input [25,26].
A multiplexer and an XOR gate, indicated in Figure 1 as MUXCY and XORCY respectively, are associated with each LUT6. Inputs to the MUXCY associated with the A LUT6 are a select signal, p r o p i , a first data input, g e n i , and a second data input, c i . The output of the MUXCY, c i + 1 , is connected to the MUXCY associated with the B LUT6. These connections continue through the C and D LUT6s to form a fast carry chain within the slice. The c i + 4 output of the slice, COUT, can be connected to the c i input of the next slice, CIN, to form longer carry chains. The p r o p signal is driven by the O6 output of the corresponding LUT6. The g e n signal is selected by a configuration multiplexer and is either the O5 output of the corresponding LUT6 or the bypass input, which is designated AX, BX, CX or DX. The fast carry logic in a slice, which includes four MUXCYs, four XORCYs and the fast carry chain, is called a CARRY4 [26].
Two flip-flops are associated with each LUT6. One flip-flop can be used to register O5 or the bypass input. The other flip-flop can be used to register O5, O6, the bypass input, the MUXCY output or the XORCY output.
The Spartan-6, Virtex-5, Virtex-6 and UltraScale families are similar to the 7-Series. One notable difference is that the Spartan-6 family does not have fast carry chains in every column of slices.

2.2. Two-Operand Addition

Suppose X and Y are to be added using the Xilinx fast carry logic. For the i-th column of the adder, x i and y i are the bits of X and Y, respectively; c i is the carry-in bit; c i + 1 is the carry-out bit; and s i is the sum bit. A truth table can be made for the adder; then, required values for p r o p i and g e n i can be derived from the table.
Figure 1 shows that s i = p r o p i c i , so p r o p i must have the same value as s i c i to produce the correct value for the sum bit. When p r o p i = 0 , the generate signal becomes the carry out, so g e n i must have the same value as the expected value of c i + 1 . When p r o p i = 1 , the generate signal is not used, so it is a don’t-care. These values are given in Table 1. Next, p r o p i and g e n i are expressed as functions of x i and y i . Inspection of the truth table shows that p r o p i = x i y i and that the generate signal can be either g e n i = x i or g e n i = y i .

2.3. Altera Logic Fabric

The multipliers proposed in this paper are specific to the Xilinx LUT6 architecture and are not applicable to Altera FPGAs. The Altera logic fabric is briefly described here to give context to related work on generalized parallel counters (GPCs).
The main logic resource for implementing combinational and sequential circuits in an Altera Stratix V FPGA is the logic array block (LAB) [27]. Each LAB in the Stratix V has ten adaptive logic modules (ALMs). The ALM has evolved, but the general functionality described in this section applies to the older Stratix II family [28] through the latest family, the Stratix 10 [29].
The capabilities of an ALM can be compared to a Xilinx LUT6 and its associated MUXCY, XORCY and flip-flops. An ALM can be configured to implement two functions of six inputs, provided that four of the inputs are common, as shown in Figure 2. By comparison, a Xilinx LUT6 can implement one function of six inputs or two functions of five shared inputs.
Each ALM includes two full adders and dedicated carry connections to implement fast addition. Figure 3 shows an Altera ALM in arithmetic mode.

2.4. Radix-4-Modified Booth Multipliers

Suppose A and B are to be multiplied. If the multiplicand, A, is an m-bit two’s-complement integer and the multiplier, B, is an n-bit two’s-complement integer, then:
A = - a m - 1 · 2 m - 1 + i = 0 m - 2 a i · 2 i
B = - b n - 1 · 2 n - 1 + j = 0 n - 2 b j · 2 j
MacSorley’s modified Booth recoding algorithm works for both unsigned and two’s-complement multipliers [30]. First, b - 1 is concatenated to the right of B and set to “0”. For two’s-complement multipliers, n must be even. If it is not, B is sign extended by one bit to make n even. For unsigned multipliers with odd values of n, B is zero-extended with one “0” to make n even. If n is already even, B is zero-extended with two “0”s.
Next, B is recoded two bits at a time using overlapping groups of three bits. For each j { 0 , 2 , 4 , , n - 2 } , b j + 1 , b j and b j - 1 are recoded as a radix-4 signed digit, b ρ , where ρ = j / 2 and b ρ = - 2 b j + 1 + b j + b j - 1 . Each partial product, P ρ , is A · b ρ . Digit recoding and partial-product selection are summarized in Table 2. Finally, the product is computed as:
P = ρ = 0 n / 2 - 1 P ρ · 2 2 ρ = ρ = 0 n / 2 - 1 A · b ρ · 2 2 ρ .
If a partial product is + A , then the multiplicand, A, is selected. If a partial product is + 2 A , then the multiplicand is shifted left one bit before selection. If a partial product is - A or - 2 A , then A or 2 A is subtracted by complementing each bit and adding “1” to the least significant bit (LSB). Table 3 summarizes partial-product generation for each selection. There are m + 1 bits in the partial product to provide for a left shift of A, with sign extension if A is not shifted. The operation bit, o p ρ , is set to “0” for addition or “1” for subtraction and is added to the LSB column of the partial product.
Each partial product is sign extended to the width of the multiplier in order to provide for correct addition and subtraction. Sign extension can be accomplished by complementing the sign bit, adding a “1” in the same column, and extending with constant “1”s. The constants are pre-added to reduce the number of “1”s in the matrix. Figure 4 shows the simplified partial-product matrix for a 6 × 6 multiplier [31,32,33].

3. Related Work: Generalized Parallel Counters

The well-known Wallace tree [34] and Dadda [35] multipliers use full adders and half adders to reduce the partial-product matrix to two rows, which are then added using a final CPA. A full adder is sometimes called a (3;2) counter, because it adds three bits in the same column and outputs a two-bit result equal to the sum of the three bits. A GPC adds bits in one or more columns and produces an n-bit result equal to the sum of the bits, taking into account the weight of the columns [36]. For example, a (5,5;4) counter adds five bits in the 2 i + 1 column and five bits in the 2 i column and outputs a four-bit result equal to the weighted sum of the ten input bits. Figure 5 shows how several (5,5;4) counters could be used to reduce five rows of bits to two rows.
Parandeh-Afshar et al. are believed to be the first to look at using GPCs implemented using LUTs to build compressor trees for multi-operand addition in FPGAs [8,9,10]. They note that modern FPGAs, such as Altera Stratix II and newer and Xilinx Virtex-5 and newer, have 6-input LUTs. Therefore, they focus on GPCs that have up to six total inputs for efficient usage of the LUTs and show that (6;3), (1,5;3), (2,3;3) and (3,3;4) counters each map to two ALMs in modern Altera FPGAs. They use a heuristic to implement multi-operand adder compressor trees with GPCs in [8], use integer linear programming (ILP) to improve the results in [9] and improve the GPCs themselves by using the ALM fast addition resources in [10]. They note that both Altera and Xilinx have efficient ternary adders, so they use GPCs to reduce the matrix to three rows. Other work on GPCs that is based on work by Parandeh-Afshar et al. presents incremental improvements or additional applications for GPCs [13,14,15,17,20]. Kumm and Zipf present two novel GPCs, (6,0,6;5) and (1,3,2,5;5), that are specific to and optimized for Xilinx FPGAs [19].

4. Proposed Two-Operand Adder

Suppose X and Y are to be added using the Xilinx fast carry logic. For the i-th column of the adder, x i and y i are the bits of X and Y, respectively; c i is the carry-in bit; c i + 1 is the carry-out ; and s i is the sum bit. The p r o p i signal must be set to x i y 1 , and the g e n i signal can be set to either x i or y i to add x i and y i [22]. If x i and y i together are a function of five or fewer inputs, then the LUT6 can be configured as two LUT5s, generating either x i or y i at O5, routing it to g e n i and generating x i y i at O6 to drive p r o p i . If x i and y i together are a function of six inputs, then the LUT6 can be configured to generate x i y i at O6 to drive p r o p i and x i or y i can be applied to the bypass input and configured to drive the g e n i input. A disadvantage of this configuration is that the bypass flip-flop cannot be used.
Normally, a LUT6 can be used to either generate a function of six inputs at O6 or to generate two functions of five inputs at O5 and O6 [25,26]. However, there are several useful cases where one function of six variables can be output at O6 and a separate function of five shared variables can be output at O5. Suppose x i is a function of one variable connected to I6 and y i is a function of five variables connected to I5:I1. The function y i is stored in M[31:0], so y i is output at O5. If x i is “0”, y i is also output at O6. If x i is “1”, the function stored in M[63:32] is output at O6. If y i ¯ is stored in M[63:32], then x i y i is generated at O6 and y i is generated at O5. This can be used to add x i and y i without using the bypass input when x i is a function of one variable and y i is a function of up to five variables. Figure 6 shows the connections for this configuration. This frees the bypass input to be connected to the bypass flip-flop to implement additional registers. Input I6 has the shortest delay path, and I1 has the longest [25], so this method also allows faster inputs to be used if y i is a function of less than five variables. The carry into the proposed adder, c 0 , can be used to implement subtraction or to add an extra bit to the least significant column.

5. Proposed Multipliers

This section describes how the proposed array multipliers are implemented and pipelined.

5.1. Partial-Product Selection and Generation

MacSorley’s algorithm adds zero when ( b 2 ρ + 1 , b 2 ρ , b 2 ρ - 1 ) = ( 1 , 1 , 1 ) by generating P ρ = ( 00 00 ) and setting o p ρ = 0 . In the proposed multiplier, P ρ = ( 11 11 ) is generated, and o p ρ is set to “1”. This complements each bit in P ρ and adds “1” to subtract zero [32]. With this modification, the operation bit o p ρ = b 2 ρ + 1 , as opposed to MacSorley’s algorithm where o p ρ is a function of three variables. This eliminates the logic resources and additional delay required to generate o p ρ and simplifies the layout on the FPGA fabric. Table 4 shows the proposed partial-product selection (cf. Table 2), and Table 5 shows the proposed partial-product generation (cf. Table 3).

5.2. Combined Partial-Product Generation and Addition

Partial-product generation and addition of a second value are combined into a generate-add unit, which is the main building block of the proposed array multipliers. The arithmetic operation is shown in Figure 7. Each unit generates one radix-4 partial product, P ρ , with a leading “1” and the most-significant bit (MSB) complemented to implement sign extension. The operation bit, o p ρ , and the ( m + 1 ) MSBs of the output from the previous generate-add unit, X ρ - 1 , are added to produce an accumulated sum, X ρ . The two LSBs of X ρ are bits p 2 ρ + 1 and p 2 ρ of the final product, so they are not added in the next unit. The generate-add unit is shown in Figure 8. It is implemented using an ( m + 2 ) -bit proposed two-operand adder as described in Section 4, with X ρ - 1 and P ρ as the X and Y addends, respectively.
Bit i of partial product P ρ , p ρ , i , is a function of five inputs:
p ρ , i = f ( b 2 ρ + 1 , b 2 ρ , b 2 ρ - 1 , a i , a i - 1 ) .
The inputs for each bit, p ρ , i , are connected to the I5:I1 inputs of a LUT6. x ρ - 1 , i + 2 is connected to I6 of the same LUT6. The M[31:0] LUT5 is configured to generate p ρ , i , and the M[63:32] LUT5 is configured to generate p ρ , i ¯ . O6 then generates x ρ - 1 , i + 2 p ρ , i and drives p r o p i . O5 generates p ρ , i and is selected to drive g e n i . This is done for all of the partial-product bits except the MSB, p ρ , m . The MSB is complemented for sign extension by generating p ρ , m ¯ in the M[31:0] LUT5 and p ρ , m in the the M[63:32] LUT5. O6 then generates x ρ - 1 , m + 2 p ρ , m ¯ and drives p r o p m . O5 generates p ρ , m ¯ and is selected to drive g e n m . The leading “1”, 2 2 ρ + m + 1 , is added by configuring the M[31:0] LUT5 to generate “1”, configuring the M[63:32] LUT5 to generate “0” and wiring “0” to I6, so that g e n m + 1 = 1 and p r o p m + 1 = 0 1 = 1 . To summarize, the M[31:0] LUT5s generate the bits of P ρ ; the M[63:32] LUT5s generate the complement of those bits; and the bits of X ρ - 1 to be added are wired to the I6 inputs. The operation bit, o p ρ , is added by wiring b 2 ρ + 1 to c 0 . The sum produced at the XORCY output is x ρ , i , which is added to p ρ + 1 , i - 2 in the next generate-add unit.
Table 6 is the truth table for a LUT6 that generates the partial product p ρ , i and adds it to the bit input to I6, e.g., x ρ - 1 , i + 2 . Note that the values for O6 are stored in M[63:32], and the values for O5 are stored in M[31:0].

5.3. Optimizations for the Generate-Add Unit

The most-significant LUT, shown in Figure 8, can be simplified and eliminated. Inspection of the circuit shows that the p r o p m + 1 input to the MUXCY is always “1”. This means that the g e n m + 1 input to the MUXCY is never used, so it is a don’t-care. This could be implemented by storing all “1”s in the M[63:32] LUT5 and wiring “1” to the I6 input, which frees the M[31:0] LUT5 to be used for another purpose. When this is done, the Xilinx tools optimize the entire LUT6 away. The Verilog models used in this work simply assign “1” to the p r o p m + 1 input of the CARRY4 primitive.
Pipelined array multipliers reported in previous work [22] had an interesting result for delay. 10 × 10 multipliers were slower than 12 × 12 multipliers (2.402 ns vs. 2.144 ns), and 14 × 14 multipliers were slower than 16 × 16 multipliers (2.471 ns vs. 2.160 ns). These multipliers were implemented using the generate-add structure shown in Figure 8, which requires m + 2 LUT6s. When m + 2 is a multiple of four, ( m + 2 ) / 4 slices are fully utilized. Inspection of Figure 1 shows that the XORCY output and the MUXCY output are registered using the same flip-flop. When m + 2 is a multiple of four, such as for 10 × 10 and 14 × 14 multipliers, the x ρ , m + 2 output from the MUXCY cannot be registered within the same slice because the x ρ , m + 1 output and the other XORCY outputs use all of the available flip-flops. This forces x ρ , m + 2 to be routed outside of the slice to an available flip-flop, causing the additional delay due to longer and slower interconnect.
This problem is avoided by noting that x ρ , m + 2 = x ρ , m + 1 ¯ . The x ρ , m + 1 output is used in the next row instead of x ρ , m + 2 so that the MUXCY output does not need to be registered. Figure 9 shows the arithmetic that is performed (cf. Figure 7). The optimized generate-add unit generates P ρ with a leading “1” and the MSB complemented to implement sign extension as in the original generate-add unit. The operation bit, o p ρ , and the ( m + 1 ) MSBs of X ρ - 1 , using x ρ , m + 1 ¯ instead of x ρ , m + 2 , are added to produce an accumulated sum, X ρ . The MSB of the output, x ρ , m + 2 , is not needed in the next row, so it is not produced.
The most-significant LUT6 of the optimized generate-add unit is configured differently than the other LUT6s. The MSB from the previous unit, x ρ - 1 , m + 1 , is connected to one of the shared I5:I1 inputs, and “1” is input to I6. The M[31:0] LUT5 is configured to produce p ρ , m ¯ at O5 to drive the g e n m signal. The M[63:32] LUT5 is configured to produce the function f = x ρ - 1 , m + 2 p ρ , m ¯ at O6 to drive the p r o p m signal. Since x ρ - 1 , m + 2 = x ρ - 1 , m + 1 ¯ ,
f = x ρ - 1 , m + 1 ¯ p ρ , m ¯ .
Table 7 gives the truth table the for the most-significant LUT6 of an optimized generate-add unit.
Figure 10 shows the optimized generate-add unit. The optimized generate-add unit uses only m + 1 LUT6s and avoids the delay of routing a MUXCY output out of a slice to be registered.

5.4. Array Structure and Pipelining

An array of n / 2 optimized generate-add units is used to implement an m × n multiplier. Optimized generate-add units are connected in an array structure as shown in Figure 11. Each generate-add unit requires m + 1 LUT6s, so the number of LUT6s required to implement an m × n array multiplier is:
# LUT 6 s = n / 2 ( m + 1 ) .
The multiplier can be pipelined to reduce cycle time and increase throughput for applications that can tolerate increased latency. Figure 12 shows the connections for optimized generate-add units in a pipelined m × n array multiplier with n / 4 stages. The multiplier can be pipelined by placing a register after every two generate-add units to increase the maximum clock frequency with a modest increase in latency. All m bits of operand A and m + 2 bits output from the second generate-add unit are registered at the end of the first stage. The three LSBs of operand B are not needed after the first stage, so only n - 3 bits are registered. The two LSBs from the output of the first generate-add unit are also registered for a total of 2 m + n + 1 bits registered at the end of the first stage. In each subsequent stage, four fewer bits of B are registered while four additional LSBs from generate-add units are registered, so 2 m + n + 1 flip-flops are used to implement pipeline registers in each stage. There are n / 4 - 1 pipeline registers, and m + n flip-flops are needed to register the output, so the number of flip-flops required for an n / 4 -stage pipelined array multiplier is:
# FFs n / 4 = n / 4 ( 2 m + n + 1 ) - m - 1 .
Each of the LUT6s used to implement the array multiplier has two flip-flops, so there are n / 2 ( 2 m + 2 ) flip-flops available within the footprint of the multiplier. If m n , there are enough flip-flops to implement an n / 4 -stage pipeline with a significant number left over for other uses. This does not imply that all flip-flops used to implement the pipeline must be placed within the footprint of the multiplier. It does imply that a large number of multipliers can be densely placed on the FPGA fabric, and there will be enough flip-flops available within the logic of the multipliers for pipelining. Other designs that use the bypass input only have one flip-flop available per LUT6 and would not have enough flip-flops available for deep pipelining. If the product is truncated or rounded, the LSBs of the generate-add units do not need to be registered, and additional flip-flops are available for other uses.
The proposed array multipliers can also be pipelined with n / 2 stages to further increase the maximum clock frequency. This is accomplished by placing pipeline registers after every generate-add unit. As with the n / 4 -stage pipeline, this requires 2 m + n + 1 bits to be registered in each stage plus m + n bits for the output register, so the number of flip-flops required for an n / 2 -stage pipelined array multiplier is:
# FFs n / 2 = n / 2 ( 2 m + n + 1 ) - m - 1 .
There are not enough flip-flops available within the footprint of the multiplier to implement an n / 2 -stage pipeline. Unused flip-flops in nearby logic can be used to make up the difference if available. The number of required flip-flops can be reduced by using shift-register LUTs (SRLs). A single SRL can be used to replace a number of flip-flops connected as a shift register, such as the least-significant bits of the product that are shifted through each stage. The two flip-flops associated with the SRL are available for use, so using SRLs increases the number of flip-flops available in the multiplier footprint while reducing the number that is required. When SRLs are used to replace chains of three or more flip-flops, the Vivado synthesis default, there are more than enough flip-flops within the multiplier footprint to implement the n / 2 -stage pipeline. As noted earlier, this does not imply that pipeline flip-flops must be placed with the footprint. Routing into or out of an SRL may be longer than the longest route between two flip-flops in a chain that it replaces, so it may be on the critical path and increase the delay of the multiplier.
The proposed array structure is easy to layout. LUT6s are placed in the fabric much like a mirror image of how they are shown in the schematic of Figure 11, which simplifies routing, as well. Deeper pipelining, i.e., using n / 2 instead of n / 4 stages, reduces delay significantly.

5.5. Row 0 Generate-Add Estimation Unit

The generate-add unit in the first row, ρ = 0 , does not have an input value X - 1 to add. The unit only needs to generate P 0 and add o p 0 and 2 m to produce X 0 , the input to the next generate-add unit. Figure 13 shows the arithmetic for the Row 0 generate-add unit. If a maximum absolute error of one unit in the last place (ulp) can be tolerated, the generate-add unit in the first row can be replaced with an estimation unit that uses only ( m + 1 ) / 2 LUT6s instead of m + 1 . Figure 14 shows the Row 0 generate-add estimation unit, which produces an estimate, X ˜ 0 , instead of X 0 .
For any adjacent pair of bits in P 0 , each bit is a function of four variables:
p 0 , i + 1 = f ( b 1 , b 0 , a i + 1 , a i )
p 0 , i = f ( b 1 , b 0 , a i , a i - 1 ) .
Together, p 0 , i + 1 and p 0 , i are a function of five variables,
( p 0 , i + 1 , p 0 , i ) = f ( b 1 , b 0 , a i + 1 , a i , a i - 1 ) .
The two bits can be computed using two LUT5s in the same LUT6, generating p 0 , i + 1 at O6 and p 0 , i at O5. This allows P 0 to be generated using only ( m + 1 ) / 2 LUT6s instead of the m + 1 LUT6s required for a generate-add unit, but does not allow the fast carry chain to be used. Table 8 gives the truth table for a LUT6 that generates adjacent partial products p 0 , i + 1 and p 0 , i in the top row, Row 0.
The least-significant LUT6 can generate p 0 , 1 and p 0 , 0 , but cannot properly add o p 0 because there cannot be a carry-out to the next LUT6. One option is to discard o p 0 and simply output x ˜ 0 , 1 = p 0 , 1 and x ˜ 0 , 0 = p 0 , 0 . Another option is to generate p 0 , 1 and p 0 , 0 , add o p 0 and output x ˜ 0 , 1 = x 0 , 1 and x ˜ 0 , 0 = x 0 , 0 if there is no carry out or x ˜ 0 , 1 = 1 and x ˜ 0 , 0 = 1 if there is a carry out. Another option is to output a function of p 0 , 1 , p 0 , 0 and o p 0 that has a desired statistical result, such as an average error of zero.
The LUT5s that output x ˜ 0 , i for m - 1 i 2 generate x ˜ 0 , i = p 0 , i . The sum of p 0 , m ¯ and the two constant “1”s is p 0 , m ¯ , p 0 , m , p 0 , m . The LUT5s that output x ˜ 0 , m + 1 and x ˜ 0 , m generate x ˜ 0 , m + 1 = p 0 , m and x ˜ 0 , m = p 0 , m . As described in Section 5.3, the generate-add unit in the second row uses x ˜ 0 , m + 1 for x ˜ 0 , m + 2 and complements it internally, so x ˜ 0 , m + 2 does not need to be generated. The only error introduced into X ˜ 0 is the error from the least-significant LUT6, so the maximum absolute error is easily constrained to 1 ulp. Although not shown in Figure 14, p 0 , m could be generated using a single LUT5 and used for x ˜ 0 , m + 2 , x ˜ 0 , m + 1 and x ˜ 0 , m .

6. Results

The proposed multipliers are compared to Xilinx LogiCORE IP v12.0 multipliers [37] for signed ( n × n ) -bit units. Results for 6-, 8-, 10-, 12-, 14-, 16-, 20-, 24-, 32- and 64-bit operands are given for single-cycle and pipelined units. Results from other work on GPC-based tree multipliers are compared, and the differences are discussed.

6.1. Methodology

Version 2014.4 of the Xilinx Vivado Design Suite was used. Designs were synthesized with the strategy set to “Vivado Synthesis Defaults” and implemented with the strategy set to “Performance_Retiming”. The -shreg_min_size parameter was set to the default value of three to synthesize pipelined versions of the proposed multipliers using SRLs and set to 99 to synthesize versions using flip-flops only. Designs were synthesized for the Virtex-7 XC7VX330T-FFG1157 (-3 speed grade) device with a timing constraint of 1 ns on the inner clock. All results are post place-and-route.
LogiCORE IP multipliers were created using the IP Catalog in Vivado. Area-optimized and delay-optimized units were synthesized for each operand size. Structural models of the proposed multipliers were implemented in Verilog. Single-cycle versions for each multiplier were created. Pipelined versions were created for LogiCORE multipliers using the optimal number of stages specified in the IP customization dialog. Pipelined versions of the proposed designs were created using n / 4 and n / 2 stages. n / 4 -stage versions were synthesized using flip-flops only (no SRLs). Flip-flop-only designs and designs using SRLs were synthesized for n / 2 -stage versions. Input and output ports were double registered to reduce dependence on I/O placement [38]. CARRY4 primitives were placed manually using the RLOC constraint, which specifies the relative location of primitives in the FPGA fabric. Placement of LUTs was done by the tools with no constraints. Placement of flip-flops was also done by the tools, and they were not constrained to the footprint of the multiplier. A separate clock on the inner level was used to measure the delay through each multiplier.

6.2. Single-Cycle Multipliers

Table 9 and Table 10 show synthesis results for single-cycle multipliers. The total number of LUTs used and the delay in nanoseconds of each multiplier are reported. The LUT-delay product (LDP) is computed as the total number of LUTs multiplied by the delay. This is analogous to the area-delay product of a VLSI design and gives a metric for comparing overall design efficiency, with lower values indicating higher efficiency. The reciprocal of LDP gives a metric for comparing throughput.
The area optimization for LogiCORE IP multipliers is most effective when both operands are unsigned [38]. Signed area-optimized LogiCORE multipliers were found to use more LUTs and to have a higher LUT-delay product than delay-optimized units for each of the operand sizes tested, so delay-optimized multipliers are used as the baseline for comparison. The total number of LUTs, maximum delay and LUT-delay product for each design are normalized to the delay-optimized LogiCORE multiplier of the same size.
The proposed single-cycle designs use 47%–51% fewer LUTs than the baseline LogiCORE multipliers, which allows approximately twice as many to be implemented in the same logic fabric. They are slower than baseline multipliers, and the normalized delay generally increases as n increases. For n 20 , the decrease in LUTs is more significant than the increase in delay, so those units have a 12%–46% lower LUT-delay product than baseline multipliers.

6.3. Pipelined Multipliers

Table 11, Table 12, Table 13, Table 14 and Table 15 show synthesis results for pipelined multipliers. The number of pipeline stages and the number of flip-flops (FFs) are reported. The number of flip-flops includes pipeline registers and one output register, but does not include the input registers or the second set of registers used to reduce dependence on I/O placement. Values are normalized to Xilinx LogiCORE IP multipliers reported in Table 11.
Table 12 shows proposed multipliers using an n / 4 -stage pipeline and no SRLs. These versions use 47%–52% fewer LUTs than the baseline LogiCORE multipliers, which allows 1.90–2.10-times as many to be implemented in the same logic fabric. These versions use fewer flip-flops than LogiCORE multipliers. The LUTs used to implement each proposed multiplier have more associated flip-flops available for use than are used in the design because the bypass inputs are not used. These versions are generally slower than LogiCORE multipliers.
Table 13 shows proposed multipliers using an n / 2 -stage pipeline and no SRLs. These versions use 47%–52% fewer LUTs and are 0%–23% faster than the baseline LogiCORE multipliers. These versions use more flip-flops than LogiCORE multipliers, and more flip-flops than are available from the associated LUTs used in the designs. If extra flip-flops are available from nearby logic, these versions offer LUT-delay products that are 52%–61% lower than baseline LogiCORE multipliers.
Table 14 shows proposed multipliers using an n / 2 -stage pipeline and SRLs to save flip-flops. These versions use 42%–49% fewer LUTs and are 1%–22% faster than the baseline LogiCORE multipliers. These versions use fewer flip-flops than LogiCORE multipliers, and enough flip-flops are available from the associated LUTs. They have a 46%–55% lower LUT-delay product than baseline multipliers, indicating a potential 1.86–2.21-times increase in throughput for a fixed number of LUTs.
Table 15 shows proposed multipliers using an n / 2 -stage pipeline, SRLs and a Row 0 estimation unit instead of a generate-add unit. These versions use 45%–49% fewer LUTs and are 4%–19% faster than the baseline LogiCORE multipliers. These versions use fewer LUTs and flip-flops than versions that use a generate-add unit, but may have slightly longer delay. They have a 49%–57% lower LUT-delay product than baseline multipliers, indicating a potential 1.97–2.33-times increase in throughput for a given number of LUTs.

6.4. Layout

Figure 15 shows a screen capture of the implementation of a proposed 6 × 6 single-cycle array multiplier (cf. the mirror image of Figure 11). Nine slices are shown in the screen capture, and primitives in the lower-right slice are annotated (cf. Figure 1). The four MUXCYs, four XORCYs and the fast carry chain for the slice are instantiated as a single Xilinx primitive called a CARRY4. Primitives that are used are indicated by a cyan background color. Note that in Figure 11, carries propagate from the right side of the figure to the left side, so that the most-significant bit of the product is on the left side and the least-significant bit is on the right side. In the screen capture, carries propagate from the bottom of the image to the top. The two slices in the left column of slices correspond to the generate-add unit that generates P 0 and outputs X 0 . The two slices in the middle column of slices correspond to the generate-add unit that generates P 1 and adds it to X 0 to output X 1 . The two slices in the right column of slices correspond to the generate-add unit that generates P 2 and adds it to X 1 to output X 2 . The flip-flops that are indicated as used are part of the registers used for the input and output ports to reduce dependence on I/O placement as noted in Section 6.1.
Figure 16 shows a screen capture of the implementation of a proposed 16 × 16 pipelined multiplier, with an eight-stage pipeline using SRLs. The image on the left shows wiring from the I/O pads used for the bits of operand A to the first register for operand A. The flip-flops for the first register are generally located near the corresponding I/O pad. The wiring from the first register for operand A to the second register for operand A is not shown. Most of the flip-flops used for the second register for operand A are located near the multiplier logic at the top of the image. The image on the right shows the wiring from the second register for the output P to the I/O pads used for the bits of P. The flip-flops for the first output register for P are generally located close to the logic for the multiplier at the top of the image. This figure shows the importance of double-registering the input and output ports. If they were not double-registered, delays from long routing lines from I/O pads to the multiplier would give misleading results for the speed of a multiplier when used as part of a larger unit, such as a finite impulse response (FIR) filter that is not connected directly to I/O pads.
Figure 17 shows a screen capture of the implementation of the same 16 × 16 eight-stage pipelined multiplier shown in Figure 16. This image shows a close-up view the multiplier logic shown at the top of the images in Figure 16. The eight generate-add units of the multiplier each occupy five slices in a column. The SRLs used in the multiplier are implemented using LUT6s in nearby slices. Most of the flip-flops in the slices are used for generate-add units, showing that the bypass inputs are indeed available and the bypass flip-flops can be used. The implementation was not constrained to use only those flip-flops. The tools implemented most of the pipeline registers using them, but left some of them unused and available while using some flip-flops in nearby slices.
Figure 18 shows a screen capture of the same multiplier shown in Figure 17, plus the wiring from the pipeline register between the third and fourth pipeline stage to the inputs of the generate-add unit in the fourth stage. It can be seen that many of the flip-flops in the pipeline register are bypass flip-flops. However, some are not, and some are not in slices occupied by generate-add units. If many of the proposed n / 2 -stage multipliers using SRLs were located next to each other, there would be enough flip-flops associated with the generate-add units and SRLs to implement all of the pipeline registers and an output register for each multiplier. However, without constraining the placement of those flip-flops, the tools would likely place some of the flip-flops for one multiplier in slices occupied by another multiplier. Further research is needed to determine if constraining flip-flops for a multiplier to the logic used to implement the same multiplier would yield any improvements in delay.

6.5. GPC-Based Tree Multipliers

Brunie et al. [18] present a data structure called a bit heap, which is similar to a BitMatrix object [4,39,40]. Bit heaps and BitMatrix objects treat a set of operands to be summed as a collection of individual weighted bits instead of a collection of operand vectors. The FloPoCo [41] arithmetic generator operates on bit heaps, applying embedded multipliers, GPCs and 3 × 3 multipliers [42] to compute the sum. FloPoCo targets Altera and Xilinx FPGAs. Kumm and Zipf present two novel GPCs specific to Xilinx FPGAs that exploit the slice structure and are more efficient than previous work in terms of the ratio of the number of bits removed from the bit heap to the number of required LUTs. They then use ILP to select GPCs to reduce a bit heap to two rows and report improvements over the previous FloPoCo heuristic [19]. Mhaidat and Hamzah [20] present results for a Xilinx Spartan-6 FPGA, which uses a 6-input LUT architecture. They report that their 32 × 32 multiplier uses 1133 LUTs, which is 2.15-times the number used by the proposed multipliers that do not use SRLs and 1.95-times the number used by the proposed n / 2 -stage pipelined multipliers that use SRLs. They do not compare their results to LogiCORE IP, so normalized results for LUTs or delay are not available for comparison to proposed multipliers.
Two Altera ALMs can be used to implement a (6;3), a (1,5;3), a (2,3;3) or a (3,3;4) counter. (6;3) and (1,5;3) counters are favored because they eliminate three partial-product bits per counter, compared to (2,3;3) and (3,3;4) counters, which only eliminate two bits per counter. In Xilinx, three LUT6s would be required to implement a (6;3) or a (1,5;3) counter, because only five inputs can be shared between the LUT5s. A (2,3;3) counter could be implemented using two LUT6s, because there are only five inputs, so LUT5s can be used. A (3,3;4) counter would require four LUT6s. (6;3) and (1,5;3) counters can be used in Altera to eliminate 1.5 bits per ALM, but they would only eliminate one bit per LUT6 in Xilinx. The differences between Xilinx and Altera are too great to assume that results for GPC-based multipliers on Xilinx FPGAs would be comparable to results for Altera FPGAs presented in other work.
Parandeh-Afshar et al. compare LUT-based multipliers using GPCs to MegaWizard multipliers in Altera FPGAs in [7] and give a graph of the results. Numerical results are estimated from their graphs and tabulated in Table 16. Their radix-4 Booth multipliers have the best overall results. They are faster than MegaWizard multipliers at the expense of additional LUTs for most operand sizes. The normalized LUT-delay product ranges from 0.67 to 1.08. By contrast, the proposed multipliers are significantly smaller and have a much lower LUT-delay product than Xilinx LogiCORE IP multipliers when pipelined with n / 2 -stages. This indicates that the proposed method has a larger improvement on Xilinx than [7] has on Altera.

7. Conclusions

This paper presents a novel two-operand adder that combines radix-4 partial-product generation and addition and shows how it can be used in FPGAs based on 6-input LUTs to implement two’s-complement array multipliers. Synthesis results are compared to Xilinx LogiCORE IP multipliers. The proposed array multipliers use approximately one-half of the LUTs needed by comparable LogiCORE IP multipliers, which allows approximately twice as many to be implemented in the same logic fabric. When deeply pipelined, the proposed multipliers are also faster than LogiCORE IP multipliers in most cases. SRLs can be used so that there are more flip-flops associated with the logic of the multiplier than required for pipelining, which allows a large number of deeply pipelined multipliers to be densely placed in the FPGA fabric. If a maximum absolute error of 1 ulp is tolerable, the number of required LUTs can be reduced further. The proposed multipliers are well suited for multiply-intensive applications, such as digital-signal processing, image processing and video processing, where they can be modified further using techniques, such as merged arithmetic and truncated-matrix arithmetic, to optimize the overall system.

Acknowledgments

The Xilinx Vivado Design Suite was obtained through the Xilinx University Program.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ALMadaptive logic module
CLBconfigurable logic block
CPAcarry-propagate adder
DSPdigital signal processing
FIRfinite impulse response
FPGAfield-programmable gate array
GPCgeneralized parallel counter
ILPinteger linear programming
LABlogic array block
LDPLUT-delay product
LUTlookup table
LUT55-input lookup table
LUT66-input lookup table
LSBleast-significant bit
MSBmost-significant bit
SRLshift register LUT
ulpunit in the last place

References

  1. Swartzlander, E.E., Jr. Merged Arithmetic. IEEE Trans. Comput. 1980, C-29, 946–950. [Google Scholar] [CrossRef]
  2. Schulte, M.J.; Swartzlander, E.E., Jr. Truncated Multiplication with Correction Constant. In VLSI Signal Processing VI; IEEE Press: Eindhoven, The Netherlands, 1993; pp. 388–396. [Google Scholar]
  3. King, E.J.; Swartzlander, E.E., Jr. Data-Dependent Truncation Scheme for Parallel Multipliers. In Proceedings of the 31st Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 2–5 November 1997; Volume 2, pp. 1178–1182.
  4. Walters, E.G., III; Schulte, M.J. Design Tradeoffs Using Truncated Multipliers in FIR Filter Implementations. In Proceedings of the SPIE: Advanced Signal Processing Algorithms, Architectures, and Implementations XII, Seattle, WA, USA, 6 December 2002; Volume 4791, pp. 357–368.
  5. Walters, E.G., III; Arnold, M.G.; Schulte, M.J. Using Truncated Multipliers in DCT and IDCT Hardware Accelerators. In Proceedings of the SPIE: Advanced Signal Processing Algorithms, Architectures, and Implementations XIII, San Diego, CA, USA, 24 December 2003; Volume 5205, pp. 573–584.
  6. Walters, E.G., III. Linear and Quadratic Interpolators Using Truncated-Matrix Multipliers and Squarers. Computers 2015, 4, 293–321. [Google Scholar] [CrossRef]
  7. Parandeh-Afshar, H.; Ienne, P. Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs. In Proceedings of the 21st International Conference on Field Programmable Logic and Applications, Chania, Greece, 5–7 September 2011; pp. 225–231.
  8. Parandeh-Afshar, H.; Brisk, P.; Ienne, P. Efficient Synthesis of Compressor Trees on FPGAs. In Proceedings of the Asia and South Pacific Design Automation Conference, Seoul, Korea, 21–24 January 2008; pp. 138–143.
  9. Parandeh-Afshar, H.; Brisk, P.; Ienne, P. Improving Synthesis of Compressor Trees on FPGAs via Integer Linear Programming. In Proceedings of the Design, Automation and Test in Europe, Munich, Germany, 10–14 March 2008; pp. 1256–1261.
  10. Parandeh-Afshar, H.; Brisk, P.; Ienne, P. Exploiting Fast Carry-Chains of FPGAs for Designing Compressor Trees. In Proceedings of the 2009 International Conference on Field Programmable Logic and Applications, Prague, Czech Republic, 31 August–2 September 2009; pp. 242–249.
  11. Parandeh-Afshar, H.; Verma, A.K.; Brisk, P.; Ienne, P. Improving FPGA Performance for Carry-Save Arithmetic. IEEE Trans. VLSI Syst. 2010, 18, 578–590. [Google Scholar] [CrossRef]
  12. Parandeh-Afshar, H.; Ienne, P. Highly Versatile DSP Blocks for Improved FPGA Arithmetic Performance. In Proceedings of the 2010 IEEE 18th Annual International Symposium on Field-Programmable Custom Computing Machines, Charlotte, NC, USA, 2–4 May 2010; pp. 229–236.
  13. Matsunaga, T.; Kimura, S.; Matsunaga, Y. Multi-Operand Adder Synthesis on FPGAs Using Generalized Parallel Counters. In Proceedings of the Asia and South Pacific Design Automation Conference, Taipei, Taiwan, 18–21 January 2010; pp. 337–342.
  14. Matsunaga, T.; Kimura, S.; Matsunaga, Y. Power and Delay Aware Synthesis of Multi-Operand Adders Targeting LUT-Based FPGAs. In Proceedings of the International Symposium on Low Power Electronics and Design, Fukuoka, Japan, 1–3 August 2011; pp. 217–222.
  15. Matsunaga, T.; Kimura, S.; Matsunaga, Y. An Exact Approach for GPC-Based Compressor Tree Synthesis. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2013, E96-A, 2553–2560. [Google Scholar] [CrossRef]
  16. De Dinechin, F.; Pasca, B. Large Multipliers With Fewer DSP Blocks. In Proceedings of the 2009 International Conference on Field Programmable Logic and Applications, Prague, Czech Republic, 31 August–2 September 2009; pp. 250–255.
  17. Gao, S.; Al-Khalili, D.; Chabini, N. Implementation of Large Size Multipliers Using Ternary Adders and Higher Order Compressors. In Proceedings of the 21st International Conference on Microelectronics, Marrakech, Morocco, 19–22 December 2009; pp. 118–121.
  18. Brunie, N.; de Dinechin, F.; Istoan, M.; Sergent, G.; Illyes, K.; Popa, B. Arithmetic Core Generation Using Bit Heaps. In Proceedings of the 23rd International Conference on Field programmable Logic and Applications, Porto, Portugal, 2–4 September 2013; pp. 1–8.
  19. Kumm, M.; Zipf, P. Pipelined Compressor Tree Optimization Using Integer Linear Programming. In Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL), München, Germany, 2–4 September 2014; pp. 1–8.
  20. Mhaidat, K.M.; Hamzah, A.Y. A New Efficient Reduction Scheme to Implement Tree Multipliers on FPGAs. In Proceedings of 2014 9th International Design and Test Symposium, Algiers, Algeria, 16–18 December 2014; pp. 180–184.
  21. Kumm, M.; Abbas, S.; Zipf, P. An Efficient Softcore Multiplier Architecture for Xilinx FPGAs. In Proceedings of the 22nd IEEE Symposium on Computer Arithmetic, Lyon, France, 22–24 June 2015; pp. 18–25.
  22. Walters, E.G., III. Partial-Product Generation and Addition for Multiplication in FPGAs With 6-Input LUTs. In Proceedings of the 48th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 2–5 November 2014; pp. 1247–1251.
  23. Walters, E.G., III. Techniques and Devices for Performing Arithmetic. U.S. Patent Application 15/025,770, 25 August 2016. [Google Scholar]
  24. Walters, E.G., III. Techniques and Devices for Performing Arithmetic. U.S. Provisional Application 62/343,366, May 2016. [Google Scholar]
  25. Young, S.P.; Bauer, T.J. Programmable Integrated Circuit Providing Efficient Implementations of Arithmetic Functions. U.S. Patent 7,218,139, 15 May 2007. [Google Scholar]
  26. 7 Series FPGAs Configurable Logic Block User Guide; UG474 (v1.6); Xilinx: San Jose, CA, USA, August 2014.
  27. Stratix V Device Handbook, Volume 1: Device Interfaces and Integration; SV-5V1; Altera: San Jose, CA, USA, 2015.
  28. Stratix II Device Handbook, Volume 1; SII5V1-4.5; Altera: San Jose, CA, USA, 2011.
  29. Stratix 10 Advance Information Brief; AIB-01025; Altera: San Jose, CA, USA, 2015.
  30. MacSorley, O.L. High-speed Arithmetic in Binary Computers. Proc. IRE 1961, 49, 67–91. [Google Scholar] [CrossRef]
  31. Koren, I. Computer Arithmetic and Algorithms; Prentice Hall: Englewood Cliffs, NJ, USA, 1993. [Google Scholar]
  32. Ercegovac, M.D.; Lang, T. Digital Arithmetic; Morgan Kaufmann: San Francisco, CA, USA, 2004. [Google Scholar]
  33. Parhami, B. Computer Arithmetic: Algorithms and Hardware Design, 2nd ed.; Oxford University Press: New York, NY, USA, 2010. [Google Scholar]
  34. Wallace, C.S. A Suggestion for a Fast Multiplier. IEEE Trans. Electron. Comput. 1964, EC-13, 14–17. [Google Scholar] [CrossRef]
  35. Dadda, L. Some Schemes for Parallel Multipliers. Alta Frequenza 1965, 34, 349–356. [Google Scholar]
  36. Stenzel, W.J.; Kubitz, W.J.; Garcia, G.H. A Compact High-Speed Parallel Multiplication Scheme. IEEE Trans. Comput. 1977, C-26, 948–957. [Google Scholar] [CrossRef]
  37. LogiCORE IP Multiplier v12.0 Product Guide; PG108; Xilinx: San Jose, CA, USA, 2014.
  38. LogiCORE IP Multiplier v11.2 Product Specification; DS255; Xilinx: San Jose, CA, USA, 2011.
  39. Walters, E.G., III; Schulte, M.J.; Glossner, J. Automatic VHDL Generation of Parameterized FIR Filters. In Proceedings of the Second International Samos Workshop on Systems, Architectures, Modeling and Siulation, Samos, Greece, 22–25 July 2002; pp. 210–224.
  40. Walters, E.G., III. Using Truncated-Matrix Multipliers and Squarers in High-Performance DSP Systems. Ph.D. Thesis, Lehigh University, Bethlehem, PA, USA, January 2009. [Google Scholar]
  41. De Dinechin, F.; Pasca, B. Designing Custom Arithmetic Data Paths with FloPoCo. IEEE Des. Test Comput. 2011, 28, 18–27. [Google Scholar] [CrossRef]
  42. Kostarnov, I.; Whyte, A. Circuit Structure for Multiplying Numbers Using Look-Up Tables and Adders. U.S. Patent 8,352,532, 8 January 2013. [Google Scholar]
Figure 1. Partial diagram of a Xilinx 7-Series CLB slice.
Figure 1. Partial diagram of a Xilinx 7-Series CLB slice.
Computers 05 00020 g001
Figure 2. Altera ALM configured as two 6-input LUTs.
Figure 2. Altera ALM configured as two 6-input LUTs.
Computers 05 00020 g002
Figure 3. Altera ALM configured for arithmetic.
Figure 3. Altera ALM configured for arithmetic.
Computers 05 00020 g003
Figure 4. Radix-4-modified Booth partial-product matrix, m = n = 6 .
Figure 4. Radix-4-modified Booth partial-product matrix, m = n = 6 .
Computers 05 00020 g004
Figure 5. Matrix reduction using (5,5;4) counters.
Figure 5. Matrix reduction using (5,5;4) counters.
Computers 05 00020 g005
Figure 6. Proposed two-operand adder; computing S U M = X + Y .
Figure 6. Proposed two-operand adder; computing S U M = X + Y .
Computers 05 00020 g006
Figure 7. Arithmetic for the partial-product generation and addition operation.
Figure 7. Arithmetic for the partial-product generation and addition operation.
Computers 05 00020 g007
Figure 8. Combined partial-product generation and addition unit.
Figure 8. Combined partial-product generation and addition unit.
Computers 05 00020 g008
Figure 9. Arithmetic for the optimized generate-add operation.
Figure 9. Arithmetic for the optimized generate-add operation.
Computers 05 00020 g009
Figure 10. Optimized generate-add unit.
Figure 10. Optimized generate-add unit.
Computers 05 00020 g010
Figure 11. Array structure of proposed multiplier, m = n = 6 .
Figure 11. Array structure of proposed multiplier, m = n = 6 .
Computers 05 00020 g011
Figure 12. Pipelined m × n array multiplier with n / 4 stages.
Figure 12. Pipelined m × n array multiplier with n / 4 stages.
Computers 05 00020 g012
Figure 13. Arithmetic for the Row 0 generate-add unit.
Figure 13. Arithmetic for the Row 0 generate-add unit.
Computers 05 00020 g013
Figure 14. Row 0 generate-add estimation unit.
Figure 14. Row 0 generate-add estimation unit.
Computers 05 00020 g014
Figure 15. Implementation of the proposed 6 × 6 single-cycle array multiplier.
Figure 15. Implementation of the proposed 6 × 6 single-cycle array multiplier.
Computers 05 00020 g015
Figure 16. Implementation of the proposed 16 × 16 pipelined multiplier (with an eight-stage pipeline using SRLs) showing wiring from the I/O pads for operand A on the left and wiring to the I/O pads for output P on the right.
Figure 16. Implementation of the proposed 16 × 16 pipelined multiplier (with an eight-stage pipeline using SRLs) showing wiring from the I/O pads for operand A on the left and wiring to the I/O pads for output P on the right.
Computers 05 00020 g016
Figure 17. Implementation of the proposed 16 × 16 pipelined multiplier (with an eight-stage pipeline using SRLs) showing the usage of LUT6s, CARRY4s and flip-flops.
Figure 17. Implementation of the proposed 16 × 16 pipelined multiplier (with an eight-stage pipeline using SRLs) showing the usage of LUT6s, CARRY4s and flip-flops.
Computers 05 00020 g017
Figure 18. Implementation of the proposed 16 × 16 pipelined multiplier (with an eight-stage pipeline using SRLs) showing wiring from the third pipeline register to the fourth-stage generate-add unit.
Figure 18. Implementation of the proposed 16 × 16 pipelined multiplier (with an eight-stage pipeline using SRLs) showing wiring from the third pipeline register to the fourth-stage generate-add unit.
Computers 05 00020 g018
Table 1. MUXCY propagate and generate signals for addition.
Table 1. MUXCY propagate and generate signals for addition.
AdderAdderMUXCY
InputsOutputsInputs
x i y i c i c i + 1 s i p r o p i g e n i
0000000
0010100
010011X
011101X
100011X
101101X
1101001
1111101
Table 2. Radix-4-modified Booth recoding and partial-product selection ( j = 2 ρ ).
Table 2. Radix-4-modified Booth recoding and partial-product selection ( j = 2 ρ ).
b j + 1 b j b j - 1 b ρ P ρ Comments
00000string of “0”s
0011 + A end of “1”s
0101 + A a single “1”
0112 + 2 A end of “1”s
100 - 2 - 2 A beginning of “1”s
101 - 1 - A a single “0”
110 - 1 - A beginning of “1”s
11100string of “1”s
Table 3. Radix-4-modified Booth partial-product generation.
Table 3. Radix-4-modified Booth partial-product generation.
P ρ p ρ , m p ρ , m - 1 p ρ , m - 2 p ρ , 2 p ρ , 1 p ρ , 0 o p ρ
+ 0 0000000
+ A a m - 1 a m - 1 a m - 2 a 2 a 1 a 0 0
+ 2 A a m - 1 a m - 2 a m - 3 a 1 a 0 00
- A a m - 1 ¯ a m - 1 ¯ a m - 2 ¯ a 2 ¯ a 1 ¯ a 0 ¯ 1
- 2 A a m - 1 ¯ a m - 2 ¯ a m - 3 ¯ a 1 ¯ a 0 ¯ 11
Table 4. Proposed partial-product selection ( j = 2 ρ ).
Table 4. Proposed partial-product selection ( j = 2 ρ ).
b j + 1 b j b j - 1 b ρ P ρ
00000
0011 + A
0101 + A
0112 + 2 A
100 - 2 - 2 A
101 - 1 - A
110 - 1 - A
1110 - 0
Table 5. Proposed partial-product generation.
Table 5. Proposed partial-product generation.
P ρ p ρ , m p ρ , m - 1 p ρ , m - 2 p ρ , 2 p ρ , 1 p ρ , 0 o p ρ
+ 0 0000000
+ A a m - 1 a m - 1 a m - 2 a 2 a 1 a 0 0
+ 2 A a m - 1 a m - 2 a m - 3 a 1 a 0 00
- 0 1111111
- A a m - 1 ¯ a m - 1 ¯ a m - 2 ¯ a 2 ¯ a 1 ¯ a 0 ¯ 1
- 2 A a m - 1 ¯ a m - 2 ¯ a m - 3 ¯ a 1 ¯ a 0 ¯ 11
Table 6. Truth table to generate p ρ , i and add it to the bit connected to the I6 input.
Table 6. Truth table to generate p ρ , i and add it to the bit connected to the I6 input.
b j + 1 b j b j - 1 a i a i - 1 P ρ p ρ , i p ρ , i ¯ p ρ , i
I5I4I3I2I1O6O5
000000010
000010010
000100010
000110010
00100 + A a i 10
00101 + A a i 10
00110 + A a i 01
00111 + A a i 01
01000 + A a i 10
01001 + A a i 10
01010 + A a i 01
01011 + A a i 01
01100 + 2 A a i - 1 10
01101 + 2 A a i - 1 01
01110 + 2 A a i - 1 10
01111 + 2 A a i - 1 01
10000 - 2 A a i - 1 ¯ 01
10001 - 2 A a i - 1 ¯ 10
10010 - 2 A a i - 1 ¯ 01
10011 - 2 A a i - 1 ¯ 10
10100 - A a i ¯ 01
10101 - A a i ¯ 01
10110 - A a i ¯ 10
10111 - A a i ¯ 10
11000 - A a i ¯ 01
11001 - A a i ¯ 01
11010 - A a i ¯ 10
11011 - A a i ¯ 10
11100 - 0 101
11101 - 0 101
11110 - 0 101
11111 - 0 101
Table 7. Truth table for the most-significant LUT6 of an optimized generate-add unit.
Table 7. Truth table for the most-significant LUT6 of an optimized generate-add unit.
b j + 1 b j b j - 1 a m - 1 x ρ - 1 , m + 1 x ρ - 1 , m + 2 P ρ p ρ , m f p ρ , m ¯
I5I4I3I2I1O6O5
0000010001
0000100011
0001010001
0001100011
001001 + A a m - 1 01
001010 + A a m - 1 11
001101 + A a m - 1 10
001110 + A a m - 1 00
010001 + A a m - 1 01
010010 + A a m - 1 11
010101 + A a m - 1 10
010110 + A a m - 1 00
011001 + 2 A a m - 1 01
011010 + 2 A a m - 1 11
011101 + 2 A a m - 1 10
011110 + 2 A a m - 1 00
100001 - 2 A a m - 1 ¯ 10
100010 - 2 A a m - 1 ¯ 00
100101 - 2 A a m - 1 ¯ 01
100110 - 2 A a m - 1 ¯ 11
101001 - A a m - 1 ¯ 10
101010 - A a m - 1 ¯ 00
101101 - A a m - 1 ¯ 01
101110 - A a m - 1 ¯ 11
110001 - A a m - 1 ¯ 10
110010 - A a m - 1 ¯ 00
110101 - A a m - 1 ¯ 01
110110 - A a m - 1 ¯ 11
111001 - 0 110
111010 - 0 100
111101 - 0 110
111110 - 0 100
Table 8. Truth table to generate p 0 , i + 1 and p 0 , i in Row 0.
Table 8. Truth table to generate p 0 , i + 1 and p 0 , i in Row 0.
b 1 b 0 a i + 1 a i a i - 1 P 0 p 0 , i + 1 p 0 , i p 0 , i + 1 p 0 , i
I5I4I3I2I1O6O5
0000000000
0000100000
0001000000
0001100000
0010000000
0010100000
0011000000
0011100000
01000 + A a i + 1 a i 00
01001 + A a i + 1 a i 00
01010 + A a i + 1 a i 01
01011 + A a i + 1 a i 01
01100 + A a i + 1 a i 10
01101 + A a i + 1 a i 10
01110 + A a i + 1 a i 11
01111 + A a i + 1 a i 11
10000 - 2 A a i ¯ a i - 1 ¯ 11
10001 - 2 A a i ¯ a i - 1 ¯ 10
10010 - 2 A a i ¯ a i - 1 ¯ 01
10011 - 2 A a i ¯ a i - 1 ¯ 00
10100 - 2 A a i ¯ a i - 1 ¯ 11
10101 - 2 A a i ¯ a i - 1 ¯ 10
10110 - 2 A a i ¯ a i - 1 ¯ 01
10111 - 2 A a i ¯ a i - 1 ¯ 00
11000 - A a i + 1 ¯ a i ¯ 11
11001 - A a i + 1 ¯ a i ¯ 11
11010 - A a i + 1 ¯ a i ¯ 10
11011 - A a i + 1 ¯ a i ¯ 10
11100 - A a i + 1 ¯ a i ¯ 01
11101 - A a i + 1 ¯ a i ¯ 01
11110 - A a i + 1 ¯ a i ¯ 00
11111 - A a i + 1 ¯ a i ¯ 00
Table 9. Synthesis results for LogiCORE IP single-cycle multipliers.
Table 9. Synthesis results for LogiCORE IP single-cycle multipliers.
TotalDelay Normalized
TypenLUTs(ns)LDPLUTsDelayLDP
Xilinx640 2.581103.21.0001.0001.000
Xilinx872 2.662191.71.0001.0001.000
Xilinx10110 3.533388.61.0001.0001.000
Xilinx12158 3.666579.21.0001.0001.000
Xilinx14214 3.728797.81.0001.0001.000
Xilinx16280 3.9371102.41.0001.0001.000
Xilinx20431 4.7022026.61.0001.0001.000
Xilinx24617 4.8853014.01.0001.0001.000
Xilinx321089 5.5146004.71.0001.0001.000
Xilinx644261 7.25930930.61.0001.0001.000
Table 10. Synthesis results for the proposed single-cycle multipliers.
Table 10. Synthesis results for the proposed single-cycle multipliers.
TotalDelay Normalized
TypenLUTs(ns)LDPLUTsDelayLDP
New621 2.64955.60.5251.0260.539
New836 3.594129.40.5001.3500.675
New1055 4.250233.80.5001.2030.601
New1278 5.248409.30.4941.4320.707
New14105 5.820611.10.4911.5610.766
New16136 6.875935.00.4861.7460.848
New20210 8.5091786.90.4871.8100.882
New24300 10.5093152.70.4862.1511.046
New32528 13.9567368.80.4852.5311.227
New642080 26.32354751.80.4883.6261.770
Table 11. Synthesis results for LogiCORE IP pipelined multipliers.
Table 11. Synthesis results for LogiCORE IP pipelined multipliers.
TotalDelay Normalized
TypenStagesLUTs(ns)FFsLUTsDelayFFsLDP
Xilinx6340 1.413551.0001.0001.0001.000
Xilinx8372 1.518811.0001.0001.0001.000
Xilinx104113 1.3381501.0001.0001.0001.000
Xilinx124161 1.4161921.0001.0001.0001.000
Xilinx144217 1.5162531.0001.0001.0001.000
Xilinx164283 1.5063051.0001.0001.0001.000
Xilinx205440 1.6395171.0001.0001.0001.000
Xilinx245626 1.6946941.0001.0001.0001.000
Xilinx3251099 1.83611541.0001.0001.0001.000
Xilinx6464288 2.35844181.0001.0001.0001.000
Table 12. Synthesis results for the proposed multipliers, n / 4 -stage pipeline, no SRLs.
Table 12. Synthesis results for the proposed multipliers, n / 4 -stage pipeline, no SRLs.
TotalDelay Normalized
TypenStagesLUTs(ns)FFsLUTsDelayFFsLDP
New6221 1.984310.5251.4040.5630.737
New8236 2.038410.5001.3430.5060.671
New10355 1.943820.4871.4520.5470.707
New12378 2.108980.4841.4890.5100.721
New144105 1.9881570.4841.3110.6210.635
New164136 2.1761790.4811.4450.5870.694
New205210 2.2322840.4771.3620.5490.650
New246300 2.3474130.4791.3850.5950.664
New328528 2.3967430.4801.3050.6440.627
New64162080 2.85530230.4851.2110.6840.587
Table 13. Synthesis results for proposed multipliers, n / 2 -stage pipeline, no SRLs.
Table 13. Synthesis results for proposed multipliers, n / 2 -stage pipeline, no SRLs.
TotalDelay Normalized
TypenStagesLUTs(ns)FFsLUTsDelayFFsLDP
New6321 1.119500.5250.7920.9800.416
New8436 1.175910.5000.7741.1230.387
New10555 1.2321440.4870.9210.9600.448
New12678 1.2832090.4840.9061.0900.439
New147105 1.3122860.4840.8651.1300.419
New168136 1.4023750.4810.9311.2300.447
New2010210 1.4655890.4770.8941.1390.427
New2412300 1.7008510.4791.0041.2260.481
New3216528 1.66715190.4800.9081.3160.436
New64322080 2.08361110.4850.8831.3830.429
Table 14. Synthesis results for the proposed multipliers, n / 2 -stage pipeline, using SRLs.
Table 14. Synthesis results for the proposed multipliers, n / 2 -stage pipeline, using SRLs.
TotalDelay Normalized
TypenStagesLUTs(ns)FFsLUTsDelayFFsLDP
New6323 1.193460.5750.8440.8360.485
New8442 1.176770.5830.7750.9510.452
New10565 1.2511160.5750.9350.7730.538
New12692 1.3181630.5710.9310.8490.532
New147123 1.2922180.5670.8520.8620.483
New168158 1.3492810.5580.8960.9210.500
New2010240 1.4414310.5450.8790.8340.480
New2412338 1.6746130.5400.9880.8830.534
New3216582 1.82410730.5300.9930.9300.526
New64322198 2.05041930.5130.8690.9490.446
Table 15. Synthesis results for the proposed multipliers, n / 2 -stage pipeline, using SRLs, estimated  X 0 .
Table 15. Synthesis results for the proposed multipliers, n / 2 -stage pipeline, using SRLs, estimated  X 0 .
TotalDelay Normalized
TypenStagesLUTs(ns)FFsLUTsDelayFFsLDP
New6322 1.294420.5500.9160.7630.504
New8438 1.233750.5280.8120.9260.429
New10560 1.2791140.5310.9560.7600.508
New12686 1.2901610.5340.9110.8390.487
New147116 1.3332160.5350.8790.8540.470
New168150 1.3682790.5300.9080.9150.481
New2010230 1.4244290.5230.8690.8300.454
New2412326 1.4656110.5210.8650.8800.450
New3216566 1.77510710.5150.9670.9280.498
New64322166 2.07641910.5050.8800.9490.445
Table 16. Results for GPC-based multipliers on Altera [7].
Table 16. Results for GPC-based multipliers on Altera [7].
[7] Radix-4 Booth[7] Baugh-Wooley
Normalized toNormalized to
Altera MegaWizardAltera MegaWizard
nLUTsDelayLDPLUTsDelayLDP
101.000.670.671.121.281.43
121.000.790.791.101.021.12
141.070.820.881.131.101.24
161.220.841.021.260.951.20
200.970.830.811.030.920.95
240.970.830.811.050.940.99
321.210.891.081.250.981.23
641.160.820.951.160.981.14

Share and Cite

MDPI and ACS Style

Walters, E.G. Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs. Computers 2016, 5, 20. https://doi.org/10.3390/computers5040020

AMA Style

Walters EG. Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs. Computers. 2016; 5(4):20. https://doi.org/10.3390/computers5040020

Chicago/Turabian Style

Walters, E. George. 2016. "Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs" Computers 5, no. 4: 20. https://doi.org/10.3390/computers5040020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop