Next Article in Journal
A 1/f Noise Detection Method for IGBT Devices Based on PSO-VMD
Previous Article in Journal
A Systematic Review of the Applications of Multi-Criteria Decision Aid Methods (1977–2022)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multiplication and Accumulation Co-Optimization for Low Complexity FIR Filter Implementation

1
School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China
2
Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China
3
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(11), 1721; https://doi.org/10.3390/electronics11111721
Submission received: 8 March 2022 / Revised: 14 April 2022 / Accepted: 14 April 2022 / Published: 28 May 2022
(This article belongs to the Section Circuit and Signal Processing)

Abstract

:
In multiplierless finite impulse response (FIR) filters, the product accumulation block (PAB) could be the major contributor to hardware complexity, especially for high-order filters. In this paper, an optimization scheme where the constant multiplication block and the PAB are jointly optimized at the bit-level is proposed to minimize the hardware complexity. In the proposed joint optimization, the multiple constant multiplications (MCM) block is rearranged into several MCM sub-blocks. The products are summed locally before accumulation to reduce the word-length of the structural adders. It is shown that the symmetric property of linear phase FIR filters can be utilized in some cases to further reduce the complexity of the constant multiplications. Quantitative analyses are also presented to study the relationship between the optimum group size and the coefficient values as well as the filter orders. It is shown that there is no fixed optimum structure for filters with different coefficient word-lengths and filter orders, and each filter needs to be optimized specifically to achieve the minimum hardware complexity. Implementation results are presented to validate the effectiveness of the proposed method.

1. Introduction

Digital filtering is one of the most important functions in digital signal processing (DSP) systems. Finite impulse response (FIR) filters are widely used in many DSP applications due to their guaranteed stability and linear phase properties [1]. However, compared with infinite impulse response (IIR) filters, the computational complexity of FIR filters are much higher [2,3,4]. Therefore, it is crucial to reduce the complexity of FIR filters. Extensive research has been conducted in the past several decades. Since multipliers are expensive in terms of circuit area and computational delay, many works focus on the design of FIR filter coefficients. One of the approaches is to reduce the number of non-zero filter coefficients, which is always referred to as a sparse FIR filter design [5,6,7,8]. Since multipliers can be omitted if the corresponding coefficients are zero, the overall complexity can be reduced by increasing the sparsity of the filter. In addition, FIR filters can also be implemented using a number system other than binary for complexity reduction. A typical example is the residue numeral system (RNS)-based FIR filter implementation [9]. Moreover, there are also some works focusing on a hybrid filter structure for complexity reduction [10,11,12]. For the VLSI implementation of FIR filters with fixed coefficients, a multiplierless design is a more general technique to reduce the hardware and time complexities [13,14,15,16,17,18,19,20,21,22,23]. In the multiplierless implementation of FIR filters, the constant multiplications are generally implemented by optimized shift-add networks. The adders in the shift-add network are usually referred to as multiplication block adders. Besides multiplication block adders, another type of adder, known as structural adders [22], exists to implement the product accumulation block (PAB) of the FIR filter.
The problem of minimizing the multiplication block has been intensively studied. Various optimization techniques have been proposed [24,25,26,27,28,29,30,31] to extract the common subexpressions (partial products) in the coefficients and reuse them within the coefficient multiplication as well as across all the multiplications for a direct form filter structure (shown in Figure 1a) or a transposed direct form (TDF) filter structure (shown in Figure 1b). Most of the multiplication block design algorithms focus on the TDF filter structure, where the input sample is concurrently multiplied by all the filter coefficients, forming a multiple constant multiplication (MCM) block. With the help of MCM design algorithms, the constant multiplications in TDF FIR filters can be implemented very efficiently.
Although the optimization of structural adders in the PABs is generally ignored by most researchers, it is noted that this part can be the major contributor to the overall hardware complexity. The number of structural adders is the same as the filter order and cannot be reduced. The naturally pipelined PAB in the TDF structure is advantageous to achieving low critical path delay, but it results in an increase in hardware complexity. As product words are added one by one along the accumulation line, the word-lengths of structural adders increase monotonically. The direct form structure, on the other hand, contains a PAB without any delay element, which could be implemented by a binary adder-tree of relatively less complexity. Therefore, a hybrid structure is a proper choice to reduce the PAB complexity and subsequently the overall complexity of FIR filters without sacrificing too much delay. In [11], the matrix MCM is proposed to minimize the adder cost of constant multiplications in hybrid-form FIR filters. In [12], different types of hybrid-form structures are discussed and evaluated. However, no optimization schemes have been investigated in the existing literature to reduce the bit-level complexity of the multiplication block as well as the PAB in hybrid-form FIR filters. Moreover, the symmetric property of linear phase FIR filters has not been utilized in such optimizations.
In this paper, a design scheme where the multiplication block and the PAB are jointly optimized at bit-level is proposed to minimize the hardware complexity of the multiplierless implementation of FIR filters. In the proposed joint optimization, the MCM block is partitioned into several MCM sub-blocks. The product words are summed locally before accumulation to reduce the word-length of structural adders. In particular, when the MCM block is partitioned into two MCM sub-blocks, the symmetric property of linear phase FIR filters can be utilized by carefully grouping the product terms to further reduce the complexity of constant multiplications. An exhaustive search algorithm, which minimizes the total complexity of the constant multipliers and PAB, is proposed to determine the optimum partition of the MCM block. The relationship between the coefficient word-length as well as filter order and the optimum structure is also studied.
The rest of the paper is organized as follows. Section 2 presents the complexity reduction of PAB. Section 3 discusses the joint optimization of PAB and the multiplication block. The quantitative analysis of the optimum group size is presented in Section 4, and conclusions are drawn in Section 5.

2. Complexity Reduction of the Product Accumulation Block

2.1. Complexity of FIR Filters at the Bit Level

Instead of the word-level adder count, a more accurate estimation of hardware complexity is the number of full adders. According to [32,33], the number of full adders needed to implement a shift-add operation can be estimated as
W MBA i = W X + log 2 f i l i ,
where W X is the word-length of the filter input, f i is the generated fundamental and l i is the number of left-shifts in this shift-add operation. Since shifts can be implemented by proper wiring, the complexity of the MCM block can be measured in terms of the total number of full adders needed, which is given by
W MBA = i = 1 M W MBA i = i = 1 M r + M a ( W X + log 2 f i l i ) .
Here, M = M r + M a is the number of shift-add operations in the multiplication block, where M r is the number of required fundamentals that are reduced from the given coefficients and M a is the number of additional fundamentals that are inserted to implement the required fundamentals. Therefore, both M a ( M r is fixed for a given coefficient set) and W MBA i need to be minimized to reduce the complexity of the multiplier.
In FIR filters, the PAB consists of two parts: (i) the structural adders and (ii) the registers. The number of full adders needed to implement the ith structural adder in an Nth order filter is
W SA i = W X + log 2 k = i N | h k | s i ,   for   0 i < N ,
where h k represents the kth filter coefficient and s i is the number of left-shifts needed to generate h i from the corresponding fundamental. The total number of full adders needed is therefore the sum of N structural adders. The number of bit-registers needed for each register, denoted by W Reg i , can be estimated in the same manner, except that the number of left-shifts s i will not affect the number of bit-registers. Note that either in the direct form structure or the TDF structure, the number of structural adders and registers are fixed.

2.2. Optimization of the Product Accumulation Block

In Equation (3), the first term W X and the third term s i are fixed for a tap. In the TDF structure, each product is added to the accumulation result of the previous tap. Therefore, the second term in Equation (3) needs to include the coefficients of all the previous taps to guard the range expansion. For the hybrid FIR filter structure, the MCM block can be rearranged into L MCM sub-blocks. In this work, the structure has two different forms depending on the value of L.

2.2.1. Structure for L = 2

The filter coefficients for linear phase FIR filters are symmetric, i.e.,  | h i |   =   | h N i | for 0 i N / 2 . For  L = 2 , this property of the linear phase FIR filter can be utilized by grouping the symmetrical coefficients into the same MCM sub-block [34].
Unlike the conventional hybrid-form implementation, which groups every other two coefficients from h 0 till the end, we can start from h 1 and h N , and group every other two coefficients in both directions, i.e.,  { h 1 ,   h 2 } , { h 3 ,   h 4 } so on and so forth and { h N ,   h N 1 } , { h N 2 ,   h N 3 } so on and so forth, and leave h 0 un-grouped. Let us use the filter in Figure 2 to illustrate the grouping. As shown in Figure 2, the structure can be transformed from the TDF structure using the retiming technique. Note that the delay elements for the input in each group are drawn individually in Figure 2 for the convenience of explanation. In practical implementations, these elements can be shared.
In this structure, the products except h 0 are locally summed with their neighbours before accumulation. The adders used to sum products before accumulation is referred to as local structural adders (LSAs). Note that the combinational hardware complexity of this structure is less than that of the TDF structure because the word-length of the LSAs is smaller than that of its corresponding structural adders (in TDF structure) due to the range reduction. For example, in Figure 2, the word-length of the adder used to sum the products of h 1 and h 2 is
W h 1 , h 2 = W X + log 2 ( | h 1 | + | h 2 | ) max ( s 1 , s 2 ) ,
while the word-length of its corresponding structural adder in the TDF form is
W h 1 , h 2 = W X + log 2 k = 2 8 | h k | s 2 .
The word-length is reduced because the range expansion of the sum is reduced from log 2 k = 2 8 | h k | bits to log 2 ( | h 1 | + | h 2 | ) bits.
The structure for L = 2 can also be explained using the transfer function, which is given by
H ( z ) = i = 0 8 h i z i   = h 0 + i = 0 3 z ( 2 i + 1 ) ( h 2 i + 1 + h 2 i + 2 z 1 ) .
It can also be expressed in matrix form as
H ( z ) = h 0 + z 1 z 3 z 5 z 7 T h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 1 z 1
The constant matrix vector multiplication (CMVM) in Equation (7) can be optimized using MCM algorithms or CMVM algorithms [12]. Note that in this structure, symmetric coefficient pairs are grouped into the same MCM sub-block in the CMVM. For example, in Equation (7), the CMVM consists of two MCM sub-blocks. It can be observed that the symmetric coefficient pairs in the original filter coefficient set, for example { h 1 ,   h 7 } and { h 3 ,   h 5 } in the first column, are grouped into the same MCM sub-block. Since in MCMs or CMVMs, duplicated coefficients are implemented once and reused across the block, only two coefficients, h 1 and h 3 , need to be implemented for the MCM sub-block of the first column, leading to reduced hardware complexity. The stand alone coefficient in Equations (6) and (7), h 0 , can be grouped together with the first column of the coefficient matrix.

2.2.2. Structure for L 2

For L 2 , the symmetric pairs of filter coefficients can no longer be grouped into the same MCM sub-block to reduce the number of unique coefficients. Therefore, the filter coefficients can be simply grouped from h 0 to h N with a group size of L. The structure for L 2 can be transformed from the TDF structure in the same way as L = 2 using retiming techniques. The transfer function H ( z ) for L 2 can be expressed in the matrix form as
z 0 z L z N + L T h 0 h 1 h L 1 h L h L + 1 h 2 L 1 h N L + 1 h N L + 2 h N 1 z 1 z L + 1
The filter structure based on Equation (8) is shown in Figure 3, which is a Type II hybrid-form structure according to the classification in [12]. The constant multiplications in Figure 3 can be designed using MCM algorithms or CMVM algorithms. Note that a binary adder tree can be used to implement the LSAs in each group to reduce the word-length of adders.

3. Joint Optimization of the Product Accumulation Block and Multiplication Block

In this section, the generalized bit-level complexity model of LSAs is proposed. The relationship between the group size L and the area complexity of PAB and constant multipliers are analyzed. Based on that, the joint optimization of the PAB and the multiplication block is proposed to reduce the overall combinational hardware complexity, followed by the register complexity analysis.

3.1. Bit-Level Complexity Reduction of LSAs

To estimate the number of full adders of LSAs, the shifts in products as well as sums generated by LSAs need to be defined. For the products, the shifts are simply the number of left-shift bits needed to generate the coefficients from the corresponding fundamentals. For the sums generated by the LSAs, the shifts can be recursively defined as
Definition 1.
The shift of a sum generated by an LSA is the smaller shift among the shifts of two input addends.
Let us use the LSAs shown in Figure 4 as an example. According to the definition, the shifts of the sums generated by LSA-1 and LSA-2 are min ( 2 , 4 ) = 2 and min ( 3 , 5 ) = 3 , respectively. Therefore, the sum generated by LSA-3 is min ( 2 , 3 ) = 2 .
Basically, the shift-add operations of LSAs are the same as the shift-add operations in the MCM block except that both of the operands of LSAs may be left-shifted. Therefore, all the shifts (positive odd fundamentals are shifted to generate the corresponding coefficients) should be considered in the bit-level complexity estimation of LSAs. The number of full adders needed for an LSA that sums two products are given by Equation (4). The generalized form of the number of full adders needed for LSAs can be expressed as
W LSA = W X + h k A | k k | max ( s a ,   s b ) ,
where A is the set of coefficients corresponding to the products that the LSA is used to sum, while s a and s b are the shifts of the two inputs of the LSA.
As shown in Equation (9), the word-length of LSAs need to only cover the range expansion in the current group of coefficients, while the word-length of structure adders have to cover the range expansion of all the previous taps. Therefore, the overall complexity can be reduced even when the number of adders does not change. For example, the first four coefficients of the 121-tap filter in [23] are 6, 13 , 22 and −28. If  L = 4 and W X = 8 , the word-length of the three LSAs and one structural adder shown in Figure 5a are 12-bit, 12-bit, 14-bit and 25-bit, respectively, while the word-length of the four structural adders of the TDF structure in Figure 5b are 23-bit, 24-bit, 25-bit and 24-bit. The overall reduction for this group can only be as much as 33 full adders. However, it should be noted that as i (the subscript of coefficient h i ) increases, the differences in word-length between LSAs and structural adders decrease due to reduced range expansion.

3.2. Relationship between L and Complexity of the PAB

In the structure of L = 2 , every two coefficients are grouped and locally summed such that half of the structural adders in the TDF structure are reduced to LSAs with smaller word-length. Generally, the proportion of structural adders that are reduced to LSAs can be expressed as
P = L 1 L = 1 1 L
It is obvious that P increases with the increase in group size L, i.e., more structural adders can be reduced to LSAs. Theoretically, the hardware complexity of the PAB reduces monotonically with the increase in L. However, it should be noted that the word-lengths of LSAs need to extend accordingly (with the increase in depth in the binary adder tree) to keep the range expansion of the local sums. Therefore, the complexity reduction of PAB is limited when L is larger than certain values.
The reduction in the number of LSAs by sharing the common subexpressions across the rows in the constant matrix depends highly on the CMVM algorithms, and there is no clear relationship with the group size L. For a given filter coefficient set, the number of rows decreases with the increase in group size L, while the number of LSAs in each group increases. The increase in the number of rows and the number of LSAs in each group is beneficial for the sharing of LSAs. Generally, a small L value is more suitable for row-oriented CMVM algorithms that work better for matrices with more rows, and a large L value is more suitable for column-oriented CMVM algorithms that work better for matrices with more columns.

3.3. Relationship between L and Complexity of the Constant Multipliers

It is noted that the grouping of coefficients into MCM sub-blocks has a negative effect on the overall complexity of the constant multipliers (excluding the LSAs in CMVM). Unlike the LSAs (which can be shared across the rows in CMVM), the partial products of constant multiplications can only be shared within the sub-blocks where the subset of coefficients is multiplied with the same input. Partitioning the MCM block into sub-blocks reduces the number of coefficients in each sub-block, diminishing the shareability of partial products, leading to an increase in the complexity of the design. For  L = 1 , i.e., the conventional TDF structure, all the coefficients reside in a single MCM block, which maximizes the sharing of partial products. This is one of the main reasons that the TDF structure is more popular for multiplierless FIR filter implementation. With the increase in L, the overall hardware complexity of the constant multipliers increases accordingly due to the reduction in shareability of partial products.

3.4. Joint Optimization

The direct form structure and the TDF structure are two special cases of L = N and L = 1 . It has been shown that the group size L affects both the complexity of the constant multipliers and the PAB. However, the relationship between the overall complexity and L is filter-dependent, i.e., both the filter order and the filter coefficient values will affect the optimum group size L for a specific filter. Therefore, the total combinational hardware complexity is a function of L, and there is no fixed optimum structure for all the filters. For a given filter, the overall combinational complexity can be expressed as
C T ( L ) = C MBA ( L ) + C LSA ( L ) + C SA ( L ) ,
where C MBA ( L ) is the complexity of the shift-add network of the constant multiplications, which can be estimated using the model in [33], C LSA ( L ) is the complexity of all the LSAs, which can be estimated using Equation (9), and C SA ( L ) is the complexity of all the SAs, which can be estimated using Equation (3). Based on Equation (11), an exhaustive search algorithm in Algorithm 1 is proposed to determine the optimum structure for the implementation of a given filter.
Algorithm 1 The exhaustive search algorithm
Input: 
Filter coefficients
Output: 
The optimum structure
1:
forL = 1 to N do
2:
    Compute the constant matrix
3:
    *Design the CMVM block using constant multiplication algorithms
4:
    Compute C T ( L ) by evaluating (11)
5:
     L Opt = arg L min ( C T ( L ) )
6:
    Partition the coefficients with group size of L Opt
7:
end for
* The design of CMVM block is independent of the algorithm. Different CMVM algorithms can be used here.
In Algorithm 1, if L Opt = 2 , the structure discussed in Section 2.2.1 is used to take advantage of the symmetry property of filter coefficients; otherwise, the generalized structure in Section 2.2.2 should be used. Note that there is no constraint on the CMVM technique and any constant multiplication optimization algorithms can be used. Generally, high-order filters tend to have the least hardware complexity with a relatively larger L since the PABs contribute the major part of hardware complexity, while for low-order filters, even the TDF structure ( L = 1 ) can be the optimum structure since the overhead introduced by partitioning the coefficients may be larger than the benefits.

3.5. Bit-Level Delay Analysis

The delay of the shift-add-based constant multiplication block consists of two parts: (i) the horizontal propagation delay and (ii) the vertical propagation delay. The horizontal delay, which is caused by the propagation of carries within the adders, can be simply estimated by the word-length of the final result minus the corresponding shift. The vertical delay, which is caused by the signal propagation across adder stages, can be estimated by the adder depth. A more precise estimation of the critical path delay can be found in [30]. Since LSAs are not registered as the structural adders in the TDF structure, extra delay may be introduced.
Let us consider the computation of a group with two-stage LSAs, as shown in Figure 6. Unlike the conventional TDF structure, adder stages exist between the constant multiplication block and the structural adders. Since the range of the accumulation results of the structural adders is not affected by the partition of the MCM block, W SA is the same as that of the TDF structure. Therefore, the horizontal delay of this tap is the same as that of the TDF structure. The vertical delay is incremented by the number of LSA adder stages. If the binary adder tree is used for LSAs, the total increment of delay can be expressed as
D Δ = T v · log 2 L ,
where T v is the vertical propagation delay of a full adder. Therefore, the delay increment is log 2 L vertical full adder delays instead of the delay of log 2 L word-level adders.

3.6. Bit-Level Register Complexity Analysis

The number of word-level registers of the structures in Figure 3 can be estimated as
R w = L · ( N + 1 L 1 ) + L ,
which is the same as that of TDF structures if N + 1 is an integer multiple of L. However, the register complexity at bit-level is reduced compared to the TDF structure. Let us consider the transformation from the TDF structure to the structure in Figure 7. As can be seen, the original registers (in dashed circles in Figure 7a) in the TDF structure are moved to the input of the structural adder (in dashed circles in Figure 7b) such that the range of the register input is reduced. Therefore, the overall bit-registers for each group are reduced (the registers for the input can be shared by all the groups of coefficients such that the overhead is negligible). Moreover, as L increases, more registers are transferred to the input line, which has a word-length of only W X . Therefore, the overall register complexity of the filter decreases with the increase in L. The structure with L = N , i.e., the direct form structure, has the least register complexity.

4. Quantitative Analysis of the Relationship between Optimum L and Filter Coefficient Values as Well as Filter Order

It is known that the optimum structure of a filter depends on the coefficient values and the filter order. In this section, a quantitative analysis is presented to show the relationship between the optimum group size L and coefficient values as well as filter order in the proposed joint optimization scheme. Note that the algorithms are coded in MATLAB in this work. In the design examples, the CMVM blocks are designed in two steps: (i) apply RAG-n to each MCM sub-block and use the algorithm [33] for further bit-level optimization and (ii) identify and share the length-2 common subexpressions (subexpressions with two variables) across the MCM sub-blocks. An input word-length of W X = 8 is used in all the implementations. To the best of the authors’ knowledge, this is the first time that the PAB and the multiplication block are jointly optimized at bit-level for low-complexity FIR filter implementation. The conventional TDF structure is therefore used as the reference structure for comparison. The direct form structure is not considered due to its order-dependent critical path, which makes it unsuitable for high-speed applications.
In the first experiment, we analyze the implementation of a 151-tap filter and a 418-tap order filter [35]. The full adder complexity of the TDF structure and the structure of joint optimization are presented in Table 1, where “CWL" in the third column is the word-length of the largest coefficient and “# Unique" in the fourth column is the number of unique fundamentals, which are reduced from the coefficients. The “# FA PAB" and “# FA MB" in the fifth and sixth columns represent the number of full adders needed for the PAB and the constant multiplication block, respectively. The L Opt is the optimum group size found by the joint optimization. As can be seen, the full adder cost of the PAB can be significantly reduced at the expense of increased constant multiplication complexity. Since the PAB contributes the major part of the hardware complexity, the overall full adder cost is reduced significantly. The reduction in hardware complexity for filter A and filter B are 9.9% and 21.0%, respectively.
Figure 8a,b show the relationship between the full adder cost and the group size L for filter A and filter B, respectively. Generally, the full adder cost of the PAB decreases with the increase in L. However, as can be seen, for filter A, the reduction in full adder cost of the PAB is almost flat for L > 2 after the significant drop from L = 1 to L = 2 , while for filter B, the reduction is notable until L > 6 . This is because the range expansion of the local accumulation results in each group diminishes word-length saving for LSAs. The order of filter B is considerably higher than that of filter A, leading to more reducible structural adders. Moreover, the larger coefficient word-length of filter A may result in a reduction in the difference between the word-length of LSAs and their corresponding structural adders in the TDF structure.
Generally, the full adder complexity of all the constant multipliers grows with the increase in L. However, as can be seen, the complexity of constant multipliers as well as its slope of increment of filter A is larger than than that of filter B. This is because the coefficient word-length of filter A is larger. Although the coefficient value is not the only factor of complexity of constant multipliers (larger coefficients do not necessarily result in more complex constant multipliers), it has a positive effect on the complexity of constant multipliers from a statistical point of view. As shown in Table 1, filter A has 70 unique fundamentals that need to be implemented, while filter B has only 29, even though the order of filter A is considerably lower than that of filter B.
Therefore, it can be concluded that the total full adder cost of the filter will increase after the optimum value of L since the reduction in the PAB cannot compensate for the increase in the multiplication block, and the optimum value of L is affected by the coefficient word-length as well as the filter order.
In order to study the impact of the filter coefficient word-length, we present the complexity of four 418-tap low-pass filters with coefficient word-lengths of 12-bit, 16-bit, 20-bit and 24-bit (used for different stopband attenuations) [35]. Figure 9 shows the total full adder cost of these four filters for different L values. The optimum L values and full adder reduction compared to the TDF structure are listed in Table 2. The optimum values of L for 12-bit, 16-bit, 20-bit and 24-bit filters are 6, 3, 3 and 1, respectively. It is interesting to note that a 4-bit increase in word-length from 12 to 16 bits results in a large reduction in FA adder saving. The main reason is that, for filters with a smaller word-length, the overall hardware complexity is also smaller, such that the percentage of FA reduction will be larger. Moreover, in this case, the optimum L is also changed, which may also affect the optimization of the PAB block. As can be seen, with the increase in the coefficient word-length from 12-bit to 24-bit, the optimum values of L are reduced from 6 to 1 and the full adder cost saving is reduced from 21.0% to 0, i.e., the TDF structure is the optimum structure for the 24-bit filter. Therefore, it can be concluded that for filters with a very long coefficient word-length, the TDF structure is the optimum structure in terms of complexity. This is because with the increase in coefficient word-length, the complexity of the constant multipliers increases such that the reduction in the PAB can no longer compensate for the complexity increase in constant multipliers due to the long coefficient word-length.
In the third experiment, we present the optimum L of 25 different filters [35] with orders from dozens to hundreds in Table 3. As can be seen, generally high-order filters tend to have a larger optimum value of L. However, as discussed, the optimum value of L depends on both the coefficient values and the filter order. In Figure 10, we plot the optimum values of L and the percentage of full adder reduction (compared to TDF structure) with respect to “# Filter Taps/Coef WL”. As can be seen, both the optimum values of L and the percentage of full adder reduction are almost linear with “# Filter Taps/Coef WL”, i.e., high-order filters with small coefficient word-lengths tend to have large complexity reductions by the proposed joint optimization.
Another interesting finding is that for filters with orders less than 100, the optimum structure is either the TDF structure ( L = 1 ) or the structure with L = 2 . This is because for low-order filters, the proportion of the complexity of constant multipliers is much higher than that of high-order filters. Therefore, the complexity overhead introduced by the multiplication block can be significant compared to the overall complexity of the filter. In the structure for L = 2 , the symmetric property of coefficients is utilized to minimize the complexity overhead of constant multipliers.

5. Implementation Results

In this section, the implementation results are presented to demonstrate the effectiveness of the proposed method. Filter B in Table 1, i.e., the 418-tap filter with 12-bit word-length, is used as the benchmark filter to compare different methods. All the designs are implemented using verilogHDL and synthesized using the Synopsys Design Compiler based on 40 nm CMOS technology. The input to the filter is 8-bit for all the designs.
Table 4 presents the area comparison of different implementations of the filter, where various algorithms, i.e., C1 [36], Hcub [28], MBPG [37], MINAS [38] and NRSCSE [39], are used to design the MCM block of the filter. Note that the first row of Table 4 shows the area of the MCM block only (designed using C1). As we can see, for this filter, the MCM block designed with a sophisticated algorithm only consumes a very small portion of the total chip area. Moreover, since the PAB part consumes the major part of chip area, there is little difference between filters designed using different MCM algorithms. As for the proposed method, since the PAB part is taken into consideration during the optimization, a significant amount of area can be saved. In addition, as we can observe from the table, the area saving becomes smaller with the increment of clock frequency. This is because extra delays are introduced to optimize the computational complexity, making the synthesis tool pay for more area to achieve a higher clock rate.
Table 5 presents the power consumption of different filter implementations. As we can see, similar to the area comparison, the power consumption of the filter design using the proposed method is significantly lower than other methods. The main reason, as we have analyzed, is the computational complexity reduction.

6. Conclusions

Most of the literature on multiplierless design and implementation of FIR filters concentrates on the efficient design of MCM blocks. However, the PAB, which is often ignored, can be the major contributor to hardware complexity, especially for high-order filters. A design scheme is therefore proposed to minimize the hardware complexity of multiplierless FIR filters by optimizing the multiplication block and PAB jointly at bit-level. In the proposed joint optimization, the single MCM block is rearranged into several MCM sub-blocks. The products of input and coefficient values are summed locally before being accumulated to reduce the word-lengths of structural adders. The symmetric property of linear phase FIR filters can be utilized to further reduce the complexity of constant multipliers when the MCM block is partitioned into two MCM sub-blocks. Moreover, an exhaustive search algorithm is proposed to determine the optimum partition of the MCM block. Quantitative analyses are presented to study the relationship between the optimum group size L and the coefficient values as well as the filter orders in the proposed joint optimization scheme. It is shown that there is no fixed optimum structure for filters with different coefficient word-lengths and filter orders, and each filter needs to be optimized specifically to achieve the lowest hardware complexity.
We need to mention that there are certain limitations to the proposed method. The first one is the modeling of complexity and delay. In our work, we analyze the complexity and delay based on FA, but it is well-known that modern synthesis tools are smart enough to optimize area and timing at a finer-grain level. Therefore, FA-based modeling may not be accurate enough to find the optimum structure. Another limitation is that for low-order filters, the complexity of the MCM part may occupy a significant portion of the overall complexity, such that the overhead introduced by the MCM part may undermine the overall benefits.

Author Contributions

Theoretical analysis, writing and editing, C.R.; conceptualization and methodology, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Shanghai Rising-Star Program under Grant 21QC1401400.

Data Availability Statement

The data are available upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Oppenheim, A.; Schafer, R. Discrete-Time Signal Processing; Prentice Hall: Upper Saddle River, NJ, USA, 2003. [Google Scholar]
  2. Liu, Q.; Lim, Y.C.; Lin, Z.; Lai, X. Design of IIR frequency-response masking filters with near linear phase using constrained optimization. In Proceedings of the 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore, MD, USA, 28–31 May 2017; pp. 1–4. [Google Scholar]
  3. Wang, Y.; Ding, F.; Xu, L. Some new results of designing an IIR filter with colored noise for signal processing. Digit. Signal Process. 2018, 72, 44–58. [Google Scholar] [CrossRef]
  4. Agrawal, N.; Kumar, A.; Bajaj, V.; Singh, G. Design of digital IIR filter: A research survey. Appl. Acoust. 2021, 172, 107669. [Google Scholar] [CrossRef]
  5. Raju, R.; Kwan, H.K.; Jiang, A. Sparse FIR Filter Design Using Artificial Bee Colony Algorithm. In Proceedings of the 2018 IEEE 61st International Midwest Symposium on Circuits and Systems (MWSCAS), Windsor, ON, Canada, 5–8 August 2018; pp. 956–959. [Google Scholar]
  6. Wang, H.; Zhao, Z.; Zhao, L. Matrix Decomposition Based Low-Complexity FIR Filter: Further Results. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Sevilla, Spain, 10–21 October 2020; pp. 1–5. [Google Scholar]
  7. Chen, W.; Huang, M.; Ye, W.; Lou, X. Cascaded Form Sparse FIR Filter Design. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 1692–1703. [Google Scholar] [CrossRef]
  8. Xi, X.; Lou, Y. Sparse FIR Filter Design With k-Max Sparsity and Peak Error Constraints. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 1497–1501. [Google Scholar] [CrossRef]
  9. Cardarilli, G.C.; Nunzio, L.D.; Fazzolari, R.; Nannarelli, A.; Petricca, M.; Re, M. Design Space Exploration Based Methodology for Residue Number System Digital Filters Implementation. IEEE Trans. Emerg. Top. Comput. 2022, 10, 186–198. [Google Scholar] [CrossRef]
  10. Khoo, K.Y.; Yu, Z.; Willson, A.N. Design of optimal hybrid form FIR filter. In Proceedings of the ISCAS 2001—2001 IEEE International Symposium on Circuits and Systems (Cat. No.01CH37196), Sydney, NSW, Australia, 6–9 May 2001; Volume 2, pp. 621–624. [Google Scholar]
  11. Gustafsson, O.; Coleman, J.; Dempster, A.; Macleod, M. Low-complexity hybrid form FIR filters using matrix multiple constant multiplication. In Proceedings of the Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 7–10 November 2004; Volume 1, pp. 77–80. [Google Scholar]
  12. Levent, A.; Paulo, F.; José, M. A Tutorial on Multiplierless Design of FIR Filters: Algorithms and Architectures. Circuits Syst. Signal Process. 2014, 33, 1689–1719. [Google Scholar]
  13. Coleman, J.O. Equiripple-Stopband Multiplierless FIR Filters by Chebyshev Sharpening of Two-Sample Averaging. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–4. [Google Scholar]
  14. Sajwan, N.; Kumar, A.; Sharma, I.; Balyan, L.K. Performance of Multiplierless FIR Filter based on CSD and Binary: A Comparative Study. In Proceedings of the 2019 International Conference on Signal Processing and Communication (ICSC), Noida, India, 7–9 March 2019; pp. 217–222. [Google Scholar]
  15. Krishna, V.; Kumar, A.; Singh, G. Design of Multiplierless IFIR based Cosine Modulated Filter Bank using QPSO. In Proceedings of the 2020 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 28–30 July 2020; pp. 0715–0720. [Google Scholar]
  16. Ramamoorthy, P.; Nallasamy, V. Review of Multiplierless Fir Filter Design Based On Graph Based Optimization. In Proceedings of the 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), Tiruchengode, India, 2–3 March 2018; pp. 276–279. [Google Scholar]
  17. Ye, W.B.; Yu, Y.J. Single-Stage and Cascade Design of High Order Multiplierless Linear Phase FIR Filters Using Genetic Algorithm. IEEE Trans. Circuits Syst. I 2013, 60, 2987–2997. [Google Scholar] [CrossRef]
  18. Ye, W.B.; Yu, Y.J. Bit-Level Multiplierless FIR Filter Optimization Incorporating Sparse Filter Technique. IEEE Trans. Circuits Syst. I 2014, 61, 3206–3215. [Google Scholar] [CrossRef]
  19. Yao, C.Y.; Hsia, W.C.; Ho, Y.H. Designing Hardware-Efficient Fixed-Point FIR Filters in an Expanding Subexpression Space. IEEE Trans. Circuits Syst. I 2014, 61, 202–212. [Google Scholar] [CrossRef]
  20. Yao, C.Y.; Chen, H.H.; Lin, T.F.; Chien, C.J.; Hsu, C.T. A novel commom-subexpression-elimination method for synthesizing fixed-point FIR filter. IEEE Trans. Circuits Syst. I 2004, 51, 2215–2221. [Google Scholar] [CrossRef]
  21. Yu, Y.J.; Lim, Y.C. Design of linear phase FIR filters in subexpression space using mixed integer linear programming. IEEE Trans. Circuits Syst. I 2007, 54, 2330–2338. [Google Scholar] [CrossRef]
  22. Yu, Y.J.; Shi, D.; Lim, Y.C. Design of Extrapolated Impulse Response FIR Filters with Residual Compensation in Subexpression Space. IEEE Trans. Circuits Syst. I 2009, 56, 2621–2633. [Google Scholar]
  23. Shi, D.; Yu, Y.J. Design of Linear Phase FIR Filters With High Probability of Achieving Minimum Number of Adders. IEEE Trans. Circuits Syst. I 2011, 58, 126–136. [Google Scholar] [CrossRef]
  24. Hartley, R.I. Subexpression sharing in filters using canonic signed digit multipliers. IEEE Trans. Circuits Syst. II 1996, 43, 677–688. [Google Scholar] [CrossRef]
  25. Pasko, R.; Schaumont, P.; Derudder, V.; Vernalde, S.; Durackova, D. A new algorithm for elimination of common subexpressions. IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst. 1999, 18, 58–68. [Google Scholar] [CrossRef] [Green Version]
  26. Bull, D.R.; Horrocks, D.H. Primitive operator digital filters. IEE Proc. G Circuits Devices Syst. 1991, 138, 401–412. [Google Scholar] [CrossRef]
  27. Dempster, A.G.; Macleod, M.D. Use of minimum-adder multiplier blocks in FIR digital filters. IEEE Trans. Circuits Syst. II Analog. Digit. Signal Process. 1995, 42, 569–577. [Google Scholar] [CrossRef]
  28. Voronenko, Y.; Puschel, M. Multiplierless multiple constant multiplication. ACM Trans. Algorithms 2007, 3, 11. [Google Scholar] [CrossRef]
  29. Lou, X.; Yu, Y.J.; Meher, P.K. Fine-Grained Critical Path Analysis and Optimization for Area-Time Efficient Realization of Multiple Constant Multiplications. IEEE Trans. Circuits Syst. I 2015, 62, 863–872. [Google Scholar] [CrossRef]
  30. Lou, X.; Yu, Y.J.; Meher, P.K. Lower Bound Analysis and Perturbation of Critical Path for Area-Time Efficient Multiple Constant Multiplications. IEEE Trans. Comput.-Aided Des. Integr. Circuit Syst. 2017, 34, 313–324. [Google Scholar] [CrossRef]
  31. Lou, X.; Yu, Y.J.; Meher, P.K. Analysis and Optimization of Product-Accumulation Section for Efficient Implementation of FIR Filters. IEEE Trans. Circuits Syst. I 2016, 63, 1701–1713. [Google Scholar] [CrossRef]
  32. Faust, M.; Chang, C.H. Low error bit width reduction for structural adders of FIR filters. In Proceedings of the 2011 20th European Conference on Circuit Theory and Design (ECCTD), Linköping, Sweden, 29–31 August 2011; pp. 713–716. [Google Scholar]
  33. Johansson, K.; Gustafsson, O.; Wanhammar, L. Bit-level optimization of shift-and-add based FIR filters. In Proceedings of the 2007 14th IEEE International Conference on Electronics, Circuits and Systems, Marrakech, Morocco, 11–14 December 2007; Volume 3, pp. 713–716. [Google Scholar]
  34. Lou, X.; Meher, P.K.; Yu, Y.; Ye, W. Novel Structure for Area-Efficient Implementation of FIR Filters. IEEE Trans. Circuits Syst. II Express Briefs 2017, 64, 1212–1216. [Google Scholar] [CrossRef]
  35. Nanyang Technological University. FIRsuite Suite of Constant Coefficient FIR Filters; Nanyang Technological University: Singapore, 2022. [Google Scholar]
  36. Dempster, A.G.; Dimirsoy, S.S.; Kale, I. Designing multiplier blocks with low logic depth. In Proceedings of the 2002 IEEE International Symposium on Circuits and Systems, (Cat. No.02CH37353), Phoenix-Scottsdale, AZ, USA, 26–29 May 2002; Volume 5, pp. 773–776. [Google Scholar]
  37. Chang, C.H.; Chen, J.J.; Vinod, A.P. Information theoretic approach to complexity reduction of FIR filter design. IEEE Trans. Circuits Syst. I 2008, 55, 2310–2321. [Google Scholar] [CrossRef]
  38. Aksoy, L.; Costa, E.; Flores, P.; Monteiro, J. Finding the optimal tradeoff between area and delay in multiple constant multiplications. Microprocess. Microsyst. 2011, 35, 729–741. [Google Scholar] [CrossRef]
  39. Peiro, M.M.; Boemo, E.I.; Wanhammar, L. Design of high-speed multiplierless filters using a nonrecursive signed common subexpression algorithm. IEEE Trans. Circuits Syst. II 2002, 49, 196–203. [Google Scholar] [CrossRef]
Figure 1. FIR filter structures: (a) direct form structure and (b) transposed direct form structure.
Figure 1. FIR filter structures: (a) direct form structure and (b) transposed direct form structure.
Electronics 11 01721 g001
Figure 2. The structure for L = 2 .
Figure 2. The structure for L = 2 .
Electronics 11 01721 g002
Figure 3. Filter structure for L > 2 .
Figure 3. Filter structure for L > 2 .
Electronics 11 01721 g003
Figure 4. Bit-level complexity of LSAs.
Figure 4. Bit-level complexity of LSAs.
Electronics 11 01721 g004
Figure 5. LSAs and corresponding structural adders.
Figure 5. LSAs and corresponding structural adders.
Electronics 11 01721 g005
Figure 6. Computation of a group with two-stage LSAs.
Figure 6. Computation of a group with two-stage LSAs.
Electronics 11 01721 g006
Figure 7. Transformation from the TDF structure to the hybrid structure.
Figure 7. Transformation from the TDF structure to the hybrid structure.
Electronics 11 01721 g007
Figure 8. Full adder cost of (a) filter A and (b) filter B for different L.
Figure 8. Full adder cost of (a) filter A and (b) filter B for different L.
Electronics 11 01721 g008
Figure 9. Total FA cost of four 418-tap filters with different coefficient word-lengths for different L values.
Figure 9. Total FA cost of four 418-tap filters with different coefficient word-lengths for different L values.
Electronics 11 01721 g009
Figure 10. Relationship between (a) optimum L and “# Filter Taps/Coef WL” and (b) % FA reduction and “# Filter Taps/Coef WL”.
Figure 10. Relationship between (a) optimum L and “# Filter Taps/Coef WL” and (b) % FA reduction and “# Filter Taps/Coef WL”.
Electronics 11 01721 g010
Table 1. Hardware complexity of two benchmark filters.
Table 1. Hardware complexity of two benchmark filters.
Filter# TapsCWL# UniqueGS# FAFA SavingDelay Increment
PABMBTotal
A1511570 L TDF = 1 34796904169
L Opt = 2 297578137569.9% T v
B4181229 L TDF = 1 74792467725
L Opt = 6 5328774610221.0% 3 · T v
CWL: coefficient word-length; PAB: product accumulation block; MB: multiplication block; Tv: vertical full adder delay; GS: group size.
Table 2. Results for 418-tap filters with different word-lengths.
Table 2. Results for 418-tap filters with different word-lengths.
Coef WL12-Bit16-Bit20-Bit24-Bit
# Unique29102187208
L Opt 6331
% FA Reduction21.0%9.1%2.0%0%
Delay Increment 3 · T v 2 · T v 2 · T v 0
WL: word-length.
Table 3. Results for filters with different orders.
Table 3. Results for filters with different orders.
# Filter TapsCoef WL# Unique L Opt % FA ReductionDelay Increment
30102821.8 T v
3271010.00
34111010.00
368528.0 T v
3785212.4 T v
49911210.9 T v
59101427.7 T v
60132610.00
61153026.2 T v
63101727.3 T v
67152829.6 T v
80153610.00
105910217.1 T v
10892335.5 2 · T v
1191553211.4 T v
1211442212.5 T v
1511570211.0 T v
2221244311.6 2 · T v
240158235.3 2 · T v
2791230420.9 2 · T v
4181229626.6 3 · T v
4411486219.0 T v
5161227730.7 3 · T v
6311228834.8 3 · T v
6951232835.9 3 · T v
WL: word-length.
Table 4. Area comparison of the 418-tap filter with a 12-bit word-length (40 nm CMOS technology).
Table 4. Area comparison of the 418-tap filter with a 12-bit word-length (40 nm CMOS technology).
AlgorithmArea ( μ m 2 ) under Different Clock Frequencies
200 MHz300 MHz400 MHz
CombNcombTotalCombNcombTotalCombNcombTotal
MCM_only1356-13562097-20973427-3427
C1 [36]32,08634,31366,39940,57735,36075,93842,69935,10877,808
Hcub [28]31,96634,31466,28040,55335,34875,90142,88534,93277,818
MBPG [37]3224534,31366,55840,59435,35175,94643,49534,93678,431
MINAS [38]3204734,31366,36140,80435,34776,15143,04734,70277,750
NRSCSE [39]3218434,31366,49738,92235,35674,27843,10934,81577,924
Proposed (L = 2)25,90633,20859,11431,46233,95565,41737,34033,93671,277
Comb: combinational; Ncomb: non-combinational.
Table 5. Power comparison of the 418-tap filter with a 12-bit word-length (40 nm CMOS technology).
Table 5. Power comparison of the 418-tap filter with a 12-bit word-length (40 nm CMOS technology).
AlgorithmPower ( μ W) under Different Clock Frequencies
200 M300 M400 M
C1 [36]122.9202.1262.9
Hcub [28]122.5201.1258.5
MBPG [37]121.6201.3265.5
MINAS [38]121.5204.2254.4
NRSCSE [39]122.0197.0256.8
Proposed (L = 2)93.8154.6197.6
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Rao, C.; Lou, X. Multiplication and Accumulation Co-Optimization for Low Complexity FIR Filter Implementation. Electronics 2022, 11, 1721. https://doi.org/10.3390/electronics11111721

AMA Style

Rao C, Lou X. Multiplication and Accumulation Co-Optimization for Low Complexity FIR Filter Implementation. Electronics. 2022; 11(11):1721. https://doi.org/10.3390/electronics11111721

Chicago/Turabian Style

Rao, Chaolin, and Xin Lou. 2022. "Multiplication and Accumulation Co-Optimization for Low Complexity FIR Filter Implementation" Electronics 11, no. 11: 1721. https://doi.org/10.3390/electronics11111721

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop