1. Introduction
Digital filtering is one of the most important functions in digital signal processing (DSP) systems. Finite impulse response (FIR) filters are widely used in many DSP applications due to their guaranteed stability and linear phase properties [
1]. However, compared with infinite impulse response (IIR) filters, the computational complexity of FIR filters are much higher [
2,
3,
4]. Therefore, it is crucial to reduce the complexity of FIR filters. Extensive research has been conducted in the past several decades. Since multipliers are expensive in terms of circuit area and computational delay, many works focus on the design of FIR filter coefficients. One of the approaches is to reduce the number of non-zero filter coefficients, which is always referred to as a sparse FIR filter design [
5,
6,
7,
8]. Since multipliers can be omitted if the corresponding coefficients are zero, the overall complexity can be reduced by increasing the sparsity of the filter. In addition, FIR filters can also be implemented using a number system other than binary for complexity reduction. A typical example is the residue numeral system (RNS)-based FIR filter implementation [
9]. Moreover, there are also some works focusing on a hybrid filter structure for complexity reduction [
10,
11,
12]. For the VLSI implementation of FIR filters with fixed coefficients, a multiplierless design is a more general technique to reduce the hardware and time complexities [
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23]. In the multiplierless implementation of FIR filters, the constant multiplications are generally implemented by optimized shift-add networks. The adders in the shift-add network are usually referred to as multiplication block adders. Besides multiplication block adders, another type of adder, known as structural adders [
22], exists to implement the product accumulation block (PAB) of the FIR filter.
The problem of minimizing the multiplication block has been intensively studied. Various optimization techniques have been proposed [
24,
25,
26,
27,
28,
29,
30,
31] to extract the common subexpressions (partial products) in the coefficients and reuse them within the coefficient multiplication as well as across all the multiplications for a direct form filter structure (shown in
Figure 1a) or a transposed direct form (TDF) filter structure (shown in
Figure 1b). Most of the multiplication block design algorithms focus on the TDF filter structure, where the input sample is concurrently multiplied by all the filter coefficients, forming a multiple constant multiplication (MCM) block. With the help of MCM design algorithms, the constant multiplications in TDF FIR filters can be implemented very efficiently.
Although the optimization of structural adders in the PABs is generally ignored by most researchers, it is noted that this part can be the major contributor to the overall hardware complexity. The number of structural adders is the same as the filter order and cannot be reduced. The naturally pipelined PAB in the TDF structure is advantageous to achieving low critical path delay, but it results in an increase in hardware complexity. As product words are added one by one along the accumulation line, the word-lengths of structural adders increase monotonically. The direct form structure, on the other hand, contains a PAB without any delay element, which could be implemented by a binary adder-tree of relatively less complexity. Therefore, a hybrid structure is a proper choice to reduce the PAB complexity and subsequently the overall complexity of FIR filters without sacrificing too much delay. In [
11], the matrix MCM is proposed to minimize the adder cost of constant multiplications in hybrid-form FIR filters. In [
12], different types of hybrid-form structures are discussed and evaluated. However, no optimization schemes have been investigated in the existing literature to reduce the bit-level complexity of the multiplication block as well as the PAB in hybrid-form FIR filters. Moreover, the symmetric property of linear phase FIR filters has not been utilized in such optimizations.
In this paper, a design scheme where the multiplication block and the PAB are jointly optimized at bit-level is proposed to minimize the hardware complexity of the multiplierless implementation of FIR filters. In the proposed joint optimization, the MCM block is partitioned into several MCM sub-blocks. The product words are summed locally before accumulation to reduce the word-length of structural adders. In particular, when the MCM block is partitioned into two MCM sub-blocks, the symmetric property of linear phase FIR filters can be utilized by carefully grouping the product terms to further reduce the complexity of constant multiplications. An exhaustive search algorithm, which minimizes the total complexity of the constant multipliers and PAB, is proposed to determine the optimum partition of the MCM block. The relationship between the coefficient word-length as well as filter order and the optimum structure is also studied.
The rest of the paper is organized as follows.
Section 2 presents the complexity reduction of PAB.
Section 3 discusses the joint optimization of PAB and the multiplication block. The quantitative analysis of the optimum group size is presented in
Section 4, and conclusions are drawn in
Section 5.
3. Joint Optimization of the Product Accumulation Block and Multiplication Block
In this section, the generalized bit-level complexity model of LSAs is proposed. The relationship between the group size L and the area complexity of PAB and constant multipliers are analyzed. Based on that, the joint optimization of the PAB and the multiplication block is proposed to reduce the overall combinational hardware complexity, followed by the register complexity analysis.
3.1. Bit-Level Complexity Reduction of LSAs
To estimate the number of full adders of LSAs, the shifts in products as well as sums generated by LSAs need to be defined. For the products, the shifts are simply the number of left-shift bits needed to generate the coefficients from the corresponding fundamentals. For the sums generated by the LSAs, the shifts can be recursively defined as
Definition 1. The shift of a sum generated by an LSA is the smaller shift among the shifts of two input addends.
Let us use the LSAs shown in
Figure 4 as an example. According to the definition, the shifts of the sums generated by LSA-1 and LSA-2 are
and
, respectively. Therefore, the sum generated by LSA-3 is
.
Basically, the shift-add operations of LSAs are the same as the shift-add operations in the MCM block except that both of the operands of LSAs may be left-shifted. Therefore, all the shifts (positive odd fundamentals are shifted to generate the corresponding coefficients) should be considered in the bit-level complexity estimation of LSAs. The number of full adders needed for an LSA that sums two products are given by Equation (
4). The generalized form of the number of full adders needed for LSAs can be expressed as
where
A is the set of coefficients corresponding to the products that the LSA is used to sum, while
and
are the shifts of the two inputs of the LSA.
As shown in Equation (
9), the word-length of LSAs need to only cover the range expansion in the current group of coefficients, while the word-length of structure adders have to cover the range expansion of all the previous taps. Therefore, the overall complexity can be reduced even when the number of adders does not change. For example, the first four coefficients of the 121-tap filter in [
23] are 6,
, 22 and −28. If
and
, the word-length of the three LSAs and one structural adder shown in
Figure 5a are 12-bit, 12-bit, 14-bit and 25-bit, respectively, while the word-length of the four structural adders of the TDF structure in
Figure 5b are 23-bit, 24-bit, 25-bit and 24-bit. The overall reduction for this group can only be as much as 33 full adders. However, it should be noted that as
i (the subscript of coefficient
) increases, the differences in word-length between LSAs and structural adders decrease due to reduced range expansion.
3.2. Relationship between L and Complexity of the PAB
In the structure of
, every two coefficients are grouped and locally summed such that half of the structural adders in the TDF structure are reduced to LSAs with smaller word-length. Generally, the proportion of structural adders that are reduced to LSAs can be expressed as
It is obvious that P increases with the increase in group size L, i.e., more structural adders can be reduced to LSAs. Theoretically, the hardware complexity of the PAB reduces monotonically with the increase in L. However, it should be noted that the word-lengths of LSAs need to extend accordingly (with the increase in depth in the binary adder tree) to keep the range expansion of the local sums. Therefore, the complexity reduction of PAB is limited when L is larger than certain values.
The reduction in the number of LSAs by sharing the common subexpressions across the rows in the constant matrix depends highly on the CMVM algorithms, and there is no clear relationship with the group size L. For a given filter coefficient set, the number of rows decreases with the increase in group size L, while the number of LSAs in each group increases. The increase in the number of rows and the number of LSAs in each group is beneficial for the sharing of LSAs. Generally, a small L value is more suitable for row-oriented CMVM algorithms that work better for matrices with more rows, and a large L value is more suitable for column-oriented CMVM algorithms that work better for matrices with more columns.
3.3. Relationship between L and Complexity of the Constant Multipliers
It is noted that the grouping of coefficients into MCM sub-blocks has a negative effect on the overall complexity of the constant multipliers (excluding the LSAs in CMVM). Unlike the LSAs (which can be shared across the rows in CMVM), the partial products of constant multiplications can only be shared within the sub-blocks where the subset of coefficients is multiplied with the same input. Partitioning the MCM block into sub-blocks reduces the number of coefficients in each sub-block, diminishing the shareability of partial products, leading to an increase in the complexity of the design. For , i.e., the conventional TDF structure, all the coefficients reside in a single MCM block, which maximizes the sharing of partial products. This is one of the main reasons that the TDF structure is more popular for multiplierless FIR filter implementation. With the increase in L, the overall hardware complexity of the constant multipliers increases accordingly due to the reduction in shareability of partial products.
3.4. Joint Optimization
The direct form structure and the TDF structure are two special cases of
and
. It has been shown that the group size
L affects both the complexity of the constant multipliers and the PAB. However, the relationship between the overall complexity and
L is filter-dependent, i.e., both the filter order and the filter coefficient values will affect the optimum group size
L for a specific filter. Therefore, the total combinational hardware complexity is a function of
L, and there is no fixed optimum structure for all the filters. For a given filter, the overall combinational complexity can be expressed as
where
is the complexity of the shift-add network of the constant multiplications, which can be estimated using the model in [
33],
is the complexity of all the LSAs, which can be estimated using Equation (
9), and
is the complexity of all the SAs, which can be estimated using Equation (
3). Based on Equation (
11), an exhaustive search algorithm in Algorithm 1 is proposed to determine the optimum structure for the implementation of a given filter.
Algorithm 1 The exhaustive search algorithm |
- Input:
Filter coefficients - Output:
The optimum structure - 1:
forL = 1 to N do - 2:
Compute the constant matrix - 3:
*Design the CMVM block using constant multiplication algorithms - 4:
Compute by evaluating ( 11) - 5:
- 6:
Partition the coefficients with group size of - 7:
end for
* The design of CMVM block is independent of the algorithm. Different CMVM algorithms can be used here. |
In Algorithm 1, if
, the structure discussed in
Section 2.2.1 is used to take advantage of the symmetry property of filter coefficients; otherwise, the generalized structure in
Section 2.2.2 should be used. Note that there is no constraint on the CMVM technique and any constant multiplication optimization algorithms can be used. Generally, high-order filters tend to have the least hardware complexity with a relatively larger
L since the PABs contribute the major part of hardware complexity, while for low-order filters, even the TDF structure (
) can be the optimum structure since the overhead introduced by partitioning the coefficients may be larger than the benefits.
3.5. Bit-Level Delay Analysis
The delay of the shift-add-based constant multiplication block consists of two parts: (i) the horizontal propagation delay and (ii) the vertical propagation delay. The horizontal delay, which is caused by the propagation of carries within the adders, can be simply estimated by the word-length of the final result minus the corresponding shift. The vertical delay, which is caused by the signal propagation across adder stages, can be estimated by the adder depth. A more precise estimation of the critical path delay can be found in [
30]. Since LSAs are not registered as the structural adders in the TDF structure, extra delay may be introduced.
Let us consider the computation of a group with two-stage LSAs, as shown in
Figure 6. Unlike the conventional TDF structure, adder stages exist between the constant multiplication block and the structural adders. Since the range of the accumulation results of the structural adders is not affected by the partition of the MCM block,
is the same as that of the TDF structure. Therefore, the horizontal delay of this tap is the same as that of the TDF structure. The vertical delay is incremented by the number of LSA adder stages. If the binary adder tree is used for LSAs, the total increment of delay can be expressed as
where
is the vertical propagation delay of a full adder. Therefore, the delay increment is
vertical full adder delays instead of the delay of
word-level adders.
3.6. Bit-Level Register Complexity Analysis
The number of word-level registers of the structures in
Figure 3 can be estimated as
which is the same as that of TDF structures if
is an integer multiple of
L. However, the register complexity at bit-level is reduced compared to the TDF structure. Let us consider the transformation from the TDF structure to the structure in
Figure 7. As can be seen, the original registers (in dashed circles in
Figure 7a) in the TDF structure are moved to the input of the structural adder (in dashed circles in
Figure 7b) such that the range of the register input is reduced. Therefore, the overall bit-registers for each group are reduced (the registers for the input can be shared by all the groups of coefficients such that the overhead is negligible). Moreover, as
L increases, more registers are transferred to the input line, which has a word-length of only
. Therefore, the overall register complexity of the filter decreases with the increase in
L. The structure with
, i.e., the direct form structure, has the least register complexity.
4. Quantitative Analysis of the Relationship between Optimum and Filter Coefficient Values as Well as Filter Order
It is known that the optimum structure of a filter depends on the coefficient values and the filter order. In this section, a quantitative analysis is presented to show the relationship between the optimum group size
L and coefficient values as well as filter order in the proposed joint optimization scheme. Note that the algorithms are coded in MATLAB in this work. In the design examples, the CMVM blocks are designed in two steps: (i) apply RAG-
n to each MCM sub-block and use the algorithm [
33] for further bit-level optimization and (ii) identify and share the length-2 common subexpressions (subexpressions with two variables) across the MCM sub-blocks. An input word-length of
is used in all the implementations. To the best of the authors’ knowledge, this is the first time that the PAB and the multiplication block are jointly optimized at bit-level for low-complexity FIR filter implementation. The conventional TDF structure is therefore used as the reference structure for comparison. The direct form structure is not considered due to its order-dependent critical path, which makes it unsuitable for high-speed applications.
In the first experiment, we analyze the implementation of a 151-tap filter and a 418-tap order filter [
35]. The full adder complexity of the TDF structure and the structure of joint optimization are presented in
Table 1, where “CWL" in the third column is the word-length of the largest coefficient and “# Unique" in the fourth column is the number of unique fundamentals, which are reduced from the coefficients. The “# FA PAB" and “# FA MB" in the fifth and sixth columns represent the number of full adders needed for the PAB and the constant multiplication block, respectively. The
is the optimum group size found by the joint optimization. As can be seen, the full adder cost of the PAB can be significantly reduced at the expense of increased constant multiplication complexity. Since the PAB contributes the major part of the hardware complexity, the overall full adder cost is reduced significantly. The reduction in hardware complexity for filter A and filter B are 9.9% and 21.0%, respectively.
Figure 8a,b show the relationship between the full adder cost and the group size
L for filter A and filter B, respectively. Generally, the full adder cost of the PAB decreases with the increase in
L. However, as can be seen, for filter A, the reduction in full adder cost of the PAB is almost flat for
after the significant drop from
to
, while for filter B, the reduction is notable until
. This is because the range expansion of the local accumulation results in each group diminishes word-length saving for LSAs. The order of filter B is considerably higher than that of filter A, leading to more reducible structural adders. Moreover, the larger coefficient word-length of filter A may result in a reduction in the difference between the word-length of LSAs and their corresponding structural adders in the TDF structure.
Generally, the full adder complexity of all the constant multipliers grows with the increase in
L. However, as can be seen, the complexity of constant multipliers as well as its slope of increment of filter A is larger than than that of filter B. This is because the coefficient word-length of filter A is larger. Although the coefficient value is not the only factor of complexity of constant multipliers (larger coefficients do not necessarily result in more complex constant multipliers), it has a positive effect on the complexity of constant multipliers from a statistical point of view. As shown in
Table 1, filter A has 70 unique fundamentals that need to be implemented, while filter B has only 29, even though the order of filter A is considerably lower than that of filter B.
Therefore, it can be concluded that the total full adder cost of the filter will increase after the optimum value of L since the reduction in the PAB cannot compensate for the increase in the multiplication block, and the optimum value of L is affected by the coefficient word-length as well as the filter order.
In order to study the impact of the filter coefficient word-length, we present the complexity of four 418-tap low-pass filters with coefficient word-lengths of 12-bit, 16-bit, 20-bit and 24-bit (used for different stopband attenuations) [
35].
Figure 9 shows the total full adder cost of these four filters for different
L values. The optimum
L values and full adder reduction compared to the TDF structure are listed in
Table 2. The optimum values of
L for 12-bit, 16-bit, 20-bit and 24-bit filters are 6, 3, 3 and 1, respectively. It is interesting to note that a 4-bit increase in word-length from 12 to 16 bits results in a large reduction in FA adder saving. The main reason is that, for filters with a smaller word-length, the overall hardware complexity is also smaller, such that the percentage of FA reduction will be larger. Moreover, in this case, the optimum
L is also changed, which may also affect the optimization of the PAB block. As can be seen, with the increase in the coefficient word-length from 12-bit to 24-bit, the optimum values of
L are reduced from 6 to 1 and the full adder cost saving is reduced from 21.0% to 0, i.e., the TDF structure is the optimum structure for the 24-bit filter. Therefore, it can be concluded that for filters with a very long coefficient word-length, the TDF structure is the optimum structure in terms of complexity. This is because with the increase in coefficient word-length, the complexity of the constant multipliers increases such that the reduction in the PAB can no longer compensate for the complexity increase in constant multipliers due to the long coefficient word-length.
In the third experiment, we present the optimum
L of 25 different filters [
35] with orders from dozens to hundreds in
Table 3. As can be seen, generally high-order filters tend to have a larger optimum value of
L. However, as discussed, the optimum value of
L depends on both the coefficient values and the filter order. In
Figure 10, we plot the optimum values of
L and the percentage of full adder reduction (compared to TDF structure) with respect to “# Filter Taps/Coef WL”. As can be seen, both the optimum values of
L and the percentage of full adder reduction are almost linear with “# Filter Taps/Coef WL”, i.e., high-order filters with small coefficient word-lengths tend to have large complexity reductions by the proposed joint optimization.
Another interesting finding is that for filters with orders less than 100, the optimum structure is either the TDF structure () or the structure with . This is because for low-order filters, the proportion of the complexity of constant multipliers is much higher than that of high-order filters. Therefore, the complexity overhead introduced by the multiplication block can be significant compared to the overall complexity of the filter. In the structure for , the symmetric property of coefficients is utilized to minimize the complexity overhead of constant multipliers.
5. Implementation Results
In this section, the implementation results are presented to demonstrate the effectiveness of the proposed method. Filter B in
Table 1, i.e., the 418-tap filter with 12-bit word-length, is used as the benchmark filter to compare different methods. All the designs are implemented using verilogHDL and synthesized using the Synopsys Design Compiler based on 40 nm CMOS technology. The input to the filter is 8-bit for all the designs.
Table 4 presents the area comparison of different implementations of the filter, where various algorithms, i.e., C1 [
36], Hcub [
28], MBPG [
37], MINAS [
38] and NRSCSE [
39], are used to design the MCM block of the filter. Note that the first row of
Table 4 shows the area of the MCM block only (designed using C1). As we can see, for this filter, the MCM block designed with a sophisticated algorithm only consumes a very small portion of the total chip area. Moreover, since the PAB part consumes the major part of chip area, there is little difference between filters designed using different MCM algorithms. As for the proposed method, since the PAB part is taken into consideration during the optimization, a significant amount of area can be saved. In addition, as we can observe from the table, the area saving becomes smaller with the increment of clock frequency. This is because extra delays are introduced to optimize the computational complexity, making the synthesis tool pay for more area to achieve a higher clock rate.
Table 5 presents the power consumption of different filter implementations. As we can see, similar to the area comparison, the power consumption of the filter design using the proposed method is significantly lower than other methods. The main reason, as we have analyzed, is the computational complexity reduction.
6. Conclusions
Most of the literature on multiplierless design and implementation of FIR filters concentrates on the efficient design of MCM blocks. However, the PAB, which is often ignored, can be the major contributor to hardware complexity, especially for high-order filters. A design scheme is therefore proposed to minimize the hardware complexity of multiplierless FIR filters by optimizing the multiplication block and PAB jointly at bit-level. In the proposed joint optimization, the single MCM block is rearranged into several MCM sub-blocks. The products of input and coefficient values are summed locally before being accumulated to reduce the word-lengths of structural adders. The symmetric property of linear phase FIR filters can be utilized to further reduce the complexity of constant multipliers when the MCM block is partitioned into two MCM sub-blocks. Moreover, an exhaustive search algorithm is proposed to determine the optimum partition of the MCM block. Quantitative analyses are presented to study the relationship between the optimum group size L and the coefficient values as well as the filter orders in the proposed joint optimization scheme. It is shown that there is no fixed optimum structure for filters with different coefficient word-lengths and filter orders, and each filter needs to be optimized specifically to achieve the lowest hardware complexity.
We need to mention that there are certain limitations to the proposed method. The first one is the modeling of complexity and delay. In our work, we analyze the complexity and delay based on FA, but it is well-known that modern synthesis tools are smart enough to optimize area and timing at a finer-grain level. Therefore, FA-based modeling may not be accurate enough to find the optimum structure. Another limitation is that for low-order filters, the complexity of the MCM part may occupy a significant portion of the overall complexity, such that the overhead introduced by the MCM part may undermine the overall benefits.