Optimal Realization of Distributed Arithmetic-Based MAC Adaptive FIR Filter Architecture Incorporating Radix-4 and Radix-8 Computation

James, Britto Pari; Leung, Man-Fai; Vaithiyanathan, Dhandapani; Mariammal, Karuthapandian

doi:10.3390/electronics13173551

Open AccessArticle

Optimal Realization of Distributed Arithmetic-Based MAC Adaptive FIR Filter Architecture Incorporating Radix-4 and Radix-8 Computation

¹

Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, India

²

School of Computing and Information Science, Faculty of Science and Engineering, Anglia Ruskin University, Cambridge CB1 1PT, UK

³

National Institute of Technology Delhi, Delhi 110036, India

⁴

Madras Institute of Technology, Anna University, Chennai 600044, India

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3551; https://doi.org/10.3390/electronics13173551

Submission received: 12 July 2024 / Revised: 23 August 2024 / Accepted: 4 September 2024 / Published: 6 September 2024

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Finite impulse response (FIR) filters are explicitly used in decisive applications such as communication and signal processing areas. Advancement in the latest technologies necessitates specific designs with optimal characteristics. This research work proposes the realization of an efficient distributed arithmetic adaptive FIR filter (DAAFA) architecture using radix-4 and radix-8 computation. Distributed arithmetic (DA) is extensively used to calculate the sum of products without involving a multiplier. The proposed fixed-point realization of a single multiply and accumulate (MAC) FIR adaptive filter is implemented with minimum complex design. The total longest-way computation time is a combination of the delay that occurred in the error calculation module and the delay involved in updating the filter weights. The longest-way computation time of the filter structure is higher, which results in increased latency. In addition, the approximate design of the radix DA multiplier structure is constructed using Booth recoding, partial product formation block and shifting-based accumulation block. Further, the approximate design of DA offers a reduction in complexity and area with respect to the number of slices and enhances the operating speed. The partial product is created using shifters and efficient adders, which further enhances the performance of the realization. This work is implemented in Xilinx and Altera devices and is compared with the present literature. From the synthesis results, it is observed that the propounded design outperforms in terms of complexity, slice delay product and ultimate speed of exertion. The suggested architecture was found to be decisive in terms of area, delay and complexity abatement. The results indicate that the propounded design achieves area reduction (slices) of about 92.03% compared to the existing design. Also, a speed enhancement of about 90.7% is accomplished for the proposed architecture. Nonetheless, the devised architecture utilizes the least means square approach, which enhances the convergence rate notably.

Keywords:

ARDA multiplier; multiply and accumulate (MAC); FIR filter; low area; high speed

1. Introduction

FIR filters are generally preferred for all kinds of signal processing and communication applications, which include VLSI signal processing, biomedical signal processing, audio and radio frequency communication, software radio communication and so on. Standard non-varying coefficients are good enough for exact signal analysis. Notwithstanding, the utilization of a filter becomes irrelevant when the characteristics of the signal are unknown. Accordingly, the incorporation of adaptive filter architecture emphasizes the coefficient variation, which is reliant on the application of corresponding input. In this devised work, the least means square (LMS) formulary is adapted to enhance the performance of the architecture. This algorithm provides reasonable convergence and reduced error performance. The conventional adaptive architecture is implemented using traditional multipliers and adders and is illustrated in Figure 1.

Based on the input number of samples, the coefficients are computed and updated for the realization of the adaptive design. Here, the obtained signal is examined with a reference signal, and this is used accordingly for the refurbishment of coefficients in each iteration. The expression to determine the coefficients of the LMS formulary is specified in Equation (1).

w(n + 1) = w(n) + µ x(n)e(n)

(1)

e(n) = d(n) − y(n). Here, x(n) specifies input and y(n) denotes output, w(n) is the filter weight or coefficient, e(n) denotes error and µ indicates the step-index component.

Figure 1 depicts the traditional filter framework of adaptive design. x(n) and d(n) refer to the input signal and required signal, respectively. e(n) refers to the error component, which is computed by subtracting every iteration output from the required signal. For the case of adaptive filtering, the weights or filter coefficients w₀, w₁,w₂,…w_n need to be updated to achieve the required response. Hence, in this work, the LMS algorithm is used to avail updated coefficients. The adaptation rate or step-index component µ is chosen as 0.06.

In the typical filter structure, expanding the length raises the multipliers, which leads to a highly complex design. To curtail this hardware complexity, a single-MAC architecture that emphasizes the time-sharing approach is proposed in this work. An efficient structure for LMS adaptive filtering is suggested in this work for the fixed-point computation, which attains reasonable performance improvement. The implementation involves the calculation of inner products, which contributes a significant advantage in the filter design.

2. Associated Works

Scores of research towards adaptive filtering have been carried out by several researchers. The survey related to this work is addressed in this section. An effective addition strategy is devised for inner-product computation that can handle high-input sampling rates by lowering the critical path and drastically reducing the adaption time, which would lead to faster convergence performance [1]. Adaptive acoustic filtering using the HOT-based LMS echo cancellation approach from the audio signal proved that it cancels the echo efficiently with a considerable curtain in computations [2]. The convergence speed is enhanced by employing an affine projection formulary in adaptive filtering [3]. Three disparate approaches have been utilized for the realization of adaptive LMS architecture that offers enhanced speed and minimization of computational complexity [4]. The systolic approach in DA-based filter formation is endowed to be better in terms of area, power and delay [5]. Further, for large orders, the DA architectures attain good performance [6]. To provide the desired larger speed realization, the subexpressions in the filter designs are altered by using fewer add and shift operations [7]. The innovative design of the DA-based filter structure employs powerful access to the LUT approach and obtains minimum error results compared to the existing designs [8]. Block-based LMS filter structure incorporating DA avails a noticeable reduction in area delay product and saves several additional operations required for the realization [9]. The reconfigurability in DA-based filter implementation involves carry save addition, the systolic method and a distributed RAM scheme [10]. It is occasionally necessary to delay the coefficient update. This study examines the behavior of the delayed LMS approach. The step size in the coefficient update has been discovered to play a crucial part in the identification of stable and convergent approaches [11]. The time-division multiplexing (TDM)-scheme-based realization for one- and many-channel FIR filter implementation avails better performance. One multiply and add unit is connected to the examined structure to manage a variety of channels and filter taps to use resources efficiently [12]. The design of multirate FIR filters takes into account several topologies, including polyphase, folding pipeline, and systolic array architectures, and the outcomes are examined [13]. Distinct adder implementation provides significant improvement in performance [14]. Using the proximation, the 16-tap FIR NLMS design [15] on an FPGA increases the throughput, but the complexity increases with preferred taps.

Discarding the relevant bits of the filter coefficients significantly diminishes the power and area, but at the expense of reduced certainty [16]. To carry out parallel multiplication, an approximation-based approach like truncation is preferred [17]. Most of the prevailing approaches employ truncation of the partial products to obtain hardware abatement. Input bits are completely needed for the multiplication, which in turn implies that the memory requirement is essential for storing the data. This implies that there is no reduction in hardware in terms of memory. Nonetheless, memory requires a significant amount of power for storing the data, and when storing and processing a larger amount of data, the area also increases. Further, optimal processing is essential to achieve a higher throughput rate. Truncation of inputs offers substantial hardware minimization, particularly for the design with addition and multiplication operations when examining truncation of the partial products [17]. The polyphase framework employs proximate radix-4 manipulation to attain optimal results [18]. Therefore, to achieve hardware saving, truncation is applied to the inputs. The carry-save accumulating unit plays an important role in DA-based filter design that does not require LUT [19]. The adaptive filter is realized with a faster convergence rate and minimal error [20]. LMS-based adaptive filtering achieves good performance characteristics [21]. The tan component utilizing linear proximate formulary is adopted for the design of less complex adaptive structures [22,23]. Mapping units, as well as the limiting mechanism suggested in [24], offer significant minimization in the area and memory usage [24].

With the approximate design of the radix DA multiplier structure constructed using Booth recoding, partial product formation block, and shifting-based accumulation block, the proposed framework is developed to address the aforementioned constraints. This research work addresses the issues present in the existing filter design such as high complexity, large area, and lower speed. The main contribution of this research work is implementing an efficient adaptive filter architecture using a single-MAC module. With the support of approximation DA, the weights or the coefficients are manipulated. Hence, the presence of approximation DA eliminates the requirement of the LUT approach. This proposed research work reduces the complexity because of the utilization of a single-MAC structure. Also, the computing speed is improved by employing pipeline registers. The proposed DAAFA realization is implemented with radix-4 and radix-8. The proposed 16-tap filter and 32-tap filters are analyzed in this work with the help of the Xilinx AND Altera platform.

The traditional adaptive framework mandates N-MAC operations, leading to larger areas and highly complex realization. Therefore, an efficient structure must be designed that resolves these issues.

In this research work, an efficient adaptive FIR filter architecture is proposed, which incorporates a single multiply and add module (single-MAC)-based time-sharing architecture. Further, the propounded architecture employs approximate DA that reduces the complexity to a notable extent. The proposed architecture employs a single-MAC unit regardless of the number of taps and is designed to carry out radix-4 and radix-8 operations. Furthermore, the total longest-way reckoning time is the unification of the delay that occurred in the error calculation module and the delay involved in updating the filter weights. Since the filter architecture has the longest way with a larger reckoning time, introducing the pipelining process enables the improvement of the speed of the suggested architecture. In addition, the approximate design of the radix DA multiplier structure is constructed using Booth recoding, partial product formation block and shifting-based aggregation block.

More importantly, to construct an adaptive architecture, the LMS formulary is incorporated in this work. In the prevailing FIR filter, realization mandates adder, multiplier and delay units. However, to achieve optimal realization, an adaptive algorithm is utilized in the architecture. The main focus of this formulary is to update the filter coefficients iteratively and achieve optimal realization.

This research work is catalogued as follows. The preface of the design approaches and the survey related to filters are presented in Section 1 and Section 2. The comprehensive study of the investigated filter framework is addressed in Section 3. The attained outputs are investigated in Section 4. Section 5 concludes the proposed work and further work to be carried out.

3. Proposed Adaptive Filter Architecture (DAAFA)

FIR filter is used to obtain the desired response. The fundamental design of FIR filters necessitates addition, multiplication, and delay operations. The weights are updated consecutively to achieve the required output. The relevant output is accomplished by employing adaptive filtering. Fixed-point and floating-point representations of filter coefficients are used for devising the adaptive filters. This implementation combines two major blocks, namely coefficient update and error calculation blocks. The critical route is mainly decided by the longest route between input and output with zero delays. The critical route of traditional adaptable realization is computed in terms of the total delay involved in the coefficient update and error calculation module. This large delay gives rise to an increase in latency, which curtails the sampling rate of the input. This paper proposes the modified adaptive filter design by employing a single multiply and accumulate section to diminish the latency and enhance the design performance.

The adaptive approach provides required updated coefficients and minimizes error. To attain significant advantages, particularly in terms of area, power and delay, approximation-based manipulation blocks are chosen for fixed-point realization. Approximation is performed in multiplication operations using truncation or rounding. The adaptive design performance is mainly decided by the convergence, which makes the output almost closer to the required response. The LMS approach is considered in this work with a step-index component or adaptation rate µ considered as 0.06. The designed filter works well with the chosen µ (0.06) value.

The convergence time is calculated based on the number of iterations used to upgrade the coefficients. Specifically, as the number of iterations increases, the error component in the output decreases.

Error Manipulation Block

The weight update is performed using the twos complement method, as described in Equation (2):

w_{i} (m) = - {w_{i}}^{n - 1} (m) 2^{n - 1} + \sum_{i = 0}^{n - 2} {w_{i}}^{j} (m) 2^{j}

(2)

where

w_{i}^{j} (m)

represents the jth lower part of the important bit of

w_{i} (m)

and the width of a number is indicated as n.

w_{i} (m)

is specified as an integer, and it is converted into a fixed-point representation by appropriate shift operation. The four bits

w_{i} (m)

are combined by utilizing the radix-8 Booth concealing formulary, which is depicted in Table 1.

The radix-4 partial products are obtained using shifters and adders, and the Wallace tree adder is used to carry out addition. The architecture for computing radix-4 computation wherein the product is obtained using some part of the multiplier and some part of the multiplicand is given in Figure 2.

The weights

w_{i} (n)

are updated iteratively, as described in Equation (3):

w_{i} (n) = - 2 {w_{i}}^{2 j + 1} (n) + {w_{i}}^{2 j} (n) + {w_{i}}^{2 j - 1} (n) 2^{2 j}

(3)

= {\sum_{j = 0}^{[m / 2] - 1} \bar{w}}^{j} (n) 2^{2 j}

where

{w_{i}}^{- 1} = 0, {\bar{w}}_{i}^{j} (n) = - 2 {w_{i}}^{2 j + 1} (n) + {w_{i}}^{2 j} (n) + {w_{i}}^{2 j - 1} (n)

and

{\bar{w}}_{i}^{j} (n) = \in [- 2, - 1, 0, 1, 2]

.

The architecture to perform radix-8 computation is given in Figure 3.

The weights

w_{i} (n)

are updated iteratively, as outlined in Equation (4):

w_{i} (n) = \sum_{j = 0}^{[m / 3] - 1} (- 2^{2} {w_{i}}^{3 j + 2} (n) + 2 {w_{i}}^{3 j + 1} (n) + {w_{i}}^{3 j} (n) + {w_{i}}^{3 j - 1} (n)) 2^{3 j}

(4)

= \sum_{j = 0}^{[m / 3] - 1} w^{- j} (n) 2^{3 j}

where

{w_{i}}^{- 1} = 0, {w_{i}}^{- j} (n) = - 2^{2} {w_{i}}^{3 j + 2} (n) + 2 {w_{i}}^{3 j + 1} (n) + {w_{i}}^{3 j} (n) + {w_{i}}^{3 j - 1} (n)

and

{w_{i}}^{- j} (n) = \in [- 4, - 3, - 2, - 1, 0, 1, 2, 3, 4]

.

If the encoded input bit width is smaller than

3 x m / 3

, then a sign extension is useful. The output of the filter is manipulated with the aid of inputs and filter coefficients, as dictated in Equation (5). The partial products are manipulated by considering the input number of samples and the relevant weight. The computation is performed using Equation (6).

y (n) = x (n) . w (n) = δ . x (n) . \bar{w} (n)

(5)

where

\bar{w} (n) = [\begin{matrix} {\bar{w}}_{0}^{0} (n) & {\bar{w}}_{1}^{0} (n) ......... & {\bar{w}}_{M - 1}^{0} (n) \\ {\bar{w}}_{0}^{1} (n) & {\bar{w}}_{1}^{1} (n) & {\bar{w}}_{M - 1}^{1} (n) \\ \begin{matrix} {\bar{w}}_{0}^{⌈\frac{M}{3}⌉ - 1} (n) \end{matrix} & \begin{matrix} {\bar{w}}_{1}^{⌈\frac{M}{3}⌉ - 1} (n) \end{matrix} & \begin{matrix} {\bar{w}}_{M - 1}^{⌈\frac{M}{3}⌉ - 1} (n) \end{matrix} \end{matrix}]

p p_{j} (n) = \sum_{i = 0}^{M - 1} \bar{{w_{i}}^{j} (n) x (n - i) = \sum_{i = 0}^{M - 1} P P_{i j}}

(6)

δ = [2^{0}, 2^{3}, \dots, 2^{3 [m / 3] - 3}]

and

x (n) = [x (n), x (n - 1), ...., x {(n - M + 1]}^{T}

p p (n) = \bar{w} (n) . x (n)

,

y (n) = δ . p p (n)

.

Let

p p (n) = [p p_{0} (n), p p_{1} (n) ....., p p {[m / 3 - 1 (n)]}^{T}

.

Incorporating radix-8 Booth recoding reduces the partial products approximately by an amount [m − n/3] = 2m/3. Therefore, the accumulated operation needed to evaluate the output diminishes by about two thirds. The utilization of multipliers in the traditional flexible filter structure rises when the filter order is increased to achieve accurate characteristics. To overcome these predicaments, the potent structure is devised with time-division-based multiplication architecture throughout the single-MAC core by enhancing the speed of operation.

The speed enhancement of the suggested filter is obtained by the incorporation of pipelining registers. L-length filter processing requires L clocking levels to finish the entire process. For L = 2, with an input sample rate of 1MSPS, the sample rate of the output at every clock cycle is obtained as 2 MSPS. The designated input is appended to the multiplication block to achieve the filtering process and, therefore, the required output is attained within two clocks. The suggested DAAFA realization for radix-4 and radix-8 design comprises of ARDA multiplication unit and other important functional units and is portrayed in Figure 4 and Figure 5 respectively. To select the appropriate data, a multiplexing unit is preferred, which is further given to the ARDA multiplication section wherein the coefficients are placed in the appropriate registers. The current and past inputs are added by the accumulation unit, which is refurbished to zero after two clocks. The selection lines and the accumulation processing are enabled by an individual familiar counting unit. Likewise, L length filtering structure is realized by associating a single-MAC core design achieving enhanced speed with the help of registers. The error component is computed as the distinction between the desired input d(n) and the output y(n). Moreover, the error component is given as feedback to the input side for the necessary update of the coefficients. Also, the product of the error component and μ = 0.06 is carried out. This result is in turn multiplied with x_in to facilitate the concurrent update of the coefficient in the appropriate registers.

The Wallace network involves addition blocks to sum partial products in stages until two numbers are left. The Wallace network is capable of diminishing the latency of the propounded structure.

Harnessing hardware for adaptive realization is challengeable and is limited in certain applications. Although the implementation in DSP processors provides flexibility and programmability, their design is minimal for certain fields because of subsequent processing, which increases the opportunities for parallel processing. Moreover, the power requirement is also quite high for portable designs, and the internal hardware structure is not optimized. Recently, the use of FPGAs has been expanding. It is a powerful hardware platform as it offers both programmable and dedicated hardware, which makes it captivating in several fields. Hence, in this work, the FPGA platform is used to carry out the results.

FPGA implementation of the proposed architecture employs approximation DA utilized multipliers that can be realized using LUTs and registers. Therefore, the DSP units are retained or reserved to perform other intended tasks. In addition, whenever several multiplications are required, adopting an approximate DA utilized multiplier diminishes the resource (area) significantly, which greatly minimizes the logic involved in the design when compared with adopting several DSP units.

4. Results and Discussion

This section presents the synthesis and analysis of the suggested DAAFA architecture and its performance comparison with the conventional architectures. The Altera (San Jose, CA, USA) and Xilinx (San Jose, CA, USA) FPGA devices are used in the synthesis of the suggested DAAFA filter topologies. The Altera DE2-115 FPGA board is attached to a MATLAB/Simulink-xPC (Natick, MA, USA) target toolbox, which is used for FIL simulation. This facilitates real-time model testing. Thus, the input signal that feeds the adder’s second input is a 100 Hz sine wave (the desired signal). A 100 Hz signal combined with a signal generated randomly by a sequence serves as the adder’s first input. It provides the intended signal, error signal, and output signal’s time domain input/output curves at 100 Hz. The LMS formulary is utilized to update the FIR filter coefficients, using a step index of µ = 0.06. Analysis shows that when μ = 0.06, the error is at its lowest, and the attained result facilitates the intended value. The program verifies the realization of the LMS adaptive FIR filter using an FPGA development board and Simulink. The respective filter output is specified as y(n), and the error component is denoted as e(n). Since Altera DE2-115 FPGA Board has a maximum target frequency of 50 MHz, an oversampling factor of (50 × 10⁶)/100 = 500 kHz is considered for FIL validation.

4.1. Simulink Module

The Simulink module and the FIL block of the devised structure are given in Figure 6 and Figure 7. The filter output is plotted for several step index components (M = µ) and is illustrated in Figure 8. It is understood that the error is less for the M = µ = 0.6. Therefore, in the proposed adaptive architecture, µ = 0.6 is considered.

4.2. Proposed DAAFA Results and Comparison

In this proposed architecture, the time-sharing technique is implemented solely in the filter section. Due to concerns about speed, the suggested design’s error manipulation block is designed using a parallel MAC architecture. Additionally, the multiply and add modules are appropriately registered to enhance the architecture’s speed. N-tap filter realization is accomplished by utilizing a single-MAC unit along with pipeline registers, which raises the operating speed. Although, the propounded design achieves significant area minimization with the incorporation of a resource-sharing-based single-MAC unit, the speed is marginally reduced. However, in the proposed architecture, this issue is addressed by adopting pipeline registers between multiply and add units, which offers maximum sample frequency upon realization. Hence, it is observed that the investigated design achieves an optimized area and enhanced speed over the other reported designs.

The propounded architecture is implemented using Xilinx and the Altera platform. Considering Xilinx Virtex-5 FPGA, the maximum sampling frequency that can be achieved relies on various factors, which include the specific model, application type, architecture and resources utilized. The Virtex-5 series features DSP48E slices, which can operate at clock frequencies of up to 550 MHz. This indicates that the theoretical maximum sampling frequency for a single-stage operation using these DSP slices could be around 550 MHz. Also, the clock management tile input frequency is about 500 MHz. Likewise, for high-speed serial input–output operation supports a frequency of about 1.25 GHz. The maximum frequency is determined by estimating the routing delay and the number of logic levels separating the registers. More than 300 MHz can be clocked using the suggested design, which has minimal logic levels between registers. Hence, the proposed DAAFA realization achieves a maximum sampling frequency greater than 300 MHz.

The simulated result of the devised architecture is shown in Figure 9. After six clock cycles, the output is attained at 125 ns wherein the input is given at 80 ns. Figure 10 depicts the RTL view of the suggested architecture. It is observed that the investigated filter overview shows the two major blocks, namely the time-sharing single-MAC unit and LMS unit. The ARDAM unit is essentially utilized in these designs to build optimized structures. Table 2 lists the complexity of designing the architecture in terms of multiplication and registers. Six clock cycles are needed for the filter structure’s delay. It takes four clock cycles for the MAC to operate; two clock cycles are needed for input registration, one clock cycle for multiply and one for the accumulate process. The output must be registered and the coefficients must be updated during the remaining two clock cycles. For this architecture, N + 6 registers are needed overall, where N is the number of taps. No matter how many filter taps there are, a single-MAC is sufficient for filter operation because of the dedicated multiplier structure. Table 3 makes it clear that, with regard to the number of taps, the time-sharing scheme that is being suggested has a hardware complexity that only requires two multipliers. The suggested architecture has a faster convergence rate and reduced complexity, according to the implementation findings.

FPGA devices are used to analyze the performance outcomes of adaptive FIR filter topologies. A synthesis summary of the traditional and proposed MAC core-based radix-4 and radix-8 filter architecture is tabulated in Table 4. A comparative analysis was performed for 16-tap and 32-tap filters. For the proposed 16-tap radix-4 architecture, the area (bit slices) is diminished by about 77.12% when compared to the conventional radix-4 architecture. Likewise, for the proposed 32-tap design, area minimization of about 87% is achieved compared to the conventional design. Moreover, an area decrement of about 76.35% is attained for the proposed 16-tap radix-8 structure compared to its traditional design. Also, the area of the 32-tap radix-8 structure is reduced by about 80.97% when examined with the conventional design approach. Therefore, the single-MAC core employs a lower number of shifts and adds units, minimizing the area substantially. Nonetheless, the frequency of operation is enhanced for the proposed 16-tap radix-4 and radix-8 structures of about 70.59% and 73.92%. Additionally, the operating speed of the suggested 32-tap radix-4 and radix-8 architectures is improved by about 79.95% and 79.33% over the existing architectures.

From Table 4, it is also observed that the propounded 16-tap DAAFA structure needs 162 slices, which is far less than the existing design, which is 708. Also, the highest sample time for the existing 16-tap design is 10.697 ns, whereas the proposed design is 3.146 ns. The critical path delay (CPD) is the total delay, which indicates the lengthy path delay between the source and destination point. A total delay of about 3.146 ns is attained for the proposed radix-4 16-tap realization. Likewise, protracted delay of about 3.168 ns is accomplished for the proposed 32-tap radix-4 framework. The delay between the error output register and the multiply unit utilized in the LMS design is obtained as CPD. It is also inferred that the CPD of the propounded 16-tap radix-4 design is reduced by about 7.551 ns (70.59%) over the existing design. Also, CPD minimization of 12.632 ns (79.95%) is achieved for the propounded 32-tap radix-4 design compared to the existing design. In a similar manner, the radix-8 realization offers noteworthy delay reductions of about 8.28 ns (73.92%) and 11.98 ns (79.93%) over the already reported design.

Hence, the performance of the proposed realization is considerably good in terms of complexity, speed and area. Likewise, the proposed 32-tap filter offers superior performance over the prevailing research. From the obtained results, it is evident that the proposed structure involved less hardware and fast processing units compared to the conventional architecture. The area and speed comparison of the conventional and proposed DAAFA design is given in Figure 11 and Figure 12.

Table 5 highlights the utilization summary of the devised 16-tap radix-4 and radix-8 architecture compared to the design proposed by [10] From the results, it is emphasized that the investigated architecture requires fewer hardware components in terms of utilized registers, slices and LUTs. Furthermore, this investigated implementation embellishes the speed of operation to a notable extent. The area (NOS) is minimized for the proposed radix-8 structure by about 6.74% compared to the reported literature. Further, the proposed radix-4 approach decreases the area by about 7.87% over the existing design. Further, the number of slice registers is reduced to 16.99% and 20.39% for the proposed radix-8 and radix-4 realization over the conventional design. There is an LUT decrement of about 7.49% and 11.61% for the suggested radix-8 and radix-4 architectures when compared to the conventional architecture. Nonetheless, because of the existence of pipeline registers in the proposed architecture, the speed enhancement is attained at about 83.36% and 82.79% for the radix-8 and radix-4 approaches. The complexity and MSF analysis is depicted in Figure 13 and Figure 14.

The proposed DAAFA radix-4 and radix-8 architectures are also compared with the existing work in terms of logic elements, as outlined in Table 6. From Table 6, it is inferred that the proposed 16-tap radix-8 architecture offers area optimization of about 50.73%, 29.51% and 19.17% over the existing design (k = 2, 4 and 8) [21]. Furthermore, an area reduction of about 51.87%, 31.15% and 21.05%, respectively, is obtained for the proposed 16-tap radix-4 DAAFA compared to the conventional approach (k = 2, 4 and 8) [21]. In addition, the area minimization for the proposed 32-tap radix-8 DAAFA design achieves reductions of 55.53%, 30.16% and 6.99% compared to the conventional design. Similarly, the area reduction of the investigated 32-tap radix-4 DAAFA structure avails 56.11%, 31.07% and 8.2%, respectively, compared to the already reported design [21]. Therefore, the approximation-based proposed architecture utilizes a single-MAC core that provides a more efficient realization compared to the existing realization.

Table 7, Table 8 and Table 9 compare the synthesis summary of the proposed DAAFA architecture with several other extant filter structures, and the synthesis and analysis were carried out using various Xilinx FPGA devices. The proposed DAAFA architectures achieved drastic area reduction and greater speed improvement compared to the preset structures. From Table 7, it is noticeable that the proposed radix-8 design achieves area minimization of about 66.61% and 91.87% over the existing designs [20,22]. Similarly, the proposed radix-8 design offers area optimization of about 65.82% and 91.69% compared to the prevailing structures [20,22]. Likewise, the speed of the synthesized proposed radix-4 architecture is embellished by 90.99% and 93.81% compared to the existing design [20,22]. In addition, the proposed radix-4 architecture enhances the speed by about 91.13% and 93.91% compared to the prevailing design.

The number of four-input LUTs is also listed in Table 7, Table 8 and Table 9. From these tables, it is inferred that the propounded realization requires a lesser number of four-input LUTs over the existing designs. Table 8 lists the synthesis summary of the proposed architectures with the conventional architectures when compiled using the Xilinx Virtex-5 device. From the results, it is observed that the propounded design offers area minimization of about 94.83% and 68.58% for radix-8 implementation as well as 94.62% and 67.34% for radix-4 implementation compared to the prevailing structures reported in [20,22]. Nonetheless, the proposed design also offers speed improvement of about 92.17% and 90.06% for radix-8 realization as well as 92.68% and 90.7% for radix-4 design over the existing work [20,22]. The MSP refers to the maximum sampling period and MSF refers to the maximum sampling frequency needed to evaluate the required output.

Table 9 emphasizes the summary of various architectures compiled using the Xilinx Spartan device; from this table, it is exemplified that the number of slices has been minimized by about 91.52% and 25.63% for the radix-8 realization. Also, slice reductions of about 90.74% and 18.77% are achieved for the proposed radix-4 design compared to the already reported design [20,22]. Moreover, speed improvements of about 95.35% and 75.57% are attained for the radix-8 design as well as 95.92% and 78.56% for the radix-4 design compared to the existing work.

The suggested adaptive filter topologies synthesized with the Altera Cyclone IV device are summarized in Table 10, which provides an example of how the number of logic elements has been reduced by almost 40% for the radix-8 realization. Additionally, compared to the previously published architecture [15], the suggested radix-4 design achieves a decrease in logic elements of almost 43%. Furthermore, when comparing the existing work, speed improvements of roughly 62% are obtained for the radix-8 and radix-4 designs.

The CPD is also investigated using Virtex-4, Virtex-5 and Spartan 3E devices. In Virtex-4 realization, the CPD of the investigated radix-8 structure is diminished by about 49.479 ns (93.87%) and 32.609 ns (90.98%) over the other structures. Furthermore, the CPD is curtailed by about 49.53 ns (93.97%) and 32.66 ns (91.13%) over the extant approaches. Likewise, in a Virtex-5-based framework, the CPD of the propounded architecture is decreased by about 36.5 ns (92.17%) and 28.09 ns (90.06%) for radix-8 formation, and the CPD is lessened by about 36.7 ns (92.68%) and 28.29 ns (90.70%) for radix-4 delineation with regard to the other architectures. Moreover, when considering Spartan 3E, CPD reductions of about 100.59 ns (95.36%) and 15.15 ns (75.56%) are achieved for the radix-8 scheme, and the CPD is lessened by about 101.19 ns (95.92%) and 15.75 ns (78.55%) for the radix-4 scheme compared to other prevailing approaches. Additionally, Altera-based implementation achieves CPD decrement of about 62.03% and 62.88% over the extant methods. Hence, the propounded structure performs better over the reported approaches.

The main contribution of this research work is implementing an efficient adaptive filter architecture using a single-MAC module, approximation DA and pipeline scheme. Time-sharing-based single-MAC utilization in filter realization combined with proximate DA minimizes the complexity to a significant extent to parallel MAC-based approaches found in other works in the literature. Moreover, the pipeline scheme provides speed improvement. The metrics for area (resources) and complexity are lesser when compared with other designs. In addition, the speed of the proposed design was found to be notably better when compared with the other design.

In this research work, performance metrics were evaluated with regard to area (resources), speed and complexity because the proposed architecture utilizes a single-MAC unit, approximation DA and pipeline approaches. In this design, the single-MAC unit executes the operation based on time sharing, which effectively brings down the complexity to a considerable extent. Also, the approximation-based DA scheme indicatively diminishes the resources. Moreover, the speed is ameliorated due to the incorporation of pipeline registers. Hence, it is summarized that the propounded realization is bettered in terms of area, speed and complexity compared to the prevailing designs.

5. Conclusions

This research work proposes the DAAFA architecture for radix-4 and radix-8 computation. The proposed architecture mainly employs approximation-based design of a DA multiplier (ARDA multiplier), pipelining registers and other computational units. The coefficients are evaluated using the LMS scheme, and the step size considered here is µ = 0.06. The approximation leads to the lower significant position bits in the coefficients being disregarded, giving rise to abatement in the filter complexity, thereby providing optimal performance. The ARDA multiplication unit involves Booth recoding, partial product creation and shift accumulation units, wherein the MAC core performs multiply operation with the help of shift and add units. The approximation and replacement of multiplication with a shift accumulation unit allow for the development of an architecture with a smaller area in terms of the number of slices (logic elements) and slice LUTs. Nonetheless, the delay is reduced significantly by the effective incorporation of registers at appropriate locations. Therefore, the speed of operation is substantially increased for the devised structure when compared with the conventional designs. The proposed DAAFA architecture is implemented for two configurations, namely radix-4 and radix-8. Additionally, the proposed architecture is implemented with 16-tap and 32-tap filters, and the synthesis and analysis are accomplished using Altera and Xilinx FPGA devices. Also, the suggested decisive DAAFA architecture is compared with the other prevailing architectures, and the results are presented.

From the synthesized results, it is inferred that, for the proposed 32-tap radix-4 design, area optimization of about 87% is achieved compared to the conventional design. Moreover, the area of the 32-tap radix-8 structure is reduced by about 80.97% when examined with the conventional design approach. Nonetheless, the operating speed of the suggested 32-tap radix-4 and radix-8 architectures is improved by about 79.94% and 79.33% over the existing architectures. From the obtained results, it is evident that the proposed structure involved less hardware and fast processing units when compared to the conventional architecture. In addition, it is understood that the proposed framework attains area minimization of about 94.83% for radix-8 and 94.62% for radix-4 implementation when compared with the existing work. Likewise, the proposed structure avails speed improvement of about 92.17% for radix-8 realization and 92.68% and 90.7% for radix-4 design over the existing work. Furthermore, the power consumption of the proposed work is not addressed in this work. Thus, the future scope of this work includes associating power minimization approaches. Also, the proposed structure can be used for high-performance computing and software radios. The designed structure can also be implemented using a quantum approach, which further enables the reduction in hardware complexity and also provides superior performance, supporting the development of recent technologies.

Author Contributions

Conceptualization, B.P.J.; Software, B.P.J.; Validation, M.-F.L.; Resources, D.V.; Writing—original draft, B.P.J.; Writing—review & editing, K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

This work is supported by Vel Tech Rangarajan Dr. SagunthalaR&D Institute of Science and Technology, Chennai, India, under SEED Grant No.: VTU SEED (FY 22-23)-16.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Meher, P.K.; Park, S.Y. Park Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation-Delay. IEEE Trans. Very Large Scale Integr. Syst. 2013, 22, 362–371. [Google Scholar] [CrossRef]
Krishna, E.H.; Raghuram, M.; Madhav, K.V.; Reddy, K.A. Acoustic echo cancellation using a computationally efficient transform domain LMS adaptive filter. In Proceedings of the 10th International Conference on Information Science, Signal Processing and Their Applications (ISSPA 2010), Kuala Lumpur, Malaysia, 10–13 May 2010; pp. 409–412. [Google Scholar] [CrossRef]
Paul, T.K.; Ogunfunmi, T. On the Convergence Behavior of the Affine Projection Algorithm for Adaptive Filters. IEEE Trans. Circuits Syst. I Regul. Pap. 2011, 58, 1813–1826. [Google Scholar] [CrossRef]
Meher, P.K.; Park, S.Y. Critical-Path Analysis and Low-Complexity Implementation of the LMS Adaptive Algorithm. IEEE Trans. Circuits Syst. I Regul. Pap. 2013, 61, 778–788. [Google Scholar] [CrossRef]
Meher, P.K.; Chandrasekaran, S.; Amira, A. FPGA Realization of FIR Filters by Efficient and Flexible Systolization Using Distributed Arithmetic. IEEE Trans. Signal Process. 2008, 56, 3009–3017. [Google Scholar] [CrossRef]
Yoo, H.; Anderson, D.V. Hardware-efficient distributed arithmetic architecture for high-order digital filters. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, PA, USA, 23 March 2005; Volume 5, pp. v/125–v/128. [Google Scholar] [CrossRef]
Mirzaei, S.; Hosangadi, A.; Kastner, R. FPGA Implementation of High Speed FIR Filters Using Add and Shift Method. In Proceedings of the 2006 International Conference on Computer Design, San Jose, CA, USA, 1–4 October 2006; pp. 308–313. [Google Scholar] [CrossRef]
Guo, R.; DeBrunner, L.S. Two High-Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic. IEEE Trans. Circuits Syst. II Express Briefs 2011, 58, 600–604. [Google Scholar] [CrossRef]
Mohanty, B.K.; Meher, P.K. A High-Performance Energy-Efficient Architecture for FIR Adaptive Filter Based on New Distributed Arithmetic Formulation of Block LMS Algorithm. IEEE Trans. Signal Process. 2012, 61, 921–932. [Google Scholar] [CrossRef]
Meher, P.K.; Park, S.Y. High-throughput pipelined realization of adaptive FIR filter based on distributed arithmetic. In Proceedings of the 2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip (VLSI-SoC), Hong Kong, China, 3–5 October 2011; pp. 428–433. [Google Scholar]
Long, G.; Ling, F.; Proakis, J.G. The LMS algorithm with delayed coefficient adaptation. IEEE Trans. Acoust. Speech Signal Process 1989, 37, 1397–1405. [Google Scholar] [CrossRef]
Vaithiyanathan, D.; James, B.P.; Mariammal, K. Comparative Study of Single MAC FIR Filter Architectures with Different Multiplication Techniques. In Proceedings of the 2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), Trichirappalli, India, 5–7 April 2023; pp. 1–10. [Google Scholar] [CrossRef]
Vaithiyanathan, D.; Mariammal, K.; James, B.P. Performance Analysis of Multirate Filter Structures for Signal Processing Applications. In Proceedings of the 2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON), Bangalore, India, 29–31 December 2023; pp. 1–7. [Google Scholar] [CrossRef]
Vaithiyanathan, D.; Sonar, S.M.; Parri, J.B.; Mariammal, K.; Kunaraj, K. Performance Analysis of Full Adder Circuit using Conventional and Hybrid Techniques. In Proceedings of the 2021 IEEE Madras Section Conference (MASCON), Chennai, India, 27–28 August 2021; pp. 1–7. [Google Scholar] [CrossRef]
Yao, C.-Y.; Huang, Y.Z. The Design of the NLMS Adaptive Filters Using the Fast-Division Approximation With CSD Encoded Divisors. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 2459–2463. [Google Scholar] [CrossRef]
Jiang, H.; Liu, C.; Liu, L.; Lombardi, F.; Han, J. A review, classification, and comparative evaluation of approximate arithmetic circuits. ACM J. Emerg. Technol. Comput. Syst. 2017, 13, 1–34. [Google Scholar] [CrossRef]
King, E.; Swartzlander, E. Data-dependent truncation scheme for parallel multipliers. In Proceedings of the Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136), Pacific Grove, CA, USA, 2–5 November 1997; Volume 2, pp. 1178–1182. [Google Scholar] [CrossRef]
Yuvan Shankar, J.; Shakeel Mohammed, G.; Maheshwari, G.; Mariammal, K. Design of Efficient Booth Multiplier based Polyphase FIR Filters. In Proceedings of the Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS 2023), Trichy, India, 23–25 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1834–1841. [Google Scholar]
Jiang, H.; Han, J.; Qiao, F.; Lombardi, F. Approximate Radix-8 Booth Multipliers for Low-Power and High-Performance Operation. IEEE Trans. Comput. 2015, 65, 2638–2644. [Google Scholar] [CrossRef]
Parmar, C.A.; Ramanadham, B.; Darji, A.D. FPGA Implementation of hardware efficient adaptive filter robust to impulsive noise. IET Comput. Digit. Tech. 2017, 11, 107–116. [Google Scholar] [CrossRef]
Allred, D.; Yoo, H.; Krishnan, V.; Huang, W.; Anderson, D.V. LMS adaptive filters using distributed arithmetic for high throughput. IEEE Trans. Circuits Syst. I Regul. Pap. 2005, 52, 1327–1337. [Google Scholar] [CrossRef]
AlfredoRosado-Munoz, A.; Bataller-Mompean, M.; Soria-Olivas, E.; Scarante, C.; Guerrero-Martinez, J.F. FPGA implementation of an adaptive filter robust to impulsive noise: Two approaches. IEEE Trans. Ind. Electron. 2009, 58, 860–870. [Google Scholar] [CrossRef]
Ghosh, S.; Meher, P.K.; Ray, D.; George, N.V. Low Complexity Design of Logistic Distance Metric Adaptive Filter for Impulsive Noise Environments. IEEE Trans. Very Large Scale Integr. Syst. 2024, 32, 1409–1413. [Google Scholar] [CrossRef]
Hao, Z.; Sun, H.; Li, S.; Zeng, X.; Fan, Y. Area-Efficient Processing Elements-Based Adaptive Loop Filter Architecture With Optimized Memory for VVC. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 4231–4235. [Google Scholar] [CrossRef]

Figure 1. Traditional adaptive framework.

Figure 2. Partial product generation of radix-4 (R4PPG).

Figure 3. Radix-8 partial product generation (R8PPG).

Figure 4. DAAFA with radix-4 architecture.

Figure 5. DAFFA with radix-8 architecture.

Figure 6. Block for the proposed framework.

Figure 7. FIL block for the proposed structure.

Figure 8. Characteristics with various M = µ.

Figure 9. Verilog result of suggested filter structure.

Figure 10. RTL representation of the suggested structure.

Figure 11. Area (number of slices) analysis of proposed architecture with the reported works.

Figure 12. Speed comparison with the reported works.

Figure 13. Complexity comparison of proposed and reported works [10].

Figure 14. Maximum sampling frequency comparison [10].

Table 1. Radix-8 Booth concealing formulary.

$w_{i} {(n)}^{3 j + 2}$	$w_{i} {(n)}^{3 j + 1}$	$w_{i} {(n)}^{3 j}$	$w_{i} {(n)}^{3 j - 1}$	$w_{i} {(n)}^{- j}$
0	0	0	0	0
0	0	0	1	+1
0	0	1	0	+1
0	0	1	1	+2
0	1	0	0	+2
0	1	0	1	+3
0	1	1	0	+3
0	1	1	1	+4
1	0	0	0	−4
1	0	0	1	−3
1	0	1	0	−3
1	0	1	1	−2
1	1	0	0	−2
1	1	0	1	−1
1	1	1	0	−1
1	1	1	1	0

Table 2. Complexity measure.

Devised Realization	Complexity
Filter taps	2	8	16	32
Multiplier	3	9	17	33
Register	8	14	22	38
Latency	6	6	6	6

Table 3. Comparison of different formularies.

Formulary	Multiplier	Adder	Convergence Speed
LMS	N	N	High
RMN	2N + 3	2N + 2	Low
Robust	N + 3	N + 5	Low
MRMN	N	N	High
DAAFA	2	N + 1	High

Table 4. Synthesis summary of the realized structure.

Performance Measures	Existing Approximate DA-Radix-4		Proposed DAAFA- Radix4		Existing Approximate DA-Radix-8		Proposed DAAFA- Radix-8
Device	Xilinx Virtex-5 device
Filter length	16	32	16	32	16	32	16	32
Slices	708	1501	162	195	702	1498	166	285
MSP (ns)	10.697	15.8	3.146	3.168	11.20	15.1	2.92	3.12
MSF(MHz)	93.484	63.29	317.9	315.65	89.28	66.25	342.29	320.51

Table 5. Utilization summary comparison of the architectures.

Design	MSP (ns)	MSF (MHz)	NOS	SREG	SLUT	% Embellishment in Slice-Delay Product
Device	XILINX VIRTEX-5 XC5VSX95T-1FF1136
Meher (2011) [10]	17.35	57	178	412	267	-
DAAFA with radix-8	2.92	342.46	166	342	247	84
DAAFA with radix-4	3.02	331.12	164	333	236	83

Table 6. Area (logic elements) analysis.

Design	Total Logic Elements
Device	Altera Stratix EP1S80F1508C6 FPGA
Filter taps	16	32
Allred (2005) [k = 2] [21]	1309	2244
Allred (2005) [k = 4] [21]	915	1429
Allred (2005) [k = 8] [21]	798	1073
DAAFA with Radix-8	645	998
DAAFA with Radix-4	630	985

Table 7. Design analysis of filter structures.

Parameters	Alfredo Rosado-Muñoz [22]	Chintan A. Parmar [20]	DAAFA-Radix-8	DAAFA- Radix-4
Device	Xilinx Virtex-4 XC4VFX12 FF6618-12 FPGA
Slices	2586	629	210	215
Four-input LUTs	4777	628	204	213
MSP (ns)	52.71	35.84	3.231	3.18
MSF (MHz)	19.16	27.90	309.50	314.46

Table 8. Synthesis summary of filter structures.

Parameters	Alfredo Rosado-Muñoz [22]	Chintan A. Parmar [20]	DAAFA-Radix-8	DAAFA- Radix-4
Device	Xilinx Virtex-5 XC5VLX30 FF324-3 FPGA
Slices	3906	643	202	210
Four-input LUTs	3513	895	197	208
MSP (ns)	39.6	31.19	3.1	2.9
MSF(MHz)	25.25	32.060	322.58	344.82

Table 9. Design results of filter structures.

Parameters	Alfredo Rosado-Muñoz [22]	Chintan A. Parmar [20]	DAAFA-Radix-8	DAAFA- Radix-4
Device	Xilinx Spartan 3E XCS500E-4 FPGA
Slices	2429	277	206	225
Four-input LUTs	4525	438	224	237
MSP (ns)	105.49	20.05	4.9	4.3
MSF (MHz)	9.48	49.863	204.08	232.55

Table 10. Synthesis results of filter structures (device: Altera Cyclone-IV).

Design	MSP (ns)	MSF (MHz)	Logic Elements	Registers	Dynamic Power (mW)
Chia-Yu Yao (2024) [15]	8.27	120.91	11,422	2531	99.04
DAAFA with radix-8	3.12	320.51	502	375	89.14
DAAFA with radix-4	3.07	325.73	504	384	91.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

James, B.P.; Leung, M.-F.; Vaithiyanathan, D.; Mariammal, K. Optimal Realization of Distributed Arithmetic-Based MAC Adaptive FIR Filter Architecture Incorporating Radix-4 and Radix-8 Computation. Electronics 2024, 13, 3551. https://doi.org/10.3390/electronics13173551

AMA Style

James BP, Leung M-F, Vaithiyanathan D, Mariammal K. Optimal Realization of Distributed Arithmetic-Based MAC Adaptive FIR Filter Architecture Incorporating Radix-4 and Radix-8 Computation. Electronics. 2024; 13(17):3551. https://doi.org/10.3390/electronics13173551

Chicago/Turabian Style

James, Britto Pari, Man-Fai Leung, Dhandapani Vaithiyanathan, and Karuthapandian Mariammal. 2024. "Optimal Realization of Distributed Arithmetic-Based MAC Adaptive FIR Filter Architecture Incorporating Radix-4 and Radix-8 Computation" Electronics 13, no. 17: 3551. https://doi.org/10.3390/electronics13173551

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimal Realization of Distributed Arithmetic-Based MAC Adaptive FIR Filter Architecture Incorporating Radix-4 and Radix-8 Computation

Abstract

1. Introduction

2. Associated Works

3. Proposed Adaptive Filter Architecture (DAAFA)

Error Manipulation Block

4. Results and Discussion

4.1. Simulink Module

4.2. Proposed DAAFA Results and Comparison

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI