3.2.2. Module 2

Figure 11 presents the block diagram of Module 2 (i.e., the sorting buffer), comprising six chains of shift registers (SRs) and multiplexers. The six streams of data from Module 1 were sent or routed to six SR chains (1–6) through paths A to F, respectively. Module 2 conducted the intermediate data sorting of raidx-2/8/8 operations. For the radix-2/8 calculation, six-stream radix-2 data from Module 1 were stored in six-chain SR and then selectively read out to Module 4 for the first BF8 operations. The calculated results were stored back into the SR chain 1 by using configurable path routing and were prepared for subsequent radix-8/8 calculation. To continue radix-8/8 operations, eight data paths were selected from SR chain 1 and the output to Module 3 (CMU) for W64 multiplication and the subsequent execution of the second BF8. Figure 11 shows each SR chain was structured using eight SR units, which were classified into two types: Type I and Type II. Both types exhibited eight stages of flip flops and front multiplexers to support the data shift and hold. Type I was used for SR chain 1, whereas providing parallel data accesses for eight flip flop stages. Type II was used for SR chains 2–6 and optionally enabled the routing input from other SR units.

**Figure 11.** Block diagram of Module 2 (sorting buffer).

Figure 12 details the sorting operations of Module 2 corresponding to the periodic 10 TS units (i.e., TS0 to TS9) in Figure 8. In Figure 12, the data of each of the six A, B, ... , or F streams from Module 1 (Figure 11) were represented using eight 8-sample data sections (i.e., X0, X1, ... , and X7, where X can refer to A, B, ... , or F). A, B, ... , or F indicate the current six streams, whereas A', B', ... , or F' denotes the next six streams. Figure 12 at TS0 show that seven data sections of each stream (e.g., A0-A6 and B0-B6) were prepared in SR chains. From TS0 to TS5, the first BF8 operation was performed for the eight-section A, B, ... , and F sequence, and the calculated results were stored back to SR chain 1. At next TSs (i.e., TS1–TS6), the first BF8 outcomes stored in SR chain 1 for B-F streams were sent out for CMU and for second BF8 operations. The data access schemes of SR chain 1 for first BF8 (stored in) and CMU/second BF8 (sent out) were alternately changed to satisfy the radix-8/8 data permutation [Equation (5)]. This observation was enabled using type I SR (Figure 11), which proved dual serial/parallel data shifts. Figure 12 shows the data of the new-coming six sequences, namely A', B', ... , and F', were sent to Module 2 after TS3. By appropriately routing the data across the SR chains (the red dashed line in Figures 11 and 12), the eight-section data of A', B', ... , and F' streams were regularly arranged in SR chains for new preparation at TS7/TS8/TS9. Therefore, the proposed sophisticated sorting buffer scheme allowed Module 2 to efficiently access intermediate data for radix-2/8/8 operations. The number of word storage elements for Module 2 was 64 per stream. By combing the level of two 32-element RAMs for Module 1, the total number of storage elements was 128. This process facilities the use of minimum memory sizes for 128-point FFT as the level of SDF/MDF-based designs [9,24–26].

**Figure 12.** Details of operations in TS0-TS9 (Figure 8) for Module 2 (sorting buffer).

## 3.2.3. Module 3

Module 3 (CMU) performed intermediate W128 and W64 multiplications for the radix-2/8/8 algorithm. According to the π/4 symmetry of twiddle factors [32], W64 multiplications could be calculated using only eight factors of *Wp* 64 = exp(−*j*2π*p*/64) and *p* = 1, 2, ... , 8, in cooperation with sweeping and sign controls [24,25]. Considering the six eight-cycle TSs (i.e., TS1–TS6) executed by the second BF8 in Figure 12, the CMU/W64 multiplication for each cycle within a single TS period could be performed using the eight-path *Wp* 64 factors, which is shown using *p* parameters in Table 2. The appropriate rescheduling of cycles 2, 4, and 6 (the red arrow in Table 2) could be used to reduce conflict in *Wp* 64 multiplication in order to minimize the number of required *Wp* 64 constant multipliers. Additional multiplexer controls and registers were introduced to Module 2 to enable rescheduling. For W128 multiplication, π/4-symmetry check increased the required number of *Wp* 128 factors to sixteen (*p* = 1, 2, ... , 16) compared with *Wp* 64. However, the scheme in [33] can be employed to decompose *Wp* 128 into a form as Equations (6) and (7) if *p* parameter is even or odd. Because *<sup>W</sup>k*,*k*±<sup>1</sup> 64 could be obtained using one of the original *Wp* 64 factors, thus, *Wp* 128 could be derived as a combined calculation of *W*1*or*<sup>3</sup> 128 and *Wp* 64. Therefore, only *W*1*or*<sup>3</sup> 128 constant multipliers were required to perform W128 multiplication with the existing *Wp* 64constant multipliers.

$$\mathcal{W}\_{128}^{p'} = \mathcal{W}\_{128}^{2k} = \mathcal{W}\_{64'}^{k} \,\,\text{when}\, p' \,\,\text{is even} \tag{6}$$

$$\begin{array}{lcl} \mathcal{W}^{p'}\_{128} = \mathcal{W}^{2k+1}\_{128} &= \mathcal{W}^1\_{128} \mathcal{W}^k\_{64} = \mathcal{W}^{-1}\_{128} \mathcal{W}^{k+1}\_{64} \\ &= \mathcal{W}^3\_{128} \mathcal{W}^{k-1}\_{64} = \mathcal{W}^{-3}\_{128} \mathcal{W}^{k+1}\_{64}, \text{when } p' \text{ is odd} \end{array} \tag{7}$$

As mentioned previously (Figures 9 and 10), six streams of data were accessed between Modules 1 and 3 for W128 multiplications. To avoid conflict operations associated with access to identical six-stream W128 factors in the same cycle, a one-sample shift was sequentially applied to the six paths of the CMU-related data accesses of the RAM banks (from Module 1, Figure 9). Figure 13 shows that the architecture of Module 3 allows access to six streams of data from Module 1 for multiplications involving W128 factors, and eight data paths from Module 2 for multiplications involving W64 factors (two arrow lines in Figure 13). Several duplicated *W*3128 and *W*464 constant multipliers were used to manage residual conflicting W128 and W64 multiplications, such as cycle 4 in Table 2, to perform *W*464 multiplication twice. On the basis of our evaluation, the overall CMU (Module 3) covered an area

equivalent to 4.18 complex multipliers (structured as four real multipliers and two adders). This process was more area-efficient than the direct approach by using six/eight complex multipliers for six-stream or eight-path operations.


**Table 2.** Lists and scheduling of eight-path *Wp* 64factors (in *p*) for each cycle in a given TS.

**Figure 13.** Block diagram of Module 3 (CMU).

3.2.4. Module 4

The two BF8 units in Module 4 were used for the first and second BF8 operations, respectively. Figure 14 shows the BF8 unit used three radix-2 stages based on the radix-2<sup>3</sup> algorithm [9]. In each radix-2 stage, four sets of BF2 elements were used for eight-path BF8 operations. Moreover, one *W*18 and one *W*38constant multiplier each were required between the radix-2 stages 2 and 3.

**Figure 14.** Block diagram of the first or second BF8 unit in Module 4.

#### **4. Results and Comparison**
