**1. Introduction**

With the development of household and industrial applications, wireless data accesses between devices and users have become a concern on demand. Among various wireless communication schemes, multiple-input, multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) technology [1] are superior in transmission performances and are widely employed in numerous standards, such as WiFi (IEEE 802.11n/ac/ax), Long Term Evolution (LTE), and advanced 5G New Radio (NR) [2–5]. In practical applications, to access data from various devices, the user side must structure MIMO-OFDM communication according to the standard employed by the individual device. Therefore, developing a user-side transceiver, which can perform a multistandard MIMO-OFDM transmission for different devices, is useful (Figure 1).

**Figure 1.** Scheme illustration of multistandard multiple-input, multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) data access applications.

For all MIMO-OFDM communication between users and devices, Figure 2 shows the block diagram of a conventional *M* × *M* MIMO-OFDM transmission model. A transmitter (TX) sends *M* streams of data symbols through *M* antennas. To reduce intersymbol interference (ISI) in OFDM communication, a guard interval (GI) was applied to the symbols by copying the number of samples from a symbol end to a symbol head. A receiver (RX) then uses *M* antennas to receive signals, which was obtained through the *M* × *M* MIMO channel. Figure 2 shows that the RX part contains *M* sets of radio frequencies (RF), analog-to-digital convertors (ADC), fast Fourier transforms (FFTs) and the following equalization and demodulation blocks. For the given *N* samples in an OFDM symbol, an *M* × *M* MIMO-OFDM receiver requires *M*-stream *N*-point FFT operations. The TX requires to perform *M*-stream *N*-point inverse FFT (IFFT) operations. Therefore, a set of *M*-stream *N*-point FFT (IFFT) processors are essential components in the MIMO-OFDM system. In general, the maximum *M* and *N* values for standard MIMO-OFDM FFT are 8 and 2048, respectively [6,7], and they can be higher in 5G applications [5]. Therefore, the cost of hardware and operation throughput for the required FFT/IFFT processors are significant design concerns. Further considering the FFT/IFFT design for multistandard MIMO-OFDM applications, implementing an effective scheme to develop the multimode hardware configuration is a difficult task. In this study, an FFT kernel design, which could be structured as a base module for reconfigurable FFT/IFFT processors, was proposed in a multistandard MIMO-OFDM system. The proposed FFT kernel could perform 4/5/6-stream 64/128-point FFT operations by efficiently using GI duration with resource-sharing and operation-rescheduling schemes. Therefore, the presented FFT kernel supports area-efficient and high-throughput development for multistandard MIMO-OFDM FFT/IFFT processors. The remainder of the paper is organized as follows. Section 2 outlines the design considerations for the FFT design and literature. Section 3 presents the hardware architecture of the proposed FFT kernel. Section 4 reports the implementation and comparison of the proposed FFT kernel design. Finally, conclusions are presented in Section 5.

**Figure 2.** Block diagram for the conventional *M* × *M* MIMO-OFDM transmission model.

#### **2. FFT Design Considerations**

#### *2.1. Related Works for MIMO-OFDM FFT*

Various hardware architectures have been developed to perform the FFT algorithm. Among these architectures, the memory-based [8] and pipelined [9] architecture are the two primary categories. The memory-based FFT configuration is structured based on the employment of one or several processing elements (PEs) in cooperation with the memory modules. This memory-based structure is suitable for flexible FFT operations, which provide various FFT streams or operation points (i.e., FFT length) [10–12]. However, this approach is not appropriate for high-throughput or low-latency FFT operations when the operability of PEs or memory bandwidths is limited [13–15]. By using flexible FFT computation (length or streams), numerous memory-based FFT designs have been presented for MIMO-OFDM applications [15–17]. However, most of these approaches employ low-radix PEs (i.e., radix-*<sup>r</sup>*, where *r* is ≤ 16), which limits throughput performance. The pipelined FFT architecture is applied to increase operation throughput and latency by using hardware resources [9]. Moreover, two conventional pipelined FFT categories are multipath delay communicator (MDC) [9,18] and single-path delay-feedback (SDF) architectures [9,19].

The MDC structure has the features of multipath (i.e., parallel) feedforward operations performed using switch control with first in first out (FIFO) memory. By effectively using the parallel data paths and operation units, the multipath processing feature of MDC can be applied to the FFT computation of multiple streams [20]. Therefore, several FFT designs based on MDC structures have been presented for MIMO-OFDM applications [20–23]. Figure 3 shows the block diagram for the conventional *M*-path MDC architecture applied to MIMO-OFDM FFT of *M* streams, where the feedforward multipath data is processed using switch blocks, FIFOs, butterfly units and multipliers. The SDF architecture provides a feedback path for FIFOs to efficiently manage the butterfly operation data at each pipeline stage [9]. To enable parallel processing for SDF configurations, the extended multipath delay-feedback (MDF) architecture [24,25] was proposed. Because the MDF scheme can process multiple input streams, MDF architectures have been researched for required multistream FFT operations in MIMO-OFDM systems [7,25–29]. On the basis of the MDF structure using the radix-2*<sup>k</sup>* algorithm (where *k* is a positive integer) [30], an *N*-point FFT unit has log2(*N*) radix-2 operation stages involving the feedback-pathed FIFO and butterfly 2 (BF2) units [9,30]. Figure 4 shows a conventional *M*-path MDF FFT structure for *M* × *M* MIMO-OFDM applications. Furthermore, some previous studies [21,25,28] discuss the schemes of hardware cost reduction for multipliers located at multiple paths (Figures 3 and 4). In general, these MDC or MDF approaches [20–29] enable multipath/parallel operations associated with multiple FFT streams (i.e., the number of data paths and FFT streams is equal).

**Figure 3.** Block diagram of conventional *M*-path MDC fast Fourier transform (FFT) architecture for M-stream MIMO-OFDM.

**Figure 4.** Block diagram of conventional *M*-path MDF FFT architecture for *M*-stream MIMO-OFDM.

#### *2.2. Kernel-Based FFT for Multiple Standards*

Most MIMO-OFDM standards have variable FFT lengths and stream levels, and thus, the corresponding MDC or MDF FFT designs generally employ a reconfigurable structure to support various FFT operation modes specified at the aimed specification [7,20,22]. In addition, considerable research has been conducted on advanced restructure schemes for multimode FFT processors that support multiple MIMO-OFDM standards [28,29]. For such multimode/multistandard MIMO-OFDM FFT applications, the aforementioned studies have employed tailored reconfiguration methods to enable multimode FFT operations based on their individual MDC/MDF structures. Therefore, the aforementioned approaches lack the design flexibility and convenience to develop the multistandard MIMO-OFDM FFT architecture.

However, among the mainstream MIMO-OFDM standards (WiFi, LTE, and 5G applications), 1- to 4-stream 128/64-point FFT is implemented with a base operation mode [20–23,25–28]. This is one approach, using which a common optimized four-stream 128/64-point FFT kernel can be developed as a base module to construct specified *M*-stream *N*-point FFT/IFFT processors. For example, in the MDF FFT architecture, Figure 5 presents the aforementioned scheme. Considering *M*-stream *N*-point FFT computation ( *M* > 4; *N* > 128, and *N* is to the power of 2), each *N*-point FFT can be performed using *<sup>N</sup>*-point FFT in conjunction with 128-point FFT based on the radix-*N* /128 FFT algorithm, where *N* is equal to *N*/128, and *M*-stream FFT can be implemented in *M*/4 sets of four-stream FFT; . denotes a ceiling operation. Figure 5 shows a four-stream 128/64-point FFT kernel prepared first as a common base module. This kernel module was further integrated with a front four-stream *<sup>N</sup>*-point FFT computation unit to complete a set of four-stream *N*-point FFT. Such configuration (FFT kernel and *<sup>N</sup>*-point FFT unit) can be extended to *M*/4 sets, and complete *M*-stream *N*-point FFT operations can be implemented. For hardware modularization, kernel-based FFT configurations are more e fficient and flexible when applied to scalable *M*-stream *N*-point FFTs for multiple MIMO-OFDM standards. Figure 5 presents a configuration which can be further modified and extended to integrate a conventional FFT kernel with MDC or memory-based units for target FFT computation. Considering four-path 128/64-point FFT kernel designs, employing the available FFT processors, which can perform the same FFT specification, is an e fficient approach [21,25]. Nevertheless, for system optimization, specific development of a hardware-e fficient FFT kernel as a common module can be advantageous. In this study, the target FFT kernel module was designed using a modified MDF structure as support. In addition to the original four-stream 128/64-point FFT operations, the proposed FFT kernel used GI duration to enable five- and six-stream FFT operations, and thereby enhance the overall throughput with an improved area e fficiency. The concept for GI utilization for the FFT design has been mentioned in [31] for the purpose of improving the operation latency. In our design, we further utilized the GI in common with a resource-sharing scheme to increase the operation throughput. The details of the proposed design are discussed in Sections 3 and 4.

**Figure 5.** Block diagram showing the hardware architecture of M-stream N-point FFT processor using a. four-stream 128-point FFT kernel.

#### **3. Proposed FFT Kernel Architecture**

#### *3.1. Algorithm and Architecture Overview*

In conventional MDC/MDF FFT schemes [20–26], the number of paths is equal to the number of FFT streams. By contrast, a mixed-radix, mixed-multipath delay-feedback (MRM2DF) architecture was developed, which operated on a radix-dependent number of paths. For 128-point FFT operations, the MRM2DF design operated based on the mixed radix-2/8/8 algorithm, which can be described using Equations (1)–(5). A 128-point discrete Fourier transform (DFT) of a time domain sequence x(n), was defined as Equation (1), where X(k) and *Wn<sup>k</sup>* 128 = exp(−*j*2π*nk*/128) are the DFT results and twiddle factor, respectively. Moreover, n and k are represented using Equation (2), and Equation (1) can be derived using Equation (3), a two-step computation based on radix-2 and radix-64 operations represented as BF2 and 64-point DFT, respectively. Between BF2 and 64-point DFT operations, 128-based (W128) twiddle factors must be multiplied with the BF2 output. Furthermore, the 64-point DFT operation in Equation (3) can be further decomposed into the radix-8/8 operation.

By using Equation (4), a 64-point DFT can be derived as Equation (5), where two-stage radix-8 operations are performed. The two radix-8 stages correspond with the first and second butterfly 8 (BF8) operations, respectively. Similarly, 64-based (W64) twiddle factor multiplication is required between the first and second BF8 operations. For 64-point FFT operations, the MRM2DF design executes the radix-8/8 algorithm as Equation (5).

$$X(k) = \sum\_{n=0}^{127} x(n) \, W\_{128'}^{nk} \, \, k = 0, \, 1, \, \ldots, \, 127 \tag{1}$$

$$\begin{cases} \text{ } n = 64 \text{n}\_1 + \text{n}\_2, \text{ } n\_1 = 0, \text{ } 1; \text{ } \text{ $n\_2 = 0$ , } 1, \dots, 63\\ \text{ } k = k\_1 + 2k\_2, \text{ } k\_1 = 0, \text{ } 1; \text{ } k\_2 = 0, \text{ } 1, \dots, 63 \end{cases} \tag{2}$$

$$X(2k\_2 + k\_1) = \sum\_{n\_2=0}^{63} \sum\_{n\_1=0}^1 x(64n\_1 + n\_2) W\_{128}^{(64n\_1 + n\_2)(2k\_2 + k\_1)}$$

$$= \underbrace{\sum\_{n\_2=0}^{63} \left\{ \underbrace{\sum\_{n\_1=0}^1 x(64n\_1 + n\_2) \, W\_{22}^{m1k\_1} \, W\_{128}^{m2k\_1}}\_{\text{BF2 ( radi-2) operation}} \, W\_{64}^{m2k\_1}}\_{\text{BF2 ( radi-2) operation}} \right\} W\_{64}^{m2k\_2} = \underbrace{\sum\_{n\_2=0}^{63} \, \text{BF2}\left(k\_1, n\_2\right) \, W\_{64}^{mk\_2}}\_{\text{64-point DFT (radi-64)}}\tag{3}$$

$$\begin{cases} \ n\_2 = 8\alpha\_1 + \alpha\_2, \ \alpha\_1 \& \alpha\_2 = 0, \ 1, \ldots, 7\\\ k\_2 = \beta\_1 + 8\beta\_2, \ \beta\_1 \& \beta\_2 = 0, \ 1, \ldots, 7 \end{cases} \tag{4}$$

$$\begin{aligned} \operatorname{X}(2(\beta\_1 + 8\beta\_2) + k\_1) &= \sum\_{a\_2=0}^{7} \sum\_{a\_1=0}^{7} \operatorname{BF2}\left(k\_1, 8\, a\_1 + a\_2\right) \mathcal{W}\_{64}^{(8.a\_1 + a\_2)(\beta\_1 + 8\beta\_2)} \\ &= \underbrace{\sum\_{a\_2=0}^{7} \left\{ \underbrace{\sum\_{a\_1=0}^{7} \operatorname{BF2}(k\_1, 8a\_1 + a\_2) \, \mathcal{W}\_8^{a\_1\beta\_1}}\_{1^{18}\, \text{BF8}\left(\text{radx} - \text{S}\right)\, \text{ operation}} \right\}}\_{\text{10}} \mathcal{W}\_{64}^{a\_2\beta\_1} \end{aligned} \tag{5}$$

2n<sup>d</sup> BF8 (radix−<sup>8</sup>) operation

Corresponding with the radix-2/8/8 algorithm, the proposed MRM2DF kernel was operated with hybrid path configurations (i.e., mixed-multipath). Hardware units associated with the radix-2 (BF2) computation allowed operations on at most six data paths (i.e., 4/5/6 paths) based on the number of streams. By contrast, radix-8 (BF8) hardware units allowed operations on eight paths corresponding to the number of radices. A shared hardware module was employed to perform W128 and W64 twiddle factor multiplication in six and eight paths, respectively. When performing 64-point FFT, the MRM2DF kernel performed only eight-path operations based on the radix-8/8 algorithm. The features of the proposed MRM2DF structure are as follows:


(iv) Points (i)–(iii) allow significant improvements in area efficiency compared with available designs that support multipath 128-point FFT (e.g., [21,25]).

Figure 6a presents the block diagram of the proposed MRM2DF FFT kernel, comprised of four modules (Modules 1–4). The operations associated with Modules 1–4 are illustrated in Figure 6b (association with color of Modules). Figure 6b shows that Module 1 is tasked with 4/5/6-path BF2 (radix-2) operations. Module 4 provides eight-path first and second BF8 (radix-8) computations.

**Figure 6.** (**a**) Block diagram of proposed MRM2DF FFT kernel; (**b**) module-associated operations.

Intermediate data transfer for radix-2/8/8 calculations was achieved using a sorting buffer in Module 2. As mentioned previously, W128 and W64 twiddle factors must be multiplied with the output of the radix-2 and first radix-8 stage. Module 3 was a constant multiplication unit (CMU) for performing twiddle factor multiplications by using a novel resource-sharing scheme. Modules 1–4 are detailed in Section 3. For 64-point FFT, only Modules 2 and 3 were used to perform radix-8/8 operations.

For example, for the six-stream 1024-point MDF FFT, Figure 7 shows an operation flow to generate a 128-point FFT block based on the MDF processing (Figure 5). In most OFDM standards (e.g., IEEE 802.11ac for WiFi), the GI period is 1/4 or 1/8 of the symbol duration, depending on the channel conditions. For example, six streams of 1024 data samples were accessed to perform 1024-point FFT operations, and the number of GI samples was 256 (i.e., 1/4 symbol duration). Figure 5 shows that this mechanism allowed the calculation of six-stream 1024-point FFT through the radix-2<sup>3</sup> stages and 128-point FFT block. The corresponding radix-2<sup>3</sup> signal flow chart with the radix-2 (BF2) operations is shown in Figure 7 for reference. Through the three stages of MDF-based radix-2 operations (including twiddle factor multiplication), sets of 512/256/128-point FFT sequences were generated (corresponding with the color and slash in the signal flow chart). Finally, eight sets of six-stream sequences were generated, including 128 data and 32 GI samples. For each set of 128-sample data streams, the proposed MRM2DF kernel was prepared to process six-stream 128-point FFT. Figure 8 presents the detailed timeliness diagram of MRM2DF hardware operations and signal flow chart based on the radix-2/8/8 algorithm. The inclusion of GI duration in the time available for processing OFDM FFT is an efficient approach. However, for conventional MDC/MDF FFT structures operating at the input sample rate [20–22,25,26], the operations were generally idle during GI clock cycles, thereby reducing hardware efficiency. By contrast, the proposed MRM2DF scheme used all 160 clock cycles, including GI (Figure 8), for 128-point FFT operations.

**Figure 7.** Operation flow for the generation of 128-point FFT block based on 1024-point MDF FFT.

**Figure 8.** Detailed timeliness diagram for operations of the proposed MRM2DF-structured FFT kernel and its associated signal flow chart for the radix-2/8/8 algorithm.

Figure 8 illustrates that BF2 and W128 multiplications were performed concurrently for all six data streams. BF8 and W64 multiplications were performed on eight stream-by-stream, based on a time unit of eight cycles (i.e., a time slot (TS)) for the operation of each stream. The radix-2/8/8 signal flow chart reveals that only half of the BF2 outputs (i.e., BF2 subtraction terms) were multiplied using W128 twiddle factors. This observation indicated that W128 multiplication was performed only during some of the 160 cycles (A- or B-portion mcl in green or red colors). This observation ensures that the non-occupied duration is available for W64 multiplication by using the resource-shared CMU. Moreover, using the TS unit (eight cycles) for BF8/W64 operations per stream allowed the efficient use of GI for the processing of additional streams. In the aforementioned example, GI had 32 cycles (i.e., T2), and T3 was half of T2. Thus, six TS units (T4) were available for BF8/W64 operations for six-stream 128-point FFT. Similar schemes can be applied to other GI situations by evaluating T1–T4 terms in Figure 8. Table 1 lists T1–T4 parameters associated with various operating modes of 128-point FFT. In the 64-point FFT configuration (Figure 6), the proposed scheme provided only BF8/W64 operations based on the BF8/W64-related portions of the timing schedule (Figure 8).

**Table 1.** Parameters for various operating modes.

