1. Introduction
Time–frequency (TF) analysis has increasingly become a crucial tool in the study of linear and non-linear systems, which exhibit both stationary and non-stationary signals with variations in time and frequency domains [
1,
2]; through the use of time–frequency representations, the identification of transient events, extraction of key features, and pattern recognition from the signal properties are possible. These capabilities provide a comprehensive view of a signal’s time-varying frequency content, facilitating the extraction of meaningful information. This, in turn, supports informed decision-making in a wide range of applications, from biomedical diagnostics [
3] to machine condition monitoring [
4].
Over time, many techniques have emerged to generate time–frequency representations. The short-time Fourier transform (STFT) is the simplest and most widely used tool for this purpose [
5]; however, its main drawback is its limited time–frequency resolution due to the fixed size of its analysis window. The STFT uses a constant window length throughout the analysis, which forces a trade-off between time resolution and frequency resolution, which can be an inconvenience in many applications. Another widely adopted approach for obtaining the frequency information over time is the wavelet transform (WT) [
6], where the underlying principle is that any time series can be decomposed into a self-similar series of dilations and translations of a signal called a mother wavelet. Although the WT inherently provides a representation in the state space of dilations and translations, it can effectively measure the power spectrum in a localized manner when the appropriate wavelet is chosen; however, this requirement is one of its main disadvantages: both the decomposition level and the mother wavelet must be carefully selected to obtain a correct decomposition, which is not always possible. A more recent technique to obtain the time–frequency representation of a signal is the Stockwell transform, also known as S-transform (ST) [
7], which has found many applications across various fields. It shares similarities with the WT in terms of progressive time–frequency resolution, but its main significant advantage is that the ST preserves phase information. While the WT uses a variable-length window that scales with frequency, the ST uses a fixed-length Gaussian window. This configuration results in a more evenly distributed frequency resolution across different frequencies. The preservation of phase by the ST is particularly crucial for certain applications, such as the analysis of audio and biomedical signals [
8], as it provides critical insights into the signal characteristics. The absolute referencing of the ST means that the phase information corresponds to the argument of the sinusoid at zero time, which aligns with the definition of phase in the Fourier transform. This characteristic enhances its utility in precise phase analysis, making it a preferred choice in fields where phase fidelity is essential. However, a drawback of the ST lies in the redundancy of its calculation for performing a time–frequency representation. To address this issue, the Discrete Orthonormal Stockwell Transform (DOST) was proposed [
9]. This method involves dividing the time–frequency domain into regions, each represented by a single coefficient. DOST is based on a series of orthonormal basis functions that effectively localize the Fourier spectrum of the signal, thus retaining the valuable property of the ST in preserving phase information. To improve the computational efficiency of DOST, Wang and Orchard [
10] proposed decomposing the DOST matrix. Subsequently, Battistia and Ribaa [
11] extended the algorithm presented in [
12] to compute the Stockwell coefficients using an admissible window. Both proposed methods achieve a computational complexity of O(
Nlog
N), aligning them with the efficiency of the fast Fourier transform (FFT) algorithm. These advancements make DOST a compelling alternative for time–frequency analysis, offering both speed and accuracy. This has been demonstrated by a number of studies ranging from the analysis of cardiovascular diseases [
13] and bearing fault diagnosis [
14] to the classification of power quality disturbances [
15].
Although time–frequency representations, including the STFT, are computationally intensive and time-consuming—thereby limiting their suitability for continuous signal monitoring—hardware implementations can effectively address these challenges. A hardware implementation involves integrating analytical techniques directly onto physical computing devices, such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs), rather than running them through software on general-purpose computers. This approach not only mitigates the computational load but also accelerates data processing speeds. These hardware solutions, therefore, enable the application of time–frequency analysis across a broad range of applications. By taking advantage of the inherent properties of specific hardware, computational complexity can be reduced, leading to faster computations and facilitating real-time analysis [
16]. Implementing a time–frequency analysis (TFA) technique in hardware always presents a challenge. However, FPGA systems are particularly suitable for signal analysis due to their ability to perform high-speed calculations and low power consumption compared to software implementations performed on CPUs or GPUs [
17].
Various works have proposed different TFA implementations on FPGA; for example, in [
18], a system based on the STFT was developed to extract features in the frequency domain. Another study in [
19] proposed a new architecture for the harmonic WT based on the discrete cosine transform. Furthermore, in [
20], various parallel architectures for different time–frequency representations based on the FFT are proposed. Among these, the ST is implemented; however, this implementation is based on the classical ST algorithm, i.e., the issue of operational redundancy is not addressed, which can affect its practicality and efficiency. From this point of view, it is crucial to develop an FPGA architecture that not only implements the efficient DOST algorithm but also ensures that it can be deployed on low-cost FPGA chips.
This study introduces an FPGA architecture to implement the DOST algorithm, which can be adapted to the number of processing points required by a specific application by using a developed MATLAB-based app (64, 128, 256, 512, and 1024 points). This flexibility enables the architecture to address a wide range of applications, as reported in the literature, enhancing its applicability and efficiency in practical implementations. Although the proposed architecture is implemented on a Cyclone V series FPGA device from Intel Altera, featuring the 5CSEMA5F31C6N chip, other FPGA boards can also be utilized as the proposed cores are not vendor-dependent. The obtained results demonstrate a low resource usage (<5%) of the chip and high accuracy (root mean square error (RMSE) of 6.0155 × 10−3) when compared with results from floating-point processors. Additionally, to provide a complete hardware solution, the proposed DOST core has been integrated into a hybrid ARM-HPS (Advanced RISC Machine–Hard Processor System) control unit, which allows the control of different peripherals, such as communication protocols and VGA-based display.
This paper is organized as follows.
Section 2 will provide an overview of the S-transform, its different variants, and how its algorithm has evolved to create a more computationally efficient implementation.
Section 3 will explain the hardware design of the implementation based on the DOST algorithm. In
Section 4, the results of the proposed implementation are presented, and
Section 5 offers our conclusions.
2. S-Transform and Its Variants
The ST method is characterized by providing a multi-resolution time–frequency representation of one-dimensional or time signals, which allows for illustrating the behavior of frequency or spectral components over time while preserving the phase information of the signals. To obtain the ST of a function h(
t), it can be calculated by convolving a Gaussian window with the function h(
t), Equation (1), as it is performed by the FT method [
7]:
where
t and
τ are a time variable and time translation, while
f denotes a frequency variable. Essentially, the ST operates as a windowed FT known as STFT, where the window width, centered at
τ, varies inversely with the frequency
f.
Employing the integral characteristics of the Gaussian function, it is possible to establish the relation or connection between the FT of h(
t), represented as
H(
f), and
S(
τ,f), in the following manner:
Finally, the original or initial function h(
t) is calculated by computing the ST as follows:
Conversely, the discrete version of ST (DST) is obtained as follows [
7]:
where
j represents a time translation index,
n signifies the frequency shift,
denotes the Gaussian function or window employed in the frequency domain, and
H[.] represents the discrete FT of the discrete signal
h[l] = h(
lT) for
l ranging from 0 to
N − 1, with a sampling interval given by
T. For the n = 0 voice, define
Similar to ST, the inverse DST is determined as follows:
for
N size signal.
Observing Equation (6), it is crucial to notice that for a time signal with a length N, N2 Stockwell coefficients are computed, requiring O(N) computation time for each one. Consequently, the computation of all N2 coefficients of the ST requires a computational burden of O(N3). Therefore, considerable computational complexity is produced by the substantial redundancy inherent in its estimated time–frequency plane. This imposes significant computational resources, thereby limiting its effectiveness in processing large datasets.
With the aim of lessening the computational complexity of the DST method, Stockwell [
9] introduced an enhanced ST method called the discrete orthonormal ST (DOST). The enhanced method relies on a set of orthonormal basis functions aimed at localizing the Fourier spectrum of the one-dimensional or time signal (1DTS) by calculating the TF plane determined by the ST method without introducing redundancy into the calculated information and preserving the phase attributes of the ST. Stockwell [
9] defines a basis vector constructed by a sum of
N orthogonal unit length, where each one corresponds to a distinct region in the calculated TF representation. The obtained regions are characterized by the following three parameters:
τ indicates the location in time,
β represents the width of each frequency band, and
ν denotes the center of that band (voice). Employing these three parameters, a
kth basis vector is obtained as follows:
Later, the inner product of the function
h[
k] (analyzed 1DTS) and
D[
K]
[v,β,τ] is considered, along with the DOST coefficient, denoted by
S, for the corresponding region with the three parameters [
v,
β,
τ], which are determined as follows:
To create the set of orthogonal basis vectors from Equation (8) for k = 0, …, N − 1, the following equation is carried out:
where the parameters ν, β, and τ need to be selected appropriately. By letting the variable p index the frequency bands, Stockwell defines the DOST basis vectors for the positive frequency for each p in [
9] using the following.
If p = 0, D[k][ν,β,τ] = 1 (only one basis vector);
If p = 1, D[k][ν,β,τ] = exp(−i2kπ/N) (only one basis vector);
For p = 2, 3…, log2 N − 1, pick
By combining these basis vectors with the basis vectors for the negative frequencies, it can be demonstrated that these parameter selections yield a set of N orthogonal unit vectors, resulting in N DOST coefficients.
Reordering the summation presented above, Equation (8) yields:
where the
f value summation is limited to a specific band, which is determined by two parameters,
β and
ν. Consequently, this summation can be depicted as the inner product between the
H vector, calculated Fourier coefficients, and a row in a sparse matrix.
Finally, an inverse Fourier transform, denoted by
F−1, is applied to the obtained subband of the FT for 1DTS or function
h[
k]. It is essential to correctly shift the three indices as follows:
where
τk,, βk, and Ω
k indicate a time index, the bandwidth, and the band for the
k-th basis vector, respectively.
The DOST is distinguished by its similarity to the general Fourier family transform described in [
5], with the key difference being the use of a window with rectangular properties instead of a truncated window with Gaussian properties. In particular, the fast algorithm developed in [
7] can be adjusted to generate the conjugate-symmetric DOST.
Other important features are the sampling frequency, Fs, and the number of points,
N, used during its computation. Although they are related to each other, it is possible to indicate that
Fs determines the bandwidth,
, in Equation (13), and
N determines the frequency resolution, ∆
f, in Equation (14). For instance, if
Fs is equal to
N, the bandwidth will be given by
Fs/2 or
N/2. Thus, for this particular case, the larger the value of
N, the larger the bandwidth. The value of
N can be changed to increase or decrease the bandwidth according to the application, with values that are powers of two being the most recommendable.
On the other hand, from an implementation point of view, the DOST algorithm consists of three main steps: first, computing the fast FT (FFT) of the input signal; second, calculating the inverse FFT (IFFT) for each region; and finally, merging each region to produce the final DOST output. This process is illustrated in
Figure 1, where the different colors, as merely an explanatory example, represent the ordered time–frequency information of each region of the signal.
3. Proposed Hardware Architecture for the DOST
3.1. Flowchart for the Automatic Generator of the DOST Architecture
In order to provide a hardware solution for a wider range of applications, this work presents a configurable architecture for the DOST method based on the required
N points. To achieve this, the flowchart shown in
Figure 2 must be followed. Firstly, the number of points,
N, for the DOST core must be selected according to the application, keeping in mind that this number is related to the bandwidth. The available sizes are 64, 128, 512, and 1024 points. Next, on the one hand (left path), the VHDL-HDL (Very-High-Speed Integrated Circuits–Hardware Description Language)-based reconfigurable DOST cores are automatically generated by means of a developed MATLAB app; then, the synthesis of the DOST core to fit the desired FPGA platform is performed. On the other hand (right path), the codes are designed to program the ARM_HSP processor to generate a DOST time–frequency representation. Finally, the DOST core is integrated into the ARM-HPS to provide access to different peripherals, such as VGA-based display and communication protocols. Thus, users can apply the developed hardware solution to their specific applications.
3.2. ARM-FPGA Solution
The overall structure of the proposed hybrid ARM-FPGA solution is shown in
Figure 3. This solution combines the flexibility of the ARM-HPS control unit with the reconfiguration power of the FPGA. The design mainly consists of five modules: the reconfigurable FPGA unit, the ARM-HPS control unit, the on-chip memory, the SDRAM (synchronous dynamic random-access memory), and the input/output interface. In this regard, the FPGA unit primarily handles the hardware implementation of the DOST algorithm and the VGA controller. The ARM processor is responsible for reordering the results obtained from the DOST to create the DOST spectrum and sending it to graph the results via the VGA protocol. On the other hand, the SDRAM memory stores the data to be graphed provided by the ARM-HPS, while the on-chip memory stores the values of all the variables within the processor software and the ASCII characters to be graphed on the screen. Finally, the I/O interface unit manages the input and output of data using the UART protocol to the PC. The entire system is interconnected through a 32-bit AXI bus.
In the next subsection, the FPGA-based DOST architecture, i.e., the main contribution of this work, is described in detail.
3.3. FPGA-Based DOST Architecture
Figure 4 presents the proposed top-level block diagram for the DOST processor. The Ctrl_DOST block is a finite state machine (FSM) that controls the process of obtaining the DOST coefficients. Its main function is to provide the necessary parameters for the calculation of the FFT in different regions of the signal, according to the DOST algorithm described in
Figure 1.
This process begins by performing the FFT of the entire signal. Subsequently, several FFTs are calculated by dividing the signal FFT coefficients into different regions. For example, to calculate the 64-point DOST, the signal is divided into regions of 16, 8, 4, 2, 1, 1, 1, 2, 4, 8, and 16 points. Each region will generate an FFT matrix (butterfly diagram) with a variable number of columns and rows; therefore, the Ctrl_DOST block provides the number of columns and rows values through the NCol and NRow signals, and the ADD_INITIAL signal that indicates the start memory address of the region where the FFT coefficients calculated in the Dual-Port RAM will be stored.
The Ctrl_DOST block receives the NewRegion signal to indicate the start of the FFT calculation in a new region and the S_IFFT signal to signal the end of the FFT calculation of the entire input signal. Finally, using the EODOST signal, Ctrl_DOST indicates the completion of the calculation of the DOST coefficients.
The next block is Ctrl_FFT, an FSM that controls the internal process of calculating the FFT. The main function of this block is to manage the CounterRow and CounterColumn counters, which are responsible for counting the rows and columns of the FFT matrix. The ENCR and ENCC signals are used to enable these counters, determining when they will increment the count. An FFT matrix will have N/2 rows and Log2N columns, where N is the number of points. Every time N/2 rows are completed, the Ctrl_FFT block receives the EOCR signal and starts a new column. At the end of all columns, the Ctrl_FFT block receives the EOCC signal, thus indicating the end of the calculation of the FFT of the current region.
The Address_Generation block generates the correct addresses, ADRA and ADRB, where the results returned by the butterfly operation will be read and written. These addresses are determined by the values of the Row and Col signals coming from the CounterRow and CounterColumn counters. In turn, address generation will also generate the addresses ADDW to read from the ROM_Sin and ROM_Cos lookup tables to select the correct values of the twiddle factor depending on the position in the FFT matrix indicated by the Row and Col signals.
The two multiplexers in
Figure 4 select between different input data sources and their corresponding addresses. Data can come from outside through the AXI bus via the DE signal and be stored at the ADRE address. During the calculation of the FFT matrix, the input data can also come from the butterfly results G0 and G1 for the real part and H0 and H1 for the imaginary part of the result. As for addresses, these can be generated directly by the Address_Generation block, or, after calculating the FFT of the entire signal, they can come from the Inverse and Bias block. In this case, the mowed directions are represented by the signals ADRAB and ADRBB and depend on the region of the FFT coefficient signal where the FFT is being recalculated for the subsequent calculation of the IFFT. The upper multiplexer provides the real part of the two data computed by the butterfly block, DI_AR and DI_BR, as well as their respective addresses, ADRAR and ADRBR, to be stored in the upper Dual-Port RAM. On the other hand, the lower multiplexer sends the two data representing the imaginary part, DI_AI and DI_BI, to the Conjugate block. This block will perform the 2’s complement of the data if the IFFT is being calculated. Otherwise, the data will be sent directly to the lower Dual-Port RAM. The Ext_Data signals indicate whether the data source is external, and the S_IFFT signal indicates whether the process is in the process of calculating the IFFT of the regions.
The Inverse and Bias block, illustrated in
Figure 5, is used to adapt the memory addresses, ADRA and ADRB, according to the region in which the process is located. At the beginning of each new region, it performs bit inversion on the addresses. Subsequently, as each new region starts at an offset address, the addresses are adjusted by adding the value of ADD_INITIAL, thus obtaining the addresses ADRAB and ADRBB.
The ROM Cos and ROM Sin blocks are lookup tables that store the N/2 cosine and sine values corresponding to the twist factor used in the butterfly operation. These values have a fixed-point precision of Q15 bits (15 bits of fractional part and 3 bits of integer part). This bit width selection is because the multiplication and DSP blocks are optimized for 18 bits on Cyclone V devices.
One of the most used methods to calculate the IFFT is through the calculation of the direct FFT (Forward FFT), performing the conjugate of the input and output data, as shown in
Figure 6. The conjugate block is responsible for this operation, allowing the IFFT to be calculated using the FFT according to Equation (13). For the division operation with a bit width of N = 2
i, where i is a natural number, it is performed by shifting log
2N places.
The butterfly processing unit is a combinational block and is the heart of any FFT-based algorithm. Its function is to take the data from memory and calculate the simple two-point FFT. This operation is schematically shown in
Figure 7, where A0 and A1 are the real part and B0 and B1 are the imaginary part of the inputs of the previous level; at its output, two complex numbers, Y0 and Y1, made up of G0 and G1, which represent the real part, and H0 and H1, which represent the imaginary part of said numbers, will be obtained. W is composed of C and S, which are the cosine and sine values from the lookup tables ROMCos and ROMSin, respectively.
Butterfly block contains four 18-bit multipliers, adders, and subtractors. To avoid the possibility of overflow, point to an additional 5 bits to accommodate the “bit growth” that occurs as the FFT processor goes through the butterfly levels. This is critical to preserving precision since it performs all the calculations using signed integer arithmetic. Finally, for each stage, the FFT is reduced by a factor of 2 in the radix-2 algorithms so that the final output of the FFT still maintains the same bit size. The results are written back to the same memory locations as an algorithm is used in place.
Finally, two blocks of Dual_Port_RAM are used to store the results of each butterfly operation. Since the data being read and written are complex in nature, with a real and an imaginary part, an upper RAM block is used to store the real part, and another lower RAM block is used to store the imaginary part. The input data and their respective addresses are managed by the multiplexers, depending on the source of the data. The outputs of the upper Dual-Port RAM, A0 and A1, represent the real part of the data, while the outputs of the lower Dual-Port RAM, B0 and B1, represent the imaginary part.