Next Article in Journal
Single Trace Analysis of Visible vs. Invisible Leakage for Comparison-Operation-Based CDT Sampling
Previous Article in Journal
Application and Optimization of a Fast Non-Local Means Noise Reduction Algorithm in Pediatric Abdominal Virtual Monoenergetic Images
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Scalable Transformer Accelerator with Variable Systolic Array for Multiple Models in Voice Assistant Applications

Department of Semiconductor Systems Engineering, Sejong University, Seoul 05006, Republic of Korea
Author to whom correspondence should be addressed.
Electronics 2024, 13(23), 4683;
Submission received: 24 October 2024 / Revised: 21 November 2024 / Accepted: 25 November 2024 / Published: 27 November 2024
(This article belongs to the Topic Theory and Applications of High Performance Computing)


Transformer model is a type of deep learning model that has quickly become fundamental in natural language processing (NLP) and other machine learning tasks. Transformer hardware accelerators are usually designed for specific models, such as Bidirectional Encoder Representations from Transformers (BERT), and vision Transformer models, like the ViT. In this study, we propose a Scalable Transformer Accelerator Unit (STAU) for multiple models, enabling efficient handling of various Transformer models used in voice assistant applications. Variable Systolic Array (VSA) centralized design, along with control and data preprocessing in embedded processors, enables matrix operations of varying sizes. In addition, we propose an efficient variable structure and a row-wise data input method for natural language processing where the word count changes. The proposed scalable Transformer accelerator accelerates text summarization, audio processing, image search, and generative AI used in voice assistance.

1. Introduction

Recently, as neural networks have emerged as a significant innovation field, more and more applications are dealing with deep learning algorithms. In particular, voice assistance and several applications are combined and used on mobile devices. Functions such as text summarization, audio processing, image search, and generative AI are emerging [1,2]. Algorithms that enable such a function are called natural language processing (NLP) algorithms. As a representative example of a natural language processing algorithm, there is Bidirectional Encoder Representations from Transformers (BERT) as an encoder model, a Generative Pre-trained Transformer (GPT) as a decoder model, and a Text-to-Text Transfer Transformer (T5) as an encoder–decoder model [3]. These are Transformer-based models, and most natural language processing algorithms are being designed based on Transformers.
A transformer consists of an encoder and a decoder, which consists of Multi–Head Attention (MHA) and a Feed–Forward Network (FFN). Both operations have enormous matrix operations [4,5,6,7,8]. However, as time passes, the parameters that must be handled in large language models (LLMs) are increasing, reducing their computational speed and power efficiency. AI hardware accelerators are needed to accelerate operations that require parallel processing, such as matrix operations. Figure 1 illustrates the process difference with and without a hardware accelerator. Function 1 and function 5 are the processes of exchanging data from the CPU to the accelerator, and functions 2–4 are the computational processes in the accelerator, and parallel processing is possible [9,10,11,12]. Therefore, we aim to implement Transformer hardware accelerators to reduce voice assistant applications’ computational speed and power consumption. Our proposed accelerator was implemented and validated on a Field–Programmable Gate Array (FPGA). The proposed Scalable Transformer Accelerator Unit (STAU) has the following three characteristics.
The first is that STAU has a structure that can be expanded to various models. Most existing papers have hardware accelerators tailored to specific models based on Transformers such as BERT and vision Transformer models like the ViT [4]. It may accelerate specific models at high speed but may not be able to accelerate others. In order to accelerate the applications used in voice assistance, acceleration must be possible in various models. In addition, if the matrix size is small, such as the operation of Q K T in MHA, the software must process it or create a specialized hardware module. We designed the Variable Systolic Array (VSA) to be reusable, and the embedded processor preprocesses the data to enable calculations with data smaller than the VSA structure. The second is that STAU has a flexible structure that takes the word count as input, terminates the operation and output early, and reduces the memory stall with row-by-row data input. The operation in the existing paper is based on a square matrix systolic arrangement [5,13]. Therefore, the clock cycle of the operation and output time do not decrease as the word count changes and becomes nonlinear. The proposed accelerator uses a variable structure of a systolic array to fit the natural language processing model in which the number of words changes each time. Furthermore, the operation initiates with row-by-row data input, significantly reducing memory read/write time, which typically accounts for the majority of processing time in the accelerator. Finally, STAU has a compact structure that does not require layer normalization by quantizing it according to the data range. It is quantized by FP16 and adjusted to maintain the digits and mantissa when the range is exceeded during multiplication and addition operations. By omitting layer normalization, one of the critical paths, in the Transformer structure, MHA and FFN operations are possible with only VSA and softmax modules.
This work significantly contributes to the advancement of on-device AI capabilities by enabling faster computation in various Transformer-based models. We expect this innovation to pave the way for responsive voice assistants and other NLP applications on mobile platforms. The rest of the paper is organized as follows. Section 2 describes the proposed hardware architecture in detail, Section 3 describes the software algorithm in the embedded processor, Section 4 shows the implementation results, and finally, the conclusions are represented in Section 5.

2. Hardware Architecture

2.1. Top Architecture

Figure 2 illustrates that the entire structure is divided into the STAU accelerator and the processor parts. We implemented the STAU accelerator with Xilinx VMK180 board’s xcvm1802-2msevsva2197-FPGA, and the embedded processor used the built-in Arm Cortex-R5. The STAU accelerator part consists of the VSA, softmax, and quantization modules, and the processor part is responsible for control and data preprocessing. The Arm Cortex-R5, an embedded processor, reuses the STAU accelerator. The accelerator and the processor communicate using sixteen AXI4-Lite interface registers. The handshake signal controls the BRAM read/write and verifies it by connecting it to an external LED. The design incorporates a FIFO interface at the input terminal of the VSA. The FIFO interface divides the inputs into rows and feeds them to the VSA after a one-clock cycle delay while simultaneously dividing the weights into columns and entering them in parallel into the VSA. During this process, BRAM read and FIFO write operations occur simultaneously. When data are input by row in the VSA, continuous memory access increases the bandwidth efficiency and hardware resource utilization. A disadvantage is that waiting is required for the entire row of input data to be prepared. However, since the input data do not change in MHA’s Q, K, V, and FFN operations, and only the weight changes in Transformer operations, the VSA can calculate without waiting for the next operation due to the nature of the VSA that receives weight in parallel by column. Therefore, the input value sent from BRAM0 to the FIFO interface is maintained, and only the weight value sent from BRAM1 to the FIFO interface is updated. By minimizing data movement, memory stalls are reduced. Reuse signals initialize the FIFO interface when the input value needs to be changed. When the reuse signal enters the FIFO interface, the register that counts the data within all FIFOs can be initialized to 0, and FIFO can write the newly entered value from BRAM. It is also a flexible structure that takes the word count as input and changes the clock cycle of all modules.
Figure 3a illustrates the hardware structure in Multi-Head Attention. We designed the softmax module to handle all operations in the STAU accelerator without any operations from the processor. In addition, the processor preprocesses the data to a constant size according to the VSA structure. Figure 3b illustrates the data flow of the Transformer encoder. The processor minimizes overheads by instructing the next task to start before the output of the current task is complete [14]. The processor also handles ReLu and Residual Connection rather than iterative operations. ReLu and Residual Connection are not iterative and take up very little operation time. The BERT, GPT, and ViT models use GeLu functions rather than ReLu as activation functions of multilayer perceptron structures [15]. Therefore, we made it possible to use activation functions for various models by modifying only the software running on the processor.
In the six layers of the Transformer encoder, the VSA is reused a total of 4320 times, and softmax is reused a total of 48 times. We designed the VSA module using combinatorial and sequential circuits to accelerate repetitive matrix operations, a critical path. Furthermore, since the softmax module is used exclusively for MHA and repeats relatively intermittently, we designed it as a sequential circuit that activates only when it receives a valid signal, reducing static power consumption.

2.2. Variable Systolic Array (VSA)

The VSA is more parallel than conventional systolic arrays and has a variable structure in which the latency decreases as the word count decreases [16,17]. Figure 4a illustrates a Processing Element (PE) designed to reduce clock cycles. Input data A are connected parallel to all Multiply–Accumulate (MAC) structures of the PE. Figure 4b illustrates a pipeline structure where PEs are instantiated up to the maximum word count [18,19]. The weight values are input in parallel as B and flow from the first to the second PE over consecutive clock cycles. In other words, the structure is a hybrid 2D PE array that utilizes the existing systolic array structure for input values while parallelizing the method of loading weights into the PE [20]. It operates sequentially from PE1 to PE256, with a maximum capacity of 256 PEs. At maximum operation, all PEs function simultaneously over 256 clock cycles. In A ( n × k ) × B ( k × m ) multiplication, a general systolic array requires a clock cycle of ( n + m + k 2 ) , whereas a VSA requires ( n + k 1 ) . The VSA utilizes a finite state machine (FSM) to control its operations efficiently and reduce clock cycles. For example, it compares the word count and run count to terminate computations early and also ends the output process prematurely. MACs are configured row by row within the higher-level PE module to control PE operations based on the word count. This configuration has the advantage of having fewer idle MACs compared to traditional systolic array structures. The total number of MACs corresponds to the number of elements in the output matrix. This means that if a sufficient area is available, more weight values can be input in parallel without increasing the clock cycles, implying that performance can improve based on the FPGA’s resource utilization. Considering the area constraints of the FPGA, the proposed VSA is designed to perform matrix multiplication operations with a size of ( n × 512 ) × ( 512 × 8 ) , where n denotes the word count.
Assuming that the hardware size is proportional to the number of MAC units, we compared the proposed VSA with the existing systolic arrays [21]. We compared the clock cycles of the most critical operations, namely the Q, K, and V calculations in MHA, assuming two of these proposed units are placed in parallel. As a result, the proposed VSAs consume fewer clock cycles than 129 words, as shown in Figure 5a, compared with a square SA of equivalent hardware size. It is also an efficient linear structure when processing matrices of various sizes. While a non-square SA may be faster at higher word counts, the proposed unit strikes a more appropriate trade-off considering hardware size, as shown in Figure 5b.
The VSA is particularly well suited for large language models (LLMs) where the input word count varies. It is especially advantageous for MHA and FFN operations. In these operations, input values typically remain constant, while weight values change frequently, and matrix multiplication is performed repeatedly. The VSA is beneficial in such situations as it can receive weights in parallel. Therefore, the VSA can also be described as a stateful matrix multiplier and is ideally suited for Transformer models that contain various word counts and only frequently change weights.

2.3. Radix-2 Softmax

Softmax operations take up less latency than matrix operations and normalization in MHA. Still, they are critical paths that need to be designed by hardware because they perform iterative operations. If the exponential operation is performed using the Lookup Tables (LUTs) method, memory cannot store the number of all cases [22]. According to the paper [13], when simulating four ranges of data groups represented by range 500, range 400, range 300, and range 200 to evaluate the average error of radix-2 softmax compared to exp-softmax, it is only 0.0022 in 20 million data samples. The data distribution patterns of radix-2 softmax are similar to those of exp softmax, and they can also complete the mapping function from discrete data to a probability from 0 to 1. These methods can be expressed as shift operations when representing n powers of 2, as shown in Equation (1). We concentrated hardware complexity on VSAs and reduced the resources of softmax, a relatively small number of operations. Thus, we used iterative shift and addition operations to reduce resource usage, increasing hardware efficiency [23].
f ( x i ) = 2 x i j = 1 N 2 x j ( i = 1 , 2 , , N )
Since softmax can be seen as a kind of normalization, it starts with a module that converts 16 floating-point values into INT8 format by rounding [24]. It operates on each row of an n × n -sized matrix. As the input is received serially and the row vector size varies with n, an n × n counter is used to distinguish each row dynamically. The radix-2 softmax operation requires raising each element in a row and dividing by the sum of all values in that row. A reference point is needed to perform this calculation with shift operations instead of LUT. This necessitates that the comparator first determine the maximum value for each row. After finding the maximum, each element in the row is subtracted from the maximum. This results in the maximum element having its MSB set to 1 without shifting, while the other elements are right-shifted by the difference between the maximum and their value. The initial input values are written to the BRAM to subtract each element from the maximum. These values are then read in the subsequent stage to compute the differences between the maximum and each element. The results of this subtraction operation are required to calculate the denominator of the softmax through accumulation. Thus, these values are first written to the BRAM and then read in the next stage for accumulation. Consequently, as shown in Figure 6, the hardware architecture requires three stages, which are implemented in a pipelined manner. Finally, the integer values obtained through division are converted back to floating-point 16 format to enable further matrix operations. The total clock cycle taken by this module is 3 n + 16 .

2.4. Quantization

Traditional Transformer algorithms perform calculations using 32-bit floating points. In this paper, we propose a novel 16-bit floating-point format tailored for Transformers [25], which differs from the traditional IEEE754 format [26]. The analysis using MathWorks MATLAB R2023b showed that the input and weight values did not exceed 2 7 , mainly distributed between −1.0 and 1.0. In floating-point representation, a 5-bit exponent can represent digits between 2 16 and 2 15 . Considering overflow, we propose a structure with a 1-bit sign, a 10-bit mantissa, and a 5-bit exponent, as illustrated in Figure 7. Quantization techniques improve computational speed and increase memory efficiency, but can result in fine losses of the learned weights. In the multiplier module, the input value is adjusted between 2 8 and 2 7 to ensure that the product does not exceed the expressible range of digits. In the adder module, input values are calibrated to avoid maximum or minimum values, ensuring continuous accumulation capability. This method ensures that normalization is not required in the final stages of MHA and FFN, maintaining accuracy without exceeding the expressive range. Thus, the clock cycle is significantly shortened by omitting the normalization step with the second largest delay time [27].
When multiplying, an exponent, such as the opcode, is added to shift the mantissa. The mantissa part 10 bits are multiplied by each other, and the exponent part 5 bits are added to each other. When adding, we shift a mantissa of small value to the right as much as the difference in the exponent, and then add it to a mantissa of large value to keep the mantissa’s MSB at 1. The exponent remains within a valid range because the excess mantissa is discarded. Multiplication and addition in this process utilize Wallace tree multipliers and ripple carry adders, which are efficient for small bit-widths [28,29,30,31].

3. Software Algorithms

Transformer models have larger and smaller data than VSA structures. Data preprocessing is performed to enable all data to be calculated in VSA. Zero padding is performed on data smaller than the VSA structure, and matrix division and accumulation methods are used for data larger than the VSA structure.

3.1. Zero Padding

In the MHA mechanism of the Transformer, the internal dimensions (i.e., the number of columns in the first matrix and the number of rows in the second matrix) for the operations Q × K T = Q K T and attention score × V = attention value are smaller than the dimensions that the matrix multiplier can handle, making these operations unfeasible. Although the matrices are large, matrix multiplication in the STAU is much faster than the processor, so zero padding is applied to fit the size of the matrix multiplier [32]. Figure 8 illustrates how zero padding ensures that A × B always results in a size of ( n × 512 ) × ( 512 × m ) , where m is a multiple of 8, the number of columns in the VSA’s output. Additionally, data are stored in vector rather than array format to prevent extra memory allocation and improve compatibility with bus interfaces, providing immediate access to the following VSA inputs. Because the data are extensive and use DRAM, storing them in vector form and using the output directly as the input can efficiently utilize memory bandwidth. Continuous memory access enables DRAM burst transfer and improves cache hit rates [33]. It also reduces latency by minimizing data movement between memory and STAU. If the VSA output can be stored sequentially without interruption, it will be received directly in vector format. If not, the data will be stored in 2D arrays to prevent memory errors during processor–accelerator communication, then reshaped and stored as vectors.
In the calculation of Q, the result has a size of ( n × 64 ) , but the processor stores it in dynamically allocated memory of ( n × 512 ) and fills the remaining space with zeros. Since Q is a multiplicand of the Q K T operation, the output is stored row-wise, and its size is padded to match the input size. In the calculation of K T ( 64 × n ) , the result of the attention score operation is not a multiple of 8. To facilitate the subsequent operation, zero padding is performed to calculate kt_padded_col, which is the nearest multiple of 8 greater than n. Since K T is a multiplier of the Q K T operation, the output is stored column-wise. V is zero-padded in the same principle, and the padded Q K T is reshaped according to the size of the softmax module.
Figure 9 illustrates the method of dividing the attention value matrices computed from each head in MHA and concatenating them in vector format. The 2D array format of each head is directly concatenated to the vector format, accessing the subsequent VSA input immediately. Zero padding is employed when the internal dimensions of the multiplication are smaller than the VSA’s dimensions. However, processing becomes more complex when these dimensions exceed the VSA’s capacity.

3.2. Matrix Divide and Accumulate Method

The second matrix operation in the FFN is given by
F 2 ( n × 2048 ) × W 2 ( 2048 × 512 ) = F 3 ( n × 512 )
The following equations describe the process of obtaining matrix C:
A k = [ a 1 , k a 2 , k a 3 , k a n , k ] T
B k = [ b k , 1 b k , 2 b k , 3 b k , m ]
k = 1 512 A k × B k = C
According to Equations (3)–(5), matrix C is obtained by accumulating the products of column vectors from matrix A and row vectors from matrix B [34]. This concept extends beyond vectors to blocks divided into 512 elements in the following order.
  • Divide F2 into four blocks according to columns;
  • Divide W2 into four blocks according to rows;
  • Multiply in order of divided blocks;
  • Add all multiplied results.
As shown in Figure 10, F 2 is split into blocks of 512 columns, and W 2 is divided into blocks of 512 rows. Performing F 2 b l o c k ( n × 512 ) × W 2 b l o c k ( 512 × 512 ) = F 3 b l o c k ( n × 512 ) , which is the matrix multiplication of a divided block, produces four F 3 b l o c k matrices. Treat one F 3 b l o c k as a layer form and stack the layers through a loop to obtain the final result.

4. Implementation Results

4.1. Comparing Clock Cycles

Before verification, we compared the clock cycle with that in [5]. Since the comparative paper uses a systolic array of n × 64 in size, we compared using VSA with the same number of MAC units for a fair comparison. The comparative paper excluded the Q K T operation because the systolic array could not multiply it and presented a clock cycle when N = 64 (N: word count). Table 1 compares the clock cycles consumed by the two operators in MHA and FFN when N = 64 (the same as the clock cycle presented in the comparative paper) and N = 256 . Under the same conditions (excluding Q K T ), VSA consumes fewer clock cycles throughout the computation, including MHA and FFN. Since the clock cycle decreases by 6%, the proposed VSA has a faster computational speed. In MHA, the clock cycle is consumed more because the operable internal dimension is fixed at 512, so more clock cycles are consumed in the step of obtaining attention score × V . However, this simple structure enables consistent data preprocessing, enabling Q K T operation even at the accelerator. Therefore, including Q K T operation is expected to reduce the real latency of MHA. In the comparative paper, we hypothesized that the clock cycle of the systolic array increased fourfold when N = 256 compared to when N = 64 . Compared to comparative papers, VSA’s clock cycle decreases by 69% when Q K T operation is excluded and 68% even when it is included. As such, the VSA has a flexible structure in which the clock cycle does not increase significantly even if the number of words increases to 256.

4.2. Hardware Implementation Result

The Verilog HDL design was synthesized and implemented using Xilinx Vivado 2022.2, and the bitstream was loaded into Xilinx Vitis 2022.2 utilizing JTAG UART for communication and testing. To enhance the reliability of the experimental results, we compared the outcomes with those executed on the ARM Cortex-R5 processor within the VMK180 board.

4.2.1. VSA Performance with Increasing Word Count

To validate the results, we first modeled a matrix multiplier based on a quantization algorithm designed in Verilog HDL in C language and executed it on an ARM processor. The results obtained from the VSA were consistent with those from the software model executed on an ARM processor. We first calculated the ratio of processing speeds between SW and HW implementations for word counts of 4, 16, 64, and 256. Then, we determined the relative speed increase for each word count compared to the baseline performance at four words. The VSA, supporting parallel processing, demonstrated significant speed improvements as the word count increased. Figure 11 illustrates performance across three cases, each tested twice. On average, the VSA achieved up to 2.87 times faster processing speed at the maximum word count than when using four words.

4.2.2. Transformer Encoder Performance and Accuracy

To validate these results, we optimized the encoder model in C language and executed it on an ARM processor. We compared the speed of the Transformer encoder operations designed with an embedded system and calculated the differences between adjacent columns in each result row. We then computed the mean squared error (MSE) to assess similarity [35]. We conducted the test twice, using 10 different random seeds with 10 iterations per seed. Figure 12 illustrates that processing speed and similarity improve as the word count increases due to the larger sample size. The speed increased by an average factor of 3.45 at the maximum word count, while the accuracy reached an average of 97.60 percent.

5. Conclusions

In this work, we propose a Transformer accelerator that is scalable to multiple models. VSA receives weights in parallel using a row-by-row data input scheme. VSA allows for rapidly executing large-scale matrix multiplication operations central to Transformer algorithms. Comparing the VSA with the traditional systolic arrays presented in this paper, the VSA-based Transformer encoder is expected to perform better than the traditional systolic array-based Transformer encoder when dealing with larger word counts. Furthermore, since the dynamic power is proportional to the clock cycle, the proposed VSA architecture with a reduced clock cycle estimates that the dynamic power will decrease accordingly. STAU preprocesses data from the processor and processes all iterations, such as VSA and softmax operations from the accelerator. Therefore, it can be extended to multiple Transformer-based models by modifying the software that runs on the processor without modifying hardware intellectual property (IP). It can also be seamlessly extended to Transformer decoders, models of generative AI that have been frequently used recently. The STAU can be a high-speed accelerator in LLM-based AI algorithms. It can improve the performance of cloud server-based applications, increasing consumer satisfaction and reducing server power consumption. Specifically, the implemented STAU is expected to accelerate voice assistant applications in mobile application processors (APs), where multiple Transformer models are used, such as call recording summaries or voice memo content arrangements. While STAU can also accelerate encoder–decoder models, it is suitable for accelerating voice assistant applications used primarily in encoder models such as Conformer, Squeezeformer, Branchformer, and BERT [36,37,38,39,40] because the processor must perform masking and import data stored in DRAM during the cross-attention process.

Author Contributions

Conceptualization, S.-W.C. and D.-S.K.; Methodology, S.-W.C. and D.-S.K.; Software, S.-W.C.; Validation, S.-W.C.; Formal analysis, S.-W.C.; Investigation, S.-W.C.; Writing—original draft, S.-W.C.; Writing—review & editing, S.-W.C. and D.-S.K.; Supervision, D.-S.K.; Project administration, D.-S.K. All authors have read and agreed to the published version of the manuscript.


This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2024-004380007) supervised by the IITP (Institute for Information and Communications Technology Planning and Evaluation).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.


  1. Luca, L.; Bellotti, F.; Berta, R. An Embedded End-to-End Voice Assistant. Eng. Appl. Artif. Intell. 2024, 136, 108998. [Google Scholar] [CrossRef]
  2. Chen, J.; Teo, T.T.; Kok, C.L.; Koh, Y.Y. A Novel Single-Word Speech Recognition on Embedded Systems Using a Convolution Neuron Network with Improved Out-of-Distribution Detection. Electronics 2024, 13, 530. [Google Scholar] [CrossRef]
  3. Dura, D. Design and Analysis of VLSI Architectures for Transformers. Ph.D. Thesis, Politecnico di Torino, Turin, Italy, 2022. [Google Scholar]
  4. Zhong, J.; Liu, Z.; Chen, X. Transformer-Based Models and Hardware Acceleration Analysis in Autonomous Driving: A Survey. arXiv 2023, arXiv:2304.10891. [Google Scholar] [CrossRef]
  5. Lu, S.; Wang, M.; Liang, S.; Lin, J.; Wang, Z. Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer. In Proceedings of the 2020 IEEE 33rd International System-on-Chip Conference (SOCC), Virtual Conference, 8–11 September 2020; pp. 84–89. [Google Scholar] [CrossRef]
  6. Vaswani, S. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  7. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. arXiv 2020, arXiv:1909.11942. [Google Scholar]
  8. Ham, T.J.; Jung, S.J.; Kim, S.; Oh, Y.H.; Park, Y.; Song, Y.; Park, J.; Lee, S.; Park, K.; Lee, J.W.; et al. A3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 22–26 February 2020; pp. 328–341. [Google Scholar] [CrossRef]
  9. Russell, G. The Anatomy of Hardware Accelerators for VLSI Circuit Design. Comput.-Aided Eng. J. 1989, 6, 82–91. [Google Scholar] [CrossRef]
  10. Possa, P.; Schaillie, D.; Valderrama, C. FPGA-Based Hardware Acceleration: A CPU/Accelerator Interface Exploration. In Proceedings of the 2011 18th IEEE International Conference on Electronics, Circuits, and Systems, Beirut, Lebanon, 11–14 December 2011; pp. 374–377. [Google Scholar] [CrossRef]
  11. Liu, P.; Li, S.; Ding, Q. An Energy-Efficient Accelerator Based on Hybrid CPU-FPGA Devices for Password Recovery. IEEE Trans. Comput. 2019, 68, 170–181. [Google Scholar] [CrossRef]
  12. Shi, W.; Li, X.; Yu, Z.; Overett, G. An FPGA-Based Hardware Accelerator for Traffic Sign Detection. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 1362–1372. [Google Scholar] [CrossRef]
  13. Ye, W.; Zhou, X.; Zhou, J.; Chen, C.; Li, K. Accelerating Attention Mechanism on FPGAs Based on Efficient Reconfigurable Systolic Array. ACM Trans. Embed. Comput. Syst. 2023, 22, 1–22. [Google Scholar] [CrossRef]
  14. Benacer, I.; Boyer, F.-R.; Bélanger, N.; Savaria, Y. A Fast Systolic Priority Queue Architecture for a Flow-Based Traffic Manager. In Proceedings of the 2016 14th IEEE International New Circuits and Systems Conference (NEWCAS), Vancouver, BC, Canada, 26–29 June 2016; pp. 1–4. Available online: (accessed on 24 October 2024).
  15. Lee, M. GELU Activation Function in Deep Learning: A Comprehensive Mathematical Analysis and Performance. arXiv 2023, arXiv:2305.12073. [Google Scholar] [CrossRef]
  16. Kung, H.T.; Leiserson, G.E. Systolic Arrays (for VLSI). In Sparse Matrix Proceedings 1978; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1979; Volume 1, pp. 256–282. [Google Scholar]
  17. Johnson, K.T.; Hurson, A.R.; Shirazi, B. General-Purpose Systolic Arrays. Computer 1993, 26, 20–31. [Google Scholar] [CrossRef]
  18. Milovanovic, I.Z.; Tokic, T.I.; Milovanovic, E.I.; Stojcev, M.K. Determining the Number of Processing Elements in Systolic Arrays. Facta Univ. Ser. Math. Inf. 2000, 15, 123–132. [Google Scholar]
  19. Huang, Y.; Shen, J.; Qiao, Y.; Wen, M.; Zhang, C. MALMM: A Multi-Array Architecture for Large-Scale Matrix Multiplication on FPGA. IEICE Electron. Express 2018, 15, 20180286. [Google Scholar] [CrossRef]
  20. Asgari, B.; Hadidi, R.; Kim, H. MEISSA: Multiplying Matrices Efficiently in a Scalable Systolic Architecture. In Proceedings of the 2020 IEEE 38th International Conference on Computer Design (ICCD), Hartford, CT, USA, 18–21 October 2020; pp. 130–137. [Google Scholar] [CrossRef]
  21. Chen, Y.-H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. -Solid-State Circuits 2017, 52, 127–138. [Google Scholar] [CrossRef]
  22. Sun, Q.; Di, Z.; Lv, Z.; Song, F.; Xiang, Q.; Feng, Q.; Fan, Y.; Yu, X.; Wang, W. A High Speed SoftMax VLSI Architecture Based on Basic-Split. In Proceedings of the 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), Qingdao, China, 31 October–3 November 2018; pp. 1–3. [Google Scholar] [CrossRef]
  23. Valls, J.; Kuhlmann, M.; Parhi, K.K. Evaluation of CORDIC Algorithms for FPGA Design. J. Vlsi Signal Process. Syst. Signal Image Video Technol. 2002, 32, 207–222. [Google Scholar] [CrossRef]
  24. Jiang, Z.; Gu, J.; Pan, D.Z. NormSoftmax: Normalizing the Input of Softmax to Accelerate and Stabilize Training. In Proceedings of the 2023 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), Berlin, Germany, 23–25 July 2023; pp. 1–6. [Google Scholar] [CrossRef]
  25. Zhen, W.; Xia, M.; Liu, B.; Ruan, X.; Gong, Y.; Yang, J.; Ge, W.; Yang, J. EERA-DNN: An Energy-Efficient Reconfigurable Architecture for DNNs with Hybrid Bit-Width and Logarithmic Multiplier. IEICE Electron. Express 2018, 15, 20180212. [Google Scholar]
  26. Kahan, W. IEEE Standard 754 for Binary Floating-Point Arithmetic. Lecture Notes on the Status of IEEE. 1996; Volume 754. Available online: (accessed on 24 October 2024).
  27. Kung, H.T.; McDanel, B.; Zhang, S.Q.; Dong, X.; Chen, C.C. Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays. In Proceedings of the 2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), New York, NY, USA, 15–17 July 2019; Volume 2160-052X, pp. 42–50. [Google Scholar] [CrossRef]
  28. Liu, Z.; Ma, S.; Guo, Y. An Efficient Floating-Point Multiplier for Digital Signal Processors. IEICE Electron. Express 2014, 11, 20140078. [Google Scholar] [CrossRef]
  29. Bondarenko, Y.; Nagel, M.; Blankevoort, T. Understanding and Overcoming the Challenges of Efficient Transformer Quantization. arXiv 2021, arXiv:2109.12948. [Google Scholar]
  30. Bansal, H.; Sharma, K.G.; Sharma, T. Wallace Tree Multiplier Designs: A Performance Comparison. Innov. Syst. Des. Eng. 2014, 5, 67. [Google Scholar]
  31. Vijay, V.; Sreevani, M.; Mani Rekha, E.; Moses, K.; Pittala, C.S.; Shaik, K.A.S.; Koteshwaramma, C.; Sai, R.J.; Vallabhuni, R.R. A Review On N-Bit Ripple-Carry Adder, Carry-Select Adder And Carry-Skip Adder. J. Vlsi. Circuits. Syst. 2022, 4, 27–32. [Google Scholar]
  32. Park, S.-S.; Chung, K.-S. CONNA: Configurable Matrix Multiplication Engine for Neural Network Acceleration. Electronics 2022, 11, 2373. [Google Scholar] [CrossRef]
  33. Zhang, C.; Sun, G.; Fang, Z.; Zhou, P.; Pan, P.; Cong, J. Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 38, 2072–2085. [Google Scholar] [CrossRef]
  34. Pal, S.; Beaumont, J.; Park, D.-H.; Amarnath, A.; Feng, S.; Chakrabarti, C.; Kim, H.-S.; Blaauw, D.; Mudge, T.; Dreslinski, R. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna, Austria, 24–28 February 2018; pp. 724–736. [Google Scholar] [CrossRef]
  35. Dosselmann, R.; Yang, X.D. A Comprehensive Assessment of the Structural Similarity Index. Signal Image Video Process. 2011, 5, 81–91. [Google Scholar] [CrossRef]
  36. Gulati, A.; Qin, J.; Chiu, C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-Augmented Transformer for Speech Recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar] [CrossRef]
  37. Kim, S.; Gholami, A.; Shaw, A.; Lee, N.; Mangalam, K.; Malik, J.; Mahoney, M.W.; Keutzer, K. Squeezeformer: An Efficient Transformer for Automatic Speech Recognition. arXiv 2022, arXiv:2206.00888. [Google Scholar] [CrossRef]
  38. Peng, Y.; Dalmia, S.; Lane, I.; Watanabe, S. Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MA, USA, 17–23 July 2022; pp. 17627–17643. Available online: (accessed on 21 November 2024).
  39. Chuang, Y.; Liu, C.; Lee, H.; Lee, L. SpeechBERT: An Audio-and-Text Jointly Learned Language Model for End-to-End Spoken Question Answering. arXiv 2020, arXiv:1910.11559. [Google Scholar] [CrossRef]
  40. Kim, M.; Kim, G.; Lee, S.-W.; Ha, J.-W. St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7478–7482. [Google Scholar] [CrossRef]
Figure 1. Differences in processes with and without hardware accelerators.
Figure 1. Differences in processes with and without hardware accelerators.
Electronics 13 04683 g001
Figure 2. Top module architecture.
Figure 2. Top module architecture.
Electronics 13 04683 g002
Figure 3. (a) Multi-Head Attention hardware architecture. (b) Data flow of Transformer encoder.
Figure 3. (a) Multi-Head Attention hardware architecture. (b) Data flow of Transformer encoder.
Electronics 13 04683 g003
Figure 4. (a) Comparison of clock cycles with systolic arrays. (b) Comparison of mac units with systolic arrays.
Figure 4. (a) Comparison of clock cycles with systolic arrays. (b) Comparison of mac units with systolic arrays.
Electronics 13 04683 g004
Figure 5. (a) Comparison of clock cycles with systolic arrays. (b) Comparison of MAC units with systolic arrays.
Figure 5. (a) Comparison of clock cycles with systolic arrays. (b) Comparison of MAC units with systolic arrays.
Electronics 13 04683 g005
Figure 6. Block diagram of softmax module.
Figure 6. Block diagram of softmax module.
Electronics 13 04683 g006
Figure 7. Quantization bit distribution method.
Figure 7. Quantization bit distribution method.
Electronics 13 04683 g007
Figure 8. Zero-padding techniques in Multi-Head Attention.
Figure 8. Zero-padding techniques in Multi-Head Attention.
Electronics 13 04683 g008
Figure 9. Concatenating eight heads of Multi-Head Attention into vector format.
Figure 9. Concatenating eight heads of Multi-Head Attention into vector format.
Electronics 13 04683 g009
Figure 10. The layer-form accumulation method using block-by-block multiplication for the second matrix multiplication in the Feed-Forward Network.
Figure 10. The layer-form accumulation method using block-by-block multiplication for the second matrix multiplication in the Feed-Forward Network.
Electronics 13 04683 g010
Figure 11. Speed improvement comparison for a word count of 4.
Figure 11. Speed improvement comparison for a word count of 4.
Electronics 13 04683 g011
Figure 12. Performance comparison for different word counts.
Figure 12. Performance comparison for different word counts.
Electronics 13 04683 g012
Table 1. Comparison with Lu et al.
Table 1. Comparison with Lu et al.
NPaperClock CyclesRatio
64Lu et al. [5] (excluding Q K T ) MHA: 21,344, FFN: 42,099
Total: 63,443
This paper (excluding Q K T ) MHA: 23,040, FFN: 36,864
Total: 59,904
This paper (including Q K T ) MHA: 29,184, FFN: 36,864
Total: 66,048
256Lu et al. [5] (excluding Q K T ) MHA: 85,376, FFN: 168,396
Total: 253,772 (estimation)
This paper (excluding Q K T ) MHA: 30,720, FFN: 49,152
Total: 79,872
This paper (including Q K T ) MHA: 32,064, FFN: 49,152
Total: 81,216
Clock cycle comparison when word count is 64 and 256.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, S.-W.; Kim, D.-S. Scalable Transformer Accelerator with Variable Systolic Array for Multiple Models in Voice Assistant Applications. Electronics 2024, 13, 4683.

AMA Style

Chang S-W, Kim D-S. Scalable Transformer Accelerator with Variable Systolic Array for Multiple Models in Voice Assistant Applications. Electronics. 2024; 13(23):4683.

Chicago/Turabian Style

Chang, Seok-Woo, and Dong-Sun Kim. 2024. "Scalable Transformer Accelerator with Variable Systolic Array for Multiple Models in Voice Assistant Applications" Electronics 13, no. 23: 4683.

APA Style

Chang, S.-W., & Kim, D.-S. (2024). Scalable Transformer Accelerator with Variable Systolic Array for Multiple Models in Voice Assistant Applications. Electronics, 13(23), 4683.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop