Energy and Precision Evaluation of a Systolic Array Accelerator Using a Quantization Approach for Edge Computing

Sanchez-Flores, Alejandra; Fornt, Jordi; Alvarez, Lluc; Alorda-Ladaria, Bartomeu

doi:10.3390/electronics13142822

Open AccessFeature PaperArticle

Energy and Precision Evaluation of a Systolic Array Accelerator Using a Quantization Approach for Edge Computing

¹

Department of Industrial Engineering and Construction, Universitat de les Illes Balears Palma, 07122 Palma, Spain

²

Barcelona Supercomputing Center, Universitat Politècnica de Catalunya Barcelona, 08034 Barcelona, Spain

³

Balearic Islands Health Research Institute (IdISBa), 07120 Palma, Spain

⁴

Institute for Environmental Agro-Environmental Research and Water Economics (INAGEA), 07120 Palma, Spain

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(14), 2822; https://doi.org/10.3390/electronics13142822

Submission received: 11 June 2024 / Revised: 9 July 2024 / Accepted: 15 July 2024 / Published: 18 July 2024

(This article belongs to the Special Issue Recent Advances and Challenges in IoT, Cloud and Edge Coexistence)

Download

Browse Figures

Versions Notes

Abstract

:

This paper focuses on the implementation of a neural network accelerator optimized for speed and energy efficiency, for use in embedded machine learning. Specifically, we explore power reduction at the hardware level through systolic array and low-precision data systems, including quantized approaches. We present a comprehensive analysis comparing a full precision (FP16) accelerator with a quantized (INT16) version on an FPGA. We upgraded the FP16 modules to handle INT16 values, employing data shifts to enhance value density while maintaining accuracy. Through single convolution experiments, we assess the energy consumption and error minimization. The paper’s structure includes a detailed description of the FP16 accelerator, the transition to quantization, mathematical and implementation insights, instrumentation for power measurement, and a comparative analysis of power consumption and convolution error. Our results attempt to identify a pattern in 16-bit quantization to achieve significant power savings with minimal loss of accuracy.

Keywords:

array systolic accelerator; energy consumption; embedded systems

1. Introduction

Automated learning strategies can be used in many areas; one of them is the processing of large amounts of information from sensor networks to provide accurate and reliable data, while minimizing the energy required. The design of neural network accelerators to perform convolutional operations, optimizing speed and energy consumption, is an active research field.

The issue of power reduction in embedded machine learning is important at both the software and hardware levels. At the hardware level, power reduction has been approached from different angles. In the context of accelerators, the use of systolic arrays is a particularly effective approach, as it allows for the parallel execution of the same operation with different input data, thereby generating an efficient output vector, as exemplified by the vector [1]. Furthermore, among the most frequently utilized and efficacious strategies are the use of direct memory access (DMA) memories to reduce power, due to their efficient data access [2]. Data access through protocols such as the Advanced eXtensible Interface (AXI) improves data availability between the central processing unit (CPU) and the accelerator. Quantized systems that use low-precision integer data consume less power than those that use floating point data [3]. Binarized systems, which use a single bit to store ML parameters or perform bit-wise operations, consume even less power, and some authors [4,5,6] propose a unit of bit-wise operations in the accelerator. In light of the findings of our previous studies [7,8] we have concluded that quantization represents one of the most effective approaches at 8 or 16 bits.

In this paper, we will focus on the implementation of a full-precision version of the accelerator and the design and implementation of a quantized version, on an FPGA. The differences in the behavior of some of the basic convolutions commonly used in machine learning provide opportunities to reduce power consumption by comparing the results of FP16 and INT16.

First, we use an FP16 accelerator [9], which exhibits overall optimized and energy-efficient characteristics. The proposal is to upgrade the accelerator modules to be used with INT16 values and to use a data shift to adjust the result towards the higher density values. Furthermore, we remain cognizant of the importance of accuracy. To explain the optimality conditions of the convolution operations and the variables involved, experiments are performed at the single-convolution level, observing the behavior of low energy consumption and minimum error value. The structure of the paper is as follows: Section 2 describes the FP16 accelerator and the modules involved in the transition of the quantization. Section 3 is devoted to the mathematical and implementation explanations of the quantization. Section 4 is dedicated to the instrumentation required to compute power and energy. Section 5 compares the power consumption of the two versions of the accelerator and the effect of low-precision data on the convolution error. Finally, in Section 6, we analyze whether this strategy has a positive impact on energy savings and discuss whether the unquestionable loss of accuracy in the results is worth the energy savings.

2. Accelerator Description

The optimization approach and evaluation results are obtained on an FP16 accelerator structure reported in a previous work [9]—see Figure 1a. The accelerator system consists of a systolic array (SA) that interchanges data with a DMA. The DMA controller applies data to the SA and receives the corresponding output. It manages the activation tensor (tensor A), the weight tensor (tensor B), and the pre-load value tensor (tensor C). All values of tensors are FP16.

The SA is the processing module based on a parallel scheme—see Figure 1c. It consists of an XY array of processing elements (PEs), organized as a matrix receiving input tensors A, B, and C. Each PE individually receives input data

a_{i}

,

b_{i}

, and

c_{i}

, which is contained in the corresponding tensors. The PE has a floating point (FP) processing unit that attempts to obtain the most accurate value possible for the multiplication–accumulation (MAC) operation for the convolution computation.

The final and most pertinent module for our purposes is the Partial Sum Module (PSM). Its operation is summarized in Figure 1b. Input tensor C, denoted as

C_{i}

, corresponds to the preload values passed from one convolution to the next. This tensor is stored in the input buffer C before processing starts. When the start of processing is commanded by the CPU, the MUX1 selects the contents of buffer C, which is transferred to the shift register submodule SR. The SR temporarily stores the C-tensors and then transfers them to the SA via MUX2. The MUX2 switches from zero to the SR content when there is a valid value of tensors. The C buffer reads the contents of the SR and acts as the output buffer when all operations are completed. The data type of all registers is maintained at a length of 16 bits throughout the module.

3. Quantization and Implementation

In the previous section, the operations were performed with FP16, and all the registers in the PEs and the SMP were defined as they are. To convert data to the INT16 format, the quantization process is conducted prior to the data being transferred to the accelerator.

Furthermore, it is important to consider that the inference task of our accelerator must be able to work independently from the training process. This means that the data accuracy remains at full accuracy during training, and any parameter adjustments are scheduled for the inference stage. Some authors propose the adjustment of such parameters in a post-training phase, converting them to low-precision data, with good results of accuracy [10,11,12,13]. Based on this, we propose the quantization post-training using FP16 data to be performed in the CPU to obtain quantized data for tensor A and tensor B, and tensor C (if applicable). All these are fed to the accelerator.

3.1. Description of Quantization

The proposed quantization strategy works with uniformly quantized and symmetric values, as proposed in [13,14]. Thus, it should be assumed a priori that the model data in FP16 format have a Gaussian distribution with mean µ = 0 and variance σ 2 = S (standard deviation).

In addition, it is established that, initially, any data in the interval

[{- d}_{m a x}, d_{m a x}]

are of the floating point type and their value is rounded to a

B_{10}

decimal value, as represented by Equation (1), using powers of 2 [15,16]:

B_{10} = {(β_{0} \dots β_{n} \dots)}_{2} = \sum_{n \in N} β_{n} 2^{n}

(1)

Since n can be any positive or negative integer value, Equation (1) results in a mixed integer/fractions format, as shown in Equation (2), where each

β_{n}

takes a value of 1 or 0.

(β_{n} 2^{n}) + (β_{n - 1} 2^{n - 1}) + \dots + (β_{1} 2^{1}) + (β_{0} 2^{0}) + (β_{- 1} 2^{- 1}) + (β_{- 2} 2^{- 2}) + \dots

(2)

It is worth mentioning that the numbers of fractions given by the elements with a negative power are known as dyadic numbers. Hardware design has used the dyadic system for decades to optimize processors [17,18,19]. In accelerators, previous work [20] has implemented this strategy, which ensures that the results of arithmetic operations are always kept in powers of 2, speeding up processing.

To explain quantization, let n be a given positive integer. The maximum representable value is

d_{m a x} = {(2}^{n} - 1)

and the interval of numbers is given by

[- (2^{n} - 1), (2^{n} - 1)]

. This set contains only integers, since the smallest number that can be represented is

d_{m i n} = 2^{0}

= 1, as shown in Figure 2a.

If we reduce the power n by two units, which corresponds to a shift to the right, then

d_{m a x} = {(2}^{n - 2} - 1)

. The interval is redefined as

[- (2^{n - 2} - 1), (2^{n - 2} - 1)]

, as is illustrated in Figure 2b, and

d_{m i n} = 2^{- 2} = 0.25

. Now, the set of data contains floating point numbers. Using a binary notation, the conversion can be performed by applying a shift to the right.

Roughly speaking, using n bits, the interval of numbers is defined as

[- (2^{n - m} - 1), (2^{n - m} - 1)]

, with an accuracy of

2^{- m}

. Then, the larger the bit shift on the right, the smaller the minimum value, allowing for greater precision and data density. This allows the data interval to be adjusted to include more values around the value of σ2. Figure 2c illustrates the example for m = 4.

Once the interval and precision are established, the next step is to determine the length of registers the PEs will operate on. To do this, the definition of the MAC operation is recalled. It consists of one multiplication operation and one addition operation. The product of two floating numbers,

f_{1}

and

f_{2}

, as defined in Equation (1), is given by Equation (3), as follows:

f_{1} = α_{θ} \cdot 2^{θ} and f_{2} = α_{η} \cdot 2^{η}

(3)

which is described in [15]. Then, the product of two floating numbers is

(f_{1} \cdot f_{2}) = (α_{θ} α_{η} \cdot 2^{θ + η})

(4)

For our purposes

α_{θ} = α_{η} = 1

. If θ and η are 16, then the multiplication result will be a maximum of 32 bits in view of the maximum power of the input data. In conclusion, the output register of the MAC is 34 bits when the carry derived from the addition part and the sign bit are added.

In the remainder of this section, the left-shift operation applied to the data will be referred to as the quantization, and it is illustrated in Figure 3a. The inverse operation of right-shifting of the data will be referred to as dequantization, as shown in Figure 3b.

If the dequantization leaves MSB without values, they are filled with 1’s if the number to dequantize is negative, and filled with 0’s if the value is positive. By shifting the 16 bits to a higher value position, the least significant bits are filled with 0’s.

To conclude this section, it should be emphasized that the selection of the best offset-shift for each layer is applied during the convolution process. The HW of the accelerator must be designed to handle this.

3.2. Implementation on FPGA

The following explains the changes required to update the accelerator and work with quantized values. The changes are numbered according to the order in which they were implemented.

The data ${I A}_{w}$ , ${I B}_{w}$ , and ${I C}_{w}$ are the length of the input data, $a_{i}$ (activation data), $b_{i}$ (weights), and $c_{i}$ (preload values), and are all updated to INT16.
Each PE is modified to use only integer arithmetic, as shown in Figure 4a.
The PSM, previously shown in Figure 1b, is updated, as shown in Figure 4b. The main change is the modification of the output of the PSM, the tensor $C_{o}$ . As explained in the last section, the new length of ${O C}_{w}$ is 34 bits.
$C_{o}$ is obtained after the MAC operation by using the data $a_{i}$ , $b_{i}$ , and $c_{i}$ , which are grouped into the tensors A, B, and C, respectively. For each Y row of the SA matrix, a tensor $C_{o}$ is output, then a set of $Y * ({O C}_{w} - 1)$ data (or Y number of Co tensors) are output. Since the length of ${O C}_{w}$ does not match the ${I C}_{w}$ length (16 bits), a quantization is implemented, as is shown in the left side of Figure 4b. A new control signal SEL_SHIFT is utilized to select the offset.
When the data leaves the SR block, the data length is 16 bits, and is subjected to a dequantization process to, again, maintain consistency with the length of the SA registers. See the right side of Figure 4b.

The quantization submodule is a set of Y numbers of multiplexers applied to each data Co. Figure 5 illustrates the process individually for a single data element. The data with length

{O C}_{w}

enters the module where the selection offset is applied to the MUX. Once reduced, a length tensor

{I C}_{w}

is passed to the shift register. The value of SEL_SHIFT is shared by all data. See left side of Figure 5.

The dequantization submodule comprises a set of Y demultiplexers, each dedicated to a single data element. At the output of the shift register, a data element of length

{I C}_{w}

enters the DEMUX, which has a SEL_SHIFT input for the determination of the offset for the dequantization of the data to

{O C}_{w}

. This offset is shared by all data elements leaving the SR. See right side of Figure 5.

4. Instrumentation and Measurement Methodology

After a review of the quantization literature, we found that the power or energy consumption values were not standardized. Some authors present power in units of mW [21,22] or W [23], while others present energy in units of µJ [24,25]. Even the energy efficiency values are presented in different units, such as pJ/op [26] or MAC/W [27]. The issue at hand is not the unit in and of itself. Rather, it is the lack of specification as to the methodology and instrumentation used in the measurement process that gives rise to questions as to the criteria used to evaluate the power and energy consumption, as well as the energy efficiency.

This section is organized as follows: the first subsection describes the features of the implementation board, the second subsection gives an overview of the system implemented in Vivado, and the last section is devoted to explaining the test cases.

4.1. Implementation Board Properties

The quantized accelerator is intended for use in embedded applications. We have established that an FPGA is necessary for this research. Among the FPGAs used for embedded applications, we find the Pynq series, which offers design support in Vivado Xilinx, and a Debian Linux operating system that allows running programs using Jupyter Notebook. Specifically, the Ultra96-v2 board was selected as the optimal choice. In addition, users can obtain electrical information from the Ultra96-v2 board via PMBus communication using Infineon’s USB005, a USB dongle.

Measurements can be made on the FPGA since the Pynq board is electrically divided into two sections—a processor system called PS and an FPGA section called PL, as shown in Figure 6a. The PL voltage exhibits a mean value of 0.85 V with a noise level of 0.016 Vpp. The processing consumption is more clearly reflected in the current signal, which has a standby value of 85 mA. Both signals are recorded and used for each power calculation. The mark signal is configured by the user to provide time references. See Figure 6b.

4.2. High-Level System

The acceleration system described in Figure 7 consists of a CPU provided by the Zynq Ultrascale+ microprocessor and the Sauria subsystem accelerator.

The CPU and the accelerator utilize the AXI for communication between them. The AXI Lite is employed to transmit 32-bit control and configuration commands, with the CPU acting as the master and the accelerator as the slave. The full 128-bit AXI is employed as a channel for data transmission, with the accelerator acting as the master of the channel and the CPU acting as the slave. The AXI interconnect blocks are provided by the Pynq framework, and they are responsible for managing data latency, synchronization, and arbitration. Similarly, the reset and clock signal connections are handled by the framework. The clock frequency of the entire system has been set to 25 MHz.

4.3. Instrumentation Description

To perform the required measurements of the electrical signals, the following connections must be established (see Figure 8 for an illustration of the necessary wiring):

PMBus [28] is a 400 KHz I2C that sends information from the Ultra96-v2 board’s voltage regulator. The pynq.pmbus.DataRecorder class of Python provides an interface to obtain the voltage, current, timing, and other signals from the PL part, which are sent to the host.

4.4. Tests Description

The purpose of the convolution tests is to examine the response of the accelerator under different scenarios. The block diagram in Figure 9 shows the general test sequence.

The configuration phase consists of setting the HW characteristics that must match the RTL implementation of the accelerator, defined as hyperparameters (see Table 1). These parameters include the dimensions of the systolic array of the accelerator (X and Y), the input data type, and the BRAM width.

In the initialization phase, the AXI communication channels are initialized; the parameters relating to the type of convolution are sent to the accelerator; and the activation tensors, weights, and preload are written to the DMA.

The start-processing block represents the time that the accelerator performs a convolution, from the time the CPU sends the start command until the accelerator sends the completion response.

Post-processing tasks include reading the tensor resulting from the convolution and switching the double output buffer. The reporting stage indicates the end of the convolution evaluation by giving internal accelerator information.

Tests have repetition cycles—see Figure 9. These cycles are ordered by execution levels. The first one, in the blue line, indicates the loop in which the N tests (N = 11) are executed to evaluate the accelerator response under different convolution conditions.

The purple line represents the convolution repetitions loop (R_tests = 10,000), which serves to stabilize the measured electrical signals. Only one convolution is processed in nanoseconds, which is not enough to keep the current and voltage at a measurable value. In particular, one convolution execution implies all the MAC operations, partial sums, and data transfers to memory. During the CNN inference, processing the convolution is executed on numerous occasions, potentially a hundred or more times.

The green line indicates the wait loop while the convolution is being completed. This is carried out instead of implementing an interrupt signal which, when tested, increased the time between tests.

4.5. Convolution Features for Every Test

We conducted the evaluation using 11 test cases, as shown in Table 2. To test the system under different conditions, each test is designed with different convolution parameters. The intention is to explore different convolution features.

4.6. Energy Calculation

Once we have established all the information about the system, the measurement signals, and the tests, we proceed to obtain the power and energy results.

Knowing that a single convolution is carried out in a time that is not sufficient for measurements (approx. 68 ms average sampling time), we repeat the convolution calculation block of Figure 9 up to 10,000 times to obtain a measurement of voltage and current for calculating the power value in one instant time.

On the other hand, in order to accurately identify the current corresponding to the activity of the accelerator, we notice different current levels in the measurements (Figure 10). In the first one, when the board is turned on and before the start of the test, we have a steady state with some noise. During this period of inactivity,

T_{n o p r o c}

, the mean PL current, is calculated and stored as a constant

I_{m e a n}

. In the second, during the execution of the convolution (

T_{p r o c}

),

I_{p r o c}

represents the current consumed by the process. Thus, to obtain

I_{p r o c}

, we take the difference between each instantaneous measured current I_INT and the preprocessing current

I_{m e a n}

.

I_{p r o c} (i) = I_{I N T} (i) - I_{m e a n} (i)

(5)

The power calculation is performed for each sample i:

P (i) = V_{I N T} (i) * I_{p r o c} (i)

(6)

The energy consumed by the accelerator during the processing time is obtained by:

E_{I N T} = [\sum_{T_{p r o c}} P (i) * Δ t (i)]

(7)

where

Δ t (i)

is the time increment during

T_{p r o c}

, corresponding to each sample, and has the median value of 60 ms ±15 ms. The irregular sample timing allows some aleatory peaks to be detected.

E_{I N T}

is the energy calculated for the INT version.

E_{F P}

is the energy calculated for the FP version.

To compare the energy consumption of INT and FP, we will use a percentage of energy. This ratio shows the percentage of energy consumed by the INT version compared to the FP version, using Equation (7), to obtain

E_{I N T}

and

E_{F P}

:

{P E}_{c o n s u m} (k) = 100 % * [\frac{E_{I N T}}{E_{F P}}]

(8)

5. Comparative Analysis

In order to define the type of analysis that we are going to do, we have to specify the objective of this work. In this regard, we state that our objective is to observe the behavior of energy at the convolution run level. This will allow us to identify a relationship of energy with convolution characteristics.

The experiment performed consists of running the convolution cases mentioned in the previous section. The tensor data A, B, and C are initialized with FP16 and are randomized with a normal or Gaussian distribution. In addition, different standard densities are used to keep the convolution results with solutions within the range of possible results.

Following the completion of tests conducted on FP16 and INT16, two evaluations are performed—one for accuracy and another for energy consumption.

5.1. Accuracy Evaluation

In order to obtain the error in the results, we define the

{C O}_{i d e a l}

tensor as the ideal result computed in the Jupyter environment in a PC with a 64-bit precision. The

{C O}_{F P}

tensor contains the results generated by the accelerator under the FP16 model. Furthermore, there is the tensor

{C O}_{I N T}

with the results obtained with the same accelerator using the modifications proposed by this paper. Each of these tensors stores the 11 output tensors corresponding to the 11 tests performed.

We define the relative error for the tensors of each test using FP16 as follows:

{R E}_{F P} = \frac{|{C O}_{i d e a l} - {C O}_{F P}|}{|m e a n ({C O}_{i d e a l})|}

(9)

In a similar way, the relative error for FP is defined as follows:

{R E}_{I N T} = \frac{|{C O}_{i d e a l} - {C O}_{I N T}|}{|m e a n ({C O}_{i d e a l})|}

(10)

Finally, the relative error to compare the two calculations in every test case k is defined as follows:

R E (k) = \frac{|{{R E}_{F P} (k) - R E}_{I N T} (k)|}{|{R E}_{F P} (k)|}

(11)

5.2. Energy Consumption vs. Accuracy Trade-Off

In addition, to evaluate the energy and accuracy, several scenarios have been defined. These scenarios consider the variables that could interfere with the measurements.

One of them is the quantization/dequantization scale value, which is related to the shifting of the data in the accelerator. Another is the distribution of the generated data, defined as the normal or Gaussian distribution for all the scenarios, and along with this information, the variance s² is also included—see Table 3.

For each shift value, we selected the scenarios with the lowest error. These are scenarios 3, 6, 8, and 10, which are shaded in Table 3. Figure 11 shows these scenarios, with (a) the energy plotted and (b) the error plotted.

In general, the lowest energy values are the ones that are more optimized. Accordingly, the most optimal of all is Sc8, as shown in Figure 11a.

Similarly, for the error, the lowest value is the best. Among the scenarios with the lowest error, the most accurate scenario is Sc8, as Figure 11 illustrates.

In Figure 11a,b, Sc8 is shown in green.

6. Conclusions and Discussion

This paper used the INT16 accelerator to make a direct comparison with the FP16 one. Energy and error calculations were applied directly between the two. A percentage ratio was obtained for the former, and a relative error ratio was obtained for the latter.

Although the current and voltage measurements of the FPGA were performed with low-resolution devices, the main goal was to find properties that help reduce power consumption, not to obtain accurate results. Among these characteristics, we found the following ones.

It was expected that using more shift or a higher quantization would result in more accurate results. However, we found that even if the highest quantization is 8, the best result is obtained with a quantization of 6 (Sc8).

Another expectation was that the more accurate the result, the more power is consumed, which is true to some extent, since Sc8 gives the best error result, but it is not the one which consumes the most power.

As for energy consumption, from Figure 11, we observe that some convolution configurations are more energy saving than others, especially test cases 5 and 10, which show an energy spike in any scenario, unlike test cases 0, 1, 2, 4, and 6, which remain at a low consumption.

The results could indicate that for input data defined with certain resolution, defined by the variance, there is an optimal quantization. We see that if we define a larger shift, such as shift = 8, the error is not reduced. On the contrary, it increases.

The optimality conditions that may exist between shift, accuracy, and energy will be further explored in future research. The expectation is to perform a quantified analysis of the number of bits that are involved in the different shifts and to examine if there is a relationship with the standard deviation of the input data. Furthermore, on the experimental side, it could be interesting to switch to an 8-bit accelerator and observe the same behavior.

Author Contributions

Conceptualization, A.S.-F., L.A. and B.A.-L.; Data curation, A.S.-F. and L.A.; Formal analysis, A.S.-F., J.F., L.A. and B.A.-L.; Investigation, A.S.-F., J.F., L.A. and B.A.-L.; Methodology, A.S.-F., J.F., L.A. and B.A.-L.; Project administration, B.A.-L.; Resources, B.A.-L.; Software, A.S.-F. and J.F.; Supervision, L.A. and B.A.-L.; Validation, A.S.-F., L.A. and B.A.-L.; Visualization, A.S.-F.; Writing—original draft, A.S.-F.; Writing—review and editing, J.F., L.A. and B.A.-L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was sponsored by the Grant Mexican Government F-PROMEP-01/Rev-04 SEP-23-002-A; and by the Grant TED2021—130604B-C21 funded by MCIN/AEI/10.13039/501100011033 by the “European Union Next Generation EU/PRTR”.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author/s.

Acknowledgments

The authors acknowledge the collaboration of the Barcelona Supercomputing Center in providing the initial designs and repository.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Liu, Z.G.; Whatmough, P.N.; Mattina, M. Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference. IEEE Comput. Archit. Lett. 2020, 19, 34–37. [Google Scholar] [CrossRef]
Meribout, M.; Baobaid, A.; Khaoua, M.O.; Tiwari, V.K.; Pena, J.P. State of Art IoT and Edge Embedded Systems for Real-Time Machine Vision Applications. IEEE Access 2022, 10, 58287–58301. [Google Scholar] [CrossRef]
Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; Cheng, J. Quantized Convolutional Neural Networks for Mobile Devices. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2016; pp. 4820–4828. [Google Scholar]
Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. arXiv 2016, arXiv:1602.02830. [Google Scholar] [CrossRef]
Al Bahou, A.; Karunaratne, G.; Andri, R.; Cavigelli, L.; Benini, L. XNORBIN: A 95 TOp/s/W hardware accelerator for binary convolutional neural networks. In Proceedings of the 21st IEEE Symposium on Low-Power and High-Speed Chips and Systems, COOL Chips 2018—Proceedings, Yokohama, Japan, 18–20 April 2018; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2018; pp. 1–3. [Google Scholar] [CrossRef]
Andri, R.; Karunaratne, G.; Cavigelli, L.; Benini, L. ChewBaccaNN: A flexible 223 TOPS/W BNN accelerator. In Proceedings of the IEEE International Symposium on Circuits and Systems, Daegu, Republic of Korea, 22–28 May 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021. [Google Scholar] [CrossRef]
Sanchez-Flores, A.; Alvarez, L.; Alorda-Ladaria, B. Accelerators in Embedded Systems for Machine Learning: A RISCV View. In Proceedings of the 38th Conference on Design of Circuits and Integrated Systems (DCIS), Malaga, Spain, 15–17 November 2023; pp. 1–6. [Google Scholar]
Sanchez-Flores, A.; Alvarez, L.; Alorda-Ladaria, B. A review of CNN accelerators for embedded systems based on RISC-V. In Proceedings of the IEEE International Conference on Omni-Layer Intelligent Systems (COINS), Barcelona, Spain, 1–3 August 2022; pp. 1–6. [Google Scholar]
Fornt, J.; Fontova-Musté, P.; Caro, M.; Abella, J.; Moll, F.; Altet, J.; Studer, C. An Energy-Efficient GeMM-Based Convolution Accelerator with On-the-Fly im2col. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2023, 31, 1874–1878. [Google Scholar] [CrossRef]
Cai, Y.; Yao, Z.; Dong, Z.; Gholami, A.; Mahoney, M.W.; Keutzer, K. ZeroQ: A Novel Zero Shot Quantization Framework. Available online: https://github.com/amirgholami/ZeroQ (accessed on 7 June 2024).
Choukroun, Y.; Kravchik, E.; Yang, F.; Kisilev, P. Low-bit Quantization of Neural Networks for Efficient Inference. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Banner, R.; Nahshan, Y.; Soudry, D. Post Training 4-bit Quantization of Convolutional Networks for Rapid-Deployment. Available online: https://github.com/submission2019/cnn-quantization (accessed on 10 June 2024).
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3009–3018. [Google Scholar]
Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv 2018, arXiv:1806.08342. [Google Scholar] [CrossRef]
Muller, J.-M.; Brisebarre, N.; de Dinechin, F.; Jeannerod, C.-P.; Lefèvre, V.; Melquiond, G.; Revol, N.; Stehlé, D.; Torres, S. Handbook of Floating-Point Arithmetic; Springer Science and Business Media LLC: Dordrecht, The Netherlands, 2010. [Google Scholar] [CrossRef]
Parhami, B. Computer Arithmetic: Algorithms and Hardware Designs; Oxford University Press: New York, NY, USA, 2010. [Google Scholar]
Proakis, D.M.J. Digital Signal Processing: Principles, Algorithms, and Applications, 4th ed.; Prentice-Hall International: London, UK, 2006. [Google Scholar]
Kumar, A.A. Fundamentals of Digital Circuits, 4th ed.; PHI Learning Pvt. Ltd.: New Delhi, India, 2016. [Google Scholar]
Vuillemin, J.E. On Circuits and Numbers. IEEE Trans. Comput. 1994, 43, 868–879. [Google Scholar] [CrossRef]
Yao, Z.; Dong, Z.; Zheng, Z.; Gholami, A.; Yu, J.; Tan, E.; Wang, L.; Huang, Q.; Wang, Y.; Mahoney, M.; et al. HAWQ-V3: Dyadic Neural Network Quantization. Int. Conf. Mach. Learn. 2021, 139, 18–24. [Google Scholar]
Flamand, E.; Rossi, D.; Conti, F.; Loi, I.; Pullini, A.; Rotenberg, F.; Benini, L. GAP-8: A RISC-V SoC for AI at the Edge of the IoT. In Proceedings of the 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors, Milan, Italy, 10–12 July 2018; pp. 1–4. [Google Scholar]
Ji, Z.; Jung, W.; Woo, J.; Sethi, K.; Lu, S.L.; Chandrakasan, A.P. CompAcc: Efficient Hardware Realization for Processing Compressed Neural Networks Using Accumulator Arrays. In Proceedings of the 2020 IEEE Asian Solid-State Circuits Conference (A-SSCC), Hiroshima, Japan, 9–11 November 2020; pp. 1–4. [Google Scholar]
Zhang, G.; Zhao, K.; Wu, B.; Sun, Y.; Sun, L.; Liang, F. A RISC-V based hardware accelerator designed for Yolo object detection system. In Proceedings of the 2019 IEEE International Conference of Intelligent Applied Systems on Engineering (ICIASE), Fuzhou, China, 26–29 April 2019; pp. 9–11. [Google Scholar]
Jia, H.; Tang, Y.; Valavi, H.; Zhang, J.; Verma, N. A Microprocessor implemented in 65 nm CMOS with Configurable and Bit-scalable Accelerator for Programmable In-memory Computing. arXiv 2018, arXiv:1811.04047. [Google Scholar]
Jia, H.; Valavi, H.; Tang, Y.; Zhang, J.; Verma, N. A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In-Memory Computing. IEEE J. Solid-State Circuits 2020, 55, 2609–2621. [Google Scholar] [CrossRef]
Cheikh, A.; Sordillo, S.; Mastrandrea, A.; Menichelli, F.; Scotti, G.; Olivieri, M. Klessydra-T: Designing Vector Coprocessors for Multithreaded Edge-Computing Cores. IEEE Micro 2021, 41, 64–71. [Google Scholar] [CrossRef]
Burrello, A.; Garofalo, A.; Bruschi, N.; Tagliavini, G.; Rossi, D.; Conti, F. DORY: Automatic End-To-End Deployment of Real-World DNNs on Low-Cost IoT MCUs. IEEE Trans. Comput. 2021, 70, 1253–1268. [Google Scholar] [CrossRef]
Jones, M.; Summerlin, T. PMBus_v1-3. Available online: https://pmbusprod.wpenginepowered.com/wp-content/uploads/2018/07/20130912PMBus_1-3_DPF.pdf (accessed on 4 July 2024).

Figure 1. High level description of (a) FP16 accelerator, (b) PSM, and (c) systolic array 4 × 8.

Figure 2. The graph represents a typical distribution of the data for different shifting. (a) shift = 0, (b) shift = 2 and, (c) shift = 4.

Figure 3. (a) Quantization operation applied to the output register of each PE. (b) Dequantization operation applied to the input registers of each PE.

Figure 4. (a) PE version INT16. (b) PSM modified to include reducing and expanding sub-modules.

Figure 5. The quantization and dequantization submodules are implemented as a MUX for the former and a DEMUX for the latter. Note that the SR registers are shifted from position 0 to X.

Figure 6. (a) A simplified diagram of the Ultra96 v2 board, divided into PS processing sections and PL logic parts. (b) An example of plotting made in Jupyter notebook.

Figure 7. A comprehensive description of the system implemented in Vivado.

Figure 8. Connections between measuring elements.

Figure 9. Process description for one convolution calculation.

Figure 10. Different current levels measured in the PL section.

Figure 11. Plot per test for scenarios 3, 6, 8, and 10. (a) Percentage of energy consumption and (b) relative error between INT and FP for each test case.

Table 1. Hyperparameters of the accelerator.

Hyperparameter	Value
Array Shape (X, Y)	8 × 4
Arithmetic	int16
Zero Gating	Mult + Add
BRAMA width	128
BRAMB width	128
BRAMC width	128
BRAM depth (all)	2048

Table 2. Features for the 11 tests used in this work to measure power and energy.

Test Number	Memory Requirements Activations, Weights, Preload/Output	Convolution Features. Dimensions of Tensors: Activation (A), Weights (B), Preload/Output (C)
0	448, 1728, 288	A (16, 4, 14), B (24, 16, 3, 3), C (24, 2, 12)
1	283, 1176, 128	A (3, 9, 21), B (16, 3, 7, 7), C (16, 2, 8)
2	1012, 2400, 128	A (3, 15, 45), B (16, 3, 10, 10), C (16, 2, 8)
3	55, 1332, 12	A (111, 1, 1), B (24, 111, 1, 1), C (24, 1, 1)
4	108, 216, 128	A (3, 6, 12), B (16, 3, 3, 3), C (16, 2, 8)
5	5920, 288, 16	A (8, 37, 40), B (8, 8, 3, 3), C (8, 1, 4)
6	420, 600, 128	A (3, 14, 20), B (16, 3, 5, 5), C (16, 2, 8)
7	36, 36, 288	A (3, 2, 12), B (24, 3, 1, 1), C (24, 2, 12)
8	1664, 2496, 192	A (208, 2, 8), B (24, 208, 1, 1), C (24, 2, 8)
9	84, 324, 288	A (3, 4, 14), B (24, 3, 3, 3), C (24, 2, 12)
10	36, 40, 12	A (3, 4, 6), B (3, 3, 3, 3), C (3, 2, 4)

Table 3. Features of the different scenarios for testing the set of test cases.

	Sc1	Sc2	Sc3	Sc4	Sc5	Sc6	Sc7	Sc8	Sc9	Sc10	Sc11	Sc12
Shift	0	0	0	4	4	4	6	6	6	8	8	8
$Scale (2^{- s h i f t})$	1	1	1	1/16	1/16	1/16	1/64	1/64	1/64	1/256	1/256	1/256
s²	1	5	10	1	5	10	1	5	10	1	5	10
PE_consum (%)	26.7	48.7	56.7	30	37.2	37.4	47	41	31.4	57.5	39.4	23.7
RE	123	96	87	1.3 × 104	17.4	6.77	1.84	1.03	9.5	17.2	466	955
RE_min	1.54	1.7	0.17	24	0.13	0.39	0.1	0.05	0.15	0.56	5.7	7.13
RE_max	627	424	376	9.5× 104	103	25	6.27	3	34	116	2172	6392
RE_FP	0.17	0.20	0.21	0.20	0.18	0.20	0.18	0.19	0.23	0.20	0.18	0.18

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sanchez-Flores, A.; Fornt, J.; Alvarez, L.; Alorda-Ladaria, B. Energy and Precision Evaluation of a Systolic Array Accelerator Using a Quantization Approach for Edge Computing. Electronics 2024, 13, 2822. https://doi.org/10.3390/electronics13142822

AMA Style

Sanchez-Flores A, Fornt J, Alvarez L, Alorda-Ladaria B. Energy and Precision Evaluation of a Systolic Array Accelerator Using a Quantization Approach for Edge Computing. Electronics. 2024; 13(14):2822. https://doi.org/10.3390/electronics13142822

Chicago/Turabian Style

Sanchez-Flores, Alejandra, Jordi Fornt, Lluc Alvarez, and Bartomeu Alorda-Ladaria. 2024. "Energy and Precision Evaluation of a Systolic Array Accelerator Using a Quantization Approach for Edge Computing" Electronics 13, no. 14: 2822. https://doi.org/10.3390/electronics13142822

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Energy and Precision Evaluation of a Systolic Array Accelerator Using a Quantization Approach for Edge Computing

Abstract

1. Introduction

2. Accelerator Description

3. Quantization and Implementation

3.1. Description of Quantization

3.2. Implementation on FPGA

4. Instrumentation and Measurement Methodology

4.1. Implementation Board Properties

4.2. High-Level System

4.3. Instrumentation Description

4.4. Tests Description

4.5. Convolution Features for Every Test

4.6. Energy Calculation

5. Comparative Analysis

5.1. Accuracy Evaluation

5.2. Energy Consumption vs. Accuracy Trade-Off

6. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI