1. Introduction
With the development of semiconductor technology, emerging applications have developed rapidly, such as multimedia processing, pattern recognition, and the Internet of Things. As these applications need to handle a large amount of data, energy consumption, memory limitations, and real-time constraints pose challenges to their computing systems. In the past, researchers have mainly enhanced the system performance by reducing the transistor size and increasing integration. However, as transistor sizes gradually approach their limits, Moore’s Law is gradually becoming ineffective. Therefore, based on existing devices, researching new circuit structures and computing methods is becoming another important means to enhance system performance [
1]. Approximate computation is one of the typical representatives of these new methods. It assumes that some applications can tolerate output quality degradation caused by computation errors; hence, exact computation becomes unnecessary. For example, image processing that involves human senses is error-tolerant. Current research on approximate computation mainly focuses on circuits, structures, software, and algorithms [
2].
For approximate computation in circuits and structures, the specific methods mainly include voltage scaling technology, circuit structure design, and system architecture design. Voltage adjustment technology mainly divides the computing or system state into different levels, providing lower voltages for less important operations and system states to reduce system power consumption. Palem et al. [
3] reduced the power consumption of the circuit by dynamically adjusting the voltages of each part, allocating higher voltages to critical parts and lower voltages to non-critical ones. Chippa et al. [
4] decreased the circuit’s power consumption by maintaining high voltages for high-weight bits and reducing those for low-weight bits. Zeng et al. [
5] designed a dual-mode voltage converter that generates low static current through the designed double clock time (DCT) and rapidly switches voltages using pulse-width-modulation (PWM) technology when the circuit load increases. For the circuit structure, special circuit structures are designed to improve the speed of operation by trading off between accuracy and performance. Behbahani et al. [
6] proposed a novel hardware solution aimed at achieving the approximate processing of edge detection in blurred images, which adopts the independent-gate fin field-effect transistor (FinFET) technology. The simulation results show a significant 71% reduction in energy consumption relative to the prior designs. Seo et al. [
7] divided the exact adder into two parts to reduce the carry delay, and carry prediction and error recovery circuits were designed to compensate for the accuracy loss caused by the division. Yan et al. [
8] designed a low-cost approximate full adder, which utilizes the input signal as the control signal to implement the addition logic operation. Gu et al. [
9] implemented a variable-precision approximate multiplier by using a 3-input NAND gate to generate the sum bit in the 4-2 compressor and dynamically truncate the partial products. Mohanty et al. [
10] implemented an approximate multiplier based on hybrid encoding by incorporating a correction factor into the encoding rules and utilizing a segmented Booth encoding table. Liu et al. [
11] designed an approximate divider. In this hybrid structure, an array divider unit was used to generate the high-order bits of the quotient, while the other quotient bits were generated by using low-precision logarithmic dividers. Kumari et al. [
12] developed 8-bit approximate multipliers with 15 levels of accuracy using recursive, bitwise, and hybrid partial bit OR (PBO) methods. Compared to existing multipliers, their design achieved significant performance improvements. In system-level architecture design, research on approximate computation mainly focuses on CPUs, GPUs, MCUs, and instruction sets, etc. Japa et al. [
13] presented a novel approach that leverages timing variations in a pipelined datapath to design a processor-based physically unclonable function (PUF) for approximate computing. By employing divergent delay path selection based on intermediate error behavior, this methodology enhances PUF uniqueness compared to an unmodified datapath. Cui et al. [
14] proposed two innovative approaches: a sample average approximation (SAA) method and a heuristic solution called minimum communication cost (MinC), to optimize the task execution modes of Internet of Things (IoT) devices. Through these approaches, devices can selectively execute tasks in either exact or approximate modes. Zhang et al. [
15] designed a programmable analog computing unit (ACU), which can approximate the calculation of any two-input function through a Gaussian mixture kernel function. Sinha et al. [
16] proposed a memory-based approximate computing method that reuses the output with the same input to reduce computing time and power consumption. Wei et al. [
17] introduced an approximate instruction duplication (ApproxDup) mechanism for efficient silent data corruption (SDC) detection. ApproxDup leverages approximate computing to prioritize the duplication of instructions prone to severe SDCs while relaxing the detection of less critical ones.
For approximate computation in software and algorithms, researchers optimize the algorithms themselves by using methods such as precision scaling, operation merging, and data compression to enhance computational performance. Silveira et al. [
18] decomposed the coefficients of the DCT into the product of a sparse matrix and a diagonal matrix, thereby the multiplication could be replaced by addition and shift operations. Hu et al. [
19] designed an approximate compression algorithm to enhance the data transmission efficiency of the message passing interface (MPI), which discards the lower bits of the data by taking advantage of their continuity and predictability. Liu et al. [
20] proposed two word-length selection algorithms, area limited design (ALD) and delay limited design (DLD), which discard certain bit widths at different stages of FFT operations to enhance the computational performance. Wang et al. [
21] proposed a neural network training algorithm based on bit-wise training, which trains individual bits of data according to their weights and switches between different precisions using a resistive RAM (ReRAM) accelerator. Dalloo et al. [
22] proposed an architecture for computing the approximate exponential and hyperbolic functions using a table-driven algorithm. Furthermore, by implementing the design on an FPGA, they demonstrated significant performance improvements. Meo et al. [
23] proposed a new approximation scheme for the delayed least mean square (DLMS) filter to reduce its power consumption. The coefficients of the filter were updated using gradient vectors based on the absolute value of the error signal. Li et al. [
24] optimized JPEG hardware implementation based on approximate computing. The key techniques employed included an approximate division realized using bit-shift operators, loop perforation, and precision scaling integrated with a multiplier-less fast discrete cosine transform (DCT) architecture.
As mentioned above, when implementing these approximate computations specifically, the main approaches include ASIC, FPGA and MCU. At present, with the development of the Internet of Things (IoT) and artificial intelligence (AI), AIoT devices have been widely applied. Therefore, exploring the implementation of approximate computation on AIoT processors is extremely important and meaningful. We investigate the implementation of approximate computation on an AI accelerator in this paper. The rest of this paper is organized as follows. In
Section 2, we employ NAS to acquire the optimal neural network structure, which is used to approximate signal processing functions such as FFT, DCT, IIR, and FIR. In
Section 3, the optimal network structure is quantized, trained, and deployed to the AI accelerator on the development board. In
Section 4, we utilize the CMSIS-DSP function library on the development board to implement the same functions and compare the results with those obtained from the AI accelerator.
2. Neural Network Structure for Approximate Computation
Before deploying a neural network on the AI accelerator, it is necessary to obtain its optimal network structure. Neural architecture search automatically searches the network space and can find the best network structure. NAS can save more time than manual efforts, and in some cases, a better network architecture can be achieved more easily. In order to further reduce unnecessary evaluations in the search space, we adopt the NAS based on the sequential model-based algorithm configuration (SMAC) proposed in reference [
25] to search for a network structure that efficiently performs signal processing tasks with high accuracy and less computation time. In this framework, SMAC explores the network space and utilizes a hardware simulator to evaluate the accuracy, speed, and power consumption of each structure. Finally, this information is fed back to the framework for the next search task. This process iterates until the specified constraints on accuracy, speed, and power consumption are met.
2.1. Model Definition
We designed a multi-layer perceptron (MLP) with three hidden layers to approximate the FFT, DCT, FIR, and IIR functions. The MLP architecture comprised one input layer, three fully connected hidden layers, and one output layer, with a (rectified linear unit) ReLU serving as the activation function. During forward propagation, the input data passed through the fully connected layers and activation functions, ultimately generating the output. To mitigate overfitting, dropout layers with a retention probability of 0.5 were incorporated. For efficient hyperparameter search, the maximum number of neurons per layer was limited to 64. In each search iteration, up to 500 neural network configurations were evaluated, and each configuration was trained for 300 epochs.
2.2. Cost Evaluation
The candidate network was evaluated based on predefined criteria including accuracy, speed, power consumption, and model size. In the SMAC search architecture, nn_dataflow [
26] was used to extract various statistical metrics about the network operations. These metrics included the total cost, total time, total operation count, total bandwidth, and others. These statistics were organized into an ordered dictionary to evaluate the network cost. For computing accuracy, if the absolute error between the sample value and the output value was below a predefined the threshold, the correct count was incremented by 1. The overall accuracy is the ratio of the correct count to the total number of samples. The total cost function is calculated according to Equation (1):
In the above formula, Pspeed and Pacc represent the speed and computing accuracy, respectively. and are the weighting coefficients. Adjusting the values of and can guide the search process toward different optimization objectives. A higher value directs the search toward higher computational accuracy, while a higher value prioritizes faster computation speeds. The threshold of computing accuracy also impacts computational precision and efficiency. By appropriately relaxing the threshold value according to the data precision requirements of different applications, both the computational accuracy and efficiency can be improved.
2.3. Model Evaluation
After training and evaluation, the best neural network structure needs to be selected. Since it will be deployed on AI hardware accelerators, the size of the model is also an important indicator. The overall merit of the network can be determined by Equation (2):
where
is the weight coefficient,
Pacc denotes the computing accuracy of the test data, and
Snorm represents the normalized model size.
Figure 1 illustrates the relationship between the model size and accuracy. In this figure, the horizontal axis represents the model size, measured in KB, and the vertical axis indicates the accuracy in percentage. The shading of the dots represents the overall merit, with lighter shades indicating higher evaluations. As shown in
Figure 1, for DCT and FIR, even with a small network size, most networks achieve high accuracy. This is primarily due to the relative simplicity of these computations, allowing small models to effectively derive output from input data. In contrast, for the FFT and IIR models, the distribution of the computational accuracy is more dispersed in small networks, but many networks still exhibit high accuracy. This dispersion can be attributed to the inherent complexity of these computations, which may not always be adequately captured by small models.
By comprehensively considering both the accuracy and size, the optimal network configurations and their corresponding accuracies are summarized in
Table 1. As shown in
Table 1, the network structures of these four functions all exhibit high computational accuracy. Among them, the FFT function achieves the highest accuracy, while the DCT function has the lowest. According to the definition of accuracy, the 96.9% accuracy of the DCT function implies that only 3.1% of the test data results are imprecise, with the judgment threshold for these imprecise data strictly set at 0.01. In terms of network structure, the FFT function has the largest structure size, with the third hidden layer containing 64 neurons. This is likely due to the inherent computational complexity of the FFT and the involvement of complex number operations during its computation. The other three network structures are relatively smaller, and even under the constraint of a maximum of 64 neurons per layer, they utilize fewer neurons in each layer.
3. Deployment on the AI Accelerator
The MAX78000 EvKit is a system-on-chip (SoC) developed by Maxim Integrated that supports deep learning neural network acceleration. Model compression and quantization techniques are also employed in the AI accelerator on the MAX78000. The accelerator consists of 64 parallel processors, each of which includes a pooling unit and a convolutional kernel. Before the MAX78000 is deployed, the model was tested and trained on a host computer. During this process, the standard function library provided by MAXIM was used.
During the model training process in
Section 2, the dataset consisted of floating-point numbers within the range of [−127/128, 127/128]. However, when running on the AI accelerator, the input must be integers within the range of [−128, +127]. Therefore, before training on the hardware accelerator, the ai8x.normalize() function was used to quantize the training dataset.
For the optimal structure obtained in
Section 2, the dedicated function library provided by MAXIM was used to replace the corresponding functions in torch.nn script. The specific functions were determined by the MAX78000 hardware operators defined in ai8x.py script.
Table 2 shows corresponding functions.
When configuring the learning rate scheduler, a multi-step learning rate scheduler was used, and milestones (set to [60, 120, 160]) were employed to adjust the learning rate. During the update, the learning rate was multiplied by a specified factor (set to 0.2). The learning rate was updated at the end of each epoch (set to 300). For weight coefficients, the MAXIM development kit provides two methods: quantization-aware training and post-training quantization. In this paper, we adopted the quantization-aware training method. After training, a checkpoint file containing the best results was obtained. This checkpoint file mainly contained the weights and biases of the network.
The trained network was deployed on the MAX78000 development board using the Eclipse for MAXIM tools.
Table 3 presents the computation cost and computation time of deep learning networks when executed on the AI accelerator in the development board. Notably, the frequency of the AI acceleration unit was synchronized with the main processor, which was the ARM Cortex-M4 processor on the MAX78000 board.
From the analysis of
Table 3, it can be observed that the FFT, due to its relatively complex network structure, reached the maximum number of neurons (64) in its third layer, requiring more computational time and resources. In contrast, the DCT demonstrated higher efficiency in terms of both the computational cost and time, indicating superior performance. Although the computation time for the FIR and IIR was relatively long, it remained within an acceptable range. In the network structures of both the FIR and IIR, the number of neurons in each layer was the same; however, there was a significant difference in the computational cost as shown in
Table 3. This discrepancy may be attributed to the fact that the IIR calculations involve feedback, resulting in lower utilization of the AI acceleration unit’s computational modules, with the majority of resources being consumed by data transmission and waiting.
4. Performance Evaluation
To further evaluate the performance of the approximate computation on the AI accelerator in the previous section, the FFT, DCT, FIR, and IIR operations were implemented on the ARM Cortex-M4 processor based on the MAX78000. The Cortex-M4 is a high-performance low-power processor with DSP extension function. The CMSIS-DSP library provided by ARM can implement various efficient signal processing for embedded systems based on Cortex-M4.
By using the same test data, the CMSIS-DSP functions were used to implement the FFT, DCT, FIR and IIR.
Table 4 presents the computation cost and computation time achieved through the CMSIS-DSP library. As described in the previous section, the frequency of the Cortex-M4 processor is the same as that of the AI accelerator unit, so the calculation time was directly used to evaluate the performance.
By comparing
Table 3 and
Table 4, it is evident that for the FFT operation, the AI accelerator achieves a computing time that is only one-sixth of the DSP implementation. However, the computational cost increases by approximately twenty-five times. This discrepancy may be attributed to the fact that the DSP function employs an optimized FFT algorithm, whereas the AI hardware accelerator relies on neural network inference based solely on input and output data, which does not strictly implement the FFT algorithm. For the DCT and FIR functions, the AI accelerator demonstrates a reduction in computing time by approximately 10%.
For IIR operations, the computing time of the AI hardware accelerators is significantly longer compared to that of the DSP. This discrepancy may be attributed to the feedback loops in the IIR algorithm, which introduce additional waiting periods. A comprehensive analysis reveals that when using the AI accelerator for approximate computations, although the computational load increases for some functions due to network design and other factors, the computation time for the FFT, DCT, and FIR functions is significantly reduced, especially for the computationally intensive FFT operations. This demonstrates that the computational efficiency of the AI acceleration unit under approximate computation is considerably improved compared to that of the MCU. When developing AIoT applications with high real-time requirements, approximate computation based on the AI accelerator can serve as a viable alternative for completing data-intensive operations with high real-time demands.