Research on Approximate Computation of Signal Processing Algorithms for AIoT Processors Based on Deep Learning

Liu, Yingzhe; Fu, Fangfa; Sun, Xuejian

doi:10.3390/electronics14061064

Open AccessArticle

Research on Approximate Computation of Signal Processing Algorithms for AIoT Processors Based on Deep Learning

by

Yingzhe Liu

¹,

Fangfa Fu

^2,* and

Xuejian Sun

²

¹

School of Electronic and Information Engineering, Liaoning Technical University, Huludao 125001, China

²

Department of Microelectronics Science and Technology, Harbin Institute of Technology, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1064; https://doi.org/10.3390/electronics14061064

Submission received: 9 February 2025 / Revised: 26 February 2025 / Accepted: 6 March 2025 / Published: 7 March 2025

(This article belongs to the Special Issue The Progress in Application-Specific Integrated Circuit Design)

Download

Browse Figure

Review Reports Versions Notes

Abstract

:

In the post-Moore era, the excessive amount of information brings great challenges to the performance of computing systems. To cope with these challenges, approximate computation has developed rapidly, which enhances the system performance with minor degradation in accuracy. In this paper, we investigate the utilization of an Artificial Intelligence of Things (AIoT) processor for approximate computing. Firstly, we employed neural architecture search (NAS) to acquire the neural network structure for approximate computation, which approximates the functions of FFT, DCT, FIR, and IIR. Subsequently, based on this structure, we quantized and trained a neural network implemented on the AI accelerator of the MAX78000 development board. To evaluate the performance, we implemented the same functions using the CMSIS-DSP library. The results demonstrate that the computational efficiency of the approximate computation on the AI accelerator is significantly higher compared to traditional DSP implementations. Therefore, the approximate computation based on AIoT devices can be effectively utilized in real-time applications.

Keywords:

approximate computation; deep learning; AI accelerator

1. Introduction

With the development of semiconductor technology, emerging applications have developed rapidly, such as multimedia processing, pattern recognition, and the Internet of Things. As these applications need to handle a large amount of data, energy consumption, memory limitations, and real-time constraints pose challenges to their computing systems. In the past, researchers have mainly enhanced the system performance by reducing the transistor size and increasing integration. However, as transistor sizes gradually approach their limits, Moore’s Law is gradually becoming ineffective. Therefore, based on existing devices, researching new circuit structures and computing methods is becoming another important means to enhance system performance [1]. Approximate computation is one of the typical representatives of these new methods. It assumes that some applications can tolerate output quality degradation caused by computation errors; hence, exact computation becomes unnecessary. For example, image processing that involves human senses is error-tolerant. Current research on approximate computation mainly focuses on circuits, structures, software, and algorithms [2].

For approximate computation in circuits and structures, the specific methods mainly include voltage scaling technology, circuit structure design, and system architecture design. Voltage adjustment technology mainly divides the computing or system state into different levels, providing lower voltages for less important operations and system states to reduce system power consumption. Palem et al. [3] reduced the power consumption of the circuit by dynamically adjusting the voltages of each part, allocating higher voltages to critical parts and lower voltages to non-critical ones. Chippa et al. [4] decreased the circuit’s power consumption by maintaining high voltages for high-weight bits and reducing those for low-weight bits. Zeng et al. [5] designed a dual-mode voltage converter that generates low static current through the designed double clock time (DCT) and rapidly switches voltages using pulse-width-modulation (PWM) technology when the circuit load increases. For the circuit structure, special circuit structures are designed to improve the speed of operation by trading off between accuracy and performance. Behbahani et al. [6] proposed a novel hardware solution aimed at achieving the approximate processing of edge detection in blurred images, which adopts the independent-gate fin field-effect transistor (FinFET) technology. The simulation results show a significant 71% reduction in energy consumption relative to the prior designs. Seo et al. [7] divided the exact adder into two parts to reduce the carry delay, and carry prediction and error recovery circuits were designed to compensate for the accuracy loss caused by the division. Yan et al. [8] designed a low-cost approximate full adder, which utilizes the input signal as the control signal to implement the addition logic operation. Gu et al. [9] implemented a variable-precision approximate multiplier by using a 3-input NAND gate to generate the sum bit in the 4-2 compressor and dynamically truncate the partial products. Mohanty et al. [10] implemented an approximate multiplier based on hybrid encoding by incorporating a correction factor into the encoding rules and utilizing a segmented Booth encoding table. Liu et al. [11] designed an approximate divider. In this hybrid structure, an array divider unit was used to generate the high-order bits of the quotient, while the other quotient bits were generated by using low-precision logarithmic dividers. Kumari et al. [12] developed 8-bit approximate multipliers with 15 levels of accuracy using recursive, bitwise, and hybrid partial bit OR (PBO) methods. Compared to existing multipliers, their design achieved significant performance improvements. In system-level architecture design, research on approximate computation mainly focuses on CPUs, GPUs, MCUs, and instruction sets, etc. Japa et al. [13] presented a novel approach that leverages timing variations in a pipelined datapath to design a processor-based physically unclonable function (PUF) for approximate computing. By employing divergent delay path selection based on intermediate error behavior, this methodology enhances PUF uniqueness compared to an unmodified datapath. Cui et al. [14] proposed two innovative approaches: a sample average approximation (SAA) method and a heuristic solution called minimum communication cost (MinC), to optimize the task execution modes of Internet of Things (IoT) devices. Through these approaches, devices can selectively execute tasks in either exact or approximate modes. Zhang et al. [15] designed a programmable analog computing unit (ACU), which can approximate the calculation of any two-input function through a Gaussian mixture kernel function. Sinha et al. [16] proposed a memory-based approximate computing method that reuses the output with the same input to reduce computing time and power consumption. Wei et al. [17] introduced an approximate instruction duplication (ApproxDup) mechanism for efficient silent data corruption (SDC) detection. ApproxDup leverages approximate computing to prioritize the duplication of instructions prone to severe SDCs while relaxing the detection of less critical ones.

For approximate computation in software and algorithms, researchers optimize the algorithms themselves by using methods such as precision scaling, operation merging, and data compression to enhance computational performance. Silveira et al. [18] decomposed the coefficients of the DCT into the product of a sparse matrix and a diagonal matrix, thereby the multiplication could be replaced by addition and shift operations. Hu et al. [19] designed an approximate compression algorithm to enhance the data transmission efficiency of the message passing interface (MPI), which discards the lower bits of the data by taking advantage of their continuity and predictability. Liu et al. [20] proposed two word-length selection algorithms, area limited design (ALD) and delay limited design (DLD), which discard certain bit widths at different stages of FFT operations to enhance the computational performance. Wang et al. [21] proposed a neural network training algorithm based on bit-wise training, which trains individual bits of data according to their weights and switches between different precisions using a resistive RAM (ReRAM) accelerator. Dalloo et al. [22] proposed an architecture for computing the approximate exponential and hyperbolic functions using a table-driven algorithm. Furthermore, by implementing the design on an FPGA, they demonstrated significant performance improvements. Meo et al. [23] proposed a new approximation scheme for the delayed least mean square (DLMS) filter to reduce its power consumption. The coefficients of the filter were updated using gradient vectors based on the absolute value of the error signal. Li et al. [24] optimized JPEG hardware implementation based on approximate computing. The key techniques employed included an approximate division realized using bit-shift operators, loop perforation, and precision scaling integrated with a multiplier-less fast discrete cosine transform (DCT) architecture.

As mentioned above, when implementing these approximate computations specifically, the main approaches include ASIC, FPGA and MCU. At present, with the development of the Internet of Things (IoT) and artificial intelligence (AI), AIoT devices have been widely applied. Therefore, exploring the implementation of approximate computation on AIoT processors is extremely important and meaningful. We investigate the implementation of approximate computation on an AI accelerator in this paper. The rest of this paper is organized as follows. In Section 2, we employ NAS to acquire the optimal neural network structure, which is used to approximate signal processing functions such as FFT, DCT, IIR, and FIR. In Section 3, the optimal network structure is quantized, trained, and deployed to the AI accelerator on the development board. In Section 4, we utilize the CMSIS-DSP function library on the development board to implement the same functions and compare the results with those obtained from the AI accelerator.

2. Neural Network Structure for Approximate Computation

Before deploying a neural network on the AI accelerator, it is necessary to obtain its optimal network structure. Neural architecture search automatically searches the network space and can find the best network structure. NAS can save more time than manual efforts, and in some cases, a better network architecture can be achieved more easily. In order to further reduce unnecessary evaluations in the search space, we adopt the NAS based on the sequential model-based algorithm configuration (SMAC) proposed in reference [25] to search for a network structure that efficiently performs signal processing tasks with high accuracy and less computation time. In this framework, SMAC explores the network space and utilizes a hardware simulator to evaluate the accuracy, speed, and power consumption of each structure. Finally, this information is fed back to the framework for the next search task. This process iterates until the specified constraints on accuracy, speed, and power consumption are met.

2.1. Model Definition

We designed a multi-layer perceptron (MLP) with three hidden layers to approximate the FFT, DCT, FIR, and IIR functions. The MLP architecture comprised one input layer, three fully connected hidden layers, and one output layer, with a (rectified linear unit) ReLU serving as the activation function. During forward propagation, the input data passed through the fully connected layers and activation functions, ultimately generating the output. To mitigate overfitting, dropout layers with a retention probability of 0.5 were incorporated. For efficient hyperparameter search, the maximum number of neurons per layer was limited to 64. In each search iteration, up to 500 neural network configurations were evaluated, and each configuration was trained for 300 epochs.

2.2. Cost Evaluation

The candidate network was evaluated based on predefined criteria including accuracy, speed, power consumption, and model size. In the SMAC search architecture, nn_dataflow [26] was used to extract various statistical metrics about the network operations. These metrics included the total cost, total time, total operation count, total bandwidth, and others. These statistics were organized into an ordered dictionary to evaluate the network cost. For computing accuracy, if the absolute error between the sample value and the output value was below a predefined the threshold, the correct count was incremented by 1. The overall accuracy is the ratio of the correct count to the total number of samples. The total cost function is calculated according to Equation (1):

f = α \cdot P_{a c c} + β \cdot P_{s p e e d}

(1)

In the above formula, P_speed and P_acc represent the speed and computing accuracy, respectively.

α

and

β

are the weighting coefficients. Adjusting the values of

α

and

β

can guide the search process toward different optimization objectives. A higher

α

value directs the search toward higher computational accuracy, while a higher

β

value prioritizes faster computation speeds. The threshold of computing accuracy also impacts computational precision and efficiency. By appropriately relaxing the threshold value according to the data precision requirements of different applications, both the computational accuracy and efficiency can be improved.

2.3. Model Evaluation

After training and evaluation, the best neural network structure needs to be selected. Since it will be deployed on AI hardware accelerators, the size of the model is also an important indicator. The overall merit of the network can be determined by Equation (2):

f = ω * P_{a c c} + (1 - ω) * (1 - S_{n o r m}),

(2)

where

ω

is the weight coefficient, P_acc denotes the computing accuracy of the test data, and S_norm represents the normalized model size. Figure 1 illustrates the relationship between the model size and accuracy. In this figure, the horizontal axis represents the model size, measured in KB, and the vertical axis indicates the accuracy in percentage. The shading of the dots represents the overall merit, with lighter shades indicating higher evaluations. As shown in Figure 1, for DCT and FIR, even with a small network size, most networks achieve high accuracy. This is primarily due to the relative simplicity of these computations, allowing small models to effectively derive output from input data. In contrast, for the FFT and IIR models, the distribution of the computational accuracy is more dispersed in small networks, but many networks still exhibit high accuracy. This dispersion can be attributed to the inherent complexity of these computations, which may not always be adequately captured by small models.

By comprehensively considering both the accuracy and size, the optimal network configurations and their corresponding accuracies are summarized in Table 1. As shown in Table 1, the network structures of these four functions all exhibit high computational accuracy. Among them, the FFT function achieves the highest accuracy, while the DCT function has the lowest. According to the definition of accuracy, the 96.9% accuracy of the DCT function implies that only 3.1% of the test data results are imprecise, with the judgment threshold for these imprecise data strictly set at 0.01. In terms of network structure, the FFT function has the largest structure size, with the third hidden layer containing 64 neurons. This is likely due to the inherent computational complexity of the FFT and the involvement of complex number operations during its computation. The other three network structures are relatively smaller, and even under the constraint of a maximum of 64 neurons per layer, they utilize fewer neurons in each layer.

3. Deployment on the AI Accelerator

The MAX78000 EvKit is a system-on-chip (SoC) developed by Maxim Integrated that supports deep learning neural network acceleration. Model compression and quantization techniques are also employed in the AI accelerator on the MAX78000. The accelerator consists of 64 parallel processors, each of which includes a pooling unit and a convolutional kernel. Before the MAX78000 is deployed, the model was tested and trained on a host computer. During this process, the standard function library provided by MAXIM was used.

During the model training process in Section 2, the dataset consisted of floating-point numbers within the range of [−127/128, 127/128]. However, when running on the AI accelerator, the input must be integers within the range of [−128, +127]. Therefore, before training on the hardware accelerator, the ai8x.normalize() function was used to quantize the training dataset.

For the optimal structure obtained in Section 2, the dedicated function library provided by MAXIM was used to replace the corresponding functions in torch.nn script. The specific functions were determined by the MAX78000 hardware operators defined in ai8x.py script. Table 2 shows corresponding functions.

When configuring the learning rate scheduler, a multi-step learning rate scheduler was used, and milestones (set to [60, 120, 160]) were employed to adjust the learning rate. During the update, the learning rate was multiplied by a specified factor (set to 0.2). The learning rate was updated at the end of each epoch (set to 300). For weight coefficients, the MAXIM development kit provides two methods: quantization-aware training and post-training quantization. In this paper, we adopted the quantization-aware training method. After training, a checkpoint file containing the best results was obtained. This checkpoint file mainly contained the weights and biases of the network.

The trained network was deployed on the MAX78000 development board using the Eclipse for MAXIM tools. Table 3 presents the computation cost and computation time of deep learning networks when executed on the AI accelerator in the development board. Notably, the frequency of the AI acceleration unit was synchronized with the main processor, which was the ARM Cortex-M4 processor on the MAX78000 board.

From the analysis of Table 3, it can be observed that the FFT, due to its relatively complex network structure, reached the maximum number of neurons (64) in its third layer, requiring more computational time and resources. In contrast, the DCT demonstrated higher efficiency in terms of both the computational cost and time, indicating superior performance. Although the computation time for the FIR and IIR was relatively long, it remained within an acceptable range. In the network structures of both the FIR and IIR, the number of neurons in each layer was the same; however, there was a significant difference in the computational cost as shown in Table 3. This discrepancy may be attributed to the fact that the IIR calculations involve feedback, resulting in lower utilization of the AI acceleration unit’s computational modules, with the majority of resources being consumed by data transmission and waiting.

4. Performance Evaluation

To further evaluate the performance of the approximate computation on the AI accelerator in the previous section, the FFT, DCT, FIR, and IIR operations were implemented on the ARM Cortex-M4 processor based on the MAX78000. The Cortex-M4 is a high-performance low-power processor with DSP extension function. The CMSIS-DSP library provided by ARM can implement various efficient signal processing for embedded systems based on Cortex-M4.

By using the same test data, the CMSIS-DSP functions were used to implement the FFT, DCT, FIR and IIR. Table 4 presents the computation cost and computation time achieved through the CMSIS-DSP library. As described in the previous section, the frequency of the Cortex-M4 processor is the same as that of the AI accelerator unit, so the calculation time was directly used to evaluate the performance.

By comparing Table 3 and Table 4, it is evident that for the FFT operation, the AI accelerator achieves a computing time that is only one-sixth of the DSP implementation. However, the computational cost increases by approximately twenty-five times. This discrepancy may be attributed to the fact that the DSP function employs an optimized FFT algorithm, whereas the AI hardware accelerator relies on neural network inference based solely on input and output data, which does not strictly implement the FFT algorithm. For the DCT and FIR functions, the AI accelerator demonstrates a reduction in computing time by approximately 10%.

For IIR operations, the computing time of the AI hardware accelerators is significantly longer compared to that of the DSP. This discrepancy may be attributed to the feedback loops in the IIR algorithm, which introduce additional waiting periods. A comprehensive analysis reveals that when using the AI accelerator for approximate computations, although the computational load increases for some functions due to network design and other factors, the computation time for the FFT, DCT, and FIR functions is significantly reduced, especially for the computationally intensive FFT operations. This demonstrates that the computational efficiency of the AI acceleration unit under approximate computation is considerably improved compared to that of the MCU. When developing AIoT applications with high real-time requirements, approximate computation based on the AI accelerator can serve as a viable alternative for completing data-intensive operations with high real-time demands.

5. Conclusions

This paper investigated the approximate computation of signal processing on AIoT processor. Firstly, under the constraints, the optimal neural network structure for the approximate computation was obtained through NAS. Then, through quantization and training, the model was deployed on the MAX78000 development board. Finally, the computing results were compared with those of ARM CMSIS-DSP. The experimental results show that the AI accelerator exhibits significant advantages in computational efficiency compared to traditional MCU and is more suitable for the tasks with real-time requirements. Therefore, approximate computation based on the AI accelerator can become a promising approach for efficient signal processing.

Author Contributions

Methodology, Writing—Original Draft, Writing—review & editing, Y.L.; Conceptualization, Supervision, F.F.; Software, Formal analysis, X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, W.; Chen, K.; Wu, B.; Deng, E.; Wang, Y.; Gong, Y.; Cui, Y.; Wang, C. High-efficiency and high-security emerging computing chips: Development, challenges, and prospects. Sci. Sin. Inf. 2024, 54, 34–47. (In Chinese) [Google Scholar] [CrossRef]
Dalloo, A.M.; Jaleel Humaidi, A.; Al Mhdawi, A.K.; Al-Raweshidy, H. Approximate Computing: Concepts, Architectures, Challenges, Applications, and Future Directions. IEEE Access 2024, 12, 146022–146088. [Google Scholar] [CrossRef]
Palem, K.; Lingamneni, A. Ten Years of Building Broken Chips: The Physics and Engineering of Inexact Computing. ACM Trans. Embed. Comput. Syst. 2013, 12, 1–23. [Google Scholar] [CrossRef]
Chippa, V.K.; Mohapatra, D.; Roy, K.; Chakradhar, S.T.; Raghunathan, A. Scalable Effort Hardware Design. IEEE Trans. Very Large Scale Integr. Syst. 2014, 22, 2004–2016. [Google Scholar] [CrossRef]
Zeng, W.; Lam, C.; Che, W. A 470-nA Quiescent Current and 92.7%/94.7% Efficiency DCT/PWM Control Buck Converter With Seamless Mode Selection for IoT Application. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 4085–4098. [Google Scholar] [CrossRef]
Behbahani, F.; Jooq, M.K.Q.; Moaiyeri, M.H.; Azghadi, M.R. A Novel Hardware Solution for Efficient Approximate Fuzzy Image Edge Detection. IEEE Trans. Fuzzy Syst. 2024, 32, 3199–3210. [Google Scholar] [CrossRef]
Seo, H.; Kim, Y. A Low Latency Approximate Adder Design Based on Dual Sub-Adders With Error Recovery. IEEE Trans. Emerg. Top. Comput. 2023, 11, 811–816. [Google Scholar] [CrossRef]
Yan, A.; Wei, S.; Li, Z.; Cui, J. Design of Low-Cost Approximate CMOS Full Adders. In Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [Google Scholar]
Gu, F.; Lin, I.; Lin, J. A low-power and high-accuracy approximate multiplier with reconfigurable truncation. IEEE Access 2022, 10, 60447–60458. [Google Scholar] [CrossRef]
Mohanty, B.K. Efficient approximate multiplier design based on hybrid higher radix booth encoding. IEEE J. Emerg. Sel. Top. Circuits Syst. 2023, 13, 165–174. [Google Scholar] [CrossRef]
Liu, W.; Xu, T.; Li, J.; Wang, C.; Montuschi, P.; Lombardi, F. Design of unsigned approximate hybrid dividers based on restoring array and logarithmic dividers. IEEE Trans. Emerg. Top. Comput. 2022, 10, 339–350. [Google Scholar] [CrossRef]
Kumari, A.; Palathinkal, R.P. Design and Analysis of Energy Efficient Approximate Multipliers for Image Processing and Deep Neural Network. IEEE Trans. Circuits Syst. I Regul. Pap. 2025, 72, 854–867. [Google Scholar] [CrossRef]
Japa, A.; Miskelly, J.; Cui, Y.; O’Neill, M.; Gu, C. A Novel Methodology for Processor based PUF in Approximate Computing. In Proceedings of the 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, 19–22 May 2024; pp. 1–5. [Google Scholar]
Cui, Y.; Chin, K.-W.; Soh, S.; Ros, M. Exact and Approximate Tasks Computation in IoT Networks. IEEE Internet Things J. 2024, 11, 7974–7988. [Google Scholar] [CrossRef]
Zhang, R.; Uetake, N.; Nakada, T.; Nakashima, Y. Design of Programmable Analog Calculation Unit by Implementing Support Vector Regression for Approximate Computing. IEEE Micro 2018, 38, 73–82. [Google Scholar] [CrossRef]
Sinha, S.; Zhang, W. Low-Power FPGA Design Using Memoization-Based Approximate Computing. IEEE Trans. Very Large Scale Integr. Syst. 2016, 24, 2665–2678. [Google Scholar] [CrossRef]
Wei, X.; Jiang, N.; Yue, H.; Wang, X.; Zhao, J.; Li, G.; Qiu, M. ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 1051–1064. [Google Scholar] [CrossRef]
Silveira, T.L.T.d.; Canterle, D.R.; Coelho, D.F.G.; Coutinho, V.A.; Bayer, F.M.; Cintra, R.J. A Class of Low-Complexity DCT-Like Transforms for Image and Video Coding. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4364–4375. [Google Scholar] [CrossRef]
Hu, Y.; Koibuchi, M. Accelerating MPI Communication Using Floating-point Compression on Lossy Interconnection Networks. In Proceedings of the 2021 IEEE 46th Conference on Local Computer Networks, Edmonton, AB, Canada, 4–7 October 2021; pp. 355–358. [Google Scholar]
Liu, W.; Liao, Q.; Qiao, F.; Xia, W.; Wang, C.; Lombardi, F. Approximate designs for Fast Fourier Transform (FFT) with application to speech recognition. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 66, 4727–4739. [Google Scholar] [CrossRef]
Wang, Y.; He, Y.; Cheng, L.; Li, H.; Li, X. A Fast Precision Tuning Solution for Always-On DNN Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 41, 1236–1248. [Google Scholar] [CrossRef]
Dalloo, A.M.; Humaidi, A.J.; Mhdawi, A.K.A.; Al-Raweshidy, H. Low-Power and Low-Latency Hardware Implementation of Approximate Hyperbolic and Exponential Functions for Embedded System Applications. IEEE Access 2024, 12, 24151–24163. [Google Scholar] [CrossRef]
Meo, G.D.; Caro, D.D.; Petra, N.; Strollo, A.G.M. A novel low-power DLMS adaptive filter with sign-magnitude learning and approximated FIR section. In Proceedings of the 2022 17th Conference on Ph.D Research in Microelectronics and Electronics (PRIME), Villasimius, Italy, 12–15 June 2022; pp. 217–220. [Google Scholar]
Li, M.-C.; Ghosh, A.; Sen, S. Approximate DCT and Quantization Techniques for Energy-Constrained Image Sensors. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2025, 44, 11–24. [Google Scholar] [CrossRef]
Chen, W.; Wang, Y.; Yang, S.; Liu, C.; Zhang, L. Towards Best-effort Approximation: Applying NAS to General-purpose Approximate Computing. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition, Grenoble, France, 9–13 March 2020; pp. 1315–1318. [Google Scholar]
Gao, M.; Yang, X.; Pu, J.; Horowitz, M.; Kozyrakis, C. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, New York, NY, USA, 13–17 April 2019; pp. 807–820. [Google Scholar]

Figure 1. Scatter plot of evaluation results. (a) FFT; (b) FIR; (c) DCT; (d) IIR.

Table 1. The optimal network structure and its accuracy.

Function	FFT	DCT	FIR	IIR
Accuracy	99.0%	96.9%	98.1%	98.6%
Structure	1 → 8 → 8 → 64 → 2	1 → 4 → 2 → 4 → 1	1 → 2 → 2 → 2 → 1	1 → 2 → 2 → 2 → 1

Table 2. Corresponding Functions in MAXIM and torch.nn script.

Functions in the MAXIM	Corresponding Functions in the torch.nn Script
Linear	Linear
Conv1d	Conv1d
FusedConv1dReLU	Conv1d, followed by ReLU
FusedMaxPoolConv1dReLU	MaxPool1d (followed by Conv1d and ReLU)

Table 3. The performance of approximate computation on the AI accelerator.

Function	Computational Cost (ops)	Computational Time (μs)
FFT	101,376	1024
DCT	4352	640
FIR	5760	1600
IIR	23,680	1600

Table 4. The performance achieved through the CMSIS-DSP library.

Function	Computational Cost (ops)	Computational Time (μs)
FFT	4096	6201
DCT	9240	682
FIR	4224	1808
IIR	4160	346

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Fu, F.; Sun, X. Research on Approximate Computation of Signal Processing Algorithms for AIoT Processors Based on Deep Learning. Electronics 2025, 14, 1064. https://doi.org/10.3390/electronics14061064

AMA Style

Liu Y, Fu F, Sun X. Research on Approximate Computation of Signal Processing Algorithms for AIoT Processors Based on Deep Learning. Electronics. 2025; 14(6):1064. https://doi.org/10.3390/electronics14061064

Chicago/Turabian Style

Liu, Yingzhe, Fangfa Fu, and Xuejian Sun. 2025. "Research on Approximate Computation of Signal Processing Algorithms for AIoT Processors Based on Deep Learning" Electronics 14, no. 6: 1064. https://doi.org/10.3390/electronics14061064

APA Style

Liu, Y., Fu, F., & Sun, X. (2025). Research on Approximate Computation of Signal Processing Algorithms for AIoT Processors Based on Deep Learning. Electronics, 14(6), 1064. https://doi.org/10.3390/electronics14061064

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Approximate Computation of Signal Processing Algorithms for AIoT Processors Based on Deep Learning

Abstract

1. Introduction

2. Neural Network Structure for Approximate Computation

2.1. Model Definition

2.2. Cost Evaluation

2.3. Model Evaluation

3. Deployment on the AI Accelerator

4. Performance Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI