Optimization Methods, Challenges, and Opportunities for Edge Inference: A Comprehensive Survey
Abstract
:1. Introduction
2. Preliminaries
2.1. Introduction to Edge Inference
2.2. Optimizing Methods for EI
2.3. Structure of the Survey
3. Lightweight Model Design
3.1. Design by Experience
3.2. Neural Architecture Search
3.3. Proxy-Based NAS
3.4. Proxyless NAS
3.5. Summary
4. Model Compression
4.1. Model Purning
4.1.1. Structured Pruning
4.1.2. Unstructured Pruning
4.1.3. Semi-Structured Pruning
4.2. Model Quantization
4.2.1. Post-Training Quantization
4.2.2. Quantization-Aware Training
4.3. Customized Sparse Acceleration and General Sparse Acceleration
4.3.1. Customized Sparse Acceleration
4.3.2. General Sparse Acceleration
4.4. Summary
5. Compilation Toolchain
5.1. Computational Graph Optimization
5.2. Automatic Code Generation
5.3. Summary
6. Collaborative Inference
6.1. Device–Device Collaboration
6.2. Device–Edge Collaboration
6.3. Device–Cloud Collaboration
6.4. Device–Edge–Cloud Collaboration
6.5. Summary
7. Future Research Opportunities
7.1. The Interpretability of NAS
7.2. Multi-Optimized Overlay Compression
7.3. Graphics and Computing Joint Compilation Optimization
7.4. Intelligent Task Allocation and Combination Optimization
7.5. Technology Applications and Potential Multidisciplinary Collaborations
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4909–4926. [Google Scholar] [CrossRef]
- Dogan, Ü.; Edelbrunner, J.; Iossifidis, I. Autonomous driving: A comparison of machine learning techniques by means of the prediction of lane change behavior. In Proceedings of the IEEE International Conference on Robotics and Biomimetics, Karon Beach, Thailand, 7–11 December 2011; pp. 1837–1843. [Google Scholar]
- Bachute, M.R.; Subhedar, J.M. Autonomous driving architectures: Insights of machine learning and deep learning algorithms. Mach. Learn. Appl. 2021, 6, 100164. [Google Scholar] [CrossRef]
- Bhavsar, K.A.; Singla, J.; Al-Otaibi, Y.D.; Song, O.Y.; Zikria, Y.B.; Bashir, A.K. Medical diagnosis using machine learning: A statistical review. Comput. Mater. Contin. 2021, 67, 107–125. [Google Scholar] [CrossRef]
- Bhavsar, K.A.; Abugabah, A.; Singla, J.; AlZubi, A.A.; Bashir, A.K. A comprehensive review on medical diagnosis using machine learning. Comput. Mater. Contin. 2021, 67, 1997. [Google Scholar] [CrossRef]
- Richens, J.G.; Lee, C.M.; Johri, S. Improving the accuracy of medical diagnosis with causal machine learning. Nat. Commun. 2020, 11, 3923. [Google Scholar] [CrossRef]
- Alzoubi, A. Machine learning for intelligent energy consumption in smart homes. Int. J. Comput. Inf. Manuf. (IJCIM) 2022, 2. [Google Scholar] [CrossRef]
- Javed, A.R.; Fahad, L.G.; Farhan, A.A.; Abbas, S.; Srivastava, G.; Parizi, R.M.; Khan, M.S. Automated cognitive health assessment in smart homes using machine learning. Sustain. Cities Soc. 2021, 65, 102572. [Google Scholar] [CrossRef]
- Priyadarshini, I.; Sahu, S.; Kumar, R.; Taniar, D. A machine-learning ensemble model for predicting energy consumption in smart homes. Internet Things 2022, 20, 100636. [Google Scholar] [CrossRef]
- Ullah, Z.; Al-Turjman, F.; Mostarda, L.; Gagliardi, R. Applications of artificial intelligence and machine learning in smart cities. Comput. Commun. 2020, 154, 313–323. [Google Scholar] [CrossRef]
- França, R.P.; Monteiro, A.C.B.; Arthur, R.; Iano, Y. An overview of the machine learning applied in smart cities. In Smart Cities: A Data Analytics Perspective; Springer: Cham, Switzerland, 2021; pp. 91–111. [Google Scholar]
- Prawiyogi, A.G.; Purnama, S.; Meria, L. Smart cities using machine learning and intelligent applications. Int. Trans. Artif. Intell. 2022, 1, 102–116. [Google Scholar] [CrossRef]
- Xu, G.; Hao, Z.; Luo, Y.; Hu, H.; An, J.; Mao, S. DeViT: Decomposing vision transformers for collaborative inference in edge devices. IEEE Trans. Mob. Comput. 2023, 23, 5917–5932. [Google Scholar] [CrossRef]
- Li, N.; Iosifidis, A.; Zhang, Q. Distributed deep learning inference acceleration using seamless collaboration in edge computing. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 3667–3672. [Google Scholar]
- Dhar, A.C.; Roy, A.; Biswas, S.; Islam, B. Studying the security threats of partially processed deep neural inference data in an iot device. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, Boston, MA, USA, 6–9 November 2022; pp. 845–846. [Google Scholar]
- Ryu, J.; Zheng, Y.; Gao, Y.; Abuadbba, A.; Kim, J.; Won, D.; Nepal, S.; Kim, H.; Wang, C. Can differential privacy practically protect collaborative deep learning inference for IoT? Wirel. Netw. 2024, 30, 4713–4733. [Google Scholar]
- Baccour, E.; Erbad, A.; Mohamed, A.; Hamdi, M.; Guizani, M. Distprivacy: Privacy-aware distributed deep neural networks in iot surveillance systems. In Proceedings of the GLOBECOM 2020—2020 IEEE Global Communications Conference, Taiwan, China, 7–11 December 2020; pp. 1–6. [Google Scholar]
- Zhang, R.; Jiang, H.; Geng, J.; Tian, F.; Ma, Y.; Wang, H. A high-performance dataflow-centric optimization framework for deep learning inference on the edge. J. Syst. Archit. 2024, 152, 103180. [Google Scholar] [CrossRef]
- Zhou, A.; Yang, J.; Qi, Y.; Qiao, T.; Shi, Y.; Duan, C.; Zhao, W.; Hu, C. HGNAS: Hardware-Aware Graph Neural Architecture Search for Edge Devices. IEEE Trans. Comput. 2024, 73, 2693–2707. [Google Scholar]
- Cai, H.; Zhu, L.; Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv 2018, arXiv:1812.00332. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
- Yang, C.; Zhao, P.; Li, Y.; Niu, W.; Guan, J.; Tang, H.; Qin, M.; Ren, B.; Lin, X.; Wang, Y. Pruning parameterization with bi-level optimization for efficient semantic segmentation on the edge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15402–15412. [Google Scholar]
- Anonymous. RotPruner: Large language model pruning in rotated space. arXiv 2024, arXiv:2410.09426. [Google Scholar]
- Xiao, G.; Lin, J.; Seznec, M.; Demouth, J.; Han, S. FLATQUANT: Flatness Matters for LLM Quantization. arXiv 2024, arXiv:2410.09426. [Google Scholar]
- Liu, Z.; Zhao, C.; Fedorov, I.; Soran, B.; Choudhary, D.; Krishnamoorthi, R.; Chandra, V.; Tian, Y.; Blankevoort, T. SpinQuant: LLM Quantization with Learned Rotations. arXiv 2024, arXiv:2405.16406. [Google Scholar]
- Hsu, O.; Strange, M.; Sharma, R.; Won, J.; Olukotun, K.; Emer, J.S.; Horowitz, M.A.; Kjølstad, F. The sparse abstract machine. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, BC, Canada, 25–29 March 2023; Volume 3, pp. 710–726. [Google Scholar]
- Shi, Y.; Tang, A.; Niu, L.; Zhou, R. Sparse optimization guided pruning for neural networks. Neurocomputing 2024, 574, 127280. [Google Scholar] [CrossRef]
- Wang, H.; Zhai, J.; Gao, M.; Ma, Z.; Tang, S.; Zheng, L.; Li, Y.; Rong, K.; Chen, Y.; Jia, Z. PET: Optimizing tensor programs with partially equivalent transformations and automated corrections. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), Online, 14–16 July 2021; pp. 37–54. [Google Scholar]
- Jia, Z.; Padon, O.; Thomas, J.; Warszawski, T.; Zaharia, M.; Aiken, A. TASO: Optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, Porto, Portugal, 27–30 October 2019; pp. 47–62. [Google Scholar]
- Chen, T.; Moreau, T.; Jiang, Z.; Zheng, L.; Yan, E.; Shen, H.; Cowan, M.; Wang, L.; Hu, Y.; Ceze, L.; et al. TVM: An automated End-to-End optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, USA, 8–10 October 2018; pp. 578–594. [Google Scholar]
- Zhao, J.; Li, B.; Nie, W.; Geng, Z.; Zhang, R.; Gao, X.; Cheng, B.; Wu, C.; Cheng, Y.; Li, Z.; et al. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Online, 20–25 June 2021; pp. 1233–1248. [Google Scholar]
- Zhang, P.; Wen, D.; Zhu, G.; Chen, Q.; Han, K.; Shi, Y. Collaborative Edge AI Inference over Cloud-RAN. IEEE Trans. Commun. 2024, 72, 5641–5656. [Google Scholar]
- Xu, Z.; Zhang, P.; Li, C.; Zhu, H.; Xu, G.; Sun, C. A Collaborative Inference Algorithm in Low-Earth-Orbit Satellite Network for Unmanned Aerial Vehicle. Drones 2023, 7, 575. [Google Scholar] [CrossRef]
- Li, N.; Iosifidis, A.; Zhang, Q. Collaborative edge computing for distributed cnn inference acceleration using receptive field-based segmentation. Comput. Netw. 2022, 214, 109150. [Google Scholar] [CrossRef]
- OpenAI. Available online: https://openai.com/ (accessed on 1 March 2025).
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
- Fang, C.; Guo, S.; Wu, W.; Lin, J.; Wang, Z.; Hsu, M.K.; Liu, L. An efficient hardware accelerator for sparse transformer neural networks. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 2670–2674. [Google Scholar]
- Yao, Z.; Yazdani Aminabadi, R.; Zhang, M.; Wu, X.; Li, C.; He, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Adv. Neural Inf. Process. Syst. 2022, 35, 27168–27183. [Google Scholar]
- Wu, D.; Tang, Q.; Zhao, Y.; Zhang, M.; Fu, Y.; Zhang, D. Easyquant: Post-training quantization via scale optimization. arXiv 2020, arXiv:2006.16669. [Google Scholar]
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A system for Large-Scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
- ONNX. Available online: https://github.com/onnx/onnx (accessed on 1 March 2025).
- Li, M.; Zhang, X.; Guo, J.; Li, F. Cloud–Edge Collaborative Inference with Network Pruning. Electronics 2023, 12, 3598. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhao, Y.; Li, H.; Lin, C.; Liu, J. DVFO: Learning-Based DVFS for Energy-Efficient Edge-Cloud Collaborative Inference. IEEE Trans. Mob. Comput. 2024, 23, 9042–9059. [Google Scholar]
- Dai, P.; Han, B.; Li, K.; Xu, X.; Xing, H.; Liu, K. Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge Computing. IEEE Trans. Mob. Comput. 2024, 24, 210–226. [Google Scholar]
- Zhou, Y.; Chen, S.; Wang, Y.; Huan, W. Review of research on lightweight convolutional neural networks. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 1713–1720. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Huang, G.; Liu, S.; Van der Maaten, L.; Weinberger, K.Q. Condensenet: An efficient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2752–2761. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach Convention & Entertainment Center, Long Beach, CA, USA, 15–20 June 2019; pp. 2820–2828. [Google Scholar]
- Liu, H.; Simonyan, K.; Yang, Y. Darts: Differentiable architecture search. arXiv 2018, arXiv:1806.09055. [Google Scholar]
- Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.J.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 19–34. [Google Scholar]
- Cai, H.; Yang, J.; Zhang, W.; Han, S.; Yu, Y. Path-level network transformation for efficient architecture search. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 678–687. [Google Scholar]
- Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4780–4789. [Google Scholar]
- Lyu, B.; Yuan, H.; Lu, L.; Zhang, Y. Resource-constrained neural architecture search on edge devices. IEEE Trans. Netw. Sci. Eng. 2021, 9, 134–142. [Google Scholar] [CrossRef]
- Luo, X.; Liu, D.; Huai, S.; Kong, H.; Chen, H.; Liu, W. Designing efficient DNNs via hardware-aware neural architecture search and beyond. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2021, 41, 1799–1812. [Google Scholar] [CrossRef]
- Risso, M.; Burrello, A.; Conti, F.; Lamberti, L.; Chen, Y.; Benini, L.; Macii, E.; Poncino, M.; Pagliari, D.J. Lightweight neural architecture search for temporal convolutional networks at the edge. IEEE Trans. Comput. 2022, 72, 744–758. [Google Scholar] [CrossRef]
- Akin, B.; Gupta, S.; Long, Y.; Spiridonov, A.; Wang, Z.; White, M.; Xu, H.; Zhou, P.; Zhou, Y. Searching for efficient neural architectures for on-device ML on edge TPUs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 2667–2676. [Google Scholar]
- Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv 2018, arXiv:1803.03635. [Google Scholar]
- Cheng, Y.; Wang, D.; Zhou, P.; Zhang, T. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Process. Mag. 2018, 35, 126–136. [Google Scholar] [CrossRef]
- Li, G.; Ma, X.; Wang, X.; Yue, H.; Li, J.; Liu, L.; Feng, X.; Xue, J. Optimizing deep neural networks on intelligent edge accelerators via flexible-rate filter pruning. J. Syst. Archit. 2022, 124, 102431. [Google Scholar] [CrossRef]
- Wang, H.; Ling, P.; Fan, X.; Tu, T.; Zheng, J.; Chen, H.; Jin, Y.; Chen, E. All-in-one hardware-oriented model compression for efficient multi-hardware deployment. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12345–12359. [Google Scholar] [CrossRef]
- Goyal, V.; Das, R.; Bertacco, V. Hardware-friendly user-specific machine learning for edge devices. ACM Trans. Embed. Comput. Syst. (TECS) 2022, 21, 1–29. [Google Scholar] [CrossRef]
- Jiang, Y.; Wang, S.; Valls, V.; Ko, B.J.; Lee, W.H.; Leung, K.K.; Tassiulas, L. Model pruning enables efficient federated learning on edge devices. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 10374–10386. [Google Scholar] [CrossRef]
- Kong, H.; Liu, D.; Luo, X.; Huai, S.; Subramaniam, R.; Makaya, C.; Lin, Q.; Liu, W. Towards Efficient Convolutional Neural Network for Embedded Hardware via Multi-Dimensional Pruning. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; pp. 1–6. [Google Scholar]
- Yu, T.; Wu, B.; Chen, K.; Yan, C.; Liu, W. Data stream oriented fine-grained sparse CNN accelerator with efficient unstructured pruning strategy. In Proceedings of the Great Lakes Symposium on VLSI 2022, Orange County, CA, USA, 6–8 June 2022; pp. 243–248. [Google Scholar]
- Yu, Z.; Wang, Z.; Li, Y.; Gao, R.; Zhou, X.; Bommu, S.R.; Zhao, Y.; Lin, Y. Edge-llm: Enabling efficient large language model adaptation on edge devices via unified compression and adaptive layer voting. In Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, CA, USA, 23–27 June 2024; pp. 1–6. [Google Scholar]
- Yin, R.; Kim, Y.; Li, Y.; Moitra, A.; Satpute, N.; Hambitzer, A.; Panda, P. Workload-balanced pruning for sparse spiking neural networks. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2897–2907. [Google Scholar]
- Eccles, B.J.; Wong, L.; Varghese, B. Rapid deployment of dnns for edge computing via structured pruning at initialization. In Proceedings of the 2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Philadelphia, PA, USA, 6–9 May 2024; pp. 317–326. [Google Scholar]
- Joardar, B.K.; Doppa, J.R.; Li, H.; Chakrabarty, K.; Pande, P.P. ReaLPrune: ReRAM crossbar-aware lottery ticket pruning for CNNs. IEEE Trans. Emerg. Top. Comput. 2022, 11, 303–317. [Google Scholar]
- Aggarwal, S.; Binici, K.; Mitra, T. CRISP: Hybrid Structured Sparsity for Class-Aware Model Pruning. In Proceedings of the 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), Valencia, Spain, 25–27 March 2024; pp. 1–6. [Google Scholar]
- Chou, W.C.; Huang, C.W.; Huang, J.D. Hardware-friendly progressive pruning framework for CNN model compression using universal pattern sets. In Proceedings of the 2022 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Taiwan, China, 18–21 April 2022; pp. 1–4. [Google Scholar]
- Wang, J.; Yu, S.; Yuan, Z.; Yue, J.; Yuan, Z.; Liu, R.; Wang, Y.; Yang, H.; Li, X.; Liu, Y. PACA: A pattern pruning algorithm and channel-fused high PE utilization accelerator for CNNs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 5043–5056. [Google Scholar]
- Gong, Y.; Zhan, Z.; Zhao, P.; Wu, Y.; Wu, C.; Ding, C.; Jiang, W.; Qin, M.; Wang, Y. All-in-one: A highly representative dnn pruning framework for edge devices with dynamic power management. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, San Diego, CA, USA, 30 October–3 November 2022; pp. 1–9. [Google Scholar]
- Gao, Y.; Zhang, B.; Qi, X.; So, H.K.H. Dpacs: Hardware accelerated dynamic neural network pruning through algorithm-architecture co-design. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, San Diego, CA, USA, 25–29 March 2023; Volume 2, pp. 237–251. [Google Scholar]
- Sui, X.; Lv, Q.; Zhi, L.; Zhu, B.; Yang, Y.; Zhang, Y.; Tan, Z. A hardware-friendly high-precision CNN pruning method and its FPGA implementation. Sensors 2023, 23, 824. [Google Scholar] [CrossRef]
- Wang, Y.; Qin, Y.; Liu, L.; Wei, S.; Yin, S. SWPU: A 126.04 TFLOPS/W edge-device sparse DNN training processor with dynamic sub-structured weight pruning. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 4014–4027. [Google Scholar] [CrossRef]
- Gale, T.; Elsen, E.; Hooker, S. The state of sparsity in deep neural networks. arXiv 2019, arXiv:1902.09574. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
- Zhou, A.; Ma, Y.; Zhu, J.; Liu, J.; Zhang, Z.; Yuan, K.; Sun, W.; Li, H. Learning n: M fine-grained structured sparse neural networks from scratch. arXiv 2021, arXiv:2102.04010. [Google Scholar]
- Liu, Z.; Wang, Y.; Han, K.; Zhang, W.; Ma, S.; Gao, W. Post-training quantization for vision transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 28092–28103. [Google Scholar]
- Shang, Y.; Yuan, Z.; Xie, B.; Wu, B.; Yan, Y. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–24 June 2023; pp. 1972–1981. [Google Scholar]
- Liu, F.; Zhao, W.; He, Z.; Wang, Y.; Wang, Z.; Dai, C.; Liang, X.; Jiang, L. Improving neural network efficiency via post-training quantization with adaptive floating-point. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 10–17 October 2021; pp. 5281–5290. [Google Scholar]
- Lin, J.; Tang, J.; Tang, H.; Yang, S.; Chen, W.M.; Wang, W.C.; Xiao, G.; Dang, X.; Gan, C.; Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proc. Mach. Learn. Syst. 2024, 6, 87–100. [Google Scholar]
- Shen, X.; Dong, P.; Lu, L.; Kong, Z.; Li, Z.; Lin, M.; Wu, C.; Wang, Y. Agile-quant: Activation-guided quantization for faster inference of LLMs on the edge. In Proceedings of the the AAAI Conference on Artificial Intelligence, British Columbia, BC, Canada, 20–27 February 2024; Volume 38, pp. 18944–18951. [Google Scholar]
- Liu, Z.; Oguz, B.; Zhao, C.; Chang, E.; Stock, P.; Mehdad, Y.; Shi, Y.; Krishnamoorthi, R.; Chandra, V. Llm-qat: Data-free quantization aware training for large language models. arXiv 2023, arXiv:2305.17888. [Google Scholar]
- Zhou, Q.; Guo, S.; Qu, Z.; Guo, J.; Xu, Z.; Zhang, J.; Guo, T.; Luo, B.; Zhou, J. Octo: INT8 training with loss-aware compensation and backward quantization for tiny on-device learning. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC 21), Online, 14–16 July 2021; pp. 177–191. [Google Scholar]
- Kim, D.; Lee, J.; Ham, B. Distance-aware quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 10–17 October 2021; pp. 5271–5280. [Google Scholar]
- Matinizadeh, S.; Mohammadhassani, A.; Pacik-Nelson, N.; Polykretisl, I.; Mishra, A.; Shackleford, J.; Kandasamy, N.; Gallo, E.; Das, A. A fully-configurable digital spiking neuromorphic hardware design with variable quantization and mixed precision. In Proceedings of the 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, MA, USA, 11–14 August 2024; pp. 937–941. [Google Scholar]
- Liu, X.; Wang, T.; Yang, J.; Tang, C.; Lv, J. MPQ-YOLO: Ultra low mixed-precision quantization of YOLO for edge devices deployment. Neurocomputing 2024, 574, 127210. [Google Scholar]
- Gao, T.; Guo, L.; Zhao, S.; Xu, P.; Yang, Y.; Liu, X.; Wang, S.; Zhu, S.; Zhou, D. QuantNAS: Quantization-aware Neural Architecture Search For Efficient Deployment On Mobile Device. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 1704–1713. [Google Scholar]
- Lin, J.; Zhu, L.; Chen, W.M.; Wang, W.C.; Gan, C.; Han, S. On-device training under 256 KB memory. Adv. Neural Inf. Process. Syst. 2022, 35, 22941–22954. [Google Scholar]
- Parashar, A.; Rhu, M.; Mukkara, A.; Puglielli, A.; Venkatesan, R.; Khailany, B.; Emer, J.; Keckler, S.W.; Dally, W.J. SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Comput. Archit. News 2017, 45, 27–40. [Google Scholar] [CrossRef]
- Krishna, A.; Nudurupati, S.R.; Chandana, D.; Dwivedi, P.; van Schaik, A.; Mehendale, M.; Thakur, C.S. Raman: A re-configurable and sparse tinyML accelerator for inference on edge. IEEE Internet Things J. 2024, 11, 24831–24845. [Google Scholar]
- Zhang, J.F.; Lee, C.E.; Liu, C.; Shao, Y.S.; Keckler, S.W.; Zhang, Z. SNAP: An efficient sparse neural acceleration processor for unstructured sparse deep neural network inference. IEEE J. Solid-State Circuits 2020, 56, 636–647. [Google Scholar]
- Gondimalla, A.; Chesnut, N.; Thottethodi, M.; Vijaykumar, T. SparTen: A sparse tensor accelerator for convolutional neural networks. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; pp. 151–165. [Google Scholar]
- Meng, J.; Venkataramanaiah, S.K.; Zhou, C.; Hansen, P.; Whatmough, P.; Seo, J.s. Fixyfpga: Efficient fpga accelerator for deep neural networks with high element-wise sparsity and without external memory access. In Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 9–16. [Google Scholar]
- Vasireddy, P.; Kavi, K.; Mehta, G. Sparse-t: Hardware accelerator thread for unstructured sparse data processing. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, San Diego, CA, USA, 29 October–3 November 2022; pp. 1–8. [Google Scholar]
- Zhang, S.; Du, Z.; Zhang, L.; Lan, H.; Liu, S.; Li, L.; Guo, Q.; Chen, T.; Chen, Y. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taiwan, China, 15–19 October 2016; pp. 1–12. [Google Scholar]
- Zhou, X.; Du, Z.; Guo, Q.; Liu, S.; Liu, C.; Wang, C.; Zhou, X.; Li, L.; Chen, T.; Chen, Y. Cambricon-S: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Cambridge, UK, 20–24 October 2018; pp. 15–28. [Google Scholar]
- Kjolstad, F.; Chou, S.; Lugato, D.; Kamil, S.; Amarasinghe, S. Taco: A tool to generate tensor algebra kernels. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), Ulm, Germany, 30 October–3 November 2017; pp. 943–948. [Google Scholar]
- Zheng, N.; Lin, B.; Zhang, Q.; Ma, L.; Yang, Y.; Yang, F.; Wang, Y.; Yang, M.; Zhou, L. SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), Vancouver, BC, Canada, 11–13 July 2022; pp. 213–232. [Google Scholar]
- Ye, Z.; Lai, R.; Shao, J.; Chen, T.; Ceze, L. Sparsetir: Composable abstractions for sparse compilation in deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, BC, Canada, 25–29 March 2023; Volume 3, pp. 660–678. [Google Scholar]
- Tian, R.; Guo, L.; Li, J.; Ren, B.; Kestor, G. A high performance sparse tensor algebra compiler in MLIR. In Proceedings of the 2021 IEEE/ACM 7th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), San Diego, CA, USA, 14 November 2021; pp. 27–38. [Google Scholar]
- Liu, R.; Leng, Y.; Tian, S.; Hu, S.; Chen, C.F.; Yao, S. DynaSpa: Exploiting Spatial Sparsity for Efficient Dynamic DNN Inference on Devices. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems, Hangzhou, China, 4–7 November 2024; pp. 422–435. [Google Scholar]
- Zhang, G.; Hsu, O.; Kjolstad, F. Compilation of modular and general sparse workspaces. Proc. ACM Program. Lang. 2024, 8, 1213–1238. [Google Scholar] [CrossRef]
- Xia, H.; Zheng, Z.; Li, Y.; Zhuang, D.; Zhou, Z.; Qiu, X.; Li, Y.; Lin, W.; Song, S.L. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv 2023, arXiv:2309.10285. [Google Scholar]
- Jia, Z.; Thomas, J.; Warszawski, T.; Gao, M.; Zaharia, M.; Aiken, A. Optimizing DNN computation with relaxed graph substitutions. Proc. Mach. Learn. Syst. 2019, 1, 27–39. [Google Scholar]
- Zheng, L.; Wang, H.; Zhai, J.; Hu, M.; Ma, Z.; Wang, T.; Huang, S.; Miao, X.; Tang, S.; Huang, K.; et al. EINNET: Optimizing tensor programs with Derivation-Based transformations. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), San Diego, CA, USA, 10–12 July 2023; pp. 739–755. [Google Scholar]
- Niu, W.; Guan, J.; Wang, Y.; Agrawal, G.; Ren, B. Dnnfusion: Accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Online, 20–25 June 2021; pp. 883–898. [Google Scholar]
- Zheng, S.; Chen, S.; Song, P.; Chen, R.; Li, X.; Yan, S.; Lin, D.; Leng, J.; Liang, Y. Chimera: An analytical optimizing framework for effective compute-intensive operators fusion. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Montreal, QC, Canada, 25 February–1 March 2023; pp. 1113–1126. [Google Scholar]
- Shi, Y.; Yang, Z.; Xue, J.; Ma, L.; Xia, Y.; Miao, Z.; Guo, Y.; Yang, F.; Zhou, L. Welder: Scheduling deep learning memory access via tile-graph. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), San Diego, CA, USA, 10–12 July 2023; pp. 701–718. [Google Scholar]
- Wang, F.; Shen, M. Automatic Kernel Generation for Large Language Models on Deep Learning Accelerators. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, 28 October–2 November 2023; pp. 1–9. [Google Scholar]
- Meng, J.; Zhuang, C.; Chen, P.; Wahib, M.; Schmidt, B.; Wang, X.; Lan, H.; Wu, D.; Deng, M.; Wei, Y.; et al. Automatic generation of high-performance convolution kernels on ARM CPUs for deep learning. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 2885–2899. [Google Scholar] [CrossRef]
- Danopoulos, D.; Kachris, C.; Soudris, D. Automatic generation of fpga kernels from open format cnn models. In Proceedings of the 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Online, 3–6 May 2020; p. 237. [Google Scholar]
- Fu, Q.; Huang, H.H. Automatic generation of high-performance inference kernels for graph neural networks on multi-core systems. In Proceedings of the 50th International Conference on Parallel Processing, Lemont, IL, USA, 9–12 August 2021; pp. 1–11. [Google Scholar]
- Zhao, X.; Chen, Z.; Shi, Y.; Wen, M.; Zhang, C. Automatic End-to-End Joint Optimization for Kernel Compilation on DSPs. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; pp. 1–6. [Google Scholar]
- Zheng, L.; Jia, C.; Sun, M.; Wu, Z.; Yu, C.H.; Haj-Ali, A.; Wang, Y.; Yang, J.; Zhuo, D.; Sen, K.; et al. Ansor: Generating High-Performance tensor programs for deep learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Seattle, WA, USA, 4–6 November 2020; pp. 863–879. [Google Scholar]
- Ma, L.; Xie, Z.; Yang, Z.; Xue, J.; Miao, Y.; Cui, W.; Hu, W.; Yang, F.; Zhang, L.; Zhou, L. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Seattle, WA, USA, 4–6 November 2020; pp. 881–897. [Google Scholar]
- Zheng, B.; Jiang, Z.; Yu, C.H.; Shen, H.; Fromm, J.; Liu, Y.; Wang, Y.; Ceze, L.; Chen, T.; Pekhimenko, G. DietCode: Automatic optimization for dynamic tensor programs. Proc. Mach. Learn. Syst. 2022, 4, 848–863. [Google Scholar]
- Li, N.; Iosifidis, A.; Zhang, Q. Receptive Field-based Segmentation for Distributed CNN Inference Acceleration in Collaborative Edge Computing. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Busan, Republic of Korea, 16–20 May 2022; pp. 4281–4286. [Google Scholar]
- Ye, S.; Du, J.; Zeng, L.; Ou, W.; Chu, X.; Lu, Y.; Chen, X. Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference. arXiv 2024, arXiv:2405.17245. [Google Scholar]
- Dong, Z.; Li, N.; Iosifidis, A.; Zhang, Q. Design and prototyping distributed CNN inference acceleration in edge computing. In Proceedings of the European Wireless 2022; 27th European Wireless Conference, VDE, Oslo, Norway, 19–21 September 2022; pp. 1–6. [Google Scholar]
- Chen, Y.; Luo, T.; Fang, W.; Xiong, N.N. Edgeci: Distributed workload assignment and model partitioning for cnn inference on edge clusters. ACM Trans. Internet Technol. 2024, 24, 1–24. [Google Scholar] [CrossRef]
- Malka, M.; Farhan, E.; Morgenstern, H.; Shlezinger, N. Decentralized low-latency collaborative inference via ensembles on the edge. IEEE Trans. Wirel. Commun. 2024, 24, 598–614. [Google Scholar] [CrossRef]
- Kumazawa, S.; Yu, J.; Kawamura, K.; Van Chu, T.; Motomura, M. Toward Improving Ensemble-Based Collaborative Inference at the Edge. IEEE Access 2024, 12, 6926–6940. [Google Scholar] [CrossRef]
- Li, G.; Liu, L.; Wang, X.; Dong, X.; Zhao, P.; Feng, X. Auto-tuning neural network quantization framework for collaborative inference between the cloud and edge. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; Proceedings, Part I 27; Springer: Cham, Switzerland, 2018; pp. 402–411. [Google Scholar]
- Hu, Y.; Xu, X.; Duan, L.; Bilal, M.; Wang, Q.; Dou, W. End-Edge Collaborative Inference of Convolutional Fuzzy Neural Networks for Big Data-Driven Internet of Things. IEEE Trans. Fuzzy Syst. 2024, 33, 203–217. [Google Scholar] [CrossRef]
- Palena, M.; Cerquitelli, T.; Chiasserini, C.F. Edge-device collaborative computing for multi-view classification. Comput. Netw. 2024, 254, 110823. [Google Scholar] [CrossRef]
- Li, E.; Zeng, L.; Zhou, Z.; Chen, X. Edge AI: On-demand accelerating deep neural network inference via edge computing. IEEE Trans. Wirel. Commun. 2019, 19, 447–457. [Google Scholar] [CrossRef]
- Cui, E.; Yang, D.; Wang, H.; Zhang, W. Learning-based deep neural network inference task offloading in multi-device and multi-server collaborative edge computing. Trans. Emerg. Telecommun. Technol. 2022, 33, e4485. [Google Scholar] [CrossRef]
- Hao, Z.; Xu, G.; Luo, Y.; Hu, H.; An, J.; Mao, S. Multi-agent collaborative inference via dnn decoupling: Intermediate feature compression and edge learning. IEEE Trans. Mob. Comput. 2022, 22, 6041–6055. [Google Scholar] [CrossRef]
- Jankowski, M.; Gündüz, D.; Mikolajczyk, K. Adaptive Early Exiting for Collaborative Inference over Noisy Wireless Channels. In Proceedings of the 2024 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), Stockholm, Sweden, 5–8 May 2024; pp. 126–131. [Google Scholar]
- Li, J.; Liao, G.; Chen, L.; Chen, X. Roulette: A Semantic Privacy-Preserving Device-Edge Collaborative Inference Framework for Deep Learning Classification Tasks. IEEE Trans. Mob. Comput. 2023, 23, 5494–5510. [Google Scholar]
- Im, J.; Kwon, N.; Park, T.; Woo, J.; Lee, J.; Kim, Y. Attention-Aware Semantic Communications for Collaborative Inference. IEEE Internet Things J. 2024, 11, 37008–37020. [Google Scholar] [CrossRef]
- Zhang, M.; Cao, J.; Shen, X.; Cui, Z. EdgeShard: Efficient LLM Inference via Collaborative Edge Computing. arXiv 2024, arXiv:2405.14371. [Google Scholar]
- Zhang, Z.; Yu, H.; Wang, F. Opt-CoInfer: Optimal collaborative inference across IoT and cloud for fast and accurate CNN inference. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 438–448. [Google Scholar]
- Zhang, W.; Zhou, H.; Mo, J.; Zhen, C.; Ji, M. Accelerated Inference of Face Detection under Edge-Cloud Collaboration. Appl. Sci. 2022, 12, 8424. [Google Scholar] [CrossRef]
- Yan, C.; Liu, S.; Liu, H.; Peng, X.; Wang, X.; Chen, F.; Fu, L.; Mei, X. Hybrid sd: Edge-cloud collaborative inference for stable diffusion models. arXiv 2024, arXiv:2408.06646. [Google Scholar]
- Hao, Z.; Jiang, H.; Jiang, S.; Ren, J.; Cao, T. Hybrid slm and llm for edge-cloud collaborative inference. In Proceedings of the Workshop on Edge and Mobile Foundation Models, Minato-ku Tokyo, Japan, 3–7 June 2024; pp. 36–41. [Google Scholar]
- Yang, Z.; Yang, Y.; Zhao, C.; Guo, Q.; He, W.; Ji, W. Perllm: Personalized inference scheduling with edge-cloud collaboration for diverse llm services. arXiv 2024, arXiv:2405.14636. [Google Scholar]
- Das, A.; Ghosh, S.K.; Raha, A.; Raghunathan, V. Toward energy-efficient collaborative inference using multisystem approximations. IEEE Internet Things J. 2024, 11, 17989–18004. [Google Scholar]
- Nimi, S.T.; Arefeen, A.; Uddin, Y.S.; Lee, Y. Earlin: Early out-of-distribution detection for resource-efficient collaborative inference. In Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, 13–17 September 2021; Proceedings, Part I 21; Springer: Cham, Switzerland, 2021; pp. 635–651. [Google Scholar]
- Liu, G.; Dai, F.; Xu, X.; Fu, X.; Dou, W.; Kumar, N.; Bilal, M. An adaptive DNN inference acceleration framework with end–edge–cloud collaborative computing. Future Gener. Comput. Syst. 2023, 140, 422–435. [Google Scholar]
- Yang, S.; Zhang, Z.; Zhao, C.; Song, X.; Guo, S.; Li, H. CNNPC: End-edge-cloud collaborative CNN inference with joint model partition and compression. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 4039–4056. [Google Scholar]
- Qi, H.; Ren, F.; Wang, L.; Jiang, P.; Wan, S.; Deng, X. Multi-compression scale DNN inference acceleration based on cloud-edge-end collaboration. ACM Trans. Embed. Comput. Syst. 2024, 23, 1–25. [Google Scholar] [CrossRef]
- Tian, J.; Li, X.; Qin, X. Reinforcement Learning Based Collaborative Inference and Task Offloading Optimization for Cloud-Edge-End Systems. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar]
- Pagliari, D.J.; Chiaro, R.; Macii, E.; Poncino, M. Crime: Input-dependent collaborative inference for recurrent neural networks. IEEE Trans. Comput. 2020, 70, 1626–1639. [Google Scholar]
- Gao, Y.; Zhang, B. Semantics-Driven Cloud-Edge Collaborative Inference A Case Study of License Plate Detection. In Proceedings of the 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Chengdu, China, 3–5 November 2023; pp. 1100–1103. [Google Scholar]
- Zhang, C.; Zheng, X.; Tao, X.; Hu, C.; Zhang, W.; Zhu, L. Distributed Collaborative Inference System in Next-Generation Networks and Communication. IEEE Trans. Cogn. Commun. Netw. 2025. early access. [Google Scholar] [CrossRef]
- Xue, M.; Wu, H.; Li, R.; Xu, M.; Jiao, P. EosDNN: An efficient offloading scheme for DNN inference acceleration in local-edge-cloud collaborative environments. IEEE Trans. Green Commun. Netw. 2021, 6, 248–264. [Google Scholar]
- Chen, Y.; Chiaro, R.; Maciiy, E.; Poncino, M.; Pagliari, D.J. C-NMT: A Collaborative Inference Framework for Neural Machine Translation. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 1512–1516. [Google Scholar]
- Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Tan, J.; Le, Q.V.; Kurakin, A. Large-scale evolution of image classifiers. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2902–2911. [Google Scholar]
- Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar]
- Pham, H.; Guan, M.; Zoph, B.; Le, Q.; Dean, J. Efficient neural architecture search via parameters sharing. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4095–4104. [Google Scholar]
- Zhang, M.; Li, H.; Pan, S.; Chang, X.; Su, S. Overcoming multi-model forgetting in one-shot NAS with diversity maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 13–19 June 2020; pp. 7809–7818. [Google Scholar]
- Zhang, R.; Jiang, H.; Tian, F.; Geng, J.; Li, X.; Ma, Y.; Zhu, C.; Dong, D.; Li, X.; Wang, H. Xenos: Dataflow-centric optimization to accelerate model inference on edge devices. In Proceedings of the International Conference on Database Systems for Advanced Applications, Tianjin, China, 17–20 April 2023; pp. 535–545. [Google Scholar]
- Ho, Q.; Cipar, J.; Cui, H.; Lee, S.; Kim, J.K.; Gibbons, P.B.; Gibson, G.A.; Ganger, G.; Xing, E.P. More effective distributed ml via a stale synchronous parallel parameter server. In Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, NV, USA, 5–10 December 2013; Volume 26. [Google Scholar]
- Cui, H.; Cipar, J.; Ho, Q.; Kim, J.K.; Lee, S.; Kumar, A.; Wei, J.; Dai, W.; Ganger, G.R.; Gibbons, P.B.; et al. Exploiting bounded staleness to speed up big data analytics. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 14), Philadelphia, PA, USA, 19–20 June 2014; pp. 37–48. [Google Scholar]
- Zhang, R.; Jiang, H.; Geng, J.; Ma, Y.; Zhu, C.; Wang, H. FlexPie: Accelerate Distributed Inference on Edge Devices with Flexible Combinatorial Optimization [Technical Report]. arXiv 2025, arXiv:2502.15312. [Google Scholar]
- Hou, J.; Liu, H.; Liu, Y.; Wang, Y.; Wan, P.J.; Li, X.Y. Model protection: Real-time privacy-preserving inference service for model privacy at the edge. IEEE Trans. Dependable Secur. Comput. 2021, 19, 4270–4284. [Google Scholar]
- He, Z.; Zhang, T.; Lee, R.B. Attacking and protecting data privacy in edge–cloud collaborative inference systems. IEEE Internet Things J. 2020, 8, 9706–9716. [Google Scholar] [CrossRef]
Optimization Method | Models | Parameters | FLOPS | Top-1 Acc. | Main Technology |
---|---|---|---|---|---|
1.0 MobileNetV1 [21] | 4.2 M | 569 M | 70.6% | DSC | |
1.0 MobileNetV2 [47] | 3.4 M | 300 M | 71.8% | DSC | |
EfficientNet-B0 [50] | 5.3 M | 390 M | 77.1% | DSC | |
Lightweight | AlexNet [37] | 60.9 M | 725 M | 57.2% | GC |
1.0 ShuffleNetV1 (g = 3) [49] | 2.4 M | 140 M | 68.4% | GC and shuffle | |
CondenseNet-86 [48] | 0.52 M | 65 M | 74.9% | GC | |
SqueezeNet [22] | 1.20 M | 837 M | 71.1% | Squeeze and expand | |
Traditional | VGG19 [51] | 144 M | 19,600 M | 72.4% | Standard convolution |
Optimization Method | Model | Target Hardware | Main Technology | Key Metrics | Code Availability |
---|---|---|---|---|---|
Proxy-based | MnasNet [52] | Google Pixel Phone | A custom weighted product Factorized hierarchical space Reinforcement learning | ImageNet Parameters (3.9 M) Flops (312 M) Accuracy (75.2%) Latency (78 ms) | ✓ |
Darts [53] | *Mobile setting NVIDIA GTX 1080Ti | Differentiable architecture search Approximate gradient calculation | ImageNet *Parameters (4.7 M) *Flops (<600 M) *Accuracy (73.3%) | ✓ | |
PANS [54] | *Mobile setting GPU (unspecified) | Progressive search Surrogate models | ImageNet *Parameters (5.1 M) *Flops (588 M) *Accuracy (74.2%) | ✓ | |
TreeCell [55] | *Mobile setting GPU (unspecified) | Path-level network transformation Tree-structured architecture Reinforcement learning | ImageNet *Flops (588 M) *Accuracy (74.5%) | ✓ | |
AmoebaNet [56] | NVIDIA Tesla P100 | Evolutionary algorithm Hidden state mutation Operation mutation Dentity mutation operations | ImageNet Parameters (86.7 M) Flops (23.1 B) Accuracy (82.8%) | ✓ | |
Proxyless | Lyu et al. [57] | NVIDIA Jetson Nano | MobileNetV2-based search space Reinforcement learning | ImageNet Parameters (1.1 M) Accuracy (73.7%) Latency (28.2 ms) | ✗ |
ProxylessNAS [20] | *Google Pixel Phone NVIDIA TESLA V100 Intel E5-2640 v4 | Binarize paths A gradient-based method Reinforce-based algorithm | ImageNet *Accuracy (74.6%) *Latency (78 ms) | ✓ | |
GoldenNAS [58] | *NVIDIA Jetson Xavier NVIDIA Quadro GV100 Intel Xeon Gold 6136 | Dynamic channel scaling Progressive space shrinking Evolutionary algorithm Daptive BN Self-knowledge distillation | ImageNet *Accuracy (76.2%) *Latency (52.7 ms) | ✗ | |
PIT [59] | *GAP-8 RISC-V STMicroelectronics- STM32H7 | Trainable masking parameters Regularization | PPG Task *Parameters (<5.4 K) *Flops (<293.5 K) *Accuracy (94.13%) *Latency (1.26 ms) | ✓ | |
HGNAS [19] | *NVIDIA Jetson TX2 NVIDIA RTX 3080 Intel i7—8700K Raspberry Pi 3B+ | GNN performance predictor A fine-grained hierarchical space A multi-stage hierarchical strategy | ModelNet40 *Parameters (1.48 M) *Accuracy (92.2%) *Latency (36.3 ms) | ✗ | |
Akin et al. [60] | Google Tensor SoC | PPE service NAS integration GC-IBN | ImageNet Accuracy (79%) Latency (26 ms) | ✓ |
Optimization Method | Method | Target Hardware | Main Technology | Constraint Indicators | Code Availability |
---|---|---|---|---|---|
Structured | FlexPruner [63] | NVIDIA Jetson TX2 NVIDIA Jetson Nano | Greedy strategy Iterative perception | Accuracy (−1.12%) Flops (59.8%) Pruning rate (50%) Speed up (1.27×) | ✗ |
AHC [64] | NVIDIA Tesla V100 Intel Xeon Gold 6258R SOPHON BM1604 | Unified hardware-aware Multi-objective evolution | Speed up (1.84×) Parameter (61.1%) Accuracy (−1.0%) | ✓ | |
MyML [65] | Snapdragon 855 Google TPU | Transfer learning Bottom-up | Model size (43%) Speed up (2.93×) Accuracy (−<1%) | ✗ | |
PruneFL [66] | Raspberry Pi 4 | Two-stage distribute | Training time (66%) Flops (66%) Accuracy (80%) | ✓ | |
TECO [67] | NVIDIA Jetson TX2 NVIDIA Jetson Xavier | Cross-dimension evaluation Intra-dimension evaluation | MACs (25.6%) Accuracy (73.07%) | ✓ | |
Yang et al. [23] | Snapdragon 888 Kryo 680 Octa-core CPU | Pruning parameterization Soft mask representation | Parameter (−3.3 M) Speed up (1.37×) Accuracy (37.5%) | ✗ | |
Unstructed | Yu et al. [68] | Avnet Ultra96v2 | Hyperparameter introduction Systolic array | Accuracy (−1.4%) Pruning rate (93.75%) Power (−>66%) | ✓ |
Edge-LLM [69] | NVIDIA Jetson TX2 Meta Quest Pro | Hierarchical unified compression Layer adjustment and voting | Accuracy (+1.29%) Memory (25%) | ✓ | |
u-Ticket [70] | Simulator | Workload balance | Pruning rate (98%) Hardware utilization (2×) Latency (−76.9%) Energy cost (−63.8%) | ✓ | |
Reconvene [71] | Simulator | Initialization pruning | Accuracy (91.26%) Pruning rate (98%) | ✗ | |
ReaLPrune [72] | Fujitsu ReRAM | ReRAM cross-bar array | Training time (5.07%) Accuracy (90.66%) Pruning rate (95.5%) | ✗ | |
Semi-structured | CRISP [73] | GPU (unspecified) | Hybrid sparsity Class-aware saliency scores | Accuracy (95%) Pruning rate (90%) Latency (−92.86%) Energy cost (−96.7%) | ✓ |
Chou et al. [74] | GPU (unspecified) | General pattern set selection Progressive pruning | Accuracy (−0.45%) Pruning rate (54.12%) MACs (55.6%) | ✗ | |
PACA [75] | SIMD PE array | Pattern pruning Channel fusion | Accuracy (−0.87%) Speed up (5.53×) | ✗ | |
All-in-One [76] | Snapdragon Adreno 650 | Parametric pruning Switchable thresholds | Accuracy (67%) | ✗ | |
DPACS [77] | XILINX ZCU102 | Mask generation | Accuracy (92.15%) MACs (43.8%) Speed up (1.6×) | ✓ | |
KRP [78] | XILINX XC7Z035FFG676-2I | Row-level pruning LR tracking retraining | Accuracy (−0.8%) Pruning rate (66.7%) Resource (−50%) | ✗ | |
SWPU [79] | Customized chip | Hybrid shape and line pattern Dynamic workload balancing | Energy cost (−73.12%) Pruning rate (50.1%) Speed up (4.69×) | ✗ |
Optimization Method | Method Name | Target Hardware | Main Technology | Constraints | Quantization Bit-Width | Code Availability |
---|---|---|---|---|---|---|
PTQ | Liu et al. [83] | Mobile setting | Ranking loss Nuclear norm | Accuracy (81.29%) Memory (75%) | 4–10 | ✓ |
PTQ4DM [84] | GPU (unspecified) | NDTC calibration method MSE quantization metric | IS (+15.52) FID (−24.92) sFID (−17.36) | 8 | ✓ | |
EasyQuant [40] | Rockchip RK3399 | Scale optimization ARM NEON ISA | Accuracy (68.26%) Computational cost (−33%) | 8, 7, <7 | ✗ | |
AFP [85] | GPU (unspecified) | Adaptive floating-point format Bayesian optimization | Accuracy (−0.04%) MACs (10.75%) Energy cost (92.42%) | 3.9–5 | ✓ | |
ZeroQuant [39] | GPU (unspecified) | Kernel fusion Layer-by-layer knowledge distillation | Speed up (2.6×) Memory (33%) | 8, 4/8 | ✓ | |
AWQ [86] | NVIDIA Jetson- Orin Nano NVIDIA RTX 4070 | Activation-aware weight protection On-the-fly dequantization Kernel fusion | Accuracy (−<0.1%) Memory (25%) Speed up (3.3×) | 4, 3, 16 | ✓ | |
Agile-Quant [87] | Snapdragon 870 Raspberry Pi 4B | Activation quantization strategy TRIP matrix multiplication | PPL (6.09) Speed up (2.55×) | 4, 8 | ✗ | |
QAT | LLM-QAT [88] | GPU (unspecified) | Data-free distillation KV cache quantization | Accuracy (69.9%) Model size (25.9%) | 4, 6, 8 | ✗ |
Octo [89] | Huawei Atlas 200DK NVIDIA Jetson Xavier | Loss-aware compensation Parameterized range clipping | Accuracy (98.8%) Speed up (2.03×) Peak memory (29.67%) | 8 | ✓ | |
DAQ [90] | GPU (unspecified) | Distance-aware soft rounding (DASR) Temperature controller | Accuracy (91.2%) | 1, 2, 3, 4, 32 | ✓ | |
QUANTISENC [91] | AMD Virtex Ultrascale | Variable quantization Dynamic configuration | Accuracy (96.5%) | 1.3, 5.3 | ✓ | |
MPQ-YOLO [92] | NVIDIA RTX 3090 | Trainable scale Progressive strategy | Accuracy (74.7%) Model size (7.04%) | 1, 4 | ✗ | |
QuantNAS [93] | Kirin 9000 | Batch statistics Scale predictor | Accuracy (94.5%) Model size (30%) Latency (−40%) | 8 | ✗ | |
Lin et al. [94] | STMicroelectronics-STM32F746 | Quantization-aware scaling | Accuracy (+1.68%) | 8 | ✗ |
Optimization Method | Method Name | Hardware Desgin | Main Technology | Constrained Metrics | Code Availability |
---|---|---|---|---|---|
Customized | SCNN [95] | Customized chip | PT-IS-CP-sparse dataflow | Speed up (2.7×) Energy cost (43.48%) | ✗ |
RAMAN [96] | Efinix Ti60 | Sparse processing Reconfigurable dataflow | Throughput (13.5 GOP/s) Energy cost (136.96 mW) Peak memory (−37%) | ✗ | |
SNAP [97] | Customized chip | Channel-first dataflow Two-level psum reduction | Speed up (2.87×) Energy efficiency (3.61 TOPS/W) | ✗ | |
SparTen [98] | Terasic DE2-150 | Bitmask representation Greedy balance | Speed up (4.3×) Memory (76.92%) | ✗ | |
FixyFPGA [99] | Intel Stratix-10 GX 10M | Fixed-weight design Fully-pipelined activation buffering | Speed up (2.34×) | ✓ | |
STA [38] | Intel Arria 10 SX660 SoC | Diverse matrix multiplication engine (DMME) Scalable softmax module | Energy efficiency (12.28×) MAC efficiency (51×) | ✗ | |
Sparse-T [100] | Ibex RISC-V | Dual-version ASIC design Metadata processing optimization | Speed up (2.1×) Energy cost (47.3%) Occupied area (30.86%) | ✗ | |
Cambricon-X [101] | Customized chip | PE-based architecture Indexing module (IM) Asynchronous compute | Speed up (7.23×) Troughput (544 GOP/s) Energy efficiency (6.34×) Energy cost (954 mW) | ✗ | |
Cambricon-S [102] | Customized chip | Entropy encoding Shared indexing | Speed up (1.71×) Energy efficiency (1.37×) Energy cost (798.55 mW) | ✗ | |
General | Taco [103] | Intel Xeon E5-2680 v3 NVIDIA RTX 2080 Ti | Data structure abstraction Sparse iteration space theory | Performance (14×) Correctness | ✓ |
SparTA [104] | NVIDIA RTX 2080 Ti AMD Radeon VII Intel Xeon Silver 4210 | Tensor-with-sparsity-attribute abstraction Sparsity attribute propagation | Inference latency (8.4×) Memory footprint Model accuracy | ✓ | |
SparseTIR [105] | GPU (unspecified) CPU (unspecified) | Composable formats Composable transformations | Speedup (1.52×) Memory footprint | ✓ | |
Tian et al. [106] | Intel Xeon Gold 6126 | Unified tensor storage format representation | Performance (6.26×) Code quality | ✓ | |
DynaSpa [107] | NVIDIA Jetson Orin NVIDIA Jetson Xavier Qualcomm Adreno 650 | Relaxed sparsity composition Polyalgorithm kernel composition | Performance (4.4×) Search time Runtime cost | ✗ | |
Zhang et al. [108] | Intel Xeon E5-2640v4 | Insert-sort-merge template Automatic workspace insert | Performance (27.12×) Memory usage | ✗ | |
Flash-LLM [109] | GPU (unspecified) | Load-as-sparse and compute-as-dense Software pipeline design | Performance (2.9×) Throughout (3.8×) | ✓ | |
SAM [27] | Intel Xeon Silver | Core data model and dataflow blocks Custard compiler | Generality performance Hardware modeling ability | ✓ |
Model | Target Hardware | Main Technology | Key Metrics | Code Availability |
---|---|---|---|---|
TVM [31] | ARM Cortex A53 XILINX Artix-7 ARM Mali-T860MP4 NVIDIA Titan X | Tensor expression language Operator fusion Data layout transformation | Inference time Resource utilization | ✓ |
MetaFlow [110] | NVIDIA Tesla V100 NVIDIA Tesla P100 | Relaxed graph substitutions Multi-dim. cost model Graph split algorithm | Inference time Resource utilization | ✓ |
TASO [30] | NVIDIA Tesla V100 | Graph substitutions Formal verification Joint optimization Data layouts transformation | Inference time | ✓ |
PET [29] | NVIDIA Tesla V100 | Partial equivalence transformation Automated corrections | Inference time Resource utilization | ✓ |
EINNET [111] | NVIDIA Tesla V100 NVIDIA Tesla A100 Intel Xeon E5-2680 v4 | Tensor algebra expression Derivation rules | Inference time | ✓ |
DNNFusion [112] | Samsung Galaxy S20 Samsung Galaxy S10 Honor Magic 2 | Fusion opportunity analysis Mathematical-based graph rewriting Profile-driven fusion plan | Inference time Resource utilization Compilation time | ✓ |
Chimera [113] | Intel Xeon Gold 6240 NVIDIA Tesla A100 Huawei Ascend 910 | Block decomposition and reordering Intra-block optimization | Inference time Cache utilization | ✓ |
Welder [114] | NVIDIA Tesla V100 NVIDIA RTX 3090 AMD MI50 GPU Graphcore IPU | Tile graph construction Hierarchical scheduling Code generation and optimization | Inference time Memory access | ✓ |
Model | Target Hardware | Main Technology | Key Metrics | Open Source |
---|---|---|---|---|
Wang and Shen [115] | GPU (unspecified) | Reinforcement learning Variance reduction | Code performance Energy efficiency | ✓ |
FastConv [116] | Huawei Kunpeng 920 Snapdragon 835, 855, 888 Apple M1 Amazon Graviton2 | Winograd algorithm Tensor transformation C++ automatic generation | Code performance Cache utilization | ✓ |
Danopoulos et al. [117] | XILINX Alveo U200 | Heterogeneous streaming Parallel processing | Code performance | ✗ |
Fu and Huang [118] | Intel Xeon (R) Gold 6126 | ACG programming model Dataflow graph IR Code template | Code performance Memory consumption | ✗ |
Zhao et al. [119] | FT-Matrix DSP | Loop transformation Instruction-level optimization Reinforcement learning | Code performance | ✗ |
Welder [114] | NVIDIA Tesla V100 NVIDIA RTX 3090 AMD MI50 GPU Graphcore IPU | Tile-graph abstraction Two-step scheduling algorithm Hardware mapping | Code performance Memory access | ✓ |
Ansor [120] | Intel 18-core 8124M NVIDIA Tesla V100 Raspberry Pi 3b+ | Hierarchical search space Program sampling Cost model Gradient descent algorithm | Code performance | ✓ |
Rammer [121] | NVIDIA Tesla V100 AMD Radeon Instinct MI50 Graphcore IPU | rOperator abstraction vDevice abstraction rTask-aware DFG compiler | Code performance Hardware utilization Scheduling overhead | ✓ |
AKG [32] | Huawei Ascend 910 | Polyhedral transformations, Tiling and fusion strategies Vectorization Low-level synchronization | Code performance | ✓ |
DIETCODE [122] | NVIDIA Tesla T4 | Shape universal search space Cost model Joint learning Local filling optimization | Code performance Scheduling overhead | ✓ |
Work | Supported Models | Platforms | Main Technology | Key Metrics | Code Availability |
---|---|---|---|---|---|
DeViT [13] | ViT DeiT CCT | NVIDIA Jetson Nano | Knowledge distillation | Accuracy Latency Energy Power | ✗ |
RFS [123] | CNN | NVIDIA RTX 2080Ti NVIDIA GTX 1080Ti NVIDIA Jetson Xavier | Model partitioning | Accuracy Latency Service reliability | ✓ |
Galaxy [124] | DistilBert Bert-L GPT2-L OPT-L OPT-XL | NVIDIA Jetson Nano | Hybrid model parallelism Communication optimization | Latency Scalability | ✗ |
HALP [14] | VGG-16 | NVIDIA GTX 1080Ti NVIDIA Jetson Xavier | Task partitioning | Speedup ratio Throughput Service reliability | ✓ |
HALP (extended) [125] | MobileNet-v1 VGG-16 | Raspberry Pi 4 | Task partitioning | Latency Accuracy Service reliability | ✗ |
COIN-LEO [34] | Self-built DNN | Simulation | Model partitioning Task assignment PPO | Throughput Latency Network overhead | ✗ |
Edge ensembles [127] | MobileNet-v2 | Unmentioned | Communication optimization Ensemble aggregation Vector quantization | Latency Accuracy | ✓ |
[126] | VGG-16 ResNet-34 | Raspberry Pi 3B+ Raspberry Pi 4B NVIDIA Jetson TX2 | Workload assignment | Execution time | ✗ |
[128] | ResNet-18 | Unmentioned | Model aggregation | Accuracy Latency | ✗ |
[45] | AlexNet VGG-19 YOLONet | Unmentioned | Model partitioning | Throughput Execution time | ✗ |
Work | Supported Models | Platforms | Main Technology | Key Metrics | Code Availability |
---|---|---|---|---|---|
[129] | AlexNet VGG16 ResNet-18 GoogLeNet | NVIDIA Jetson TX2 NVIDIA Titan Xp | Model partitioning | Latency Storage Accuracy | ✗ |
[35] | ResNet-50 VGG-16 | NVIDIA RTX 2080Ti NVIDIA GTX 1080Ti NVIDIA Jetson Xavier | Task partitioning | Accuracy Latency Service failure probability | ✗ |
DisCFNN [130] | CFNN-A CFNN-V CFNN-R | Intel Xeon Platinum 8352V | Model partitioning Task offloading | Utility Success rate Server utilization rate Relative transmission data size Fairness | ✗ |
Roulette [136] | LeNet ResNet18 ResNet50 | NVIDIA A100 Intel Xeon Gold 6240 | Model partitioning Differential privacy | Accuracy Attack accuracy Computing load | ✗ |
[131] | VGG-16 | NVIDIA Tesla V100 | Model partitioning Data fusion | Accuracy Latency Transmission gain Communication overhead | ✗ |
Edgent [132] | AlexNet | Intel Quad-core Processor Raspberry Pi 3 | Model partitioning Early exit | Latency Accuracy Throughput | ✗ |
LSTM-TD3 [133] | MobileNetV3-Large MobileNetV3-Small | Raspberry Pi NVIDIA GTX 960 | Task offloading | Latency Accuracy Execution time | ✗ |
MAHPPO [134] | ResNet-18 VGG-11 MobileNetV2 | NVIDIA Jetson Nano | Feature compression PPO | Compression rate Latency Energy consumption | ✓ |
[135] | VGG16 | Unmentioned | Early exit Transmision decision | Accuracy Communication savings | ✗ |
[137] | DeiT-Tiny DeiT-Small DeiT-Base | APPLE iPhone 12 NVIDIA RTX 3090 | Patch selection Communication optimization | Communication cost Accuracy | ✗ |
Work | Supported Models | Platforms | Main Technology | Key Metrics | Code Availability |
---|---|---|---|---|---|
EdgeShard [138] | Llama2-7B Llama2-13B Llama2-70B | NVIDIA Jetson Orin NVIDIA Jetson Orin NX NVIDIA RTX 3090 | Model partitioning Pipeline execution optimization | Latency Throughput | ✗ |
DVFO [44] | EfficientNetB0 ViT-B16 ResNet-18 Inception-v4 MobileNet-v2 YOLOv3-Tiny RetinaNet DeepSpeech | NVIDIA Jetson Nano NVIDIA Jetson TX2 NVIDIA Jetson Xavier NX NVIDIA Orin NX NVIDIA AGX Orin NVIDIA RTX 3080 | Dynamic voltage frequency scaling Deep reinforcement learning | Latency Energy Accuracy | ✗ |
Opt-CoInfer [139] | VGG-16 | Raspberry Pi 4B NVIDIA Tesla V100 | Model partitioning Model compression Optimal scheme searching | Latency Accuracy | ✓ |
[43] | VGG16 ResNet18 MobileNetV1 MobileNetV2 | Raspberry Pi 3B NVIDIA RTX 3080Ti | Model partitioning Network pruning Feature compression | Latency Accuracy | ✗ |
[140] | CenterNet | NVIDIA Jetson Nano NVIDIA RTX 3090 NVIDIA GTX 1060 | Model pruning Model partitioning | Accuracy Latency | ✗ |
EARLIN [145] | DenseNet ResNet34 ResNet44 VGG16 | Intel Core i7 9750H NVIDIA Tesla K80 | Early exit | Accuracy Latency | ✗ |
Hybrid SD [141] | Stable Diffusion v1.4 BK-SDM-Small BK-SDM-Tiny OursTiny | NVIDIA A100 NVIDIA Tesla V100 APPLE iPhone 15 Pro | Model pruning Task offloading | Quality | ✗ |
DRAX [144] | AlexNet VGG11 ResNet34 | Intel Stratix IV GX FPGA Intel Xeon Silver 4114 Intel Neural Compute Stick 2 NVIDIA Jetson Nano Google Edge TPU | Heuristic approximation | Energy consumption Accuracy | ✓ |
[142] | Llama2-70B-chat Llama2-7B-chat TinyLlama-1.1B | Unmentioned | Task offloading | Accuracy Cost | ✗ |
PerLLM [143] | Llama2-33B Llama2-7B Llama3-8B Yi-6B Yi-9B | Intel Xeon Silver 4214R NVIDIA A100 | Task offloading Resource allocation | Latency Throughput Energy consumption | ✗ |
Work | Supported Models | Platforms | Main Technology | Key Metrics | Code Availability |
---|---|---|---|---|---|
[33] | SVM MLP | Unmentioned | Communication optimization | Accuracy | ✗ |
[146] | AlexNet ResNet-34 MobileNetV1 | Huawei Nova 7 Pro NVIDIA Max250 NVIDIA GTX 1080Ti x 3 | Model partitioning | Latency | ✗ |
CRIME [150] | CoVe a one-layer LSTM | ARM Cortex-A53 NVIDIA Jetson TX2 NVIDIA Titan Xp | Task offloading | Latency Energy consumption | ✗ |
C-NMT [154] | BiLSTM GRU RNN “MarianMT” Transformer | NVIDIA Jetson TX2 NVIDIA Titan Xp | Task estimation Linear mapping | Execution time | ✗ |
[151] | HyperLPR YOLOv5 MTCNN | Intel Core i7 10510U NVIDIA RTX 2080Ti | Task offloading | Latency Throughput Traffic Device utilization | ✗ |
CNNPC [147] | MobileNet-V2 ResNet-18 SSD-VGG16 | Snapdragon 845 Snapdragon 710 NVIDIA Jetson TX2 NVIDIA Tesla P100 | Model partitioning Model compression | Latency Accuracy Compression rate | ✓ |
MCIA [148] | ResNet-56 | Intel Core i7 9700K | Model partitioning Deep reinforcement learning | Latency Accuracy | ✗ |
[149] | AlexNet MobileNet-v2 GoogLeNet | Unmentioned | Task offloading Model partitioning Resource allocation PPO | Latency Throughput | ✗ |
[152] | BERT-Base-uncased BERT-Large-uncased BERTweet | NVIDIA RTX 4090 | Task offloading Early exit | Accuracy Execution time | ✗ |
EosDNN [153] | AlexNet VGG GoogleNet ResNet | Unmentioned | Computation offloading | Latency | ✓ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, R.; Jiang, H.; Wang, W.; Liu, J. Optimization Methods, Challenges, and Opportunities for Edge Inference: A Comprehensive Survey. Electronics 2025, 14, 1345. https://doi.org/10.3390/electronics14071345
Zhang R, Jiang H, Wang W, Liu J. Optimization Methods, Challenges, and Opportunities for Edge Inference: A Comprehensive Survey. Electronics. 2025; 14(7):1345. https://doi.org/10.3390/electronics14071345
Chicago/Turabian StyleZhang, Runhua, Hongxu Jiang, Wei Wang, and Jinhao Liu. 2025. "Optimization Methods, Challenges, and Opportunities for Edge Inference: A Comprehensive Survey" Electronics 14, no. 7: 1345. https://doi.org/10.3390/electronics14071345
APA StyleZhang, R., Jiang, H., Wang, W., & Liu, J. (2025). Optimization Methods, Challenges, and Opportunities for Edge Inference: A Comprehensive Survey. Electronics, 14(7), 1345. https://doi.org/10.3390/electronics14071345