1. Introduction
The transformative potential of Artificial Intelligence (AI) is undeniable, with its influence permeating diverse sectors ranging from healthcare and finance to transportation and entertainment [
1,
2,
3]. Traditionally, AI applications have predominantly resided on cloud-based infrastructures or High-Performance Computing (HPC) environments [
4,
5], leveraging vast computational resources and ample memory to execute complex algorithms and process massive datasets. This type of cloud-based AI excels in scenarios requiring massive data aggregation and complex, large-scale model training. However, this centralized approach is often limited by latency, bandwidth constraints, data privacy, and energy consumption, hindering its application in time-critical and resource-sensitive scenarios [
6,
7,
8]. For an autonomous vehicle detecting pedestrians, the time it takes to send critical decision-making data back and forth to a cloud server is simply unacceptable. In contrast, embedded AI (EAI) places intelligence directly at the point of data collection. This paradigm shift from centralized to distributed processing is not just a choice, but an inevitable trend driven by the growing demand for real-time processing, low latency, enhanced privacy, and optimized energy efficiency, giving rise to the emergence of EAI [
9].
The development of EAI can be divided into three key eras, as illustrated in
Figure 1. The core paradigm of the embryonic stage (pre-2016) was the exploration of miniaturization from cloud AI, marked by early compression algorithms such as SqueezeNet. The mobile AI explosion period (2017–2019) saw the maturity of the MobileNet series and TensorFlow Lite (TFLite), which enabled a large-scale shift of AI’s focus from the cloud to the edge, making real-time inference on-device mainstream. The TinyML era (2020-present) has seen the ultimate downscaling of AI capabilities to microcontrollers, driven by technologies such as AI-accelerated microcontrollers and lightweight Transformers, giving rise to always-on, microwatt-level intelligence [
10,
11]. In the future, EAI is heading towards the next paradigm shift, from static inference to dynamic, adaptive learning [
12,
13,
14].
EAI transfers the locus of AI computation from the cloud to the edge, embedding intelligent capabilities directly into resource-constrained devices. EAI empowers edge devices to make autonomous decisions, adapt to dynamic environments, and interact seamlessly with the physical world, all without a constant reliance on network connectivity [
15,
16,
17]. This transition has unlocked possibilities for a plethora of novel applications, including autonomous vehicles, smart sensors, wearable devices, industrial automation systems, and personalized healthcare solutions [
18,
19,
20]. The challenges entailed in designing and deploying efficient and robust EAI systems necessitate a holistic approach that considers hardware limitations, algorithmic complexity, and application-specific requirements.
Comprehensive surveys on the swiftly progressing EAI technology have also been conducted by researchers in recent years. Kodgire et al. [
21] explored the key trade-offs in EAI, especially the balance between computing efficiency and energy consumption. The article focuses on the application of EAI in different fields and discusses the core challenges it faces. It is a study that focuses on the core contradictions and application breadth of EAI. Elkhalik [
22] is an application-driven review that focuses on the specific implementation of AI in the field of smart homes. It deeply explores the opportunities and challenges in this specific scenario, such as resource limitations and data security. The core of Peccia et al.’s [
23] systematic review is embedded distributed reasoning of deep neural networks. It focuses on how to decompose and deploy complex reasoning tasks to multiple embedded devices, so its discussion focuses on related algorithms, deployment strategies, and challenges. Zhang et al. [
24] conducted a relatively comprehensive review of artificial intelligence technology in embedded systems. The article focuses on the hardware platform and core algorithms of EAI, and deeply analyzes key challenges such as resource limitations and data security, but pays less attention to the software ecosystem and deployment process. Kotyal et al. [
25] reviewed the progress and challenges of AI applications from a broader perspective. They discussed the implementation of AI in various industries in a broad sense, but their core focus was not entirely on the specific constraints and technology stacks of embedded environments. Serpanos et al. [
26] outline a future roadmap for embedded AI from a macro perspective, with an emphasis on defining its scope, elucidating its vast potential in key application domains, and identifying the core challenges and opportunities that must be addressed to realize this vision, thereby providing guidance for future research and policymaking in Europe.
This survey significantly surpasses contemporaneous related works [
21,
22,
23,
24,
25,
26] in terms of comprehensiveness and systematic coverage, with a comparison of the core thematic scope presented in
Table 1. First, at the theoretical level, it introduces an innovative mathematical formalization, thereby establishing a rigorous theoretical foundation for EAI—an aspect absent in prior works that primarily relied on descriptive definitions. Second, at the technology stack level, it is the only survey that systematically covers the entire ecosystem from hardware platforms to software frameworks, complemented with detailed comparative tables that provide developers with a valuable “hardware-to-software” selection guide, whereas other reviews either focus narrowly on hardware or neglect the software ecosystem. Furthermore, in terms of core algorithmic analysis, this work not only covers conventional and classical models but also systematically presents the latest lightweight neural network architectures tailored for embedded environments, thereby demonstrating its cutting-edge orientation. Most notably, through clear deployment flowcharts and step-by-step breakdowns, it offers a thorough and practical exposition of “how to implement EAI,” providing an actionable end-to-end guideline that distinguishes it from prior works, which tend to remain at an abstract level of discussion. In summary, by virtue of its comprehensive knowledge chain, in-depth technical analysis, and strong practical guidance, this survey endeavors to serve as a highly valuable “one-stop” reference for both academic research and engineering practice in the field.
Table 1.
Comparison of the core thematic scope of EAI comprehensive survey.
Table 1.
Comparison of the core thematic scope of EAI comprehensive survey.
Core Thematic | Ours | Kodgire et al. [21] | Elkhalik [22] | Peccia et al. [23] | Zhang and Li [24] | Kotyal et al. [25] | Serpanos et al. [26] |
---|
Definition | ✓ | | | ✓ | | ✓ | |
Hardware Platforms | ✓ | | | | ✓ | | |
Software Frameworks and Tools | ✓ | | | | | | |
Algorithms (including lightweight models) | ✓ | | | ✓ | ✓ | ✓ | |
Deployment | ✓ | | | ✓ | | | |
Application | ✓ | ✓ | ✓ | | ✓ | ✓ | ✓ |
Challenges and Opportunities | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
The paper also integrates insights from multiple domains such as autonomous driving and smart homes, which are highlighted as key application areas for EAI [
22,
23]. Furthermore, the review proposes practical solutions for core challenges like resource constraints and data security—critical issues identified in the literature on EAI and its applications [
22,
24]. By synthesizing these diverse perspectives, this review not only enhances the analytical depth of the field but also provides a more detailed account of the opportunities within the EAI landscape, thereby offering readers a comprehensive guide to navigate this rapidly evolving domain [
24,
25,
26,
27,
28].
The remainder of this paper is organized as follows:
Section 2 systematically introduces the formal definition of EAI from both theoretical and mathematical perspectives, clarifying its core optimization objectives.
Section 3 provides an in-depth exploration of the diverse hardware platforms for EAI, ranging from microcontrollers to specialized AI chips, and presents a practical strategy for their selection.
Section 4 comprehensively reviews the software frameworks and toolchains that support EAI development, offering a valuable reference for developers in choosing the appropriate technology stack.
Section 5 details the core algorithms applicable to EAI, not only reviewing traditional machine learning and classic deep learning models but also placing a special emphasis on state-of-the-art lightweight neural network architectures.
Section 6, illustrated with a flowchart, breaks down the end-to-end deployment process—from model optimization to hardware integration—revealing the critical stages of transforming an algorithm into a product.
Section 7 vividly showcases the widespread applications of EAI in fields such as autonomous driving and the Internet of Things through a rich set of case studies.
Section 8 provides a profound analysis of the core challenges currently facing EAI and looks ahead to its future opportunities in algorithms, hardware, and applications. Finally,
Section 9 concludes the paper.
2. Definition of EAI
EAI can be defined as the deployment and execution of AI algorithms on embedded systems [
16,
17,
18]. Embedded systems are specialized computer systems meticulously designed for specific tasks. They are typically characterized by resource constraints, including limited processing power, finite memory capacity, and stringent energy-consumption budgets [
29,
30]. This definition necessitates achieving high AI performance within the operational constraints of the embedded environment, which requires a delicate balance.
Mathematically, we can formulate the objective of EAI as optimizing the performance of an AI model under resource constraints. Let:
where the objective function
represents the model performance metric to be maximized (e.g., accuracy, F1-score, etc.); and the constraint
requires that all resource consumption of the model (e.g., computational complexity, memory, energy consumption) does not exceed the limitations of the hardware platform. This formulation transforms an abstract design philosophy into a measurable and executable constrained optimization problem. This mathematical framework not only reveals the inherent trade-off between performance and resources that is unavoidable in the EAI domain but also provides a unified theoretical guide and evaluation standard for all optimization techniques, such as model compression and hardware-aware Neural Architecture Search (NAS).
The core distinction from traditional AI optimization lies in a fundamental shift in mindset: (1) Traditional AI optimization pursues unconstrained performance maximization. Its central question is, “How powerful a model can we build?” where resources are typically considered scalable; (2) EAI optimization, in contrast, seeks a Pareto-optimal solution within a fixed, insurmountable hardware “box.” Its central question is, “What is the best performance we can achieve under the given strict constraints?”
Specifically, the resources consumed by a model can be represented as a vector, such as:
where
denotes the model’s computational complexity (e.g., measured in floating point number operations, FLOPs),
represents its memory footprint (e.g., in megabytes), and
is its energy consumption per inference.
Correspondingly, the hardware platform imposes maximum allowable values for these resources, which can be denoted as
where
are the maximum allowable values imposed by the hardware on computational complexity (FLOPs), memory, and energy consumption, respectively. If it is necessary to simultaneously optimize performance and resources (e.g., Pareto optimization), the problem can be extended to a multi-objective form:
In this case, it is necessary to strike a balance between performance improvement and resource reduction in order to seek Pareto-optimal solutions. For example, a common small-scale case study involves the joint optimization of pruning and quantization for a specific neural network on a target embedded device. By exploring different pruning rates (e.g., removing 30%, 50%, or 70% of redundant weights) and quantization strategies (e.g., 8-bit or 4-bit integer quantization), a series of models with varying levels of accuracy and resource costs (such as model size or inference latency) can be generated. When these models are plotted on a two-dimensional graph with accuracy on the Y-axis and model size on the X-axis, the Pareto frontier emerges as a curve consisting of optimal models. Any model on this curve is considered “Pareto-optimal,” meaning that no other configuration can achieve higher accuracy with the same or smaller model size. Developers can then select a model point along this frontier that maximizes performance under the strict resource constraints of their application. This principle underpins many automated model compression and hardware-aware NAS frameworks [
31].
This framework provides a mathematical foundation for model compression (e.g., pruning, quantization), NAS, and deployment in EAI. This optimization problem highlights the core challenge of EAI: maximizing AI performance while adhering to the inherent resource limitations of embedded systems. Unlike traditional AI systems residing in cloud servers or powerful workstations [
4,
5], EAI operates directly on the target device, enabling real-time inference and decision-making. Its goal is to achieve a synergy between AI capabilities and hardware constraints, ensuring that the deployed algorithms are efficient, accurate, and robust within the operational environment [
16,
17].
3. Hardware Platforms of EAI
To solve the core optimization problem articulated in the preceding section—namely, maximizing AI performance subject to stringent hardware limitations—it is imperative to first gain a deep understanding of the physical basis for these constraints: the embedded hardware platforms. Spanning a wide spectrum from low-power microcontrollers to high-performance AI Systems-on-Chip (SoC), the choice of platform necessitates a meticulous trade-off among computational power, power consumption, cost, and the complexity of the development ecosystem. This section systematically reviews the major EAI hardware platforms and puts forward a practical strategy for their selection [
32].
3.1. Major Hardware Platforms of EAI
Current mainstream EAI hardware platforms are categorized into five types based on their chip architecture: Microcontroller Unit(MCU), Micro Processor Unit (MPU), Graphics Processing Unit(GPU), Field-Programmable Gate Array (FPGA), Application-Specific Integrated Circuit (ASIC), AI System-on-Chip (SoC).
Table 2 and
Table 3 lists mainstream EAI hardware platforms and compares them across multiple dimensions to facilitate assessment. These key comparison dimensions include: specific chip models, manufacturers, architectural, the core processor type, AI accelerators, the achievable computational power, as well as runtime power consumption and other key features. These factors collectively determine a specific hardware platform’s suitability and competitiveness in specific EAI applications.
3.1.1. MCU
MCUs are distinguished by their extremely low power consumption, low cost, small form factor, and robust real-time hardware characteristics [
33]. They typically integrate Central Processing Unit (CPU) (such as the ARM Cortex-M series [
34]), limited Random Access Memory (RAM), and Flash memory. In terms of AI capabilities, the constrained hardware resources of MCUs render them suitable for executing lightweight AI models, such as simple keyword spotting, sensor data anomaly detection, and basic gesture recognition [
35,
36]. Their hardware architecture is often optimized to support these AI applications, and many contemporary MCUs integrate hardware features conducive to AI computation [
33,
37]. Typical hardware representatives in the MCU domain include: STM32F series (STMicroelectronics, Geneva, Switzerland), which, with its extensive product line and integrated hardware resources, can execute optimized neural network models [
38]; i.MX RT series (NXP Semiconductors N.V., Eindhoven, Netherlands), which are crossover MCUs combining the real-time control characteristics of traditional MCUs with the application-processing capabilities approaching those of MPUs, thereby providing a hardware foundation for more demanding EAI tasks [
39]; ESP32 series (Espressif Systems, Shanghai, China), a low-cost MCU family with integrated Wi-Fi and Bluetooth functionality, whose hardware design also incorporates considerations for accelerated support for digital signal processing and basic neural network operations, often leveraged by libraries such as ESP-DL [
40].
3.1.2. MPU
MPUs are central to general-purpose computing in embedded systems, typically incorporating CPU cores based on architectures such as ARM [
41]. These cores excel at executing complex logical control and sequential tasks and are increasingly undertaking computational responsibilities for edge AI [
33,
42]. ARM-based MPUs are well-suited for scenarios where AI tasks coexist with substantial non-AI-related computations, or where stringent real-time AI requirements are not paramount, such as in smart home control centers and low-power edge computing nodes [
42]. Modern high-performance MPUs featuring ARM CPU cores (e.g., STM32- MP13x series [
43]) often employ multi-core architectures and widely support the ARM NEON™ Single Instruction Multiple Datastream (SIMD) instruction set architecture extension [
38]. This NEON technology can significantly accelerate common vector and matrix operations prevalent in AI algorithms, thereby enhancing the computational efficiency of certain AI models [
33,
44]. In many resource-constrained embedded systems, developers often directly leverage the existing computational resources of the MPU’s CPU cores, augmented by NEON capabilities, to efficiently execute lightweight AI models [
35]. For instance, in many ARM-based MPUs or SoCs, including the MT7986A with Cortex-A53 processor (MediaTek Inc., Hsinchu City, Taiwan) [
45] and Raspberry Pi 5 with Cortex-A76 processor (Raspberry Pi Foundation, Cambridge, United Kingdom) [
46], the CPU cores handle the primary computational tasks. Even without dedicated AI accelerators, integrated technologies like NEON often provide sufficient performance for a variety of lightweight EAI applications [
44].
Table 2.
Mainstream MCP/MPU hardware platforms of EAI.
Table 2.
Mainstream MCP/MPU hardware platforms of EAI.
Platforms Models 1 | Manufacturer | Architecture | Processor | AI Accelerator | Computational Power | Power Consumption 2 | Other Key Features |
---|
STM32F [38] | STMicroelectronics, Geneva, Switzerland | MCU | Cortex-M0/M3/M4/ M7/M33 etc. | ART Accelerator in some models | Tens to hundreds of DMIPS 3, several GOPS 4 in some models (with accelerator) | Low power (mA-level op., µA-level standby) | Rich peripherals, CubeMX ecosystem, Security features (integrated in some models) |
i.MX RT [39] | NXP Semiconductors N.V., Eindhoven, Netherlands | MCU | Cortex-M7, Cortex-M33 | 2D GPU (PXP), NPU (e.g., RT106F, RT117F) in some models | Hundreds to thousands of DMIPS, NPU can reach 1–2 TOPS | Medium power, performance-oriented | High-performance real-time processing, interface control, EdgeLock security subsystem |
ESP32-S3 [40] | Espressif Systems (Shanghai) Co., Ltd., Shanghai, China. | MCU | Xtensa LX6/LX7 (dual-core), RISC-V | Supports AI Vector Instructions | CPU: Up to 960 DMIPS; AI: Approx. 0.3–0.4 TOPS (8-bit integer, INT8 ) | Low power (higher when Wi-Fi/BT active) | Integrated Wi-Fi and Bluetooth, Open-source SDK (ESP-IDF), Active community, High cost-performance |
STM32-MP13x [41,42,43,44] | STMicroelectronics | MPU | ARM Cortex-A7, Cortex-M4 | No dedicated GPU; AI runs via CPU/CMSIS-NN | A7: Max ∼2000 DMIPS; M4: Max ∼200 DMIPS | <1 W | Entry-level Linux or RTOS system design |
MT7986A (Filogic 830) [45] | MediaTek Inc., Hsinchu City, Taiwan. | MPU | ARM Cortex-A53 | Hardware Network Acceleration Engine, AI Engine (approx. 0.5 TOPS) | CPU: Up to ∼16,000 DMIPS | ∼7.5 W | High-speed network interface |
Raspberry Pi 5 [46] | Raspberry Pi Ltd, Cambridge, United Kingdom. | MPU | Broadcom BCM2712 (quad-core ARM Cortex-A76) | VideoCore VII GPU | ∼60–80 GFLOPS 5 | Passive Cooling; 2.55 W (Idle); 6.66 W (Stress) 6 | Graphics performance and general-purpose computing capability |
3.1.3. GPU
GPUs possess massively parallel processing capabilities, excelling at handling matrix operations, and are suitable for accelerating deep learning algorithms [
33,
47]. They are primarily used in AI applications requiring high-performance computing and parallel processing, such as autonomous driving, video analytics, and advanced robotics [
48,
49]. Mainstream GPU vendors include NVIDIA, AMD, and ARM (Mali). NVIDIA’s GPUs are widely used in embedded systems, such as the Jetson series (NVIDIA, Santa Clara, CA, USA) [
50,
51]; their CUDA and TensorRT toolchains facilitate the development and deployment of AI applications, widely used in robotics, drones, and other fields [
52]. Through programming frameworks like CUDA or OpenCL, developers can conveniently leverage GPUs to accelerate AI model training and inference [
33,
53]. Mobile SoCs like Qualcomm’s Snapdragon series (featuring Adreno GPUs) (Qualcomm, San Diego, CA, USA) [
54] and MediaTek’s Dimensity series (often featuring ARM Mali or custom GPUs) [
55] also integrate powerful GPUs to accelerate AI applications on smartphones.
3.1.4. FPGA
The FPGA is a programmable hardware platform where hardware accelerators can be customized as needed, making it suitable for scenarios requiring customized hardware acceleration and strict latency requirements, such as high-speed data acquisition, real-time image processing, and communication systems [
56,
57]. It offers a high degree of flexibility and reconfigurability, enabling optimized data paths and parallel computation. Products such as the Zynq 7000 SoC family (AMD, Santa Clara, CA, USA) [
58] and the Arria 10 SoC FPGAs (Intel, Santa Clara, CA, USA) [
59] integrate FPGA’s programmable logic with hard-core processors, such as Cortex-A9 (Arm, Cambridge, United Kingdom), on the same chip. This makes them suitable for applications demanding both high-performance processing from the CPU and flexible, high-throughput hardware-acceleration capabilities from the FPGA fabric.
3.1.5. ASIC
ASICs are specialized chips meticulously designed for particular applications, offering the highest performance and lowest power consumption for their designated tasks [
33,
60]. The hardware logic of an ASIC is fixed during the manufacturing process, and its functionality cannot be modified once produced. Consequently, the design and manufacturing (Non-Recurring Engineering, NRE) costs for ASICs are typically high, and they offer limited flexibility compared to programmable solutions [
60]. Google’s Tensor Processing Unit (TPU), such as Tensor G3 (Google LLC, Mountain View, CA, USA) is a prime example of an ASIC, specifically engineered to accelerate the training and inference of machine learning (ML) models, particularly within the TensorFlow framework [
61]. Atlas series AI processors (Huawei, Shenzhen, China), which employ the Da Vinci architecture, also fall into the ASIC category, designed for a range of AI workloads [
62].
3.1.6. AI SoC
The AI SoC refers to a system-on-chip that integrates hardware-acceleration units specifically designed for AI computation, such as Neural Processing Unit (NPU) or, as sometimes termed by vendors, Knowledge Processing Unit (KPU) [
33,
63]. These integrated AI-acceleration units often employ efficient parallel computing architectures, such as Systolic Arrays [
64], and are specifically optimized to perform core operations in deep learning models, including matrix multiplication and convolution [
33]. This architectural specialization enables their widespread application in various EAI scenarios, including image recognition, speech processing, and natural language understanding, playing a crucial role particularly in edge computing devices where power consumption and cost are stringent requirements [
63,
65]. For example, RV series SoCs (e.g., RV1126/RV1109) (Microchip, Chandler, AZ, USA) [
66] and K230 (Kendryte, Singapore) [
67] are SoCs that integrate NPUs or KPUs to accelerate diverse edge AI computing tasks. By highly integrating modules such as CPUs, GPUs, Image Signal Processors (ISPs), and dedicated NPUs/KPUs onto a single chip, AI SoCs not only enhance AI computational efficiency but also reduce system power consumption, cost, and physical size, making them a crucial hardware foundation for promoting the popularization and implementation of AI technology across various industries [
33,
63].
Table 3.
Mainstream GPU/FPGA/ASIC hardware platforms of EAI.
Table 3.
Mainstream GPU/FPGA/ASIC hardware platforms of EAI.
Platforms Models | Manufacturer | Architecture | Processor | AI Accelerator | Computational Power | Power Consumption | Other Key Features |
---|
Jetson Orin NX [50] 1 | NVIDIA Corporation, Santa Clara, CA, USA. | GPU | ARM Cortex-A78AE | NVIDIA Ampere GPU, Tensor Cores | 100.0 TOPS (16 GB ver.), 70.0 TOPS (8 GB ver.) | 10–25 W | CUDA, TensorRT |
Jetson Nano [51] | NVIDIA Corp. | GPU | ARM Cortex-A57, NVIDIA Maxwell GPU | NVIDIA Maxwell GPU | 0.472 TFLOPs (16-bit floating-point, FP16)/5 TOPS (INT8, sparse) | 5–10 W (typically) | CUDA, TensorRT |
Zynq 7000 [58] | AMD (Xilinx), Santa Clara, CA, USA | FPGA | Dual-core Arm Cortex-A9 MPCore | Programmable Logic (PL) for custom accelerators | PS (A9): Max ∼5000 DMIPS | Low power (depends on design and load) | Highly customizable hardware, Real-time processing capability |
Arria 10 [59] | Intel Corp., Santa Clara, CA, USA | FPGA | Dual-core Arm Cortex-A9 (SoC versions) | FPGA fabric for custom accelerators | SoC (A9): Max ∼7500 DMIPS | Low power (depends on design and load) | Enhanced FPGA and Digital Signal Processing (DSP) capabilities |
iCE40HX1K [68] | Lattice Semiconductor, Hillsboro, OR, USA. | FPGA | iCE40 LM (FPGA family) | FPGA fabric for custom accelerators | 1280 Logic Cells | Very low power | Small form factor, Low power consumption |
Smart Fusion2 [69] | Microchip Technology Inc., Chandler, AZ, USA. | FPGA | ARM Cortex-M3 | FPGA fabric for custom accelerators | MCU (M3): ∼200 DMIPS | Low power | Secure, high-performance for comms, industrial interface, automotive markets |
Google Tensor G3 [61] | Google Inc., Mountain View, CA, USA. | ASIC | 1× Cortex-X3, 4× Cortex-A715, 4× Cortex-A510 | Mali-G715 GPU | N/A (High-end mobile SoC performance) | N/A (Mobile SoC) | Emphasizes AI functions and image processing |
Coral USB Accelerator [70] | Google Inc. | ASIC | ARM 32-bit Cortex-M0+ Microprocessor (controller) | Edge TPU | 4.0 TOPS (INT8) | 2.0 W | USB interface, Plug-and-play |
MLU220-SOM [71] | Cambricon Technologies, Beijing, China. | ASIC | 4× ARM Cortex-A55 | Cambricon MLU (Memory Logic Unit) NPU | 16.0 TOPS (INT8) | 15 W | Edge intelligent SoC module |
Atlas 200I DK A2 [62,72] | Huawei Technologies Co., Ltd., Shenzhen, China. | ASIC | 4 core @ 1.0 GHz | Ascend AI Processor | 8.0 TOPS (INT8), 4.0 TOPS (FP16) | 24 W | Software and hardware development kit |
RK3588 [66] | Fuzhou Rockchip Electronics Co., Ltd. (Rockchip), Fuzhou, China. | AI SoC | 4× Cortex-A76 + 4× Cortex-A55 | ARM Mali-G610 MC4, NPU | NPU: 6.0 TOPS | ∼8.04 W | Supports multi-camera input |
RV1126 [66] | Rockchip | AI SoC | ARM Cortex-A7, RISC-V MCU | NPU | 2.0 TOPS (NPU) | 1.5 W | Low power, Optimized for visual processing |
K230 [67] | Kendryte (Canaan Inc.), Singapore. | AI SoC | 2× C908 (RISC-V), RISC-V MCU | NPU | 1.0 TOPS (NPU) | 2.0 W | Low power, Supports TinyML |
AX650N [73] | AXERA (Axera Tech), Ningbo, China. | AI SoC | ARM Cortex-A55 | NPU | 72.0 TOPS (INT4), 18.0 TOPS (INT8) | 8.0 W | Video encoding/decoding, Image processing |
RDK X3 Module [74] | Horizon Robotics, Beijing, China. | AI SoC | ARM Cortex-A53 (Application Processors) | Dual-core Bernoulli BPU | 5.0 TOPS | 10 W | Optimized for visual processing |
In the autonomous driving sector, NPUs can process multi-channel camera streams in real-time with extremely low latency to perform complex tasks like object detection and semantic segmentation—a feat that is challenging for general-purpose CPUs or GPUs to match. In smart security, NPUs empower edge cameras with high-accuracy facial recognition and behavior-analysis capabilities, significantly reducing bandwidth and computational demands on back-end servers. Similarly, in consumer electronics and smart home devices, the proliferation of low-power NPUs has made functions like offline voice recognition and intelligent image enhancement possible, which not only improves responsiveness but also fundamentally protects user data privacy. Therefore, NPUs/KPUs are not merely hardware accelerators; they are strategic components that define the core competitiveness of the next generation of intelligent embedded products.
3.2. Hardware Platform Selection Strategy
When selecting an EAI hardware platform, it is necessary to comprehensively consider application requirements, algorithm complexity, and development cost and cycle, and to ultimately determine the most suitable solution through prototype verification and performance testing [
33,
75].
The selection strategy for EAI hardware platforms is shown in
Figure 2. Firstly, application requirements need to be clarified. This includes specific requirements for computational power, power consumption, latency, cost, security, and reliability [
76]. For example, autonomous driving has extreme demands for high computational power and low latency, while smart homes focus more on low power consumption and cost [
77]. Next, algorithm complexity needs to be evaluated. This involves model type such as Convolutional Neural Network (CNN), Recurrent Neural Networks (RNN) and Transformer, model size (MB), and computational complexity (FLOPs) [
78]. Lightweight models can run on CPUs, while complex models require GPU, FPGA, or ASIC acceleration [
79]. Concurrently, development cost and cycle must be considered. GPUs and AI chips have more mature development tools and community support, which can shorten the development cycle, whereas FPGAs and ASICs require more hardware design and verification work and involve considerations such as development tools, community support, software ecosystem, and Internet Protocol (IP) licensing fees [
80]. Finally, through prototype verification and performance testing, evaluate the hardware platform’s performance in specific application scenarios using benchmarks, real-world tests, power and latency tests, stress tests, and performance-analysis tools, and continuously optimize [
81,
82]. For instance, using the Huawei’s Atlas 200 DK for prototype verification, or reducing hardware performance demands through model quantization and pruning techniques [
78,
83].
To illustrate these selection principles in a real-world context, the design of wearable devices such as smartwatches or fitness trackers clearly demonstrates why a MCU is a far more suitable choice than a MPU. Firstly, from the perspective of application requirements, wearables demand extremely low power consumption for multi-day battery life and have strict constraints on form factor; an MCU’s power draw in active and sleep modes (often in the µA to mA range) is orders of magnitude lower than an MPU’s (typically in the hundreds of mA range). Secondly, concerning algorithmic complexity, the AI tasks on these devices are typically lightweight, such as keyword spotting or activity recognition. The computational power of modern MCUs, especially those with specialized instructions (e.g., ARM Cortex-M with Helium), is sufficient for these TinyML models, whereas an MPU’s high-performance cores would lead to significant underutilization and inefficiency. Furthermore, from the viewpoint of development cost and cycle, the highly integrated nature of MCUs (with CPU, memory, and peripherals on a single chip) greatly simplifies design, reduces costs, and shrinks product size, in stark contrast to MPU-based systems that require multiple external components. Finally, regarding performance (real-time), wearables need instant responsiveness, and an MCU, running on an RTOS or bare-metal, can boot instantly and provide deterministic feedback, perfectly suiting tasks like health monitoring. Cumulatively, an MPU is clearly “overkill” in this scenario. The selection strategy, therefore, definitively guides the choice of an MCU as the optimal solution, as it meets all stringent constraints while providing sufficient performance for the target AI tasks.
5. Algorithms of EAI
With the hardware foundation and software toolchain established, our attention naturally turns to the heart of any EAI system: the AI algorithm itself. The algorithmic choice is a primary determinant of a system’s intelligence, its functional capabilities, and its intrinsic resource demands. For embedded systems, selecting a suitable algorithm necessitates a delicate balance not only of high accuracy but also of computational efficiency, memory footprint, and robustness. This section delves into the various algorithms applicable to EAI, paying special attention to lightweight models custom-designed for resource-constrained settings, and presents a comparative analysis of the experimental outcomes of classic algorithms.
5.1. Traditional ML Algorithms
These algorithms have relatively low computational complexity and are easy to deploy on resource-constrained embedded systems.
5.1.1. Decision Trees
Decision trees [
100] perform classification or regression through a series of if-then-else rules. They are easy to understand and implement, have low computational complexity, and are suitable for handling classification and regression problems, especially when the relationships between features are unclear. Decision trees can be used for sensor data analysis, determining device status based on sensor readings; they can also be used for fault diagnosis, determining the cause of failure based on fault symptoms.
5.1.2. Support Vector Machines
Support Vector Machines (SVM) [
101] performs classification by finding an optimal hyperplane in a high-dimensional space that maximizes the margin between samples of different classes. SVM performs well on small sample datasets and has good generalization ability, making it suitable for image recognition, text classification, etc. In embedded systems, SVM can be used to recognize simple image patterns, such as handwritten digit recognition or simple object recognition.
5.1.3. K-Nearest Neighbors
K-Nearest Neighbors (KNN) [
102] is a distance-based classification algorithm. For a given sample, the KNN algorithm finds the K training samples most similar to it and determines the class of the sample by voting based on the classes of these samples. The KNN algorithm is simple and easy to understand, but its computational complexity is relatively high, especially with high-dimensional data. Therefore, the KNN algorithm is typically suitable for processing low-dimensional data, such as sensor data classification.
5.1.4. Naive Bayes
Naive Bayes [
103] is a classification algorithm based on Bayes’ theorem, which assumes that features are mutually independent. The Naive Bayes algorithm is computationally fast, requires less training data, and is suitable for text classification, spam filtering, etc. In embedded systems, Naive Bayes can be used to determine environmental status based on sensor data, for example, to identify air quality.
5.1.5. Random Forests
Random Forests [
104] is an ensemble learning algorithm that improves prediction accuracy by integrating multiple decision trees. Random Forests have strong anti-overfitting capabilities and can handle high-dimensional data. In embedded systems, Random Forests can be used for complex sensor data analysis, such as predicting the remaining useful life of equipment.
5.2. Typical Deep Learning Algorithms
Deep learning algorithms, leveraging their powerful feature-extraction and pattern-recognition capabilities, have achieved breakthroughs in multiple fields. This paper/section will review several landmark deep learning algorithms.
5.2.1. CNN
CNNs are the cornerstone of computer vision tasks, excelling particularly in areas such as image recognition, object detection, and video analysis. Their core idea is to automatically learn spatial hierarchical features of images through convolutional layers and reduce feature dimensionality through pooling layers, enhancing the model’s translation invariance [
2]. Classic CNN architectures like LeNet-5 [
105] laid the foundation for subsequent deeper and more complex networks (such as AlexNet [
5], VGG [
106], ResNet [
107], and Inception [
108]). Although CNNs are highly effective, they typically contain a large number of parameters and computations, posing severe challenges for direct deployment on resource-constrained embedded devices.
5.2.2. RNN
RNNs and their variants are designed to process sequential data and have widespread applications in natural language processing (e.g., machine translation, text generation), speech recognition, and time series prediction. RNNs capture temporal dependencies in sequences through their internal recurrent structure. However, original RNNs are prone to vanishing or exploding gradient problems, limiting their ability to process long sequences. Long Short-Term Memory (LSTM) networks [
109] and Gated Recurrent Units (GRU) [
110] effectively alleviate these issues by introducing gating mechanisms, becoming standard models for processing sequential data. Similar to CNNs, complex RNN models, especially those containing multiple stacked LSTM/GRU layers, can also have computational and memory requirements that exceed the capacity of embedded devices.
5.2.3. Continual Learning
Beyond static pre-trained models, an important algorithmic paradigm of EAI is On-Device Continual Learning. The objective of this framework is to enable embedded devices to incrementally learn from continuous data streams, thereby adapting to dynamic environments or providing personalized services without requiring large-scale retraining from scratch. The core algorithmic challenge lies in addressing the issue of catastrophic forgetting, where a model forgets previously acquired knowledge while learning new tasks. To tackle this challenge under the resource constraints of embedded devices, related methods often employ parameter-efficient updates (modifying only a small subset of parameters) or data-efficient rehearsal (storing only a tiny set of representative past samples) [
111,
112,
113], thereby striking a balance between acquiring new capabilities (plasticity) and retaining old knowledge (stability).
5.2.4. Federated Edge AI (FEAI)
Federated Edge AI (FEAI) represents a privacy-preserving distributed learning framework in which models are collaboratively trained across a large number of edge devices without the need to centralize users’ raw data. Within this framework, a central server only distributes the global model and aggregates updates uploaded from individual devices, while the actual training computation takes place locally on the devices. The primary algorithmic challenges include handling the non-independent and identically distributed (Non-IID) nature of data across devices and designing memory- and computation-efficient training algorithms for device-side execution [
114]. Recent research efforts have focused on addressing these challenges to ensure that even the most resource-constrained microcontrollers can effectively participate in federated learning, thereby enabling collaborative construction of more powerful models while preserving user privacy [
115].
5.2.5. Transformer Networks
Transformer Networks were initially designed for natural language-processing tasks (particularly machine translation). Their core is the Self-Attention Mechanism, which can process all elements in a sequence in parallel and capture long-range dependencies, significantly outperforming traditional RNNs [
116]. The success of Transformers quickly extended to other domains; for example, Vision Transformer (ViT) [
117] applied it to image recognition, achieving performance comparable to or even better than CNNs. However, Transformer models typically have an extremely large number of parameters and high computational complexity (especially the quadratic complexity of the self-attention mechanism), making their direct deployment on embedded devices a significant challenge.
To address this challenge, researchers have proposed several key innovations for developing tiny Transformer models:
Hybrid Architectures: An effective approach is to design hybrid models that combine the respective strengths of CNNs and Transformers. For instance, Mobile Vision Transformer (MobileViT) [
118] leverages standard convolutional layers to efficiently capture local features, which are then fed into a lightweight Transformer module to model long-range global dependencies.
Complexity Reduction: This strategy aims to reduce the computational and memory demands of the model. One method is to apply compression techniques such as knowledge distillation, where knowledge is transferred from a large pre-trained Transformer (the “teacher model”) to a much smaller student Transformer, enabling the student model to achieve high performance with significantly fewer parameters [
119]. Another method directly modifies the self-attention mechanism itself, employing techniques like linear attention, sparse attention, or kernel-based approximations to reduce the computational complexity from quadratic to linear, thereby making it more suitable for on-device deployment [
120].
While these aforementioned typical deep learning algorithms have achieved great success in their respective fields, they are generally designed for servers or workstations with powerful computational capabilities and ample memory. Their common characteristic is a large model scale and intensive computation. Therefore, to endow edge devices with these powerful AI capabilities, deployment on embedded devices requires model-compression and -optimization techniques such as Quantization [
78,
83], Pruning [
78,
121], and Knowledge Distillation [
122].
5.3. Lightweight Neural Network Architectures
To address the resource limitations of embedded devices, researchers have designed a series of lightweight neural network architectures. These architectures, through innovative network structure designs and efficient computational units, significantly reduce model parameter count and computational complexity while maintaining high performance.
5.3.1. MobileNet Series
MobileNetV1 [
123]: Proposed by Google, its core innovation is Depthwise Separable Convolutions, which decompose standard convolutions into Depthwise Convolution and Pointwise Convolution, drastically reducing computation and parameters.
MobileNetV2 [
124]: Built upon V1 by introducing Inverted Residuals and Linear Bottlenecks, further improving model efficiency and accuracy.
MobileNetV3 [
125]: Combined NAS technology to automatically optimize network structure and introduced the h-swish activation function and updated Squeeze-and-Excitation (SE) modules, achieving better performance under different computational resource constraints.
MobileNetV4 [
126]: builds upon the success of the previous MobileNet series by introducing novel architectural designs, such as the Universal Inverted Bottleneck (UIB) block and Mobile Multi-Head Query Attention (MQA), to further enhance the model’s performance under various latency constraints. MobileNetV4 achieves a leading balance of accuracy and efficiency across a wide range of mobile ecosystem hardware, covering application scenarios from low latency to high accuracy.
5.3.2. ShuffleNet Series
ShuffleNetV1 [
127]: Proposed by Megvii Technology, designed for devices with extremely low computational resources. Its core elements are Pointwise Group Convolution and Channel Shuffle operations, the latter promoting information exchange between different groups of features and enhancing model performance.
ShuffleNetV2 [
128]: Further analyzed factors affecting actual model speed (such as memory access cost) and proposed better design criteria, resulting in network structures that perform better in both speed and accuracy.
5.3.3. SqueezeNet
SqueezeNet [
129] is proposed by DeepScale (later acquired by Tesla) and Stanford University, among others, aiming to significantly reduce model parameters while maintaining AlexNet-level accuracy. Its core is the Fire module, which includes a squeeze convolution layer (using 1 × 1 kernels to reduce channels) and an expand convolution layer (using 1 × 1 and 3 × 3 kernels to increase channels).
5.3.4. EfficientNet Series
EfficientNetV1 [
130]: Proposed by Google, it uniformly scales network depth, width, and input image resolution using a Compound Scaling method. It also used neural architecture search to obtain an efficient baseline model B0, which was then scaled to create the B1-B7 series, achieving state-of-the-art accuracy and efficiency at the time.
EfficientNetV2 [
131]: Building on V1, it further optimized training speed and parameter efficiency by introducing Training-Aware NAS and Progressive Learning, and incorporated more efficient modules like Fused-MBConv.
5.3.5. GhostNet Series
GhostNetV1 [
132]: proposed by Huawei Noah’s Ark Lab, introduces the Ghost Module. This module generates intrinsic features using standard convolutions, then applies cheap linear operations to create more “ghost” features, significantly reducing computation and parameters while maintaining accuracy.
GhostNetV2 [
133], also from Huawei Noah’s Ark Lab, enhances V1 by incorporating Decoupled Fully Connected Attention (DFCA). This allows Ghost Modules to efficiently capture long-range spatial information, boosting performance beyond V1’s local feature focus with minimal additional computational cost.
GhostNetV3 [
134], from relevant researchers, extends GhostNet’s efficiency to ViTs. It applies Ghost Module principles—like using cheap operations within ViT components—to create lightweight ViT variants with reduced complexity and parameters, making them suitable for edge deployment [
134].
5.3.6. MobileViT
MobileViT (Mobile Vision Transformer) [
118]: Proposed by Apple Inc. (Cupertino, CA, USA), it is one of the representative works successfully applying Transformer architecture to mobile vision tasks. It ingeniously combines convolutional modules from MobileNetV2 and self-attention modules from Transformers to design a lightweight ViT, achieving competitive performance on mobile devices.
The design of these lightweight networks provides feasible solutions for deploying advanced deep learning models on resource-constrained embedded devices, driving the rapid development of EAI.
5.4. Performance Comparison of Mainstream Lightweight Neural Networks
To gain a deeper understanding of the design philosophies and performance trade-offs of different lightweight architectures,
Table 6 summarizes the key performance metrics of mainstream lightweight neural networks, including the MobileNet series, ShuffleNet series, SqueezeNet, GhostNet, and MobileViT, on the standard image classification benchmark (ImageNet). This comparison highlights differences in Top-1 accuracy, parameters, computational cost (FLOPs), inference latency, and test hardware, offering a clear view of how each architecture balances efficiency and performance.
A comprehensive analysis of the table allows us to distill four key insights into the evolution of lightweight neural network architectures. First, the field has undergone a co-evolution of efficiency and accuracy: innovations such as depthwise separable convolutions and neural architecture search have enabled higher model performance under reduced resource consumption. Second, design principles have moved beyond merely minimizing theoretical computations (FLOPs) and entered a hardware-aware stage; as exemplified by ShuffleNetV2, accounting for practical hardware bottlenecks, such as memory access costs, has become critical for achieving real-world speed improvements. Third, the outstanding performance of MobileViT on dedicated NPUs highlights the power of algorithm–hardware co-design, demonstrating that tightly coupling model efficiency with target hardware is central to future performance breakthroughs. Finally, the overall evolution does not aim for a single optimal solution but rather constructs a Pareto frontier of performance, allowing developers to make informed trade-offs among accuracy, latency, and model size according to specific application requirements.
Table 6.
Performance comparison of mainstream lightweight neural networks.
Table 6.
Performance comparison of mainstream lightweight neural networks.
Model Architecture | Top-1 Accuracy (ImageNet) 1 | Params | FLOPs (G) 2 | Latency 3 | Test Hardware |
---|
SqueezeNet 4 [129] | 0.58 | 1.25 M | 0.82 | | |
MobileNetV1(1.0×) [123] | 0.71 | 4.2 M | 0.57 | 113 ms | Google Pixel 1 CPU |
MobileNetV2(1.0×) [124] | 0.72 | 3.5 M | 0.3 | 75 ms | Google Pixel 1 CPU |
MobileNetV3-Large (1.0×) [125] | 0.75 | 5.4 M | 0.22 | 51 ms | Google Pixel 1 CPU |
MobileNetV3-Small (1.0×) [125] | 0.67 | 2.5 M | 0.06 | 15 ms | Google Pixel 1 CPU |
ShuffleNetV1 (1.0×, g = 3) [127] | 0.69 | 3.4 M | 0.14 | 50 ms | Qualcomm Snapdragon 820 |
ShuffleNetV2 (1.0×) [128] | 0.69 | 2.3 M | 0.15 | 39 ms | Qualcomm Snapdragon 820 |
GhostNet (1.0×) [132] | 0.74 | 5.2 M | 0.14 | 74.5 ms | Qualcomm Snapdragon 625 |
MobileViT-XS [118] | 0.75 | 2.3 M | 0.7 | 7.28 ms | Apple A14 NPU -ANE 5 |
6. Deployment of EAI
Following our discussion of the core components of EAI systems—hardware, software, and algorithms—we now address their integration through a systematic deployment process. This crucial phase, often unaddressed in existing surveys, is a complex engineering practice that extends beyond simple model conversion to include optimization, compilation, hardware integration, and iterative validation. This section details the end-to-end workflow for converting a model from a development environment into a stable application on a target embedded device.
As is shown in
Figure 3, a typical deployment workflow can be broken down into key stages: model training, model optimization and compression, model conversion and compilation, hardware integration and inference execution, and performance evaluation and validation.
6.1. Model Training
The starting point of the deployment process is obtaining an AI model that meets functional requirements. This is typically accomplished in environments with abundant computational resources (e.g., GPU clusters) and mature deep learning frameworks (e.g., TensorFlow [
135] or PyTorch [
86]). Developers utilize large-scale datasets to train the model, aiming to achieve high predictive accuracy or target metrics [
2]. The model produced at this stage is typically based on 32-bit floating-point (FP32) operations, possessing a large number of parameters and high computational complexity. While it performs well on servers, it often far exceeds the carrying capacity of typical embedded devices.
The core challenge in this stage is the "Train-Deploy Gap". A model trained on large-scale, high-quality datasets in a resource-rich server environment may not generalize well to the low-quality, noisy data collected by embedded devices in real-world, complex settings.
6.2. Model Optimization and Compression
This constitutes a critical step in adapting models to embedded environments. Due to the stringent constraints of embedded devices in terms of computational power, memory capacity, and power consumption, it is essential to optimize the originally trained models. The goal of this stage is to significantly reduce model size and computational complexity while preserving accuracy as much as possible. Common optimization techniques include Quantization, Pruning, and Knowledge Distillation.
To assess the independent impact of each technique, ablation studies are indispensable. It should be noted, however, that we do not need to re-conduct such experiments, as the effectiveness of these methods has already been thoroughly validated in numerous seminal works. The data presented in
Table 7 are extracted from these classic studies, with the aim of clearly illustrating the core contributions of each technique.
Quantization: Converts FP32 weights and/or activations into low-bitwidth representations (e.g., INT8, FP16), thereby exploiting more efficient integer or half-precision computational units [
83,
136,
137]. This method yields the most direct and significant improvement in inference latency. As demonstrated in ablation studies (e.g., Configuration 1), quantization can provide a 2–3× speedup on compatible hardware, at the cost of typically slight and controllable accuracy degradation (e.g., approximately 1.5% loss on ResNet-50).
Pruning: Removes redundant weights or structures (e.g., channels, filters) to produce sparse models with fewer parameters and lower computational requirements [
138]. Pruning techniques excel in maintaining accuracy, often achieving nearly lossless performance (e.g., Configuration 2 and Configuration 4, <0.5% accuracy drop) after careful fine-tuning. However, their practical impact on inference latency depends heavily on the type of pruning (structured vs. unstructured) and the degree of hardware support for sparse computation. In many cases, pruning primarily reduces the theoretical computational load (FLOPs), offering potential rather than guaranteed acceleration.
Knowledge Distillation: Trains a compact “student” model under the guidance of a larger “teacher” model, transferring knowledge to enhance the performance of the smaller network [
122,
139,
140]. Distillation does not directly alter inference latency; its core value lies in significantly boosting the upper bound of model accuracy. As shown in experiments (e.g., Configuration 3), distillation can even yield more than a 1.2% absolute accuracy gain for a strong baseline. This enables the deployment of a smaller and faster student model that, through distillation, can match or surpass the accuracy of its larger counterpart, thereby indirectly achieving low latency.
These optimization techniques can be applied independently or in combination, with the choice depending on the target hardware capabilities and the specific accuracy–latency requirements of the application. Ablation studies indicate that combined approaches often deliver the best results. For example, knowledge distillation can effectively compensate for the accuracy degradation caused by quantization and pruning (e.g., Configuration 5 [
140]), sometimes even leading to a “compressed yet stronger” model. A typical advanced optimization pipeline proceeds as follows: pruning is first applied to obtain an efficient compact architecture, knowledge distillation is then employed to restore and further enhance accuracy, and finally quantization is applied to achieve inference acceleration on the target hardware—thereby striking an optimal balance among accuracy, latency, and model size (e.g., Configuration 6 [
141]).
Table 7.
Ablation studies of major model-compression techniques.
Table 7.
Ablation studies of major model-compression techniques.
Config | Technique Combination | Model and Dataset | Baseline Acc. | Optimized Acc. | Acc. Loss/Gain | Speed Up |
---|
1 | Quantization Only [83] | ResNet-50/ ImageNet | Top-1: 76.4% | Top-1: 74.9% | −0.015 | 4× (size) 2–3× (latency) |
2 | Pruning Only [138] | VGG-16/ImageNet | Top-5: 92.5% | Top-5: 92.1% | −0.004 | FLOPs reduced by 50.2% |
3 | Knowledge Distillation Only [139] | ResNet-34 (Student)/ ImageNet | Top-1: 73.31% | Top-1: 74.58% | 0.0127 | 1×
(vs. student model) |
4 | Pruning (Automated) + Distillation (Implicit) [31] | MobileNetV1/ ImageNet | Top-1: 70.9% | Top-1: 70.7% | −0.002 | FLOPs reduced by 50% (2× speedup) |
5 | Quantization + Knowledge Distillation [140] | ResNet-18 (Student)/ CIFAR-100 | Top-1: 77.2% | Top-1: 78.4% | +1.2% (vs. student) (Acc. surpasses FP32 teacher) | 4-bit Quantization |
6 | Pruning + Quantization + (Self-)Distillation [141] | OFA (MobileNetV3)/ ImageNet | | Top-1: 79.9% | (No traditional baseline) | Specialized sub-network (pruned) + INT8 quantization Latency: 7 ms (Samsung Note10) |
The primary challenge at this stage is the difficult “Accuracy-Efficiency Trade-off.” Nearly all optimization techniques, particularly quantization and pruning, introduce some degree of accuracy loss while reducing resource consumption. The greatest difficulty lies in selecting the right combination of techniques to minimize this accuracy degradation to an acceptable level while satisfying strict hardware constraints.
6.3. Model Conversion and Compilation
The goal of conversion and compilation is to generate a compact, efficient model representation that can run on the target hardware. The optimized model needs to be converted into a format that the target embedded platform’s inference engine can understand and execute. This process is not just about file format conversion; it often involves further graph optimization and code generation.
Format Conversion: This crucial step is responsible for converting the optimized model from its original training framework (e.g., TensorFlow, PyTorch) or a common interchange format like ONNX [
87] into a specific format executable by the target embedded inference engine. For example, TFLite Converter [
85] generates .tflite files; NCNN [
90] uses its conversion toolchain (e.g., onnx2ncnn) to produce its proprietary .param and .bin files; Intel’s OpenVINO [
142] creates its optimized Intermediate Representation (IR) format (.xml and .bin files) via the Model Optimizer. Additionally, many hardware vendors’ Software Development Kits (SDKs) also provide conversion tools to adapt models to their proprietary, hardware-acceleration-friendly formats.
Graph Optimization: Inference frameworks or compilers (e.g., Apache TVM [
91]) perform platform-agnostic and platform-specific graph optimizations, such as operator fusion (merging multiple computational layers into a single execution kernel to reduce data movement and function call overhead), constant folding, and layer replacement (substituting standard operators with hardware-supported efficient ones), among others.
Code Generation: For some frameworks (e.g., TVM) or SDKs targeting specific hardware accelerators (NPUs), this stage generates executable code or libraries highly optimized for the target instruction set (e.g., ARM Neon SIMD, RISC-V Vector Extension) or hardware-acceleration units.
The most common and challenging issue in this phase is “Operator Incompatibility.” The set of operators supported by training frameworks (like PyTorch/TensorFlow) is far larger than that supported by embedded inference engines (like TFLite for microcontrollers). If a model contains an operator not supported by the target framework, the conversion process will fail, forcing developers to either redesign the model or manually implement custom operators, which significantly increases development complexity.
6.4. Hardware Integration and Inference Execution
This stage involves deploying the converted model to run on the actual embedded hardware.
Integration: Integrating the converted model files (e.g., .tflite files or weights in C/C++ array form) and inference engine libraries (e.g., the core runtime of TFLite for MCUs [
85], ONNX Runtime Mobile, or specific hardware vendor inference libraries) into the firmware of the embedded project. This typically involves including the corresponding libraries and model data within the embedded development environment (e.g., based on Makefile, CMake, or IDEs like Keil MDK, IAR Embedded Workbench).
Runtime Environment: The inference engine runs in an embedded operating system (e.g., Zephyr, FreeRTOS, Mbed OS) or a bare-metal environment. It requires allocating necessary memory for it (typically statically allocated to avoid the overhead and uncertainty of dynamic memory management, especially on MCUs [
85]).
Application Programming Interfaces (APIs) Invocation: Application code, by calling APIs provided by the inference engine, completes steps such as model loading, input data preprocessing, inference execution (run() or similar functions), and obtaining output results and post-processing.
Hardware Accelerator Utilization: If the target platform includes hardware accelerators (e.g., NPUs, GPUs), it is necessary to ensure the inference engine is configured with the correct “delegate” or “execution provider” [
89] to offload computationally intensive operations (e.g., convolutions, matrix multiplications) to the hardware accelerator for execution, fully leveraging its performance and power efficiency advantages [
117]. This often requires integrating drivers and specialized libraries provided by the hardware vendor (e.g., ARM CMSIS-NN [
143] for optimizing operations on ARM Cortex-M).
The central challenge here is “System-Level Resource Management and Integration.” This includes not only the precise and safe memory allocation for model weights and intermediate activations (the “Tensor Arena”) within extremely limited RAM but also the correct configuration and invocation of drivers and libraries for NPUs. Furthermore, it involves ensuring that the AI inference task can be completed deterministically within its deadline without interfering with other real-time tasks in the system (such as control loops).
6.5. Performance Evaluation and Validation
After deployment, the actual performance of the model must be comprehensively evaluated and validated on the target embedded hardware.
Benchmarking: To ensure objective and reproducible performance evaluation and enable fair comparisons across different hardware and software solutions, adopting industry-standard benchmarks is crucial. In the resource-constrained domain, MLPerf Tiny [
36] is the most prominent benchmark suite. It provides an “apples-to-apples” comparison platform for the inference latency and energy consumption of various solutions through its standardized models, datasets, and measurement rules. Benchmarking with MLPerf Tiny helps developers validate their solutions and make more informed technical decisions.
Key Metrics Measurement: It is necessary to accurately measure the model’s inference latency (time taken for a single inference), throughput (number of inferences processed per unit time), memory footprint (peak RAM and Flash/Read-Only Memory (ROM) occupation), and power consumption (Energy per Inference or Average Power).
Accuracy Validation: Evaluate the accuracy or other task-relevant metrics of the deployed model (after optimization and conversion) on real-world data or representative datasets to ensure it meets application requirements, and compare it with the pre-optimization accuracy to assess whether the accuracy loss due to optimization is within an acceptable range.
It is important to emphasize that the linear workflow depicted in
Figure 3 is an idealized model simplified for clarity. In real-world applications, this process is highly iterative and nonlinear. This is especially true during the model-optimization and -compression phase, where the order of internal steps (such as pruning and quantization) is not fixed but heavily depends on the selected toolchain and optimization strategy, such as choosing between "pruning before quantization" and the more advanced "quantization aware pruning". Therefore, optimization itself is a complex mini-workflow. More importantly, the feedback loop from "performance evaluation" back to "optimization" is the essence of the entire deployment process: if the evaluation results fail to meet application requirements (e.g., excessive latency, memory usage, or significant accuracy degradation), it is necessary to return to previous steps and make adjustments. This may involve trying different optimization strategies (such as adjusting quantization parameters or using different pruning rates), selecting different lightweight models, adjusting model conversion/compilation options, or even redesigning the model architecture or considering a different hardware platform. This design-optimization-deployment-evaluation cycle may require multiple iterations to achieve the ultimate goal.
The core challenge in this final stage is achieving “Representative and Accurate Validation.” Performance metrics (such as latency and power consumption) measured on a development board can differ significantly from the final product’s performance in its actual operating environment (e.g., under varying temperatures or battery voltages). Designing test cases that accurately reflect the final deployment scenario and using professional tools to precisely measure metrics like energy consumption are critical to preventing product failures in real-world applications.
In summary, the deployment of EAI models is a complex process involving multidisciplinary knowledge, requiring meticulous engineering practices and repeated iterative optimization. Its ultimate goal is to achieve efficient and reliable on-device intelligence while satisfying the stringent constraints of embedded systems.
9. Conclusions
EAI, as a rapidly advancing interdisciplinary field, is demonstrating tremendous potential to reshape various industries. This review provides a systematic and comprehensive overview of the key aspects of EAI, encompassing its fundamental definition, heterogeneous hardware platforms, software development frameworks, core algorithms, model deployment strategies, and its increasingly diverse application scenarios.
Through in-depth analysis, this review identifies the central challenge of the field as achieving efficient AI computation under strict resource constraints. Conversely, its primary opportunity lies in seamlessly integrating intelligent capabilities into the physical world’s edge devices through technological innovation. Consequently, the success of future research and practice will hinge upon the breakthroughs made in addressing challenges such as resource constraints, model optimization, hardware acceleration, and security and privacy.
We believe that by continuously addressing these challenges and seizing the resultant opportunities, researchers and practitioners can collectively drive the evolution of EAI, enabling its more pervasive, efficient, and reliable integration from the cloud into the physical world. This progression will not only accelerate the adoption of intelligent technology across industries but will also profoundly reshape their landscapes, ultimately integrating seamlessly into our daily lives and ushering in a more intelligent era.