1. Introduction
Artificial intelligence (AI) chips are specialized alternatives to generic CPUs, often designed to handle parallel processing and equipped with their own memory and I/O systems. A key feature of AI chips is the implementation of structures tailored to handle the matrix operations common in neural networks (NNs); these structures are known as NN accelerators [
1]. In numerous applications, the optimization of reduced size and power consumption is a critical design consideration, although not invariably mandatory. The reduction in these parameters can significantly enhance the efficiency and applicability of technologies, particularly in contexts where minimizing the physical footprint and power usage is crucial for overall system performance and user accessibility. For instance, Internet of Things (IoT) devices in remote locations require low power consumption for prolonged operation [
2], while wearable technologies like fitness trackers need compact, energy-efficient designs for comfort and extended use [
3]. In medical applications, such as wireless capsule endoscopy [
4], an ultra-low-power edge computing system is critical for diagnosing gastrointestinal diseases in real time.
Compared to conventional commercial solutions and programmable platforms, ASICs generally exhibit reduced area requirements and significantly lower power consumption, which are critical for extending battery life and enabling long-term operation in energy-constrained environments. While not an exact comparison, a frequently cited paper by Kuon and Rose [
5] highlights the substantial disparity between FPGAs and ASICs. Their research indicated that, on average, FPGAs were 21 times larger in area, 2.8 times slower, and consumed 9 times more power compared to ASICs, primarily due to the generic nature of FPGA circuitry. A recent literature supports these findings; for instance, the ASIC-based AccelTran accelerator achieved 372,000 GOPs, drastically surpassing the top FPGA implementation (Me-ViT at 2682 GOPs). Similarly, ASIC-based DTQAtten delivered an energy efficiency of 1298 GOPs/W, while FPGA-based BETA reached only 174 GOPs/W, underscoring ASICs’ significant advantages in both performance density and energy efficiency due to their specialized architectures and optimized manufacturing processes [
6].
The potential of open-source technology for ASIC development is primarily twofold: (1) cost reduction and (2) experimental flexibility. With the advent of open-source PDKs and the increasing maturity of free EDA tools, the development of production-ready ASICs without incurring software costs has become more feasible. This development facilitates broad collaboration—analogous to that observed in the software domain—among senior engineers, researchers, students, and enthusiasts, and provides unrestricted access to the field of microchip design. A notable example is [
7], where undergraduate students successfully leveraged the SkyWater 130 nm Technology (Sky130) and an open-source EDA flow to tape out a processor in a single semester. In essence, this represents an emerging area with the potential to impact both individual microchip design projects and the field as a whole.
Another noteworthy example is the Ecko project, an open-source initiative that leverages automated, community-driven methodologies for NN accelerator design. Utilizing publicly available tools, the Ecko initiative provides detailed resources, including documentation, methodologies, and practical examples for automating the transition from neural network algorithms to silicon implementation. By emphasizing transparency and accessibility, Ecko exemplifies how modern open-source projects can significantly streamline the NN-to-ASIC development workflow, thereby fostering collaboration and innovation across academia and industry [
8].
This paper aims to investigate open-source tools for ASIC design, specifically to generate the NN accelerator component of an AI chip using a high-level programming environment. The objective is to establish the foundation for future custom AI systems. This research distinguishes itself from similar work through the utilization of the FPGA tool HLS4ML, adopting an FPGA-inspired approach to ASIC AI chip design.
2. The Open-Source Landscape for NN-to-Silicon
In essence, four essential components are needed for creating open-source ASIC NN accelerators.
Programming environment for creating an NN;
Creating the corresponding circuit equivalent of an NN;
Designing the layout of the circuit equivalent;
The choice of process node.
While these components could be managed manually, the ideal flow aims to establish a streamlined process that automates as many individual components as feasible. This approach seeks to reduce factors such as time-to-market, speed up prototyping, and ensure functionality by applying criteria throughout the development process. The current state-of-the-art solutions in this space utilize different tools, albeit with the same goal, to address these challenges.
2.1. The State of the Art in NN-to-Silicon Solutions
The following presents the recent open-source frameworks and workflows created by researchers to address ASIC AI chip development. First, the 2021 method, VeriGOOD-ML, is a no-human-in-the-loop methodology for generating machine-learning-optimized Verilog from a given ONNX file [
9]. It addresses various key challenges in creating hardware descriptions for ML algorithms and layout designs for both small and simple and large and complex networks. Esmaeilzadeh et al. adopted a platform-based approach, categorizing ML algorithms into three separate groups and choosing different approaches for each. For non-DNN ML algorithms, they utilized the template-based framework TABLA by Mahajan et al. [
10], specifically designed to simplify the development process of NN accelerators for FPGAs, focusing on supervised learning algorithms (both classification and regression). For DNN algorithms, they used their own compiler, GeneSys, which translates an ONNX description into a new graph description. During this transformation, the compiler can replace the original nodes with template optimizations for common DNN structures, such as dense, convolution, and various activation layers. The result is a hardware-aware reconstruction that, for example, builds a binary tree for the L2 norm or optimizes data flow for convolutions. Finally, for small specific ML algorithms, they employed their own compiler named Axiline, a hard-coded engine designed for specific small ML algorithms. The rationale is that, for certain commonly used ML algorithms, investing time in complex computations from scratch, as in the TABLA or GeneSys cases, may not be a worthwhile investment. This method can thereby increase developer productivity in specific cases. The paper demonstrates successful layout generation for classic ML algorithms, namely an SVM, ResNet50, and a logistic regression algorithm, as a proof of concept on the GF12LP 13 metal layer technology using all 13 layers. This achievement was accomplished using a blend of commercial and open-source automation tools for tasks such as place-and-route, CTS, and PDN generation.
Second, the 2022 method, SODA-Opt, by Agostini et al., is an open-source framework that generates NN-optimized hardware descriptions for implementation in either FPGA or ASIC designs [
11]. Similar to VeriGOOD-ML, it follows the approach of converting popular file formats (demonstrated with both TensorFlow and PyTorch) to layout through an intermediate step of converting the model from Python/ONNX to an optimized equivalent. This equivalent is then processed by an HLS tool to create Verilog, and ultimately goes through the RTL-GDSII tool OpenROAD. In contrast to VeriGOOD-ML’s platform-based design, Agostini et al. chose to design a single compiler: the
SODA compiler. This compiler translates the Python NN algorithm into efficient machine code using the MLIR framework found in the LLVM project. This framework allows for reconstructing code on a compiler level. They utilized MLIR’s built-in dialects (through intermediate representations) to turn the high-level NN code into corresponding hardware-optimized machine code equivalents. This is achieved through loop optimizations such as tiling and unrolling, and redundancy optimizations such as early alias analysis (EAA) and dead code elimination (DCE). The result is a highly optimized equivalent of the input NN algorithm that can then be passed to the open-source HLS tool PandaBambu, distributed by Politecnico Milano. As an HLS tool, Bambu can synthesize both C/C++ code and MLIR intermediate representations (which can be compiled through C compilers such as GoLang) into corresponding Verilog/VHDL. Bambu comes with built-in verification tools for simulation and analysis, similar to those found in software like Xilinx’s Vivado. A LeNet CNN was transformed from Python code to a layout on the open-source theoretical 45nm FreePDK design kit using OpenROAD.
Third, the 2023 method, although not released in a paper, was developed by Baungarten et al., the winners of the Efabless 2023 contest [
12]. Their project, called AI-by-AI, attempts to create as many parts of an NN accelerator as possible using AI technologies [
13]. Specifically, they used the LLM ChatGPT4 as the primary development tool. Using ChatGPT4, they managed to generate an MNIST CNN in TensorFlow (Python) with 96.87% accuracy, then generate a function for reducing precision in the CNN to 16-bit, and then translate these layers into Vivado HLS-compatible C equivalents by step-by-step prompting the creation of a bare-metal CNN forward function. They then fed the quantized, bare-metal C code to Vivado HLS, generated the corresponding Verilog RTL, and processed it through OpenLane (an automation fork of OpenROAD), creating the corresponding layout on the production-ready Sky130 PDK. Using Efabless’ SoC template with full I/O (called Caravel), the entire AI chip was created.
2.2. FPGA as Part of NN-to-Silicon Solutions
A late 2023 survey by Ferrandi et al. that summarizes design methodologies to accelerate deep learning mentions most of the aforementioned tools [
14]. One tool mentioned for its potential, but not yet adopted by any current state-of-the-art methods, is HLS4ML, an open-source FPGA tool designed for NN-to-C++ interpretation. It supports multiple HLS back-ends, such as the aforementioned Vivado HLS used in AI-by-AI and the open-source Vitis HLS.
Due to the feedforward nature of CNN data flow, it lends itself well to efficient hardware structures, especially when considering that 90% of operations in general NN accelerators are matrix multiplications and convolutions. As a result, developing efficient topologies for handling matrix multiplications becomes a primary focus of optimization [
15]. Multiple architectural approaches can achieve the same output from a given input, albeit with distinct hardware implementations. Consequently, many researchers advocate for a co-design strategy, where the NN’s architecture and its feasibility for hardware implementation (both FPGAs and ASICs) are given equal consideration [
16].
It is also noted that none of the aforementioned workflows experiment with newer NN developments specifically tailored for FPGAs. Considering that FPGA NN accelerators are a well-studied domain, there may be advantages found that could assist in the HLS process and the overall flow.
2.3. NN Accelerators on FPGA
The NN accelerator architecture of the CNN utilizes the FPGA by distributing the heavy use of multiplications across parallel resources [
17]. Initially, this was carried out by assigning DSPs to each node, resulting in a considerable number of float multipliers for larger networks. Over time, this method has been replaced by more contemporary approaches to designing NNs for FPGAs, pioneered in 2020 by Wang et al. and later refined in 2023 [
18,
19]. LUTnet is a methodology for FPGA NN optimizations.
On the FPGA, the modern optimization flow is as follows:
Hardware-aware NN model: Designing, training, testing, and pruning NNs are performed using TensorFlow, with deliberate reductions in the total number of parameters in the NN, at the cost of some accuracy.
From DNN to BNN: [
17] The optimization on the FPGA itself involves transforming the weights and biases of the NN into binarized versions, reducing computational complexity. This is accomplished through a combination of Python scripts and HLS tools.
From logic gates to LUTs: By replacing as many conventional multipliers (DSPs) with LUTs as possible, it has been shown that the circuit complexity and footprint are reduced. As a result, there is less fan-in (i.e., fewer inputs) at summation points, saving area and resources compared to using DSPs in an equivalent scenario.
Training on FPGA: An additional benefit is that LUTs can be updated on the FPGA, enabling backpropagation and allowing for hardware-specific training.
For performance validation, the efficiency of hardware resource usage and circuit latency are simple measures that most consider. Four key parameters are (1) reuse, i.e., whether blocks are single-use or reusable in the circuitry; (2) latency, i.e., accounting for gate delay and other factors; (3) power consumption; and (4) area.
The relationship between hardware-specific parameters leads to a trade-off between parallelism (to achieve low latency, low reuse factors are required) and size (high reuse factors result in higher latency). In FPGAs, the speed of memory access is a bottleneck for many NN accelerators when latency is of particular importance [
15].
2.4. FPGA Advancement for ASIC Design
As significant efforts have been dedicated in recent years to optimizing NNs for FPGA platforms, particularly focusing on hardware-aware implementation, it raises the question of whether these advancements can be applied in an ASIC context. When an LUT is translated for ASIC implementation, it results in a circuit composed of gates. This brings up the question: what is the purpose of employing LUTs initially? Despite this transition, there are still compelling reasons to explore this approach in ASIC development.
Firstly, integrating FPGA testing into the ASIC development process can offer an additional layer of testing and validation as the resources needed to test NNs on FPGA platforms are considerably lower.
Secondly, LUTs on FPGAs are adaptable, enabling backpropagation and facilitating hardware-specific training. This makes them one level closer to representative hardware compared to CPU/GPU platforms.
Thirdly, although ASIC implementation removes the
programmable aspect inherent in FPGAs, it does not preclude their utilization. With emerging advancements, such as the work by Amarú et al. in 2021 [
20], which outlines an LUT-based optimization methodology customized for ASIC synthesis rather than FPGAs (further refined in 2022 [
21]), this approach could potentially become prevalent.
2.5. Common Patterns in NN-to-Silicon Solutions
Breaking down VeriGOOD-ML, SODA-Opt, and AI-by-AI into their essential components reveals some common patterns and a conceptual framework from which novel methods can be developed (see
Figure 1).
NN in Python: While there is no standardized approach for creating neural networks, using a Python environment aligns with the preference of most developers.
Translation of Python to IR: This step serves as an intermediate process for HLS to Verilog/VHDL. Synthesis from C/C++ is currently the most common approach. Significant progress has been made in code conversion tools that employ libraries of code blocks. Additionally, language translation tools, including those powered by large language models like ChatGPT, alongside various online platforms, offer efficient services in this field.
HLS: Required for utilization of the most popular RTL-GDSII suites, HLS tools often include hardware-aware optimizations and relevant testing procedures, thus introducing an additional assessment before layout creation. This step can potentially leverage existing tools for FPGA technology, enabling comprehensive verification and testing for timing/delay and resource utilization.
RTL-GDSII: The automation of routing and component placement is well explored in the field of digital circuitry. Additionally, open-source tools like ’Magic’ and Klayout are available for DRC and LVS verification. OpenROAD also allows for the extraction of parasitics, the output of LEF/DEF files, and the generation of heat maps and congestion maps for further analysis before production.
Each of these steps can be regarded as an individual component, akin to black boxes, with well-defined input/output relationships. This modular structure allows for the seamless integration of future technologies in place of existing components if necessary. These four steps highlight how separate tools designed to accomplish very specific goals can operate cohesively in this context.
3. Implementation
An experiment was conducted using a conceptual flow template with tools including TensorFlow, HLS4ML, Vivado HLS, and Openlane2. An overview of this experiment is provided in
Figure 2. The roles of each open-source tool in the proposed automation flow are clearly delineated as follows: TensorFlow provides a versatile Python-based platform for neural network creation, training, and evaluation, enabling rapid prototyping. HLS4ML converts these high-level neural network models into hardware-optimized C++ code suitable for synthesis. Vivado HLS translates this intermediate representation into synthesizable Verilog RTL, enabling detailed hardware-specific optimization. Finally, Openlane2 automates the RTL-to-GDSII layout process, including placement, routing, and verification, thereby facilitating a streamlined progression from high-level models to fabrication-ready layouts. The GitHub repository [
22] contains all the necessary files to run the experimental flow and an installation guide for the tools used.
3.1. Python Emphasis
The technologies were selected with compatibility in mind. They all work seamlessly with any Python-compatible IDE. Vivado HLS can be invoked via the CLI within a Python script, and Openlane2 allows for the customization of layout parameters, such as floorplanning, using Python scripting.
3.2. Package Version Considerations
At the time of writing, the steps utilize the latest releases of most software packages. TensorFlow, HLS4ML, and Openlane2 are all at their latest versions, regularly updated with new features. Vivado HLS is version 2020.1, as currently supported by HLS4ML.
3.3. NN Architecture
The flow was tested on three networks. One tested whether it was possible to complete the flow on a simple dense and activation layer, functioning as a template example that can be run multiple times an hour for debugging (henceforth referred to as ‘Debug template’).
Inspired by both AI-by-AI and MobileNet, two additional template examples were created:
Baseline: The NN model from AI-by-AI, using the grayscale MNIST dataset with a clock frequency of 40 MHz. The CNN from Baumgarten et al., given its detailed specifications, is used as a baseline to test whether the FPGA-optimized flow can perform at the same level on the same PDK.
Compatibility: We selected a wireless capsule endoscopy application as our target. To gauge compatibility with real-world models, we used a neural network inspired by the wireless endoscopy capsule, featuring separable convolution layers and batch normalization. This model processes 32 × 32 pixel, 3-channel images from the CIFAR10 dataset. Conceptually similar to MobileNet but with reduced complexity, this CNN model served as a practical benchmark.
The chosen architectures were selected to test both the flow’s compatibility with future NN accelerator design and whether the flow could produce a complete NN accelerator end-to-end.
5. Discussion
While the creation of an automation flow from a high-level environment toward manufacturing an open-source NN accelerator using new advancements in FPGA technology is an ongoing effort, the experimental results demonstrate that the proposed flow effectively serves as a lightweight, plug-and-play script. This tool enables users to familiarize themselves with relevant terminology and gain practical intuition about how the software operates. Although the flow provides a simple-to-install framework encompassing the NN-to-silicon process, further development is needed to achieve manufacturing-ready results.
Using the open-source Sky130 PDK in our experiment aligns naturally with our fully open-source toolchain to ensure cost efficiency, reproducibility, and accessibility. However, it impacts design outcomes by increasing the die area and limiting the achievable frequency due to less optimized cell libraries compared to commercial alternatives. Despite these trade-offs, Sky130 facilitates community-driven enhancements and transparent design processes, making it particularly valuable for educational and research-focused ASIC development.
Regarding the suitability of HLS4ML as a resource in an ASIC flow, it offers several advantages, including high transparency, layer-by-layer customization, simplicity, elegance, and numerous quality-of-life features and verification stages. It also demonstrates reduced latency compared to bare-metal implementations. However, the experiment did not fully replicate the results available on HLS4ML’s GitHub, which features successful CNN examples on RGB datasets, indicating that additional optimization and investigation are necessary.
One of the challenges encountered was the inability to reach very high clock frequencies, which poses limitations for certain applications. Additionally, personal experience with TensorFlow’s depthwise and separable convolution layers resulted in undocumented errors, despite HLS4ML documentation stating that these are supported. A significant bottleneck was computing power, as many flows halted due to limited system resources, making compression necessary in all cases. For future endeavors, utilizing high-performance computing resources and being prepared for flows that may take from 12 h to multiple days is advisable. More computing power would enable more definitive comparisons, allowing flows to run longer and potentially produce more optimized layouts.
Debugging code in the open-source domain presents its own set of challenges due to the occasional lack of examples and documentation. Engaging with the community is crucial; collaborating closely with key developers can help to address errors more effectively. Errors such as SIGSEV (segmentation fault) and SIGKILL (process termination) encountered in OpenLane2 highlight the importance of community support and collaboration to resolve such issues.
In summary, the findings show that automation tools can be utilized at each stage of NN accelerator development, with multiple options available for NN development platforms, HLS, and RTL-to-GDSII tools. Even the relatively simple FPGA-inspired flow from this experiment shows promise in accelerating the development time of key blocks, although further work is needed to produce a complete, production-ready layout. Given that RTL-to-GDSII remains the most popular method for generating NN accelerators, there is significant potential for experts in Verilog and associated EDA software to drive further improvements in this landscape.
5.1. Practical Implications for Industry
From an industry perspective, the automation methodology presented in this paper has several practical advantages. By utilizing open-source tools to simplify and accelerate ASIC design, the approach makes custom neural network hardware accessible even to small companies, startups, and academic teams that might otherwise lack the resources for such developments. This democratization can significantly shorten development cycles and reduce financial risks, making innovative hardware solutions viable for specialized applications in healthcare, automotive, IoT, and consumer electronics. Moreover, the open nature of these tools promotes collaborative innovation across academia and industry, ultimately driving faster advancements in specialized AI accelerators tailored to specific market needs.
5.2. Future Directions and Recommendations
Although our primary aim was to automate NN accelerators, there remain numerous avenues for expanding and refining this work. One option is to adopt a bottom-up approach rather than the top-down approach used here (which produces a layout from a given NN). A bottom-up methodology—centered on layout constraints first and then matching the NN design to those constraints—may reveal hardware optimizations that were overlooked in our top-down flow. Likewise, optimizing NNs for hardware-aware ASIC design (e.g., using activation functions that require fewer resources, reducing precision in deep layers, or determining the required clock frequency) was not pursued here, as the focus was on a straightforward comparison with AI-by-AI’s results and default TensorFlow layers.
Additionally, the use of high-performance computing resources for both simulation and synthesis is recommended to unlock further optimizations. Longer runtimes (potentially in the range of days) and larger memory allocations can often uncover better floorplans, higher clock speeds, or more refined compression strategies. In parallel, stronger engagement with the open-source community remains essential for rapid troubleshooting and for enhancing documentation around tools such as OpenLane2. Working closely with developers can help to address undocumented segmentation faults and other stability issues, ensuring that the flow can evolve into a robust framework suitable for a broader range of NN accelerator designs.
6. Conclusions
Recent methodologies, including one fully open-source tool [
11], another predominantly utilizing open-source tools [
9], and a flow incorporating open-source APIs and software for RTL-GDSII translation [
13], demonstrate the capability to automate hardware-aware design for NN accelerators on modern architectures with digital circuitry. These methodologies offer a foundation for a robust approach. Leveraging advancements in open-source frameworks for Python ML libraries [
23], bare-metal translation/compiler code optimization, HLS, and RTL-GDSII flows [
24], it is increasingly evident that existing open-source tools can be harnessed to develop methodologies enabling developers to operate exclusively in high-level language programming environments for NNs to produce silicon equivalents.
Though the experiment conducted for this paper produced limited results, it demonstrates the possibility of fully automating the design process for common NN blocks to layout, working exclusively within a Python interface with a single button press and leveraging advancements in FPGA technology through a predominantly open-source approach. FPGA optimizations play a role in ASIC design for NN accelerators as they provide a trade-off between increased area and decreased delay by optimizing data flow and reducing redundancy. In assessing the effectiveness of HLS4ML within an ASIC flow, the evidence leans toward a negative evaluation due to challenges with higher frequencies (above 300 MHz) and TensorFlow’s depthwise and separable convolution layers not working as expected, highlighting the limitations of HLS4ML in handling certain NN architectures. While the experiment fell short in producing viable results for manufacturing, the flow emerges as an accessible tool for gaining familiarity with NN-to-silicon processes, offering ease of installation and an introduction to relevant concepts.