1. Introduction
Chip verification is important in manufacturing chips to detect chip faults and improve performance. The chip verification process takes a long time to repeatedly verify and revise the chip design. When increasing the size and complexity of integrated circuit designs, the design verification time can be prolonged and uncertain. Many users depend on various electronic design automation (EDA) tools, such as Synopsys and Cadence, which are commercial software. The licensed EDA tools come with several disadvantages. They can be costly and require license fees that can be a burden for small companies or individuals [
1]. While the licensed EDA tools offer functionalities, they may not be able to satisfy all user requirements. The licensed EDA tools offer limited flexibility, and users can modify or add features to the software. They do not disclose the underlying algorithm to perform the specific function, such as place and route. Also, the licensed EDA tools cannot access the source code for debugging and modification [
2]. Using a licensed EDA tool to accurately verify chips requires a lot of cost and time for users.
We conduct research to save cost and time regarding clock tree synthesis (CTS), which is one of the processes with high user dependency during the chip verification. This research is focused on the highly expected, CTS process of placement and route (P&R), which is a time-consuming process in chip verification, where elements are placed and wired to optimize performance, power, and area (PPA). Also, the CTS process is important to have fast clock signal transmission and more accuracy across the chip, increasing overall chip performance [
3,
4]. The CTS is of paramount importance in controlling the overall chip timing problems in the P&R process and determines timing convergence and power consumption. The CTS improves energy efficiency by providing the ability to optimize power consumption. It is helpful to implement the clock tree using the clock distribution algorithm, as it helps reduce the total chip power [
5,
6]. The objective of the clock distribution algorithm should meet the max transition, max capacitance, and max fanout of the clock tree design rule check and fulfill minimal skew and minimum insertion delay to the target clock tree. It adjusts the timing of the chip by inserting buffers that efficiently optimize timing closure for high fanout signals to reduce skew. However, as the design size of the integrated circuit increases, the number of nodes required for distributing the clock signal increases significantly, increasing the calculation time. In addition, a lot of calculations have to be performed to obtain an optimal result that meets the complex distribution algorithm considering various constraint conditions. If the clock tree implementation does not meet the conditions after RTL design, the design has to be modified, which consumes more time and cost because various constraints, circuit stability, performance, etc., must be considered.
To reduce the cost and time of the CTS process, we designed and preconstructed a shallow clock tree based only on a register transfer level (RTL) source without a licensed EDA tool or any other requirements in prior studies [
7]. Prior studies have addressed the CTS of the data path from input to D flip-flop data path and have performed some CTS. There have also been restrictions on general cases, such as large and complex designs. Since the Verilog sources are composed of one output and several inputs, the study is limited to the general source, which has many outputs and inputs. Also, the study solved the hold violation from the input to D flip-flop, but it did not solve the hold violation of the clock path.
To solve the restrictions, we designed CTS pre-estimation considering all clock paths from the clock to the D flip-flop’s clock sink using the general case of a large Verilog source before the route process.
Figure 1 shows how to conduct CTS pre-estimation using an RTL synthesis netlist on the CTS for the P&R process. It cannot approximately close the CTS results of the licensed EDA tool by using only the RTL synthesis netlist, which does not have existing timing and parasitic information. To close the CTS result of the licensed EDA tool, this study assumes that the logic is arbitrarily placed after going through the placement process before the route process, which is described as pre-route. Also, this study introduces the random buffer insertion stage to reproduce the placement process of the licensed EDA tool. It executes the shallow CTS process to reduce clock skew and calibrate total chip timing after the random buffer insertion stage [
8]. It shows that shallow clock tree implementation optimizing PPA increases about 10% of the maximum clock frequency on the chip against Qflow’s CTS result and decreases about 20% of the standard deviation of clock path delay after shallow CTS in this paper.
The remaining parts of this paper are as follows.
Section 2 introduces related research on the CTS algorithm on the P&R process.
Section 3 explains the overall process in this study of how to construct with CTS between the pre-route and route process. We explain how to carry out random buffer insertion in the pre-route process and how to preconstruct clock tree implementation algorithms. In
Section 4, we present the details of the experiments based on a specific RTL source using a Taiwanese semiconductor manufacturing company’s limited (TSMC) 180 nm standard cell library. We analyze the experimental results that show how to close clock tree preconstruction through the heuristic algorithm. We observe the standard deviation difference before and after CTS pre-estimation and reduced clock skew for the clock path. We compare the maximum clock frequency of shallow CTS with the maximum clock frequency of the open-source EDA tool after the CTS. In addition, we evaluate the memory profiling of shallow CTS algorithm depending on RTL sources. Finally,
Section 5 presents the conclusions. We highlight the conclusions and explain future research.
2. Related Research
Clock tree implementation is a important task in the physical design of an integrated circuit. The clock tree delivers the synchronous signal to every sequential cell, such as D flip-flop. It is important to solve the clock skew problem, which is the difference of the arrival time to the sequential cell, in clock tree implementation. The CTS process determines the maximum clock frequency and minimum clock skew and affects the overall power of the physical design. Some earlier proposed papers concentrated on the clock skew minimization between the source and the sinks to achieve minimum clock skew and maximum clock frequency. Soheil Nazar introduced a synchronous minimum-skew clock tree synthesis algorithm for single flux quantum circuits, taking into account splitter delays and placement constraints. The proposed methodology advances the current state of the art by incorporating splitter delays and creating a fully balanced clock tree structure in which the number of clock splitters from the clock source to all sink nodes is equal [
9]. Jinwei Lu implemented power-aware clock tree synthesizer and power- and slew aware clock tree synthesizer using the Elmore RC delay model [
10]. They suggested using a slew-focused lookup table to convey details about the driving capacity when inserting buffers and gates. The implementation achieved power consumpation minimization and minimum clock skew by slew rate satisfaction. Yici Cai focused on skew optimization under slew constraints and obstacle avoidance. This includes the clock tree construction phase, which utilizes an obstacle-aware topology generation algorithm called OBB, ensures the balanced insertion of candidate buffer positions, and employs a fast heuristic buffer insertion algorithm [
11]. They integrated signal polarity requirements into the fundamental buffer insertion step, ensuring minimal impact on the final results. Overcoming the adverse effects of obstacles on skew is a significant challenge in buffered CTS, and they dedicated considerable time and effort to address this issue.
3. Proposed Shallow CTS Preconstruction Process
We explain how to preconstruct the clock tree from the Verilog file and propose the algorithms and functions in detail for the overall process.
Figure 2 shows the overall process proposed for shallow CTS pre-estimation. The entire process is largely divided into the RTL synthesis, pre-route, and shallow CTS processes. It uses the open RTL synthesis tool Yosys to identify whether the RTL source is synthesizable [
12]. Parser-Verilog, which is a structural Verilog parser, is used to conduct the pre-route and shallow CTS processes [
13,
14]. Each step is explained in detail as follows.
3.1. RTL Synthesizable Estimation
The input is an RTL synthesizable Verilog file that uses a high-level description of the design. Usually, the CTS of the licensed EDA tool proceeds based on an optimized gate-level representation, parasitic information, timing information, etc. However, in this study, we conduct the shallow CTS process on the primitive gate level to determine whether or not to synthesize a clock tree without data such as timing and parasitic information. The original Verilog source is converted to an RTL synthesizable source using Yosys. It matches the standard cell library for a suitable fabrication process and modifies it to meet various constraints, such as safety and timing problems.
3.2. Route Information Using Parsed Netlist
First of all, before adjusting the clock skew and hold violation through buffer insertion, it is necessary to create a general tree using the parsed information and traverse the entire path. It is composed of an RTL synthesizable source and clock tree implementation algorithm using Parser-Verilog in the route algorithm. The route algorithm searches and stores route information from the gate with the output port to a gate with the input port based on a parsing netlist. Prior studies consisted of a general tree based on a Verilog source with one output port and multiple input ports [
15]. We tried to solve the hold violation from the input to the D flip-flop’s data path. The prior study only used the Verilog source’s existing noncyclic path. In the route process, the specific RTL source with one output and multiple inputs consists of a 2-dimensional vector including route information. It was challenging to apply general cases except for special sources that have a noncyclic path with one output and multiple inputs in the prior study.
Figure 3 shows a general case of the Verilog source and explains how route information is formed in the Verilog netlist structure. The synthesizable RTL source consists of a general tree structure, which is where a topmost root node is placed on the output or clock port of the RTL sources and the bottom leaf node is placed on the inputs or D flip-flop’s input ports. We develop a tree structure to apply a general case and try to solve the hold violation from the clock to the D flip-flop clock path. It has multiple input ports and multiple output ports and seems to consist of several general tree structures on the Verilog file. We search the common rules of the Verilog file, which has multiple input and output ports. The parsed Verilog netlist is composed of a 3-dimensional vector that comprises several trees with one output and multiple inputs. The 2-dimensional vector consists of route information from the gate with one output port to the gate with multiple input ports. The 1-dimensional vector is stored route information from a gate that includes one output port to a gate that includes one input port.
Algorithm 1 and
Figure 4 represent the search clock path that is interconnected with a clock port and traverses clock paths using a recursive DFS algorithm. The complexity of Algorithm 1 is
by existing recursive DFS. The
vector inserts
G, which includes the clock port or directly connects to the clock. It searches
G’s clock port, which is
, and stores the gate’s clock port information. It traverses other gates connected to
through the
function and stores to
. If no other gate, which includes a clock port, exists in the
vector,
stores to
. Otherwise,
G stores to
and it continues to traverse. The vectorized data using Algorithm 1 are used in the pre-route and the shallow CTS process.
Algorithm 1: Search route of clock paths using recursive DFS model |
|
3.3. Pre-Route Process
Initially, the clock signal is an ideal mode that does not exist in the physical clock distribution during the RTL design, synthesis, and placement [
16]. The clock signal in the ideal mode is assumed to reach the clock pins of all D flip-flops simultaneously. The pre-route process is arbitrary logic placement before the real route process. After the placement process, the clock is physically connected, and clock skew occurs as logic is placed respectively [
17]. This study assumes that it does not consider timing constraints or parasitic components like R, C, cell strength, etc. In addition, it is assumed that the clock path generates the worst clock skew by random placement after the pre-route process. The random placement process is implemented using random buffer insertion.
Random buffer insertion is essential to expressing the state in which the physical clock is connected, and clock skew occurs in the placement process before the route process. If random placement does not execute, the clock skew of the clock paths does not generate because this study assumes that it calculates the path’s static delay, including the standard library cell’s delay.
Figure 5 represents the difference between the clock tree after RTL synthesis and the clock tree after the pre-route process. It is simply connected to the clock buffer and clock sink of the D flip-flop in
Figure 5a. There is no clock skew because it assumes that the clock arrives at the D flip-flop’s clock ports on the same time after RTL synthesis. On the other side, it represents the existence of clock skew by inserting a random buffer after the pre-route process in
Figure 5b. Even so, random buffer insertion is not inserted randomly but focused on the number of loads connected to the clock buffer. If there are many loads connected to the clock buffer, the clock skew will be reduced, and the overall chip area will be larger by placing clock buffers having greater strength. If placing clock buffers having greater strength, there is a possibility of an enlarged chip area. To prevent this problem, load balancing should be carried out by considering the R and C components of the front and rear stages, but only standard library cell delay is assumed, except timing constraints: parasitic components like R, C, and cell strength, etc. Therefore, random buffer insertion is performed by simply determining the number of loads.
3.4. Shallow CTS Process
The shallow CTS process is used to place arbitrary logic and carry out random buffer insertion that generates clock skew through the pre-route process. The pre-route process indicates when the physical clock is connected to all D flip-flops’ clock sinks. When the physical clock signal is applied, it arrives at the clock pin on the different times. The clock skew created by clock uncertainty is corrected through the CTS process. The CTS process progresses to adjust the clock skew that is generated after physical clock connection. The CTS recognizes the clock signal from the clock source pin and delivers the clock signal to thousands of D flip-flops. The CTS is used to form a buffer tree to match the skew of clock nets and high fanout nets and to meet the design rules, such as maximum capacitance, maximum transition time, etc. The shallow CTS process is performed in Parser-Verilog using the RTL synthesizable netlist. In this process, the buffer insertion algorithm is applied to reach time closure in the clock path.
The clock path goes from the clock to a D flip-flop’s clock sink. The calculation of each path’s delay is based on the intrinsic delay of the standard cell library, which does not include timing specifications and size. It calculates the clock path delay for cells existing along each clock path, and selects the clock path with the largest delay as the worst clock path delay. The difference between the worst clock path delay and the other clock path delay is defined as clock skew.
Figure 6 shows how each clock path solves clock skew for the worst clock path. There is a difference in timing with the real clock signal propagated to the clock pins of many D flip-flops. It is assumed that the
delay as D flip-flop1’s clock path delay is smaller than
delay as D flip-flop2’s clock path delay in
Figure 6a. In order to reduce clock latency, the clock skew is reduced through buffer insertion into the other clock path delay based on the worst clock path delay in
Figure 6b [
18,
19]. In this study, the buffer insertion algorithm aims to reach the time closure.
Algorithm 2 and
Figure 7 represents the buffer insertion of a clock path. It calculates each clock path’s delay and stores delay information to
through the vectorized data that result from Algorithm 1. The complexity of Algorithm 2 is
by the existing nest for the loop.
When each clock path and max is compared, if clock skew exists, which is the difference between and max , buffer insertion is performed. The algorithm inserts buffer delay in order, which has larger delay. For instance, it assumes that it has a difference between the worst clock path’s delay and the other clock path’s delay. ’s intrinsic delay is greater than ’s intrinsic delay. If the clock skew remains, the insertion carries out the clock path. If the clock skew remains after the insertion, the insertion applies the clock path remaining clock skew.
The overall process synthesizes a clock tree through buffer insertion based on the worst clock skew of the clock path using an RTL synthesizable source. The variation of the clock skew on the pre-route and the CTS processes is illustrated in
Figure 8. Logic placement corrects the worst clock skew through random buffer insertion on the pre-route process in
Figure 8b. After the pre-route process, it is difficult to solve PPA and reach the time closure because the worst clock skew occurs in the clock paths. The CTS inserts buffers based on the worst clock path to solve the overall chip timing and PPA problems in
Figure 8c. It can show reduced clock skew after the CTS process compared to the pre-route process.
Algorithm 2: Buffer insertion |
|
5. Conclusions
The results of the experiments show that the proposed shallow CTS algorithm is efficient for preconstructing a clock tree and checking whether the clock tree is synthesizable. The standard deviation of the shallow CTS not only obviously reduced the standard deviation of the pre-route but it also measured a higher maximum clock frequency than the maximum clock frequency of Qflow post-STA, which is over after the CTS. It verified that the performance of the results approaches zero clock skew through the shallow CTS algorithm, which focuses on clock skew, and the maximum clock frequency of the shallow CTS process is higher than the maximum clock frequency of Qflow. It clearly proves weak points using a robust clock path standard deviation calculation. It means that critical path delay decreases on the chip and the standard deviation of the clock path decreases, increasing the clock frequency and optimizing clock tree structure. On this basis, we conclude that the CTS, which is a time-consuming process, can confirm the mock review of CTS to pre-estimate using the heuristic algorithm without using the licensed EDA tool. When using the licensed EDA tool, there are problems with the tool cost and the long time required to check the CTS results with the tool. In addition, with the RTL design, it cannot be known whether the design is synthesizable before clock tree synthesis. Iterative design cost increases and design errors take a long time to detect. Before proceeding with the actual CTS process, we can perform a shallow CTS based on the RTL netlist to understand the clock tree structure and where the asynchronous clock signals are delivered. It is possible to make quick corrections by roughly understanding the results before the CTS process. The use of licensed EDA tools eventually entails a redundant expected cost for users. To reduce the expected cost and obtain economic benefits, we can find out information, such as the approximate timing results and area addition through shallow CTS before using the licensed EDA tool. Moreover, the shallow CTS results are helpful in modifying the RTL source iteratively to match chip verification.
One limitation of our implementation is that it is not a precise delay calculation. Qflow, which is an open EDA tool, makes elaborate delay adjustments by using Elmore delay calculation and considers the strength, parasitic elements, and network delay of the front and rear ends. We should improve the delay calculation which considers parasitic elements and network delay. Future research should be devoted to the development of precise delay calculation and overall performance improvement using the graph neural network and machine learning. Also, the memory usage of the algorithm should be considered to ensure that it utilizes minimal memory at runtime. In addition, future research should apply standard cell libraries of various fabrications.