Shallow Clock Tree Pre-Estimation for Designing Clock Tree Synthesizable Verilog RTLs

Kwon, Nayoung; Park, Daejin

doi:10.3390/electronics12204340

Open AccessArticle

Shallow Clock Tree Pre-Estimation for Designing Clock Tree Synthesizable Verilog RTLs

by

Nayoung Kwon

and

Daejin Park

^*

School of Electronic and Electrical Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(20), 4340; https://doi.org/10.3390/electronics12204340

Submission received: 30 August 2023 / Revised: 15 October 2023 / Accepted: 16 October 2023 / Published: 19 October 2023

(This article belongs to the Section Circuit and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Clock tree synthesis (CTS) is an important process in determining overall chip timing and power consumption. The CTS is also a time-consuming process for checking the clock tree. If the chip design and specification do not match, the CTS result will be wrong. Many users use licensed electronic design automation (EDA) tools like Synopsys, and Cadence to carry out accurate chip verification. However, when using a licensed EDA tool, it is difficult to change the function and confirm the overall process in detail. If the design is wrong, the expected cost is doubled, as it will be necessary to modify the design and check all processes for verification. Currently, it cannot check the synthesizability of the clock tree on the placement and route process using only RTL. The main purpose of this study is to predict the CTS result of pre-estimation roughly using an RTL source placing temporary logics using random buffer insertion before the route process: then the incorrectly designed part can be freely modified because the CTS result can be known in advance. Experimental results showed that this research achieves an increase in inserted buffer area by about 10%, the standard deviation of clock skew achieves zero clock skew after shallow CTS, and clock frequency increases by about 10%. This paper contributes to optimizing clock tree implementation by conducting the pre-route process before using the CTS tool. Also, our approach not only minimizes resource usage but also optimizes CTS for the RTL structure. It holds considerable value in enhancing the efficiency and performance of integrated circuits.

Keywords:

chip design; clock tree synthesis (CTS); place and route (P&R); licensed electronic design automation tool (EDA); buffer insertion; clock skew

1. Introduction

Chip verification is important in manufacturing chips to detect chip faults and improve performance. The chip verification process takes a long time to repeatedly verify and revise the chip design. When increasing the size and complexity of integrated circuit designs, the design verification time can be prolonged and uncertain. Many users depend on various electronic design automation (EDA) tools, such as Synopsys and Cadence, which are commercial software. The licensed EDA tools come with several disadvantages. They can be costly and require license fees that can be a burden for small companies or individuals [1]. While the licensed EDA tools offer functionalities, they may not be able to satisfy all user requirements. The licensed EDA tools offer limited flexibility, and users can modify or add features to the software. They do not disclose the underlying algorithm to perform the specific function, such as place and route. Also, the licensed EDA tools cannot access the source code for debugging and modification [2]. Using a licensed EDA tool to accurately verify chips requires a lot of cost and time for users.

We conduct research to save cost and time regarding clock tree synthesis (CTS), which is one of the processes with high user dependency during the chip verification. This research is focused on the highly expected, CTS process of placement and route (P&R), which is a time-consuming process in chip verification, where elements are placed and wired to optimize performance, power, and area (PPA). Also, the CTS process is important to have fast clock signal transmission and more accuracy across the chip, increasing overall chip performance [3,4]. The CTS is of paramount importance in controlling the overall chip timing problems in the P&R process and determines timing convergence and power consumption. The CTS improves energy efficiency by providing the ability to optimize power consumption. It is helpful to implement the clock tree using the clock distribution algorithm, as it helps reduce the total chip power [5,6]. The objective of the clock distribution algorithm should meet the max transition, max capacitance, and max fanout of the clock tree design rule check and fulfill minimal skew and minimum insertion delay to the target clock tree. It adjusts the timing of the chip by inserting buffers that efficiently optimize timing closure for high fanout signals to reduce skew. However, as the design size of the integrated circuit increases, the number of nodes required for distributing the clock signal increases significantly, increasing the calculation time. In addition, a lot of calculations have to be performed to obtain an optimal result that meets the complex distribution algorithm considering various constraint conditions. If the clock tree implementation does not meet the conditions after RTL design, the design has to be modified, which consumes more time and cost because various constraints, circuit stability, performance, etc., must be considered.

To reduce the cost and time of the CTS process, we designed and preconstructed a shallow clock tree based only on a register transfer level (RTL) source without a licensed EDA tool or any other requirements in prior studies [7]. Prior studies have addressed the CTS of the data path from input to D flip-flop data path and have performed some CTS. There have also been restrictions on general cases, such as large and complex designs. Since the Verilog sources are composed of one output and several inputs, the study is limited to the general source, which has many outputs and inputs. Also, the study solved the hold violation from the input to D flip-flop, but it did not solve the hold violation of the clock path.

To solve the restrictions, we designed CTS pre-estimation considering all clock paths from the clock to the D flip-flop’s clock sink using the general case of a large Verilog source before the route process. Figure 1 shows how to conduct CTS pre-estimation using an RTL synthesis netlist on the CTS for the P&R process. It cannot approximately close the CTS results of the licensed EDA tool by using only the RTL synthesis netlist, which does not have existing timing and parasitic information. To close the CTS result of the licensed EDA tool, this study assumes that the logic is arbitrarily placed after going through the placement process before the route process, which is described as pre-route. Also, this study introduces the random buffer insertion stage to reproduce the placement process of the licensed EDA tool. It executes the shallow CTS process to reduce clock skew and calibrate total chip timing after the random buffer insertion stage [8]. It shows that shallow clock tree implementation optimizing PPA increases about 10% of the maximum clock frequency on the chip against Qflow’s CTS result and decreases about 20% of the standard deviation of clock path delay after shallow CTS in this paper.

The remaining parts of this paper are as follows. Section 2 introduces related research on the CTS algorithm on the P&R process. Section 3 explains the overall process in this study of how to construct with CTS between the pre-route and route process. We explain how to carry out random buffer insertion in the pre-route process and how to preconstruct clock tree implementation algorithms. In Section 4, we present the details of the experiments based on a specific RTL source using a Taiwanese semiconductor manufacturing company’s limited (TSMC) 180 nm standard cell library. We analyze the experimental results that show how to close clock tree preconstruction through the heuristic algorithm. We observe the standard deviation difference before and after CTS pre-estimation and reduced clock skew for the clock path. We compare the maximum clock frequency of shallow CTS with the maximum clock frequency of the open-source EDA tool after the CTS. In addition, we evaluate the memory profiling of shallow CTS algorithm depending on RTL sources. Finally, Section 5 presents the conclusions. We highlight the conclusions and explain future research.

2. Related Research

Clock tree implementation is a important task in the physical design of an integrated circuit. The clock tree delivers the synchronous signal to every sequential cell, such as D flip-flop. It is important to solve the clock skew problem, which is the difference of the arrival time to the sequential cell, in clock tree implementation. The CTS process determines the maximum clock frequency and minimum clock skew and affects the overall power of the physical design. Some earlier proposed papers concentrated on the clock skew minimization between the source and the sinks to achieve minimum clock skew and maximum clock frequency. Soheil Nazar introduced a synchronous minimum-skew clock tree synthesis algorithm for single flux quantum circuits, taking into account splitter delays and placement constraints. The proposed methodology advances the current state of the art by incorporating splitter delays and creating a fully balanced clock tree structure in which the number of clock splitters from the clock source to all sink nodes is equal [9]. Jinwei Lu implemented power-aware clock tree synthesizer and power- and slew aware clock tree synthesizer using the Elmore RC delay model [10]. They suggested using a slew-focused lookup table to convey details about the driving capacity when inserting buffers and gates. The implementation achieved power consumpation minimization and minimum clock skew by slew rate satisfaction. Yici Cai focused on skew optimization under slew constraints and obstacle avoidance. This includes the clock tree construction phase, which utilizes an obstacle-aware topology generation algorithm called OBB, ensures the balanced insertion of candidate buffer positions, and employs a fast heuristic buffer insertion algorithm [11]. They integrated signal polarity requirements into the fundamental buffer insertion step, ensuring minimal impact on the final results. Overcoming the adverse effects of obstacles on skew is a significant challenge in buffered CTS, and they dedicated considerable time and effort to address this issue.

3. Proposed Shallow CTS Preconstruction Process

We explain how to preconstruct the clock tree from the Verilog file and propose the algorithms and functions in detail for the overall process. Figure 2 shows the overall process proposed for shallow CTS pre-estimation. The entire process is largely divided into the RTL synthesis, pre-route, and shallow CTS processes. It uses the open RTL synthesis tool Yosys to identify whether the RTL source is synthesizable [12]. Parser-Verilog, which is a structural Verilog parser, is used to conduct the pre-route and shallow CTS processes [13,14]. Each step is explained in detail as follows.

3.1. RTL Synthesizable Estimation

The input is an RTL synthesizable Verilog file that uses a high-level description of the design. Usually, the CTS of the licensed EDA tool proceeds based on an optimized gate-level representation, parasitic information, timing information, etc. However, in this study, we conduct the shallow CTS process on the primitive gate level to determine whether or not to synthesize a clock tree without data such as timing and parasitic information. The original Verilog source is converted to an RTL synthesizable source using Yosys. It matches the standard cell library for a suitable fabrication process and modifies it to meet various constraints, such as safety and timing problems.

3.2. Route Information Using Parsed Netlist

First of all, before adjusting the clock skew and hold violation through buffer insertion, it is necessary to create a general tree using the parsed information and traverse the entire path. It is composed of an RTL synthesizable source and clock tree implementation algorithm using Parser-Verilog in the route algorithm. The route algorithm searches and stores route information from the gate with the output port to a gate with the input port based on a parsing netlist. Prior studies consisted of a general tree based on a Verilog source with one output port and multiple input ports [15]. We tried to solve the hold violation from the input to the D flip-flop’s data path. The prior study only used the Verilog source’s existing noncyclic path. In the route process, the specific RTL source with one output and multiple inputs consists of a 2-dimensional vector including route information. It was challenging to apply general cases except for special sources that have a noncyclic path with one output and multiple inputs in the prior study.

Figure 3 shows a general case of the Verilog source and explains how route information is formed in the Verilog netlist structure. The synthesizable RTL source consists of a general tree structure, which is where a topmost root node is placed on the output or clock port of the RTL sources and the bottom leaf node is placed on the inputs or D flip-flop’s input ports. We develop a tree structure to apply a general case and try to solve the hold violation from the clock to the D flip-flop clock path. It has multiple input ports and multiple output ports and seems to consist of several general tree structures on the Verilog file. We search the common rules of the Verilog file, which has multiple input and output ports. The parsed Verilog netlist is composed of a 3-dimensional vector that comprises several trees with one output and multiple inputs. The 2-dimensional vector consists of route information from the gate with one output port to the gate with multiple input ports. The 1-dimensional vector is stored route information from a gate that includes one output port to a gate that includes one input port.

Algorithm 1 and Figure 4 represent the search clock path that is interconnected with a clock port and traverses clock paths using a recursive DFS algorithm. The complexity of Algorithm 1 is

O (n^{2})

by existing recursive DFS. The

G r o u t e

vector inserts G, which includes the clock port or directly connects to the clock. It searches G’s clock port, which is

G c l k

, and stores the gate’s clock port information. It traverses other gates connected to

G c l k

through the

g e t c h i l d r e n c l o c k G a t e

function and stores to

c h i l d G

. If no other gate, which includes a clock port, exists in the

c h i l d G

vector,

G r o u t e

stores to

R i n f o

. Otherwise, G stores to

G r o u t e

and it continues to traverse. The vectorized data using Algorithm 1 are used in the pre-route and the shallow CTS process.

Algorithm 1: Search route of clock paths using recursive DFS model

3.3. Pre-Route Process

Initially, the clock signal is an ideal mode that does not exist in the physical clock distribution during the RTL design, synthesis, and placement [16]. The clock signal in the ideal mode is assumed to reach the clock pins of all D flip-flops simultaneously. The pre-route process is arbitrary logic placement before the real route process. After the placement process, the clock is physically connected, and clock skew occurs as logic is placed respectively [17]. This study assumes that it does not consider timing constraints or parasitic components like R, C, cell strength, etc. In addition, it is assumed that the clock path generates the worst clock skew by random placement after the pre-route process. The random placement process is implemented using random buffer insertion.

Random buffer insertion is essential to expressing the state in which the physical clock is connected, and clock skew occurs in the placement process before the route process. If random placement does not execute, the clock skew of the clock paths does not generate because this study assumes that it calculates the path’s static delay, including the standard library cell’s delay. Figure 5 represents the difference between the clock tree after RTL synthesis and the clock tree after the pre-route process. It is simply connected to the clock buffer and clock sink of the D flip-flop in Figure 5a. There is no clock skew because it assumes that the clock arrives at the D flip-flop’s clock ports on the same time after RTL synthesis. On the other side, it represents the existence of clock skew by inserting a random buffer after the pre-route process in Figure 5b. Even so, random buffer insertion is not inserted randomly but focused on the number of loads connected to the clock buffer. If there are many loads connected to the clock buffer, the clock skew will be reduced, and the overall chip area will be larger by placing clock buffers having greater strength. If placing clock buffers having greater strength, there is a possibility of an enlarged chip area. To prevent this problem, load balancing should be carried out by considering the R and C components of the front and rear stages, but only standard library cell delay is assumed, except timing constraints: parasitic components like R, C, and cell strength, etc. Therefore, random buffer insertion is performed by simply determining the number of loads.

3.4. Shallow CTS Process

The shallow CTS process is used to place arbitrary logic and carry out random buffer insertion that generates clock skew through the pre-route process. The pre-route process indicates when the physical clock is connected to all D flip-flops’ clock sinks. When the physical clock signal is applied, it arrives at the clock pin on the different times. The clock skew created by clock uncertainty is corrected through the CTS process. The CTS process progresses to adjust the clock skew that is generated after physical clock connection. The CTS recognizes the clock signal from the clock source pin and delivers the clock signal to thousands of D flip-flops. The CTS is used to form a buffer tree to match the skew of clock nets and high fanout nets and to meet the design rules, such as maximum capacitance, maximum transition time, etc. The shallow CTS process is performed in Parser-Verilog using the RTL synthesizable netlist. In this process, the buffer insertion algorithm is applied to reach time closure in the clock path.

\begin{matrix} s k e w_{i, j} = T_{i} - T_{j} \end{matrix}

(1)

\begin{matrix} s k e w^{m a x} = m a x | T_{i} - T_{j} | \end{matrix}

(2)

\begin{matrix} s k e w_{i, j} = T_{i} - T_{j} + d e l a y \approx 0 \end{matrix}

(3)

The clock path goes from the clock to a D flip-flop’s clock sink. The calculation of each path’s delay is based on the intrinsic delay of the standard cell library, which does not include timing specifications and size. It calculates the clock path delay for cells existing along each clock path, and selects the clock path with the largest delay as the worst clock path delay. The difference between the worst clock path delay and the other clock path delay is defined as clock skew. Figure 6 shows how each clock path solves clock skew for the worst clock path. There is a difference in timing with the real clock signal propagated to the clock pins of many D flip-flops. It is assumed that the

T_{i}

delay as D flip-flop1’s clock path delay is smaller than

T_{j}

delay as D flip-flop2’s clock path delay in Figure 6a. In order to reduce clock latency, the clock skew is reduced through buffer insertion into the other clock path delay based on the worst clock path delay in Figure 6b [18,19]. In this study, the buffer insertion algorithm aims to reach the time closure.

Algorithm 2 and Figure 7 represents the buffer insertion of a clock path. It calculates each clock path’s delay and stores delay information to

R d e l a y

through the vectorized data that result from Algorithm 1. The complexity of Algorithm 2 is

O (n^{2})

by the existing nest for the loop.

When each clock path

R d e l a y

and max

R d e l a y

is compared, if clock skew exists, which is the difference between

R d e l a y

and max

R d e l a y

, buffer insertion is performed. The algorithm inserts buffer delay in order, which has larger delay. For instance, it assumes that it has a difference between the worst clock path’s delay and the other clock path’s delay.

B U F X N

’s intrinsic delay is greater than

B U F X M

’s intrinsic delay. If the clock skew remains, the

B U F X N

insertion carries out the clock path. If the clock skew remains after the

B U F X N

insertion, the

B U F X M

insertion applies the clock path remaining clock skew.

The overall process synthesizes a clock tree through buffer insertion based on the worst clock skew of the clock path using an RTL synthesizable source. The variation of the clock skew on the pre-route and the CTS processes is illustrated in Figure 8. Logic placement corrects the worst clock skew through random buffer insertion on the pre-route process in Figure 8b. After the pre-route process, it is difficult to solve PPA and reach the time closure because the worst clock skew occurs in the clock paths. The CTS inserts buffers based on the worst clock path to solve the overall chip timing and PPA problems in Figure 8c. It can show reduced clock skew after the CTS process compared to the pre-route process.

Algorithm 2: Buffer insertion

4. Experimental Result and Discussion

The experiments were performed based on RTL sources, which have many instances after RTL synthesis to evaluate how effective they are in reducing clock skew using large RTL sources. We conducted experiments using a TSMC 180 nm standard cell library. This study used relatively simple designs with 500 total gates or fewer, such as a 4-bit divider, map9v3 using an 8-bit linear feedback shift register [20]. In addition, we experimented based on relatively complex designs with more than 500 total gates and 100 clock sinks. We measured the number of inserted buffers and standard deviation on the pre-route and CTS processes. We then computed the maximum clock frequency to compare the CTS results with those of Qflow, which is an open-source EDA tool [21]. We evaluated the memory usage of the shallow CTS process using the algorithm proposed in this paper.

4.1. Comparison of Pre-Route and Shallow CTS Process

The shallow CTS is an important factor in power consumption because the clock is a major power consumer. It is important how much inserted buffers account for the overall chip area and power consumption because area and power consumption are related. We computed the percentage of inserted buffers and the number of total instances in Figure 9 and Table 1. We measured the percentage of clock buffers added after shallow CTS to be less than 10% of the total netlist. These results do not significantly affect the chip area. After the pre-route process, the number of added clock buffers after pre-route affects shallow CTS process to the clock tree implementation. Before the pre-route and after RTL synthesis, the clock tree had the same clock path from the clock source to the sink because the physical clock was connected. While taking into account the obstacle placement, the clock trees will not all have the same clock path after the pre-route process. The clock tree is formed as an asymmetric structure with different clock paths and unequal delays. Even if the number of added buffers after pre-route is small, the asymmetric clock tree structure should be replaced with a symmetric structure. Compared to the number of added buffers after pre-route, the number of inserted buffers after shallow CTS is increased by about two times. It means that in order to change the asymmetric clock tree structure to a symmetric structure through the shallow CTS process and achieve zero clock skew, the shallow CTS process has a high cost in terms of total area.

We measured the clock path delay and standard deviation on the pre-route and CTS to evaluate the skew of each clock skew, as in Figure 10 and Table 2. In Figure 10, the x axis represents the path from the clock to a D flip-flop, such as the clock path from the clock to D flip-flop1, and the y axis represents the standard deviation of the clock path delay. Table 2 calculates the standard deviation on the pre-route and the shallow CTS process. The average of the clock path delay after shallow CTS increased from the average of the clock path delay after pre-route by inserting the clock buffer to achieve zero clock skew. After the shallow CTS, the standard deviation of the skew of the clock path reduced to closely zero clock skew.

4.2. Comparison of EDA Tool’s CTS and Shallow CTS

Figure 11 represents how to obtain the maximum clock frequency on the pre-route and the shallow CTS process using the STA result of Qflow. The pre-route process, which is our proposed process, is before the CTS process. The open EDA tool Qflow executes the CTS on the route process. We analyzed the static timing analysis (STA) result of Qflow in order to proceed with a quantitative comparison. The STA result includes the timing constraint, timing path identification, delay analysis, clock synchronization and performance analysis. We conducted Qflow’s back annotation, which is included in the CTS result, and Qflow’s STA, which is not included in the CTS result, to compare the CTS performance. We compared the performance of Qflow’s CTS based on the netlist after pre-route, which does not proceed after the CTS process, with the performance of the shallow CTS being based on the netlist after shallow CTS. We aimed to acquire a collinear comparison based on the STA that has undergone Qflow’s CTS once to evaluate the performance of the proposed CTS algorithm. The netlist of the pre-route process proceeded to the back-annotation process to acquire a post-STA result that reports the final CTS results after Qflow’s CTS. It determined the maximum clock frequency of post-STA on the pre-route process. Furthermore, the netlist of the shallow CTS process proceeded to the STA, which is not included in the CTS result because it already progresses using the proposed shallow CTS algorithm.

We analyzed the maximum clock frequency after pre-route and shallow CTS because the clock frequency is an important factor in chip performance: the results are shown in Table 3. In addition, most RTL sources tend to slow the clock frequency after the CTS process. The frequency of the pre-route process measured in the post-STA is the result of Qflow’s own CTS after the pre-route process. The frequency of the shallow CTS result measured in STA is the result after the shallow CTS process. The clock frequency is the measured fast clock frequency after the shallow CTS process.

4.3. Memory Profile

It takes a long time to execute the route and buffer insertion algorithm because it retrieves all paths. The memory usage of the shallow CTS algorithm is one of the important aspects of the algorithm profile.The complexity of Algorithms 1 and 2 proposed in this paper is

O (n^{2})

. We measured the memory usage of the processes executing on Algorithms 1 and 2 to measure the performance of shallow CTS. Valgrind is primarily used to track how the program uses memory and monitor memory usage over time [22]. Valgrind Massif helps identify how programs use heap memory and identify memory leaks. Valgrind Massif provides a visual representation of memory usage patterns to help developers troubleshoot memory management issues. Shallow CTS is in C++ language and uses vector format, so we measured the total heap memory consumption because of the high usage of dynamic allocation. It is illustrated in Table 4. Total memory heap consumption shows that memory usage increases with the total number of gates. Total gates of 500 or below used about 1 MB or less of memory, while total gates of 10,000 or more used roughly 10 MB or more. Figure 12 shows the heap memory consumption of cordic18×18×18, which has the highest total number of gates, visualized as Valgrind massif. The total gate is more than 10,000, which means it consumes about 38.1 MB. It can be seen that the vector allocation part consumes the most heap memory.

4.4. Discussion

We developed shallow CTS implementation specifically designed to minimize design overhead. In fact, after the pre-route process, the reduction amounted to approximately 10% of the standard deviation as demonstrated in Table 2. Furthermore, our algorithm showed notable improvements in the synthesis of clock paths compared to the open source EDA tool, Qflow. This indicates that our approach not only minimizes resource usage but also optimizes CTS for the RTL structure. Also, the memory usage evaluation of the shallow CTS implementation shows that the total number of gates in the RTL source determines the usage. However, the memory usage of shallow CTS is not anticipated to have a significant impact on the clock tree implementation since it typically executes the EDA tool of chip verification process on static time. Additionally, we anticipate another advantage in the form of reduced operating power consumption for the micro controller unit (MCU) through the optimized implementation of the clock tree. This optimization has the potential to greatly enhance the overall power efficiency of the MCU. Overall, our shallow CTS implementation yielded promising results, including a notable reduction in the hardware area, improved clock path synthesis, and the possibility of reducing power consumption. These achievements hold considerable value in enhancing the efficiency and performance of integrated circuits.

5. Conclusions

The results of the experiments show that the proposed shallow CTS algorithm is efficient for preconstructing a clock tree and checking whether the clock tree is synthesizable. The standard deviation of the shallow CTS not only obviously reduced the standard deviation of the pre-route but it also measured a higher maximum clock frequency than the maximum clock frequency of Qflow post-STA, which is over after the CTS. It verified that the performance of the results approaches zero clock skew through the shallow CTS algorithm, which focuses on clock skew, and the maximum clock frequency of the shallow CTS process is higher than the maximum clock frequency of Qflow. It clearly proves weak points using a robust clock path standard deviation calculation. It means that critical path delay decreases on the chip and the standard deviation of the clock path decreases, increasing the clock frequency and optimizing clock tree structure. On this basis, we conclude that the CTS, which is a time-consuming process, can confirm the mock review of CTS to pre-estimate using the heuristic algorithm without using the licensed EDA tool. When using the licensed EDA tool, there are problems with the tool cost and the long time required to check the CTS results with the tool. In addition, with the RTL design, it cannot be known whether the design is synthesizable before clock tree synthesis. Iterative design cost increases and design errors take a long time to detect. Before proceeding with the actual CTS process, we can perform a shallow CTS based on the RTL netlist to understand the clock tree structure and where the asynchronous clock signals are delivered. It is possible to make quick corrections by roughly understanding the results before the CTS process. The use of licensed EDA tools eventually entails a redundant expected cost for users. To reduce the expected cost and obtain economic benefits, we can find out information, such as the approximate timing results and area addition through shallow CTS before using the licensed EDA tool. Moreover, the shallow CTS results are helpful in modifying the RTL source iteratively to match chip verification.

One limitation of our implementation is that it is not a precise delay calculation. Qflow, which is an open EDA tool, makes elaborate delay adjustments by using Elmore delay calculation and considers the strength, parasitic elements, and network delay of the front and rear ends. We should improve the delay calculation which considers parasitic elements and network delay. Future research should be devoted to the development of precise delay calculation and overall performance improvement using the graph neural network and machine learning. Also, the memory usage of the algorithm should be considered to ensure that it utilizes minimal memory at runtime. In addition, future research should apply standard cell libraries of various fabrications.

Author Contributions

N.K. designed, implemented and analyzed the shallow CTS process; D.P.: Conceptualization, supervision, project administration, funding acquisition. D.P. was the corresponding author. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the BK21 FOUR project (4199990113966), the Basic Science Research Program (NRF-2018R1A6A1A03025109, 20%), (NRF-2022R1I1A3069260) through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, and (2020M3H2A1078119) by Ministry of Science and ICT. This work was partly supported by an Institute of Information and communications Technology Planning and Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2021-0-00944, Metamorphic approach of unstructured validation/verification for analyzing binary code, 10%) and (No. 2022-0-01170, PIM Semiconductor Design Research Center, 10%) and (No. RS-2023-00228970, Development of Flexible SW-HW Conjunctive Solution for On-edge Self-supervised Learning, 10%) and (No. RS-2022-00156389, Innovative Human Resource Development for Local Intellectualization support program, 50%). The EDA tool was supported by the IC Design Education Center (IDEC), Republic of Korea.

Data Availability Statement

Not available.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

P&R	Place and route
CTS	Clock tree synthesis
RTL	Register transfer level
EDA	Electronic design automation
PPA	Performance, power, and area
STA	Static timing analysis

References

Chang, C.C.; Pan, J.; Xie, Z.; Hu, J.; Chen, Y. Rethink before Releasing your Model: ML Model Extraction Attack in EDA. In Proceedings of the 2023 28th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo, Japan, 16–19 January 2023; pp. 252–257. [Google Scholar]
Arias, O.; Liu, Z.; Guo, X.; Jin, Y.; Wang, S. RTSEC: Automated RTL Code Augmentation for Hardware Security Enhancement. In Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), Virtual, 14–23 March 2022; pp. 596–599. [Google Scholar] [CrossRef]
Ray, P.; Prashant, V.S.; Rao, B.P. Machine Learning Based Parameter Tuning for Performance and Power optimization of Multisource Clock Tree Synthesis. In Proceedings of the 2022 IEEE 35th International System-on-Chip Conference (SOCC), Belfast, UK, 5–8 September 2022; pp. 1–2. [Google Scholar] [CrossRef]
Bhaskara, P.; Bharadwaja, P. A Robust CTS algorithm using the H-Tree to minimize local skews of higher frequency targets of the SOC designs. In Proceedings of the 2020 7th International Conference on Smart Structures and Systems (ICSSS), Chennai, India, 23–24 July 2020; pp. 1–5. [Google Scholar] [CrossRef]
Lu, Y.C.; Lee, J.; Agnesina, A.; Samadi, K.; Lim, S.K. GAN-CTS: A Generative Adversarial Framework for Clock Tree Prediction and Optimization. In Proceedings of the 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Westminster, CO, USA, 4–7 November 2019; pp. 1–8. [Google Scholar] [CrossRef]
Chong, A.B. Hybrid Multisource Clock Tree Synthesis. In Proceedings of the 2021 28th IEEE International Conference on Electronics, Circuits, and Systems (ICECS), Dubai, United Arab Emirates, 28 November–1 December 2021; pp. 1–6. [Google Scholar] [CrossRef]
Kwon, N.; Park, D. Lightweight Buffer Insertion for Clock Tree Synthesis Visualization. In Proceedings of the 2022 International Conference on Electronics, Information, and Communication (ICEIC), Jeju, Republic of Korea, 6–9 February 2022; pp. 1–3. [Google Scholar] [CrossRef]
Chakrabarti, P. Clock Tree Skew Minimization with Structured Routing. In Proceedings of the 2012 25th International Conference on VLSI Design, Hyderabad, India, 7–11 January 2012; pp. 233–237. [Google Scholar] [CrossRef]
Shahsavani, S.N.; Pedram, M. A Minimum-Skew Clock Tree Synthesis Algorithm for Single Flux Quantum Logic Circuits. IEEE Trans. Appl. Supercond. 2019, 29, 1303513. [Google Scholar] [CrossRef]
Lu, J.; Chow, W.K.; Sham, C.W. Fast Power- and Slew-Aware Gated Clock Tree Synthesis. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2012, 20, 2094–2103. [Google Scholar] [CrossRef]
Cai, Y.; Deng, C.; Zhou, Q.; Yao, H.; Niu, F.; Sze, C.N. Obstacle-Avoiding and Slew-Constrained Clock Tree Synthesis with Efficient Buffer Insertion. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015, 23, 142–155. [Google Scholar] [CrossRef]
Open RTL Synthesis Tool: Yosys. Available online: https://github.com/YosysHQ/yosys (accessed on 20 July 2021).
Standalone Structural Verilog Parser: Parser-Verilog. Available online: https://github.com/OpenTimer/Parser-Verilog (accessed on 20 July 2021).
Nn, S.; Vahvale, Y.; Praveena, N.; Mamatha, A. Design and Implementation of 64-bit SRAM and CAM on Cadence and Open-source environment. Int. J. Circuits Syst. Signal Process. 2021, 15, 586–594. [Google Scholar] [CrossRef]
Kwon, N.; Park, D. Lightweighted Shallow CTS Techniques for Checking Clock Tree Synthesizable Paths in RTL Design Time. In Proceedings of the 2022 19th International SoC Design Conference (ISOCC), Gangneung-si, Republic of Korea, 19–22 October 2022; pp. 394–395. [Google Scholar] [CrossRef]
Na, T.; Ko, J.H.; Mukhopadhyay, S. Clock data compensation aware clock tree synthesis in digital circuits with adaptive clock generation. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, Lausanne, Switzerland, 27–31 March 2017; pp. 1504–1509. [Google Scholar] [CrossRef]
Lin, M.; Sun, H.; Kimura, S. Power-efficient and slew-aware three dimensional gated clock tree synthesis. In Proceedings of the 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), Tallinn, Estonia, 26–28 September 2016; pp. 1–6. [Google Scholar] [CrossRef]
Liang, R.; Nath, S.; Rajaram, A.; Hu, J.; Ren, H. BufFormer: A Generative ML Framework for Scalable Buffering. In Proceedings of the 2023 28th Asia and South Pacific Design Automation Conference (ASP-DAC), Tokyo, Japan, 16–19 January 2023; pp. 264–270. [Google Scholar]
Hyun, G.; Kim, T. Flip-flop State Driven Clock Gating: Concept, Design, and Methodology. In Proceedings of the 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Westminster, CO, USA, 4–7 November 2019; pp. 1–6. [Google Scholar] [CrossRef]
8-bit LFSR: Map9v3. Available online: https://http://opencircuitdesign.com/qflow/example/map9v3.v (accessed on 24 December 2018).
RTimothyEdwards. Open-Source Digital Synthesis Flow: Qflow. Available online: http://opencircuitdesign.com/qflow/ (accessed on 24 December 2018).
Linux Program Debugging and Profiling: Valgrind. Available online: https://valgrind.org/info/ (accessed on 28 April 2023).

Figure 1. Proposed research background: The correspondence between the original chip verification process and proposed structure in the paper is illustrated.

Figure 2. Proposed overall flow: shallow clock tree synthesis process about the proposed structure in detail. Also, the placement on the pre-route and shallow CTS process is shown.

Figure 3. Vector structure: This is an illustration of the process of creating and vectorizing RTL synthesis netlists into each general tree structure.

Figure 4. Search route of clock paths using recursive algorithm flow chart: It represents how to search and route clock paths on Algorithm 1.

Figure 5. Random buffer insertion: it demonstrates how to insert clock buffers to the 2-D vectorized clock paths during random buffer insertion in the pre-route process.

Figure 6. Clock skew representation: it explains clock skew using the D flip-flop as an example. (a) illustrates that the clock path delay of D flip-flop1 is less than the

T_{j}

delay, which is the clock path delay of D flip-flop2. (b) illustrates that clock skew is reduced by inserting buffers into the other clock path delays based on the worst clock path delay.

Figure 6. Clock skew representation: it explains clock skew using the D flip-flop as an example. (a) illustrates that the clock path delay of D flip-flop1 is less than the

T_{j}

delay, which is the clock path delay of D flip-flop2. (b) illustrates that clock skew is reduced by inserting buffers into the other clock path delays based on the worst clock path delay.

Figure 7. Buffer insertion algorithm flow chart: It explains how to insert clock buffers to vectorized clock paths on Algorithm 2.

Figure 8. Difference between pre-route and CTS process: the figure compares the clock skew after RTL synthesis with the clock skew after pre-route and the clock skew after shallow CTS.

Figure 9. Comparison to inserted buffers on pre-route and shallow CTS process: this is a graph that calculates how much of the total gate is occupied by the buffer added after the shallow CTS process.

Figure 10. Comparison to mean and standard deviation on pre-route and shallow CTS process: It calculates the mean and the standard deviation on the pre-route and the shallow CTS process. The result indicates the standard deviation of clock skew, where the clock skew is reduced to near zero after the shallow CTS process.

Figure 11. Comparison STA result of Qflow on pre-route and shallow CTS process: for a quantitative comparison, this figure compares the CTS performance of Qflow with the shallow CTS algorithm used in this paper using the Qflow tool.

Figure 12. Memory usage of cordic18×18×18: It is illustrated the heap memory consumption of cordic18×18×18, which has the highest total number of gates, as visualized as Valgrind massif.

Table 1. Comparison numbers of inserted buffer on pre-route and shallow CTS process.

RTL Source	Total Gate	Clock Sink	Clock Buffers after RTL Synthesis	Clock Buffers after Pre-Route	Clock Buffers after Shallow CTS
map9v3	240	33	5	2	26
divider	283	31	5	6	12
sv chip3	764	101	13	25	33
cordic8×8×8	3463	432	57	69	270
fir3×8×8	3632	148	12	14	63
fir scu	4968	202	14	25	21
iir1	5060	204	8	37	46
rs decoder1	6126	517	64	36	52
iir	10,310	219	15	36	46
cordic18×18×18	19,079	2052	161	80	1571

Table 2. Measurement of the mean and standard deviation of pre-routed versus shallow CTS process.

RTL Source	Mean after Pre-Route	Mean after Shallow CTS	Std Deviation ¹ after Pre-Route	Std Deviation ¹ after Shallow CTS
map9v3	0.2934 ns	0.3523 ns	0.0305 ns	0.0027 ns
divider	0.2699 ns	0.3001 ns	0.0390 ns	0.0035 ns
sv chip3	0.2723 ns	0.2966 ns	0.0345 ns	0.0019 ns
cordic8×8×8	0.2508 ns	0.2960 ns	0.0349 ns	$5.551 \times 10^{- 17}$ ns
fir3×8×8	0.2688 ns	0.2999 ns	0.0391 ns	0.0034 ns
fir scu	0.2894 ns	0.2973 ns	0.0264 ns	0.0132 ns
iir1	0.3371 ns	0.3537 ns	0.0319 ns	0.0034 ns
rs decoder1	0.2949 ns	0.3023 ns	0.0239 ns	0.0021 ns
iir	0.3355 ns	0.351 ns	0.0295 ns	0.0 ns
cordic18×18×18	0.2424 ns	0.2976 ns	0.0334 ns	0.0029 ns

¹ std deviation: standard deviation.

Table 3. Comparison STA on pre-route and shallow CTS process.

RTL Source	Clock Frequency after Pre-Route	Clock Frequency after Shallow CTS
map9v3	645.218 MHz	813.366 MHz
divider	369.704 MHz	403.154 MHz
sv chip3	633.82 MHz	651.368 MHz
cordic8×8×8	522.083 MHz	524.893 MHz
fir3×8×8	299.532 MHz	291.951 Mhz
fir scu	247.105 MHz	256.333 MHz
iir1	172.538 MHz	187.072 MHz
rs decoder1	184.694 MHz	195.991 MHz
iir	748.013 MHz	769.84 MHz
cordic18×18×18	387.337 MHz	421.836 MHz

Table 4. Total memory heap consumption.

RTL Source	Total Memory Heap Consumption
map9v3	812.7 KB
divider	763.9 KB
sv chip3	1.8 MB
cordic8×8×8	7.1 MB
fir3×8×8	6.3 MB
fir scu	8.6 MB
iir1	8.4 MB
rs decoder1	10.7 MB
iir	15.3 MB
cordic18×18×18	38.1 MB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kwon, N.; Park, D. Shallow Clock Tree Pre-Estimation for Designing Clock Tree Synthesizable Verilog RTLs. Electronics 2023, 12, 4340. https://doi.org/10.3390/electronics12204340

AMA Style

Kwon N, Park D. Shallow Clock Tree Pre-Estimation for Designing Clock Tree Synthesizable Verilog RTLs. Electronics. 2023; 12(20):4340. https://doi.org/10.3390/electronics12204340

Chicago/Turabian Style

Kwon, Nayoung, and Daejin Park. 2023. "Shallow Clock Tree Pre-Estimation for Designing Clock Tree Synthesizable Verilog RTLs" Electronics 12, no. 20: 4340. https://doi.org/10.3390/electronics12204340

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Shallow Clock Tree Pre-Estimation for Designing Clock Tree Synthesizable Verilog RTLs

Abstract

1. Introduction

2. Related Research

3. Proposed Shallow CTS Preconstruction Process

3.1. RTL Synthesizable Estimation

3.2. Route Information Using Parsed Netlist

3.3. Pre-Route Process

3.4. Shallow CTS Process

4. Experimental Result and Discussion

4.1. Comparison of Pre-Route and Shallow CTS Process

4.2. Comparison of EDA Tool’s CTS and Shallow CTS

4.3. Memory Profile

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI