Timing-Driven Simulated Annealing for FPGA Placement in Neural Network Realization

Yu, Le; Guo, Baojin

doi:10.3390/electronics12173562

Open AccessArticle

Timing-Driven Simulated Annealing for FPGA Placement in Neural Network Realization

by

Le Yu

^* and

Baojin Guo

School of Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(17), 3562; https://doi.org/10.3390/electronics12173562

Submission received: 1 August 2023 / Revised: 15 August 2023 / Accepted: 21 August 2023 / Published: 23 August 2023

(This article belongs to the Special Issue FPGA-Based Deep Neural Network Accelerators Using Emerging Technologies)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The simulated annealing algorithm is an extensively utilized heuristic method for heterogeneous FPGA placement. As the application of neural network models on FPGAs proliferates, new challenges emerge for the traditional simulated annealing algorithm in terms of timing. These challenges stem from large circuit sizes and high heterogeneity in the block proportions typical in neural networks. To address these challenges, this study introduces a timing-driven simulated annealing placement algorithm. This algorithm integrates cluster criticality identification during the cluster selection phase, which enhances the probability of high-criticality cluster selection. In the cluster movement phase, the proposed method employs an improved weighted center movement for high-criticality clusters and a random movement strategy for other clusters. Experimental evidence demonstrates that the proposed placement algorithm decreases the average wire length by 1.52% and the average critical path delay by 5.03%. This improvement in performance is achieved with a marginal increase of 5.01% in runtime, as compared to VTR8.0.

Keywords:

FPGA; EDA; placement; simulated annealing; timing-driven

1. Introduction

Field-programmable gate arrays (FPGAs) have become an attractive solution for implementing neural network models due to their ability to implement custom precision data paths and their lower development costs compared to those of custom application-specific integrated circuits (ASICs) [1]. FPGA-based acceleration for neural network applications is an emerging area of research focus. Given the expansive scale of neural network circuits, minimizing critical path delay is of paramount importance.

Heterogeneous FPGAs, which contain more programmable units and incorporate heterogeneous modules such as block RAMs (BRAMs) and digital signal processing blocks (DSPs), offer advantages over traditional FPGAs [2]. Electronic design automation (EDA) software, used for implementing circuits on FPGAs, is constantly evolving to adapt to new architectures and application requirements. Placement is a crucial step in the FPGA EDA process, as its results determine the position of blocks and affect wire length and critical path delay after routing. Simulated annealing is a widely used heuristic method for FPGA placement. However, as circuit sizes on chips continue to grow, placement time increases. By optimizing the movement process of programmable units in the simulated annealing method, it is possible to accelerate the convergence speed of FPGA placement and optimize timing, wire length, and other indicators. This has become the main direction for optimizing simulated annealing algorithms.

Ref. [3] proposed two timing-optimized movement strategies: weighted center movement and weighted boundary movement. Weighted center movement calculates the center area based on the position of the nets connected by the cluster to be moved and moves the cluster to a random position within this area. Weighted boundary movement calculates the center area based on the bounding box of the nets connected by the cluster to be moved and moves the cluster to a random position within this area. Compared to random movement, these two methods can accelerate the convergence of placement in terms of timing, but there are still some problems when applied to neural network circuits.

The implementation of neural network circuits requires a large amount of logic resources. This increases the number of blocks the placer has to process. As a result, the probability of selecting high-delay paths is reduced, which slows down timing convergence. In advanced neural networks, about 80–90% of the operations are matrix multiplication [4]. This results in a high proportion of heterogeneous blocks in the circuit, which affects the placer’s search space. It brings new challenges to placement. To address these issues, we propose a timing-driven simulated annealing algorithm. Our main contributions are as follows:

(1): At the cluster selection stage, clusters are labeled according to their criticality and the probability of selecting highly critical clusters is increased.
(2): During a direct move, an improved weighted center movement is adopted for clusters with high criticality, while a random movement strategy is used for other clusters.

2. Related Work

The two main placement algorithms that have achieved good results and are widely used in industry are analytical algorithms and simulated annealing algorithms.

Analytical algorithms were first applied in the ASICs field and have been increasingly applied to FPGAs in recent years. Based on analytical placement techniques, the FPGA placement problem is modeled as a continuous optimization problem, and the optimal solution of the placement problem is obtained by gradient descent. Analytical placement generally includes three main stages: global placement, legalization, and detailed placement [5]. Global placement calculates the optimal position of each module with wire length and other indicators as optimization goals, ignoring the nonoverlapping constraints of blocks. Then, legalization removes the overlapping parts of blocks and places them in their respective positions to minimize displacement from global placement. Detailed placement further improves the quality of the solution. During global placement, analytical placement performs well, such as Gplace [6] and UTPlaceF [7]. To further optimize FPGA placement results, analytical placers usually require a detailed placement step. Detailed placement is implemented by low-temperature simulated annealing algorithms [8] or reinforcement learning algorithms [9]. With the addition of heterogeneous blocks, chip structures become more complex. The Liquid [10] placer uses an analytical algorithm for global placement and then uses simulated annealing to legalize the position of heterogeneous modules. In the commercial field, analytical placement is the main technique used in Xilinx Vivado’s placement tool [11].

In the field of simulated annealing, VPLACE [12] applies an improved simulated annealing algorithm to island-style FPGAs. The T-VPLACE algorithm [13] adds timing factors to VPLACE, reducing path delay by an average of 30% compared to VPlace. Vorwerk et al. [14] proposed multiple movement strategies to move blocks to converge faster to better-quality solutions. Compared with traditional simulated annealing algorithms, this method can significantly improve critical path delay and total wire length within the same computation time. Intel’s Quartus II placer uses more than 10 different directed moves processes, but the selection mechanism for the directed moves process during placement has never been opened [15]. Yuan et al. [16] introduced the concept of range constraints into simulated annealing algorithms to improve movement strategies and limit exchange distances, thereby avoiding unnecessary spatial exploration and converging faster to near-optimal solutions. VTR [17], an open-source EDA tool in academia, has improved the exchange strategy in the simulated annealing algorithm, which enhances the exchange space and success rate of heterogeneous modules. Elgammal et al. [3] introduced reinforcement learning into simulated annealing algorithms, reducing critical path delay by 11% and runtime by 33–50% compared to VTR8.0 [17].

Analytical methods offer high quality and fast convergence for placement problems. However, due to differences in chip architectures, it is challenging to abstract them into universal mathematical problems. Additionally, detailed placement using analytical algorithms often requires implementation via simulated annealing algorithms. For heterogeneous FPGAs, which have more complex architectures than traditional FPGA chips, simulated annealing algorithms are more suitable for placement. Nevertheless, the presence of numerous programmable units and the connection of some heterogeneous modules to many nets in heterogeneous FPGAs presents new challenges for movement strategies in simulated annealing algorithms.

3. Placement Methods

The placement algorithm presented in this paper utilizes a simulated annealing approach, as illustrated in Figure 1. Initially, the placer establishes the placement of clusters within the cluster-level netlist and sets the initial temperature and exchange radius. Subsequently, the algorithm enters the simulated annealing loop. The placer comprises two nested loops: the outer loop updates the temperature and radius while assessing whether the termination criteria have been met, whereas the inner loop selects a cluster and a position within the range specified by the placer. If the chosen position is unoccupied, the cluster is directly moved to that location. Otherwise, the selected cluster is swapped with the cluster occupying the chosen position.

3.1. Cluster Selection

After the initial placement, clusters have their initial positions on the chip. The placement tool generates a timing graph based on the clusters’ positions and estimates the delay of the netlist wires. Firstly, the placement tool abstracts the timing graph into a relaxation matrix, which is expressed as follows:

s l a c k (i, j) = T_r e q (j) - T_a r r (i) - T_d e l (i, j)

(1)

In the equation,

s l a c k (i, j)

represents the slack time,

T_d e l (i, j)

represents the delay between nodes,

T_a r r (i)

represents the earliest arrival time of the signal at a node, and

T_r e q

represents the latest time at which the signal can be received. After obtaining the relaxation matrix, the placement tool calculates the critical path of the circuit based on the matrix, using the following formula.

D m a x

is the delay of the critical path.

C r i t

is the criticality of the net.

C r i t (i, j) = 1 - \frac{s l a c k (i, j)}{D m a x}

(2)

This simulated annealing-based placement algorithm has two ways of selecting clusters: (1) randomly selecting a cluster from the cluster-level netlist and (2) randomly selecting a cluster located on the high-delay path. The first strategy has a wider coverage range but slower convergence speed, while the second strategy optimizes for high-delay clusters and has a faster convergence speed but a smaller coverage range. In this paper, we optimize the strategy for selecting clusters.

Firstly, the clusters in the cluster-level netlist are marked with their criticality. As shown in Figure 2, a net is composed of a source and one or multiple sinks corresponding to a bounding box. Different sinks correspond to their own path delays. A cluster often connects multiple nets, located on multiple paths, and has multiple bounding boxes. For a sink, only its corresponding source position affects the path delay; for a source, the paths that affect the delay are all the sinks it connects to. The placement tool traverses all low-fanout nets currently connected to the cluster, finds the pin corresponding to the maximum delay path, and uses its timing criticality as the cluster’s criticality. In Figure 2, the timing criticality of the blue module is 0.8.

In the cluster-level netlist, the proportion of clusters on high-delay paths is low and has a low probability of being selected during the cluster selection stage. Taking bnn as an example, in the placement process, the cluster with a criticality greater than 0.9 in the cluster-level netlist accounts for the proportion of the netlist, as shown in Figure 3. As can be seen from the figure, clusters with a criticality greater than 0.9 account for a relatively small proportion of the netlist, mainly distributed within 0.2%. As the circuit grows larger, the proportion of high-criticality clusters is usually smaller. In this paper, during the outer loop, the clusters are divided into two sets, one with a criticality greater than

c r i t_{s} e l

and the other with all clusters. When selecting clusters in the inner loop, a cluster located at high criticality is selected with a probability of

λ

and a cluster is randomly selected from the cluster-level netlist with a probability of (

1 - λ

), as shown in the following formula, where

c l u s t e r s

are all clusters in the cluster-level netlist and

c r i t_c l u s t e r s

are clusters with high criticality in the cluster-level netlist.

p = \{\begin{matrix} λ b l o c k \in {c r i t_c l u s t e r s} \\ 1 - λ b l o c k \in {c l u s t e r s} \end{matrix}

(3)

3.2. Direct Move

The traditional placement approach selects a previously chosen cluster as the center point and employs the simulated annealing algorithm’s radius as the exchange radius. Within this radius, a random position is selected for swapping. This strategy offers the advantage of a broad search range, but its drawback lies in its lack of timing optimization due to the randomness of the exchange. The weighted boundary movement strategy considers only the boundary pin criticality of each net, disregarding the criticality of pins within the boundary. For net-based timing optimization, although all sink pins within the net are taken into account, the calculation of the center point can be easily influenced by low-delay pins.

The timing optimization strategy has a limited swapping radius, which can result in an insufficient search space when moving the entire cluster-level netlist. This can lead to the outcome being heavily influenced by the initial placement and becoming trapped in a local optimum. For modules with many pins and connections, such as DSP and RAM, the center region calculated by the timing optimization strategy is often located in the middle of the chip. When there are many DSP and RAM modules in the circuit, these modules may not be placed in suitable positions due to the limited search space during placement. Additionally, calculating the position of the central region based on the connected nets or boundary boxes of the current cluster increases computational complexity.

To address these issues, this paper proposes a delay-based movement strategy. For clusters with high path criticality, a timing-optimized movement strategy is used. For clusters with low path criticality, the cluster is used as the center point and a random position is selected around it for movement. This approach can change the direction of the search and break out of local optima when necessary. The algorithmic process is shown in Algorithm 1.

Algorithm 1 Delay-based movement strategy.

1:: select a cluster-level module $b i$
2:: calculate the criticality $t i$ of $b i$
3:: if $t i > c r i t_m o v e$ then
4:: center(xm, ym)=caculate_center( $b i$ )
5:: else
6:: center(xm, ym)=get_coordinates( $b i$ )
7:: end if
8:: compress the coordinates of the chip
9:: define the center area based on the center point
10:: randomly select a position within the center area and move it
11:: return $T h e C l u s t e r w i t h P o s i t i o n I n f o r m a t i o n$

Clusters with a key criticality greater than

c r i t_m o v e

are defined as high-criticality clusters, while other clusters are considered low-criticality clusters. For high-criticality clusters, an improved weighted center movement method is proposed. This method involves finding all low-fanout nets connected to the current cluster and using the pin positions corresponding to these nets to calculate a weighted center point. For low-criticality clusters, the weighted value is 0. The formula for calculating the center point is as follows.

X_{c} = \frac{\sum_{i ϵ i n_n e t s} x_{i} \times w_{i} + \sum_{i ϵ o u t_n e t s} \sum_{j ϵ s i n k s (i)} x_{j} \times w_{j}}{\sum_{i ϵ i n_n e t s} w_{i} + \sum_{i ϵ o u t_n e t s} \sum_{j ϵ s i n k s (i)} w_{j}}

(4)

x_{i}

is the x-coordinate of the source end of wire i,

s i n k s (i)

is the set of coordinates of the drain ends of wire i,

i n_n e t s

is the wire corresponding to the current cluster as a drain end,

o u t_n e t s

is the wire corresponding to the current cluster as a source end, and

w_{j}

is the weight of the connection between the moving block and pin j. The calculation of

Y_{c}

is similar. The formula for

w_{j}

is as follows.

c r i t (j)

is the criticality of pin j.

w_{j} = \{\begin{matrix} 0 & c r i t (j) < 0.2 \\ c r i t (j) & c r i t \geq 0.2 \end{matrix}

(5)

If the criticality of a cluster is less than

c r i t_m o v e

, it is considered a low-delay-path cluster in the current iteration, with a low probability of becoming a high-delay-path cluster after movement. As such, this paper sets the center point coordinates of these clusters to their current positions.

In traditional placement methods, the center range is determined based on the distribution of resources on the chip. For circuits with heterogeneous modules, different clusters may be located far apart from each other. When the exchange radius is below a certain value, the position of the cluster can become locked. To avoid this issue, this paper employs the coordinate compression method described in [17] after obtaining the center point coordinates. The placement tool compresses the coordinates and retains only clusters of the same type as the current cluster, as shown in Figure 4. This ensures that the placement tool can quickly find adjacent blocks of the same type.

After coordinate compression, the placement tool extends the swap radius in all directions centered at the center point to form a bounding box. In the initial stage of the placement, the calculation formula for the size of the bounding box is given by Formula (6). W is the width of the chip, H is the height of the chip,

x_{m}

and

y_{m}

are the calculated center point coordinates, and

r l i m_d m

is the movement radius.

x_{l}

,

x_{r}

,

y_{u}

,

y_{d}

are the top, bottom, left and right coordinates of the bounding box, respectively.

\{\begin{matrix} x_{l} = m a x (0, x_{m} - r l i m_d m) \\ x_{r} = m i n (W, x_{m} + r l i m_d m) \\ y_{u} = m i n (H, y_{m} + r l i m_d m) \\ y_{d} = m a x (0, y_{m} - r l i m_d m) \end{matrix}

(6)

In the late placement stage, to limit the movement range of blocks, the position of the bounding box is determined by the current position of the block. The x-coordinate is calculated by Formulae (7) and (8), where

x_{b}

is the x-coordinate of the current position of the block. The calculation method for the y-coordinate is the same. For high-criticality clusters, the formula for the movement radius

r l i m_d m

is Formula (9).

x_{l} = \{\begin{matrix} m a x (0, x_{b} - r l i m_d m) x_{m} < x_{d} \\ x_{b} x_{m} \geq x_{d} \end{matrix}

(7)

x_{r} = \{\begin{matrix} x_{b} x_{m} < x_{d} \\ m i n (W, x_{b} + r l i m_d m) x_{m} \geq x_{d} \end{matrix}

(8)

r l i m_d m = \{\begin{matrix} 3.0 r l i m_d m > 3.0 \\ r l i m_d m r l i m_d m \leq 3.0 \end{matrix}

(9)

In order to increase the search space and reduce the amount of computation, the placer adopts a random movement strategy for clusters with low criticality. The size of the bounding box is given by Formula (6). The placer randomly selects a position within the boundary box and moves it. The cost of the swap is determined by the cost function.

3.3. Cost Function

This paper uses a cost function in simulated annealing consisting of two parts: timing cost and wire length cost. The timing cost is the estimated delay of all paths in the circuit, while the wire length cost is the estimated total wire length in the circuit. During the placement process, the cost difference is used to represent the cost function, denoted as

△ C o s t

. The timing cost is represented as

△ T i m i n g_C o s t

, the wire length cost is represented as

△ W i r i n g_C o s t

, and the weight factor

λ

can be adjusted to adjust the balance between the timing and wire length costs.

△ C o s t = α \times \frac{△ T i m i n g_C o s t}{P r e v i o u s_T i m i n g_C o s t} + (1 - α) \times \frac{△ W i r i n g_C o s t}{P r e v i o u s_W i r i n g_C o s t}

(10)

Estimating the length of each wire in the network can increase computational complexity, and the results are often not ideal due to the uncertainty of routing and the ability of the network to share nodes. To address this issue, this paper uses the commonly used half-perimeter model in placement algorithms to estimate wire length. The model is expressed as Formula (11). In which,

b b x (i)

and

b b y (i)

are the boundary lengths of net i on the x and y axes, and

c h a n x (i)

and

c h a n y (i)

are the average number of channels occupied by net i on the x and y axes, respectively.

c r o s s (i)

is the length compensation value, whose magnitude is determined by the fanout of the net. The value of

c r o s s (i)

is referenced from [18]. It is multiplied by the bounding box of the net to better estimate the wire length of higher-fanout nets.

W i r i n g_C o s t (i) = c r o s s (i) \times (\frac{b b x (i)}{c h a n x (i)} + \frac{b b y (i)}{c h a n y (i)})

(11)

The timing cost is determined by two parts: node delay and path criticality.

c o n_d (i, j)

represents the delay between nodes i and j, while

C r i t (i, j)

represents the criticality of the path to which nodes i and j belong. If

△ C o s t < 0

, the new solution is better than the current one and is accepted. If

△ C o s t > 0

, the new solution is accepted with a probability of

e^{\frac{- △ C o s t}{T}}

, where T is the current temperature.

T i m i n g_C o s t (i, j) = c o n_d (i, j) \times C r i t (i, j)

(12)

4. Experimental Results

The algorithm presented in this paper was implemented and tested on a Linux system equipped with an Intel i5-1135G7 CPU and 16 GB of RAM. The target architecture used in this work is based on the k6FracN10LB_mem20K_complexDSP_customSB_22nm architecture provided by VTR, with modifications made to the distribution of DSPs and RAMs. Testing is conducted on circuits from the Koios benchmarks [19]. The Koios benchmarks are a set of deep learning benchmarks for DL-related architecture and CAD research. There are 20 designs, including 12 medium-sized benchmarks and 8 large benchmarks. The design covers a wide variety of accelerated neural networks, design sizes, implementation styles, abstraction levels, and numerical precision values.

Table 1 shows some parameters and parameter values used in this experiment, and the values are obtained through verification in VTR8.0.

This paper compares random selection with the selection strategy in this paper and tests the influence of the changes in two parameters in the selection strategy on the critical path delay. To reduce the placement running time, the timing graph is updated only in the outer loop. If

λ

is too high, there may be overfitting of the timing and it can affect the optimization of low-delay paths. If

λ

is too low, the timing optimization effect is low. In this paper,

λ

is set to 0.1%, 0.5%, and 1%, respectively, to study the influence of crit1 changes on critical path delay. Figure 5 shows the influence of

c r i t_s e l

and

λ

changes on critical path delay, and the test set is medium-sized circuits in Koios. As can be seen from the figure, with the changes in

c r i t_s e l

and

λ

, the critical path delay is optimized. When

λ

is 0.5, the average optimization rate of critical path delay is the highest, and better results are achieved when

c r i t_s e l

is 0.9, with an optimization rate of 6.65% for critical path delay. Therefore, this paper sets

λ

to 0.5 and

c r i t_s e l

to 0.9.

To evaluate the effectiveness of the delay-based movement strategy, this study introduced a random movement strategy in addition to the weighted boundary and weighted center movement strategies. Specifically, clusters with high criticality were moved using either the weighted center or weighted boundary movement strategies, while other clusters were moved randomly. The results of this experiment are presented in Table 2. The data in Table 2 reflect the use of timing optimization for all clusters. All values were obtained by dividing the results of different movement strategies by the baseline and taking the arithmetic mean.

From Table 2, it can be seen that compared to using timing optimization strategies for all clusters, the delay-based movement strategy only uses timing optimization methods for some clusters, while other clusters use randomly selected methods. This not only optimizes the critical path delay but also reduces the runtime due to the reduced amount of computation. From Table 3, it can be seen that the improved weighted center movement strategy performs best in terms of critical path delay.

Table 4 compares the timing-driven simulated annealing algorithm proposed in this paper with the VTR8.0 algorithm in terms of critical path delay, total wire length, and runtime, using the entire circuit of Koios as the benchmarks. As can be seen from the table, our algorithm reduces the average total wire length by 1.52% and the average critical path delay by 5.03% at the cost of a 5.01% increase in runtime.

In Figure 6, a comparative evaluation of the critical path delay between our algorithm and VTR8.0 using the Koios benchmark is presented. Among these circuits, the bnn circuit stands out with the most pronounced reduction in critical path delay, amounting to 4.2 ns. Overall, it is observed that 75% of the total circuits exhibit a diminished critical path delay. Notably, for the majority of the benchmark circuits, the range of delay reduction lies between 0.5 and 2 ns.

Figure 7 illustrates the effectiveness of our algorithm in relation to FPGA resource utilization, with the vertical axis representing the optimization rate compared to VTR8.0. From the graph, the resource utilization of these 20 benchmark circuits spans a broad spectrum, ranging approximately from 10% to over 95%. Conversely, the optimization rates for most circuit critical path delays lie between 0.5% and 10%. Notably, both tpu_s and bnn circuits exhibit exceptional performances, showing optimization rates of 20% and 35%, respectively. The data reveal that in the majority of scenarios, our algorithm outperforms VTR8.0 in terms of reducing critical path delay.

5. Conclusions

This paper presents a timing-driven simulated annealing algorithm to address the issue of inadequate optimization for high-delay paths in traditional simulated annealing algorithms. During the cluster selection stage, clusters are first marked with timing criticality, and the proportion of high-criticality clusters is increased to raise the likelihood of selecting high-delay paths. In terms of the cluster movement strategy, the limitations of two strategies are analyzed, and a delay-based movement strategy is proposed. Clusters with different criticality levels are moved using different methods: an improved weighted center movement strategy is employed for high-criticality clusters, while a random movement strategy is used for other clusters. Experiments show that the algorithm in this paper is able to reduce runtime and critical path delay compared to the timing-optimized move strategy. Compared with the random move strategy in VTR8.0, the average wire length is reduced by 1.52% and the average critical path delay is reduced by 5.03%.

Compared to the placement algorithm of VTR8.0, the complexity of the algorithm in this paper increased, resulting in an increase of 5.01% in runtime. In future work, the runtime can be reduced by adopting parallel computing methods.

Author Contributions

Conceptualization, L.Y. and B.G.; methodology, L.Y. and B.G.; software, B.G.; validation, L.Y. and B.G., formal analysis, B.G.; writing—original draft preparation, B.G.; writing—review and editing, L.Y.; supervision, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The benchmarks used in this paper are from VTR (Verilog-to-Routing) https://github.com/verilog-to-routing/vtr-verilog-to-routing (accessed on 30 July 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Boutros, A.; Betz, V. FPGA Architecture: Principles and Progression. IEEE Circuits Syst. Mag. 2021, 21, 4–29. [Google Scholar] [CrossRef]
Yu, L.; Guo, B.; Zhi, T.; Bai, L. Improving Seed-Based FPGA Packing with Indirect Connection for Realization of Neural Networks. Electronics 2023, 12, 2691. [Google Scholar] [CrossRef]
Elgamma, M.A.; Murray, K.E.; Betz, V. Learn to Place: FPGA Placement Using Reinforcement Learning and Directed Moves. In Proceedings of the 2020 International Conference on Field-Programmable Technology (ICFPT), Maui, HI, USA, 9–11 December 2020; pp. 85–93. [Google Scholar] [CrossRef]
Arora, A.; Wei, Z.; John, L.K. Hamamu: Specializing FPGAs for ML Applications by Adding Hard Matrix Multiplier Blocks. In Proceedings of the 2020 IEEE 31st International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Manchester, UK, 6–8 July 2020; pp. 53–60. [Google Scholar] [CrossRef]
Chen, S.C.; Chang, Y.W. FPGA placement and routing. In Proceedings of the 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Irvine, CA, USA, 13–16 November 2017; pp. 914–921. [Google Scholar] [CrossRef]
Abuowaimer, Z.; Maarouf, D.; Martin, T.; Foxcroft, J.; Gréwal, G.; Areibi, S.; Vannelli, A. GPlace3.0: Routability-Driven Analytic Placer for UltraScale FPGA Architectures. ACM Trans. Des. Autom. Electron. Syst. 2018, 23, 3244. [Google Scholar] [CrossRef]
Li, W.; Dhar, S.; Pan, D.Z. UTPlaceF: A routability-driven FPGA placer with physical and congestion aware packing. In Proceedings of the 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Austin, TX, USA, 7–10 November 2016; pp. 1–7. [Google Scholar] [CrossRef]
Gort, M.; Anderson, J.H. Analytical placement for heterogeneous FPGAs. In Proceedings of the 22nd International Conference on Field Programmable Logic and Applications (FPL), Oslo, Norway, 29–31 August 2012; pp. 143–150. [Google Scholar] [CrossRef]
Esmaeili, P.; Martin, T.; Areibi, S.; Grewal, G. Guiding FPGA Detailed Placement via Reinforcement Learning. In Proceedings of the 2022 IFIP/IEEE 30th International Conference on Very Large Scale Integration (VLSI-SoC), Patras, Greece, 3–5 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Vercruyce, D.; Vansteenkiste, E.; Stroobandt, D. Liquid: High quality scalable placement for large heterogeneous FPGAs. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, Australia, 11–13 December 2017; pp. 17–24. [Google Scholar] [CrossRef]
Liang, T.; Chen, G.; Zhao, J.; Sinha, S.; Zhang, W. AMF-Placer: High-Performance Analytical Mixed-size Placer for FPGA. In Proceedings of the 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), Munich, Germany, 1–9 November 2021. [Google Scholar]
Betz, V.; Rose, J. VPR: A new packing, placement and routing tool for FPGA research. In Proceedings of the International Conference on Field-Programmable Logic and Applications, Oxford, UK, 4–8 August 1997. [Google Scholar]
Marquardt, A.; Betz, V.; Rose, J. Timing-Driven Placement. In Proceedings of the 2000 FPGAs, New York, NY, USA, 10–11 February 2000; pp. 203–213. [Google Scholar] [CrossRef]
Vorwerk, K.; Kennings, A.; Greene, J.W. Improving Simulated Annealing-Based FPGA Placement with Directed Moves. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2009, 28, 179–192. [Google Scholar] [CrossRef]
Ludwin, A.; Betz, V. Efficient and Deterministic Parallel Placement for FPGAs. ACM Trans. Des. Autom. Electron. Syst. 2011, 16, 355. [Google Scholar] [CrossRef]
Yuan, J.; Chen, J.; Wang, L.; Zhou, X.; Xia, Y.; Hu, J. ARBSA: Adaptive Range-Based Simulated Annealing for FPGA Placement. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2019, 38, 2330–2342. [Google Scholar] [CrossRef]
Murray, K.E.; Petelin, O.; Zhong, S.; Wang, J.M.; Eldafrawy, M.; Legault, J.P.; Sha, E.; Graham, A.G.; Wu, J.; Walker, M.J.P.; et al. VTR 8: High-Performance CAD and Customizable FPGA Architecture Modelling. ACM Trans. Reconfig. Technol. Syst. 2020, 13, 8617. [Google Scholar] [CrossRef]
Cheng, C.L.E. Risa: Accurate And Efficient Placement Routability Modeling. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, USA, 6–10 November 1994; pp. 690–695. [Google Scholar] [CrossRef]
Arora, A.; Boutros, A.; Rauch, D.; Rajen, A.; Borda, A.; Damghani, S.A.; Mehta, S.; Kate, S.; Patel, P.; Kent, K.B.; et al. Koios: A Deep Learning Benchmark Suite for FPGA Architecture and CAD Research. In Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 355–362. [Google Scholar] [CrossRef]

Figure 1. Placement process.

Figure 2. Cluster and net relationship.

Figure 3. Change in proportion of high-criticality clusters.

Figure 4. Coordinate compression schematic.

Figure 5. Impact of

c r i t_s e l

and

λ

changes on critical path delay.

Figure 5. Impact of

c r i t_s e l

and

λ

changes on critical path delay.

Figure 6. Changes in critical path delay in benchmark.

Figure 7. The impact of FPGAs resource utilization on critical path delay.

Table 1. Values of parameters.

Parameters	Value
$α$	0.6
$c r i t_m o v e$	0.7

Table 2. Optimization effect of delay-based movement strategy.

Methods	Weighted Boundary + Random	Weighted Center + Random
crit_path_delay	0.9316	0.9924
wirelength	1.0515	0.9709
Runtime	0.7001	0.9749

Table 3. Optimization effect of delay-based movement strategy.

Methods	Weighted Boundary	Weighted Center	Our Improved Weighted Center
crit_path_delay	1.0000	0.9814	0.9638

Table 4. Comparison of timing-driven simulated annealing algorithm and VTR8.0.

Index	Crit_Path_Delay	Wirelength	Runtime
our	5.03%	1.52%	5.01%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, L.; Guo, B. Timing-Driven Simulated Annealing for FPGA Placement in Neural Network Realization. Electronics 2023, 12, 3562. https://doi.org/10.3390/electronics12173562

AMA Style

Yu L, Guo B. Timing-Driven Simulated Annealing for FPGA Placement in Neural Network Realization. Electronics. 2023; 12(17):3562. https://doi.org/10.3390/electronics12173562

Chicago/Turabian Style

Yu, Le, and Baojin Guo. 2023. "Timing-Driven Simulated Annealing for FPGA Placement in Neural Network Realization" Electronics 12, no. 17: 3562. https://doi.org/10.3390/electronics12173562

APA Style

Yu, L., & Guo, B. (2023). Timing-Driven Simulated Annealing for FPGA Placement in Neural Network Realization. Electronics, 12(17), 3562. https://doi.org/10.3390/electronics12173562

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Timing-Driven Simulated Annealing for FPGA Placement in Neural Network Realization

Abstract

1. Introduction

2. Related Work

3. Placement Methods

3.1. Cluster Selection

3.2. Direct Move

3.3. Cost Function

4. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI