1. Introduction
Network-on-chip (NoC) offers a flexible and scalable solution to satisfy the communication bandwidth for the many-core system-on-chip (SoC) [
1,
2,
3,
4,
5]. Suffering from high power cost, long network latency, and low network throughput [
2,
6,
7], the overall performance of existing NoC systems is undesirable. Many approaches have been proposed to improve NoC performance, which includes changing the NoC structure [
7,
8], dynamically changing packet injection rate [
9,
10], optimizing routing algorithms [
5,
11], and so on. Routing algorithm optimization is one of a lower cost solution [
12,
13,
14,
15,
16,
17,
18]. Routing algorithms specify the transmitted path of a packet on NoC systems, and thus can directly influence the throughput and latency of the NoC [
11,
19]. Deterministic routing algorithms are first introduced to improve the performance of NoC in the early days [
20,
21]. Unfortunately, they fail to efficiently balance the network load since they do not consider the dynamic network status during the routing process [
22]. To cope with this problem, adaptive routing algorithms are proposed [
4,
23] recently, so as to reduce the latency and increase the throughput of the NoC at a small cost [
2,
24]. By leveraging the dynamic network status during the routing processes, they have shown great potential and can be widely applied in interconnection networks.
From the perspective of congestion-aware routing, adaptive routing algorithms can be classified into three types: (1) Local congestion awareness: Its key idea is to select router output ports by only considering the congestion status of neighboring nodes [
25]. As a result, it usually destroys the global load balance, and thus cannot effectively solve the congestion problem. (2) Global congestion awareness: Its key idea is to maintain fine-grained congestion information of the network. For a given “source-destination” pair, it accurately calculates the amount of congestion on each candidate path of the packet for the minimum path range [
12,
13]. Different from local congestion awareness algorithms, it considers the global load balance during the routing process and thus can specify the best path to the destination in theory. However, it is a high-cost solution for an NoC system. The main reason is that calculating each candidate path requires a significant amount of computing time, thus increasing the network delays. (3) Regional congestion awareness: Its key idea is to collect congestion information from remote links and weigh congestion values by distance from the local node. Compared with local congestion awareness algorithms, each node in regional congestion awareness algorithms has more information to make routing decisions, thus delivering a lower network latency. Generally, considering a better load balance and lower latency of the NoC, regional congestion awareness algorithms have been proven to be an effective solution, and thus have been the most common congestion awareness routing algorithm.
However, it incurs a significant queuing latency when practitioners use existing regional congestion awareness routing algorithms to select the best path, thus leading to an undesirable performance of the NoC. There are two main reasons behind the above performance problem. First, existing regional congestion awareness algorithms make routing decisions by evaluating the redundant remote link congestion information. When the source node and target node are very close, these remote links are not in the range of the optional path from the source node to the target node. Considering that this information may be wrong, the link that has congested neighbors but is idle at the remote end is selected as the output link. Instead, it interferes with routing decisions and increases queuing delay. Second, the load of the nodes in the central area is significantly higher than the load of the edge nodes under a high traffic injection rate. Existing routing algorithms do not consider shifting the intermediate load to the edge of the network. Therefore, existing regional congestion awareness algorithms can not meet the performance requirements of the NoC yet, and it is non-trivial to optimize the regional congestion awareness algorithm further.
To this end, we propose an efficient regional congestion awareness routing algorithm, ParRouting, which aims at increasing the throughput and reducing queuing latency of NoC. First, ParRouting divides the network nodes into three priorities based on link utilization. Second, according to the nodes’ position, ParRouting sets the nodes with high and medium priority as edge nodes, and the low priority as central nodes. Finally, for edge nodes, ParRouting selects the output node by preferring the high-priority candidate link, so as to better balance the network load. For central nodes, ParRouting selects the output node by evaluating the hotspot status of the remote links and the number of candidate nodes’ free VCs, so as to avoid the congestion area and reduce queuing latency. The goal of ParRouting is to balance the traffic load and reduce the queuing latency of packets, thus improving the overall performance of the NoC systems further.
We evaluate the performance of ParRouting by comparing it with three typical routing algorithms over synthetic traffic patterns and real application patterns. Compared with exiting routing algorithms, ParRouting can better balance the traffic load and yield lower queuing latency of packets over different routing scenarios. Our experimental results indicate that ParRouting achieves a 44.9% reduction in packet average latency over SPLASH-2 water application and improves the saturated throughput by up to 28.72% over a synthetic traffic pattern for an NoC system, compared with existing routing algorithms.
The technical contribution of this paper is three-fold:
We propose an efficient regional congestion awareness routing algorithm ParRouting, which can balance the traffic load and reduce the queuing latency of packets for NoC systems;
We leverage a new congestion propagation technique, which uses packets to carry congestion information, avoiding the wiring overhead required to propagate congestion information;
We conduct comprehensive performance evaluation and analysis on ParRouting over both a synthetic traffic pattern and real application pattern by comparing ParRouting with three typical routing algorithms.
The rest of this paper is structured as follows.
Section 2 introduces the related work and motivation.
Section 3 presents the design of ParRouting.
Section 4 shows the implementation of ParRouting.
Section 5 describes the experiment setup and performance evaluation results. Finally, in
Section 6, a conclusion is drawn.
3. ParRouting Design
ParRouting is a congestion-aware and load-balance routing algorithm. It has two core ideas. One is to partition the network nodes into three priorities according to the centrality of nodes in the network, then select different routing strategies for nodes in different areas. The other is to use remote congestion information to select the output node.
3.1. Network Partition
3.1.1. Closeness Centrality
Centrality is a concept commonly used in social network analysis (SNA) to measures how much a node is in the “center” of other “point pairs” on the way. According to different application scenarios, centrality has four indicators, among which closeness centrality reflects the proximity of a point to other points. It calculates the sum of the shortest distance from a node to all other reachable nodes. After normalization, we get a number between (0, 1). The larger the number, the higher the closeness centrality of this point. A point with a high closeness centrality means that this point is relatively close to any other point, which is at the center of the network. For 2D Mesh topology, nodes are connected to a different number of neighboring nodes. There are obvious differences in the close centrality of different nodes. For 2D Torus topology, all nodes are connected to four other nodes. However, the border nodes are connected by a long link, so that different links can be assigned with different weights to calculate closeness centrality.
The standardized closeness centrality calculation formula is as follows. Closeness centrality of a node
u is reciprocal of the average shortest path distance to
u over all
reachable nodes.
where
is the shortest-path distance between
v and
u, and
n is the number of nodes that can reach
u. For 2D Mesh topology, we simply use 1 to represent the link length of every two adjacent nodes since the link lengths of adjacent nodes are the same. The closeness centrality of an
2D Mesh network is shown in
Figure 1. Each small grid in the figure represents a network node, the value in the small grid is the standardized closeness centrality of the node, and the coordinate axis represents the position of the node in the network. As seen, the closeness centrality of different nodes in the network is obviously different,
is 1.75 times of
. We will prioritize nodes based on these values in the next subsection.
3.1.2. Priority and Area Partition
Closeness centrality is a concept related to the shortest path. Most NoC routing algorithms are based on minimum routing. Therefore, closeness centrality is very suitable for predicting the congestion of NoC. Priority partition is the core idea of ParRouting. We partition the priority of the nodes according to their closeness centrality.
The red to blue in
Figure 1 represents the closeness centrality of the nodes from high to low. For a 2D Mesh network, the nodes at the center of the network have higher closeness centrality, which means that the shortest distance between these nodes and other nodes is small so that they are more likely to be chosen as output nodes calculated by minimal routing. Therefore, the nodes with a high closeness centrality should be given low priority to reducing the probability of being used, so as to reduce the load of these nodes. Accordingly, the nodes at the network’s edge have a lower closeness centrality so that these nodes should be given high priority to increase the probability of being used. That is, high closeness centrality corresponds to low priority and low closeness centrality corresponds to high priority.
We need two thresholds for the closeness centrality to divide these nodes into three priorities. The smaller one
is the threshold of high priority and medium priority, and the larger one
is the threshold of medium priority and low priority. The thresholds can be formulated as Equation (
2) [
18].
and
are needed to calculate
and
. By following the setting of [
18], we assign 0.7 to
, so as to get
. Nodes with a closeness centrality greater than
are given low priority, as the red nodes shown in
Figure 2. If
were large, the low priority nodes would be less, which would lead to a smaller central area. So that the remote-congestion-based routing strategy applied in the central area can not work as expected. On the other hand, if
were small, the central area would be bigger, the priority-based routing strategy applied in the edge area will not work efficiently.
is used to get
. Nodes with a closeness centrality less than
are given high priority, as the blue nodes shown in
Figure 2. If
were too big or too small, the distinction between high priority nodes and medium priority nodes would reduce, and the load balancing performance would not improve greatly compared to other routing algorithms. Thus, we assign 0.5 to
to calculate
. The rest nodes whose closeness centrality are greater than
and less than
are given medium priority, as the pink nodes shown in
Figure 2.
The priority has two uses. 1. Partition the network areas. We apply different routing strategies for different areas. Noticed that the high and medium priority nodes are at the edge of the network and the high priority nodes are at the center of the network. We partition the low priority nodes to the central area. For nodes in the central area, we apply remote-congestion-based routing. We partition the high and medium priority nodes to the edge area. For nodes in the edge area, we apply priority-based routing. 2. Making routing decisions. Priority-based routing strategy evaluates the candidate output node priority to select the output node. We will introduce the two routing strategies in the next subsection.
3.2. ParRouting Strategies
ParRouting is a congestion-aware and load-balance routing algorithm, which has two routing strategies for nodes in the edge area and central area respectively. If the source node is in the edge area, we apply priority-based routing otherwise, we apply remote-congestion-based routing.
Priority-based routing. If the source node is in the edge area, we make routing decisions according to the node priority. First, the priority of the two candidate output nodes are checked. Second, we must check if the two candidates have free virtual channels (VC). (1) If the two candidates are at the same priority, there are two possibilities for their free VCs: (a) They both have or do not have a free VC. (b) One has a free VC and another does not have a free VC. For situation (a), select an output node randomly and for situation (b), select one that has free VCs as the output node. (2) If the two candidates are at a different priority, there are three possibilities for thei free VCs: (a) They both have or do not have free VCs. (b) The higher priority node has a free VC. (c) The lower priority node has a free VC. For the situation (a) and (c), select the higher priority node as the output node and for situation (b), select the lower priority node as the output node.
Remote-congestion-based routing. If the source node is in the central area, we apply a two-stage routing strategy. In the first stage, calculate the number of free VC of the two candidate nodes, selecting the node with more free VC as the output node. If the two candidates have the same number of free VC, we enter the next stage. In the second stage, weigh the remote hotspot information on the two candidate nodes’ directions. If the weighted values of the two candidates are still the same, select an output node randomly. Otherwise, select the one with little remote congestion as the output node.
Adaptive routing algorithm must consider deadlock freedom. Our proposed ParRouting achieves deadlock-freedom based on Duato theory [
27]. We have two VCs (VC0 and VC1) in each router in the network. VC0 has a higher priority than VC1. After VC allocation, VC1 cannot wait for VC0, so as to destroy the cycle dependency between VCs in the network.
3.3. Detailed ParRouting Algorithm
ParRouting algorithm is an adaptive algorithm based on minimal routing, therefore, each flit has one or two candidate output nodes as the output node. As mentioned in the last subsection, flit for a “source-destination” pair is first checked its source node’s area in ParRouting. Different routing strategies are used in different areas. The detailed ParRouting for different areas are introduced in the following subsections.
3.3.1. Edge Area
The detailed process is shown in Algorithm 1. For the source node in the edge area, first calculate the candidate output nodes
and
by the minimum routing. Then check if
and
have a free VC. The node with the higher priority is first checked. If the node has at least one free VC, select the node as the output node, otherwise check if another node has a free VC. If both
and
have no free VC, select the node with a higher priority as the output node. If
and
have the same priority, check whether the two nodes both have a free VC. If only one candidate has a free VC, selecting the one as the output, otherwise select randomly from
and
.
Algorithm 1: Routing algorithm for nodes in the edge area. |
|
3.3.2. Central Area
For the source node in the central area, we apply the two-stage routing strategy mentioned before.
Figure 3 presents a detailed selection strategy for central nodes. A head flit is currently at node S, and its target node is T.
As shown in
Figure 3a,
and
are the candidate output nodes of S in the
and
directions respectively. First, calculate the number of free VC of the two nodes. Select the node with more free VCs to routing if the two numbers have a different number of free VC, otherwise enter the second stage. As shown in
Figure 3b, the red nodes
and
are the candidate output nodes for
and
respectively, their congestion status affect the routing latency directly, so giving them a high weight when calculating the remote congestion. The yellow nodes
,
, and
,
are the candidate output nodes of
and
in the
and
directions respectively. They also have a great influence on routing latency, so give them medium weight when calculating remote congestion.
and
in green are common candidate nodes of
,
, and
,
respectively. They have a small probability to be used between this “source-destination” pair, so give them a small weight. The weighting method is shown in Algorithm 2.
Algorithm 2: Routing algorithm for nodes in the central area. |
|
Algorithm 2 shows the detailed routing process for central nodes. The output directions of the two candidate nodes are denoted by
and
respectively. As shown in
Figure 4,
and
are used to represent the number of free VC associated with the candidate nodes along the
and
directions respectively.
represents the hotspot status of four nodes in the
direction;
represents the hotspot status of the four nodes in the
direction.
5. Performance Evaluation
In this section, we evaluate the proposed routing algorithm. The evaluation setup is introduced first, followed by the evaluation methodology, and finally, the performance is analyzed.
5.1. Evaluation Setup
We use the gem5 with Garnet2.0 [
28] infrastructure to evaluate the proposed routing mechanism. Gem5 provides a cycle-accurate NoC timing model. We use McPat for power consumption statistics, which is widely used for power evaluation.
Baseline network. We simulate and 2D Mesh networks respectively. In a 2D Mesh network, each node contains a router and a processing element (PE). Each router has 5 output ports, 4 of which are connected to neighboring routers, one connecting to the local PE. Each router has 2 VCs per port, and 5 flit buffers per VC. The baseline network includes 128-bit link width, minimal routing, wormhole flow control, and Duato’s deadlock avoidance theory.
Benchmarks. Both synthetic traffic patterns and application traffic patterns are considered to evaluate the network. Synthetic traffic pattern allows the most generic way to evaluate the network. In this experiment, we use five kinds of synthetic traffic patterns including bit_reverse, bit_rotation, shuffle, transpose, and uniform_random with the mix of 1-flit and 5-flit packet size. On the other hand, application traffic is obtained from real application which results in more realistic scenarios. We use SPLASH-2 to simulate in the syscall-emulation (SE) mode of gem5.
5.2. Evaluation Methodology
We compare ParRouting with three typical routing algorithms: XY (a deterministic routing), DyXY (a local congestion-aware routing), and FreeRider (a regional congestion-aware routing). Specifically, we conduct a set of experiments as follows:
We first evaluate the overall performance of ParRouting by comparing it with existing routing algorithms on a 2D Mesh network under synthetic traffic pattern, so as to check if ParRouting can select the output node with high performance for the NoC system;
Secondly, we evaluate ParRouting from the perspective of crossbar activity on an 2D Mesh network under synthetic traffic pattern, so as to verify if ParRouting can balance the traffic loading of the NoC system;
Thirdly, we evaluate the overall performance of ParRouting on an 2D Mesh network under synthetic traffic pattern, to verify the scalability of ParRouting;
Lastly, we evaluate the latency of ParRouting on an 2D Mesh network under SPLASH-2 to verify if ParRouting is effective in realistic scenarios.
5.3. Performance Analysis
We compare the performance of ParRouting with XY, DyXY, and FreeRider to evaluate ParRouting. For synthetic traffic patterns, we compare the saturated throughput of the four algorithms. For SPLASH-2, we compare the normalized average latency and power consumption of the four algorithms. The next two subsections introduce the detailed evaluation process for the two traffic patterns.
5.3.1. Synthetic Traffic Pattern
For synthetic traffic patterns, as shown in
Figure 9, the X-axis represents the traffic injection rate of the network and the Y-axis represents the average flit latency in different injection rates. To evaluate the throughput of the network, we first evaluate the algorithm in a
2D Mesh network. We start evaluation from a low traffic injection rate, then increase the injection rate gradually until the network load reaches saturation. To evaluate the scalability of ParRouting, we then evaluate the algorithms in an
2D Mesh network just like the evaluation method in the
2D Mesh network. When FreeRider is under the saturated injection rate, we compare the number of crossbar_activity of these algorithms under this injection rate to evaluate the load-balance of the network. The following are the evaluation results.
Figure 9 compares the average packet latency of the four algorithms with increasing injection rate for different traffic patterns on a
2D Mesh network. As seen in
Figure 9, XY is always the first to reach saturated. Considering that we evaluate the saturated throughput, the latency of XY in the high injection rate is meaningless to us. Therefore, we just set the average packet latency up to about 100 cycles. Compared with the classic XY routing algorithm, we can conclude the following experimental results:
(a) In bit_reverse traffic pattern, ParRouting improves the saturated throughput by up to 38.81%; (b) In shuffle traffic pattern, ParRouting improves the saturated throughput by up to 28.72%; (c) In bit_rotation traffic pattern, ParRouting improves by 20%; (d) In transpose traffic pattern, ParRouting improves by 49.95%; (e) In uniform_random traffic pattern, ParRouting improves by 8.7%.
It can be seen that ParRouting shows improvements in all synthetic traffic patterns compared with the XY routing algorithm. As the injection rate increases, the XY algorithm quickly appears congested and the network latency is greatly improved, the throughput of the network goes to saturation, while ParRouting keeps a low latency and transmit packets as normal. This is because, for XY routing, the free buffer in the network does not meet the demand, and the buffer utilization of the network is low. While ParRouting algorithm effectively avoids congestion nodes so that it reduces the possibility of congestion. However compared with FreeRider, ParRouting only improves saturated throughput by 4.37% on average under the 2D Mesh network. It is because the load-balance function will not work efficiently for a small-scale network. Thus, we compare the algorithms in an 2D Mesh network.
According to Reza Akbar [
18], the load balancing performance of the algorithm can be evaluated by comparing the variance of the crossbar activity in each router.
Table 1 lists the mean and variance of the crossbar activity of XY, FreeRider, and ParRouting. Compared with XY and FreeRider when FreeRider is under saturated injection rate. ParRouting reduces the variance of crossbar activity by 28.38% and 33.09% respectively.
Figure 10 shows the crossbar_activity hotspot correspondingly. It can be seen that the central node of XY and FreeRider is more active than ParRouting. This is because we use area-partition routing to achieve a load-balance effect. Compared with other algorithms, ParRouting increases the load on edge nodes and reduces the load on the central node of the network under a higher traffic load. To show the effectiveness of area-partition, we evaluate the performance of each algorithm on an
2D Mesh network.
Figure 11 compares the average packet latency of each algorithm on an
2D Mesh network. Similar to the
network, ParRouting shows better performance than the other three algorithms overall. We compare the saturated throughput of ParRouting and FreeRider especially. Among all the five synthetic traffic patterns, the saturated throughput of ParRouting improves about 8.33% on average compared with FreeRider. The result further proves that ParRouting is an effective load balancing routing algorithm.
5.3.2. Real Application Traffic Pattern
We evaluate the normalized average latency of ParRouting in SE mode under an
2D Mesh network.
Figure 12a,b respectively show the normalized improvement results of the average packet latency and power consumption under the SPLASH-2 benchmark. In terms of NoC latency, ParRouting is generally superior to other routing algorithms. Compared with XY, ParRouting reduces NoC latency in different applications by 13% (fft), 31.3% (lu), 11.4% (lu_non), 44.9% (water), 16% (water-spatial), and 53.4% (ocean). Even for FreeRider, ParRouting performs with a lower average latency of about 5%. In terms of NoC power consumption, ParRouting had less power consumption in all cases compared to its rivals. Compared with XY and FreeRider, the average power consumption of ParRouting in all cases is reduced by 0.641% and 0.318% respectively. Compared with XY, ParRouting had a register in the router, which may lead to more static power consumed by the NoC components. The reason of the trivial power consumption reduction is that ParRouting has lower packet latency.