1. Introduction
Modern datacenter networks have common characteristics such as high bandwidth and low latency. In addition, many typical datacenter network topologies have rich path diversity and symmetry [
1,
2]. Since many applications that run in datacenter networks require high throughput or short flow completion time, there have been extensive research on how to achieve those objectives through load-balancing [
3].
Depending on the granularity of load balancing, approaches are categorized into flow-level and packet-level. One of the major factors in deciding granularity of load-balancing is the issue of packet reordering. Flow-level load-balancing is favorable to in-order delivery, but prone to low network utilization and long flow completion time, so there have been a lot of efforts to address the issues [
2,
3,
4,
5,
6,
7]. Packet-level load-balancing is more effective in spreading traffic evenly than flow-level load-balancing, although it needs to address the packet reordering issue. Many proposals have been made to make packet-level load-balancing more viable by exploiting the unique characteristics of datacenter networks [
8,
9,
10,
11,
12,
13]. A notable approach is to exploit the topological symmetry more directly. The authors of [
9] showed the effectiveness of RPS (Random Packet Spraying), which sprays packets of a flow randomly across equal-cost paths. When RPS is used in symmetric datacenter networks, the packets belonging to the same flow experience almost the same queuing delay, and consequently most of them arrive at the destination in order. The study also shows that asymmetry due to a link failure results in queue length differentials, which adversely impacts TCP performance of RPS. DRILL [
12] also exploits a key characteristics of a symmetric Clos network. DRILL employs per-packet load-balancing based on queue occupancies. To handle topological asymmetry, DRILL decomposes the network into a minimal number of components, such that the paths within each component are symmetric. DRILL assigns each flow to a component and applies per-packet load-balancing across paths inside the component.
A recent proposal for architectural fault-tolerance of data center networks [
14] shows the possibility that the network topology can keep symmetric more stably. Considering this technological trend, it is important to explore how to more effectively exploit topological symmetry and high degree of path diversity.
We propose a proactive per-packet load-balancing scheme named LBSP (Load Balancing based on Symmetric Path groups) for fat-trees. In LBSP, multiple equal-cost paths are partitioned into path groups of equal size. For each flow, packets belonging to the flow are sprayed across the paths within one path group chosen based on the destination address. A notable feature is that path groups are of equal size. Hence, symmetry holds not only between paths inside a path group, but also between path groups in terms of the number of available paths for each flow. If a link failure occurs, LBSP selects an alternative path group for the flows affected by the failure so that each flow is assigned a path group which does not contain a failed link. Consequently, even in the presence of failures, for each flow, LBSP conserves symmetry of multiple paths that the packets of the flow pass through.
A strength of LBSP is that, in a normal state (with no failures), it does not have to maintain the information on mapping a flow to a path. Only when an alternative path group needs to be used, the information is maintained for the relevant address range. In addition, the simple rule on mapping a flow to a path group facilitates hardware implementation for speedup.
To evaluate LBSP, we simulated three-tier fat-trees and compared flow completion time between LBSP and the original packet scatter scheme. Simulation results show that LBSP is much more robust to network failures than the original packet scatter schemes.
Since LBSP forwards packets based on the destination address, traffic distribution between path groups can be uneven. To reduce the queue length differentials between path groups, we propose a solution that uses not only the destination address but also the input port number.
The rest of this paper is structured as follows. In
Section 2, we explain the background and motivation. In
Section 3, we describe our scheme LBSP in detail.
Section 4 demonstrates the simulation results. In
Section 5, we propose a solution to reduce the queue length differentials between path groups.
Section 6 describes the related work. In
Section 7, we conclude.
3. Load Balancing Based on Symmetric Path Groups (LBSP)
3.1. Basic Idea
The proposed scheme LBSP aims to achieve high TCP performance by exploiting a symmetric portion of redundant paths even when symmetry of the whole network is disturbed due to failures. The main features of LBSP are as follows.
Initially, LBSP partitions shortest paths into fixed equal sized symmetric subsets called path groups.
In a normal state without a failure, LBSP assigns a (default) path group to a flow based on the least significant bits of the destination address.
In the case of failure detection, LBSP selects an alternative path group instead of the default path group for the flows affected by the failure.
LBSP partitions multiple equal-cost paths between two hosts into more than one path group of the same size. The rationale for partitioning multiple equal-cost paths is that, in the presence of network failures, we can reduce packet reordering by forwarding packets through a portion of multiple paths which are still symmetric. DRILL [
12] also adopts the idea of partitioning multiple shortest paths between a source and destination pair into components. However, since DRILL partitions a network into a minimal number of symmetric components, the component sizes may be different. In addition, when there is no failure in a fat-tree, the whole network has only one component. In contrast, LBSP partitions the shortest paths into fixed equal-sized symmetric path groups. This enables the topological symmetry property to be satisfied not only between paths inside a path group but also between path groups. This reduces the complexity for handling asymmetry.
For each flow, a path group is selected according to a rule based on the destination address. This allows every packet belonging to the same flow to use the same path group. For ease of explanation, we consider an 8-ary fat-tree, although our scheme is scalable to larger fat-trees. In the 8-ary fat-tree depicted in
Figure 2, consider a packet going from Host 0 to Host 112. If no restriction is imposed on path selection, the packet can choose one of the 16 shortest upward paths which pass through different core switches. Now we partition the links between the Switch T0 and the aggregation Switches A0, A1, A2, and A3 into two link groups, which are indicated by the red solid lines and the red dashed lines in
Figure 2. Similarly, for each of the aggregation Switches A0, A1, A2, and A3, the links between the aggregation switch and its neighbor core switches are partitioned into two link groups, represented by the blue solid lines and the blue dashed lines in
Figure 2. If we combine the first partition with the second partition, as shown in
Table 1, the 16 symmetric upward paths are partitioned into four path groups, each of which consists of four paths. The flow from Host 0 to Host 112 is assigned one of the four path groups. For simplicity, we assign a link group to each flow based on the parity of the least significant bits of the destination address. Within a link group, packets are forwarded in a round-robin manner. This method guarantees that the flows selecting the same path group observe almost the same queue occupancies.
If a link failure occurs, LBSP tries to select an alternative path group which does not contain failed links.
Figure 3 illustrates selecting an alternative path group. Suppose that the default path group for the flow S2 → D2 is the paths that pass through Switches A9 and A11. If Switch T11 detects the link failure between Switch A11 and itself, it assigns the flow an alternative link group that passes through Switches A8 and A10. Then, at Switches A8 and A10, the flow S2 → D2 is assigned one of the two link groups. Suppose that the flow S2 → D2 is assigned the same path group as the flow S1→D1 according to the least significant bits of the destination address. Then, as indicated by the green dashed lines, the paths for the flow S2 → D2 are as follows: S2 → T11 → {A8,A10} → {C0,C2,C8,C10} → {A4,A6} → T5 → D2. It is important to note that the two flows S1 → D1 and S2 → D2 using the same path group will observe almost the same queue lengths.
3.2. Partitioning Paths and Assigning Path Groups
The practical design of LBSP was inspired by Flexible Interval Routing (FIR), which Gomez et al. used to implement a routing algorithm for fat-trees [
15]. We associate each output port with the range of the destination addresses. The range indicates all the destination hosts which are reachable from the output port. Since each upward output port leads to every host in a fat-tree, the range of destination addresses contains all the host addresses. In contrast, each downward output port has a limited range.
Figure 4a illustrates the ranges associated with the output ports of an aggregation switch and a core switch in a fat-tree shown in
Figure 2.
In a fat-tree, once upward paths are selected, the downward paths are deterministic. In an 8-ary fat-tree, a flow from a pod to another pod has 16 shortest paths. LBSP partitions all the 16 shortest paths into four path groups using the parity of each port number of each switch on the upward paths. Since partitioning the shortest paths in this way makes consecutive packets pass through different ports in a ToR switch, it helps reduce latency when a round-robin manner is used.
LBSP selects a path group for each flow based on the least significant bits of the destination address. That is, for upward forwarding, each ToR switch and aggregation switch selects uplinks based on the LSB and the second LSB, respectively. Across paths within a link group, packets of a flow are forwarded in a round-robin manner. Hence, flows can use increased bandwidth with a higher probability compared to flow-level load balancing. For downward forwarding, each switch forwards a packet according to the range of destination address.
Figure 4b illustrates partitioning paths and assigning path groups to source-destination pairs according to the least significant bits of the destination address. If each output port has a mask register that indicates which bit of the destination address is to be compared, then in a normal state without failures, LBSP does not have to keep the information on which specific flow is assigned to which path group.
3.3. Handling Asymmetry due to Failures
3.3.1. Failure Notification
For failure detection, each switch sends hello messages to its neighbor switches. On detecting a failure, it floods a failure message in the network. A failure message contains the information on the failure type and the address range affected by the failure. With the failure types, a switch can figure out which link group should not be selected.
Table 2 summarizes the failure types.
A failure message from a switch detecting a failure of a downlink contains the address range associated with the port incident to the failed link. In contrast, a failure message from a switch detecting a failure of an uplink contains the address range of all the hosts. In the fat-tree shown in
Figure 2, suppose that the link between A31 and T31 is down. The failure message sent from A31 will contain the address range of [124…127] and the failure type of 1. Suppose that Switch T3 receives the failure message. Then, T3 cannot select Link Group 1 for the flows with the destination address in the range 124–127.
In the case of a three-tier fat-tree, the failure messages take at most three hops to reach all the aggregation switches and ToR switches that need the failure information to select a link group.
3.3.2. Modifying the Forwarding Tables
As previously described, when the network has no failure, upward forwarding in LBSP is performed as predefined. If a switch receives a message about a failure which affects the flows passing through itself, it modifies the forwarding table so that an alternative link group is chosen for packets with the destination addresses within the range indicated in the failure message.
For instance, consider a flow from Host 15 to Host 115, as shown in
Figure 5. Since the LSB of the destination address (115) is 1, Switch T3 forwards packets using output Ports 1 and 3. Similarly, since the second LSB of 115 is 1, Switches A1 and A3 selects output Ports 1 and 3 for the flow. Suppose that the link between the aggregation Switch A31 and the core switch C15 is down. The failure message conveys the information that the affected address range is [112…127] and the failure type is 3. It means that the flows with the address in the range of 112–127 are affected by the failure and cannot be forwarded through Link Group 1 between an aggregation switch and a core switch. Note that failure Type 3 does not affect the forwarding tables of ToR switches. On receiving the failure message, aggregation Switches A1 and A3 modify their forwarding tables as follows.
Destination Address Range | Link Group (Output Ports) |
112…127 | Link Group 0 (Ports 0, 2) |
Suppose that Host 15 sends packets to Host 115 at that point. Since the LSB of 115 is 1, Switch T3 forwards packets using Ports 1 and 3 as predefined. However, Switches A1 and A3 select Ports 0 and 2 instead of Ports 1 and 3, because the modified forwarding table indicates that packets whose destination addresses are in the range [112…127] must be forwarded through output Ports 0 and 2.
3.4. Discussions
LBSP is economical in that it does not require a lot of state information and can be implemented with simple hardware. In a normal state without a failure, LBSP can forward packets upwards by simply mapping the least significant bits of the destination address to a path group. The bit comparison needed can be implemented by using the mask register as in FIR [
15] so that only the specific bit(s) of the destination address is compared. Only when a failure is notified, the entries for the corresponding address range and link group are maintained in the forwarding table. To indicate whether the forwarding table needs to be searched, the restriction register proposed for FIR in [
15] can be used.
While we describe the LBSP scheme using 8-ary fat-trees for simplicity, LBSP performs better for a higher fan-out ratio (k). For k-ary fat-trees with k ≥ 16, we have two choices on partitioning shortest paths: either having more path groups or having more paths in each path group. Having more path groups enhances robustness to failures because each switch has more alternative path groups. On the other hand, having more paths in a path group helps reduce the flow completion time for each flow.
Effectiveness of LBSP becomes more evident in fat-trees with high oversubscription ratios, where queues build up more. If a failure occurs in such a network, network asymmetry and queue buildup may jointly cause severe packet reordering. LBSP is effective also in the presence of a degraded link. When packets in a TCP flow are spread evenly across multiple paths, performance of the flow depends on the slowest path. LBSP deals with a degraded link and a disconnected link in the same way by selecting an alternative path group.
Some might be concerned that selecting an alternative path group in the presence of a failure can significantly reduce network utilization. The proportion of flows that need to select an alternative path due to a link failure is not significant. For k-ary fat-tree, the proportion of the source-destination pairs to be rerouted due to a link failure between a ToR switch and an aggregation switch is
Hence, in the illustrative 8-ary fat-tree of
Figure 3, if a failure occurs on the link between a ToR switch and an aggregation switch, the flows which communicate with the four hosts connected to the ToR switch are affected, and the proportion of the source-destination pairs to be rerouted is approximately 6%. Hence, network utilization is not significantly affected.
While LBSP assigns a path group to a flow simply based on the least significant bits of the destination address, entropy can be increased by hashing several fields in the packet header to a path group as in ECMP (Equal-cost Multiple Path). ECMP, which is the de facto flow-level load balancing scheme, distributes flows across available multiple paths by statically hashing some fields in the packet header. Hence, ECMP does not require state information, either. It is known that ECMP works well for a large number of flows with sufficient entropy but that it may underutilize a network when a few long flows collide on some path [
6]. We expect that, by employing ECMP-like hashing and utilizing multiple symmetric paths, LBSP can mitigate the shortcomings of ECMP without causing excessive packet ordering.
4. Simulation Results
We performed simulations for an 8-ary fat-tree and a 16-ary fat-tree using ns-2. All links have bandwidth of 1 Gbps, queue size of 250 packets, and propagation delay of 50 μs. At the transport layer, we used TCP-Reno and set the retransmission timer to 20 ms. For tests with network failures, we simulated link failures, node failures, and degraded links. For link failure tests, we disconnected a link connecting an aggregation switch and a ToR switch. For node failure tests, we made an aggregation switch down. A node failure causes multiple link failures simultaneously. We measured the flow completion time, which is important in datacenter networks.
First, we simulated all-to-all short-lived flows of size 32 KB with load factor 0.8 repeatedly for about 3.3 s. In total, 128 flows were generated using a random permutation.
Figure 6a shows the 99.9th percentile of the flow completion times. Neither a link failure nor a node failure affects the flow completion time of TCP flows significantly. Since the flows are short, the queues do not build up enough to cause packet reordering or packet drops. Hence, the original packet scatter scheme, which distributes packets using more paths, performs marginally better than LBSP.
To validate the functionality of LBSP for long-lived flows, we chose to use a similar test scenario to the one used in [
6] considering the limit of our simulation environment. We generated a small number of 100 MB flows from hosts on ToR switches in two specific pods transmitting to hosts on another specific ToR switch in a different pod. We simulated four flows with size 100 MB repeatedly for about 20 s. The source and destination hosts were selected as depicted in
Figure 3. For the source-destination pairs <S1, D1>, two flows were generated. For each of the other source destination pairs <S2, D2> and <S3, D3>, one flow was generated. This scenario is not advantageous to LBSP because <S1,D1> and <S2,D2> were selected so that, under the link failure, the three flows from S1 and S3 take the same paths.
Figure 6b shows the 99.9th percentile of flow completion times for the original packet scatter scheme and LBSP. When there is no failure in the network, both schemes perform almost equal. When a failure occurs at a link between a ToR switch and an aggregation switch, as shown in
Figure 3, LBSP performs better than the original packet scatter scheme despite the disadvantageous scenario. For the original packet scatter scheme, the flow from S3 suffers the greatest degradation of performance. The reason is that the packets from S3 arrive at D3 out of order due to different queuing delays. In LBSP, however, the flows seldom experience packet reordering. When an aggregation switch in the source pod of the three flows fails, the gap between two schemes becomes greater than that shown in
Figure 6b.
We simulated one link failure in a 16-ary fat-tree. We partitioned shortest paths into the same number of path groups as in an 8-ary fat-tree. Consequently, since each path group has more paths, the probability of packet loss due to congestion decreases. We disconnected a link connecting a ToR switch and an aggregation switch as in the tests for an 8-ary fat-tree. We generated five flows going to one pod: four flows coming from the pod containing the disconnected link and one flow from another pod.
Figure 7a shows the result. For oversubscription ratio of 1:1, a link failure or a node failure does not significantly affect TCP flow completion time of the original scatter scheme. This was also the case when we simulated oversubscription ratio of 4:1 by reducing the bandwidth of the links connecting core switches and aggregation switches to 250 Mbps. However, when the oversubscription ratio was set to 8:1, the flow completion time of the original packet scatter scheme drastically increases, even when only one link is disconnected. In this case, there was no packet drop. However, it was observed that queue occupancy irregularly increased in many links connecting core switches and aggregation switches. It implies that the destination host suffered a lot of packet reordering. In contrast, the flow completion time of LBSP is not affected.
We simulated asymmetries by degrading the bandwidth of a link from 1 Gbps to 100 Mbps for an 8-ary fat-tree and a 16-ary fat-tree both. We set the bandwidth of one link connecting a ToR switch and an aggregation switch to 100 Mbps. Short-lived flow tests were performed using flows with size 32 KB for the 8-ary fat-tree and long-lived flow tests using flows with size 100 MB for the 8-ary and 16-ary fat-trees both.
Figure 7b shows the results for short-lived flow tests. We can see that the effect of a degraded link on flow completion time is larger than that of a disconnected link which is shown in
Figure 6a. Note that performance of LBSP is less affected by a degraded link compared to that of the original packet scatter scheme.
Figure 7c shows the results for long-lived flow tests. The impact of degraded links on performance of the original packet scatter scheme is enormous. For both 8-ary and 16-ary fat-trees, the original packet scatter scheme experiences over eleven times longer flow completion time than in a normal condition. Since there is no packet drop, the performance degradation is only due to out-of-order packets. LBSP can avoid such huge degradation of performance by selecting an alternative path group that does not contain the degraded link.
6. Related Work
There have been many efforts to enhance load-balancing in datacenter networks. We summarize the efforts in two categories: flow-level load-balancing and packet-level load-balancing.
ECMP (Equal Cost Multiple Path) is a standard forwarding scheme which is supported by commodity switches. ECMP distributes traffic across equal-cost paths based on flow hashing. VL2 [
2] uses a variant of VLB (Valiant Load Balancing) that employs a random flow-level distribution to avoid packet reordering. In this variant of VLB, each host selects an intermediate switch (i.e., core switch) independently and randomly. This scheme may consume additional network capacity by extending a path length, but it can improve utilization of bisection bandwidth. Hedera [
4] is a centralized flow scheduling scheme aimed at maximizing network utilization. Collecting information on flows, Hedera dynamically computes paths so that flows avoid busy links. Hedera requires a picture of the whole routing and traffic demands of all flows. CAAR (Congestion-Aware Adaptive foRwarding) [
5] is a distributed forwarding scheme where each flow selects the most underutilized path. The flow can be redirected to an underutilized path in transmission if the selected path becomes congested. Each switch selects the next hop based on its neighbors’ queue length which is updated in each time slot. Similarly, FlowBender [
6] reroutes flows when congestion is detected. In FlowBender, however, congestion is detected using end-to-end notification ECN and rerouting is initiated by end-hosts detecting congestion. In [
17], the authors presented a theoretical solution to DLBP (Dynamic Load Balancing Problem) in hybrid switching data center networks with converters. MPTCP [
7] generates several subflows for a flow and assigns each subflow one of multiple paths. Since each subflow performs congestion control independently, it mitigates the problem due to packet reordering. MPTCP requires significant changes to protocol stacks for end-hosts. To improve the efficacy of MPTCP for many-to-one traffic, DCMPTCP [
18] has been proposed.
DeTail [
8] is a cross-layer network stack to reduce the long-lived flow completion time tail. At the link layer, DeTail employs flow control based on port buffer occupancies to construct a lossless fabric. At the network layer, DeTail performs per-packet adaptive load balancing. Each switch dynamically picks a packet’s next hop using the congestion information obtained from port buffer occupancies. Out-of-order delivery is reordered at the end-hosts. The study in [
9] shows the feasibility RPS (Random Packet Spraying) in which packets of every flow are randomly assigned to one of the available shortest paths to the destination. It shows experimentally that RPS works well in symmetric topologies. To handle asymmetry due to failures, SRED (Selective RED) is proposed for use along with RPS. SRED selectively enables RED only for flows that induce queue length differentials. If a flow that cannot use all the multiple paths makes differentials of queue lengths between switches, SRED drops the packets belonging to the flow more aggressively. To this end, SRED requires a topology aware centralized fault manager, which configures end hosts or ToR routers to mark all packets of flows affected by a link failure so that downstream routers can employ RED only on marked packets. Fastpass [
10] is a system where a logically centralized arbiter controls each packet’s timing and path. The ultimate goal of the arbiter is to make queue lengths almost zero. At large scales, Fastpass needs several arbiters. Presto [
11] performs proactive load-balancing with granularity in-between flow-level and packet-level. Edge vSwitches break each flow into discrete units of packets, called flowcells whose size is the maximum TCP Segment Offload size (64 KB). Presto employs a centralized approach. A centralized controller collects the network topology, partitions the network into a set of multiple spanning trees, and assigns each vSwitch a unique forwarding label in each spanning tree. When a failure occurs, the controller prunes the spanning trees that are affected by the failure or enforces a weighted scheduling algorithm over the spanning trees. Packet reordering is handled by the GRO (Generic Receive Offload) handler modified to reduce computational overhead. DRILL (Distributed Randomized In-network Localized Load-balancing) [
12] is a datacenter fabric for Clos networks which employs per-packet decisions at each switch based on local queue occupancies and randomized algorithms to distribute load. DRILL compares queue lengths of two random ports and the previously least-loaded port, and sends the packet to the least loaded of these. To handle asymmetry, DRILL partitions network paths into groups, each of which is symmetric and applies micro load balancing inside each path group. TTC (Thresholded Two-Choice) [
13] has been proposed to improve downward load-balancing in fat-trees based on the observation that DRILL [
12], which well balances upward traffic, may perform poorly on downward load balancing since local path decisions are agnostic to downward congestion. Similar to DRILL, TTC also uses a two-choice algorithm. However, to improve downward load-balancing, TTC makes local routing decisions from two choices: a default path using the deterministic D-mod-k scheme [
15] and a random choice. The random choice is selected only when the load of the default path exceeds that of the random choice by a threshold. However, TTC has yet to handle link failures and packet reordering. More recently, VMS (Virtual Multi-channel Scatter) [
19] has been proposed. VMS is a packet-level load-balancing scheme which runs in the virtual switch layer.
7. Conclusions
We propose a new per-packet load balancing scheme LBSP that exploits symmetric redundant paths in fat-trees. LBSP is focused on enabling packets in a flow to experience almost the same queueing delays even when the whole network becomes asymmetric due to network failures. To this end, LBSP partitions equal-cost paths between two hosts into more than one path groups of the same size. Each flow is assigned a path group based on the least significant bits of the destination address. Packets of a flow are forwarded across the links of the selected path group. When a failure is detected, LBSP selects an alternative path group for the flows affected by the failure so that the multiple paths that the packets of the flows pass through are symmetric. Only when a failure occurs, LBSP maintains the state information to reroute the affected flows. Simulation results show that LBSP is much less affected by network failures than the original packet scatter scheme.
Since LBSP selects a path group based on the destination address, the difference between the queue lengths of different link groups can be large for heavy load. To mitigate the problem, we propose a solution which migrates a part of the traffic from the default link group to the alternative link group.