1. Introduction
With the rapid development of cloud computing, the scale of data centers carrying cloud computing continues to expand, which has caused significant difficulties in the management and operation of data center networks. How to efficiently solve data center network problems is one of the main challenges for network operation and maintenance, as well as innovation personnel. Softwared Defined Network (SDN) has enabled flexible management and configuration of network devices by decoupling the control and data planes [
1]. Industry highly uniformly uses the OpenFlow protocol as the control protocol of SDN, abstracting the control plane into a controller and the forwarding plane into a unified OpenFlow network device [
2]. The controller controls the network device traffic forwarding through the OpenFlow protocol. In the data center network environment, scholars such as Benson pointed out that 80% of the network traffic is less than 10 KB, which is defined as mouse traffic, and 10% of the traffic has the characteristics of long duration and large carrying capacity, which is defined as elephant flow [
3]. The network’s congestion level and performance stability are closely related to the elephant flow [
4]. Therefore, the monitoring and scheduling of elephant traffic are key to solving data center network problems. Dynamic flow scheduling systems like Hedera have been shown to significantly improve bandwidth utilization in multi-rooted tree topologies [
5]. However, while Hedera showcases the benefits of dynamic routing, it lacks fine-grained traffic classification and active polling control, which are central to our approach.
Although elephant flows account for only a small fraction of total traffic, they consume a disproportionately large amount of bandwidth and directly cause performance bottlenecks and congestion in data center environments. This highlights the dynamic and busy nature of data center traffic, where applications such as big data analytics, distributed storage, and real-time services frequently generate elephant flows, challenging traditional static routing and load balancing mechanisms [
6]. Liu et al. [
7], for instance, proposed a shim-layer mechanism to monitor TCP socket buffers, but their method remains confined to Fat-Tree structures and relies on static detection techniques, limiting its adaptability. Similarly, Hamdan et al. [
8] and Bezerra et al. [
9] introduced advanced detection strategies that combine terminal and network-layer cues to improve flow detection accuracy, but they do not provide frameworks for rerouting or scheduling.
The topology widely used in data centers is the Fat-Tree structure. Due to loops in the three-layer Fat-Tree structure, the Spanning Tree Protocol (STP) needs to be used to remove network loops and avoid broadcast storms [
10,
11]. Some scholars have studied STP based on OpenFlow. This protocol can make the network converge, but it also brings potential risks of single-channel network congestion [
12]. Based on the Fat-Tree network topology, some researchers have proposed a load balancing mechanism for SDN, which integrates a “shim layer” in the terminal host to monitor the Transmission Control Protocol (TCP) socket buffer, thereby reducing the network load [
7]. These works focus on monitoring and load balancing within the Fat-Tree topology but do not address the path congestion issues that arise due to single-path routing. In contrast, we explore the potential of the Leaf-Spine architecture, which avoids the limitations of single-path routing by leveraging Equal-Cost Multi-path (ECMP) to distribute traffic across multiple available paths, improving bandwidth utilization and reducing congestion.
In order to solve the inherent defects of single-path and three-layer structures, the flattened Leaf-Spine network structure was proposed and studied by scholars [
13,
14,
15]. Google researchers conducted an in-depth review of the progress of data center network design and revealed the potential of the Leaf-Spine architecture to achieve efficient bandwidth utilization and optimize application performance in data center networks by discussing Google’s Jupiter project [
13]. In addition, some scholars have explored the evolution of data center architecture over the past few decades, from the initial client-server model to the access-aggregate-core (AAC) architecture, and the Leaf-Spine architecture developed to meet the needs of low-latency and high-throughput server-to-server communication and load balancing. They also simulated the Leaf-Spine network environment through experiments and used machine learning methods to predict the traffic transmission from the Leaf-Spine switching layer to the server. Although these methods are theoretically feasible, they heavily depend on high-quality data [
14]. Furthermore, while some studies on the Leaf-Spine architecture, such as the work by Sultan et al. [
14] and Alizadeh et al. [
15], provide valuable insights into network design decisions, they do not address operational monitoring or real-time scheduling systems.
Unlike the Fat-Tree structure, in the Leaf-Spine architecture all leaf nodes (Leaf switches) are directly connected to each spine node (Spine switch). Leaf nodes usually connect to servers, storage devices, and other network terminal devices. Spine nodes are responsible for high-speed data forwarding between leaf nodes. In addition, the Leaf-Spine architecture supports ECMP [
16], which means that traffic is distributed among multiple available paths, effectively utilizing all connection paths, improving bandwidth utilization, and reducing congestion. Multiple parallel links between leaf and spine nodes can provide redundancy, increasing network reliability and fault tolerance. Many network design problems under similar reliability and path constraint requirements, such as those involving edge-disjoint paths and budget limits, can be modeled as edge-disjoint rooted distance-constrained minimum spanning tree problems (ERDCMST). Arbelaez et al. [
17] proposed a constraint-based parallel local search algorithm for the ERDCMST, demonstrating its effectiveness on real-world optical network topologies in Europe.
Recent research has significantly advanced machine learning (ML) for intelligent detection and optimization in diverse domains, including SDN and supply chain management. Several studies have explored using ML for attack detection and traffic management in the context of SDN. For example, Rahman et al. discussed machine learning classifiers such as decision trees, random forests, and support vector machines for detecting Distributed Denial of Service (DDoS) attacks in SDN environments [
18]. These intelligent detection frameworks enhance the analysis of flow behavior and real-time pattern recognition, which indirectly support our approach to dynamic flow classification and congestion management in SDNs.
Furthermore, ML techniques have proven transformative in the realm of supply chain optimization. Recent studies have highlighted AI’s role in supply chain disruption management, focusing on how real-time data analytics, including AI and blockchain technologies, improve operational resilience and decision-making [
19]. This is particularly relevant for our approach to elephant flow scheduling in SDNs, where intelligent decision-making can optimize the allocation of network resources and improve overall system performance.
In this paper, by simulating the elephant flow in the network environment through Mininet [
20] and setting the corresponding traffic thresholds, we can intuitively capture the pre-set traffic types, such as elephant flow and mouse flow. After the elephant flow appears in the Leaf-Spine structure, the method proposed in this paper will poll and schedule the elephant flow to the equivalent path. Compared with the Fat-Tree single path, the proposed Leaf-Spine polling scheduling method is verified. The results show that it can improve the utilization and stability of network equipment. Code available:
https://github.com/cmy-hhxx/el_monitor (accessed on 30 May 2025).
The main contributions of this work are summarized as:
1. We designed and implemented an SDN-based elephant flow monitoring system using the Ryu controller. This system classifies traffic based on duration and bandwidth.
2. We proposed a polling-based dynamic elephant flow scheduling algorithm that performs path rerouting across equal-cost multipaths in Leaf-Spine topologies, avoiding congestion typical in traditional fat-tree structures.
3. We conducted simulations in Mininet and iperf to validate our approach, demonstrating that our proposed strategy achieves stable throughput (8 Mbps) and zero packet loss under load, outperforming traditional stability and link utilization scheduling methods.
This paper follows this organization:
Section 2 introduces the design and implementation of our proposed elephant flow monitoring and scheduling strategy.
Section 3 presents the experimental setup, network simulation environment, and evaluation results.
Section 4 concludes the paper and discusses directions for future research.
3. Experimental Setup and Results
3.1. Experimental Setup
In order to verify the effectiveness of the proposed method for monitoring elephant flows, Mininet simulates network devices and connects them to the Ryu controller to let the Ryu controller take over. Using the el_monitor application developed in this research work, the pingall command is used to test the connectivity of the network. Through analysis, a lot of mouse flows will be generated. Observe whether the el_monitor application can monitor these mouse flows. Then h1, h4 and h5 are set as network performance test clients (iperf clients), and h2, h3, and h6 are used as network performance test servers (iperf servers) for speed testing. The bandwidths are 8 Mbps, 4 Mbps, and 100 Kbps respectively. The elephant flow, medium-sized elephant flow and mouse flow are tested correspondingly. Observe whether the el_monitor application works normally during this process. The network topology is shown in
Figure 2.
To ensure consistency in the emulation, the experiments employed a Leaf-Spine topology comprising nine switches, with six leaf switches and three spine switches. This structure provided multiple equal-cost paths to support robust traffic distribution. The iperf tool generated a mix of elephant and mice flows under varying load conditions. This study identified elephant flows as those that either transferred more than 10 MB of data or maintained a sustained throughput above 1 Mbps. The Ryu SDN controller managed the network and actively polled flow statistics every 5 s to detect and respond to emerging elephant flows. Each network link operated at 10 Gbps, with latency configured at 1 millisecond between edge and leaf switches and 2 milliseconds between leaf and spine switches, closely reflecting realistic data center conditions. Each switch port maintained a queue size of 1000 packets to simulate typical buffering behavior. Every simulation lasted for 120 s to capture both transient effects and steady-state traffic dynamics.
In a fully connected network such as Leaf-Spine, there are multiple equivalent paths. This paper uses the characteristics of this structure to schedule elephant flows. Simulate the network topology in Mininet and use the controller to send flow tables to converge the entire network. Then we test the elephant flow. After the elephant flow appears, we analyze the OpenFlow switch flow table through the running scheduler to observe whether the path changes. If it changes, the scheduling is successful. Traditional data centers generally use the Fat-Tree structure. In order to prove that the scheduling system under the Leaf-Spine structure has certain advantages, we use Mininet to simulate the Fat-Tree structure, connect the Ryu controller, and use the STP protocol to converge the network. Through the experiment, we observe that there are several paths in the network. Considering the need for data centers to migrate at any time, this will generate elephant flows, and specifically analyze the network resource utilization under the Leaf-Spine structure.
3.2. Experimental Results
Use the pinball command. At this time, a large number of ICMP messages will be generated in the network environment. The duration is short, and the capacity is small. It is a mouse flow. Then test the medium elephant flow and elephant flow, using bandwidths of 4 Mbps and 9 Mbps, respectively. Observe that the el_monitor application successfully identifies the traffic in the network and displays detailed information such as the protocol, IP address, and transmission rate used.
After the elephant flow appears in the Leaf-Spine network structure of the data center, the traffic needs to be guided in time. After the experiment simulates the Leaf-Spine network structure, the scheduler “reroute.sh” is used at this time to distribute the traffic to equal-cost paths, thereby realizing the switching of two paths. The program mainly implements dynamic rerouting of traffic through several steps: deleting old flow table entries (del-flow function), printing flow table entries (dump-flow function), and configuring path 1 and path 2 (path_1 and path_2 functions).
The experiment tested the elephant flow in the Fat-Tree single-path network and the elephant flow in the Leaf-Spine structure multi-path network, respectively. The network interface that generated the traffic was exported using Wireshark, and the packet loss rate was compared after matching the User Datagram Protocol (UDP) and the host IP. Due to the single-path nature enforced by the Spanning Tree Protocol in Fat-Tree structures, it is infeasible to dynamically reroute elephant flows, resulting in overburdened links and increased congestion risks. The load of a single link is too high, which will cause network congestion.
The throughput curve of the Fat-Tree structure is shown in
Figure 3. After running STP in the Fat-Tree structure, there were multiple network congestions during the iperf flow process within 120 s. Network congestion occurred at 65 s, 75 s, 98 s, 105 s, etc. When the network is congested, the packet forwarding speed will become 0 packets/s, which means that no data packets can pass through the network and can only wait for the network to recover. Therefore, the network performance of the single-path solution is not stable.
The throughput curve using the Leaf-Spine structure is shown in
Figure 4. After running the scheduler using the Leaf-Spine structure, the network packet loss rate is obviously better than the Fat-Tree single channel, and the network performance tends to be stable. During the 120-s test, the network rate can be stabilized to 8 Mbps, which is basically consistent with the bandwidth set by iperf. Although the network rate fluctuated slightly at the 35th, 40th, and 63rd seconds, there was no network congestion, and the traffic was still transmitted at a higher network rate. The network rate fluctuated between 7 Mbps and 8 Mbps, which had little impact on the overall performance. In addition to throughput analysis, packet loss and jitter were measured via Wireshark logs. These showed packet drops and near-zero throughput under congestion, though detailed curves are omitted due to space constraints. Future versions of the paper will include a full set of performance indicators including packet loss rate and jitter graphs.
4. Conclusions
This paper uses SDN to monitor and schedule elephant flows, leveraging SDN’s centralized control to overcome the limitations of traditional distributed networks. An elephant flow monitoring and scheduling application is developed based on the Ryu framework. The OpenFlow switch collects data every 5 s, and when corresponding conditions are met, flow details such as source IP, destination IP, port, and protocols are displayed on the console.
In contrast to the spanning tree method, the proposed approach enhances network utilization and analyzes paths between switches. Experiments simulating 9 Mbps elephant flows over 120 s show that the traditional Fat-Tree spanning tree structure causes network jitter, packet loss, and congestion. In contrast, the Leaf-Spine solution proposed here reduces jitter, avoids congestion, and maintains stable 9 Mbps throughput. This solution can help network operators improve monitoring, performance, and equipment utilization. The current approach has limitations that need further exploration. Static thresholds for elephant flow detection may not perform well in dynamic, unpredictable environments. The round-robin rerouting strategy lacks real-time awareness of network conditions, leading to inefficient path utilization. Future work will focus on adaptive thresholding for better detection flexibility and incorporating real-time telemetry to optimize path selection. Future improvements include adopting adaptive thresholding and integrating machine learning to enhance flow prediction and responsiveness, as well as extending evaluation to large-scale deployments for better scalability insights. Additionally, applying our approach in multi-tenant cloud environments introduces new challenges in policy isolation, dynamic resource sharing, and control overhead, which merit further investigation.
The proposed elephant flow scheduling approach not only demonstrates notable performance improvements but also offers valuable implications for existing network management protocols. By comparing with traditional mechanisms such as ECMP and OSPF, we highlight the advantages of SDN-based dynamic flow rerouting in enhancing throughput consistency and mitigating congestion. Moreover, the method can effectively complement Quality of Service frameworks by enabling real-time flow classification and adaptive path selection. Its fine-grained control capabilities also address the limitations of coarse-grained flow aggregation, offering a practical path forward for refining future network protocol designs.