This section proposes using the packets of flows forwarded to the controllers to generate or update the flow entries for prediction elephants. This completely avoids moving statistics from switches to controllers dedicated to elephant prediction. Since the flows used to predict elephants are not complete, simply applying the thresholds or models trained on complete datasets is not suitable. Since entry timeouts affect the number of packets forwarded to controllers, which further affects the elephant model used for prediction, we propose a two-step approach to maximizing elephant prediction accuracy () and efficiency (E) while minimizing controller–switch interaction (I): (1) using LR based on the complete traffic dataset to learn an elephant model that inputs the selected features of a flow in TR1 and outputs the probability of that flow being an elephant, as well as (2) applying BO to find the best initial timeouts (), the rate (r) at which timeouts increase, and the probability threshold () for a flow to be an elephant.
In the rest of this section, we first analyze the features of elephants and mice and the TCP and UDP elephants in TR1 to determine the features used to model the elephants. Second, we apply LR to actually model the elephants. Third, we formulate an optimization problem that finds the best to maximize and E while minimizing I. Finally, we apply BO to solve it.
4.1. Elephant Modeling Features
To determine the features of the flows used to model elephants, the flows of TR1 with more than 10,000 bytes were marked as real elephants because more than 90% of the total bandwidth usage is occupied by such flows. The cumulative distribution functions (CDFs) of the packet counts, flow size, flow duration, and mean packet size and packet inter-arrival time of all the elephants in TR1 were computed as shown in
Figure 3.
We found that over 95% of the elephants had 8+ packets, while over 92.5% of the mice had 5 or fewer packets, including 62% of them having only 1 packet. While over 70% of the mice lasted less than 0.38 s, over 95% of the elephants lasted longer than 0.38 s. Although most of the elephants had flow sizes between 10,000 and 500,000 bytes, 50% and 97% of the mice had flow sizes of less than 144 and 6,000 bytes, respectively. While 80% of the mice had a mean packet size of less than 400 bytes, over 90% of the elephants had a mean packet size greater than 400 bytes, and over 80% of the elephants had a mean packet size greater than 1,000 bytes. While over 80% of the elephants had a mean packet inter-arrival time greater than 0.2 s, over 70% of the mice had one of less than 0.2 s, and 62% of the mice had one of 0 because only 1 packet was included. These demonstrate that the elephants and mice had distinct distributions in the five features, and all these features should be used to model elephants.
Since elephants can be TCP or UDP flows generated by various applications, we further analyzed the features of the TCP and UDP elephants in TR1. We found that in TR1, 63% and 37% of the elephants were TCP and UDP flows, respectively. Although TCP and UDP elephants do not have many differences in the CDF of flow size, as shown in
Figure 3C’, the TCP elephants often had more packet counts, lower flow durations, and shorter packet inter-arrival times than the UDP elephants. As shown in
Figure 3A’B’E’, over 50% of the TCP elephants had 19-packets, while over 70% of the UDP elephants had 19+ packets. Although over 70% of the TCP elephants lasted 50-s, over 70% of the UDP elephants lasted 50+ seconds. While over 70% of the TCP elephants had a mean packet inter-arrival time of less than 0.92 s, over 70% of the UDP elephants had one longer than that. The TCP and UDP elephants also varied in their mean packet sizes. While 80% of the UDP elephants had 800–1200 bytes per packet, over 80% of the TCP elephants had 1200–1500 bytes per packet.
Therefore, TR1 had TCP and UDP elephants that differed greatly in features other than flow size. Given two networks, they may have similar feature distributions for dedicated TCP or UDP elephants, since TCP and UDP elephants on two networks are often generated by similar types of applications. However, the overall elephant distribution between two networks can be widely different due to the various ratios of TCP and UDP elephants that make up the network. Accordingly, an elephant model trained by TR1 may not be able to achieve high generalization accuracy over another network. These observations motivated us to train dedicated models for TCP and UDP elephants to improve the robustness of the models.
Since we used complete packet traffic to train the elephant models but predicted elephants based on the packet traffic sampled by flow entry timeouts, the two packet traffic datasets had a large difference in packet count and mean packet inter-arrival time. Therefore, we chose the accumulated flow duration (), flow size (), and mean packet size () to model the TCP and UDP elephants and improve the robustness of the models.
4.2. Explainable Logistic Regression for Elephant Modeling
LA is a classification algorithm used to assign observations to a discrete set of classes. We chose it for its simplicity and explainability. In our case, we had two classes of flows: elephants and mice. We labeled elephants as one and mice as zero. LA generates a hyperplane, as shown in Equation (
1), to separate the samples in the dataset and further transforms the output to a probability using the logistic sigmoid function, as shown in Equation (
2). This probability is then further mapped to elephants or mice using a threshold
, where
F is the set of flows in TR1 and
i is a flow in the set
F:
Let
be 0.5. Then, the flow
i is classified as an elephant if
is greater than
. We applied the built-in LA function in Python to learn
for the feature
and the parameter
b. As illustrated in Equation (
3), for each flow
, the features of the mean packet size
, flow size
, and flow duration
are the inputs of LA, and
,
, and
are the weights of the features. We consider two scenarios: (1) training a model for all elephants and (2) training a model with submodels dedicated to TCP and UDP elephants.
As shown in
Table 3, we have
and
for scenario 1. In scenario 2, we have
and
for the TCP elephants and
and
for the UDP elephants. It can be seen that the flow size feature had the highest weight in both scenarios, implying that the total bytes of a flow are key for elephant classification. Both models achieved a prediction F1 score of 99+% over TR1.
To roughly estimate the accuracy of such models over the sampled dataset, we used a fixed hard timeout of 0.00001 s to sample TR1. We set . The prediction F1 scores of both models were around 0.89. Although the prediction accuracy can be measured by the recall, precision, and F1 score, we chose the F1 score to balance the recall and precision simultaneously. It is obvious that the two models had lower prediction F1 scores over the sampled dataset because sampling leads to packet loss, which decreases the accuracy of the model used to predict elephants. Accordingly, we needed to optimize the timeouts of the flow entries as well as the elephant model to achieve a high prediction accuracy.
4.3. Optimization Problem Formulation
In our proposed approach, the prediction can be performed at any moment when the controllers receive a packet of flows forwarded by switches. Since the traffic used to predict elephants is sampled by the timeouts of the flow entries, a major challenge is to tune the elephant model and the sampling timeouts. Accordingly, the elephant model generated by the complete traffic dataset, as listed in
Table 3, and the packet sampling timeouts should be optimized.
As shown in Equations (
1) and (
2), given the packet count, flow size, and duration of flow
i to LR, it outputs the probability
. By comparing
with the given threshold
, the model classifies the flow
i as an elephant if
and as a mouse otherwise. Since the chosen features are not significantly affected by sampling, the CDF of such features over the sampled dataset has a high probability of having a similar shape to that over the complete dataset. Therefore, we believe that simply adjusting the probability threshold
instead of the parameters
and
b of the model trained on a complete dataset can achieve high prediction accuracy over the sampled dataset. Therefore, we formulated an optimization problem that finds the best initial timeout (
), timeout increase rate (
r), and probability threshold (
) to maximize the elephant prediction accuracy (
) and efficiency (
E) while minimizing the network delay (
L).
Given
for each flow
, a flow entry starts with a hard timeout of
. When the timeout expires, the value of the timeout increases to
r times the current until the flow is classified as an elephant or out of its lifetime. When a flow is classified as an elephant, its flow entry is configured with an idle timeout of 5s. In particular, when a flow packet is forwarded to the controllers, Equations (
1) and (
2) are used to compute the probability that this flow is an elephant. Let
be the label of flow
i. It is one if flow
i is a real elephant and zero otherwise. We let
be the label of flow
i predicted by the model. Since the controllers keep making predictions for each flow as the flow accumulates,
is the final decision made by the controllers. Then, the F1 score (
) of the target prediction can be computed using Equation (
5), where
and
are the recall and precision, which can be computed using Equations (
6) and (
7), respectively.
The elephant prediction efficiency (
E) represents how long an elephant has lived when it is correctly predicted. Let
be the cumulative time duration in which flow
i is predicted to be an elephant and
be the lifetime of flow
i. Then,
computes the prediction efficiency of flow
i, and the overall prediction efficiency (
E) is the average prediction efficiency for all real elephants that are correctly predicted, as shown in Equation (
8).
Given
, we simply use the total number of packets of flows forwarded to the controllers (
) over the total number of packets of flows forwarded to the controllers under an idle timeout of 1 s (
) to represent the increase in network latency (
L), as formulated in Equation (
9), since the greater the volume of packets forwarded to the controllers, the more extra network forwarding latency is added.
To compute
, for each flow
, we have a binary parameter
. Let
be one if the
j packet of flow
i is forwarded to the controllers and zero otherwise. If
is the time at which the
jth packet of flow
i arrives at the switch,
is the most recent activation time of the flow entry of flow
i, and
is the current hard timeout value of the flow entry of flow
i, then the
jth packet of flow
i is forwarded to the controllers if
, and
can be calculated using Equation (
10).
is the total number of packets of flow
i forwarded to the controllers.
can also be calculated using Equation (
10) if we let
be the last call time of the flow entry of flow
i, and let
be one if
and zero otherwise:
Let
S be the entire domain of
. The proposed optimization problem can be formulated to find the best
such that the elephant prediction inaccuracy (
), efficiency (
E), and network latency increase (
L) are minimized. Since this optimization problem has three conflicting objectives, the best solution to the problem is not unique. We then give weights
to the objectives
, respectively, and the proposed optimization problem can be simplified as shown in Equation (
11), where the final objective function
, as shown in Equation (
12). All the symbols used in this paper are listed in
Table 4:
4.4. Applying Explainable Bayesian Optimization
It is time-consuming for the proposed optimization problem to exhaustively search the entire solution set (S) to find the best solution. Gradient descent-based optimization approaches can quickly find the global optima but are not suitable for solving the proposed problem because the proposed problem is discrete, and its derivative cannot be easily achieved. Grid search, random search, and genetic search approaches are also unsuitable because they are either too computationally expensive or cannot generate approximate solutions of sufficient quality.
BO is a search mechanism based on BT. It performs searches efficiently and effectively. Before applying BO to solve our proposed problem, we should (1) represent the parameters as a vector; (2) define the search space; (3) formulate the objective function; and (4) calculate the cost over the objective function.
The solution is represented as
. Since BO requires a continuous solution domain, we let
be a real number in
, and
is a short value that should consider the shortest packet inter-arrival time of the elephants in the training dataset (we let
s in TR1). We let
r be a real number in [1, 5], since we preferred the timeouts to grow continuously to reduce the total number of packets of a flow forwarded to the controllers. We let
be a real number in
to adjust the probability of elephant prediction without actually changing the elephant model. The objective function is
, where
,
E, and
L are formulated as shown in Equations (
5), (
8) and (
9), and the cost is the output of the objective function.
Let A and B be two events, where , , and refer to the likelihood, prior, and posterior probabilities, respectively. According to BT, we have (given to be normalized), which provides a framework to quantify the beliefs about an unknown objection function given samples that form the domain and their evaluation via the objective function. In our case, a sample refers to , and it is estimated using the objective function (), while also refers to the cost of . The samples and their costs were collected sequentially to form data to define the prior . The likelihood is defined as the probability of observing data D given and keeps changing as more observations are collected.
The posterior (
) represents what we have known about the objective function. It is an approximation of the objective function and can be used to estimate the cost of different candidate samples that we may want to evaluate. It is a surrogate objective function that probabilistically summarizes the conditional probability of the objective function
f, given the available data (
D) or
. Here, we chooe Gaussian process regression (GPR) to estimate
f, since it is widely used and is capable of efficient and effective summarization of a large number of functions and smooth transition as more observations are made available to the model. Based on the estimate, an acquisition function is involved in finding the samples in the search space that are most likely to pay off. As additional samples and their evaluation via the objective function are collected, they are added to the data
D, and the posterior is then updated. This process is repeated until the given number of iteration is exhausting. Our proposed BO algorithm is illustrated in Algorithm 1.
Algorithm 1 BO algorithm for our proposed problem |
- 1:
INPUT: the number of iterations, the set S, and the data - 2:
OUTPUT: the best - 3:
initialize D - 4:
compute the GP over D - 5:
pick a new using acquisition function - 6:
compute its cost - 7:
update D - 8:
if the number of iterations has been exhausted then - 9:
go to 2 - 10:
else - 11:
go to 4 - 12:
end if
|
The algorithm was applied to both scenarios 1 and 2, as listed in
Table 3. While a set of
was optimized for all flows in scenario 1, two dedicated sets of
were optimized for the TCP and UDP flows separately in scenario 2. The results show that
was best for scenario 1 and
was best for the TCP and UDP elephants in scenario 2. This implies that giving the fixed hard timeout of 0.00001 s over TR1 or slightly increasing the timeout value to 1.047 times the current timeout once the flow entries are timed out (until the flows are predicted to be elephants or out of their lifetime) can minimize the elephant prediction accuracy, efficiency, and the increase in network latency. After a flow was predicted to be an elephant, its flow entry was switched to an idle timeout of 5 s.
It should be noticed that the probability threshold
was adjusted to 0.7/0.566 from 0.5. In LR, the flows were placed with
on plane h1, as shown in
Figure 4. In the trained model, the flows with
were predicted to be elephants, and they were placed on top of plane h1. Adjusting
to 0.7/0.566 from 0.5 implies that the hyper-plane that separated the elephants and mice moved to h2. This is because the traffic sampled by the timeouts was incomplete, and a flow needs more time to accumulate packets so that the features generated on the incomplete traffic can have a similar value to those based on complete traffic. Moving hyper-plane h1 to h2 indicates the increase in the probability threshold, as shown in
Figure 4.