1. Introduction
Computation-intensive and bandwidth-hungry applications brought a revolution in Data Center Networks (DCNs) to support the continuously growing network traffic considering the network performance requirements [
1,
2]. The recent findings by Microsoft [
1] and Facebook [
2] showed that the DC’s racks exchanged a biased pattern of traffic workload. They found that few racks exchange the majority (>
) of the overall traffic whereas the remaining racks exchange less traffic or no traffic at all. Thereby, the DC links are either underutilized or overutilized while their uniform capacity and fixed topology prohibit them from optimally satisfying the workload capacity requirements. Optical Wireless Data Center Networks (OWDCNs) emerged as the alternative to conventional wired DC networks. OWDCNs have the agility to allocate the capacity where it is needed. Also, OWDCNs offer other benefits such as a lower number of cables and less maintenance overhead, power consumption and heat dissipation [
1,
3,
4,
5].
In recent works, researchers have proposed different use cases of optical wireless communication (OWC). One of the emerging technologies of OWC is FSO, in which a modulated light beam propagates in free space with no fibers involved. Therefore, FSO combines the edibility of wireless communication and the high speed/high bandwidth of optical communication. Due to its proven features, FSO has been widely used to tackle the aforementioned challenges. FireFly [
4] and ProjecToR [
1] are mainly designed to transmit all the DC workloads. F4Tele [
5,
6] utilizes OWC to build a dedicated network for management traffic. Umair et al. [
7] proposed a wireless network for SDN traffic. In this context, Zhou et al. [
8] built a separate wireless network for facilities traffic.
Intuitively, a normal DCN has thousands of Top of Rack (ToR) switches, and the physical dimension and processing capacity of a rack is not enough to install or process thousands of transceivers to communicate with every DC rack. Moreover, inadequate FSO links are unable to deal with multitudinous racks simultaneously. Thus, a line-of-sight (LoS) FSO link between every individual rack and others is hard to structure, and indeed this challenge exists for the wired network. This is also the same for normal DC traffic, control traffic, network management traffic and facility messages. To tackle this challenge, researchers such as those working on FireFly [
4], ProjecToR [
1], F4Tele [
5] and others [
9] establish indirect LoS FSO links on demand by exploiting emerging technologies (such as ceiling mirrors, disco balls, switchable mirrors and digital micromirror devices). Although these schemes solved the traffic workload challenges and increased the DC communication performance, a reasonable amount of time is needed to establish the indirect LoS FSO lightpath. From this, we understand that studying and analyzing the delay of establishing the FSO links, lightpaths, to transmit the traffic from source to destination is mandatory.
Thus, in this work, we aim to study and analyze all possible factors that contribute to building an indirect LoS FSO link between DC racks. The process of establishing the link involves a sequence of nonuniform processing steps. Every step is performed on a different device with unequal processing operations and service times, starting from examining the existence of a lightpath between the source and destination and ending at swiveling the FSO link gears (the FSO gears are reconfigurable and the CU can changes their directions both vertically and horizontally [
1,
4,
5]) (transceivers and switching mirrors) toward the designated destination. This introduces a random process of arbitrary random variables which requires a deliberate analysis analytically and empirically. Additionally, the process of establishing an indirect FSO link is launched upon the arrival of a new flow (
) (the new flow is a flow with a destination that cannot be reached by any of the existing FSO links from its source rack) and terminated according to the controlling unit (CU) commands. In previous work, we studied the delay involved in the SDN flow setup [
10], where OpenFlow switches communicate with the SDN controller to build an end-to-end path. Normally, OpenFlow switches need to be configured with proper configuration commands to route incoming packets to the right destination.
1.1. Motivations
Next-generation DCs are being modified by considering the FSO and Radio Frequency (RF) wireless communication to support the exponential growth of data. A huge amount of data is stored in the servers, and the number of servers reaches hundreds of thousands of servers to accommodate and process the massive data simultaneously [
1,
2]. In term of installation overhead and costs, FSO network doesn’t require the overhead and costs of building the ducts and pulling the wires through them. Fiber optic cables are inflexible and fragile. Also, fiber optic cables are prone to damage and cut during the construction and maintenance. In term of scalability, FSO networks can be expanded easily by adding enough number of transceivers at the edges without modifying the network infrastructure [
1,
4,
9]. Moreover, FSO technology offers high-speed link capacity up to Terabyte per second [
11].
1.2. Related Works
In order to resolve the weaknesses and limitations of wired DCs, researchers have attempted to reap the benefits of wireless communication technologies [
9,
12,
13,
14]. Researchers have classified DC traffic according to its size as large (elephant) and small (mice). Also, they have classified it according to its service: network management traffic and data traffic. The OWDCN researchers exploited these classifications in their schemes. The following paragraphs summarize their findings, particularly the FSO-related works as they are our interest herein. F4Tele [
5] is introduced to build an FSO-based network dedicated for network management (NM) traffic. Rather than sending the NM traffic over the same network of the data traffic, the author attempts to utilize FSO technology to build a dedicated network to transfer it from data racks to the NM racks.
Similarly, the authors of ProjecToR [
1] attempted to exploit the DC traffic communication pattern in the topology structure and the traffic scheduling, where few racks are overloaded and the majority of racks are underutilized. To facilitate this, they leveraged digital micromirror devices and disco balls to speed up the switching of FSO links. The digital micromirror device can direct FSO beams toward tens of thousands directions, while it needs 12 µs to switch between these directions. The authors of [
15] introduced a new OWDCN solution by utilizing a nanosecond semiconductor optical amplifier and wavelength selectors and an arrayed waveguide grating router. The solution has been thoroughly investigated by using detailed sets of experiments and hardware. The authors of [
16] proposed and evaluated a novel OWDCN architecture named ROTOS based on reconfigurable optical ToR switches. The wavelength capacity and beam directions are configured on demand from a centralized unit.
On the other hand, other researchers attempted to build an overlay FSO network dedicated to network management traffic. However, they encountered multiple challenges. The network management racks do not hold enough physical and processing capacities to serve thousands of FSO beams. Instead, the authors of [
5] proposed a new traffic scheduling method compatible with the network management traffic workload. Moreover, the author of [
6] attempted to reduce the number of FSO links between the data racks and management racks. Since the DCs show a skewed traffic distribution, the author attempts to shuffle the racks to regulate this distribution. The solution groups multiply the racks into one cluster, and every cluster has a dedicated FSO transceiver toward the management rack. This method simplifies the flow scheduling mechanism and unfairness. Similarly, the authors of [
17] introduced an FSO scheme for the facility traffic.
1.3. Paper Objectives and Novel Contributions
The main contributions of this work can be summarized as follows:
- •
This paper attempts to understand and model the process of establishing the R2R indirect LoS FSO link in wireless Data Center Networks.
- •
The establishment of an R2R indirect LoS FSO link involves a sequence of nonuniform processing steps. Every step is performed on a different device with unequal processing operations and service times. This introduces a random process of arbitrary random variables which requires a deliberate analysis analytically and empirically.
- •
According to recent data center traffic studies by Microsoft [
1] and Facebook [
2], the DC has short and long flows as well as short-term and long-tern rack-to-rack traffic directions. This article considered the variability in these traffic workloads. The first scenario attempts to present a model compatible with short flows and long-term traffic-direction workloads. In contrast, the second scenario considers a model suitable for long flows and short-term directions. Although the first scenario is more practical and easier for modeling, it is expected to face high power consumption challenges and it suffers from inefficient resource utilization. However, in the second scenario, the R2R FSO links are terminated after flow completion, which is suitable for power conservation and utilization solutions.
- •
The flows in DCs could be forwarded to the same destination or otherwise. The probability that a flow is going to be forwarded to the same destination relies on the number of established FSO links S, where and K is the maximum number of FSO links that can be launched from an individual rack. In the second scenario, S is a stochastic variable that has multiple factors contributing to its state.
The remaining of the paper is organized as follows: Section 2, discusses the problem statement of the indirect LoS R2R FSO link setup process and where link management techniques contribute. Section 3, describes the mathematical description of the system model and their two scenarios. Section 4, the mathematical analysis including delay analysis of new link setup time, system capacity, and blocking probability are explained. The performance evaluation of results is revealed in Section 5, and finally, a conclusion is drawn in Section 6.
2. Problem Statement
DCs have thousands of servers grouped into almost identical racks, which means every rack has the same number of servers. The racks communicate with each other through a switch (also known as a top-of-rack (ToR) switch) installed on top of every rack. The ToR switches are connected through optical fiber cables with hundreds of intermediate high-speed switches. This wire-based structure encounters maintenance and development challenges and clear deficiencies in allocating the optimal capacity to serve forwarding flows. Recent studies on DC traffic characteristics have suggested the development of new alternatives for existing wired technologies. The wired DCs have a rigged structure and uniform distribution of the communication and processing resources that hinder them from efficiently coping with the requirements of the DC traffic workloads. On the other hand, emerging wireless technologies (e.g., FSO and mmWave) have the necessary features to be the superior alternative.
The DC servers exchange inter-rack and intra-rack (Local) traffic. The inter-rack traffic carries data, e.g., search queries, and control, e.g., syslog messages, traffic. When a new flow arrives at the ToR switch, conventionally, a path table is examined to determine the forwarding port and FSO link for this flow. However, this is not the end of the journey. The data center has thousands of racks and the ToR switch has a limited number of outgoing ports (FSO transceivers), which eliminates the possibility of building a direct link with every rack. To tackle this challenge, researchers have developed two things: indirect LoS FSO link mechanisms and on-demand FSO link-scheduling algorithms [
1,
4,
5]. When the ToR switch does not have an FSO link to serve a flow, it establishes a new link to serve it. The establishment of a new FSO link involves further processing steps, which introduce extra delay and overhead. This process goes through a series of services starting from the ToR switch, then the control channel (CU) and finally the FSO link gears (mirrors and transceivers). Each service needs time to process the request. The ToR switches need time,
, to process the request, read its switching table and forward it through the control channel to the CU. The control channel need time,
, to transmit and propagate the request, and this depends on its data rate. When the request is received by the CU, it needs time,
, to execute the optimization algorithm to find the optimal path between the source and destination ToR switches. Then, the CU instructs the FSO gears. These gears needs time,
, to change their directions by spinning the transceivers and changing the switching-mirrors state. The flow at the ToR switch waits for all of these times to complete. The delay
(the unit could be microseconds up to seconds depending on the adopted technology) for the setup of the new FSO link is given by
3. R2R FSO Link Setup Process: Two Scenarios
The ToR switch can establish a limited number, K, of rack-to-rack (R2R) FSO links. This limitation is due to the finite processing capacity and number of outgoing ports. When a flow arrives at the ToR switch, it could find an R2R FSO link at its destination or wait for the ToR switch to establish a new link for it. In this work, we attempt to study and model this waiting time.
The ToR switch needs to establish a rack-to-rack (R2R) FSO link to serve a new flow, . This process starts at the ToR switch by sending an R2R link establishment query to the CU. Since there are multiple numbers of choices to establish the R2R link, this creates a well-known integer linear programming problem: the multicommodity flow problem (also known as routing and wavelength assignment). The CU is expected to run one of the well-known resource allocation optimization algorithms to solve the integer linear program and find the optimal selection to set up the path (the multicommodity flow model is a nondeterministic polynomial-time-complete problem which can be solved by heuristic approaches). Accordingly, the CU has three tasks: (1) run the algorithm to find the optimal path, (2) command the selected path gears including FSO transceivers as well as the switching mirrors to establish the link and (3) provide the ToR switches with the necessary information to forward the flows via the right outgoing ports. Finally, the ToR switch adds a new entry to its path table about this link.
On the other hand, the newly installed link is going to be used by any subsequent flows that are going to the same destination rack, . In this case, these flows, , would not encounter the R2R FSO link setup process. The likelihood of this happening depends on the number of currently established R2R FSO links S, where and K is the maximum number of FSO links that can be launched from an individual rack. The question is how many R2R FSO links exist, S, when a flow arrives at the ToR switch? The value of S could be static, where the system is configured to always have K FSO links. In this scenario, the value of S is K. Alternatively, the system is configured to establish the R2R FSO links on demand. In this scenario, S is a dynamic random value that varies with time. The configuration of the R2R FSO links changes the system characteristics and the system modeling accordingly.
These two configuration scenarios are expected to exist together at the same DC. In the first scenario, every ToR switch in the DC establishes and retains K FSO links all the time, regardless of their utilization. At the beginning, the system establishes these FSO links with random racks and then changes their directions according to the CU instructions. In the case of zero utilization “no traffic”, the links turn to idle mode. This scenario is suitable for small flows (mice flows) as well as long-term directions (high utilized racks). On the other hand, in the second scenario, the R2R FSO links are established on demand and are terminated immediately after the forwarding flows complete their transmission. When there is no flow, the system has no established link. This scenario is suitable for large flows (elephant flows) as well as short-term directions (low utilized racks). These two scenarios cover the DC workload requirements as described above and in [
1,
2] and are expected to be used together in the same DC.
3.1. Problem Formulation: First Scenario
The ToR switch maximum capacity is
K FSO transceivers, and these transceivers can be used to build only
K FSO links. The arrival flow is going to experience the waiting time in the case that the
K FSO links are connected with racks other than its destination. Since the system has K servers and FSO channels, as shown in
Figure 1, the closest model to it is M/M/K. However, the main difference between the R2R FSO link setup system and M/M/K is in the sharing ability of the FSO channels. These channels could be shared by all the flows that are going to the same destination. The FSO links have the physical layer electronics to be shared by multiple flows [
9].
Table 1 illustrates the significant notations.
The M/M/K model reaches the waiting state when the number of customers in the system is larger than the number of servers. In contrast, due to the sharing ability of the FSO links, some of the flows (customers) in the presented system are not going to wait for other flows to complete their services. The flow enters the waiting queue when the ToR switch has no link to its destination. The question is how does the presented system get to the waiting state? At state 0, every ToR switch has K FSO links to randomly chosen destination racks. A new flow enters the waiting state when it carries a destination that differs from all the K destinations. In this case, the FSO link setup process is triggered, and then one of the K FSO links is re-established toward the new destination rack. The subsequent flows are either being served by these K links or by requesting to re-establish one of them.
In order to make the presented model similar to the M/M/K model, the transient event from one state to another needs to be clearly defined. In the presented system, the flows with the same destination share the same FSO link, and no transient event would happen for them. Contrary, the flow with a new destination would trigger the R2R FSO link setup process, which makes a transient event. Thereby, the transient event happens when a flow with a new destination rack arrives at the ToR.
The flows with different racks,
, have an arrival rate of
where pm is the probability that a flow matches one of the existing FSO links. This probability is clearly related to the ToR transceiver capacity (number of outgoing ports) and the total number of racks in the DC. The ToR switch has
K R2R FSO links, and the DC has M racks. The matching probability, pm, can be calculated from
To clarify the impact of
(please note that this symbol is the arrival rate and it is not a wavelength) on the system, we need to assume that the system has only
. In this case, the transient state event happens with every arrival until all the K FSO links are established and the system reaches its full capacity. The subsequent arrivals need to wait for the other flows to be complete. This system presents similar characteristics to the M/M/K model, assuming the arrival rate follows the Poisson distribution and the channel service times follow an exponential distribution. Additionally, the waiting time has an extra component which is the R2R FSO link setup process changing the service time distribution into a general distribution and the model into M/G/K. The average waiting time is
When the flow arrival rate of
is considered,
These flows,
, are not going to wait for the other flows including
or the FSO link setup process. The system immediately transmits them with their sisters sharing the same FSO link, and
is the service time of a single FSO link:
On the other hand, during the waiting time of
for the system to find for them a link, a new
flow could arrive, which prolongs their waiting time
because the system needs to wait for them to complete their service before using their FSO link:
where the mean waiting time of M/G/K according to Lee and Longton [
18] is given by
where
is the square coefficient of variations of the ToR mean service time
. In this case, the total waiting time when the matching probability,
, is considered is
3.2. Problem Formulation: Second Scenario
The mathematical model of the second scenario is similar to the model of the first scenario except that the matching probability needs to be considered. Both scenarios have a statistical characteristic close to M/G/K. However, the variable K in first scenario is constant, while it is stochastic in this scenario. To make this clear, when a new flow arrives at the ToR switch, the number of FSO channels is always K in the first scenario and unknown, , in the second scenario. The following bullets describe its characteristics:
K as a constant value is no longer valid in this scenario because it represents the maximum capacity of the ToR switch; instead, the symbol k is used, which is Z∈.
K increases with a mismatch when a flow of a different destination, , arrives at the ToR switch.
K decreases when an FSO link is terminated.
Accordingly, k is considered an independent and identical distributed (i.i.d.) random variable which impacts the modeling of the matching probability, pm. In order to calculate the pm, the system needs to know how many R2R FSO links exist when a flow arrived. Moreover, to find the total response time of this system we need to find the pm distribution. As explained above the first contributing factor into k and matching probability is the arrival-rate. The second factor is the FSO link termination. The arrivals could be or . The value of k as well as the probability are increased with the arrival of , and decreased when an FSO link is terminated. The time between any two events is defined herein with . This time could be the lifetime of an individual FSO link is the time from establishing the FSO link until destroying it, or the inter-arrival time between two flows.
During
three events could happen. Arrival of
which triggers the R2R FSO link setup process and k increases accordingly. Arrival of
to be forwarded through this FSO link or other links. Finally, a termination of this link or one of existing FSO links. When there is no link, the arrival flow is definitely new, and
starts. In order to model the relation between these events,
is discretized into small instants of time,
. Only one event could happen in a single instant. Discretizing the time enables modeling the main factors of the matching probability by utilizing discrete-time Markov chain model (DTMC). The DMTC based model is shown in
Figure 2. In this model one of the aforementioned events could happen at,
.
The DTMC based model helps to get the pm probability through modeling the probabilities of these three events. First, the probability that the arrived flow is which means it would be forwarded through one of the ToR established links, p. Second, the probability that it is and doesn’t match with any of the established links, p which triggers the R2R FSO link setup process. Third, the probability that there is no arrival and instead a link termination event happens . The probability p examines all the time instants, , until arrival-event happens or a link terminated. From this we understand that is the time between two events because when there is no arrival during , the inprocess flow is complete and the FSO link terminates. This probability, p, is Geometric which alternates between two states whether having an arrival event or not. When there is no arrival at the whole time of , the link termination event arising. It is clear that from literature the probability of having a flow arrival at follows the Poisson distribution.
According to the DTMC model, the state probabilities are as follows:
Figure 2.
DTMC state diagram of the for the second scenario.
Figure 2.
DTMC state diagram of the for the second scenario.
Finally, the Markov model contains the impact of the arrival rate and FSO link lifetime within the state probability
. Since the matching probability in this scenario is increased with the increase in the number of links and from the total probability theory,
will be
4. Mathematical Analysis
In this section, we present an analytical expression for the new link setup delay experienced by all classes of the links. We are interested in obtaining the mean waiting time
and its second moment
that will be experienced by all the flows. We first derived the mean results of the waiting time of all the components involved in the R2R FSO link setup process. The first component is the flow processing time at the ToR switch, assuming the arrival rate follows a Poisson distribution. Similarly, the southbound channel (control channel) is between the ToR switch and the CU. However, the service times,
and
, of them follow an exponential distribution, and they are different in terms of speed, where
. On the other hand, the CU and the FSO gear service times present an arbitrary distribution due to the involvement of diverse processing services, such as executing the FSO wavelength assignment algorithm and performing it physically on the FSO link gears. Consequently, the mathematical analysis of the CU as well as the FSO gear setup service time used in this section considers standard derivation steps of the
model, and
Table 1 has the definitions of the main notations. The incoming paragraphs define the definitions of other notations that are being used herein. The waiting time for the matching probability is calculated as follows:
Since the general distribution was considered for the service time of the CU, its average waiting time
is
where
is the service time of the path/link
i while
R is the residual time. The mean residual service time
appearing in
can be derived by the same kind of graphical
as in the case of the (P-K) mean value formula. This residual time’s first instant can be expressed as
The second moment of the waiting time is derived by the implementation of the additional algebraic manipulations:
where
. Equation (
21) is procured by raising both sides of (
19) to the second power and taking the mean. Note that the variables of
are all known except that we need to evaluate
. Thereby, the law of total expectation, which states that E[Y] = E[E[Y|X]], is employed to obtain
The average time that a flow
f spends in the system is given by
where the average response time of the CU and subsequent services when the matching probability are considered, and
The total response time distribution is
For the sake of accurate representation, we use different service times for each step in the R2R FSO link setup journey, where , and these are the service times of the ToR switch, CU (the control channel) and the time to set up the FSO gears.
System Capacity
When it comes to real DC networks, the flow duration varies depending on its service. For instance, data centre web-search workloads present flows with different size and length compared to the data-mining workloads [
19]. Thus, the flow waiting-time should be limited by a specific level of service time. Otherwise, some of the new incoming flows will wait and spend more time in the setup queue for an uncertain time which exceeds their duration. As a result, in this part, we attempt to figure out how much capacity the system has and how many flows it can handle at a certain response time quality of service
.
As indicated in (
25), the system response time is a random variable,
, where
is the flows’ response time index. In the case of the exponential distribution, the response-time equation will be,
The newly arrived flow needs to wait for the flows ahead waiting in the queue. The summation of their response times is,
. The value of
E needs not to exceed the system QoS constraint, and the system blocks the subsequent flows. The question is what is the probability an arrival flows is going to be blocked. In order to get the blocking probability we need to find a closed formula for the distribution of the system response times
L. In literature if
where
is i.i.d Exponential R.V. with the constant parameter
, the probability density function (PDF) of their sum is Erlang distribution with
U and
parameters
. However, the R2R FSO link setup process has different service-time rates for each service, Exponential R.V., in its process. According to above mathematical analysis, we find the distribution of the total response time of R2R FSO link setup process in (
25) take a shape close to an Hypo-exponential distribution with large
U. Contrary, Erlang and Gamma distributions tend to take the bell shape as the value of
U increases. Therefore, the approximated distribution is Hypo-exponential
,
The blocking probability is defined as, the probability that an R2R FSO link setup request needs a time to response exceeding the QoS time constraint
,
. The value of the threshold value, QoS time constraint, is obtained from the response time distribution
L explained above. The threshold value can be expressed by the maximum number of R2R FSO link setup requests,
, that the system can handle before the
value is exceeded. This can be approximated to the following equation where
is the average response time,