1. Introduction
In network environments with various devices, such as the Internet of Things (IoT), smart factories, sensor networks, and 5G, it is required to provide stricter qualities of service (QoS) than before; therefore, it is important to distribute limited network resources efficiently. Currently, it is common to use simple algorithms such as strict-priority (SP) scheduling, which always transmits higher-priority packets first. However, network size is becoming increasingly larger, leading into an era where many different types of data and devices with various performance requirements are connected; accordingly, it is not sufficient to provide fixed priority to network flows. In studies that investigated the application of deep learning to the network field, not only the advantages and effects of introducing deep learning but also the challenges and difficulties were discussed [
1,
2,
3,
4]. Deep learning can estimate correlations of data through a neural network, and it uses a lot of input data to show better performance. Deep learning is already known as a solution that has achieved the best performance in many fields. Since network problems can often be modeled with the Markov decision process (MDP), reinforcement learning based on MDP has proven its feasibility [
1]. In particular, deep reinforcement learning (DRL), using neural networks of deep learning, has demonstrated good performance. Reinforcement learning defines a problem in the form of MDP, training agents to choose optimal policies from experience, and adding the approximation of neural networks, which is more effective than existing algorithms. Various research studies suggest that DRL has outperformed existing algorithms [
5,
6,
7,
8,
9,
10]. As traffic targeting time-sensitive applications increases and networks are expected to become more complex, other algorithms are needed for the shift in generations. This research was conducted as a necessity to study scheduling based on DRL, which uses nonlinear optimizers that can operate in accordance with numerous network scenarios that cannot be linearly defined.
Before going further, we have introduced the abbreviations we use in this paper in
Table 1.
Deep learning has also been discussed as a way to implement autonomous networking. Autonomous networking has emerged, allowing networks to operate on their own without manual network management [
11]. It is aimed at achieving self-management, including self-configuration, self-healing, self-optimizing, and self-protection. Reinforcement learning can be in charge of the intelligence of autonomous networks because it takes appropriate actions for various situations in the network environment. The following is the meaning of each “self” function of autonomous networking:
Self-configuration: the network sets itself without intervention by the administrator or management system.
Self-healing: the network automatically solves problems or adapts to a changed environment.
Self-optimizing: the network finds the optimal way for itself to achieve the network’s requirement.
Self-protection: the network automatically prepares to potentially respond to an attack.
In order to improve the flexibility and intelligence of autonomous networking, SDN/NFV-based standard models, such as self-organizing networks (SON), CogNet, and SELFNET, have been developed, and data analysis through deep learning or machine learning is needed [
12]. In [
13,
14], reinforcement learning was applied to implement the autonomy of networking. In [
13], the authors investigated the latest studies on autonomous IoT, analyzed suitable DRL algorithms, and proposed a general model. A cognitive control loop was proposed to realize autonomy in [
14].
Despite its spectacular prospects, no cases of deep learning application have been reported on network control problems (i.e., scheduling, routing, etc.). Major challenges in the field of networks applying deep learning are related to latency and generalization. It is difficult to implement because the network nodes have to communicate with the central controller in real time. It also takes considerable time to infer output in deep learning. Using a transport layer protocol, a single node of information must be delivered to the central node to control network congestion [
3]. Additionally, generalization is necessary to respond appropriately to the states of all possible networks. It should have the ability to learn or respond to unobserved patterns in real time and devise ways to normalize network parameters of different formats. Centralized controllers struggle to manage the resources of many network entities with a variety of capabilities, requiring research to learn and deploy neural network models in a distributed manner [
2].
In this study, we propose a scheduler-applied DRL in a timeslotted environment. Among the various DRL algorithms, we used the value-based DDQN, which is known to be simple, to perform well, and to be stable in discrete action spaces [
13]. In previous work on deep learning applied to networking problems, it has been demonstrated that reinforcement learning can be suited to action choice where the state varies every timeslot, such as the network scheduling problem. This study assumes a timeslot-based scheduling environment, in which clocks are synchronized and queues are assigned according to the priority of the flow. It also follows the basic assumptions made by IEEE time-sensitive networking (TSN) [
15] standard regarding the jitter and latency minimization technology for small networks [
16]. The priority is basically determined by the class of the flow. In this study, the concept of precedence used in military networks is also introduced, assuming that flows in the same class may have different precedence (or priority) depending on the user’s authority [
17]; i.e., flows may have the same class and similar requirements but can be entered into different priority queues and scheduled differently. In this study, the estimated end-to-end delay (referred to as ET in this paper) is used to define MDP elements because it facilitates training on a single node. A state is defined as the length of each priority queue and the estimated delay. The reward function is designed to meet the deadline of the packet. Action is the choice of which priority queue in the timeslot to send the packet.
As a result of this work, the DDQN on a single node outperformed existing algorithms (the SP and the weighted round robin (WRR)) in terms of the total reward. Existing algorithms achieved an average sum of reward of 90 to 92%, but our trained model achieved a sum of reward of 100%. In addition, we validated the performance of a trained model with estimated delay on a single node through simulation in the topology. This implies that the aforementioned simple states and the reward with respect to estimated delay could be effective and feasible for a general network. The reason for training on a single node is that it rarely communicates with a central controller, other than deploying initial models on each node. If the central controller can know all the situations of the network nodes, the performance of learning could be superior, but it can be difficult to realize that parameters must be transmitted in real time. In the simulations, the ratio of packets meeting deadlines was considered as a performance indicator. Since the SP transmits packets based on priority, a relatively low-priority queue is not guaranteed. WRR had a disadvantage in that the weight had to be manually adjusted according to the traffic patterns, such as the period of traffic. The DDQN-based scheduler was able to overcome the shortcomings of these existing algorithms and send more packets within the deadline in several scenarios. In addition, by introducing heuristics to reduce the deep learning inference time occurring per timeslot, the scheduler could infer only when packets are present in two or more priority queues. We could further reduce the inference time by recording actions for frequently observed states, which can also be named as caching. In order to generalize it to actual IoT devices, it is important to consider energy efficiency as well as inference time. There are various studies considering energy efficiency in the time-division multiple access (TDMA) scheme, which is a timeslot environment cognate. In recent studies, methods for transmitting power to wireless sensor networks or harvesting energy have been proposed. In [
18], it is revealed that the strict-delay constraint leads to a decrease in energy efficiency. In addition, the authors have proposed an algorithm for determining throughput that increases energy efficiency in cases where QoS guarantees are required and generated from a delay-sensitive source. In [
19], which is another study of energy efficiency in TDMA, a sleep-scheduling policy called the multiple vacation and start-up threshold policy is used to mitigate energy consumption in a TDMA environment.
The main contributions of this paper are as follows:
The shortcomings of existing algorithms have been resolved. The SP algorithm does not guarantee a low priority instead of guaranteeing a high priority, and in the case of WRR, the weight must be adjusted according to the situation. The DDQN has a high probability of transmitting both high and low priorities of packets within a deadline and does not manually adjust the weight.
Despite the assumption that the end-to-end delay (referred to as E2E in this paper) is unknown and learning with the estimated ET, the topology with which E2E is obtainable achieved the same or higher performance than the existing algorithms.
A simple state, action, and reward are defined, and considering the contribution point above, it can be inferred that learning and application to a specific network situation may not put much effort into the learning environment.
Not only reducing computation time and increasing energy efficiency due to deep learning, but it is designed to not frequently exchange information with the central system. Thus, the proposed DDQN based scheduler could be a potential solution for IoT devices.
Section 2 introduces research referenced in this study in detail; in particular, the motivation and results of reinforcement learning-based network studies, reinforcement learning and TSN.
Section 3 describes the model of our system and defines MDP elements. In
Section 4, we evaluate the results of simulations, including existing algorithms, single nodes, and topologies; we also discuss strategies to reduce inference time and briefly describe the implementation results of caching.
Section 5 summarizes the proposed work and discusses the directions and challenges that future work should take. Traditional network schedulers operate in accordance with a predetermined algorithm, so they do not have the ability to adequately handle changes of network states. This motivated us to study intelligent scheduling with DRL that can be used in an advanced networking environment. The goal of the study is to accelerate the introduction of automated and intelligent networking.
2. Related Works
Reinforcement learning can utilize the convolutional neural networks (CNNs) that input images, such as game environments, or utilize prediction models. Network scheduling [
5,
7,
9,
20], routing [
7], and resource allocation [
6,
8] through DRL using prediction models have been studied in various ways. In particular, in a deadline-aware environment, the rewards and states are usually defined with the goal of sending many packets within the deadline. Similar to deciding which packet to send, a DQN has been used in the study of automobile traffic signal systems. In [
21], the authors proposed an intelligent traffic signal system model using real traffic data. In [
22], the definition of the appropriate state and reward in traffic conditions was analyzed. It proved that the optimization of traffic signals is the minimization of vehicle driving time, and as a result of learning with a combination of various states and rewards, it achieved optimal performance even with a simplified state and reward. In [
5], the authors devised a method for scheduling using a DQN for new classes of applications and traffic of IoT devices that will appear in future mobile networks; In order to adapt to the dynamic traffic, the research achieved optimal IoT traffic scheduling by implementing a scheduler applying reinforcement learning. In [
6], agents learned the optimal policy to efficiently distribute limited resources in IoT edge computing systems through a DQN. In [
7], policy-based proximal policy optimization (PPO) was used to find the optimal joint scheduling and routing solution in multi-hop wireless networks. It aimed to send many packets within the deadline. The state consists of queue and queuing packets for all nodes, and the action selects one of the five heuristic algorithms in the timeslot. The results of the study outperformed the best heuristic in the training set scenario by 74% and outperformed the non-training set scenario by 64%. We also show that generalized policies learned from datasets of all scenarios are slightly lower than custom policies learned from individual scenarios, and that exploring a larger number of routes can reduce deadline missing. In [
20], the authors presuppose an SDN environment capable of real-time telemetry, and schedule it by adjusting the pacing rate of packets in a network running an application that requires data transmission completion within a given deadline. The purpose of the study is to maximize the number of flows satisfying deadlines while maximizing network utilization. Compared with well-known heuristics, such as the earliest deadline first (EDF) and equal partition, reinforcement learning agents always showed the same or higher performance, and in particular, the higher the network load, the better the performance over other algorithms. In [
8], advantage actor–critic (A2C) is proposed as a solution to mobile network load due to strict QoS requirements. Two A2C models were proposed, and the performance of each model was compared. The model trained with more information on the state achieved results that increased packet transmission rates by 92%. In [
9], in order to solve the resource allocation imbalance in edge computing (which eventually leads to system performance degradation) caused by numerous devices and user movements in dynamic environments, a study was conducted to meet the required deadline of the task with DRL.
In addition, IEEE TSN aims to be a latency guarantee network in an environment where time-synchronized node and slot scheduling by a central entity is considered, and the key is to adjust gate opening in the TSN standard. In a TSN synchronous approach [
16], the output port has a class-based queue, a time-aware shaper (TAS), and a strict-priority scheduler; TAS has a gate control list (GCL) with information that coordinates the gate opening or closing of queues per time slot. A TSN synchronous approach is suitable for application to a network where the period and type of flows are static because the environment is aimed at a deterministic service. Therefore, we propose a scheduling solution that determines a queue to send packets per timeslot in a static environment, similar to the gate control of a TSN synchronous approach. Gate control in TSN is performed through a fixed GCL, but in our environment, each priority queue is controlled based on the state in the timeslot.
Reinforcement Learning
Reinforcement learning is a process in which the learning agent in the environment undergoes trial and error in choosing random actions and converges to optimal policies by learning behaviors that can receive maximum rewards. An episode means from the beginning to the end of the simulation. Several episodes are required for the learning process. The agent and the environment exchange information at every timestep (i.e., timeslot) and proceed with learning. When the simulation ends after several timesteps, the simulation will proceed to the initial state of the next episode. An agent observes the state, which is information obtained from the environment, and selects action. Then the environment delivers a reward to the agent. The state is changed to a new one in the next timestep. This reward is also called an immediate reward because it is a value given instantly after the timestep in which the agent selects an action. Q-learning (i.e., a representative algorithm of reinforcement learning) uses Q-value, which denotes the cumulative value pursuant to the state and action of the agent. The Q-value approximation in the DQN derivation from (1) to (4) is summarized from [
23]. The Q-value is measured by considering not only the immediate reward but also all rewards to be received after timestep
. The immediate reward is given at timestep
subsequent to selecting an action at timestep
. Thus,
becomes the reward received at timestep
. In case the future value is converted to the value of the present timestep, a discount factor between 0 and 1 represented by the symbol
is used to convert future values to the current value, as shown in (1).
If the selected action has the largest Q-value after timestep
, this is expressed as shown in (2). Thus, the Q-value formula means that the agent would select an action to maximize reward at each timestep
to
. The part of (1) equal to (2) is able to be replaced by (2), and therefore the Q-value equation is derived as shown in (3) below.
In (3),
is the maximum Q-value where an agent selects the action
in the next state
; thus, it is expressed as (4).
In order to calculate the Q-value, all rewards received after
are demanded. According to the Q-learning algorithm, a two-dimensional matrix Q-table with
as rows and
as columns was introduced to record the Q-value of every combination of
. Several numbers of simulations are required to update each
of the Q-table. As mentioned above, (3) is calculated as the sum of (4) and the immediate reward; then, consequently, the Q-table is updated. The combination of state or action could be diversified; however, the size of the Q-table and the amount of computation will increase if there are many states and actions. To address these limitations, the DQN opened a new chapter in reinforcement learning with deep learning by devising the neural network Q-network to estimate Q-value [
23]. For general deep learning tasks, input data are computed with the parameters of the neural network and passed through the activation layer; finally, it produces output data. Since output is the value predicted by the neural network, there is a target which the output of the neural network is supposed to attain. It is trained to update the neural network by back propagating the error between the target and the prediction in order that the output is close to the target. The entire system described above is referred to as deep learning. The input of the Q-network is state
, and the output is the Q-value of each action
in the action space. In (3),
is calculated by the immediate reward
and
, which becomes the target of the Q-network. DQN learning is conducted through an error between
and
that is predicted as the Q-value by the Q-network. If
in (3) was predicted through the Q-network, the target would change in every step. Due to the fixed target, this problem does not occur in supervised learning; however, in reinforcement learning, there is not a fixed target. Therefore, the learning might not be performed correctly only with the Q-network. The target network, which has the same structure as the Q-network and fixed weights, calculates
and periodically copies weights of the Q-network; this is one of the characteristics of a double DQN (DDQN). As shown in
Figure 1, copying weights of the Q-network is named as a soft update.
4. Performance Evaluation
For the performance evaluation, we have implemented deep learning on the network using a discrete event-based simulation tool. A reinforcement learning simulation platform, Simpy, was used for our experiments [
25]. For implementation of our simulator, the packet generation process and link, transport, source, and node are modularized with their processes. Training and tests are executed by interacting with the TensorFlow-based DDQN with the agent. The DDQN trains the outside of the simulation environment defined in Simpy; it does not affect the packet latency. In other words, training the DDQN results in delay due to inference and learning operations of the neural network, but the simulation environment does not take interference into account. However, in real-world network environments (i.e., where scheduling is required for every time slot), the delay of deep learning computation could occur. In preparation for this situation, the agent could select an action only with information about its own state, even if all switch nodes in the network were not known; it allows decentralization of the DDQN scheduler. The role of the trained DDQN is to output an optimal action by using the state as an input; the optimal action for the current state observed at the node could be recorded in the look-up table. The parameters set to train the DDQN are summarized in
Table 4 on network simulation parameters and in
Table 5 on the DDQN learning parameters. The structure of the DDQN can be seen in
Figure 4. Linear was used for the activation function at the final output terminal, and Adam was used for the optimizer. This method uses an adaptive learning rate optimization algorithm, which is known to improve learning performance by updating weights using individual learning rates. In this study, the learning rate of
was empirically determined based on the most effective value for loss reduction. The algorithms used for the result comparison are SP and WRR. SP is an algorithm that unconditionally sends packets in the highest priority queue. WRR is an algorithm that services queues sequentially in proportion to the weights allocated to each priority queue. In all simulations, including the DDQN, if a packet exists in only one priority queue, work-conserving was applied to send the packet regardless of the result of the scheduling.
4.1. Existing Algorithms
In this study, existing algorithms were used to objectively evaluate the performance of the DDQN agent. As mentioned in
Section 3 some packets could not arrive within the deadline in situations where utilization increases due to the sudden influx of packets into the link.
Table 2 confirms that existing algorithm SP has weaknesses that are difficult to adaptively respond to. Unlike SP, which unconditionally gives preemption to high priority, WRR can perform more flexible scheduling than SP in that it can assign weights to each queue. Work-conserving has been applied to all algorithms, including SP, WRR, and DDQN agents, to ensure that packets are sent when they are waiting. For instance, if packets exist in only one of the two queues, they are sent immediately without using the scheduling algorithm. Not only did this meet the requirements of the packet, but it also helped reduce the running time. WRR allows the allocated weights to be involved only when packets exist in both queues; in other words, when it has a weight of 3:1, it means transmission of three priority 1 packets and one priority 2 packet only by scheduling without work-conserving. The weight is set according to the network situation; even in the same situation, the result might vary depending on the weight. Therefore, for each simulation below, an indicator for evaluating the performance according to the weight of the WRR is added.
4.2. DDQN Training Results
As a result of training for hyperparameter tuning, it was confirmed that the higher the value assigned to priority 1, the better the performance that was shown. Β should have less effect than α and greater than 0, so a value of 0.01 was assigned.
Figure 5 is a learning curve comparing the results of DDQN training with 20,000 episodes, DQN training with 10,000 episodes, and SP simulation in 20,000 episodes. The moving average of the DQN learning curve showed a continuous decrease, and we stopped the training early because we decided that there was no possibility of improvement. The cumulative sum of rewards, which is specified as a score obtained in each timestep of an episode, was used as the result indication. The moving average of the score with window size 1000 and the moving standard deviation with the range of 30% transparency were exploited to display the learning curve. Each score was normalized to a range between 0 and 1 by adjusting the score with the max value in the episode. As ∈ decreased, the score of the DDQN increased and learning was progressing properly. Referring to
Figure 6, the score of the DDQN was recorded higher as compared with existing algorithm SP after 15,000 episodes. It was confirmed that the loss function gradually converges close to zero as well. As mentioned above, loss function is used to optimize neural network parameters and we define it to be MSE (Mean Square Error), which is applied to measure and improve the accuracy of prediction. It is a partial derivative with respect to the parameters, then back propagated to update them. Hence, a decrease of loss denotes that the criterion of accuracy has been improved through iterations of training. The progress of loss during episodes is shown in
Figure 7.
4.3. Single Node Simulations
The training was successfully performed in the single node that has the output port structure shown in
Figure 3. Simulations were conducted under the same environmental variables as the existing algorithm by applying a trained agent for evaluation.
Figure 8 indicates the comparison between the trained deep learning model and the existing algorithms. All algorithms were tested 10 times; thus, the distribution of data could be considered. Unlike the environment of DDQN learning, with 200% of maximum link utilization (i.e., priority 1 and priority 2 packets generated at a period of 1 timeslot), all packet generation periods were set to 2 timeslots and tested. The deadlines of all flows were fixed at 7 timeslots, and the packet number of each flow was set to 40. The current delay was designed considering remaining hops since we assumed that packet has a smaller remaining hop means packet was sent from source longer timeslots ago. As could be seen from
Figure 8, the trained DDQN achieved 100% of a normalized score even though parameters were changed, proving that it would adapt well to dynamic environments. On the other hand, the existing algorithms remained in average performance at around 90%~92%. This result suggests that the DDQN could guarantee deadline requirements for more packets in timeslot scheduling with priority unlike other algorithms. SP and WRR showed similar normalized scores. WRR, with a weight of 1:1, showed higher average performance than WRR at 3:1.
Table 6 shows the results of deadline satisfaction according to weight in the same network environment. Since the period ratio of each priority is 1:1, the weight also achieved a higher satisfaction rate as it is closer to 1:1; however, it could not achieve more than 90% in priority 2. In order to confirm whether a deadline is guaranteed or not for every flow (i.e., to confirm the accomplishment of the purpose of our work), the delay for each algorithm and priority was confirmed in
Figure 9. The gray area denotes the deadline. As we assumed that the agent utilizes ET, it is impossible to find the real E2E of packets. The timeslot is 0.6 ms and the deadline is 4.2 ms; thus, the deadline is 7 timeslots. In the simulation environment, as the packet arrival is constant, the length of the queue is increased. Accordingly, since packets transmitted at the beginning of the packet generation process have a small ET, and packets transmitted later have a larger ET, some fluctuation of the number of packets may occur. The ET of priority 1 was slightly increased but the ET of priority 2 was slightly decreased, so that all packets arrived within the deadline, contrary to other algorithms. Therefore, the simulation results in a single node might be interpreted as an accomplishment of the purpose and intention of the training DDQN agent.
4.4. Simulations in Network Topology
In
Section 4.2, it was confirmed that the DDQN operates in various parameters in the single node environment. We implemented a well-known mesh-type network topology with nine nodes, as shown in
Figure 10 [
16,
26], in order to verify that the DDQN scheduler works properly even when expanding the network size, and then applied the same trained model for comparison of the performance of the scheduling algorithms and analysis of delay. We used ET when training on a single node, but there is no way to observe whether the DDQN agent correctly estimated the E2E. Hence, we measured that simulation in the network topology is also essential. In the topology, each node delivers the state to a trained DDQN agent, which is assumed to be embedded inside of each node and receive the action. Because the state defined as an MDP problem is independent, the entire network situation is unknown to other nodes. Therefore, the key of our work is that scheduling is properly performed with ET through topology simulation. If the information of another node is added to the state, scheduling performance could be improved, but the state-space would be vast, resulting in a large amount of computation, and as the node is added, it will be difficult to apply. We conducted a simulation on various scenarios which are set with eight flows and two priorities. The eight flows are shown in
Table 7, and the route is the sequence of nodes through which the flow passes. The number of nodes is the same as in
Figure 10. The first index is the node connected to the source, and the last index is the node connected to the destination. Each node of the topology has two input ports and two output ports. The structure of the output port defined in
Figure 3 is applied, which means that there are two priority queues per port. Scheduling is performed on a per-port basis, and as mentioned, the scheduling algorithm operates only when packets are simultaneously queued in two priority queues. In particular, if the flows defined in
Table 7 share the same link, the scheduling algorithm is applied instead of work-conserving; The link from node 2 to node 5 is shared by F1 and F2, and the link from node 8 to node 5 is shared by F7 and F8. Therefore, the simulation can be conducted with only F1, F2, F7, and F8. Our proposed DDQN-based timeslot scheduling process is described in Algorithm 1.
Algorithm 1 DDQN scheduling simulation in topology |
N = number of nodes in topology P = number of priority queues in a port of a node = 2 Init priority_queues [1..P] Init action[1..N], state[1..N], next_state[1..N], reward[1..N], timeslot DDQN trained_ddqn_weights while not done : timeslot += 1 for each SourceModule do: SourceModule.PacketGeneration if timeslot : # not initial timeslot for n in 1..N: state[n] next_state[n] if any NodeModule[n].Queues is not empty: action[n] DDQN.ChooseAction else: action[n] NodeModule[n].WorkConserving for n in 1..N: packet NodeModule[n].Scheduling(action[n]) priority_queues[P] packet #the priority of the packet is P NodeModule[n].Send(priority_queues[P]) next_state[n] NodeModule[n].StateObserve if transmission terminated: done = True return |
In
Table 8 below, there are six scenarios where we experimented. The deadline and period are represented by tuples for priority 1 and 2, respectively, and T means timeslot. The probability of meeting the deadline of the DDQN, SP, and WRR in simulations of each scenario are shown in
Figure 11 and
Figure 12. There are scenarios in which the model trained from the single node using ET shows 100% deadline meeting, and the DDQN performance is higher overall. In the case of SP, 100% of the deadline was met for priority 1, but the deadline for priority 2 is not perfectly guaranteed. In the case of WRR, if the packet generation period is longer than two slots due to work-conserving, scheduling is similar to that of SP. Since it has a weight of 3:1, it shows a low probability of meeting deadlines in an environment such as S5, where the packet generation period is different from its weight’s inverse proportion. This means that when using WRR, the weight must also be changed according to the changing environment. In the case of the proposed DDQN scheduler, the results have been demonstrated to successfully address the drawback of SP and WRR.
Table 9 denotes the results of deadline meets according to weight at scenario S4. In S4, since each period of priority is 2T, it shows the highest satisfaction at a weight of 1:1. Among the S1 to S6, WRR showed especially poor performance in S5. This shows that the weight of the WRR is sensitive to the maximum utilization (i.e., period of the packet). In
Table 10, the WRR results according to the weight of S5 may be confirmed. Since S5 has periods of 1T and 5T, respectively, it is an environment with high utilization of up to 120%. This demonstrated the more weight needing to be assigned to the packet having the shorter period. However, this is slightly lower than that of the DDQN, and there is a disadvantage that WRR weight must be reassigned when the utilization changes. In
Figure 11, the E2E in the DDQN, SP and WRR was illustrated. It was confirmed that every packet scheduled by the DDQN arrived within the deadline. The left graph of
Figure 11 corresponds to flow F1 and the right graph corresponds to F2, and the graphs mean the E2E of 60 packets in scenario S4.
The x-axis represents the E2E in ms instead of the timeslot unit.
Figure 13 might imply why it could substantially improve the performance. Scheduling algorithms including the DDQN transmitted high-priority packets within the required deadline, but the DDQN transmitted more low-priority packets compared to other existing algorithms. It can be seen that the trained DDQN agent takes action to transmit low-priority packets while delaying high-priority packets to a limit where the packets can be transmitted within the deadline. In contrast, the SP and WRR algorithms transmit many high-priority packets even though there are many deadlines left, so it can be seen that low-priority packets are delayed. This also can be observed in
Figure 9 and
Figure 13. Although the research results demonstrate that DRL shows potential for network scheduling, several challenges remain: reinforcement learning agents should be generalized so that the agents do not operate only in the environment configured in the simulation, and the computational complexity should be reduced so that DRL can be applied to network equipment.
4.5. Simulations in Network Topology
As mentioned in
Section 1, there is a problem of deep learning inference time as a task to be solved in order to apply DRL to networks. In this study, in an effort to reduce the inference time, inference at every timeslot is prevented through work-conserving, and single nodes can independently act without correspondence with the central node. Despite it being designed this way, the action is still chosen by the agent; and it takes considerable computation time. We designed the simulation not to consider the deep learning inference time, but it is difficult to do this in real situations. Therefore, we have introduced a table look-up method as a simple way to reduce inference time. This means all observed states and actions in multiple times of simulation are stored in a look-up table, and the simulation with a look-up table is performed with the table look-up method instead of the DDQN. When it uses the table look-up method, the action of the next timeslot is decided by searching the state from the look-up table, avoiding the computation of the DDQN. If a state does not exist in the look-up table, it operates as an SP algorithm. Because of the simulation in the same scenario, the rate at which packets arrive within a deadline is very similar to the DDQN results in
Figure 12, but the inference time is significantly reduced, as shown in
Table 11 below; for example, in Scenario 1, the simulation time of the look-up table is approximately 1.4/100 of the DDQN time. This suggests the possibility that if the model is trained, the inference can be partially replaced by the table look-up method; also, it can be expected that reinforcement learning can be introduced in small networks by distributing only simplified tables without having to periodically distribute models after training all newly observed samples at the central processing node. However, the disadvantage of this method is that the agent needs to know the state and action possible for all scenarios, and that it can only choose one predetermined action for the state. As a solution to this, a policy-distillation method for reducing the computation of reinforcement learning can be considered [
27]. Policy distillation is a technique that lessens the computation by transferring a teacher model, where training is actually done, to a small student model. Policy-distillation has shown that the computational time is very low, and the student’s performance can be close to that of the teacher. In fact, there has been a discussion in [
4] that the implementation problem of deep learning in networks can be solved through policy transfer learning of policy distillation. In addition, in [
10], policy distillation was applied as a way to output real-time tasks in microseconds to replace existing congestion control algorithms in C-RAN with DRL. In the study, policy was transferred to a decision tree of appropriate depth that achieves performance similar to the original model and negligible inference time. The proposed model based on the DDQN is trained in every timeslot. The experience replay keeps repeating while training. As a result, the learning slows down, and the complexity increases as the learning progresses due to the accumulation of observed data samples. To address these issues, we tried to exclude unnecessary experience and deep learning inference. Timeslot data that are not needed during training (e.g., state elements are all 0 and no packets are generated at a timeslot) are not added to memory. In addition, when there are packets in only one of Queue 1 or Queue 2, the experience replay is not required, so it is excluded from replay memory. This could make it possible to not significantly slow down.
5. Conclusions
This study proposed timeslot scheduling based on deep reinforcement learning. Reinforcement learning with deep learning has been known as an algorithm that reduces computational time and increases Q-value prediction performance compared to Q-learning. With deadlines also required for traffic of lower priority, we hypothesized that more packets would be sent within deadlines if the higher priority yielded a timeslot to the lower priority at an appropriate state. The DDQN agent was conducted to reward when the ET of the packet was less than the deadline. After about 15,000 episodes, the score of the DDQN showed that it outperformed SP on the learning curve. As a result of simulation under the environment of utilization smaller than the learning environment, SP or WRR recorded a score of 90% and the DDQN recorded a score of 100%. This means that the DDQN could guarantee more packet deadlines than existing algorithms. Furthermore, in the results of testing by expanding the network structure, the DDQN always performed higher or the same while improving the weaknesses of other algorithms. The E2E of packet scheduling with the DDQN was validated to be adjusted in order to meet the deadline. Reinforcement learning would provide optimal scheduling to meet the requirements of networks where vast amounts of data are transmitted without intervention of an administrator, and be able to reach the goal of autonomous networking that responds immediately to changes in the network environment. This study has shown the feasibility of introducing deep learning into future intelligent networks. We are planning to develop a well-adapted agent to a more dynamic network environment for a generalized reinforcement learning scheduler. The essential factors of future work are defining different priorities and classes of flows, and being robust for various utilization and requirements. In addition, we are planning to devise a solution for reduction of operation, such as a policy-distillation method to reduce inference time, aiming for DRL in the network to be applied in a real case. In order to apply the deep learning modules in PyTorch or TensorFlow, we implemented our Simpy-based simulation, and not use other network simulation frameworks. Hence, it makes deep learning difficult to be accepted in networking. In order to apply a deep learning model based on Python to a network, most of the related research developed the simulator independently, which is able to be employed on their work only [
28]. Therefore, the simulator should be standardized so that various environments and models can be tested and compared. In order to develop deep learning-based networks, it will be directed to improve DRL scheduling performance, which remains a challenge, as well as to build standardized simulators and benchmark datasets.
Reinforcement learning-based network scheduling is still in its early stage. Although the DDQN agent in this paper showed better performance than the existing algorithms, the following limitations are remaining. We also suggest future work to solve these limitations.
(1) Generalization and robustness issues: In this study, priority-based scheduling is performed in a simplified timeslot environment. However, real traffic patterns and network conditions can be more complex. We expect the scheduler performance would be different according to such discrepancies in more realistic environments. Therefore, in subsequent studies, the flow configurations will reflect better the real-word traffic, and network topologies and sizes will reflect the real networks. We also plan to compare the results in diverse aspects and analyze the agent’s policy. In addition, various DRL models will be explored and simulated and compared in terms of efficiency and performance.
(2) Scalability issues: Limitations on computing resources make it difficult to realize the suggested framework in real implementations. In subsequent studies, a more scalable approach will be employed, in which the central node distributes a lightweight model (for example, policy distillation), and the other nodes use the lightweight model to reduce complexity in terms of computation and memory.
(3) Bias issues: The evaluation results shown in this work may have been biased. Since the simulator was developed by itself and was executed only in a network environment where conditions were set, it is necessary to compare the results in various aspects and analyze the agent’s behavior to give objectivity. It is possible to enhance the proposed framework by modifying the inference frequency and the system input/output.