1. Introduction
The purpose of combing Operation Technology (OT) and Information Technology (IT) is to build an information system for the smart factory based on the industrial control network technology. The new industrial network scheme based on SDN and TSN can satisfy the high requirements of low latency, flexibility and dynamic of industrial networks well [
1]. Due to the mobility of devices caused by flexible manufacturing, the programmable logic controllers (PLC) of lathes and main controllers need to be connected by wireless network [
2]. The controlling traffic of PLC is sensitive to latency, while the fifth generation-advanced (5G-A), which supports ultra-reliable low-latency communications (URLLC) and connects with TSN, can satisfy the traffic with determined latency well [
3].
By exploiting edge intelligence (EI), the data stream can route directly to the near edge Data Center (DC) implemented in a workshop instead of transmitting to the application system in the central DC [
4]. The structure of the inner network of the smart factory is presented in
Figure 1.
The factory network is divided into access layer, distribution layer and core layer. The part between the vertical dotted line a and b is the access layer, which is built by the access switch and wireless network. The part between the vertical dotted line b and c is the distribution layer, which is a ring or MESH network built by distributed switches. The part between the vertical dotted line c and d is the core layer, which is built by two core switches. The network switch is controlled by an SDN controller through open flow protocol.
The edge devices, such as vehicles, augmented reality (AR) glasses, robots, cameras, digital lathes and intelligent terminals, are to the left of the vertical dotted line a in
Figure 1. The application system for production management is implemented in the central DC, the EI applications are implemented in the edge DC and the edge devices are connected to the switches of the distribution layer.
Heterogeneous network, diversity of traffic, mobility of access and applications tending to be implemented in both cloud and edge are the characteristics of the network and traffic of a smart factory. However, the mobility of the staff and devices leads to dynamic traffic path allocation rather than static allocation. In addition, the diversity of Quality of Service (QoS) makes management of operation and maintenance difficult. The Operator has been studied, Autonomous Driving Network (ADN), which introduces artificial intelligence (AI) into network management. ADN has attracted numerous researchers, but studies on industrial networks are rare. Actually, the industrial network needs ADN more due to its complexity.
On the one hand, network slicing can avoid decreasing of the delay jitter index of real-time traffic affected by burst data. On another hand, network slicing can provide solution of transmission with determined delay. The authors in [
1] divided an SDN switch in two slices. One is for high-reliable low-latency traffic, such as the communication between PLCs, and remote operation. Another is for traffic that is insensitive to delay, such as data collecting and safety video monitoring.
The method proposed in [
5] used hard slicing for satisfying the requirement for different QoS of traffics, which realizes the isolation of real-time traffic and non-real-time traffic in long term evolution (LTE) network. The authors in [
2] proposed a QoS cooperatively controlling scheme and procedure of 5G and SDN heterogeneous network for smart factory, and modeled the production network with a specific smart factory. Besides 5G network slicing, SDN network slicing was also proposed in [
2]. Moreover, a method for multiple QoS traffic flow path dynamic allocation and global optimization by linear programming was proposed.
The Full Paths Re-computation (FPR), Heuristic Full Paths Re-computation (HPR) and Partial Paths Re-computation (PPR) network stream optimization algorithms for single bandwidth parameter in 5G SDN were proposed in [
6]. Although HPR and PPR algorithms decrease the operation delay, the allocation delay approximates 1 s when user number is more than 50, which cannot satisfy the requirement of access delay.
With the increasing amount of traffic and the network scale, adopting conventional methods for dynamic flow path allocation and global optimization requires a lot of computing resources, which need long computation time, and leads to the ‘dimension disaster’. In addition, the access delay is too long to satisfy the requirement. The authors in [
2] proposed utilizing Deep Reinforcement Learning (DRL) for factory network to realize the establish and dynamic optimization of multiple parameter QoS flow path.
By exploiting the DRL method, the agent is introduced into the management plane. It interacts with the SDN controller, collects condition information of network, including traffic access requirement, and makes a routing decision according to the policy of the agent. According to the decision, the SDN controller generates the traffic flow path, and sends it to the SDN switch in the data plane by open flow protocol to realize the intelligent allocation of the traffic path [
7,
8,
9,
10].
However, the aforementioned DRL-based methods are not proposed for industrial network, which does not explicitly consider the requirement of bandwidth and delay together. Therefore, these methods may not satisfy the requirement of industrial network when the delay is larger than the maximum traffic latency requirement. Recently, a study in [
11] propose a graph embedding-based DRL framework for adaptive path selection (GRL-PS), which achieved nearly-optimal performance with both latency and throughput, especially in large-scale dynamic networks.
Furthermore, these DRL-based methods may not adapt well to the network topology that is different with the training period, because the conventional neural network may not deal with the complex graphic problem in the industrial network [
12]. Industrial network is highly reliable, and can establish the ring network or mesh network, which can act against the error by itself. When the error occurs in the link or switch node, the topological structure of network will change, which needs methods to adapt to the change.
When the network satisfies traffic QoS, the maximum network traffic is used to evaluate the agent performance. However, the reward of each traffic path allocation is hard to get, and only the reward corresponding to the maximum traffic of network can be obtained until one episode is finished, which is a sparse reward problem. Therefore, a reward function needs to be designed to guide the learning of the agent.
The model-free-based DRL methods highly rely on the number of training data, which is not easy to satisfy in the real network, especially in the sparse traffic situation. Although the random experience replay method solves the correlation and non-statable problem of experience data, the uniform sampling and updating data based on batch leads to a scenario where valuable data are not fully used [
13].
Although most research focuses on general-purpose networks, such as NSFNET, GEANT2 and GBN, there has also been research on industrial networks in recent years. The authors of [
14] designed a Deep Federated Q-Learning-Based algorithm, which dynamically assigns traffic flow paths to different network slices according to QoS requirements in IIOT networks of LoRaWAN Technology. Studies about QoS flow path intelligent allocation in 5G and SDN heterogeneous network for smart factory are rare. These should consider network reliability, network slicing and multi-service mixing in industrial situations.
2. Methodology
Ref. [
15] proposes an overall architecture of the intelligent routing method based on Dueling DQN reinforcement learning, which significantly improves the network throughput and effectively reduces the network delay and packet loss rate. To fit the topological structure of inner network of smart factory and the characteristics of the traffic, we study the multiple QoS traffic path allocation method based on DRL for 5G and industrial SDN heterogeneous network. The traffic flow path allocation and optimization architecture are shown as
Figure 2. They consist of the data plane, control plane and management plane. The function of each plane is described below.
- (1)
THe data plane mainly includes the 5G User Plane and the SDN switch, which is controlled by 5G control plane and SDN controller of the Control plane, respectively. The collected states of the network are reported to the Control plane. The terminal Equipment original or Server original Service request would be sent to the Control plane; it is the Control plane that allocates the traffic flow path.
- (2)
The Control plane mainly includes the 5G-control plane, SDN controller and the IW-NEF. The IWF-NEF is responsible for QoS collaborative Control between the application, 5G and SDN controller. The network status, including traffic requests, is sent to the management plane by the control plane. The new state and reward are sent as feedback to management after receiving the action from the QoS Policy Agent.
- (3)
The Management plane is a Deep Reinforcement Learning (DRL)-based policy Agent which can achieve the ability of QoS flow path intelligent allocation by training, and can allocate an action corresponding to the QoS flow path.
Due to different production organizations, the factory network topology is different. Meanwhile, due to the high reliability of an industrial network, the traffic path should be allocated properly even with the error of some nodes. Since graph neural network (GNN) has strong ability in terms of modeling graph structure and optimization, as well as generalization [
16,
17,
18], it is reasonable to exploit GNN to model the network structure and realize the relational reasoning and combinatorial generalization. Ref. [
19] uses GNN to forecast SDN end-to-end latency with SDN, which can enhance the network’s routing strategy. In this paper, GNN is utilized in DQN to learn the network state and allocation method based on deep learning.
The prioritized experience replay method can solve the problem of the valuable data not being fully used caused by random experience replay, and improve the learning efficiency. The insight of prioritized experience replay method [
13] is evaluating the value of data for learning based on TD-error, and replaying the valuable data many times. To avoid the lack of diversity caused by TD-error based prioritization, we utilize prioritized experience replay method based on probability. Moreover, importance sampling is introduced to correct the bias caused by the prioritized experience replay method, and the weight generated by importance sampling will be applied to the Q-learning.
As for the sparse reward problem, we aim to improve the learning effect of the agent by improving the learning method or adding a new reward, including the reward shaping method [
20], designing a new reward module [
21], and adding a new reward learning method [
22,
23]. The reward shaping method can design the reward according to prior information about environment. Therefore, we design the reward function by combining the graph theory and conventional network programming and optimization to aid the agent in finding the policy quickly.
2.1. Algorithm Framework
The framework of the proposed algorithm is presented in
Figure 3, which contains training and test stages. The simulation environment is composed of a traffic generator and industrial SDN model. The enhanced-DQN (E-DQN), GNN and prioritized experience replay method are applied for agent, deep learning structure and experience replay memory, respectively. The evaluation module collects information of the environment and agent, and evaluates the performance of the agent.
In the training stage, the traffic generator generates the traffic requirements according to QoS. The path set, i.e., the action set, is established by an industrial SDN model. By interacting with the policy network of the agent, the action is selected and sent to the environment. The environment calculates the reward based on the designed reward function, sends complete {s, a, r, s’} to the agent, and records it in the experience replay memory. In the proposed algorithm, E-DQN uses the prioritized experience replay method to choose valuable data for training, and a neural network is chosen as the GNN. The network traffic, QoS index and reward are evaluated periodically in the training stage.
2.2. Network Model and Traffic Model
2.2.1. Factory Network Model
Factory network has its own characteristics, such as three-layer network architectures, intelligent application of edge cloud collaboration, etc. Production management application is implemented at the Factory or Company-level Data center (DC), while some real-time edge intelligent applications are implemented on the production site due to the application of edge computing, [
24]. From the view of the network framework, once the access point is determined, the path can only uplink the edge server to the switch, regardless of wire or wireless access. Therefore, the key of traffic path allocation is in the distribution layer and core layer.
For simplification, the study objects of the network model are chosen as the distribution network of WorkShop
and WorkShop
, and the core network of the factory. The topological structure of the network is presented in
Figure 4, where SX is the application server in the central DC, and EC-SXX is the edge intelligent application system implemented on site, such as main PLC and remote operator.
{SW01, SW02} are core switches, and the rest of the switches are distribution switched. The bandwidth connecting with core switches and between distribution switches are 1000 M and 200 M, respectively.
2.2.2. Traffic Model
As shown in
Figure 4, the production management application system is usually implemented in the Central DC. The production management application contains monitoring of production data, video of key places, health of device, and dispatching of speech and video, which can be divided into real-time (RT) service and non-real-time (NRT) service types.
The source nodes and target nodes of the traffic in data center are as follows.
- (1)
- (2)
Edge intelligent application systems EC-S01, EC-S02, EC-S03, EC-S04, EC-S05, EC-S06, EC-S07 deal with highly real-time traffic, including processing control messages between PLCs, security monitoring of local traffic and remote operation of vehicles, which can be divided into RT and highly-reliable ultra- low-latency types (URT). It is worth mentioning that the traffic from different workshops usually accesses the edge computing system of the workshop itself, which means the source and target nodes are divided into different groups. The group details are as follows.
Group 1:
- (1)
source node 1:
- (2)
target nodes 1:
Group 2:
- (1)
source node 2:
- (2)
target nodes 2:
The bandwidth and delay of traffic are shown in
Table 1. To simplify, the delay is defined as the maximum number of nodes in the flow path. Traffic bandwidth is chosen and allocated from set
.
2.2.3. Network Slicing Model
The industrial network and switch nodes are both divided into two slices, Slice 1 for NRT type and another for RT and URT type.
Figure 4 shows the typical way of slicing and connecting switches; the details are as follows.
(1) SW92 is divided into SW92_sl1 and SW92_sl2, RT application server S5 is connected to SW92_sl2, NRT application server S3 and S4 are connected to SW92_sl1. (2) SW02 is divided as SW02_sl1 and SW22_sl2, the link between SW02 and SW92 is divided into L_sl1 and L_sl2, which are connected to SW92_sl1,SW02_sl1 and SW92_sl2, SW02_sl2, respectively.
2.3. QoS Optimization Model
The topological structure of the inner network of a factory is modeled by undirected graph G = (V, E), where V is the node set of G, presenting the distribution switches and core switches, and E is the edge set of G, presenting the link of the network. When applying network slicing, the network is divided into multiple graphs corresponding to slices, which can be presented as = (). The slice index, nodes, links, capacity and bandwidth of links, and exchanging delay of nodes are denoted as SLi, , respectively. T simplify, the exchanging delays of nodes in the same slice are the same.
The source-target node pairs required by QoS traffic are composed of arbitrary two node pairs from source and target node set in . The traffic requiring node is denoted as . One forwarding path of node pair k is , and the all forwarding path set of the node pair set is . The bandwidth of allocation requirement and maximum delay of traffic are denoted as and , respectively.
The optimization targets of QoS are as follows.
- (1)
The path delay of required traffic is less than the maximum traffic delay, which is presented as
- (2)
Maximizing the traffic capacity of the network, which is presented as
2.4. Algorithm for Agent
2.4.1. Environment State
Network state, which can be presented as , includes stable and dynamic information. The topological structure of network, delay of exchanging nodes, bandwidth of link and betweenness of nodes belong to stable information. The link flow, QoS parameters and source-target nodes belong to dynamic information.
2.4.2. Action
The number of routing combinations of the source-target pair required by traffic is large, which leads to the high dimensional space in a real large-sized network and adds difficulty to the route choosing of the agent. Since delay is a key parameter of QoS, applying K-shortest paths can consider the delay requirement implicitly. Moreover, the dimension can be reduced due to the fact that the action space is only a part of the full set.
The choice of K value depends on the size of the network and routing combinations. The agent chooses only one path according to the environment state. The allocation for traffic flow path is proper as long as the path delay and bandwidth satisfy the requirement.
2.4.3. Reward
Reward is defined as a function of the maximum traffic allocated by the agent on the network, which can be calculated when one episode is finished. The mathematical form is
2.5. Designing of Reward Function with Sparse Reward
The optimization object is maximizing the traffic flow when the requirement of delay and bandwidth are satisfied. Since the traffic path allocation may not be optimal before one episode finishes, the reward value of each action cannot be given directly. Therefore, we use reward shaping to guide the learning of the agent and find an optimization policy with the sparse reward. The delay requirement has to be satisfied for factory traffic. Meanwhile, pursuing lower delay is not necessary.
For traffic path allocation, large betweenness of network nodes means that the number of nodes in the shortest path is large. Therefore, the nodes are occupied ahead, which leads to a situation where following traffic cannot be allocated and reduces the traffic flow. To avoid this, a negative term about betweenness of the path nodes can be introduced. Each proper path allocation will increase the traffic flow, and the bandwidth of this path can be designed as a positive term. So, the reward function is designed as
where
are the weight of three terms,
is the summarization of the betweenness of the nodes of traffic path
.
The reward function can obtain the agent performance according to the delay and bandwidth reward after traffic path allocation, without the aid of . The is set as 0 in this situation.
Since the path delay is easy to obtain, we can delete the path that cannot satisfy the delay requirement when we set the initial path set. The is set as 0 in this situation.
2.6. Algorithm
Algorithm 1 presents the training procedure of the agent. The algorithm includes experience data generation and training two parts. The total numbers of iteration
, size of experience memory
, training epochs of Main Net
, updating times of parameters of target network
and experience memory buffer are initialized first (Line 1).
Algorithm 1 The training procedure of agent |
initialize Algorithm parameters while do
initialize environment
while NOT DONE do
while do
get k_shortest_path for i = 0 to k do
end for
j = j + 1
end while
if then sample Experience buffer
if then
end if end if
end while
n_i = n_i + 1 end while |
2.6.1. Experience Data Generation and Memory
Each iteration corresponds to one episode, which is finished when a traffic path cannot satisfy the bandwidth requirement. We initialize the environment state, set the link capacity to the maximum and calculate the link betweenness (Line 3).
Then, the traffic generator sends traffic requirements described by to the network model. The network model calculates K-shortest paths and forms the action sets according to the source-target nodes. The simulation environment sends the network state, traffic requirement, action sets, available link capacity and betweenness to the E-DQN, which calculates the q-value of each action, and chooses feedback for the action corresponding to the maximum q-value by greedy algorithm. The feedback action is used by the industrial SDN network model to generate new a condition S′ and reward (Line 6–13).
The environment model sends the above experience data to the experience replay memory function of E-DQN, which calculates the priority and weight of the experience data, and stores the data in the experience buffer (line 14–15).
When the network cannot allocate new traffic, it sets ‘DONE’ as ‘True’, exits the loop and starts the next loop until it achieves the maximum epochs. Training is called every iterations.
2.6.2. Training
We extract data from the experience buffer to train the main network of E-DQN and update the priority and weight. The parameters of the main network are updated by new weight, and the parameters of the target network are updated every iterations.