GDR: A Game Algorithm Based on Deep Reinforcement Learning for Ad Hoc Network Routing Optimization

Hong, Tang; Wang, Ruohan; Ling, Xiangzheng; Nie, Xuefang

doi:10.3390/electronics11182873

Open AccessArticle

GDR: A Game Algorithm Based on Deep Reinforcement Learning for Ad Hoc Network Routing Optimization

School of Information Engineering, East China Jiaotong University, Nanchang 330013, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(18), 2873; https://doi.org/10.3390/electronics11182873

Submission received: 7 August 2022 / Revised: 29 August 2022 / Accepted: 3 September 2022 / Published: 11 September 2022

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Ad Hoc networks have been widely used in emergency communication tasks. For dynamic characteristics of Ad Hoc networks, problems of node energy limited and unbalanced energy consumption during deployment, we propose a strategy based on game theory and deep reinforcement learning (GDR) to improve the balance of network capabilities and enhance the autonomy of the network topology. The model uses game theory to generate an adaptive topology, adjusts its power according to the average life of the node, helps the node with the shortest life to decrease the power, and prolongs the survival time of the entire network. When the state of the node changes, reinforcement learning is used to automatically generate routing policies to improve the average end-to-end latency of the network. Experiments show that, under the condition of ensuring connectivity, GDR has smaller residual energy variance, longer network lifetime, and lower network delay. The delay of the GDR model is 10.5% higher than that of existing methods on average.

Keywords:

game theory; deep reinforcement learning; Ad Hoc networks

1. Introduction

The Ad Hoc network is a center-less, multi-hop wireless network with self-organizing properties. There is no fixed infrastructure for the entire network; each node is on the move and can stay connected to other nodes in any way dynamically [1]. Due to the particularity of its communication mechanism and the limitation of node resources, the energy is usually powered by batteries, which are difficult to supplement. Therefore, the energy consumption of nodes must be considered. In actual deployment, its nodes are randomly and densely deployed in different locations in the monitoring area. The network load is out of balance because each node consumes energy at different rates. So, how to improve energy utilization efficiency and ensure balanced energy consumption to make the network obtain the maximum life cycle is the question that needs to be considered. At the same time, the Ad Hoc network is dynamic, and the network nodes can move anywhere and can also be turned on and off at any time, which will make the network topology change at any time. The dynamic topology makes it more difficult to guarantee a long lifetime while maintaining connectivity.

At present, some achievements have been made in the research on the energy consumption management of Ad Hoc network. The vehicular edge computing (VEC) proposed in [2] reduces the energy consumption by using edge caching approach to reduce the cost, which treats unused nodes as static nodes with reduced computational overhead. It can be seen as adding some static nodes and changing the topology of the network to control the energy consumption. In addition, to balance the energy, Du Y et al. [3] proposed a topology control game algorithm (EBTG). In order to consider both nodal energy balance and energy efficiency, a new exponent is introduced to improve the utility function. Based on this, they use non-cooperative game theory to optimize the algorithm in [4], which includes two sub-algorithms. One of them only guarantees the single connectivity of the network and another one can guarantee the dual connectivity of the network. In contrast to other algorithms based on game theory, this algorithm can effectively extend the network lifetime and performs better in terms of robustness when guaranteeing the dual connectivity of the network. The above algorithms can effectively control the network topology, improve the network performance and balance the energy consumption of the nodes, but they cannot completely guarantee the full coverage of the network and the connectivity of all sensor nodes. Moreover, the average survival time of nodes and the transmission power of nodes should be given consideration.

Moreover, the control method only for topology has many limitations in practical deployment. Because of the dynamic properties of Ad Hoc networks, a single topology control cannot adapt to the changing network environment. In recent years, the speedy advancement of reinforcement learning has opened up the possibility of solving routine engineering problems. Compared with supervised learning and unsupervised learning, which can only classify network traffic, reinforcement learning can directly generate routes by training intelligent agents in unlabeled data sets through newly connected nodes [5,6]. Many related studies apply reinforcement learning to routing engineering or traffic flow engineering [7,8].

The application effect of reinforcement learning in routing engineering is not so significant at present, because the existing methods [9] are mainly based on neural network (NN for short) architectures (e.g., Convolution Neural Network [10] and Recurrent Neural Network [11]), which are not appropriate for modeling information about graph structures. This is because changes in the topological of dynamic networks imply that the inputs and outputs of CNNs or RNNs are not fixed. Also, for the input or output of dynamic Ad Hoc network information, CNN or RNN cannot generalize well. Even if it is represented, it is very inconvenient to store, because the change of topology means the modification of the whole representation. Graph neural networks (GNN) use graphs to represent network topology information, which can clearly characterize the Ad Hoc network structure and store it more efficiently. In [12], a new GNN-based multipath routing model is proposed to explore the complexity between links, paths and MPCP connection on various topologies. GNN models can predict the topology and network throughput for a given network, providing support for optimal multipath routing. In [13], packets routing strategy based on deep reinforcement learning (PRMD) were proposed to reduce data packet transmission time by learning the data information of the network to forward the packets. These reinforcement learning strategies do not focus on the consumption of network energy, which can lead to changes in the overall network topology and therefore cannot make the network stable for a long time.

To obtain a stable, longer life cycle and lower delay network, and to solve the problem of unbalanced energy consumption during dynamic networking, we propose an adaptive routing algorithm based on game theory and deep reinforcement learning. The method first describes the network topology and network transmission parameters in terms of graph neural networks, and then dynamically adjusts the topology using a game algorithm. In order to maintain stability when the network changes dynamically, reinforcement learning is used to generate automatic routing strategies, so that GNN can attain an approximate optimal performance without a priori information about the environment, and is capable of independent exploration and optimization decisions. The contributions of our work are summarized as follows:

We apply game theory to the graph representation of routing engineering and pay attention to the performance of each node in the network in terms of energy consumption while ensuring network connectivity and efficiency;
In view of the dynamic nature of the network, we train a network with reinforcement learning to achieve a near-optimal routing policy without prior information about the environment. It is worth pointing out that our input is a graph representation that is adapted by the game model, which is less computationally expensive and faster;
We collect and experiment with network traffic data in the real world. The results show that the average end-to-end delay of GDR model is 10.5% higher than that of AutoGNN [14] on average. Meanwhile, the average lifespan of the network nodes is longer, the network energy consumption is more balanced, the energy efficiency of the entire network is improved, and the network structure is more robust.

The remainder of this paper is structured as follows. Section 2 summarizes the related work. Section 3 presents the details of the GRD model. Section 4 is the simulation and result analysis. The conclusion is presented in Section 5.

2. Related Works

Early approaches to enhance the survival time of Ad Hoc networks were mainly based on graph-theoretic topology control approaches. This approach constructs the minimum spanning tree of network topology with Euclidean distance and lacks adaptive capability. Currently more promising approaches mainly use Game Theory and Deep Reinforcement Learning.This paper uses GNN to describe the topology of the network, then uses Game Theory to adjust the topology according to the present state of the network, and finally uses DRL methods to determine a routing policy.

Graph Neural Networks. A graph neural network is an emerging network that operates on graph structural information. Since they were introduced in [15], they have made great strides in many fields. In wireless networks, many related works are presented. In [16], the problem of downlink power control in wireless networks was considered. They utilize a graph neural network and use unsupervised optimization methods to learn the decisions for optimal power allocation. In [17], the crucial interference features are extracted using graph neural networks and the DRL framework is used to explore the optimal allocation strategy. Reference [18] characterizes diverse link characteristics and interference relationships using graphs and proposes a heterogeneous interference graph neural network that enables each link to obtain its individual transmission scheme with neighboring links after a limited information exchange. Therefore, based on the ability of graphs to represent the network structure, it is inevitable to use graphs for network structure representation.

Game theory based approach to network topology adjustment. Wang [19] proposed an algorithm based on this topology control model. The nodes adjust their power according to the average lifetime of the nodes in order to help the nodes with the shortest lifetime to reduce the transmitting power, which can extend the lifetime of the whole network. But in this model, the dynamic adjustment of the network topology relies on a set period or threshold. In practical applications, this threshold is often difficult to define. Aiming at the problems of the existing topology control algorithm of wireless Ad Hoc network based on game theory, such as an unbalanced load of individual “bottleneck nodes”, many redundant links, and short life cycle, a multi-objective fusion network topology control algorithm is proposed in [20]. The weight factor of the game model in this algorithm is manually adjusted, which cannot adapt to the dynamics of the Ad Hoc network.

Deep reinforcement learning based routing Policy. Practices and theories that lead by deep reinforcement learning have been studied for a long time. Kao [21] demonstrates the promise of applying reinforcement learning (RL) to optimize Network-on-Chip (NoC) runtime performance. A multi-agent model-free RL scheme for resource allocation is presented in [22], which mitigates interference and eliminates the need of network model. The proposed schemes are implemented in a decentralized cooperative manner with CRs acting as multi-agent, forming a stochastic dynamic team to obtain an optimal energy-efficient resource allocation strategy. As a result, reinforcement learning techniques in multi-agent environments and distributed networks are becoming increasingly popular.

3. GDR Framework

In this section, we will describe the GDR framework in detail. In Section 3.1, we introduce how to represent Ad Hoc networks with a dynamic graph. Section 3.2 provides the topology control game algorithm and a detailed description of the optimization problem. Section 3.3 develops the routing algorithm based on the DRL framework with GNN.

3.1. Dynamic Graph Construction for Ad Hoc Network

In real practice, there are several examples where the number of neighbors per node is variable, as opposed to the fixed neighbor size of images and text. With such data, it is difficult to express them in any way other than in a graph. The network topology is just such data. This section will describe how Ad Hoc networks can be represented by graphs.

To build the graph of an Ad Hoc network with N transceiver pairs, we regard the ith pair of transceivers as the ith node of the graph. Each node has a feature vector, including environmental information and direct channel state information

h_{i i}^{1}

, such as the remaining energy

w_{i}

of the ith node. The feature vectors of two directed edges between nodes

v_{i}

and

v_{j}

can include

h_{i j}

and

h_{i j}

, respectively. Figure 1 shows a construction method for a graph for Ad Hoc networks with three transceivers.

In an Ad Hoc network, each transceiver is frequently distinct from the other, and the information carried by the graph is diverse.The above representation does not do well enough to perform the network, so we define the graph as

H = 〈V, E, P〉

, where V represents the set of all transceiver nodes in the network, i.e., N nodes randomly deployed in a region; the link set E represents the set of links between two nodes in the node set V. Each node can communicate with its neighboring reachable nodes.

V_{i}

stores the coordinates

d_{i}

of point i. For convenience, we treat the graph representation as undirected, i.e.,

E_{i j} = E_{j i}

.

E_{i}

contains the propagation delay of the link, which is related to the quality of the link. P represents the set of states of a node, including network connectivity, transmitting power, node degree and link quality. In this paper, these specific feature values are integrated into the environment information, which will be introduced in Section 3.2 and Section 3.3.

Figure 2 depicts the graph representation of an Ad Hoc network with two state types, where

v_{i}

is the feature vector of node i, and

e_{i j}

is the feature vector of the edge between nodes

v_{i}

and

v_{j}

. Different colors indicate different states, and the states are stored in

p_{i}

.

3.2. Game Algorithm for Topology Control

Having represented the network topology in a graph, we can then use the game model to adjust the topology generation to regulate energy consumption through the nodes while ensuring network connectivity. To start with, a simple example of topological control to regulate energy consumption is considered as shown in Figure 3. Nodes A, B, C, and D are deployed within the transmission range of each other, where each node is represented by a small circle for ease of description and is assigned an integer remaining lifetime value, and the number inside the circle indicates the remaining lifetime of the node.

Figure 3a shows the topology generation diagram of nodes at a certain moment, assuming that the current average lifetime is 4, and the remaining lifetimes of nodes A, B, C and D are 7, 3, 4 and 2, respectively. Node A can be accessible to Node B with the smallest transmit power and the least energy consumption, and node B can reach node C with the largest transmit power and the most energy consumption. If the operation continues according to Figure 3a, node B will surely die due to excessive energy consumption. In Figure 3b, the remaining lifetime of node A is 7, which is longer than the average lifetime. Consider increasing the transmitting power to help the surrounding nodes reduce energy consumption, so the communication radius of node A increases from the original node B to the average lifetime node C. In Figure 3c, the node B has a shorter than average lifetime, so consider reducing the transmitting power to reduce energy consumption to extend its lifetime under the condition of ensuring network connectivity. Therefore, the communication radius of node B is reduced from reaching the distant node C to the nearest node B. The remaining lifetime of node C is the same as the average lifetime, so its transmitting power remains unchanged. After the node transmitting power is adjusted, the network remains connected and point B will not be dead prematurely.

In this paper, the topological control game model uses the above idea, and the details of the game model will be presented later in this section.

3.2.1. Game Model

The definition strategy game is

θ = 〈N, C, U〉

, where

N = \{1, 2, \dots, n\}

denotes the game participants, which is the same as the number of nodes in the graph. The strategy space is denoted as

C = \{C_{1}, C_{2}, \dots, C_{n}\}

, where

C_{i}

represents the set of strategies that participant i can choose. If there are k alternative strategies for i, then we have

C_{i} = \{c_{i} [1], c_{i} [2], \dots, c_{i} [k]\}

;

U = \{u_{1}, u_{2}, \dots, u_{n}\}

is the payoff value obtained by the participants after the game, and

u_{i} (c_{i}, c_{- i})

denotes the payoff value obtained by participant i in strategy combination

(c_{i}, c_{- i})

, with

c_{i}

denoting the strategy chosen by participant i, and

c_{- i}

denoting the strategy chosen by the remaining of participants.

3.2.2. Utility Function

It is not easy to quantify the benefits of nodes in the complex node deployment environment of Ad Hoc networks. In order to reflect the network realistically and exactly, we consider the utility function of nodes through the following aspects.

Network connectivity. By introducing network connectivity as a parameter when a node changes its own transmit power, it is possible to ensure that the topology keeps connected after various iterations of the game. Equation (1) represents the connectivity of the network:

f_{i} (c_{i}, c_{- i}) = \{\begin{matrix} 0, & connected \\ 1, & unconnected . \end{matrix}

(1)

Transmitting power. Depending on the denseness of the nodes within the communication range and sparsity, the neighboring nodes are controlled by adjusting the magnitude of node transmit power nodes and thus achieve the purpose of network load balancing. Definition

p_{i}^{j}

is defined as the number of nodes i with node j the transmission power consumed to establish the communication rate. In Equation (2),

d (i, j)

denotes the length of the communication link between node i and neighbor node j;

p_{i}^{\min}

denotes minimum transmission power;

a (f)

is the communication loss coefficient.

p_{i}^{j} = d (i, j) p_{i}^{\min} {(10^{a {(f)}^{/ 10}})}^{d (i, j)} .

(2)

Node degree. An appropriate node degree can effectively improve the energy utilization efficiency. Consequently, the node degree density probability function is added to the integrated utility function. Y represents the probability that the degree of node i is m. In Equation (3), m is the degree of node i;

λ

is the mean value of Poisson random variable.

Y \{i = m\} = \frac{λ^{m} \exp (- λ)}{m!} .

(3)

Link quality. In the process of data transmission, link quality has an essential influence on data transmission, and choosing a path with high link quality to transmit data can effectively reduce energy consumption. The received signal strength indication value in both positive and negative directions is used to combine the link quality

b (i, j)

. In Equation (4),

R (i)

,

R (j)

is the signal reception capacity of the node i, j.

b (i, j) = \frac{R (i) R (j)}{R {(i)}^{2} + R {(j)}^{2}} .

(4)

Combining the above analysis, the payoff of the utility function is presented as Equation (5), where

α

,

β

,

η

and

μ

is the weighting factor greater than zero. The weight factor constraint is

α + β + η + μ = 1

.

p_{i}^{\max}

is the maximum transmission power of node i;

E_{o} (i)

is the initial energy of node i;

E_{r} (i)

is the residual energy of node i.

u_{i} = f (c_{i}, c_{- i}) (α p_{i}^{\max} \frac{E_{o} (i)}{E_{r} (i)} + β \bar{E_{i} (p_{i})} + η \frac{λ^{m} \exp (- λ)}{p_{i}^{j} m!} + μ \frac{b (i, j)}{p_{i}^{j} d (i, j)}) - α p_{i} \frac{E_{o} (i)}{E_{r} (i)} .

(5)

From Equation (5), we can see that the utility function is topologically controlled under the premise of guaranteeing network connectivity. If

f_{i} (c_{i}, c_{- i}) = 0

, the gain of the utility function is 0, which indicates that, at this time, node i can only have a negative gain after the game. When

u_{i} (c_{i}, c_{- i}) = - α p_{i} \frac{E_{o} (i)}{E_{r} (i)}

, from Equations (1)–(4) and combined with Equation (5), we can see that node energy consumption, node degree and link length are negatively related to the utility function, the higher the energy consumption and the longer the link length, the lower the gain. The residual energy and link quality of neighboring nodes are positively correlated with the utility function; the higher the residual energy of neighboring nodes and the higher the link quality, the higher the gain. Therefore, this game model not only reduces node energy consumption, but also has higher link quality between nodes and lower node load.

3.2.3. Nash Equilibrium

If the node i achieves a higher gain value with a smaller transmit power, the node adjusts to the transmit power at this point and updates the neighbor list. At this point, the node does not need to adjust the power again and obtains a larger gain, thus achieving Nash equilibrium. That is, the selection strategy of node i is

c_{i} = \arg \max_{c_{i} ϵ C_{i}} u_{i} (c_{i}, c_{- i})

. The algorithm of the potential game phase is described as Algorithm 1. In Algorithm 1,

p (i, j)

denotes the need for power from i to j;

\hat{P}

denotes the power set of all points, which can be used to represent the topology;

S_{i}

denotes the set of neighbor nodes.

Algorithm 1 Topology Control Game Algorithm.

1:: Initialization:
2:: $\hat{P} = \{{\hat{p}}_{1}, {\hat{p}}_{2}, \dots, {\hat{p}}_{n}\}, n ϵ N$
3:: foriinN:
4:: Node i broadcasts its own information with $p_{\max}$ power.
5:: if ( $p (i, j) ⩽ p_{\max}$ )
6:: $j \to S_{i}$
7:: number of $S_{i} \to k$
8:: sort $p_{i} [1] < p_{i} [2] < \dots < p_{i} [k]$
9:: $S_{i} & p_{i} \Leftrightarrow \{c_{i} [1], c_{i} [2], \dots, c_{i} [k]\} \to C_{i}$
10:: Adjustment:
11:: do
12:: foriinN
13:: for t in $C_{i}$ do $u_{i}^{k} (t, - t)$
14:: $c_{i}^{*} = \arg {max}_{c_{i}^{*} ϵ C_{i}} u_{i} (c_{i}^{*}, c_{- i}^{*})$
15:: update $S_{i} & p_{i}$
16:: end for
17:: while ( $\hat{P}$ no longer changes)

After the Algorithm 1, we are able to obtain a new and more concise topology. The new topology not only reduces node energy consumption, but also has higher link quality between nodes and lower node load. Figure 4 shows the process of establishing the topology of the topology control game algorithm. From Figure 4a, we can see that in the initial state of the network, the communication links constructed between nodes can cause high energy consumption of nodes and lead to premature death of nodes.

From Figure 4b, we can see that, after applying the potential game model, the node can complete the forwarding task without establishing communication to all neighboring nodes in the communication range. It can be concluded that the game model in the algorithm can reduce the node load to improve the link quality. At the same time, this results in smaller inputs and better computational performance for the reinforcement learning agent.

3.3. GNN-Based DRL Agent

In this section, we provide details of the deep reinforcement learning routing optimization algorithm, which is a DRL approach with GNN. First, we briefly introduce reinforcement learning and graph neural networks (GNNs). Then, we illustrate how to define the key reinforcement learning elements in the routing optimization problem. After that, we describe the process of generating actions by agents based on GNN. It includes feature extraction and policy generation. Finally, the specifics of the training process are presented. The framework of the routing optimization is illustrated in Figure 5.

3.3.1. Deep Reinforcement Learning

Since the node state and the link state of the network in the GDR model are dynamic, the network topology, routing strategy, and transmission power need to be dynamically adjusted according to the current node energy. The parameters

α

,

β

,

η

, and

μ

for topology control and energy control in the utility function of Equation (5) need to be determined and, since the network state is dynamic, manually setting these parameters for the network will not be adaptive. Therefore, a DRL model is designed here to learn the properties of the network and pave the way for optimization decisions.

DRL is suitable for achieving decision optimization in situations where the environment model is unknown. In a system of environmental state change that can be represented as a Markov process, the agent first obtains the environment state

s_{t}

at each step t, and then the agent performs action

a_{t}

, which causes the environmental state to change to

s_{t + 1}

. In this phase, the changed environment is evaluated by a specific standard, and the agent receives a reward

r_{t}

. When this process is completed, the performance of this decision can be represented by the sum of each reward. The decision process is shown in Figure 6.

During reinforcement learning model training, the parameters of DRL are first randomly initialized. Through interaction with the environment, the parameters of DRL are constantly adjusted, so as to generate near-optimal actions. Each complete state transition in reinforcement learning is called an episode. The interaction steps between the agent and environment in each episode can be represented as

s_{t}

,

a_{t}

,

r_{t}

, and the reward for this episode is

E (\sum_{t = 0}^{T} r_{t})

, where T is the number of steps for each episode’s state transition. To make the DRL model converge better, the cumulative rewards are replaced with

E (\sum_{t = 0}^{T} γ^{t} r_{t})

, where

γ

is the discount factor. So, the policy of the agent learned is

π : O \to A

, which can be expressed as

π_{θ} (a_{i, t} | o_{i, t})

.

The policy function

π_{θ} (a_{i, t} | o_{i, t})

gives the probability of taking any possible behavior given a particular state and parameters, and similar to other neural network solutions, we can solve the policy function as long as the objective function is determined. Specifically, the target function can be optimized by simply finding the gradient of the parameter

θ

and then updating it in the direction of the rising gradient. This gradient expressed as:

\nabla_{θ} E_{π} [\sum_{t = 0}^{T} r_{t}] = E_{π} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{i, t} | o_{i, t}) \sum_{t = 0}^{T} r (o_{i, t}, a_{i, t})] .

(6)

3.3.2. Graph Neural Networks

The GNN model can extract features of the graph structure well, and it can capture the dependencies in the graph through the information propagation between vertices. GNN can obtain the hidden state embedding

ℏ_{c}, c ϵ C

based on graph perception. Specifically, the hidden state representation of each vertex iteratively is updated by aggregating the features of each vertex’s edges and neighboring vertices. The hidden state embedding of vertex c at time

t + 1

is updated as follows:

ℏ_{c}^{t + 1} = f (χ_{c}, χ_{c}^{con (c)}, χ_{c}^{nei (c)}, ℏ_{nei (c)}^{t}),

(7)

where

χ_{c}^{con (c)}

is the features of edge adjacent to vertex c,

f (\cdot)

is the local transaction function,

χ_{c}^{nei (c)}

represents the features of the neighbor vertices of c, and

ℏ_{nei (c)}^{t}

represents the hidden state embedding of the neighbor vertices at time t.

In our work, the network based on GNN is built. The network structure is shown in Figure 7. To extract the features of the graph mode representation, we use four graph convolution layers. If too few stacking layers are set, the network does not learn easily. Since the vertices can only identify and aggregate a few neighbors, the vertices will lack feature information of their neighbors. According to general common sense, we tend to think that the deeper the network is trained, the better it will perform, but it actually does not perform that well. If there are too many hidden layers in the network, after multi-hop propagation, almost all vertices are determined to be adjacent to each other. However, in reality, our input is an optimized topology, and each vertex in the graph should not present a highly similar performance. For the routing optimization task, it is sufficient to see only the closer nodes and to consider the global effect. By stacking four GNN layers, features can be extracted efficiently. In addition, a fully connected layer is used to gradually adapt to route choices by adjusting the parameters, and using the softmax layer as the output. In the deep reinforcement learning framework, we use this output to generate the probability distribution of actions. The softmax function converts the input vector to a new vector of values between (0,1) and uses it to represent the probability of choosing a certain action.

3.3.3. DRL Framework

The DRL framework of the routing optimization is illustrated in Figure 5. We will describe each component of the reinforcement learning framework in detail.

Agent: Based on the “GNN+DRL” framework, a central controller is considered as an agent, which means that our work uses a centralized reinforcement learning algorithm. The agent has the ability to learn and make decisions in the interaction with the environment. The central controller is responsible for performing routing optimization for the network;
State: State in the reinforcement learning framework represents the information that the agent can obtain from the environment. This information is represented as a graph, including vertexes, edges and global information.

In our work, the state mainly consists of the utility function

U (t)

of Section 3.2 and the distance distribution matrix

D (t)

.

D (t)

is derived from the location information carried by each node itself interacting with other nodes. Consequently, the GNN-Based DRL agent system state is defined as:

O (t) = \{D (t), U (t)\} .

(8)

Action: An action is a process of valid routing. Due to the dynamic nature of the Ad Hoc network, new incoming or outgoing nodes can cause changes in the state of the environment.

At every single step, we define the actions as the allocation of the newly accessed nodes, which can be expressed as:

A (t) = \{a_{1} (t), a_{2} (t), \dots\},

(9)

where

a_{i} (t)

denotes a choice of action. In our approach, the action values are a matrix of information about the newly accessed nodes, including

〈V, P〉

described in Section 3.1. Each edge information E is calculated after the information in P and V. The utility function value of the route is calculated from the carried utility function content.

Rewards: Instead of following predetermined labels, the learning agent optimizes the behavior of the algorithm by continuously obtaining rewards from the external environment. The principle of the “reward” mechanism is to tell the agent how comparatively good the current action is. In other words, the reward function directs the optimization direction of the algorithm. Accordingly, if we link the design of the reward function to the optimization goal, the performance of the system will be improved, driven by the rewards.

In general, actions that satisfy all nodes’ requirements without breaking constraints and that are more stable are considered favorable and are encouraged, and the agent will receive a positive reward. This means that the maximum probability of choosing the current action should be enforced. Conversely, actions that violate constraints or cause a severe imbalance in the network are considered to be failures and are discouraged, and negative rewards are fed back to the agent. This means that the agent has more possibilities to find alternative routing optimization decisions. Consistent with the maximization of cumulative discount rewards, the overall optimization goal is achieved by continuously promoting routing optimization policies.

During the learning process, the agent continuously updates the policy driven by the cumulative reward function, until the best policy for routing optimization is learned. We optimize the loss and back-propagate the gradients through the policy network. The loss function are as follows:

L_{θ} = \sum_{t = 0}^{T} (\log π_{θ} (a_{i, t} |o_{i, t}) (\sum_{t^{^{'}} = t}^{T} r (o_{i, t^{^{'}}}, a_{i, t^{^{'}}}) - b_{i, t^{^{'}}})) .

(10)

Therefore, the total loss from each small batch is given by:

L_{total} = \frac{1}{Z} \sum_{i = 1}^{Z} L_{θ} .

(11)

Algorithm 2 is a pseudo-code that describes the operation of the DRL agent. We use the data from each small batch to perform the optimization. At the beginning, nodes are initialized randomly with an arbitrary position

ϑ

and utility information

ξ

. After that, some nodes are dropped randomly as Z. At the same time, we initialize the environment by randomly setting the parameter of the GNN model and the parameter of the utility function as

θ

, including

α

,

β

,

η

and

μ

, where

α + β + η + μ = 1

. Then, we operate on each node in Z until all nodes have been traversed. We initialize the environment

θ

for each node. Then, we construct the graph

H = 〈V, E, P〉

of the current topology, which is introduced in Section 3.1 and Section 3.2.1. Observe the state

O (t)

from the graph for each case, which can be found in Equation (8) in detail. What is more, we select routing link

a_{i, t^{^{'}}}

form

A (t)

, which can be found in Equation (9) in detail and perform routing link selection. After that, we use the game model to adjust the topology generation to obtain a more optimal structure, as shown in Section 3.2. The performance is especially judged by the reward

r (o_{i, t^{^{'}}})

after execution.

Algorithm 2 DRL Agent Operation.

1:: Initialization:
2:: Nodes is initialized randomly with an arbitrary position $ϑ$ and utility information $ξ$
3:: Some nodes is dropped randomly as Z.
4:: The parameter of GNN model and utility function is initialized as $θ$ , where $α + β + η + μ = 1$ in utility function
5:: begin
6:: For i in Z, do
7:: Initialize the environment $θ$
8:: For the current moment t, do
9:: Construct the graph $H = 〈V, E, P〉$ of current topology
10:: Observe the state $O (t)$ from the graph
11:: Select routing link $a_{i, t^{^{'}}}$ form $A (t)$ and perform routing link selection
12:: Use the game model to adjust the topology generation b
13:: Obtain the reward $r (o_{i, t^{^{'}}}, a_{i, t^{^{'}}})$
14:: $t = t + 1$
15:: End for
16:: Calculate the loss of routing link selection $L_{θ}$ .
17:: Calculate the total loss $L_{t o t a l}$ .
18:: Update the network parameter $θ$ and utility function parameter, $α$ , $β$ , $η$ and $μ$ .
19:: End for
20:: end

4. Simulation and Evaluation

In this section, we present experiments to evaluate our proposed routing optimization game algorithm. It was conducted in Windows OS (CPU is Intel core i7-7700 3.6HZ; GPU is NVIDIA GeForce RTX 3080, which includes 8GB of graphics memory). Some hyperparameters are summarized in Table 1.

4.1. Simulation

To start with, we compare our proposed algorithm (GDR) with the following methods: 1. the stochastic policy based on our proposed network structure (SP for short); 2. the CNN-based DRL approach (CNN+DRL for short); 3. the AutoGNN [14] uses GNN as the DRL framework but does not use game theory; 4. the game model with manually modified parameters (GMM for short). Table 2 illustrates the comparison of different algorithms.

Figure 8a depicts the convergence process of the algorithm we proposed in this paper. As the number of iterations increases, the reward increases and becomes stable. At the beginning of the training, the reward is relatively small and the algorithm is just experimenting and not learning anything useful. After 25,000 iterations, the reward gradually stabilizes. This indicates that the agent has learned and analyzed the historical trajectory, and has mastered the specific method to automatically update the strategy and converge to the optimum in subsequent attempts. Additionally, Figure 8b illustrates the total loss during training. It is evident that after 4800 iterations, the loss is minimized. The slight fluctuations in the training process do not affect the performance of our algorithm.

Figure 9 illustrates the smoothing curve showing the achievable data rates of the various algorithms. The achievable data rate of the GDR proposed in the paper is stable at [5500, 5800], which performs the best. Although the convergence time of the stochastic strategy is relatively short, it is not the optimal routing optimization scheme because the achievable data rate is stable at [2700, 2900]. In addition, the performance of the CNN+DRL scheme is worth exploring. As seen in the figure, it fails to converge because the CNN can only process Euclid data. The GDR convergence time is 25,000 s and the optimal routing optimization strategy is learned. Meanwhile, it is known from [14] that the reward of AutoGNN is stable after 48,000 iterations. It is obviously much slower than GDR. This is because the input is the result of the complete network topology, which is much more computationally intensive. In contrast, the input of the algorithm in this paper is the adjusted network, and the graph representation of the network is simpler and more computationally effective.

4.2. Evaluation

The effectiveness of topology control has been demonstrated in [3,4,19,20] and will not be repeated here. To show the superiority of dynamic topology control and deep reinforcement learning, we compare GDR (proposed in this paper) with the Energy Balance Topology control game algorithm ([3] EBTG) and the routing strategy based on deep reinforcement learning ([13] PRMD) for both residual energy and survival time.

Residual energy. The standard deviation variation of node residual energy can visualize the inter-node load. To verify the load balancing of the GDR algorithm, Figure 10 shows the comparison of the variation of the standard deviation of the residual energy of the nodes.

The standard deviation of the GDR algorithm is lower and increases more slowly than that of the PRMD and EBTG algorithm. The PRMD algorithm is concerned with energy balance, but the dynamics of the nodes cannot be properly considered, and it is easy to subjectively misjudge the transmit power of some nodes. It tends to cause the total energy consumption of the link to increase and the residual energy standard deviation of the nodes to rise faster. On the contrary, the GDR and algorithm can be dynamically adjusted according to the network conditions to improve the energy efficiency of the network and make the node lifetime in the network more balanced.

Survival time. The ultimate goal of topology control algorithms is to extend the survival time of the network, which can be taken as the survival time of the network by the first node death time. In this paper, we compare the relationship between the survival time and the number of nodes of PRMD, EBTG and GRD, as Figure 11.

As can be seen in Figure 11, the survival time of algorithm GDR is slightly higher than that of algorithm PRMD and EBTG as the number of nodes increases, and the life cycle of the GDR algorithm improves by about 2.6%, 12.9%, 13.6%, and 14.8%. This is due to the limitations of the manually given parameters, which cannot be adapted to the dynamic network environment. In contrast, the parameters in the GDR algorithm are updated dynamically as the network changes, and the effects of node load and node state on the network survival time are taken into account comprehensively, so the network survival time is longer.

To evaluate the whole model, we conduct experiments from network delay. To demonstrate our advantage in comparison, we train GDR instances and AutoGNN [14] instances in two network scenarios. To demonstrate the efficiency of our algorithm, the two network scenarios are random network structures with 15 and 40 nodes, respectively. It is worth mentioning that the demand lists of both GDR and AutoGNN solutions are identical and randomly generated. We conducted 200 random traffic generation experiments for each to obtain representative conclusions.

Network delay. Network delay is measured by the average end-to-end delay, which is reflected as a specific score in the algorithm of this paper. Figure 12 and Figure 13 show the results of 2 experiments based on different number of nodes, 15 and 20, respectively. In each box plot, the Y-axis shows the specific score, which is the delay calculated from the environmental information available at the time of each evaluation. From the figure, we can see that the end-to-end delay distribution of the network does not differ much when the network has only fifteen nodes. This is mainly due to the simple topology of the network when there are fewer nodes and the complexity of the graph representation of the network is not high. When the network has forty nodes, GDR significantly outperforms AutoGNN. e.g., in about 75% of the experiments, the performance improvement is more than 30%. This is because the graph representation after game model simplification not only improves computational efficiency, but also allows the model to learn something useful to help decision making. In particular, Section 4.1 shows that we can know that GDR training is roughly 1.92 times faster than AutoGNN. As a result, the GDR model is better.

There is variability in the characteristics of the data under different network structures. Without knowing the specific network, we do not have a good criterion upon which to judge whether a model is good or bad. Therefore, in this paper, we use comparison experiments to evaluate the generalization ability of our model by the score of the average reward.

Generalization ability. We evaluated the ability of our GDR model to generalize to real-world network topologies obtained from a small self-study room. From the data collected in that study room, we selected topologies with more than 10 and less than 40 nodes. In particular, we did not consider ring and star topologies. This is because, in these topologies, the number of effective candidate paths for distribution requirements is usually very limited (in many cases a node is connected to only 1–2 nodes). By filtering, we obtained 140 real network topologies with which to perform tests.

To evaluate the generalization ability of our model, we selected the best model during training for comparison experiments. On each topology, we performed 500 evaluation experiments and stored the rewards achieved by GDR and AutoGNN routing strategies and calculated the average value. Figure 14 shows the performance of different models on different topologies (X-axis). The X-axis represents different topologies, which are ordered according to the GDR model scores. The Y-axis indicates the performance of our model relative to the AutoGNN model. It is shown that in 80% of the cases, our model works better than AutoGNN. The delay of the network under the GDR model is 10.5% higher than that of existing methods, on average.

5. Conclusions

In this paper, we proposed a strategy based on game theory and reinforcement learning to improve the balance of network capabilities and enhance the autonomy of the network topology. The model uses game theory to generate an adaptive topology, adjust the node’s power according to the average life, help the node with the shortest life to reduce the power, and prolong the survival time of the entire network. When nodes move in and out of the network dynamically, reinforcement learning is used to automatically generate routing policies to improve the average end-to-end latency of the network. Through theoretical analysis and experimental results, it is proved that, under the condition of ensuring connectivity, GDR has lower load balancing, longer network lifetime, and lower network delay. It reduces the average end-to-end delay of the network and exhibits greater robustness to topology changes.

Author Contributions

Conceptualization, T.H., R.W., X.L. and X.N.; Literature search, T.H. and X.L.; Methodology, T.H. and R.W.; Supervision, T.H.; Software, T.H. and X.L.; Visualization, T.H. and X.L.; Data analysis, R.W.; Writing-review & editing, R.W., X.L. and X.N. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to express thanks to the Science Research Project of Jiangxi Provincial Department of Education (under Project No. GJJ190319), the National Natural Science Foundation of China (Project No. 62067002, No. 52062016, No. 62062033),the Special 03 Project and 5G Project of Jiangxi Province (20203ABC03W07) and the Natural Science Foundation of Jiangxi Province (Project No. 20192ACBL21006) for their financial supports.

Conflicts of Interest

All authors declare no conflict of interest.

References

Ramanathan, R.; Rosales-Hain, R. Topology Control of Multiple Wireless Networks Using Transmit Power Adjustment. In Proceedings of the INFOCOM 2000. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies, Tel Aviv, Israel, 26–30 March 2000. [Google Scholar]
Zhao, J.; Dong, P.; Ma, X.; Sun, X.; Zou, D. Mobile-aware and relay-assisted partial offloading scheme based on parked vehicles in B5G vehicular networks. Phys. Commun. 2020, 42, 101163. [Google Scholar] [CrossRef]
Du, Y.; Gong, J.; Wang, Z.; Xu, N. A distributed energy-balanced topology control algorithm based on a noncooperative game for wireless sensor networks. Sensors 2018, 18, 4454. [Google Scholar] [CrossRef] [PubMed]
Du, Y.; Xia, J.; Gong, J.; Hu, X. An energy-efficient and fault-tolerant topology control game algorithm for wireless sensor network. Electronics 2019, 8, 1009. [Google Scholar] [CrossRef]
Sun, P.; Hu, Y.; Lan, J.; Tian, L.; Chen, M. TIDE: Time-relevant deep reinforcement learning for routing optimization. Future Gener. Comput. Syst. 2019, 99, 401–409. [Google Scholar] [CrossRef]
Tiwari, P.; Zhu, H.; Pandey, H.M. DAPath: Distance-aware knowledge graph reasoning based on deep reinforcement learning. Neural Netw. 2021, 135, 1–12. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Liu, J.; Yang, L.; Ai, B.; Ni, S. Future 5G-oriented system for urban rail transit: Opportunities and challenges. China Commun. 2021, 18, 1–12. [Google Scholar] [CrossRef]
Wan, G.; Pan, S.; Gong, C.; Zhou, C.; Haffari, G. Reasoning like human: Hierarchical reinforcement learning for knowledge graph reasoning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Yokohama, Japan, 7–15 January 2021; pp. 1926–1932. [Google Scholar]
Suárez-Varela, J.; Mestres, A.; Yu, J.; Kuang, L.; Feng, H.; Cabellos-Aparicio, A.; Barlet-Ros, P. Routing in optical transport networks with deep reinforcement learning. J. Opt. Commun. Netw. 2019, 11, 547–558. [Google Scholar] [CrossRef]
Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6. [Google Scholar]
Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
Zhu, T.; Chen, X.; Chen, L.; Wang, W.; Wei, G. Gclr: Gnn-based cross layer optimization for multipath tcp by routing. IEEE Access 2020, 8, 17060–17070. [Google Scholar] [CrossRef]
You, X.; Li, X.; Xu, Y.; Feng, H.; Zhao, J.; Yan, H. Toward Packet Routing with Fully-distributed Multi-agent Deep Reinforcement Learning. IEEE Trans. Syst. Man Cybern. Syst. 2019, 52, 855–868. [Google Scholar] [CrossRef]
Chen, B.; Zhu, D.; Wang, Y.; Zhang, P. An Approach to Combine the Power of Deep Reinforcement Learning with a Graph Neural Network for Routing Optimization. Electronics 2022, 11, 368. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
Naderializadeh, N.; Eisen, M.; Ribeiro, A. Wireless power control via counterfactual optimization of graph neural networks. In Proceedings of the IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Atlanta, GA, USA, 26–29 May 2020; pp. 1–5. [Google Scholar]
Zhao, D.; Qin, H.; Song, B.; Han, B.; Du, X.; Guizani, M. A graph convolutional network-based deep reinforcement learning approach for resource allocation in a cognitive radio network. Sensors 2020, 20, 5216. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, H.; Xiong, J.; Liu, X.; Zhou, L.; Wei, J. Scalable power control/beamforming in heterogeneous wireless networks with graph neural networks. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; pp. 1–6. [Google Scholar]
Wang, H.; Qiu, Z.; Dong, R.; Jiang, H. Energy balanced and self adaptation topology control game algorithm for wireless sensor networks. Kongzhi yu Juece/Control Decis. 2019, 34, 72–80. [Google Scholar]
Yang, S.; Lian-Suo, W.; Yuan, G. Multi-Objective Fusion Ordinal Potential Game Wireless Ad Hoc Network Topology Control Algorithm. J. Beijing Univ. Posts Telecommun. 2022, 105–111. [Google Scholar]
Kao, S.C.; Yang, C.H.H.; Chen, P.Y.; Ma, X.; Krishna, T. Reinforcement learning based interconnection routing for adaptive traffic optimization. In Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, New York, NY, USA, 17–18 October 2019; pp. 1–2. [Google Scholar]
Kaur, A.; Kumar, K. Energy-efficient resource allocation in cognitive radio networks under cooperative multi-agent model-free reinforcement learning schemes. IEEE Trans. Netw. Serv. Manag. 2020, 17, 1337–1348. [Google Scholar] [CrossRef]

Figure 1. Constructing graph for Ad Hoc network with 3 transceivers.

Figure 2. Constructing graph for Ad Hoc network with 2 state types. (a) An example of Ad Hoc network with 2 state types. (b) A graph describing the Ad Hoc network.

Figure 3. Topology control to regulate energy consumption.

Figure 4. The process of establishing and optimizing the topology. (a) Initial Network State. (b) Using the game model.

Figure 5. The deep reinforcement learning framework.

Figure 6. The Markov decision process.

Figure 7. The network structure of the GNN model.

Figure 8. The convergence process and training loss. (a) The convergence process. (b) The training loss.

Figure 9. The achievable data rates of the different algorithms.

Figure 10. Comparison of the variation of the standard deviation of the residual energy of the nodes.

Figure 11. Survival time comparison.

Figure 12. Evaluation on network with 15 nodes.

Figure 13. Evaluation on network with 40 nodes.

Figure 14. Evaluation of the proposed GDR agent in a dataset with real-world topology.

Table 1. Simulation parameters.

Parameter	Value
Number of initial nodes	50
Initial energy of node	50 J
cell radius	100 m
system loss	1
Wave length	0.1224 m
Monitoring area	300 × 300 m²
Node residual energy	The Poisson distribution that obeys $λ$ of 25

Table 2. A comparison of different algorithms.

	SP	CNN+DRL	GMM	AutoGNN [14]	GDR
Neural networks	GNN	CNN	✘	GNN	GNN
DRL framework	✘	✔	✘	✔	✔
Convergence time	6000 s	⩾60,000 s	✘	48,000 s	25,000 s
Optimal solution	✘	✘	✘	✔	✔
Scalability	✘	✘	✔	✔	✔

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hong, T.; Wang, R.; Ling, X.; Nie, X. GDR: A Game Algorithm Based on Deep Reinforcement Learning for Ad Hoc Network Routing Optimization. Electronics 2022, 11, 2873. https://doi.org/10.3390/electronics11182873

AMA Style

Hong T, Wang R, Ling X, Nie X. GDR: A Game Algorithm Based on Deep Reinforcement Learning for Ad Hoc Network Routing Optimization. Electronics. 2022; 11(18):2873. https://doi.org/10.3390/electronics11182873

Chicago/Turabian Style

Hong, Tang, Ruohan Wang, Xiangzheng Ling, and Xuefang Nie. 2022. "GDR: A Game Algorithm Based on Deep Reinforcement Learning for Ad Hoc Network Routing Optimization" Electronics 11, no. 18: 2873. https://doi.org/10.3390/electronics11182873

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GDR: A Game Algorithm Based on Deep Reinforcement Learning for Ad Hoc Network Routing Optimization

Abstract

1. Introduction

2. Related Works

3. GDR Framework

3.1. Dynamic Graph Construction for Ad Hoc Network

3.2. Game Algorithm for Topology Control

3.2.1. Game Model

3.2.2. Utility Function

3.2.3. Nash Equilibrium

3.3. GNN-Based DRL Agent

3.3.1. Deep Reinforcement Learning

3.3.2. Graph Neural Networks

3.3.3. DRL Framework

4. Simulation and Evaluation

4.1. Simulation

4.2. Evaluation

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI