Reinforcement Learning Based Dual-UAV Trajectory Optimization for Secure Communication

Qian, Zhouyi; Deng, Zhixiang; Cai, Changchun; Li, Haochen

doi:10.3390/electronics12092008

Open AccessArticle

Reinforcement Learning Based Dual-UAV Trajectory Optimization for Secure Communication

¹

Jiangsu Key Laboratory of Power Transmission & Distribution Equipment Technology, Hohai University, Changzhou 213022, China

²

College of Internet of Things Engineering, Hohai University, Changzhou 213022, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(9), 2008; https://doi.org/10.3390/electronics12092008

Submission received: 31 March 2023 / Revised: 24 April 2023 / Accepted: 25 April 2023 / Published: 26 April 2023

(This article belongs to the Special Issue UAV Communications)

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned aerial vehicles (UAV) can serve as aerial base stations for users due to their flexibility, low cost, and other characteristics. However, due to the high flight position of UAVs, the air-to-ground (ATG) channels usually dominate with line-of-sight (LoS), which can be easily eavesdropped by multiple eavesdroppers. This poses a challenge to secure communication between UAVs and ground users. In this paper, we study a UAV-aided secure communication in an urban scenario where a legitimate UAV Alice transmits confidential information to a legitimate user Bob on the ground in the presence of several eavesdroppers around it and a UAV Jammer sends artificial noise to interfere with the eavesdroppers. We aim to maximize the physical layer secrecy rates in the system by jointly optimizing the trajectories of UAVs and their transmitting power. Considering the time-varying characteristics of channels, this problem is modeled as a Markov decision process (MDP). An improved algorithm based on double-DQN is proposed in the paper to solve this MDP problem. Simulation results show that the proposed algorithm can converge quickly under different environments, and the UAV transmitter and UAV jammers can find the optimal location correctly to maximize the information secrecy rate. It also shows that the double-DQN (DDQN) based algorithm works better than the Q-learning and deep Q-learning network (DQN).

Keywords:

UAV-aided secure communication; physical layer security; double-DQN; reinforcement learning; trajectory optimization

1. Introduction

Due to their flexibility and low cost, UAVs are promising to provide reliable data transmission services as mobile nodes in various scenarios in the next-generation (6G) network [1]. For example, UAVs can serve as a communication node of intelligent city transportation systems for traffic management and monitoring. In case of emergencies that lead to partial blockage or paralysis of communication systems, UAVs can be deployed as mobile air communication nodes in these areas to help ground base stations provide communication [2,3]. Compared with conventional communication that suffers from shadowing and multi-path fading, the channel of UAV experiences LoS fading, which allows the UAV to provide better communication quality and larger communication coverage with lower energy consumption [4]. However, due to the broadcast characteristics of the wireless channel, the privacy and security of UAV communication are challenging [5,6]. Although traditional key encryption techniques can improve the security of communication, they are computationally complex and require additional resource overhead for keys. Different from traditional methods, physical layer security (PLS) uses secrecy rate as an evaluation criterion and is considered to be a promising technology. Currently, the main research directions of UAV-assisted physical layer security techniques include mmWave communication, trajectory design approach, resource allocation and others [5,7,8]. Trajectory design approach, one of the most popular research directions, has achieved some meaningful research results [9,10,11,12,13]. Trajectory design approach is an optimization problem that can be solved by mathematical tools such as convex optimization. However, the optimization problem is usually non-convex, which makes the problem difficult to solve directly. Therefore, the original optimization problem is often converted into a convex optimization problem or multiple sub-convex optimization problems. For example, Zhang et al. [9] proposed a model in which a UAV serves as a communication node (Alice) to transfer data to legitimate users on the ground. The aim of the model is to maximize the average secrecy rate by jointly optimizing the UAV’s trajectory and the transmit power of the legitimate transmitter. Then, they used the block coordinate descent and successive convex optimization methods to solve the optimization problem. Cai et al. [10] also divided the non-convex optimization problem into two sub-nonconvex optimization problems. Then they used the successive convex approximation (SCA) [14] algorithm and block coordinate descent to get a good approximate solution.

However, most conventional optimization methods can only achieve a local optimal result. When the number of optimization variables of the UAV increases and the environment becomes more complex, the nonconvex problem will be more difficult to solve and the algorithm may not converge [4].

With the rapid development of artificial intelligence, techniques such as deep learning (DL) and reinforcement learning (RL) have proven to be effective in solving high complexity optimization problems, such as V2V communications, UAV trajectory optimization, and resource allocation, which are difficult to solve with traditional methods [15,16,17,18,19,20,21]. In addition, RL is attracting more and more scholars’ interest because it can solve complex optimization problems with a simplified model but requires no training samples like deep learning. Zhang [15] successfully used Q-learning to optimize the trajectories of UAV under the condition of multiple users and one eavesdropper. Zhang et al. [17] solved the trajectory optimization problem by modeling the problem as a constrained Markov decision process (CMDP) using the safe-DQN model. Fu et al. [16] proposed the curiosity-driven DQN algorithm C-DQN, which constringes faster, and the simulation results indicated that the system is more stable at the trajectory corners. In addition, for the trajectory optimization problem under multi-UAV cooperation, Zhang et al. [18] proposed a continuous action attention MADDPG (CAA-MADDPG) model that adopts centralized training and distributed execution, and the algorithm achieved good results in the simulation.

Regarding the optimization problem of physical layer security, it is important to maximize the information secrecy rate. At the same time, the energy consumption of UAVs is also a problem to consider due to the limited energy. In addition, due to the multiple eavesdroppers on the ground, a single transmitting UAV may be difficult to satisfy the secure communication in the system, so the help of jamming UAVs can effectively solve this problem. Therefore, in this paper, we consider a multi-agent model. The existing multi-agent algorithm not only requires each agent to maintain its own neural network model but also requires an airborne computing platform to train the critic neural network with all agents’ input.

A system with a small number of UAVs does not need such a complex system because the high cost of an airborne computing platform. Therefore, we propose an optimization algorithm under the situation of a single transmitter and a single jammer. The algorithm is based on double-DQN [22]. For ease of illustration, this paper adopts a model in which two UAVs work in cooperation: one UAV (Alice) transmits information and the other UAV (Jammer) sends jamming signals. In this algorithm model, each UAV is an agent that maintains its own DDQN network and executes actions based on observations and information transmitted from other UAVs. Results indicate that the algorithm in this paper has good scalability, and performs well in the case of different numbers of eavesdroppers (Eves). Our goal is to maximize the information secrecy rate by optimizing both the UAV trajectory and the UAV transmit power. For readers convenience, The main notations used in the paper are summarized in Table 1.

The rest of this paper is organized as follows. Section 2 introduces the system model. In Section 3.1 and Section 3.2, we present the problem formulation and model the optimization problem as an MDP. In Section 3.3, we present the DQDN-based approach for multi-UAV networks. Section 4 discusses the simulation results. Finally, the paper concludes in Section 5.

2. System Model

2.1. Network Model

As shown in Figure 1, we mainly focus on UAV-aided secure communication in an urban scenario. In this model, the UAV Alice serves as a mobile base station to provide a reliable data transmission service to the legitimate user Bob, with the assumption that there are E eavesdroppers Eve on the ground, which eavesdrop on the information between Alice and Bob. The UAV Jammer serves to send artificial noise to confuse Eves and can move as required. We suppose there are single or multiple mobile eavesdroppers and several stationary eavesdroppers in the system. There are two types of eavesdroppers: active and passive. The former actively send signal to attack the legitimate user, so it can be easily detected [23]. The latter does not send signals, only eavesdrops. The location of the passive eavesdropper can be detected from the local oscillator power leaked from its radio frequency (RF) front end [24]. Therefore, in this paper, we assume the UAV can detect the position of Bob and all Eves, and the UAVs can communicate with each other to transmit the position information. Therefore, the UAV can acquire the position information of Bob and all Eves, which guarantees that the RL algorithm can learn the optimum position. The movement range of Alice and Jammer in the model is in two separate two-dimensional planes in the air.

2.2. Mobility Model

The total flight time T can be divided into

τ

time slots, indexed by t, t = 1, 2, …,

τ

. Considering the three-dimensional coordinate system, the coordinates of Alice and Jammer at the beginning of moment t are denoted as

s_{A} (t) = [x_{A} (t), y_{A} (t), z_{A} (t)]

and

s_{J} (t) = [x_{J} (t), y_{J} (t), z_{J} (t)]

;

s_{e} (t) = [x_{e} (t), y_{e} (t), z_{e} (t)]

,

e \in \{1, 2, . . ., E\}

denote the coordinates of each Eve. The coordinate of a stationary Bob can be denoted by

s_{b} = [x_{b} (t), y_{b} (t), z_{b} (t)]

.

2.3. Transmission Model

The channel between the UAV and the user on the ground can be regarded as an ATG channel. As shown in Figure 2, the transmission path loss suffered by the ATG channel consists of two parts: one part is the free space path loss between the UAV and the receiver on the ground, and the other part is the excessive path loss caused by the buildings and trees in the city. Therefore, the path loss of ATG per unit time slot can be denoted as [25].

P L_{ξ} (t) = F S P L (t) + η_{ξ} (t) .

(1)

In (1),

F P S L (t)

denotes the free space path loss,

F P S L (t) = 20 {log}_{} (d (t)) + 20 {log}_{} (f) + 20 {log}_{} (4 π / c)

,

d (t) = \sqrt{h {(t)}^{2} + r {(t)}^{2}}

,

r (t)

is the distance between the UAV and the ground receiver,

h (t)

is the height of the UAV above the ground, f denotes the carrier frequency, c is the speed of light,

η_{ξ}

is the value of excessive path loss, and

ξ \in \{N L o S, L o S\}

refers to the propagation group. Assuming that the transmitting and receiving antennas are isotropic, the channel path loss can be written as

\bar{P L (t)} = P_{L o S} (t) \times P L_{L o S} (t) + P_{N L o S} (t) \times P L_{N L o S} (t),

(2)

where

P_{L o S}

and

P_{N L o S}

are the probability of LoS and NLoS, respectively,

P_{N L o S} = 1 - P_{L o S}

,

P_{L o S}

can be approximated to a simple Sigmoid function [25],

P_{L o S} (t) = \frac{1}{1 + a exp (- b [θ - a])} .

(3)

In (3), a and b are constants that depend on the specific environment (e.g., urban, suburban, or rural), and the value of

θ

can be calculated by

θ = 180 π^{- 1} arctan (\frac{z (t)}{\sqrt{{(x (t) - x_{b} (t))}^{2} + {(y (t) - y_{b} (t))}^{2}}}) .

(4)

Supposing the channel path loss between Alice and Bob is

P L_{A, B} (t)

, and the channel path loss between Alice and Jammer and Eve is

P L_{A, e} (t)

and

P L_{J, e} (t)

,

e \in \{1, 2, . . ., E\}

, then the formula of channel path loss can be obtained by combining (1) and (2)

P L_{A, B} (t) = 20 {log}_{} \frac{(4 π f d_{A, B})}{c} + η_{L o S} P_{L o S} + η_{N L o S} P_{N L o S},

(5)

P L_{A (J), e} (t) = 20 {log}_{} \frac{(4 π f d_{A (J), e})}{c} + η_{L o S} P_{L o S} + η_{N L o S} P_{N L o S},

(6)

P L_{J, A} (t) = 20 {log}_{} \frac{(4 π f d_{J, A})}{c} + η_{L o S} P_{L o S} + η_{N L o S} P_{N L o S} .

(7)

In (5)–(7),

[η_{L o S}, η_{N L o S}]

are constants that are set according to the environment [25]; the specific parameters are given below. We assume that the instantaneous transmitting power of Alice and Jammer are

p_{A} (t)

and

p_{J} (t)

and the information rates of Alice-to-Bob and Alice-to-Eve are

R_{A, B} (t)

and

R_{A, e} (t)

,

e \in \{1, 2, . . ., E\}

. Then, the achievable rate of the channel can be given by

R_{A, B} (t) = B {log}_{2} (1 + \frac{p_{A} (t)}{P L_{A, B} (t) (N_{0} + p_{J} (t) / P L_{J, A} (t))}),

(8)

R_{A, e} (t) = B {log}_{2} (1 + \frac{p_{A} (t)}{P L_{A, e} (t) (N_{0} + p_{J} (t) / P L_{J, e} (t))}),

(9)

where

N_{0}

is the power of the natural Gaussian noise. According to [26], the secrecy rate from Alice to Bob can be given by the difference between the achievable rate of the legitimate channel and the channel:

R_{c} (t) = {[\sum_{e = 1}^{E} [R_{A, B} (t) - R_{A, e} (t)] / E]}^{+},

(10)

where

{[x]}^{+} ≜ max [0, x]

.

3. System Problem Formulation and Markov Decision Process

3.1. Problem Formulation

The optimization goal of the model is to optimize the trajectory and transmitting power of the UAV in a limited space to achieve the maximum information secrecy rate. Therefore, the optimization problem can be described as

\begin{matrix} max_{S_{A}, S_{J}, p_{A}, p_{J}} R_{c} (t) \end{matrix}

(11)

\begin{matrix} s . t . X^{m i n} \leq x_{A} (t), x_{J} (t) \leq X^{m a x} \end{matrix}

(11a)

\begin{matrix} Y^{m i n} \leq y_{A} (t), y_{J} (t) \leq Y^{m a x} \end{matrix}

(11b)

\begin{matrix} Z^{m i n} \leq z_{A} (t), z_{J} (t) \leq Z^{m a x} \end{matrix}

(11c)

\begin{matrix} P_{A}^{m i n} \leq p_{A} (t) \leq P_{A}^{m a x} \end{matrix}

(11d)

\begin{matrix} P_{J}^{m i n} \leq p_{J} (t) \leq P_{J}^{m a x} \end{matrix}

(11e)

where (11a)–(11c) give the boundaries of the limited area for UAV movement, and (11d) and (11e) give constraints on the transmission power of Alice and Jammer.

3.2. Markov Decision Process

In a multi-UAV environment, the decisions of each UAV are influenced by the other ones. For each UAV, the optimization problem can be modeled as a Markov decision process (MDP). Therefore, the optimization problem of multi-UAV can be regarded as an extension of the MDP and can be formalized by a tuple

(S, A, P, R)

, where

S

is the state space,

A

is the action space,

R

is the reward space, and

P

is the transfer probability. In a time slot, each agent has coordinate

s^{i} (t) \in S

and takes action

a^{i} (t) \in A

based on a certain policy. Then a new state

s^{i + 1} (t)

is generated with transition probability

P

. The reward that UAV i receives from the environment is defined as

r^{i} (t) \in R

. The specific meanings of

S, A, P, R

are expressed as follows.

Each UAV can be regarded as an agent. An independent model is used to solve the MDP problem for each UAV. However, because the agent can be influenced by other agents, from its own point of view, other agents make the environment unstable. Therefore, UAVs need to communicate during the flight to inform each other of the position information. Then they take action and get environmental rewards according to the position information of all agents. In this paper, we only consider two agents, named Alice and Jammer. The complete model structure is shown in Figure 3.
$S$ is the state space; $S$ includes the location of each UAV, Bob, and Eves. The coordinates of Alice and Jammer are described as $s_{A} (t) = [x_{A} (t), y_{A} (t), z_{A} (t)]$ , $s_{J} (t) = [x_{J} (t), y_{J} (t), z_{J} (t)]$ ; the coordinate of Bob is $s_{b} (t) = [x_{b} (t), y_{b} (t), z_{b} (t)]$ . The coordinate of Eve e is $s_{e} (t) = [x_{e} (t), y_{e} (t), z_{e} (t)]$ , $e \in \{1, 2, \dots, E\}$ .
$A$ is the action space, which contains the moving speed of the UAV at each moment and its transmitting power. To reduce the complexity of the network model, we discretize the speed into a set $V = \{v_{1}, v_{2}, . . ., v_{I}\}$ . The speed of Alice and Jammer in each time slot is denoted as $V_{A i} (t)$ , $V_{J i} (t)$ , and the transmitting power is $p_{A} (t)$ and $p_{J} (t)$ , $p_{m i n} \leq p_{A} (t) \leq p_{m a x}$ , $p_{m i n} \leq p_{J} (t) \leq p_{m a x}$ .
$P$ denotes the state-transition probability. As it is difficult to predict $P$ , we adopt the model-free RL method to address the above MDP problem.
$R$ is the reward space. For all UAVs, the reward obtained in each time slot consists of the scenario penalty, the energy penalty, the information secrecy rate reward, and the distance reward. The information secrecy rate reward is equal for each agent. For the scene penalty, due to the complex arrangement of tall buildings and vegetation in the city, the free flight space of the UAV is limited, so we need to give the UAV an environmental penalty when the UAV wants to fly away from the limited area. The coordinate of the UAV i is denoted as $[x_{i} (t), y_{i} (t), z_{i} (t)]$ , $i \in \{A, J\}$ . Then the scenario penalty $U_{i}$ is

$U_{i} = \{\begin{matrix} - C; & x_{i} < X_{m i n} o r x_{i} > X_{m a x} \\ - C; & y_{i} < y_{m i n} o r y_{i} > y_{m a x} \\ 0; & e l s e \end{matrix}$

(12)

where C is a constant. Since the battery capacity of the UAV is finite, it is necessary to consider the energy cost of the UAV without wireless charging. The energy consumption penalties of Alice and Jammer are defined as $E_{A}$ and $E_{J}$ . The transmitting powers of Alice and Jammer in the current time slot are denoted as $p_{A}$ and $p_{J}$ . Therefore, the energy penalty can be given by

$E_{A} = - k_{E}^{A} p_{A},$

(13)

$E_{J} = - k_{E}^{J} p_{J},$

(14)

where $k_{E}^{A}$ and $k_{E}^{J}$ are hyperparameters. In the initial state of the model, Alice may be far away from Bob, which makes the information secrecy rate very small. The too-small reward leads to the model being unable to learn the features of the environment effectively. To avoid this, we introduce a distance penalty and define the distance penalty for Alice and Jammer as $D_{A}$ and $D_{J}$ . We use the average of the distances as the reward, as shown in (16). The distance reward according to the Euclidean distance from the UAV to the target can be expressed as

$D_{A} = - λ_{D}^{A} {∥s_{A} - s_{b}∥}_{2},$

(15)

$D_{J} = - \frac{λ_{D}^{J}}{E} \sum_{e = 1}^{E} {∥s_{J} - s_{e}∥}_{2},$

(16)

where $λ_{D}^{A}$ , $λ_{D}^{J}$ are the distance reward factors of Alice and Jammer. Thus, the total reward function for each UAV can be given by

$r_{A} = R_{c} + U_{A} + E_{A} + D_{A},$

(17)

$r_{J} = R_{c} + U_{J} + E_{J} + D_{J},$

(18)

where $r_{A}$ and $r_{J}$ are the reward for Alice and Jammer separately.

3.3. Double DQN

The trajectory optimization problem of each UAV can be modeled as a Markov process with the knowledge of other changing coordinates. We use a separate DDQN network for each UAV and assume that each UAV can acquire the positions of other UAVs through real-time communication and estimate the coordinate information of Eve through the camera. Before describing the DDQN algorithm, let us begin with the DQN algorithm [27]. The objective of discrete reinforcement learning is to fit a Q table, but with the increase in state space

S

and action space

A

, the Q table will have a dimensional explosion problem. Therefore, the main idea of DQN is using a deep neural network (DNN)

Q (s, a, w_{Q})

to approximate the Q table, where s is the state and a is the action. With the reward

r_{Q}

from environment, the target Q value

y_{Q}

can be expressed as [27]

y_{Q} = r_{Q} + γ max_{a^{'}} Q (s^{'}, a^{'}, w_{Q}),

(19)

where

γ \in [0, 1]

is the discount factor, and

s^{'}

and

a^{'}

are the state and the action in the next time slot. DQN has an experience replay buffer

M_{Q}

to get samples and uses the mean square error as the loss function

L o s s {(w)}_{Q}

,

L o s s {(w)}_{Q} = E {[(y_{Q} - Q (s, a, w_{Q}))]}^{2} .

(20)

However, using

Q (s, a, w_{Q})

directly for calculating the target Q value

y_{Q}

will cause the

L o s s {(w)}_{Q}

to change at each time slot, in which case the

Q (s, a, w_{Q})

will diverge if the environment returns a large reward. At the same time, the max function in (19) makes the target Q value overestimated [22].

Like DQN, double DQN also uses DNN to approximate the Q function.

However, DDQN uses a target Q network to reduce overestimation, which can enhance the stability of the model. The same as DQN, DDQN also adopts an experience replay technique to utilize the data efficiently. For each UAV, the predict Q network is denoted as

Q (s, a, w)

and the target Q network is denoted as

Q^{'} (s, a, w^{'})

; s is the joint state of the UAV, a is the action selection, and r is the reward after executing the action given by

r = R_{c} + U + E + D .

(21)

After the communication between UAVs to acquire the new joint position information

s^{'}

in the next time slot, the target Q value can be expressed by the symbol y:

y = \{\begin{matrix} r & terminal \\ r + γ Q^{'} (s^{'}, {arg max}_{a^{'}} Q (s^{'}, a^{'}, w), w^{'}) & else \end{matrix}

(22)

where

γ \in [0, 1]

is the discount factor, and

a^{'}

is the action in the next time slot. The same as DQN, DDQN also uses the mean square error as the loss function

L o s s (w)

, and the samples come from the experience replay buffer

M

L o s s (w) = E {[(y - Q (s, a, w))]}^{2},

(23)

Each UAV updates parameters w of

Q (s, a, w)

by gradient back-propagation,

w = w - α \nabla_{w} Q (s, a, w) (y - Q (s, a, w)),

(24)

where

α

is learning rate,

0 < α < 1

. When the number of updates of

Q (s, a, w)

reaches a certain value, w of

Q (s, a, w)

is copied to

w^{'}

in

Q^{'} (s, a, w^{'})

. The Q and Q′ networks of different agents are distinguished by the subscript i, where

i \in {A, J}

. The procedures of the algorithm are illustrated in Algorithm 1.

Algorithm 1 A DDQN-based optimization algorithm for dual UAVs.

Input:: episode K, total flight time T, vector dimension n, action space $A$ , action noise N, batch size m, network parameters update frequency $φ$ , hyperparameters $α$ , $γ$ , and $ϵ$ .
Output:: Network parameters w, $w^{'}$ .
1:: Initialize w in $Q_{i} (s, a, w)$ , $i \in {A, J}$ , then $w^{'} ⟵ w$ .
2:: Initialize experience replay buffer $M$ with capacity m.
3:: for $k = 1$ $t o$ K do
4:: for $t = 1$ $t o$ T do
5:: Initialize s to the first state of the current agent;
6:: Select action a from Q_i(s, a, w) using the ϵ—greedy method;
7:: UAV i executes action a in state s.
8:: Each UAV transmits the state to other ones so that each UAV gets a new joint state $s^{'}$ , the reward r by (21) and termination sign $i s_e n d$ ;
9:: Each UAV stores ( $s, a, r, s^{'}, i s_e n d$ ) into its own experience replay buffer $M$ , $s ⟵ s^{'}$ ;
10:: Sampling m items from replay buffer $M$ and calculating y values by (22);
11:: Update the parameters of $Q_{i} (s, a, w)$ by (23) and (24);
12:: if (t % $φ$ = 0) then
13:: Update $Q_{i}^{'} (s, a, w^{'})$ parameters $w^{'} ⟵ w$ ;
14:: end if
15:: if ( $i s_e n d i s t r u e$ ) then
16:: break;
17:: end if
18:: end for
19:: end for
20:: return w, $w^{'}$

4. Simulation Results and Analysis

In this section, we analyze the performance of the algorithm based on DDQN and discuss the simulation results under different situations. The algorithm is implemented based on the PyTorch framework, and the model is trained on a server with a GTX 1080ti. We suppose that Alice, Bob, Jammer, and Eves are randomly distributed in a 500 m × 500 m environment. One or more Eves move along the road in the city around the Bob, and the rest of the Eves are on the sides of the roads and in the buildings. Because the model quantizes the speed of the UAV, speed can be converted into step length per unit time slot. The step length of the UAV in this model is hyperparametric, and it affects the convergence performance of the algorithm, which is discussed below. To avoid the collision between UAVs, which will complicate the model, the flight heights of Alice and Jammer are fixed at 15 m and 20 m, and UAVs can move in a 500 × 500 area with freedom. System carrier frequency is 5 GHz, Alice and Jammer’s transmitting power are

p_{A} = 10

dBm and

p_{J} = 20

dBm, the power of natural Gaussian noise in the channel is

N_{0} = - 97

dBm, coefficients in (13) and (14) are set as

k_{E}^{A} = 0.05

,

k_{E}^{J} = 0.05

, and the learning rate is

α

= 0.001. In (5) and (6),

[η_{L o S}, η_{N L o S}]

is set as

[1, 20]

. By the formula in [25], we can calculate a = 55, b = 5. The coordinate position of Bob is (280, 250, 0), which is the center of the two-dimensional plane of the area. The step of Eve is set to 1 m.

Figure 4a,b denote the rewards versus episodes with a 90% confidence level for different initial positions of Alice and Jammer. We set the maximum step length of the UAV to 2 m, and there is only one moving Eve. From Figure 4a, we can see that the normalized average reward value increases gradually in each episode, and the curve fluctuates slightly after 200 episodes with a stable value for four initial positions of the UAV. It indicates that the algorithm can reach convergence for different initial positions of the UAV. It can also be found that the closer position Alice is to Bob, the fewer episodes to reach convergence. The convergence speed of the model can be accelerated by appropriately increasing the distance reward factor in (15) and (16). It can also reduce the fluctuation amplitude of the model in a stable state.

In Figure 4b, the trend of the four convergence curves is approximately the same, which indicates that different Jammer positions do not have too much influence on the convergence performance of the model, and the model reaches convergence before 150 episodes.

Figure 5a illustrates rewards versus episodes for different step lengths. For the situation where step lengths are greater than 1 m and less than 8 m, the convergence speed is almost the same for all four groups of simulations. The model converges around 170 episodes. However, as the step length increases, the max reward decreases and the curves become more unsmooth. When the step length is 9 m, the model reaches a steady state with a lower maximum reward. When the step length is 0.5 m, the convergence speed of the system decreases dramatically, and the system converges after 600 episodes. Compared with the other step lengths, the phenomenon is poor. As the model assumes that Eve moves in steps of 1 m, a too-short step length causes UAVs to not to reach the best point instantly. Therefore, step length determines the stability and convergence speed of the system. We recommend setting the step length between 1 m and 5 m. In Figure 5b, we compare the convergence of DDQN with DQN and Q-learning algorithms. Because the action in the scenario is discrete, we use the value-based RL algorithms as the baseline. As can be seen from Figure 5b, the DDQN-based algorithm reaches stability before 100 episodes, whereas the Q-learning algorithm reaches stability after 150 episodes and the DQN algorithm reaches stability after 200 episodes. The reward of the DDQN-based algorithm is always better than that of other discrete RL algorithms, which indicates that the DDQN-based algorithm has good convergence performance and stability.

Consider a scenario where one Eve is in a moving state; Figure 6a,b show the motion trajectories of Alice and Jammer with initial positions [(250, 250), (150, 150)], [(200, 200), (400, 150)]. The red curve denotes Alice’s trajectory, the black curve denotes Jammer’s trajectory, and the green x denotes Bob’s position. Suppose Eve moves along the street around Bob from the position of (100, 210); other Eves’ positions are shown in Figure 6a. In the beginning, Jammer and Alice approach quickly. After they meet each other, Jammer and Alice will go to the area near Bob. When Eve is close to Bob, Alice will hover away from Eve around Bob to reduce Eve’s achievable rate. It can also increase the information secrecy rate of the system. On the contrary, Jammer moves only in a 10 m × 10 m area above Bob, because Jammer needs to consider the position of all Eves. Figure 6c shows the trajectory of the UAV in the environment with only one moving Eve, and we can clearly see that Alice’s trajectory is approximately the same as that in Figure 6a,b; however, Jammer follows the moving Eve. The phenomenon illustrates that Jammer’s motion trajectory in a multi-Eve environment is influenced by all Eves. In addition, the reward curves in Figure 6a–c are approximately the same as those in Figure 4a, in which the system reaches a stable state when Alice and Jammer arrive near Bob, and then Alice and Jammer change their positions according to the environment. The results illustrate that the different initial locations of Alice and Jammer do not affect the convergence of the algorithm. Figure 6d describes the system secrecy rate curves under different RL algorithms. It can be seen that before 200 episodes, the DDQN-based algorithm has achieve the best system security rate, which is approximately 4.3 bps/Hz. However, the DQN and Q-learning have not converged yet. They can only obtain a secrecy rate of 2.5 bps/Hz. The system secrecy rate under DQN and Q-learning algorithms did not reach 4.3 bps/Hz until 300 episodes. From Figure 6d, it can be seen that the DDQN-based algorithm converges twice as fast as the other two algorithms, and if the number of UAVs is increased, the gap of convergence speed will be further expanded.

Consider the scenarios with two moving Eves or three moving Eves, as shown in Figure 7a,b. Figure 7a shows the motion trajectories of Alice and Jammer when two Eves move around Bob along the road. We can see that, although the number of moving Eves increases, Alice and Jammer still approach each other quickly at the beginning without getting too close to Eves, which is the same as the UAV’s trajectory with a single Eve. However, after the system is steady, Jammer moves in a near-elliptical trajectory near Bob, mainly following the Eve that is closest to Bob. This phenomenon indicates that more moving Eves lead to Jammer’s following target switching more frequently, thus, we introduce the average distance in the reward function to smooth the reward value in (21). Figure 7b adds a moving Eve compared to Figure 7a. In Figure 7b, Alice’s trajectory goes around the moving Eve to the left and keeps a distance from it, and the trajectory of Jammer is similar to that in Figure 7a. However, Jammer is more conservative in moving around Bob and hovers in a smaller area but is generally similar to the trajectory in Figure 7a.

5. Conclusions

In the environment with moving Eves, we model the problem as a Markov decision process and use the DDQN-based algorithm to solve it. The main objective of this study is to optimize the UAV trajectory and transmit power to maximize the information secrecy rate of the system. In this paper, the convergence of the model is discussed for UAV and Eve with different step lengths and initial positions. The simulation results show that the model has excellent convergence performance and rapid convergence speed. We also compare the core algorithm double-DQN with several discrete DRL networks. The results show that double-DQN has better stability and faster convergence speed.

The future challenge is about improving the algorithm with multiple agents. The multi-agent environment will result in a larger state space, which makes the computational complexity grow exponentially. At the same time, the data transmission between agents will also be complicated with an increase in agents, so the data transmission assignment is also an issue to be considered.

Author Contributions

Conceptualization, Z.D.; data curation, Z.Q.; formal analysis Z.Q.; funding acquisition Z.D.; investigation C.C.; methodology, Z.D. and Z.Q.; writing—original draft preparation, Z.Q.; writing—review and editing, Z.D. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Open Project of the Jiangsu Key Laboratory of Power Transmission & Distribution Equipment Technology under Grant No. 2022JSSPD03 and the Fundamental Research Plan in Changzhou under Grant No. CJ20220245.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to we have not done detailed work on code annotations and instructions for use yet.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tang, F.; Kawamoto, Y.; Kato, N.; Liu, J. Future intelligent and secure vehicular network toward 6G: Machine-learning approaches. Proc. IEEE 2019, 108, 292–307. [Google Scholar] [CrossRef]
Zhao, N.; Lu, W.; Sheng, M.; Chen, Y.; Tang, J.; Yu, F.R.; Wong, K.K. UAV-assisted emergency networks in disasters. IEEE Wirel. Commun. 2019, 26, 45–51. [Google Scholar] [CrossRef]
Cheng, F.; Zhang, S.; Li, Z.; Chen, Y.; Zhao, N.; Yu, F.R.; Leung, V.C. UAV trajectory optimization for data offloading at the edge of multiple cells. IEEE Trans. Veh. Technol. 2018, 67, 6732–6736. [Google Scholar] [CrossRef]
Zhong, C.; Yao, J.; Xu, J. Secure UAV communication with cooperative jamming and trajectory control. IEEE Commun. Lett. 2018, 23, 286–289. [Google Scholar] [CrossRef]
Sun, X.; Ng, D.W.K.; Ding, Z.; Xu, Y.; Zhong, Z. Physical layer security in UAV systems: Challenges and opportunities. IEEE Wirel. Commun. 2019, 26, 40–47. [Google Scholar] [CrossRef]
Lu, H.; Zhang, H.; Dai, H.; Wu, W.; Wang, B. Proactive eavesdropping in UAV-aided suspicious communication systems. IEEE Trans. Veh. Technol. 2018, 68, 1993–1997. [Google Scholar] [CrossRef]
Xiao, L.; Lu, X.; Xu, T.; Zhuang, W.; Dai, H. Reinforcement learning-based physical-layer authentication for controller area networks. IEEE Trans. Inf. Forensics Secur. 2021, 16, 2535–2547. [Google Scholar] [CrossRef]
Mao, Q.; Hu, F.; Hao, Q. Deep learning for intelligent wireless networks: A comprehensive survey. IEEE Commun. Surv. Tutor. 2018, 20, 2595–2621. [Google Scholar] [CrossRef]
Zhang, G.; Wu, Q.; Cui, M.; Zhang, R. Securing UAV communications via joint trajectory and power control. IEEE Trans. Wirel. Commun. 2019, 18, 1376–1389. [Google Scholar] [CrossRef]
Cai, Y.; Wei, Z.; Li, R.; Ng, D.W.K.; Yuan, J. Joint trajectory and resource allocation design for energy-efficient secure UAV communication systems. IEEE Trans. Commun. 2020, 68, 4536–4553. [Google Scholar] [CrossRef]
Wang, Y.; Chen, L.; Zhou, Y.; Liu, X.; Zhou, F.; Al-Dhahir, N. Resource allocation and trajectory design in UAV-assisted jamming wideband cognitive radio networks. IEEE Trans. Cogn. Commun. Netw. 2020, 7, 635–647. [Google Scholar] [CrossRef]
Li, Y.; Zhang, R.; Zhang, J.; Yang, L. Cooperative jamming via spectrum sharing for secure UAV communications. IEEE Wirel. Commun. Lett. 2019, 9, 326–330. [Google Scholar] [CrossRef]
Zhou, X.; Wu, Q.; Yan, S.; Shu, F.; Li, J. UAV-enabled secure communications: Joint trajectory and transmit power optimization. IEEE Trans. Veh. Technol. 2019, 68, 4069–4073. [Google Scholar] [CrossRef]
Razaviyayn, M. Successive Convex Approximation: Analysis and Applications. Ph.D. Thesis, University of Minnesota, Minneapolis, MN, USA, 2014. [Google Scholar]
Zhang, J. A Q-learning based Method f or Secure UAV Communication against Malicious Eavesdropping. In Proceedings of the 2022 14th International Conference on Computer and Automation Engineering (ICCAE), Brisbane, Australia, 25–27 March 2022; pp. 168–172. [Google Scholar] [CrossRef]
Fu, F.; Jiao, Q.; Yu, F.R.; Zhang, Z.; Du, J. Securing UAV-to-vehicle communications: A curiosity-driven deep Q-learning network (C-DQN) approach. In Proceedings of the 2021 IEEE International Conference on Communications Workshops (ICC Workshops), Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar]
Zhang, Z.; Zhang, Q.; Miao, J.; Yu, F.R.; Fu, F.; Du, J.; Wu, T. Energy-efficient secure video streaming in UAV-enabled wireless networks: A safe-DQN approach. IEEE Trans. Green Commun. Netw. 2021, 5, 1892–1905. [Google Scholar] [CrossRef]
Zhang, Y.; Mou, Z.; Gao, F.; Jiang, J.; Ding, R.; Han, Z. UAV-enabled secure communications by multi-agent deep reinforcement learning. IEEE Trans. Veh. Technol. 2020, 69, 11599–11611. [Google Scholar] [CrossRef]
Ye, H.; Li, G.Y.; Juang, B.H.F. Deep reinforcement learning based resource allocation for V2V communications. IEEE Trans. Veh. Technol. 2019, 68, 3163–3173. [Google Scholar] [CrossRef]
Deng, D.; Li, X.; Menon, V.; Piran, M.J.; Chen, H.; Jan, M.A. Learning-based joint UAV trajectory and power allocation optimization for secure IoT networks. Digital Commun. Netw. 2022, 8, 415–421. [Google Scholar] [CrossRef]
Liu, C.; Zhang, Y.; Niu, G.; Jia, L.; Xiao, L.; Luan, J. Towards reinforcement learning in UAV relay for anti-jamming maritime communications. Digital Commun. Netw. 2022. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Mukherjee, A.; Swindlehurst, A.L. Jamming games in the MIMO wiretap channel with an active eavesdropper. IEEE Trans. Signal Process. 2012, 61, 82–91. [Google Scholar] [CrossRef]
Mukherjee, A.; Swindlehurst, A.L. Detecting passive eavesdroppers in the MIMO wiretap channel. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 2809–2812. [Google Scholar]
Al-Hourani, A.; Kandeepan, S.; Lardner, S. Optimal LAP altitude for maximum coverage. IEEE Wirel. Commun. Lett. 2014, 3, 569–572. [Google Scholar] [CrossRef]
Bloch, M.; Barros, J.; Rodrigues, M.R.; McLaughlin, S.W. Wireless information-theoretic security. IEEE Trans. Inf. Theory 2008, 54, 2515–2534. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]

Figure 1. UAV-aided communication system model.

Figure 2. Air-to-ground channel propagation in an urban environment.

Figure 3. DDQN system model structure for dual UAVs.

Figure 4. (a,b) Graphs of normalized reward versus episodes with a 90% confidence level for different initial positions of Alice and Jammer.

Figure 5. (a) Graph of normalized rewards versus episodes with a 90% confidence level with UAVs’ fixed initial positions and different move steps, and (b) the comparison of the performance of different algorithms.

Figure 6. (a,b) Trajectories of Alice and Jammer at different initial positions with a moving Eve. The coordinates of Alice and Jammer’s initial positions in (a,b) are [(250, 250), (150, 150)], [(200, 200), (400, 150)]. (c) Trajectory of Alice and Jammer with only one moving Eve. (d) Curves of the secrecy rates under different RL algorithms.

Figure 7. (a) Trajectories of the UAV with two moving Eves, and the initial two-dimensional coordinates of the UAV is [(200, 250), (250, 250)]. (b) Trajectories of the UAV with three moving Eves, and the initial two-dimensional coordinates of the UAV is [(200, 250), (300, 250)].

Table 1. Part of notations.

Notation	Definition
T	Total flight time
$τ$	Number of time slots
t	Index of time slots
$P L_{ξ} (t)$	Path loss of ATG
$F S P L (t)$	Free space path loss
$r (t)$	Distance between the UAV and the ground receiver
$h (t)$	Height of the UAV
f	Carrier frequency of the system
c	Speed of light
$η_{ξ}$	Value of excessive path loss
$ξ$	Propagation group
$\bar{P L (t)}$	Channel path loss
$P_{L o S} (t)$	Probability of LoS
$P_{N L o S} (t)$	Probability of NLoS
$θ$	Angle between the UAV and the ground user
$P L_{A, B} (t)$	Path loss between Alice and Bob
$P L_{A (J), e} (t)$	Path loss between Alice (Jammer) and Eve
$P L_{J, A} (t)$	Path loss between Jammer and Alice
$R_{A, B} (t)$	Data rate of legitimate channel
$R_{A, e} (t)$	Data rate of wiretap channel
$R_{c} (t)$	Secrecy rate of the system
$N_{0}$	Power of the natural Gaussian noise
E	Number of Eves
$p_{A} (t)$	Transmitting power of Alice
$p_{J} (t)$	Artificial noise power generated by Jammer
$S$	State space of MDP
$A$	Action space of MDP
$P$	State-transition probability of MDP
$R$	Reward space of MDP

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, Z.; Deng, Z.; Cai, C.; Li, H. Reinforcement Learning Based Dual-UAV Trajectory Optimization for Secure Communication. Electronics 2023, 12, 2008. https://doi.org/10.3390/electronics12092008

AMA Style

Qian Z, Deng Z, Cai C, Li H. Reinforcement Learning Based Dual-UAV Trajectory Optimization for Secure Communication. Electronics. 2023; 12(9):2008. https://doi.org/10.3390/electronics12092008

Chicago/Turabian Style

Qian, Zhouyi, Zhixiang Deng, Changchun Cai, and Haochen Li. 2023. "Reinforcement Learning Based Dual-UAV Trajectory Optimization for Secure Communication" Electronics 12, no. 9: 2008. https://doi.org/10.3390/electronics12092008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning Based Dual-UAV Trajectory Optimization for Secure Communication

Abstract

1. Introduction

2. System Model

2.1. Network Model

2.2. Mobility Model

2.3. Transmission Model

3. System Problem Formulation and Markov Decision Process

3.1. Problem Formulation

3.2. Markov Decision Process

3.3. Double DQN

4. Simulation Results and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI