Multiple UAVs Path Planning Based on Deep Reinforcement Learning in Communication Denial Environment

Xu, Yahao; Wei, Yiran; Jiang, Keyang; Wang, Di; Deng, Hongbin

doi:10.3390/math11020405

Open AccessArticle

Multiple UAVs Path Planning Based on Deep Reinforcement Learning in Communication Denial Environment

by

Yahao Xu

,

Yiran Wei

^*,

Keyang Jiang

,

Di Wang

and

Hongbin Deng

School of Mechatronical Engineering, Beijing Institute of Technology, 5th South Zhongguancun Street, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(2), 405; https://doi.org/10.3390/math11020405

Submission received: 14 December 2022 / Revised: 6 January 2023 / Accepted: 10 January 2023 / Published: 12 January 2023

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose a C51-Duel-IP (C51 Dueling DQN with Independent Policy) dynamic destination path-planning algorithm to solve the problem of autonomous navigation and avoidance of multiple Unmanned Aerial Vehicles (UAVs) in the communication denial environment. Our proposed algorithm expresses the Q function output by the Dueling network as a Q distribution, which improves the fitting ability of the Q value. We also extend the single-step temporal differential (TD) to the N-step timing differential, which solves the problem of inflexible updates of the single-step temporal differential. More importantly, we use an independent policy to achieve autonomous avoidance and navigation of multiple UAVs without any communication with each other. In the case of communication rejection, the independent policy can achieve the consistency of multiple UAVs and avoid the greedy behavior of UAVs. In multiple-UAV dynamic destination scenarios, our work includes path planning, taking off from different initial positions, and dynamic path planning, taking off from the same initial position. The hardware-in-the-loop (HITL) experiment results show that our C51-Duel-IP algorithm is much more robust and effective than the original Dueling-IP and DQN-IP algorithms in an urban simulation environment. Our independent policy algorithm has similar effects as the shared policy but with the significant advantage of running in a communication denial environment.

Keywords:

multi-agent reinforcement learning; UAV path planning; visual perception; communication denial

MSC:

68T42

1. Introduction

Path planning is one of the most important problems to be explored in UAVs for finding an optimal path between source and destination. Multi-UAV path planning refers to a multi-objective planning method based on a single UAV for collaborative task completion [1]. The UAV often encounters problems such as signal blocking by buildings in the city and no network communication service in the mountains, resulting in weak communication among UAVs and UAVs with ground operators. These problems make it impossible for UAVs to collaborate as expected. Therefore, UAV path planning is becoming a research hotspot and has been a broad concern for researchers.

Based on the collected information, path planning divides into two parts: global path planning and local path planning [2]. Global path planning refers to knowing environmental information in advance to find a collision-free path that connects the origin and destination before execution. Since the environmental information needs to be fully known, the algorithm for the global path-planning problems usually uses offline strategies.

Local path planning can handle both global unknown and partially-known environments. Local path planning, also named the reactive method, is used for navigation in unfamiliar environments because it can deal with the high uncertainty in the environment. They are intelligent, efficient, and easy to implement.

In practical application, people usually use a combination of local and global planning methods. The global planning is updated at a low frequency, while the local planning is updated for every new action the agent takes. A typical example is the Move Base function package in the ROS system. The Move Base package provides an implementation of action-based path planning; the global path planning mainly uses the A* algorithm or Dijkstra algorithm, which requires a given target point to plan a global path, but it can only adapt to a static environment. For unpredictable obstacles, the local path-planning algorithm can be better solved. For example, the DWA (Dynamic Window Approach) algorithm plans the agent′s real-time linear and angular velocities in each cycle to make them conform to the globally optimal path as far as possible.

Deep Reinforcement Learning [3] algorithm combines the perception ability of deep learning [4] with the decision-making ability of reinforcement learning. It can generate adaptive strategies online through the interaction process between the agent and the environment and performs well in UAVs’ path planning. The path planning problem of the UAV is transformed into a Partially Observable Markov Decision Process (POMDP) [5], and the deep neural network is used to fit the Q value. Combined with a greedy strategy, the algorithm supports the path planning and optimization of UAVs in a complex environment. In the implementation process of a common algorithm, researchers often simplify the UAV flight environment as follows:

diving continuous action space into a limited action range;
assuming that environmental information is globally known;
simplifying UAV to a particle;
using blocks with known positions instead of complicated obstacles.

Most current algorithms configure a central control terminal for all UAVs, which sends a control strategy to all UAVs. However, there is no communication between UAVs and ground stations in the communication denial environment, so each UAV must have its policy.

To solve the problem of dynamic destination path planning and avoidance of multiple UAVs without communication, we propose an end-to-end deep reinforcement learning algorithm with the image as the only input in an unknown 3D scene based on a simulation environment by Airsim [6]. No control center plans to avoid collision between UAVs, but each UAV responds dynamically in real-time. We establish a multi-agent method with an independent policy, which can recognize obstacles and dynamic destination plan paths without any communication among UAVs.

The main contributions of this paper are as follows:

(1) Based on the Dueling DQN algorithm, we consider the distribution of the Q value and introduce a distributed viewpoint to increase the expressiveness of the Q value. Meanwhile, we extend from TD (0) to TD (n) to solve the problem of inflexible updates.

(2) Independent policies solve multiple UVAs’ decision-making problems in communication denial environments. The image acquired by the UAV is passed directly to the onboard processor to generate policies without the need for a central control terminal to centrally process and assign behavioral policies. In the case of communication rejection, our independent policy algorithm can achieve the consistency of multiple UAVs and avoid greedy behavior.

(3) The proposed C51-Duel-IP algorithm is validated on the Pixhawk hardware. Unlike other algorithms that are only verified by software simulation experiments, our algorithm is also verified by hardware in-loop simulation experiments, proving its good portability.

The rest of the paper is organized as follows: Section 2 reviews relevant work in the literature. Our approach is described in Section 3. Section 4 describes the experimental environment and analyzes the experimental results. The conclusion is summarized in Section 5.

2. Background

Classical navigation methods include the artificial potential field method [7], genetic algorithm [8], particle swarm algorithm [9], ant colony algorithm [10], and cuckoo algorithm [11].

The artificial potential field method assumes an imaginary force attracting the robot to the target, making it stay away from the obstacle and reach the target area. In the artificial potential field method, path planning in 3D space is challenging, but the authors of [12] have achieved it.

A Genetic Algorithm (GA) is a computational model that simulates the biological evolution process of natural selection and the genetic mechanism of biological evolution theory. It is a method that searches for the optimal solution by simulating the natural evolution process. Its main feature is that it operates directly on the structure object without restriction in derivation and function continuity. It has inherent, implicit parallelism and better global optimization ability. The probabilistic optimization method can automatically obtain and guide the optimization search space without definite rules and adjust the search direction adaptively. The genetic algorithms divide groups based on the specific problem and give fitness values to each member of the group according to the objective function. These individuals are selected based on their fitness values. The algorithm allows their genes to pass on to the next generation by mating. Mutations can maintain population diversity and prevent premature convergence. Finally, the algorithm terminates when the population converges. The authors of [13] discussed the navigation problems involving movable obstacles in an uncertain environment. The authors of [14] proposed an online training model to obtain the most suitable chromosome, solving the issue of the stuck agent.

The particle swarm optimization algorithm (PSO) mimics the behavior of social animals but does not require any leader in the group to achieve its goal. The authors of [15] studied the application of particle swarm optimization in the non-deterministic navigation of UAVs. They allowed them to work together to protect a large area from aerial attack. In [16], particle swarm optimization is used to solve the problem of multiple UAVs versus multiple UAVs.

The ant colony algorithm stems from ant behavior to find the shortest path from the nest to the food source. In [17], the ant colony algorithm is used to select and optimize fuzzy rules, and the navigation problem of mobile robots in an unknown dynamic environment is solved.

Reinforcement learning [18] is an adaptive, flexible approach to achieving optimal actions in the current state through repeated experiments and maximizing rewards. However, with the increased system complexity, the learning cost of reinforcement learning increases exponentially. A deep neural network [4] can acquire image features and learn a deep understanding of images. By combining reinforcement learning with deep neural networks, strategies can successfully derive from high-dimensional observations, which can solve the problem of exponentially increasing learning costs in reinforcement learning. Deep reinforcement learning performed better than humans in single-player video games. The DQN [3] algorithm outscored humans on average in Atari games, and the AlphaGo algorithm beat the world′s No. 1 human chess player in May 2017.

The MS-DQN algorithm [19] simulated the air combat strategy. This method solved the problem that UAVs can make autonomous air combat decisions quickly and accurately in an uncertain environment. Based on the DDPG algorithm, the MN-DDPG algorithm [20] was proposed, which can track maneuvering targets automatically and accurately in an uncertain environment. The work in [21] compared the performance of several optimizers for UAV path planning. It proved that the DQN algorithm using the Adadelta optimizer could achieve the best effect in the quadrotor path-planning problem. When a UAV navigates in a highly dynamic environment, the DQN algorithm cannot converge due to extensive interactive data. In [22], a hierarchical DQN algorithm was proposed to solve the problem. The work in [23] proposed a multi-layer path-planning algorithm based on reinforcement learning. Compared with the classical Q-learning algorithm, the multi-layer Q-learning algorithm proposed in this paper has significant advantages in collecting global and local information and dramatically improves overall performance. In [24], a simultaneous target assignment and path planning (STAPP) algorithm based on multi-agent depth deterministic policy gradient (MADDPG) can solve the target allocation and path-planning problems of multiple UAVs. Moreover, the work in [25] used the D3QN algorithm to realize the real-time path planning of the UAV in the area with the threat. In [26], the authors proposed a two-stage reinforcement learning approach to avoid collisions, generating time-efficient and collision-free paths under incomplete perception and handling local noise observations with unknown noise levels well. As seen from the above review, most works simplify the environment as the grids and UAVs as the particles. In addition, few studies only use image information to perceive obstacles and plan paths in unknown scenes. Meanwhile, in the multiple UAV systems, the existing work mostly assumes that the communication among UAVs is not restricted, which is difficult to achieve in reality. Finally, there is little literature on the case of dynamic destination. Nowadays, using images as the only perception without communication among UAVs in multiple UAV dynamic path planning is still a challenge.

Reinforcement Learning (RL) [18] is a trial-and-error approach to artificial intelligence based on interaction with the environment. In the RL, an agent can decide and take action to update the environment. After each action, the environment outputs the rewards. Rewards reveal information about an action, whether the result is positive or negative feedback. The goal of the agent is to maximize cumulative returns. Each interaction between action and the environment is called a step. An action can bring the environment to a terminal state, also known as an episode. Thus, an episode is a set of steps from the beginning state to the end state. In this paper, the agent is a UAV whose environment rewards every step. The collision or arrival will end the current episode. Furthermore, we also set a maximum limit of 1000 steps to end the episode. The actions performed by the agents are the inputs of the environment, and states and rewards are the outputs of the environment.

According to the input state, the Q algorithm based on the Bellman equation gets the Q value of each action and then selects the action according to the Q value. The algorithm uses

ε - greedy

attenuation. The algorithm determines the random action with probability

ε

, or the action with maximum Q with probability

1 - ε

, which can increase the exploration capability. The empirical replay algorithm is used to put

(s, a, r, s^{'})

generated at each step into the buffer to increase training stability. The calculation formula of the Q value obtained according to the temporal difference method is:

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α (R_{t + 1} + γ \max_{a} Q (s_{t + 1}, a_{t}, θ) - Q (s_{t}, a_{t}, θ))

(1)

where

γ

is the discount factor,

θ

is the network weight, and

α

is the step length.

In [27], the authors introduced the target network and proposed DDQN algorithm. Then, Wang et al. [28] introduced Dueling networks by modifying the DQN network structure. Specifically, the two streams share an initial convolution layer, which is used to estimate and obtain the dominant function

A

of the operation, where A is shown in Equation (2).

V^{π} (s)

denotes the state value function of the policy

π

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)

(2)

The value distribution algorithm [29] considers the Q function from a distributed perspective. The authors prove that the Bellman equation can also be applied to the value distribution algorithm, as shown in Equation (3), where

Z

,

R

, and

Q (x_{0}, a_{0})

are random variables representing return, immediate reward, and value function. The author proposes dividing

Z

into 51 classes(C51), learning the probability of

Z

in each atom, and calculating the loss function using Kullback–Leibler (KL) divergence.

T Q (x, a) : = E R (x, a) + γ E \max Z (x^{'}, a^{'})

(3)

Z (x, a) = R (s, a) + γ Z (X^{'}, A^{'})

(4)

N-step temporal difference [18] integrated the advantages of Monte Carlo and temporal difference. It can solve the inflexible problem of single-step temporal difference by updating the N-step return within N-step. Rewards and returns are shown in Equations (5) and (6).

R_{t}^{(n)} : = \sum_{k = 0}^{n - 1} γ^{k} r_{t + k + 1}

(5)

y = R_{t}^{(n)} + γ^{n} \max_{a^{'} \in A} Q_{\bar{θ}} (s_{t + n}, a^{'}) - Q_{θ} (s_{t}, a_{t})

(6)

3. Proposed Method

This section introduces the basic concepts of reinforcement learning and proposes a C51-Duel-IP (C51 Dueling DQN with an Independent Policy dynamic destination path-planning algorithm for multiple UAVs in a communicational denial environment.

3.1. C51-Duel Algorithm

To solve the problem of multi-UAV dynamic destination path planning using images in a communication-denied environment, based on the Dueling DQN algorithm, we introduce the distributed and time temporal difference ideas to propose the C51-Duel-IP algorithm. As shown in Algorithm 1,

S

represents the original states, and the action value function is marked as

Q (s, a)

, which is denoted as

Z (s, a)

in the Bellman operator.

Algorithm 1 An algorithm with C51-Duel

Initialize replay memory

D

Initialize

Q

net

θ

Initialize target

Q

net \bar{θ}

Initialize hyper - parameter (γ, M, ε, n, N)

for episode = 1 to

M

do

Initialize replay memory

S

,

S^{'}

,

R

,

A

Initialize done = False

repeat

While not done do

input

S

get Q (s_{t + 1}, a) = \sum_{i = 0}^{N} z_{i} p_{i} (s_{t + 1}, a)

⊳ Distribute Q

Select

A

using ε - greedy

get

S^{'}

R

done info

end while

restore {S, S^{'}, A, R}

into

D

until

n

steps

get batch from

D

Z (s_{t}, a_{t}) = R (s_{t}, a_{t}) + γ^{n} \max Z_{\bar{θ}} (s_{t + n}, a^{'}) - Z_{θ} (s_{t}, a_{t})

⊳ TD (n)

l o s s = - \sum_{i = 0}^{N} m_{i} \log p_{i} (s_{t}, a_{t})

⊳ Cross - entropy loss of C 51

if

t % c = = 0

then

Copy

θ

\to

θ^{'}

end if

S

\to

S^{'}

end for

At the beginning of the algorithm, the neural network and experience buffer are initialized. The action is selected according to the Q value calculated by the neural network. The image information obtained by the UAV is calculated through the neural network to obtain the Q value of each action. The algorithm uses an

ε

-greedy strategy to increase the exploration capability, where the algorithm selects the random action with a probability

ε

and the action with maximum Q value with a probability

1 - ε

. We use the experience replay algorithm [3] to save the states, actions, and rewards generated at each UAV step into the buffer

(s, a, r, s^{'})

to increase the stability of training.

The Dueling network divides states into two parts after a shared neural network: state value and the advantage function, as shown in (2). The effect of the advantage function is to distinguish rewards for action from rewards for states, which is significant for collision avoidance.

Based on the Dueling DQN algorithm, we introduce a value distribution algorithm to represent Q values through probability distributions, export the probabilities at each atom, and then update the Q values. The Dueling DQN algorithm uses state values

V (s)

or action value functions

Q (s, a)

to model cumulative returns. In this modeling process, complete distribution information is discarded in quantity. As shown in Figure 1, the C51-Duel algorithm solves the distribution information loss problem by modeling the entire distribution

Z (s, a)

of the action value rather than their expectations.

The N-step temporal difference synthesizes Monte Carlo and temporal difference, which can solve the inflexible update problem. In a single-step temporal difference, the step length determines the frequency of action changes and the execution time of the bootstrap algorithm. The N-step algorithm enables bootstrap to occur over multiple steps, which solves the inflexible problem of the single-step temporal difference method.

3.2. Multiple UAVs Independent Policy

Parameter sharing policy is generally used in multi-agent reinforcement learning. As shown in Figure 2a, the parameter-sharing policy takes the single-agent approach to learn a strategy for all agents. All observations obtained by agents are combined and sent to the control terminal. The control terminal generates a strategy that sends the action to all UAVs. This simple approach has been proven to achieve state-of-the-art performance in cooperative games [30]. However, this can lead to dimensional disasters as the number of agents increases, and there are limitations in dealing with heterogeneous agents.

The Centralized Training Decentralized Execution (CTDE) [31] framework shares all observations during training and uses the same critic, but each agent has its actor, which partly overcomes the problem of unstable environments. After training, the agent makes decisions only based on its local observations. The CTDE method, such as MADDPG [32] and MAPPO [33], concatenates the local observations of each agent. However, using the local observation concatenation method will make the value network′s input dimension too high, and the training of the value network will be difficult. This problem has been mentioned in the literature [33].

Note that CTDE requires shared observations between agents during training. In contrast, parameter sharing requires shared observations between agents in real-time, which conflicts with the communication denied condition assumed in this paper. Therefore, the independent policy is used in this paper, as shown in Figure 2c. Each UAV has its observation and corresponds to its policies.

The agents do not need to upload images to the control center and receive action instructions in the independent policy because each agent is independent. The onboard processor directly senses the image acquired and generates strategies, which can make decisions independently according to the situation. There is no control center planning to avoid collision between agents, but each agent responds dynamically in real-time according to their observation. Moreover, like many vehicles on the road, each person drives according to his or her observations. So, an independent policy is a kind of distributed intelligence.

3.3. Action Space and Observation Space

In urban environments, buildings and roads are horizontal or vertical. In this case, setting UAV actions as discrete actions can reduce the meaningless actions of the UAV. So, the action space of UAVs is discrete, with seven options: up, down, front, back, left, right, and do nothing. In the urban environment, the UAV moves front, back, left, and right about 4 m with each step and moves up and down about 1.6 m with each step, denoted as Discrete (7) in gym style.

Common obstacle avoidance algorithms based on visual perception often rely on RGB and depth images. UAVs can use RGB images to identify the objects and depth images to estimate obstacles or other UAV positions. However, the UAV, in reality, rarely uses multiple-depth cameras due to the limited image-processing ability. Therefore, we only use the RGB images from the first-person perspective in our experiment, making the experiment closer to reality. The front UAV cannot observe the back, and the back UAV must respond in real-time according to the movement of the front UAV. We set the observation shape of multiple UAV scenarios as an 84 × 84 × 3 matrix, denoted as Box (84, 84, 3) in gym style, as shown in Figure 3.

3.4. Reward Function

During the training process, only the RGB image obtained by each UAV is used as observation, and there is no need to guide the path of each UAV manually. When the UAV performs the current action, the algorithm sets different rewards according to the UAV′s action and the state. When the UAV collides or flies out of the city boundary, the reward is set at −20. The reward for reaching the target position is set at +20. In other cases, the reward is set based on whether the UAV flies to its target.

r e w a r d = {\begin{matrix} + 20, & l a n d \\ - 20, & c o l l i s o n a n d o u t s i d e \\ D^{t - 1} - D^{t}, & o t h e r w i s e \end{matrix}

(7)

where:

D^{t}

is the distance between the UAV and the target at time step

t

.

D^{t - 1}

is the distance at the previous time step

t - 1

.

When the distance between the UAV and the target at time

t

is smaller than that at time

t - 1

, that is, the UAV is moving toward the target direction, a positive reward will be given. Otherwise, a negative reward will be given.

4. Experiments and Results

To verify the effectiveness of the proposed algorithm in a complex 3D environment, we designed an urban environment using Airsim, as shown in Figure 4. The environment is about 40,000 m², including several tall buildings, villas, and other constructions, and there are also roads and narrow gaps between the buildings. The urban environment model is available on the UE Marketplace. The target moves repeatedly within the green area, and the speed is 0.8 m/s. During the experiment, we assumed that the UAV could not fly over the roof of tall buildings, and there is no communication between the UAV and the ground station or among UAVs. Therefore, it is necessary to ensure that each UAV has independent perception and decision-making ability rather than using a central console.

4.1. Experimental Content

The experimental environment is a globally unknown city, where the UAVs perform real-time path planning based on visual information. Figure 1 shows the algorithm process of the UAV path planning. In the task, the UAVs use visual information to obtain the surrounding environmental information and take actions to avoid obstacles until reaching the target location. When planning a path for multiple UAVs, collisions among UAVs should also be avoided. Two simulation scenarios are designed as follows:

Dynamic path planning for multiple UAVs in different initial positions;
Dynamic avoidance and path planning for multiple UAVs in the same initial position.

In scenario 1, the three UAVs need to take off from three different positions shown in Figure 4b,c,e and reach the target position simultaneously.

In scenario 2, the three UAVs need to take off from the same position shown in Figure 4b simultaneously and reach the target position at the same time without collision.

In both scenarios, there is no communication between the three UAVs, and they do not know the position of each other. They can only perceive each other through image information. The UAVs carry out strategy independently, with no central control sending execution instructions to each UAV. In scenario 2, the three UAVs take off from the same position, so they must avoid each other while flying to the destination, which is the main difference from scenario 1. In scenario 2, the UAVs need to avoid other UAVs, which are dynamic obstacles and mean that the environment is not entirely static.

We implement our algorithm by Windows OS, PyTorch [34], and Ray [35] distributed machine learning framework. The simulation device is an NVIDIA 3090 graphics card and an AMD Ryzen 9 5950X processor. In order to make the algorithm applicable to a wider range of scenarios, we only use the RGB images of the first-person perspective as the observation. We will compare the success rates of different algorithms and focus on analyzing the algorithm performance in multiple-UAV path-planning missions.

4.2. Hyper-Parameter

The image shape is 84 × 84 × 3 for scenarios 1 and 2. The convolutional neural network is used to extract features. The convolution kernel of the first layer is 4 × 4, the stride is 4, and the activation function is ReLU. The convolution kernel of the second layer is 4 × 4, the stride is 4, and the activation function is ReLU. The convolution kernel of the third layer is 5 × 5, the stride is 2, and the activation function is ReLU. The full connection layer contains two hidden layers, and each hidden layer contains 256 neurons, and the activation function is Tanh.

We evaluate the performance of the C51-Duel-IP (C51-Duel algorithm with Independent Policy) on the above tasks. Additionally, we reproduce and test the DQN-IP, Duel-IP, and C51-Duel-SP (C51-Duel algorithm with Shared Policy) algorithms. The hyper-parameter is shown in Table 1:

4.3. Experimental Results

4.3.1. Scenario 1: Path Planning of Multiple UAVs in Different Initial Positions

The three UAVs start from positions shown in Figure 4b,c,e, respectively, and then arrive at the target position shown in Figure 4a. All the algorithms execute 1200K steps. The success rate of the C51-Duel-IP algorithm can converge to 0.9, but the Duel-IP algorithms only converge to 0.3, and the DQN-IP algorithm is challenging to complete the task, as shown in Figure 5a. The rewards are the sum of three UAV rewards, and we can see that the C51-Duel-IP rewards are better than the other two algorithms in Figure 5b.

As shown in Figure 5b, the average reward for the Duel-IP fluctuated significantly, sometimes even showing extreme peaks, while the average reward for the C51-Duel-IP remained stable. The reason is that when some UAVs reach the target and others do not, the UAV that gets close to the target repeatedly gets high rewards, which increases the total reward. However, the mission is not complete until all the UAVs reach the target. We call it greedy behavior and the high reward is meaningless in this case. Combined with the success rate curve, we can see that the C51-Duel-IP can achieve dynamic path planning efficiently for multiple UAVs.

In most research, the individuals in the multi-agent system can adjust and update their behaviors to achieve consistency under local cooperation and communication. The independent policy we propose can achieve consistency and avoid greedy behavior only according to the visual perception in the case of communication denial. As shown in Figure 6, the C51-Duel-IP algorithm has similar results to the C51-Duel-SP in terms of success rate and mean reward. That is, our independent policy algorithm has similar effects as the shared policy, but with the significant advantage of running in a communication denial environment.

4.3.2. Scenario 2: Avoidance and Path Planning of Multiple UAVs in the Same Initial Positions

For multiple UAVs with the same initial positions, we require the UAVs to avoid any collision among UAVs and ensure that there is non-communication between UAVs in the environment, as shown in Figure 4b. Compared with scenario 1, the UAV is a dynamic obstacle to the others, increasing the complexity of the environment. In the shared policy algorithm, the central control terminal plans each evasive action of the UAV. In our independent policy, each UAV has its policy and responds to its observations in real-time, which is essential in the context of communication rejection.

As shown in Figure 7a, the success rate of the C51-Duel-IP converges to 0.9. However, the success rate of the Duel-IP may not be stable, and that of the DQN-IP is always zero. Figure 7b shows the mean reward in each episode, revealing that C51-Duel-IP converges faster. The reason for the peak and stability in Figure 7b is the same as in scenario 1.

As shown in Figure 8, the C51-Duel-IP algorithm also has similar results to the C51-Duel-SP in terms of success rate and mean reward in this experiment.

4.4. Evaluation

To verify that the above results can be reproduced on hardware, we apply C51-Duel-IP to the HITL task. HITL is a simulation mode in which normal PX4 firmware runs on real flight controller hardware. This approach benefits from testing most of the actual flight code on real hardware.

We used the simple-flight mode developed by AirSim for training. Simple-flight is a method to simulate the hardware control of PX4 using the software. It integrates several tools to facilitate simulation, making it ideal for the training phase. When the training is completed, and the training weights are saved, the simulation environment is connected to the real hardware to make the algorithm run on the real microcontroller. We use a Pixhawk PX2.4.8 as the UAV autopilot. PX4 communicates with AirSim via MAVLink protocol. The UAV information in AirSim contains the sensor information and control model. AirSim simulates the sensor status of the real UAV flight, such as IMU, and GPS, and sends them to PX4. PX4 performs attitude estimation and dynamic solutions and then sends control information to Airsim. The HITL architecture of AirSim is shown in Figure 9. The configuration of the evaluation UAV path-planning algorithm is shown in Figure 10.

Figure 11 shows the path-planning results for three scenarios using the C51-Duel-IP algorithm. The experimental results of scenario 1 views from the top and scenario 2 views from the bird′s eye view. We can see that the C51-Duel-IP algorithm provides paths to avoid obstacles and reach the target, as shown in Figure 11a, and the three UAVs can avoid each other, as shown in Figure 11b.

5. Conclusions

This paper proposes a C51-Duel-IP dynamic destination path-planning algorithm for multiple UAVs in a communication denial environment. The algorithm uses the first-person perspective image as the only observation input and uses the independent policy to control UAVs. There is no central controller to avoid collision between UAVs. The decision system is distributed on the onboard processor of each UAV. The experiment results show that each UAV responds in real-time and reaches the target position successfully without collision. Our algorithm is compared with DQN-IP, Duel-IP, and C51-Duel-SP algorithms; the experiments show that our algorithm has much better results than the original algorithm and is very close to the shared policy. Our algorithm can effectively implement the multi-UAV dynamic path-planning task in a 3D environment. Additionally, we verified the effectiveness of our algorithm in the HITL experiment and will test our algorithm in actual UAVs in future work.

Author Contributions

Conceptualization, Y.X. and H.D.; methodology, Y.X.; software, Y.X.; validation, K.J.; formal analysis, Y.X. and Y.W.; investigation, K.J.; resources, Y.W.; data curation, K.J.; writing—original draft preparation, Y.X. and D.W.; writing—review and editing, K.J.; visualization, Y.W.; supervision, Y.W.; project administration, Y.W. and D.W.; funding acquisition, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

The work described in this paper was supported in part by Beijing Hongda Hechuang Defense Technology Research Institute Co., Ltd. (No. XM2101).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Puente-Castro, A.; Rivero, D.; Pazos, A.; Fernandez-Blanco, E. A review of artificial intelligence applied to path planning in UAV swarms. Neural Comput. Appl. 2022, 34, 153–170. [Google Scholar] [CrossRef]
Patle, B.K.; Babu, L.G.; Pandey, A.; Parhi, D.; Jagadeesh, A. A review: On path planning strategies for navigation of mobile robot. Def. Technol. 2019, 15, 582–606. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Cassandra, A.R. A survey of POMDP applications. In Working Notes of AAAI 1998 Fall Symposium on Planning with Partially Observable Markov Decision Processes; Microelectronics and Computer Technology Corporation: Austin, TX, USA, 1998. [Google Scholar]
Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field and Service Robotics; Springer: Cham, Switzerland, 2017. [Google Scholar] [CrossRef] [Green Version]
Khatib, O. Real-Time Obstacle Avoidance for Manipulators and Mobile Robots. Int. J. Robot. Res. 1986, 5, 90–98. [Google Scholar] [CrossRef]
Holland, J.H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence; The MIT Press: Cambridge, MA, USA, 1975. [Google Scholar] [CrossRef]
Eberhart, R.; Kennedy, J. A new optimizer using particle swarm theory. In Proceedings of the Mhs95 Sixth International Symposium on Micro Machine & Human Science, Nagoya, Japan, 4–6 October 1995. [Google Scholar]
Dorigo, M.; Gambardella, L.M. Ant colony system: A cooperative learning approach to the traveling salesman problem. IEEE Trans. Energy Convers. 1997, 1, 53–66. [Google Scholar] [CrossRef] [Green Version]
Yang, X.S.; Deb, S. Cuckoo Search via Levy Flights. In Proceedings of the 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), Coimbatore, India, 9–11 December 2009. [Google Scholar] [CrossRef]
Cetin, O.; Zagli, I.; Yilmaz, G. Establishing Obstacle and Collision Free Communication Relay for UAVs with Artificial Potential Fields. J. Intell. Robot. Syst. 2013, 69, 361–372. [Google Scholar] [CrossRef]
Shi, P.; Cui, Y.J.; Northeastern Univ, C. Dynamic Path Planning for Mobile Robot Based on Genetic Algorithm in Unknown Environment. In Proceedings of the 22nd Chinese Control and Decision Conference, Xuzhou, China, 26–28 May 2010; Volume 1-54325-4329. [Google Scholar]
Kang, X.; Yong, Y.; Li, D.; Maple, C. Genetic algorithm based solution to dead-end problems in robot navigation. Int. J. Comput. Appl. Technol. 2011, 41, 177–184. [Google Scholar] [CrossRef]
Banks, A.; Vincent, J.; Phalp, K. Particle swarm guidance system for autonomous unmanned aerial vehicles in an air defence role. J. Navig. 2008, 61, 9–29. [Google Scholar] [CrossRef]
Li, W.; Shi, J.; Wu, Y.; Wang, Y.; Lyu, Y. A Multi-UCAV cooperative occupation method based on weapon engagement zones for beyond-visual-range air combat. Def. Technol. 2021, 18, 1006–1022. [Google Scholar] [CrossRef]
Purian, F.K.; Sadeghian, E. Mobile robots path planning using ant colony optimization and Fuzzy Logic algorithms in unknown dynamic environments. In Proceedings of the 2013 International Conference on Control, Automation, Robotics and Embedded Systems (Care-2013), Jabalpur, India, 16–18 December 2013. [Google Scholar]
Andrew Barto, R.S. Reinforcement Learning, 2nd ed; An Introduction; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Li, Y.; Shi, J.; Jiang, W.; Zhang, W.; Lyu, Y. Autonomous maneuver decision-making for a UCAV in short-range aerial combat based on an MS-DDQN algorithm. Def. Technol. 2021, 18, 1697–1714. [Google Scholar] [CrossRef]
Li, B.; Yang, Z.; Chen, D.; Liang, S.; Ma, H. Maneuvering target tracking of UAV based on MN-DDPG and transfer learning. Def. Technol. 2021, 17, 457–466. [Google Scholar] [CrossRef]
Jembre, Y.Z.; Nugroho, Y.W.; Khan, M.; Attique, M.; Paul, R.; Shah, S.; Kim, B. Evaluation of Reinforcement and Deep Learning Algorithms in Controlling Unmanned Aerial Vehicles. Appl. Sci. 2021, 11, 7240. [Google Scholar] [CrossRef]
Guo, T.; Jiang, N.; Li, B.Y.; Zhu, X.; Wang, Y.; Du, W.B. UAV navigation in high dynamic environments: A deep reinforcement learning approach. Chin. J. Aeronaut. 2021, 34, 479–489. [Google Scholar] [CrossRef]
Cui, Z.Y.; Wang, Y. UAV Path Planning Based on Multi-Layer Reinforcement Learning Technique. IEEE Access 2021, 9, 59486–59497. [Google Scholar] [CrossRef]
Qie, H.; Shi, D.X.; Shen, T.L.; Xu, X.H.; Li, Y.; Wang, L.J. Joint Optimization of Multi-UAV Target Assignment and Path Planning Based on Multi-Agent Reinforcement Learning. IEEE Access 2019, 7, 146264–146272. [Google Scholar] [CrossRef]
Yan, C.; Xiang, X.J.; Wang, C. Towards Real-Time Path Planning through Deep Reinforcement Learning for a UAV in Dynamic Environments. J. Intell. Robot. Syst. 2020, 98, 297–309. [Google Scholar] [CrossRef]
Wang, D.W.; Fan, T.X.; Han, T.; Pan, J. A Two-Stage Reinforcement Learning Approach for Multi-UAV Collision Avoidance Under Imperfect Sensing. IEEE Robot. Autom. Lett. 2020, 5, 3098–3105. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the 30th Association-for-the-Advancement-of-Artificial-Intelligence (AAAI) Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Wang, Z.Y.; Schaul, T.; Hessel, M.; van Hasselt, H.; Lanctot, M.; de Freitas, N. Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 24 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; Volume 48, pp. 1995–2003. [Google Scholar]
Bellemare, M.G.; Dabney, W.; Munos, R. A Distributional Perspective on Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Terry, J.K.; Grammel, N.; Son, S.; Black, B. Parameter Sharing for Heterogeneous Agents in Multi-Agent Reinforcement Learning. In Einstieg in Deep Reinforcement Learning; Carl Hanser Verlag: Munich, Germany, 2020. [Google Scholar] [CrossRef]
Li, S.; Jia, Y.; Yang, F.; Qin, Q.; Gao, H.; Zhou, Y. Collaborative Decision-Making Method for Multi-UAV Based on Multiagent Reinforcement Learning. IEEE Access 2022, 10, 91385–91396. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Wu, Y. The surprising effectiveness of mappo in cooperative. arXiv 2021, arXiv:2103.01955. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.M.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouve, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., D’Alche-Buc, F., Fox, E., Garnett, R., Eds.; NeurIPS: La Jolla, CA, USA; Volume 32.
Philipp Moritz, R.N.S.W. Ray: A Distributed Framework for Emerging AI Applications. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, CA, USA, 8–10 October 2018. [Google Scholar]

Figure 1. Algorithm schematic, where Q-Val is the original Dueling DQN algorithm, and yellow data stream is the C51-Duel algorithm.

Figure 2. Multi-agent policy, (a) represents shared parameters, (b) represents centralized evaluation and decentralized execution, and (c) represents independent policies.

Figure 3. Path planning for multiple UAVs using independent policy.

Figure 4. Experimental environment: (a) is the target, (b,c,e) is the initial positions of UAVs for scenario 1, (d) is the top view of the scene; the target position is restricted to the green zone. (b) is also the initial position of all UAVs for scenario 2.

Figure 5. Scenario 1 experimental results with independent policy. The horizontal axis indicates that all the UAVs select one action. We choose the sum of episode rewards contained in 5K steps divided by the number of episodes as the average reward. The success rate takes the number of successful episodes in 5K steps divided by the total number of episodes. (a) Shows the success rate, (b) Shows the mean reward in each episode.

Figure 6. Comparison of independent policies and shared policies in scenario 1: (a) shows the success rate; (b) shows the mean reward in each episode.

Figure 7. Scenario 2—experimental result with independent policy: (a) shows the success rate, (b) shows the mean reward in each episode.

Figure 8. Comparison of independent policies and shared policies in scenario 2: (a) shows the success rate; (b) shows the mean reward in each episode.

Figure 9. AirSim HITL Architecture [6].

Figure 10. HITL simulation.

Figure 11. Path-planning results obtained by C51-Duel-IP algorithm: (a) is the results of scenario 1; (b) is the results of scenario 2.

Table 1. Hyper-parameter Settings of DQN-IP, Duel-IP, C51-Duel-IP, C51-Duel-SP algorithms.

	DQN-IP	Duel-IP	C51-Duel-IP	C51-Duel-SP
Gamma	0.995	0.995	0.995	0.995
Target update period	1000	1000	1000	1000
Learn rate	0.0001	0.0001	0.0001	0.0001
Learn start	2000	2000	2000	2000
Buffer	50,000	50,000	50,000	50,000
Epsilon	50,000	50,000	50,000	50,000
Final	0.01	0.01	0.01	0.01
Batch size	1024	1024	1024	1024
Class	-	-	51	51
V-max	-	-	10	10
N-step	1	1	3	3
Dueling Net	-	True	True	True

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Y.; Wei, Y.; Jiang, K.; Wang, D.; Deng, H. Multiple UAVs Path Planning Based on Deep Reinforcement Learning in Communication Denial Environment. Mathematics 2023, 11, 405. https://doi.org/10.3390/math11020405

AMA Style

Xu Y, Wei Y, Jiang K, Wang D, Deng H. Multiple UAVs Path Planning Based on Deep Reinforcement Learning in Communication Denial Environment. Mathematics. 2023; 11(2):405. https://doi.org/10.3390/math11020405

Chicago/Turabian Style

Xu, Yahao, Yiran Wei, Keyang Jiang, Di Wang, and Hongbin Deng. 2023. "Multiple UAVs Path Planning Based on Deep Reinforcement Learning in Communication Denial Environment" Mathematics 11, no. 2: 405. https://doi.org/10.3390/math11020405

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multiple UAVs Path Planning Based on Deep Reinforcement Learning in Communication Denial Environment

Abstract

1. Introduction

2. Background

3. Proposed Method

3.1. C51-Duel Algorithm

3.2. Multiple UAVs Independent Policy

3.3. Action Space and Observation Space

3.4. Reward Function

4. Experiments and Results

4.1. Experimental Content

4.2. Hyper-Parameter

4.3. Experimental Results

4.3.1. Scenario 1: Path Planning of Multiple UAVs in Different Initial Positions

4.3.2. Scenario 2: Avoidance and Path Planning of Multiple UAVs in the Same Initial Positions

4.4. Evaluation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI