1. Introduction
Due to the continuous development of multi-sensor data fusion technology and autonomous flight control technology, Unmanned Aerial Vehicles (UAVs) have become an important tool in the field of emergency rescue. Particularly, in the face of natural disasters and emergencies, UAVs are widely used in search and rescue operations [
1], for example, the 9.0-magnitude earthquake in Japan in 2011 [
2], the 8.1-magnitude earthquake in Nepal in 2015 [
3], and Hurricane Harvey in Texas in 2017 [
4], which caused massive damage and casualties. UAVs were used to assist rescue teams to provide necessary rescue services for trapped people in danger. However, due to the lack of effective path planning, some UAVs collided during the flight, resulting in mission delays and resource waste, and even secondary disasters, which posed additional risks to both rescue workers and trapped people. Therefore, effective and collision-free path planning is critical for the successful rescue mission of UAVs. A better path planning strategy can avoid collisions between UAVs or even with buildings or other obstacles, improve rescue efficiency, and ensure the smooth progress of rescue operations [
5].
The complexity of the rescue environment is the primary challenge of the path planning problem [
6]. In general, obstacles (e.g., buildings, bridges, trees, gravel) in the disaster area are relatively messy, and the geographical environment is complex. Planning a collision-free flight path for UAVs has practical significance in assisting rescue missions. At present, some studies [
7,
8,
9,
10] (e.g., the A star algorithm, Dijkstra’s algorithm, and the Bellman–Ford algorithm) have been used in collision-free path planning algorithms. These algorithms evaluate the quality of the path with certain rules and heuristic functions on the existing map and perform path planning by searching and optimizing to find the best path. However, some disasters may change the inherent topography of the affected area and damage the structure of buildings and infrastructure, which will cause unexpected changes in the task area.
For this reason, it is more practical to study the path planning problem in an uncertain task environment. In a disaster environment, the communications and navigation infrastructure may not be able to function properly, requiring UAVs to perform tasks autonomously without a GPS signal or effective communication with the ground command centers [
11], which will cause traditional path planning algorithms to fail to obtain the latest maps and environmental information. In addition, unforeseen changes may occur in disaster environments, such as road damage, obstacles, and personnel evacuations. These changes will put forward higher requirements for path planning and navigation algorithms that can adapt to environmental changes in real time and quickly re-plan paths to meet these challenges. Traditional algorithms are usually designed based on specific maps and road network structures. When the map changes, the algorithm may not be able to adapt to the new environment. In order to meet these challenges, it is necessary to adopt more flexible and real-time path planning algorithms, such as the path planning algorithm based on reinforcement learning. The path planning algorithm based on reinforcement learning has the advantages of adaptability, real-time planning, and a learning ability. Researchers try to introduce the reinforcement learning (RL) model into the path planning algorithm [
12,
13,
14] and learn flexible and feasible strategies through the continuous trial and error of the agent in unknown environments.
However, training RL models requires a large amount of training data, training time, and computing resources [
15]. In addition, as the complexity of the task environment increases, RL-based path planning algorithms face challenges when dealing with complex environmental constraints (such as avoiding dangerous areas or other agents, energy constraints, and time constraints). It is necessary to design appropriate state representation and action selection strategies. The heuristic algorithm can provide prior knowledge for the reinforcement learning model by considering and dealing with these complex constraints using certain rules and heuristic functions. For example, the genetic algorithm [
16] uses operations such as selection, crossover, and mutation; the particle swarm algorithm [
17] updates the velocity and position of particles; and the bee colony algorithm [
18] uses a local search and information exchange. These algorithms then combine fitness functions and constraints to generate UAV path solutions that adapt to the environment. These heuristic algorithms have advantages in terms of computational efficiency, data requirements, and solution speed, and they can effectively reduce path redundancy and unnecessary trial and error.
To this end, we propose a multi-UAV path planning algorithm based on a whale-inspired deep Q-network (WDQN). Specifically, we use the deep Q-network (DQN) as the framework and design a comprehensive reward function to balance the three factors of path length, obstacle avoidance, and energy consumption. Then, a deep neural network is used to approximate the Q-value function, and the comprehensive reward function is used to evaluate and reward the planned path in real time during training. In addition, the WoA is introduced when training the DQN, and the WoA balances exploration and exploitation to help the DQN model explore a wider range of actions and states, enabling it to discover new and possibly better strategies. Moreover, the search strategy is dynamically adjusted according to the fitness of the solution to help the DQN model to focus the search on promising areas of the state–action space, thereby improving the convergence speed and learning efficiency. Simulation experiments show that the proposed algorithm can enable UAVs to explore unknown and complex environments and plan collision-free feasible paths. Finally, compared with the classic RL algorithm, the WDQN has a better performance in learning efficiency, path planning success rate, and path length.
The main contributions of this paper are as follows:
We combine a WoA and DQN to propose a path planning algorithm for UAV-assisted disaster rescue. The introduced WoA significantly improves the learning efficiency of the DQN, and the designed comprehensive reward function effectively improves the sparse reward problem of the RL algorithm in complex environments.
We train the model with increasing scene complexity to improve the robustness and generalization of the algorithm. During the training phase, we gradually increase the number of obstacles in the task scene, so that the UAV can gradually adapt to the complex environment and learn a better pathfinding strategy.
Compared with the global path planning algorithm, this algorithm is suitable for disaster areas with limited communication and has good scalability; that is, it can apply the experience learned in the training process to unknown environments. In addition, the comparative experiments with the classical RL algorithms show that the WDQN has a better training efficiency and path planning ability.
The rest of the paper is organized as follows. In
Section 2, we introduce research work related to this paper. The construction of a disaster scenario and the formulation of the rescue problem are described in
Section 3. In
Section 4, the WDQN-based multi-UAV path planning algorithm proposed in this paper is described in detail. In
Section 5, the relevant parameters of the algorithm are introduced, and experiments are carried out to validate the proposed algorithm. The full text is summarized in
Section 5.
2. Related Works
Multi-UAV reliable communication and path planning in emergency rescue scenarios are two current research hotspots [
19,
20,
21]. On the one hand, the development of UAV communication security is constantly advancing in the direction of encrypted communication technology [
22], anti-jamming technology [
23], and anti-attack technology [
24]. On the other hand, the development of path planning for UAVs is shifting from methods based on known maps to exploring unknown environments. In this paper, we assume that the communication environment of multi-UAVs is reliable and focus on the collision-free path planning algorithm in emergency scenarios. The following reviews the existing path planning algorithms from two aspects, namely global path planning based on a fully known environment and local path planning based on an uncertain environment [
25].
2.1. Global Path Planning
Global path planning based on a completely known environment is a static planning algorithm. Global path planning is to calculate the optimal path from the starting point to the end point through an algorithm based on the known environmental map. Common global path planning algorithms include Dijkstra’s algorithm, the A* algorithm, the sampling-based Rapidly Exploring Random Tree Star (RRT*) algorithm, etc.
Dijkstra’s algorithm is widely used in solving the shortest path problem in weighted graphs. Dijkstra’s algorithm is an efficient choice, especially when the target is uncertain. However, its time complexity and space complexity are relatively high, which is not suitable for complex environments. In order to avoid the sequential bottleneck problem that may occur in Dijkstra’s algorithm, Wang et al. [
26] first used the extended hierarchical graph to model the dynamic grid environment, then used the matrix alignment method to perform parallel exploration in the simulated environment, and finally traced the safe path according to the navigator matrix. This method shows a good performance in a large-scale dynamic grid environment. The A* algorithm is currently one of the most widely used graph traversal and path exploration techniques, which is an extension of the Dijkstra algorithm. The traditional A* algorithm has the advantages of simple principles and a fast calculation speed. But as the map grows larger, its computational load will increase exponentially, resulting in a high memory usage, long computing time, and low exploration efficiency [
27]. To this end, Guilherme et al. [
28] proposed an improved A* algorithm for UAVs for target search and rescue, which introduced a truncation mechanism to prevent UAVs from falling into the local optimum. The RTT* algorithm has great advantages in search efficiency and search quality, and it has been successfully applied in the field of unmanned driving and UAV navigation. Since the classic RTT* algorithm adopts a fixed step size expansion strategy, setting the step size too large or too small will directly affect the search results. To this end, Wang et al. [
29] designed a dynamic step size strategy. When the underground intelligent vehicle is far away from obstacles, a larger step size is used to speed up the convergence speed; when it is close to obstacles, a smaller step size is used to ensure the safety of the path. Hu et al. [
30] first used the bidirectional extending random tree search strategy to plan the static path and then pruned the random tree according to the change in environmental information, so as to reduce the calculation amount of the RRT* algorithm in uniform random sampling. Compared with the classic RRT algorithm, the improved algorithms converge faster and have higher accuracy in path planning. But it is easy for them to fall into local optimum, and the energy of the system is relatively large. Therefore, intelligent algorithms, such as PSO and the WOA, are often used to improve efficiency and quality. Dong et al. [
31] used the Adaptive Whale Algorithm (AWOA) to optimize the deployment of UAVs, used the variable length population strategy to find the optimal number of stopping points, and introduced nonlinear parameters and partial mutation rules to balance exploration and exploitation, so as to minimize the energy consumption of UAVs. Yu et al. [
32] improved the PSO algorithm through the simulated annealing algorithm, and proved by experiments that the algorithm can plan high-quality paths for UAVs.
The performance of the global path planning algorithm is highly dependent on the predicted environment information. When the task environment of UAVs undergoes unpredictable changes, the paths planned by these methods will no longer be reliable.
2.2. Local Path Planning
Local path planning based on uncertain environments belongs to the dynamic programming algorithm. When the environmental information is completely unknown or partially known, this type of algorithm uses the local information obtained by UAVs to plan a collision-free path. Classical local path planning usually introduces technologies such as the dynamic window method (DWA), the artificial potential field method, and reinforcement learning. These algorithms have a good robustness to environmental errors and noise.
Traditional DWA algorithms prefer to bypass the periphery of the dense obstacle area, which increases the total distance. Additionally, the objective cost function fails when a “C”-shaped obstacle is encountered, resulting in no path [
33] being found. Tan et al. [
34] introduced the concept of the obstacle search angle to the traditional DWA. When the static obstacle is not within a certain angle range of the agent’s forward direction, the impact on the agent’s subsequent navigation is not considered, which improves the adaptability of obstacle avoidance in different scenarios. The artificial potential field method simulates the repulsive field generated by obstacles and the gravitational field generated by the end point, so that the robot can avoid obstacles in the task environment and move forward to the end point. This method is relatively simple in theory and operation, but there are also problems, such as unreachable goals and the way it is easy to fall into the local optimum [
35]. To this end, Sun et al. [
36] introduced the gravity barrier function in the repulsion function and further restricted the flight boundary and flight speed of the UAV, so as to avoid falling into the local optimum and improve the safety and stability of the track route. However, the ability to deal with unknown task environments using methods based on the artificial potential field method is limited.
In recent years, methods based on reinforcement learning have been widely used. The Q-learning algorithm is one of the commonly used reinforcement learning methods, in which UAVs can learn the optimal action strategy by interacting with the environment. For example, Souto et al. [
37] used Q-learning as a reinforcement learning technique to control the UAV to move to the target, aiming to reduce its energy consumption in the disaster area rescue process. However, the Q-learning algorithm faces challenges when dealing with continuous state and action spaces. To solve this problem, deep reinforcement learning (DRL) is introduced into local path planning tasks. DRL not only has the powerful perception and representation ability of a deep neural network when processing high-dimensional decision-making information but also inherits the learning mechanism of reinforcement learning in the interaction with the environment, which can realize real-time path planning for agents [
38]. Among them, the DQN is one of the most classic methods. Zhang et al. [
39] transformed the trajectory optimization problem into a constrained Markov decision process with a UAV as an agent and proposed a trajectory planning method based on the deep Q-network. Although the DQN is effective in obstacle avoidance and path planning, the algorithm has the problems of large randomness in action selection and slow convergence during training. To this end, Huang et al. [
40] proposed the Dueling DQN algorithm based on the tree sampling mechanism. In addition, Wang et al. [
41] proposed a DDQN based on a greedy strategy to solve the problem of mountain rescue in complex environments.
These algorithms enable UAVs to perform tasks in uncertain environments. Due to the lack of a grasp on global information, the planned path may not be optimal or even feasible. In addition, the path planning problem in an unknown environment needs to consider multiple possible states and transition probabilities, which will lead to a high computational complexity of the algorithm.
4. Method
In order to reduce the frequency of blind trial and error for UAVs in the search process, we design a whale-inspired deep Q-network. A WOA is used to accumulate prior knowledge about action selection. In this section, we first briefly introduce the basic principles of the WOA and DQN then introduce the proposed algorithm in detail.
4.1. Action Decision Based on WoA
The WoA is an optimization algorithm inspired by the foraging behavior of whales in nature. Each UAV is considered a whale, and there are three hunting models for each whale, which are the encircling model, the searching model, and the bubble-net attacking model.
4.1.1. Encircling Model
At this time, the whale chooses to swim towards the optimal individual, and the position update formula of the whale is as follows.
where
t represents the current number of iterations,
. represents the whale currently in the optimal position,
and
represent random variables in the range of
, respectively, and
T represents the maximum number of iterations. The initial value of
a is set to 2; it will decrease to 0 as the number of iterations increases.
4.1.2. Searching Model
At this time, the whale randomly selects an individual to approach, and the position update formula of the whale is as follows.
where
is the position of the whale randomly selected in the current group.
4.1.3. Bubble-Net Attacking Model
Whales also need to constantly adjust their position when using bubble nets to drive away prey. In this case, the position update formula of the whale is as follows.
In the formula, b is a constant, and the default value is 1. l is a random number uniformly distributed in the range .
4.2. Path Exploration Based on DQN
RL is a machine learning method. The agent obtains feedback and rewards from the environment by constantly trying different actions and gradually forms an optimal strategy. The interaction process between the agent and the environment can be represented by the quaternion of the Markov decision process (MDP), where S represents the state space of the agent, A represents the action space of the agent, represents the immediate reward obtained by the agent after taking an action in the current state, represents the transition probability of the agent from the current state to the next state after taking an action, and is a discount factor.
The core idea of the DQN is to use a deep neural network (DNN) to approximate the Q-value function used to evaluate the value of taking a specific action in a given state, thereby maximizing the discounted cumulative reward
R in the reinforcement learning environment.
where
represents the discount factor and
r represents the instant reward obtained at time
t.
The DQN uses the basic idea of a Q-learning algorithm, which is to guide the agent to take the best action in a given state by learning a value function
. Unlike traditional Q-learning, the DQN uses a DNN to approximate the optimal value function. Specifically, a Q-network takes a state
s as input and outputs a corresponding Q-value
a to each action. By optimizing the parameters of the Q-network, it can accurately predict the cumulative reward of performing each action in a given state; that is,
And the Q-function is shown in Equation (
7).
where
represents the weight of the Q-network. The optimal Q-function is shown in Equation (
8).
The function
returns the maximum reward of all strategies, which can be expressed by the Bellman equation as follows.
where
and
represent the subsequent state and action in the next time step, respectively.
represents the weight of the target Q-network, and
represents the target value.
In the DQN, the parameters
of the Q-network are updated by optimizing the loss function, that is, minimizing the mean square error between the target Q-value and the current Q-value. The formula of the loss function is as follows.
We choose the gradient descent method to update the parameters of the Q-network to gradually approximate the true Q-value function. The update formula of parameter
is
where
a represents the learning rate, which is used to control the step size of the parameter update. ∇ represents the gradient operation, and
represents the gradient of the loss function with respect to the network parameters. The network parameters are updated iteratively, so that the Q-function gradually approaches the true Q-value function. Finally, we can obtain the appropriate action through the well-trained network.
4.3. The Autonomous Path Planning Strategy Based on WDQN
For the path planning task, we proposes a whale-inspired DQN algorithm. The core idea of the WDQN is to use the WOA to provide the DQN with the decision-making experience required for training, improve the interaction efficiency between agents and the environment, and accelerate the convergence speed of the DQN algorithm. The framework of the WDQN algorithm consists of four parts, as shown in
Figure 2. First of all, the input of the proposed algorithm is a two-dimensional vector composed of the real-time position of each UAV and the environment information. Second, we design a comprehensive reward function to generate dynamic rewards in real time based on environment information, which enables UAVs to have a good control performance. Then, each UAV is regarded as a whale, and the corresponding fitness function value is calculated according to the reward obtained during training. The action decision output by the WOA is added to the replay buffer of the Q-network for training, so that the model can obtain more valuable experience in the early stage of training. Finally, we increase the complexity of the environment step by step to improve the robustness and generalization of the algorithm.
4.3.1. Reward Function
The setting of the reward function is crucial to the performance of reinforcement learning algorithms. Choosing an appropriate reward function can effectively promote the convergence of the algorithm, while an inappropriate one may make it difficult for the algorithm to converge [
42]. In traditional reinforcement learning algorithms, a learner is rewarded only after completing a task, and there is no reward for the previous series of behaviors. It has been pointed out that this reward mechanism can lead to the sparse reward problem when faced with complex environments [
43]. When the set of environmental states is large, the learner encounters a series of nonfeedback states before completing the task. Since effective rewards cannot be obtained in time, the algorithm will be difficult to converge. To solve this problem, we design a comprehensive reward function, as shown in Equation (
14).
The reward function is divided into three parts: the target reward function , obstacle avoidance reward function , and energy consumption reward function , which are used to evaluate the directionality, safety, and efficiency of the model, respectively:
(1) The target reward function
is used to guide the UAV to quickly reach the target position, which is calculated using the following formula.
where
represents the weight of the target reward function,
represents the initial Euclidean distance between the UAV and the target node,
represents the Euclidean distance between the UAV and the target node at time
t, and
represents the displacement of the UAV from time step
to
t.
(2) The obstacle avoidance reward function
is a negative reward used to measure the safe distance between the UAV and nearby obstacles. This function can guide the UAV away from obstacles and ensure the safety of the planned path, which is calculated using the following formula.
where
represents the weight of the obstacle avoidance reward function and
represents the distance between the UAV and nearby obstacles. If there are no obstacles around the current position of the UAV, then
; otherwise, calculate the sum of the Euclidean distances to all surrounding obstacles.
(3) To make the UAV tend to plan shorter paths, the
is designed as a reward function to measure the remaining energy, which is also a negative reward. It can be seen from the following formula that the faster the search, the higher the reward.
where
represents the weight of the energy consumption reward function,
represents the current energy consumption value, and
represents the initial energy value.
4.3.2. Complexity of Task Scene
In this paper, the reinforcement learning model is trained in the form of the complexity of the task scene. In order to simulate the disaster-affected scene, we assume that the task scene of UAVs contains
obstacles (e.g., buildings trees, and vehicles), and use a quaternion
to identify the attributes of each obstacle. Among them,
x and
y represent the abscissa and ordinate of the obstacle center, respectively, and
l and
w represent the length and width of the obstacle, respectively. Obviously, as the number of obstacles increases, the complexity of the UAVs path planning increases. Therefore, we assume that the complexity
L of the task scene is proportional to the number of obstacles
, and the initial complexity of the training scene is
. Specifically,
is a random integer in the interval
. In the initial stage of training, we use simple and easily avoidable obstacles to enable the UAV to quickly learn basic path planning techniques. As the training progresses, when the number
of UAVs that find the target is greater than 80% of the total number of UAVs
N, the complexity
L of the scene will be upgraded, meaning that the terrain becomes gradually more complex and the passage becomes narrower. Otherwise, the difficulty remains the same. Namely,
4.3.3. WDQN for Multi-UAV Path Planning
In this paper, we use the Adam algorithm to adaptively adjust the learning rate and optimize the model parameters by calculating the first-order moment estimation and second-order moment estimation of the gradient, so as to speed up the convergence speed of the model and improve the robustness and stability of the algorithm. The update formula of the parameters is as follows.
where
represents the value of the first-order moment estimator
after bias correction,
represents the value of the second-order moment estimator
after bias correction, and
is an offset used to prevent division by zero.
where
and
represent the exponential decay rates of the first-order moment estimation and second-order moment estimation, respectively.
In order to guide the UAV to choose actions reasonably, we use the
strategy to balance exploration and utilization, thereby preventing the algorithm from falling into the local optimum. Decreasing
gradually during the training process makes the Q-network perform more random exploration in the initial stage, which is beneficial to try more actions. As the training progresses, gradually decrease
to make it more inclined to choose the optimal action for a better performance. The update formula of
is as follows.
where
represents the minimum value of the exploration rate.
In summary, we give a detailed description of the WDQN-based multi-UAV autonomous path planning algorithm (see Algorithm 1).
Algorithm 1 Pseudocode of simulated WDQN |
Input:
Q-network weights , maximum iterations T, learning rate , number of whales W, discount factor ;
|
Output:
The optimal policy ;
|
- 1:
Initialization the Q network, initialization the positions of UAV and the trapped people, initialization the level of obstacles L, initialization Q network Q and target Q-network , initialization greedy probability , set ;
|
- 2:
for
do
|
- 3:
while and do
|
- 4:
update according to Equation ( 26)
|
- 5:
|
- 6:
if
then
|
- 7:
|
- 8:
else if
then
|
- 9:
selected according to WoA
|
- 10:
else
|
- 11:
selected a random action
|
- 12:
end if
|
- 13:
execute action and observe reward and new state
|
- 14:
store the transition into replay buffer
|
- 15:
set
|
- 16:
if replay buffer is full then
|
- 17:
sample random minibatch of transition from replay buffer
|
- 18:
set target value according to Equation ( 10)
|
- 19:
calculate the loss according to Equation ( 11)
|
- 20:
calculate gradient according to Equation ( 13)
|
- 21:
update the first moment estimator according to Equation ( 24)
|
- 22:
update the second moment estimator according to Equation ( 25)
|
- 23:
calculate bias-corrected first moment estimator according to Equation ( 22)
|
- 24:
calculate bias-corrected second moment estimator according to Equation ( 23)
|
- 25:
update Q-network weights according to Equation ( 21)
|
- 26:
every C steps set
|
- 27:
end if
|
- 28:
end while
|
- 29:
calculate level of obstacles L according to Equation ( 20)
|
- 30:
end for
|
6. Conclusions
In this paper, we study the path planning problem of multiple UAVs in emergency rescue scenarios based on the RL model and propose the WDQN-based path planning algorithm. The WDQN overcomes the limitations of traditional path planning algorithms in uncertain task scenes and improves the learning efficiency of RL models, enabling the UAV to autonomously plan a feasible collision-free path in an uncertain scene. Through continuous trial and error and studying feedback from the environment, the model gradually adapts and explores feasible collision-free paths in different scenes. In addition, we train the Q-network by gradually increasing the difficulty of obstacle avoidance, which enhances the robustness of the algorithm. In order to test the effectiveness of the proposed algorithm, we conduct experiments under different difficulty levels of obstacle avoidance, different numbers of task targets, and different scales of task areas. The results show that the proposed algorithm can successfully find targets and plan safe and feasible paths in all task scenes. Moreover, compared with several other RL algorithms, the proposed algorithm shows a better performance in terms of training efficiency, task completion rate, and average path length.
In three-dimensional space, the UAV needs to consider movement in both the horizontal and vertical directions, as well as different angles of rotation. This increases the number of actions in the UAV’s action set to 27, which in turn increases the complexity of the action set and leads to a larger search space in the path planning algorithm. In addition, the jitter can lead to instability in the UAV’s path, resulting in an increased energy consumption and flight risks, particularly in three-dimensional space. Consequently, in future work, we plan to propose a path planning strategy to address the complexity of a real-world three-dimensional environment, aiming to reduce the search space, accelerate the path planning process, and generate high-quality paths.
Furthermore, UAV path planning in large-scale mission environments is also a challenging problem. First of all, in large-scale task scenarios, UAVs need to plan paths in complex environments, avoid obstacles, and take into account factors such as task priorities and time constraints. This requires path planning algorithms to efficiently handle large-scale maps and tasks while ensuring the safety and efficiency of the path. In future work, we will consider dividing the large-scale region into multiple small task regions, decomposing the complex path planning problem into multiple relatively simple sub-problems. By decomposing the problem, we will reduce the computational complexity and the difficulty of path planning and design appropriate allocation algorithms and mechanisms to achieve intra-regional and inter-regional path planning and collaboration.