Next Article in Journal
Application of End-to-End Perception Framework Based on Boosted DETR in UAV Inspection of Overhead Transmission Lines
Previous Article in Journal
Trajectory Tracking Control of Unmanned Vehicles via Front-Wheel Driving
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning

School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, China
*
Author to whom correspondence should be addressed.
Drones 2024, 8(10), 544; https://doi.org/10.3390/drones8100544
Submission received: 9 August 2024 / Revised: 25 September 2024 / Accepted: 29 September 2024 / Published: 1 October 2024

Abstract

:
To deal with the complexities of decision-making for unmanned aerial vehicles (UAVs) in denial environments, this paper applies deep reinforcement learning algorithms to search and rescue (SAR) tasks. It proposes a two-stage target search and tracking method for UAVs based on deep reinforcement learning, which divides SAR tasks into search and tracking stages, and the controllers for each stage are trained based on the proposed deep deterministic policy gradient with three critic networks (DDPG-3C) algorithm. Simulation experiments are carried out to evaluate the performance of each stage in a two-dimensional rectangular SAR scenario, including search, tracking, and the integrated whole stage. The experimental results show that the proposed DDPG-3C model can effectively alleviate the overestimation problem, and hence results in a faster convergence and improved performance during both the search and tracking stages. Additionally, the two-stage target search and tracking method outperforms the traditional single-stage approach, leading to a more efficient and effective decision-making ability in SAR tasks.

1. Introduction

Unmanned aerial vehicles (UAVs) are a class of aircraft that can fly autonomously without a need for an onboard pilot, which are notable for their convenience, flexibility, low cost, and wide range of applications [1]. UAVs operate without a pilot on board, allowing them to perform a variety of high-risk or inaccessible tasks in complex and changing environments, including terrain mapping, precision agriculture (PA), surveillance and reconnaissance, power line inspections, search and rescue (SAR), film shooting, aerial delivery, and military fields [2,3,4,5,6,7,8].
In the field of UAVs, the execution of numerous tasks fundamentally relies on target tracking. For instance, in security management tasks, it is imperative to track, encircle, and apprehend criminals when a terrorist event occurs. Under such circumstances, UAVs can be used to track criminals and even carry light weapons to assist police. Another vivid case is the search and rescue (SAR) tasks after a disaster happened. When natural disasters occur, the terrain in the affected area is complex, and communication is interrupted, posing significant challenges to SAR tasks. In such scenarios, UAVs can quickly enter these areas with advanced sensors like high-definition cameras and infrared thermal imagers to provide valuable information for SAR tasks, conduct aerial patrols, and perform target tracking with their aerial superiority.
With the rapid development of technology, UAVs have achieved significant improvements in intelligence, endurance, payload capacity, and resistance to interference, broadening their applicability across diverse scenarios. Despite these improvements, UAVs operating in SAR tasks often encounter unique challenges, such as infrastructure damage and severe natural conditions, which can disrupt communications. This disruption severely impacts the UAVs’ ability to perceive their environment, diminishing the success rate of rescues. Furthermore, in denial environments, the communication between UAVs and the ground command center will be disrupted, which cannot guarantee fast and safe communication. Therefore, each UAV is required to have the ability to independently and intelligently complete tasks under these severe environments, achieving a highly autonomous ability as much as possible.
The investigation into how unmanned aerial vehicles (UAVs) can make rapid and effective maneuvering decisions based on real-time observation information to accomplish tasks in unknown and complex environments efficiently has emerged as a significant area of research interest. In recent years, advanced methods such as model predictive control, optimization-based methods, and particle swarm optimization (PSO) algorithms have been successfully applied in UAVs. These algorithms empower UAVs to navigate and complete complex tasks within unfamiliar environments by improving route planning and decision-making processes.
Faced with denial environments without external positioning systems, Abraham et al. [9] presented a method to enable a quadrotor helicopter equipped with a laser rangefinder to autonomously explore and map unstructured indoor environments. Mac et al. [10] introduced a novel method combining sensor fusion for localization and an improved potential field method for UAVs in obstacle avoidance tasks, alongside an optimally tuned PID controller using multi-objective particle swarm optimization. However, their methods relied heavily on onboard sensors, which may limit performance in more complex or dynamically changing environments. Rothmund et al. [11] utilized scenario-based model predictive control for inspection drones to avoid obstacles in unknown environments. Although the UAV would change speed or take detours to avoid flying in potentially dangerous areas to mitigate risk, the system was limited by its reliance on a pre-planned path. At the same time, the high computational cost of this method made it difficult to guarantee real-time performance requirements. Similarly, Kulathunga et al. [12] introduced an optimization-based, trajectory-tracking approach for multi-rotor aerial vehicles operating in unknown environments. The method utilized a dual-planner system consisting of a global planner and a local planner. The global planner adjusted the initial reference trajectory to avoid obstacles, while the local planner focused on generating optimal control policies for following this adjusted trajectory. This approach was based on precise mathematical models, which could be a limitation in unknown environments where model parameters are often difficult to obtain accurately. Saccani et al. [13] introduced a multi-trajectory model predictive control for UAV navigation in unknown and static environments by using LiDAR for obstacle detection and path planning that balanced safety and goal achievement. To address the problem of coordinating USVs and UAVs for maritime parallel search, Li et al. [14] proposed a dynamic event-triggered control mechanism to reduce communication load and a sensor-tolerant control to handle sensor faults. This method was validated through simulation experiments, demonstrating its effectiveness in maintaining formation and achieving full coverage of the search area. However, this method depended on threshold tuning and assumed the states of the USV and UAV were known, which is difficult to implement in practice. Additionally, it did not consider obstacles, limiting its applicability in complex environments.
In recent years, with the rapid development of artificial intelligence (AI), reinforcement learning (RL), as a part of machine learning, has achieved remarkable results due to its excellent performance and wide application in the field of decision-making of complex tasks, such as autonomous vehicles, robot control, game strategy optimization, and financial transactions [15,16,17]. Unlike traditional optimal control methods, the agent in reinforcement learning learns a policy through interaction with the environment, aiming to maximize cumulative rewards over time. It does not necessarily require a known model of the system dynamics and focuses on adaptability through trial and error, which is more adaptable to handling uncertainties and unknown dynamics.
Pham et al. [18] integrated a PID controller with Q-learning, a type of reinforcement learning algorithm, to manage the UAV’s trajectory. This combination enabled the UAV to dynamically adjust its path based on real-time inputs, learning from the environment to enhance both navigation precision and efficiency. Li et al. [19] enhanced a UAV’s capability to swiftly adapt to unpredictable target movements by integrating deep reinforcement learning (DRL) with meta-learning. This approach significantly improved tracking accuracy and efficiency, making it particularly effective in diverse scenarios such as wildlife protection and emergency aid. When it comes to aerial robots in search and rescue tasks within unknown environments, Ramezani et al. [20] introduced an energy-aware hierarchical reinforcement learning approach by utilizing a predictive energy consumption algorithm to enhance the sustainability and endurance of missions. To address the real-time challenges in the decision-making process of UAVs, Zhao et al. [21] proposed adaptive and random exploration methods, ensuring that UAVs reach their target positions via reasonable and safe paths, enhancing their operational effectiveness and safety in dynamic environments. Kun et al. [22] utilized the deep deterministic policy gradient (DDPG) algorithm to direct UAVs toward fixed-point targets; however, their method did not address the guidance and tracking of moving targets. In response to this limitation, Alejandro et al. [23] developed a Gazebo-based reinforcement learning framework that effectively applied the DDPG algorithm for continuous UAV landing on a moving ship, which improved the UAV’s adaptability and performance in dynamic environments. However, the initial position of the UAV was fixed in their simulation, which could not effectively reflect the autonomy of guiding the maneuvering decision-making process. To address UAV flight control in dynamic environments with random wind turbulence, Ma et al. [24] introduced an incremental reinforcement learning (IRL)-based algorithm, which employed policy relief (PR) for exploration and significance weighting (SW) to enhance learning. This method was validated through simulations and real-world tests, demonstrating good tracking performance under wind disturbances though it was operated under the assumption of a fully known environment and did not account for collision avoidance. Xia et al. [25] proposed an end-to-end cooperative multi-agent reinforcement learning (MARL) scheme for UAV target tracking, addressing limitations such as unknown trajectories and limited flight performance. This paper modeled communication between UAVs in the swarm, improving coordination and efficiency. The method enhanced tracking with an energy-saving strategy and spatial information entropy, outperforming deep reinforcement learning baselines in simulations. However, it requires position information, assumes half-duplex communication, and may cause difficulties in communication-denied environments.
Faced with the complexities of autonomous decision-making for unmanned aerial vehicles (UAVs) in unknown environments, and to enable UAVs to autonomously make decisions based on limited observation information to adapt to unknown settings while maintaining strong generalization capabilities, this paper applies deep reinforcement learning algorithms to UAV search and rescue (SAR) tasks. A two-stage target search and tracking method for UAVs based on deep reinforcement learning is proposed. The novelties of this paper are listed as follows.
(1)
A deep deterministic policy gradient with three critic networks (DDPG-3C) algorithm is proposed to alleviate the overestimation problem in the critic network of the traditional DDPG algorithm. This method enhances both training speed and effectiveness by adopting three critic networks and introducing the experience replay buffer mechanism.
(2)
A two-stage target search and tracking method is introduced to enhance the search success rate and tracking performance of UAVs in SAR tasks within unknown environments. This method divides SAR tasks into the search stage and tracking stage, and the controllers for each stage are trained based on the proposed DDPG-3C algorithm.
(3)
A simplified two-dimensional SAR scenario is designed to demonstrate the practical application of the proposed methods and algorithms.
By introducing the DDPG-3C algorithm, this paper effectively alleviates the overestimation problem found in the traditional DDPG algorithm, leading to faster convergence and improved decision-making ability. The proposed two-stage target search and tracking method significantly improves the efficiency of UAV operations in SAR tasks, which not only provides a more efficient search strategy but also better adaptability to the target’s movements. Additionally, the development of a simplified two-dimensional SAR simulation environment provides a solid foundation for future research into more complex 3D environments.
The structure of this paper is as follows: Section 2 describes the search and rescue task and a simplified model of the SAR scenario is designed. Section 3 introduces the proposed deep deterministic policy gradient with three critic networks (DDPG-3C) and its training process. Section 4 introduces the proposed two-stage search and tracking method in SAR tasks. In Section 5, the effectiveness of the proposed DDPG-3C model is validated through numerous experiments, as well as the proposed two-stage target search and tracking method. Finally, some conclusions are drawn in Section 6.

2. Problem Description

2.1. Scenario Description

After a natural disaster, such as an earthquake, flood, or fire, the landscape is often devastated and full of challenges. Firstly, the terrain may change due to shaking or flooding, and roads and buildings may be severely damaged, with new obstacles such as fissures, craters, and inundated areas becoming prevalent. Additionally, the post-disaster environment may pose various potential hazards, such as communication disruptions, natural gas leaks, or secondary disasters triggered by unstable ground. These factors not only make it difficult to access affected areas but also significantly exacerbate the complexity of search and rescue (SAR) operations. In response, rescue teams have deployed advanced unmanned aerial vehicles (UAVs) for SAR tasks, as shown in Figure 1. These UAVs, such as the upgraded DJI Matrice 350 RTK (DJI, Shenzhen, China) [26], are equipped with sophisticated thermal imaging and night vision technologies, enabling effective operation in diverse and challenging conditions.
The task of these UAVs is to conduct aerial patrol and rescue. Once a UAV identifies trapped individuals or signs of life, it can swiftly switch to target tracking mode, which allows the UAV to continuously monitor the target to assist survivors or transport food, medicine, and other supplies.
Due to communication disruptions in disaster areas, these UAVs often operate in denial environments where communication with the command center is limited. These UAVs must rely on their onboard systems to perceive environments and execute tasks effectively. Furthermore, debris scattered around the disaster area, such as metal fragments and other materials, can pose threats to the radar and other systems of the UAVs. Therefore, the UAVs must avoid flying over areas that might disrupt their operational capabilities when executing search and rescue tasks.

2.2. Simplified Model

By simplifying the search and rescue scenario of the UAV in denial environments described in Section 2.1 to a two-dimensional search and rescue (SAR) scenario, the simplified model can be obtained, which is shown in Figure 2.
The simplified SAR scenario of the UAV is represented by a two-dimensional rectangular area with a length of L and a width of W , where our UAV is represented by a dark blue UAV icon, and the target waiting to be rescued is represented by a blue circular icon. The green lines located at the boundary of the square area represent the barriers of the area. There are dark green circular areas with a radius of R , representing obstacles or danger zones. The UAV is equipped with sensors to observe the velocity and direction of approaching objects, whose sensing range is indicated by a red circle in Figure 2.
The positions and velocities of the UAV and the target are all random at the beginning. The target works at the initial velocity, and it is assumed that it will act on a completely elastic collision and continue to move when colliding with an obstacle or barrier. Our UAV acts by choosing a thrust vector to add their current velocity. The motion model of our UAV can be written as
v x ( k + 1 ) = v x ( k ) + F x m t v y ( k + 1 ) = v y ( k ) + F y m t ,
where F x , F y is the force vector generated by the thrust and is applied to the body in the horizontal x and vertical y coordinates with restricted value F x , F y F max , F max ; v x , v y is the velocity vector in the horizontal and vertical direction with a restricted value v x , v y v max , v max ; m is the mass of the UAV and t is the sampling interval.
The mission of our UAV is to go to great lengths to search for the target while avoiding obstacle areas and barriers. Once a target is detected, the UAV needs to approach it as quickly as possible and continuously track this target. Due to the communication denial environment being unable to communicate with the controlling center, the UAV can only make decisions independently based on the information obtained from sensors.

2.3. State and Action Space

In the SAR scenario, the UAV needs to perceive the current situation and make decisions independently based on the information obtained from sensors. A complicated state space will reduce the efficiency of the agent in extracting and understanding important information and weaken the generalization ability of the intelligent algorithm, while an overly simple state space cannot provide sufficient information for the agent to make optimal decisions. When designing the state space, the information including UAV, target, and obstacle is considered, and the state space is shown in Equation (2):
S = p UAV , v UAV , d target , v target , d obs ,
where p UAV , v UAV , d target , v target , d obs 2 . Table 1 explains the meaning of each symbol in the state space. Sensors that do not sense any objects within their range report 0 for velocity and 1 for distance, which means that, if the objects (target, obstacles, or barriers) are not within the sensor range of the UAV, their velocity will be marked as 0 and their distance will be marked as 1.
The UAV has a continuous action space represented as a two-element vector F x , F y , which corresponds to the force generated by the horizontal and vertical thrust. Action values must be in the range F max , F max by taking into account the maneuverability of the UAV.

3. Improved DDPG Algorithm

3.1. DDPG

Deep deterministic policy gradient (DDPG) is a reinforcement learning algorithm based on policy gradients [27] that combines the ideas from Deep Q-network (DQN) and Actor–Critic (AC) framework and can be used to solve decision-making problems in high-dimensional continuous action spaces. The main idea is to use deep neural networks (actor network and critic network) to fit behavior strategy functions and state-action value functions, which are called policy networks and value networks, respectively.
The actor network is responsible for mapping states s t to actions a t :
a t = μ s t θ μ ,
where s t is the state at time t and θ μ are the parameters of the actor network. This network outputs a deterministic action for any given state, aiming to maximize future rewards. The actions that it selects are based on its current policy, and, during the training period, it adjusts its parameters to improve the policy based on the gradients received from the critic network.
The critic network estimates the value of the state-action pairs:
Q s t , a t φ Q
where s t is the state, a t is the action, and φ Q , are the parameters of the critic network. The critic network evaluates the expected return in a given state and takes a specific action according to the actor’s policy.
The actor network is trained to maximize the Q-values predicted by the critic network. The policy gradient used for updating the actor’s parameters is calculated as:
θ μ J 1 N i a Q s , a φ Q s = s i , a = a i θ μ μ s θ μ s i ,
where θ μ μ s θ μ is the gradient of the policy with respect to its parameters; a Q s , a φ Q is the gradient of the action-value function with respect to the action; N represents the batch size, which is the number of experience tuples sampled from the experience replay buffer during each training iteration.
The critic network is updated by minimizing the squared Bellman error:
L = 1 N i y i Q s i , a i φ Q 2 ,
where y i is the target value calculated using the rewards obtained and the next state values are estimated using the target critic network.

3.2. DDPG-3C

The DDPG algorithm, while widely applied in decision-making problems with continuous action spaces, is known to exhibit an overestimation bias in Q-values, which can hinder the convergence of the policy. This issue primarily resulted from maximization bias and target network update delays [28,29]. To alleviate the overestimation problem, this paper imitates the double Q-learning approach and applies it to the DDPG algorithm.
In this paper, an improved deep deterministic policy gradient with three critic networks (DDPG-3C) is proposed. The framework of the DDPG-3C decision model is shown in Figure 3.
As is shown in Figure 3, the DDPG-3C uses an Actor–Critic (AC) framework, and there are eight networks: one actor network μ θ , one target actor network μ θ , three critic networks Q 1 φ , Q 2 φ , Q 3 φ , and three target critic networks Q 1 φ , Q 2 φ , Q 3 φ . Our UAV can be treated as an agent containing a DDPG-3C decision model. When the agent obtains a state s i from the environment, the actor network outputs an action a i according to the input state and takes the next step s i . Meanwhile, the environment returns a reward r i . Then, the agent stores the data s , a , r , s in the experience replay buffer. The agent updates its critic and actor networks by sampling mini-batches from the experience replay buffer.
The actor network is responsible for mapping states to actions. It attempts to learn a policy that maximizes future expected returns:
L o s s = J ( θ μ ) = E s ~ ρ μ R 1 s 0 = s , θ μ ,
where R 1 denotes the reward from the state s 0 ; ρ μ is the state distribution under the policy μ ; θ μ are the parameters of the actor network. To optimize this policy, the gradient of J ( θ μ ) with respect to the parameters θ μ is computed. This is typically obtained using the policy gradient theorem [30], leading to the gradient for updates, as shown in Equation (5). It should be noted that when calculating the loss of the actor network, only the first critic network is used to predict the Q-value. This allows to reduce computational costs when all the estimates are not needed. The parameters of the target actor network are updated using a soft update rule weighted by a factor τ :
θ μ = τ θ μ + ( 1 τ ) θ μ .
Unlike traditional DDPG that uses a single critic network, DDPG-3C employs three critic networks Q 1 φ , Q 2 φ , Q 3 φ and three target critic networks Q 1 φ , Q 2 φ , Q 3 φ to independently estimate the Q-values. Three critic networks are identical, each trying to estimate the Q-value for a given state and action pair s , a . To reduce the impact of overestimation, DDPG-3C removes the highest of the three Q-value estimates and uses the average of the remaining two Q-values as the target Q-value:
y t = r t + 1 2 γ i = 1 2 Q i φ s , a ,
where r t is the reward obtained from the environment at step t ; γ is the discount factor, used to calculate the present value of future rewards; a represents the action output by the target actor network for the next state s ; Q i φ are the outputs from the two remaining target critic networks. The loss of three critic networks is
L o s s = M S E Q 1 φ , y t + M S E Q 2 φ , y t + M S E Q 3 φ , y t ,
where M S E is short for mean squared error; Q i φ are the outputs from critic networks. Once the loss is computed, gradients are propagated back through each critic network to update the network parameters φ i Q , leading to the following gradient for updates:
φ Q J ( φ Q ) 1 N i φ Q M S E Q 1 φ , y t + M S E Q 2 φ , y t + M S E Q 3 φ , y t s = s i , a = a i ,
where N represents the batch size. φ Q is the gradient operator with respect to the network parameters φ Q . The parameters of three target critic networks are updated using a soft update rule weighted by a factor τ :
φ Q = τ φ Q + ( 1 τ ) φ Q .

3.3. Training Procedure of DDPG-3C

The training procedures of the proposed DDPG-3C decision algorithm are shown in Algorithm 1.
Algorithm 1: Training process of the DDPG-3C decision algorithm
Initialize :   actor   network   parameters   θ μ , target   network   parameters   θ μ ,
         critic   networks   parameters   φ 1 Q , φ 2 Q , φ 3 Q ,
         target   critic   networks   parameters   φ 1 Q , φ 2 Q , φ 3 Q ;
Input :   θ μ ,   θ μ ,   φ 1 Q , φ 2 Q , φ 3 Q ,   φ 1 Q , φ 2 Q , φ 3 Q , learning rate β ,
     discount factor γ , soft   update   factor   τ
Output :   optimal   parameters   θ μ * , φ 1 Q * , φ 2 Q * , φ 3 Q * ;
for   e p i s o d e   =   1 to M  do
   Receive   observation   s 0 ;
   for   t = 1 to   m a x   e p i s o d e   l e n g t h  do
    Select   a t w . r . t   the   current   actor   network   μ θ s t ;
    Execute   actions   a t and   observe   reward   r t and   new   state   s t + 1 ;
   Store s t , a t , r t , s t + 1 in experience replay buffer D ;
   Sample a random minibatch from D ;
   Calculate the target Q-value as Equation (9);
   Update critic by minimizing the loss as Equation (10);
   Update actor using the sampled policy gradient as Equation (5);
   Update target network parameters as Equations (8) and (12);
  end for
end for

4. Two-Stage Target Search and Tracking Method

This paper proposes a two-stage target search and tracking method, which divides SAR tasks into search stage and tracking stage. In the search stage, the specific optimization of the search algorithm aims to cover broader areas and quickly find targets. Once a target is detected, the focus shifts to optimizing the ability to track dynamic targets, including adapting to changes in the target’s speed and direction. The controllers for each stage are trained based on the proposed DDPG-3C algorithm. The diagram of the proposed two-stage target search and tracking method is shown in Figure 4. When implementing, if the target is not within the radar range of the UAV, the UAV will make decisions according to the model trained in the cover scenario to search for targets. If the target is detected by the radar, the UAV will turn to tracking mode.
By separating the task into two stages, the complexity of the state and action spaces for each stage is reduced, making the training process more efficient and faster to converge. Each controller can be optimized for a relatively smaller and more specific set of tasks, which improves learning efficiency. Additionally, a two-stage method allows each stage to be independently adjusted and optimized, making it easier to adapt to environmental changes and task requirements. What’s more, if an issue arises in either the search or tracking models, this method makes it easier to pinpoint and adjust the problem without affecting other parts of the system.

4.1. Search Stage

In the search stage, the goal is to quickly and efficiently cover the entire two-dimensional rescue area using the radar while avoiding obstacles and barriers in the area. The DDPG-3C decision model described in Section 3.2 is utilized for training the UAV (agent). The structure of the critic network and actor network is depicted in Figure 5.
The training procedure is described in Algorithm 1. In the search stage, the state is set as
S = p UAV , v UAV , d obs ,
where p UAV , v UAV , d target 2 and Table 1 explains the meaning of each symbol in the state space. The UAV has a continuous action space represented as a two-element vector F x , F y , which corresponds to the force generated by the horizontal and vertical thrust.
The rewards in this search scenario are comprised of step rewards and terminal rewards aiming to cover the entire two-dimensional rescue area. Each step incurs a step reward r s t e p = 0.1 to encourage the agent to complete the coverage task quickly. At the same time, each step incurs a coverage reward r c o v e r a g e meaning that covering new areas grants additional rewards. The coverage reward is calculated as
r c o v e r a g e = D s e n s o r D ¯ c o v e r e d ,
where D s e n s o r is the area currently covered by the UAV’s sensor, and D ¯ c o v e r e d is the complement of the previously covered area (i.e., areas that have not yet been covered). The logical AND ( ) ensures that only the areas currently covered by the sensor, but not previously covered, are considered—these are the newly covered areas. At the end of every episode, the agent (UAV) receives a reward r c o m p l e t e = 300 and marks the task as completed when the total covered area reaches 95%. Meanwhile, if the agent (UAV) is too close to barriers or obstacles, the agent will be penalized with a reward r c o l l i s i o n = 200 , and this episode is terminated. The reward function is shown as follows:
R = r s t e p + p r c o v e r a g e + r c o m p l e t e + r c o l l o s i o n ,
where p = 0.001 is the reward ratio; r s t e p and r c o v e r a g e are step rewards; r c o m p l e t e and r c o l l o s i o n are terminal rewards.
When implementing the decision model, it only needs to load the structure and parameters of the actor network. The actor network receives a state from the environment. This state should be in the same format and should be preprocessed in the same way as Equation (13) used during training. Subsequently, the action is output by the actor network and then executed by the UAV in the environment. The implementation process of the search operation is shown in Figure 6.
It should be noted that there is no target in the search scenario when the decision model is training. The goal of the UAV is to quickly and efficiently cover the entire two-dimensional rescue area. During implementation, a target is moving and the UAV will go to great lengths to search for this target.

4.2. Tracking Stage

Once the target is identified during the search stage, the system transfers to the tracking stage. Here, the controller focuses on maintaining continuous surveillance of the target, adjusting to its movements. The DDPG-3C decision model described in Section 3.2 is utilized for training the UAV (agent). The structure of the critic network and actor network is depicted in Figure 5, which is similar to the model in the search stage.
In the tracking stage, to better adapt to the target’s dynamics and maintain consistent tracking, the state is set as
S = p UAV , v UAV , d target , v target , d obs ,
where p UAV , v UAV , d target , v target , d obs 2 and Table 1 carefully explains the meaning of each symbol in the state space. Similarly, the UAV has a continuous action space represented as a two-element vector F x , F y .
To encourage the UAV to move closer to the target, with each step, the agent receives a reward r p r o for proximity to the target, calculated as the Euclidean distance between the UAV and the target:
r p r o = 5 d target + 0.5 ,
where d target 2 represents the horizontal and vertical distances of the target detected by the UAV’s sensor. Adding 0.5 to the denominator prevents excessively high rewards and division by zero errors when the distance is very small. If the agent is too close to barriers or obstacles, it will receive a penalty r c o l l i s i o n = 200 , and this episode is terminated. The reward function is shown as follows:
R = r p r o + r c o l l o s i o n .
It should be noted that when the model is training, it is assumed that the UAV can perceive the states of the environment (target, obstacles, and barriers), regardless of whether objects are within the radar range of the UAV because the goal of this period is to train the tracking capacity of the UAV. When implementing this trained model, once the target enters the radar range of the UAV, the UAV switches from search mode to tracking mode. The implementation process of the tracking operation is similar to that of the searching stage.

5. Experimental Simulations

In order to validate the effectiveness of the proposed DDPG-3C model that utilizes three critic networks to alleviate the overestimation problem, the performance of the proposed model is compared with two popular reinforcement learning baselines in the two-dimensional search and rescue (SAR) scenario described in Section 2: DDPG [27] and TD3 (twin delayed deep deterministic policy gradient) [29]. At the same time, to validate the advantages of the proposed two-stage target search and tracking method, this paper compares it with the traditional single-stage reinforcement learning model in the SAR scenario.
When training the DDPG decision model, TD3 decision model, and the DDPG-3C decision model, the hyperparameters are set according to Table 2.
In the following section, the DDPG decision model, the TD3 decision model, and the DDPG-3C decision model will be applied to the search stage and tracking stage of the SAR task, as well as the traditional single-stage method for comparison. All of these algorithms are implemented in Python and run on a desktop computer with a GeForce RTX 4090 GPU and a 13th 64-Core i9-13900KS CPU.

5.1. Search Stage Simulations

In the search stage, there is no target in this scenario when the decision model is training. The designed SAR scenario includes a single UAV with a sensor range of 200 m and two obstacles, each with a radius of 50 m. Both the obstacles and the UAV are positioned at randomly generated locations within a two-dimensional rescue area that spans 0 to 800 m. The obstacles are static, while the UAV operates based on the force generated by the horizontal and vertical thrust, with a maximum action value of 400 N. The hyperparameters for the DDPG, TD3, and DDPG-3C decision models are listed in Table 2.
The training results of the DDPG-3C decision model, TD3 decision model, and the DDPG decision model in the search stage of the SAR task in denial environments are shown in Figure 7. The results are calculated using an exponential moving average (EMA) to help visualize trends more clearly by reducing noise from the raw data according to Equation (19):
y t = x t , t = 1 s × y t 1 + 1 s × x t , t > 1 .
where s = 0.95 is the smoothing factor where closer to 1 implies heavier smoothing, and closer to 0 implies lighter smoothing. Moreover, the detailed convergence rewards and convergence steps are listed in Table 3.
It can be seen from Figure 7a and Table 3 that the DDPG-3C model, which utilizes three critic networks, shows a reward trajectory that steadily increases and stabilizes at approximately 600, indicating robust learning and performance stability. At the same time, the DDPG-3C model achieves optimal performance more rapidly than its counterparts. In contrast, the traditional DDPG model, which uses a single critic network, reaches a lower peak reward of around 560 and takes approximately 7 million training steps to converge, showing a slower learning rate compared with the DDPG-3C model. Similarly, the TD3 model, which uses twin critic networks, stabilizes at a reward of approximately 560 but requires about 4 million training steps to converge, demonstrating slower convergence compared to DDPG-3C but faster than DDPG.
When it comes to the estimation of target Q-values, the red line in Figure 7b represents the target Q-value for the DDPG-3C model. This value is calculated by discarding the highest of the three Q-value estimates and averaging the remaining two. Figure 7b shows that the target Q-value for DDPG-3C is consistently lower than that of the traditional DDPG and TD3, indicating that the proposed decision model can effectively alleviate the problem of overestimation.
To compare the performance of these three models during the search stage, each model will be tested across 10,000 episodes. The average coverage ratio and average collision rate will be calculated to evaluate their effectiveness and safety. All models are initialized under the same conditions to maintain fairness in testing. The results are shown in Table 4. For the simulation video of the search stage, please click here (https://easylink.cc/5r3njl, accessed on 25 September 2024) to view (Supplementary Materials).
According to the data presented in Table 4, the DDPG model recorded a collision frequency of 543, meaning that collisions occurred in approximately 5.43% of the episodes. The TD3 model demonstrated an improvement with a collision frequency of 461 (4.61%). However, the DDPG-3C model demonstrated a significantly lower collision frequency with a total number of 349 collisions (3.49%). This improvement indicates that the DDPG-3C decision model is more effective at avoiding obstacles, which is safer during the search stage of SAR tasks. In terms of area coverage, the traditional DDPG model achieved a coverage rate of 87.92%, while the TD3 model achieved a slightly better coverage rate of 89.14%. However, the DDPG-3C model outperformed both, achieving a higher coverage rate of 90.17%. This indicates not only a more thorough coverage per episode but also improved efficiency in searching.

5.2. Tracking Stage Simulations

In the tracking stage, the goal is to maintain continuous surveillance of a moving target within a defined two-dimensional area spanning 0 to 800 m. The scenario includes a single UAV that has complete perception of the environment’s states. The environment also contains two static obstacles, each with a radius of 50 m, randomly placed in the area. The UAV operates based on the horizontal and vertical thrust forces, with a maximum action value of 400 N, to keep track of the target without colliding with the obstacles. By providing the UAV with full environmental awareness, the model can learn the optimal strategies for tracking and obstacle avoidance without the constraints of sensor range.
The DDPG, TD3, and DDPG-3C models are trained under the same conditions using hyperparameters listed in Table 2. Training results are analyzed using an exponential moving average (EMA) according to Equation (19) to smooth the data and highlight trends more effectively. Moreover, the detailed convergence rewards and convergence steps are listed in Table 5.
Figure 8a,b displays the rewards and target Q-value trajectories for the DDPG, TD3, and DDPG-3C decision models. Similar to the results from the search stage, the DDPG-3C, with its three critic networks, is expected to converge more quickly, achieving this at 3 million training steps, and gain a higher reward of approximately 200. In contrast, the TD3 model takes about 4 million training steps to converge, with a reward of approximately 180, and the traditional DDPG model converges at around 7 million training steps with a lower reward of roughly 160. These results suggest that DDPG-3C shows better adaptability to the target’s movements compared to the other models. The target Q-values for DDPG-3C, calculated by discarding the highest and averaging the remaining two estimates, are relatively lower than those of the traditional DDPG and TD3. This approach offers a more conservative and accurate estimation, thus reducing the risks of overestimation.
To compare the performance of these three models during the tracking stage, each model will be tested across 10,000 episodes. The average collision frequency and the tracking success rate, defined as the percentage of episodes where the UAV successfully tracks and maintains the target within a range of 20 m, will be calculated and compared. All models are initialized under the same conditions to maintain fairness in testing. The results are shown in Table 6. For the simulation video of the tracking stage, please click here (https://easylink.cc/5r3njl, accessed on 25 September 2024) to view.
According to the data presented in Table 6, collisions occurred in approximately 3.16% of the episodes when implementing the DDPG decision model in the tracking stage, while the TD3 model showed an improvement with a collision frequency of 274 (2.74%). The DDPG-3C decision model demonstrated an even lower collision frequency of 189, with a collision rate of 1.89%. In terms of tracking success, the DDPG model achieved a tracking success frequency of 9357 out of 10,000 episodes, while the TD3 model performed slightly better with a tracking success frequency of 9397 (93.97%). However, the proposed DDPG-3C decision model outperformed both, achieving a tracking success rate of 95.12%. This higher success rate for DDPG-3C suggests that the model, with its additional critic networks, provides more accurate value estimates and decision-making ability compared to DDPG and TD3.

5.3. Whole Period

In this experiment, the whole period of the SAR tasks was considered, which combines the search stage and tracking stage. Initially, the UAV operates in search mode, aiming to extensively cover a defined two-dimensional area (0 to 800 m) to detect any potential targets. During this stage, the UAV lacks target and obstacle information and relies on its sensor system to detect obstacles and other objects. Once a target enters the UAV’s radar range (within 200 m), the UAV switches to tracking mode and continuously tracks the moving target. The search stage and tracking stage use the DDPG-3C decision model trained in Section 5.1 and Section 5.2, respectively. The implementation process of the whole period is shown in Figure 4.
Figure 9 shows the experimental results of the proposed two-stage target search and tracking method for SAR tasks, which include avoiding obstacles, searching for targets, and tracking the target. In Figure 9a–c, the UAV’s initial position is (700, 700), with a random initial angle, and an initial speed of 0. The target’s initial position is (100, 100), with a speed of 30 m/s and an initial angle is π / 8 . The positions of the obstacles are (200, 200) and (400, 400). In Figure 9d–f, the UAV’s initial position is also (700, 700), with a random initial angle and an initial speed of 0. The target’s initial position is (100, 100), with a speed of 30 m/s and an initial angle of π / 4 , while the obstacles are located at (200, 80) and (350, 200).
It can be seen from Figure 9a,d that the UAV is positioned randomly and continues to search the environment for potential targets. After the UAV detects a target, as illustrated in Figure 9b,e, it transitions into tracking mode. The UAV adjusts its path to approach the target while avoiding obstacles. In the tracking stage, as shown in Figure 9c,f, the UAV not only maintains its course towards the target but also adapts to changes in the target’s speed. Additionally, it consistently avoids obstacles, as demonstrated in Figure 9c. For the simulation video of the whole stage, please click here (https://easylink.cc/5r3njl, accessed on 25 September 2024) to view.
Figure 10 presents the experimental results of the proposed two-stage target search and tracking method for SAR tasks, featuring scenarios with 5, 9, and 13 obstacles, respectively. In each scenario, the UAV’s initial position is (700, 700), with a random initial angle and an initial speed of 0, and the target’s speed is 30 m/s. In Figure 10a, the target’s initial position is (400, 500), with an angle of 17 π / 16 , and the obstacles are located at (100, 100), (100, 700), (700, 100), (700, 700), and (400, 400). In Figure 10b, four additional obstacles are added, positioned at (400, 100), (100, 400), (700, 400), and (400, 700), and the target’s initial position is (400, 500), with an angle of 13 π / 16 . In Figure 10c, the number of obstacles is increased to 13, which are evenly distributed throughout the SAR scenario, and the target moves at a speed of 30 m/s along a predetermined trajectory.
It can be seen from the figure that in SAR scenarios with varying numbers of obstacles, the UAV successfully avoids obstacles during the search stage while covering the SAR area to locate the target. Once the UAV’s radar detects the target, it switches to tracking mode. During the tracking stage, the UAV not only effectively tracks the target but also continuously avoids obstacles. For the simulation video of the whole stage, please click here (https://easylink.cc/5r3njl, accessed on 25 September 2024) to view.
To better illustrate the effectiveness of the proposed two-stage target search and tracking method for SAR tasks, this paper compares models trained using the two-stage target search and tracking method with the model trained using a traditional single-stage reinforcement learning method. In traditional single-stage SAR tasks, the UAV is required to search for and track the target as soon as possible while avoiding obstacles. The traditional single-stage method utilizes the DDPG-3C model. The hyperparameters of DDPG-3C model for the traditional single-stage method are similar to the DDPG-3C decision models in the search stage and tracking stage. The state is set as Equation (2). The reward function is shown as follows:
R = r p r o + r s e a r c h + r c o l l o s i o n ,
where r p r o is the proximity reward, calculated as the Euclidean distance between the UAV and the target, as shown in Equation (17); r s e a r c h = 100 represents the reward for finding the target; r c o l l i s i o n = 200 represents the collision reward for being too close to barriers or obstacles.
When comparing the performance of the two-stage target search and tracking method with the traditional single-stage method, each model was tested across 20,000 episodes. The average number of steps taken to find the target, defined as the target being detected by the UAV’s sensor, and the average steps for successfully tracking the target, defined as the UAV maintaining the target within a 20-m range, were calculated and compared. Additionally, collision rates were compared. Both models are initialized under the same conditions to maintain fairness in testing. The results are presented in Table 7. To view the comparison video of these two methods, please click here (https://easylink.cc/5r3njl, accessed on 30 September 2024).
According to the data presented in Table 7, when using the two-stage target search and tracking method, which divides SAR tasks into search and tracking stages, requires an average of 171.27 steps to find the target. This is significantly fewer than the traditional single-stage method, which takes an average of 313.78 steps. A more efficient search strategy of the two-stage target search and tracking method, due to specialized training in the search stage, is suggested. Once the target is detected, the two-stage target search and tracking method takes an average of 354.54 steps to track the target, maintaining it within a 20-m range. However, the traditional single-stage method requires more steps, averaging 481.68, to maintain tracking of the target. This means that the UAV is able to operate more effectively in dynamic environments by using the two-stage target search and tracking method. In terms of collision rates, the traditional single-stage method has a slightly higher collision rate of 4.51% compared with 4.07% for the two-stage target search and tracking method, indicating not only faster but also safer operations compared with the traditional method.
It should be noted that although the two-stage target search and tracking method shows overall better efficiency in finding and tracking targets compared with the traditional single-stage method, there is no significant difference in the number of steps taken from detecting to successfully tracking the target between these two methods. To be more specific, when measuring the steps from finding the target to tracking it, the traditional single-stage method requires fewer steps (167.9 steps) than the two-stage method (183.27 steps). This can be attributed to a state transition process from the search stage to the tracking stage. For the two models trained with the two-stage method, during this transition, the state of the UAV in the search stage acts like noise, which the tracking stage must overcome, requiring additional steps to stabilize and effectively track the target.

6. Conclusions

To deal with the complexities of decision-making for unmanned aerial vehicles (UAVs) in unknown environments, this paper applies deep reinforcement learning algorithms to search and rescue (SAR) tasks. A two-stage target search and tracking method for UAVs based on deep reinforcement learning is proposed, which divides SAR tasks into search and tracking stages, and the controllers for each stage are trained based on the proposed deep deterministic policy gradient with three critic networks (DDPG-3C) algorithm. Simulation experiments are carried out to evaluate the performance of each stage in a two-dimensional rectangular SAR scenario, including search, tracking, and the integrated whole stage. Some concluding remarks are drawn below:
(1)
The deep deterministic policy gradient (DDPG) algorithm, while effective for continuous action spaces, exhibits overestimation bias in Q-values in the search and tracking stages of SAR tasks. This overestimation slows the convergence of the policy and leads to less reliable decision-making.
(2)
The proposed DDPG-3C model, which incorporates three critic networks, addresses the overestimation problem by removing the highest of the three Q-value estimates and uses the average of the remaining two Q-values as the target Q-value. This improvement leads to more accurate Q-value estimations, resulting in faster convergence and improved performance during both the search and tracking stages. Therefore, the proposed method not only provides a more efficient search strategy but also better adaptability to the target’s movements, improving overall efficiency.
(3)
The twin delayed deep deterministic policy gradient (TD3) algorithm, which uses twin critic networks, can also alleviate the overestimation problem to some extent and demonstrates good performance in both avoiding collisions and achieving a high coverage rate. However, the DDPG-3C model surpasses TD3 in these areas, offering even better obstacle avoidance and coverage rate, as well as more robust performance.
(4)
The two-stage target search and tracking method proposed in this paper outperforms the traditional single-stage approach. According to the average steps to finding and tracking the target and the collision rates, the proposed method offers not only faster and safer operations but also simplifies the process by eliminating the need for intricate reward function design. Additionally, it reduces the state vector size, which results in a more efficient and effective decision-making ability in SAR tasks.
Despite these improvements, the current study only considers simulations in a two-dimensional area, which limits its applicability to more complex real-world scenarios. Additionally, the UAV’s obstacle detection focuses on the nearest obstacle within the radar range when using the DDPG-3C decision model. This simplification may increase the likelihood of collisions when multiple obstacles are present within the radar’s detection range. Future work could explore the expansion of the model to consider target search and tracking in three-dimensional environments. Additionally, developing reinforcement learning methods with state embeddings could allow the state vector to better adapt to varying numbers of targets and obstacles. Furthermore, research could focus on increasing the number of UAVs and targets by using multi-agent reinforcement learning algorithms, enabling multiple UAVs to collaboratively search for and track targets.

Supplementary Materials

The simulation video of the search stage, tracking stage, and whole stage, as well as the comparison video of the proposed two-stage target search and tracking method and the traditional single-stage method can be downloaded at: https://easylink.cc/5r3njl access on 25 September 2024.

Author Contributions

Conceptualization, M.L.; Data curation, M.L.; Formal analysis, M.L.; Investigation, M.L. and J.W.; Methodology, M.L., J.W. and K.L.; Project administration, K.L.; Resources, M.L.; Software, M.L.; Supervision, K.L.; Validation, M.L.; Visualization, M.L.; Writing—original draft, M.L.; Writing—review and editing, M.L., J.W. and K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is unavailable due to privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Elmeseiry, N.; Alshaer, N.; Ismail, T. A detailed survey and future directions of unmanned aerial vehicles (uavs) with potential applications. Aerospace 2021, 8, 363. [Google Scholar] [CrossRef]
  2. Huang, Y.; Thomson, S.J.; Hoffmann, W.C.; Lan, Y.; Fritz, B.K. Development and prospect of unmanned aerial vehicle technologies for agricultural production management. Int. J. Agric. Biol. Eng. 2013, 6, 1–10. [Google Scholar]
  3. Muchiri, G.N.; Kimathi, S. A review of applications and potential applications of UAV. In Proceedings of the Sustainable Research and Innovation Conference, Pretoria, South Africa, 20–24 June 2022. [Google Scholar]
  4. Kazmi, W.; Bisgaard, M.; Garcia-Ruiz, F.; Hansen, K.D.; la Cour-Harbo, A. Adaptive surveying and early treatment of crops with a team of autonomous vehicles. In Proceedings of the 5th European Conference on Mobile Robots ECMR 2011, Örebro, Sweden, 7–9 September 2011. [Google Scholar]
  5. Marx, A.; Chou, Y.H.; Mercy, K.; Windisch, R. A lightweight, robust exploitation system for temporal Stacks of UAS data: Use case for forward-deployed military or emergency responders. Drones 2019, 3, 29. [Google Scholar] [CrossRef]
  6. Guan, S.; Zhu, Z.; Wang, G. A Review on UAV-Based Remote Sensing Technologies for Construction and Civil Applications. Drones 2022, 6, 117. [Google Scholar] [CrossRef]
  7. Merz, M.; Pedro, D.; Skliros, V.; Bergenhem, C.; Himanka, M.; Houge, T.; Matos-Carvalho, J.P.; Lundkvist, H.; Cürüklü, B.; Hamrén, R.; et al. Autonomous UAS-Based Agriculture Applications: General Overview and Relevant European Case Studies. Drones 2022, 6, 128. [Google Scholar] [CrossRef]
  8. Aslam, W. Great-power responsibility, side-effect harms and American drone strikes in Pakistan. J. Mil. Ethics 2016, 15, 143–162. [Google Scholar] [CrossRef]
  9. Bachrach, A.; He, R.; Roy, N. Autonomous Flight in Unknown Indoor Environments. Int. J. Micro Air Veh. 2009, 1, 217–228. [Google Scholar] [CrossRef]
  10. Mac, T.T.; Copot, C.; De Keyser, R.; Ionescu, C.M. The development of an autonomous navigation system with optimal control of an UAV in partly unknown indoor environment. Mechatronics 2018, 49, 187–196. [Google Scholar] [CrossRef]
  11. Rothmund, S.V.; Johansen, T.A. Risk-Based Obstacle Avoidance in Unknown Environments Using Scenario-Based Predictive Control for an Inspection Drone Equipped with Range Finding Sensors. In Proceedings of the 2019 International Conference on Unmanned Aircraft Systems (ICUAS), Atlanta, GA, USA, 11–14 June 2019. [Google Scholar]
  12. Kulathunga, G.; Hamed, H.; Devitt, D.; Klimchik, A. Optimization-Based Trajectory Tracking Approach for Multi-Rotor Aerial Vehicles in Unknown Environments. IEEE Robot. Autom. Lett. 2022, 7, 4598–4605. [Google Scholar] [CrossRef]
  13. Saccani, D.; Cecchin, L.; Fagiano, L. Multitrajectory Model Predictive Control for Safe UAV Navigation in an Unknown Environment. IEEE Trans. Control. Syst. Technol. 2023, 31, 1982–1997. [Google Scholar] [CrossRef]
  14. Li, J.; Zhang, G.; Zhang, X.; Zhang, W. Integrating dynamic event-triggered and sensor-tolerant control: Application to USV-UAVs cooperative formation system for maritime parallel search. IEEE Trans. Intell. Transp. Syst. 2024, 25, 3986–3998. [Google Scholar] [CrossRef]
  15. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
  16. Tamar, A.; Wu, Y.; Thomas, G.; Levine, S.; Abbeel, P. Value iteration networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  17. Waschneck, B.; Reichstaller, A.; Belzner, L.; Altenmüller, T.; Bauernhansl, T.; Knapp, A.; Kyek, A. Optimization of global production scheduling with deep reinforcement learning. Procedia Cirp 2018, 72, 1264–1269. [Google Scholar] [CrossRef]
  18. Pham, H.X.; La, H.M.; Feil-Seifer, D.; Nguyen, L.V. Autonomous uav navigation using reinforcement learning. arXiv 2018, arXiv:1801.05086. [Google Scholar]
  19. Li, B.; Gan, Z.; Chen, D.; Sergey Aleksandrovich, D. UAV maneuvering target tracking in uncertain environments based on deep reinforcement learning and meta-learning. Remote Sens. 2020, 12, 3789. [Google Scholar] [CrossRef]
  20. Ramezani, M.; Amiri Atashgah, M.A. Energy-Aware Hierarchical Reinforcement Learning Based on the Predictive Energy Consumption Algorithm for Search and Rescue Aerial Robots in Unknown Environments. Drones 2024, 8, 283. [Google Scholar] [CrossRef]
  21. Zhao, Y.; Zheng, Z.; Zhang, X.; Yang, L. Q learning algorithm-based UAV path learning and obstacle avoidance approach. In Proceedings of the 2017 36th Chinese Control Conference, Dalian, China, 26–28 July 2017. [Google Scholar]
  22. Zhang, K.; Li, K.; Shi, H.; Zhang, Z.; Liu, Z. Autonomous guidance maneuver control and decision-making algorithm based on deep reinforcement learning UAV route. Syst. Eng. Electron. 2020, 42, 1567–1574. [Google Scholar]
  23. Rodriguez-Ramos, A.; Sampedro, C.; Bavle, H.; De La Puente, P.; Campoy, P. A Deep Reinforcement Learning Strategy for UAV Autonomous Landing on a Moving Platform. J. Intell. Robot. Syst. 2019, 93, 351–366. [Google Scholar] [CrossRef]
  24. Ma, B.; Liu, Z.; Dang, Q.; Zhao, W.; Wang, J.; Cheng, Y.; Yuan, Z. Deep reinforcement learning of UAV tracking control under wind disturbances environments. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
  25. Xia, Z.; Du, J.; Wang, J.; Jiang, C.; Ren, Y.; Li, G.; Han, Z. Multi-agent reinforcement learning aided intelligent UAV swarm for target tracking. IEEE Trans. Veh. Technol. 2021, 71, 931–945. [Google Scholar] [CrossRef]
  26. DJI. Drones Assist in Fire Rescue of Large-Scale Urban Complexes in Nanjing. Available online: https://enterprise-insights.dji.com/cn/blog/nanjing-drone-mall-fire-rescue/ (accessed on 25 July 2024).
  27. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015. submitted. [Google Scholar]
  28. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
  29. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  30. Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 1999, 12, 1057–1063. [Google Scholar]
Figure 1. Drones assist in the fire rescue of large-scale urban complexes in Nanjing [26].
Figure 1. Drones assist in the fire rescue of large-scale urban complexes in Nanjing [26].
Drones 08 00544 g001
Figure 2. Simplified model of the SAR scenario.
Figure 2. Simplified model of the SAR scenario.
Drones 08 00544 g002
Figure 3. The framework of the DDPG-3C decision model: the yellow and blue sections represent the actor and critic modules, respectively.
Figure 3. The framework of the DDPG-3C decision model: the yellow and blue sections represent the actor and critic modules, respectively.
Drones 08 00544 g003
Figure 4. Diagram of the proposed two-stage target search and tracking method.
Figure 4. Diagram of the proposed two-stage target search and tracking method.
Drones 08 00544 g004
Figure 5. Structure of the critic network and the actor network in DDPG-3C.
Figure 5. Structure of the critic network and the actor network in DDPG-3C.
Drones 08 00544 g005
Figure 6. Implementation process of the UAV in the search stage.
Figure 6. Implementation process of the UAV in the search stage.
Drones 08 00544 g006
Figure 7. Training results of three models in search stage: (a) episode rewards; (b) target Q value.
Figure 7. Training results of three models in search stage: (a) episode rewards; (b) target Q value.
Drones 08 00544 g007
Figure 8. Training results of two models in tracking stage: (a) episode rewards; (b) target Q value.
Figure 8. Training results of two models in tracking stage: (a) episode rewards; (b) target Q value.
Drones 08 00544 g008
Figure 9. Experimental results of the proposed two-stage target search and tracking method for SAR tasks. (ac): Scenario 1; (df): Scenario 2; dark blue UAV icon represents our UAV, the action values of which are shown by the red bars coming from the individual motors; dark green circular areas represent obstacles or danger zones; blue circular icon represents the target; red circle represents the sensor range of the UAV; red and black lines represent the trajectories of the target and the UAV, respectively.
Figure 9. Experimental results of the proposed two-stage target search and tracking method for SAR tasks. (ac): Scenario 1; (df): Scenario 2; dark blue UAV icon represents our UAV, the action values of which are shown by the red bars coming from the individual motors; dark green circular areas represent obstacles or danger zones; blue circular icon represents the target; red circle represents the sensor range of the UAV; red and black lines represent the trajectories of the target and the UAV, respectively.
Drones 08 00544 g009
Figure 10. Experimental results of two-stage target search and tracking with varying numbers of obstacles: (a) 5 obstacles; (b) 9 obstacles; (c) 13 obstacles.
Figure 10. Experimental results of two-stage target search and tracking with varying numbers of obstacles: (a) 5 obstacles; (b) 9 obstacles; (c) 13 obstacles.
Drones 08 00544 g010
Table 1. Meaning of each dimension in the state space.
Table 1. Meaning of each dimension in the state space.
SymbolDescriptionNormalized Values
p UAV Position of UAV in the horizontal and vertical direction 1 , 1 × 1 , 1
v UAV Velocity of UAV in the horizontal and vertical direction 1 , 1 × 1 , 1
d target Horizontal and vertical distances of target detected by sensors 0 , 1 × 0 , 1
v target Horizontal and vertical velocities of target detected by sensors 0 , 1 × 0 , 1
d obs Horizontal and vertical distances of obstacle detected by sensors 0 , 1 × 0 , 1
Table 2. Hyperparameters of the network.
Table 2. Hyperparameters of the network.
NameValue
Input vector size of the critic networksearch stage: 8; tracking stage: 12;
traditional single-stage: 12
Hidden layer size of the critic network400, 300
Output vector size of the critic network1
Activation function of the critic networkReLU, ReLU
Learning rate of the critic network0.001
Optimizer of the critic networkAdam
Input vector size of the actor networksearch stage: 6; tracking stage: 10;
traditional single-stage: 10
Hidden layer size of the actor network400, 300
Output vector size of the actor network2
Activation function of the actor networkReLU, ReLU, Tanh
Learning rate of the actor network0.001
Optimizer of the actor networkAdam
Train step size500
Batch size1024
Table 3. Evaluation metrics about training process in search stage.
Table 3. Evaluation metrics about training process in search stage.
ModelConvergence RewardsConvergence Steps
DDPG~600 *~2.8 M *
TD3~560 *~4 M *
DDPG-3C~560 *~7 M *
* Tilde (“~”) symbol conveys that the values are approximate and 1 M represents 1,000,000.
Table 4. Comparison of collision rate and coverage rate in search stage.
Table 4. Comparison of collision rate and coverage rate in search stage.
ModelCollision FrequencyCollision RateCoverage Rate
DDPG5435.43%87.92%
TD34614.61%89.14%
DDPG-3C3493.49%90.17%
Table 5. Evaluation metrics about training process in tracking stage.
Table 5. Evaluation metrics about training process in tracking stage.
ModelConvergence RewardsConvergence Steps
DDPG~200 *~3 M *
TD3~180 *~4 M *
DDPG-3C~160 *~7 M *
* Tilde (“~”) symbol conveys that the values are approximate and 1 M represents 1,000,000.
Table 6. Comparison of collision rate and tracking success rate in tracking stage.
Table 6. Comparison of collision rate and tracking success rate in tracking stage.
ModelCollision
Frequency
Collision RateTracking Success
Frequency
Tracking Success
Rate
DDPG3163.16%935793.57%
TD32742.74%939793.97%
DDPG-3C1891.89%951295.12%
Table 7. Comparison of average steps and collision rate in the whole period.
Table 7. Comparison of average steps and collision rate in the whole period.
MethodAverage Steps
(Finding the Target)
Average Steps
(Tracking the Target)
Collision Rate
Two-stage method171.27354.544.07%
Traditional single stage313.78481.684.51%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, M.; Wei, J.; Liu, K. A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning. Drones 2024, 8, 544. https://doi.org/10.3390/drones8100544

AMA Style

Liu M, Wei J, Liu K. A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning. Drones. 2024; 8(10):544. https://doi.org/10.3390/drones8100544

Chicago/Turabian Style

Liu, Mei, Jingbo Wei, and Kun Liu. 2024. "A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning" Drones 8, no. 10: 544. https://doi.org/10.3390/drones8100544

APA Style

Liu, M., Wei, J., & Liu, K. (2024). A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning. Drones, 8(10), 544. https://doi.org/10.3390/drones8100544

Article Metrics

Back to TopTop