A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning

Liu, Mei; Wei, Jingbo; Liu, Kun

doi:10.3390/drones8100544

Open AccessArticle

A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning

by

Mei Liu

,

Jingbo Wei

and

Kun Liu

^*

School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(10), 544; https://doi.org/10.3390/drones8100544

Submission received: 9 August 2024 / Revised: 25 September 2024 / Accepted: 29 September 2024 / Published: 1 October 2024

Download

Browse Figures

Versions Notes

Abstract

To deal with the complexities of decision-making for unmanned aerial vehicles (UAVs) in denial environments, this paper applies deep reinforcement learning algorithms to search and rescue (SAR) tasks. It proposes a two-stage target search and tracking method for UAVs based on deep reinforcement learning, which divides SAR tasks into search and tracking stages, and the controllers for each stage are trained based on the proposed deep deterministic policy gradient with three critic networks (DDPG-3C) algorithm. Simulation experiments are carried out to evaluate the performance of each stage in a two-dimensional rectangular SAR scenario, including search, tracking, and the integrated whole stage. The experimental results show that the proposed DDPG-3C model can effectively alleviate the overestimation problem, and hence results in a faster convergence and improved performance during both the search and tracking stages. Additionally, the two-stage target search and tracking method outperforms the traditional single-stage approach, leading to a more efficient and effective decision-making ability in SAR tasks.

Keywords:

deep reinforcement learning; unmanned aerial vehicles; SAR tasks

1. Introduction

Unmanned aerial vehicles (UAVs) are a class of aircraft that can fly autonomously without a need for an onboard pilot, which are notable for their convenience, flexibility, low cost, and wide range of applications [1]. UAVs operate without a pilot on board, allowing them to perform a variety of high-risk or inaccessible tasks in complex and changing environments, including terrain mapping, precision agriculture (PA), surveillance and reconnaissance, power line inspections, search and rescue (SAR), film shooting, aerial delivery, and military fields [2,3,4,5,6,7,8].

In the field of UAVs, the execution of numerous tasks fundamentally relies on target tracking. For instance, in security management tasks, it is imperative to track, encircle, and apprehend criminals when a terrorist event occurs. Under such circumstances, UAVs can be used to track criminals and even carry light weapons to assist police. Another vivid case is the search and rescue (SAR) tasks after a disaster happened. When natural disasters occur, the terrain in the affected area is complex, and communication is interrupted, posing significant challenges to SAR tasks. In such scenarios, UAVs can quickly enter these areas with advanced sensors like high-definition cameras and infrared thermal imagers to provide valuable information for SAR tasks, conduct aerial patrols, and perform target tracking with their aerial superiority.

With the rapid development of technology, UAVs have achieved significant improvements in intelligence, endurance, payload capacity, and resistance to interference, broadening their applicability across diverse scenarios. Despite these improvements, UAVs operating in SAR tasks often encounter unique challenges, such as infrastructure damage and severe natural conditions, which can disrupt communications. This disruption severely impacts the UAVs’ ability to perceive their environment, diminishing the success rate of rescues. Furthermore, in denial environments, the communication between UAVs and the ground command center will be disrupted, which cannot guarantee fast and safe communication. Therefore, each UAV is required to have the ability to independently and intelligently complete tasks under these severe environments, achieving a highly autonomous ability as much as possible.

The investigation into how unmanned aerial vehicles (UAVs) can make rapid and effective maneuvering decisions based on real-time observation information to accomplish tasks in unknown and complex environments efficiently has emerged as a significant area of research interest. In recent years, advanced methods such as model predictive control, optimization-based methods, and particle swarm optimization (PSO) algorithms have been successfully applied in UAVs. These algorithms empower UAVs to navigate and complete complex tasks within unfamiliar environments by improving route planning and decision-making processes.

Faced with denial environments without external positioning systems, Abraham et al. [9] presented a method to enable a quadrotor helicopter equipped with a laser rangefinder to autonomously explore and map unstructured indoor environments. Mac et al. [10] introduced a novel method combining sensor fusion for localization and an improved potential field method for UAVs in obstacle avoidance tasks, alongside an optimally tuned PID controller using multi-objective particle swarm optimization. However, their methods relied heavily on onboard sensors, which may limit performance in more complex or dynamically changing environments. Rothmund et al. [11] utilized scenario-based model predictive control for inspection drones to avoid obstacles in unknown environments. Although the UAV would change speed or take detours to avoid flying in potentially dangerous areas to mitigate risk, the system was limited by its reliance on a pre-planned path. At the same time, the high computational cost of this method made it difficult to guarantee real-time performance requirements. Similarly, Kulathunga et al. [12] introduced an optimization-based, trajectory-tracking approach for multi-rotor aerial vehicles operating in unknown environments. The method utilized a dual-planner system consisting of a global planner and a local planner. The global planner adjusted the initial reference trajectory to avoid obstacles, while the local planner focused on generating optimal control policies for following this adjusted trajectory. This approach was based on precise mathematical models, which could be a limitation in unknown environments where model parameters are often difficult to obtain accurately. Saccani et al. [13] introduced a multi-trajectory model predictive control for UAV navigation in unknown and static environments by using LiDAR for obstacle detection and path planning that balanced safety and goal achievement. To address the problem of coordinating USVs and UAVs for maritime parallel search, Li et al. [14] proposed a dynamic event-triggered control mechanism to reduce communication load and a sensor-tolerant control to handle sensor faults. This method was validated through simulation experiments, demonstrating its effectiveness in maintaining formation and achieving full coverage of the search area. However, this method depended on threshold tuning and assumed the states of the USV and UAV were known, which is difficult to implement in practice. Additionally, it did not consider obstacles, limiting its applicability in complex environments.

In recent years, with the rapid development of artificial intelligence (AI), reinforcement learning (RL), as a part of machine learning, has achieved remarkable results due to its excellent performance and wide application in the field of decision-making of complex tasks, such as autonomous vehicles, robot control, game strategy optimization, and financial transactions [15,16,17]. Unlike traditional optimal control methods, the agent in reinforcement learning learns a policy through interaction with the environment, aiming to maximize cumulative rewards over time. It does not necessarily require a known model of the system dynamics and focuses on adaptability through trial and error, which is more adaptable to handling uncertainties and unknown dynamics.

Pham et al. [18] integrated a PID controller with Q-learning, a type of reinforcement learning algorithm, to manage the UAV’s trajectory. This combination enabled the UAV to dynamically adjust its path based on real-time inputs, learning from the environment to enhance both navigation precision and efficiency. Li et al. [19] enhanced a UAV’s capability to swiftly adapt to unpredictable target movements by integrating deep reinforcement learning (DRL) with meta-learning. This approach significantly improved tracking accuracy and efficiency, making it particularly effective in diverse scenarios such as wildlife protection and emergency aid. When it comes to aerial robots in search and rescue tasks within unknown environments, Ramezani et al. [20] introduced an energy-aware hierarchical reinforcement learning approach by utilizing a predictive energy consumption algorithm to enhance the sustainability and endurance of missions. To address the real-time challenges in the decision-making process of UAVs, Zhao et al. [21] proposed adaptive and random exploration methods, ensuring that UAVs reach their target positions via reasonable and safe paths, enhancing their operational effectiveness and safety in dynamic environments. Kun et al. [22] utilized the deep deterministic policy gradient (DDPG) algorithm to direct UAVs toward fixed-point targets; however, their method did not address the guidance and tracking of moving targets. In response to this limitation, Alejandro et al. [23] developed a Gazebo-based reinforcement learning framework that effectively applied the DDPG algorithm for continuous UAV landing on a moving ship, which improved the UAV’s adaptability and performance in dynamic environments. However, the initial position of the UAV was fixed in their simulation, which could not effectively reflect the autonomy of guiding the maneuvering decision-making process. To address UAV flight control in dynamic environments with random wind turbulence, Ma et al. [24] introduced an incremental reinforcement learning (IRL)-based algorithm, which employed policy relief (PR) for exploration and significance weighting (SW) to enhance learning. This method was validated through simulations and real-world tests, demonstrating good tracking performance under wind disturbances though it was operated under the assumption of a fully known environment and did not account for collision avoidance. Xia et al. [25] proposed an end-to-end cooperative multi-agent reinforcement learning (MARL) scheme for UAV target tracking, addressing limitations such as unknown trajectories and limited flight performance. This paper modeled communication between UAVs in the swarm, improving coordination and efficiency. The method enhanced tracking with an energy-saving strategy and spatial information entropy, outperforming deep reinforcement learning baselines in simulations. However, it requires position information, assumes half-duplex communication, and may cause difficulties in communication-denied environments.

Faced with the complexities of autonomous decision-making for unmanned aerial vehicles (UAVs) in unknown environments, and to enable UAVs to autonomously make decisions based on limited observation information to adapt to unknown settings while maintaining strong generalization capabilities, this paper applies deep reinforcement learning algorithms to UAV search and rescue (SAR) tasks. A two-stage target search and tracking method for UAVs based on deep reinforcement learning is proposed. The novelties of this paper are listed as follows.

(1): A deep deterministic policy gradient with three critic networks (DDPG-3C) algorithm is proposed to alleviate the overestimation problem in the critic network of the traditional DDPG algorithm. This method enhances both training speed and effectiveness by adopting three critic networks and introducing the experience replay buffer mechanism.
(2): A two-stage target search and tracking method is introduced to enhance the search success rate and tracking performance of UAVs in SAR tasks within unknown environments. This method divides SAR tasks into the search stage and tracking stage, and the controllers for each stage are trained based on the proposed DDPG-3C algorithm.
(3): A simplified two-dimensional SAR scenario is designed to demonstrate the practical application of the proposed methods and algorithms.

By introducing the DDPG-3C algorithm, this paper effectively alleviates the overestimation problem found in the traditional DDPG algorithm, leading to faster convergence and improved decision-making ability. The proposed two-stage target search and tracking method significantly improves the efficiency of UAV operations in SAR tasks, which not only provides a more efficient search strategy but also better adaptability to the target’s movements. Additionally, the development of a simplified two-dimensional SAR simulation environment provides a solid foundation for future research into more complex 3D environments.

The structure of this paper is as follows: Section 2 describes the search and rescue task and a simplified model of the SAR scenario is designed. Section 3 introduces the proposed deep deterministic policy gradient with three critic networks (DDPG-3C) and its training process. Section 4 introduces the proposed two-stage search and tracking method in SAR tasks. In Section 5, the effectiveness of the proposed DDPG-3C model is validated through numerous experiments, as well as the proposed two-stage target search and tracking method. Finally, some conclusions are drawn in Section 6.

2. Problem Description

2.1. Scenario Description

After a natural disaster, such as an earthquake, flood, or fire, the landscape is often devastated and full of challenges. Firstly, the terrain may change due to shaking or flooding, and roads and buildings may be severely damaged, with new obstacles such as fissures, craters, and inundated areas becoming prevalent. Additionally, the post-disaster environment may pose various potential hazards, such as communication disruptions, natural gas leaks, or secondary disasters triggered by unstable ground. These factors not only make it difficult to access affected areas but also significantly exacerbate the complexity of search and rescue (SAR) operations. In response, rescue teams have deployed advanced unmanned aerial vehicles (UAVs) for SAR tasks, as shown in Figure 1. These UAVs, such as the upgraded DJI Matrice 350 RTK (DJI, Shenzhen, China) [26], are equipped with sophisticated thermal imaging and night vision technologies, enabling effective operation in diverse and challenging conditions.

The task of these UAVs is to conduct aerial patrol and rescue. Once a UAV identifies trapped individuals or signs of life, it can swiftly switch to target tracking mode, which allows the UAV to continuously monitor the target to assist survivors or transport food, medicine, and other supplies.

Due to communication disruptions in disaster areas, these UAVs often operate in denial environments where communication with the command center is limited. These UAVs must rely on their onboard systems to perceive environments and execute tasks effectively. Furthermore, debris scattered around the disaster area, such as metal fragments and other materials, can pose threats to the radar and other systems of the UAVs. Therefore, the UAVs must avoid flying over areas that might disrupt their operational capabilities when executing search and rescue tasks.

2.2. Simplified Model

By simplifying the search and rescue scenario of the UAV in denial environments described in Section 2.1 to a two-dimensional search and rescue (SAR) scenario, the simplified model can be obtained, which is shown in Figure 2.

The simplified SAR scenario of the UAV is represented by a two-dimensional rectangular area with a length of

L

and a width of

W

, where our UAV is represented by a dark blue UAV icon, and the target waiting to be rescued is represented by a blue circular icon. The green lines located at the boundary of the square area represent the barriers of the area. There are dark green circular areas with a radius of

R

, representing obstacles or danger zones. The UAV is equipped with sensors to observe the velocity and direction of approaching objects, whose sensing range is indicated by a red circle in Figure 2.

The positions and velocities of the UAV and the target are all random at the beginning. The target works at the initial velocity, and it is assumed that it will act on a completely elastic collision and continue to move when colliding with an obstacle or barrier. Our UAV acts by choosing a thrust vector to add their current velocity. The motion model of our UAV can be written as

\{\begin{array}{l} v_{x} (k + 1) = v_{x} (k) + \frac{F_{x}}{m} △ t \\ v_{y} (k + 1) = v_{y} (k) + \frac{F_{y}}{m} △ t \end{array},

(1)

where

[F_{x}, F_{y}]

is the force vector generated by the thrust and is applied to the body in the horizontal

x

and vertical

y

coordinates with restricted value

F_{x}, F_{y} \in [- F_{\max}, F_{\max}]

;

[v_{x}, v_{y}]

is the velocity vector in the horizontal and vertical direction with a restricted value

v_{x}, v_{y} \in [- v_{\max}, v_{\max}]

;

m

is the mass of the UAV and

△ t

is the sampling interval.

The mission of our UAV is to go to great lengths to search for the target while avoiding obstacle areas and barriers. Once a target is detected, the UAV needs to approach it as quickly as possible and continuously track this target. Due to the communication denial environment being unable to communicate with the controlling center, the UAV can only make decisions independently based on the information obtained from sensors.

2.3. State and Action Space

In the SAR scenario, the UAV needs to perceive the current situation and make decisions independently based on the information obtained from sensors. A complicated state space will reduce the efficiency of the agent in extracting and understanding important information and weaken the generalization ability of the intelligent algorithm, while an overly simple state space cannot provide sufficient information for the agent to make optimal decisions. When designing the state space, the information including UAV, target, and obstacle is considered, and the state space is shown in Equation (2):

S = [p_{UAV}, v_{UAV}, d_{target}, v_{target}, d_{obs}],

(2)

where

p_{UAV}, v_{UAV}, d_{target}, v_{target}, d_{obs} \in ℝ^{2}

. Table 1 explains the meaning of each symbol in the state space. Sensors that do not sense any objects within their range report 0 for velocity and 1 for distance, which means that, if the objects (target, obstacles, or barriers) are not within the sensor range of the UAV, their velocity will be marked as 0 and their distance will be marked as 1.

The UAV has a continuous action space represented as a two-element vector

[F_{x}, F_{y}]

, which corresponds to the force generated by the horizontal and vertical thrust. Action values must be in the range

[- F_{\max}, F_{\max}]

by taking into account the maneuverability of the UAV.

3. Improved DDPG Algorithm

3.1. DDPG

Deep deterministic policy gradient (DDPG) is a reinforcement learning algorithm based on policy gradients [27] that combines the ideas from Deep Q-network (DQN) and Actor–Critic (AC) framework and can be used to solve decision-making problems in high-dimensional continuous action spaces. The main idea is to use deep neural networks (actor network and critic network) to fit behavior strategy functions and state-action value functions, which are called policy networks and value networks, respectively.

The actor network is responsible for mapping states

s_{t}

to actions

a_{t}

:

a_{t} = μ (s_{t} |θ^{μ}),

(3)

where

s_{t}

is the state at time

t

and

θ^{μ}

are the parameters of the actor network. This network outputs a deterministic action for any given state, aiming to maximize future rewards. The actions that it selects are based on its current policy, and, during the training period, it adjusts its parameters to improve the policy based on the gradients received from the critic network.

The critic network estimates the value of the state-action pairs:

Q (s_{t}, a_{t} |φ^{Q})

(4)

where

s_{t}

is the state,

a_{t}

is the action, and

φ^{Q},

are the parameters of the critic network. The critic network evaluates the expected return in a given state and takes a specific action according to the actor’s policy.

The actor network is trained to maximize the Q-values predicted by the critic network. The policy gradient used for updating the actor’s parameters is calculated as:

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} {\nabla_{a} Q (s, a |φ^{Q})|}_{s = s_{i}, a = a_{i}} {\nabla_{θ^{μ}} μ (s |θ^{μ})|}_{s_{i}},

(5)

where

\nabla_{θ^{μ}} μ (s |θ^{μ})

is the gradient of the policy with respect to its parameters;

\nabla_{a} Q (s, a |φ^{Q})

is the gradient of the action-value function with respect to the action;

N

represents the batch size, which is the number of experience tuples sampled from the experience replay buffer during each training iteration.

The critic network is updated by minimizing the squared Bellman error:

L = \frac{1}{N} \sum_{i} {(y_{i} - Q (s_{i}, a_{i} |φ^{Q}))}^{2},

(6)

where

y_{i}

is the target value calculated using the rewards obtained and the next state values are estimated using the target critic network.

3.2. DDPG-3C

The DDPG algorithm, while widely applied in decision-making problems with continuous action spaces, is known to exhibit an overestimation bias in Q-values, which can hinder the convergence of the policy. This issue primarily resulted from maximization bias and target network update delays [28,29]. To alleviate the overestimation problem, this paper imitates the double Q-learning approach and applies it to the DDPG algorithm.

In this paper, an improved deep deterministic policy gradient with three critic networks (DDPG-3C) is proposed. The framework of the DDPG-3C decision model is shown in Figure 3.

As is shown in Figure 3, the DDPG-3C uses an Actor–Critic (AC) framework, and there are eight networks: one actor network

μ^{θ}

, one target actor network

μ^{θ^{'}}

, three critic networks

\{Q_{1}^{φ}, Q_{2}^{φ}, Q_{3}^{φ}\}

, and three target critic networks

\{Q_{1}^{φ^{'}}, Q_{2}^{φ^{'}}, Q_{3}^{φ^{'}}\}

. Our UAV can be treated as an agent containing a DDPG-3C decision model. When the agent obtains a state

s_{i}

from the environment, the actor network outputs an action

a_{i}

according to the input state and takes the next step

s_{i}^{'}

. Meanwhile, the environment returns a reward

r_{i}

. Then, the agent stores the data

(s, a, r, s^{'})

in the experience replay buffer. The agent updates its critic and actor networks by sampling mini-batches from the experience replay buffer.

The actor network is responsible for mapping states to actions. It attempts to learn a policy that maximizes future expected returns:

L o s s = J (θ^{μ}) = E_{s ~ ρ^{μ}} [R_{1} |s_{0} = s, θ^{μ}],

(7)

where

R_{1}

denotes the reward from the state

s_{0}

;

ρ^{μ}

is the state distribution under the policy

μ

;

θ^{μ}

are the parameters of the actor network. To optimize this policy, the gradient of

J (θ^{μ})

with respect to the parameters

θ^{μ}

is computed. This is typically obtained using the policy gradient theorem [30], leading to the gradient for updates, as shown in Equation (5). It should be noted that when calculating the loss of the actor network, only the first critic network is used to predict the Q-value. This allows to reduce computational costs when all the estimates are not needed. The parameters of the target actor network are updated using a soft update rule weighted by a factor

τ

:

θ^{μ^{'}} = τ θ^{μ} + (1 - τ) θ^{μ^{'}} .

(8)

Unlike traditional DDPG that uses a single critic network, DDPG-3C employs three critic networks

\{Q_{1}^{φ}, Q_{2}^{φ}, Q_{3}^{φ}\}

and three target critic networks

\{Q_{1}^{φ^{'}}, Q_{2}^{φ^{'}}, Q_{3}^{φ^{'}}\}

to independently estimate the Q-values. Three critic networks are identical, each trying to estimate the Q-value for a given state and action pair

(s, a)

. To reduce the impact of overestimation, DDPG-3C removes the highest of the three Q-value estimates and uses the average of the remaining two Q-values as the target Q-value:

y_{t} = r_{t} + \frac{1}{2} γ \sum_{i = 1}^{2} Q_{i}^{φ^{'}} (s^{'}, a^{'}),

(9)

where

r_{t}

is the reward obtained from the environment at step

t

;

γ

is the discount factor, used to calculate the present value of future rewards;

a^{'}

represents the action output by the target actor network for the next state

s^{'}

;

Q_{i}^{φ^{'}}

are the outputs from the two remaining target critic networks. The loss of three critic networks is

L o s s = M S E (Q_{1}^{φ}, y_{t}) + M S E (Q_{2}^{φ}, y_{t}) + M S E (Q_{3}^{φ}, y_{t}),

(10)

where

M S E

is short for mean squared error;

Q_{i}^{φ}

are the outputs from critic networks. Once the loss is computed, gradients are propagated back through each critic network to update the network parameters

φ_{i}^{Q}

, leading to the following gradient for updates:

\nabla_{φ^{Q}} J (φ^{Q}) \approx \frac{1}{N} \sum_{i} {\nabla_{φ^{Q}} (M S E (Q_{1}^{φ}, y_{t}) + M S E (Q_{2}^{φ}, y_{t}) + M S E (Q_{3}^{φ}, y_{t}))|}_{s = s_{i}, a = a_{i}},

(11)

where

N

represents the batch size.

\nabla_{φ^{Q}}

is the gradient operator with respect to the network parameters

φ^{Q}

. The parameters of three target critic networks are updated using a soft update rule weighted by a factor

τ

:

φ^{Q^{'}} = τ φ^{Q} + (1 - τ) φ^{Q^{'}} .

(12)

3.3. Training Procedure of DDPG-3C

The training procedures of the proposed DDPG-3C decision algorithm are shown in Algorithm 1.

Algorithm 1: Training process of the DDPG-3C decision algorithm

Initialize : actor network parameters θ^{μ}

,

target network parameters θ^{μ^{'}}

,

critic networks parameters \{φ_{1}^{Q}, φ_{2}^{Q}, φ_{3}^{Q}\}

,

target critic networks parameters \{φ_{1}^{Q^{'}}, φ_{2}^{Q^{'}}, φ_{3}^{Q^{'}}\}

;

Input : θ^{μ}

, θ^{μ^{'}}

, φ_{1}^{Q}, φ_{2}^{Q}, φ_{3}^{Q}

, φ_{1}^{Q^{'}}, φ_{2}^{Q^{'}}, φ_{3}^{Q^{'}}

, learning rate

β

,
discount factor

γ

,

soft update factor τ

Output : optimal parameters θ^{μ *}, φ_{1}^{Q *}, φ_{2}^{Q *}, φ_{3}^{Q *}

;

for e p i s o d e = 1

to

M

do

Receive observation s_{0}

;

for t = 1

to m a x e p i s o d e l e n g t h

do

Select a_{t}

w . r . t the current actor network μ^{θ} (s_{t})

;

Execute actions a_{t}

and observe reward r_{t}

and new state s_{t + 1}

;
Store

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in experience replay buffer

D

;
Sample a random minibatch from

D

;
Calculate the target Q-value as Equation (9);
Update critic by minimizing the loss as Equation (10);
Update actor using the sampled policy gradient as Equation (5);
Update target network parameters as Equations (8) and (12);
end for
end for

4. Two-Stage Target Search and Tracking Method

This paper proposes a two-stage target search and tracking method, which divides SAR tasks into search stage and tracking stage. In the search stage, the specific optimization of the search algorithm aims to cover broader areas and quickly find targets. Once a target is detected, the focus shifts to optimizing the ability to track dynamic targets, including adapting to changes in the target’s speed and direction. The controllers for each stage are trained based on the proposed DDPG-3C algorithm. The diagram of the proposed two-stage target search and tracking method is shown in Figure 4. When implementing, if the target is not within the radar range of the UAV, the UAV will make decisions according to the model trained in the cover scenario to search for targets. If the target is detected by the radar, the UAV will turn to tracking mode.

By separating the task into two stages, the complexity of the state and action spaces for each stage is reduced, making the training process more efficient and faster to converge. Each controller can be optimized for a relatively smaller and more specific set of tasks, which improves learning efficiency. Additionally, a two-stage method allows each stage to be independently adjusted and optimized, making it easier to adapt to environmental changes and task requirements. What’s more, if an issue arises in either the search or tracking models, this method makes it easier to pinpoint and adjust the problem without affecting other parts of the system.

4.1. Search Stage

In the search stage, the goal is to quickly and efficiently cover the entire two-dimensional rescue area using the radar while avoiding obstacles and barriers in the area. The DDPG-3C decision model described in Section 3.2 is utilized for training the UAV (agent). The structure of the critic network and actor network is depicted in Figure 5.

The training procedure is described in Algorithm 1. In the search stage, the state is set as

S = [p_{UAV}, v_{UAV}, d_{obs}],

(13)

where

p_{UAV}, v_{UAV}, d_{target} \in ℝ^{2}

and Table 1 explains the meaning of each symbol in the state space. The UAV has a continuous action space represented as a two-element vector

[F_{x}, F_{y}]

, which corresponds to the force generated by the horizontal and vertical thrust.

The rewards in this search scenario are comprised of step rewards and terminal rewards aiming to cover the entire two-dimensional rescue area. Each step incurs a step reward

r_{s t e p} = - 0.1

to encourage the agent to complete the coverage task quickly. At the same time, each step incurs a coverage reward

r_{c o v e r a g e}

meaning that covering new areas grants additional rewards. The coverage reward is calculated as

r_{c o v e r a g e} = D_{s e n s o r} \cap {\bar{D}}_{c o v e r e d},

(14)

where

D_{s e n s o r}

is the area currently covered by the UAV’s sensor, and

{\bar{D}}_{c o v e r e d}

is the complement of the previously covered area (i.e., areas that have not yet been covered). The logical AND (

\cap

) ensures that only the areas currently covered by the sensor, but not previously covered, are considered—these are the newly covered areas. At the end of every episode, the agent (UAV) receives a reward

r_{c o m p l e t e} = 300

and marks the task as completed when the total covered area reaches 95%. Meanwhile, if the agent (UAV) is too close to barriers or obstacles, the agent will be penalized with a reward

r_{c o l l i s i o n} = - 200

, and this episode is terminated. The reward function is shown as follows:

R = \sum r_{s t e p} + p \sum r_{c o v e r a g e} + r_{c o m p l e t e} + r_{c o l l o s i o n},

(15)

where

p = 0.001

is the reward ratio;

r_{s t e p}

and

r_{c o v e r a g e}

are step rewards;

r_{c o m p l e t e}

and

r_{c o l l o s i o n}

are terminal rewards.

When implementing the decision model, it only needs to load the structure and parameters of the actor network. The actor network receives a state from the environment. This state should be in the same format and should be preprocessed in the same way as Equation (13) used during training. Subsequently, the action is output by the actor network and then executed by the UAV in the environment. The implementation process of the search operation is shown in Figure 6.

It should be noted that there is no target in the search scenario when the decision model is training. The goal of the UAV is to quickly and efficiently cover the entire two-dimensional rescue area. During implementation, a target is moving and the UAV will go to great lengths to search for this target.

4.2. Tracking Stage

Once the target is identified during the search stage, the system transfers to the tracking stage. Here, the controller focuses on maintaining continuous surveillance of the target, adjusting to its movements. The DDPG-3C decision model described in Section 3.2 is utilized for training the UAV (agent). The structure of the critic network and actor network is depicted in Figure 5, which is similar to the model in the search stage.

In the tracking stage, to better adapt to the target’s dynamics and maintain consistent tracking, the state is set as

S = [p_{UAV}, v_{UAV}, d_{target}, v_{target}, d_{obs}],

(16)

where

p_{UAV}, v_{UAV}, d_{target}, v_{target}, d_{obs} \in ℝ^{2}

and Table 1 carefully explains the meaning of each symbol in the state space. Similarly, the UAV has a continuous action space represented as a two-element vector

[F_{x}, F_{y}]

.

To encourage the UAV to move closer to the target, with each step, the agent receives a reward

r_{p r o}

for proximity to the target, calculated as the Euclidean distance between the UAV and the target:

r_{p r o} = \frac{5}{‖ d_{target} ‖ + 0.5},

(17)

where

d_{target} \in ℝ^{2}

represents the horizontal and vertical distances of the target detected by the UAV’s sensor. Adding 0.5 to the denominator prevents excessively high rewards and division by zero errors when the distance is very small. If the agent is too close to barriers or obstacles, it will receive a penalty

r_{c o l l i s i o n} = - 200

, and this episode is terminated. The reward function is shown as follows:

R = \sum r_{p r o} + r_{c o l l o s i o n} .

(18)

It should be noted that when the model is training, it is assumed that the UAV can perceive the states of the environment (target, obstacles, and barriers), regardless of whether objects are within the radar range of the UAV because the goal of this period is to train the tracking capacity of the UAV. When implementing this trained model, once the target enters the radar range of the UAV, the UAV switches from search mode to tracking mode. The implementation process of the tracking operation is similar to that of the searching stage.

5. Experimental Simulations

In order to validate the effectiveness of the proposed DDPG-3C model that utilizes three critic networks to alleviate the overestimation problem, the performance of the proposed model is compared with two popular reinforcement learning baselines in the two-dimensional search and rescue (SAR) scenario described in Section 2: DDPG [27] and TD3 (twin delayed deep deterministic policy gradient) [29]. At the same time, to validate the advantages of the proposed two-stage target search and tracking method, this paper compares it with the traditional single-stage reinforcement learning model in the SAR scenario.

When training the DDPG decision model, TD3 decision model, and the DDPG-3C decision model, the hyperparameters are set according to Table 2.

In the following section, the DDPG decision model, the TD3 decision model, and the DDPG-3C decision model will be applied to the search stage and tracking stage of the SAR task, as well as the traditional single-stage method for comparison. All of these algorithms are implemented in Python and run on a desktop computer with a GeForce RTX 4090 GPU and a 13th 64-Core i9-13900KS CPU.

5.1. Search Stage Simulations

In the search stage, there is no target in this scenario when the decision model is training. The designed SAR scenario includes a single UAV with a sensor range of 200 m and two obstacles, each with a radius of 50 m. Both the obstacles and the UAV are positioned at randomly generated locations within a two-dimensional rescue area that spans 0 to 800 m. The obstacles are static, while the UAV operates based on the force generated by the horizontal and vertical thrust, with a maximum action value of 400 N. The hyperparameters for the DDPG, TD3, and DDPG-3C decision models are listed in Table 2.

The training results of the DDPG-3C decision model, TD3 decision model, and the DDPG decision model in the search stage of the SAR task in denial environments are shown in Figure 7. The results are calculated using an exponential moving average (EMA) to help visualize trends more clearly by reducing noise from the raw data according to Equation (19):

y_{t} = \{\begin{array}{l} x_{t}, & t = 1 \\ s \times y_{t - 1} + (1 - s) \times x_{t}, & t > 1 \end{array} .

(19)

where

s = 0.95

is the smoothing factor where closer to 1 implies heavier smoothing, and closer to 0 implies lighter smoothing. Moreover, the detailed convergence rewards and convergence steps are listed in Table 3.

It can be seen from Figure 7a and Table 3 that the DDPG-3C model, which utilizes three critic networks, shows a reward trajectory that steadily increases and stabilizes at approximately 600, indicating robust learning and performance stability. At the same time, the DDPG-3C model achieves optimal performance more rapidly than its counterparts. In contrast, the traditional DDPG model, which uses a single critic network, reaches a lower peak reward of around 560 and takes approximately 7 million training steps to converge, showing a slower learning rate compared with the DDPG-3C model. Similarly, the TD3 model, which uses twin critic networks, stabilizes at a reward of approximately 560 but requires about 4 million training steps to converge, demonstrating slower convergence compared to DDPG-3C but faster than DDPG.

When it comes to the estimation of target Q-values, the red line in Figure 7b represents the target Q-value for the DDPG-3C model. This value is calculated by discarding the highest of the three Q-value estimates and averaging the remaining two. Figure 7b shows that the target Q-value for DDPG-3C is consistently lower than that of the traditional DDPG and TD3, indicating that the proposed decision model can effectively alleviate the problem of overestimation.

To compare the performance of these three models during the search stage, each model will be tested across 10,000 episodes. The average coverage ratio and average collision rate will be calculated to evaluate their effectiveness and safety. All models are initialized under the same conditions to maintain fairness in testing. The results are shown in Table 4. For the simulation video of the search stage, please click here (https://easylink.cc/5r3njl, accessed on 25 September 2024) to view (Supplementary Materials).

According to the data presented in Table 4, the DDPG model recorded a collision frequency of 543, meaning that collisions occurred in approximately 5.43% of the episodes. The TD3 model demonstrated an improvement with a collision frequency of 461 (4.61%). However, the DDPG-3C model demonstrated a significantly lower collision frequency with a total number of 349 collisions (3.49%). This improvement indicates that the DDPG-3C decision model is more effective at avoiding obstacles, which is safer during the search stage of SAR tasks. In terms of area coverage, the traditional DDPG model achieved a coverage rate of 87.92%, while the TD3 model achieved a slightly better coverage rate of 89.14%. However, the DDPG-3C model outperformed both, achieving a higher coverage rate of 90.17%. This indicates not only a more thorough coverage per episode but also improved efficiency in searching.

5.2. Tracking Stage Simulations

In the tracking stage, the goal is to maintain continuous surveillance of a moving target within a defined two-dimensional area spanning 0 to 800 m. The scenario includes a single UAV that has complete perception of the environment’s states. The environment also contains two static obstacles, each with a radius of 50 m, randomly placed in the area. The UAV operates based on the horizontal and vertical thrust forces, with a maximum action value of 400 N, to keep track of the target without colliding with the obstacles. By providing the UAV with full environmental awareness, the model can learn the optimal strategies for tracking and obstacle avoidance without the constraints of sensor range.

The DDPG, TD3, and DDPG-3C models are trained under the same conditions using hyperparameters listed in Table 2. Training results are analyzed using an exponential moving average (EMA) according to Equation (19) to smooth the data and highlight trends more effectively. Moreover, the detailed convergence rewards and convergence steps are listed in Table 5.

Figure 8a,b displays the rewards and target Q-value trajectories for the DDPG, TD3, and DDPG-3C decision models. Similar to the results from the search stage, the DDPG-3C, with its three critic networks, is expected to converge more quickly, achieving this at 3 million training steps, and gain a higher reward of approximately 200. In contrast, the TD3 model takes about 4 million training steps to converge, with a reward of approximately 180, and the traditional DDPG model converges at around 7 million training steps with a lower reward of roughly 160. These results suggest that DDPG-3C shows better adaptability to the target’s movements compared to the other models. The target Q-values for DDPG-3C, calculated by discarding the highest and averaging the remaining two estimates, are relatively lower than those of the traditional DDPG and TD3. This approach offers a more conservative and accurate estimation, thus reducing the risks of overestimation.

To compare the performance of these three models during the tracking stage, each model will be tested across 10,000 episodes. The average collision frequency and the tracking success rate, defined as the percentage of episodes where the UAV successfully tracks and maintains the target within a range of 20 m, will be calculated and compared. All models are initialized under the same conditions to maintain fairness in testing. The results are shown in Table 6. For the simulation video of the tracking stage, please click here (https://easylink.cc/5r3njl, accessed on 25 September 2024) to view.

According to the data presented in Table 6, collisions occurred in approximately 3.16% of the episodes when implementing the DDPG decision model in the tracking stage, while the TD3 model showed an improvement with a collision frequency of 274 (2.74%). The DDPG-3C decision model demonstrated an even lower collision frequency of 189, with a collision rate of 1.89%. In terms of tracking success, the DDPG model achieved a tracking success frequency of 9357 out of 10,000 episodes, while the TD3 model performed slightly better with a tracking success frequency of 9397 (93.97%). However, the proposed DDPG-3C decision model outperformed both, achieving a tracking success rate of 95.12%. This higher success rate for DDPG-3C suggests that the model, with its additional critic networks, provides more accurate value estimates and decision-making ability compared to DDPG and TD3.

5.3. Whole Period

In this experiment, the whole period of the SAR tasks was considered, which combines the search stage and tracking stage. Initially, the UAV operates in search mode, aiming to extensively cover a defined two-dimensional area (0 to 800 m) to detect any potential targets. During this stage, the UAV lacks target and obstacle information and relies on its sensor system to detect obstacles and other objects. Once a target enters the UAV’s radar range (within 200 m), the UAV switches to tracking mode and continuously tracks the moving target. The search stage and tracking stage use the DDPG-3C decision model trained in Section 5.1 and Section 5.2, respectively. The implementation process of the whole period is shown in Figure 4.

Figure 9 shows the experimental results of the proposed two-stage target search and tracking method for SAR tasks, which include avoiding obstacles, searching for targets, and tracking the target. In Figure 9a–c, the UAV’s initial position is (700, 700), with a random initial angle, and an initial speed of 0. The target’s initial position is (100, 100), with a speed of 30 m/s and an initial angle is

π / 8

. The positions of the obstacles are (200, 200) and (400, 400). In Figure 9d–f, the UAV’s initial position is also (700, 700), with a random initial angle and an initial speed of 0. The target’s initial position is (100, 100), with a speed of 30 m/s and an initial angle of

π / 4

, while the obstacles are located at (200, 80) and (350, 200).

It can be seen from Figure 9a,d that the UAV is positioned randomly and continues to search the environment for potential targets. After the UAV detects a target, as illustrated in Figure 9b,e, it transitions into tracking mode. The UAV adjusts its path to approach the target while avoiding obstacles. In the tracking stage, as shown in Figure 9c,f, the UAV not only maintains its course towards the target but also adapts to changes in the target’s speed. Additionally, it consistently avoids obstacles, as demonstrated in Figure 9c. For the simulation video of the whole stage, please click here (https://easylink.cc/5r3njl, accessed on 25 September 2024) to view.

Figure 10 presents the experimental results of the proposed two-stage target search and tracking method for SAR tasks, featuring scenarios with 5, 9, and 13 obstacles, respectively. In each scenario, the UAV’s initial position is (700, 700), with a random initial angle and an initial speed of 0, and the target’s speed is 30 m/s. In Figure 10a, the target’s initial position is (400, 500), with an angle of

17 π / 16

, and the obstacles are located at (100, 100), (100, 700), (700, 100), (700, 700), and (400, 400). In Figure 10b, four additional obstacles are added, positioned at (400, 100), (100, 400), (700, 400), and (400, 700), and the target’s initial position is (400, 500), with an angle of

13 π / 16

. In Figure 10c, the number of obstacles is increased to 13, which are evenly distributed throughout the SAR scenario, and the target moves at a speed of 30 m/s along a predetermined trajectory.

It can be seen from the figure that in SAR scenarios with varying numbers of obstacles, the UAV successfully avoids obstacles during the search stage while covering the SAR area to locate the target. Once the UAV’s radar detects the target, it switches to tracking mode. During the tracking stage, the UAV not only effectively tracks the target but also continuously avoids obstacles. For the simulation video of the whole stage, please click here (https://easylink.cc/5r3njl, accessed on 25 September 2024) to view.

To better illustrate the effectiveness of the proposed two-stage target search and tracking method for SAR tasks, this paper compares models trained using the two-stage target search and tracking method with the model trained using a traditional single-stage reinforcement learning method. In traditional single-stage SAR tasks, the UAV is required to search for and track the target as soon as possible while avoiding obstacles. The traditional single-stage method utilizes the DDPG-3C model. The hyperparameters of DDPG-3C model for the traditional single-stage method are similar to the DDPG-3C decision models in the search stage and tracking stage. The state is set as Equation (2). The reward function is shown as follows:

R = \sum r_{p r o} + r_{s e a r c h} + r_{c o l l o s i o n},

(20)

where

r_{p r o}

is the proximity reward, calculated as the Euclidean distance between the UAV and the target, as shown in Equation (17);

r_{s e a r c h} = 100

represents the reward for finding the target;

r_{c o l l i s i o n} = - 200

represents the collision reward for being too close to barriers or obstacles.

When comparing the performance of the two-stage target search and tracking method with the traditional single-stage method, each model was tested across 20,000 episodes. The average number of steps taken to find the target, defined as the target being detected by the UAV’s sensor, and the average steps for successfully tracking the target, defined as the UAV maintaining the target within a 20-m range, were calculated and compared. Additionally, collision rates were compared. Both models are initialized under the same conditions to maintain fairness in testing. The results are presented in Table 7. To view the comparison video of these two methods, please click here (https://easylink.cc/5r3njl, accessed on 30 September 2024).

According to the data presented in Table 7, when using the two-stage target search and tracking method, which divides SAR tasks into search and tracking stages, requires an average of 171.27 steps to find the target. This is significantly fewer than the traditional single-stage method, which takes an average of 313.78 steps. A more efficient search strategy of the two-stage target search and tracking method, due to specialized training in the search stage, is suggested. Once the target is detected, the two-stage target search and tracking method takes an average of 354.54 steps to track the target, maintaining it within a 20-m range. However, the traditional single-stage method requires more steps, averaging 481.68, to maintain tracking of the target. This means that the UAV is able to operate more effectively in dynamic environments by using the two-stage target search and tracking method. In terms of collision rates, the traditional single-stage method has a slightly higher collision rate of 4.51% compared with 4.07% for the two-stage target search and tracking method, indicating not only faster but also safer operations compared with the traditional method.

It should be noted that although the two-stage target search and tracking method shows overall better efficiency in finding and tracking targets compared with the traditional single-stage method, there is no significant difference in the number of steps taken from detecting to successfully tracking the target between these two methods. To be more specific, when measuring the steps from finding the target to tracking it, the traditional single-stage method requires fewer steps (167.9 steps) than the two-stage method (183.27 steps). This can be attributed to a state transition process from the search stage to the tracking stage. For the two models trained with the two-stage method, during this transition, the state of the UAV in the search stage acts like noise, which the tracking stage must overcome, requiring additional steps to stabilize and effectively track the target.

6. Conclusions

To deal with the complexities of decision-making for unmanned aerial vehicles (UAVs) in unknown environments, this paper applies deep reinforcement learning algorithms to search and rescue (SAR) tasks. A two-stage target search and tracking method for UAVs based on deep reinforcement learning is proposed, which divides SAR tasks into search and tracking stages, and the controllers for each stage are trained based on the proposed deep deterministic policy gradient with three critic networks (DDPG-3C) algorithm. Simulation experiments are carried out to evaluate the performance of each stage in a two-dimensional rectangular SAR scenario, including search, tracking, and the integrated whole stage. Some concluding remarks are drawn below:

(1): The deep deterministic policy gradient (DDPG) algorithm, while effective for continuous action spaces, exhibits overestimation bias in Q-values in the search and tracking stages of SAR tasks. This overestimation slows the convergence of the policy and leads to less reliable decision-making.
(2): The proposed DDPG-3C model, which incorporates three critic networks, addresses the overestimation problem by removing the highest of the three Q-value estimates and uses the average of the remaining two Q-values as the target Q-value. This improvement leads to more accurate Q-value estimations, resulting in faster convergence and improved performance during both the search and tracking stages. Therefore, the proposed method not only provides a more efficient search strategy but also better adaptability to the target’s movements, improving overall efficiency.
(3): The twin delayed deep deterministic policy gradient (TD3) algorithm, which uses twin critic networks, can also alleviate the overestimation problem to some extent and demonstrates good performance in both avoiding collisions and achieving a high coverage rate. However, the DDPG-3C model surpasses TD3 in these areas, offering even better obstacle avoidance and coverage rate, as well as more robust performance.
(4): The two-stage target search and tracking method proposed in this paper outperforms the traditional single-stage approach. According to the average steps to finding and tracking the target and the collision rates, the proposed method offers not only faster and safer operations but also simplifies the process by eliminating the need for intricate reward function design. Additionally, it reduces the state vector size, which results in a more efficient and effective decision-making ability in SAR tasks.

Despite these improvements, the current study only considers simulations in a two-dimensional area, which limits its applicability to more complex real-world scenarios. Additionally, the UAV’s obstacle detection focuses on the nearest obstacle within the radar range when using the DDPG-3C decision model. This simplification may increase the likelihood of collisions when multiple obstacles are present within the radar’s detection range. Future work could explore the expansion of the model to consider target search and tracking in three-dimensional environments. Additionally, developing reinforcement learning methods with state embeddings could allow the state vector to better adapt to varying numbers of targets and obstacles. Furthermore, research could focus on increasing the number of UAVs and targets by using multi-agent reinforcement learning algorithms, enabling multiple UAVs to collaboratively search for and track targets.

Supplementary Materials

The simulation video of the search stage, tracking stage, and whole stage, as well as the comparison video of the proposed two-stage target search and tracking method and the traditional single-stage method can be downloaded at: https://easylink.cc/5r3njl access on 25 September 2024.

Author Contributions

Conceptualization, M.L.; Data curation, M.L.; Formal analysis, M.L.; Investigation, M.L. and J.W.; Methodology, M.L., J.W. and K.L.; Project administration, K.L.; Resources, M.L.; Software, M.L.; Supervision, K.L.; Validation, M.L.; Visualization, M.L.; Writing—original draft, M.L.; Writing—review and editing, M.L., J.W. and K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is unavailable due to privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Elmeseiry, N.; Alshaer, N.; Ismail, T. A detailed survey and future directions of unmanned aerial vehicles (uavs) with potential applications. Aerospace 2021, 8, 363. [Google Scholar] [CrossRef]
Huang, Y.; Thomson, S.J.; Hoffmann, W.C.; Lan, Y.; Fritz, B.K. Development and prospect of unmanned aerial vehicle technologies for agricultural production management. Int. J. Agric. Biol. Eng. 2013, 6, 1–10. [Google Scholar]
Muchiri, G.N.; Kimathi, S. A review of applications and potential applications of UAV. In Proceedings of the Sustainable Research and Innovation Conference, Pretoria, South Africa, 20–24 June 2022. [Google Scholar]
Kazmi, W.; Bisgaard, M.; Garcia-Ruiz, F.; Hansen, K.D.; la Cour-Harbo, A. Adaptive surveying and early treatment of crops with a team of autonomous vehicles. In Proceedings of the 5th European Conference on Mobile Robots ECMR 2011, Örebro, Sweden, 7–9 September 2011. [Google Scholar]
Marx, A.; Chou, Y.H.; Mercy, K.; Windisch, R. A lightweight, robust exploitation system for temporal Stacks of UAS data: Use case for forward-deployed military or emergency responders. Drones 2019, 3, 29. [Google Scholar] [CrossRef]
Guan, S.; Zhu, Z.; Wang, G. A Review on UAV-Based Remote Sensing Technologies for Construction and Civil Applications. Drones 2022, 6, 117. [Google Scholar] [CrossRef]
Merz, M.; Pedro, D.; Skliros, V.; Bergenhem, C.; Himanka, M.; Houge, T.; Matos-Carvalho, J.P.; Lundkvist, H.; Cürüklü, B.; Hamrén, R.; et al. Autonomous UAS-Based Agriculture Applications: General Overview and Relevant European Case Studies. Drones 2022, 6, 128. [Google Scholar] [CrossRef]
Aslam, W. Great-power responsibility, side-effect harms and American drone strikes in Pakistan. J. Mil. Ethics 2016, 15, 143–162. [Google Scholar] [CrossRef]
Bachrach, A.; He, R.; Roy, N. Autonomous Flight in Unknown Indoor Environments. Int. J. Micro Air Veh. 2009, 1, 217–228. [Google Scholar] [CrossRef]
Mac, T.T.; Copot, C.; De Keyser, R.; Ionescu, C.M. The development of an autonomous navigation system with optimal control of an UAV in partly unknown indoor environment. Mechatronics 2018, 49, 187–196. [Google Scholar] [CrossRef]
Rothmund, S.V.; Johansen, T.A. Risk-Based Obstacle Avoidance in Unknown Environments Using Scenario-Based Predictive Control for an Inspection Drone Equipped with Range Finding Sensors. In Proceedings of the 2019 International Conference on Unmanned Aircraft Systems (ICUAS), Atlanta, GA, USA, 11–14 June 2019. [Google Scholar]
Kulathunga, G.; Hamed, H.; Devitt, D.; Klimchik, A. Optimization-Based Trajectory Tracking Approach for Multi-Rotor Aerial Vehicles in Unknown Environments. IEEE Robot. Autom. Lett. 2022, 7, 4598–4605. [Google Scholar] [CrossRef]
Saccani, D.; Cecchin, L.; Fagiano, L. Multitrajectory Model Predictive Control for Safe UAV Navigation in an Unknown Environment. IEEE Trans. Control. Syst. Technol. 2023, 31, 1982–1997. [Google Scholar] [CrossRef]
Li, J.; Zhang, G.; Zhang, X.; Zhang, W. Integrating dynamic event-triggered and sensor-tolerant control: Application to USV-UAVs cooperative formation system for maritime parallel search. IEEE Trans. Intell. Transp. Syst. 2024, 25, 3986–3998. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Tamar, A.; Wu, Y.; Thomas, G.; Levine, S.; Abbeel, P. Value iteration networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Waschneck, B.; Reichstaller, A.; Belzner, L.; Altenmüller, T.; Bauernhansl, T.; Knapp, A.; Kyek, A. Optimization of global production scheduling with deep reinforcement learning. Procedia Cirp 2018, 72, 1264–1269. [Google Scholar] [CrossRef]
Pham, H.X.; La, H.M.; Feil-Seifer, D.; Nguyen, L.V. Autonomous uav navigation using reinforcement learning. arXiv 2018, arXiv:1801.05086. [Google Scholar]
Li, B.; Gan, Z.; Chen, D.; Sergey Aleksandrovich, D. UAV maneuvering target tracking in uncertain environments based on deep reinforcement learning and meta-learning. Remote Sens. 2020, 12, 3789. [Google Scholar] [CrossRef]
Ramezani, M.; Amiri Atashgah, M.A. Energy-Aware Hierarchical Reinforcement Learning Based on the Predictive Energy Consumption Algorithm for Search and Rescue Aerial Robots in Unknown Environments. Drones 2024, 8, 283. [Google Scholar] [CrossRef]
Zhao, Y.; Zheng, Z.; Zhang, X.; Yang, L. Q learning algorithm-based UAV path learning and obstacle avoidance approach. In Proceedings of the 2017 36th Chinese Control Conference, Dalian, China, 26–28 July 2017. [Google Scholar]
Zhang, K.; Li, K.; Shi, H.; Zhang, Z.; Liu, Z. Autonomous guidance maneuver control and decision-making algorithm based on deep reinforcement learning UAV route. Syst. Eng. Electron. 2020, 42, 1567–1574. [Google Scholar]
Rodriguez-Ramos, A.; Sampedro, C.; Bavle, H.; De La Puente, P.; Campoy, P. A Deep Reinforcement Learning Strategy for UAV Autonomous Landing on a Moving Platform. J. Intell. Robot. Syst. 2019, 93, 351–366. [Google Scholar] [CrossRef]
Ma, B.; Liu, Z.; Dang, Q.; Zhao, W.; Wang, J.; Cheng, Y.; Yuan, Z. Deep reinforcement learning of UAV tracking control under wind disturbances environments. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
Xia, Z.; Du, J.; Wang, J.; Jiang, C.; Ren, Y.; Li, G.; Han, Z. Multi-agent reinforcement learning aided intelligent UAV swarm for target tracking. IEEE Trans. Veh. Technol. 2021, 71, 931–945. [Google Scholar] [CrossRef]
DJI. Drones Assist in Fire Rescue of Large-Scale Urban Complexes in Nanjing. Available online: https://enterprise-insights.dji.com/cn/blog/nanjing-drone-mall-fire-rescue/ (accessed on 25 July 2024).
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015. submitted. [Google Scholar]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 1999, 12, 1057–1063. [Google Scholar]

Figure 1. Drones assist in the fire rescue of large-scale urban complexes in Nanjing [26].

Figure 2. Simplified model of the SAR scenario.

Figure 3. The framework of the DDPG-3C decision model: the yellow and blue sections represent the actor and critic modules, respectively.

Figure 4. Diagram of the proposed two-stage target search and tracking method.

Figure 5. Structure of the critic network and the actor network in DDPG-3C.

Figure 6. Implementation process of the UAV in the search stage.

Figure 7. Training results of three models in search stage: (a) episode rewards; (b) target Q value.

Figure 8. Training results of two models in tracking stage: (a) episode rewards; (b) target Q value.

Figure 9. Experimental results of the proposed two-stage target search and tracking method for SAR tasks. (a–c): Scenario 1; (d–f): Scenario 2; dark blue UAV icon represents our UAV, the action values of which are shown by the red bars coming from the individual motors; dark green circular areas represent obstacles or danger zones; blue circular icon represents the target; red circle represents the sensor range of the UAV; red and black lines represent the trajectories of the target and the UAV, respectively.

Figure 10. Experimental results of two-stage target search and tracking with varying numbers of obstacles: (a) 5 obstacles; (b) 9 obstacles; (c) 13 obstacles.

Table 1. Meaning of each dimension in the state space.

Symbol	Description	Normalized Values
$p_{UAV}$	Position of UAV in the horizontal and vertical direction	$[- 1, 1] \times [- 1, 1]$
$v_{UAV}$	Velocity of UAV in the horizontal and vertical direction	$[- 1, 1] \times [- 1, 1]$
$d_{target}$	Horizontal and vertical distances of target detected by sensors	$[0, 1] \times [0, 1]$
$v_{target}$	Horizontal and vertical velocities of target detected by sensors	$[0, 1] \times [0, 1]$
$d_{obs}$	Horizontal and vertical distances of obstacle detected by sensors	$[0, 1] \times [0, 1]$

Table 2. Hyperparameters of the network.

Name	Value
Input vector size of the critic network	search stage: 8; tracking stage: 12; traditional single-stage: 12
Hidden layer size of the critic network	400, 300
Output vector size of the critic network	1
Activation function of the critic network	ReLU, ReLU
Learning rate of the critic network	0.001
Optimizer of the critic network	Adam
Input vector size of the actor network	search stage: 6; tracking stage: 10; traditional single-stage: 10
Hidden layer size of the actor network	400, 300
Output vector size of the actor network	2
Activation function of the actor network	ReLU, ReLU, Tanh
Learning rate of the actor network	0.001
Optimizer of the actor network	Adam
Train step size	500
Batch size	1024

Table 3. Evaluation metrics about training process in search stage.

Model	Convergence Rewards	Convergence Steps
DDPG	~600 *	~2.8 M *
TD3	~560 *	~4 M *
DDPG-3C	~560 *	~7 M *

* Tilde (“~”) symbol conveys that the values are approximate and 1 M represents 1,000,000.

Table 4. Comparison of collision rate and coverage rate in search stage.

Model	Collision Frequency	Collision Rate	Coverage Rate
DDPG	543	5.43%	87.92%
TD3	461	4.61%	89.14%
DDPG-3C	349	3.49%	90.17%

Table 5. Evaluation metrics about training process in tracking stage.

Model	Convergence Rewards	Convergence Steps
DDPG	~200 *	~3 M *
TD3	~180 *	~4 M *
DDPG-3C	~160 *	~7 M *

* Tilde (“~”) symbol conveys that the values are approximate and 1 M represents 1,000,000.

Table 6. Comparison of collision rate and tracking success rate in tracking stage.

Model	Collision Frequency	Collision Rate	Tracking Success Frequency	Tracking Success Rate
DDPG	316	3.16%	9357	93.57%
TD3	274	2.74%	9397	93.97%
DDPG-3C	189	1.89%	9512	95.12%

Table 7. Comparison of average steps and collision rate in the whole period.

Method	Average Steps (Finding the Target)	Average Steps (Tracking the Target)	Collision Rate
Two-stage method	171.27	354.54	4.07%
Traditional single stage	313.78	481.68	4.51%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, M.; Wei, J.; Liu, K. A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning. Drones 2024, 8, 544. https://doi.org/10.3390/drones8100544

AMA Style

Liu M, Wei J, Liu K. A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning. Drones. 2024; 8(10):544. https://doi.org/10.3390/drones8100544

Chicago/Turabian Style

Liu, Mei, Jingbo Wei, and Kun Liu. 2024. "A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning" Drones 8, no. 10: 544. https://doi.org/10.3390/drones8100544

APA Style

Liu, M., Wei, J., & Liu, K. (2024). A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning. Drones, 8(10), 544. https://doi.org/10.3390/drones8100544

Article Menu

A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Problem Description

2.1. Scenario Description

2.2. Simplified Model

2.3. State and Action Space

3. Improved DDPG Algorithm

3.1. DDPG

3.2. DDPG-3C

3.3. Training Procedure of DDPG-3C

4. Two-Stage Target Search and Tracking Method

4.1. Search Stage

4.2. Tracking Stage

5. Experimental Simulations

5.1. Search Stage Simulations

5.2. Tracking Stage Simulations

5.3. Whole Period

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI