A Long-Term Target Search Method for Unmanned Aerial Vehicles Based on Reinforcement Learning

Wei, Dexing; Zhang, Lun; Yang, Mei; Deng, Hanqiang; Huang, Jian

doi:10.3390/drones8100536

Open AccessArticle

A Long-Term Target Search Method for Unmanned Aerial Vehicles Based on Reinforcement Learning

by

Dexing Wei

,

Lun Zhang

^*

,

Mei Yang

,

Hanqiang Deng

and

Jian Huang

^*

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Authors to whom correspondence should be addressed.

Drones 2024, 8(10), 536; https://doi.org/10.3390/drones8100536

Submission received: 16 July 2024 / Revised: 12 September 2024 / Accepted: 27 September 2024 / Published: 30 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned aerial vehicles (UAVs) are increasingly being employed in search operations. Deep reinforcement learning (DRL), owing to its robust self-learning and adaptive capabilities, has been extensively applied to drone search tasks. However, traditional DRL approaches often suffer from long training times, especially in long-term search missions for UAVs, where the interaction cycles between the agent and the environment are extended. This paper addresses this critical issue by introducing a novel method—temporally asynchronous grouped environment reinforcement learning (TAGRL). Our key innovation lies in recognizing that as the number of training environments increases, agents can learn knowledge from discontinuous trajectories. This insight leads to the design of grouped environments, allowing agents to explore only a limited number of steps within each interaction cycle rather than completing full sequences. Consequently, TAGRL demonstrates faster learning speeds and lower memory consumption compared to existing parallel environment learning methods. The results indicate that this framework enhances the efficiency of UAV search tasks, paving the way for more scalable and effective applications of RL in complex scenarios.

Keywords:

UAVs; reinforcement learning; long-term search task; multi-agent

1. Introduction

Due to the low cost and flexibility of unmanned aerial vehicles (UAVs), they are frequently employed in various search missions. Depending on the type of target motion, these problems can be categorized into static target search problems and dynamic target search problems. In static target search problems, common scenarios include resource exploration in harsh environments and rescuing individuals trapped in earthquakes. In dynamic target search problems, application scenarios encompass rescuing lost tourists in forests and searching for enemy targets cruising in warfare.

Based on the differences in modeling approaches, the existing methods for UAV search can be primarily categorized into four types: traditional methods, heuristic methods, probability-based methods, and deep reinforcement learning-based methods. Traditional methods can be further divided into exact algorithms and approximation algorithms. The notable representatives of exact algorithms are integer programming algorithms and A* search algorithms [1]. Approximation algorithms include Voronoi diagram methods [2] and visibility graph-based methods [3]. Traditional methods have good interpretability and exhibit excellent performance on simple problems. However, they often lack scalability and are difficult to model and solve when dealing with complex issues. Heuristic algorithms encompass genetic algorithms [4,5], ant colony optimization [6], wolf pack algorithms [7], and particle swarm optimization [8,9]. These algorithms can be applied to search problems of various scales and are often easier to model. Probability-based methods [10,11] often exhibit superior interpretability compared to intelligent algorithms, and in some instances, they may also demonstrate enhanced performance. Nevertheless, these methods yield fixed solutions. When the scenario undergoes slight changes, the solutions obtained by these algorithms cannot automatically adapt to the variations in the scenario. In contrast, deep reinforcement learning-based algorithms can produce an intelligent agent capable of real-time decision-making, thereby possessing a better ability to adapt to dynamic scenarios.

The characteristics of the four categories of methods are presented in Table 1:

In conclusion, deep reinforcement learning methods are emerging as a focal point for researchers due to their ease of modeling, scalability to problems of varying magnitudes, and adaptability to changing scenarios. These attributes make deep reinforcement learning an increasingly attractive approach in the field.

Researchers have proposed numerous methods for guiding search using deep reinforcement learning algorithms. Imanberdiyev et al. [12] solved the problem of autonomous drone path planning in a positional environment using model-based reinforcement learning methods, while Wang et al. [13] proposed a reinforcement learning method for this problem that does not require building a planning model. They investigated drone search strategies in scenarios with limited communication spectrum. Wang Tian et al. [14] studied the search and tracking problem for multiple drones under energy constraints. They explored drone search schemes in scenarios with potential threats. Shen et al. [15] proposed an improved multi-agent reinforcement learning algorithm for the scenario of multiple UAVs searching for multiple static targets. Su et al. [16] investigated a multi-agent reinforcement learning algorithm with constraints on sensing range and communication range. Zhang et al. [17] investigated a multi-UAV search method that combines belief probability graphs and an improved DDPG algorithm.

In these studies, reinforcement learning demonstrated its powerful adaptability and scalability. However, its performance is dependent on extended training times. To address this issue, researchers have proposed numerous agent–environment parallel interaction frameworks. These frameworks leverage the multi-core advantages of modern processors, transforming the original serial multi-round interactions between agents and environments into parallel single-round interactions.

For instance, Ray [18] and RLlib [19] can harness the hardware processing capabilities of distributed systems. Dalton et al. [20] demonstrated that running the Atari environment on GPUs significantly accelerates the agent’s learning speed by increasing the number of environments. Subsequently, Makoviychuk et al. [21] proposed Isaac Gym as a GPU-parallel simulation platform for reinforcement learning, serving as a new benchmark environment for reinforcement learning. Weng et al. introduced Envpool [22], which supports not only the Atari environment but also complex environments such as StarCraft. In various subfields, parallel agent–environment interaction methods have been extensively utilized. For instance, Liu et al. [23] developed a high-performance GPU-parallel underwater robot learning environment. Freeman et al. [24] proposed a differentiable physics simulation tool utilizing GPU parallel computing.

However, when addressing the long-duration search problem for UAVs, these methods still face two primary challenges. First, regardless of the number of processor cores utilized, agents still require at least one complete round of interaction with the environment. When the application scenario of drones involves searching for unknown dynamic targets, the uncertainty of target movement information can lead to lengthy search times. As a result, each round of simulation consumes considerable time, thereby slowing the learning speed of the agent. Secondly, a large number of simulated environments consume substantial GPU memory. Since all exploration data of the agents need to be stored prior to updating the agent network, both the increase in the number of environments and the prolongation of agent exploration time will lead to increased memory usage. Consequently, when addressing the problem of long-term drone search, memory constraints may prevent the creation of a sufficient number of simulated environments for agent training.

In our previous work [25], we proposed the OC-MAPPO method, which integrates optimal control with DRL (deep reinforcement learning) to address the multi-UAV cooperative search problem. The experimental results demonstrated its enhanced adaptability to dynamic environments. Additionally, we employed the method of environment pool proposed in [22], utilizing parallel environments to accelerate agent learning. During experiments, we observed that increasing the number of environments from 1 to 10 significantly improved the agent’s learning speed, as the interaction time with the environment changed from multiple rounds to a single round. Yet, further increasing the number of environments to several hundreds or even thousands did not alter the learning speed. This is because the data collected before policy updates originate from the same distribution, and increasing the number of environments merely results in more homogeneous data. Then, how to overcome such a challenge and speed up DRL training by quantitatively increasing the parallel training environment become a problem area of interest.

In this paper, through analysis and experimentation, we discover that in the context of UAV cooperative search tasks, it is possible to designate different time segments for different environments during the parallel process. This approach allows agents to engage in minimal forward interactions with each environment, further expediting the learning speed of the agents. Our findings suggest a novel approach to optimizing the efficiency of reinforcement learning in multi-agent UAV scenarios, potentially overcoming existing limitations in the field.

The remainder of this article is structured as follows: Section 2 delineates the problem scenario under investigation. Section 3 elucidates and analyzes extant methodologies. Section 4 expounds upon the proposed approach. Section 5 empirically validates the efficacy of the proposed method through experimental evaluation.

2. Problem Formulation

In our research, we examine a search scenario where the target is lost at a specific time point. Given the known parameters of the target’s maximum velocity, last known position, and time of loss, UAVs are deployed to locate the missing target. The primary objective of this study is to determine the optimal search trajectory for each UAV.

2.1. Scenario Description

In our study, the search scenario can be represented by Figure 1. The primary assumptions for this scenario are as follows:

UAVs’ scanning radius is fixed, and the scanning radius is an integral multiple of the unit grid edge length within the environment.
The targets exhibit stochastic movement patterns within the designated operational zone, demonstrating no intentional evasive behavior in response to UAVs’ reconnaissance activities.
UAVs communicate with one another via satellite-based systems, enabling them to share their respective positional data and reconnaissance information.

2.2. Update Model and Move Model

In [25], we employ a target probability map to represent the probability of target presence in each unit area within the task domain. The update of the target probability map is conducted in two steps:

Diffusion update due to target movement.
Bayesian update based on UAV reconnaissance.

These two steps can be described by the following formula:

T P M_{i, j}^{t + 1} = \frac{1}{9} * \sum_{m = i - 1}^{i + 1} \sum_{j - 1}^{j + 1} T P M_{m, n}^{t}

(1)

T P M_{i, j}^{t + 1} = \frac{T P M_{i, j}^{t} * t p}{t p * T P M_{i, j}^{t} + (1 - t p) (1 - T P M_{i, j}^{t})}

(2)

In this context,

t p

represents the probability of missed detection by the sensor at position

i, j

.

In the practical application of the algorithm, two scenarios are considered. The first scenario involves an environment sufficiently large that the target does not reach the boundary when the maximum detection time elapses. Alternatively, boundary conditions must be separately addressed. When the target appears at corners or edges, it exhibits different diffusion probabilities for the surrounding areas. The diffusion probability at corners changes from 1/9 to 1/4, while at edges, it shifts from 1/9 to 1/6. In the experimental results presented in this paper, we adopt the second design approach.

This article primarily focuses on the rapid learning of search strategies for UAVs rather than on UAV modeling itself. Consequently, the motion model of the UAV is simplified, employing Newton’s Second Law of Motion to describe the UAV’s movement. Let

α_{u}

denote the orientation of drone u,

p o s_{x}^{u}

and

p o s_{y}^{u}

represent the position of drone u,

v e l^{u}

indicate the velocity,

a^{u}

signify the acceleration, and

ω_{u}

represent angular acceleration. Consequently, the UAV’s movement can be described as follows:

\{\begin{matrix} \dot{p o s_{x}^{u}} = s p e e d^{u} cos α_{u} \\ \dot{p o s_{y}^{u}} = s p e e d^{u} sin α_{u} \\ \dot{v e l^{u}} = a_{u} \\ \dot{α_{u}} = ω_{u} \end{matrix}

(3)

2.3. Reconnaissance Model

Let the sensor model of UAV u be denoted as

S e n s o r_{u} (\cdot)

and

P r e c i s e_{u}

, where

S e n s o r_{u}

characterizes the reconnaissance range of the sensor, and

P r e c i s e_{u}

represents the reconnaissance precision of the sensor. The reconnaissance area of UAV u at time t is designated as

D_{t}^{u}

, and

p_{t}^{u}

denotes the probability of UAV u detecting the target at time t. The reconnaissance model of the UAV can be described using the following formula:

D_{t}^{u} = S e n s o r_{u} (\vec{c_{t}^{u}})

(4)

p_{t}^{u} = \int_{(x, y) \in D_{t}^{u}} P r e c i s e_{u} (\vec{c_{t}^{u}}, x, y) * T P M_{x, y}^{t} d x d y

(5)

2.4. Optimal Control Model Construction

In [25], we proposed an optimal control model to formalize the UAV search problem. In this model, the search objective for the unmanned aerial vehicle (UAV) is to minimize the logarithm of the product of probabilities that the target remains undetected. The mathematical formulation is as follows:

m i n_{A} J (A (T)) = \int_{0}^{T} \sum_{u \in U} l n (1 - p_{t}^{u}) d t

(6)

In this equation,

u_{p}

denotes the probability of a drone detecting the target. U represents the set of all drones. T signifies the maximum search time. And

A (t)

denotes the flight trajectory of a drone within the time interval

[0, t]

.

3. Preliminary and Analysis

Existing reinforcement learning methods typically involve an agent interacting with the environment over multiple episodes, denoted as C. The agent improves its behavior based on the experiences gathered. Without any parallelization, the time taken for the agent to update its policy after interacting with the environment is

C \times T

. If multiple environments are initiated simultaneously for training, as in the OC-MAPPO method, the time required for the agent to update its policy can be reduced to T. However, in OC-MAPPO, the number of environments initiated far exceeds C, yet the time expenditure cannot be further reduced. Based on these considerations, this section reanalyzes the principles of reinforcement learning operations and proposes a new architecture to improve the training process.

When applying reinforcement learning to solve this problem, the cumulative return and single-step reward are defined as follows:

\begin{matrix} r e t u r n & = - J (A (T)) \\ R_{t} & = J (A (t - 1)) - J (A (t)) \end{matrix}

(7)

Let

S_{t}

denote the state of UAVs at time t.

S_{t}

encompasses the UAV’s position, its orientation, and the probabilistic map of target information. Let

V (S_{t})

represent the maximum expected reward attainable when the UAV is in state

S_{t}

. The mathematical formulation of

V (S_{t})

is as follows:

V (S_{t}) = m i n J (A (T)) - J (A (t))

(8)

In essence, the original problem can be reformulated as follows:

m i n J (A (T)) = J (A (t)) + V (S_{t})

(9)

If we know the form of the function

V (\cdot)

, then we can solve the original problem by simply solving the aforementioned problem, in which the agent only needs to interact with the environment for t steps. This formula delineates the interactive process between the agent and the environment within the time interval

[0, t]

. The value of

V (S_{t})

can be obtained by solving the following problem:

V (S_{t}) = m i n \int_{t}^{T} \sum_{u \in U} l n (1 - p_{t}^{u}) d t

(10)

This formula corresponds to the interaction process between the agent and the environment during the time interval

[t, T]

. It can be observed that this problem shares the same structure as the original problem, thus allowing for the application of identical solution methods. In this case, one could directly employ reinforcement learning techniques to solve the problem, or alternatively, partition it into two intervals: from t to

t_{1}

and from

t_{1}

to T. The optimal policy for the t to

t_{1}

interval can then be determined by estimating

V (S_{t_{1}})

, ultimately yielding the value of

V (S_{t})

.

This question arises: is it feasible to concurrently execute the processes of solving (9) and (10)? Intuitively, it would seem that the second process must invariably follow the first, as reaching state

S_{t}

necessarily requires first attaining state

S_{t - 1}

, precluding parallel execution. However, if we initiate both processes simultaneously, with the first process remaining idle during the preparation phase while the second process allowing for the agent to interact with the environment

t_{1}

times, we can then simultaneously complete one simulation corresponding to (9) and (10) in the subsequent interaction (assuming

t = t_{1}

). This methodology can be delineated through Algorithm 1:

Algorithm 1 Asynchronous Exploration Algorithm

Warm up stage:
For $e_{1}$
do nothing
For $e_{2}$
For t in $1 \dots t_{1}$
$A_{t} \sim A c t o r (S_{t})$
$S_{t + 1} = E n v i r o n m e n t (S_{t}, A_{t})$
$S_{t} < - S_{t + 1}$
Train stage:
For each i in $1 \dots n$
For $e_{1}$ , $e_{2}$
For t in 1… $t_{1}$
$A_{t} A c t o r (S_{t})$
$S_{t + 1} = E n v i r o n m e n t (S_{t}, A_{t})$
$S_{t} < - S_{t + 1}$

It is important to note that this approach differs from using a single process to perform T simulations. Owing to the agent’s policy exhibiting some randomness in each interaction with the environment, the environment itself may possess inherent stochasticity. Consequently, the terminal state of the first process and the initial state of the second process may not be identical. Then, what impact does this difference have on the learning strategies of intelligent agents?

To illustrate this issue, we will use the Proximal Policy Optimization (PPO) [26] algorithm, a prominent example from the field of reinforcement learning algorithms. Consider two solution outcomes in the solving process: actor and critic. The actor is responsible for the agent’s behavioral policy,

π

, accepting the drone’s state as input and outputting the drone’s action, i.e.,

π (S_{t}) \to a (t)

. The critic is responsible for fitting

V (\cdot)

. If the PPO algorithm is employed to train the actor and critic, the parameter update formulas for the actor and critic are as follows:

For the actor:

\begin{matrix} L (θ_{t}) & = E_{t} [min (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})} A d v^{π_{θ_{o l d}}} (s_{t}, a_{t}), \\ clip (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}, 1 - ϵ, 1 + ϵ) A d v^{π_{θ_{o l d}}} (s_{t}, a_{t}))] \\ θ_{t + 1} & = θ_{t} + \frac{\partial L (θ_{t})}{θ_{t}} * l r \end{matrix}

(11)

In this context,

θ

represents the parameters of the actor network, and

l r

denotes the learning rate.

\begin{matrix} L (v_{t}) & = E_{s_{t} \sim π_{θ}} [{(C r i t i c_{v} (s_{t}) - C r i t i c_{t}^{t a r g e t})}^{2}] \\ C r i t i c_{t}^{t a r g e t} & = r_{t} + γ C r i t i c_{v} (s_{t + 1}) \\ v_{t + 1} & = v_{t} + \frac{\partial L (v_{t})}{\partial v_{t}} * l r \end{matrix}

(12)

wherein v denotes the parameters of the critic network, and

γ

represents the discount factor, which is a commonly used parameter in reinforcement learning. In this particular problem, to ensure correspondence with the original formulation,

γ

is set to 1. Let us first consider the learning process of the critic. Consider the fitting target of the critic as follows:

C r i t i c (S_{t}) - C r i t i c (S_{t + 1}) = r

(13)

Although the ultimate goal of the

C r i t i c

is to fit

V (\cdot)

, it first attempts to fit the expected reward under the current policy. Then, it improves the

A c t o r

’s policy by letting the

A c t o r

choose actions that can generate more expected rewards than the current policy. When the

A c t o r

’s policy is improved to the optimal policy, the

C r i t i c

network will be able to fit the true

V (\cdot)

value. From Equation (13), it can be seen that the update of the

C r i t i c

depends on the r value obtained from the interaction process between the

a g e n t

and the environment. Whether it is the (9) process or the (10) process, the intelligent agent can obtain the true r from the interaction with the environment, thereby improving the accuracy of the

C r i t i c

’s fitting of the expected return

V (\cdot)

.

Nonetheless, a challenge persists for (9). Due to the original process, the following equation holds:

E n v i r o n m e n t (S_{t}, A (t)) = S_{t + 1} for t \in {1, 2, \dots, T}

(14)

where

E n v i r o n m e n t (\cdot)

denotes the environmental model. However, for Algorithm 1, at

t = t_{1}

, it holds that

S_{t_{1} + 1}^{e_{2}} \neq E n v i r o n m e n t (S_{t_{1}}^{e_{1}}, A {(t_{1})}^{e_{1}})

. This discrepancy leads to the inability of

e_{1}

to generate the labels for the critic, because for each state

S_{t}^{e_{1}}

of

e_{1}

, the corresponding critic label

V (S_{t}^{e_{1}})

is computed as follows:

V (S_{t}^{e_{1}}) = \sum_{i = t}^{i = t_{1}} V (S_{i}) + \sum_{i = t_{1} + 1}^{T} V (S_{i})

(15)

Since

S_{i}

is interrupted at

t_{1}

, the data for

\sum_{i = t_{1} + 1}^{T} V (S_{i})

cannot be obtained.

Fortunately, this issue is improved when the agent interacts simultaneously with multiple environments. Consider the set

S

, which represents the set of all states reached after

t_{1}

time steps from the initial state

S_{1}

. Thus,

S_{t_{1} + 1}^{e_{2}} \in S

and

E n v i r o n m e n t (S_{t}^{e_{1}}, A {(t)}^{e_{1}}) \in S

. Although both are in the same set, due to the often large number of elements in

S

, the critic cannot infer

V (S_{t + 1}^{e_{1}})

solely from

V (S_{t_{1} + 1}^{e_{2}})

. However, if we have a large amount of data, suppose m samples, we can get their label through the following equation.

\begin{matrix} V (S_{t_{1} + 1}^{e_{2}}) & = \sum_{i = t_{1} + 1}^{T} V (S_{i}^{e_{2}}) \\ V (S_{t_{1} + 1}^{e_{3}}) & = \sum_{i = t_{1} + 1}^{T} V (S_{i}^{e_{3}}) \\ \dots \\ V (S_{t_{1} + 1}^{e_{m}}) & = \sum_{i = t_{1} + 1}^{T} V (S_{i}^{e_{m}}) \end{matrix}

(16)

Then, we can use these m labeled data points to fit

V (s)

using the neural network critic, where

s \in S

. The fitted critic can then be used to obtain the value of

V (S_{t_{1} + 1}^{e_{1}})

.

\begin{matrix} \hat{V} (s) & = Critic (s), s \in S \\ \hat{V} (S_{t + 1}^{e_{1}}) & \approx Critic (S_{t + 1}^{e_{1}}) \end{matrix}

(17)

Regarding the actor, as per Equation (11), the update of the actor is related to

π_{θ}

and

A d v

, where

π_{θ}

denotes the parameters of the actor, and

A d v (S_{t}, a_{t})

represents the advantage of executing action

a_{t}

in state

S_{t}

, which can be calculated by the following equation:

\begin{matrix} A d v (s_{t}) = r_{t} + γ \cdot C r i t i c (s_{t + 1}) - C r i t i c (s_{t}) \end{matrix}

(18)

From the above equation, it is evident that the value of

A d v

is related to the output of the critic and the actual reward r obtained by the agent. When the critic can be updated normally, the actor can also be updated accordingly.

Through the aforementioned analysis, we have demonstrated that when the number of environments is large, the interaction between the agent and the environment can be divided into two parts: the

0 - t_{1}

segment and the

t_{1} - T

segment, which can run concurrently. By partitioning the interaction process, the

t_{1} - T

process exhibits the same structure as the original problem. This suggests that if the number of environments is sufficiently large, the

t_{1} - T

process can be further subdivided. Consequently, we propose the temporally asynchronous grouped environment reinforcement learning (TAGRL) method. This approach divides the environments into several groups. After a preparation phase lasting T, each group executes different time segments of an interaction episode.

Let E denote the total number of environments, and G denote the number of environment groups. Through the aforementioned design, the algorithm can allow the agent to interact with the environment

T / G

times before each update instead of T times. Furthermore, compared to existing parallel methods (such as OC-MAPPO), the TAGRL method shows lower memory consumption. The key memory consumption component for both algorithms is the buffer, which stores the experiential data generated during the training process. In the OC-MAPPO method, the data storage requirement is as follows:

M e m o r y_{O C - M A P P O} = E \times T \times (S i z e (o b s) \times N + S i z e (s t a t e))

(19)

In contrast, the memory requirement for TAGRL is as follows:

M e m o r y_{T A G R L} = E \times \frac{T}{G} \times (S i z e (o b s) \times N + S i z e (s t a t e))

(20)

where N represents the number of agents. In the aforementioned calculations, we disregard the action and reward, as their memory usage is significantly less than that of obs and states (the latter being multidimensional tensors while the former are vectors). It is evident that, with the same number of environments, the TAGRL method requires less memory compared to the OC-MAPPO method.

4. Method

Based on the preceding analysis, we propose an asynchronous parallel reinforcement learning framework.

Suppose we have E environments, which we divide into G groups (in Figure 2, we use

G = 5

as an example to illustrate the method). During the preparation phase, the first four groups of environments interact with the agent

T / G

times each, while the fifth group remains inactive. The preparation phase consumes the time equivalent to one complete round of interactions. Following the preparation phase, before each update of the agent’s policy, each environmental group interacts with the agent

T / G

times. During the Warm Up phase, for the i-th environment, the following formula can be used to determine the number of steps the agent needs to interact with that environment.

G T (i) = t_{m a x} \times \frac{⌊ \frac{i * G}{E} ⌋}{G}

(21)

In this context, G represents the number of environmental groups, and E represents the total number of environments. For environments that have not been terminated, the advantage value obtained by the agent is calculated as follows:

\begin{matrix} δ (t) & = r (t) + C r i t i c (S_{t + 1}) - C r i t i c (S_{t}) \\ A d v (t) & = δ (t) + τ \times γ \times A d v (t + 1) \end{matrix}

(22)

where

r (t) + V (S_{t + 1})

represents the expected reward obtained by the agent after executing an action, while

C r i t i c (S_{t})

denotes the expected reward for the agent prior to action execution.

δ (t)

signifies the advantage value of a single step action taken by the agent, and

A (t)

represents the cumulative advantage value of the agent.

For terminated environments, the advantage value obtained by the agent is calculated as follows:

A d v (T) = r (T) - C r i t i c (T)

(23)

After determining the A-values required for updating the agent, these can be substituted into the original formula to continue the agent’s training process.

Figure 3 below illustrates the advantages of this method and its distinctions from the original approach.The asterisk (*) denotes multiplication in figure.

In the most common practice, an agent sequentially interacts with a single environment for n rounds, wherein the time spent on agent–environment interactions is

n \times T

. If parallel environments are employed, utilizing more than n environments to simultaneously interact with the agent and collect data, the time spent on agent–environment interactions is reduced to T. However, adopting this approach, whether using n environments or

100 n

environments, the agent–environment interaction time remains T.

In the TAGRL method, environments are divided into different groups. For instance, if there are

100 n

environments, divided into 5 groups, each group contains

20 n

environments. After the warm-up phase, each environment only needs to advance

0.2 T

steps, thereby further reducing the interaction time between the agent and the environments.

This approach significantly enhances the efficiency of data collection and potentially accelerates the learning process. By strategically grouping environments and implementing a phased interaction strategy, TAGRL optimizes the trade-off between the breadth of environmental exploration and the depth of learning within each environment.

The following outlines the process of the TAGRL search Algorithm 2. The symbols utilized in this process are elucidated in Table 2.

Algorithm 2 TAGRL Search Algorithm

Initialize the parameter $θ$ of actor network $π$ using orthogonal initialization
Initialize the parameter v of $C r i t i c$ network using orthogonal initialization
Initialize total iterations I
Let us denote $t_{m a x}$ as the time limit for drone search operations.
Let us denote E as the number of environments
Let us denote g as the number of environment groups
Let us denote n as the maximum simulation turns for a drone
Let us denote $D^{u}$ is u reconnaissance coverage.
$Warm up stage$ :
$For$ each environment e in 1 …E
Get $\vec{O_{0}}$ from environments
Calculate $G T (i)$ by Equation (21)
$For$ each t in 1 … $G T (i)$
execute average filtering on $T P M^{e}$ by Equation (1)
$For$ each UAV u
$d i s t r i b u t i o n_{t}^{u}$ = $π (o_{t}^{u}, θ)$
$a_{t}^{u} \sim d i s t r i b u t i o n_{t}^{u}$
update $\vec{c_{t}^{u}}$ and $α_{t}^{u}$ according to $a_{t}^{u}$ by Equation (3)
update $T P M^{e}$ by Equation (2)
$Train stage$ :
$For$ each i in 1 …n
Get $S_{0}$ , $\vec{O_{0}}$ from environments
Set buffer b = [ ]
$For$ each t in 1 … $\frac{t_{m a x}}{G}$
$For$ each environment e
$r_{t}$ = 0
execute average filtering on $T P M^{e}$ by Equation (1).
$For$ each UAV u
$d i s t r i b u t i o n_{t}^{u}$ = $π (O_{t}^{u}, θ)$
$a_{t}^{u}$ ∼ $d i s t r i b u t i o n_{t}^{u}$
calculate $p_{t}^{u}$ by (5)
$r_{t}$ += $- l n (1 - p_{t}^{u})$
update $T P M^{e}$ by Equation (2)
$v a l u e_{t}$ = $C r i t i c (S_{t}, v)$
b+=[ $S_{t}$ , $v a l u e_{t}$ , $O_{t}$ , $\vec{a_{t}}$ , $r_{t}$ ]
Compute advantage estimate $A d v$ via Equations (22) and (23) on b
Compute reward-to-go R on b and normalize
Compute loss of $π$ by b, $A d v$ and Equation (11)
Compute loss of V by Equation (12)
Update $π$ ,V

5. Experiment

The experimental setup for this study consisted of a single NVIDIA GeForce RTX 4090 graphics card (NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Core i9-14900K processor (Intel Corporation, Santa Clara, CA, USA). The software environment utilized Python 3.11.9. Neural networks were constructed using PyTorch 2.3.1, while the unmanned aerial vehicle (UAV) search environment was implemented using Numba 0.60.0.

To validate the effectiveness of the proposed method, we conducted comparative experiments with traditional approaches. In these experiments, both our method and the conventional approach utilized 256 environments to ensure equal computational resources. Both methods employed MAPPO [27] to solve the established optimal control model(OC-MAPPO) [25]. The primary difference was that our method (TAGRL) incorporated environmental grouping and asynchronous temporal interactions, while all other settings and net architecture remained consistent. To ensure a fair comparison, when evaluating the OC-MAPPO method, a discrete action design was implemented, wherein the UAVs’ movement patterns were limited to three options: forward, left, or right at each step. In contrast, for experiments demonstrating UAV trajectories, we employed the aforementioned continuous action model, where the UAVs’ movement patterns at each time point were determined by acceleration and angular acceleration. The Table 3 presents the key parameter settings:

To mitigate the impact of random factors, each experiment was repeated 30 times, and error bands were plotted. The experimental results are illustrated in Figure 4.

In Figure 4, the orange error band represents the performance of the proposed method, while the blue curve indicates the performance when only using the parallel environment. It can be observed that agents in the TAGRL method learn faster and achieve higher average rewards. Furthermore, the TAGRL method and the OC-MAPPO method utilize the same number of computing cores, but the TAGRL method requires significantly less memory. This is because TAGRL stores only one-fifth of the interaction data per environment, whereas the OC-MAPPO method needs to store all the interaction data.

Subsequently, we conducted a comparative analysis of our method’s performance under varying numbers of parallel environments.

From the Figure 5, it is observed that when the number of parallel environments is relatively low, the learning curve of the agent exhibits greater fluctuations and achieves lower reward values. This phenomenon may be attributed to the fact that during the learning process in (9), the agent requires a stable V function as a foundation. When the number of environments is limited, the agent obtains less data from each round of interaction with the environment, making it challenging to learn an accurate and stable

V (\cdot)

function. Conversely, when the agent interacts with a large number of environments, the

V (\cdot)

function learned by the agent more closely approximates the true expected reward, and the policy learned through the (9) process more accurately reflects the correct policy.

To validate the comparative performance of our method under different search durations, we conducted experiments for four scenarios:

T = 50, T = 100, T = 200

, and

T = 500

. Each method was executed for 300 s, with the TAGRL method consistently employing five environmental groups. The optimal reward generated during the training process was selected as the representative reward for each method. The experimental results are shown in Figure 6.

The OC-MAPPO method lacks values for the case of

T = 500

and

E = 512

due to memory overflow. As observed in the figure, when

T = 50

, the OC-MAPPO method performs comparably to TAGRL with fewer environments. However, as the number of environments increases, the MAPPO method demonstrates certain advantages. With further increases in the number of environments, TAGRL again exhibits performance similar to the OC-MAPPO method. When

T = 100

, TAGRL shows slightly superior performance to OC-MAPPO, with TAGRL significantly outperforming OC-MAPPO when

E = 32

, although this may be attributed to the stochasticity inherent in reinforcement learning. At

T = 200

, TAGRL demonstrates a more pronounced advantage, likely due to its shorter simulation time per round, as the OC-MAPPO method can complete fewer rounds in the same time frame. When

T = 500

, both the memory and time advantages of TAGRL become evident. The increase in simulation environments does not significantly constrain TAGRL in terms of memory usage, whereas the OC-MAPPO method experiences memory overflow at

E = 512

. Due to the further reduction in simulation rounds, the performance of the OC-MAPPO method becomes more unstable. Overall, when the reconnaissance time for UAV is longer, TAGRL exhibits superior performance compared to the OC-MAPPO method.

To further validate the robustness of the algorithm, we conducted a comparative analysis of its performance under varying numbers of UAVs. We selected scenarios with 2, 3, and 5 UAVs for our comparative experiments. The results of these experiments are illustrated in Figure 7.

To validate the search strategy learned by the agent, we visualized the agent’s search trajectory and the changes in the target probability map during the search process. The drone’s search trajectory is composed of 200 discrete segments. The Figure 8 below illustrates the search trajectories learned by the agent.

To better comprehend the search trajectory of UAVs, we segmented the UAVs’ route for display and selected probability maps of the target at various time points for presentation.

Figure 9 illustrates that initially, both UAVs move towards the target point. Subsequently, the UAV closer to the target loss point begins to rotate approximately around a point to the right and behind the loss point, while the other UAV circles around the left front. Following this, the UAVs continue to rotate on either side of the target loss point. During this period, the Figure 10 indicates a significant decrease in the likelihood of the target appearing at or near the loss point. The red arrows in the figure indicate the forward direction of the UAV. However, after

t = 100

, despite higher target probabilities in the upper and right regions of TPM, the UAVs fail to investigate these areas promptly. This observation highlights potential areas for improvement in the current method. Increasing training duration and expanding network parameters could potentially enhance the UAVs’ performance.

Despite the ability of agents to autonomously learn effective search strategies in dynamic long-term search tasks, current algorithms still have potential areas for improvement in application. Firstly, there exists an approximation error between the discrete TPM and the continuous world. Increasing the density of the grid in the TPM can reduce this error but also increases the algorithm’s memory consumption. A potential solution may be to focus on regions of interest within the TPM while compressing other areas. Secondly, the current research problems do not involve adversarial strategies for the target. The design of drone search strategies in scenarios where the target intentionally employs evasive tactics also deserves further consideration.Third, the current study assumes that UAVs fly at a fixed altitude, resulting in a constant scanning radius. However, in practical applications, UAVs can adjust their altitude to change their reconnaissance radius and precision. Removing this limitation may facilitate the design of more flexible search strategies.

6. Conclusions

This paper proposes a novel temporally asynchronous grouped environment reinforcement learning method. This approach enables UAVs to rapidly acquire long-term search strategies while reducing memory requirements. Agents trained on long-term search strategies may exhibit enhanced robustness in their policies compared to those focused on short-term strategy learning. Further investigation into the impact of this methodology on the robustness of UAVs is warranted. Once UAV policies demonstrate strong resilience across temporal and two-dimensional spatial domains, it becomes feasible to extend the search scenario to a three-dimensional space, ultimately facilitating deployment on physical UAV platforms for practical applications.

Author Contributions

Conceptualization, D.W. and J.H.; methodology, D.W. and L.Z.; software, D.W.; validation, H.D., M.Y. and L.Z.; formal analysis, D.W. and L.Z.; investigation, H.D. and J.H.; resources, J.H.; data curation, J.H.; writing—original draft preparation, D.W.; writing—review and editing, L.Z.; visualization, D.W.; supervision, M.Y.; project administration, M.Y. and J.H.; funding acquisition, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, X.; Gong, D. A Comparative Study of A-star Algorithms for Search and Rescue in Perfect Maze. In Proceedings of the 2011 International Conference on Electric Information and Control Engineering, Wuhan, China, 15–17 April 2011. [Google Scholar] [CrossRef]
Bhattacharya, P.; Gavrilova, M. Roadmap-Based Path Planning-Using the Voronoi Diagram for a Clearance-Based Shortest Path. IEEE Robot. Autom. Mag. 2008, 15, 58–66. [Google Scholar] [CrossRef]
Zhang, H.Y.; Lin, W.M.; Chen, A.X. Path Planning for the Mobile Robot: A Review. Symmetry 2018, 10, 450. [Google Scholar] [CrossRef]
Wu, Y.; Nie, M.; Ma, X.; Guo, Y.; Liu, X. Co-Evolutionary Algorithm-Based Multi-Unmanned Aerial Vehicle Cooperative Path Planning. Drones 2023, 7, 606. [Google Scholar] [CrossRef]
Su, J.L.; Wang, H. An Improved Adaptive Differential Evolution Algorithm for Single Unmanned Aerial Vehicle Multitasking. Def. Technol. 2021, 17, 1967–1975. [Google Scholar] [CrossRef]
Guan, W.Y.X. A New Searching Approach Using Improved Multi-Ant Colony Scheme for Multi-UAVs in Unknown Environments. IEEE Access 2019, 7, 161094–161102. [Google Scholar] [CrossRef]
Yao, P.; Wang, H.; Ji, H. Multi-UAVs Tracking Target in Urban Environment by Model Predictive Control and Improved Grey Wolf Optimizer. Aerosp. Sci. Technol. 2016, 55, 131–143. [Google Scholar] [CrossRef]
Li, Y.; Chen, W.; Fu, B.; Wu, Z.; Hao, L.; Yang, G. Research on Dynamic Target Search for Multi-UAV Based on Cooperative Coevolution Motion-Encoded Particle Swarm Optimization. Appl. Sci. 2024, 14, 1326. [Google Scholar] [CrossRef]
Ren, Z.; Jiang, B.; Hong, X. A Cooperative Search Algorithm Based on Improved Particle Swarm Optimization Decision for UAV Swarm. In Proceedings of the 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS), Chengdu, China, 23–26 April 2021; pp. 140–145. [Google Scholar] [CrossRef]
Karimi, S.; Saghafi, F. Cooperative Aerial Search by an Innovative Optimized Map-Sharing Algorithm. Drone Syst. Appl. 2023, 12, 1–18. [Google Scholar] [CrossRef]
Fan, X.; Li, H.; Chen, Y.; Dong, D. UAV Swarm Search Path Planning Method Based on Probability of Containment. Drones 2024, 8, 132. [Google Scholar] [CrossRef]
Imanberdiyev, N.; Fu, C.; Kayacan, E.; Chen, I.M. Autonomous Navigation of UAV by Using Real-Time Model-Based Reinforcement Learning. In Proceedings of the 2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV), Phuket, Thailand, 13–15 November 2016; pp. 1–6. [Google Scholar]
Wang, C.; Wang, J.; Zhang, X.; Zhang, X. Autonomous Navigation of UAV in Large-Scale Unknown Complex Environment with Deep Reinforcement Learning. In Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Montreal, QC, Canada, 14–16 November 2017. [Google Scholar] [CrossRef]
Wang, T.; Qin, R.; Chen, Y.; Snoussi, H.; Choi, C. A Reinforcement Learning Approach for UAV Target Searching and Tracking. Multimed. Tools Appl. 2019, 78, 4347–4364. [Google Scholar] [CrossRef]
Shen, G.; Lei, L.; Zhang, X.; Li, Z.; Cai, S.; Zhang, L. Multi-UAV Cooperative Search Based on Reinforcement Learning with a Digital Twin Driven Training Framework. IEEE Trans. Veh. Technol. 2023, 72, 8354–8368. [Google Scholar] [CrossRef]
Su, K.; Qian, F. Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning. Appl. Sci. 2023, 13, 11905. [Google Scholar] [CrossRef]
Zhang, B.; Lin, X.; Zhu, Y.; Tian, J.; Zhu, Z. Enhancing Multi-UAV Reconnaissance and Search through Double Critic DDPG with Belief Probability Maps. IEEE Trans. Intell. Veh. 2024, 9, 3827–3842. [Google Scholar] [CrossRef]
Moritz, P.; Nishihara, R.; Wang, S.; Tumanov, A.; Liaw, R.; Liang, E.; Elibol, M.; Yang, Z.; Paul, W.; Jordan, M.I.; et al. Ray: A Distributed Framework for Emerging {AI} Applications. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, USA, 8–10 October 2018; pp. 561–577. [Google Scholar]
Liang, E.; Liaw, R.; Nishihara, R.; Moritz, P.; Fox, R.; Goldberg, K.; Gonzalez, J.; Jordan, M.; Stoica, I. RLlib: Abstractions for Distributed Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 3053–3062. [Google Scholar]
Dalton, S. Accelerating Reinforcement Learning through Gpu Atari Emulation. Adv. Neural Inf. Process. Syst. 2020, 33, 19773–19782. [Google Scholar]
Makoviychuk, V.; Wawrzyniak, L.; Guo, Y.; Lu, M.; Storey, K.; Macklin, M.; Hoeller, D.; Rudin, N.; Allshire, A.; Handa, A.; et al. Isaac Gym: High Performance GPU Based Physics Simulation for Robot Learning. Neurips Datasets Benchmarks 2021, 1. [Google Scholar]
Weng, J.; Lin, M.; Huang, S.; Liu, B.; Makoviichuk, D.; Makoviychuk, V.; Liu, Z.; Song, Y.; Luo, T.; Jiang, Y.; et al. Envpool: A Highly Parallel Reinforcement Learning Environment Execution Engine. Adv. Neural Inf. Process. Syst. 2022, 35, 22409–22421. [Google Scholar]
Liu, W.; Bai, K.; He, X.; Song, S.; Zheng, C.; Liu, X. FishGym: A High-Performance Physics-Based Simulation Framework for Underwater Robot Learning. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022. [Google Scholar] [CrossRef]
Freeman, C.D.; Frey, E.; Raichuk, A.; Girgin, S.; Mordatch, I.; Bachem, O. Brax-a Differentiable Physics Engine for Large Scale Rigid Body Simulation. 2021, Volume 6. Available online: http://github.com/google/brax (accessed on 3 July 2024).
Wei, D.; Zhang, L.; Liu, Q.; Chen, H.; Huang, J. UAV Swarm Cooperative Dynamic Target Search: A MAPPO-based Discrete Optimal Control Method. Drones 2024, 8, 214. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of Ppo in Cooperative Multi-Agent Games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]

Figure 1. Environment model.

Figure 2. Temporally asynchronous grouped environment.

Figure 3. Method comparison.

Figure 4. Comparison between OC-MAPPO and TAGRL.

Figure 5. Comparison between different numbers of environments.

Figure 6. Comparative analysis of best return under varying reconnaissance time and parallel environment quantities.

Figure 7. Performance comparison of UAVs over number.

Figure 8. Trajectory learned by agents.

Figure 9. Segmented unmanned aerial vehicle search trajectory.

Figure 10. Change in TPM during the search process.

Table 1. Comparison of four methodological approaches.

Method	Scalability	Accuracy	Adaptability
Traditional method	Inferior	Satisfactory	Inferior
Heuristic method	Superior	High variance, dependent on iteration time	Inferior
Probability-based method	Superior	Satisfactory	Inferior
DRL-based method	Superior	High variance, dependent on training time	Superior

Table 2. Variable list.

Variable	Type	Size	Explanation
Width	Scalar	1	The number of environment grids in the x-axis direction
Height	Scalar	1	The number of environment grids in the y-axis direction
$T P M^{t}$	Matrix	$H e i g h t \times W i d t h$	Each value of TPM expresses the probability of a target in each environment grid at time t
$a_{t}^{u}$	Scalar	1	The action of the UAV u at time t
$α_{t}^{u}$	Scalar	1	The orientation angle of the UAV u at time t
$ψ_{m a x}^{u}$	Scalar	1	The maximum steering angle of the UAV u
$\vec{c_{t}^{u}}$	Vector	2	The position of the UAV u at time t
${speed}^{u}$	Scalar	1	The speed of the UAV u
$S e n s o r_{u} (\cdot)$	Function	1	The sensor function of the UAV u
$P r e c i s e_{u} (\cdot)$	Function	1	The precise function of the UAV u sensor
$D_{t}^{u}$	Plane	1	The detection range of the UAV u at time t
$p_{t}^{u}$	Scalar	1	The probability of the UAV u detecting the target at time t
T	Scalar	1	The total reconnaissance time
$N_{u}$	Scalar	1	The number of UAVs
$A$	Tensor	$T \times N_{u} \times 2$	The action of the UAVs at each time
$r e t u r n$	Scalar	1	The total reward of the UAVs
$R_{t}$	Scalar	1	The reward of the UAVs at time t
$J (\cdot)$	Function	1	The objective function of the UAVs
$V (\cdot)$	Function	1	The value function of the UAVs
$C r i t i c$	Net&Function	1	Fitting the expected reward of the actor
$A c t o r \| π$	Net&Function	1	The policy of the UAVs
$C r i t i c_{t}^{t a r g e t}$	Scalar	1	The target value of the critic at time t
$A d v (\cdot)$	Function	1	The advantage function of the UAVs
$S_{t}$	tuple	(TPM, $c_{t}$ , $\vec{α_{t}}$ )	The state of the UAVs at time t
$G T (\cdot)$	Function	1	The maximum simulation time of the grouped environment
$O_{t}$	tuple	(TPM, $c_{t}$ , $\vec{α_{t}}$ , id:Scalar)	The observation of the UAVs at time t

Table 3. TAGRL parameters.

Parameter	Value
$γ$	0.99
Horizon length	819
Update epochs	4
Learning rate	0.0003
Vector environment numbers	10
GAE lambda parameter	0.95
Learning rate	0.001
Value loss coefficient	0.05
Entropy coefficient	0.005
Eps clip	0.2
Hidden size	128
Environment number	256
Environment group number	5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, D.; Zhang, L.; Yang, M.; Deng, H.; Huang, J. A Long-Term Target Search Method for Unmanned Aerial Vehicles Based on Reinforcement Learning. Drones 2024, 8, 536. https://doi.org/10.3390/drones8100536

AMA Style

Wei D, Zhang L, Yang M, Deng H, Huang J. A Long-Term Target Search Method for Unmanned Aerial Vehicles Based on Reinforcement Learning. Drones. 2024; 8(10):536. https://doi.org/10.3390/drones8100536

Chicago/Turabian Style

Wei, Dexing, Lun Zhang, Mei Yang, Hanqiang Deng, and Jian Huang. 2024. "A Long-Term Target Search Method for Unmanned Aerial Vehicles Based on Reinforcement Learning" Drones 8, no. 10: 536. https://doi.org/10.3390/drones8100536

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Long-Term Target Search Method for Unmanned Aerial Vehicles Based on Reinforcement Learning

Abstract

1. Introduction

2. Problem Formulation

2.1. Scenario Description

2.2. Update Model and Move Model

2.3. Reconnaissance Model

2.4. Optimal Control Model Construction

3. Preliminary and Analysis

4. Method

5. Experiment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI