UAV Swarm Cooperative Dynamic Target Search: A MAPPO-Based Discrete Optimal Control Method

Wei, Dexing; Zhang, Lun; Liu, Quan; Chen, Hao; Huang, Jian

doi:10.3390/drones8060214

Open AccessArticle

UAV Swarm Cooperative Dynamic Target Search: A MAPPO-Based Discrete Optimal Control Method

by

Dexing Wei

^†,

Lun Zhang

^*,†

,

Quan Liu

,

Hao Chen

and

Jian Huang

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2024, 8(6), 214; https://doi.org/10.3390/drones8060214

Submission received: 29 April 2024 / Revised: 19 May 2024 / Accepted: 20 May 2024 / Published: 22 May 2024

(This article belongs to the Special Issue Advances in Cartography, Mission Planning, Path Search, and Path Following for Drones)

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned aerial vehicles (UAVs) are commonly employed in pursuit and rescue missions, where the target’s trajectory is unknown. Traditional methods, such as evolutionary algorithms and ant colony optimization, can generate a search route in a given scenario. However, when the scene changes, the solution needs to be recalculated. In contrast, more advanced deep reinforcement learning methods can train an agent that can be directly applied to a similar task without recalculation. Nevertheless, there are several challenges when the agent learns how to search for unknown dynamic targets. In this search task, the rewards are random and sparse, which makes learning difficult. In addition, because of the need for the agent to adapt to various scenario settings, interactions required between the agent and the environment are more comparable to typical reinforcement learning tasks. These challenges increase the difficulty of training agents. To address these issues, we propose the OC-MAPPO method, which combines optimal control (OC) and Multi-Agent Proximal Policy Optimization (MAPPO) with GPU parallelization. The optimal control model provides the agent with continuous and stable rewards. Through parallelized models, the agent can interact with the environment and collect data more rapidly. Experimental results demonstrate that the proposed method can help the agent learn faster, and the algorithm demonstrated a 26.97% increase in the success rate compared to genetic algorithms.

Keywords:

UAVs; optimal control; dynamic target search; multi-agents; MAPPO

1. Introduction

UAVs are crucial tools in various fields because of their easy deployment and flexible scheduling. They are utilized in emergency search and rescue operations [1], agricultural irrigation [2], and urban infrastructure development [3]. Researchers have investigated the efficiency of drones in various scenarios. Drones can be beneficial in human rescue missions because of their speed and ability to navigate challenging terrains [1], and they have the capability to assist in the exploration of unfamiliar and intricate situations [4]. Moreover, artificial intelligence technology enables drones to carry out duties in place of humans. Drones possess equivalent target identification capabilities as humans thanks to object detection algorithms [5]. Drones utilize machine learning algorithms to condense the rules for numerous activities and implement them in novel tasks. These innovations have expanded the potential applications of UAVs and increased the depth of their usage.

The use of multiple UAVs exhibits increased search efficiency and fault tolerance compared to a single UAV, as they can integrate each other’s information [6] and search different directions simultaneously. The efficient cooperation of several UAVs in searching for targets is a significant research subject that has garnered considerable attention from scholars. Through strategic organization, several UAVs can cover distinct locations simultaneously, significantly decreasing the time needed for the search. During the search of the target area, the UAVs devise their routes autonomously to optimize the overall reconnaissance coverage simultaneously.

The issue of multi-UAV cooperative search can be categorized into static target search and dynamic target search based on the motion characteristics of reconnaissance targets. In a typical static target search situation, the objective is to explore a stationary target whose initial position is uncertain. The primary aim is to optimize the entire coverage area of the UAV. Typical dynamic target search scenarios include human rescue operations and criminal pursuits, where partial information about the target’s last known location, such as the vicinity of their disappearance, is available.

Dynamic target search is more complex than static target search due to the uncertain movement strategy of the target. When the UAV moves, it is necessary not only to consider the interactive information between the drones but also to consider the various possible movement directions of the target. In areas previously scanned by UAVs, it remains possible for targets to emerge. Furthermore, the asymmetry in the probability of the target appearing in the environment due to the initial position knowledge requires the development of specific algorithms to utilize this prior information effectively. In addition, because the target’s movement direction is unpredictable, it may take a long time to locate the target, resulting in an increased upper limit on the search duration. Because of the unique attributes of the dynamic target search problem, algorithms created for static targets are often not directly applicable. Therefore, it is essential to develop new algorithms specifically for dynamic target search problems. Traditional offline planning methods for dynamic target search, such as evolutionary algorithms [7,8], PSO (Particle swarm optimization) algorithms [9,10], and artificial potential field (APF) algorithms [11], can generate an excellent search trajectory for a drone when the target is lost by leveraging heuristic information. However, this search trajectory is only applicable to the given search scenario, and when the location of the lost target changes, a new solution needs to be computed. This implies that when the lost location of the target changes, the algorithm requires some time to calculate the new search plan.

In contrast, reinforcement learning methods can train an agent to perform such tasks, and a well-trained agent can directly execute the search task when the target is lost without the need for further iterative optimization. Moreover, since the agent can receive real-time observations from the environment and then execute actions, even if unexpected situations occur during the execution of the task, such as a deviation in the initial position of the target or a deviation in the previously received information about the positions of other drones, the agent can continue to perform the task based on the latest information. Traditional methods that compute fixed trajectories cannot do this. Therefore, training an agent to perform search tasks using deep reinforcement learning algorithms may better align with practical requirements.

However, when training agents to perform a coordinated search for dynamic targets using reinforcement learning, there are still some remaining issues.

Firstly, owing to the uncertainty of target movement, the results of UAVs’ finding the target vary even when adopting the same strategy, which can make the rewards obtained by the drones more random. According to the literature [12], intelligent agents face greater challenges in learning within stochastic environments compared to deterministic settings.
Additionally, before the drones discover the target, the only observation obtained by the drones is the position information of each drone, which can make the rewards obtained by the drones sparse. Sparser rewards can also make the learning process more challenging for intelligent agents [13].
Additionally, to adapt to diverse scenarios under different settings, the intelligent agents require more training data than what are needed for specific scenario tasks during the training process. This necessitates longer interaction times with the environment.

These three factors may increase the difficulty of learning search tasks for agents and slow down the learning speed of the agent. Moreover, from a technical perspective, existing deep reinforcement learning methods can leverage deep learning libraries to execute the agent’s learning programs on GPUs fast. However, when the agent interacts with the environment, it needs to interact with the environment running on the CPU. Compared to GPUs, CPUs have fewer cores, which makes the interaction time between the agent and the environment a bottleneck in the algorithm’s execution. This is also a significant reason for the slow execution of reinforcement learning programs [14].

To facilitate the agent in learning search tasks, this paper constructs an optimal control model that provides stable and continuous rewards to assist the agents in learning, thereby enhancing their learning efficiency. Furthermore, by developing models that run on GPUs, algorithms can benefit from the massive number of cores available on GPUs. Unlike the process before parallelization, where intelligent agents needed to interact with the environment over multiple rounds, parallelized agents only need to interact with the environments for a single round. The interaction data can be directly transferred for learning through the DLPack protocol without additional data transfers from the CPU to the GPU, further improving the training speed of the agents. Based on the research above, this paper compares several state-of-the-art reinforcement learning methods and provides experimental validation of the work conducted.

The rest of this work is organized as follows: Section 2 outlines the current research status. Section 3 introduces the problem background, and Section 4 explains the optimal control model established. Section 5 introduces a solution method for the model based on MARL (multi-agent reinforcement learning). Section 6 presents the experimental validation of the work and discusses the results. Section 7 offers a conclusion of this article.

2. Related Works

Significant progress has been achieved in the field of multi-UAV cooperative search through study. Cao and colleagues [15] studied a hierarchical task planning method for UAVs with an improved particle swarm algorithm. This approach is appropriate for cases when the number of search tasks changes dynamically. Xu and colleagues [16] proposed a multi-task allocation method for UAVs via a multi-chromosome evolutionary algorithm that considers priority constraints between tasks during allocation. Zuo et al. [17] investigated the task assignment issue for many UAVs with constrained resources.

These methods concentrate exclusively on the search issue at a higher level. However, it is also necessary for detailed path planning or trajectory optimization crucial for UAVs to carry out a cooperative search. Y. Volkan Pehlivanoglu [18] combined genetic algorithms and Voronoi diagrams in the study cited to propose a path-planning method for numerous UAVs in a 2D environment. Wenjian He and his colleagues [19] employed an advanced particle swarm algorithm to develop a novel path-planning technique for many UAVs in a three-dimensional environment. Liu et al. investigated methods for path planning based on a set of points of interest [20]. In addition, a reinforcement learning method was also applied to assist intelligent agents in searching static targets [21] and mobile targets [22]. Jiaming Fa and colleagues [23] presented a path planning technique that utilizes the bidirectional APF-RRT* algorithm. Kong and colleagues [24] examined task planning and path planning simultaneously.

However, in these studies, the location of the target is usually known. But, in specific search scenarios, such as rescuing lost travelers, the target’s location is often uncertain and needs to be found quickly by UAVs. In such scenarios, Peng Yao [3] and others used an improved model predictive control method to solve the problem of how multi-UAVs can efficiently search for unknown targets, taking into account the communication constraints between UAVs. Hassan Saadaoui [25] and others used a local particle swarm algorithm-based information-sharing mechanism for UAVs to help them quickly find targets in unknown environments. Liang et al. [26] investigated how UAVs search for and attack targets in unknown environments. Samine Karimi and Fariborz Saghafi [27] investigated a collaborative search method based on shared maps that is suitable for cost-effective systems. Xiangyu Fan et al. [28] proposed a collaborative search method that combines inclusion probability and evolutionary algorithms for UAV applications. Furthermore, when multiple uncrewed aerial vehicles (UAVs) operate collaboratively, they may encounter unexpected situations, such as the loss of some UAVs. To address this issue and enhance the robustness of UAV systems, Abhishek Phadke and F. Antonio Medrano [29] investigated methods for recovering the functionality of compromised UAVs.

Traditional methods for solving target search problems in unknown environments have poor scalability and run too slowly in large map areas [30]. For large-scale search tasks, Chen et al. [31] proposed a hierarchical task planning method that decomposes the overall search problem into smaller, more manageable subtasks. Hou Yukai and others [30] proposed a target search method based on the multi-agent deep deterministic policy gradient (MADDPG) mentioned by [32]. Because of the algorithm’s centralized training and distributed execution mechanism, the computation efficiency does not significantly decrease even with a large number of UAVs.

When employing reinforcement learning for unknown static target search, intelligent agents can directly interact with the environment and utilize the degree of exploration as a reward function to facilitate the learning process. But, when it comes to searching for a lost dynamic target, several new challenges arise. Firstly, because of the unknown motion strategy of the target, it is not feasible to directly establish a corresponding simulation environment. Secondly, the presence of prior information regarding the target’s lost location results in a non-uniform distribution of the target’s appearance probability, rendering the use of exploration degree as a reward function no longer viable.

To solve the issues mentioned above, this paper first models the dynamic target search problem as an optimal control problem, then regards the established optimal control problem as a simulation environment, and finally uses multi-agent proximal policy optimization (MAPPO) [33] to solve the modeled problem, thus making full use of the powerful fitting ability of neural networks and the efficient optimization ability of reinforcement learning algorithms.

3. System Model and Problem Formulation

The scenario we consider is to send multiple UAVs to find the lost target in the shortest possible time, given the known position and time of the target’s disappearance.

3.1. System Model

3.1.1. Environment Model

First, the environment is gridded because the time to find the target is not known at the beginning of the solution, so the boundary of the environment depends on the maximum search time and the target’s speed. The purpose of the search is to find the missing object, which is called T. The target’s lost location, the time of loss, and the initial location of the drone are known. The UAV is represented by u, its coordinate at time t is

u_{c}

, and the reconnaissance range and moving range of the UAV are integer multiples of the cell. The state of each cell is

c_{x, y}

.

c_{x, y} = 1

indicates that the cell at coordinates

(x, y)

contains the target, and

c_{x, y} = 0

means no target in the cell. We assume that the drones are capable of fully knowing each other’s information. The environment is depicted in Figure 1.

3.1.2. Update Model

After the target is lost, the specific position of the target is unknown. In order to use the information of reconnaissance time, the initial position of the target, and the result of each reconnaissance of the UAV, the target probability map (

T P M

) [34] is used to represent the state of the UAV at each moment. A target probability map is a matrix where each element indicates the probability of the target appearing in a cell. The element in the

T P M

where the target is lost at the initial time is 1, and the other elements are 0. As time goes by, the area in which the target is likely to appear increases. The target may move in any area of the surrounding direction and may remain stationary. Since no information is known about the direction of the target’s movement, the probability of moving the target to the surrounding eight grids and the original grid at the next time is 1/9. The probability of the target appearing in a grid at the next moment is the sum of the probabilities of the target moving from the surrounding nine grids to the grid at the previous moment. Since the probability of the nine grids moving in the direction of the grid is 1/9, the probability of the target appearing in a grid at the next moment is the average of the occurrence probabilities of the surrounding nine grids at the previous moment. The formal description of

T P M

updates due to target movement is as follows:

T P M_{i, j}^{t + 1} = \frac{1}{9} \times \sum_{m = i - 1}^{i + 1} \sum_{j - 1}^{j + 1} T P M_{m, n}^{t}

(1)

It can be seen that this operation and the mean filtering operation are equivalent; therefore, this change of this target probability map is denoted as the

M (\cdot)

.

To increase search efficiency, the

T P M

needs to be updated when UAVs continue to scout the region. The

T P M

is updated probabilistically because of the limits of the sensors’ accuracy. The update of the

T P M

is conducted using a Bayesian model, which is based on the sensor practices in various established robotic applications. The likelihood that a target is in a cell

(x, y)

at time t is updated in the following manner when a UAV scans that cell:

T P M_{i, j}^{t + 1} = \frac{T P M_{i, j}^{t} \times t p}{t p \times T P M_{i, j}^{t} + (1 - t p) \times (1 - T P M_{i, j}^{t})}

(2)

where

t p

indicates the sensor’s accuracy when the sensor detects a target in the cell. In case the sensor fails to identify a target,

1 - t p

is substituted for

t p

. Furthermore, the probability is updated many times if multiple UAVs scan the cell

(x, y)

concurrently. The number of updates is the same as the number of drones detected. This update of the target probability map is called

B (\cdot)

because of the Bayesian formula used.

3.1.3. Movement Model

In the problem scenario studied, in order to avoid collision and repeated observation, the interval between drones should be greater than a certain threshold, which is represented by d. Every time step t in the search procedure allows each UAV to travel to a nearby cell. Nevertheless, the UAVs’ restricted mobility possibilities at each time step stem from their turning radius limitations. Each UAV may only choose one of the nearby cells in the direction of [

- α

,

α

], where

α

represents the maximum turning angle of the drone. The mobility model of the UAV can be depicted by the image referenced in Figure 2.

4. Optimal Control Model Construction

The search mission of the drone is to find the target in the shortest possible time, and once the target is found, the search mission ends. Therefore, except for the last moments, it can be assumed that the drone did not find the target during the mission. The optimization objective is to minimize the probability that the UAV fails to detect the target. Let us represent the collection of uncrewed aerial vehicles as U. Note that the probability of the UAV finding the target during each reconnaissance is

u_{p}

and the area covered by the UAV during each reconnaissance is D; then, the probability of the UAV failing to detect the target is

1 - u_{p}

, and the probability of all UAVs failing to detect the target is calculated as follows:

\prod_{u \in U} (1 - u_{p})

(3)

Suppose that the UAV scouts the target for N times from start to end, and the maximum reconnaissance time is T. Then the probability that N reconnaissance drones do not find the target is

\prod_{i = 1}^{N} \prod_{u \in U} (1 - u_{p} (i))

. Taking the logarithm of this formula can change multiplication symbols into addition symbols, and because of the monotonicity of the logarithm function, taking the logarithm does not change the extreme point of the original formula. Taking the logarithm of this formula gives the following formula:

J = \sum_{i = 1}^{N} \sum_{u \in U} l n (1 - u_{p} (i))

(4)

where J represents the optimization objective. If the reconnaissance frequency of the UAV is high and the reconnaissance probability changes continuously with time, the performance index can also be written in the following continuous form:

J = \int_{0}^{T} \sum_{u \in U} l n (1 - u_{p} (i)) d t

(5)

Among them,

u_{p} (i)

is calculated using the integral of the UAV reconnaissance probability over the reconnaissance area

D_{i}

.

D_{i}

is determined by the sensor model

S_{i}

and the drone’s position

u_{c} (i)

. Let

T P M (x, y)

denote the probability of the target appearing at

(x, y)

, and let

p (x, y)

represent the reconnaissance accuracy of the UAV at coordinates

(x, y)

. Then, the calculation formula for

u_{p} (i)

is as follows:

\begin{matrix} D_{i} & = S_{i} (u_{c} (i)) \end{matrix}

(6)

\begin{matrix} u_{p} (i) & = \int_{(x, y) \in D_{i}} p (x, y) * T P M (x, y) d x d y \end{matrix}

(7)

Here, optimal control theory is used to design a control scheme to regulate the UAV’s reconnaissance path in such a way that the above continuous-time performance indicators are minimized to optimize the search efficiency and maximize the probability of finding the target. Here, the state variables are the position of the UAV and the target probability map. The input of the system is the action matrix

A

taken by UAVs. Each row a of the matrix

A

represents a sequence of actions taken by a drone. The transfer model of the system is as follows:

\begin{matrix} u_{c} (t + 1) & = a (t) + u_{c} (t) \end{matrix}

(8)

\begin{matrix} T P M (t + 1) & = B (M (T P M (t))) \end{matrix}

(9)

In this formula,

B (\cdot)

, which is defined on Equation (2), represents the process of Bayesian update on the

T P M

based on the reconnaissance information obtained by the uncrewed aerial vehicle (UAV) at its new position. The term

M (\cdot)

, which is defined on Equation (1), denotes the impact on the

T P M

caused by the expansion of the target’s activity range due to the increase in time. UAVs need to abide by some restrictions during the reconnaissance process and keep a certain distance between UAVs at all times to prevent collision or repeated observation. The turning angle of the drone is limited, so the drone can only turn in a certain range of the current direction of its fuselage. Note that the steering angle of the drone is

θ

. Finally, the optimal control model of the problem is formulated as follows:

\begin{matrix} min_{A} J (A) \\ s . t . & \{\begin{matrix} min | u_{c} (i) - u_{c} (j) | \geq d, \\ - α \leq θ \leq α \end{matrix} \end{matrix}

(10)

5. MARL-Based Solving Method

While minimal principle and dynamic programming are effective in solving various optimal control issues in engineering, the proposed model presents a challenge due to the combination of continuous and discrete variables, as well as the inclusion of nonlinear functions. Therefore, this study utilizes deep reinforcement learning to solve the optimum control model mentioned above. In this section, we discuss the MARL-based search optimization technique in three separate subsections. Initially, we introduce the MARL structure, which outlines the fundamental elements of MARL. Next, we introduce the basic knowledge about PPO (Proximal Policy Optimization) [35], which is the single agent version of MAPPO. Finally, we provide the MARL-based search strategy and a comprehensive overview of the algorithm’s entire operation.

5.1. MARL Structure

Marl usually refers to the algorithm that performs the best actions from the observation through the appropriate reward design. When multiple intelligent bodies have a common global reward, agents gradually adjust their actions to achieve collaboration in order to maximize the rewards. In the usual MARL algorithm, agents interact with the environment to learn good strategy, but in this problem, because the opponent’s strategy is unknown, agents do not interact directly with the environment but with the mathematical model mentioned above. With the MARL, at time step t, each agent i takes an action

a_{i}^{t}

based on its observation

o_{i}^{t}

and receives a single-step reward

r_{i}^{t}

(

J (t) - J (t + 1)

) from the model. During the interaction with the model, the agent leverages the reward information to update its policy in order to obtain a higher cumulative reward. The key components of the multi-agent reinforcement learning (MARL) system are as follows:

5.1.1. Agent

Every UAV is identified as an agent inside the search system. The goal of the drone is to find a path that minimizes J.

5.1.2. State

The state variable x represents the global state, which contains the drones’ location information, angle information, and

T P M

.

5.1.3. Observation

In order to distinguish each drone during the planning of drone actions, the observation not only includes state information but also includes the ID of each drone.

5.1.4. Action

Each agent i selects the best course of action

a_{i}

according to the current observation. The UAV u selects the steering angle as the action in the search environment, as seen below:

a^{i} = \{\begin{matrix} - 1, & turn - α degrees \\ 0, & go straight \\ 1, & turn α degrees \end{matrix}

(11)

5.1.5. Reward

After the UAV performs the action, the target probability map is updated accordingly, and then the UAV receives the reward r from the model. According to the established optimal control model, the UAV receives the reward at each step as follows:

r = - \sum_{u \in U} l n (1 - u_{p} (i))

(12)

The multi-agent search framework incorporating optimal control models can be represented by Figure 3.

5.2. Basic PPO

The PPO algorithm uses an actor–critic design. Specifically, it uses a network of critics to assess the expected cumulative rewards that an agent can expect to earn based on its current state and uses actors to select appropriate actions for each agent. By using the output of the critic network as the reference value of the reward, the algorithm can approximate the reward range in real-world applications from [

r_{m i n}

,

r_{m a x}

] to [

- (r_{m i n} + r_{m a x}) / 2

,

(r_{m i n} + r_{m a x}) / 2

], which can be further normalized. Through this mapping method, the algorithm can adapt to scenarios with different orders of rewards. The PPO algorithm improves the performance of agents by updating the policy through the following loss function:

\begin{matrix} L^{P P O} (θ) = - E_{t} [min (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})} A^{π_{θ_{o l d}}} (s_{t}, a_{t}), \\ clip (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}, 1 - ϵ, 1 + ϵ) A^{π_{θ_{o l d}}} (s_{t}, a_{t}))] \end{matrix}

(13)

Here,

π_{θ} (a_{t} | s_{t})

represents the probability of selecting action

a_{t}

in state

s_{t}

under the policy parameterized by

θ

, and

π_{θ_{o l d}} (a_{t} | s_{t})

denotes the probability under the old policy, before the update.

$A^{π_{θ_{o l d}}} (s_{t}, a_{t})$ is the advantage function, which is computed by the GAE (general advantage estimate) [36] and measures the benefit of taking action $a_{t}$ in state $s_{t}$ under the policy $π_{θ_{o l d}}$ compared to the average action under the same policy.
$ϵ$ is a hyperparameter that controls the clipping range to prevent the policy from updating too drastically, which helps in maintaining training stability.

The use of the clip function ensures that the ratio of the new policy probability to the old policy probability does not deviate from 1 by more than

ϵ

, ensuring gentle updates. This clipping mechanism acts as a regularization technique, encouraging the new policy to stay close to the old policy and mitigating the risk of making overly large updates that can result in performance degradation.

The PPO algorithm iteratively performs the following steps:

Collect data from the current policy $π_{θ_{o l d}}$ by interacting with the environment.
Compute advantage estimates $A^{π_{θ_{o l d}}} (s, a)$ based on the collected data.
Optimize the PPO objective $L^{P P O} (θ)$ with respect to $θ$ for a fixed number of epochs, using stochastic gradient ascent.
Update the policy by setting $θ_{o l d} = θ$ to the optimized values.

5.3. OC-MAPPO Target Searching Method

The MAPPO method is the multi-agent version of the PPO method. In the MAPPO approach, all agents share the same rewards and utilize a centralized value function and localized policy functions. In the original MAPPO method, the loss function is calculated in the same way as in the PPO method. However, in this paper, we have made some modifications to the calculation formula of the loss function. In the original calculation formula, the denominator and numerator in

\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

represent the probability of taking action

a_{t}

in state

s_{t}

before and after the network update, respectively. However, in the multi-agent version, the original

s_{t}

actually corresponds to

(o_{t}^{1}, o_{t}^{2}, . . ., o_{t}^{n}

in the multi-agent version, where n represents the number of agents, and the corresponding action changes from

a_{t}

to

(a_{t}^{1}, a_{t}^{2}, . . . a_{t}^{n})

. Therefore, we represent the probability of taking the action tuple

(a_{t}^{1}, a_{t}^{2}, . . . a_{t}^{n})

under the observation tuple

(o_{t}^{1}, o_{t}^{2}, . . ., o_{t}^{n}

as the product of the probabilities of each agent taking the corresponding action.

The revised loss function

L_{i} (θ)

is computed as

\begin{matrix} L_{M A P P O} (θ_{a c t o r}) = - E_{t} [min (\frac{\prod_{i = 1}^{N} π_{θ} (a_{t}^{i} | o_{t}^{i})}{\prod_{i = 1}^{N} π_{θ_{o l d}} (a_{t}^{i} | o_{t}^{i})} & A^{π_{θ_{o l d}}} (s_{t}, a_{t}), \\ clip (\frac{\prod_{i = 1}^{N} π_{θ} (a_{t}^{i} | o_{t}^{i})}{\prod_{i = 1}^{N} π_{θ_{o l d}} (a_{t}^{i} | s_{t}^{i})}, 1 - ϵ, 1 + ϵ) & A^{π_{θ_{o l d}}} (s_{t}, a_{t}))] \end{matrix}

(14)

Here,

π_{θ} (a_{t}^{i} | o_{t}^{i})

represents the probability of taking action

a_{i}

when its observation is

o_{i}

under the current policy

π_{θ}

, while

π_{θ_{old}} (a_{i} | s)

denotes the probability under the old policy

π_{θ_{old}}

. The advantage function

A^{π_{θ_{old}}} (s, a)

estimates the advantage of taking action a in state s compared to the expected return under the old policy.

The min and clip operations are used to limit the update step size and maintain stability during training. This loss function is used to update the actor network, while the critic network is updated using the mean square error.

\begin{matrix} L (θ_{c r i t i c}) & = E_{s_{t} \sim π_{θ}} [{(V_{θ} (s_{t}) - V_{t}^{t a r g e t})}^{2}] \end{matrix}

(15)

\begin{matrix} V_{t}^{t a r g e t} & = r_{t} + γ V_{θ} (s_{t + 1}) \end{matrix}

(16)

Next, we create some additional designs for the actor network (used for fitting the policy function) and the critic network (used for fitting the value function) regarding the issues studied. Since the information to be input contains the two-dimensional data of the target probability map, the convolutional neural network is selected for data processing. At the same time, using the high-performance parallel multi-channel processing capability of modern computers, the incoming UAV position information is integrated into the two-dimensional data for unified processing. In order to distinguish between different drones, when inputting information for each drone’s movement, the ID information of the drone is additionally included.

For actor networks, the input data are first passed through a convolutional neural network, then flattened to one dimension, and then passed through three fully connected neural networks. For the critic network, the input data are processed similarly, except that the final layer of the actor network’s neural network combines the softmax activation function to map the input to the same number of probability vectors as the optional actions. The critic network, in combination with the tanh activation function, maps the data to individual V values. The initial parameters of all networks are orthogonal matrices. The main architecture of the actor network and the critic network is shown in Figure 4.

In order to expedite the learning process of intelligent agents, we have developed a GPU parallel environment to facilitate more efficient interactions with the agents. The parallelized interaction approach exhibits the following changes compared to the non-parallelized method:

Prior to each update, the intelligent agent is only required to interact with each environment for a single round, as opposed to engaging in multiple rounds of interaction.
Each environment is configured with distinct parameters to enhance the generalization performance of the intelligent agent.

The overview of algorithms is depicted in Figure 5:

UAVs work together to investigate their surroundings by changing directions at regular intervals, and the interaction data are retained for future training. The actor and critic networks are updated using historical samples. Then, the agent continues to interact with the environment using the updated actor network, repeating the previous stages until the training time is achieved. Finally, the drone chooses the path with the largest cumulative reward as the genuine search route to search.The overall flow of the algorithm is illustrated in Algorithm 1.

Algorithm 1 Search Algorithm Based on MAPPO and Optimal Control

Initialize the parameter $θ_{π}$ of actor network $π$ using orthogonal initialization
Initialize the parameter $θ_{V}$ of network V using orthogonal initialization
Initialize total iterations I.
Let us denote $t_{m a x}$ as the time limit for drone search operations.
Let us denote n as the maximum simulation turns for the drone.
Let us denote $D^{u}$ is u reconnaissance coverage.
$For$ each i in 1 …n
Set a different initial target location for each environment
Set each environment’s drone positions and $T P M$
Obtain $s_{0}$ , $\vec{o_{0}}$ from environments
Set buffer b = [ ]
$For$ each t in 1 … $t_{m a x}$
$For$ each environment e
$r_{t}$ = 0
execute average filtering on $T P M^{e}$ .
$For$ each UAV u
$p_{t}^{u}$ = $p i (o_{t}^{u}, θ_{π})$
$a_{t}^{u}$ = $p_{t}^{u}$
$r_{t}$ += $- l n (1 - \int_{D^{u}} T P M * (1 - t p))$
update $T P M^{e}$ by Equation (2)
$v_{t}^{u}$ = $V (s_{t}, θ_{V})$
b+=[ $s_{t}$ , $\vec{o_{t}}$ , $a_{t}$ , $r_{t}$ , $s_{t + 1}$ , $\vec{o_{t + 1}}$ ]
Compute advantage estimate A via GAE [36] on b
Compute reward-to-go R on b and normalize
Compute loss of $π$ by b, A and Equation (14)
Compute loss of V by mean squared error (v, r)
Update $π$ , V

6. Experiment and Result

This section presents the experiments conducted on the proposed work from four distinct perspectives. Firstly, it examines whether the model enhances the learning speed of the agents and the accuracy of the model. Secondly, it analyzes the effects of GPU parallelization. Subsequently, it compares the performance of various reinforcement learning algorithms on this task. Finally, it verifies the adaptability of the OC-MAPPO method under different parameter settings.

6.1. Optimal Control Model Validation

To validate the effectiveness of the proposed model, a comparison was conducted between the MAPPO method combined with the optimal control model introduced in this paper and the pure MAPPO method. In the first approach, the agents interacted with the established model, and the reward obtained was the negative value of the cost function at a single time step, i.e.,

- \sum_{u \in U} l n (1 - u_{p} (i))

. In the second approach, the agents directly interacted with the environment, receiving a reward of 1 when a target was detected and a reward of 0 when no target was detected. Apart from this, all other parameters remained identical for both methods. Subsequently, 200 simulation experiments were conducted on the fully trained agent. In each experiment, the theoretical value of the agent’s discovery probability was first derived based on the rewards obtained by the agent. The computation formula is given by

1 - e^{- R / 10}

, as we amplified the reward by a factor of ten during training. Then, the agent’s reconnaissance trajectory was fixed, and the target’s movement route was randomly varied 1000 times to calculate the statistical probability of the agent discovering the target. Finally, the theoretical value and statistical probability were compared to verify the accuracy of the model. The parameters of the MAPPO method are shown in Table 1.

The experimental results are shown in Figure 6 and Figure 7. In Figure 6, the horizontal axis represents the number of training iterations for the intelligent agent, while the vertical axis represents the likelihood of the intelligent agent discovering the target. These values were obtained by averaging the results of 100 interactions between the intelligent agent and its environment. Furthermore, because of the inherent randomness in the training process of reinforcement learning, each method was experimented with 30 times, and the corresponding error bars were calculated and are shown.

In Figure 7, the x axis represents simulation times, and the y axis indicates the score calculated by the reward and score from the simulation. From the graph, it can be observed that the two types of scores often have similar values. When the agent’s simulation score decreases, the agent’s theoretical score also exhibits a corresponding decline. Regarding the two significant drops in the agent’s scores, they occur because the agent is still running in exploration mode, meaning that during each run, there is still a certain probability of executing suboptimal actions. In practical applications, this issue can be avoided by setting the agent to a deterministic mode.

The results shown in Figure 6 and Figure 7 indicate that the optimal control model proposed in this paper is effective and efficient.

6.2. GPU Parallel Model Checkout

To validate the effectiveness of the established GPU parallel model, we conducted two comparison experiments. The first comparison was made between the OC-MAPPO method combined with the parallel GPU model and non-GPU parallelized model in a task scenario with a fixed target’s start position. In the first method, 3 × 256 GPU cores were utilized to update 768 mathematical models simultaneously, while the interaction rounds between MAPPO and the environment were reduced to one round. On the other hand, the second method involved CPU updates for 10 mathematical models with 40 interaction rounds. The second comparison was made between the OC-MAPPO method with 10 environments, which can be parallized in the CPU, and the OC-MAPPO method with a 3 × 256 environment, which needs to use a GPU because of the limited core of the CPU, in scenarios with various target’s start position. The purpose of the second experiment was to compare the effect of a GPU parallel environment in a more challenging scenario. Apart from the number of parallel environments, all other parameters remained consistent with the experiments conducted in the first part.

The experimental results are shown in the Figure 8 and Figure 9. Figure 8 indicates that the MAPPO method, incorporating GPU parallelization, demonstrated significant advantages after 20 training epochs and consistently maintained a higher discovery rate thereafter. We speculate that this can be attributed to the increased interaction between the agent and the environment. Additionally, from the second subplot, it is evident that the GPU parallelization method only took approximately half the time compared to the second approach. Therefore, we can conclude that the utilization of GPU parallelization significantly enhances the learning efficiency of the agent.

Figure 9 illustrates that, in a task scenario where the initial position of the target varies, employing a greater number of parallel environments enables the agent to learn more stably and efficiently. Conversely, when only 10 parallel environments are utilized, the learning process of the agent becomes highly convoluted, and the learning pace is significantly diminished. The results shown in Figure 6 and Figure 7 indicate that the optimal control model proposed in this paper is effective and efficient.

The results shown in Figure 8 and Figure 9 indicate that parallel GPU environments can significantly enhance the learning speed of the agent.

6.3. Comparison of Deep Reinforcement Learning Algorithms

Currently, the field of deep reinforcement learning is primarily divided into two branches: value-based methods and policy gradient-based methods. This experiment compares the advanced Dueling Double Deep Q-Learning (D3QN) [37,38] method from the value-based branch and the MAPPO method from the policy gradient-based branch in order to explain why the MAPPO method was chosen to solve the model in this paper. The parameters for D3QN are shown in Table 2, while the parameters for MAPPO are the same as in the previous experiment.

Figure 10 illustrates the performance comparison between D3QN and MAPPO. In the left panel, the y axis remains consistent with Figure 6, while the horizontal axis has been altered to represent the training time required by both methods because of their distinct updating mechanisms. As evident from the graph, the MAPPO approach begins to surpass the D3QN method at approximately 50 s and maintains a consistent advantage thereafter. In the right panel, the horizontal axis represents training time, and the y axis indicates the mean reward. The general trends in the right graph are essentially consistent with those in the left graph. However, in the initial stages of the left graph, the MAPPO algorithm exhibits a sudden surge in score. This discrepancy arises because the left graph represents the best performance achieved by the agents during each round of interaction with the parallel environment, whereas the right graph measures the average performance. In both panels, the MAPPO algorithm ultimately demonstrates superior performance compared to D3QN methods. Consequently, this study employs the MAPPO technique to resolve the proposed uncrewed aerial vehicle reconnaissance problem. Based on the information in Figure 10, it can be concluded that the MAPPO method demonstrates superior performance and a more stable learning process compared to the D3QN method.

6.4. Comparison of MAPPO and Offline Planning Methods

In comparison to offline planning methods, such as genetic algorithms, employing deep reinforcement learning techniques enables the pre-training of intelligent agents capable of online action, allowing the agents to maintain their ability to act even when the target location changes. This study aims to compare the proposed method with genetic algorithms to observe the performance of solutions obtained by genetic algorithms and the intelligent agents trained using the presented approach when the target location is altered. The parameters for the GA method are shown in Table 3, while the parameters for the MAPPO method remain the same as in the previous experiment. In both methods, the departure position of the drone is located at the lower left corner.

The experimental results are shown in Figure 11 and Figure 12, where the coordinate values represent the deviation of the target’s initial position from its initial position in the training scenario. The center point (0, 0) in the graph represents the scenario where the target’s starting position during testing is consistent with its position during training. The color of each point becomes increasingly red as the probability of detecting the target increases. Figure 11 reveals that as the x-value changes from negative to positive and the y-value transitions from negative to positive, the color shifts from blue to red and then back to blue, indicating that the probability of target detection initially increases and subsequently decreases. In Figure 12, when the target’s starting position coordinates are less than those of the training position, the UAV departs from the lower left corner, resulting in the target being closer to the UAV. Consequently, the colors in the graph become more red, signifying an increased probability of the UAV detecting the target. Only when the target’s distance is greater than the training position does the probability of the UAV detecting the target exhibit a decline.

In summary, the probability of the genetic algorithm’s solution detecting the target rapidly decreases when the target’s initial position changes. In contrast, the OC-MAPPO method, benefiting from a large number of interactions in parallel environments, maintains its probability of discovering the target when the target position changes but remains relatively close to the agent’s spawn location. This demonstrates the reinforcement learning algorithm’s greater robustness and broader applicability.

7. Conclusions

In this paper, we study the problem of how multiple uncrewed aerial vehicles (UAVs) can efficiently recover a lost mobile target without prior knowledge of its movement patterns. To enable the UAVs to effectively locate the target in the absence of information about its motion, we formulate an optimal control model for the problem and employ the Multi-Agent Proximal Policy Optimization (MAPPO) method to solve it. Experimental results demonstrate that the proposed mathematical model enhances the probability of the UAVs successfully finding the target. Moreover, the adopted MAPPO approach exhibits distinct advantages compared to other reinforcement learning methods and genetic algorithms. The findings of this study contribute to the development of efficient strategies for UAV-based target recovery in dynamic environments where the target’s movement is unknown.

The method proposed in this paper primarily focuses on search strategies in two-dimensional scenarios. To further enhance the proposed method, future research can explore three-dimensional scenarios, which will expand the algorithm’s scope of application. Additionally, the drone motion model considered in this paper is discrete. Investigating continuous drone motion models can improve the algorithm’s precision. Furthermore, this paper does not examine the performance of agents when the number of drones dynamically changes. Whether agents can spontaneously adjust their strategies when the drone quantity varies remains a question worthy of study. In-depth research on these issues can further optimize the search efficiency of drones.

Author Contributions

Conceptualization, D.W.; Methodology, D.W.; validation, L.Z.; formal analysis, L.Z.; investigation, Q.L.; data curation, H.C.; writing—orginal draft, D.W.; supervision, J.H.; project administration, J.H.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jose, A.N.C.S.C. Adaptive Search Control Applied to Search and Rescue Operations Using Unmanned Aerial Vehicles (UAVs). IEEE Lat. Am. Trans. 2014, 12, 1278–1283. [Google Scholar] [CrossRef]
Chen, H.; Lan, Y.; K Fritz, B.; Clint Hoffmann, W.; Liu, S. Review of Agricultural Spraying Technologies for Plant Protection Using Unmanned Aerial Vehicle (UAV). Int. J. Agric. Biol. Eng. 2021, 14, 38–49. [Google Scholar] [CrossRef]
Yao, P.; Wang, H.; Ji, H. Multi-UAVs Tracking Target in Urban Environment by Model Predictive Control and Improved Grey Wolf Optimizer. Aerosp. Sci. Technol. 2016, 55, 131–143. [Google Scholar] [CrossRef]
Guan, W.Y.X. A New Searching Approach Using Improved Multi-Ant Colony Scheme for Multi-UAVs in Unknown Environments. IEEE Access 2019, 7, 161094–161102. [Google Scholar] [CrossRef]
Queralta, J.P.; Taipalmaa, J.; Pullinen, B.C.; Sarker, V.K.; Gia, T.N.; Tenhunen, H.; Gabbouj, M.; Raitoharju, J.; Westerlund, T. Collaborative Multi-Robot Search and Rescue: Planning, Coordination, Perception, and Active Vision. IEEE Access 2020, 8, 191617–191643. [Google Scholar] [CrossRef]
Yao, P.; Wei, X. Multi-UAV Information Fusion and Cooperative Trajectory Optimization in Target Search. IEEE Syst. J. 2022, 16, 4325–4333. [Google Scholar] [CrossRef]
Wu, Y.; Nie, M.; Ma, X.; Guo, Y.; Liu, X. Co-Evolutionary Algorithm-Based Multi-Unmanned Aerial Vehicle Cooperative Path Planning. Drones 2023, 7, 606. [Google Scholar] [CrossRef]
Su, J.l.; Wang, H. An Improved Adaptive Differential Evolution Algorithm for Single Unmanned Aerial Vehicle Multitasking. Def. Technol. 2021, 17, 1967–1975. [Google Scholar] [CrossRef]
Li, Y.; Chen, W.; Fu, B.; Wu, Z.; Hao, L.; Yang, G. Research on Dynamic Target Search for Multi-UAV Based on Cooperative Coevolution Motion-Encoded Particle Swarm Optimization. Appl. Sci. 2024, 14, 1326. [Google Scholar] [CrossRef]
Zhicai, R.; Jiang, B.; Hong, X. A Cooperative Search Algorithm Based on Improved Particle Swarm Optimization Decision for UAV Swarm. In Proceedings of the 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS), Chengdu, China, 23–26 April 2021; pp. 140–145. [Google Scholar] [CrossRef]
Shao, R.; Tao, R.; Liu, Y.; Yang, Y.; Li, D.; Chen, J. UAV Cooperative Search in Dynamic Environment Based on Hybrid-Layered APF. EURASIP J. Adv. Signal Process. 2021, 2021, 101. [Google Scholar] [CrossRef]
Wang, H.; Zariphopoulou, T.; Zhou, X.Y. Reinforcement Learning in Continuous Time and Space: A Stochastic Control Approach. J. Mach. Learn. Res. 2020, 21, 1–34. [Google Scholar]
Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Pieter Abbeel, O.; Zaremba, W. Hindsight Experience Replay. Adv. Neural Inf. Process. Syst. 2017, 30, 5048–5058. [Google Scholar]
Nair, A.; McGrew, B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Overcoming Exploration in Reinforcement Learning with Demonstrations. In Proceedings of the 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 6292–6299. [Google Scholar] [CrossRef]
Cao, L.; Tan, H.; Peng, H.; Pan, M. Multiple UAVs Hierarchical Dynamic Task Allocation Based on PSO-FSA and Decentralized Auction. In Proceedings of the 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014), Bali, Indonesia, 5–10 December 2014. [Google Scholar] [CrossRef]
Xu, G.; Li, L.; Long, T.; Wang, Z.; Cai, M. Cooperative Multiple Task Assignment Considering Precedence Constraints Using Multi-Chromosome Encoded Genetic Algorithm. In Proceedings of the 2018 AIAA Guidance, Navigation, and Control Conference, Kissimmee, FL, USA, 8–12 January 2018. [Google Scholar]
Zuo, L.H.Q. Multi-Type UAVs Cooperative Task Allocation Under Resource Constraints. IEEE Access 2018, 6, 1. [Google Scholar]
V, P.Y. A New Vibrational Geneticalgorithm Enhanced with a Voronoi Diagram for Path Planning of Autonomous UAV. Aerosp. Sci. Technol. 2012, 16, 47–55. [Google Scholar] [CrossRef]
He, W.; Qi, X.; Liu, L. A Novel Hybrid Particle Swarm Optimization for Multi-UAV Cooperate Path Planning. Appl. Intell. 2021, 51, 7350–7364. [Google Scholar] [CrossRef]
Liu, J.; Zou, D.; Nan, X.; Xia, X.; Zhao, Z. Path Planning Algorithm for Multi-Drone Collaborative Search Based on Points of Interest. In Proceedings of the International Conference on Autonomous Unmanned Systems; Springer: Singapore, 2023; pp. 504–513. [Google Scholar]
Xiao, J.; Pisutsin, P.; Feroskhan, M. Collaborative Target Search with a Visual Drone Swarm: An Adaptive Curriculum Embedded Multistage Reinforcement Learning Approach. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–15. [Google Scholar] [CrossRef] [PubMed]
Su, K.; Qian, F. Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning. Appl. Sci. 2023, 13, 11905. [Google Scholar] [CrossRef]
Fan, J.; Chen, X.; Liang, X. UAV Trajectory Planning Based on Bi-Directional APF-RRT* Algorithm with Goal-Biased. Expert Syst. Appl. 2023, 213, 119137. [Google Scholar] [CrossRef]
Kong, X.; Zhou, Y.; Li, Z.; Wang, S. Multi-UAV Simultaneous Target Assignment and Path Planning Based on Deep Reinforcement Learning in Dynamic Multiple Obstacles Environments. Front. Neurorobot. 2024, 17, 1302898. [Google Scholar] [CrossRef]
Illi, H.S.E.B. Information Sharing Based on Local PSO for UAVs Cooperative Search of Moved Targets. IEEE Access 2021, 9, 134998–135011. [Google Scholar] [CrossRef]
Liang, Z.; Li, Q.; Fu, G. Multi-UAV Collaborative Search and Attack Mission Decision-Making in Unknown Environments. Sensors 2023, 23, 7398. [Google Scholar] [CrossRef]
Karimi, S.; Saghafi, F. Cooperative Aerial Search by an Innovative Optimized Map-Sharing Algorithm. Drone Syst. Appl. 2023, 12, 1–18. [Google Scholar] [CrossRef]
Fan, X.; Li, H.; Chen, Y.; Dong, D. UAV Swarm Search Path Planning Method Based on Probability of Containment. Drones 2024, 8, 132. [Google Scholar] [CrossRef]
Phadke, A.; Medrano, F.A. Increasing Operational Resiliency of UAV Swarms: An Agent-Focused Search and Rescue Framework. Aerosp. Res. Commun. 2024, 1, 12420. [Google Scholar] [CrossRef]
Yang, Y.H.Z.Z.C. UAV Swarm Cooperative Target Search: A Multi-Agent Reinforcement Learning Approach. IEEE Trans. Intell. Veh. 2023, 9, 1–11. [Google Scholar] [CrossRef]
Chen, J.; Xiao, K.; You, K.; Qing, X.; Ye, F.; Sun, Q. Hierarchical Task Assignment Strategy for Heterogeneous Multi-UAV System in Large-Scale Search and Rescue Scenarios. Int. J. Aerosp. Eng. 2021, 2021, 1–19. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Millet, T.; Casbeer, D.; Mercker, T.; Bishop, J. Multi-Agent Decentralized Search of a Probability Map with Communication Constraints. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, Toronto, ON, Canada, 2–5 August 2010; p. 8424. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:170706347. [Google Scholar]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar] [CrossRef]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]

Figure 1. Environment diagram.

Figure 2. UAV movement model.

Figure 3. MARL-based method overview.

Figure 4. The backbone architectures of actor–critic networks.

Figure 5. Overview of OC-MAPPO method.

Figure 6. MAPPO with optimal control model vs. MAPPO with environment.

Figure 7. Comparison between theoretical scores and simulation scores.

Figure 8. OC-MAPPO with GPU paralleled model vs. CPU model in fixed-parameter scenario.

Figure 9. OC-MAPPO under different numbers of parallel environments in various-parameter scenario.

Figure 10. Compare with D3QN and MAPPO.

Figure 11. Performance of GA when target position deviates from set position.

Figure 12. Performance of OC-MAPPO when target position deviates from set position.

Table 1. MAPPO parameters.

Parameter	Value
$γ$	0.99
Horizon length	819
Update epochs	4
Learning rate	0.0003
Vector environment numbers	10
GAE lambda parameter	0.95
Learning rate	0.001
Value loss coefficient	0.05
Entropy coefficient	0.005
Eps clip	0.2
Hidden size	128

Table 2. D3QN parameters.

Parameter	Value
$γ$	0.99
Horizon len	$819 (\frac{batch_size * 16}{environment numbers})$
Soft update frequency	100
Capacity	10000
Vector environment numbers	10
Epsilon init	0.9
Epsilon min	0.005
Epsilon decay	0.9
Learning rate	0.001
Hidden size	128

Table 3. Genetic parameters.

Parameter	Value
Iterations	300
Population	100
Parents	20
Elites	10
Mutation Rate	0.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, D.; Zhang, L.; Liu, Q.; Chen, H.; Huang, J. UAV Swarm Cooperative Dynamic Target Search: A MAPPO-Based Discrete Optimal Control Method. Drones 2024, 8, 214. https://doi.org/10.3390/drones8060214

AMA Style

Wei D, Zhang L, Liu Q, Chen H, Huang J. UAV Swarm Cooperative Dynamic Target Search: A MAPPO-Based Discrete Optimal Control Method. Drones. 2024; 8(6):214. https://doi.org/10.3390/drones8060214

Chicago/Turabian Style

Wei, Dexing, Lun Zhang, Quan Liu, Hao Chen, and Jian Huang. 2024. "UAV Swarm Cooperative Dynamic Target Search: A MAPPO-Based Discrete Optimal Control Method" Drones 8, no. 6: 214. https://doi.org/10.3390/drones8060214

Article Menu

UAV Swarm Cooperative Dynamic Target Search: A MAPPO-Based Discrete Optimal Control Method

Abstract

1. Introduction

2. Related Works

3. System Model and Problem Formulation

3.1. System Model

3.1.1. Environment Model

3.1.2. Update Model

3.1.3. Movement Model

4. Optimal Control Model Construction

5. MARL-Based Solving Method

5.1. MARL Structure

5.1.1. Agent

5.1.2. State

5.1.3. Observation

5.1.4. Action

5.1.5. Reward

5.2. Basic PPO

5.3. OC-MAPPO Target Searching Method

6. Experiment and Result

6.1. Optimal Control Model Validation

6.2. GPU Parallel Model Checkout

6.3. Comparison of Deep Reinforcement Learning Algorithms

6.4. Comparison of MAPPO and Offline Planning Methods

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI