1. Introduction
As early as the Stone Age, human beings learned that group warfare could exert a combat power far beyond the cumulative effect of individuals. In recent years, the formation flight of Unmanned Aerial Vehicles (UAVs) has developed rapidly and attracted the attention of all parties. It is a new operational concept model with quantitative advantages, cost advantages and intelligent synergy advantages rapidly gaining popularity among military powers [
1,
2,
3].
Due to their advantages of high flexibility, strong maneuverability, low safety risk, low cost and good robustness, the use of UAVs will be a significant aspect of future confrontations. UAVs can make full use even of contested airspace. Increasing numbers of UAVs in that airspace will inevitably lead to multi-directional UAV flight vectors, which will significantly increase the possibility of flight conflicts and seriously affect the safe flight of UAVs. Therefore, a question that must be answered is how the UAVs in the swarm can approach enemy UAVs and strike collectively through cooperative decision-making to minimize losses and quickly complete their strike missions. This kind of multi-agent game research also has important practical significance [
4]. Among them, the problem of multi-agent reinforcement learning has been proposed as early as the last century, and stochastic games are generally used to generate mathematical definitions. Unlike Markov models, stochastic games have multiple action spaces and reward functions and are used extensively, from open AI training in games to robotics. In industrial applications, reinforcement learning is becoming a practical component of large systems. However, most of reinforcement learning’s successes are in the domain of a single agent, so multi-agent reinforcement learning needs to study more complex problems, such as teamwork, conflict detection and resolution [
5,
6].
Conflict detection and intelligent resolution are the main problems to be solved in UAV path planning. Flight conflict refers to a state in which the distance between two aircraft is less than a specific minimum interval, which threatens the safety of the UAV. Conflict detection determines whether a UAV has entered the protected area of another UAV based on information such as UAV performance, current flight status, flight plan, etc. Conflict resolution refers to measures that could avoid conflict, such as planning a good trajectory and getting rid of possible conflicts when a flight conflict is detected.
There have been many studies on detecting UAV conflict, and the methods used by various researchers often differ. For example, detection methods for UAVs conflict can be divided into two main categories: deterministic conflict detection, based on real-time flight dynamics, meteorological information, UAV flight plans and careful consideration of navigation performance errors; and probabilistic conflict detection, based on assessing the influence of uncertain factors such as meteorology on future tracks to determine the probability that UAVs will collide in the future. Among them, incomplete information is a typical feature of probabilistic conflict detection, which is shown in
Figure 1. In the figure, the black and white circle is the UAVs of different sides, the gray circle is the obstacle and the dotted circle indicates the uncertainty of the obstacle. In the research process, the uncertainties of obstacles, opponents and environment are reflected in the reward function, so it is of great significance to study the path planning of UAVs with unknown reward functions.
For deterministic conflict detection, Florence Ho [
7] introduced an ORCA adaptive algorithm to resolve conflict detection and resolution (CDR) for possible conflicts between UAVs of different service providers. The adaptive ORCA algorithm solves the practical problems inherent in deploying UAVs in shared airspace, such as navigation inaccuracy, communication overhead and flight phase. Flying multiple unmanned aircraft or operating these aircraft in commercial airspace increases the likelihood of a collision. B. M. Albaker [
8] developed a new functional architecture for the UAV collision avoidance system and an algorithm to determine the collision avoidance criteria based on the nominal state projection. Roberto Conde [
9] proposed a conflict detection and resolution method for cooperative UAVs in shared airspace. It is based on the axis-aligned minimum bounding box algorithm to detect conflicts. The detected disputes are cooperatively resolved using a genetic algorithm that modifies the UAV trajectory at a minimal cost. Florence Ho [
10] studied first-come, first-serve (FCFS) and “batch” processing of Unmanned Aircraft Systems (UASs) operation requests. The throughput of them was compared. The air traffic topology was analyzed for UAV delivery. Then, they developed a new MAPF model for the pre-flight CDR method. This CDR method supports decentralized conflict resolution, with different “agents” (here UAS Service Providers) managing their UAV operations, providing all UAVs with collision-free flight paths before takeoff [
11]. The above researchers conducted analysis and intelligent resolution of conflict detection based on deterministic information. Although they achieved good experimental results, they did not fully adapt to the emergency and uncertain information in the actual situation, so the processing ability was unstable.
For probabilistic conflict detection, Yu Wan [
12] proposed a multi-UAV coordination technique based on consensus algorithm and policy coordination. This model used a distributed conflict detection and resolution method for human-machine formation, an improved space-time integrated conflict detection model, an improved distributed coordination token allocation strategy, and proposed coordination damping to solve the problem of data loss and transmission delay in the same airspace at the same time. Mingrui Lao [
13] proposed an algorithm for conflict detection and another for conflict resolution to generate all possible solutions for potential conflicts, thereby selecting the best strategy for multi-threat scenarios. Chin E. Lin [
14] used Automatic Dependent Surveillance-Broadcast (ADS-B) to collect aircraft data to establish collision avoidance. Based on flight maneuvers, they proposed that a detection algorithm create sector ranges to cover UAVs and helicopters’ possible flight direction changes. Zhaoxuan Liu [
15] first developed a conflict network to analyze pairwise conflict relationships between aircraft, where the detection of a particular aircraft is called an edge, and the conflict severity is measured as the weight of this edge. In addition, they designed an improved PageRank algorithm to identify critical aircraft that are system safety bottlenecks and implemented centralized Conflict resolution Sequence Assignment (SA) to ensure that these critical aircraft take primary responsibility for discrimination and are deconflicted first. Austin L. Smith [
16] developed and implemented a collision avoidance algorithm based on an aggregate collision cone approach, ranging from a single platform capable of independently performing all collision avoidance functions to a diversity of collision avoidance commands that execute ground station calculations. Jian Yang [
17] used a geometric method to describe the relationship between UAV conflicts, considering actual and potential conflicts, and formalized the CDR problem as a nonlinear optimization problem to minimize maneuvering costs. Furthermore, they designed a two-layer strategy consisting of Stochastic Parallel Gradient Descent (SPGD) and an interior-point algorithm to efficiently solve non-convex optimization problems. The researchers consider the problem of UAV cluster conflict detection in the case of incomplete or uncertain information and effectively solve the problem of UAV cluster conflict detection and intelligent resolution in the case of uncertain data. However, the stability and efficiency of their results still need to be revised to be satisfactory.
The problem of multi-UAV CDR is one of the main topics of UAV cooperative control system research. However, conflict avoidance requires studying the multi-agent path planning (MAPF) problem to calculate the optimal result, thereby improving escape efficiency. In a MAPF setting, agents in the environment must follow paths to reach their target location without colliding with each other, usually in a distributed setting, and generally considering the multi-agent independence case, first calculating individual payoffs and then considering global payoffs. Researchers from various countries have proposed many methods to minimize the “global indication” and maximize the “benefit” or the optimization method. This paper will use the improved multi-agent deep deterministic policy gradient (MADDPG) algorithm to construct a multi-agent system and regards the time-space multi-domain UAV conflict detection and intelligent resolution as a complete system optimization problem. In this paper, to make this system work, the two sub-problems of conflict detection and intelligent resolution will be solved simultaneously [
18,
19,
20,
21,
22,
23].
Global scholars have conducted much research on UAV games and obtained many valuable results, but many problems still need to be solved. First, while there are many research results on multi-UAV conflict detection in circumstances of complete information, there are few on multi-UAV conflict detection and resolution in uncertain environments with incomplete information. Second, current intelligent algorithms for UAV conflict detection and resolution are mostly traditional path-planning algorithms. For an algorithm in a multi-agent environment setting, the strategy of each agent changes with the progress of training. In the human-machine environment, the defects of slow convergence speed and low precision are magnified due to the large number of agents involved and the resulting complexity. In addition, the algorithm’s applicability decreases with increasing numbers of UAVs. Finally, UAV path planning is primarily concentrated in a single centralized space. At present, there are few studies on the combination of the time domain and the space domain, which is difficult to adapt to the modern large-scale battlefield environment.
In view of the above problems, a UAV path planning strategy based on the MADDPG algorithm is proposed to realize the time-space multi-domain conflict detection and intelligent resolution of UAVs without the player knowing their reward function. At a time step, each agent chooses an action and receives a numerical value as its payoff or perceived payoff in the game. Unlike virtual games and optimal response dynamics that require knowledge of other players’ behavioral histories, our learning algorithm relaxes this assumption. It is often unreasonable and unrealistic in applications to assume the ability to observe the actions of other parties, i.e., to expect to have complete information. Furthermore, we believe that the state space of the game and its transition laws between states is unknown. In addition, the agent does not know the action space of other agent units, the migration strategy and specific speed information of enemy UAV and the location information of the threat area. Therefore, we want to address how much the agent can expect to learn in this situation [
24,
25,
26].
3. Improved MADDPG Algorithm
3.1. Multi-Agent Reinforcement Learning Algorithm
The continuous improvement and development of multi-agent reinforcement learning provide a new solution for multi-UAV target assignment and path planning. MADDPG performs well in multi-agent games, wherein target allocation and path planning problems are the games’ ultimate underlying basis. Both sides of the game essentially require UAVs to select “appropriate” targets to strike (or defend) and to minimize the total distance of the UAV formation (or prevent this trend). Moreover, despite the dynamism of environmental information and target attributes, the MADDPG model enables the UAVs to deal with the changes in the environment increasingly expertly as the training progresses.
In a multi-agent extension of Markov decision processes (MDPs), an MDP consists of a five-tuple in N agents, where S and A represent the state space and action space: they have their state space and action space . represents the state transition probability matrix, represents the reward function and is the decay coefficient of the cumulative discount reward.
In a multi-agent system, the reward obtained by state transition depends on the joint strategy
, which is the joint decision-making strategy of all agents. The value function of each agent is as follows:
In this formula, T is the total time, t is the current simulation time and is the previous state of the environment. The ultimate goal of the multi-agent Markov game is to find the optimal joint strategy , which maximizes the cumulative expected return of the entire system.
The Bellman equations of state-value function under multi-agent are as follows:
where the expected value of the reward
R with states is
.
The MADDPG algorithm used in this paper integrates the UAVs of the player and the enemy into the same agent system for training. The essence of this method is a Markov decision process, and the problem of multi-UAV target assignment and path planning is discrete across multiple time steps. After each step, the UAVs and the environment are treated as a state, and each UAV can observe the current environment and then take the following action according to its policy network. However, each UAV cannot fully monitor the location of or receive comprehensive intelligence on the enemy target. Furthermore, because multiple enemy UAVs’ values are unclear, our UAVs operate and attempt to fight in an incomplete information state.
3.2. MADDPG Algorithm
In order to solve the problem of reinforcement learning with incomplete information, we introduced the observation space of a partially observable Markov decision process (POMDP), which is on the basis of MDPs. A POMDP is defined as A tuple of
, where
are similar to the definition of the MDP.
is the observation space, note that
is different from
that is the observation perceived by agent 1. Agents may observe differently at the same state because of the observation probability
. The MADDPG algorithm [
15] is an extension of the DDPG algorithm in multi-agent reinforcement learning. It uses a “centralized training, decentralized execution” architecture, which requires additional state and action information about other agents only in the training phase. The state of the agent itself is necessary to output the policy action. The architecture frame diagram of MADDPG is shown in
Figure 2. Each agent has two networks: an Actor-network π and a critic network Q. The actor network calculates the action to be performed based on the agent’s state, and the critic network is responsible for evaluating the movement to improve the performance of the Actor-network. Using the Q-value network to break the correlation by randomly reading the experience pool data makes the training results more stable. At the same time, during the training process, the Actor-network only copies and observes its information, while the critic network is responsible for monitoring other agents.
A random policy
used by agent
i in the MADDPG algorithm, among them the policy should depend on the history of observation.
, informs the strategies of all agents
. The expected policy gradient for agent
i can thus be obtained as:
where
is the observed value of agent
i;
and represents the state of the agent, which could simplify the value of Q function expression; and
is the Q value function, which uses State 𝜒 and all agent actions to estimate the state-action value Q of agent
i. Since each
is learned individually, the agent can have arbitrary reward structures, including conflicting rewards in competitive environments. D represents the experience pool contains a series of tuples
to record all agent training samples.
is the new state of the agent after acting, and
is the reward value of agent
i.
Updating the critic network loss function can be shown as:
where
.
3.3. CMD-MADDPG Algorithm with Incomplete Information
Based on the MADDPG algorithm, the complex memory driver (CMD) communication mechanism is introduced to enable agents to use the shared memory as the communication channel. Before performing an operation, the agent reads the memory first and then writes the response. In this case, the agent’s strategy is related to its observation and interpretation of the memory set. Based on the above analysis and applying relevant game theory, it is possible to obtain the following improvements to the incomplete information scenario in the MADDPG algorithm for UAVs conflict detection:
- (1)
N represents the participant function, where N= {1, …, M, …, N}, M is the number of our UAVs, and N-M is the number of enemy UAVs;
- (2)
Agent state ;
- (3)
The probability of the enemy selecting strategy S is under the state ;
- (4)
The Q value function can be obtained from the previous discoveries and written as ;
- (5)
The expected policy gradient for agent i can thus be changed to:
The critic network loss function can therefore be updated to:
where
.
3.4. Analysis of Reward Function
The reward function has been set as global and location local rewards. Its primary purpose is to guide the UAV to reach the dynamic target in the shortest distance and avoid conflicts.
When multiple UAVs perform tasks, two possible conflicts should avoid the conflict between the UAV and the threat area and the conflict between various UAVs. Therefore, this paper needs to design an appropriate collision reward function to avoid collision. When a collision occurs, the UAV gets a negative reward.
In order to get the UAV to the dynamic target quickly, we need to simulate and calculate the distance to the enemy’s dynamic target. We will approximate the action of the dynamic target in each time step according to the binomial distribution to make the corresponding action space and state space. According to the constraint design in this paper, it can push step
t each time and the distance
of the
from the dynamic enemy target
i is:
The relationship between the initial position of our UAV, its position at time step
t and the dynamic target position of the enemy at time step
t is approximately linear.
is the initial distance,
is the distance between the enemy’s initial and current position.
is the moving distance of the dynamic target, and
is the angle between the initial position of the UAV and the current position at step
t. The distance between the UAVs can be described in
Figure 3.
To make the UAV sufficiently flexible in completing the task, the UAV will directly defeat the target if it “accidentally” finds other enemy UAVs during the training process. At this time, the UAV corresponding to the captured target will continue to perform the different tasks. The target will be recalculated and allocated according to the enemy’s fuzzy position reward.
The pseudocode for the flow of training in the UAV training algorithm is given below in Algorithm 1:
Algorithm 1: CMD-MADDPG algorithm |
1: Initialize the number of UAVs k, the number of targets m, the number of threat areas L and the critical area σ
2: Initialize the policy network and evaluation network of and the parameters of the network and
3: For episode = 1 to MAX Episode do
4: Randomly initialize UAVs, obstacles and target positions in a set UAV environment
5: For t = 1 MaxStep do
6: get environment status 7: Get UAV action 8: Interact the joint actions . of all UAVs with the environment, and return the UAV’s return , the number of collisions and the next state 9: Store samples () into the experience pool 10: Update environment state 11: for i = 1 to k do 12: Randomly sample S samples () from the experience pool to form batch samples 13: Compute the objective of the joint behavior function from the sampled data 14: Update the policy network of UAVs by formula 15: Update the evaluation network of UAVs by formula 16: end for 17: Update each UAV’s target policy network and target evaluation network in a soft-update manner 18: end for 19: end for |
4. Simulation Results and Analysis
Based on the OPENAI platform, we used the CMD-MADDPG to create an incomplete information training environment for multi-UAV conflict detection. The experiment will use several offensive UAVs and dynamic targets while the environment contains multiple fixed threat zones that randomly appear. Therefore, for offensive UAVs, real-time conflict detection and resolution are needed. The offensive UAVs communicate with each other, but they do not know the moving direction of the dynamic target and the location information of the fixed threat zone. In this condition, it is difficult to realize the final path planning with traditional methods. The simulated environment will take the geometric center of the environment as the origin of the Cartesian coordinate system. The agent size is 0.05, the target size is 0.07 and the threat area size is 0.09. Positions will randomly generate in each training scenario, and while the speed of the UAV is set as 0.02, the target movement speed changes with time steps. The experiment will use two indicators to measure the algorithm performance in the CMD-MADDPG path planning of dynamic targets in two modes. The indicators are the number of collisions (between agent and threat areas, total training episode (TEC) and global reward. The dynamic target has a random direction (every 45° is a direction) movement pattern. The following
Table 1 lists the hyperparameters.
By establishing the CMD-MADDPG algorithm model, three UAVs are trained to plan a good path in the case of two fixed threat areas and reach the dynamic target location moving in random directions in the fastest time and the shortest distance. From
Figure 4a,b, it can be seen that as the number of training increases, the effect of batch size (bs) and learning rate (LR) on the algorithm reward is gradually different. In
Figure 4a, the TEC gradually converges after the training round reaches 1000 episodes. In
Figure 4b, the reward function is lower before the training round reaches 1000 episodes due to the unsatisfactory training effect. Between steps 1000 and 2000 of the training round, the reward function decreases abruptly and then gradually converges and stabilizes. It can be seen that the learning rate is 0.005, and the training effect is at its best when the batch size is 1024 by comparing the two figures.
To solve the conflict reasonably and intelligently despite the threat area and to reach the target location in the fastest and shortest distance, 5, 7 and 10 UAVs are trained in the mode. However, it can be seen that the training is entirely effective in the multi-UAV environment. Moreover, the algorithm successfully reduces the collision rate of the threat zone to 0 under incomplete information (i.e., it avoids the actual collision), realizes the capture of the target, and the reward successfully converges. The experimental results are shown in
Figure 5.
There were two model structures established for training MADDPG and CMD-MADDPG. The resulting reward function is shown in
Figure 6.
Figure 6b shows the reward changes of two UAVs in each training round during the training process. The x-coordinate represents the number of training rounds, and the y-coordinate sub-represents the cumulative rewards of two UAVs in each training round. It can be seen from the figure that with the increase in training times, the reward gradually increases. When the number of training rounds reaches 2500, the reward curve area of the two algorithms is gentle and tends to converge. However, it can be seen from
Figure 6a that the UAV collision rate of the MADDPG model is much higher than that of the multi-UAV collision rate trained by the CMD-MADDPG model. Comparing the two algorithms shows that the CMD-MADDPG algorithm has more robust and faster convergence than the MADDPG algorithm. As the path planning method proposed in this paper is a real-time planning method, timeliness is of great significance in the practical application of UAVs, especially in combat and reconnaissance scenarios. Therefore, the actual running time of the algorithm is critical. We used MADDPG and CMD-MADDPG algorithms to conduct five experiments and recorded the time consumption of conflict detection and intelligent resolution, as shown in
Table 2. Comparing the two algorithms shows that the CMD-MADDPG algorithm has more robust and faster convergence than the MADDPG algorithm.
Figure 7 shows the schematic diagram of cross-domain conflict detection of UAVs in the scenario of five UAVs and five threat areas.
Figure 7a shows the grey threat area and cyan dynamic target randomly set before the test. The orange circle represents our UAVs. Different numbers in the circle correspond to different UAV numbers.
Figure 7b illustrates the UAV training result following the initialization condition shown in
Figure 7a, with the red curves representing the trajectory of the UAV tracking target. It can be seen from the figure that the UAV can successfully avoid the threat zone and successfully reach the target position. During this training process, UAVs 1 and 2 intelligently exchanged cross-domain dynamic targets according to the algorithm to avoid conflicts, perform intelligent resolution and complete target capture.
It is noteworthy that the MADDPG algorithm is improved in this paper to make it applicable to incomplete information conditions. Specifically, multi-UAV path planning is realized under unknown reward function conditions. And the experimental results show that the proposed CMD-MADDPG algorithm has improved the convergence speed and accuracy.