1. Introduction
Compared with traditional manned aerial vehicles, unmanned aerial vehicles (UAVs) can be autonomously controlled or remotely controlled, which have the advantages of low requirements on the combat environment and strong battlefield survivability, and they can be used to perform a variety of complex tasks [
1,
2]. Therefore, UAVs have been widely studied and applied [
3]. With the continuous application of UAVs in the military field, the system composed of a single UAV has gradually revealed the problems of poor flexibility and low stability [
4,
5]. The cooperative combat method of using a multi-UAV system composed of multiple UAVs has become a new main research direction [
6,
7]. Under the conditions of the modernized and networked battlefield, the air cluster composed of multiple UAVs has the air power to continuously launch the required strikes, forcing the enemy to spend more resources and deal with more fighters, thereby enhancing the overall capability and overall performance of military combat confrontation.
Multi-UAV-cooperative penetration is one of the key issues to achieve multi-UAV-cooperative combat. Multiple UAVs start from the same location or different locations, and finally arrive at the same place. At present, UAVs’ penetration-trajectory-planning methods mainly include the A* algorithm [
8], the artificial potential field method [
9,
10], and the RRT algorithm [
11]. Most of the application scenarios of these methods are environments with static no-fly zones and rarely consider dynamic threats. The A* algorithm is a typical grid method. This type of method rasterizes the map for planning, but the size of the grid will have a greater impact on the result and is difficult to determine. Based on the artificial potential field law, it is easy to fall into the local optimum, leading to the unreachable target. When there is a dynamic-tracking interceptor in the environment, the environment information becomes complicated, and real-time planning requirements are put forward for UAVs. Therefore, traditional algorithms cannot meet the requirements.
Multi-UAV-cooperative penetration generally requires that multiple UAVs achieve time coordination to penetrate defenses according to different trajectories and finally reach the target area at the same time or according to a certain time sequence. When the UAVs depart from different locations, it will greatly increase the difficulty of collaboration. The multi-UAV-coordination algorithm can be improved based on the traditional single-UAV-penetration algorithm adapted to the multi-UAV environment. Chen uses the optimal-control method to improve the artificial potential-field method to achieve multi-UAV coordination [
11,
12], and Kothari improves the RRT algorithm to achieve multi-UAV coordination [
13]. At the same time, there are a large number of methods for cooperative control of UAVs based on the graph theory [
3]. Li proposed a multi-UAV-collaboration method based on the graph theory [
14]. Ruan proposed a multi-UAV-coordination method based on multi objective optimization [
15]. The above methods all realize the coordination of multiple UAVs, but their algorithms lack the research on dynamic environments and cannot adapt to the complex and dynamic battlefield environment. Aimed at the environment with dynamic threats, this paper proposes a method based on deep reinforcement learning to achieve multi-UAV-cooperative penetration.
Reinforcement learning is an important branch of machine learning. Its main feature is to evaluate the action policy of the agent based on the final rewards and through the interaction and trial and error between the agent and the environment. Reinforcement learning is a far-sighted machine-learning method that considers long-term rewards [
16,
17]. It is often used to solve sequential decision-making problems. Reinforcement learning can not only be applied in a static environment but also when the parameters of the environment are constantly changing, and the agent can also be applied in a dynamic environment [
16,
17,
18]. The research of reinforcement learning is mostly concentrated in the field of a single agent, but there is also a large body of research on reinforcement-learning algorithms for multiagent systems. There is also much research on applying reinforcement learning to UAVs. Pham successfully applied deep reinforcement learning to UAVs to realize autonomous navigation of UAVs [
19]. Wang also studied the autonomous navigation of UAVs in large, unknown, and complex environments based on deep reinforcement learning [
20]. Wang applied reinforcement learning to the target search and tracking of UAVs [
21]. Based on deep reinforcement learning, Yang studied the task scheduling problem of UAV clusters and solved the application problem of reinforcement learning in multi-UAV systems [
22]. Through the investigation of relevant literature, it can be found that the application of deep reinforcement learning to multi-UAV systems is a feasible method, which can be used to achieve complex multi-UAV-system tasks and has great research potential.
The main work of this paper uses the multi-UAV-cooperative penetration dynamic-tracking interceptor as the scenario. Based on the deep reinforcement-learning DDPG algorithm, we establish the intelligent UAV model and realize the multi-UAV-cooperative penetration dynamic-tracking interceptor by designing the reward function related to coordination and penetration. The simulation experiment results show that the trained multi-UAV system can achieve cooperative attack tasks from different initial locations, which proves the application potential of artificial intelligence methods, such as reinforcement learning in the implementation of coordinated tasks in UAV clusters.
3. Deep Deterministic Policy-Gradient Algorithm
The DDPG algorithm is a branch of reinforcement learning [
5]. The basic process of reinforcement-learning training is that the agent performs an action based on the current observation state. This action acts on the agent’s training environment and returns a reward and a new state observation. The goal of training is to maximize the final reward. Reinforcement learning does not need to give any artificial strategies and guidance during training but only needs to give the reward function when the environment is in various states. This is also the only part of training that can be adjusted artificially.
Figure 4 shows the basic process of reinforcement learning.
The DDPG algorithm is an actor-critic framework algorithm that solves the problem of applying reinforcement learning in continuous space. There are two networks in the DDPG algorithm, namely, the state-action value function network using parameters and the policy network using parameters. At the same time, two concepts are introduced, target network and experience replay. When the value function network is updated, the current value function is used to fit the future state-action value function. If both state-action value functions use the same network, it is difficult to fit during training. Therefore, the concept of the target network is introduced. The target network is used as the future state-action value function, which is the same as the state-action value function network to be updated, except that it is not updated in real time but is updated according to the state-action value function network when the state-action value function network is updated to a certain extent. The policy network also adopts the same training method in DDPG. The experience replay is a function of storing state transfer , and it will be stored in the experience replay pool every time the agent performs an action that causes the state to transfer. When the value function is updated, it will not be updated directly according to the action of the current policy, but the state transition value will be extracted from the experience playback pool for updating. The advantage of such an update is that the training and learning of the network is more efficient.
Before training, the value function network,
, and the policy network,
, are first randomly initialized, and then the target network,
and
, are initialized. It is also necessary to initialize an action random noise,
, which is conducive to the agent’s exploration. During training, the agent selects and executes actions,
, based on the current policy network and action noise and receives rewards,
, and new state observations,
, based on environment feedback. The state transition
is stored in the experience replay pool. After that, N state transitions
are randomly selected to update the value function network. The principle of updating the value function network is to minimize the loss function. The mathematical expression of the loss function is as follows:
where
,
is only related to
.
Assuming the objective function of training is the following:
where
is the discount factor. The policy network is updated according to the gradient of the objective function, and its mathematical expression is as follows:
After training and finally updating the target network, the mathematical expression is as follows:
To extend DDPG to a multi-UAV system, multiple actors and a critic must exist in the system. During each training, the value function network evaluates the policies of all UAVs in the environment, and the UAVs update their respective policy networks based on the evaluation and independently choose to execute actions.
Figure 5 shows the structure of the multi-UAV DDPG algorithm.
The algorithm design also needs to construct the UAV’s action space, state space, reward function, and termination conditions. In this paper, all UAVs are tested and simulated using small UAV models for the purpose of preliminary verification of the algorithm.
3.1. Action-Space Design
It can be seen from the motion model of the UAV that the action the UAV can perform is to change the angular velocity so the action space of multiple UAVs is designed as , which is the collection of the angular velocity of multiple UAVs.
3.2. State-Space Design
To realize the coordinated penetration of UAVs, the design of the state space should include the UAVs’ positions, , speed, , and the central position of the target area, . At the same time, it is necessary to introduce the state observation of the interceptor position, . Therefore, the state space is set to .
3.3. Termination Conditions Design
The termination conditions are divided into four items, namely, out of bounds, collision, timeout, and successful arrival.
- a)
Out of bounds: When the movement of the UAV exceeds the environmental boundary, it will be regarded as a mission failure; an end signal and a failure signal will be given.
- b)
Collision: When the UAV is captured by the interceptor, that is, the distance between the two, , it is regarded as a mission failure; an end signal and a failure signal are given.
- c)
Timeout: When the training time exceeds the maximum exercise time, the task will be regarded as a failure; an end signal and a failure signal will be given.
- d)
Successful arrival: When the UAV successfully reaches the target area, the mission is successful; an end signal and a success signal are given.
When any UAV in the environment finishes training, all UAVs finish training and give a failure or success signal according to the distance from the target point.
3.4. Reward-Function Design
The sparse reward problem is a common problem when designing the reward function. This problem will affect the training process of the UAV, prolong the training time of the UAV and even fail to achieve the training goal. To better achieve collaborative tasks and solve the problem of sparse reward, the reward-function design is divided into four parts, namely, the distance-reward function, , which is related to the distance between the UAV and the target, the cooperative reward, , which is used to constrain the position of the UAV, the mission success reward, , and the mission failure reward, . The reward function is linearized to improve the efficiency of UAV cluster training. One of the UAVs is an example to introduce the design of the reward function.
Assume represents the distance between one UAV and the target area, , representing the distance between the UAV’s initial location and the target area.
The distance reward,
, is related to the distance from the UAV to the target area. The closer the distance, the greater the distance-reward value. This type of reward is the key reward on whether the UAV can reach the target area. The specific form is shown in (11):
The cooperative reward is related to the cooperative parameters in the UAV cluster. Here, the difference between the farthest and closest distance between the UAV and the target area is selected as the coordination parameter. Its specific expression is shown in (12). When there are two UAVs in the cluster, the distribution diagram is shown in
Figure 6.
where
,
, respectively, represent the maximum and minimum distances between the UAV and the target area in the environment. It can be seen from the mathematical expression and distribution diagram that there are two major distribution trends for synergistic rewards. When the maximum-distance difference of the UAV cluster is smaller, its value is larger, which will lead the UAVs to move towards time coordination. When the UAV is closer to the target area, its value is larger, and the UAV will be guided to reach the target area.
When the UAV receives the success signal and finally reaches the target area, the mission is successful, and it will give a success reward. The success reward is also related to the farthest distance between the UAV cluster and the target point. The closer the distance, the greater the success reward, as shown in (13).
When the final mission of the UAV fails, that is, when it is intercepted by an interceptor and fails to reach the target area, it will give a negative reward of failure, as shown in (14).
By linearly superimposing the above multiple reward functions, the final reward function can be obtained. The reward function, R, of the UAV is shown in (15).
where
.
5. Conclusions
Based on the deep reinforcement-learning DDPG algorithm, this paper studies the UAV-cooperative penetration. Based on the mission scenario model, the UAV’s action space, state space, termination conditions, and reward functions are designed. Through the training of the UAV, the coordinated penetration of multiple aircraft was realized. The main conclusions of this paper are as follows:
The collaborative method based on the DDPG algorithm designed in this paper can achieve coordinated penetration between UAVs. After UAVs are trained, they can coordinate to evade interceptors without being intercepted by them. Compared with traditional algorithms, the UAV’s penetration performance is stronger, the applicable environment is more complex, and it has great application prospects.
The reinforcement learning collaboration method in this paper realizes the time collaboration between UAVs on the premise of exchanging state information between multiple aircrafts, and the movement of UAVs finally reaches the target area at the same time from different initial locations, achieving time coordination. However, this paper doesn’t consider factors such as communication delays or failures between UAVs, which causes the UAV to receive the wrong information. This problem may be solved by predicting UAVs’ states and information fusion. Further research can be carried out in the follow-up work.
This paper only considers the movement of the engagement plane. In the follow-up work, the movement can be expanded to three dimensions, and multi-UAV-cooperative penetration in a 3D environment will put forward higher requirements on the algorithm.