The Optimal Strategies of Maneuver Decision in Air Combat of UCAV Based on the Improved TD3 Algorithm

Gao, Xianzhong; Zhang, Yue; Wang, Baolai; Leng, Zhihui; Hou, Zhongxi

doi:10.3390/drones8090501

Open AccessArticle

The Optimal Strategies of Maneuver Decision in Air Combat of UCAV Based on the Improved TD3 Algorithm

by

Xianzhong Gao

^1,*,

Yue Zhang

²,

Baolai Wang

³,

Zhihui Leng

⁴ and

Zhongxi Hou

^1,2

¹

Test Center, National University of Defense Technology, Xi’an 710106, China

²

College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410073, China

³

College of Computer, National University of Defense Technology, Changsha 410073, China

⁴

Jiangxi Hongdu Aviation Industry Group Co., Ltd., Nanchang 330096, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(9), 501; https://doi.org/10.3390/drones8090501

Submission received: 20 August 2024 / Revised: 12 September 2024 / Accepted: 18 September 2024 / Published: 19 September 2024

(This article belongs to the Collection Drones for Security and Defense Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Nowadays, unmanned aerial vehicles (UAVs) pose a significant challenge to air defense systems. Unmanned combat aerial vehicles (UCAVs) have been proven to be an effective method to counter the threat of UAVs in application. Therefore, maneuver decision-making has become the crucial technology to achieve autonomous air combat for UCAVs. In order to solve the problem of maneuver decision-making, an autonomous model of UCAVs based on the deep reinforcement learning method was proposed in this paper. Firstly, the six-degree-of-freedom (DoF) dynamic model was built in three-dimensional space, and the continuous actions of tangential overload, normal overload, and roll angle were selected as the maneuver inputs. Secondly, to improve the convergence speed for the deep reinforcement learning method, the idea of “scenario-transfer training” was introduced into the twin delayed deep deterministic (TD3) policy gradient algorithm, the results showing that the improved algorithm could cut off about 60% of the training time. Thirdly, for the “nose-to-nose turns”, which is one of the classical maneuvers for experienced pilots, the optimal maneuver generated by the proposed method was analyzed. The results showed that the maneuver strategy obtained by the proposed method was highly consistent with that made by experienced fighter pilots. This is also the first time in a public article that compared the maneuver decisions made by the deep reinforcement learning method with experienced fighter pilots. This research can provide some meaningful references to generate autonomous decision-making strategies for UCAVs.

Keywords:

unmanned combat aerial vehicles (UCAVs); maneuver decision-making; autonomous air combat; deep reinforcement learning; scenario-transfer training

1. Introduction

In recent years, the rapid advancement of unmanned aerial vehicle (UAV) technologies has led to the development of UAV swarms that exhibit superior coordination, intelligence, and autonomy compared to traditional multi-UAV systems [1]. As evidenced by global conflicts, the use of UAV swarms for saturation attacks is becoming a predominant trend in UAV operations, posing a significant threat to air defense [2].

To counter these attacks, many researchers have proposed the use of defensive unmanned combat aerial vehicle (UCAV) swarms. The results suggest that UCAV swarms offer a potential solution to the threats posed by UAV swarms [3]. While a UCAV is typically defined as an UAV used for intelligence, surveillance, target acquisition, reconnaissance, and drone strikes, this paper specifically refers to an innovative UAV that combines the advantages of multi-rotor aircraft, such as vertical take-off and landing (VTOL), with fixed-wing aircraft like long endurance and high speed cruise [4].

Currently, UCAVs are predominantly controlled either through pre-programmed codes or in real-time by operators at a ground station. The former method, however, lacks flexibility due to the incomplete nature of battlefield information and the difficulty of achieving desired outcomes under conditions of uncertain information. The latter method requires the consideration of communication stability and delay, both of which significantly impact air combat [5]. Furthermore, as the number of UCAVs increases, so does the cost of organization, coordination, and operator cooperation. Consequently, these current methods are incapable of controlling large-scale swarms of UCAVs. Therefore, autonomous maneuvering decision-making based on real-time situations has become a critical issue for UCAV swarms. This is currently one of the most important research topics in the field of UAV intelligence [6].

Achieving autonomous air combat for UCAV hinges on the successful integration of detection, decision-making, and implementation processes. Since the 1950s, numerous methods have been proposed to develop algorithms capable of autonomously conducting air combat [7]. These methods can be broadly categorized into two types: rule-based and optimization-based. The former employs expert knowledge accumulated through pilot experiences or dynamic programming approaches to formulate maneuver strategies based on varying positional contexts [8,9]. The latter transforms the air-to-air scenario into an optimization problem [10,11] that can be solved using numerical computation techniques such as particle swarm optimization (PSO) [12], minimum time search (MTS) [13], and pigeon-inspired optimization (PIO) [14], among others. However, the decision-making system designed based on expert knowledge exhibits strong subjectivity in situation assessment, target allocation, and air combat maneuver decision-making processes. Consequently, the rule-based method lacks sufficient flexibility to handle complex, dynamic, and uncertain air combat situations. Given that the dynamic model of UCAVs is intricate and typically expressed as a nonlinear differential equation, it requires significant computing resources and time for numerical computation. Furthermore, the optimization-based method faces a dimension disaster when the number of UCAVs on both sides increases significantly [15]. Additionally, accurately predicting an enemy’s intentions, tactics, equipment performance, and other information during combat is generally unattainable. These factors collectively limit the applicability of the aforementioned methods.

More recently, deep reinforcement learning methods have recently been extensively applied to air combat decision-making problems, encompassing both visual range and beyond visual range decisions [16]. On 27 June 2016, the artificial intelligence system, Alpha-Dogfight, triumphed over an experienced American air combat expert who piloted a four-generation aircraft in a simulated combat environment. This victory marked a significant milestone in the application of artificial intelligence to air combat maneuver decision-making. Subsequently, the Defense Advanced Research Projects Agency (DARPA) initiated the Air Combat Evolution (ACE) program, aimed at advancing and fostering trust in air-to-air combat autonomy. The ultimate objective of the ACE program is to conduct a live flight exercise involving full-scale aircraft [7].

Currently, numerous researchers have explored the application of deep reinforcement learning in UCAV air combat maneuver decision-making. For instance, Piao H. et al. introduced a purely end-to-end reinforcement learning approach and proposed a key air combat event reward shaping mechanism. The experimental results demonstrated that multiple valuable air combat tactical behaviors emerged progressively [17]. Kong W. et al. took into account the error of aircraft sensors and proposed a novel autonomous aerial combat maneuver strategy generation algorithm based on the state-adversarial deep deterministic policy gradient algorithm. Additionally, they proposed a reward shaping method based on the maximum entropy inverse reinforcement learning algorithm to enhance the efficiency of the aerial combat strategy generation algorithm [18]. Hu D. et al. proposed a novel dynamic quality replay method based on deep reinforcement learning to guide UCAVs in air combat to learn tactical policies from historical data efficiently, thereby reducing dependence on traditional expert knowledge [16]. Puente-Castro A. et al. developed a reinforcement learning-based system capable of calculating the optimal flight path for a swarm of UCAVs to achieve full coverage of an overflight area for tasks without establishing targets or any prior knowledge other than the given map [19]. This work provides a valuable reference for a swarm of UCAVs to carry out autonomous air combat.

However, previous research on autonomous air combat maneuver decision-making often encounters limitations due to the complexity of the combat environment, simplified assumptions, overreliance on expert knowledge, and low learning efficiency from large-scale exploration spaces. For instance, due to the intricacy of the three-dimensional space dynamic model, most researchers have assumed that UCAVs move in a two-dimensional plane [20,21,22] or in a highly simplified three-dimensional model [16]. This simplification results in the loss of significant details regarding air combat maneuvers. Research on sequential decision-making problems has also revealed that conventional deep learning methods are heavily dependent on expert knowledge and exhibit low learning efficiency when obtaining effective knowledge from large-scale exploration spaces due to the complexity of the air combat environment. Moreover, since the maneuvering of UCAVs is influenced by the search space, the discretion of actions significantly impacts the results. In refs. [23,24], the action space was discretized into seven actions on the horizontal plane, while in ref. [25], the action space was expanded to include thirty-six actions. Finally, in refs. [7,18], a model of continuous action space was established. Currently, selecting an appropriate method for deep reinforcement learning in continuous action space is a hot topic in academic research.

This study advances the field with the following contributions:

A three-dimensional UCAV dynamics model was developed for deep reinforcement learning in continuous action space, with continuous actions of tangential overload, normal overload, and roll angle as maneuver inputs, aligning UCAV maneuvers with real-world conditions.
An improved twin delayed deep deterministic (TD3) training method was proposed, integrating “scenario-transfer training” to enhance training efficiency by 60%, facilitating faster convergence of the deep reinforcement learning algorithm.
The method’s effectiveness was validated through a comparative analysis with the experienced fighter pilots’ strategies across four scenarios, marking the first literature comparison of deep reinforcement learning maneuver decisions against expert pilot maneuvers.

The remainder of this paper is organized as follows. In Section 2, the problems concerned in this work and system modeling based on the Markov decision process (MDP) are stated. Section 3 introduces the training environment for UCAV aerial combat. In Section 4, the improved TD3 algorithm based on the idea of “scenario-transfer training” is depicted in detail. Then, the simulation results regarding the maneuver of “nose-to-nose turn” are presented and discussed in Section 5. Finally, Section 6 summarizes this work and provides some suggestions and proposals for further research.

2. System Modeling Based on Markov Decision Process

2.1. The Dynamic Model of UCAVs

The dynamic model of UCAVs is modeled in the inertial coordinate system, in which the X-axis is pointed to the east, the Y-axis is pointed to the north, and the Z-axis is pointed to the sky, as shown in Figure 1 Since this paper focused on the maneuver decision algorithm of UCAVs, the torque imbalance in state transition was ignored, and our main attention was given to the relative position and velocity between two UCAVs. Therefore, the six-degree-of -freedom (6-DoF) dynamic model of UCAVs was selected to analyze the force characteristics.

During the level flight, the main forces acting on the UCAV are the propulsion, gravity, and aerodynamics, so a body coordinate system should be built. The original point of the body coordinate system is in the mass center of the UCAV, the x_b-axis is pointed to the head of UCAV, which is paralleled with the propulsion and velocity, the y_b-axis is pointed to the down of the UCAV, and the z_b-axis points to the wing of the UCAV according to the right hand rule. Transforming the inertial coordinate system O-XYZ into the body coordinate system O-x_by_bz_b, the simplified dynamic equation [26] of the UCAV can be obtained, as shown in Equation (1):

\begin{array}{l} \frac{d v}{d t} = g (n_{t} - \sin γ) \\ \frac{d γ}{d t} = \frac{g}{v} (n_{f} \cos μ - \cos γ) \\ \frac{d ψ}{d t} = \frac{g n_{f} \sin μ}{v \cos γ} \end{array}

(1)

In Equation (1), g represents the acceleration of gravity, v represents the velocity of UCAV, and the constraints of

v_{\min} \leq v \leq v_{\max}

should be satisfied. The direction of velocity can be represented by two angles: the pitch angle

γ \in [- π / 2, π / 2]

, which indicates the angle between the velocity v and the plane of OXY, the other is the yaw angle

ψ \in (- π, π]

, which indicates the angle between the projection of velocity in the plane of OXY and the X-axis. n_t and n_f are the tangential overload and normal overload, respectively, and μ is the roll angle of the UCAV.

The kinematics equation of the UCAV in the OXYZ coordinate system can be described as follows:

\begin{array}{l} \frac{d x}{d t} = v \cos γ \cos ψ \\ \frac{d y}{d t} = v \cos γ \sin ψ \\ \frac{d z}{d t} = v \sin γ \end{array}

(2)

Through the integral calculation of Equation (2), the flight path of the UCAV represented by x, y, and z in the inertial coordinate system OXYZ can be obtained. Then, the 6-DoF model of the red UCAV can be built by the combination of Equations (1) and (2). It should be noted that the dynamic equations of the blue UCAV are also defined by Equations (1) and (2), and factors such as the aerodynamic forces, moments, and stall characteristics of the aircraft during deceleration are considered.

2.2. State Definition of UCAV Based on MDP

The process of reinforcement learning is always modeled by MDP, which can be represented by a quadruple

(S, A, R, γ_{D})

, in which S is the state space, A is the action space, R is the reward, and γ_D is the discount. The state S of the UCAV in the MDP framework is defined by the dynamic equation constrained by Equations (1) and (2) in Section 2.1. According to the viewpoint of the MDP, the air combat between UCAVs can be treated as follows: the red UCAV (an agent with ability of environmental perception) takes some maneuver actions based on one of policy π under the condition of known current state S, and obtains some kinds of reward from the environment. Supposing the instant reward function that feedbacks from the environment to UCAV is

r_{t} = r_{t} (s_{t}, a_{t})

, then the cumulative rewards of the UCAV in the current state can be defined as

R_{t} = \sum_{i = t}^{T} γ_{D}^{i - t} r (s_{i}, a_{i})

, and

γ_{D} \in [0, 1]

is the discount factor. The greater the discount factor, the more the current reward is impacted by the pass reward. The ultimate goal of reinforcement learning framework based on the Markov decision process model is to allow the UCAV to learn the optimal policy π, then the UCAV can decide on the optimal action according to optional policy π to obtain the maximum reward, as shown in Figure 2.

In the aerial combat depicted in Figure 2, the red and blue aircraft are named as the red UCAV and blue UCAV, respectively. The state can be fully described by the current motion parameters, that is, the position, velocity, and attitude angle, as shown in Equation (3).

S = \{x_{r}, y_{r}, z_{r}, x_{b}, y_{b}, z_{b}, v_{r}, v_{b}, γ_{r}, γ_{b}, ψ_{r}, ψ_{b}, μ_{r}, μ_{b}\}

(3)

However, compared with the state described by the UCAVs, the state described by the relative position, velocity, and attitude of the blue UCAV to red UCAV is more intuitive to understand the situation in aerial combat, as shown in Figure 3. ρ is a vector, which represents the direction of line of sight from the red UCAV to blue UCAV, and the amplitude of ρ is D (i.e.,

‖ρ‖ = D

). α is the angle between the projection of ρ on the plane of the OXY and Y-axis, and β is the angle between ρ and the plane of OXY. p, e are the angles between the vector of v_r, v_b and vector line of sight ρ, respectively, which can be calculated as follows:

\begin{array}{l} p = \arccos \frac{ρ \cdot v_{r}}{‖ρ‖ ‖v_{r}‖} \\ e = \arccos \frac{ρ \cdot v_{b}}{‖ρ‖ ‖v_{b}‖} \end{array}

(4)

Thus, the state in Equation (3) can be rewritten as follows:

S = \{D, α, β, v_{r}, v_{b}, p, e, μ_{r}, μ_{b}\}

(5)

It can be seen from the above equation that the rewritten state not only describes the situation in air combat more intuitively, but also reduces the dimension of the state space.

In order to improve the convergence performance of deep neural networks, the state parameters in Equation (3) should be preprocessed, as shown in Table 1.

2.3. Action Definition and Reward Function

In this paper, three continuous actions were used to control the maneuver of the UCAV:

A_{c} = [n_{t c}, n_{f c}, Δ μ_{c}]

(6)

where A_c is the action set, and n_tc, n_fc, Δμ_c are the commanders of n_t, n_f, and Δμ_c, respectively. The relationship between them can be expressed as follows:

\begin{array}{l} n_{t c} = n_{t \min} + (n_{t} + 1) \frac{n_{t \max} - n_{t \min}}{2} \\ n_{f c} = n_{f \min} + (n_{f} + 1) \frac{n_{f \max} - n_{f \min}}{2} \\ Δ μ_{c} = Δ μ_{\min} + (Δ μ + 1) \frac{Δ μ_{\max} - Δ μ_{\min}}{2} \end{array}

(7)

As shown in Figure 2, n_t and n_f are the tangential overload and normal overload, respectively, and Δμ_c is the command of the differential roll angle of the UCAV. By selecting different combinations of actions, the UCAV can maneuver to the desired state, and each set of action values in Equation (6) corresponds to an action in the maneuver library of the UCAV.

The reward given by environment is based on the agent’s position with respect to its adversary, and its goal is to position the adversary with its weapons engagement zone (WEZ) [7], and most weapons systems provide the UCAV controller with an indication of the radar line of sight (LOS) when they are locked on a target [27]. Thus, as shown in Figure 3, if the distance between the red UCAV and blue UCAV is smaller than the effective shooting distance, and the LOS meets the locking requirements of the red UCAV to blue UCAV, the red UCAV will obtain the reward 1, on the contrary, if the LOS meets the locking requirements of the blue UCAV to red UCAV, the red UCAV will obtain the reward −1. Therefore, the reward function can be designed as follows:

r = \{\begin{matrix} 1 & D \leq D^{*} a n d |p| \leq p^{*} \\ - 1 & D \leq D^{*} a n d |e| \leq e^{*} \\ 0 & e l s e \end{matrix}

(8)

where D^* is the maximum distance to achieve the locking requirements and is proportional to the maximum attack distance of the weapon system on the UCAV platform. p^* is the maximum angle when the locking requirements are achieved between the velocity direction of the red UCAV and LOS angle, and e^* is the blue one. These two angles are related to the maximum attack angle of the weapon system on the UCAV platform; in this paper, D^* = 200 m,

p^{*} = e^{*} = π / 6

.

However, it is obvious that the rewards are sparse, as defined in Equation (8), since both UCAVs need to take many actions to obtain one reward. Rewards are sparse in the real world, and most of today’s reinforcement learning algorithms struggle with such sparsity. One solution to this problem is to allow the agent to create rewards for itself, thus making rewards dense and more suitable for learning [28]. In order to solve this problem, we added the so called “process rewards” to guide the process of reinforce learning for the UCAV. The “process rewards” included four parts: the angle reward r₁, distance reward r₂, height reward r₃, and velocity reward r₄. The calculation equation regarding these four rewards is defined in Equation (9).

\{\begin{matrix} r_{1} = - (p + e) / 2 π \\ r_{2} = - D / D_{\max} \\ r_{3} = - (1 - Δ h / D_{\max}) \\ r_{4} = - (v_{b} - v_{r} + (v_{\max} - v_{\min})) / 2 (v_{\max} - v_{\min}) \end{matrix}

(9)

where Δh is the height difference between the red UCAV and blue UCAV, and v_max and v_min are the maximum and minimum velocity that the UCAVs can achieve in the flight, respectively.

In summary, we obtain the comprehensive reward as shown in the following equation:

R = r + k_{1} r_{1} + k_{2} r_{2} + k_{3} r_{3} + k_{4} r_{4}

(10)

where k₁, k₂, k₃,and k₄ are the weight of the rewards, these values ranging between 0 and 1, and their sum is 1. Currently, the reward weight k_i is determined by the method of trial-and-error, so adaptive reward techniques will be adopted in future research.

During the simulation round in reinforcement learning, the UCAV selects the action according to certain policy π, and obtains the reward R from the environment, where the aim of reinforcement learning is to obtain the maximum rewards. The round ends when the number of steps per round reaches the maximum value, or when one UCAV locks the other to obtain the preset rewards continuously.

3. The Training Environment for Air Combat

3.1. The Training Method Based on TD3 Algorithm

The twin delayed deep deterministic (TD3) policy gradient algorithm is an optimization algorithm based on the deep deterministic policy gradient (DDPG) [29]. In the TD3 algorithm, in order to solve the problem of overestimation regarding the Q-value, two neural networks are introduced, and the relatively small value of the neural network is chosen as the update target [30], as shown in Figure 4.

The TD3 algorithm, as illustrated in Figure 4, employs an actor–critic framework similar to that of the DDPG algorithm. In this architecture, the actor network

μ (s |θ^{μ})

is responsible for determining the action policy, while the critic network

Q (s, a |θ^{Q})

evaluates the policy. It is important to note that these two policies are distinct. To address the issue of temporal correlation in the samples, we utilized an ‘experience replay’ mechanism. This involves storing samples from each time step in an experience pool and subsequently drawing random samples from this pool for training purposes.

To stabilize the training process, we introduced target actor and critic networks that shared the same architecture as the online networks but were updated with a delay. The training objective of the neural network was to maximize the rewards as expressed in Equation (10) through gradient descent, leading to the optimal parameters θ^μ and θ^Q after iterative convergence.

For the actor network

μ (s |θ^{μ})

, at each step i, the input s_i was processed to produce an action a_i. To enhance exploration, we added noise N_i to the output action. The noise was uncorrelated random Gaussian white noise, aimed at increasing the randomness to broaden the exploration space:

a_{i} = μ (s_{i} |θ^{μ}) + N_{i}

(11)

The state transition function (Equations (5) and (6)) uses the noisy action a_i along with the state s_i to interact with the environment. The reward r_i is computed using Equation (10), and the next state s_i+₁ is observed. This transition is stored in the experience pool.

The critic network

Q (s_{i}, a_{i} |θ^{Q})

is updated using a batch of N randomly selected samples from the experience pool. The loss function L for updating the critic network is defined as the mean square error between the network’s estimated value and the target value y_i:

L = \frac{1}{N} {\sum_{i} (y_{i} - Q (s_{i}, a_{i} |θ^{Q}))}^{2}

(12)

where the expected target value y_i is obtained by the current real reward value r_i and the multiplying of the next output value of the critic network and the discount rate y_D, as follows:

y_{i} = r_{i} + γ_{D} Q (s_{i + 1}, μ (s_{i + 1} |θ^{μ}) |θ^{Q})

(13)

The index about the optimization of actor network is denoted as J, and then the parameter of the actor network can be updated by the gradient descent, as shown in the following equation:

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a |θ^{Q}) |_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s |θ^{μ}) |_{s = s_{i}}

(14)

The target actor and critic networks are updated with a delay, as illustrated in Figure 4, meaning that their parameters are periodically copied from the online networks after a certain number of time steps.

3.2. The Training Environment Setting

The training environment and deep reinforcement learning framework were built mainly by the module of PyTorch, the neural networks architecture in the algorithm were all-connected, and the activation function was the linear rectification function.

In the training environment, it was supposed that the performance parameters of the red UCAV and blue UCAVs were the same, as shown in Table 2.

During the training process, the red UCAV can take an action according to the deep reinforcement learning algorithm described in Section 3.1. The maneuver strategy of the blue UCAV was designed using the minimax search tree method (MST), which modeled the UCAV to follow one of the predefined candidate maneuvers, as shown in Figure 5 [31].

The candidate maneuvers of the blue UCAV included seven actions: max load factor left turn, max load factor right turn, max long acceleration, steady flight, max long deceleration, max load factor pull up, and max load factor pull over; the values of each action are set in Table 3.

In each training step, the reward is calculated by the states of the red UCAV and blue UCAVs. The training episode ends when the number of steps in one episode reaches the maximum limitation or one agent successfully locks the opponent. The parameters of the neural network and training parameters in deep learning reinforcement are labelled in Table 4.

3.3. The Training Process for Air Combat

If the blue UCAV takes the strategy of the min–max algorithm, then the strategy of the red UCAV can be trained by the deep reinforcement learning algorithm according to the process defined in Section 3.1, and the parameters defined in Table 2, Table 3 and Table 4.

After training 100,000 times, the relationship between episode reward and episode number is shown in Figure 6, where the average reward is calculated by the mean of 200 neighborhood episodes. It can be seen from Figure 6 that the episode reward ended to converge in about 20,000 episodes with the increase in episode number. This shows that the red UCAV gradually obtains a stable maneuver strategy through continuous attempts and explorations during the training process. Thus, the trained maneuver strategy allows the red UCAV to always obtain higher scores than the blue UCAV in the process of air combat.

After the actor–critic network converges, the red UCAV makes maneuver decisions according to the network trained by the TD3 reinforcement learning algorithm. The blue UCAV makes maneuver decisions based on the min–max algorithm. The flight trajectories of the UCAVs are generated during aerial combat. The input data are the position and attitude information of the aircrafts. The dynamic and kinematic equations are Equation (1) and Equation (2), respectively. In order to show the training results in a more intuitive way, the flight trajectories of the red and the blue UCAVs during aerial combat when the algorithm converged are shown in Figure 7. The red UCAV could make the correct decision after training, and could obtain position dominance after the air dogfight. A detailed analysis of the optimal strategies generated by the proposed algorithm are depicted in the next section.

To verify the adaptability of the maneuver strategy trained in this paper, we carried out Monte Carlo simulation experiments 1000 times under random initial conditions. The simulation results show that the winning rate of the red UCAV was more than 90%, which indicates that the trained maneuver strategy has widely adaptability, thus, can be used in the analysis of optimal strategies in different situations for UCAV aerial combat.

4. The Improved TD3 Algorithm Based on Scenario-Transfer Training

One of the key ideas about reinforcement learning is to accumulate experience through continuous trial and error. However, it is often not effective to directly try to solve particularly complex problems by the reinforcement learning method without a certain technique [32,33,34]. As shown in Figure 6 in Section 3, the reward value needs to be trained to steady after about 20,000 steps, and the convergence speed of the algorithm is very slow. Through the analysis of the training process, it was found that the reason why it was difficult to find the optimal strategy for the red UCAV was that the strategy of the blue UCAV controlled by the min–max strategy was too strong at the beginning of the training. Therefore, it was unable to effectively accumulate rewards for the red UCAV, and a lot of ineffective explorations were conducted with little sense such as away from the battlefield, landing, flying away, and so on. In other words, it is hard to obtain effective rewards when learning from scratch in a complex scenario. Inadequate reward accumulation leads to the slow convergence of learning and poor results.

In order to solve this problem, we proposed a scenario-transfer training (STT) method to train the strategy of the red UCAV based on the TD3 algorithm. STT was first proposed to solve the problem in that it is hard to train a good model directly [21,22]. The principle of STT is shown in Figure 8. The training for a task can be achieved by training in several simple but similar transition scenarios, in which the first one is called the source scenario and the last is called the target scenario. A reinforcement learning model can first be trained in the source scenario, then in several transition scenarios, and finally in the target scenario. The experience trained in each scenario is memorized in the model, which is used as the base model for subsequent training.

In order to validate the effectiveness of the improved algorithm based on the STT method, referring to the work of Zhang et al. [21], we conducted comprehensive experiments to illustrate the effectiveness of STT with the TD3 algorithm as the baseline in this paper.

Three training scenarios corresponding to three kinds of maneuvering strategies of the blue UCAV were designed from simple to complex, respectively. In the first case, the initial state of the blue UCAV was chosen randomly, and then kept to conduct the linear motion; this was called the simple scenario. In the second case, the initial state of the blue UCAV was also chosen randomly, but its action was randomly selected from the maneuver action library at each decision moment; this case was named the transitional scenario. In the third case, although the initial state of the blue UCAV was also chosen randomly, its action was controlled by the min–max strategy at each decision moment; this case was named the complex scenario. The step numbers that decide the scenario transfer were defined as N_s, N_t, and N_c, respectively. The training process of the TD3 algorithm improved by STT is described in Algorithm 1. Suppose that there are N red UCAVs and M blue UCAVs in the combat task. We first initialized the parameters of all models randomly, and then the parameters of the models were updated through the training.

Algorithm 1. The process of the training method for air combat based on the TD3 algorithm improved by STT.
Step 1:	Initializing parameters of online actor network and two online critic networks
Step 2:	Initializing parameters of target actor network and two target critic networks
Step 3:	Initializing the experience pool
Step 4:	for episode = i: if i < maximum episode: if i < N_s episode: ● The action a_i is chosen from the simple scenario elif i < N_t and i > N_s: ● The action a_i is chosen from the transitional scenario elif i < N_c and i > N_t: ● The action a_i is chosen from the complex scenario endif ● Getting the reward r_i and s_i+₁ by putting s_i and a_i into Equations (6) and (10) ● The quadruple $〈s_{i}, a_{i}, r_{i}, γ〉$ is stored into the experience pool ● Selecting N samples from experience pool randomly ● Computing expected reward of action a_i by evaluating the next output value of critic network, and calculating y_i by Equation (13) ● Updating the parameters of online critic network by Equation (12) ● Updating the parameters of online actor network by Equation (14) ● Updating the parameters of target actor and critic networks by that of online actor and critic networks after a delay time end if end for
Step 5:	End

The comparative analysis regarding the simulation was carried out in Python (the source code will be uploaded in GitHub when the paper is published), and the total number of training steps was set as 30,000. The method of the TD3 algorithm in Section 3.1 was used for the baseline, and the red UCAV was trained by this method from beginning to end. The method of the improved TD3 algorithm in Algorithm 1 was used as the comparison, and the red UCAV was trained in the simple, transition, and complex scenarios, and the values of N_s, N_t, N_c were set as 10,000, 20,000, and 30,000, respectively. The comparison of the average reward in different training scenarios is shown in Figure 9.

It can be seen from Figure 9 that the average reward of the designed training method in Algorithm 1 based on the STT method was higher than that of the direct training based on the TD3 algorithm. The designed training method in Algorithm 1 started to converge rapidly after 20,000 steps, and the final average reward was higher than that of the direct training. At the same time, it could be intuitively observed that the red UCAV did not learn the experience from zero after the scenario transfer. The experience in the previous scenario was inherited, as indicated in the arrow in Figure 9, and the average reward in each scenario improved rapidly compared with the method of the TD3 algorithm in Section 3.1.

In addition, since the complexity of the algorithm is strongly related to the iteration as well as the storage space and computational time, the faster the algorithm converges, the less computational time is required. The consumed times for the training of the two methods were also compared. The results showed that the training time of the method improved by STT was less, about 60% than that of the traditional TD3 algorithm, as shown in Figure 10. The main reason for this phenomenon is that the min–max strategy of the blue UCAV consumed too much computing resources, while the training scenarios such as the simple scenario and transition scenario consumed little computing resources and could provide good training results.

As shown in the arrows in Figure 9, the training scene transition was accompanied by a sharp change in the reward values. The average reward decreased sharply during the transition. Although the reward value recovered quickly after the transition, this discontinuity may have affected the stability of training.

To mitigate this problem, we proposed a ‘soft-update’ scenario optimization technique in this paper. In this method, the strategy of the blue UCAV in each scenario is probabilistically selected from a predefined strategy library instead of following the fixed strategy shown in Algorithm 1. In this way, the probability distribution for selecting the strategy is not constant; it changes during the training process and gradually becomes more inclined to complex scenarios as the training proceeds.

In the training phase including the first N_s steps, the probabilities of choosing the first and second strategies in simple scenes are represented as P_a and 1 − P_a, respectively. In the transition phase, spanning from N_s to N_t steps, the probabilities of selecting the second and third strategies are denoted as P_b and 1 − P_b, respectively. Following the N_t step, in the complex scene phase, the third strategy is consistently chosen until the N_c step. The probability model for strategy selection in each scenario is articulated as follows:

\begin{array}{l} P_{a} = 1 - \frac{s}{N_{s}} \\ P_{b} = 1 - \frac{s - N_{s}}{N_{t} - N_{s}} \end{array}

(15)

where s is the training step.

The algorithm’s efficacy can be enhanced in accordance with the definition delineated in Equation (15). Figure 11 juxtaposes the training outcomes of various methodologies.

While the convergence criterion of the reward system may be altered by this strategy, employing distinct convergence rules at various stages of the learning process can facilitate a seamless transition and meet the convergence criteria. The training method, enhanced by scenario-transfer training and the soft update defined in Equation (15), outperformed other methods in terms of both the training speed and average reward. Furthermore, it is evident from Table 5 that the convergence speed of the improved method surpassed that of the other methods. These results validate the effectiveness of the training method proposed in this paper.

5. Analysis of Optimal Maneuvering Strategy

The art of aerial combat is a delicate dance of strategy and execution, where basic fighter maneuvers (BFMs) form the bedrock of tactical mastery [27]. This concept extends to the domain of UCAVs, where BFMs are categorized into two types: primary maneuvers, which are executed independently of an adversary such as accelerations and climbs; and relative maneuvers, which are performed with respect to an opposing aircraft. This academic endeavor is dedicated to unraveling the optimal strategy for maneuver decision-making within the context of relative maneuvers.

The core objective of this research was to delve into the subtleties of relative maneuvers, with a particular emphasis on studying the “red” UCAV’s optimal strategies for engaging with the “blue” UCAV across various UCAV air combat scenarios. We were particularly intrigued by the quantification of the changes in reward, state, and control quantities for both the red and blue UCAVs under optimal maneuvering conditions.

A hallmark of air combat maneuvers is the finite number of possible initial conditions, with the nose-to-nose turn serving as a quintessential tactical maneuver. The pivotal question that arises is one that even seasoned fighter pilots grapple with: “In a nose-to-nose turn, should one execute a dive loop or a climbing loop to ensure the most advantageous firing angle?” This study sought to unravel this enigma by analyzing the state of nose-to-nose turns during aerial engagements.

Subsequent content will delve into four critical scenarios (equal advantage, height advantage, height disadvantage, and lateral difference) to analyze the optimal maneuvers determined by machine learning algorithms for two UCAVs with identical flight performance meeting head-on. These four scenarios will serve as foundational blocks for more complex scenarios, enabling a comprehensive understanding of the decision-making process of the algorithm from multiple perspectives.

Following this, we will juxtapose these maneuvers with those recommended by fighter pilots, assessing the similarity of flight trajectories and the reasonableness of the machine learning-derived maneuvers. The ultimate goal is to evaluate their practical applicability in real-world scenarios. This paper introduces a novel methodology for studying UCAV air combat theory, which holds the potential to underpin the rapid generation of air combat decision-making.

5.1. Introduction to “Nose-to-Nose Turn”

The most classic tactical maneuver in one-versus-one maneuvering is the nose-to-nose turn, in which the red and blue UCAVs are roughly evenly matched, with the same energy status and opposite speed directions approaching each other. For such a state, the experience of a manned fighter pilot is to make a descending turn after spotting the opposing fighter and before entering the weapon firing range, converting altitude to angular advantage rather than wasting extra speed. Figure 12 depicts how this tactic is used. At time “T₁”, the two fighters meet head-on at nearly the same altitude and speed. During the battle, the fighter’s energy is always valuable and challenging to obtain, so both pilots should try to gain as much energy as possible at this moment. The red UCAV should climb at the maximum rate of energy gain and run the engines at full power to gain energy as soon as possible. As the red UCAV pulls up to corner point speed, it changes to level flight and makes a sharp turn to the right at time “T₂” to gain additional flight-path separation in the lateral direction. As the target approaches, the red fighter reverses its direction and begins an offensive lead turn facing the opponent, forcing the opponent to fly past the attacker at time “T₃” to eliminate the flight-path separation and achieve a defense against this attack. Accordingly, the red UCAV changes turn direction at time “T₃” and cuts to the inside of the target aircraft that is making a horizontal turn. At time “T₄”, it changes turn direction and leads the turn, converting the resulting flight-path separation into an angular advantage. This tactic by the red UCAV will prompt the opponent to respond with a downturn so that the fighter can continue to turn low without an excessive loss of altitude relative to the target.

5.2. Initial Conditions with Equal Advantages

Based on the introduction in Section 5.1, we first simulated the nose-to-nose turn under ideal equilibrium (i.e., assuming that the UCAVs are of the same height and fly along a straight line relative to each other) and observed the coping strategy based on the method proposed in this paper. The initial states of the red and blue UCAVs are shown in Table 5.

Figure 13 shows the optimal maneuvering strategy of the nose-to-nose turn obtained by the red UCAV based on the TD3 algorithm in the case of Table 5, where Figure 13a is the optimal maneuver trajectory diagram of the nose-to-nose turn based on the TD3 algorithm, Figure 13b is the corresponding reward curve of the red and blue UCAVs, Figure 13c is the speed change curve, Figure 13d is the attitude angle change curve, and Figure 13e is the corresponding control command value curve.

The confrontation between the red and blue UCAVs commenced at “T₁”, as depicted in Figure 13a, with both UCAVs approaching head-on. As the distance between them diminished, the reward values for both parties decreased continuously, indicative of a shift from a non-threatening to a critical state, as modeled in Figure 13b. As illustrated in Figure 13c, each UCAV adopted a distinct strategy in response to the emerging threat. The red UCAV opted for deceleration, while the blue UCAV accelerated. Subsequently, at “T₂” and “T₃”, the blue UCAV increased its speed to execute a left turn, aiming to evade the red UCAV via velocity. Conversely, the red UCAV initiated a left turn to establish a posture conducive to executing a horizontal turn before turning right. This maneuvering created an appropriate arc along the trajectory, thereby increasing the lateral separation from the blue UCAV. By “T₃”, the red UCAV had accomplished its posture realignment, at which point a discrepancy in the reward functions of the red and blue UCAVs became apparent. During “T₃–T₅”, the red UCAV commenced acceleration to assume a pursuit stance, thereby narrowing the gap with the blue UCAV. As observed in Figure 13b, the cumulative reward values of the red and blue UCAVs exhibited a diverging trend. By “T₅–T₆”, the red UCAV had achieved a stable angular superiority over the blue UCAV. As the distance closed, the red UCAV secured a stable firing range and angle after regulating its speed. The discrepancy in reward values began to show a directional shift, indicating the red UCAV’s dominant position. Post “T₆”, the red UCAV was in the phase of selecting the optimal moment to initiate an attack on the blue UCAV. The differing strategies of the red and blue UCAVs were further elucidated by the variations in the attitude angle and control command depicted in Figure 13d,e. At “T₁”, the red UCAV initiated a roll angle command to adjust its attitude first, while the blue UCAV triggered a tangential acceleration command to surge toward the red UCAV before attempting an evasive left turn.

A comparison between Figure 12 and Figure 13a revealed a high degree of similarity between the optimal maneuver of a nose-to-nose turn derived from the enhanced TD3 algorithm and the optimal maneuver mode informed by combat experience. Both approaches advocated for the blue UCAV to enter an accelerated left turn state in the initial phase, while the red UCAV should initially adjust its attitude, execute a horizontal turn with a small turning radius, extend the lateral distance to gain an angular advantage, and subsequently reduce the distance to the blue UCAV via acceleration. This sequence could continuously augment the angular advantage against the blue UCAV, culminating in a stable firing advantage. This comparison also underscores that the established UCAV chase model adeptly reflected the “dogfight” dynamics in aerial combat and yielded the optimal maneuver for the red UCAV against the blue UCAV. The analysis presented herein addresses the query posed at the outset of this section: “In a head-on turn, should one execute a dive loop or a climbing loop to secure the most advantageous firing angle?” The results provided by the improved TD3 algorithm suggest that, under optimal equilibrium, the optimal maneuver is neither a dive loop nor a climbing loop, but rather a “horizontal loop.” In this scenario, the red UCAV can achieve attitude adjustment with minimal energy expenditure and secure an angular advantage.

5.3. Initial Conditions with Vertical Differences

Assuming that the red and blue UCAVs have different heights and fly along a straight line, we observed the red UCAV’s coping strategy based on the improved TD3 algorithm. We divided the cases with longitudinal differences into two types. The first as that the red side was higher than the blue side, so it had an energy advantage, which was defined as case 1. The other is that the red side was lower than the blue side. In this case, the blue side had an energy advantage, which was defined as case 2. The initial states of the red and blue UCAVs are shown in Table 6.

5.3.1. Initial Conditions with Vertical Advantages

Figure 14 shows the optimal coping maneuver strategy of the nose-to-nose turn obtained by the red UCAV based on the improved TD3 algorithm under case 1 in Table 6, where Figure 14a is the optimal maneuver trajectory diagram of the nose-to-nose turn, Figure 14b is the corresponding reward curve of the red and blue UCAVs, Figure 14c is the speed change curve, Figure 14d is the attitude angle change curve, and Figure 14e is the corresponding control command value change curve.

It can be seen from Figure 14a that the red and blue UCAVs began to fly nose-to-nose at time “T₁”. As the distance became closer, the reward values of both sides decreased, as shown in Figure 14b. This trend was highly similar to that in Figure 13b. Figure 14c shows the speed change trend where the red UCAV decelerated first while the blue UCAV accelerated. Unlike the situation in Figure 13, at times “T₂” and “T₄”, the blue UCAV accelerated to turn right and climbed upward. In contrast, the red UCAV first dived to shorten the distance with the blue UCAV and then adjusted the relative posture with the blue UCAV through the dive loop. At time “T₅”, the red side completed its posture adjustment, and the gap between the reward values of the red and blue sides expanded. At time “T₅–T₆”, the red side formed a stable angle advantage over the blue side. As the distance approached, the red side formed a stable shooting distance and angle against the blue side after adjusting its speed, thus the red side has occupied a dominant advantage. After “T₆”, the red side will choose the opportunity to fire and shoot the blue side. Comparing Figure 12, Figure 13a, and Figure 14a, it can also be seen that when the red side has a height advantage, the optimal maneuver is a dive loop.

5.3.2. Initial Conditions with Vertical Disadvantage

Figure 15 shows the optimal coping maneuver strategy of a nose-to-nose turn obtained by the red UCAV based on the improved TD3 algorithm under case 2 in Table 6, where Figure 15a is the optimal maneuver trajectory diagram of the nose-to-nose turn, Figure 15b is the corresponding reward curve of the red and blue UCAVs, Figure 15c is the speed change curve, Figure 15d is the attitude angle change curve, and Figure 15e is the corresponding control command value change curve.

As illustrated in Figure 15a, the blue UCAV initially possessed a substantial altitude advantage. This advantage granted it more flexibility to transition between potential and kinetic energy, thereby increasing its likelihood of achieving higher reward values at the outset. However, our algorithm inhibited the blue UCAV from fully capitalizing on this advantage. As the distance between the two aircraft diminished, the blue UCAV opted to accelerate into a rightward turn, while the red UCAV executed a subtle climbing cycle by turning right to modify its posture. Concurrently, as depicted in Figure 15b, it is important to note that the reward value for both UCAV consistently decreased. This trend markedly deviated from the patterns observed in Figure 13b and Figure 14b. Following the initial engagement, the reward value of the blue drone surpassed that of the red drone, primarily due to its altitude advantage. After the red drone completed its posture adjustment, the reward value of the blue UCAV experienced a rapid decline, and at time “T₄”, the reward values of the two aircraft began to invert. As shown in Figure 15c, during the period “T₄–T₆”, the red drone persistently accelerated from V_min to V_max. This action narrowed the gap with the blue drone while maintaining an advantage in pursuit. By the time “T₇”, the red drone had successfully intercepted the blue drone and established a stable advantage.

Upon comparing Figure 12, and Figure 13a, Figure 14a and Figure 15a, it becomes evident that when the red side is at a height disadvantage, the most effective maneuvering strategies are climbing and looping. Despite the cumulative reward value of the red side not being high throughout the process, the initial height advantage held by the blue side was successfully negated through optimal maneuvering.

5.4. Initial Conditions with Lateral Differences

Finally, we analyzed the nose-to-nose turn with a lateral difference. Assuming that the red and blue UCAVs had the same height, flew along a relatively straight line, and had a distance difference in the lateral direction, we observed the red UCAV’s coping strategy based on the machine learning algorithm. Because the lateral difference was symmetrical, we assumed that the Y coordinate of the red side was positive and that of the blue side was negative. The lateral distance difference between the two sides was 400 m. The initial states of the red and blue UCAVs are shown in Table 7.

Figure 16 shows the optimal coping maneuver strategy of a nose-to-nose turn obtained by the red UCAV based on the TD3 algorithm in the case of Table 7, where Figure 16a is the optimal maneuver trajectory diagram of nose-to-nose turn based on the improved TD3 algorithm, Figure 16b is the corresponding reward curve of the red and blue UCAVs, Figure 16c is the speed change curve, Figure 16d is the attitude angle change curve, and Figure 16e is the corresponding control command value change curve.

Figure 16a illustrates a scenario involving a lateral difference, where the red and blue UCAVs commenced a head-on approach at “T₁”. As the distance between them continued to close, the blue UCAV opted to execute a right turn for a climb, and concurrently, the red UCAV adjusted its flight attitude by also turning right to climb, mirroring the situation depicted in Figure 13a. Between “T₄” and “T₅”, following the completion of attitude adjustment by the red UCAV, a stable angular advantage was established, and the discrepancy in reward values between the red and blue sides started to widen. Due to the lateral separation under the initial conditions, the relative distance between the red UCAV and the blue UCAV became relatively close after the right turn. Consequently, the red UCAV could maintain an appropriate firing distance from the blue UCAV while sustaining an angular advantage without the need for excessive acceleration. As depicted in Figure 16c, in contrast to Figure 13c, Figure 14c and Figure 15c, this is the first instance where the red UCAV initiated deceleration and speed adjustment without reaching V_max. This strategic maneuver allowed the red UCAV to lock onto the blue UCAV as early as “T₅” and establish a stable advantage.

Furthermore, Figure 16a demonstrates that when the red side encountered a lateral difference, the optimal tactic was to follow the trajectory of the blue side, complete the posture adjustment with a subtle right climbing loop, and subsequently secure a lock on the blue UCAV by regulating its speed.

5.5. Summary

The present study systematically analyzed the optimal strategies for the red UCAV to respond to the blue UCAV in various tactical scenarios during a nose-to-nose turn. By comparing the results from Figure 12, Figure 13a, Figure 14a, Figure 15a, and Figure 16a, we determined that the blue UCAV’s coping strategies were consistent with the classic nose-to-nose turn tactics, which involved accelerating the turn to evade the opponent. This consistency served as a crucial starting point for our further analysis.

Our research has also provided a sound answer to the pivotal question raised in Section 5: “Should we perform a dive loop or a climbing loop to secure the most favorable firing angle during the nose-to-nose turn?” This question, which has historically been challenging for professional fighter pilots to resolve quickly, has now been addressed through the analysis of the strategies derived from our enhanced TD3 algorithm.

The strategies employed by the red UCAV were found to be closely related to the initial conditions of the engagement. The performance of our method varied across different scenarios, as evidenced by the distinct flight trajectories and reward value curves. The method’s sensitivity to changes in these conditions underscores its capability to accurately identify various tactical situations. Additionally, the trained agent’s proficiency in quickly recognizing and addressing its limitations through effective maneuvers was demonstrated.

In Section 5.2, the experimental results indicate that when the red and blue sides were in an ideal balance, the optimal coping strategy for the red UCAV was a horizontal loop. This is a strategic choice that is not typically considered in ideal scenarios by pilots. In Section 5.3, when the red side held a height advantage, the best strategy was to execute a dive loop, leveraging the height advantage to quickly adjust the posture as the opponent approached. Conversely, when the red side was at a height disadvantage, as outlined in Section 5.3, the best strategy was to follow the blue UCAV with a climbing loop. In Section 5.4, when there was a lateral difference between the red and blue sides, the red side had extra space to adjust the angle compared to the balanced state. Therefore, a significant turning radius climbing loop was used to modify its attitude.

When examining these scenarios collectively (Figure 13c, Figure 14c, Figure 15c, and Figure 16c), it became evident that the coping strategies of the red UCAV, as trained by our improved TD3 algorithm, maintained a consistent decision-making logic across different scenarios: first, adopting deceleration to adjust the attitude, and then, after obtaining the angular advantage, adjusting the distance with the blue UCAV through acceleration. This approach aligns with the experience of fighter pilots that “it is easier to change the angle than the speed”, which is a fundamental principle in the fighter community. This principle can be applied as a strategic framework for air combat and offers a valuable reference for the pilots’ air combat training.

6. Conclusions

This study has made significant advancements in the field of UCAV maneuver decision-making by leveraging deep reinforcement learning. The key contributions of this paper are as follows:

Firstly, we addressed the issue of slow convergence in the TD3 algorithm by introducing a novel “scenario-transfer training” approach. This method successfully accelerated the training process, reducing the convergence time by approximately 60%. This enhancement is crucial for practical applications where time efficiency is paramount.

Secondly, we conducted an in-depth analysis of the improved TD3 algorithm in the context of 1v1 air combat scenarios. Our analysis extended to the strategic decision-making process during the “nose-to-nose turn”, a critical maneuver in aerial combat. We also evaluated the optimal strategies for executing dive loops or climb loops to achieve the most advantageous firing angle. The strategies derived from the improved TD3 algorithm indicate a strong correlation with the initial state of the red UCAV. Importantly, the machine learning-generated optimal maneuvers aligned closely with the reference actions taken by experienced fighter pilots, affirming the reasonableness and practicality of the maneuvers produced by our algorithm.

Overall, this research not only provides a novel theoretical framework for UCAV air combat, but also offers substantial technical support for the rapid and effective generation of UCAV air combat decisions.

Future Work

While the current study has made substantial progress, it is not without its limitations, which also point to directions for future research.

Limitations:

The current model is primarily focused on 1v1 air combat scenarios. The complexity of multi-UCAV engagements and the decision-making under dynamic and uncertain environments have not been fully addressed.
The generalizability of the trained model may be limited by the scope of the training data and the diversity of the simulated combat scenarios.
The model overlooks some interference factors that may arise in real-world situations such as the duration of weapon effects and disturbances from external factors like electromagnetic interference.

Future Work:

To broaden the application of our method, future research should extend to multi-UCAV cooperative combat, considering the intricacies of decision-making and coordination among multiple UCAVs.
Enhancing the model’s generalizability through transfer learning and domain adaptation techniques could enable it to handle a wider range of combat scenarios.
Efforts to optimize the computational efficiency of the algorithm are necessary to ensure its viability for real-time combat decision-making.
Investigating the integration of the deep reinforcement learning approach with other decision-making techniques, such as rule-based and optimization-based methods, could lead to a more robust and adaptable UCAV maneuver decision-making system.
Addressing the challenges of hardware deployment is crucial. This includes advancing the integration of the algorithm into physical systems, refining experimental models, and establishing robust test environments, with the aim of validating the algorithm’s performance in real-world settings.

In conclusion, this study has laid the groundwork for further advancements in UCAV air combat theory and decision-making. The proposed method holds promise for transforming the way UCAVs engage in aerial combat, and continued research in this area will be essential for the development of future UCAV technologies and air combat strategies.

Author Contributions

X.G.: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, visualization, supervision, project administration. Y.Z.: Conceptualization, methodology, software, validation, investigation, data curation, writing—original draft preparation, writing—review and editing, visualization. B.W.: Conceptualization, methodology, software, investigation, data curation, writing—original draft preparation, writing—review and editing, visualization. Z.L.: Software, writing—review and editing. Z.H.: Validation, resources, writing—review and editing, supervision, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

DURC Statement

Current research is limited to the field of autonomous decision-making for unmanned combat aerial vehicles, which is beneficial for enhancing air defense capabilities and does not pose a threat to public health or national security. The authors acknowledge the dual-use potential of the research involving deep reinforcement learning and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, we strictly adhere to relevant national and international laws regarding dual-use research of concern (DURC). We advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

Z.L. declares no conflicts of interest related to this study. He has not received any research grants from any company nor has he received any speaker honoraria associated with this study. Furthermore, he does not hold stocks in any company connected to this research and has not served as a consultant or expert witness for any such entity. Finally, he is not the inventor of any patents related to this research. The other authors declare no conflicts of interest.

References

Zhou, W.; Li, J.; Liu, Z.; Shen, L. Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning. Chin. J. Aeronaut. 2022, 35, 100–112. [Google Scholar] [CrossRef]
Castrillo, V.U.; Manco, A.; Pascarella, D.; Gigante, G. A Review of Counter-UAS Technologies for Cooperative Defensive Teams of Drones. Drones 2022, 6, 65. [Google Scholar] [CrossRef]
Shahid, S.; Zhen, Z.; Javaid, U.; Wen, L. Offense-Defense Distributed Decision Making for Swarm vs. Swarm Confrontation While Attacking the Aircraft Carriers. Drones 2022, 6, 271. [Google Scholar] [CrossRef]
Jordan, J. The future of unmanned combat aerial vehicles: An analysis using the Three Horizons framework. Futures 2021, 134, 102848. [Google Scholar] [CrossRef]
Li, H.; Sun, Q.; Wang, M.; Liu, C.; Xie, Y.; Zhang, Y. A Baseline-Resilience Assessment Method for UAV Swarms under Heterogeneous Communication Networks. IEEE Syst. J. 2022, 16, 6107–6118. [Google Scholar] [CrossRef]
Sun, Z.; Piao, H.; Yang, Z.; Zhao, Y.; Zhan, G.; Zhou, D.; Meng, G.; Chen, H.; Chen, X.; Qu, B.; et al. Multi-agent hierarchical policy gradient for Air Combat Tactics emergence via self-play. Eng. Appl. Artif. Intell. 2021, 98, 104112. [Google Scholar] [CrossRef]
Pope, A.P.; Ide, J.S.; Micovic, D.; Diaz, H.; Rosenbluth, D.; Ritholtz, L.; Twedt, J.C.; Walker, T.T.; Alcedo, K.; Javorsek, D. Hierarchical Reinforcement Learning for Air-to-Air Combat. In Proceedings of the 2021 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 15–18 June 2021; pp. 275–284. [Google Scholar]
Wu, A.; Yang, R.; Liang, X.; Zhang, J.; Qi, D.; Wang, N. Visual Range Maneuver Decision of Unmanned Combat Aerial Vehicle Based on Fuzzy Reasoning. Int. J. Fuzzy Syst. 2022, 24, 519–536. [Google Scholar] [CrossRef]
Mohebbi, A.; Achiche, S.; Baron, L. Integrated and concurrent detailed design of a mechatronic quadrotor system using a fuzzy-based particle swarm optimization. Eng. Appl. Artif. Intell. 2019, 82, 192–206. [Google Scholar] [CrossRef]
Bianchi, D.; Borri, A.; Cappuzzo, F.; Di Gennaro, S. Quadrotor Trajectory Control Based on Energy-Optimal Reference Generator. Drones 2024, 8, 29. [Google Scholar] [CrossRef]
Michel, N.; Kong, Z.; Lin, X. Energy-Efficient UAV Trajectory Generation Based on System-Level Modeling of Multi-Physical Dynamics. In Proceedings of the 2022 American Control Conference (ACC), Atlanta, GA, USA, 8–10 June 2022; pp. 4119–4126. [Google Scholar]
Phung, M.D.; Ha, Q.P. Safety-enhanced UAV path planning with spherical vector-based particle swarm optimization. Appl. Soft Comput. 2021, 107, 107376. [Google Scholar] [CrossRef]
Pérez-Carabaza, S.; Besada-Portas, E.; López-Orozco, J.A. Minimizing the searching time of multiple targets in uncertain environments with multiple UAVs. Appl. Soft Comput. 2024, 155, 111471. [Google Scholar] [CrossRef]
Duan, H.; Zhao, J.; Deng, Y.; Shi, Y.; Ding, X. Dynamic Discrete Pigeon-Inspired Optimization for Multi-UAV Cooperative Search-Attack Mission Planning. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 706–720. [Google Scholar] [CrossRef]
Schwarzrock, J.; Zacarias, I.; Bazzan, A.L.C.; de Araujo Fernandes, R.Q.; Moreira, L.H.; de Freitas, E.P. Solving task allocation problem in multi Unmanned Aerial Vehicles systems using Swarm intelligence. Eng. Appl. Artif. Intell. 2018, 72, 10–20. [Google Scholar] [CrossRef]
Hu, D.; Yang, R.; Zhang, Y.; Yue, L.; Yan, M.; Zuo, J.; Zhao, X. Aerial combat maneuvering policy learning based on confrontation demonstrations and dynamic quality replay. Eng. Appl. Artif. Intell. 2022, 111, 104767. [Google Scholar] [CrossRef]
Piao, H.; Sun, Z.; Meng, G.; Chen, H.; Qu, B.; Lang, K.; Sun, Y.; Yang, S.; Peng, X. Beyond-Visual-Range Air Combat Tactics Auto-Generation by Reinforcement Learning. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Kong, W.; Zhou, D.; Yang, Z.; Zhao, Y.; Zhang, K. UAV Autonomous Aerial Combat Maneuver Strategy Generation with Observation Error Based on State-Adversarial Deep Deterministic Policy Gradient and Inverse Reinforcement Learning. Electronics 2020, 9, 1121. [Google Scholar] [CrossRef]
Puente-Castro, A.; Rivero, D.; Pazos, A.; Fernandez-Blanco, E. UAV swarm path planning with reinforcement learning for field prospecting. Appl. Intell. 2022, 52, 14101–14118. [Google Scholar] [CrossRef]
Kong, W.; Zhou, D.; Yang, Z.; Zhang, K.; Zeng, L. Maneuver Strategy Generation of UCAV for within Visual Range Air Combat Based on Multi-Agent Reinforcement Learning and Target Position Prediction. Appl. Sci. 2020, 10, 5198. [Google Scholar] [CrossRef]
Zhang, G.; Li, Y.; Xu, X.; Dai, H. Efficient Training Techniques for Multi-Agent Reinforcement Learning in Combat Tasks. IEEE Access 2019, 7, 109301–109310. [Google Scholar] [CrossRef]
Zhang, G.; Li, Y.; Xu, X.; Dai, H.; Yu, H. Multiagent Reinforcement Learning for Swarm Confrontation Environments; Yu, H., Liu, J., Liu, L., Ju, Z., Liu, Y., Zhou, D., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 533–543. [Google Scholar]
Austin, F.; Carbone, G.; Falco, M.; Hinz, H.; Lewis, M. Automated Maneuvering Decisions for Air-to-Air Combat; American Institute of Aeronautics and Astronautics: Washington, DC, USA, 1987. [Google Scholar]
Yang, X.; Li, X.; Zhang, Y.; Zhang, Y. Research on UAV Air Combat Decision Making Based on DRL and Differential Games. Fire Control. Command Control 2021, 46, 71–75, 80. [Google Scholar] [CrossRef]
Hongpeng, Z.; Changqiang, H.; Yongbo, X.; Shangqin, T. Maneuver Decision of Autonomous Air Combat of Unmanned Combat Aerial Vehicle Based on Deep Neural Network. Acta Armamentarii 2020, 41, 1613–1622. [Google Scholar] [CrossRef]
Guo, J.; Wang, Z.; Lan, J.; Dong, B.; Li, R.; Yang, Q.; Zhang, J. Maneuver decision of UAV in air combat based on deterministic policy gradient. In Proceedings of the 2022 IEEE 17th International Conference on Control & Automation (ICCA), Naples, Italy, 27–30 June 2022; pp. 243–248. [Google Scholar]
Shaw, R.L. Fighter Combat: Tactics and Maneuvering; Naval Institute Press: Annapolis, MD, USA, 1985; p. 428. [Google Scholar]
Savinov, N.; Raichuk, A.; Vincent, D.; Marinier, R.; Pollefeys, M.; Lillicrap, T.; Gelly, S. Episodic Curiosity through Reachability. arXiv 2019, arXiv:1810.02274. [Google Scholar]
Hou, Y.; Hong, H.; Sun, Z.; Xu, D.; Zeng, Z. The Control Method of Twin Delayed Deep Deterministic Policy Gradient with Rebirth Mechanism to Multi-DOF Manipulator. Electronics 2021, 10, 870. [Google Scholar] [CrossRef]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the Machine Learning Research, New York, NY, USA, 23–24 February 2018; pp. 1587–1596. [Google Scholar]
Buşoniu, L.; Rejeb, J.B.; Lal, I.; Morărescu, I.; Daafouz, J. Optimistic minimax search for noncooperative switched control with or without dwell time. Automatica 2020, 112, 108632. [Google Scholar] [CrossRef]
Xu, D.; Qiao, P.; Dou, Y. Aggregation Transfer Learning for Multi-Agent Reinforcement learning. In Proceedings of the 2021 2nd International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE), Zhuhai, China, 24–26 September 2021; pp. 547–551. [Google Scholar]
Faust, A.; Palunko, I.; Cruz, P.; Fierro, R.; Tapia, L. Automated aerial suspended cargo delivery through reinforcement learning. Artif. Intell. 2017, 247, 381–398. [Google Scholar] [CrossRef]
Liu, Y.; Liu, H.; Tian, Y.; Sun, C. Reinforcement learning based two-level control framework of UAV swarm for cooperative persistent surveillance in an unknown urban area. Aerosp. Sci. Technol. 2020, 98, 105671. [Google Scholar] [CrossRef]

Figure 1. The 6-DoF model of the UCAV.

Figure 2. Schematic diagram of the air combat reinforcement learning framework based on MDP.

Figure 3. The relative position, velocity, and attitude of the blue UCAV to red UCAV.

Figure 4. Diagram of the TD3 algorithm.

Figure 5. The diagram of the min–max algorithm.

Figure 6. The change of reward value in the training process.

Figure 7. The three-dimensional trajectory diagram of the aerial combat trajectory of the UCAVs controlled by our method (red) and min-max method (blue) when the algorithm converges.

Figure 8. The principle of scenario-transfer training.

Figure 9. The comparison of average reward in different training scenarios.

Figure 10. The relative time consumed by the training of the two methods.

Figure 11. The average reward of the proposed method is further enhanced through the application of soft-update STT.

Figure 12. Nose-to-nose turn maneuver strategy based on combat experience.

Figure 13. Optimal strategy for nose-to-nose turns under ideal equilibrium. (a) Optimal maneuver trajectory diagram of nose-to-nose turn. (b) The reward change curve of UCAVs. (c) The speed change curve of UCAVs. (d) The attitude angle change curve of UCAVs. (e) The control command value change curve of UCAVs.

Figure 14. Optimal strategy for nose-to-nose turn with longitudinal difference in case 1. (a) Optimal maneuver trajectory diagram of nose-to-nose turn. (b) The reward change curve of UCAVs. (c) The speed change curve of UCAVs. (d) The attitude angle change curve of UCAVs. (e) The control command value change curve of UCAVs.

Figure 15. Optimal strategy for nose-to-nose turn with longitudinal difference in case 2. (a) Optimal maneuver trajectory diagram of nose-to-nose turn. (b) The reward change curve of UCAVs. (c) The speed change curve of UCAVs. (d) The attitude angle change curve of UCAVs. (e) The control command value change curve of UCAVs.

Figure 16. Optimal strategy for nose-to-nose turn with lateral difference. (a) Optimal maneuver trajectory diagram of nose-to-nose turn. (b) The reward change curve of UCAVs. (c) The speed change curve of UCAVs. (d) The attitude angle change curve of UCAVs. (e) The control command value change curve of UCAVs.

Table 1. The preprocessed method for state parameters.

State	Definition	State	Definition	State	Definition
S₁	$\frac{D}{D_{\max}}$	S₂	$\frac{α}{π}$	S₃	$\frac{β}{π}$
S₄	$\frac{v_{r}}{v_{r \max}}$	S₅	$\frac{v_{b}}{v_{b \max}}$	S₆	$\frac{p}{π}$
S₇	$\frac{e}{π}$	S₈	$\frac{μ_{r}}{π}$	S₉	$\frac{μ_{b}}{π}$

Table 2. The performance parameters of the UCAVs.

Performance Parameters	Value
The minimum velocity	50 [m/s]
The maximum velocity	150 [m/s]
The tangential overload	[−2, 2]
The normal overload	[−5, 5]
The maximum rolling of angular velocity	π/2 [rad/s]
The maximum distance of weapon to attack	200 [m]
The maximum angle of weapon to attack	π/6 [rad]

Table 3. The candidate maneuvers of the blue UCAV and the value of each action.

Maneuvers	n_r	n_f	μ
Steady flight	0	0	0
Max long acceleration	2	1	0
Max long deceleration	−2	1	0
Max load factor pull up	0	5	0
Max load factor pull over	0	−5	0
Max load factor left turn	0	5	−π/4
Max load factor right turn	0	5	π/4

Table 4. The parameters of the neural network and training parameters in deep learning reinforcement.

The Name of Parameters	Value
Hidden layer of actor neural network h_a₁	300
Hidden layer of actor neural network h_a₂	300
Learning rate of actor neural network l_a	10⁻³
Hidden layer of critic neural network h_c₁	300
Hidden layer of critic neural network h_c₂	300
Learning rate of critic neural network l_c	10⁻³
Rate of greedy exploration ξ	0.1
Variance of random noise σ	0.2
Rate of discount γ	0.99
Parameter of soft update τ	0.01
Capacity of buffer size about experience pool	10⁶
Capacity of batch size about sampling	256

Table 5. The initial state of the red and blue UCAVs under ideal equilibrium.

	x (m)	y (m)	z (m)	v (m/s)	γ (rad)	ψ (rad)	μ (rad)
Red UCAV	−400	0	1500	100	0	π/2	0
Blue UCAV	400	0	1500	100	0	−π/2	0

Table 6. The initial states of the red and blue UCAVs with longitudinal difference.

		x (m)	z (m)	v (m/s)	ψ (rad)
Case 1	Red UCAV	−400	1700	100	π/2
Case 1	Blue UCAV	400	1300	100	−π/2
Case 2	Red UCAV	−400	1300	100	π/2
Case 2	Blue UCAV	400	1700	100	−π/2

Table 7. The initial states of the red and blue UCAVs with lateral difference.

	x (m)	y (m)	z (m)	v (m/s)	γ (rad)	ψ (rad)	μ (rad)
Red UCAV	−400	200	1500	100	0	π/2	0
Blue UCAV	400	−200	1500	100	0	−π/2	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Zhang, Y.; Wang, B.; Leng, Z.; Hou, Z. The Optimal Strategies of Maneuver Decision in Air Combat of UCAV Based on the Improved TD3 Algorithm. Drones 2024, 8, 501. https://doi.org/10.3390/drones8090501

AMA Style

Gao X, Zhang Y, Wang B, Leng Z, Hou Z. The Optimal Strategies of Maneuver Decision in Air Combat of UCAV Based on the Improved TD3 Algorithm. Drones. 2024; 8(9):501. https://doi.org/10.3390/drones8090501

Chicago/Turabian Style

Gao, Xianzhong, Yue Zhang, Baolai Wang, Zhihui Leng, and Zhongxi Hou. 2024. "The Optimal Strategies of Maneuver Decision in Air Combat of UCAV Based on the Improved TD3 Algorithm" Drones 8, no. 9: 501. https://doi.org/10.3390/drones8090501

Article Menu

The Optimal Strategies of Maneuver Decision in Air Combat of UCAV Based on the Improved TD3 Algorithm

Abstract

1. Introduction

2. System Modeling Based on Markov Decision Process

2.1. The Dynamic Model of UCAVs

2.2. State Definition of UCAV Based on MDP

2.3. Action Definition and Reward Function

3. The Training Environment for Air Combat

3.1. The Training Method Based on TD3 Algorithm

3.2. The Training Environment Setting

3.3. The Training Process for Air Combat

4. The Improved TD3 Algorithm Based on Scenario-Transfer Training

5. Analysis of Optimal Maneuvering Strategy

5.1. Introduction to “Nose-to-Nose Turn”

5.2. Initial Conditions with Equal Advantages

5.3. Initial Conditions with Vertical Differences

5.3.1. Initial Conditions with Vertical Advantages

5.3.2. Initial Conditions with Vertical Disadvantage

5.4. Initial Conditions with Lateral Differences

5.5. Summary

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

DURC Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI