Deep Reinforcement-Learning-Based Air-Combat-Maneuver Generation Framework

Mei, Junru; Li, Ge; Huang, Hesong

doi:10.3390/math12193020

Open AccessArticle

Deep Reinforcement-Learning-Based Air-Combat-Maneuver Generation Framework

by

Junru Mei

,

Ge Li

and

Hesong Huang

^*

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(19), 3020; https://doi.org/10.3390/math12193020

Submission received: 11 September 2024 / Revised: 24 September 2024 / Accepted: 25 September 2024 / Published: 27 September 2024

(This article belongs to the Special Issue Artificial Intelligence and Algorithms with Their Applications)

Download

Browse Figures

Versions Notes

Abstract

With the development of unmanned aircraft and artificial intelligence technology, the future of air combat is moving towards unmanned and autonomous direction. In this paper, we introduce a new layered decision framework designed to address the six-degrees-of-freedom (6-DOF) aircraft within-visual-range (WVR) air-combat challenge. The decision-making process is divided into two layers, each of which is addressed separately using reinforcement learning (RL). The upper layer is the combat policy, which determines maneuvering instructions based on the current combat situation (such as altitude, speed, and attitude). The lower layer control policy then uses these commands to calculate the input signals from various parts of the aircraft (aileron, elevator, rudder, and throttle). Among them, the control policy is modeled as a Markov decision framework, and the combat policy is modeled as a partially observable Markov decision framework. We describe the two-layer training method in detail. For the control policy, we designed rewards based on expert knowledge to accurately and stably complete autonomous driving tasks. At the same time, for combat policy, we introduce a self-game-based course learning, allowing the agent to play against historical policies during training to improve performance. The experimental results show that the operational success rate of the proposed method against the game theory baseline reaches 85.7%. Efficiency was also outstanding, with an average 13.6% reduction in training time compared to the RL baseline.

Keywords:

air combat; deep reinforcement learning; SAC; recurrent neural network

MSC:

93-10

1. Introduction

In recent years, along with the continuous advancement of UAV technology, the Unmanned Combat Aerial Vehicle (UCAV) is exerting an increasingly crucial role on the battlefield. However, in order to fully exploit the performance of UCAVs and achieve high-intensity air combats, UCAVs must be liberated from ground control and have a high degree of autonomous decision-making. Simultaneously, the combination of deep neural networks and machine learning has triggered a new research upsurge in artificial intelligence. In 2016, AlphaGo [1] defeated the human Go champion, drawing worldwide attention. On the one hand, AlphaGo’s victory placed reinforcement learning on the historical stage, and on the other hand, it also verified the application potential of reinforcement learning in game issues. Reinforcement learning has had some successful applications in the military. In the DARPA (Defense Advanced Research Projects Agency) Subterranean Challenge, which began in 2018, competing unmanned ground vehicles used reinforcement-learning techniques to learn to navigate and perform tasks in complex environments, such as opening doors or crossing obstacles. The U.S. Army is developing a Synthetic Training Environment that incorporates RL to create adaptive virtual enemies for training purposes. These AI-controlled adversaries learn and adapt strategies based on the actions of human soldiers. The U.S. Navy has experimented with RL in coordinating drone swarms for maritime reconnaissance, where individual drones learn to work together efficiently in reconnaissance missions. In the DARPA AlphaDogfight trials in August 2020, the intelligent flight autonomous decision-making system designed by the American Heron Company successfully defeated the F-16’s flight instructor with a score of 5:0, attracting the attention of military powers worldwide. Intelligent air-combat decision-making has become a research hotspot for all military powers presently.

So far, numerous studies have been conducted on the within-visual-range (WVR) air combat of UCAVs. The mainstream intelligent air-combat decision algorithms can be roughly divided into three categories. The first method is based on game theory [2,3,4], which includes two branches: differential game and matrix game. This method has a clear mathematical form, but the solution of the differential game is more complicated, and there are more strict conditions for solving the problem. Matrix games can only obtain the optimal decision in a short time, but cannot guarantee to generate the global optimal decision sequence. The second approach is based on expert knowledge [5,6]. This method has the advantages of rigorous logical thinking, clear and reasonable interpretation decisions, and low consumption of computing resources, but it still has several shortcomings. For example, the process of transforming expert knowledge into a knowledge base is complicated, and it is difficult to establish a complete rule base. Expert knowledge comes from the experience of human pilots and cannot guarantee the global optimality of decision-making strategies. The learning performance of the expert system is poor. After the design of the expert system is completed, it is difficult to extend the learning of new knowledge. The third method is based on reinforcement learning, which has the advantages of high control precision, fast response speed, potential optimal solutions that can be searched, long-term tactical planning ability, and strong relearning ability. However, at the same time, there are also shortcomings, such as the design of the reward function being highly subjective and the training of the algorithm requires a large amount of data and a long time. Considering the advantages and disadvantages of the above methods, we decide to adopt the reinforcement-learning method to solve the WVR air-combat problem of UCAVs.

Early studies were carried out in simplified environmental and dynamic models. The current aerial combat setting can be classified as either a two-dimensional or a three-dimensional space. Some studies, such as the work in [7], ignore the altitude aspect of UAVs, effectively reducing the combat space from three dimensions to a two-dimensional surface. These researchers have utilized platforms like DCS World to simulate one-on-one aerial combat scenarios and have endowed UAVs with the ability to make intelligent maneuver decisions. In an attempt to simulate the aerial combat scenario more accurately, various scholars have delved into the study of 3D combat situations. For instance, the work in [8,9] introduced a deep reinforcement-learning (DRL) algorithm for UAVs to make intelligent decisions, leveraging virtual reality technology. They trained UAVs in a 3D virtual environment using the Unity3D platform, enabling the visualization of the combat process in three dimensions.

Numerous studies on aircraft maneuvers for air combat assume comprehensive knowledge of both one’s own aircraft and the target’s information [10,11,12,13,14]. However, in real-world scenarios, specific details such as the position, orientation, and velocity of the target can be imprecise due to limitations in sensor capabilities, measurement errors, and fluctuating weather conditions. Moreover, pursuing an opponent is not always successful, leading to situations where target information becomes unobtainable. Therefore, for an air-combat model to be applicable to actual aircraft operations, it is crucial to incorporate the constraints and inaccuracies associated with the aircraft’s integrated sensor systems.

In this paper, we propose a hierarchical DRL framework for constructing an air-combat model equipped with a high-fidelity 6-DOF dynamics model, which can conduct air combat in partially observable Markov decision process (POMDP) environments with imperfect information. The framework is segregated into two layers and solved independently using reinforcement learning. The upper layer formulates the combat policy based on the current combat situation and generates the autopilot instructions for the lower layer. The lower layer employs the control policy to directly manipulate the engine and rudder surfaces of the aircraft in accordance with the upper-layer instructions. Considering the issue of information imperfection in real-world air combat, we focus on two distinct sensor types: radar and visual sensors. Regarding the radar system, it is assumed that it operates with high accuracy and is capable of precisely detecting the opponent in the forward direction of one’s own ship. Conversely, the vision sensor can detect opponents from all directions but has a shorter detection range and is mainly utilized in close combat scenarios. Nevertheless, it is pertinent to note that neither of these sensors is infallible and can, at times, be subject to constraints and errors. Critically, these may result in situations where there is a complete lack of information about the opponent. To robustly cope with such unpredictable circumstances, we incorporate the Gate Recurrent Unit (GRU), a type of recurrent neural network, into our system architecture. Additionally, we apply the renowned soft actor–critic (SAC) algorithm to our system. This design enables us to effectively address the challenge of handling sensor errors, ensuring that our system is resilient and reliable, even in the presence of inherent limitations in the sensors’ functionality.

Furthermore, to solve the low efficiency of exploration in the early training stage, we utilize curriculum learning, a technique that restricts state space exploration. Similar to the human learning process, it is an effective tool to gradually learn from simple concepts to complex problems. It breaks down complex knowledge by providing a series of progressively more difficult learning tasks. It has been proven to reduce training time and induce reward signal feedback in environments with sparse reward, therefore balancing action and exploration.

Finally, sufficient experimentation shows that the proposed method can effectively deal with problems caused by missing information while also addressing the exploration difficulty in a high-fidelity simulation environment. Compared to the baselines, our agent has demonstrated exceptional combat performance. The contributions of this paper are as follows:

Proposal of a novel two-layer hierarchical decision framework using the upper layer to generate combat policy based on the current combat situation and the lower layer to control aircraft maneuvers according to upper layer instructions.
A decision method for partially observable environments is proposed, utilizing the learning algorithm SAC and GRU to extract temporal context information and effectively address problems arising from sensor limitations and errors.
A series of reward functions based on expert knowledge are designed to effectively enhance the efficiency and stability of early training of agents. A virtual self-game mechanism is proposed to improve combat performance by fighting against its historical policy and iterating. Experimental results demonstrate the effectiveness of this mechanism.

The rest of this article is organized as follows. In Section 2, we review our work on air-combat solutions. Section 3 describes our approach in detail, including a partially observable Markov decision process (POMDP) model for WVR air combat and a DRL decision framework. In Section 4, we conducted a series of experiments and discussed the results. Finally, we conclude Section 5.

2. Related Work

In recent years, Deep Reinforcement Learning (DRL) has emerged as a key focus in the field of artificial intelligence. It demonstrates extraordinary decision-making capabilities in games, navigation, multi-agent cooperation, autonomous driving, and Robot control [15]. For instance, Alpha Go triumphed over the world Go champion with a score of 4:1 [1]. Alpha Star defeated the StarCraft professional with a score of 10:1 [16]. OpenAI Five defeated DOTA2’s highest competition level TI champion OG team with a score of 2:0 [17]. These examples show that reinforcement learning has surpassed the average human level in decision-making speed and precision. Consequently, numerous researchers have applied DRL to air-combat scenarios and achieved remarkable accomplishments.

Yang et al. [18] utilized deep Q networks to train an agent that selects from a collection of discrete maneuvers in a customized 3D environment. Hu et al. [19] modeled one-to-one beyond-visual-range air combat by establishing the attack envelope and kill envelope of air-to-air missiles. The results validate that it can effectively represent the motion of the UAV in space and achieve intelligent decision-making. In the work of [20], a UAV dynamics model was established, where continuous throttle, angle of attack, and bank angle were taken as control variables. Different air-combat intentions, such as attack, escape, and pursuit, were designed, and an intelligent decision framework for MAV/UAV was established. The results show that the UAV can enhance the intelligent level under the guidance of MAV intention. The work in [21] presented a semi-realistic flight simulation environment known as Harfang3D Dog-Fight Sandbox for UAVs. The work in [7,18,19,22,23] established realistic air-combat environments with UAV and missile models. They used DRL algorithms such as TD3 to train the agent. Yuan et al. [24] proposed a novel heuristic deep deterministic policy gradient (DDPG) algorithm to enhance the exploration efficiency in the continuous action space. The aforementioned methods assumed 3-DOF aircraft dynamics for the air-combat task, disregarding the attack and sideslip angles by presuming that the velocity vector aligns with the nose direction. This oversimplification makes these methods unsuitable for more complex and realistic 6-DOF aircraft dynamics.

To enhance realism in simulating real-world scenarios, recent studies have utilized 6-DOF (six degrees of freedom) aircraft models for close-range air combat, referred to as WVR (within-visual-range) engagements. As the complexity of the dynamics escalates, so does the complexity of the simulation methods. The work in [25] presents a hierarchical deep reinforcement learning (DRL) policy that encompasses a high-level policy selector along with a set of specialized low-level policies, each trained to deal with different domains within the state space. Both the high-level and low-level components are trained through off-policy, maximum entropy techniques, and expert knowledge is incorporated by shaping the reward structure. This approach was conspicuously successful, achieving second place in the AlphaDogfight Trials and surpassing a human expert. While this method yields remarkable results, crafting reward functions that ensure optimal performance for subpolicy training continues to be a formidable challenge. Additionally, it demands considerable computational resources, involving tens of thousands of neurons at each layer of the neural network. The study referred to as [26] presents an alternative hierarchical framework with the aim of tackling the challenges of one-on-one WVR air combat in the context of 6-DOF dynamics. The method divides the decision-making process into inner and outer loops. However, In this work, the depiction of the missile is overly simplified. They employ a pre-defined weapon engagement zone (WEZ) and operate under the assumption that an adversary can be neutralized as soon as they are within this zone. This simplification can hardly explain the agent’s evasion tactics compared to the behavior of real-world air combat.

The neural network used by the algorithm in the above research only has a fully connected layer and cannot extract historical information, so it has poor performance in the environment of lack of information. Yang et al. [27] compared the performance of off-policy reinforcement-learning algorithms such as DDPG, TD3, and SAC with cyclic neural networks such as LSTM and GRU. The experimental results show that SAC and recurrent neural networks (RNN) have the best combined ability. Among them, the combination of SAC and GRU has the best performance, which is comparable to the combination of SAC and LSTM, but has advantages in convergence speed. Therefore, we use the SAC+GRU algorithm, which improves the SAC algorithm based on GRU networks and utilizes historical temporal data. Detailed descriptions of the air-combat scenario, the related theories, and our proposed decision-making framework are presented as follows.

3. Preliminaries

3.1. Air-Combat Scenario Description

In this paper, one-to-one (red versus blue) WVR air combat is investigated. The aircraft adopts a 6-DOF motion model, and its motion is controlled by the forces and moments generated by each control surface. Both Red and Blue use the JSBSim [28] open-source environment to simulate the aircraft model, the flight dynamics model (FDM) based on wind tunnel test data, with high fidelity. This system model is also used by the Defense Advanced Research Projects Agency (DARPA) AlphaDogfight project. Detailed equations of motion can be found in [29].

The operating frequency of the simulation environment is 60 Hz. The action space consists of continuous signal inputs from the flight control system (ailerons, elevators, rudder, and throttle). WEZ is defined as a spherical cone with a

2^{\circ}

aperture extending from the nose of an aircraft at 200–1000 m (Figure 1) with a range of 200–1000 m. Either side takes sustained damage while in the opponent’s WEZ. Damage per second in WEZ can be calculated using the following formula:

d_{w e z} = \{\begin{matrix} 0 & d > 1000 m \\ \frac{1000 - d}{800} & 200 m < d < 1000 m \\ 0 & d < 200 m \end{matrix}

(1)

where d is the distance between the two aircraft.

The engagement ends when the simulation time reaches 300 s or the health of either aircraft reaches zero. The agent can win by causing enough damage to the opponent or forcing him to break the minimum height limit of 100 m. If both sides have not cleared HP after 300 s, the engagement is considered a draw.

3.2. Partially Observable Markov Decision Processes (POMDP)

In real-world settings, an agent often can only have access to a reflection of the underlying state of the environment due to noise, occlusions, limited measurement resolutions, and so on. Under partial observability, a decision-making problem can be effectively modeled as a POMDP [30]. We can formally specify a finite-horizon POMDP by the tuple

〈S, A, T, R, Ω, O, γ〉

, where S is the state space, A is the action space,

T (s_{t + 1} ∣ s_{t}, a_{t})

is the transition probability,

R (s_{t}, a_{t}, s_{t + 1})

is the reward function,

Ω

is the observation space,

O (o_{t} ∣ s_{t + 1}, a_{t})

is the observation function, and

γ \in [0, 1]

is the discount factor. At each time period, the environment is in a certain state

s \in S

. The agent takes action

a \in A

, which results in an environment transition to state

s^{'}

with probability

T (s^{'} ∣ s, a)

. At the same time, the agent receives observation

o \in O m e g a

, which depends on the new state of the environment with probability

O (o ∣ s^{'}, a)

. Finally, the agent receives a reward r equal to

R (s, a)

. Then, repeat the process. The goal is to have the agent select the action that maximizes its expected future discount reward at each time step:

E [\sum_{t = 0}^{\infty} γ^{t} r_{t}]

. To achieve optimal performance in a partially observable Markov decision process (POMDP), an agent must base its policy on the complete history of observations and actions it has encountered thus far. However, the dimensionality of the history space grows exponentially with the length of this history. Consequently, recurrent neural networks are frequently employed to encapsulate this historical information, allowing the policy to be conditioned on the fixed-size hidden states produced by the RNN.

3.3. Maximum Entropy Reinforcement Learning

The idea of most deep reinforcement-learning methods is to find a deterministic notion of optimality, and although some random policies can increase agent exploration, most of them are heuristic, such as adding random noise or initializing a random policy with high entropy. But in some cases, we prefer to learn random policies, which can increase exploration in tasks with different goals and can also gain different combinations through pre-training. Previous studies have shown that when we consider optimal control versus probabilistic inference, random policies may be the best choice, i.e., the solution to the maximum entropy learning problem. Intuitively, we do not just want to capture a single deterministic behavior with a minimum cost (maximum cumulative expected return) but a set of behaviors with a low cost, i.e., we want to learn policies that can accomplish a task in multiple ways. If we can learn such a random policy, it can be a very good initialization policy when we want to fine-tune it to fit a more specific task.

Our goal is to learn a policy

π (a_{t} ∣ s_{t})

. The learning goal of traditional reinforcement learning is:

π_{s t d}^{*} = arg max_{π} \sum_{t} E_{(s_{t}, a_{t}) \sim ρ_{π}} [r (s_{t}, a_{t})]

(2)

The maximum entropy RL adds a policy entropy to the end of the reward term, which aims to not only maximize the reward at every moment but also combine with the policy entropy at every moment to increase the uncertainty of the policy.

π_{M a x E n t}^{*} = arg max_{π} \sum_{t} E_{(s_{t}, a_{t}) \sim ρ_{π}} [r (s_{t}, a_{t}) + α H (π (\cdot ∣ s_{t}))]

(3)

Where

α

is the temperature parameter that controls the randomness of the optimal policy,

ρ_{π}

denotes the state-action marginals of the trajectory distribution induced by the policy,

r (s_{t}, a_{t})

is the reward obtained at time t, and

H (π)

represents the entropy

E [- log π (a ∣ s)]

of the policy. This approach helps balance development and exploration. When the environment reward is high, the policy entropy is low, and the agent tends to choose the optimal action. When the environmental reward is high, the policy entropy is higher, which leads to greater exploration. If we wish to extend the maximum entropy RL to the infinite horizon problem, then it is convenient to introduce the discount factor

γ

to ensure that the expected reward and entropy are bounded. The entropy term makes the exploration range of the policy wider and eventually converges to the optimal action. Previous studies have confirmed the advantages of maximum entropy RL in the exploration of multimodal problems and learning speed [31,32].

3.4. The Soft Actor–Critic Algorithm

Soft actor–critic (SAC) [33] is an off-policy reinforcement-learning algorithm that combines maximum entropy learning with an actor–critic framework. Common reinforcement-learning algorithms tend to become more and more deterministic in the learning process, which greatly weakens the exploration ability of the algorithm in the late training and easily converges to the local optimal solution. In SAC, we want not only to maximize the reward of the environment but also to maximize the entropy of the policy, which makes the algorithm have better exploration ability and, therefore, show better performance on various tasks.

In SAC, we want to find a policy that maximizes the reward and, at the same time, maximizes the entropy of the policy’s action distribution in each state. Therefore, the reward function is expanded:

r_{s o f t} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + α \underset{s_{t + 1} \sim p}{E} [H (π (\cdot ∣ s_{t + 1}))]

(4)

Where

α

is the entropy temperature coefficient, which determines the degree of emphasis on entropy maximization; p is the state transition distribution, and

r (s_{t}, a_{t})

is the expectation that performing

s_{t}

in

a_{t}

will result in a reward. By bringing (4) into the Bellman equation, we obtain the soft action value function:

Q_{s o f t} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + \underset{s_{t + 1}, a_{t + 1}}{E} [Q_{s o f t} (s_{t + 1}, a_{t + 1}) - α log (π (a_{t + 1} ∣ s_{t + 1}))]

(5)

Using (5), we can calculate the

Q_{s o f t}^{π_{o l d}}

value of the current policy

π_{o l d}

. Then, solve the following optimization problems to find a new policy

π_{n e w}

that is better than the current policy:

π_{n e w} = \underset{π^{'}}{arg min D_{K L}} (π^{'} (\cdot ∣ s_{t}) ‖ \frac{exp (α^{- 1} Q_{s o f t}^{π_{o l d}} (s_{t}, \cdot))}{Z^{π_{o l d}} (s_{t})})

(6)

In (6), we first exponential

Q_{s o f t}^{π_{o l d}}

to

exp (α^{- 1} Q_{s o f t}^{π_{o l d}})

and then normalize it to the distribution

\frac{exp (α^{- 1} Q_{s o f t}^{π_{o l d}} (s_{t}, \cdot))}{Z^{π_{o l d}} (s_{t})}

, where

Z^{π_{o l d}} (s_{t}) = \int exp (α^{- 1} Q_{s o f t}^{π_{o l d}} (s_{t}, a_{t})) d a_{t}

is the normalized function. We hope to find a policy

π^{'}

in the state

s_{t}

with minimum KL-divergence between its action distribution and

\frac{exp (α^{- 1} Q_{s o f t}^{π_{o l d}} (s_{t}, \cdot))}{Z^{π_{o l d}} (s_{t})}

distribution, i.e., the distribution

π^{'} (\cdot ∣ s_{t})

and

\frac{exp (α^{- 1} Q_{s o f t}^{π_{o l d}} (s_{t}, \cdot))}{Z^{π_{o l d}} (s_{t})}

are expected to be as similar as possible.

We use

θ

and

ϕ

to represent the parameters of action value function

Q_{θ} (s, a)

and policy

π_{θ} (a ∣ s)

, respectively. Similar to TD3 [34] algorithm, we use Clip Double Q learning to mitigate overestimating so that we will study

θ_{1}

and

θ_{2}

networks, respectively. In addition, in order to ensure the stability of learning, we will also maintain two target networks,

{\hat{θ}}_{1}

and

{\hat{θ}}_{2}

.

According to (5), the training of parameter a of q can be achieved by minimizing the soft Bellman residual:

\begin{matrix} L_{Q} (θ_{i}) = \underset{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim D}{E} [(Q_{θ_{i}} (s_{t}, a_{t}) - & (r_{t} + \underset{a_{t + 1} \sim π_{ϕ}}{E} [min_{{\hat{θ}}_{i}} Q_{{\hat{θ}}_{i}} (s_{t + 1}, a_{t + 1}) \\ - α log (π_{ϕ} (a_{t + 1} ∣ s_{t + 1})) {]))}^{2}] \end{matrix}

(7)

where D denotes the replay buffer.

Further simplifying (6), The training of parameter

ϕ

of

π_{ϕ} (a_{t + 1} ∣ s_{t + 1})

can be achieved by minimizing the following loss function:

L_{π} (ϕ) = \underset{s_{t} \sim D}{E} [\underset{a_{t} \sim π_{ϕ} (∣ t)}{E} [α log π_{ϕ} (a_{t} ∣ s_{t}) - min_{θ_{i}} Q_{θ_{i}} (s_{t + 1}, a_{t + 1})]]

(8)

The SAC algorithm sets a target entropy for each task. When the entropy of the policy is greater than the target entropy, reduce

α

to reduce the degree of maximum entropy of the algorithm, thus reducing the entropy of the current policy. When the entropy of the policy is less than the target entropy, increasing

α

increases the degree of the algorithm to maximize the entropy, thus increasing the entropy of the current policy. The loss function of design entropy is as follows:

\begin{matrix} L (α) & = \underset{s_{t} \sim D, a_{t} \sim π}{E} [- α log π_{ϕ} (a_{t} ∣ s_{t}) - α \hat{H}] \\ = α \underset{s_{t} \sim D}{E} [H (π (\cdot ∣ t)) - \hat{H}] \end{matrix}

(9)

4. Method

Due to the continuous high-dimensional state and action space, as well as the complex calculation of aircraft dynamics by JSBSim, it is difficult to directly train the combat policy of the aircraft. The idea of hierarchical reinforcement learning is to decompose a complex problem into several smaller subtasks for easy solutions. In this paper, we propose a hierarchical decision framework as shown in Figure 2. The upper level outputs maneuvering instructions based on the current battle situation (the position, attitude, and speed of the warring parties) and acts as input to the lower level. The lower level determines the output signals of each control part of the aircraft (ailerons, elevators, rudder, and throttle) according to the instructions of the upper level and the current state of the aircraft to complete the maneuver. At the same time, based on how well the control policy completes the mission, the combat policy plans its behavior more rationally. The upper and lower layers have different decision frequencies. At the micro level, the decision frequency is high because the control policy needs to track the target with high-precision and keep the aircraft stable. On the macro level, in order to avoid the trajectory shock caused by frequent policy changes and reduce the consumption of computing resources, the combat policy operates at a lower decision frequency.

The training of the overall framework is divided into two phases. In the first phase, we trained the control policy using the SAC algorithm to perform tasks such as navigation and tracking with high performance. In the second stage, we use the trained control policy to supplement the training of combat policy while introducing the self-game mechanism. This training method improves the training efficiency and improves combat performance. The specific description is as follows.

4.1. The 6-DOF Dynamic Control Policy

Reinforcement-learning algorithms interact with the environment based on the current state, move to the next state, and receive a reward, a process often modeled as MDP. In this work, JSBSim serves as a simulation environment for aircraft dynamics modeling and air-combat maneuvers. It can simulate the movement of the aircraft with high precision and return a variety of desired variables. This will generate a state space with position and velocity, etc., which we define as

s_{b a s i c}

. The definition of each state is shown in Table 1.

The control policy realizes motion control by controlling the pitch, roll, and yaw angles of the aircraft. Because the three Euler angles are coupled, the yaw angle can be controlled indirectly by the pitch angle and the roll angle. The control policy takes the pitch angle, roll angle, and velocity difference of the target as the tracking target and outputs the deflection angle instructions of ailerons, elevators, rudders, and throttle valves to directly control the aircraft. The control process can be established as A Markov decision process (MDP):

M = 〈S, A, R, P, γ〉

, where S is the state space, A is the action space, R is the reward function, P is the state transition, and

γ

is the discount factor. The details are as follows.

4.1.1. State Space

State

s_{t}

describes the control of the aircraft at time t. It can be divided into two parts. The first part is the basic state of the aircraft. It is used to describe the current state of the vehicle for attitude control. The second part describes the situation of angle and velocity tracking. It contains the tracking errors of pitch angle, roll angle, and velocity:

[Δ θ_{t}, Δ ϕ_{t}, Δ v_{t}]

, where

Δ θ_{t}, Δ ϕ_{t}, Δ v_{t}

are, respectively, the difference between and target pitch angle, target roll angle and speed. In summary, the state

s_{t}

is

[s_{b a s i c}, Δ θ_{t}, Δ ϕ_{t}, Δ v_{t}]

. All bounded variables are normalized.

4.1.2. Action Space

The t time action includes the aileron

c_{a}

, the elevator

c_{e}

, the rudder

c_{r}

deflection angle command, and the throttle

c_{t}

, which, respectively, controls the aircraft roll, pitch, yaw, and engine thrust. The bound of each variable is shown in Table 2.

4.1.3. Reward

The positive/negative feedback signal is obtained by the agent after performing an action, and the reward set is all the feedback information available to the agent. However, due to the sparsity of the reward signal, the model training may have high variance, resulting in unstable performance. Therefore, the design of reward functions is important for successful learning. For the training of control policy, we design an effective reward function:

r = β_{1} r_{t r a c k} + β_{2} r_{a c t i o n} + r_{s t a b l e} + r_{c r a s h}

(10)

The first part is the target tracking reward, which is determined by the errors between the current pitch angle, roll angle, and target angles:

r_{t r a c k} = - \sqrt{{(Δ θ)}^{2} + {(Δ ϕ)}^{2} - exp (- |Δ v|) + 1}

(11)

The second part is the action reward, which is summed up by 4 parts:

r_{i} = exp (- \frac{0.5 |c_{i}|}{k_{i}}), \begin{matrix} i = a, e, r, t \end{matrix}

(12)

where

c_{i}

represents the action value. As

|c_{i}|

increases,

r_{i}

monotonically decreases from 1 to 0. This helps provide the agent with smooth feedback on changes and reduces system fluctuations. The scale factor

|k_{i}|

can be set to an appropriate value according to the unit of measure

|c_{i}|

.

β_{1}

and

β_{2}

in (10) are constants and can be adjusted to change the effect of the corresponding reward.

The third part is related to aircraft stability. If the aircraft exceeds the limit angle of attack or sideslip angle or is below the critical speed will lose control.

r_{p o s i t i o n} = \{\begin{matrix} 1 & s t a b l e \\ 0 & i n s t a b l e \end{matrix}

(13)

If it crashes, the penalty

r_{c r a s h} = - 100

is given, and the simulation ends.

We use the SAC algorithm to train the control policy. Initialize the network parameters

ϑ_{1}

,

ϑ_{2}

, and

φ

and initialize a replay buffer

D

at the beginning of each epoch. Then, start the simulation until the replay buffer is full. We select the position signal of the random target as the tracking target and re-initialize it regularly to enhance the exploration ability of the agent. The trace data are stored in the buffer

D

. When enough data are collected, we update the parameters of the actor network and critic network according to the loss function described in Section 3.3. Through training, the control policy finally converges and obtains good tracking performance, which will be verified in the experiment. The training process can be described by Algorithm 1:

Algorithm 1: SAC-Based Control Policy Training

After the training of the control policy is complete, we freeze its parameters and encapsulate them into the decision framework. This facilitates the training of the combat policy and allows the control policy to be modularized for reuse in other missions.

4.2. The SAC+GRU-Based Combat Policy

In this section, we train the upper-layer combat policy based on the control policy introduced in the previous section. Including state space design, reward design, and action space design. As shown in Figure 2, the combat policy outputs macro commands based on the state of the aircraft and the relative position and speed of the opponent. Considering the sensor characteristics of the configuration, the simulator uses state to generate reward

r_{t}

and observations

o_{t}

. Trajectories

(o_{t}, a_{t}, r_{t})

are stored in the replay buffer, and a fixed-length sequence of trajectories is sampled for the critic network.

4.2.1. Observation Space

The combat policy observation space contains information about air combat, including information such as the position and posture of the self relative to the opponent. Figure 3 shows the basic geometry between the two aircraft and represents the components of the relative target information. Relative target information at time step t includes: Aspect Angle (

A A_{t}

), Antenna Train Angle (

A T A_{t}

), Heading Crossing Angle (

H C A_{t}

), relative distance (

d_{t}

), relative speed (

Δ V_{t} = [Δ u_{t}, Δ v_{t}, Δ w_{t}]

), ownship HP (

o H P_{t}

), and target HP (

t H P_{t}

):

Ω_{t} = \{A A_{t}, A T A_{t}, H C A_{t}, d_{t}, Δ V_{t}, o H P_{t}, t H P_{t}\}

(14)

4.2.2. Action Space

The upper layer generates the target tracking track in real time according to the combat situation and determines the optimal sailing target point according to the position, attitude, and speed vector of both sides. The output of the upper layer is the expected roll angle

{\dot{ϕ}}_{t}

, pitch angle

{\dot{θ}}_{t}

and speed

{\dot{v}}_{t}

. Its action space is:

a_{t} = ({\dot{ϕ}}_{t}, {\dot{θ}}_{t}, {\dot{v}}_{t})

(15)

4.2.3. The Design of POMDP Environment

In the real environment, it is often difficult to obtain complete target information due to the resolution limitation and noise of the sensor. The accumulation of errors in the training process will eventually lead to an overestimation of the Q value, which may result in training converging to sub-optimal policies or even training failure. In practice, RNNS can be used to model from the time series information of the target to predict the trajectory of the target beyond the detection range of the sensor and reduce the impact of noise. In this paper, we designed a POMDP environment to reflect the real environment characteristics. We used two types of sensors for POMDP environments, as shown in Figure 4:

The first is radar: fixed search 60° ahead. The search distance is far. The second is the vision sensor: the search distance is close, and the error is large, but can cover a 360-degree range.

r_{V}

and

e_{V}

are the detection distance and error rate of the vision sensor, while

r_{R}

and

e_{R}

are the detection distance and error rate of the radar sensor, respectively. When the target is within the detection range of the sensor, the error

e_{t}

in the observation space is proportional to the relative distance d. When the target is in the overlapping area of two sensors,

e_{t} = min \{e_{V}, e_{R}\}

. If the target is out of the detection range, the target-related information is set to 0.

4.2.4. Reward

To improve the exploration difficulties caused by sparse reward problems, signal delay, local optimal, and high variance, we design a series of reward functions based on expert knowledge to form dense reward signals. The reward function of combat policy can be divided into five types: shoot-down, WEZ, control zone, crash, and control stability. Details for each item are as follows:

r_{A T A} = - {\bar{A T A}}^{\frac{1}{2}}

(16)

r_{A A} = - \bar{A A}

(17)

r_{A T A}

and

r_{A A}

are related to the tracking angle of the target and penalize the agent when there is a non-zero angle.

r_{p o s i t i o n} = \{\begin{matrix} e^{- \bar{A T A}} tanh (1000 - d) & 200 m < d < 1000 m \\ - 2 e^{- \bar{A T A}} tanh (1000 - d) & d < 200 m \end{matrix}

(18)

r_{p o s i t i o n}

encourages the agent to move to the opponent’s six o’clock direction and approach the opponent at the same time to take advantage of the position. At the same time, curb the tendency to move too close so as not to overtake the opponent.

r_{w e z} = \{\begin{matrix} exp {(\frac{1000 - d}{800})}^{2} & 200 m < d < 1000 m a n d A T A < 2 ° \\ 0 & o t h e r w i s e \end{matrix}

(19)

r_{d a m a g e d} = \{\begin{matrix} - exp {(\frac{1000 - d}{800})}^{2} & 200 m < d < 1000 m a n d A T A > 178 ° \\ 0 & o t h e r w i s e \end{matrix}

(20)

r_{s h o o t}

rewards the agent for attacking an opponent in WEZ, and

r_{d a m a g e d}

punishes the agent for entering the opponent’s WEZ.

If either side is shot down within the time limit, the simulation ends, and the agent receives a reward

r_{w i n} = 100

or

r_{l o s e} = - 100

based on the outcome of the combat.

4.2.5. Network Architecture

The diagram of the actor network and critic network in this article is shown in Figure 5. The network architecture of the SAC+GRU algorithm consists of two parts: actor network and critic network. The actor network takes the observed feature vector o as input, passes through the fully connected layer FC1, and uses the ReLU activation function to enhance feature extraction. The subsequent layer is a GRU layer that enables the algorithm to extract timing information and improve decisions based on the current state. After the GRU layer is the fully connected layer FC2, which uses ReLU activation functions to process the output of the GRU layer. Finally, a full connection layer produces the average

μ

and

σ

, used to resampling from Gaussian distribution

N = (μ, σ)

. The resulting action a is obtained by the tanh activation function.

The critic network takes the observation vector o and the action vector a as inputs. The FC1 and GRU layers are the same as the actor network. The FC2 layer is a fully connected layer, and features are extracted from A using the ReLU activation function. The FC3 layer takes features from the GRU layer and the FC2 layer as input. Finally, the output layer consists of a neuron that outputs Q values to update the actor network.

4.2.6. The Self-Play Training Mechanism

In the early stages of training, the agent’s opponents are created using a rule-based approach. We build the agent’s historical policy pool

B

, record the historical policy and update it after each episode. We adopt the self-game course-learning method. When the winning rate of the agent in training exceeds 50%, we will optimize the current policy

π_{h}

by making the agent fight against its own historical policy according to the game theory method. The network parameters stored in the pool are frozen and uniformly sampled as the opponent policy in the self-game mechanism. The training process is shown in Algorithm 2. It has been shown that self-game-based methods can converge to Nash equilibrium.

The rule for opponent updates is to sample opponents from the historical policy pool according to a certain probability distribution. This distribution is inversely proportional to the winning percentage of the last 100 games against each opponent and is dynamically updated after each episode. The sampling process is designed to allow the agent to play more against more intelligent opponents and reduce redundant games to avoid performance degradation. The policy can be iterated and converged to Nash equilibrium policy by self-evolving training.

Algorithm 2: SAC+GRU Combat Policy Training

5. Experiment

This section discusses the training progress and results of the control policy and the combat policy. This experiment was performed on an Intel Core I9-13900K CPU (24 cores, 3.00 GHz), an Nvidia RTX 4090 GPU (24 GB) and 64 GB RAM. The Pytorch version is v1.10.1. The interaction frequency of the JSBSim environment is set to 60 Hz, the decision frequency of the control policy is set to 30 Hz, and the decision frequency of the combat policy is set to 10 Hz. Due to the large number of parameters to calculate in the JSBSim environment, training in control policy took 16 days, and combat policy took 19 days. We also set up a control group where the GRU layer of the combat policy was replaced with a fully connected layer, and the control policy remained the same. It takes about 22 days to train. The main parameters in the experiment are shown in Table 3.

5.1. Experiments on Control Policy

In the control policy experiment, the coefficient of

r_{t r a c k}

term and

r_{a c t i o n}

term in the reward will affect the performance of the control policy. To study the influence of the coefficient, we first limited the coefficient

β_{1}

of

r_{t r a c k}

between 0.1 and 2 and the coefficient

β_{2}

of

r_{a c t i o n}

between 0.1 and 0.5 according to experience.

We adjusted two coefficient pairs with a step size of 0.01 to train the control policy and conducted 200 random target signal tests for each control policy. If the training is convergent, the final average reward is normalized. Then, ultimately, reward the highest combination of coefficients as our control policy. The results in Figure 6 show that different combinations of coefficients lead to performance differences. When

β_{1} / β_{2}

is too large, the stability of the aircraft will be affected, and the angle of attack and sidescale angle of the aircraft will be greatly shaken, which will lead to the aircraft stall or tailspin. When

β_{1} / β_{2}

is too small, the plane’s rudder angle is too small, which reduces aircraft maneuverability.

According to the experimental results, we finally determined the coefficients

β_{1}

= 1.22 and

β_{2}

= 0.27 as our control policy. The learning curves of the control policy are shown in Figure 7, which is obtained by averaging the results of 5 repeated experiments, where the blue curve represents the average reward, and the yellow curve represents the tracking error. The tracking error is determined by the sum of roll angle, pitch angle, and speed difference. At the initial stage of training, the reward increases rapidly, and the average tracking error decreases. After about 200 episodes of training, the two curves gradually converge. The average reward per episode converges to about 5000, and the average tracking error converges to about 0.01. This shows that the control stability and tracking performance are improved with the iteration of training. The convergence of the tracking error curve to 0.01 indicates that the tracking error angle of the convergent agent is less than 1 degree and the speed difference is close to 0.

In the experiments to test the tracking performance of a control policy, we compare it with a traditional proportional integral derivative (PID) controller. It is the most widely used automatic controller. It has the advantages of simple principle, easy realization, wide application, independent control parameters, and simple selection of parameters. Details can be found in [35].

We tested the tracking performance of SAC and PID in two different scenarios: (a) sinusoidal signals and (b) Random signals. Figure 8 shows the tracking results of two control policies under two target signals. Both can stably track target course, altitude and speed. However, compared with the PID controller, the SAC control policy has advantages in overshoot, response speed, stability time and steady-state error. The control policy based on reinforcement learning can track the target signal quickly and smoothly, while the PID controller has obvious oscillation. In addition, model-free reinforcement-learning algorithms do not require complex analysis of aircraft dynamics to find superior policies that can be applied to a wider range of scenarios.

5.2. Experiments on Combat Policy

5.2.1. Evaluation of SAC+GRU Algorithm

The adversary policy in this experiment is modeled according to the method in [2], which is based on game theory and modified by us to apply to the aircraft dynamic model. We assume that the adversary has access to all our agent’s information to prevent performance degradation, limit the agent’s inefficient search of the state space, and speed up early training.

To examine the role of the GRU layer, we set up a control group trained only with SAC, with both layers having previously trained control policies. Set the battlefield range to 10 km × 10 km, and randomly initialize the positions of both sides after each combat. We examined the effect on the training of both groups of agents by varying the size of the

r_{V}

(10 km, 3 km, and 1 km). In the experiment, the noise

e_{R}

of the radar is set to Gaussian noise

N (0, {0.01}^{2})

, and the noise

e_{V}

of the vision sensor increases with the increase of distance, which is expressed as

e_{V} = \frac{\sqrt{d}}{2 r_{V}}

. Figure 9 shows the learning curves of SAC+GRU and SAC agents, where the solid line represents the average reward. When

r_{V}

= 10 km, SAC+GRU and SAC performed similarly, with SAC+GRU averaging 4.86% higher rewards than SAC but converging faster. The reason for this is that most of the time vision sensors can cover the entire battlefield, and the decision-making process in this case is similar to MDP. When

r_{V}

= 5 km, both SAC+GRU and SAC end up with a lower average reward. SAC+GRU was down 4.19%, while SAC was down 11.70%. Through visual analysis, we found that the SAC agent has a certain probability of losing the target after intersecting with the enemy aircraft, especially at higher speeds, because the turning radius is larger at this time. When

r_{V}

= 1 km, in the worst case, the opponent cannot be seen until he is in the six o’clock direction and within his WEZ. Compared to an MDP-like environment, the average reward for SAC+GRU dropped by 12.85%, while SAC dropped even more, to 37.11%. It is also observed that SAC+GRU curves converge faster, and oscillation amplitude is smaller in all scenarios. This may be because SAC is more sensitive to noise. The accumulation of errors in the training process eventually leads to an overestimation of the Q value, which makes the training effect worse.

Figure 10 shows the results of 100 battles with the trained SAC+GRU and SAC versus the game theory-based approach, respectively. SAC+GRU-based agents maintained a winning rate of 82%, 76%, and 69% in all three cases, respectively. SAC-based agents won 79%, 70%, and 55% of the time, respectively. The experimental results show that the SAC+GRU algorithm is superior to the traditional SAC algorithm in the POMDP environment because of its ability to extract historical information.

5.2.2. Evaluation of Self-Play Mechanism

In the training process of the agent, we design a course-learning method based on self-game, and the agent constantly confronts the historical policy to improve the current policy. The training generation ends when the average reward increase within 50 consecutive iterations is less than 5%. This generation policy is then stored in the history policy pool, and the parameters are frozen. When training the next generation policy, sample from the historical policy pool as described in Section 4.2.6. The first-generation opponent policy is the same as in the previous section, and the agents used in the previous section are all first-generation. Here, we use the

r_{V}

= 5 km agent because this is similar to the pilot’s perspective.

We ran a total of 5 iterations, as shown in Figure 11, with each curve representing the average reward of the current generation versus the previous generations. All learning curves grow with the number of training steps until they converge. This suggests that each generation’s policy has an advantage over the previous one. In addition, with the increase of algebra, the growth rate of the curve begins to slow down, the convergence value decreases, and the standard deviation increases. This is because the improvement of the policy allows the new policy to accumulate absolute advantages over a long period of time and magnifies the impact of wrong actions. After the training was completed, we conducted 100 tests between every two generations, the results of which are shown in Table 4. Our older generation combat policies have a higher win rate than our younger generation combat policies. This indicates that the course-learning method based on self-game can achieve a more stable policy learning process by interacting with its own past version. Statistics show that each generation has a higher victory rate in battles with the previous generation. But at the same time, as the number of iterations increases, the advantages begin to decrease until they converge to the optimal policy.

5.2.3. Behavior Analysis

After the fifth iteration, we test the performance of the agent (red) against the baseline in the three typical scenarios shown in Figure 12. They are advantage scenarios, disadvantage scenarios, and neutral scenarios. A total of 100 episodes were tested in each scenario, and the specific results are shown in Table 5.

Under favorable initial conditions, the agent can usually react quickly to the opponent’s behavior and track the opponent stably. Therefore, we mainly analyze the agent’s behavior in neutral and unfavorable scenarios. We used visual tools to analyze the two-party maneuvering trajectories stored during the test.

Figure 13 and Figure 14 show two policies for the agent under neutral initial conditions. In the first, as shown in Figure 13, the agents choose to control the turning speed after the intersection in order to maintain the energy at a high level while maneuvering in the vertical direction, further depleting the energy of the opponent. Eventually, the opponent’s mobility decreases due to low energy, and the agent has a higher speed, thus taking advantage and moving into the offensive position. In the second scenario, as shown in Figure 14, the agent chooses to convert speed into height after the horizontal intersection, thus obtaining a certain increase in mobility in the short term. At the same time, the reduction of speed reduces the turning radius so that the agent occupies a favorable position in the one-circle maneuver.

Figure 15 shows the agent’s policy under adverse initial conditions. The initial height, speed, and position of the agent are all at a disadvantage. The agent first tries to induce the opponent to overshoot by rolling scissors and barrel-rolling maneuvers while diving to recover energy. After that, the opponent chooses to pull up to avoid overshooting. At this point, the agent and the opponent enter the scissors maneuver in the vertical direction and take advantage of the smaller turning radius.

6. Conclusions

In this paper, we present a layered framework for air-combat-maneuver decision-making in a high-fidelity environment, taking into account the sensor limitations and errors in the real world. We designed a two-stage training mechanism, with the lower layer directly exporting control signals to various parts of the aircraft and the upper layer introducing a GRU network to generate maneuvering commands based on combat situation and historical information. In the training of combat policy, we also introduce the course-learning method based on self-game. The experimental results show that the control policy can track the target quickly, stably and accurately, and the effect is better than the traditional control method. The combat policy shows better performance and robustness to noise than the baseline method. We also analyze the maneuvering trajectories during the training process and demonstrate the strong learning ability of the agent. In future studies, we plan to further examine the applicability of the proposed method to real machines. In addition, future research will no longer be limited to single-aircraft confrontation, introducing high-precision radar missile models and increasing over-the-horizon air-combat scenarios.

Author Contributions

Conceptualization, J.M. and H.H.; methodology, H.H.; writing—original draft preparation, J.M. and G.L.; writing—review and editing, J.M. and H.H.; visualization, H.H. and G.L.; supervision, G.L.; project administration, J.M. and H.H.; funding acquisition, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Hunan Natural Science Foundation under Grant 2024JJ6478.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Austin, F.; Carbone, G.; Falco, M.; Hinz, H.; Lewis, M. Game theory for automated maneuvering during air-to-air combat. J. Guid. Control. Dyn. 1990, 13, 1143–1149. [Google Scholar] [CrossRef]
Cruz, J.B.; Simaan, M.A.; Gacic, A.; Jiang, H.; Letelliier, B.; Li, M.; Liu, Y. Game-theoretic modeling and control of a military air operation. IEEE Trans. Aerosp. Electron. Syst. 2001, 37, 1393–1405. [Google Scholar] [CrossRef]
Poropudas, J.; Virtanen, K. Game-Theoretic Validation and Analysis of Air Combat Simulation Models. IEEE Trans. Syst. Man, Cybern.-Part A Syst. Humans 2010, 40, 1057–1070. [Google Scholar] [CrossRef]
Chai, R.; Tsourdos, A.; Savvaris, A.; Xia, Y.; Chai, S. Real-Time Reentry Trajectory Planning of Hypersonic Vehicles: A Two-Step Strategy Incorporating Fuzzy Multiobjective Transcription and Deep Neural Network. IEEE Trans. Ind. Electron. 2020, 67, 6904–6915. [Google Scholar] [CrossRef]
Huang, C.; Dong, K.; Huang, H.; Tang, S. Autonomous air combat maneuver decision using Bayesian inference and moving horizon optimization. J. Syst. Eng. Electron. 2018, 29, 86–97. [Google Scholar] [CrossRef]
Qiu, X.; Yao, Z.; Tan, F.; Zhu, Z.; Lu, J.G. One-to-one Air-combat Maneuver Strategy Based on Improved TD3 Algorithm. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020. [Google Scholar] [CrossRef]
Wang, L.; Wei, H. Research on Autonomous Decision-Making of UCAV Based on Deep Reinforcement Learning. In Proceedings of the 2022 3rd Information Communication Technologies Conference (ICTC), Nanjing, China, 6–8 May 2022. [Google Scholar] [CrossRef]
Xianyong, J.; Hou, M.; Wu, G.; Ma, Z.; Tao, Z. Research on Maneuvering Decision Algorithm Based on Improved Deep Deterministic Policy Gradient. IEEE Access 2022, 10, 92426–92445. [Google Scholar] [CrossRef]
Wang, L.; Wang, J.; Liu, H.; Yue, T. Decision-Making Strategies for Close-Range Air Combat Based on Reinforcement Learning with Variable-Scale Actions. Aerospace 2023, 10, 401. [Google Scholar] [CrossRef]
Wei, Y.; Zhang, H.; Wang, Y.; Huang, C. Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions. Appl. Sci. 2023, 13, 9421. [Google Scholar] [CrossRef]
Chen, R.; Li, H.; Yan, G.; Peng, H.; Zhang, Q. Hierarchical Reinforcement Learning Framework in Geographic Coordination for Air Combat Tactical Pursuit. Entropy 2023, 25, 1409. [Google Scholar] [CrossRef]
Wang, D.; Zhang, J.; Yang, Q.; Liu, J.; Shi, G.; Zhang, Y. An Autonomous Attack Decision-Making Method Based on Hierarchical Virtual Bayesian Reinforcement Learning. IEEE Trans. Aerosp. Electron. Syst. 2024. [Google Scholar] [CrossRef]
Sun, L.; Qiu, H.; Wang, Y.; Yan, C. Autonomous UAV maneuvering decisions by refining opponent strategies. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 3454–3467. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar]
Yang, Q.; Zhang, J.; Shi, G.; Hu, J.; Wu, Y. Maneuver Decision of UAV in Short-Range Air Combat Based on Deep Reinforcement Learning. IEEE Access 2020, 8, 363–378. [Google Scholar] [CrossRef]
Hu, D.; Yang, R.; Zuo, J.; Zhang, Z.; Wu, J.; Wang, Y. Application of Deep Reinforcement Learning in Maneuver Planning of Beyond-Visual-Range Air Combat. IEEE Access 2021, 9, 32282–32297. [Google Scholar] [CrossRef]
Li, B.; Gan, Z.; Chen, D.; Sergey Aleksandrovich, D. UAV maneuvering target tracking in uncertain environments based on deep reinforcement learning and meta-learning. Remote Sens. 2020, 12, 3789. [Google Scholar] [CrossRef]
Din, A.F.; Mir, I.; Gul, F.; Mir, S. Non-linear intelligent control design for unconventional unmanned aerial vehicle. In Proceedings of the AIAA SCITECH 2023 Forum, National Harbor, MD, USA, 23–27 January 2023; p. 1071. [Google Scholar]
Zhang, H.; Zhou, H.; Wei, Y.; Huang, C. Autonomous maneuver decision-making method based on reinforcement learning and Monte Carlo tree search. Front. Neurorobotics 2022, 16, 996412. [Google Scholar] [CrossRef]
Jiang, Y.; Yu, J.; Li, Q. A novel decision-making algorithm for beyond visual range air combat based on deep reinforcement learning. In Proceedings of the 2022 37th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Beijing, China, 19–20 November 2022. [Google Scholar] [CrossRef]
Yuan, W.; Xiwen, Z.; Rong, Z.; Shangqin, T.; Huan, Z.; Wei, D. Research on UCAV Maneuvering Decision Method Based on Heuristic Reinforcement Learning. Comput. Intell. Neurosci. 2022, 2022, 1477078. [Google Scholar] [CrossRef]
Pope, A.P.; Ide, J.S.; Mićović, D.; Diaz, H.; Twedt, J.C.; Alcedo, K.; Walker, T.T.; Rosenbluth, D.; Ritholtz, L.; Javorsek, D. Hierarchical reinforcement learning for air combat at DARPA’s AlphaDogfight trials. IEEE Trans. Artif. Intell. 2022, 4, 1371–1385. [Google Scholar] [CrossRef]
Chai, J.; Chen, W.; Zhu, Y.; Yao, Z.X.; Zhao, D. A hierarchical deep reinforcement learning framework for 6-DOF UCAV air-to-air combat. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 5417–5429. [Google Scholar] [CrossRef]
Yang, Z.; Nguyen, H. Recurrent Off-policy Baselines for Memory-based Continuous Control. arXiv 2021, arXiv:cs.LG/2110.12628. [Google Scholar]
Berndt, J. JSBSim: An Open Source Flight Dynamics Model in C++. In Proceedings of the AIAA Modeling and Simulation Technologies Conference and Exhibit, Providence, RI, USA, 16–19 August 2004. [Google Scholar] [CrossRef]
Nguyen, L.T. Simulator Study of Stall/Post-Stall Characteristics of a Fighter Airplane with Relaxed Longitudinal Static Stability; National Aeronautics and Space Administration: Washington, DC, USA, 1979; Volume 12854. [Google Scholar]
Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef]
Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement Learning with Deep Energy-Based Policies. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017. [Google Scholar]
Schulman, J.; Chen, X.; Abbeel, P. Equivalence Between Policy Gradients and Soft Q-Learning. arXiv 2017, arXiv:1704.06440. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Stevens, B.L.; Lewis, F.L.; Johnson, E.N. Aircraft Control and Simulation: Dynamics, Controls Design, and Autonomous Systems; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]

Figure 1. Weapon Engagement Zone (WEZ).

Figure 2. The hierarchical decision frame is constructed with two layers: the upper (combat policy) and the lower (control policy).

Figure 3. Geometry of air combat.

Figure 4. POMDP environment design.

Figure 5. Network architecture of SAC+GRU.

Figure 6. Performance of control policy under different combinations of coefficients. The maximum value is obtained when

β_{1} = 1.22

and

β_{2} = 0.27

.

Figure 6. Performance of control policy under different combinations of coefficients. The maximum value is obtained when

β_{1} = 1.22

and

β_{2} = 0.27

.

Figure 7. Average episode reward and tracking error of the control policy.

Figure 8. Performance of SAC-based and PID-based control policies tracking two different target signals. (a) Sinusoidal signal. (b) Random signal.

Figure 9. Average episode reward of SAC+GRU and SAC in restricted observation space. (a)

r_{V}

= 10 km; (b)

r_{V}

= 3 km; (c)

r_{V}

= 1 km.

Figure 9. Average episode reward of SAC+GRU and SAC in restricted observation space. (a)

r_{V}

= 10 km; (b)

r_{V}

= 3 km; (c)

r_{V}

= 1 km.

Figure 10. Combat results of SAC+GRU and SAC in restricted observation space. (a)

r_{V}

= 10 km; (b)

r_{V}

= 3 km; (c)

r_{V}

= 1 km.

Figure 10. Combat results of SAC+GRU and SAC in restricted observation space. (a)

r_{V}

= 10 km; (b)

r_{V}

= 3 km; (c)

r_{V}

= 1 km.

Figure 11. Learning curves of 5 generations. Each curve represents the change in the average episode reward when the current policy battles the previous policy.

Figure 12. Testing scenarios. (a) Advantage scenario; (b) Disadvantage scenario; (c) Neutral scenario.

Figure 13. Neutral scenario. Win by high energy. (a) Beginning; (b) Maintain speed; (c) Boom and Zoom; (d) Take advantage; (e) Track.

Figure 14. Neutral scenario. Win by a faster turn. (a) Beginning; (b) Pull up; (c) One-circle maneuver; (d) Inside the opponent’s turn path; (e) Track.

Figure 15. Disadvantage scenario. Win by defense and counterattack. (a) Beginning; (b) Induce the opponent to overshoot; (c) Vertical Rolling Scissors; (d) Take advantage of the smaller turning radius; (e) Track.

Table 1. Basic state definitions of 6-DOF aircraft.

State	Definition	State	Definition
x	x-axis position	$α$	angle of attack
y	y-axis position	$β$	sideslip angle
h	altitude	$δ_{a}$	aileron position
$u, v, w$	linear velocity	$δ_{e}$	elevator position
$ϕ, θ, ψ$	roll, pitch and yaw angles	$δ_{r}$	rudder position
$p, q, r$	roll, pitch and yaw rate	$δ_{t}$	throttle

Table 2. Action and bound.

Action	Bound	Action	Bound
$c_{a}$	$[- 21.5 °, 21.5 °]$	$c_{e}$	$[- 25 °, 25 °]$
$c_{r}$	$[- 30 °, 30 °]$	$c_{t}$	$[0, 1]$

Table 3. Parameter settings.

Parameter	Setting
Discount factor $γ$	0.99
Worker number	64
Optimizer	Adam
Learning rate	0.0001
Replay buffer size	1 × 10⁶
Iteration per epoch	10
Batch size	12,800
Activation function	ReLU
Sequence length	64

Table 4. Combat results between different generations.

Red∖Blue	Gen 1	Gen 2	Gen 3	Gen 4
Gen 2	78/12/10	-	-	-
Gen 3	80/13/7	73/16/11	-	-
Gen 4	76/9/15	77/10/13	70/14/16	-
Gen 5	81/12/14	75/11/14	72/12/14	69/13/18

Table 5. Test results in three typical scenarios.

Scenarios	Win	Lose	Draw
Advantage	95%	3%	2%
Disadvantage	76%	19%	5%
Neutral	86%	8%	6%
Total	85.7%	10.0%	4.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mei, J.; Li, G.; Huang, H. Deep Reinforcement-Learning-Based Air-Combat-Maneuver Generation Framework. Mathematics 2024, 12, 3020. https://doi.org/10.3390/math12193020

AMA Style

Mei J, Li G, Huang H. Deep Reinforcement-Learning-Based Air-Combat-Maneuver Generation Framework. Mathematics. 2024; 12(19):3020. https://doi.org/10.3390/math12193020

Chicago/Turabian Style

Mei, Junru, Ge Li, and Hesong Huang. 2024. "Deep Reinforcement-Learning-Based Air-Combat-Maneuver Generation Framework" Mathematics 12, no. 19: 3020. https://doi.org/10.3390/math12193020

APA Style

Mei, J., Li, G., & Huang, H. (2024). Deep Reinforcement-Learning-Based Air-Combat-Maneuver Generation Framework. Mathematics, 12(19), 3020. https://doi.org/10.3390/math12193020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement-Learning-Based Air-Combat-Maneuver Generation Framework

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Air-Combat Scenario Description

3.2. Partially Observable Markov Decision Processes (POMDP)

3.3. Maximum Entropy Reinforcement Learning

3.4. The Soft Actor–Critic Algorithm

4. Method

4.1. The 6-DOF Dynamic Control Policy

4.1.1. State Space

4.1.2. Action Space

4.1.3. Reward

4.2. The SAC+GRU-Based Combat Policy

4.2.1. Observation Space

4.2.2. Action Space

4.2.3. The Design of POMDP Environment

4.2.4. Reward

4.2.5. Network Architecture

4.2.6. The Self-Play Training Mechanism

5. Experiment

5.1. Experiments on Control Policy

5.2. Experiments on Combat Policy

5.2.1. Evaluation of SAC+GRU Algorithm

5.2.2. Evaluation of Self-Play Mechanism

5.2.3. Behavior Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI