1. Introduction
Traditional aerial games challenge the piloting skills of players, especially in close-range scenarios where quick reactions are paramount. With the continuous advancement of unmanned systems technology and information technology, unmanned aerial vehicles (UAVs) are increasingly utilized in various tasks, from data collection to simulated challenges [
1]. They offer advantages such as cost-effectiveness, high maneuverability, and advanced stealth capabilities. Furthermore, the extensive use of unmanned and intelligent systems in various real-world applications underscores the future direction of aerial operations [
2,
3,
4]. The next evolution of aerial games will likely be dominated by unmanned, autonomous, intelligent, and beyond-visual-range capabilities [
5].
The core challenge of intelligent beyond-visual-range UAV aerial games revolves around the development of autonomous aerial control (AAC) capabilities. These capabilities enable UAVs to operate effectively without human intervention, achieving situational awareness through environmental sensing and analysis. However, UAVs face several key obstacles, including difficulties in reliably detecting and interpreting complex scenarios, making real-time maneuvering decisions in dynamic environments, and executing precise control commands in highly unstable flight conditions. To address these issues, our approach focuses on enhancing UAVs’ ability to autonomously assess their surroundings, process real-time data, and adapt their flight strategy accordingly. Specifically, we utilize reinforcement learning algorithms to improve decision-making and control performance, ensuring that UAVs can autonomously navigate and perform precise maneuvers. This enables UAVs to handle the challenges posed by beyond-visual-range aerial games, where rapid, accurate decision-making and control are critical.
Research methodologies related to autonomous aerial maneuver decision-making fall into three primary categories: mathematical modeling, computer-based traversal search, and intelligent algorithms [
6].
Mathematical modeling methods frame typical aerial game tasks as mathematical optimization problems. The literature [
7,
8,
9] adopts the homicidal chauffeur problem as a representative model. However, this approach comes with constraints like planar and constant speed conditions, limiting its practical applicability. The literature [
10,
11] delves into pursuit–evasion (PE) scenarios, with predefined roles for pursuer and evader. Such models may not always resonate with modern aerial games where roles can interchange dynamically. The literature [
12] also examines multi-objective optimization problems, determining game outcomes based on flight data. These methods often grapple with limited applicability and challenges in deriving analytical solutions.
Computer-based traversal search methodologies discretize potential maneuver control commands, calculating optimal decisions based on situational awareness. This strategy is prevalent in decision-making processes. The literature [
13,
14,
15,
16] investigates the A* algorithm and its variants for pathfinding and obstacle navigation. Other literature [
17,
18] delves into the influence map method, offering a clear representation of key aerial game factors, albeit at the cost of computational complexity. Studies [
19,
20] explore adaptive dynamic programming (ADP), which, through data-driven techniques, determines search intervals for commands. Its practical application remains challenging due to inherent constraints.
Intelligent algorithms transform situational awareness data into maneuver decisions, employing operations like feature extraction, clustering, and fitting on extensive datasets. This category encompasses neural networks [
21], fuzzy logic [
22], genetic algorithms [
23], reinforcement learning [
24], Bayesian networks [
25], and other intelligent algorithms [
26,
27], where neural networks offer powerful function approximation capabilities, making them suitable for modeling complex nonlinear dynamics. However, they typically require large labeled datasets and may lack interpretability. Fuzzy logic systems excel at incorporating human-like reasoning into control systems and are effective in dealing with uncertainties, but they struggle with high-dimensional input spaces and require expert-designed rule sets. Genetic algorithms are useful for global optimization and parameter tuning in UAV control strategies, yet they are often computationally expensive and slow to converge in real-time applications. Bayesian networks provide a probabilistic framework for reasoning under uncertainty and are particularly effective in tasks involving causal inference; however, they require well-defined prior knowledge and can become intractable in complex domains. Reinforcement learning stands out by enabling agents to learn optimal control strategies directly from interaction with the environment, without requiring explicit modeling of dynamics. RL is especially well-suited for scenarios with delayed rewards and sparse feedback, making it a focal point for many researchers in the field of autonomous UAV control. Notably, reinforcement learning research is a focal point for many experts.
Reinforcement learning is a branch of machine learning focused on how agents learn to make decisions by interacting with an environment to achieve a specific goal. Unlike supervised learning, which relies on labeled data, RL agents learn from trial and error, receiving feedback in the form of rewards.
At the core of reinforcement learning are two entities: an agent and an environment. The agent perceives the current state of the environment, selects an action based on a policy
, and receives a reward as feedback. This process repeats over time, forming a sequence of interactions that continues until the task (or episode) is completed. This loop follows the Markov decision process (MDP) framework, as illustrated in
Figure 1. Through continuous interaction, the agent uses the collected data—states, actions, and rewards—to improve its policy, aiming to discover an optimal strategy that maximizes long-term cumulative rewards.
Deep reinforcement learning is an advanced framework that combines the decision-making capabilities of reinforcement learning with the powerful representation learning abilities of deep neural networks. While traditional reinforcement learning struggles in high-dimensional or continuous state spaces due to the need for explicit state–action mappings, DRL overcomes these limitations by using neural networks to approximate value functions, policies, or both.
DRL methods can be broadly categorized into value-based methods and policy-based methods, where value-based methods (e.g., deep Q-networks) aim to learn the value of taking certain actions in given states and select actions by maximizing estimated values. Policy-based methods directly optimize the policy that maps states to actions, which can better handle continuous action spaces and stochastic policies.
This integration allows DRL to effectively handle tasks with large or continuous state and action spaces, improve convergence stability, and enable more scalable and generalizable learning across complex environments.
The literature [
28,
29] delves into close-range aerial game maneuver decision-making using deep reinforcement learning methods, yet it focuses solely on a three-degree-of-freedom aircraft model, diverging from the more realistic six-degrees-of-freedom model employed in advanced aerial games. Furthermore, close-quarter engagements are becoming less predominant in contemporary aerial games, with beyond-visual-range engagements becoming more central. Other literature [
30,
31] adopts a hierarchical reinforcement learning strategy, crafting a semi-Markov decision model and training sub-policies and policy selectors to achieve autonomous UAV maneuver decision-making. This method, however, necessitates meticulous reward function design for each maneuver, complicating the reward structuring for intricate maneuvers. Moreover, it is mainly tailored for theoretical maneuver decision-making research and poses challenges for precise control. Study [
32] presents an end-to-end reinforcement learning technique for beyond-visual-range aerial game decision-making, yet confronts issues like sparse rewards, prolonged training duration, and convergence obstacles. Studies [
33,
34] employ a two-dimensional environment for its simulations, proving to be a hurdle in mimicking genuine game scenarios. Study [
35] establishes a high-accuracy model grounded on actual UAV metrics, examines reinforcement learning strategies for dynamic target tracking, and introduces the IFHER method to transmute failed flight paths into triumphant experiences, enhancing trajectory efficiency and expediting training convergence. However, dynamic target tracking is merely a secondary objective in beyond-visual-range aerial games. Other literature [
36,
37] proposes a deep reinforcement learning-oriented method for autonomous UAV evasion in game settings. Still, its specificity to certain evasion models makes it less adaptable to parameter alterations. Study [
38] explores UAV waypoint navigation control using the DDPG algorithm and emulates chase-evade scenarios. Nevertheless, this method curtails roll and pitch angles during training to guarantee convergence and stability, making it apt only for waypoint navigation and not ideal for intricate and agile game environments.
In summary, intelligent unmanned beyond-visual-range aerial games face the following three main challenges:
Simulation Fidelity Issue: Much of the current research employing deep reinforcement learning algorithms is limited to simplified 2D environments or three degrees of freedom, resulting in diminished simulation precision.
Research Completeness Issue: There exists a gap in comprehensive and in-depth studies in the realm of beyond-visual-range (BVR) intelligent games.
Reward Function Design Challenge: Crafting a fitting reward function for intricate maneuvers using deep reinforcement learning presents a significant hurdle.
This paper introduces a framework for intelligent unmanned beyond-visual-range aerial games with the following primary contributions:
Addressing the Simulation Fidelity Issue: We employed the deep reinforcement learning PPO method for maneuver control and developed a 6-DOF high-fidelity UAV model and simulation environment. To tackle the intricacy of reward function design, we ushered in the PHP-ROW training method. This approach expedites UAV control, achieving the intended heading, altitude, and speed, thereby formulating action commands. It not only boosts interpretability but also smoothes the transition from simulations to real-world UAVs.
Enhancing Research Completeness: A rigorous examination of the pivotal components in BVR aerial games was undertaken, leading to the formulation of a key situational vector and a performance function, constituting the situational awareness module in our framework.
Introducing a Novel Reward Function Design: We designed six maneuvers—directional navigation, altitude adjustment, objective targeting, trajectory optimization, evasive action, and agile turns. Utilizing a Bayesian network structure and expert dataset-based training, we achieved inference from situational data to maneuver actions.
A Holistic BVR Game Workflow: By amalgamating situational awareness, maneuver decision-making, and precise control, we accomplished a comprehensive BVR game sequence. The efficiency of our framework was subsequently ascertained through human-machine BVR game simulations.
The subsequent sections of this paper are structured as follows:
Section 2 outlines the problem, encompassing the six-degrees-of-freedom UAV dynamics model and the creation of situational awareness vectors.
Section 3 details the PHP-ROW method, delineating the design of the deep reinforcement learning-based maneuver control algorithm.
Section 4 elucidates the entire process and methodologies of beyond-visual-range UAV aerial games, spanning situational awareness, maneuver decision-making, and precise control.
Section 5 curates waypoint navigation tasks and human–machine BVR game scenarios, executing experiments and evaluating outcomes.
Section 6 concludes the paper, broaching potential avenues for future exploration.
2. Problem Formulation
In this section, the key concepts and problem models are introduced:
Section 2.1 establishes a six-degrees-of-freedom dynamics model to simulate the motion state of UAVs in the game environment,
Section 2.2 designs basic situational vectors and performance functions to establish an accurate situational awareness model, and
Section 2.3 implements autonomous situational awareness and tactical decision-making through the combination of fuzzy methods and Bayesian network inference.
2.1. UAV Dynamic Model
In the study of high-precision unmanned aerial vehicle (UAV) simulations for beyond-visual-range aerial games, a primary task is to construct a detailed six-degrees-of-freedom dynamics model to simulate the UAV’s motion behavior in various in-game scenarios. This model should encompass the UAV’s movement in six degrees of freedom: three translational degrees of freedom (displacement along the X, Y, and Z axes) and three rotational degrees of freedom (roll, pitch, and yaw), as depicted in
Figure 2.
To accurately simulate the UAV’s dynamic behavior within aerial game scenarios, it is essential to consider the aircraft’s physical characteristics, propulsion system, aerodynamic performance, and control system. First, a point-mass dynamics model for the aircraft will be established, describing the motion of a point mass in six degrees of freedom. Following this, the aircraft’s rotational motion, encompassing roll, pitch, and yaw, will be taken into account. This step requires considering the aircraft’s moment of inertia matrix and the torques produced to induce such rotational movements. The aircraft’s propulsion system, which includes engine thrust and torques, will be integrated, along with the impact of aerodynamic forces and moments.
Ultimately, we aim to develop a comprehensive six-degrees-of-freedom dynamics model that can simulate the UAV’s maneuvers in various in-game scenarios. This model will facilitate high-fidelity UAV flight testing within a simulation environment, evaluating its performance in beyond-visual-range aerial games.
The body-fixed coordinate system acceleration is defined as follows:
where
represent the velocities along the body-fixed coordinate system’s
axes,
represent the angular velocities along the
axes,
represent the external forces acting along the
axes,
m denotes the mass of the aircraft,
G denotes the acceleration due to gravity, and
represents the transformation matrix from the ground coordinate system to the body-fixed coordinate system. The body-fixed coordinate system angular acceleration is defined as follows:
where
I represents the moment of inertia of the aircraft, and
denote the external moments acting along the body-fixed coordinate system’s
axes, respectively. Integration yields the body-fixed coordinate system velocities and angular velocities, shown as
Converting body-fixed coordinate system angular velocities to Euler angular velocities can be expressed as
where
represent the roll, pitch, and yaw angles of the body-fixed coordinate system relative to the ground coordinate system, and
represents the transformation matrix from body-fixed coordinate system angular velocities to Euler angular velocities.
Converting body-fixed coordinate system velocities to ground coordinate system velocities can be defined as follows:
where
represent the velocities along the ground coordinate system’s
axes, respectively.
Integrating ground coordinate system velocities yields the aircraft’s position in the ground coordinate system, whereas integrating Euler angular velocities provides the Euler angles, shown as
2.2. Situational Awareness Model
In beyond-visual-range aerial games, precise situational awareness and an in-depth grasp of the game environment are vital. To meet this goal, we design essential situational vectors and performance functions. These vectors and functions guide unmanned aircraft in understanding the current state of the game and offer pivotal data for intelligent decision-making.
We outline and create these key situational vectors, encompassing crucial details like the positions, velocities, altitudes, and orientations of the unmanned aircraft and other game entities. Specifically, this scenario is designed to simulate a typical airspace encounter where two UAVs—red (the ego UAV) and blue (the target or intruder UAV)—interact in a shared environment. The setup captures key relative motion parameters: distance difference
, altitude difference
, relative heading angles
and
, and velocities
and
. These variables are commonly used in both air combat and collision avoidance contexts. These variables will thoroughly portray the dynamic situation within the game and endow the unmanned aircraft with a complete understanding of the game environment, as illustrated in
Figure 3.
To achieve comprehensive game situational awareness and facilitate autonomous beyond-visual-range operations for unmanned aerial vehicles (UAVs) within our aerial game environment, we have pinpointed key situational elements to formulate a performance function. The situational data include:
Binary Variables: alert signal W, target identification signal S, guidance signal G, operational boundary effectiveness F, and enemy entity action trigger L. These variables can only assume values of 0 or 1.
Performance Variables: proximity score , altitude score ,velocity score , and angular score are the four performance variables. We normalize the values of these performance variables within the range of [−1, 1].
Proximity score
and altitude score
are used to quantify the distance between our aircraft and the target, with a lower value indicating closer spatial proximity between the two.
Velocity score
describes the difference in velocity between two aircraft and is used to quantify the dynamic relative state between them. These scores can effectively assess potential threats in air combat and provide a scientific basis for tactical decision-making.
Angular score
describes the angular difference between two aircraft, and the relative heading angle is usually in the range [−180°, 180°], expressed by
, so the angular score can be expressed as
These situational factors are pivotal for informed decision-making in autonomous beyond-visual-range UAV game scenarios, facilitating dynamic evaluation and adaptation to evolving conditions within the game environment.
2.3. Maneuver Control Decision Model
In this section, we introduce six long-range game maneuver actions and break them down based on three criteria: desired altitude, desired velocity, and desired heading. Next, we design the Bayesian network topology and utilize a fuzzification method for data processing to simplify the data complexity. Finally, we detail the training and inference process of the Bayesian network, achieving the transformation from the game environment to the optimal maneuver decision.
2.3.1. Game Maneuver Design
In this section, we design six game maneuver actions and conditionally decompose them based on the desired heading, desired altitude, and desired velocity. The game maneuver actions and their decomposition are presented in
Table 1.
Directional Attack Maneuver: This maneuver has relatively low maneuverability requirements. It involves directing the unmanned aircraft towards a specific direction based on provided point penetration location information to complete a directional attack mission.
Climbing Expansion Maneuver: This maneuver has moderate maneuverability requirements. It involves flying towards a target based on radar-detected target azimuth information while climbing in altitude. This helps improve the weapon’s attack envelope.
Tail Escape Maneuver: This maneuver has high maneuverability requirements. It demands the unmanned aircraft to rapidly change direction to evade incoming missiles.
Offset Guidance Maneuver: This maneuver has relatively low maneuverability requirements. It requires the radar to simultaneously illuminate both the missile and the target, placing the target at the edge of the radar’s detection range.
S Maneuver: The maneuver’s maneuverability requirements depend on the specific situation. Performing large-angle turns can also be considered a high-mobility action. S maneuver is typically used for tasks such as searching for targets within a specific range and depleting the energy of pursuing missiles. Another characteristic of the S maneuver is the turn and heading-holding times. For long-duration maneuvers and large turn radii, it can be divided into multiple heading commands at a specific granularity, forming the S maneuver.
Weapon Launch Maneuver: This maneuver requires maintaining heading, altitude, and speed. The key to this maneuver is the timing of the decision to launch weapons.
In
Table 1,
represents the geographical azimuth for point penetration,
denotes the azimuth angle for radar-detected targets,
indicates the geographical azimuth of incoming missiles,
signifies the geographical azimuth for guided missiles, and
represents the desired heading for the turning control of the S maneuver. We have set the default values for desired velocity at 0.59 m/s and desired altitude at 2000/4000 km, but these values can be adjusted as needed.
2.3.2. Bayesian Formula and Bayesian Network
Unlike the traditional frequentist approach, Bayesianism places greater emphasis on prior information, asserting that parameters are not fixed and immutable, but instead possess a prior distribution based on past observational data. This prior distribution is not necessarily accurate and can be updated based on subsequent observational data, yielding the posterior distribution of parameters. This approach enables a more accurate and comprehensive understanding of the parameters, shown as
where
represents the prior probability of the parameter
, encapsulating our initial beliefs or knowledge about
before any data observations;
denotes the likelihood function, indicating the probability of observing data
given a specific value of the parameter
;
is a fixed constant representing the probability of observing the specific data point
; and, finally,
quantifies the posterior probability of the parameter
after observing the sample
, reflecting our updated beliefs regarding
based on the observed data.
Bayesian networks are graphical models depicting probabilistic relationships among a set of variables. They have the capability to capture relationships between variables and can update beliefs about the target variable based on new data. Bayesian networks are typically represented as
where
is a directed acyclic graph (DAG) with nodes
V and directed edges
E;
D represents a set of local probability distributions associated with each variable in
V.
According to the Markov assumption, we can combine the conditional distributions of individual variables to obtain the joint probability distribution of
V, shown as
where
represents the parent nodes of
, and
represents the local probability of variable
X, which can be obtained from
D. Consequently,
uniquely determines the joint probability distribution of
V.
2.3.3. Tactical Maneuver Inference
In the preceding sections, we have constructed critical situational information. However, when the unmanned aircraft perceives situational data, it needs to autonomously decide which tactical maneuver to take. We design a Bayesian network topology based on the constructed situational information and the available tactical actions. To obtain the conditional probability distribution tables for the Bayesian network, we train the Bayesian network using an expert experience dataset. Ultimately, we accomplish autonomous situational awareness and tactical decision-making through a combination of fuzzy methods and Bayesian network inference.
We assume that the situational vector and advantageous variables are mutually independent and adopt a head-to-head Bayesian network topology, with the situational vector and advantageous variables serving as parent nodes of the Bayesian network, and the decision variable as the unique shared child node. To simplify the problem, we employ a discrete Bayesian network, necessitating the fuzzification of continuous advantageous variables. We assume that , , can be fuzzified as P, Z, N using Gaussian membership functions, and can be fuzzified as L, S using Gaussian membership functions. Additionally, W, S, G, V, F can be fuzzified as Y, N using triangular membership functions.
Next, we need to determine the conditional probability distribution tables (CPD) for this Bayesian network. CPDs can be filled in by experts or learned from data. We have constructed a training dataset for network training based on situational information obtained from simulated adversarial experiments and the corresponding maneuver actions. The training process is illustrated in
Figure 4a.
Data learning methods include maximum likelihood estimation (MLE) and Bayesian estimation. In this paper, we adopt the Bayesian estimation method. We assume a Dirichlet distribution as the prior distribution for the decision variable nodes. Our decision variable nodes encompass the six maneuver variables. Therefore, we define the distribution parameters and their prior distribution as
Because the Dirichlet distribution is a conjugate prior for the multinomial distribution, we can utilize the collected data as samples and employ Bayesian estimation to learn the parameters. When new situational information becomes available, we can use it as posterior information to estimate the posterior probabilities of the decision variable nodes, serving as the basis for our tactical decisions. Here, we use the exact inference variable elimination method. The advantage of this method is that it can provide decision probabilities based on partial situational information. It allows us to focus on partial situational information and still reliably perform decision reasoning when only partial situational information is available. The network inference process is illustrated in
Figure 4b.
3. PHP-ROW Method and PPO Algorithm for Tactical Control
In this section, the proposed algorithm framework is introduced. First, the priority heading polling (PHP) and the random observation weight (ROW) methods are introduced. Second, the implementation of PPO-clip algorithm, and a reasonable state-action space, reward function, and termination condition are designed according to the UAV adversarial environment, and finally, a complete training architecture is established.
3.1. PHP-ROW Method
To ensure both flight stability and high maneuverability simultaneously, we have introduced a training framework that incorporates the priority heading polling (PHP) method and the random observation weight (ROW) method. The detailed description of this framework is as follows:
Priority Heading Polling (PHP): Within every 250 simulated time steps, we check whether the unmanned aircraft has reached the desired heading. If the desired heading is achieved, we set new target values for heading, altitude, and velocity, selecting these new targets randomly within a predefined range. If the desired heading is not reached, the current round is terminated. The advantages of this method include:
- (a)
By employing the priority heading polling, we can control the length of the unmanned aircraft’s flight trajectory during the early stages, thus promoting exploration.
- (b)
We sample from a uniform distribution [0, 1] and use the sampled value as the weight for the roll reward function. We keep the random observation weight unchanged and only resample it at the end of each episode. This limits the impact of the roll reward on the training process and helps balance the trade-off between rapid turning and flight stability.
Random Observation Weight (ROW): We sample from a uniform distribution within the range [0, 1] and utilize the sampled values as weights for the roll angle reward function. This effectively constrains the influence of roll angle rewards on the training process, helping to balance the trade-off between rapid maneuvering and flight stability.
The architecture of the proposed training framework is depicted in
Figure 5.
3.2. PPO-Clip Algorithm
The PPO algorithm, as a state-of-the-art algorithm developed by OpenAI, is widely used in various fields, including the recent popular application of ChatGPT-4. To understand the PPO algorithm, it is essential to first grasp the concept of policy gradient. Policy gradient is a type of policy-based method that directly employs neural networks to approximate the policy function. The core update formula for the policy network parameters is
In this formula, represents a sample trajectory of the agent in one episode, following the probability distribution defined by the policy parameters ; represents the cumulative discounted reward for that trajectory. The objective is to compute the gradient of the expected cumulative discounted reward with respect to the policy parameters. By performing gradient ascent on the policy parameters, we aim to improve the expected cumulative discounted reward.
In conventional policy gradient (PG) algorithms, a single trajectory sample corresponds to a singular update of the policy network parameters. In contrast, the proximal policy optimization (PPO) algorithm amalgamates the merits of both the advantage actor-critic (A2C) and trust region policy optimization (TRPO) algorithms, affording the capacity to employ trajectories sampled under the old policy for multiple iterations of policy network parameter updates. This capability arises from the introduction of importance sampling, ensuring unbiasedness, and the imposition of a clipping mechanism that imposes constraints upon the disparity between the two policy functions, specifically with respect to the ratio.
We introduce importance sampling, shown as
where
represents policy network parameters that are close to
. We aim to perform sampling on the policy with parameters
and use the sampled data to update
.
So far, we have been considering the entire trajectory. Now, we will decompose the trajectory into action–state sequence pairs
. Additionally, we aim to avoid blindly increasing the selection probability of corresponding actions during updates just because the returns are positive, which might lead to a decreased probability of selecting better actions due to low sampling frequency. Therefore, we introduce a baseline, which is the negation of the state value function
, to adjust the selection probabilities of different actions during updates. This type of return function with a baseline is referred to as the advantage function, defined as shown in (
21). Intuitively, this formula represents the difference between the estimate of the expected future return after taking the current action in a specific state and the estimate before taking the action. This difference measures the goodness of the action.
As a result, the original equation is transformed into the final expression.
Actually, we are optimizing (
23).
Finally, we use a clip operation to constrain the policies corresponding to parameters
and
from differing too much, shown as
where
serves as a hyperparameter that restricts the difference between the target policy and the action policy.
3.3. High Performance Training Method for Tactics Control
Traditional control methods, such as rule-based logic or model predictive control, often rely on precise mathematical models of the environment and predefined decision-making logic. However, in complex and dynamic scenarios such as UAV tactical confrontation, these methods can struggle to adapt to unpredictable opponent behaviors, high-dimensional state spaces, and partial observability.
In contrast, PPO—a model-free deep reinforcement learning algorithm—offers several advantages for this task. For example, it offers adaptability, stability, robustness, exploration and generalization.
3.3.1. State Space
The construction of state features is crucial for training; good features can accelerate convergence and reduce the parameter space. We ultimately consider and design 13 state variables, covering the following: the difference between the desired altitude and the current altitude of the aircraft,
; the difference between the desired heading and the current heading of the aircraft,
; the difference between the desired velocity and the current velocity of the aircraft,
; the current altitude of the aircraft,
; the roll angle
; the pitch angle
; the aircraft’s velocities in the x, y, and z axes,
,
,
; the aircraft’s airspeed
V; and the random observation weight
, shown as
3.3.2. Action Space
The action space consists of four control parameters:
where
represents command given to the ailerons, which normalized between −1 and 1. The ailerons are the movable control surfaces on the wings of an aircraft that control roll (or banking). A value of −1 typically corresponds to full left deflection, a value of 1 to full right deflection, and 0 to a neutral or centered position. The parameter
represents command given to the elevator, normalized between −1 and 1. The elevator is a movable control surface attached to the horizontal stabilizer of an aircraft, controlling pitch. A value of −1 typically corresponds to full downward deflection (nose down), a value of 1 to full upward deflection (nose up), and 0 to a neutral position. The parameter
represents command given to the rudder, normalized between −1 and 1. The rudder is a movable control surface attached to the vertical stabilizer of an aircraft, controlling yaw (or direction). A value of −1 typically corresponds to full left deflection, a value of 1 to full right deflection, and 0 to a neutral position. The parameter
represents command given to the throttle, normalized between 0 and 1. The throttle controls the power output of the engine(s). A value of 0 corresponds to no power (idle), and a value of 1 corresponds to full power.
3.3.3. Reward and Termination
We expect the UAV to fly according to the desired heading, altitude, and speed. Therefore, we have defined the following three reward functions:
The above three functions are the heading, altitude, and velocity reward functions, which are Gaussian distributions, with the expected values as means and our desired error control as standard deviations. Specifically, we aim to control the heading error within rad, altitude error within 15.24 m, and speed error within 20 m/s.
In addition, the UAV’s roll angle determines its turning ability. A greater roll angle increases the lift force component in the radial direction, allowing for faster turns and sharper turning radius, thereby enhancing maneuverability. Therefore, we have also designed a roll reward and introduced
to balance turning capability and flight stability, shown as (
30).
When the UAV descends to a critical altitude of 100 km, we impose a penalty and terminate the episode.
Additionally, we have defined other termination conditions:
Based on the PHP, if the UAV reaches the polling time but has not reached the desired heading;
Descending to a critical altitude: 1 km;
Exceeding the time limit: 1000 simulation steps;
Altitude exceeding 100 km;
Rotational angular velocities p, q, r exceeding rad/s;
Speed exceeding 50 M
Acceleration exceeding 10 g.
3.3.4. Training Structure
We adopted an efficient training architecture and detailed design, central to which is a hybrid learning mechanism that combines orthogonal initialization with dynamic learning rate adjustments to achieve a more efficient training process and superior performance. In the initialization phase, the action network employs orthogonal initialization, setting a gain factor of 0.01 to ensure that all potential actions can be fairly explored from the outset. Simultaneously, the value network’s gain factor is set to 1, laying the groundwork for subsequent dynamic adjustments.
Furthermore, the architecture establishes a large-scale update cycle, with the number of updates determined by dividing the total time steps by the batch size. With each update, the algorithm dynamically adjusts the learning rate using a predetermined formula to accommodate the current training progress. Moreover, within the update cycle, the algorithm performs multiple steps to gather essential training data, including state observations, behavioral decisions, and reward logging. All data are properly stored during collection and utilized for the next phase of neural network training.
Following this, we employ generalized advantage estimation (GAE) to calculate advantage values, which, combined with state values, constitute an overall evaluation of actions, known as returns. Then, having acquired a series of observations, actions, advantage values, and returns, the algorithm enters a more refined update cycle. In this phase, experiential data are divided into several smaller batches, each used for independent updates to the neural network parameters.
The algorithm here adopts a hybrid strategy, comparing the performances of old and new policies and introducing entropy loss to encourage exploratory behavior. Ultimately, through a comprehensive consideration of policy loss, value loss, and entropy loss, the algorithm achieves a holistic optimization of the behavioral policy. This optimization process incorporates a gradient clipping mechanism. After each mini-batch learning session, gradients are checked for size, appropriately clipped, and then applied to update the neural network’s parameters. The pseudocode for the tactical control training is shown as Algorithm 1; the training process is shown as
Figure 6.
Table 2 shows the hyperparameter settings for training the tactical control training model.
Algorithm 1 PHP-ROW PPO tactical control training algorithm. |
- 1:
Initialize parameters. - 2:
Set - 3:
for to do - 4:
Set learning rate decay: - 5:
for to do - 6:
Increment global step count based on the number of parallel environments - 7:
Store observations, O, and done flags, D - 8:
Input O into neural network (policy and value networks) to obtain actions, A, log probabilities, L, and estimated state values, V - 9:
Store V, A, and L - 10:
m one step in the vector environment to obtain next observation, , reward, R, and done flag, - 11:
Store R, update episode reward () and episode length () - 12:
end for - 13:
Estimate the state value of using the value network, calculate advantages, , using GAE, and store the sum of and V as returns, - 14:
Obtain a full batch of O, L, A, , , and V for one rollout under - 15:
for to do - 16:
Shuffle the experience buffer - 17:
for to with do - 18:
Input shuffled O and A into the neural network to compute new L, entropies, E, and - 19:
Calculate the ratio, where are the new log probabilities - 20:
Calculate policy loss, , based on , , and clipping parameter, - 21:
Calculate value loss, , as the MSE between and - 22:
Calculate the mean of E as entropy loss, - 23:
Calculate total loss: where is the entropy coefficient - 24:
Update parameters with gradient clipping - 25:
end for - 26:
end for - 27:
end for
|
4. Workflow of Intelligent Unmanned Aerial Vehicle Beyond-Visual-Range Adversarial Game
In this section, We propose an advanced framework for an intelligent beyond-visual-range six-degrees-of-freedom unmanned aerial vehicle adversarial game. This framework comprises four main components: tactical control, tactical decision-making, situational awareness, and proximal policy optimization training. The system structure is shown in
Figure 7.
The situational awareness component is responsible for gathering state information from the environment. We construct a utility function to build situational awareness and employ fuzzy logic techniques to discretize continuous situational vectors.
The tactical decision-making component consists of a Bayesian network and a library of tactical maneuvers. We establish training and testing datasets to train and assess the Bayesian network’s ability to make correct tactical maneuvers based on the given situation.
The PPO training component utilizes the PPO algorithm to train the UAV agent to fly according to desired headings, speeds, and altitudes. To enhance the UAV’s ability to respond to unforeseen circumstances, we introduce a random observation weight control for roll reward functions and create a parallel training framework that combines high maneuverability and stability.
The tactical control component employs the actor network trained by PPO to map observations to UAV control inputs. The complete BVR game workflow entails pre-training the Bayesian network responsible for tactical decision-making and the PPO model responsible for tactical control. The situational awareness component senses and processes situational information. The Bayesian network makes decisions on the optimal tactical actions under the current situation, which are then further decomposed into decision criteria and desired headings, speeds, and altitudes. These constitute observation inputs to the PPO actor network, which ultimately outputs UAV control inputs to execute tactical control.