1. Introduction
The pursuit–evasion game is a classic problem in the UAV field [
1], with common approaches including differential game theory [
2,
3] and optimal control [
4,
5,
6]. Ref. [
7] explores the feasibility of evading incoming missiles at low altitudes using a typical action library method, while Ref. [
8] studies evasion strategies through trajectory planning methods. In recent years, deep reinforcement learning algorithms, represented by Deep Deterministic Policy Gradient(DDPG) [
9], Proximal Policy Optimization(PPO) [
10], SAC [
11], and their improvements [
12,
13] have achieved outstanding success in fields such as robot control [
14,
15] and UAV navigation [
16,
17]. With the development of self-play technologies [
18,
19], reinforcement learning algorithms have also demonstrated high performance in competitive tasks [
20,
21]. Deep reinforcement learning can address sequential decision making problems in complex and high-dynamic scenarios, which is characteristic of pursuit–evasion games. Ref. [
22] uses the TD3 algorithm to directly generate control commands for the aircraft’s control surfaces for maneuvering evasion decisions. Ref. [
23] implements mixed discrete and continuous action evasion decision making through an improved SAC algorithm. Ref. [
24] employs hierarchical reinforcement learning based on the PPO algorithm to enhance the robustness of evasion strategies while considering indicators such as energy consumption. Ref. [
25] proposes a curriculum learning architecture that enables reinforcement learning algorithms to quickly learn effective evasion strategies and adapt to complex situations with multiple pursuers. These methods all assume that all information about the pursuer is known, including its flight attitude and pursuit strategy. However, in practical application scenarios, the uncertainty of this information, especially the pursuit strategy and the parameters of the missile, makes it difficult to apply the aforementioned methods.
In practical scenarios, an evader can detect information such as the relative position of a pursuer through sensors, but the attitude of a pursuer is usually not effectively detectable. Moreover, the pursuit strategy and parameters of the pursuer cannot be obtained in advance. In pursuit games, the pursuer usually adopts some specific strategy to approach the evader, among which proportional guidance is a common pursuit strategy. To provide more comprehensive information, many scholars have made numerous attempts in the field of aircraft attitude estimation and identification [
4,
5,
26,
27,
28,
29,
30,
31]. Ref. [
27] discriminated the guidance laws of missiles by classifying the motion trajectories of missiles. Ref. [
4] constructed a classifier to identify different guidance laws in real time using Bayesian inference. Ref. [
28] was the first to use an interactive multiple model filter to identify the guidance law parameters of the missiles. Ref. [
29] used an interactive multiple model identifier to simultaneously identify the guidance laws parameters and estimate the attitude of the pursuer. Ref. [
30] proposed a multi-model mechanism to identify the guidance laws using an LSTM network through deep learning methods. Ref. [
5] provided an analytical method for identifying guidance laws by analyzing the proportional guidance law. These studies have to some extent solved the problem of missing decision information, but they all consume a large amount of computational power, and the estimation accuracy of continuously changing guidance law parameters is also insufficient.
For reinforcement learning, the challenges stemming from incomplete information are twofold. First, the lack of the pursuit UAV’s flight attitude information means that the decision making problem for the evader does not possess the Markov property, making it difficult to train reinforcement learning algorithms. This issue can be resolved by providing complete information during training. In practical operational contexts, the attitude of the pursuer can be furnished by specialized estimators, and the impact of estimation errors on performance can be mitigated by robust reinforcement learning algorithms. Research in this domain has reached a considerable level of maturity, with common approaches encompassing zero-shot generalization [
32,
33], data augmentation [
34,
35], and world modeling techniques [
36].
The second challenge arises from the unknown pursuit strategy. Reinforcement learning training is contingent upon the environment [
37], and specific pursuit strategies and their parameters are integral components of this environment. During the algorithm’s training phase, due to the inability to predict pursuit strategy and parameters, it is necessary to assume that the pursuer uses a certain policy (such as pure proportional guidance) to construct an appropriate environment for training to obtain the optimal evasion strategy. In actual use, it is almost impossible for the pursuer to have the same pursuit strategy as the one in the training environment, leading to a significant discrepancy between the training and operational environments. This can result in the reinforcement learning algorithm making decisions that do not lead to the correct state transitions for the system in a given state [
38]. Due to the properties of Markov decision processes, reinforcement learning algorithms focus solely on making the best decisions in the current state without concern for the effectiveness of previous decisions [
39]. This can cause the aircraft to increasingly deviate from the originally optimal evasion route, significantly impacting the decision making performance.
To address the problems of state information deficiency and performance loss of the reinforcement learning algorithms due to the unknown parameters such as the pursuit UAV’s attitude and pursuit strategy, this paper proposes a reinforcement learning strategy adaptation algorithm based on the estimation and identification of unknown pursuer quantities and pursuit. We constructed an effective reward function that guides the evader to maneuver to the greatest extent to deplete the energy of the pursuer, thereby training a strategy to evade the pursuit UAV by controlling the evasive UAV’s lateral and longitudinal overloads. In addition, we built a new three-degree-of-freedom motion model of UAVs in a pursuit–evasion game by decomposing the relative motion of the pursuer and the evader in the horizontal plane and the vertical plane and estimated unknown quantities such as the pursuer’s velocity, acceleration, pitch angle, and heading angle through the analysis of this model. For pursuers using pure proportional guidance as a pursuit strategy, we derived the identification model of the proportional guidance constant and eliminated the zero-crossing error in the identification results through the Kalman filtering algorithm. Based on the identification of the proportional guidance constant, we represented the relationship between the relative motion situation of evader and pursuer UAVs and evaders’ lateral overload and side overload as an affine dynamic system. On the basis of the reinforcement learning decision instructions and the identification results of the proportional guidance constant, we provided an action compensation equation through this system. Solving this equation yields the optimal decision correction for existing reinforcement learning strategies against an unknown pursuer. Unlike the method of adding noise during the training process to improve the robustness of reinforcement learning algorithms, the action compensation algorithm proposed in this paper theoretically ensures the effectiveness of the optimal strategy against a pursuer with different pursuit strategy parameters.
This paper is organized as follows:
Section 2 provides the process of addressing the evasive decision making problem using reinforcement learning.
Section 3 introduces the relative motion model of UAVs in pursuit–evasion games and outlines the methods for estimating the unknown quantities and pursuit strategy of the pursuer.
Section 4 presents a Model Reference Policy Adaptive (MRPA) algorithm.
Section 5 comprises the experimental segment, wherein the algorithms proposed in
Section 2,
Section 3 and
Section 4 are subjected to numerical simulation for empirical validation. Conclusive remarks and a summary of our findings are presented in
Section 6.
4. Model Reference Policy Adaptation Algorithm
This section discusses the principle behind the failure of reinforcement learning evasion strategies when facing an unknown pursuer that differs from the training environment. Building on this, and based on the relative motion model and the affine dynamics model discussed in this paper, an action compensation scheme is proposed, aiming to reduce the loss of strategic performance when facing different pursuers.
4.1. Problem Formulation
Whether it is value function-based reinforcement learning algorithms or direct policy search-based reinforcement learning algorithms, their goal is to find the optimal policy in a specific environment. The Bellman equation and the policy gradient theorem shown in Equations (
55) and (
56) indicate that the optimal policy is related to the environment [
40].
In these equations, represents policy, and p denotes the state transition probabilities of the environment, with different environments having distinct state transition probabilities. When the environment changes, p changes accordingly, and the originally optimal policy no longer retains its optimality. In deep reinforcement learning, the state transition probabilities of the environment are considered unknown quantities. Through continuous interaction with the environment, the policy network or value function network within the reinforcement learning model learns information about the environment’s state transition probabilities from the data and fits the action value function or optimal policy based on this information. Therefore, the optimal policy of deep reinforcement learning algorithms is still constructed based on a specific environment.
From the perspective of state transitions, the significance of the optimal policy lies in enabling the agent to choose a particular action with the highest probability in a given state and transition to the next specific state, repeating this process to ultimately form a specific trajectory with the highest expected discounted cumulative reward. When the environment changes, the process of state transitions will also change. Since the agent cannot perceive the changes in the environment, the optimal policy will cause the agent to choose the same actions as in the ideal environment but fail to transition to the specific next state. Due to the characteristics of the Markov decision process, the agent only cares about choosing the best action in the current state and does not concern itself with whether the actions chosen in the previous state achieve the desired effect. Therefore, as shown in
Figure 2, incorrect state transitions will continue until the task is terminated, and the trajectory formed by the agent will gradually deviate from the expected optimal trajectory, resulting in an expected cumulative reward that is lower than the optimal value. The degree of deviation and the extent of the reduction in expected cumulative reward are related to the magnitude of the environmental change.
From the analysis above, it is known that the key to ensuring the performance of the optimal policy does not decrease after environmental changes is to enable the agent to continue operating along the original optimal trajectory in the changed environment, that is, to ensure that the agent can achieve the same state transition as in the original environment after making the optimal action in any state. To achieve this without retraining the optimal policy, it is necessary to make certain corrections to the optimal actions based on the changes in the environment.
4.2. MRPA Method
In engineering applications, adaptive control methods are commonly used to improve the control performance of a system when there is uncertainty in the model. Model reference control is one of the more popular methods. A model reference control system mainly consists of a controller, a reference model, an adaptation rate, and the controlled object [
41]. The idea is to design an adaptation rate so that the deviation between the output signal of an uncertain controlled object and the output signal of the reference model with the same input is as small as possible [
42].
Inspired by Model Reference Adaptive Control (MRAC), this section proposes a Model Reference Policy Adaptation (MRPA) algorithm. The original environment in which the optimal policy is trained serves as the reference environment. Based on this, an adaptation rate is designed to adjust the actions selected by the optimal policy according to the difference in state transitions between the reference environment and the actual environment for a given state–action pair. This adjustment aims to eliminate the difference or make it as minimal as possible, thereby reducing the performance loss caused by environmental changes. The process of the Model Reference Policy Adaptation algorithm is shown in
Figure 3.
The reference model utilizes an affine dynamic system in the form of
In the pursuit–evasion game discussed in this paper, the state vector
and action vector
can be represented as
Combining the relative motion model of UAVs presented in
Section 3.1 with Equations (
12) and (
13) yields:
and
where
Assuming
, we found
Summarizing the above gives
By discretizing Equation (
57), we obtain
From Equations (
59) and (
66), it can be inferred that
is a function matrix with
as the parameter. When the agent selects a certain optimal action
, the state transition equation is:
In the pursuit–evasion game, when facing a pursuer different from the training environment, the parameter
will change, indicating that the environment has changed. At this point, under a certain action
, the state transition equation is:
To maintain consistent state transitions, even if the agent can achieve the same state transition as when selecting the optimal action in the original environment, we have:
Solving for the action that should be chosen in the altered environment yields:
Equation (
72) represents the action compensation equation, where
is obtained through the identification algorithm discussed in
Section 3.3 and is updated in real time during the decision making process.
5. Experiments
In this section, we first provide the experimental setup and then demonstrate the training effectiveness of the SAC algorithm in the pursuit–evasion game scenario, followed by numerical simulation to verify the performance of the relative motion model, pursuit UAV’s attitude, and guidance law parameter estimation algorithms proposed in this paper, and through ablation experiments, we validate the effectiveness of the MRAC algorithm.
5.1. Experimental Settings
During the training process of the evasive policy using the SAC algorithm, to simulate pursuers approaching from different directions and distances, we set a uniform distribution for the initial velocity direction of the pursuer and the initial position of the aircraft, randomly generating different initial situations. The generation parameters for each initial state are shown in
Table 1, and the hyperparameters of the SAC algorithm are shown in
Table 2:
To test the performance of the relative motion model and the pursuit UAV’s attitude estimation and pursuit strategy identification algorithms, this paper designs two flight trajectories, the barrel roll maneuver and straight-line maneuver, to simulate and test the methods proposed in the text. The control amount for the barrel roll maneuver is:
where
R represents the radius of the barrel roll maneuver,
denotes the horizontal velocity, and
T represents the rotation period of the barrel roll maneuver. Parameters for the first-order filter used in the pursuit UAV’s attitude estimation and the Kalman filter used in pursuit strategy estimation are shown in
Table 3.
In the experiments of this section, the state transition model used during the training of the reinforcement learning algorithm as well as the UAV kinematics model used in the pursuit strategy identification and evasion strategy testing both operate with a simulation time step of 1 ms. The observation cycle for the pursuer’s state estimation and pursuit strategy identification algorithm is 10 ms.
5.2. Validity of Relative Motion Model
The relative motion model proposed in
Section 3.1 is constructed by decomposing the motion of the pursuer and the evader in the horizontal plane and the vertical plane, and approximate methods are used during the derivation process. The experiments in this section simulate the aircraft performing barrel roll maneuvers and straight-line movements, and the effectiveness is verified by comparing the differences between the relative motion states calculated by our proposed model in the simulation and the actual states.
Figure 4 displays the trajectories of the pursuer and evader during straight-line maneuvers and barrel rolls of the aircraft. As shown in
Figure 5, in both maneuvering modes, the relative motion situation calculated by the model proposed in this paper is almost consistent with the true value. At the end of the motion, there is some error in the calculation of
R. This error is due to the approximation used in the model, where
, which fails when
x is not close to zero. The model uses the ratio of relative displacement
to
R to approximate the change in the yaw angle. As the pursuer is about to catch up with the evader,
R decreases, causing some distortion in the model. Experimental results also show that this distortion only occurs about 1.88 s before the end of the simulation, and the model maintains a high degree of accuracy.
5.3. Performance of Guidance Law Parameter Identification
This chapter presents the simulation results of the guidance law parameter identification. In the experiment, the velocities of the evader and pursuer used are the same as in
Table 1, with the pursuer’s velocity linearly decaying. The initial position of the aircraft is [10,000, 0, 1000], and the initial pitch angle of the pursuer is
, with the yaw angle being
. The aircraft’s barrel roll maneuver has a period of 1 s with a horizontal velocity of 100. For the straight-line motion, the pitch angle is 1 and the yaw angle is
.
To verify the adaptability of the algorithms proposed in this paper in high-dynamic environments, we conducted identification experiments on dynamically changing guidance law parameters. The variations in the proportional navigation constant include step changes, linear changes, and sinusoidal variations. The time for the step change is 8 s, the rate of change for the linearly varying proportional navigation constant is , and the period for the sinusoidal variation is .
The experimental results in
Figure 6 and
Figure 7 indicate that the pursuer flight state estimation model can accurately and stably estimate the pursuer’s attitude, velocity, and longitudinal and lateral accelerations. There are certain errors at the beginning and end stages of the simulation process. The error at the beginning stage originates from the first-order low-pass filter used in the attitude estimation not being converged, and the error at the end stage comes from the error of the relative motion model.
Figure 8 and
Figure 9 demonstrate that the identification algorithm proposed in this paper has a sufficiently high degree of accuracy for the identification of constant guidance law parameters, with errors controlled within approximately 0.2, and the response time is less than 0.5 s.
Figure 10,
Figure 11 and
Figure 12 demonstrate that the method proposed in this paper can also effectively track and identify the dynamically changing guidance law parameters. Additionally, by comparing the identification results before and after the Kalman filter in
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12, it can be seen that the filter designed in this paper can effectively eliminate the interference of zero-crossing errors on the identification results.
In practical application scenarios, the observational signals provided by sensors are generally not unbiased, and the observational values obtained by the pursuit strategy identification algorithm typically contain some noise [
43]. To test the performance of the pursuit strategy identification algorithm under input signals with observational noise, we conducted an additional set of tests by providing the identification algorithm with input signals containing random noise to test its robustness to noise. We tested proportional navigation parameters with step changes, linear changes, and sinusoidal changes, with the parameters for the three types of changes being the same as in previous experiments. Due to the coupling between the input signals of the pursuit strategy identification algorithm, we directly added observational noise to the observational values of the pursuer’s position coordinates and calculated the input signals in the same manner as Equation (
1) to ensure the plausibility of noise in each signal. The added noise was Gaussian noise with a mean of 0 and a standard deviation of 20.
The experimental results depicted in
Figure 13,
Figure 14 and
Figure 15 indicate that despite the noise in the input signals causing some impact on the identification results, the method proposed in this paper is still capable of stably tracking the dynamically changing proportional navigation constants. When tracking the step change in proportional navigation constants, the average error is 0.248; for the linear change, the average error is 0.249; and for the sinusoidal change, the average error is 0.217. These error levels are higher than those observed in noise-free conditions. The presence of observational noise not only directly affects the estimation but also exacerbates the negative impact of zero-crossing errors. Not only does division by zero amplify more errors, but observational noise may also increase the frequency of zero-crossing errors. Since the Kalman filter and low-pass filter included in the method proposed in this paper have certain noise suppression effects, the impact of observational noise on algorithm performance is controlled within an acceptable range.
5.4. Performance of MRAC Method
This chapter begins by comparing the MRPA algorithm proposed in this paper with other methods that enhance the generalization performance of reinforcement learning to verify the superiority of the MRPA algorithm in dealing with unknown pursuers. We selected two baseline methods for comparison. The first method is the traditional generalization approach, Multi Env–SAC, which enhances the evasion strategy’s ability to cope with different pursuers by considering various pursuit strategies during the training phase. The second method is the data augmentation method, Data Argument–SAC, proposed in Reference 1, which improves the generalization performance of reinforcement learning algorithms by applying data augmentation to the state data in the experience replay buffer, and it is currently the most widely used method to enhance the generalization capabilities of reinforcement learning algorithms.
When implementing Multi Env–SAC, to enable the algorithm to learn evasion strategies against different pursuers, we added the pursuit strategy parameter
K to the state information based on the SAC algorithm used in
Section 2, adjusting the state space to:
with all other parts remaining unchanged and
K varying within the range of [3, 6]. The data augmentation method proposed in Reference 1 provides various data augmentation techniques for reinforcement learning algorithms based on image input, such as adding noise, color changes, image cropping, and flipping. However, since the input signal of the reinforcement learning algorithm used in this paper is a state vector, only noise addition is used for data augmentation, adding Gaussian noise with a mean of 0 and a standard deviation of 0.05 to the normalized state data. Both comparative methods are trained with settings identical to the original methods.
Figure 16 shows the training results of the three methods. It can be observed that after approximately 100,000 steps of iteration, the unprocessed reinforcement learning algorithm and Data Argument–SAC both achieved a high level of evasion capability, while Multi Env–SAC still exhibits a low level of evasion capability after 400,000 steps of training. The training results indicate that considering multiple different pursuers during the training phase greatly affects the algorithm’s convergence speed. This is because considering different pursuers makes the pursuit–evasion game problem more complex, and the additional dimensions in the state space mean that the reinforcement learning algorithm needs to interact with the environment on a larger scale to be trained, which implies that the required training time will increase to an unacceptable level. In contrast, Data Argument–SAC only processes the state signals during the training process and does not significantly affect training efficiency.
To further demonstrate the advantages of the MRPA algorithm in dealing with different pursuers, we conducted 3000 independent evasion experiments under scenarios where the pursuer’s proportional navigation constant was 4, 5, and 6, respectively. In each experiment, the MRPA algorithm and the evasion strategy trained by Data Argument–SAC were used to evade under the same initial conditions, and their survival times were recorded, as shown in
Figure 17.
The experimental results show that using MRPA for policy adaptation allows for more survival time when dealing with different pursuers compared to the evasion strategy trained by Data Argument–SAC, demonstrating better generalization performance by the MRPA algorithm. The Data Argument–SAC method aims to enhance the robustness of reinforcement learning algorithms to potential noise interference in the environment through data augmentation. These noises are typically zero-mean noises, whereas changes in the pursuer’s pursuit strategy imply changes in the environment’s state transitions, which cannot be simply modeled as noise interference. Therefore, data augmentation techniques do not exhibit sufficient generalization performance in the scenarios discussed in this paper.
We conducted ablation experiments to validate the effectiveness of the MRPA algorithm. In the original environment, the reinforcement learning algorithm was trained to evade a pursuer with a proportional navigation constant , resulting in a reference policy. We conducted three tests, in which the proportional navigation constants of the pursuers in the test environment were 4, 5, and 6, respectively. Each test included an experimental group, a control group, and a reference group. All three groups loaded the reference policy. The experimental group evaded the test pursuer, loaded the policy trained in the original environment, and used the MRPA algorithm; the control group used the same pursuer and initial situation as the experimental group but did not use the MRPA algorithm; the reference group also used the same initial situation as the experimental group, evaded the pursuer used in the original environment, and did not use the MRPA algorithm. Each test was conducted with 3000 simulations through random initial situations, and the survival time of the experimental group, control group, and reference group in each simulation was recorded.
Figure 18 displays the typical trajectories of the evader and pursuer in three experiments, and it can be seen intuitively that the trajectory of the experimental group is significantly different from that of the reference group, even with a greater deviation than the control group. This is because the purpose of the MRPA algorithm to correct the UAV’s decision making commands is to make the changes in the relative position relationship between the experimental group’s evader and pursuer as close as possible to that of the reference group, which is determined by the state space of the Markov decision problem in the pursuit–evasion issue. The changes in relative distance, line-of-sight yaw angle, and line-of-sight pitch angle of the three typical cases shown in
Figure 19 also illustrate that the deviations of several key states of the experimental group from the reference group are significantly smaller than those of the control group.
The experimental results shown in
Figure 20 indicate that the experimental group using the MRPA algorithm with the reference policy can achieve better performance in evading pursuers different from the training environment. The reason for not reaching the same performance as the reference group lies in the fact that the dimension of the action vector in the pursuit–evasion game is less than that of the state vector. Consequently, the solution to Equation (
72) is a least squares solution, which cannot ensure that the state transition is completely consistent with the reference model. Referring to the reward function’s setup, using a survival time of 24 s as the threshold to calculate the evasion success rate,
Figure 21 demonstrates that the experimental group has improved by
,
, and
over the control group in the three tests, respectively.
6. Conclusions
In response to the scenario of an unknown pursuer attacking an evader, an analytical method for estimating the pursuit UAV’s flight attitude and pursuit strategy was constructed through the analysis of the relative motion model of UAVs in pursuit–evasion games. Concurrently, a maneuvering evasion strategy was trained using a reinforcement learning algorithm, and a Model Reference Policy Adaptation (MRPA) algorithm was proposed to address different pursuers.
The results of numerical simulation indicate that the pursuit UAV’s flight attitude estimation and pursuit strategy identification can achieve a high accuracy, with response times less than 0.5 s. Moreover, the guidance law identification model can accurately track guidance law parameters that vary over time. The average error during the stable tracking phase is less than 2%. Comparative ablation experiments have shown that the MRPA algorithm can effectively enhance the performance of evasion strategies when dealing with unknown pursuers, with an average increase in evasion success rate of 8.4%.
Although it is assumed that the pursuer uses a proportional navigation guidance law, experimental results demonstrate that the guidance law parameter identification model is effective for dynamically changing parameters, and the MRPA algorithm, not being dependent on the specific guidance law construction, remains effective against pursuers employing a pursuit strategy.
7. Limitations and Future Work
Although comparative and ablation experiments demonstrate that the MRPA algorithm proposed in this paper can enable reinforcement learning-trained evasion strategies to effectively adapt to different pursuers, this method still has many limitations in practical use. The first limitation stems from the impact of observational noise. The MRPA algorithm needs to adjust decisions based on the results of pursuit strategy identification to adapt to different pursuers. The experimental results in
Section 5.3 show that while the identification method proposed in this paper has some robustness to observational noise, noise still introduces higher errors into the identification results. Incorrect identification results can further reduce the robustness of the MRPA algorithm. Additionally, the modeling of the pursuit strategy in this paper has certain limitations. We use proportional navigation guidance to describe the pursuer’s strategy, but in reality, pursuers may employ a richer set of strategies, possibly including intelligent pursuit strategies trained by reinforcement learning algorithms. Although the method proposed in this paper can identify dynamically changing guidance law parameters, it cannot prove that it can represent all pursuit strategies. When we cannot correctly identify the pursuit strategy, the performance of the MRPA algorithm will inevitably decline.
In future work, we will conduct a more detailed modeling of the UAV pursuit–evasion game, such as considering more comprehensive aerodynamic constraints and potential time delays in control quantities, which will benefit the application of our method in the real world. Furthermore, we will use more advanced deep learning methods for pursuit strategy identification. By conducting large-scale simulations of different types and parameters of pursuit strategies to establish a dataset, we will train a deep learning model to directly identify the state transition matrix in Equation (
57). This approach does not rely on specific kinematic models (e.g., Equations (
14)–(
16)) and may have better identification effects on different types of pursuit strategies. We hope these improvements will enable the MRPA algorithm to achieve better results in real-world environments.