1. Introduction
Spacecraft such as satellites, space stations, and space shuttles play an important role in both civil and military activities. They are also at risk of being intercepted in the exo-atmosphere. The pursuit-evasion game between the spacecraft and the interceptor will be critical in the competition for space resources and has been widely studied in recent years. The trajectory of spacecraft can be accurately predicted [
1] since the dynamics of the spacecraft is generally described in terms of a two-body problem. With the development of accurate sensors, guidance technology, small-sized propulsion systems, and fast servo-mechanism techniques, the Kinetic Kill Vehicle (KKV), which can be used for direct-hit killing, has superior maneuverability compared to the other spacecraft. In other words, it is not practical for targeted spacecraft involved in the pursuit-evasion game to rely solely on orbital maneuvering.
Among the many available countermeasures, launching an Active Defense Vehicle (ADV) as a defender to intercept the incoming threat has proven to be an effective approach to compensate for the inferior target maneuverability [
2,
3,
4]. In an initial study [
2], Boyell proposed the active defense strategy of launching a defensive missile to protect the target from a homing missile. Boyell proposed an approximate normalized curve of game results under the condition of constant or static target velocity based on the relative motion relationship among the three participants. The dynamic three-body framework was introduced by Rusnak in Ref. [
4], inspired by the narrative of a “lady-bodyguard-bandit” situation. This framework was later transformed into a “target-interceptor-defender” (TID) three-body spacecraft active defense game scenario as described in Ref [
3]. In the TID scenario, the defender aims to reduce the distance from the interceptor, while the interceptor endeavors to increase the distance from the defender and successfully intercept the target. In Refs. [
3,
4], Rusnak proposed a game guidance method under the TID scenario based on Multiple Objective Optimization and differential games theories. It was proven that the proposed active defense method significantly reduces the miss distance and the required acceleration level between interceptor and defender.
The efficacy of the active defense method has garnered increased attention to the collaborative strategy between the target and defender in the TID scenario. Traditional methods for solving optimal strategies in this context include Optimal Control [
5,
6,
7] and differential games theories [
8,
9,
10]. In Ref. [
7], Weiss employed the Optimal Control theory to independently design the guidance for both the target and defender. This approach considered the influence of target maneuvers on the interceptor’s effectiveness as a defender. Furthermore, in Ref. [
6], collaborative game strategies for the target and defender were proposed, emphasizing their combined efforts in the TID scenario. Aiming at the multi-member TID scenario in which a single target carries two defenders against two interceptors, Ref. [
5] designed a multi-member cooperative game guidance strategy and considered the fuel consumption of target and defender. However, Optimal-Control-based strategies rely on perfect information, demanding accurate maneuvering details of the interceptor. In contrast, Differential Game approaches require prior knowledge instead of accurate target acceleration information, enhancing algorithm robustness [
11]. In Ref. [
8], optimal cooperative pursuit and evasion strategies were proposed using Pontryagin’s minimum principle. A similar scenario was studied in Ref. [
9] for both continuous and discrete domains using the linear–quadratic differential game method. It is worth noting that the differential game control strategies proposed in Ref. [
9] solve the fuel cost and saturation problem. However, they introduce computational problems and make the selection of weight parameters more difficult. A switching surface [
10], designed with zero-effort miss distance, was introduced to divide the multi-agent engagement into two one-on-one differential games, thereby achieving a balance between performance and usability. Nonetheless, using the differential game method to solve the multi-agent pursuit-evasion game problem still faces shortcomings [
11,
12,
13]. First, it is difficult to establish a scene model of a multi-member, multi-role game due to the extremely large increase in the dimension of the state quantity; second, it has high requirements for the accuracy of the prior knowledge, and the success rate of the game is low if the prior knowledge of the players in the game cannot be obtained accurately; third, the differential game algorithm is complicated, involving a high-dimensional matrix operation, power function operation, integral calculation, etc., which places a high demand on the computational resources of the spacecraft. More on this topic can be found in [
14,
15,
16,
17,
18,
19,
20].
With the advancement of machine learning technology, Deep Reinforcement Learning (DRL) has emerged as a promising approach for addressing active defense guidance problems. In DRL, an agent interacts with the environment and receives feedback in the form of rewards, enabling it to improve its performance and achieve specific tasks. This mechanism has led to successful applications of DRL in various decision-making domains, including robot control, MOBA games, autonomous driving, and navigation [
21,
22,
23,
24,
25]. In Ref. [
26], the DRL was utilized to learn an adaptive homing phase control law, accounting for sensor and actuator noise and delays. Another work [
27] proposed an adaptive guidance system to address the landing problem using Reinforcement Meta-Learning, adapting agent training from one environment to another with limited steps, showcasing robust policy optimization in the presence of parameter uncertainties. In the context of the TID scenario, Lau [
28] demonstrated the potential of using reinforcement learning for active defense guidance rating, although an optimal strategy was not obtained in their preliminary investigation.
It is worthy to point out that, on one hand, to better align with real-world engineering applications, research in guidance methods often needs to consider the presence of various information gaps and noise [
29,
30]. However, most of the existing optimal active defense guidance methods rely on perfect information assumptions, leading to subpar performance when faced with unknown prior knowledge or observation noise. Additionally, these methods often struggle to meet the real-time requirements of spacecraft applications. On the other hand, the majority of reinforcement learning algorithms have been applied to non-adversarial or weak adversarial flight missions, where mission objectives and process rewards are clear and intuitive. However, in the highly competitive TID game scenario, obtaining effective reward information becomes challenging due to the intense confrontation between agents, leading to sparse reward problems or “Plateau Phenomenon” [
31].
Given these observations, there is a strong motivation to develop an active defense guidance method based on reinforcement learning that possesses enhanced real-time capabilities, adaptiveness, and robustness, while addressing the challenges posed by adversarial scenarios and sparse reward issues.
In this paper, we focus on the cooperative active defense guidance strategy design of a target spacecraft with active defense attempting to evade an interceptor in space. This TID scenario holds significant importance in the domains of space attack-defense and ballistic missile penetration. The paper begins by deriving the kinematic and first-order dynamic models of the engagement scenario. Subsequently, an intelligent cooperative active defense (ICAAI) guidance method for active defense is proposed, utilizing the twin-delay deep deterministic policy gradient (TD3) algorithm. To address the challenge of sparse rewards, an efficient and stable convergence (ESC) training approach is introduced. Furthermore, benchmark comparisons are made using Optimal Guidance Laws (OGLs), and simulation analyses are presented to validate the performance of the proposed method.
The paper is organized as follows. In
Section 2, the problem formulation is provided. In
Section 3, the guidance law is developed. In
Section 4, experiments are presented where the proposed method has been compared with its analytical counterpart, followed by the conclusions presented in
Section 5.
2. Problem Formulation
Consider a multi-agent game with a spacecraft as the main target (T), an active defense vehicle as the defender (D), and a highly maneuverable small spacecraft as the interceptor (I). In this battle, the interceptor chases the target, which launches the defender to protect itself by destroying the interceptor. During the endgame, all players are considered as constant-speed mass points whose trajectories can be linearized around the initial line of sight. As a consequence of trajectory linearization, the engagement, a three-dimensional process, can be simplified and will be analyzed in one plane. However, it should be noted that in most cases these assumptions do not affect the generality of the results [
11].
A schematic view of the engagement is shown in
Figure 1, where
is a Cartesian inertial reference frame. The distances between the players are denoted as
and
, respectively. Each player’s velocity is indicated as
,
, and
, while their accelerations are represented as
,
, and
. The flight path angles of the players are defined as
,
, and
, respectively. The line of sight (LOS) between the players is described by
and
, and the angles between the LOS and the X-axis are denoted as
and
. The lateral displacements of each player relative to the X-axis are represented as
,
, and
, while the relative displacements between the players are defined as
and
.
Considering the collective mission objectives, the target’s priority is to evade the interceptor with defender support. Simultaneously, the interceptor aims to avoid the defender while chasing the target. Consequently, the target’s guidance law strives for maximum convergence, while the defender’s aims for convergence to zero. Conversely, the interceptor’s guidance law assumes the opposite role (as depicted in
Figure 1). This scenario can thus be segmented into two collision triangles: one involving the interceptor and the target, and the other between the interceptor and the defender.
2.1. Equations of Motion
Consider the I-T collision triangle and the I-D collision triangle in a multi-agent pursuit-evasion engagement. The kinematics are expressed using the polar coordinate system attached in the target and defender as follows:
Furthermore, the flight path angles associated with dynamics can be defined for each of the players:
2.2. Linearized Equations of Motion
In the research context, both the LOS angle
and fight path angle
are small quantities, and the inter-spacecraft distances are much larger than the spacecraft velocities. Furthermore, during the terminal guidance phase, the rate of change in spacecraft velocity magnitude approaches zero. Therefore, the equations of motion can be linearized around the initial line-of-sight:
The dynamics for each of the players is assumed to be a first-order process:
Furthermore, the variable vector can be defined as follows:
while the linearized equations of motion in the state space form can be written as follows:
where
Since the velocity of each player is assumed to be constant, the engagement can be formulated as a fixed-time process. Thus, the interception time can be calculated using the following:
where
represents the initial relative distance between the interceptor and the target, while
is the distance between the interceptor and the defender, allowing us to define the time-to-go of each engagement by
which represents the expected remaining game time for the interceptor in the “Interceptor vs. Target” and “Interceptor vs. Defender” game scenarios, respectively.
2.3. Zero-Effort Miss
A well-known zero-effort miss (ZEM) is introduced in the guidance law design and reward function design. It is obtained from the homogeneous solutions of equations of motion and is only affected by the current state and interception time. It can be calculated as follows:
where
Thus, the ZEM and its derivative with respect to time are given as follows:
where
2.4. Problem Statement
This research focuses on the terminal guidance task of evading a homing interceptor for a maneuvering target with active defense. We design a cooperative active defense guidance to facilitate coordinated maneuvers between the target and the defender based on DRL. This enables the target to evade the interceptor’s interception while allowing the defender to counter-intercept the incoming threat.
4. Experiments
In this section, we demonstrate the efficacy of the proposed guidance method and the effectiveness of the shaping technique through learning processes and Monte Carlo simulations. We establish benchmark comparisons by including OGLs and evaluating application requirements. To illustrate, we consider a scenario [
10] involving a maneuverable small spacecraft (Interceptor, I), a defensive vehicle (Defender, D), and an evading spacecraft (Target, T), all in circular Earth orbits. Gravity effects are incorporated in the simulations. Assumptions include the interceptor’s superior maneuverability and time constant compared to the target and defender.
4.1. Optimal Pursuit and Evasion Guidance Laws
Lemma 1. The linear–quadratic optimal guidance law (LQOGL) [10]:where is a positive constant representing the limit-collision radius between the interceptor and the defender, and is the maximum control force provided by the interceptor. Furthermore, variable K(t) and P(t) can be defined as follows:where,
,
, and are nonnegative constants ensuring the interceptor converges towards the target, guaranteeing its escape from the defender. Proof. The detailed proof of similar results can be found in [
10]; see Theorem 1 and the associated proof. □
Lemma 2. Standard optimal guidance law (SOGL) [45]:where is a positive constant representing the switching condition always equal to the defender kill radius. Proof. Consider the following cost function:
For
, the Hamiltonian of the problem is defined as follows:
The costate equation and transversality condition are provided by the following:
The optimal interceptor controller minimizes the Hamiltonian satisfying the following:
The interceptor guidance law can thus be obtained:
For
, a similar interceptor guidance law can be found:
Finally, the interceptor guidance schemes for evading the defender and pursuing the target are proposed after combining Equations (58) and (59):
□
4.2. Engagement Setup
In this scenario, a target carrying an active anti-interceptor is threatened by a KKV interceptor in orbit at an altitude of 500 km. The defender maintains an initial safe distance of approximately 50 m longitudinally and 10 km transversely to the target. Given that the detection range of the interceptor’s guided warhead is about 100 km, the initial transverse distance between the interceptor and the target is set at 100 km, and the initial longitudinal position is random in the range 499.8–500.2 km. In addition, the maneuverability and control response speed of the interceptor are better than those of the target and defender, and the OGL is used for guidance.
The comprehensive list of engagement parameters is shown in
Table 3.
Furthermore, Gaussian noise with standard variance of , , and is considered in the interceptor information obtained by the target and defender through a radar seeker.
4.3. Experiment 1: Real-Time Performance of the Guidance Policy
To verify that the proposed RL training approach ESC can improve convergence efficiency and stability, the learning processes were demonstrated using the sparse reward (SR) signal and ESC, respectively, with the same hyperparameters. During the learning process, the weights of the neural network model were stored every 100 episodes for subsequent analysis. In addition, to remove stochasticity as a confounding factor, six random seeds were set for each case. Meanwhile, the real-time performance of the optimized agent is evaluated by comparing it with the traditional OGLs.
The agents were obtained after a training of 20,000 episodes, which took 12 h with 8 parallel workers on a computer equipped with a 104-core Intel Core Xeon Platinum 8270 CPU @2.70 GHz. Similarly, both the traditional methods and the proposed method are provided a current state or observation and return the required action.
Table 4 shows the comparison of computational cost and update frequency obtained by using SOGL, LQOGL, and the proposed method. It can be seen from the table that LQOGL is time-consuming due to the calculation of the Riccati function, which is the reason why it has not been applied in practice. As a proven approach, the SOGL has excellent real-time performance. The proposed method achieved an update frequency of 10
3 Hz and showed great potential for on-board applications. While a variety of approaches (e.g., pruning and distillation) were effective to compress the policy network and further improve its real-time performance, it is not the main work of this research.
Remark 1. As shown in Equations (18) and (19), the LQOGL has to solve the Riccati differential equation. However, the experimental results show that its update frequency cannot meet the real-time requirements of spacecraft guidance. Compared to the LQOGL, the SOGL in Equation (60) does not need to solve the Riccati differential equation and has no hyperparameter. This improves both its computational efficiency and robustness at the cost of flexibility and the occurrence of the chattering phenomenon. To take into account the practical situation, the SOGL was chosen as an OGL benchmark.
4.4. Experiment 2: Convergence and Performance of the Guidance Policy
The performance of the trained agent in the fully observable game was investigated by comparing the escape success rate corresponding to an optimized policy , obtained by performing Monte Carlo simulation in the fully observable (deterministic and with default engagement parameters) environment, with the solution of the SOGL.
4.4.1. Baselines
The SOGL for the target and the defender were considered as an OGL benchmark. Through a brief derivation similar to that in
Section 3, it can be proven that the SOGLs for the target and the defender are as follows:
4.4.2. Convergence and Escape Success Rate
Figure 6 displays the learning curves depicting the mean accumulated reward across learning episodes for various scenarios. As depicted, in the ESC case, the agent’s reward consistently escalated throughout the training episodes, ultimately stabilizing at around 6000 after 4000 iterations. Conversely, within the sparse reward (SR) framework, the ICAAI encountered a plateau phenomenon during training, resulting in an unstable convergence process for the associated reward function and eventual convergence failure.
Figure 7 presents success rate curves for target evasion over learning episodes, comparing agents trained with and without ESC. The green line denotes OGL’s deterministic environment success rate of 83.4%. The ESC-trained agent surpassed the baseline by 2700 episodes, achieving a peak performance of 99% after around 13,800 episodes. Conversely, the agent without ESC exhibited a gradual decline in performance after reaching a zenith of 77%, signifying policy network overfitting during continued training. The ESC-trained agent demonstrated accelerated convergence and improved local optima. It can be inferred that the proposed ESC training approach effectively organizes exploration, addressing sparse reward issues and showcasing heightened learning efficiency and asymptotic performance. Furthermore, the proposed methodology adeptly mitigates overfitting phenomena.
4.4.3. Performance Test
Figure 8 depicts spacecraft trajectories, featuring the interceptor’s actual path (blue curve) and the observed trajectory from the target’s perspective (yellow curve).
Figure 9 displays the lateral acceleration profiles for each spacecraft, while
Figure 10 illustrates the ZEM measurements between the target and interceptor and between the defender and interceptor. The simulation results presented in
Figure 11 reaffirm the impact of the relative distance between the target and defender
on the game outcomes for the target.
Figure 8,
Figure 9 and
Figure 10 illustrate the evident cooperation between the target and the defender, utilizing relative state information. Taking the simulation results at
as an example, the miss distance between the target and the interceptor was approximately 15 m. The defender maintained a miss distance of less than 1 m from the interceptor, confirming its successful interception threat.
Figure 9 and
Figure 10 depict that, within 16 s of the scenario’s initiation, the target collaborated with the defender, executing subtle maneuvers to intercept the interceptor. At around the 16 s mark, the interceptor perceived the threat and initiated an escape strategy. Simultaneously, the target executed an evasive maneuver in the opposite direction, utilizing its maximum maneuverability, which resulted in an increase in distance. Ultimately, the interceptor managed to evade the defender’s interception attempt but failed to intercept the target in time, leading to the target’s successful evasion.
In addition, the above simulation results show that the relative distance between the target and defender
directly determines the time it takes for the interceptor to intercept the target after evading the defender. Consequently,
significantly influences the game outcomes for the target, including the success rate of evasion and miss distance. Therefore, to explore the effect of
on the performance of ICAAI, the game results for
ranging from 0 to 15 km are introduced in
Figure 11.
As evident from
Figure 11, employing the ICAAI intelligent game algorithm results in the target achieving success rates of no less than approximately 90% when the relative distance to the defender is less than 10 km. However, as the
increases from 10 to 15 km, the success rate of target evasion decreases from 90% to 0%. These simulation results illustrate that a smaller relative distance leads to an increased evasion success rate. Additionally, the curve depicting the average miss distance for the target reveals that the miss distance follows a pattern of initially increasing and then decreasing with
. The miss distance reaches its maximum value of approximately 50 m around a relative distance of 5 km. The occurrence of this phenomenon can be attributed to the fact that, when
is less than 5 km, the miss distance increases with the target’s evasion time. Moreover, at this point, the interceptor has not had sufficient time to alter its trajectory to intercept the target. Conversely, when
exceeds 5 km, the interceptor has ample time to intercept the target after evading the defender. Consequently, the miss distance decreases with an increasing
.
4.5. Experiment 3: Adaptiveness of the ICAAI Guidance
In the real-world game confrontation process, obtaining the opponent’s prior knowledge, such as the maximum acceleration and time constant, is often impractical. To assess the proposed ICAAI guidance method’s superior adaptability compared to the OGL method under conditions of unknown opponent knowledge, several comparison conditions were designed and evaluated using the Monte Carlo target shooting method. The adaptive capabilities of both methods were analyzed based on the game results (escape success rate and miss distance) of the target spacecraft employing the two strategies.
While the target utilized OGL guidance, we considered it adopting
g,
s as the prediction of the prior knowledge of the interceptor, while the actual
g,
s. The simulation results are shown in
Figure 12.
As depicted in
Figure 12a, as the interceptor’s maneuverability improves, the target’s escape ability decreases for both guidance methods. However, it is evident that, when employing the ICAAI guidance, the rate of decline in the target’s escape ability is significantly lower compared to the OGL guidance method. Similarly,
Figure 12b demonstrates that an increase in the interceptor’s response speed yields a similar trend in the target’s escape ability as in
Figure 12a. Specifically, when accurately estimating the prior knowledge of the target, the escape abilities of both methods are comparable. However, when the prior knowledge error exceeds 25%, the OGL guidance leads to a reduction of over 75% in the target’s escape ability, while the ICAAI guidance results in less than a 34% decrease. In conclusion, the proposed ICAAI guidance exhibits superior adaptability compared to the OGL guidance when the interceptor’s prior knowledge is unknown.
Remark 2. As an analytical method, the SOGL is stable but inflexible due to its theoretical framework [46] and stringent assumptions [47]. Correspondingly, the ICAAI control strategies are flexible and can be continuously optimized. The proposed method is independent of the time constant, which means that it performs better with less prior knowledge than the OGL. Furthermore, the adaptability of the proposed method can be improved by considering the tolerance of the maximum interceptor acceleration. 4.6. Experiment 4: Robustness of the RL-Based Guidance Method
In addition to the unperturbed, fully observable game, the following noisy, partially observable game studies have been analyzed separately in this manuscript. The parameters used to describe the imperfect information model defined in
Section 3 are shown in
Table 5. The Monte Carlo simulation method is used to obtain the escape success rate and the miss distance of the target using the proposed ICAAI guidance and SOGL guidance under different noise conditions. The results of the Monte Carlo simulation are shown in
Figure 13.
Based on the simulation results of Case 2, it was observed that the OGL method exhibited significant sensitivity to LOS noise. In scenarios without LOS noise, the escape success rate of the proposed ICAAI guidance matched that of the OGL guidance, and, in some cases, the OGL method even achieved a larger miss distance. However, as the LOS noise variance increased to 0.05 mrad, the success rate of the OGL method dropped to approximately 50%. Eventually, at a LOS noise variance of 0.15 mrad, the target was practically unable to escape using the SOGL method, while the ICAAI guidance still maintained an escape success rate of around 80%.
Analyzing the simulation results of Case 1 and Case 3, it was found that due to the presence of LOS noise, the target employing the OGL method exhibited reduced sensitivity to acceleration and velocity noise. Nevertheless, its escape capability remained weaker compared to that of the ICAAI guidance. This could be attributed to the policy network propagating observation information with different weights, leveraging the exploration mechanism of reinforcement learning (RL). Consequently, training the agent in a deterministic environment resulted in a robust guidance policy with strong noise-resistant ability.