Cooperative Guidance Strategy for Active Spacecraft Protection from a Homing Interceptor via Deep Reinforcement Learning

Ni, Weilin; Liu, Jiaqi; Li, Zhi; Liu, Peng; Liang, Haizhao

doi:10.3390/math11194211

Open AccessArticle

Cooperative Guidance Strategy for Active Spacecraft Protection from a Homing Interceptor via Deep Reinforcement Learning

by

Weilin Ni

¹

,

Jiaqi Liu

²,

Zhi Li

¹,

Peng Liu

² and

Haizhao Liang

^1,*

¹

School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzheng 518107, China

²

National Key Laboratory of Science and Technology on Test Physics and Numerical Mathematics, Beijing 100076, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(19), 4211; https://doi.org/10.3390/math11194211

Submission received: 29 August 2023 / Revised: 27 September 2023 / Accepted: 7 October 2023 / Published: 9 October 2023

(This article belongs to the Special Issue Advanced Guidance and Control of Flight Vehicle: Theory and Application)

Download

Browse Figures

Versions Notes

Abstract

:

The cooperative active defense guidance problem for a spacecraft with active defense is investigated in this paper. An engagement between a spacecraft, an active defense vehicle, and an interceptor is considered, where the target spacecraft with active defense will attempt to evade the interceptor. Prior knowledge uncertainty and observation noise are taken into account simultaneously, which are vital for traditional guidance strategies such as the differential-game-based guidance method. In this set, we propose an intelligent cooperative active defense (ICAAI) guidance strategy based on deep reinforcement learning. ICAAI effectively coordinates defender and target maneuvers to achieve successful evasion with less prior knowledge and observational noise. Furthermore, we introduce an efficient and stable convergence (ESC) training approach employing reward shaping and curriculum learning to tackle the sparse reward problem in ICAAI training. Numerical experiments are included to demonstrate ICAAI’s real-time performance, convergence, adaptiveness, and robustness through the learning process and Monte Carlo simulations. The learning process showcases improved convergence efficiency with ESC, while simulation results illustrate ICAAI’s enhanced robustness and adaptiveness compared to optimal guidance laws.

Keywords:

cooperative guidance; reinforcement learning; active protection; guidance law

MSC:

93-08

1. Introduction

Spacecraft such as satellites, space stations, and space shuttles play an important role in both civil and military activities. They are also at risk of being intercepted in the exo-atmosphere. The pursuit-evasion game between the spacecraft and the interceptor will be critical in the competition for space resources and has been widely studied in recent years. The trajectory of spacecraft can be accurately predicted [1] since the dynamics of the spacecraft is generally described in terms of a two-body problem. With the development of accurate sensors, guidance technology, small-sized propulsion systems, and fast servo-mechanism techniques, the Kinetic Kill Vehicle (KKV), which can be used for direct-hit killing, has superior maneuverability compared to the other spacecraft. In other words, it is not practical for targeted spacecraft involved in the pursuit-evasion game to rely solely on orbital maneuvering.

Among the many available countermeasures, launching an Active Defense Vehicle (ADV) as a defender to intercept the incoming threat has proven to be an effective approach to compensate for the inferior target maneuverability [2,3,4]. In an initial study [2], Boyell proposed the active defense strategy of launching a defensive missile to protect the target from a homing missile. Boyell proposed an approximate normalized curve of game results under the condition of constant or static target velocity based on the relative motion relationship among the three participants. The dynamic three-body framework was introduced by Rusnak in Ref. [4], inspired by the narrative of a “lady-bodyguard-bandit” situation. This framework was later transformed into a “target-interceptor-defender” (TID) three-body spacecraft active defense game scenario as described in Ref [3]. In the TID scenario, the defender aims to reduce the distance from the interceptor, while the interceptor endeavors to increase the distance from the defender and successfully intercept the target. In Refs. [3,4], Rusnak proposed a game guidance method under the TID scenario based on Multiple Objective Optimization and differential games theories. It was proven that the proposed active defense method significantly reduces the miss distance and the required acceleration level between interceptor and defender.

The efficacy of the active defense method has garnered increased attention to the collaborative strategy between the target and defender in the TID scenario. Traditional methods for solving optimal strategies in this context include Optimal Control [5,6,7] and differential games theories [8,9,10]. In Ref. [7], Weiss employed the Optimal Control theory to independently design the guidance for both the target and defender. This approach considered the influence of target maneuvers on the interceptor’s effectiveness as a defender. Furthermore, in Ref. [6], collaborative game strategies for the target and defender were proposed, emphasizing their combined efforts in the TID scenario. Aiming at the multi-member TID scenario in which a single target carries two defenders against two interceptors, Ref. [5] designed a multi-member cooperative game guidance strategy and considered the fuel consumption of target and defender. However, Optimal-Control-based strategies rely on perfect information, demanding accurate maneuvering details of the interceptor. In contrast, Differential Game approaches require prior knowledge instead of accurate target acceleration information, enhancing algorithm robustness [11]. In Ref. [8], optimal cooperative pursuit and evasion strategies were proposed using Pontryagin’s minimum principle. A similar scenario was studied in Ref. [9] for both continuous and discrete domains using the linear–quadratic differential game method. It is worth noting that the differential game control strategies proposed in Ref. [9] solve the fuel cost and saturation problem. However, they introduce computational problems and make the selection of weight parameters more difficult. A switching surface [10], designed with zero-effort miss distance, was introduced to divide the multi-agent engagement into two one-on-one differential games, thereby achieving a balance between performance and usability. Nonetheless, using the differential game method to solve the multi-agent pursuit-evasion game problem still faces shortcomings [11,12,13]. First, it is difficult to establish a scene model of a multi-member, multi-role game due to the extremely large increase in the dimension of the state quantity; second, it has high requirements for the accuracy of the prior knowledge, and the success rate of the game is low if the prior knowledge of the players in the game cannot be obtained accurately; third, the differential game algorithm is complicated, involving a high-dimensional matrix operation, power function operation, integral calculation, etc., which places a high demand on the computational resources of the spacecraft. More on this topic can be found in [14,15,16,17,18,19,20].

With the advancement of machine learning technology, Deep Reinforcement Learning (DRL) has emerged as a promising approach for addressing active defense guidance problems. In DRL, an agent interacts with the environment and receives feedback in the form of rewards, enabling it to improve its performance and achieve specific tasks. This mechanism has led to successful applications of DRL in various decision-making domains, including robot control, MOBA games, autonomous driving, and navigation [21,22,23,24,25]. In Ref. [26], the DRL was utilized to learn an adaptive homing phase control law, accounting for sensor and actuator noise and delays. Another work [27] proposed an adaptive guidance system to address the landing problem using Reinforcement Meta-Learning, adapting agent training from one environment to another with limited steps, showcasing robust policy optimization in the presence of parameter uncertainties. In the context of the TID scenario, Lau [28] demonstrated the potential of using reinforcement learning for active defense guidance rating, although an optimal strategy was not obtained in their preliminary investigation.

It is worthy to point out that, on one hand, to better align with real-world engineering applications, research in guidance methods often needs to consider the presence of various information gaps and noise [29,30]. However, most of the existing optimal active defense guidance methods rely on perfect information assumptions, leading to subpar performance when faced with unknown prior knowledge or observation noise. Additionally, these methods often struggle to meet the real-time requirements of spacecraft applications. On the other hand, the majority of reinforcement learning algorithms have been applied to non-adversarial or weak adversarial flight missions, where mission objectives and process rewards are clear and intuitive. However, in the highly competitive TID game scenario, obtaining effective reward information becomes challenging due to the intense confrontation between agents, leading to sparse reward problems or “Plateau Phenomenon” [31].

Given these observations, there is a strong motivation to develop an active defense guidance method based on reinforcement learning that possesses enhanced real-time capabilities, adaptiveness, and robustness, while addressing the challenges posed by adversarial scenarios and sparse reward issues.

In this paper, we focus on the cooperative active defense guidance strategy design of a target spacecraft with active defense attempting to evade an interceptor in space. This TID scenario holds significant importance in the domains of space attack-defense and ballistic missile penetration. The paper begins by deriving the kinematic and first-order dynamic models of the engagement scenario. Subsequently, an intelligent cooperative active defense (ICAAI) guidance method for active defense is proposed, utilizing the twin-delay deep deterministic policy gradient (TD3) algorithm. To address the challenge of sparse rewards, an efficient and stable convergence (ESC) training approach is introduced. Furthermore, benchmark comparisons are made using Optimal Guidance Laws (OGLs), and simulation analyses are presented to validate the performance of the proposed method.

The paper is organized as follows. In Section 2, the problem formulation is provided. In Section 3, the guidance law is developed. In Section 4, experiments are presented where the proposed method has been compared with its analytical counterpart, followed by the conclusions presented in Section 5.

2. Problem Formulation

Consider a multi-agent game with a spacecraft as the main target (T), an active defense vehicle as the defender (D), and a highly maneuverable small spacecraft as the interceptor (I). In this battle, the interceptor chases the target, which launches the defender to protect itself by destroying the interceptor. During the endgame, all players are considered as constant-speed mass points whose trajectories can be linearized around the initial line of sight. As a consequence of trajectory linearization, the engagement, a three-dimensional process, can be simplified and will be analyzed in one plane. However, it should be noted that in most cases these assumptions do not affect the generality of the results [11].

A schematic view of the engagement is shown in Figure 1, where

X - O - Y

is a Cartesian inertial reference frame. The distances between the players are denoted as

ρ_{ID}

and

ρ_{IT}

, respectively. Each player’s velocity is indicated as

V_{I}

,

V_{T}

, and

V_{D}

, while their accelerations are represented as

a_{I}

,

a_{T}

, and

a_{D}

. The flight path angles of the players are defined as

ϕ_{I}

,

ϕ_{T}

, and

ϕ_{D}

, respectively. The line of sight (LOS) between the players is described by

{LOS}_{ID}

and

{LOS}_{IT}

, and the angles between the LOS and the X-axis are denoted as

λ_{ID}

and

λ_{IT}

. The lateral displacements of each player relative to the X-axis are represented as

y_{I}

,

y_{T}

, and

y_{D}

, while the relative displacements between the players are defined as

y_{IT}

and

y_{ID}

.

Considering the collective mission objectives, the target’s priority is to evade the interceptor with defender support. Simultaneously, the interceptor aims to avoid the defender while chasing the target. Consequently, the target’s guidance law strives for maximum convergence, while the defender’s aims for convergence to zero. Conversely, the interceptor’s guidance law assumes the opposite role (as depicted in Figure 1). This scenario can thus be segmented into two collision triangles: one involving the interceptor and the target, and the other between the interceptor and the defender.

2.1. Equations of Motion

Consider the I-T collision triangle and the I-D collision triangle in a multi-agent pursuit-evasion engagement. The kinematics are expressed using the polar coordinate system attached in the target and defender as follows:

\begin{array}{l} {\dot{ρ}}_{IT} = - V_{I} \cos (ϕ_{I} + λ_{IT}) - V_{T} \cos (ϕ_{T} - λ_{IT}) \\ {\dot{y}}_{IT} = V_{I} \sin ϕ_{I} - V_{T} \sin ϕ_{T} \\ {\dot{λ}}_{IT} = \frac{V_{I} \sin (ϕ_{I} + λ_{IT}) - V_{T} \sin (ϕ_{T} - λ_{IT})}{ρ_{IT}} \end{array}

(1)

\begin{array}{l} {\dot{ρ}}_{ID} = - V_{I} \cos (ϕ_{I} + λ_{ID}) - V_{D} \cos (ϕ_{D} - λ_{ID}) \\ {\dot{y}}_{ID} = V_{I} \sin ϕ_{I} - V_{D} \sin ϕ_{D} \\ {\dot{λ}}_{ID} = \frac{V_{I} \sin (ϕ_{I} + λ_{ID}) - V_{D} \sin (ϕ_{D} - λ_{ID})}{ρ_{ID}} \end{array}

(2)

Furthermore, the flight path angles associated with dynamics can be defined for each of the players:

{\dot{ϕ}}_{i} = \frac{a_{i}}{V_{i}}, i = {I, T, D}

(3)

2.2. Linearized Equations of Motion

In the research context, both the LOS angle

λ

and fight path angle

ϕ

are small quantities, and the inter-spacecraft distances are much larger than the spacecraft velocities. Furthermore, during the terminal guidance phase, the rate of change in spacecraft velocity magnitude approaches zero. Therefore, the equations of motion can be linearized around the initial line-of-sight:

\begin{matrix} {\dot{ρ}}_{IT} & = - V_{I} \cos (ϕ_{I} + λ_{IT}) - V_{T} \cos (ϕ_{T} - λ_{IT}) \approx - (V_{I} + V_{T}) \\ {\ddot{y}}_{IT} & = {(V_{I} \sin ϕ_{I} - V_{T} \sin ϕ_{T})}^{'} \approx {(V_{I} ϕ_{I} - V_{T} ϕ_{T})}^{'} \\ = {\dot{V}}_{I} ϕ_{I} - {\dot{V}}_{T} ϕ_{T} + V_{I} {\dot{ϕ}}_{I} - V_{T} {\dot{ϕ}}_{T} = {\dot{V}}_{I} ϕ_{I} - {\dot{V}}_{T} ϕ_{T} + a_{I} - a_{T} \\ \approx a_{I} - a_{T} \\ {\dot{λ}}_{IT} & = \frac{V_{I} \sin (ϕ_{I} + λ_{IT}) - V_{T} \sin (ϕ_{T} - λ_{IT})}{ρ_{IT}} \approx \frac{V_{I} (ϕ_{I} + λ_{IT}) - V_{T} (ϕ_{T} - λ_{IT})}{ρ_{IT}} \approx 0 \end{matrix}

(4)

\begin{matrix} {\dot{ρ}}_{ID} & = - V_{I} \cos (ϕ_{I} + λ_{ID}) - V_{D} \cos (ϕ_{D} - λ_{ID}) \approx - (V_{I} + V_{D}) \\ {\ddot{y}}_{ID} & = {(V_{I} \sin ϕ_{I} - V_{D} \sin ϕ_{D})}^{'} \approx {(V_{I} ϕ_{I} - V_{D} ϕ_{D})}^{'} \\ = {\dot{V}}_{I} ϕ_{I} - {\dot{V}}_{D} ϕ_{D} + V_{I} {\dot{ϕ}}_{I} - V_{D} {\dot{ϕ}}_{D} = {\dot{V}}_{I} ϕ_{I} - {\dot{V}}_{D} ϕ_{D} + a_{I} - a_{D} \\ \approx a_{I} - a_{D} \\ {\dot{λ}}_{ID} & = \frac{V_{I} \sin (ϕ_{I} + λ_{ID}) - V_{D} \sin (ϕ_{D} - λ_{ID})}{ρ_{ID}} \approx \frac{V_{I} (ϕ_{I} + λ_{ID}) - V_{D} (ϕ_{D} - λ_{ID})}{ρ_{ID}} \approx 0 \end{matrix}

(5)

The dynamics for each of the players is assumed to be a first-order process:

{\dot{a}}_{i} = - \frac{a_{i} - u_{i}}{τ_{i}}, i = {I, T, D}

(6)

Furthermore, the variable vector can be defined as follows:

x = [\begin{matrix} y_{IT} & {\dot{y}}_{IT} & y_{ID} & {\dot{y}}_{ID} & a_{I} & a_{T} & a_{D} \end{matrix}]

(7)

while the linearized equations of motion in the state space form can be written as follows:

\dot{x} = A x + B {[\begin{matrix} u_{I} & u_{T} & u_{D} \end{matrix}]}^{T}

(8)

where

A = [\begin{matrix} 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & - 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & - 1 \\ 0 & 0 & 0 & 0 & - 1 / τ_{I} & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & - 1 / τ_{T} & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & - 1 / τ_{D} \end{matrix}]

(9)

B = [\begin{matrix} 0_{5 \times 3} \\ B_{1} \end{matrix}], B_{1} = [\begin{matrix} 1 / τ_{I} & 0 & 0 \\ 0 & 1 / τ_{T} & 0 \\ 0 & 0 & 1 / τ_{D} \end{matrix}]

(10)

Since the velocity of each player is assumed to be constant, the engagement can be formulated as a fixed-time process. Thus, the interception time can be calculated using the following:

\begin{array}{l} t_{f, IT} = - ρ_{IT}^{0} / {\dot{ρ}}_{IT} = ρ_{IT}^{0} / (V_{I} + V_{T}) \\ t_{f, ID} = - ρ_{ID}^{0} / {\dot{ρ}}_{ID} = ρ_{ID}^{0} / (V_{I} + V_{D}) \end{array}

(11)

where

ρ_{IT}^{0}

represents the initial relative distance between the interceptor and the target, while

ρ_{ID}^{0}

is the distance between the interceptor and the defender, allowing us to define the time-to-go of each engagement by

\begin{array}{l} t_{go, IT} = t_{f, IT} - t \\ t_{go, ID} = t_{f, ID} - t \end{array}

(12)

which represents the expected remaining game time for the interceptor in the “Interceptor vs. Target” and “Interceptor vs. Defender” game scenarios, respectively.

2.3. Zero-Effort Miss

A well-known zero-effort miss (ZEM) is introduced in the guidance law design and reward function design. It is obtained from the homogeneous solutions of equations of motion and is only affected by the current state and interception time. It can be calculated as follows:

\begin{array}{l} Z_{IT} (t) = L_{1} Φ (t, t_{f, IT}) x (t) \\ Z_{ID} (t) = L_{2} Φ (t, t_{f, ID}) x (t) \end{array}

(13)

where

\begin{array}{l} L_{1} = [\begin{array}{l} 1 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}] \\ L_{2} = [\begin{array}{l} 0 & 0 & 1 & 0 & 0 & 0 & 0 \end{array}] \end{array}

(14)

Thus, the ZEM and its derivative with respect to time are given as follows:

\begin{array}{l} Z_{IT} (t) = x_{1} + t_{goIT} x_{2} + a_{I} τ_{I}^{2} φ (t_{goIT} / τ_{I}) x_{5} - a_{T} τ_{T}^{2} φ (t_{goIT} / τ_{T}) x_{6} \\ Z_{ID} (t) = x_{3} + t_{goID} x_{4} + a_{I} τ_{I}^{2} φ (t_{goID} / τ_{I}) x_{5} - a_{D} τ_{D}^{2} φ (t_{goID} / τ_{D}) x_{7} \end{array}

(15)

\begin{array}{l} {\dot{Z}}_{IT} (t) = τ_{I} φ (t_{goIT} / τ_{I}) u_{I} - τ_{T} φ (t_{goIT} / τ_{T}) u_{T} \\ {\dot{Z}}_{ID} (t) = τ_{I} φ (t_{goID} / τ_{I}) u_{I} - τ_{D} φ (t_{goID} / τ_{D}) u_{D} \end{array}

(16)

where

φ (χ) {= e}^{- χ} + χ - 1

(17)

2.4. Problem Statement

This research focuses on the terminal guidance task of evading a homing interceptor for a maneuvering target with active defense. We design a cooperative active defense guidance to facilitate coordinated maneuvers between the target and the defender based on DRL. This enables the target to evade the interceptor’s interception while allowing the defender to counter-intercept the incoming threat.

3. Guidance Law Development

In this section, we develop the Intelligent Cooperative Active Defense (ICAAI) guidance strategy and design an efficient and stable convergence (ESC) training approach. The target and defender utilize ICAAI guidance, while the interceptor employs OGL. We describe the game scenario using a Markov process, present the ICAAI guidance strategy, and design an ESC training approach based on reward shaping and curriculum learning.

3.1. Markov Decision Process

The sequential decision making that an autonomous RL agent interacts with the environment (e.g., the engagement) can be formally described as an MDP, which is required to properly set up the mathematical framework of an DRL problem. A generic time-discrete MDP can be represented as a 6-tuple

{s, o, a, P_{s a}, γ, R}

.

s_{t} \in S \in ℝ^{n}

is a vector that completely identifies the state of the system (e.g., the EOM) at time t. Generally, the complete state is not available to the agent at each time t; the decision-making relies on an observation vector

o_{t} \in O \in ℝ^{m}

. In the present paper, the observations are defined as an uncertain (e.g., imperfect and noisy) version of the true state, which can be written as a function

Ω

of the current state

s_{t}

. The action

a \in A \in ℝ^{l}

of the agent is given by a state-feedback policy

π : O \to A

, that is,

a_{t} = π (o_{t})

.

P_{s a}

is time-discrete dynamic model describing the transformation led by the state–action pair

(s_{t}, a_{t})

. As a result, the evolution rule of the dynamic system can be described as follows:

\begin{array}{l} s_{t + 1} = P_{s a} (s_{t}, a_{t}) \\ o_{t} = Ω (s_{t}) \\ a_{t} = π (o_{t}) \end{array}

(18)

Since a fixed-time engagement is considered, the interaction between the agent and the environment gives rise to a trajectory

Ι

:

\begin{array}{l} Ι = [ι_{1}, ι_{2}, \dots, ι_{t}, \dots, ι_{T - 1}, ι_{T}] \\ ι_{t} = {[o_{t}, a_{t}, r_{t}]}^{T} \end{array}

(19)

where the trajectory information at each time step

ι_{t}

is composed of observational

o_{t}

, action

a_{t}

, and reward signal

r_{t}

generated through the interaction between the agent and the environment.

The return, the agent received at time

t

in the trajectory

Ι

, is defined as a discounted sum of rewards:

R_{t}^{Ι} = \sum_{i = t}^{T} γ^{i - t} r_{i}

(20)

where

γ \in (0, 1]

is a discount rate determining whether the agent has a long-term vision (

γ = 1

) or is short-sighted (

γ ≪ 1

).

Prior to deriving the current guidance law, we outline the key elements of the MDP: state space, action space, and observations. We present the reward design separately by highlighting a crucial aspect of the configuration.

3.1.1. Perfect Information Model

In a deterministic model, the basic assumption is that each player has perfect information about the interceptor (e.g., states, maximum acceleration, and time constant). The communication of this information between the defender and the protected target is assumed to be ideal and without delay. Thus, the state space can be identified by states, maximum acceleration, and time constant:

s_{t} = {[\begin{matrix} t & x_{t} & y_{t} & V_{t} & a_{t} & a_{\max} & τ \end{matrix}]}^{T}

(21)

x_{t} = [\begin{matrix} x_{t, T} & x_{t, D} & x_{t, I} \end{matrix}], y_{t} = [\begin{matrix} y_{t, T} & y_{t, D} & y_{t, I} \end{matrix}]

(22)

V_{t} = [\begin{matrix} V_{t, T} & V_{t, D} & V_{t, I} \end{matrix}]

(23)

a_{t} = [\begin{matrix} a_{t, T} & a_{t, D} & a_{t, I} \end{matrix}]

(24)

a_{\max} = [\begin{matrix} a_{\max, T} & a_{\max, D} & a_{\max, I} \end{matrix}]

(25)

τ = [\begin{matrix} τ_{T} & τ_{D} & τ_{I} \end{matrix}]

(26)

As with the multi-agent system, interactions introduce uncertainty into the environment, which significantly affects the stability of the RL algorithm. Given the full cooperation between defender and target due to communication assumptions, the model must learn a shared guidance law for both. This effectively mitigates environmental uncertainty and enhances model convergence. In practical application, the same trained agent is assigned to the target pair, yielding the following action space:

action = [\begin{matrix} u_{T} & u_{D} \end{matrix}]

(27)

Since the dynamics of the scenario are formulated in Section 2.1, the state can be propagated implicitly as the linearized equation of motion presented in Equations (4)–(6).

3.1.2. Imperfect Information Model

The imperfection of information is usually due to the limitations of radar measurement and the erasure of prior knowledge. However, in existing studies, perfect information is a strong assumption, which leads to implementation difficulties in practice. To address this dilemma, this thesis considers information degradation. On the one hand, the interceptor is assumed to have perfect information (i.e., the relative states and maneuverability of the target and the defender). On the other hand, the observation of the target and defender is imperfect and even noise-corrupted. The observation uncertainty is modeled as observation noise and a mask on the perfect information.

o_{t} = Ω (s_{t}) = Γ s_{t} \times (I + ω_{o, t}) = [\begin{matrix} t \\ x_{t} \\ y_{t} \\ V_{t} \\ a_{t} \end{matrix}] + [\begin{matrix} 0 \\ δ x_{o, t} \\ δ y_{o, t} \\ δ V_{o, t} \\ δ a_{o, t} \end{matrix}]

(28)

where

Γ

is the mask matrix and

ω_{o, t}

is the observation noise vector.

ω_{o, t}

can be calculated by Equations (29)–(32).

ω_{o, t} = [\begin{matrix} 0 \\ δ x_{o, t} \\ δ y_{o, t} \\ δ V_{o, t} \\ δ a_{o, t} \end{matrix}] ~ U (0_{13}, Σ) \in ℝ^{13}

(29)

Σ = {[0, 0, 0, σ_{xI}, 0, 0, σ_{yI}, 0, 0, σ_{v}, 0, 0, σ_{a}]}^{T}

(30)

σ_{xI} = \cos (σ_{LOS} + λ_{IT}) (ρ_{IT} + σ_{ρ}) - x_{I} \approx 0

(31)

σ_{yI} = \sin (σ_{LOS} + λ_{IT}) (ρ_{IT} + σ_{ρ}) - y_{I} \approx σ_{LOS} \cdot ρ_{IT}

(32)

where

Σ

represents the noise amplitude, with

σ_{ρ} (m)

,

σ_{L O S} (mrad)

,

σ_{v} (m / s)

, and

σ_{a} ({m / s}^{2})

the nonnegative parameters.

3.2. ICAAI Guidance Law Design

In this section, we present the mathematical framework of actor–critic RL algorithms, focusing on the algorithm used in ICAAI guidance: Twin-Delay Deep Deterministic Policy Gradient (TD3) [32]. TD3 is an advanced deterministic policy gradient reinforcement learning algorithm. In comparison to stochastic policy gradient algorithms like Proximal Policy Optimization (PPO) [33] and Asynchronous Advantage Actor–Critic (A3C) [34], TD3 exhibits a higher resistance to converging into local optima. Furthermore, when compared to traditional deterministic policy gradient RL algorithms such as Deep Deterministic Policy Gradient (DDPG) [35], TD3 achieves superior training stability and convergence efficiency. This assertion is supported by our prior RL algorithm selection experiments, as illustrated in Figure 2.

Without loss of generality, throughout the entire section the MDP is supposed to be perfectly observable (i.e., with

o_{t} = s_{t}

) to conform with the standard notation of RL. However, the perfect information state

s_{t}

can be replaced by observation

o_{t}

whenever the observations differ from the state.

3.2.1. Actor–Critic Algorithms

The RL problem’s goal is to find the optimal policy

π_{ϕ}

with parameters

ϕ

that maximizes the expected return, which can be formulated as follows:

J (ϕ) = \underset{τ ~ π_{ϕ}}{E} [R_{0}^{τ}] = \underset{τ ~ π_{ϕ}}{E} [\sum_{i = 0}^{T} γ^{i - 0} r_{i}]

(33)

where

\underset{τ ~ π}{E}

denotes the expectation taken over the trajectory

τ

. In actor–critic algorithms, the policy, known as the actor, can be updated by using a deterministic policy gradient algorithm [36]:

\nabla_{ϕ} J (ϕ) = E_{P_{s a}} [\nabla_{a} Q^{π} (s, a) |_{a = π (s)} \nabla ϕ π ϕ (s)]

(34)

The expected return, when performing action

a

in state

s

and following

π

after, is called the critic or the value function, which can be formulated as follows:

Q^{π} (s, a) = \underset{τ ~ π_{ϕ}}{E} [R_{t}^{τ} | s, a]

(35)

The value function can be learned through off-policy temporal differential learning, an update rule based on the Bellman equation which describes the relationship between the value of the state–action pair

(s, a)

and the value of the subsequent state–action pair

(s^{'}, a^{'})

:

Q^{π} (s, a) = r + γ \underset{s^{'}, a^{'}}{E} [Q^{π} (s^{'}, a^{'})]

(36)

In deep Q-learning [37], the value function can be estimated with a neural network approximator

Q_{θ} (s, a)

with parameters

θ

, and the network is updated by using temporary differential learning with a secondary frozen target network

Q_{θ^{'}} (s, a)

to maintain a fixed objective

U

over multiple updates:

U = r + γ Q_{θ^{'}} (s^{'}, a^{'}), a^{'} = π_{ϕ^{'}} (s^{'})

(37)

where the actions

a^{'}

are determined by a target actor network

π_{ϕ^{'}}

. Generally, the loss function and update rule can be formulated as follows:

J (θ) = U - Q_{θ} (s, a)

(38)

\nabla_{θ} J (θ) = [U - Q_{θ} (s, a)] \nabla_{θ} Q_{θ} (s, a)

(39)

The parameters of target networks are updated periodically to exactly match the parameters of the corresponding current networks, which is called delayed update. This leads to the original actor–critic method, the basic structure of which is shown in Figure 3.

3.2.2. Twin-Delayed Deep Deterministic Policy Gradient Algorithm

To address the common RL issues in actor-critic algorithms (i.e., overestimation bias and accumulation of errors), in the TD3 algorithm, the actor–critic framework is modified from three aspects.

A novel variant of double Q-learning [38] called clipped double Q-learning is developed to limit possible overestimation. This provides the update objective of the critic:

U = r + γ \min_{i = 1, 2} Q_{{θ^{'}}_{i}} (s^{'}, π_{{ϕ^{'}}_{1}} (s^{'}))

(40)

The parameters of policy networks are updated periodically to match the value network, which is called delayed policy update, and the soft update approach is adopted, which can be formulated as follows:

θ^{'} \leftarrow κ θ + (1 - κ) θ^{'}

(41)

where

κ

is a proportion parameter.

Target policy smoothing regularization is adopted to alleviate the overfitting phenomenon, which can be explicated as follows:

y = r + γ Q_{θ^{'}} (s^{'}, π_{ϕ^{'}} (s^{'}) + ε)

(42)

where

ε

is a clipped Gaussian noise.

An overview of the TD3 algorithm is demonstrated in Figure 4.

3.2.3. Implementation Details

As for the network architecture setting, the agent observations are vectors with 13 dimensions. Both the guidance policy estimation (actor) and the value function estimation (critic) consist of three fully connected layers with sizes of 64, 256, and 512, respectively, along with layer normalization. The output layer has two units for the actor, representing the unified command of the target and defender, respectively, and one unit for the critic. The activation function is ReLU for the hidden layer neurons and linear for the output layer neuron. This structure is heuristically designed and can be generalized for efficient function approximation. Deeper and wider networks are avoided for real-time performance and fast convergence.

The hyperparameters of TD3 have been devised and validated by empirical experiments, which are reported in Table 1.

3.3. ESC Traning Technique

Aiming at the sparse reward problem in the multi-agent pursuit-evasion game, an efficient and stable convergence (ESC) training approach of reinforcement learning is proposed based on reward shaping [39] and curriculum learning [40].

3.3.1. Reward Shaping

The design of a reward function is the most challenging part of solving this multi-agent pursuit-evasion game through RL, as the function had to be adaptive to engagement with a sparse reward setting. It is found that, except for the common leadership mission, the pursuit-evasion game can be formulated as a strictly competitive zero-sum game. In addition, the agent policy network weights were randomly initiated at the beginning of training, while the interceptor was deployed with optimal guidance and is sufficiently aggressive.

In [41], a shaping technique was presented as a particularly effective approach to solving sparse reward problems through a series of biological experiments. The researchers divided a difficult task into several simple units and trained the animals according to an easy-to-hard schedule. This approach requires adjusting the reward signal to cover the entire training process, followed by gradual changes in task dynamics as training progresses. In [40], researchers took this idea further and proposed curriculum learning, a type of training strategy. In this work, the shaping technique and curriculum learning were used to speed up the convergence of neural networks and to increase the stability and performance of the algorithm.

The goal of the target and the defender is to converge

Z_{ID}

to zero as

t \to t_{f 2}

while keeping

Z_{IT}

as large as possible. On the contrary, the interceptor control law is designed to make

Z_{IT}

converge to zero while maintaining

Z_{ID}

as large as possible.

For this reason, a non-sparse reward function is defined in Equations (43) and(44):

r_{medium} = γ Φ (s ’) - Φ (s)

(43)

Φ (s) = {| \frac{Z_{IT}}{α_{1}} |}^{β_{1}} - {| \frac{Z_{ID}}{α_{2}} |}^{β_{2}}

(44)

r_{terminal} = {\begin{array}{l} σ, if succeed \\ - σ, else \end{array}

(45)

where

γ

is the discount factor in the Markov decision process and

α_{1}

,

α_{2}

,

β_{1}

,

β_{2}

, and

σ

are the positive hyperparameters.

It must be stressed that, since both the number and maneuverability of players completely change the environment, the hyperparameter values used in this paper may be not universal. Thus, in the following subsection, the focus will be on the applied design method instead of the specific hyperparameter values. The

r_{terminal}

is the terminal reward signal given to the terminal behavior of the agent, which is sparse but intuitive. Situations in which the interceptor is destroyed by the defender (when

t = t_{f, ID}

) or when the interceptor is driven away by the defender and misses the target are judged as a success. Furthermore, the

r_{medium}

is a non-sparse reward function based on the difference form of a potential function

Φ (s)

which ensures the consistency of the optimal strategy [42,43,44]. It is important to emphasize that the design of

Φ (s)

relies on a fractional exponential function. This function provides a continuous reward signal for the agent’s evaluation of each state. Notably, this model exhibits a unique property: as the base number approaches infinity, the gradient decreases to zero, and as the base number approaches zero, the gradient increases infinitely. This specific characteristic significantly aids the agent in converging towards states where the base number is either greater than zero or approaches zero.

In this paper, the defined reward function carries the physical meaning of the mission—the target must escape from the interceptor, while the defender has to get close to the interceptor. The

r_{medium}

value increases as

Z_{ID}

converges to zero, or when

Z_{IT}

increases. On the other hand, it decreases when

Z_{ID}

is divergent or when

Z_{IT}

converges to zero.

Generally, reward normalization is beneficial to neural network convergence. However, determining the bounds of

Z_{IT}

and

Z_{ID}

is a complex task. For this reason, hyperparameters

α_{1}

,

α_{2}

,

β_{1}

, and

β_{2}

are tuned, aiming to scale the

r_{medium}

close to

[- c, c]

, in which c is a positive constant. In the following step, the design of

ρ

is considered, which introduces the expectation of agent foresight. If the agent is expected to predict the terminal reward

r_{terminal}

n steps before, the discounted terminal reward must be larger than the

r_{medium}

bounds. Thus, the hyperparameter

ρ

satisfies the following expression:

ρ \geq \frac{c}{γ^{n}}

(46)

3.3.2. Curriculum Learning

After hyperparameter tuning, we enhance the training stability of intelligent algorithms using an adaptive progressive curriculum learning approach. This method incrementally raises training complexity to enhance agent capability and performance. The agent’s training level is adaptively assessed through changes in network loss, determining appropriate training difficulty. The

v_{PG}

calculation formula is as follows:

v_{PG} = L (x, θ) - L (x, θ ’)

(47)

where

L (\cdot)

represents the calculation function of network loss;

θ

is the current network parameter and

θ ’

is the new network parameter obtained after data

x

training. Given a small amount

ε (1 ≫ ε > 0)

, when

| v_{PG} | < ε

(48)

the agent training enters the next stage. A sequence of increasingly difficult tasks is allocated to the agent, as shown in Table 2. The curriculum was divided into three stages:

The agent is required to combat the interceptors employing non-maneuvering;
Square wave signal;
OGL.

Finally, it is possible to complete the reward shaping process.

In summary, the block diagram of ICAAI guidance strategy is shown in Figure 5.

4. Experiments

In this section, we demonstrate the efficacy of the proposed guidance method and the effectiveness of the shaping technique through learning processes and Monte Carlo simulations. We establish benchmark comparisons by including OGLs and evaluating application requirements. To illustrate, we consider a scenario [10] involving a maneuverable small spacecraft (Interceptor, I), a defensive vehicle (Defender, D), and an evading spacecraft (Target, T), all in circular Earth orbits. Gravity effects are incorporated in the simulations. Assumptions include the interceptor’s superior maneuverability and time constant compared to the target and defender.

4.1. Optimal Pursuit and Evasion Guidance Laws

Lemma 1.

The linear–quadratic optimal guidance law (LQOGL) [10]:

u_{I}^{*} = {\begin{array}{l} - \frac{K (t) Z_{ID} (t)}{ω_{1}} u_{I}^{\max} τ_{I} φ (\frac{t_{f, ID} - t}{τ_{I}}) for ‖ Z_{ID} (t) ‖ < η \\ - \frac{P (t) Z_{IM} (t)}{ξ_{1}} u_{I}^{\max} τ_{I} φ (\frac{t_{f, IT} - t}{τ_{I}}) else \end{array}

(49)

where

η

is a positive constant representing the limit-collision radius between the interceptor and the defender, and

u_{I}^{\max}

is the maximum control force provided by the interceptor. Furthermore, variable K(t) and P(t) can be defined as follows:

K (t) = \frac{1}{\int_{t}^{t_{fID}} [\frac{1}{ω_{1}} {(u_{I}^{\max} τ_{I} φ (\frac{t_{f, ID} - t}{τ_{1}}))}^{2} - \frac{1}{ω_{2}} {(u_{D}^{\max} τ_{D} φ (\frac{t_{f, ID} - t}{τ_{D}}))}^{2}] d t - 1}

(50)

P (t) = \frac{1}{\int_{t}^{t_{fIM}} [\frac{1}{ξ_{1}} {(u_{I}^{\max} τ_{I} φ (\frac{t_{f, IM} - t}{τ_{1}}))}^{2} - \frac{1}{ξ_{2}} {(u_{M}^{\max} τ_{M} φ (\frac{t_{f, IM} - t}{τ_{M}}))}^{2}] d t - 1}

(51)

where

ω_{1}

,

ω_{2}

,

ξ_{1}

, and

ξ_{2}

are nonnegative constants ensuring the interceptor converges towards the target, guaranteeing its escape from the defender.

Proof.

The detailed proof of similar results can be found in [10]; see Theorem 1 and the associated proof. □

Lemma 2.

Standard optimal guidance law (SOGL) [45]:

\begin{array}{l} u_{I}^{*} = u_{I}^{\max} sgn [Z_{ID} (t_{f, ID})] sgn [φ (\frac{t_{f, ID} - t}{τ_{I}})] for ‖ Z_{ID} (t) ‖ < η \\ u_{I}^{*} = - u_{I}^{\max} sgn [Z_{IT} (t_{f, IT})] sgn [φ (\frac{t_{f, IT} - t}{τ_{I}})] else \end{array}

(52)

where

η

is a positive constant representing the switching condition always equal to the defender kill radius.

Proof.

Consider the following cost function:

\begin{array}{l} J_{1} = - \frac{1}{2} Z_{ID}^{2} (t_{fID}) for ‖ Z_{ID} (t) ‖ < η \\ J_{2} = \frac{1}{2} Z_{IT}^{2} (t_{fIT}) else \end{array}

(53)

For

J_{1}

, the Hamiltonian of the problem is defined as follows:

H_{1} = λ_{1} {\dot{Z}}_{ID} (t)

(54)

The costate equation and transversality condition are provided by the following:

{\dot{λ}}_{1} (t) = - \frac{\partial H_{1}}{\partial Z_{ID}} = 0

(55)

λ_{1} (t_{fID}) = \frac{\partial J_{1}}{\partial Z_{ID} (t_{fID})} = - Z_{ID} (t_{fID})

(56)

The optimal interceptor controller minimizes the Hamiltonian satisfying the following:

u_{I}^{*} = \underset{u_{I}}{\arg} \min (H_{1})

(57)

The interceptor guidance law can thus be obtained:

u_{I}^{*} = u_{I}^{\max} sgn [Z_{ID} (t_{fID})] sgn [φ (\frac{t_{fID} - t}{τ_{I}})]

(58)

For

J_{2}

, a similar interceptor guidance law can be found:

u_{I}^{*} = - u_{I}^{\max} sgn [Z_{IT} (t_{f IT})] sgn [φ (\frac{t_{f IT} - t}{τ_{I}})]

(59)

Finally, the interceptor guidance schemes for evading the defender and pursuing the target are proposed after combining Equations (58) and (59):

\begin{array}{l} u_{I}^{*} = u_{I}^{\max} sgn [Z_{ID} (t_{f ID})] sgn [φ (\frac{t_{f ID} - t}{τ_{I}})] for ‖ Z_{ID} (t) ‖ < η \\ u_{I}^{*} = - u_{I}^{\max} sgn [Z_{IT} (t_{f IT})] sgn [φ (\frac{t_{f IT} - t}{τ_{I}})] else \end{array}

(60)

□

4.2. Engagement Setup

In this scenario, a target carrying an active anti-interceptor is threatened by a KKV interceptor in orbit at an altitude of 500 km. The defender maintains an initial safe distance of approximately 50 m longitudinally and 10 km transversely to the target. Given that the detection range of the interceptor’s guided warhead is about 100 km, the initial transverse distance between the interceptor and the target is set at 100 km, and the initial longitudinal position is random in the range 499.8–500.2 km. In addition, the maneuverability and control response speed of the interceptor are better than those of the target and defender, and the OGL is used for guidance.

The comprehensive list of engagement parameters is shown in Table 3.

Furthermore, Gaussian noise with standard variance of

σ_{L O S} = 1 mrad

,

σ_{v} = 0.2 m / s

, and

σ_{a} = 1 {m / s}^{2}

is considered in the interceptor information obtained by the target and defender through a radar seeker.

4.3. Experiment 1: Real-Time Performance of the Guidance Policy

To verify that the proposed RL training approach ESC can improve convergence efficiency and stability, the learning processes were demonstrated using the sparse reward (SR) signal and ESC, respectively, with the same hyperparameters. During the learning process, the weights of the neural network model were stored every 100 episodes for subsequent analysis. In addition, to remove stochasticity as a confounding factor, six random seeds were set for each case. Meanwhile, the real-time performance of the optimized agent is evaluated by comparing it with the traditional OGLs.

The agents were obtained after a training of 20,000 episodes, which took 12 h with 8 parallel workers on a computer equipped with a 104-core Intel Core Xeon Platinum 8270 CPU @2.70 GHz. Similarly, both the traditional methods and the proposed method are provided a current state or observation and return the required action. Table 4 shows the comparison of computational cost and update frequency obtained by using SOGL, LQOGL, and the proposed method. It can be seen from the table that LQOGL is time-consuming due to the calculation of the Riccati function, which is the reason why it has not been applied in practice. As a proven approach, the SOGL has excellent real-time performance. The proposed method achieved an update frequency of 10³ Hz and showed great potential for on-board applications. While a variety of approaches (e.g., pruning and distillation) were effective to compress the policy network and further improve its real-time performance, it is not the main work of this research.

Remark 1.

As shown in Equations (18) and (19), the LQOGL has to solve the Riccati differential equation. However, the experimental results show that its update frequency cannot meet the real-time requirements of spacecraft guidance. Compared to the LQOGL, the SOGL in Equation (60) does not need to solve the Riccati differential equation and has no hyperparameter. This improves both its computational efficiency and robustness at the cost of flexibility and the occurrence of the chattering phenomenon. To take into account the practical situation, the SOGL was chosen as an OGL benchmark.

4.4. Experiment 2: Convergence and Performance of the Guidance Policy

The performance of the trained agent in the fully observable game was investigated by comparing the escape success rate corresponding to an optimized policy

π_{ϕ} (s)

, obtained by performing Monte Carlo simulation in the fully observable (deterministic and with default engagement parameters) environment, with the solution of the SOGL.

4.4.1. Baselines

The SOGL for the target and the defender were considered as an OGL benchmark. Through a brief derivation similar to that in Section 3, it can be proven that the SOGLs for the target and the defender are as follows:

\begin{array}{l} u_{T} = - u_{T}^{\max} sgn [Z_{IT} (t_{f IT})] sgn [φ (\frac{t_{f IT} - t}{τ_{T}})] \\ u_{D} = u_{D}^{\max} sgn [Z_{ID} (t_{f ID})] sgn [φ (\frac{t_{f ID} - t}{τ_{D}})] \end{array}

(61)

4.4.2. Convergence and Escape Success Rate

Figure 6 displays the learning curves depicting the mean accumulated reward across learning episodes for various scenarios. As depicted, in the ESC case, the agent’s reward consistently escalated throughout the training episodes, ultimately stabilizing at around 6000 after 4000 iterations. Conversely, within the sparse reward (SR) framework, the ICAAI encountered a plateau phenomenon during training, resulting in an unstable convergence process for the associated reward function and eventual convergence failure.

Figure 7 presents success rate curves for target evasion over learning episodes, comparing agents trained with and without ESC. The green line denotes OGL’s deterministic environment success rate of 83.4%. The ESC-trained agent surpassed the baseline by 2700 episodes, achieving a peak performance of 99% after around 13,800 episodes. Conversely, the agent without ESC exhibited a gradual decline in performance after reaching a zenith of 77%, signifying policy network overfitting during continued training. The ESC-trained agent demonstrated accelerated convergence and improved local optima. It can be inferred that the proposed ESC training approach effectively organizes exploration, addressing sparse reward issues and showcasing heightened learning efficiency and asymptotic performance. Furthermore, the proposed methodology adeptly mitigates overfitting phenomena.

4.4.3. Performance Test

Figure 8 depicts spacecraft trajectories, featuring the interceptor’s actual path (blue curve) and the observed trajectory from the target’s perspective (yellow curve). Figure 9 displays the lateral acceleration profiles for each spacecraft, while Figure 10 illustrates the ZEM measurements between the target and interceptor and between the defender and interceptor. The simulation results presented in Figure 11 reaffirm the impact of the relative distance between the target and defender

d i s_{DT}

on the game outcomes for the target.

Figure 8, Figure 9 and Figure 10 illustrate the evident cooperation between the target and the defender, utilizing relative state information. Taking the simulation results at

d i s_{DT} = 10 km

as an example, the miss distance between the target and the interceptor was approximately 15 m. The defender maintained a miss distance of less than 1 m from the interceptor, confirming its successful interception threat. Figure 9 and Figure 10 depict that, within 16 s of the scenario’s initiation, the target collaborated with the defender, executing subtle maneuvers to intercept the interceptor. At around the 16 s mark, the interceptor perceived the threat and initiated an escape strategy. Simultaneously, the target executed an evasive maneuver in the opposite direction, utilizing its maximum maneuverability, which resulted in an increase in distance. Ultimately, the interceptor managed to evade the defender’s interception attempt but failed to intercept the target in time, leading to the target’s successful evasion.

In addition, the above simulation results show that the relative distance between the target and defender

d i s_{DT}

directly determines the time it takes for the interceptor to intercept the target after evading the defender. Consequently,

d i s_{DT}

significantly influences the game outcomes for the target, including the success rate of evasion and miss distance. Therefore, to explore the effect of

d i s_{DT}

on the performance of ICAAI, the game results for

d i s_{DT}

ranging from 0 to 15 km are introduced in Figure 11.

As evident from Figure 11, employing the ICAAI intelligent game algorithm results in the target achieving success rates of no less than approximately 90% when the relative distance to the defender is less than 10 km. However, as the

d i s_{DT}

increases from 10 to 15 km, the success rate of target evasion decreases from 90% to 0%. These simulation results illustrate that a smaller relative distance leads to an increased evasion success rate. Additionally, the curve depicting the average miss distance for the target reveals that the miss distance follows a pattern of initially increasing and then decreasing with

d i s_{DT}

. The miss distance reaches its maximum value of approximately 50 m around a relative distance of 5 km. The occurrence of this phenomenon can be attributed to the fact that, when

d i s_{DT}

is less than 5 km, the miss distance increases with the target’s evasion time. Moreover, at this point, the interceptor has not had sufficient time to alter its trajectory to intercept the target. Conversely, when

d i s_{DT}

exceeds 5 km, the interceptor has ample time to intercept the target after evading the defender. Consequently, the miss distance decreases with an increasing

d i s_{DT}

.

4.5. Experiment 3: Adaptiveness of the ICAAI Guidance

In the real-world game confrontation process, obtaining the opponent’s prior knowledge, such as the maximum acceleration and time constant, is often impractical. To assess the proposed ICAAI guidance method’s superior adaptability compared to the OGL method under conditions of unknown opponent knowledge, several comparison conditions were designed and evaluated using the Monte Carlo target shooting method. The adaptive capabilities of both methods were analyzed based on the game results (escape success rate and miss distance) of the target spacecraft employing the two strategies.

While the target utilized OGL guidance, we considered it adopting

{\bar{u}}_{I}^{\max} = 8

g,

{\bar{τ}}_{I} = 0.02

s as the prediction of the prior knowledge of the interceptor, while the actual

u_{I}^{\max} = 6 ~ 10

g,

τ = 0.05 ~ 0.002

s. The simulation results are shown in Figure 12.

As depicted in Figure 12a, as the interceptor’s maneuverability improves, the target’s escape ability decreases for both guidance methods. However, it is evident that, when employing the ICAAI guidance, the rate of decline in the target’s escape ability is significantly lower compared to the OGL guidance method. Similarly, Figure 12b demonstrates that an increase in the interceptor’s response speed yields a similar trend in the target’s escape ability as in Figure 12a. Specifically, when accurately estimating the prior knowledge of the target, the escape abilities of both methods are comparable. However, when the prior knowledge error exceeds 25%, the OGL guidance leads to a reduction of over 75% in the target’s escape ability, while the ICAAI guidance results in less than a 34% decrease. In conclusion, the proposed ICAAI guidance exhibits superior adaptability compared to the OGL guidance when the interceptor’s prior knowledge is unknown.

Remark 2.

As an analytical method, the SOGL is stable but inflexible due to its theoretical framework [46] and stringent assumptions [47]. Correspondingly, the ICAAI control strategies are flexible and can be continuously optimized. The proposed method is independent of the time constant, which means that it performs better with less prior knowledge than the OGL. Furthermore, the adaptability of the proposed method can be improved by considering the tolerance of the maximum interceptor acceleration.

4.6. Experiment 4: Robustness of the RL-Based Guidance Method

In addition to the unperturbed, fully observable game, the following noisy, partially observable game studies have been analyzed separately in this manuscript. The parameters used to describe the imperfect information model defined in Section 3 are shown in Table 5. The Monte Carlo simulation method is used to obtain the escape success rate and the miss distance of the target using the proposed ICAAI guidance and SOGL guidance under different noise conditions. The results of the Monte Carlo simulation are shown in Figure 13.

Based on the simulation results of Case 2, it was observed that the OGL method exhibited significant sensitivity to LOS noise. In scenarios without LOS noise, the escape success rate of the proposed ICAAI guidance matched that of the OGL guidance, and, in some cases, the OGL method even achieved a larger miss distance. However, as the LOS noise variance increased to 0.05 mrad, the success rate of the OGL method dropped to approximately 50%. Eventually, at a LOS noise variance of 0.15 mrad, the target was practically unable to escape using the SOGL method, while the ICAAI guidance still maintained an escape success rate of around 80%.

Analyzing the simulation results of Case 1 and Case 3, it was found that due to the presence of LOS noise, the target employing the OGL method exhibited reduced sensitivity to acceleration and velocity noise. Nevertheless, its escape capability remained weaker compared to that of the ICAAI guidance. This could be attributed to the policy network propagating observation information with different weights, leveraging the exploration mechanism of reinforcement learning (RL). Consequently, training the agent in a deterministic environment resulted in a robust guidance policy with strong noise-resistant ability.

5. Conclusions

In this research, we solved the cooperative active defense guidance problem for a target with active defense attempting to evade an interceptor. Based on deep reinforcement learning algorithms, a collaborative guidance strategy termed ICAAI was formulated to enhance active spacecraft defense. Monte Carlo simulations were conducted to empirically substantiate the real-time performance, convergence, adaptiveness, and robustness of the introduced guidance strategy. The conclusions are stated as follows:

(1): In the presence of less prior knowledge and observation noise, the proposed ICAAI guidance strategy is effective in achieving a higher success rate of target evasion by guiding the target to coordinate maneuvers with defensive spacecraft.
(2): Utilizing a heuristic continuous reward function and an adaptive progressive curriculum learning method, we devised the ESC training approach to effectively tackle issues of low convergence efficiency and training process instability in ICAAI.
(3): The ICAAI guidance strategy outperforms the linear–quadratic optimal guidance law (LQOGL) [10] in real-time performance. This framework also achieved an impressive update frequency of $10^{3}$ Hz, demonstrating substantial potential for onboard applications.
(4): Simulation results confirm ICAAI’s effectiveness in reducing the relative distance between interceptor and defender, enabling successful target evasion. In contrast to traditional OGL methods, our approach exhibits enhanced robustness in noisy environments, particularly in mitigating line-of-sight (LOS) noise.

Author Contributions

Conceptualization, W.N., J.L., Z.L., P.L. and H.L.; Methodology, W.N., J.L., Z.L., P.L. and H.L.; Software, W.N.; Validation, W.N.; Investigation, W.N.; Data curation, W.N.; Writing—original draft, W.N.; Writing—review & editing, H.L.; Project administration, H.L.; Funding acquisition, J.L. and P.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work described in this paper is supported by the National Natural Science Foundation of China (Grant No. 62003375). The authors fully appreciate their financial supports.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

$a$	acceleration, ${m / s}^{2}$
$A, B$	state-space model of the linearized equations of motion
$H$	Hamiltonian
$I$	identity matrix
$J (\cdot)$	cost function
$L$	constant vector
$L^{- 1}$	inverse Laplace transform
$LOS$	light-of-sight
$Q (\cdot)$	reward signal
$r$	reward signal
$s$	state defined in Markov decision process
$o$	observation of the agent
$t, t_{go}, t_{f}$	time, time to go, and final time, respectively, s
$u$	guidance command, ${m / s}^{2}$
$V$	velocity, $m / s$
$X - O - Y$	Cartesian reference frame
$x$	state vector of the linearized equations of motion
$y$	lateral distance, m
$Z$	zero-effort-miss, m
$α, β, σ$	design parameters of the reward function
$ϕ$	flight path angle, rad
$Φ$	transition matrix
$γ$	discount factor
$η$	killing radius, m
$λ$	the angle between the corresponding light-of-sight and X-axis, rad
$λ (\cdot)$	Lagrange multiplier vector
$μ (\cdot)$	policy function
$ρ$	relative distance between the adversaries, m
$τ$	time constant
$ω, ξ$	design parameters of the optimal guidance law (OGL)
$I, T, D$	interceptor, target, and defender, respectively
$\max$	maximum
$*$	optimal solution

References

Ye, D.; Shi, M.; Sun, Z. Satellite proximate pursuit-evasion game with different thrust configurations. Aerosp. Sci. Technol. 2020, 99, 105715. [Google Scholar] [CrossRef]
Boyell, R.L. Defending a moving target against missile or torpedo attack. IEEE Trans. Aerosp. Electron. Syst. 1976, AES-12, 522–526. [Google Scholar] [CrossRef]
Rusnak, I. Guidance laws in defense against missile attack. In Proceedings of the 2008 IEEE 25th Convention of Electrical and Electronics Engineers in Israel, Eilat, Israel, 3–5 December 2008; pp. 090–094. [Google Scholar]
Rusnak, I. The lady, the bandits and the body guards—A two team dynamic game. IFAC Proc. Vol. 2005, 38, 441–446. [Google Scholar] [CrossRef]
Shalumov, V. Optimal cooperative guidance laws in a multiagent target–missile–defender engagement. J. Guid. Control Dyn. 2019, 42, 1993–2006. [Google Scholar] [CrossRef]
Weiss, M.; Shima, T.; Castaneda, D.; Rusnak, I. Combined and cooperative minimum-effort guidance algorithms in an active aircraft defense scenario. J. Guid. Control Dyn. 2017, 40, 1241–1254. [Google Scholar] [CrossRef]
Weiss, M.; Shima, T.; Castaneda, D.; Rusnak, I. Minimum effort intercept and evasion guidance algorithms for active aircraft defense. J. Guid. Control Dyn. 2016, 39, 2297–2311. [Google Scholar] [CrossRef]
Shima, T. Optimal cooperative pursuit and evasion strategies against a homing missile. J. Guid. Control. Dyn. 2011, 34, 414–425. [Google Scholar] [CrossRef]
Perelman, A.; Shima, T.; Rusnak, I. Cooperative differential games strategies for active aircraft protection from a homing missile. J. Guid. Control Dyn. 2011, 34, 761–773. [Google Scholar] [CrossRef]
Liang, H.; Wang, J.; Wang, Y.; Wang, L.; Liu, P. Optimal guidance against active defense ballistic missiles via differential game strategies. Chin. J. Aeronaut. 2020, 33, 978–989. [Google Scholar] [CrossRef]
Anderson, G.M. Comparison of optimal control and differential game intercept missile guidance laws. J. Guid. Control 1981, 4, 109–115. [Google Scholar] [CrossRef]
Dong, J.; Zhang, X.; Jia, X. Strategies of pursuit-evasion game based on improved potential field and differential game theory for mobile robots. In Proceedings of the 2012 Second International Conference on Instrumentation, Measurement, Computer, Communication and Control, Harbin, China, 8–10 December 2012; pp. 1452–1456. [Google Scholar]
Li, Z.; Wu, J.; Wu, Y.; Zheng, Y.; Li, M.; Liang, H. Real-time Guidance Strategy for Active Defense Aircraft via Deep Reinforcement Learning. In Proceedings of the NAECON 2021-IEEE National Aerospace and Electronics Conference, Dayton, OH, USA, 16–19 August 2021; pp. 177–183. [Google Scholar]
Liang, H.; Li, Z.; Wu, J.; Zheng, Y.; Chu, H.; Wang, J. Optimal Guidance Laws for a Hypersonic Multiplayer Pursuit-Evasion Game Based on a Differential Game Strategy. Aerospace 2022, 9, 97. [Google Scholar] [CrossRef]
Liu, F.; Dong, X.; Li, Q.; Ren, Z. Cooperative differential games guidance laws for multiple attackers against an active defense target. Chin. J. Aeronaut. 2022, 35, 374–389. [Google Scholar] [CrossRef]
Weintraub, I.E.; Cobb, R.G.; Baker, W.; Pachter, M. Direct methods comparison for the active target defense scenario. In Proceedings of the AIAA Scitech 2020 Forum, Orlando, FL, USA, 6–10 January 2020; p. 0612. [Google Scholar]
Shalumov, V. Cooperative online guide-launch-guide policy in a target-missile-defender engagement using deep reinforcement learning. Aerosp. Sci. Technol. 2020, 104, 105996. [Google Scholar] [CrossRef]
Liang, H.; Wang, J.; Liu, J.; Liu, P. Guidance strategies for interceptor against active defense spacecraft in two-on-two engagement. Aerosp. Sci. Technol. 2020, 96, 105529. [Google Scholar] [CrossRef]
Salmon, J.L.; Willey, L.C.; Casbeer, D.; Garcia, E.; Moll, A.V. Single pursuer and two cooperative evaders in the border defense differential game. J. Aerosp. Inf. Syst. 2020, 17, 229–239. [Google Scholar] [CrossRef]
Harel, M.; Moshaiov, A.; Alkaher, D. Rationalizable strategies for the navigator–target–missile game. J. Guid. Control Dyn. 2020, 43, 1129–1142. [Google Scholar] [CrossRef]
Miljković, Z.; Mitić, M.; Lazarević, M.; Babić, B. Neural network reinforcement learning for visual control of robot manipulators. Expert Syst. Appl. 2013, 40, 1721–1736. [Google Scholar] [CrossRef]
Ye, D.; Chen, G.; Zhang, W.; Chen, S.; Yuan, B.; Liu, B.; Chen, J.; Liu, Z.; Qiu, F.; Yu, H. Towards playing full moba games with deep reinforcement learning. arXiv 2020, arXiv:2011.12692. [Google Scholar]
Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv 2016, arXiv:1610.03295. [Google Scholar]
Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May 2017–3 June 2017; pp. 3357–3364. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Gaudeta, B.; Furfaroa, R.; Linares, R. Reinforcement meta-learning for angle-only intercept guidance of maneuvering targets. In Proceedings of the AIAA Scitech 2020 Forum AIAA 2020, Orlando, FL, USA, 6–10 January 2020; Volume 609. [Google Scholar]
Gaudet, B.; Linares, R.; Furfaro, R. Adaptive guidance and integrated navigation with reinforcement meta-learning. Acta Astronaut. 2020, 169, 180–190. [Google Scholar] [CrossRef]
Lau, M.; Steffens, M.J.; Mavris, D.N. Closed-loop control in active target defense using machine learning. In Proceedings of the AIAA Scitech 2019 Forum, San Diego, CA, USA, 7–11 January 2019; p. 0143. [Google Scholar]
Zhang, G.; Chang, T.; Wang, W.; Zhang, W. Hybrid threshold event-triggered control for sail-assisted USV via the nonlinear modified LVS guidance. Ocean Eng. 2023, 276, 114160. [Google Scholar] [CrossRef]
Li, J.; Zhang, G.; Shan, Q.; Zhang, W. A novel cooperative design for USV-UAV systems: 3D mapping guidance and adaptive fuzzy control. IEEE Trans. Control Netw. Syst. 2022, 10, 564–574. [Google Scholar] [CrossRef]
Ainsworth, M.; Shin, Y. Plateau phenomenon in gradient descent training of RELU networks: Explanation, quantification, and avoidance. SIAM J. Sci. Comput. 2021, 43, A3438–A3468. [Google Scholar] [CrossRef]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In PMLR, Proceedings of Machine Learning Research, Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018; Volume 80, pp. 1587–1596. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Babaeizadeh, M.; Frosio, I.; Tyree, S.; Clemons, J.; Kautz, J. Reinforcement learning through asynchronous advantage actor-critic on a gpu. arXiv 2016, arXiv:1611.06256. [Google Scholar]
Casas, N. Deep deterministic policy gradient for urban traffic light control. arXiv 2017, arXiv:1703.09035. [Google Scholar]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In PMLR, Proceedings of Machine Learning Research, Proceedings of the 31st International Conference on Machine Learning, Beijing China, 21–26 June 2014; PMLR: New York, NY, USA, 2014; Volume 32, pp. 387–395. [Google Scholar]
Fan, J.; Wang, Z.; Xie, Y.; Yang, Z. A Theoretical Analysis of Deep Q-Learning. In PMLR, Proceedings of Machine Learning Research, Proceedings of the 2nd Conference on Learning for Dynamics and Control, Online, 10–11 June 2020; PMLR: New York, NY, USA, 2020; Volume 120, pp. 486–489. [Google Scholar]
Hasselt, H. Double Q-learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: New York, NY, USA, 2010; Volume 23. [Google Scholar]
Gullapalli, V.; Barto, A.G. Shaping as a method for accelerating reinforcement learning. In Proceedings of the 1992 IEEE International Symposium on Intelligent Control, Glasgow, UK, 11–13 August 1992; pp. 554–559. [Google Scholar]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
Krueger, K.A.; Dayan, P. Flexible shaping: How learning in small steps helps. Cognition 2009, 110, 380–394. [Google Scholar] [CrossRef]
Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. LCML 1999, 99, 278–287. [Google Scholar]
Randløv, J.; Alstrøm, P. Learning to Drive a Bicycle Using Reinforcement Learning and Shaping. ICML 1998, 98, 463–471. [Google Scholar]
Wiewiora, E. Potential-based shaping and Q-value initialization are equivalent. J. Artif. Intell. Res. 2003, 19, 205–208. [Google Scholar] [CrossRef]
Qi, N.; Sun, Q.; Zhao, J. Evasion and pursuit guidance law against defended target. Chin. J. Aeronaut. 2017, 30, 1958–1973. [Google Scholar] [CrossRef]
Ho, Y.; Bryson, A.; Baron, S. Differential games and optimal pursuit-evasion strategies. IEEE Trans. Autom. Control 1965, 10, 385–389. [Google Scholar] [CrossRef]
Shinar, J.; Steinberg, D. Analysis of Optimal Evasive Maneuvers Based on a Linearized Two-Dimensional Kinematic Model. J. Aircr. 1977, 14, 795–802. [Google Scholar] [CrossRef]

Figure 1. Schematic view of the engagement.

Figure 2. Comparison of training results of various reinforcement learning algorithms.

Figure 3. Structure of actor–critic method.

Figure 4. Structure of the TD3 algorithm.

Figure 5. Block Diagram of ICAAI Guidance Strategy.

Figure 6. Learning curves of the ICAAI.

Figure 7. Escape success rate.

Figure 8. Spacecrafts game trajectory. (a)

d i s_{DT} = 5 km

, (b)

d i s_{DT} = 10 km

, (c)

d i s_{DT} = 15 km

.

Figure 8. Spacecrafts game trajectory. (a)

d i s_{DT} = 5 km

, (b)

d i s_{DT} = 10 km

, (c)

d i s_{DT} = 15 km

.

Figure 9. Lateral acceleration curve of each spacecraft. (a)

d i s_{DT} = 5 km

, (b)

d i s_{DT} = 10 km

, (c)

d i s_{DT} = 15 km

.

Figure 9. Lateral acceleration curve of each spacecraft. (a)

d i s_{DT} = 5 km

, (b)

d i s_{DT} = 10 km

, (c)

d i s_{DT} = 15 km

.

Figure 10. ZEM curve between each spacecraft. (a) dis_DT = 5 km, (b) dis_DT = 10 km, (c) dis_DT = 15 km.

Figure 11. Target game results under different distances between target and defender.

Figure 12. Simulation results in situations without prior knowledge. (a)

u_{I}^{\max} = 6 ~ 10 g, τ = 0.02 s

, (b)

u_{I}^{\max} = 8 g, τ = 0.05 ~ 0.002 s

.

Figure 12. Simulation results in situations without prior knowledge. (a)

u_{I}^{\max} = 6 ~ 10 g, τ = 0.02 s

, (b)

u_{I}^{\max} = 8 g, τ = 0.05 ~ 0.002 s

.

Figure 13. Simulation results in noise-corrupted environment. (a) Case 1, (b) Case 2, (c) Case 3.

Table 1. TD3 hyperparameters.

Hyperparameter	Symbol	Value
Discount factor	$γ$	0.99
Learning rate	$α$	$3 \times 10^{- 4}$
Buffer size	$B$	5120
Batch size	$n_{batch}$	128
Soft update coefficient	$ζ$	$5 \times 10^{- 3}$
Policy delay	$n_{opt}$	2
Train frequency	$ω$	6000

Table 2. Curriculum learning.

Curriculum	Stage 1	Stage 2	Stage 3
Interceptor guidance command	None	Square wave signal	OGL
Maximum interceptor acceleration	0	8 g	4 g/6 g/8 g

Table 3. Engagement parameters.

	Interceptor	Target	Defender
Parameters	Interceptor	Target	Defender
Horizonal location (km)	100	0	0~15
Vertical location (km)	499.8~500.2	500	500.05
Horizonal velocity (km/s)	−3	2	2
Vertical velocity	0	0	0
Maximum acceleration (g)	8	2	6
Time constant (s)	0.02	0.1	0.05
Kill radius (m)	0.25	0.5	0.15

Table 4. Statistics of time consumption with different guidance methods.

Metrics	LQOGL	SOGL	ICAAI
Duration (1e3 step)	2.773 s	0.0145 s	0.910 s
Update frequency	$\approx 360 HZ$	$\approx 6.9 \times 10^{4} HZ$	$\approx 1.1 \times 10^{3} HZ$

Table 5. Parameters of the different imperfect information models.

Measurement Noise	Parameter	Case 1	Case 2	Case 3
LOS	$σ_{L O S} (mrad)$	0.05	0~0.2	0.05
Velocity	$σ_{v} (m/s)$	0.2	0.2	0~0.5
Acceleration	$σ_{a} (m/s)$	1~3	2	2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ni, W.; Liu, J.; Li, Z.; Liu, P.; Liang, H. Cooperative Guidance Strategy for Active Spacecraft Protection from a Homing Interceptor via Deep Reinforcement Learning. Mathematics 2023, 11, 4211. https://doi.org/10.3390/math11194211

AMA Style

Ni W, Liu J, Li Z, Liu P, Liang H. Cooperative Guidance Strategy for Active Spacecraft Protection from a Homing Interceptor via Deep Reinforcement Learning. Mathematics. 2023; 11(19):4211. https://doi.org/10.3390/math11194211

Chicago/Turabian Style

Ni, Weilin, Jiaqi Liu, Zhi Li, Peng Liu, and Haizhao Liang. 2023. "Cooperative Guidance Strategy for Active Spacecraft Protection from a Homing Interceptor via Deep Reinforcement Learning" Mathematics 11, no. 19: 4211. https://doi.org/10.3390/math11194211

APA Style

Ni, W., Liu, J., Li, Z., Liu, P., & Liang, H. (2023). Cooperative Guidance Strategy for Active Spacecraft Protection from a Homing Interceptor via Deep Reinforcement Learning. Mathematics, 11(19), 4211. https://doi.org/10.3390/math11194211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cooperative Guidance Strategy for Active Spacecraft Protection from a Homing Interceptor via Deep Reinforcement Learning

Abstract

1. Introduction

2. Problem Formulation

2.1. Equations of Motion

2.2. Linearized Equations of Motion

2.3. Zero-Effort Miss

2.4. Problem Statement

3. Guidance Law Development

3.1. Markov Decision Process

3.1.1. Perfect Information Model

3.1.2. Imperfect Information Model

3.2. ICAAI Guidance Law Design

3.2.1. Actor–Critic Algorithms

3.2.2. Twin-Delayed Deep Deterministic Policy Gradient Algorithm

3.2.3. Implementation Details

3.3. ESC Traning Technique

3.3.1. Reward Shaping

3.3.2. Curriculum Learning

4. Experiments

4.1. Optimal Pursuit and Evasion Guidance Laws

4.2. Engagement Setup

4.3. Experiment 1: Real-Time Performance of the Guidance Policy

4.4. Experiment 2: Convergence and Performance of the Guidance Policy

4.4.1. Baselines

4.4.2. Convergence and Escape Success Rate

4.4.3. Performance Test

4.5. Experiment 3: Adaptiveness of the ICAAI Guidance

4.6. Experiment 4: Robustness of the RL-Based Guidance Method

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI