A Real-Time and Optimal Hypersonic Entry Guidance Method Using Inverse Reinforcement Learning

Su, Linfeng; Wang, Jinbo; Chen, Hongbo

doi:10.3390/aerospace10110948

Open AccessArticle

A Real-Time and Optimal Hypersonic Entry Guidance Method Using Inverse Reinforcement Learning

by

Linfeng Su

,

Jinbo Wang

^* and

Hongbo Chen

School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Aerospace 2023, 10(11), 948; https://doi.org/10.3390/aerospace10110948

Submission received: 13 October 2023 / Revised: 3 November 2023 / Accepted: 4 November 2023 / Published: 7 November 2023

(This article belongs to the Special Issue Advanced Motion Planning and Control in Aerospace Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The mission of hypersonic vehicles faces the problem of highly nonlinear dynamics and complex environments, which presents challenges to the intelligent level and real-time performance of onboard guidance algorithms. In this paper, inverse reinforcement learning is used to address the hypersonic entry guidance problem. The state-control sample pairs and state-rewards sample pairs obtained by interacting with hypersonic entry dynamics are used to train the neural network by applying the distributed proximal policy optimization method. To overcome the sparse reward problem in the hypersonic entry problem, a novel reward function combined with a sophisticated discriminator network is designed to generate dense optimal rewards continuously, which is the main contribution of this paper. The optimized guidance methodology can achieve good terminal accuracy and high success rates with a small number of trajectories as datasets while satisfying heating rate, overload, and dynamic pressure constraints. The proposed guidance method is employed for two typical hypersonic entry vehicles (Common Aero Vehicle-Hypersonic and Reusable Launch Vehicle) to demonstrate the feasibility and potential. Numerical simulation results validate the real-time performance and optimality of the proposed method and indicate its suitability for onboard applications in the hypersonic entry flight.

Keywords:

hypersonic entry; inverse reinforcement learning; few datasets; autonomous guidance; real-time optimal control

1. Introduction

A hypersonic vehicle is a specific type of vehicle that traverses the atmosphere at speeds exceeding Mach 5. In recent years, the prominence of gliding hypersonic vehicles has increased significantly due to their remarkable capabilities for long-range and cross-range flights, as well as their ability to achieve high-precision targeting in both military and civilian domains [1]. However, when operating in complex flight environments characterized by factors such as heat, pressure, and overload, the system dynamics of a hypersonic vehicle become coupled, uncertain, and highly nonlinear [2]. In order to ensure the success of flight missions, the entry guidance algorithm for a hypersonic vehicle necessitates enhanced robustness and autonomy [3]. Therefore, the online real-time trajectory optimization algorithm is particularly necessary to be developed [4]. Nevertheless, it is still a significant challenge to design an optimal or near-optimal guidance strategy for onboard applications with guaranteed stability and real-time performance [5]. In this paper, a novel entry guidance algorithm based on inverse reinforcement learning is proposed to generate optimal or near-optimal control in real-time under complex hypersonic flight environments.

The optimal guidance can be described as a trajectory optimization or an optimal control problem (OCP), which aims to optimize a performance index while satisfying complex constraints. Traditionally, OCP algorithms can be classified into two main types: indirect methods and direct methods [6,7]. Based on Pontryagin’s minimum principle, indirect methods transform the OCP problem into a two-point boundary value problem [8]. Numerous indirect methods have been developed to solve OCP problems, offering high-precision solutions [9,10,11]. However, due to the main drawbacks of indirect methods in convergence difficulty and solving path constraints, direct methods have gained broader application. Direct methods involve transforming OCP problems into finite-dimensional parameter optimization problems through discretization methods, subsequently solved using nonlinear solvers. By combining convex optimization theory and the pseudospectral method, direct methods offer advantages in real-time performance and solution accuracy and have been successfully applied to solving many OCP problems [12,13,14]. Ref. [15] developed a two-stage trajectory optimization framework using convex optimization and the pseudospectral method to solve the hypersonic vehicle entry problem, improving computational efficiency. Additionally, a Chebyshev pseudospectral method based on differential flatness theory was applied to the hypersonic vehicle entry problem, demonstrating that the guidance algorithm can reduce the solution time for a single trajectory [16]. Unfortunately, modeling the constraints of the trajectory planning problem into a convex format losslessly is difficult work, particularly for hypersonic dynamic systems with highly nonlinear dynamics and constraints. Moreover, the computational cost escalates rapidly with an increase in discrete points, and the number of iterations becomes unpredictable when aiming for a high-precision solution [17,18]. Consequently, for the OCP algorithms, the above shortcomings limit its online application.

In recent years, artificial intelligence (AI)-based guidance algorithms have gained significant attention in the aerospace field, primarily due to their real-time performance and adaptable capabilities. Ref. [19] proposed that these algorithms can be broadly classified into two implementations: supervised learning (SL) and reinforcement learning (RL). In supervised learning, neural networks are trained using extensive datasets of optimal trajectories generated by OCP algorithms. Several SL-based guidance algorithms have been proposed for onboard applications [20]. For instance, Ref. [21] presented a deep neural network (DNN)-based guidance framework for planetary landing, capable of predicting fuel optimal controls from raw images captured by an onboard optical camera. Ref. [22] introduced a DNN-based guidance method for two-degree-of-freedom (2DOF) entry trajectory planning of hypersonic vehicles, and numerical simulations demonstrated its ability to provide stable and real-time control instructions for maximizing terminal velocity. Ref. [23] proposed a real-time DNN-based algorithm to solve the 3DOF entry problem of hypersonic vehicles, and the results showed its capability to generate optimal onboard controls. Similarly, Ref. [24] proposed a DNN-based controller to map optimal control commands from the state, and a hard-ware-in-the-loop (HIL) system was developed to support the real-time performance conclusion of the controller. However, both Ref. [23] and Ref. [24] required generating a large number of datasets before the training process, which was extremely costly in practical applications. Consequently, ensuring the convergence accuracy of existing SL-based algorithms for hypersonic entry problems necessitates constructing a large number of datasets to cover all scenarios, which remains a drawback for these algorithms when the missions are time-sensitive.

On the other hand, reinforcement learning offers an alternative approach that does not rely on existing datasets. RL algorithms continuously update model parameters through interactions with the environment, leading to improved generalization and robustness. RL has also shown promising results in addressing aerospace problems [25,26]. In comparison to traditional guidance algorithms, RL-based guidance algorithms exhibit strong anti-disturbance capabilities and real-time performance [27,28,29,30]. Ref. [31] proposed an RL-based adaptive real-time guidance algorithm for the 3DOF entry problem of hypersonic vehicles, and numerical simulation demonstrated that the proposed algorithm achieved a higher terminal success rate compared to the Linear Quadratic Regulator (LQR) method. The convergence of RL-based algorithms heavily relies on the design of the reward function. In the implementation of Ref. [31], dense rewards were provided by tracking a human-designed guidance law, which made it challenging for the model to search for the global optimal solution. Hence, in order to generate optimal control commands, it is key to design an improved reward function for RL-based algorithms.

In hypersonic entry flight environments, the reward signal is often sparse, meaning that the agent receives a reward only after completing a mission. To address this challenge, a reward shaping function needs to be designed to provide dense rewards throughout the learning process, motivating the agent to learn continuously. A reasonable reward shaping function is hard to complicate manually. Fortunately, the IRL method is one potential solution for solving this problem. The IRL algorithm represents an innovative approach within the realm of the RL method. Diverging from the traditional RL algorithm, the IRL method aims to infer a potential reward function from observed expert examples. Furthermore, IRL can be thought of as an inverse problem where the objective is to understand the motivation behind expert behavior rather than directly learning a policy.

This paper presents a novel guidance algorithm based on Inverse Reinforcement Learning (IRL) to address the guidance problem during the entry phase of hypersonic vehicles. In comparison to other AI-based algorithms and traditional optimal control algorithms explored in previous works, the proposed algorithm’s controller can generate optimal actions that meet the requirements of onboard applications using only a few trajectories as datasets. To the best of our knowledge, there have been few studies reported on the generation of optimal actions for hypersonic vehicles via a well-trained DNN-based controller supported by a few trajectories as datasets. Therefore, the concern is attempted to be addressed in this work. In our work, the guidance algorithm is implemented as a policy neural network updated through simulated experience over an interaction of a hypersonic entry simulated environment. In the proposed IRL framework, a customized version of Proximal Policy Optimization Algorithms (PPO) [32] is used to optimize the policy network. In particular, a generative adversarial neural network is designed to distinguish between the agent trajectories and the optimal datasets provided by the optimal control theory, which can effectively address the sparse reward problem while maintaining optimality. It is worth noting that the optimal dataset is only served by a few trajectories. After model optimization, the policy can offer high-frequency closed-loop guidance commands for onboard applications. To fully demonstrate the applicability of the proposed algorithm, numerical simulations are conducted on two typical hypersonic vehicles: the Common Aero Vehicle-Hypersonic (CAV-H) [33] and the Reusable Launch Vehicle (RLV). The two hypersonic vehicles correspond to different flight conditions, which is sufficient to illustrate the generalization of the proposed algorithm.

This paper is structured as follows. Section 2 provides an introduction to the entry problem for hypersonic vehicles, highlighting its characteristics of highly nonlinear dynamics. The Inverse-Reinforcement-Learning-based (IRL-based) guidance method is detailed in Section 3, including the algorithm framework, reward function design, and network structures utilized in the approach. Section 4 verifies the effectiveness and optimality of the proposed algorithm by performing a number of simulations through comparisons with General Pseudo-Spectral Optimal Control Software (GPOPS) [34]. The conclusion of this paper is given in Section 5.

2. Problem Formulation

2.1. The 3DOF Dynamic Model for Hypersonic Entry

The Earth is modeled as a uniform sphere, taking into account the effects of Earth’s rotation. During the entry phase of hypersonic vehicles, the control of the vehicle is achieved through the manipulation of the bank angle and attack angle, assuming no flight sideslip angle. The dynamic model for the entry phase is formulated in a 3-degree-of-freedom (3DOF) format, and the parameters of the dynamic model used in this paper are defined within the geocentric fixed coordinate system. These parameters are further elaborated in Equations (1)–(3), and these expressions are derived from Refs. [16,35].

\{\begin{matrix} \frac{d r}{d t} = v \sin (γ) \\ \frac{d θ}{d t} = \frac{v \cos (γ) \sin (ψ)}{r \cos (φ)} \\ \frac{d φ}{d t} = \frac{v \cos (γ) \cos (ψ)}{r} \\ \frac{d v}{d t} = - \frac{D}{m} - g \sin (γ) + Ω_{v} \\ \frac{d γ}{d t} = \frac{1}{v} [\frac{L \cos (σ)}{m} + (\frac{v^{2}}{r} - g) \cos (γ)] + Ω_{γ} \\ \frac{d ψ}{d t} = \frac{1}{v} [\frac{L \sin (σ)}{m \cos (γ)} + \frac{v^{2}}{r} \cos (γ) \sin (ψ) \tan (φ)] + Ω_{ψ} \end{matrix}

(1)

with

\begin{array}{l} Ω_{v} = ω^{2} r \cos (φ) [\sin (γ) \cos (φ) - \cos (γ) \sin (φ) \cos (ψ)] \\ Ω_{γ} = 2 ω \cos (φ) \sin (ψ) + \frac{ω^{2} r \cos (φ)}{v} [\cos (γ) \cos (φ) + \sin (γ) \sin (φ) \cos (ψ)] \\ Ω_{ψ} = 2 ω [\cos (φ) \tan (γ) \cos (ψ) - \sin (φ) + \frac{ω^{2} r}{v \cos (γ)} \sin (φ) \cos (φ) \sin (ψ)] \end{array}

(2)

where

r

represents the distance from the center of the earth to the hypersonic vehicle,

θ

and

φ

denote the longitude and latitude, respectively.

v

represents the velocity of the vehicle relative to the earth.

γ

describes the flight angle of the velocity versus the local horizontal plane angle.

ψ

is the heading angle.

ω

represents the angular velocity of the earth’s rotation.

σ

denotes the bank angle.

g

represents the gravitational acceleration, defined as

μ / r^{2}

where

μ

is the gravitational constant.

L

and

D

represent the aerodynamic lift and the drag, respectively, which can be expressed as:

\begin{array}{l} L = \frac{1}{2} ρ v^{2} S_{r e f} C_{L} \\ D = \frac{1}{2} ρ v^{2} S_{r e f} C_{D} \end{array}

(3)

in which the reference area of the vehicle is denoted by

S_{r e f}

, the atmospheric density

ρ = ρ_{0} e^{- (r - R_{e}) / h_{s}}

is given by the equation that is the function of altitude and reference sea level density. Here,

ρ_{0}

is the reference density at sea level,

R_{e}

is the earth’s radius, and

h_{s}

is the density scale height. The lift coefficient

C_{L}

, and the drag coefficient

C_{D}

, are both functions of the attack angle and the Mach number.

2.2. Problem Statement

The paper addresses the trajectory planning problem for hypersonic vehicles, which can be formulated as an optimization control problem. The objective is to generate a sequence of optimal control commands that minimize a given objective function, subject to various constraints, including boundary, path, and control constraints. Using the dynamic model, a typical optimization problem for hypersonic entry vehicles can be defined as follows [35]:

\begin{array}{l} \min J \\ s . t . \dot{x} = f (x, u) \\ x (t_{0}) = x_{0}, x (t_{f}) = x_{f} \\ x \in [x_{\min}, x_{\max}] \\ u_{\min} \leq u \leq u_{\max} \\ \dot{Q} = K_{Q} ρ^{0.5} v^{3.15} \leq {\dot{Q}}_{m a x} \\ q = \frac{1}{2} ρ v^{2} \leq q_{m a x} \\ n = S_{r e f} \frac{q \sqrt{C_{L}^{2} + C_{D}^{2}}}{m g} \leq n_{m a x} \end{array}

(4)

where the objective function is denoted as

J

, the dynamics system is represented by the equation

\dot{x} = f (x, u)

,

x

is a six-dimensional state vector given by

{[r, θ, φ, v, γ, ψ]}^{T}

. Additionally, the heat rate at the stagnation point is denoted as

\dot{Q}

, the dynamic pressure as

q

, the overload as

n

, and

K_{Q}

is a constant parameter related to the curvature radius of the vehicle. The initial state is represented as

x_{0}

and the mission target as

x_{f}

. The boundary constraints for the states are denoted as

[x_{m i n}, x_{m a x}]

. It is important to note that the initial states

x_{0}

are randomly generated to simulate actual flight conditions. The control command vector

u

is determined by the vehicle model and mission requirements. The minimum and maximum values of the control commands are represented by

[u_{m i n}, u_{m a x}]

.

3. Inverse-Reinforcement-Learning-Based Guidance Method

In this section, the proposed framework based on the IRL method is introduced. Our framework describes a novel training process for a DNN model, which achieves high accuracy even with a few optimal trajectories as datasets. The RL problem formulation, reward function design, and the architecture of the neural network are also discussed in this section.

3.1. IRL-Based Guidance Framework

Different from traditional RL algorithms, the IRL method requires the dataset and can enable the learning of a reward shaping function from expert demonstrations, such as optimal trajectories generated by the OCP algorithms, allowing the agent to generalize from limited data and generate dense rewards. By using IRL, the reward shaping function can be learned automatically, relieving the need for manual and complex reward design. When the expert demonstration does not cover the scene, the IRL method can still optimize the agent to learn a decent policy. Several IRL methods have been proposed, such as extracting reward functions using the maximum-margin method [36], using the Gaussian method [37], and using decision trees [38].

In this paper, the Generative Adversarial Imitation Learning (GAIL) algorithm [39] is utilized as a form of the IRL algorithms to train a DNN-based model. The proposed guidance framework based on IRL is designed to achieve effective training of the model using a small number of expert demonstrations. The process of model training is depicted in Figure 1, illustrating the steps involved in training the model.

Prior to the model training phase, expert demonstrations are generated using a sophisticated direct trajectory optimization algorithm to solve the hypersonic entry problem. It is important to note that the initial states of the trajectories in the expert demonstrations only cover a limited range. However, the subsequent numerical experiment demonstrates that even with a few trajectories, the proposed algorithm is effective and capable of achieving good guidance.

The proposed IRL-based guidance method incorporates three different neural networks: the Policy Neural Network (Actor), the Value Neural Network (Critic), and the Discriminator Neural Network. The training phase of the IRL-based method proceeds as follows: (1) The policy network interacts with a high-fidelity hypersonic entry environment, as described in Section 2, and generates trajectories online. All initial states are generated randomly and contain all state space for the hypersonic entry problems. (2) State-action pairs from the generated trajectories and the expert demonstration are randomly sampled to train the discriminator network. The goal is to maximize the expert reward while minimizing the agent reward, enabling the discriminator to distinguish between the expert behaviors and the agent behaviors. (3) The advantage function is calculated by combining the IRL reward generated by the discriminator network, the reward computed by the reward function, and the value predicted by the value network. Then, the advantage function is used to optimize the policy network, enabling it to generate improved control commands.

In our IRL-based framework, we utilize the PPO algorithm, which is a popular Advantage Actor-Critic (A2C) method and is widely used in various complex problems. The PPO algorithm is a policy gradient algorithm based on the Trust Region Policy Optimization (TRPO) method. It can dynamically adjust the maximum updated step size by constraining Kullback–Leibler (KL) divergence between the new and old policy. The PPO objective function can be expressed as follows:

\max J (θ) = E [\min (r_{t} (θ), c l i p (r_{t} (θ), 1 - ε, 1 + ε)) A_{π, t} (s, a)]

(5)

where

r_{t} (θ)

represents the probability ratio

\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

between the new and old policy. The advantage function is denoted by

A_{π, t} (s, a)

, which captures the advantage or benefit of taking action

a

in state

s

. The function

c l i p (x, 1 - ε, 1 + ε)

is a clipped function that limits the value of

x

within the range

[1 - ε, 1 + ε]

, where

ε

is the clipping ratio.

To reduce the computational cost of the model training phase, the Distributed Proximal Policy Optimization (DPPO) algorithm [40] is implemented in this paper, which utilizes a multi-process mechanism to accelerate exploration efficiency. The combination of IRL with DPPO is described in Algorithm 1. According to the recommendations from prior research [25,32], adjusting the clipping ratio and the policy learning rate during the model training phase dynamically can improve performance. The approach is also employed in our work, as shown in line 17 of Algorithm 1, and can be described as follows:

ε = \{\begin{cases} \min (ε_{\max}, 1.5 ε) if kl < \frac{1}{2} {kl}_{t a r g} \\ \max (ε_{\min}, \frac{1}{1.5} ε) {if kl > 2 kl}_{t a r g} \end{cases}

(6)

α_{θ} = \{\begin{cases} \min (α_{θ_{m a x}}, 1.5 α_{θ}), if kl < \frac{1}{2} {kl}_{t a r g} and ε > \frac{1}{2} ε_{\max} \\ \max (α_{θ_{m i n}}, \frac{1}{1.5} α_{θ}), {if kl > 2 kl}_{t a r g} and ε < 2 ε_{\min} \end{cases}

(7)

Algorithm 1 DPPO-IRL algorithm
1.	Input: Expert trajectories $τ_{E} ~ π_{E}$ , iteration number $I$ , DPPO worker process number $M$ , buffer size $K$ ,
2.	discount factor $Γ$ , initial policy, value network, and discriminator parameters $θ_{0}, ϕ_{0}, ω_{0}$ .
3.	Generate the DPPO worker process ${[worker}_{1} {, worker}_{2}, \dots {, worker}_{M}]$ .
4.	for $i = 1, 2, 3, \dots, I$ do
5.	for $j = 1, 2, \dots, M$ do
6.	Send the current policy $π_{θ_{i}}$ to the process ${worker}_{j}$ , and run the policy $π_{θ_{i}}$ for $K / M$ timesteps
7.	to collect online trajectories.
8.	end for
9.	Sample agent trajectories $τ_{i} ~ π_{θ_{i}}$ and expert trajectories $τ_{E} ~ π_{E}$ .
10.	Update the discriminator parameter from $ω_{i}$ to $ω_{i + 1}$ with the gradient:
11.	$E_{τ_{θ_{i}}} [\nabla_{ω} \log (D_{ω} (s, a))] + E_{τ_{E}} [\nabla_{ω} \log (1 - D_{ω} (s, a)]$
12.	The final reward $R_{t}$ is calculated as the sum of the cost generated by the cost function $\log (1 - D_{ω_{i + 1}} (s, a))$
13.	and the reward is computed by the reward function.
14.	Discount the reward using the factor $Γ$ and estimate advantages via $A_{π, t} = {\hat{R}}_{t} - V_{ϕ_{i}} (s_{t})$ .
15.	Update the policy parameter from $θ_{i}$ to $θ_{i + 1}$ and value network parameter from $ϕ_{i}$ to $ϕ_{i + 1}$ , using the
16.	PPO algorithm.
17.	Adjust the policy learning rate and clipping ratio according to the approximate KL divergence.
18.	end for

3.2. RL Problem Formulation

An episode in the training process will be terminated prematurely if the range angle increases, indicating that the vehicle has deviated from the target point. Additionally, the episode will also be terminated if the path constraints (such as heat rate, dynamic pressure, or overload) are violated. After a certain number of steps have been accumulated, the policy, value function, and discriminator network are updated once using the IRL-based method. During the model optimization, the observation is represented by a vector given in Equation (8). As mentioned in Equation (9), the action space is defined differently for various problems. For the CAV-H entry problem and the RLV entry problem, the action space consists of [generalized lift coefficient

λ

, bank angle

σ

] and [bank angle

σ

], respectively.

obs = [r θ φ v γ ψ]

(8)

action = \{\begin{cases} [λ, σ] \in ℝ^{2}, when CAV - H Entry Problem \\ [σ] \in ℝ^{1}, when RLV Entry Problem \end{cases}

(9)

The position and velocity of observations used in this paper are normalized before being fed into the model. The definition of normalization is

[\bar{r} = r / R_{e}, \bar{v} = v / \sqrt{g_{0} R_{e}}]

. Furthermore, each element in the action space is independently normalized to the range

[- 1, 1]

using the Equation (10):

u {(i)}_{n o r m} = \frac{2 * (u (i) - u {(i)}_{m i n})}{u {(i)}_{m a x} - u {(i)}_{m i n}} - 1, i = \{\begin{cases} [λ, σ], when CAV - H Entry Problem \\ [σ], when RLV Entry Problem \end{cases}

(10)

3.3. Reward Function Design

In the field of hypersonic entry problems, the issue of sparse reward is a challenge, and designing a reasonable and dense reward function has been a focal point for researchers. However, to the best of our knowledge, there is little literature available on the design of the reward function in the field of hypersonic entry problems. One potential solution is to follow predetermined guidance law. The design of the tracking guidance law resulted in a loss of trajectory optimality. To overcome this limitation, in this paper, a novel reward function is introduced in Equation (11) that combines the discriminator network with several designed terms.

\begin{matrix} r = r_{I R L} + r_{s h a p i n g} + r_{p e n a l t y} + r_{b o n u s} + η \\ r_{s h a p i n g} = r_{s h a p i n g_{h}} + r_{s h a p i n g_{h e a t}} + r_{s h a p i n g_{p r e s s u r e}} + r_{s h a p i n g_{o v e r l o a d}} \\ r_{p e n a l t y} = r_{p e n a l t y_{θ}} + r_{p e n a l t y_{φ}} + r_{p e n a l t y_{h}} + r_{p e n a l t y_{γ}} + r_{p e n a l t y_{ψ}} \end{matrix}

(11)

where the variables mentioned above are defined as follows:

(1): In contrast to other classical RL methods, $r_{I R L}$ is a term generated by the discriminator network at each step, which provides incentives for the agent to learn an optimal policy that aligns with the expert demonstrations.
(2): $r_{s h a p i n g}$ is a punishment term for undesired states when the agent approaches the boundary, such as altitude, heat rate, dynamic pressure, and overload. As shown in Equation (12), the value of $r_{s h a p i n g}$ is determined by an exponential function, where the punishment increases as the agent gets closer to the boundary. This method enables the agent to quickly learn the solution that does not violate the path constraints.

$\begin{matrix} r_{s h a p i n g_{h}} = \{\begin{cases} α_{h} \exp (- ‖h - h_{b o u n d a r y_{m i n}}‖ / h_{s c a l e}), & if h > h_{l i m i t_{m a x}} \\ β_{h} \exp (- ‖h - h_{b o u n d a r y_{m a x}}‖ / h_{s c a l e}), & if h < h_{l i m i t_{m i n}} \\ 0, & otherwise \end{cases} \\ r_{s h a p i n g_{h e a t}} = \{\begin{cases} α_{h e a t} \exp (- ‖\dot{Q} - {\dot{Q}}_{m a x}‖ / {\dot{Q}}_{m a x}), & if \dot{Q} > β_{h e a t} {\dot{Q}}_{m a x} \\ 0, & otherwise \end{cases} \\ r_{s h a p i n g_{p r e s s u r e}} = \{\begin{cases} α_{p r e s s u r e} \exp (- ‖q - q_{m a x}‖ / q_{m a x}), & if q > β_{p r e s s u r e} q_{m a x} \\ 0, & otherwise \end{cases} \\ r_{s h a p i n g_{o v e r l o a d}} = \{\begin{cases} α_{o v e r l o a d} \exp (- ‖n - n_{m a x}‖ / n_{m a x}), & if n > β_{o v e r l o a d} n_{m a x} \\ 0, & otherwise \end{cases} \end{matrix}$

(12)
(3): To continuously incentivize the agent to improve terminal accuracy, we introduce the term $r_{p e n a l t y}$ which measures the accuracy of the terminal state and is only provided at the end of an episode. The specific formulations of $r_{p e n a l t y}$ are described as follows:

$\begin{matrix} r_{{p e n a l t y}_{θ}} = \{\begin{matrix} ζ_{θ} ∥θ - θ_{t a r g e t}∥, & if done \\ 0, & otherwise \end{matrix} \\ r_{{p e n a l t y}_{φ}} = \{\begin{matrix} ζ_{φ} ∥φ - φ_{t a r g e t}∥, & if done \\ 0, & otherwise \end{matrix} \\ r_{{p e n a l t y}_{h}} = \{\begin{matrix} ζ_{h} min (∥h - h_{{t a r g e t}_{m i n}}∥, ∥h - h_{{t a r g e t}_{m a x}}∥), & if done and h \notin [h_{{t a r g e t}_{m i n}}, h_{{t a r g e t}_{m a x}}] \\ 0, & otherwise \end{matrix} \\ r_{{p e n a l t y}_{γ}} = \{\begin{matrix} ζ_{γ} min (∥γ - γ_{{t a r g e t}_{m i n}}∥, ∥γ - γ_{{t a r g e t}_{m a x}}∥), & if done and γ \notin [γ_{{t a r g e t}_{m i n}}, γ_{{t a r g e t}_{m a x}}] \\ 0, & otherwise \end{matrix} \\ r_{{p e n a l t y}_{ψ}} = \{\begin{matrix} ζ_{ψ} min (∥ψ - ψ_{{t a r g e t}_{m i n}}∥, ∥ψ - ψ_{{t a r g e t}_{m a x}}∥), & if done and ψ \notin [ψ_{{t a r g e t}_{m i n}}, ψ_{{t a r g e t}_{m a x}}] \\ 0, & otherwise \end{matrix} \end{matrix}$

(13)
(4): As defined in Equation (14), $r_{b o n u s}$ is a bonus given at the end of an episode if the agent satisfies all terminal constraints and the range angle $d x$ is less than the specified tolerance $d x_{l i m}$ .

$r_{b o n u s} = \{\begin{cases} κ, if done and d x < d x_{\lim} and x_{f} \in [x_{t a r g e t_{m i n}}, x_{t a r g e t_{m a x}}] \\ 0, otherwise \end{cases}$

(14)
(5): $η$ is a positive constant that encourages the agent to continue exploring. This is necessary because the agent might tend to terminate early if all rewards are negative.

3.4. Neural Network Architecture

Two opposing neural networks are required for the IRL-based algorithm, called the generator and the discriminator. In the implementation of the algorithm described in this article, the policy plays the role of the generator, which is composed of a four-layer neural network. The input to the policy is a six-dimensional vector representing the state, and the output is a vector whose dimension depends on the action definition. Each hidden layer of the policy uses a hyperbolic tangent activation function. The value function network estimates the advantage value, which is a one-dimensional value representing the expected advantage of taking a specific action in a given state. The output of the value function network is a single value that represents the estimated advantage value. The discriminator network takes as input a concatenated vector of the observation and action. Figure 2 provides a summary of three network structures.

4. Experiments

First, this section provides an overview of two vehicle models and missions considered in this study. The characteristics and parameters of the hypersonic vehicles are described. Next, the process of generating expert demonstrations and optimizing the model is presented. Furthermore, numerical trajectories of the IRL-based guidance algorithm are given in this section. Additionally, a comparison between the IRL-based algorithm and the GPOPS solver is provided. It is important to note that the model optimization and all numerical experiments were finished on a personal computer with an Intel Core i9-9900 CPU @ 3.10GHz, 16.0 GB RAM, and Windows 10 operating system. The Python 3.7 environment with PyTorch 1.10 was used for implementing the IRL-based algorithm, while the GPOPS software was executed in MATLAB.

4.1. Vehicle Model and Mission

4.1.1. CAV-H Entry Problem

Referring to the article [20], the first vehicle model used in this paper is CAV-H, which exhibits a high lift-drag ratio during hypersonic entry flight. Without loss of generality, the drag coefficient

C_{D}

of CAV-H can be assumed to follow the equation of

C_{L}

, and the expression for the lift-to-drag ratio can be obtained through a corresponding equation. Assuming the vehicle maintains the maximum lift-to-drag ratio, the lift coefficient and drag coefficient of the vehicle can be defined as follows:

\begin{matrix} {C_{L}}^{*} = \sqrt{\frac{C_{D 0}}{K}} \\ {C_{D}}^{*} = 2 C_{D 0} \end{matrix}

(15)

Therefore, the maximum lift-to-drag ratio coefficient

E^{*}

can be expressed as

E^{*} = C_{L}^{*} / C_{D}^{*} = 1 / 2 \sqrt{K \cdot C_{D 0}}

. In this problem, the vehicle always maintains the maximum lift-to-drag ratio during flight, and the generalized coefficient

λ

is used as the control command instead of the traditional attack angle. The generalized coefficient

λ

is defined as

λ = C_{L} / C_{L}^{*}

. As a result, the lift and drag coefficients can be redefined as follows:

\begin{array}{l} C_{L} = λ C^{*}_{L} \\ C_{D} = \frac{{C_{L}}^{*} (1 + λ^{2})}{2 E^{*}} \end{array}

(16)

The generalized lift coefficient

λ

and bank angle

σ

used as the control command in this CAV-H entry problem are limited within a certain range. The parameters for the CAV-H can be found in Table 1. The initial and terminal states of the vehicle are provided in Table 2. The entry mission is to reach a target location defined by a specific longitude, latitude, and final altitude range. Generally, in hypersonic missions, there are various performance indexes that can be optimized. Due to the CAV-H’s classification as a weapon missile, the minimization of flight time is considered imperative. The objective function can be defined as

\min J = t_{f}

.

4.1.2. RLV Entry Problem

Similar to the assumption in reference [41], an RLV model is used for numerical demonstrations in this work. The RLV is a winged-body vehicle for vertical takeoff and horizontal landing. The trajectory optimization in this paper considers the approximated aerodynamic coefficients regime as described in reference [42], with the following expressions:

\begin{array}{l} C_{L} = 0.0002602 α^{2} + 0.016292 α - 0.041065 \\ C_{D} = 0.86495 C_{L}^{2} - 0.03026 C_{L} + 0.080505 \end{array}

(17)

where

α

is in degrees and is scheduled based on the velocity profile as given below:

α = \{\begin{cases} 40, & if v > 4570 m / s \\ 40 - 0.20705 {(v - 4570)}^{2} / 340^{2}, & otherwise \end{cases}

(18)

Profiles of the angle of attack and aerodynamic coefficients are shown in Figure 3. Consequently, in the RLV entry problem of this paper, the bank angle

σ

is the only control command, and the rate of the bank angle is limited to

10 \deg / s

. The parameters for the RLV can be found in Table 3. Similar to the CAV-H entry problem, the parameters of the initial and terminal points are listed in Table 4. The free-flight-time entry is considered in this paper. For the RLV, it is significant to minimize the total heat load during entry. Therefore, the objective function for the RLV mission is to minimize the total heat load, as expressed as

\min J = \int_{t_{0}}^{t_{f}} \dot{Q} d_{t}

.

4.2. Expert Demonstrations Generation Strategy

The GPOPS software, which is based on a pseudospectral method, is used in this paper as the OCP solver to generate the expert demonstrations. The environment parameters used in the simulations and dataset generation are reported in Table 5.

For both the CAV-H and RLV entry problems, 50 trajectories are randomly selected. The profiles of the 50 trajectories for the CAV-H and RLV entry problem are plotted in Figure 4 and Figure 5, respectively. After the generation of the trajectories, with the aim of augmenting the dataset, the 50 trajectories were linearly interpolated at intervals of

step = 1 s

. It should be noted that if a well-trained network is optimized using supervised learning methods, the number of samples required would typically be two orders of magnitude larger than the dataset used in this paper [22,23,24].

4.3. Model Optimization

The initial state range used in this model optimization can be obtained in Table 2 and Table 4. At the beginning of model optimization, the initial learning rates of policy, value function, and discriminator network are set to

0.0002

,

0.0025

, and

0.001

, respectively. All of the hyperparameters during the model optimization are listed in Table 6. These hyperparameters have been elaborately determined based on the mission objectives, constraints, and empirical knowledge, with the aim of rescaling rewards across all components to sensible ranges. During the model optimization, the guidance period is set to 2.5 s, and the integration period is 0.5 s. For both the CAV-H and RLV entry problems, a total of 3000 model iterations are performed, which takes approximately 15 h to complete.

After applying smoothing with a window size of 5, the reward curves and terminal range angle curves for the CAV-H and RLV entry problems are plotted in Figure 6 and Figure 7, respectively. The left y-axis represents the reward curve, while the right y-axis describes the terminal range angle curve. At the beginning of the model optimization, the agent violated the path constraints, and the terminal range angle was large. As the model optimization progressed, the control commands generated by the agent gradually became similar to the expert demonstrations, leading to a rapid increase in total rewards. Finally, the agent learned how to satisfy all constraints and to continuously receive the terminal bonus. After approximately 1500 epochs of updating, both the policy and the discriminator network reached convergence, indicating that the algorithm had effectively learned the optimal guidance strategy.

It is important to note that while the total reward curve increased continuously during the model training process, the reward generated by the discriminator network might not followed the same trend. This can be attributed to the limited number of trajectories in the expert demonstrations, which can introduce compounding errors in the control sequence. This observation highlights the advantage of using the IRL-based algorithm compared to supervised learning methods, where a large amount of data is typically required to ensure model generalization. In the case of the RLV entry mission, the reward generated by the discriminator network exhibited an upward and then downward trend. This phenomenon resulted in a slight overall decrease in the total reward, as evidenced in Figure 7. This indicates that the agent was able to learn a different strategy from the expert demonstrations through the IRL-based algorithm, demonstrating its ability to explore alternative solutions.

4.4. Terminal Guidance Accuracy of the IRL-Based Guidance Method

In this subsection, in order to fully evaluate the performance of the IRL-based guidance method, 1000 trajectories are served in numerical simulations. The state variables of the vehicle are randomly initialized, and real-time closed-loop guidance is performed using IRL-based controllers introduced above. The statistics for the terminal state are used to measure the performance of the IRL-based algorithm, which are tabulated in Table 7. The mission is considered successful if the trajectory satisfies all constraints and the terminal range angle error is less than a certain threshold,

d x_{l i m}

degrees. For the CAV-H mission, the threshold is set to 0.25 degrees, while for the RLV mission, it is set to 0.5 degrees due to the greater difficulty of finding a viable solution. The results show that the proposed algorithm achieves a success rate of 99.6% for the CAV-H mission, with the maximum range angle well controlled below 0.27 degrees. Even for the more challenging RLV mission, the success rate is still high at 99.2%. Table 8 provides statistics for the heating rate, dynamic pressure, and overload, demonstrating that all 1000 trajectories generated by the IRL-based method strictly stratify the path constraints. Furthermore, the terminal state distributions of the two vehicles are plotted in Figure 8 and Figure 9, respectively, providing a visual representation of the achieved performance.

4.5. Optimality Analysis and Real-Time Performance

As shown in Ref. [35] and Ref. [41], the solutions from GPOPS are typically considered the benchmark for the trajectory optimality. Therefore, in this paper, the IRL-based results are compared to the GPOPS solutions to validate their optimality. With the given initial and terminal conditions, the optimization objective for the CAV-H entry problem is the minimum flight time, and for the RLV entry problem is the minimum total heat load. Figure 10 and Figure 11 show sample trajectories obtained using the IRL-based controller and the GPOPS method. For the CAV-H entry problem, it can be observed that the solutions obtained from the IRL-based are similar to the results of the GPOPS method. The profiles of the generalized lift coefficient and bank angle also exhibit the same trends. When the vehicle approaches the target point, the generalized lift coefficient of the IRL-based algorithm appears to be smoother compared to the GPOPS method.

For the RLV entry problem, the objective function becomes a highly nonlinear integral function, which can make the problem infeasible or difficult to converge. As shown in Figure 5, the control profiles of the expert demonstration for the RLV mission exhibit high-frequency jitter, indicating the complexity of searching for an optimal solution using GPOPS methods. This complexity also brings challenges in the model learning phase, especially when working with a limited number of trajectories in this paper. From Figure 11, the bank angle profiles of the IRL-based method have a similar trend to that of GPOPS, but the control results of the IRL-based method are smoother, which is more conducive to the actual flight environment. One noteworthy item is that in order to reduce the total heat load, the IRL-based method chooses to reach the target point faster, while the GPOPS method tends to decelerate as much as possible. The terminal range angle of the IRL-based method is only 0.1459 degrees, which satisfies the required accuracy for the RLV entry problem.

In order to further analyze the closed-loop guidance effect and optimality of the intelligence controller, 50 trajectories are severed for evaluating the performance and the computational cost of two vehicles. The results of the comparison with the GPOPS method are presented in Table 9 and Table 10, respectively. While the training phase of the IRL-based method requires time, once the model training is completed, the online guidance frequency of the controller is high. The statistics illustrate that the IRL-based method achieves a guidance frequency of

1 s / 0 . 000167 s \approx 5988 HZ

, which provides potential for future online closed-loop applications. In general, the Nonlinear Programming solver used in GPOPS is much slower compared to the IRL-based method, and the solution time of GPOPS is unpredictable because good initial guesses are required for convergence. Specially, due to the complexity of the minimum total head load, the calculation time of GPOPS increases to 28 s during the RLV entry environment. In contrast, the CPU time of the IRL-based method is only 0.167 milliseconds, demonstrating the computational advantage. Furthermore, the total heat load of the IRL-based method is only 3% higher than that of GPOPS. The result further demonstrates the optimality of the IRL-based method.

5. Conclusions

In this paper, an Inverse-Reinforcement-Learning-based method for hypersonic entry problems is developed to solve highly nonlinear optimal control problems, where a discriminator network is employed to implicitly capture the optimal reward information associated with expert demonstrations. On this basis, a novel reward function is proposed to address the sparse reward dilemma and provide optimal incentives, which is the main contribution of this paper. The IRL-based method has been validated on two typical hypersonic entry vehicle missions, showcasing its generalization capability. Extensive experiments have demonstrated the effectiveness of the IRL-based method in achieving real-time and high terminal precision with a small dataset. Furthermore, the optimality of the IRL-based method has been demonstrated by numerical solutions through comparison with GPOPS, and the simulation results show that the methodology proposed in this paper is suitable for online optimal guidance and has the potential for onboard implementation in practical applications.

Author Contributions

Conceptualization, J.W. and L.S.; Methodology, J.W. and L.S.; Software, L.S.; Validation, L.S.; Formal analysis, J.W.; Investigation, L.S.; Resources, J.W. and H.C.; Data Curation, L.S.; Writing—Original Draft Preparation, L.S.; Writing—Review and Editing, J.W. and H.C.; Visualization, L.S.; Supervision, J.W.; Project administration, H.C.; Funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic and Applied Basic Research Project of Guangzhou Science and Technology Bureau, No. 202201011187.

Data Availability Statement

All data used during the study appear in the submitted article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, Z.; Hu, C.; Ding, C.; Liu, G.; He, B. Stochastic gradient particle swarm optimization based entry trajectory rapid planning for hypersonic glide vehicles. Aerosp. Sci. Technol. 2018, 76, 176–186. [Google Scholar] [CrossRef]
Conway, B.A. A Survey of Methods Available for the Numerical Optimization of Continuous Dynamic Systems. J. Optim. Theory Appl. 2012, 152, 271–306. [Google Scholar] [CrossRef]
Chai, R.; Tsourdos, A.; Savvaris, A.; Chai, S.; Xia, Y.; Philip Chen, C. Review of advanced guidance and control algorithms for space/aerospace vehicles. Prog. Aerosp. Sci. 2021, 122, 100696. [Google Scholar] [CrossRef]
Ross, I.M.; Fahroo, F. Issues in the real-time computation of optimal control. Math. Comput. Model. 2006, 43, 1172–1188. [Google Scholar] [CrossRef]
Wang, Z.P.; Wu, H.N.; Li, H.X. Sampled-Data Fuzzy Control for Nonlinear Coupled Parabolic PDE-ODE Systems. IEEE Trans. Cybern. 2017, 47, 2603–2615. [Google Scholar] [CrossRef]
Betts, J.T. Survey of Numerical Methods for Trajectory Optimization. J. Guid. Control Dyn. 1998, 21, 193–207. [Google Scholar] [CrossRef]
von Stryk, O.; Bulirsch, R. Direct and indirect methods for trajectory optimization. Ann. Oper. Res. 1992, 37, 357–373. [Google Scholar] [CrossRef]
Ozimek, M.T.; Howell, K.C. Low-Thrust Transfers in the Earth-Moon System, Including Applications to Libration Point Orbits. J. Guid. Control Dyn. 2010, 33, 533–549. [Google Scholar] [CrossRef]
Mansell, J.R.; Grant, M.J. Adaptive Continuation Strategy for Indirect Hypersonic Trajectory Optimization. J. Spacecr. Rocket. 2018, 55, 818–828. [Google Scholar] [CrossRef]
Grant, M.J.; Braun, R.D. Rapid Indirect Trajectory Optimization for Conceptual Design of Hypersonic Missions. J. Spacecr. Rocket. 2015, 52, 177–182. [Google Scholar] [CrossRef]
Tang, G.; Jiang, F.; Li, J. Fuel-Optimal Low-Thrust Trajectory Optimization Using Indirect Method and Successive Convex Programming. IEEE Trans. Aerosp. Electron. Syst. 2018, 54, 2053–2066. [Google Scholar] [CrossRef]
Wang, J.; Li, H.; Chen, H. An Iterative Convex Programming Method for Rocket Landing Trajectory Optimization. J. Astronaut. Sci. 2020, 67, 1553–1574. [Google Scholar] [CrossRef]
Açıkme¸se, B.; Carson, J.M.; Blackmore, L. Lossless Convexification of Nonconvex Control Bound and Pointing Constraints of the Soft Landing Optimal Control Problem. IEEE Trans. Control Syst. Technol. 2013, 21, 2104–2113. [Google Scholar] [CrossRef]
Wang, J.; Cui, N.; Wei, C. Optimal Rocket Landing Guidance Using Convex Optimization and Model Predictive Control. J. Guid. Control Dyn. 2019, 42, 1078–1092. [Google Scholar] [CrossRef]
Wang, J.; Cui, N.; Wei, C. Rapid trajectory optimization for hypersonic entry using convex optimization and pseudospectral method. Aircr. Eng. Aerosp. Technol. 2019, 91, 669–679. [Google Scholar] [CrossRef]
Wang, J.; Liang, H.; Qi, Z.; Ye, D. Mapped Chebyshev pseudospectral methods for optimal trajectory planning of differentially flat hypersonic vehicle systems. Aerosp. Sci. Technol. 2019, 89, 420–430. [Google Scholar] [CrossRef]
Yang, S.; Cui, T.; Hao, X.; Yu, D. Trajectory optimization for a ramjet-powered vehicle in ascent phase via the Gauss pseudospectral method. Aerosp. Sci. Technol. 2017, 67, 88–95. [Google Scholar] [CrossRef]
Lekkas, A.M.; Roald, A.L.; Breivik, M. Online Path Planning for Surface Vehicles Exposed to Unknown Ocean Currents Using Pseudospectral Optimal Control. In Proceedings of the 10th IFAC Conference on Control Applications in MarineSystemsCAMS, Trondheim, Norway, 13–16 September 2016; Volume 49, pp. 1–7. [Google Scholar]
Shirobokov, M.; Trofimov, S.; Ovchinnikov, M. Survey of machine learning techniques in spacecraft control design. Acta Astronaut. 2021, 186, 87–97. [Google Scholar] [CrossRef]
Thuruthel, T.G.; Shih, B.; Laschi, C.; Tolley, M.T. Soft robot perception using embedded soft sensors and recurrent neural networks. Sci. Robot. 2019, 4, eaav1488. [Google Scholar] [CrossRef]
Furfaro, R.; Bloise, I.; Orlandelli, M.; Di Lizia, P.; Topputo, F.; Linares, R. Deep learning for autonomous lunar landing. In Proceedings of the AAS/AIAA Astrodynamics Specialist Conference, Snowbird, UT, USA, 19–23 August 2018; pp. 3285–3306. [Google Scholar]
Shi, Y.; Wang, Z. Onboard Generation of Optimal Trajectories for Hypersonic Vehicles Using Deep Learning. J. Spacecr. Rocket. 2021, 58, 400–414. [Google Scholar] [CrossRef]
Wang, J.; Wu, Y.; Liu, M.; Yang, M.; Liang, H. A Real-Time Trajectory Optimization Method for Hypersonic Vehicles Based on a Deep Neural Network. Aerospace 2022, 9, 188. [Google Scholar] [CrossRef]
Chai, R.; Tsourdos, A.; Savvaris, A.; Xia, Y.; Chai, S. Real-Time Reentry Trajectory Planning of Hypersonic Vehicles: A Two-Step Strategy Incorporating Fuzzy Multiobjective Transcription and Deep Neural Network. IEEE Trans. Ind. Electron. 2020, 67, 6904–6915. [Google Scholar] [CrossRef]
Deng, T.; Huang, H.; Fang, Y.; Yan, J.; Cheng, H. Reinforcement learning-based missile terminal guidance of maneuvering targets with decoys. Chin. J. Aeronaut. 2023. [Google Scholar] [CrossRef]
Wang, H.; Yang, Z.; Zhou, W.; Li, D. Online scheduling of image satellites based on neural networks and deep reinforcement learning. Chin. J. Aeronaut. 2019, 32, 1011–1019. [Google Scholar] [CrossRef]
Gaudet, B.; Linares, R.; Furfaro, R. Deep reinforcement learning for six degree-of-freedom planetary landing. Adv. Space Res. 2020, 65, 1723–1741. [Google Scholar] [CrossRef]
Xu, X.; Chen, Y.; Bai, C. Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing. Sensors 2021, 21, 8161. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Yan, Y.; Qiao, H.; Guan, X.; Li, X. Reinforcement Learning for Computational Guidance of Launch Vehicle Upper Stage. Int. J. Aerosp. Eng. 2022, 2022, 2935929. [Google Scholar] [CrossRef]
Furfaro, R.; Scorsoglio, A.; Linares, R.; Massari, M. Adaptive generalized ZEM-ZEV feedback guidance for planetary landing via a deep reinforcement learning approach. Acta Astronaut. 2020, 171, 156–171. [Google Scholar] [CrossRef]
Gaudet, B.; Drozd, K.; Furfaro, R. Adaptive Approach Phase Guidance for a Hypersonic Glider via Reinforcement Meta Learning. In Proceedings of the AIAA SCITECH 2022 Forum, San Diego, CA, USA, 3–7 January 2022. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Richie, G. The Common Aero Vehicle—Space delivery system of the future. In Proceedings of the Space Technology Conference and Exposition, Albuquerque, NM, USA, 28–30 September 1999. [Google Scholar]
Patterson, M.A.; Rao, A.V. GPOPS-II: A MATLAB Software for Solving Multiple-Phase Optimal Control Problems Using HpAdaptive Gaussian Quadrature Collocation Methods and Sparse Nonlinear Programming. ACM Trans. Math. Softw. 2014, 41, 1–37. [Google Scholar] [CrossRef]
Wang, Z.; Grant, M.J. Constrained Trajectory Optimization for Planetary Entry via Sequential Convex Programming. J. Guid. Control Dyn. 2017, 40, 2603–2615. [Google Scholar] [CrossRef]
Ng, A.Y.; Russell, S.J. Algorithms for Inverse Reinforcement Learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, San Francisco, CA, USA, 29 June–2 July 2000; pp. 663–670. [Google Scholar]
Levine, S.; Popovic, Z.; Koltun, V. Nonlinear inverse reinforcement learning with gaussian processes. Adv. Neural Inf. Process. Syst. 2011, 24, 19–27. [Google Scholar]
Bagnell, J.; Chestnutt, J.; Bradley, D.; Ratliff, N. Boosting Structured Prediction for Imitation Learning. In Proceedings of the Advances in Neural Information Processing Systems; Schölkopf, B., Platt, J., Hoffman, T., Eds.; MIT Press: Cambridge, MA, USA, 2006; Volume 19. [Google Scholar]
Ho, J.; Ermon, S. Generative Adversarial Imitation Learning. In Proceedings of the Advances in Neural Information Processing Systems; Schölkopf, B., Platt, J., Hoffman, T., Eds.; MIT Press: Cambridge, MA, USA, 2016; Volume 29. [Google Scholar]
Heess, N.; TB, D.; Sriram, S.; Lemmon, J.; Merel, J.; Wayne, G.; Tassa, Y.; Erez, T.; Wang, Z.; Eslami, S.M.A.; et al. Emergence of locomotion behaviours in rich environments. arXiv 2017, arXiv:1707.02286. [Google Scholar]
Wang, Z.; Lu, Y. Improved Sequential Convex Programming Algorithms for Entry Trajectory Optimization. J. Spacecr. Rocket. 2020, 57, 1373–1386. [Google Scholar] [CrossRef]
Lu, P. Entry Guidance and Trajectory Control for Reusable Launch Vehicle. J. Guid. Control Dyn. 1997, 20, 143–149. [Google Scholar] [CrossRef]

Figure 1. IRL-based guidance framework.

Figure 2. Architecture of the three neural networks. (a) Policy network (b) Value function network (c) Discriminator network.

Figure 3. Profiles of angle of attack and aerodynamic coefficients.

Figure 4. Dataset of the expert demonstrations for the CAV-H entry problem.

Figure 5. Dataset of the expert demonstrations for the RLV entry problem.

Figure 6. Optimization reward curve and range angle curve for the CAV-H entry problem.

Figure 7. Optimization reward curve and range angle curve for the RLV entry problem.

Figure 8. Statistical graph of terminal latitude, longitude, range angle, and altitude for the CAV-H entry problem.

Figure 9. Statistical graph of terminal latitude, longitude, range angle, and altitude for the RLV entry problem.

Figure 10. Comparison of sample trajectory for the CAV-H entry problem.

Figure 11. Comparison of sample trajectory for the RLV entry problem.

Table 1. The parameters of the CAV-H.

Parameter	Value	Parameter	Value
$m (kg)$	$907$	${\dot{Q}}_{m a x} (kW / m^{2})$	$2000$
$S_{r e f} (m^{2})$	$0.4839$	$q_{m a x} (kN / m^{2})$	$300$
$E^{*}$ (-)	$3.24$	$n_{m a x} (g_{0})$	$3.0$
$C_{L}^{*}$ (-)	$0.45$	$K_{Q}$ (-)	$1.688 \times 10^{- 5}$
$λ_{m i n}$ (-)	$0$	$σ_{m i n} (\deg)$	$- 80$
$λ_{m a x}$ (-)	$2$	$σ_{m a x} (\deg)$	$80$

Table 2. Boundary constraints for the CAV-H entry problem.

	$h (km)$	$θ (\deg)$	$φ (\deg)$	$v (m / s)$	$γ (\deg)$	$ψ (\deg)$
Boundary	$h (km)$	$θ (\deg)$	$φ (\deg)$	$v (m / s)$	$γ (\deg)$	$ψ (\deg)$
Initial condition	$41 \leq h \leq 46$	$- 0.5 \leq θ \leq 0.5$	$- 0.5 \leq φ \leq 0.5$	$5300 \leq v \leq 5500$	$- 0.5 \leq γ \leq 0.5$	$89.9 \leq ψ \leq 90.1$
Terminal condition	$30 \leq h \leq 40$	$39.3$	$20$	$-$	$-$	$-$

Table 3. The parameters of the RLV.

Parameter	Value	Parameter	Value
$m (kg)$	$104, 305$	${\dot{Q}}_{m a x} (kW / m^{2})$	$1800$
$S_{r e f} (m^{2})$	$391.22$	$q_{m a x} (kN / m^{2})$	$20$
$σ_{m i n} (\deg)$	$- 80$	$n_{m a x} (g_{0})$	$3.0$
$σ_{m a x} (\deg)$	$80$	$K_{Q}$ (-)	$1.65 \times 10^{- 4}$

Table 4. Boundary constraints for the RLV entry problem.

	$h (km)$	$θ (\deg)$	$φ (\deg)$	$v (m / s)$	$γ (\deg)$	$ψ (\deg)$
Boundary	$h (km)$	$θ (\deg)$	$φ (\deg)$	$v (m / s)$	$γ (\deg)$	$ψ (\deg)$
Initial condition	$99 \leq h \leq 101$	$- 0.2 \leq θ \leq 0.2$	$- 0.2 \leq φ \leq 0.2$	$7450$	$- 0.5$	$0$
Terminal condition	$20 \leq h \leq 30$	$12$	$70$	$-$	$- 20 \leq γ \leq 0$	$80 \leq ψ \leq 100$

Table 5. Environment parameters.

Parameter	Value
$Atmosphere scale height h_{s} (m)$	$7500$ (CAV-H), 7200 (RLV)
$Surface air density ρ_{0} ({kg / m}^{3})$	$1.2$ (CAV-H), 1.225 (RLV)
$Earth redius R_{e} (m)$	$6 . 378 \times 10^{6}$
$Gravitational acceleration at Earth redius g_{0} ({m / s}^{2})$	$9.81$

Table 6. Hyperparameters settings.

Parameter	CAV-H Entry Problem	RLV Entry Problem
$α_{i}, i \in [h, h e a t, p r e s s u r e, o v e r l o a d]$	$[- 1, - 1, - 1, - 1]$	$[- 1, - 2.5, - 2.5, - 5]$
$β_{i}, i \in [h, h e a t, p r e s s u r e, o v e r l o a d]$	$[- 0.5, 0.98, 0.98, 0.98]$	$[- 1, 0.96, 0.96, 0.96]$
$ζ_{i}, i \in [θ, φ, h, γ, ψ]$	$[- 10, - 10, - 0.1, 0, 0]$	$[- 5, - 5, - 2, - 0.5, - 0.5]$
$h_{i}, i \in [b o u n d a r y_{m i n}, b o u n d a r y_{m a x}, l i m i t_{m i n}, l i m i t_{m a x}, s c a l e]$	$[25, 55, 30, 50, 0.1]$	$[20, 120, 20, 120, 0.1]$
$κ$	$150$	$100$
$η$	$0.01$	$0.01$
$d x_{l i m}$	$0.25$	$0.5$
$ε$	0.1	0.1
$Γ$	$0.99$	$0.99$
$K$	$32, 768$	$32, 768$
$M$	$6$	$6$
$I$	$3000$	$3000$

Table 7. Terminal Accuracy Statistics.

Parameter	CAV Entry Problem			RLV Entry Problem
Parameter	Min	Mean	Max	Min	Mean	Max
$Range Angle (\deg)$	0.01	0.07	0.27	0.00	0.17	0.46
$Latitude (\deg)$	39.14	39.31	39.51	11.81	12.05	12.20
$Longitude (\deg)$	19.83	19.99	20.15	69.83	70.14	70.45
$Velocity (m / s)$	2074.24	2456.57	2755.38	1086.91	1205.92	1407.08
$Altitude (km)$	30.76	33.38	35.50	28.12	28.83	31.97
$Flight Path Angle (\deg)$	$-$			−9.04	−5.54	0.03
$Heading Angle (\deg)$	$-$			81.36	96.05	99.46
Success Rate	99.6%			99.2%

Table 8. Path Constraint Statistics.

Vehicle Mission	Constraints	$μ$	$σ$	Max	Limit
CAV-H Entry	${Heating Rate (kW / m}^{2})$	587.57	60.35	725.43	2000
	${Dynamic Pressure (kN / m}^{2})$	54.86	10.05	76.34	300
	$Overload (g_{0})$	1.42	0.27	1.98	3
RLV Entry	${Heating Rate (kW / m}^{2})$	1436.14	13.84	1459.8	1800
	${Dynamic Pressure (kN / m}^{2})$	16.42	0.17	16.59	20
	$Overload (g_{0})$	2.86	0.019	2.88	3

Table 9. Comparison with GPOPS for the CAV-H entry problem.

Method	Mean Flight Time (s)	Mean CPU Time (ms)
IRL-based	1264	0.163
GPOPS	1260	14328

Table 10. Comparison with GPOPS for the RLV entry problem.

Method	$Mean Total Heat Load (kw / m^{2})$	$Mean CPU Time (ms)$
IRL-based	1,099,644	0.167
GPOPS	1,064,421	28,234

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, L.; Wang, J.; Chen, H. A Real-Time and Optimal Hypersonic Entry Guidance Method Using Inverse Reinforcement Learning. Aerospace 2023, 10, 948. https://doi.org/10.3390/aerospace10110948

AMA Style

Su L, Wang J, Chen H. A Real-Time and Optimal Hypersonic Entry Guidance Method Using Inverse Reinforcement Learning. Aerospace. 2023; 10(11):948. https://doi.org/10.3390/aerospace10110948

Chicago/Turabian Style

Su, Linfeng, Jinbo Wang, and Hongbo Chen. 2023. "A Real-Time and Optimal Hypersonic Entry Guidance Method Using Inverse Reinforcement Learning" Aerospace 10, no. 11: 948. https://doi.org/10.3390/aerospace10110948

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Real-Time and Optimal Hypersonic Entry Guidance Method Using Inverse Reinforcement Learning

Abstract

1. Introduction

2. Problem Formulation

2.1. The 3DOF Dynamic Model for Hypersonic Entry

2.2. Problem Statement

3. Inverse-Reinforcement-Learning-Based Guidance Method

3.1. IRL-Based Guidance Framework

3.2. RL Problem Formulation

3.3. Reward Function Design

3.4. Neural Network Architecture

4. Experiments

4.1. Vehicle Model and Mission

4.1.1. CAV-H Entry Problem

4.1.2. RLV Entry Problem

4.2. Expert Demonstrations Generation Strategy

4.3. Model Optimization

4.4. Terminal Guidance Accuracy of the IRL-Based Guidance Method

4.5. Optimality Analysis and Real-Time Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI