Next Article in Journal
3D Printer Selection for Aircraft Component Manufacturing Using a Nonlinear FGM and Dependency-Considered Fuzzy VIKOR Approach
Previous Article in Journal
Experimental and Numerical Study on the Combined Jet Impingement and Film Cooling of an Aero-Engine Afterburner Section
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Powered Landing Control of Reusable Rockets Based on Softmax Double DDPG

1
School of Automation Science and Electrical Engineering, Beihang University, Beijing 102206, China
2
Beijing Aerospace Automatic Control Institute, Beijing 100854, China
3
School of Astronautics, Beihang University, Beijing 102206, China
4
Shenzhen Institute of Beihang University, Shenzhen 518063, China
*
Author to whom correspondence should be addressed.
Aerospace 2023, 10(7), 590; https://doi.org/10.3390/aerospace10070590
Submission received: 14 April 2023 / Revised: 12 June 2023 / Accepted: 27 June 2023 / Published: 28 June 2023
(This article belongs to the Section Astronautics & Space Science)

Abstract

:
Multi-stage launch vehicles are currently the primary tool for humans to reach extraterrestrial space. The technology of recovering and reusing rockets can effectively shorten rocket launch cycles and reduce space launch costs. With the development of deep representation learning, reinforcement learning (RL) has become a robust learning framework capable of learning complex policies in high-dimensional environments. In this paper, a deep reinforcement learning-based reusable rocket landing control method is proposed. The mathematical process of reusable rocket landing is modelled by considering the aerodynamic drag, thrust, gravitational force, and Earth’s rotation during the landing process. A reward function is designed according to the rewards and penalties derived from mission accomplishment, terminal constraints, and landing performance. Based on this, the Softmax double deep deterministic policy gradient (SD3) deep RL method is applied to build a robust reusable rocket landing control method. In the constructed simulation environment, the proposed method can achieve convergent and robust control results, proving the effectiveness of the proposed method.

1. Introduction

With the expanding demand for launching satellites and spacecraft, expendable launch vehicles are becoming one of the main factors restricting the exploration of extraterrestrial space. Expendable launch vehicles discard critical components such as engine and navigation equipment directly after completing their missions, raising the cost of space access [1]. To tackle the above problems, reusable launch vehicles have gained widespread attention regarding their economy and flexibility [2,3,4,5].
Research in reusable launch vehicles has made continuous progress in recent years. As an extension of parachute landing technology, the parachute recovery method was widely studied and applied in various rockets [6,7,8,9]. Although it is a simple, reliable, and economical method, the unpowered descent process makes it difficult to predict the final landing point of the rocket. As a comparison, the powered landing control method has gradually become the preferred option considering the safety, controllability, and rapidity of the landing process. Academic research on powered landing generally apply offline or online trajectory optimization followed by various tracking control methods for pinpoint landing. Convex optimization is one of the typical methods to transform the optimal landing problem into a convex optimization problem where an optimal landing trajectory can be obtained [10,11,12]. Açıkmeşe et al. [10] introduced a convexification of the control constraints that was proven to be lossless and used interior methods of convex optimization to obtain optimal solutions of the original non-convex optimal control problem. Blackmore [11] combined convex optimization by CVXGEN [13] with powered divert guidance to optimize trajectory and eliminate error. While convex-optimization-based methods can achieve optimal solutions, they require the problem to be convex and often become computationally inefficient as the complexity of the problem increases. Besides convex optimization, many other numerical optimization methods can also be applied to solve the optimal landing problem. Sagliano et al. [14] applied a multivariate pseudospectral interpolation scheme to subspaces of a database of pre-computed trajectories to generate real-time capable entry guidance solutions. Seelbinder et al. [15] introduced parametric sensitivity analysis of non-linear programs for near-optimal control sequence generation. To take advantage of both convex optimization and pseudospectral methods, Wang et al. [16] adopted a pseudospectral-improved successive convexification algorithm to solve the trajectory optimization problem. The above-mentioned methods greatly enriched the solution diversity of the optimal landing problem, but they usually require extensive calculations and accurate modelling of the environment, which are difficult to satisfy in practice.
Leveraging the continuous development and breakthrough of deep learning technology, the idea of using deep neural networks (DNN) in the guidance, navigation, and control has been widely explored in the literature [17,18,19,20]. Among all deep learning technologies, supervised learning has been extensively discussed. DNNs have been applied to asteroid landing trajectory optimization [17] and first-stage recovery flight of reusable launch vehicles [18]. Furfaro et al. [19,20] designed a set of convolutional neural networks (CNN) and long-short-term memory (LSTM) architectures to handle the problem of the autonomous lunar landing. Although supervised learning exhibits strong fitting abilities for optimal landing controls, acquiring sufficient data required for the training process is very time-consuming [17,19]. An optimization method (e.g., GPOPS [19]) is usually employed before supervised learning to generate optimal state-control pairs as training samples (146,949 samples in [17] and 2,970,000 samples in [19]). Depending on the complexity of the optimization problem, a single sample can take anywhere from a few seconds to several hours to compute, and generating such a large number of samples consumes considerable time.
Deep reinforcement learning (DRL), on the other hand, uses trial-and-error interaction with the environment to optimize the policy instead of learning from enormous labelled data, thus making it useful in many fields such as chess [21], game [22,23], investment [24], and autonomous driving [25,26]. DRL has also drawn extensive attention to reusable launch vehicles. Ciabatti et al. [27] combined a deep deterministic policy gradient (DDPG) [28] algorithm with transfer learning to tackle the problem of planetary landing on unknown extra-terrestrial environments. Cheng et al. [29] derived an interactive reinforcement learning (RL) algorithm for the Moon fuel-optimal landing problem. Furfaro et al. [30] presented an adaptive guidance algorithm based on zero-effort-miss/zero-effort-velocity (ZEM/ZEV) and deterministic policy gradient [31]. Gaudet et al. [32] applied the principles of the deep Q network [33] into a six degrees-of-freedom Mars landing simulation environment, showing strong robustness.
In this paper, we present the final landing stage of a two-stage reusable rocket controlled by using DRL algorithms. In contrast with existing applied RL algorithms, we apply the Softmax double deep deterministic policy gradient (SD3) to keep the error between the value function estimated by the neural network and the true value function within a reasonable interval. This advantage is particularly useful when facing the coupling nature of the 3-DOF translational and 3-DOF rotational motion of the rocket, as the value function is difficult to accurately estimate in such scenarios. In addition, we designed a novel reward function by introducing a state-dependent Gaussian distribution. The landing result was evaluated by considering the landing position error, landing velocity, and landing impact related to the final acceleration. To verify the effectiveness of our proposed method, we built a three-dimensional simulation environment considering gravity, thrust, and aerodynamic drag to test the trained landing control policy. Our contribution are listed as follows:
1.
We introduced RL to obtain the optimal control commands for the reusable rocket landing problem. We apply the SD3 algorithm to simultaneously predict both the optimal thrust and its nozzle angle considering the three-axis translational and three-axis rotational dynamics of the rocket as well as environmental disturbances and uncertainties.
2.
We designed a novel reward function specific for the reusable rocket landing problem during the training of the RL models. We aim to give sufficient guidance to the rocket in terms of position, velocity, thrust, and attitude angle. We introduced Gaussian distribution rewards for position and attitude angle to give higher rewards under the condition of a higher probability of a successful landing.

2. Materials and Methods

2.1. Mathematical Modelling of the Reusable Rocket Landing

This paper focuses on the power landing stage of the reusable rocket landing process, where the position and attitude of the rocket are controlled by adjusting the thrust magnitude and nozzle angle. Four coordinate systems are used in this paper: the geocentric equatorial coordinate system, the launch point coordinate system, the body coordinate system of the rocket, and the velocity coordinate system of the rocket. Their definitions and relationships are derived in Appendix A.
The translational movement of the rocket is governed by the following translational dynamics equations expressed in the launch point coordinate system:
r ˙ = v v ˙ = T + D m + g a e a k m ˙ = 1 I s p | | T | |
where r = [ x y z ] R 3 is the position vector, v = [ v x v y v z ] R 3 is the velocity vector, m is the mass of the rocket, T is the thrust vector, D is the aerodynamic drag vector, g is the acceleration of gravity, and I s p is the specific impulse of the rocket. We define ω e as the angular velocity of the Earth’s rotation, then a e = ω e × ( ω e × r ) is the convected acceleration, and a k = 2 ω e × δ r δ t is the Coriolis acceleration.
The acceleration of gravity g can be decomposed into two components considering the oblateness of the Earth, i.e., the vertical component g r pointing from the Earth’s centre to the rocket’s centre of mass, and the horizontal component g m pointing from the rocket’s mass centre to North
g r = μ | | r | | 2 [ 1 + 3 2 J 2 ( a e | | r | | ) 2 ( 1 5 sin 2 Φ ) ] g m = 3 μ | | r | | 2 J 2 ( a e | | r | | ) 2 sin Φ
where μ = 3.986004 × 10 14 is the geocentric gravitational constant, J 2 = 1.08263 × 10 3 is the oblateness of the Earth, and Φ is the latitude of the rocket.
The aerodynamic drag D is described as
D = 1 2 ρ | | v | | S r e f C D M a v
where ρ is the atmospheric density determined by the height, v is the velocity of the rocket relative to the atmosphere, S r e f is the reference area of the rocket, and M a is the Mach number. C D is the drag coefficient, which is a non-linear function of the velocity v. Detailed calculation of aerodynamic drag D is discussed in Appendix B.
The rotational movement of the rocket is governed by the attitude dynamics equation expressed as
I d ω b d t + ω b × I · ω b = M
where ω b = [ ω x ω y ω z ] is the angular velocity of the rocket, M is the external torque on the rocket, and I is the inertia matrix of the rocket. The torque caused by aerodynamic drag is relatively small compared to the torque generated by the engine, it is neglected in the following discussion. The shape of the rocket is simplified and defined as a cylinder with radius r and height h. Therefore, the inertia matrix can be expressed as
I = 1 2 m r 2 0 0 0 1 12 m ( 3 r 2 + h 2 ) 0 0 0 1 12 m ( 3 r 2 + h 2 ) .

2.2. SD3 Reinforcement Learning Algorithm

The RL problem can be formulated by a Markov decision process (MDP). We consider a standard RL setup consisting of an agent interacting with an environment in discrete time steps. In the powered landing problem, the agent is considered as the rocket itself. At each time step t, the agent receives an observation s t , takes an action a t and receives a scalar reward r t ( s t , a t ) . An agent’s behaviour is defined by a policy π ( s t ) , which outputs the action a t based on the state s t . The environment changes from one stage to another according to the transition probability p ( s t + 1 | s t , a t ) . The return of an episode R is defined as the sum of discounted future reward with a discounting factor γ [ 0 , 1 ]
R t = i = 0 T γ i r ( s i , a i )
RL aims to learn a policy that maximizes the expected return from the starting distribution J = E r i , s i E , a i π [ R 1 ] . Specifically, in the reusable rocket landing problem, policy π refers to an actor network that can output thrust control commands based on states such as velocity and position, and return refers to the reward given by the environment. During the learning process, a rocket is randomly initialized repeatedly to land in a fixed landing zone under the control of policy π , and the policy is updated according to the environment-based rewards. The action-value function describes the expected return after taking an action a t in state s t and thereafter following policy π :
Q π ( s t , a t ) = E r i t , s i > t E , a i > t π [ r t ( s t , a t ) ]
According to the Bellman equation, the action-value function can be expressed as
Q π ( s t , a t ) = E r t , s t + 1 E [ r t ( s t , a t ) + γ max a t + 1 Q π ( s t + 1 , a t + 1 ) ]
As the first method to extend DQN (deep Q network) to a continuous control space, DDPG is one of the standard algorithms for training an off-policy model-free RL network. DDPG learns both a policy network and a value function by decomposing them into an actor network π ( s | θ π ) , target actor network π ( s | θ π ) , Q network Q ( s , a | θ Q ) and target Q network Q ( s , a | θ Q ) . The parameterized actor network learns to determine the best action with the highest value estimates according to the critic network by policy gradient descent. The parameters of the target network are periodically replicated from the ontology to ensure the stability and convergence of training while alleviating the overestimation of Q. As an off-policy algorithm, it takes advantage of a replay buffer D to store past experiences and learn from them by randomly sampling from the replay buffer. The Q networks are updated by minimizing the loss L:
L = E ( s , a , r , s ) D ( Q ( s , a | θ Q ) ( r + γ max a Q ( s , a | θ Q ) ) ) 2
The actor network is updated by policy gradient:
θ π J E ( s , a , r , s ) D a Q ( s , a | θ Q )
TD3 (Twin-delayed deep deterministic policy gradient) is developed based on DDPG with three main improvements: clipped double Q learning, delayed policy updates and target policy smoothing.
  • Clipped double Q
    As an extension of DQN in continuous control, DDPG inherits the over-estimation problem: the estimated Q value function is larger than the actual Q value function. To alleviate this problem, TD3 uses two Q value function networks at the same time, while in the update, the smaller value of the two networks is selected to calculate the updating target for the value function
    y ( r , s , d ) = r + γ ( 1 d ) min i = 1 , 2 Q Φ i , a r g ( s , a ( s ) )
  • Delayed policy updates
    Since an unstable value network can lead to a blind update of the policy network, it should be updated at a lower frequency than the value network. In this case, the error of the target networks is reduced to avoid updating the policy updates on high-error states which can cause divergent behaviour.
  • Target policy smoothing
    Due to the deterministic policy, the update target of the Q network is more susceptible to the estimation error. TD3 introduces a regularization strategy for deep value learning in calculating the target Q, adding a small amount of random noise to the target policy and averaging over mini-batches.
    y = r + γ Q θ ( s , π ϕ ( s ) + ϵ ) , ϵ c l i p ( N ( 0 , σ ) , c , c )
This updated style can also lead to improvement in the stochastic domains with failure cases, making the estimation of Q more accurate and robust.
In this paper, we use the SD3 algorithm to alleviate the underestimation generated by directly applying the double Q-learning algorithm. A key component in TD3 is the clipped double Q-learning algorithm, which significantly reduces the overestimation but also leads to a large underestimation bias. To handle this problem, SD3 investigates the use of a Softmax operator in updating value functions in actor–critic methods for continuous control. It proposes to estimate the target value for the critic Q i by
y i = r + γ T S D 3 ( s )
where
T S D 3 ( s ) = s o f t m a x β Q ^ i ( s , · )
Q ^ i ( s , a ) = min Q i ( s , a ; θ i ) , Q i ( s , a ; θ i )
By applying the Boltzmann Softmax operator for value function estimation, SD3 helps to smooth the optimization landscape and improve the underestimation bias of TD3. The SD3-based rocket recycling algorithm is shown in Algorithm 1.
Algorithm 1 SD3-based rocket recycling.
1:
Initialize critic networks Q 1 , Q 2 , and actor networks π 1 , π 2 with random parameters θ 1 , θ 2 , ϕ 1 , ϕ 2
2:
Initialize target networks θ 1 θ 1 , θ 2 θ 2 , ϕ 1 ϕ 1 , ϕ 2 ϕ 2
3:
Initialize replay buffer B
4:
for episode t = 1 to T do
5:
   Observe a rocket state s
6:
   Select a thruster action a with exploration noise ϵ
7:
   Execute action a, observe reward r, new state s i and done d
8:
   Store transition tuple ( s , a , r , s , d ) in B
9:
   while it’s time for update do
10:
     Sample a mini-batch of N transitions ( s , a , r , s , d ) from B
11:
     Sample K noises ϵ N ( 0 , σ ¯ )
12:
      a ^ π i ( s ; ϕ i ) + c l i p ( ϵ , c , c )
13:
      Q ^ ( s , a ^ ) min j = 1 , 2 ( Q j ( s , a ^ ; θ j ) )
14:
      s o f t m a x β ( Q ^ i ( s , · ) ) E a ^ p exp ( β Q ^ ( s , a ^ ) ) Q ^ ( s , a ^ ) p ( a ^ ) / E a ^ p exp ( β Q ^ ( s , a ^ ) ) p ( a ^ )
15:
      y i r + γ ( 1 d ) s o f t m a x β Q ^ i ( s , · )
16:
     Update the critic θ i according to Bellman loss: 1 N s Q i ( s , a ; θ i ) y i 2
17:
     Update actor ϕ i by policy gradient: 1 N s ϕ i ( π ( s ; ϕ i ) ) a Q i ( s , a ; θ i ) | a = π ( s ; ϕ i )
18:
     Update target networks: θ i τ θ i + ( 1 τ ) θ i , ϕ i τ ϕ i + ( 1 τ ) ϕ i
19:
   end while
20:
end for

2.3. Reinforcement Learning for Rocket Landing

In this section, we set up RL for reusable rocket landing, including action space, observation space, reward function, and network structure.

2.3.1. Action Space

The rocket is equipped with an engine with a deflectable nozzle, as is shown in Figure 1. The angle between T and the plane O b x b z b is defined as δ y , and the angle between the projection of T on plane O b x b z b and the opposite direction of O b x b is defined as δ z . Therefore, the control variables of the rocket are | | T | | , δ y , and δ z .
To simplify the outputs of the agent for better numerical stability, the control variables are constrained to [ 1 , 1 ] . The actual actions taken by the rocket can then be defined as [ a 1 a 2 a 3 ] , and their relations with the original control variables are
| | T | | = T max T min 2 a 1 + T max + T min 2 δ y = δ max a 2 δ z = δ max a 3
The thrust vector in the body coordinate system T b can be expressed as
T b = | | T | | cos δ y cos δ z | | T | | sin δ y | | T | | cos δ y sin δ z
Since the torque caused by aerodynamic drag is relatively small, only the torque generated by thrust is considered in this paper. The shape of the rocket is defined as a cylinder with radius r and height h, then M can be expressed as
M = 0 1 2 h | T | | sin δ y 1 2 h | | T | | cos δ y sin δ z

2.3.2. Observation Space

To make sure the agent receives enough information about the current states to make decisions during the reusable rocket landing process, we carefully design the information the agent can obtain from the environment, including normalized position information, velocity information, attitude angle information, and control information. Note that all the actions are obtained from the previous time step. Besides traditional velocity information, we introduce extra information for observation, i.e., the angle θ v between the rocket’s velocity and a reference direction, to better stabilize the landing control process. The reference direction is defined to point from the rocket’s centre of mass to the ideal landing position. All angles except the roll angle are normalized by sine functions, whereas the roll angle requires both sine and cosine functions to cover [ 0 , 360 ] . The observation space at time t can be expressed as
s t = [ r x | | r | | r y | | r | | r z | | r | | v x | | v | | v y | | v | | v z | | v | | s i n θ v sin ( ϕ + 90 ) sin ψ cos ψ sin γ a 1 a 2 a 3 ] t

2.3.3. Reward Function

The goal of the agent is to maximize the received reward, and a well-designed reward function can guide the agent to learn more robustly. We design a reward function including a successful landing reward, a failed landing penalty, a location-based reward, a velocity-based reward, and an angle-based reward.
  • Reward for successful landing: The agent receives a reward when the rocket successfully lands at the target location. The conditions for successful landing can be expressed as C 1 , which includes | x | x l i m , 0 | y | y l i m , | z | z l i m , v v < v v l i m , v h < v h l i m , | ϕ | ϕ l i m , | ψ | ψ l i m , and | γ | γ l i m . The reward can be defined as follows:
    R s u c c e s s = λ s u c c e s s , C 1 s a t i s f i e d 0 , C 1 n o t s a t i s f i e d
    where λ s u c c e s s is the positive constant hyperparameter, x l i m = z l i m = 10 m are the length and width of the landing zone, y l i m = 0.5 m is the upper bound of the landing height, v v l i m = 2 m/s and v h l m = 1 m/s are the upper bounds of the vertical and horizontal velocity, respectively, ϕ l i m = ψ l i m = γ l i m = 10 are the upper bounds of the rocket’s landing Euler angle. Although there is no need to limit ψ , it is limited in the simulation.
  • Penalty for failed landing: The agent will receive a penalty if the landing fails, including an excessive attitude angle or deviation from the landing zone, and the simulation will be terminated. A failed landing is that exact opposite of a successful landing. The penalty can be expressed as:
    R f a i l = λ f a i l , C 1 n o t s a t i s f i e d 0 , C 1 s a t i s f i e d
  • Penalty for fuel consumption: The landing performance is evaluated by fuel consumption. The penalty aims to minimize fuel consumption and is defined by
    R f u e l = λ f u e l 1 I s p | | T | |
  • Reward for accurate landing: The ideal landing location is set to be r = [ 0 , 0 , 0 ] . In order to encourage the rocket to land as close to the ideal landing location as possible, we construct the landing position reward function in the form of a Gaussian distribution,
    R i = 1 2 π σ i exp r i 2 2 σ i 2 , i = x , y , z
    where σ i determines the scope of the reward given, and the closer the rocket is to the ideal landing location, the bigger reward it will obtain. The total reward is:
    R p o s = 1 3 λ p o s i = x , y , z R i
    where λ p o s is a hyperparameter to adjust the size of the reward.
  • Reward for velocity direction: We construct a reward for velocity direction to stabilize the translational flight of the rocket. The reference direction is defined to point from the rocket’s centre of mass to the ideal location, and the angle θ v between the current velocity and the reference direction is used as a guidance for learning. The smaller θ v is, the bigger reward the rocket will obtain. The reward can be described as:
    R v e l = λ v e l cos θ v
  • Reward for Euler angle: Similar to the reward for velocity direction, we want to stabilize the attitude motion of the rocket during landing and prefer to control the Euler angle of the rocket to arrive as close to our predefined reference attitude ϕ = 90 , ψ = 0 and γ = 0 . The reward function can be expressed as:
    R ϕ = 1 2 π σ ϕ exp ( ϕ + 90 ) 2 2 σ ϕ 2
    R ψ = 1 2 π σ ψ exp ψ 2 2 σ ψ 2
    R γ = 1 2 π σ γ exp γ 2 2 σ γ 2
    where σ ϕ , σ ψ and σ γ limit the range of rewards given, and the total reward is:
    R a n g l e = 1 3 λ a n g l e R ϕ + R ψ + R γ
    where λ a n g l e is a hyperparameter to adjust the size of the reward.
    Finally, the overall reward function is given in the form of a summation
    R = R s u c c e s s + R f a i l + R f u e l + R p o s + R v e l + R a n g l e

2.3.4. Network Architecture

We apply the widely used actor–critic framework to construct the RL algorithm. The agent contains two types of networks: the actor network and the critic network. The actor network takes state s t as the input and outputs the action a t . The critic network takes the concatenation of state s t and action a t , and outputs the value Q ( s t , a t ) . The network architecture is shown in Figure 2.

3. Results

In this section, we will demonstrate the simulation used to verify our proposed method and present the results.

3.1. Environmental Settings

The values of the simulation parameters are listed in Table 1. The overall training process is cut into episodes. In each episode, the position r x , r y , r z , velocity v x , v y , v z , and attitude angle ϕ , ψ , γ of the rocket are initialized randomly according to Table 2. The hyperparameters of the SD3 RL are shown in Table 3.

3.2. Training Results

We apply DDPG, TD3, and SD3 separately to train our agents and compare their landing control performance. Their reward curves are shown in Figure 3.
We can see from the figure that rewards received by all three algorithms start to rise at around 3000 epochs, indicating a significant improvement in the agent’s control ability. All three rewards converge after 6000 episodes of training, meaning all three algorithms demonstrate a sufficient control ability. Among the three algorithms, SD3 gained the highest reward, reaching around 175.
To validate the effectiveness of the trained agent, we further select 100 initial simulation cases randomly according to Table 1 and Table 2 to conduct 100 landing tests. The landing trajectories are shown in Figure 4. According to the figure, all three algorithms are capable of guiding the rocket to approach the landing zone. However, when controlled by DDPG, the rocket failed to reach the ground successfully. When controlled by SD3, the rocket performs abnormal turning and lateral movement during the final landing stage. In comparison, the rocket controlled by SD3 managed to reach the landing zone vertically, smoothly and accurately.
The comparison of landing points is shown in Figure 5. The mean and standard deviation of the three sets of data are shown in the following Table 4. The DDPG agent failed to guide the rocket to land on the ground, the points in the graph are the vertical projection of the final position of the rocket on the ground surface. Although rockets controlled by TD3 agents successfully landed on the ground, all their landing points are out of the landing zone. The landing points controlled by SD3 achieved the best results compared to DDPG and TD3, with no landing points located outside the landing zone. The Comparison of landing points of the three methods are listed in Table 5. The results validate the effectiveness of the proposed SD3-based method.
Figure 6, Figure 7 and Figure 8 show the detailed landing status and control commands of the 26th trajectory controlled by the three algorithms, respectively.

4. Discussion

In RL, an optimal policy is linked to the optimal value function. When the reward function is fixed, the obtained landing policy is directly related to the estimation of the value function.
The DDPG method uses the Bellman equation that tends to overestimate the value function. Under this circumstance, the agents are more likely to explore areas with low rewards, thus making the training process unstable.
The TD3 method uses two value networks with the same structure but different learned parameters to estimate the value function. It takes the smaller value as the estimation result. This operation can avoid overestimation of the value function, but introduces the problem of underestimation. Therefore, although the training process of the TD3 algorithm is more stable compared to DDPG, the convergence speed of the algorithm is slower and the algorithm converges easily to areas with lower rewards, making the agents land poorly.
In comparison, the SD3 method adopts Softmax operation on the state-value function, making the error between the estimation result and true value within a controllable range. A better estimation of the state-value function leads to a better landing performance.

5. Conclusions

In this paper, an SD3-based RL method is proposed with a novel designed reward function to solve the reusable rocket landing control problem. Benefiting from the accurate estimation of the state-value function, the proposed method outperforms widely used DDPG and TD3 algorithms and has been proven to be effective through numerical simulation conducted on 100 randomly selected cases.

Author Contributions

Conceptualization, Y.L. and H.L.; methodology, H.L.; software, W.L.; validation, W.L., X.Z. and H.L.; formal analysis, X.Z.; investigation, W.L. and X.Z.; resources, Y.D.; data curation, X.Z.; writing—original draft preparation, W.L.; writing—review and editing, X.Z.; visualization, H.L.; supervision, Y.D. and Y.L.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Guangdong Basic and Applied Basic Research Foundation grant number 2022A1515110837.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DDPGDeep deterministic policy gradient
TD3Twin-delayed deep deterministic policy gradient
SD3Softmax double deep deterministic policy gradient

Appendix A

Four coordinate systems are used in this work, namely the geocentric equatorial coordinate system, the launch point coordinate system, the body coordinate system of the rocket and the velocity coordinate system of the rocket. The relationships between them are described below.
The geocentric equatorial coordinate system O e x e y e z e : The origin O e is located at the centre of the Earth, the x-axis points to the intersection of the prime meridian and the equatorial plane, the z-axis points to the direction of the Earth’s North pole, and the y-axis direction satisfies the requirements of the right-handed coordinate system.
The launch point coordinate system O o x o y o z o : The origin O o is at the launch point of the rocket, and the Earth is considered an ellipsoid, with the y-axis coinciding with the principal normal of the ellipsoidal plane at the origin of the coordinates and the x-axis pointing in the launch direction. In this paper, since the study is about reusable rocket landing, the landing target point is set as the origin of the launch point coordinate system for the convenience of describing the study. The angle between the y-axis and the equatorial plane is noted as latitude B, and the longitude where the launch point is located is noted as λ . In the horizontal plane, the angle between the x-axis and direct North of the local meridian tangent is the launch azimuth, noted as A.
The body coordinate system of the rocket O b x b y b z b : The origin O b is at the rockets’ centre of mass, and the x-axis is along the longitudinal axis of the rocket, pointing in the direction of the tail of the rocket. The z-axis is perpendicular to the main symmetry plane of the rocket, and it points to the right when viewed along the direction of motion. It is used to describe the rocket’s attitude.
The velocity coordinate system of the rocket O v x v y v z v : The origin O v is at the rocket’s centre of mass, the x-axis is along the direction of the rocket’s velocity, the z-axis is perpendicular to the main plane of symmetry of the rocket, and the z-axis points to the right when viewed along the direction of motion. It is used for the calculation of aerodynamic drag.
Figure A1. The relationships between the coordinate systems. (a) The rotation from the geocentric equatorial system O e x e y e z e to the launch point coordinate system O o x o y o z o . (b) The rotation from the launch point system O o x o y o z o to the body coordinate system O b x b y b z b . (c) The rotation from the body coordinate system O b x b y b z b to the velocity coordinate system O v x v y v z v .
Figure A1. The relationships between the coordinate systems. (a) The rotation from the geocentric equatorial system O e x e y e z e to the launch point coordinate system O o x o y o z o . (b) The rotation from the launch point system O o x o y o z o to the body coordinate system O b x b y b z b . (c) The rotation from the body coordinate system O b x b y b z b to the velocity coordinate system O v x v y v z v .
Aerospace 10 00590 g0a1
The relationship of the coordinate systems can be described using rotation matrices. The geocentric equatorial coordinate system is rotated by ( 90 λ ) around the z-axis, then by B around the x-axis, and finally by ( 90 + A ) around the y-axis to coincide with the launch point coordinate system. The rotation matrix C e o can be written as
C e o = L y ( ( 90 + A ) ) L x ( B ) L z ( ( 90 λ ) )
The rotation from the launch point coordinate system O o x o y o z o to the body coordinate system O b x b y b z b is defined by three angles: the pitch angle ϕ , the yaw angle ψ and the roll angle γ . We define the angle between the projection of the O b x b in O o x o y o planes and the O o x o as ϕ , the angle between the O b x b and O o x o y o planes as ψ , and the rocket’s angle of rotation around O b x b as γ . The launch point coordinate system is rotated by ϕ around the z-axis, then by ψ around the y-axis, and finally by γ around the x-axis to coincide with the body coordinate system. The rotation matrix C o b can be written as
C o b = L x ( γ ) L y ( ψ ) L z ( ϕ )
The rotation from the body coordinate system O b x b y b z b to the velocity coordinate system O v x v y v z v is defined by two angles: the angle of attack α and the angle of sideslip β . We define the angle between the projection of O v x v in the O b x b y b plane and the O b x b as α , and the angle between the O v x v and O b x b y b planes as β . The body coordinate system is rotated by α around the z-axis and by β around the y-axis to coincide with the velocity coordinate system. The rotation matrix C b v can be written as
C b v = L y ( β ) L z ( α )

Appendix B

In this paper, we decompose aerodynamic drag D in the direction of the x-axis, y-axis and z-axis of the body coordinate system as
D = X 1 + Y 1 + Z 1
where X 1 , Y 1 and Z 1 are in the direction of the x-axis, y-axis, and z-axis of the body coordinate system, respectively, which can be described as
X 1 = 1 2 C x ρ v 2 S r e f = C x q S r e f Y 1 = 1 2 C y ρ v 2 S r e f = C y q S r e f Z 1 = 1 2 C z ρ v 2 S r e f = C z q S r e f
where ρ is the atmospheric density, v is the rocket’s velocity relative to the atmosphere, S r e f is the reference area of the rocket, q = 1 2 ρ v 2 is dynamic pressure, and C x , C y and C z are the axial force coefficient, normal force coefficient, and lateral force coefficient, respectively. We define the angle between O b x b and O v x v as the total angle of attack η , which can be expressed as
cos η = cos α cos β
Then the aerodynamic drag coefficient for C x , C y and C z can be expressed as
C x = C A C A η 2 η 2 C y = C N η η cos β sin α sin η C z = C N η η sin β sin η
when η = 0 ,
C x = C A C A η 2 η 2 C y = 0 C z = 0
where C A , C A η 2 and C N η are aerodynamic parameters.

References

  1. Cui, N.; Wu, R.; Wei, C.; Xu, D.; Zhang, L. Development and Key Technologies of Vertical Takeoff Vertical Landing Reusable Launch Vehicle. Astronaut. Syst. Eng. Technol. 2018, 2, 27–42. [Google Scholar]
  2. Jo, B.U.; Ahn, J. Optimal staging of reusable launch vehicles for minimum life cycle cost. Aerosp. Sci. Technol. 2022, 127, 107703. [Google Scholar] [CrossRef]
  3. Jones, H. The recent large reduction in space launch cost. In Proceedings of the 48th International Conference on Environmental Systems, Albuquerque, NM, USA, 8–12 July 2018. [Google Scholar]
  4. Sivan, K.; Pandian, S. An overview of reusable launch vehicle technology demonstrator. Curr. Sci. 2018, 114, 38–47. [Google Scholar] [CrossRef]
  5. de Mirand, A.P.; Bahu, J.M.; Louaas, E. Ariane Next, a vision for a reusable cost efficient European rocket. In Proceedings of the 8th European Conference for Aeronautics and Space Sciences (EUCASS), Madrid, Spain, 1–4 July 2019. [Google Scholar]
  6. Meyerson, R.; Taylor, A. A status report on the development of the Kistler Aerospace K-1 Reusable Launch Vehicle. In Proceedings of the 16th AIAA Aerodynamic Decelerator Systems Technology Conference and Seminar, Boston, MA, USA, 21–24 May 2001; p. 2069. [Google Scholar]
  7. Gardinier, D.; Taylor, A. Design and testing of the K-1 reusable launch vehicle landing system airbags. In Proceedings of the 15th Aerodynamic Decelerator Systems Technology Conference, Toulouse, France, 9–11 June 1999; p. 1757. [Google Scholar]
  8. King, R.; Hengel, J.; Wolf, D. Ares I First Stage Booster Deceleration System-An Overview. In Proceedings of the 20th AIAA Aerodynamic Decelerator Systems Technology Conference and Seminar, Seattle, WA, USA, 4–7 May 2009; p. 2984. [Google Scholar]
  9. Davis, S. Ares IX Flight Test: The Future Begins Here. In Proceedings of the AIAA SPACE 2008 Conference & Exposition, San Diego, CA, USA, 9–11 September 2008; p. 7806. [Google Scholar]
  10. Açıkmeşe, B.; Carson, J.M.; Blackmore, L. Lossless convexification of nonconvex control bound and pointing constraints of the soft landing optimal control problem. IEEE Trans. Control Syst. Technol. 2013, 21, 2104–2113. [Google Scholar] [CrossRef]
  11. Blackmore, L. Autonomous precision landing of space rockets. In Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2016 Symposium; The Bridge: Washington, DC, USA, 2016; Volume 46, pp. 15–20. [Google Scholar]
  12. Blackmore, L.; Açikmeşe, B.; Scharf, D.P. Minimum-landing-error powered-descent guidance for Mars landing using convex optimization. J. Guid. Control Dyn. 2010, 33, 1161–1171. [Google Scholar] [CrossRef] [Green Version]
  13. Mattingley, J.; Boyd, S. CVXGEN: A code generator for embedded convex optimization. Optim. Eng. 2012, 13, 1–27. [Google Scholar] [CrossRef]
  14. Sagliano, M.; Mooij, E.; Theil, S. Onboard trajectory generation for entry vehicles via adaptive multivariate pseudospectral interpolation. J. Guid. Control Dyn. 2016, 40, 466–476. [Google Scholar] [CrossRef] [Green Version]
  15. Seelbinder, D.; Büskens, C. Real-time atmospheric entry trajectory computation using parametric sensitivities. In Proceedings of the 6th International Conference on Astrodynamics Tools and Techniques, Darmstadt, Germany, 14–17 March 2016. [Google Scholar]
  16. Wang, J.; Cui, N.; Wei, C. Optimal rocket landing guidance using convex optimization and model predictive control. J. Guid. Control Dyn. 2019, 42, 1078–1092. [Google Scholar] [CrossRef]
  17. Cheng, L.; Wang, Z.; Jiang, F.; Li, J. Fast generation of optimal asteroid landing trajectories using deep neural networks. IEEE Trans. Aerosp. Electron. Syst. 2019, 56, 2642–2655. [Google Scholar] [CrossRef]
  18. Song, Y.; Miao, X.; Cheng, L.; Gong, S. The feasibility criterion of fuel-optimal planetary landing using neural networks. Aerosp. Sci. Technol. 2021, 116, 106860. [Google Scholar] [CrossRef]
  19. Furfaro, R.; Bloise, I.; Orlandelli, M.; Di Lizia, P.; Topputo, F.; Linares, R. Deep learning for autonomous lunar landing. Adv. Astronaut. Sci. 2018, 167, 3285–3306. [Google Scholar]
  20. Furfaro, R.; Bloise, I.; Orlandelli, M.; Di Lizia, P.; Topputo, F.; Linares, R. A recurrent deep architecture for quasi-optimal feedback guidance in planetary landing. In Proceedings of the IAA SciTech Forum on Space Flight Mechanics and Space Structures and Materials, Moscow, Russia, 13–15 November 2018; pp. 1–24. [Google Scholar]
  21. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Ye, D.; Liu, Z.; Sun, M.; Shi, B.; Zhao, P.; Wu, H.; Yu, H.; Yang, S.; Wu, X.; Guo, Q.; et al. Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 6672–6679. [Google Scholar]
  23. Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
  24. Soleymani, F.; Paquet, E. Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoder—DeepBreath. Expert Syst. Appl. 2020, 156, 113456. [Google Scholar] [CrossRef]
  25. Palanisamy, P. Multi-agent connected autonomous driving using deep reinforcement learning. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–7. [Google Scholar]
  26. Yu, C.; Wang, X.; Xu, X.; Zhang, M.; Ge, H.; Ren, J.; Sun, L.; Chen, B.; Tan, G. Distributed multiagent coordinated learning for autonomous driving in highways based on dynamic coordination graphs. IEEE Trans. Intell. Transp. Syst. 2019, 21, 735–748. [Google Scholar] [CrossRef]
  27. Ciabatti, G.; Daftry, S.; Capobianco, R. Autonomous planetary landing via deep reinforcement learning and transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2031–2038. [Google Scholar]
  28. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  29. Cheng, L.; Wang, Z.; Jiang, F. Real-time control for fuel-optimal Moon landing based on an interactive deep reinforcement learning algorithm. Astrodynamics 2019, 3, 375–386. [Google Scholar] [CrossRef]
  30. Furfaro, R.; Scorsoglio, A.; Linares, R.; Massari, M. Adaptive generalized ZEM-ZEV feedback guidance for planetary landing via a deep reinforcement learning approach. Acta Astronaut. 2020, 171, 156–171. [Google Scholar] [CrossRef] [Green Version]
  31. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
  32. Gaudet, B.; Linares, R.; Furfaro, R. Deep reinforcement learning for six degree-of-freedom planetary landing. Adv. Space Res. 2020, 65, 1723–1741. [Google Scholar] [CrossRef]
  33. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Figure 1. The direction of the thrust vector T in the body coordinate system is determined by δ y and δ z .
Figure 1. The direction of the thrust vector T in the body coordinate system is determined by δ y and δ z .
Aerospace 10 00590 g001
Figure 2. Network framework and architecture.
Figure 2. Network framework and architecture.
Aerospace 10 00590 g002
Figure 3. The reward received during training.
Figure 3. The reward received during training.
Aerospace 10 00590 g003
Figure 4. Comparison of the landing trajectories. The view on the right of each graph is the enlarged view of the final landing stage.
Figure 4. Comparison of the landing trajectories. The view on the right of each graph is the enlarged view of the final landing stage.
Aerospace 10 00590 g004
Figure 5. Comparison of landing points. The grey zone is the desired 10 m × 10 m landing zone. The red points lie outside of the landing zone, the blue points are within the landing zone.
Figure 5. Comparison of landing points. The grey zone is the desired 10 m × 10 m landing zone. The red points lie outside of the landing zone, the blue points are within the landing zone.
Aerospace 10 00590 g005
Figure 6. Comparison of the landing velocities.
Figure 6. Comparison of the landing velocities.
Aerospace 10 00590 g006
Figure 7. Comparison of the rocket Euler angle.
Figure 7. Comparison of the rocket Euler angle.
Aerospace 10 00590 g007
Figure 8. Comparison of the landing control commands.
Figure 8. Comparison of the landing control commands.
Aerospace 10 00590 g008
Table 1. Parameters of the simulation environment.
Table 1. Parameters of the simulation environment.
ParametersValues
Launch azimuth A 190
Latitude of launch point B 41 N
Longitude of launch point λ 100 E
Mass of rocket m35,000 kg
Radius of rocket r3.7 m
Height of rocket h40 m
Minimum thrust T min 250,000 N
Maximum thrust T max 750,000 N
Reference area of rocket S r e f 10.75 m 2
Specific impulse I s p 345 s
Landing zone area10 m × 10 m
Acceptable vertical landing velocity [ 2.5 , 0 ] m/s
Acceptable translational landing velocity [ 0 , 1 ] m/s
Table 2. Value ranges of the initial status of the rocket.
Table 2. Value ranges of the initial status of the rocket.
Initial States of the RocketValue Ranges
X-axis component of the initial position r x [ 500 , 500 ] m
Y-axis component of the initial position r y 2500 m
Z-axis component of the initial position r z [ 500 , 500 ] m
X-axis component of the initial velocity v x [ 5 , 5 ] m/s
Y-axis component of the initial velocity v y 230 m/s
Z-axis component of the initial velocity v z [ 5 , 5 ] m/s
Initial pitch angle ϕ [ 95 , 85 ]
Initial yaw angle ψ [ 5 , 5 ]
Initial roll angle γ 0
Table 3. Hyperparameters of the SD3 RL algorithm.
Table 3. Hyperparameters of the SD3 RL algorithm.
ParametersValues
Size of replay buffer | B | 10 6
Variance of policy noise σ ¯ 0.05
Maximum noise c 0.05
Discount factor γ 0.99
Learning rate of actor networks 5 × 10 5
Learning rate of critic networks 5 × 10 5
Table 4. Comparison of the landing points.
Table 4. Comparison of the landing points.
Landing Point CoordinatesDDPGTD3SD3
Average ( 5.534 , 16.955 ) m ( 5.795 , 9.316 ) m ( 1.074 , 5.133 ) m
Standard deviation ( 11.563 , 17.219 ) m ( 8.746 , 12.064 ) m ( 4.173 , 2.987 ) m
Best ( 1.322 , 1.549 ) m ( 2.316 , 0.248 ) m ( 0.217 , 0.534 ) m
Worst ( 13.712 , 46.027 ) m ( 23.674 , 25.946 ) m ( 2.387 , 9.976 ) m
Table 5. Comparison of the reward function components for the different algorithms.
Table 5. Comparison of the reward function components for the different algorithms.
Reward Function ComponentsDDPGTD3SD3
R f u e l 0.147 0.142 0.118
R p o s 0.318 0.348 0.368
R v e l 0.441 0.454 0.469
R a n g l e 0.131 0.135 0.138
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, W.; Zhang, X.; Dong, Y.; Lin, Y.; Li, H. Powered Landing Control of Reusable Rockets Based on Softmax Double DDPG. Aerospace 2023, 10, 590. https://doi.org/10.3390/aerospace10070590

AMA Style

Li W, Zhang X, Dong Y, Lin Y, Li H. Powered Landing Control of Reusable Rockets Based on Softmax Double DDPG. Aerospace. 2023; 10(7):590. https://doi.org/10.3390/aerospace10070590

Chicago/Turabian Style

Li, Wenting, Xiuhui Zhang, Yunfeng Dong, Yan Lin, and Hongjue Li. 2023. "Powered Landing Control of Reusable Rockets Based on Softmax Double DDPG" Aerospace 10, no. 7: 590. https://doi.org/10.3390/aerospace10070590

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop