1. Introduction
In actual combat, to maximize intercept effectiveness, the interceptor system seeks to reduce the response time from the detection system’s detection of the target to the launch of the interceptor. Moreover, as the longest duration part of the combined guidance process, the trajectory design of the midcourse guidance section largely determines the terminal intercept capability of the interceptor and is a key factor in determining the success or failure of the intercept mission. At the same time, the trajectory planning problem in the midcourse guidance phase is a complex non-linear problem with multiple constraints, which requires the overall consideration of physical constraints, such as dynamic pressure, overload, and thermal flow density during the flight process, while ensuring that the terminal constraints are met. Therefore, this problem places high demands on the accuracy and efficiency of the midcourse guidance trajectory optimization algorithm.
In addition to satisfying certain constraints, trajectory planning also requires finding a particular trajectory, which satisfies some performance indices from the initial position to the target position [
1]. Currently, the previous method used in the study of such problems for their solution is the indirect method. In Ref [
2], the analytical solution of the shooting equation for the on-line trajectory optimization problem of planetary landing is derived, and the indirect method is improved by combining the homotopy theory technology to achieve an optimal propellant. Grant and Braun [
3] design a fast trajectory optimization method by combining indirect optimization, continuation, and symbolic manipulation theories. Shen et al. [
4] propose an optimization method combining the indirect method and homotopy approach for solving an impulse trajectory. In the research of dealing with the problem of difficult selection of the initial values of co-state variables in indirect methods, Lee et al. [
5] propose a non-functional approximation or extrapolation initial guess structure to deal with specific energy targeting problems. Ren et al. [
6] simplify the dynamic model and, based on it, design an initial guess generator with high computational efficiency, which allows the initial value to be obtained by analytically solving a linear system of equations. For the trajectory optimization problem, the indirect method not only has the advantage of high computational accuracy, but it also theoretically proves the optimality of the optimized trajectory. However, the difficulty in selecting the initial values of the co-state variables has not been fully solved, which severely limits the development and application of the indirect method.
The principle of the direct method of problem solving is different from that of the indirect method. It discretizes the problem, which makes the procedure for solving the problem simple and avoids the inaccuracy of the initial value solution in indirect methods. It has been widely applied to the solution of optimal trajectory problems [
7,
8,
9,
10]. Especially in recent years, with the significant improvement in scientific computing hardware capabilities, the pseudospectral method, as a typical method of point collocation in direct methods, has been widely applied by researchers to solve trajectory planning problems, and its theory has been improved and extended [
11,
12]. Zhang et al. [
13] design a multi-objective globally optimal homing trajectory for a wing parachute based on the Gaussian pseudospectral method. Zhang et al. [
14] propose an improved Radau pseudospectral method combined with deep neural networks to solve the chase and escape game problem of orbital trajectory. Li et al. [
15] propose a hybrid optimization method based on the conjugate gradient method and pseudospectral point matching method for the optimal trajectory planning problem during rocket landing. However, all types of pseudospectral methods have the problem of imprecise efficiency solution accuracy, which is a major obstacle to their application in practical engineering.
In addition, intelligent optimization algorithms are often used to solve trajectory planning generation problems due to their ability to effectively handle complex multi-constraint and multi-dimensional optimization problems [
16,
17]. Zhao et al. [
18] generate reliable constrained glide trajectories for hypersonic gliding vehicles by improving the pigeon inspired optimization (PIO) algorithm. Duan et al. [
19] combine the direct collocation method and the artificial bee colony algorithm to optimize and generate the re-entry trajectory of hypersonic aircraft. Zhou et al. [
20] establish a dynamic pressure profile-based optimization model for the hypersonic vehicle trajectory optimization problem, transform the problem into a parameter optimization problem, and solve it using an improved particle swarm optimization algorithm. Li et al. [
21] propose an improved particle swarm optimization algorithm combined with gradient search to solve the problem of rapid re-entry trajectory optimization of a hypersonic glider—which solves the problem of insufficient accuracy due to early convergence of the algorithm—and apply it to problem solving. Gaudet et al. [
22] combine the adaptability of reinforcement learning and the fast learning ability of meta learning, proposing a missile guidance and control method based on reinforcement meta learning. D’Ambrosio et al. [
23] combine the Pontryagin maximum principle with the powerful learning ability of neural networks to propose a fuel optimal trajectory learning method based on the Pontryagin neural network. However, intelligent optimization algorithms face the problem of falling into local optima. How to avoid this problem is a key issue, which needs to be addressed in current research.
In recent years, convex optimization methods have become a powerful tool in aircraft trajectory planning research due to their fast solution speed and ability to handle constrained problems [
24,
25,
26]. In Ref [
27], for the optimal guidance problem of planetary orbit insertion, the problem is transformed into a convex optimization problem through constrained convex relaxation, linearization, and discretization, and a convex optimization algorithm based on the interior point method is proposed. Based on convex optimization methods, Liu et al. [
28] propose a regularization technique to ensure the accuracy of convex relaxation and solve the optimal terminal guidance problem of aerodynamic control missiles. Cheng et al. [
29] accelerate the efficiency of the convex optimization algorithm to solve the trajectory planning problem of the ascent phase of the launch vehicle by using the Newton–Kantorovich/pseudospectral method to iteratively solve the initial solution. In Ref [
30], for the aircraft re-entry guidance problem, the continuous linearization and convexification techniques are used to transform the problem into a continuous convex programming problem, and a convex optimization re-entry guidance method is designed to solve the problem, which reduces the sensitivity of the initial guess accuracy. Zhou et al. [
31] improve the efficiency and accuracy of the algorithm by designing a dynamic adjustment grid point method based on the original sequence convex programming algorithm. In Ref [
32], a pseudospectral convex optimization technology combining the advantages of the pseudospectral method and convex optimization is proposed to achieve optimal trajectory planning during rocket power descent and landing processes. On the basis of the pseudospectral convex optimization technology framework, Sagliano et al. [
33,
34,
35] further propose the generalized hp pseudospectral convex programming method and the lobatto pseudospectral convex programming method, which improve the flexibility and efficiency of the algorithm optimization process. Song et al. [
36] propose an adaptive dynamic descent guidance method based on multi-stage pseudospectral convex optimization, which achieves adaptive trajectory planning during rocket landing. The above methods all perform convex transformation on the aircraft motion equation and constraints in the time domain, which achieves the optimization and generation of the aircraft trajectory. However, the relationship between the grid point position and the physical position of the target in the time domain is not intuitive, which makes it difficult to analyze the approximation error of the trajectory. In addition, using convex optimization algorithms to solve trajectory optimization problems requires feasible trajectories, which satisfy the constraints to be used as initial solutions; otherwise, the algorithm may not find the optimal approximate solution for a long time, or it may even diverge. This problem also greatly hinders the performance improvement and application of convex optimization algorithms.
This paper addresses the problem of rapid trajectory optimization generation in the guidance phase of an interceptor under multiple constraints. In terms of problem model processing, the motion model and multiple constraints are transformed into convex and discrete forms, allowing the optimization problem to be transformed into a sequential convex programming problem, which can be solved using convex optimization methods. The problem model is also transformed from the time domain to the lateral distance domain, which describes the positional relationship more intuitively. In terms of generating initial solution trajectories, this paper uses the deep deterministic policy gradient (DDPG) algorithm to train the planning generation of interceptor midcourse guidance trajectories, obtaining high-quality initial solution trajectories, which satisfy the basic guidance requirements, and improving the generation speed and guidance accuracy of optimized trajectories. In terms of the adjustment of dynamic grid point, this paper uses the distribution of grid point approximation error to determine the dynamic adjustment of grid points and adjust them to the appropriate position, thereby reducing the approximation error of the optimized trajectory and improving the solving efficiency of the convex optimization algorithm. The main contributions can be summarized as follows:
- (1).
On the lateral plane of the three-dimensional trajectory, based on the lateral range of the interceptor, the concept of the lateral distance domain is proposed, which transforms the problem model from the time domain to the horizontal distance domain, simplifies the convexification of the problem model, and facilitates the analysis of the approximation error.
- (2).
Based on the characteristics of the trajectory planning model in the mid-guidance phase, the corresponding Markov decision process (MDP) is designed, and the DDPG algorithm is applied to learn and train the initial solution trajectory planning task, and a higher quality initial solution trajectory is obtained.
- (3).
The dynamic adjustment strategy of grid points in the convex optimization algorithm is improved. In the iterative solution process, the position distribution of grid points is adjusted based on the distribution of approximate solution errors of grid points, which not only reduces the approximate solution error of the whole optimization trajectory but also improves the efficiency of the algorithm.
The sections of this paper are structured as follows. This first section briefly analyzes the state of the art in trajectory optimization generation algorithms and outlines the main contributions of this paper. In the second section, the problem of optimizing the midcourse guidance trajectory of the interceptor is described. The third section describes the convexity and discretization of the problem. In the fourth section, the fast method for generating initial solution trajectories based on the DDPG algorithm is presented. In the fifth section, the grid point dynamic adjustment method based on the approximate solution error distribution is designed. In the sixth section, the research content is simulated and verified. In the last section, the content of this paper is summarized.
4. Initial Solution Trajectories’ Rapid Generation Method Design
As a relatively mature algorithm in deep reinforcement learning, the DDPG algorithm has significant advantages over other deep reinforcement learning algorithms (such as deep Q network (DQN), deterministic policy gradient (DPG), etc.) in handling continuous action spaces, efficient gradient optimization, utilizing experience replay buffers, and improving stability [
37]. This makes the DDPG algorithm achieve higher performance and efficiency in solving complex continuous control tasks.
The quality of the initial solution trajectory is one of the key factors affecting the efficiency of the convex optimization algorithm. In order to improve the quality of the initial solution trajectory, ensure that it meets the basic guidance requirements, and then improve the search speed of the algorithm, this paper uses the DDPG algorithm in deep reinforcement learning to learn and train the problem model, and it obtains the optimal strategy to quickly generate the initial solution trajectory of the convex optimization algorithm.
4.1. Markov Decision Process Design
The interaction between agents and the environment in the DDPG algorithm follows the Markov decision process (MDP), which mainly includes state sets, action sets, reward functions, discount coefficients, and transition probabilities. This paper uses a model-free deep reinforcement learning method without the transition probability. Design the appropriate MDP based on the characteristics of the problem in this paper.
When designing a state set, it is important to take as much information as possible, which helps to solve the problem and discard information, which may interfere with the decision. Based on the guidance mechanism of the interceptors, this paper defines the missile target distance
ltogo, the longitudinal plane component
ηxh, and the lateral plane component
ηxz of the velocity lead angle. The simplified calculation formula, ignoring the influence of earth’s rotation and curvature, is as follows:
where
φxh and
φxz are the longitudinal plane component and the lateral plane component of the line of sight angle. The composition of the state set is [
h,
z,
x,
v,
θ,
ψv,
ltogo,
ηxh,
ηxz].
The action set can be composed of the interceptor’s guidance control inputs, namely the control variables α and σ.
The key to MDP design is the construction of reward functions. Based on the guidance purpose and mechanism, this article provides two types of reward functions: the final reward and the feedback reward. The specific design is as follows:
where △
t is the simulation step;
ω is the distance convergence threshold, and its value needs to be set according to the accuracy requirements of the specific training task (the smaller the value, the higher the accuracy requirements).
and
represent the
ηxh and
ηxz values corresponding to △
t. In this paper, the interceptor is encouraged to reach the terminal position as soon as possible, using a method where the final reward is inversely proportional to the total simulation step size. At the same time, to ensure that the speed direction of the interceptor converges to the direction of the missile target connection as soon as possible, the feedback reward is designed as a negative reward. Based on the actual defense operations, the interceptor adopts a frontal interception method and sets the termination condition for iterative training as
done = |
ηxh| >
π/2 or
ltogo <
ω. Thus, the reward function is
R =
Rf +
R△t.
The discount coefficient represents the weighting of future rewards to current rewards (between zero and one). If the value is too small, the current reward is only focused on the size of the next step reward, which is not conducive to achieving long-term goals. If the value is one, the current reward is completely divorced from the current reality, and it is difficult to ensure that the training can converge.
4.2. Initial Solution Trajectory Rapid Generation
This method collects data through the interaction between the interceptor and the simulation environment, and it optimizes its own strategy based on the obtained data. The trained strategy function is the final initial solution trajectory fast generation method. The specific model of the DDPG algorithm used in this paper and the interceptor’s guidance motion model used for the interaction between the agent and the environment can be found in Ref [
37], and they will not be detailed here due to space limitations.
Figure 2 shows the training framework for the trajectory planning task based on the DDPG algorithm.
The specific process of off-line algorithm training is as follows:
- (1)
Initialize the network parameters and memory capacity, and start the cycle.
- (2)
Set the initial state of the interceptor, randomly select the target point position, and start a single trajectory cycle.
- (3)
Perform the actions and obtain the corresponding status and reward values, and store the data in a memory bank.
- (4)
Randomly sample small batches of training data from the memory, update the network parameters, and complete a single trajectory cycle.
- (5)
Determine whether the trajectory has completed the training task. If so, proceed to the next step. If not, return to Step (2).
- (6)
At the end of the cycle, output the optimal network parameters and trajectory planning strategy.
After specifying the initial and final conditions, based on the optimal network parameters trained off-line, the action sequence of the interceptor can be quickly specified, and the state sequence, i.e., the initial solution trajectory, can be obtained by integration.
5. Grid Points’ Dynamic Adjustment Method Design
In the iterative solution process of convex optimization algorithms for problems, the iterative solution is represented in discrete form. A reasonable design of the grid points not only improves the convergence of the algorithm but also affects the approximate solution error of the optimization trajectory with the accuracy of each grid point [
38]. The number of grid points solved in the
k-th iteration is
N, and the approximate solution error of the
i-th grid point is defined as
where
represents the state vector of the
k-th iteration solution;
represents the integral state vector without linearization error, expressed as
The adjustment function of the grid point is defined as follows:
where
χnord represents the default threshold for the approximate solution error of the grid points. In this paper, the grid point positions at both ends are set to be fixed; therefore, their function values remain 1 throughout the iterative solution process. When the approximate solution error of the grid points is less than the default threshold, its function value is 0, indicating that the degree of non-linear violation of the grid points here is relatively small, and the position of the grid points here can be changed. The number of grid points, which need to be adjusted for each iteration solution, is the number of grid points where all function values are 0.
The probability density function of the adjusted grid point is defined as follows:
According to Equation (46), the probability density of the grid points at any position in the lateral distance domain can be interpolated and calculated. Calculate the corresponding cumulative probability distribution function according to F(li), and select the position with a cumulative probability of j/(Δχk + 1), j = 1, …, Δχk as the corresponding newly added grid point position according to the number of grid points, which need to be adjusted.
Set the algorithm iteration termination conditions as follows:
where
ε is the algorithm convergence threshold, which is the same as the
s dimension. The dynamic adjustment process of the grid points in the algorithm is shown in
Figure 3.
6. Simulation Verification
6.1. Experimental Parameter Setting
It is assumed that the interceptor adopts a high throw re-entry glide trajectory mode, and this paper focuses on the trajectory of the interceptor in the midcourse guidance re-entry glide phase. The initial states are set to [h0, z0, x0, v0, θ0, ψv0] = [7 × 104/re, 0, 0, 3 × 103/, −5°, 0°]. The process constraints are limited to Qmax= 1 × 106 J/(m2s), pmax = 1 × 105 Pa, nmax = 8 g. The control variable constraints are limited to |α| ≤ 30°, |σ| ≤ 85°. The constraint radius of the trust region is set to [δh, δz, δx, δv, δθ, δψv, δα, δσ] = [2 × 104/re, 2 × 103/re, 2 × 103/re, 500/, 20π/180, 30π/180, 10π/180, 90π/180]. The convergence thresholds are set to [εh, εz, εx, εv, εθ, εψv, εα, εσ] = [200/re, 20/re, 20/re, 50/, π/180, 5π/180, π/180, 5π/180]. The number of grid points in the iterative solution of the convex optimization algorithm is set to N = 200; the maximum number of iterations is 500; and the approximation solution error threshold is χnord = 1 × 10−6. This paper uses Python 3.10 to program the simulation experiments; MATLAB R2016a to plot the simulation data; and the ECOS-BB solver for convex sequence planning.
Set the parameters related to the DDPG algorithm as follows. The simulation step size is set to = 1/; the discount factor is set to 0.99; the maximum number of training sessions is set to 5 × 104; the number of random training samples is set to 2 × 103; and the capacity of the memory bank is set to 1 × 106. The actor network adopts the 9-300-2 hierarchical structure; the critical network adopts the 11-300-2 hierarchical structure; the network parameter optimizer uses the Adam optimizer; and the network learning rate is set to 0.0001.
6.2. The Effectiveness of Initial Solution Trajectories’ Rapid Generation Method Verification
In order to verify the effectiveness of the rapid initial solution trajectory generation method based on the DDPG algorithm proposed, this paper evaluates the convergence of the DDPG algorithm training by observing the changes in average rewards; the specific meaning of average rewards can be found in Ref [
30], and the average reward index is set to 800. Simultaneous selection of multiple terminal end positions ([
,
,
] = [3 × 10
4/
re, 0, 3.38 × 10
5/
re], [
,
,
] = [3 × 10
4/
re, 2.5 × 10
4/
re, 3.38 × 10
5/
re], [
,
,
] = [3 × 10
4/
re, −2.5 × 10
4/
re, 3.38 × 10
5/
re]) is performed for initial solution trajectory generation experiments. The simulation results are shown in
Figure 4.
Figure 4a shows that the average reward of the DDPG algorithm shows an overall upward trend during the training process, and when the number of training iterations is 15,130, the average reward reaches the set training index, indicating that the DDPG algorithm can converge in the training of the initial solution generation task of the guidance trajectory.
Figure 4b shows that the three initial solution trajectories generated by this method are not only relatively smooth but also meet the basic guidance requirements. In addition, the generation time of these three initial solution trajectories is 0.15 s, 0.14 s, and 0.15 s, all of which meet the time requirements for rapid trajectory generation.
6.3. The Superiority of Initial Solution Trajectories’ Rapid Generation Method Verification
In order to verify the superior performance of the trajectory convex optimization algorithm with the trajectories generated by the DDPG algorithm as the initial solution in this paper, three terminal end positions are used as three scenarios in
Section 6.2, and a simulation comparison experiment is conducted using the initial solution trajectory generation algorithm from Ref [
38]. The grid point strategies for both algorithms are uniformly distributed, and the simulation results are shown in
Figure 5.
In
Figure 5, DDPG represents the initial solution trajectory curve obtained using the DDPG algorithm; Li M. (2023) represents the initial solution trajectory curve obtained using the initial solution trajectory generation algorithm from Ref [
38]; DDPG + CVX refers to the optimized trajectory curve obtained via convex optimization of the trajectory based on the initial solution trajectory solved using the DDPG algorithm; and Li M. (2023) + CVX represents the optimized trajectory curve obtained via convex optimization of the trajectory based on the initial solution trajectory generated by the algorithm from Ref [
38].
As shown in
Figure 5, the optimal trajectories obtained by the two convex optimization methods are very similar, but the guiding effect of the initial solution generated by the DDPG algorithm is significantly better than that of the method from Ref [
31] in all three scenarios (see
Figure 5a–f). Moreover, the optimal trajectories in all three scenarios can satisfy the process constraints (see
Figure 5g–i), indicating that the trajectory planned by the improved convex optimization algorithm in this paper is effective.
To effectively assess the comparative advantages and disadvantages of the two methods, this article randomly selects terminal positions and subsequently carries out 100 Monte Carlo simulations utilizing both approaches. Each simulation’s resulting data are recorded. Ultimately, the statistical averages of these data are computed and utilized for comparison, as shown in
Table 1.
As is evident in
Table 1, in Monte Carlo simulations, the DDPG + CVX algorithm in this paper necessitates less iteration to attain an optimized trajectory compared to the convex optimization method from Ref [
31]. Furthermore, the corresponding solving time is reduced by approximately one-fifth, on average. Notably, despite achieving similar objective function value, the average terminal position error generated by the DDPG + CVX algorithm is smaller than that of the convex optimization method from Ref [
31]. This is attributed to the fact that the initial trajectory generated by the DDPG approach in this article exhibits a higher degree of accuracy. Consequently, this serves as a beneficial prerequisite for the convex optimization algorithm, enabling it to swiftly and precisely converge to the optimal trajectory during iterative solving.
The simulation results show that the improved convex optimization algorithm proposed in this paper improves the efficiency and accuracy of solving the trajectory planning problem in the midcourse guidance phase.
6.4. The Effectiveness of Grid Point Dynamic Adjustment Method Verification
In order to verify the effectiveness of the dynamic adjustment method for grid points based on the distribution of approximate solution errors designed in this paper, a simulation comparison analysis with the traditional uniform grid point distribution method was performed using Scenario 2 in
Section 6.3 as an example. The simulation results are shown in
Figure 6. The approximate solution error of the trajectory is defined as the integration difference between the optimized trajectory and the actual trajectory [
32].
As shown in
Figure 6, the iterative longitudinal trajectory of both methods can quickly converge to the optimal trajectory (see
Figure 6a,d), but the iterative lateral trajectory of the traditional method is obviously not as good as that of the improved method (see
Figure 6b,e). Compared with traditional methods, the improved method only requires four iterations, and in the final grid point distribution of the improved method, the grid points at both ends of the trajectory are relatively sparse, while the grid points in the middle are relatively dense (see
Figure 6c,f). This is because the trajectory turns laterally in the middle stage, increasing the demand for overload, and the non-linearity of this part of the trajectory is relatively high. Therefore, the improved method’s grid point adjustment strategy allocates more grid points in the middle stage of the trajectory.
Similar to
Section 6.3, to effectively assess the comparative advantages and disadvantages of the two methods, this article randomly selects terminal positions and subsequently carries out 100 Monte Carlo simulations utilizing both approaches. Each simulation’s resulting data are recorded. Ultimately, the statistical averages of these data are computed and utilized for comparison, as shown in
Table 2.
As shown in
Table 2, the improved method is better than traditional methods in terms of iterations and CPU time. Although the objective function values obtained by the two methods are similar, the approximate solution error of the trajectory obtained by the improved method is reduced by about half compared to the traditional method. This is because the improved method gradually distributes more grid point positions in areas with higher non-linearity during the iterative solution process under the designed grid point dynamic adjustment method (such as Equations (42)–(46)), reducing the degree of non-linearity violation of the trajectory and thus reducing the approximate solution error of the obtained trajectory.
The simulation results show that the designed grid point dynamic adjustment method based on the approximation error distribution not only improves the optimization efficiency of the convex optimization algorithm but also greatly reduces the approximation error of the optimized trajectory, making it more conducive to the subsequent trajectory tracking processing.
6.5. The Performance of Improved Convex Optimization Method Verification
In order to verify the optimization performance and the ability to meet the constraints of the method proposed in this paper, a simulation comparison analysis with the Gauss pseudospectral method (GPM) was performed using Scenario 2 in
Section 6.3 as an example. The simulation results are shown in
Figure 7 and
Table 3.
In
Figure 7, GPM represents the Gauss pseudospectral method.
As shown in
Figure 7, the optimized trajectories of the two methods can meet the guidance requirements, and the trajectories are smooth (see
Figure 7a), with relatively stable changes in their respective state variables (see
Figure 7b–d). The changes in the angle of attack and pitch angle of the optimized trajectory using the DDPG + CVX method are significantly smaller than those of the GPM method (see
Figure 7e,f), indicating that the DDPG + CVX method is more conducive to the tracking and control of subsequent trajectories. In addition, the optimized trajectories of the two methods can meet the requirements of process constraints (see
Figure 7g,i). As can be seen in
Table 3, the trajectory optimization accuracy of the GPM method is slightly higher, but the optimization accuracies of the two methods are not much different, and the optimization time and approximate solution error of the DDPG + CVX method are much smaller than those of the GPM method. Therefore, the comparison of the simulation data between the two methods shows that the optimization performance of the two methods is approximately the same, and both can meet the constraints in the guidance process. However, the optimization efficiency of the DDPG + CVX method is significantly better.