1. Introduction
Many studies on missiles that deal with complex and complicated mission profiles have been carried out. This is not only because there is military demand, which needs to overcome the complicated engagement scenarios, but also because researchers have expected the demands in advance. A guided missile is generally based on PNG (Proportional Navigation Guidance), which is widely known as quasi-optimal for interceptor guidance [
1,
2,
3,
4]. Many guidance laws for various objectives have been derived by using PNG. Zhou et al. [
5,
6,
7,
8,
9] concerned jamming and deception to the friendly missile. They showed a simultaneous impact engagement profile by introducing impact time control guidance (ITCG). They pointed out the limitation of jammers which can only work well under the one-to-one correspondence of seeker–jammer interaction. ITCG has been continuously developed from the initial study on planar engagement space [
9] to many improved guidance laws. Some studies of terminal angle constraint guidance (TACG) have also been carried out by constraining the approaching angle on the terminal phase [
7,
10,
11,
12]. Those studies are based on the fact that the performance of the missile can be enhanced if the missile is able to strike the specific vulnerable part of the target.
Meanwhile, we also looked into studies of obstacle avoidance guidance for fixed-wing aircrafts, since missile and fixed-wing aircrafts share similar dynamic properties. Ma [
13] proposed a real-time obstacle avoidance method for a fixed-wing unmanned aerial vehicle (UAV). He showed a good performance of trajectory planning in a three-dimensional dynamic environment by using rapidly exploring random tree (RRT). Wan [
14] proposed a novel collision avoidance algorithm for cooperative fixed-wing UAV. Each UAV generates three different possible maneuvers and predicts the planned trajectories. The algorithm manipulates the planned trajectories of UAVs, decides whether each combination of trajectories is good for collision avoidance or not, and activates the chosen maneuver when the collision comes closer.
Recently, reinforcement learning (RL) has attracted a lot of attention in the optimization and design of guidance in various fields. Yu and Mannucci [
15,
16] used RL for fixed-wing UAVs to implement collision avoidance tasks. They showed that they reduced the probability of UAVs’ collision with many simulation experiments. Furthermore, there are some prior studies for missile guidance via reinforcement learning. Gaudet [
17] argued that RL working under a stochastic environment could make the logic more robust. They presented an RL-based missile guidance law and its framework for a missile homing phase via Q learning. They also presented a framework for interceptor guidance law design, which is able to infer guidance commands with only line-of-sight angles via Proximal policy optimization (PPO) in [
18]. However, they dealt with a small and limited environment. Hong [
19] expanded the environment further to cover a whole planar environment and set the fair comparison condition. They presented an RL-based missile guidance law for a wide range of environments and showed some advantageous features.
In practice, some missiles have the ability to avoid anti-missile systems and obstacles. Harpoons, for example, guide themselves to a sea-skimming maneuver to hide from radar detection. In the terminal phase, the missile kicks into the high for pop-up maneuver and prevents the mission failure due to CIWS (close-in weapon system) counteraction. It is also able to avoid the known obstacles such as friendly ships or islets by following the predefined waypoints. Such capabilities of missile guidance could raise the mission success rate, since it makes missile defense systems difficult to properly counter.
Several algorithms for obstacle avoidance have already been suggested in the literature to guide the missile to the target in complicated environments containing mountainous areas, islets, and ships. They were achieved by following the predefined trajectory, which requires a complete map of the operation field. Their approaches obviously limit the operation environment and requires too much prior information.
In this paper, we propose novel missile guidance laws using reinforcement learning, which can autonomously avoid obstacles and terrains in complicated environments with limited prior information and even without the need of off-line trajectory or waypoint generation. Our guidance laws are operating in real-time inference with less computational burdens and are also able to determine the probability of mission failure, which provides the missile some time to quit the mission safely when a mission is predicted as a failure.
This paper is organized as follows.
Section 2 explains some basic missile dynamics and discusses environment modeling in which missile guidance laws are trained and operated. In
Section 3, we present details of neural network architecture, reward function design, and training methodology. In
Section 4, some numerical simulations are provided and the performance of proposed guidance laws is evaluated. Concluding remarks are given in
Section 5.
3. Architecture Design and Training
Figure 7 shows the architecture of the artificial neural network for scenarios A and B, where the left one is the actor network and the right one is the critic network. Each network is composed of nine hidden layers and each layer contains hundreds of neurons, as shown in
Figure 7. All layers use hyperbolic activation functions. The actor network has 14 and 24 states as inputs for scenarios A and B, respectively, and has 1 and 2 outputs as actions for scenarios A and B, respectively. Actions, which are the missile maneuver acceleration, are limited to the feasible range. Actions are then normalized and fed into the critic network to evaluate the policy along with the states. The critic networks is updated with the loss function of Mean Square Error (MSE), and the policy is updated via TD3PG [
23]. TD3PG stands for Twin Delayed Deep Deterministic Policy Gradient and is one of the most advanced algorithms of RL. TD3PG was developed to ease the limitation of Deep Deterministic Policy Gradient (DDPG) [
24,
25], which sometimes overestimated the value of the state-action.
The environments for both scenarios have the following termination conditions:
- 1
Collision: is activated when the agent hits the object that should not be hit;
- 2
Escape: is activated when the agent is going outside of the environment;
- 3
Excess altitude: is activated when the agent exceeds the altitude limit set in advance and is only for scenario B;
- 4
Time over: is activated when the episode takes more time than it is supposed to;
- 5
Out of sight: is activated when the target is outside of the field of view of the seeker of the agent;
- 6
Hit: is activated when the agent is close enough to the target.
In the training session, the environment of each episode is randomly generated under the given constraints. This randomness makes the guidance law robust by letting the agent experience a varying environment. Further training details for each scenario will be described below.
3.1. Training Details for Scenario A
For scenario A, the agent has 14 inputs, which are the distance to the target
R, its rate
, look angle to the target
, the very previous look angle
, 5 beam length reflected of obstacle detector, and their one-step previous values. The reason for having values of one-step backward is to let the agent recognize the rate somehow. The termination rewards for each episode are shown in
Table 7.
Where
and
represent the initial distance to the target and the final distance to the target at which the episode terminated, respectively. If multiple termination conditions are satisfied at the same time, the condition in the largest ordinal number is applied. The reward for the termination condition 1–5 starts with −500, since we want the agent to be able to predict mission failure as it should not happen for a missile to hit obstacles involving friendly ships. The reward function is designed by:
where each term has its own purpose. The first term is to minimize the maneuver energy and the second term is to get the missile closer to the target. The third term is to encourage the agent to be more rewarded in a more difficult environment. Furthermore, it takes positive rewards over time to encourage the agent to create a detour route in a situation where the agent faces obstacles.
Figure 8 shows the learning curve of the agent during training, which illustrates that the agent learns reliably and reaches its maximum reward after about 400 episodes.
3.2. Training Details for Scenario B
For scenario A, the agent has 23 inputs, which are the distance to the target
R, its rate
, azimuth, and elevation look angle to the target
, the very previous look angle
, the attitude of the missile
,
, 15 beam length reflected of obstacle detector, and their one-step previous values.
Table 8 shows the termination reward for each episode.
In
Table 8,
and
represent the initial distance to the target and the final distance to the target at which the episode terminated, respectively. If multiple conditions are satisfied, the reward in the condition with the larger ordinal number is applied. Meanwhile, training only with the termination reward is very inefficient because unless there is some guide to the target as step reward, the agent tries too many attempts to get sparse reward in the vast environment. Thus, we have set the step reward as follows:
where
is the z-axis element of missile inertial position,
is the highest altitude of the top of the mountainous terrain, and
is the number of peaks. The first term of the right hand reduces the maneuver acceleration of the missile to suppress excessive maneuvers and to save energy; the second term guides the movement of the missile to have direction heading toward the target. The third term forces the missile to keep a low altitude and so suppresses the possibility of being detected. The fourth term provides a certain amount of reward for each step so that the total reward at the end of the episode increases as the episode gets longer, helping the missile create a detour trajectory.
The learning curve in
Figure 9 shows that the reward increases in a stable manner as training progresses. After around episode 3100, we lowered the learning rate so that the guidance law is fine-tuned. Eventually, after training, the missile tends to move in the direction of the topographic valley and turn its head toward the target while keeping the altitude as low as possible.
5. Conclusions
This paper presents novel missile guidance laws using reinforcement learning. Design processes of guidance laws are explained in detail in terms of neural network architecture, reward function selection, and train method. The proposed guidance laws are focused on two scenarios. For scenario A, two-dimensional obstacle avoidance, the guidance law is designed in the way to avoid planar obstacles until it reaches the target. It avoids most obstacles by real-time inference of trained networks with limited information compared to existing algorithms with similar purposes. Meanwhile, failure can be predicted through critic network which is naturally generated during the learning process. Thus, it allows the missile to take action before a missile makes a fatal disaster, such as hitting friendly ships. For the 3D terrain avoidance, which is scenario B, a missile guidance law based on RL is designed to overcome terrain features through real-time inference. It keeps its altitude low to ensure it is not seen by radar on the top of the field while striking the target.
In summary, the proposed RL-based missile guidance laws are not only able to strike the targets while avoiding obstacles and topographic features with limited information, but also able to determine the probability of the success rate, i.e., whether the mission is achievable or not. Numerical simulations show their effectiveness with some inherent limitations.