**1. Introduction**

With the density of maritime traffic increasing, various types of marine accidents frequently occur. 80% of marine accidents are caused by human factors, according to the world maritime disaster records that were filed by international maritime organization (IMO) from 1978 to 2008 [1]. Therefore, improving the autonomous driving level of ships has become an urgent problem to be solved. On the other hand, the environment that is faced by ships is more and more complex. In some cases, it is not suitable for manned ships to go to the workplace to carry out tasks, while unmanned ships are more suitable for dealing with the complex and changeable harsh environment at sea. This requires unmanned ships to have the ability of autonomous path planning and obstacle avoidance, so as to efficiently complete tasks and enhance the comprehensive operation ability [2].

With their strong autonomy and adaptability, unmanned ships have gradually become the new research direction pursued by the current industry [3]. Unmanned ships can not only perform tasks independently in dangerous sea areas, but also cooperate with manned ships to improve work efficiency. They are widely used in marine exploration, military tasks, material transportation, and other fields. Autonomous path planning is an essential key technology in order to improve the self-determination ability of unmanned ships. The autonomous path planning of unmanned ships needs to optimize the path in the safe navigation area according to certain navigation rules and crew experience. It can safely avoid obstacles and independently plan an optimal trajectory from the known starting point to the target point. Unmanned ships often face complex and variable navigational environments. Therefore, it is necessary to adopt a continuous and effective method for controlling

the trajectory of the ship during the voyage, thus ensuring the safety of the ship [4]. The research directions of unmanned ships are mainly divided into autonomous path planning, navigation control, autonomous collision avoidance, and semi-autonomous task execution. Autonomous path planning plays a key role in ship automation and practical application, as the basis and premise of autonomous navigation [5]. In the actual navigation process, ships often meet with other ships, which requires reasonable methods to guide ships to avoid other ships and achieve the target point. Therefore, it is valuable to consider how to avoid dynamic obstacles in the process of path planning. The unmanned ship path planning method can guide the ship to take the optimal action and avoid other obstacles. At the same time, it can also be divided into the navigable area and the obstacle area by the method according to the obstacle information to realize the function of avoiding the obstacle in the local area. Many scholars have completed relevant research and experiments to solve the problem of autonomous path planning for unmanned ships. However, traditional path planning methods usually require relatively complete environmental information as prior knowledge, and it is quite difficult to obtain surrounding environment information in an unknown sea environment. Moreover, the traditional algorithm has a large amount of computation, which makes it difficult to realize the real-time behavior decision-making of ships, resulting in large mistake information of path planning.

In recent years, artificial intelligence technology has been rapidly developed and applied. Deep Learning (DL) [6] and Reinforcement Learning (RL) [7] have achieved great success in many fields. DL has a strong perceptual ability and RL has decision-making ability. DRL [8] is obtained by combining the advantages of DL and RL, which provides a solution to the perceptual decision-making problem of complex systems. DRL can effectively solve the problem of continuous state space and action space. It directly takes original data as input and output results as execution action, realizes an end-to-end learning mode, and greatly improves the efficiency and convergence of the algorithm. At present, DRL has been widely used in robotic control [9], automatic driving [10–12], financial prediction [13], traffic control [14], and other fields. Artificial intelligence technology is gradually infiltrating into various fields and the concept of unmanned autonomy is getting closer. As an important part of the transportation field, unmanned ships are moving forward in the direction of intelligence and autonomy.

RL has attracted widespread attention in recent years, emphasizing the learning of agents from the environment to behavioral mapping and seeking the most accurate or optimal action decisions by maximizing the value function. Mnih, V et al. [15] proposed a Deep Q-Network (DQN) algorithm, which opened a new era of DRL. The DQN algorithm utilizes the powerful function fitting ability of the deep neural network to avoid the huge storage space of the Q table, and the experience replay memory and target network are used to enhance the stability of the training process. At the same time, DQN implements an end-to-end learning method, and only original data are used as input, and the output result is the Q value of each action. The DQN algorithm has achieved great success in discrete action, but it is difficult to achieve high-dimensional continuous action. If continuously changing action is infinitely split, the number of actions exponentially increases with increasing degrees of freedom, which leads to the problem of latitude catastrophe and can cause great training difficulties. In addition, simply discretizing the action removes important information regarding the structure of the action domain. The Actor-Critic (AC) algorithm [16] is capable of handling continuous action problems and it is widely used in continuous action spaces. The AC algorithm network structure includes Actor network and Critic network. The Actor network is responsible for outputting the probability value of the action. The Critic network evaluates the output action. In this way, the network parameters are continuously optimized and the optimal action strategy is obtained, but the random strategy of the AC algorithm makes the network difficult to converge. Lillicrap, T.P et al. [17] offered the Deep Deterministic Policy Gradient (DDPG) algorithm to solve the problem of DRL in continuous state action space. DDPG is a model-free algorithm that combines the advantages of the DQN algorithm with an experience replay memory and a target network. At the same time, the AC algorithm that is based on the Deterministic Policy Gradient (DPG) is used to make the network output result a certain action value, which ensures that DDPG can be applied to the continuous action space field. The DDPG can be easily applied to complex problems and larger network structures with simplicity and convergence. Zhu, M et al. [11] proposed a framework for human-like autonomous car-following planning based on DDPG. In this framework, unmanned cars learn from the environment through trial and error. Finally, the path planning model of unmanned cars was obtained and it has good experimental results. This research shows that DDPG can gain insight into driver behavior and can help to develop human-like autopilot algorithms and traffic flow models.

This paper proposes an autonomous path planning model for unmanned ships that are based on the DRL method. The essence of the model is that the agent independently finds the most efficient path through the enumerated method, which might be closer to human manipulation. At the same time, the ship motion is transformed into a continuous motion control problem, so that it conforms to the actual ship's motion characteristics. The above ideas have been implemented in this paper. Firstly, the interaction mode between ships and environment is analyzed, and a virtual computing environment for unmanned ships autonomous path planning close to the real world is established. Secondly, the model is defined and the network structure and parameters of the DDPG algorithm are set, and the action exploration strategy and reward function are designed. At the same time, the international maritime collision avoidance rules (COLREGS) and crew's experience are quantified, and the attraction strategy to the target point is added, to ensure the standardization of navigation and avoid the algorithm falling into local optimum. Finally, historical experience data is stored in the memory pool and the neural network parameters are updated by random extraction, thereby reducing the relevance of the data and improving the learning efficiency of the algorithm. Besides, this paper combines the artificial potential field (APF) with the DDPG to obtain the APF-DDPG based autonomous ship autonomous path planning model. The APF-DDPG model has higher decision making power and faster convergence. The model is integrated with the electronic chart platform in order to evaluate the validity and accuracy of the model, and experiments are carried out under the conditions of a single ship encounter and multi-ship encounter, respectively. Moreover, five sets of verification experiments of the DQN, AC, DDPG, APF-DDPG, and Q-learning [18] algorithms are designed as comparative cases. The results show that the APF-DDPG has high convergence speed and planning efficiency, the planned path is more in line with navigation rules, and the autonomous path planning of unmanned ships is realized.

The rest of this paper is structured, as follows. Section 2 reviews the related works. Section 3 presents the autonomous path planning model for unmanned ships that are based on DRL. Section 4 presents a comparative analysis of the simulation experiment process and experimental results. Finally, Section 5 concludes the paper.

#### **2. Related Research**

At present, research on autonomous path planning for unmanned ships has been carried out at domestic and foreign. The methods include traditional algorithms, such as APF, velocity obstacle method, and A\* algorithm, as well as some intelligent algorithms, such as ant colony optimization algorithm, genetic algorithm, neural network algorithm, and other related algorithms of DRL.

In terms of traditional algorithms, Petres, C et al. [19] used APF to construct a virtual gravitational field to guide the autonomous surface vehicles (ASV) to the target point. The ASV avoids obstacles and performs path planning in a complex navigation environment by transforming the navigation restricted area into a virtual obstacle area. The APF has the advantages of high computational efficiency and simple algorithm, but it is necessary to set reasonable potential field parameters to avoid falling into local minimum. The practicality and effectiveness of the algorithm can be guaranteed in the process of unmanned ships autonomous path planning, combined with the COLREGS [20]. Kuwata, Y et al. [21] adopted the velocity obstacle method with COLREGS, and presented a path planning method for unmanned surface vehicles (USV) to navigate safely in dynamic, cluttered environments. The experimental results show that USV can achieve better obstacle avoidance and path planning. Campbell, S et al. [22] showed a USV real-time path planning method that was based on improved

A\* algorithm, which is integrated with decision-making framework and combined with COLREGS. The results show that the method realizes real-time path planning of the USV in complex navigation environment. However, the A\* algorithm relies on the design of the grid map, and the size and number of grids will directly affect the calculation speed and accuracy of the algorithm. Xue, Y et al. [23] introduced a path planning method of unmanned ships that are based on APF, and combined with COLREGS. The experimental showed that the method could effectively realize unmanned ships path search and collision avoidance in complex environments. However, this method is difficult to deal with ships autonomous path planning and collision avoidance problems in an unknown restricted navigation environment.

In addition, many intelligent algorithms, such as genetic algorithms, ant colony optimization algorithms, and neural network algorithms, have also been used in the autonomous path planning problem of unmanned ships. For example, Vettor, R et al. [24] used the optimization genetic algorithm to calculate the environmental information as the initial population to obtain the navigation path that satisfies the requirements. Lazarowska, A et al. [25] proposed the ant colony optimization algorithm to transform ships path planning and collision avoidance problems into dynamic optimization problems, with collision risk and range loss as the objective function. The optimal planning path and collision avoidance strategy were obtained based on the motion prediction of dynamic obstacles. Xin, J et al. [26] adopted the improved genetic algorithm to increase the number of offspring by using the multi-domain inversion. The result shows that the algorithm is superior with a desirable balance between the path length and time-cost, and it has a shorter optimal path, a faster convergence speed, and better robustness. Xie, S et al. [27] proposed a predictive collision avoidance method for under-actuated surface ships based on the improved beetle antenna search (BAS) method. A predictive optimization strategy for real-time collision avoidance is established and COLREGS is used as a constraint condition while considering the minimization of safety and economic cost. The simulation experiments verify the effectiveness of the improved BAS method. However, such intelligent algorithms are usually computationally intensive, and they are mainly used in offline global path planning or auxiliary decision making, and they are difficult to be used for real-time ship action decision problems.

In the field of intelligent ships, the application of DRL to the control of unmanned ships has gradually become a new research field. For example, Chen, C et al. [18] introduced a path planning and maneuvering method for unmanned cargo ships that were based on Q-learning. This method can learn the action reward model and it obtains the best action strategy. After enough rounds of training, the ship can find the right path or navigation strategy by itself. Fu, K et al. [28] proposed a ship rotation detection model based on a feature fusion pyramid network and DRL (FFPN-RL). The ship's independent guidance and docking function is realized by applying the Dueling Q network to the inclined ship detection task. Yang, J et al. [29] showed autonomous navigation control that is based on relative value iterative gradient (RVIG) algorithm for unmanned ships, and designed the navigation environment of the ship while using Unity3D game engine software. The simulation results showed that the unmanned ships could successfully avoid obstacles and reached the destination in a complex environment. Shen, H.Q et al. [30] offered a method that was based on the Dueling DQN algorithm for automatic collision avoidance of multiple ships, and combined ship manoeuvrability, crew's experience, and COLREGS to verify the path planning and collision avoidance capability of unmanned ships. Zhang, R.B et al. [31], based on the Sarsa on-policy algorithm, proposed a behavior-based USV local path planning and obstacle avoidance method, and tested in real marine environment. WANG, Y et al. [32] introduced a USV course tracking control plan by combining DDPG algorithm and achieved good experimental results. Zhang, X et al. [33] proposed an adaptive navigation method that was based on maritime autonomous surface ships for hierarchical DRL. The method is combined with the ship maneuverability and navigation rules COLREGS for training and learning. The results show that the method can effectively improve the navigation safety and avoid collision. Zhao, L et al. [34] used the proximal policy optimization (PPO) algorithm, combined with the ship motion model and navigation rules, the unmanned ship autonomous collision avoidance

model in the multi-ship environment is proposed. The experimental results show that the model could obtain the time efficiency and collision-free path of multiple ships, and it has good adaptability to unknown complex environments. In addition, as the human factor is composed of navigators' subjectivity and indeterminacy of the navigation situation in the actual navigation process, it is usually has a game character in the actual ship control process [35]. The DRL overcomes the shortcomings of usual intelligent algorithm, which requires a certain number of samples. At the same time, it has less error and response time.

Many critical autonomous path planning methods have been proposed in the field of unmanned ships. However, these methods are mainly focused on the research of small and medium-sized USV, while the research on unmanned ships is relatively rare, and few experts currently apply the DDPG to unmanned ships path planning. This paper chooses DDPG for unmanned ships path planning, because it has powerful deep neural network function fitting ability and better generalized learning ability. The fitting ability can achieve higher precision when approaching the motion of the ship, and the learning ability can enable the ship to obtain the action output close to the human characteristic from the decision behavior, which is helpful in further developing the human-like traffic models. In addition, the outputs of algorithms, such as Q-learning and DQN, are discrete behaviors, which will lead to the discontinuity and incompleteness of the actions. The DDPG has the advantages of fast convergence speed and continuous action space. When compared with the traditional path planning, the method proposed in this paper has more continuous action output and less decision error when the unmanned ship is sailing.

#### **3. Construction of Autonomous Path Planning Model for Unmanned Ships Based on DRL**

#### *3.1. Deep Reinforcement Learning*

As an important field of machine learning, DL can extract high-precision feature samples from the original input data. This method has greatly promoted the development of visual object recognition, object detection, and voice recognition. DL uses deep neural networks (DNN) to automatically learn the features of high-dimensional data. The core idea of DL is to update the network parameters through the back propagation method and then discover the distributed characteristics of the data from the training process.

RL, as another field of machine learning, is widely used in robot control, game simulation, and scheduling optimization. The purpose of RL is to maximize the cumulative reward that was obtained by agents in the training process and learn the optimal strategy to achieve the goal. At the same time, the agent directly interacts with the environment information to replace the large amount of training data by evaluating the action value. The principle is shown in Figure 1. The agent observes the state *st* from the environment and makes action *at* that is based on a certain policy. The environment then makes feedback reward *rt* for the executed action and it moves to the new state *st*+1. The agent uses the new state and reward *rt*+<sup>1</sup> to select the action again and keep repeating to maximize the long-term accumulation index of reward.

The DRL is an end-to-end learning method that combines DL and RL, while using the perception ability of DL and the decision-making ability of RL [36]. It has made great progress in continuous motion control and it can effectively solve the shortcomings of traditional unmanned ships in path planning. DDPG is a type of algorithm in DRL that can be used to solve continuous action spaces problems. Among them, deep refers to the deep network structure and policy gradient is the strategy gradient algorithm, which can randomly select actions in the continuous action space according to the learned strategies (action distribution). The purpose of deterministic is to help the policy gradient avoid random selection and to output a specific action value. The DDPG is being gradually applied to the transportation sector, especially in the areas of driverless cars and unmanned ships. These vehicles (buses, small cars, boats, and cargo ships, etc.) often require continuous motion control as well as traffic rules and human operating experience to ensure driving safety. The DDPG can solve the above

problems well with its powerful self-learning ability and function fitting ability. Therefore, DDPG has broad prospects and expansion space in the maritime domain.

**Figure 1.** The Principle of Reinforcement Learning.

## *3.2. The Principle of DDPG Algorithm*

### 3.2.1. AC Algorithm

DDPG is based on AC algorithm, whose structure is shown in Figure 2.

**Figure 2.** Actor-Critic (AC) algorithm structure.

The network structure of the AC framework includes a policy network and an evaluation network. The policy network is called an Actor network, and the evaluation network is called a Critic network. The Actor network is used to select actions corresponding to the DDPG, and the Critic network evaluates the merits and demerits of the selected actions by calculating the value function.

The Actor network and the Critic network are two separate networks that share state information. The Actor network uses state information to generate actions, while the environment feeds back the resulting actions and outputs reward. The Critic network uses state and reward to estimate the value of the current action and constantly adjust its own value function. Meanwhile, the Actor network updates its action strategy in the direction of improving the value of the action. In this cycle, the Critic network evaluates the action strategy by means of a value function, giving the Actor network a better gradient estimate, and finally obtaining the optimal action strategy. It is very important to evaluate the action strategy in the Critic network, which is more conducive to the convergence and stability of the current Actor network. The above features ensure that the AC algorithm can obtain the optimal action strategy with the gradient estimation at a lower variance.

#### 3.2.2. DDPG Algorithm Structure

Figure 3 shows the structure of DDPG algorithm.

**Figure 3.** Deep Deterministic Policy Gradient (DDPG) algorithm structure.

The DDPG algorithm takes the information of the initial state as the input, and the output result is the action strategy μ(*st*) calculated by the algorithm. Afterwards, the random noise is added to the action strategy to obtain the final output action, and this process is a typical end-to-end learning mode. When starting the task, the agent outputs an action according to the current state *st*. The reward function is designed and the action is evaluated in order to verify the validity of the output action, thereby obtaining a feedback reward *rt* of the environment. The action that is beneficial to the agent to achieve the goal gives a positive reward and, on the contrary, gives a negative reward. Afterwards, the current state information, the action, the reward, and the state information of the next time (*st*, *at*,*rt*,*st*+1) are stored in the experience buffer pool. At the same time, the neural network trains experience and continuously adjusts action strategy by randomly extracting sample data from the experience buffer pool, and uses the gradient descent approach to update and iterate network parameters, so as to further enhance the stability and accuracy of the algorithm.

DDPG combines with DQN on the premise of AC algorithm in order to further enhance the stability and effectiveness of network training, which makes it more conducive in the filed of solving the problem of continuous state and action space. Additionally, DDPG uses DQN's the experience replay memory and the target network to solve the problem of non-convergence when using neural network to approximate the function value. Meanwhile, DDPG subdivides the network structure into online network and target network. The online network is used to output the actions in real time, evaluate actions, and update network parameters through online training, which includes online Actor network and online Critic network, respectively. The target network includes target Actor network and target Critic network, which are used to update the value network system and the Actor network system, but not carry out online training and updating of network parameters. The target network and the online network have the same neural network structure and initialization parameters. In the training process, the parameters of the target Actor network and target Critic network are updated in a way of slow change (Soft Replace), instead of directly copying the parameters of online Actor network

and online Critic network, so as to further enhance the stability of the training process. The following formulas describe how to update the parameters.

$$\begin{cases} \theta^{Q'} = \tau \theta^Q + (1 - \tau) \theta^{Q'} \\ \theta^{\mu'} = \tau \theta^{\mu} + (1 - \tau) \theta^{\mu'} \end{cases} \tag{1}$$

where, θ<sup>μ</sup> and θ*<sup>Q</sup>* are the parameters of the online Actor network and the online Critic network, θμ and θ*Q* are the parameters of the target Actor network and the target Critic network, τ 1 is the approximation coefficient. Figure 4 is the flow chart of the DDPG algorithm.

**Figure 4.** DDPG algorithm flow chart.

#### *3.3. Structure Design of Autonomous Path Planning Model for Unmanned Ships Based on DDPG Algorithm*

The DDPG algorithm is a combination of deep learning and reinforcement learning. Based on this algorithm, this paper designs an autonomous path planning model for unmanned ships. The algorithm structure mainly includes AC algorithm, experience replay mechanism, and neural network. The output action of the model is gradually accurate by using the AC algorithm to output and judge the ship's action strategy. By using the experience replay mechanism, the historical information of the DDPG algorithm executed in the experimental environment is remembered, and the data are randomly

extracted for training and learning. In the course of the experiment, the number of environmental states and behavioral states of the ship is large, so it is necessary to use neural networks for fitting and generalization. The current state of the unmanned ship obtained in the environment is used as the input of the neural network, and the output is the Q value of actions that the ship can perform in the current state. The model can learn the optimal action strategy in the current state by continuously training and adjusting neural network parameters.
