• Action control strategy

In the aspect of action control strategy, one of the output actions of Actor network is used to indicate the deflection heading *asteering* and the *Tanh* activation function is used to control the range of the output value is [−1, 1]. *D*max indicates the maximum deflection angle, the value range is [0 ◦ , 35◦ ], and *steering* is used to represent the actual deviation sailing value of the ships. The calculation formula is as follows:

$$\text{sterring} = \text{Tanh}(a\_{\text{stering}}) D\_{\text{max}} \tag{2}$$

Another output action of the Actor network is ship speed increment *ashi f ting*, which is processed by activation function *Tanh*. The output range is [−1, 1]. *V*max is the maximum value of the ship's speed change, the range of value is [0, 15] kn, and *shi f ting* is the actual change of ship speed. The calculation formula is as follows:

$$shifting = Tamb(a\_{sluffing})V\_{\max} \tag{3}$$

• Action exploration strategy

In the aspect of action exploration strategy, the effective implementation of action exploration in the continuous control space can enable the unmanned ships to find a better action since DDPG is an off-strategy algorithm. The method of adding random noise into the output action of the neural network is used to realize the exploratory process of the action. Random noise is defined, as follows:

$$
\mu'(\mathbf{s}\_t) = \mu(\mathbf{s}\_t) + \mathbb{N}\_t \tag{4}
$$

where, μ is the exploration strategy and N*<sup>t</sup>* is the added random noise.

The Ornstein–Uhlenbeck (OU) process is a sequential correlation process. It is often used as the random noise of DDPG algorithm and it has good effect in continuous action space. In this paper, this method is used as random noise of unmanned ships action strategy and it is specifically defined as:

$$d\mathbf{x}\_t = \theta(\mu - \mathbf{x}\_t)dt + \sigma dW\_t \tag{5}$$

In the above formula, θ is the speed of the variable approaching the average value, μ is the action mean value, σ is the fluctuation degree of the random process, and *Wt* is the Wiener process.

#### 3.3.4. Data Preparation

Based on the ship AIS system, this paper collects the real AIS data of 50 ships from August 1st to 15th, 2018 in the Dalian-Yantai route. Through the analysis of data, complete ship navigation data can be obtained, for example: Longitude, Latitude, speed over ground (SOG), course over ground (COG), heading (HDG), and other information. Table 1 shows the main information of AIS.

**ID Data Type Data Value Data Sources** 1 Longitude 121◦30.43 E AIS 2 Latitude 37◦47.30 N AIS 3 SOG 17.5 AIS 4 COG 10.70◦ AIS 5 HDG 10.00◦ AIS 6 DRIFT 0.7◦ AIS

**Table 1.** Automatic identification system (AIS) data attribute table.

This paper mainly selects the longitude, latitude, heading, and speed information of the ship as the input parameters of the model. The trajectory of each ship can be clearly displayed since the data is collected in real time. Their total mileage is 133,550 nautical miles and the data volume is around 3G. This paper randomly selects the data of 20 ships from the first 50 ships as the input data of the model training since these data are collected in real time. These data were normalized prior to model training to speed up the training of the model and improve the accuracy of the model.

#### 3.3.5. Design of Reward Function

On the one hand, the unmanned ships needs to maintain the exploration ability to obtain better action, on the other hand, it also needs to use the learned action to obtain more reward from the environment, in order to fully explore the environment space to obtain the optimal action strategy.

The reward function is also known as the immediate reward or enhancement signal *R*. When an unmanned ship performs an action, the environment will make feedback information based on the action to evaluate the performance of the action. The environment and decision makers design the reward function. It is usually a scalar, with positive value as reward and negative value as punishment. It is very important to obey the navigation rules for the practicability of the algorithm in the autonomous path planning process of unmanned ships.

According to the COLREGS, the ship encounter situation can generally be divided into three types: head-on situation, crossing situation, and overtaking situation. Figure 7 shows the situation of the ship encounter situation.

**Figure 7.** Ship encounter situation chart.

Different ship encounters will be divided according to COLREGS when the sea environment has good visibility, as shown in Figure 7. The range of the A region is (0◦, 005◦) and (355◦, 360◦); it is called head-on situation. The ship should take a steering action that is greater than 15◦ to the right. The B region is a crossing situation with a range of (247.5◦, 355◦). The C region is the overtaking situation, and the range is (112.5◦, 247.5◦). The ship is overtaken and it does not usually take action. The D region is a crossing situation with a range of (005◦, 112.5◦). The ship should take right-turning action. In addition, in situation where the visibility of the sea environment is restricted, there will be no responsibility separation between the stand-on vessel and the give-way vessel. The COLREGS made the following rules for the situation of the ship at this time: For coming ships in the range of (0◦, 90 ◦) and (270◦, 360◦), the self-ship will turn to the right. Self-ship takes a turn towards other ships for coming ships in the range of (90◦, 180◦) and (180◦, 270◦). We mainly study the situation of good visibility at sea in this article.

The rules need to be mathematically quantified and the rules are converted into navigational restrictions in order to integrate the COLREGS and crew's experience into the model. Set the navigation limit area when the distance between the ship and the other ship is less than six nautical miles, otherwise it will not be set. The following describes the rule conversion in the three scenarios:

1. Head-on situation. Under this circumstance, both ships should take the action of turning to the right to avoid the boat. The method of rule conversion is as follows: a navigational limit line with a length of three times the ship length and a direction of 10◦ from the bow is drawn clockwise in the bow direction of the two ships. The environment will punish the ship if the ship crosses the navigation limit line. The specific performance is that the ship receives a negative reward.


For the conversion of crew's experience, this paper adds a virtual area for obstacles based on crew's experience in order to enable the ship to take actions to avoid obstacles in advance. When the distance between the ship and the obstacle is more than 1.5 times the length of the ship, the obstacle area is set, otherwise it is not set. The same as embedding COLREGS, we also convert the crew's experience into a restricted area. We set the obstacle as an approximate regular pattern in order to simplify the calculation and unify the criteria (a circular). The method of setting the virtual obstacle area is as follows: the center of the obstacle is the circle and the longer side of the obstacle is used as the radius. Once the ship enters the obstacle area, the environment will punish it. In this case, the reward obtained by the ship is the inverse of the distance between the ship and the obstacle. The punishment is stronger when the ship is closer to the obstacle and, on the contrary, it will be smaller. The reward dynamically changes until the ship is outside the obstacle area.

By setting the navigation restriction area, the ship's requirements for driving according to COLREGS and crew's experiences are realized, thereby constraining and guiding the ship to select correct and reasonable behavior. The navigation rules are converted into navigational restrictions through the quantitative treatment of COLREGS and crew's experiences. When the ship crosses a restricted navigation area or collides with an obstacle, it will be punished with a negative value. When the ship reaches the target point, the return value is positive. At the same time, the goal-based attraction strategy is adopted, and the reward is positively correlated with the distance between the ship and the target at the adjacent time. The reward function is designed, as follows:

$$R\_t \begin{cases} 2, & d\_{t-\text{goal}} < D\_{\mathcal{S}-\text{min}} \\ -1, & d\_{t-\text{obs}} < D\_{o-\text{min}} \\ (d\_{t'-\text{goal}} - d\_{t-\text{goal}}) - 0.1, & other \end{cases} \tag{6}$$

In the above formula, *dt*−*goal* is the distance between the ship and the target point at the current time *t*, the *dt*−*goal* is the distance between the ship and the target point at the previous time *t* , the *dt*−*obs* is the dangerous distance between the ship and the obstacle at the current time *t*, the *Dg*−min is the minimum threshold for the ship to reach the target point, and the *Do*−min is the minimum dangerous distance threshold between the ship and the obstacle.

During the training process, when the ship reaches the target range, that is *dt*−*goal* < *Dg*−min, the reward is set to 2. When reaching the obstacle range, that is *dt*−*obs* < *Do*−min, the reward is set to −1. In other cases, the attraction strategy is to attract the ship to the target point as soon as possible, in order to maximize the return value of the action. The current ship's distance from the target point is subtracted from the distance between the ship and the target point at the previous time to determine whether the ship is approaching the target point. As the simulation environment is based on a 800 × 600 pixels two-dimensional map, 1 pixel represents 10m in the actual environment. Here, (*dt*−*goal* − *dt*−*goal*) is normalized, that is, the result of dividing the distance between the current position of the ship and the target point by the distance of the diagonal of the two-dimensional map is used as the reward value, and the range of the reward value is between [0, 1]. At the same time, the reduction of 0.1 in each step is to reduce the number of steps to reach the target point and avoid the redundancy in the planned path. The reward is calculated as (*dt*−*goal* − *dt*−*goal*) − 0.1. There are many training rounds in the experiment. The current round ends when the ship reaches the target point. If it collides with obstacles, the ship is stepped back and re-selected action. At the same time, the maximum number of steps in each round is set to 400 steps. The current round is ended and the next round is re-entered when the maximum number of steps is reached.

## *3.4. Model Execution Process*

The path planning model of unmanned ships based on the DDPG algorithm is used to abstract the real complex environment and then transform it into a simple virtual environment through the model. At the same time, the action strategy of this model is applied to the environment of electronic chart platform, so as to obtain the optimal planning track of unmanned ships and realize the learning process of end-to-end algorithm in the real environment. Figure 8 shows the execution flow of the autonomous path planning model for unmanned ships that are based on DDPG algorithm.

**Figure 8.** Execution process of unmanned ships path planning model.

The execution process of the model for unmanned ships is described, as follows:


cycle through it. If it is the end state, it indicates that the unmanned ships have completed the path planning task, and the model ends the calculation and the invocation.

In this section, first, the unmanned ship uses the obtained state information as the input to the algorithm. Second, the action strategy is obtained through training based on the navigation restricted and the reward function. Subsequently, the parameters of the model are updated with historical experience data until the cumulative return is at the maximum. Finally, the ship executes the action and changes its state, and it determines whether to end the execution process by judging the current state.

#### **4. Simulation Results and Experimental Comparison**

#### *4.1. Definition of Model Input and Output Parameters*

The input parameters of the model are ship data obtained from ship navigation information fusion module, which mainly includes ship's own state, target point information, and surrounding information. The output of the model is the execution action of unmanned ship, including the deflection heading and the speed increment. Table 2 shows the input parameters names and definitions used in the model.


**Table 2.** Input state definition.

Here, the Angle between the current heading of the ship and the target point is designed as the input parameter, so that the unmanned ships can be quickly driven toward the target point, and the model training period is prevented from being too long. At the same time, the obstacle risk in the range of 1000 m is calculated and then stored in the set Tracks as the decision-making basis for ship obstacle avoidance. Among them, the set of distances from obstacles is differentiated according to the training situation. In the case where the self-ship heading extension line intersects with the other ship heading extension line, that is, on the premise that there is a danger of collision. If it is a single ship encounter, Tracks has only one distance information of other ships. Tracks contains the distance information of the most dangerous other ship if it is a situation where multiple ships meet. Here, the method of selecting the most dangerous other ship is: When the heading extension line of multiple other ships intersects the heading extension line of self-ship, we choose to store the smallest distance between the self-ship and other ship in Tracks. Finally, the above input states are respectively normalized to limit their range to [0, 1], thereby improving the calculation efficiency of the model. The deflection heading Steering and speed increment Shifting are used as the output action of the model, which realizes the direct conversion from the input state to the output action. Table 3 shows the output parameters definition.


According to the actual navigation situation, the maximum deflection heading of the unmanned ship is set to 35◦. There is a situation that the ship might not be able to complete the currently required deflection angle but receive a new steering action due to the slower steering of the ship. Therefore, the same action accumulation method is used to calculate the heading deflection angle *d*ψ. The heading deflection angle of the unmanned ship is obtained by Formula (7).

$$d\_{\psi} = \sum\_{i=t}^{t+k-1} a\_i \tag{7}$$

where, *ai* represents the unmanned ship deflection angle generated by the model at time *i*, and *k* represents the same action performed *k*(*k* > 0) times in succession.

In addition, the current ship's heading ψ*<sup>d</sup>* can be obtained by summing the ship s heading and deflection angle at the previous moment since the ship's heading at the previous moment ψ*<sup>p</sup>* is known. The current heading of the unmanned ship is obtained by Formula (8).

$$
\psi\_d = \psi\_{\mathcal{P}} + d\_{\psi} \tag{8}
$$

#### *4.2. Model Training Process*

Python language and electronic chart platform are used for model training and simulation experiments in order to verify the validity and feasibility of the unmanned ship autonomous path planning model based on DDPG algorithm. The deep learning framework TensorFlow trains the model. The model is designed using the python language, and the convergence and accuracy of the model training can be directly observed. The simulation environment is an electronic chart platform designed and developed based on Visual Studio 2013. Additionally, the state size of the simulation environment is a two-dimensional map of 800 × 600 pixels, and the range of motion of the ship is set to the size of the two-dimensional map. The ship is considered to have collided if the ship crosses the boundary of the two-dimensional map. This paper first uses the DDPG algorithm to train a global route from the start point to the end point, and then trains on the situation of encountering dynamic obstacles (other ships) during the voyage. We will mainly study the part of the ship that avoids obstacles and reaches the end point since the planning of the global static route is not the focus of this paper.

After many experiments, the better structure and parameters of the neural network are designed. The network structure of the Actor and the Critic both use a fully connected neural network with two layers of hidden layers. The Actor network hyper parameters are set, as follows: the numbers of neurons in the hidden layer are 300 and 600, respectively, the learning rate of the network is 10<sup>−</sup>4, the action of the output layer is the heading deflection and the speed increment respectively, and different activations are selected according to the difference of the output action range. The heading deflection uses the *tanh* activation function and its output range is [−1, 1]; the speed increment uses the *sigmoid* activation function, and its output range is [0, 1]. The Critic network hyper parameters are set, as follows: the number of neurons in the hidden layer is 200, the learning rate is 10<sup>−</sup>3, and the discount factor γ of the reward function is set to 0.9. In addition, the experience buffer pool size is set to 5000 and the batch learning sample size is 64. The random noise uses the method in Equation (5); the value of μ is set to 0, θ is set to 0.6, and σ is set to 0.3. The rate of the soft update method is τ = 0.01. The Actor network and the Critic network both use the Adam network optimizer. To prevent infinite training, set the maximum number of steps per round to 400, for a total of 300 rounds. The parameters of the Actor network and the Critic network are updated every 1000 steps in order to further improve the accuracy of the model in the training process. Based on the above training conditions and parameters, this paper studies and trains the path planning model of the unmanned ship. The training process and results are described, as follows.

Figure 9 shows the number of training steps per round during the unmanned ships autonomous path planning process. It can be seen that, in about the first 40 rounds, the training steps of the model reach the maximum, which indicates that the model triggers the termination condition of training and it does not realize the path planning or falls into the local obstacle area. Between 45th and 110th round, the training steps of each round began to decrease, and the average training steps are maintained at 50 steps, which indicated that unmanned ships learn more and more action strategies and plan a complete path independently. After about the 115th round, the training steps of the model are kept below 40 steps in each round, which indicates that the unmanned ship has fully learned the optimal action strategy and it has no collision risk. In the vicinity of the 275th round, the number of training steps increased due to the exploratory strategy of the algorithm, which makes unmanned ships attempt random action.

**Figure 9.** Number of steps per turn.

The purpose of the model is to improve the reward on the action strategy through continuous interaction with the environment. The greater the cumulative reward per round, the better the learning effect. Figure 10 shows the cumulative reward of each round of the model. In the first 40 rounds, the reward of the model is lower per round, and the fluctuation state is processed, which indicated that the unmanned ships has not found the correct path and is constantly trying new action strategy. Around the 45th round, the reward of the model began to increase, which indicated that a path to the target point is found. After the 115th round, the cumulative reward of the model per round is basically maintained at the maximum, indicating that the unmanned ships have found the optimal action strategy. The trend of cumulative reward per round is consistent with the change of steps per round when compared with Figure 9.

**Figure 10.** Cumulative reward per turn.

The average reward reflects the effect of the learning process and it also more directly observes the degree of change in the reward. Figure 11 shows the average reward of the model every 50 rounds. As can be seen from the figure, the general trend of average compensation is upward. About the 45th round, the growth rate of average rewards began to slow down and then gradually leveled off. After the 110th round, the average reward stabilized and then remained at a large value, which indicated that the model has found the optimal action strategy at this time.

**Figure 11.** Average reward per rounds.

## *4.3. Model Integration and Simulation Experiment*

The model is tested and observed in the simulation environment in order to verify the validity and correctness of the autonomous path planning model for unmanned ships in Section 4.2. Firstly, the electronic chart platform and ship navigation rules are briefly introduced. Secondly, the model is validated in three different situations to observe whether the unmanned ship is running correctly according to the navigation rules. Finally, the other two unmanned ship path planning methods are selected as comparative experiments, and the training process and simulation results of the three methods are compared and analyzed.

#### 4.3.1. Verification Environment

It is difficult for traditional numerical simulation methods to accurately describe the environmental information because the unmanned ships have a wide range of work and its environment is complex. The electronic chart [37] platform is an important navigation tool for ships and other marine vehicles. It can provide real and complete environmental information needed in navigation, including land, ocean, water depth, obstacles, and islands, etc.

In this paper, the electronic chart platform that was developed in C++ language is used as the verification environment of the model, as shown in Figure 12. The platform has the following main features: (1) Displaying standard chart information, which can be used in any scale chart interface. (2) Use Microsoft Foundation Classes (MFC) as a dynamic link library, it provides a flexible and convenient interface. (3) Set the motion information of the ship and other ships, including longitude, latitude, heading, and speed. (4) Dynamic display of the track of all ships.

**Figure 12.** Electronic chart platform.

#### 4.3.2. Validation Results

The autonomous path planning model of unmanned ships is obtained through training, and the model is invoked by the electronic chart platform to further verify the effectiveness of the algorithm. In the electronic chart platform, the self-ship is indicated by black concentric circles and Ship-1, other ships are represented by triangles and Ship-i (i = 2, 3, 4, ... ), the yellow circle indicates the target point of the self-ship, and the purple circle indicates the target point of the other ship, static obstacles are indicated by irregular figures, and the trajectory of the ships are drawn during the movement. The motion parameters of self-ship are defined, as follows: the ship s captain is set *L* = 12.5 m, the ship's width is set *<sup>W</sup>* = 2.1 m, the heading change amount is <sup>ψ</sup>*<sup>m</sup>* = [−35◦ , 35◦ ], and the shipping speed change amount is *vm* = [−15, 15] kn. All other ships adopt uniform motion parameters. The length of the ship is *Lo* = 10.6 m, and the ship's width is set *Wo* = 1.8 m. After setting the ship's motion parameters, they travel to the set point according to the uniform linear motion.

The head-on situation, crossing situation, overtaking situation, and multi-ship encounter situation in the ship navigation process are tested, respectively, in order to verify whether the action strategy made by the model in this paper complies with the COLREGS. Here, the parameter information in the experiment process is uniformly introduced. Ship-1 represents the self-ship and Ship-i (i = 1, 2, 3, ... ) represents the other ship. Heading angle represents the initial angle of the ship and Ship Speed indicates the initial speed. The starting point and target point points are expressed in latitude and longitude.

#### (1) Head-on case

Table 4 sets the motion parameters of the ships in detail. Figure 13 shows the experimental results. It can be seen from the waypoints information that the Ship-1 turned to the right according to the rules, successfully avoided the other ship, and then headed for the target position. In this case, the planned path length is 6.592 nautical miles.


**Table 4.** Setting of ship in head-on case.

**Figure 13.** Verification result of ship trajectories in the head-on case.

#### (2) Crossing case

Tested by the crossing of two ships, Tables 5 and 6 show the specific ship motion parameters. Figures 14 and 15 show the experimental results, respectively. In Figure 14, the other ship approached from the right side, where the Ship-1 acts as a give-way vessel with respect to the Ship-2, and then deflects the heading to the right according to rule and passes the stern of the Ship-2. In this case, the planned path length is 6.663 nautical miles. In Figure 15, the other ship approached from the left side, In this case, the planned path length is 6.462 nautical miles.

**Table 5.** Setting of two ship in crossing case 1.


**Table 6.** Setting of two ship in crossing case 2.


**Figure 14.** Verification result of ship trajectories in the crossing case 1.

**Figure 15.** Verification result of ship trajectories in the crossing case 2.

#### (3) Overtaking case

Table 7 shows the motion parameters of the two ships. From the test results and the waypoints information, Figure 16 shows that the Ship-1 is on the port side of the Ship-2, and the Ship-1 passes as the give-way vessel from the stern of the Ship-2. Ship-2 maintains heading as a stand-on vessel. In this case, the planned path length is 6.416 nautical miles.

**Table 7.** Setting of ship in overtaking case.


**Figure 16.** Verification result of ship trajectories in the overtaking case.

#### (4) Multi-ship encounter case 1

Table 8 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a head-on encounter, and then a crossing encounter with Ship-3. Ship-1 first takes the action of turning right, passing through the right side of Ship-2, then takes the action of turning right again, passing through the tail of Ship-3, and finally reaching the target point, according to the test results and waypoint information. Figure 17 shows the scene construction of multi ship encounter case 1, and Figure 18 shows the verification results of multi ship encounter case 1. In this case, the planned path length is 14.436 nautical miles.


**Table 8.** Setting of ship in multi-ship encounter case 1.

**Figure 17.** Multi-ship encounter case 1 scenario construction.

**Figure 18.** Verification result of ship trajectories in multi-ship encounter case 1.

#### (5) Multi-ship encounter case 2

Table 9 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a crossing encounter situation and then form an overtaking situation with Ship-3. Ship-1 first takes the action of turning right, passing through the stern of Ship-2, then takes the action of turning left, passing through the right side of Ship-3, and finally reaches the target point, according to the test results and waypoint information. Figure 19 shows the scene construction of multi ship encounter case 2 and Figure 20 shows the verification results of multi ship encounter case 2. In this case, the planned path length is 13.651 nautical miles.


**Table 9.** Setting of ship in multi ship encounter case 2.

**Figure 19.** Multi-ship encounter case 2 scenario construction.

**Figure 20.** Verification result of ship trajectories in multi-ship encounter case 2.

(6) Multi-ship encounter case 3

Table 10 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a crossing encounter situation and then form an overtaking situation with Ship-3. Ship-1 first takes the action of turning left, passing through the stern of Ship-2, then takes the action of turning left, passing through the right side of Ship-3, and finally reaches the target point, according to the test results and waypoint information. Figure 21 shows the scene construction of multi ship encounter case 2 and Figure 22 shows the verification results of multi ship encounter case 3. In this case, the planned path length is 15.264 nautical miles.

**Table 10.** Setting of ship in multi ship encounter case 3.


**Figure 21.** Multi-ship encounter case 3 scenario construction.

**Figure 22.** Verification result of ship trajectories in multi-ship encounter case 3.

#### (7) Multi-ship encounter case 4

Table 11 shows the motion parameters of the three ships. First, Ship-1 and Ship-2 form a crossing encounter situation, form a head-on encounter situation with Ship-3, and then finally form an overtaking encounter situation with Ship-4. According to the test results and waypoint information, Ship-1 first takes the action of turning right, passing from the stern of Ship-2 and the right side of Ship-3, then takes the action of turning left, passing from the left side of Ship-4, and then finally reaches the target point. Figure 23 shows the scene structure of case 3 of multi ship encounter and Figure 24 shows the verification results of case 4 of a multi ship encounter. In this case, the planned path length is 18.713 nautical miles.

**Table 11.** Setting of ship in multi ship encounter case 4.


**Figure 23.** Multi-ship encounter case 4 scenario construction.

**Figure 24.** Verification result of ship trajectories in multi-ship encounter case 4.

Through the observation of the above verification test, it is shown that the unmanned ship based on the model that is proposed in this paper has better path planning effect in the case of single ship and multi-ship, and it can successfully avoid the obstacle according to the navigation rules. Finally, reach the target point. At the same time, by observing the time and path length used in path planning, it is further indicated that the path distance of the model planning is shorter and more in line with the actual navigation experience.

#### *4.4. Improved Model of Autonomous Path Planning*

The planned path is redundant and not sufficiently flat, although the unmanned ship autonomous path planning based on the DDPG algorithm is implemented in Section 4.3.2. By observing the training process in Section 4.2., it is found that the DDPG training time is longer, the convergence speed is slower, and the algorithm is easy to fall into the problem of local iteration. The above phenomenon is because the DDPG has no prior knowledge of the environment and the initial stage of learning can only randomly select actions. Therefore, the convergence speed of the algorithm is slow in complex environments. This section improves the DDPG and adds the APF method to obtain the unmanned ships path planning model based on APF-DDPG in order to improve the learning efficiency in the initial stage and speed up the convergence of the algorithm.

The APF is a method of virtual gravitational field and repulsive field, which has been widely used in real-time path planning of robots. The basic principle is to virtualize the simulation environment, and each state point in the environment has a corresponding potential energy value. The target point generates a gravitational potential field in the virtual environment, the obstacle generates a repulsive potential field in the virtual environment, and the total field strength is obtained by superimposing the gravitational field and the repulsive field. The object approaches the target point by using the gravitational field, and the repulsion field is used to avoid the obstacle. Generally, it is calculated by the following formula:

$$\mathcal{U}l(\mathbf{s}) = \mathcal{U}\_{\mathbf{a}}(\mathbf{s}) + \mathcal{U}\_{\mathbf{r}}(\mathbf{s}) \tag{9}$$

In the above formula, *Ua*(s) is the potential energy value of the gravitational field at the point s, *Ur*(s) is the potential energy value of the repulsive field at the point s, and *U*(s) is the potential energy of the point s. *Ua*(s) and *Ur*(s) are obtained by Formulas (10) and (11), respectively.

$$
\hbar L\_{\mathfrak{a}}(\mathbf{s}) = \frac{1}{2} k\_{\mathfrak{a}} \rho\_{\mathcal{S}}^2(\mathbf{s}) \tag{10}
$$

In the Formula (10): *ka* is the scale factor of gravitational field and ρ*g*(s) is the minimum distance between the s and the target.

$$\mathcal{U}L\_{r}(\mathbf{s}) = \begin{cases} \frac{1}{2}k\_{l} \left(\frac{1}{\rho\_{ob}(\mathbf{s})} - \frac{1}{\rho\_{0}}\right)^{2}, \rho(\mathbf{s}) < \rho\_{0} \\\ 0, \quad \rho(\mathbf{s}) \ge \rho\_{0} \end{cases} \tag{11}$$

In the Formula (11): *kr* is the scale factor of repulsion field and ρ*ob*(s) is the minimum distance between the s and the obstacle, where ρ*ob*(s) is the obstacle influence coefficient.

For the path planning problem of unmanned ships, the immediate reward *r* can only be obtained when the destination is reached or an obstacle is encountered. The sparsity of the reward function leads to low initial efficiency and multiple iterations. There are a large number of invalid iterative search spaces, especially for large-scale unknown training environments. Therefore, the APF is constructed according to the position information of the target point and the obstacle point. At this time, the potential field value of each state in the potential field represents the maximum cumulative return *V*(*si*) of the state *si*, and the relation expression is expressed as:

$$V(s\_i) = |\mathcal{U}(s\_i)|\tag{12}$$

In the Formula (12): *U*(*si*) is the potential field value of state *si* in the virtual potential field environment. *V*(*si*) represents the maximum cumulative return when the optimal action is taken in state *si*.

The steps of the APF-DDPG based autonomous path planning method are as follows:


When the gravitational field is added to the target point position, the unmanned ship reaches the target point faster and the selection of the action strategy becomes more stable. The experimental procedure that is based on the APF-DDPG method is described below.

The path planning experimental parameters that are based on the APF-DDPG are the same as the DDPG parameters in Section 4.3.2. The APF parameters are set, as follows: *ka* = 1.6, *kr* = 1.2, ρ<sup>0</sup> = 3.0. We choose the environment of case (4), (5), and (7) in Section 4.3.2 as the comparison environment in order to better compare APF-DDPG with DDPG.

Figure 25 shows the training process of APF-DDPG. From Figure 25a, it can be seen that the number of training steps in each round begins to decline and converge in the 43th round, and it fluctuates in the subsequent training process, which is caused by the action exploration strategy of the algorithm. Figure 25b shows the reward value of each round. The reward value of each round starts to increase and then reaches the maximum value from the training to the 43th round, indicating that a better action strategy is found at this time. Figure 25c shows the average reward value of each round, which rapidly increases and then stabilizes at the maximum value at the beginning of the 43th round.

**Figure 25.** Artificial potential field-DDPG (APF-DDPG) training process.

The result is shown in Figures 18 and 26 as compared with the (4) experimental case in Section 4.3.2. Figure 18 has more redundant paths and it takes longer. The path in Figure 26 is smoother and the length of the planned path is shorter and takes less time. Similarly, when compared with the experimental case (5) (7) in Section 4.3.2, the results are shown in Figures 20, 24, 27 and 28, respectively. It can be seen that APF-DDPG is better than DDPG in the distance and time of path planning. By comparing the DDPG and the APF-DDPG, it is shown that the APF-DDPG has better decision-making level and faster convergence speed.

**Figure 26.** Experimental results based on APF-DDPG (a).

D

**Figure 27.** Experimental results based on APF-DDPG (b).

**Figure 28.** Experimental results based on APF-DDPG (c).

#### *4.5. Experimental Comparison and Analysis*

Currently, there are five path planning methods comparisons and experimental analyses, as described in this section. These methods are called the DQN method, AC method, DDPG method, Q-learning method [18], and the path planning method based on APF-DDPG. The five methods are trained to obtain training steps, the reward, and the average reward in the case that the ship's motion parameters and the surrounding environment are the same. Analyze and compare the experimental process and performance data of unmanned ships autonomous path planning during the training phase.

The five comparison methods set the same neural network structure and network parameters, the discount factor of the reward function is set to 0.9, and the experience buffer pool size is set to 5000. Among them, the Q-learning algorithm and DQN algorithm discretize the deflection heading action into 70 deflection angle values in the range of [−35◦, 35◦], and other algorithms directly output the specific action value.

Figure 29 shows the contrast experiment of autonomous path planning for unmanned ships. Figure 29a–c are the results of an experimental of unmanned ships path planning based on DQN. Figure 29d–f are the results of an experimental based on AC. Figure 29g–i are based on the DDPG. Figure 29j–l are the results of an experiment based on the Q-learning. Figure 29m–o are based on the experimental results of the APF-DDPG. Figure 30 is a comparison of the average reward of the five methods.

**Figure 29.** Comparative experiment of different path planning methods.

**Figure 30.** Comparison of average returns of five algorithms.

(a), (d), (g), (j), and (m) in Figure 29 are the experimental results of the number of steps per round obtained while using the DQN, the AC, the DDPG, the Q-learning, and the APF-DDPG, respectively. By longitudinally comparing the number of execution steps per round, it can be found that the number of steps per turn based on the DQN (a) starts to decrease at about the 150th round, while the number of steps per turn based on the AC (d) and the DDPG (g) begins to decrease and then gradually converges around the 100th round. The number of steps per turn based on the DQN does not converge to the minimum number of steps, but the AC and the DDPG have both converged to the minimum number of steps, and the DDPG produces fewer fluctuations in the latter process. In addition, the number of steps per round based on the Q-learning (j) begins to decrease around the 150th round, and the number of steps per round based on the APF-DDPG (m) begins to decrease around the 43th round. By vertically comparing the five methods, it can be found that the path planning method that is based on the APF-DDPG reaches the target point with fewer rounds, and the convergence speed is faster, which is better than the other algorithms in terms of stability.

 (b), (e), (h), (k), and (n) in Figure 29 are the experimental result of the reward per round obtained by the DQN, the AC, the DDPG, the Q-learning, and the APF-DDPG, respectively. The round reward based on the DQN (b) and the AC (e) starts to increase at about the 100th round, and then gradually reaches the maximum reward, and the round reward based on the DDPG (h) starts to increase at about the 50th round and then gradually stabilizes at the maximum reward, which indicated that the unmanned ships while using DDPG found the optimal action strategy faster. The round reward based on the DQN continuously fluctuates and does not reach the maximum reward, which indicates that the unmanned ships has not learned the optimal behavior strategy, and the AC has found the maximum reward value, indicating that the optimal behavior strategy has been learned. In addition, the round reward based on Q learning (k) has been low in the initial round, indicating that the correct action has not yet been learned and is in the exploratory stage. The 100th round of the award began to gradually increase, but ultimately did not reach the maximum reward. The round reward that was based on the APF-DDPG (n) began to increase at the 43th round and reached the maximum reward value faster. Longitudinal comparison of the five experimental processes shows that the APF-DDPG achieves the maximum round bonus value with the least amount of time and it is consistently maintained. At the same time, the APF-DDPG is more stable than the other algorithms in the later stage, which indicates that the unmanned ships has learned the optimal behavior strategy.

(c), (f), (i), (l), and (o) in Figure 29 is the experimental result of the average reward per round obtained by the DQN, the AC, DDPG, the Q-learning, and the APF-DDPG, respectively. By observing the average reward, it is more intuitive to judge the efficiency and accuracy of the convergence of the three algorithms. By observing (c), it can be known that the average reward that is based on the DQN starts to increase at about the 110th round and tends to be stable around the 180th round. By observing (f), the average reward that is based on the AC is the 40th round began to increase and then stabilized around 130th round. By observing (i), the average reward that is based on the DDPG of this paper started to increase at about the 10th round, and the downward trend appeared around the 50th round. This is because the random exploration strategy of the algorithm causes the reward of each round to fluctuate, but around the 115th round, the average reward reaches the maximum. In addition, the average reward based on Q learning (l) showed a downward trend at the beginning, which indicated that it was in the exploration stage. Subsequently, there was an increase in the 86th round, but did not reach the maximum average reward finally. By observing (o), the average reward that is based on the APF-DDPG increases at the 43th round and reaches the maximum average reward value faster, which converges faster than other algorithms. Longitudinal comparison of five experimental processes shows that the APF-DDPG achieves the maximum average reward with the least amount of time and it is consistently maintained.

Figure 30 is a comparison of the average reward for the five path planning methods. It can be seen from the figure that the Q-learning and DQN methods that are based on discrete motion space converge slowly, starting to gradually increase in the 80th to 100th rounds, and fail to reach the maximum draw value. Based on the AC and DDPG methods, convergence begins on the 10th to 50th rounds. The maximum average reward value is greater than the DQN and Q-learning methods, and the DDPG algorithm has reached the maximum average reward value. The average reward value of the APF-DDPG method is higher than the other methods at the beginning, which indicated that the initial convergence rate is improved after the artificial potential field method is added. At the same time, the APF-DDPG method reached the maximum average reward value in the 70th round. In summary, the APF-DDPG method is superior to other methods in terms of convergence speed and stability.

Experimental data from five methods were extracted, and the comparison is made from the total iteration time, the optimal decision time, convergence steps, and the number of obstacle collisions to further illustrate the accuracy and effectiveness of the improved model. As shown in Table 12, DQN and Q-learning based on discrete motion space have greater overhead in terms of training time and convergence speed. In contrast, AC and DDPG have better training time and convergence effects. When compared to the above method, the decision method using APF-DDPG has less iteration time and optimal decision time. In addition, the convergence speed of the method is improved, and the trial and error rate is reduced. The simulation results show that the APF-DDPG has higher convergence speed and stability.


**Table 12.** Comparison of experimental data.

#### **5. Conclusions**

For the traditional path planning algorithm, the historical experience data cannot be recycled and used for online training and learning, which results in low accuracy of the algorithm, and the actual path of the plan is not smooth and redundant. This paper proposes an unmanned ships autonomous path planning method based on the DDPG algorithm. First, the ship data are acquired based on the electronic chart platform. Secondly, the model is designed and trained in combination with the COLREGS and crew experience, and validated in three classical encounter situations on the electronic chart platform. The experiments show that unmanned ships take the best and reasonable action in unfamiliar environment, successfully complete the task of autonomous path planning, and realize the end-to-end learning method of unmanned ships. Finally, by combining APF and DDPG, an unmanned ship autonomous path planning method that is based on improved DRL is proposed. The improved DRL is compared with other classical DRL methods. The results show that the improved DRL has faster convergence speed and accuracy, can achieve continuous operation output, and has less navigation error, which further validates the effectiveness of the proposed method. However, the ship's motion model and the actual verification environment are not considered in this paper. How to consider the motion model of a ship in a complex sea area and then verify it in a real environment is the focus of the next research in this paper.

**Author Contributions:** Conceptualization, X.Z. and S.G. Methodology, S.G. Software, S.G. and Y.Z. Writing—original draft preparation, X.Z., S.G., Y.D. Writing—review and editing, X.Z. and S.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported by the National Key R&D Program of China (Grant No. 2018YFB1601500), the National Natural Science Foundation of China (Grant No. 51679025) and the Fundamental Research Funds for the Central Universities (Grant No. 3132019313).

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
