• AC Algorithm

AC Algorithm includes Actor network and Critic network, which respectively include online network and target network. The two networks have the same network structure and they adopt the experience replay technology to randomly extract ship sample data from the experience buffer pool for network training.

• Environment

The training environment of autonomous path playing model of unmanned ships mainly includes Ship Action Controller and Ship Navigation Information Fusion Module. When the ship performs the action the environment feedbacks the reward and the state of the ship.

• Ship Action Controller

The Ship Action Controller converts the model output action to deflection heading and speed increment. These actions indicate that the unmanned ship should perform in the current state.

• Ship navigation information fusion module

The module can receive and process global positioning system (GPS), automatic identification system (AIS), depth sounder, and anemometer in real time (the information can be integrated and displayed on the electronic chart platform), and provide information of the unmanned ships state, including the ship's position, heading, speed, the distance between the ship and the obstacle, and the angle of the ship and the target point.

• Experience replay memory

The function of this module is to store the state of the ship, the action, the reward, and the state of the next moment. When the maximum capacity of the experience buffer pool is reached, the old data will be replaced by the new data, and the buffer pool is updated in this way.

The DDPG algorithm pseudo code is as follows.

#### **Algorithm 1: Pseudo code of the DDPG algorithm.**

1: Randomly initialize Critic network *<sup>Q</sup>*(*s*, *<sup>a</sup>*|θ*Q*) and Actor network <sup>μ</sup>(*s*|θμ) with weight parameters <sup>θ</sup>*<sup>Q</sup>* and θμ. 2: Initialize target network Actor *Q* (*s*, *<sup>a</sup>*|θ*Q* ) and Critic μ (*s*|θμ ) with weight parameters θ*Q* <sup>←</sup> <sup>θ</sup>*<sup>Q</sup>* and θμ <sup>←</sup> θμ . 3: Initialize experience replay memory *D*. 4: for t = 1 to M do 5: Initialize the random process N in the action exploration strategy. 6: Input initial unmanned ships and environment observation state *s*1: ship latitude and Longitude, ship heading, ship speed, angle with target point, distances from obstacles. 7: for t = 1 to T do 8: Choose ship heading and ship speed *at* <sup>=</sup> <sup>μ</sup>(s*t*|θμ) + <sup>N</sup>*<sup>t</sup>* based on current strategy <sup>μ</sup>(s*t*) and exploring noise N. 9: Implement output action *at* to get the reward *rt* and the new state *st*+1. 10: Save transition (*st*, *at*,*rt*,*st*+1) into *D*. 11: Sample random batch of *N* transitions (*si*, *ai*,*ri*,*si*+1) from *D*. 12: Set *yi* = *ri* + γ*Q* (*si*+1, μ (*si*+1|θμ )|θ*Q* ) 13: Update online Critic network by minimizing loss: *L* = <sup>1</sup> *N <sup>i</sup>* (*yi* <sup>−</sup> *<sup>Q</sup>*(*si*, *ai*|θ*Q*))<sup>2</sup> 14: Update online Actor network using sampled policy gradient: <sup>∇</sup>θ<sup>μ</sup> *<sup>J</sup>* <sup>≈</sup> <sup>1</sup> *N <sup>i</sup>* <sup>∇</sup>*aQ*(*s*, *<sup>a</sup>*|θ*Q*)<sup>|</sup> *<sup>s</sup>*=*si*,*a*=μ(*si*)∇θ<sup>μ</sup> <sup>μ</sup>(*s*|θμ)|*si* 15: Update target Actor network and target Critic network: <sup>θ</sup>μ <sup>←</sup> τθ<sup>μ</sup> + (<sup>1</sup> <sup>−</sup> <sup>τ</sup>)θμ θ*Q* <sup>←</sup> τθ*<sup>Q</sup>* + (<sup>1</sup> <sup>−</sup> <sup>τ</sup>)θ*Q* 16: end for 17: end for

#### 3.3.2. Structural Design of AC Algorithms

The input parameters of the Actor network and the Critic network are all from the ship state data that were provided by the Ship Navigation Information Fusion Module. The Actor network and the Critic network continuously calculate and update their network parameters by continuously inputting ship state information, and finally get better network output results, which are respectively the action of unmanned ships and the value of action. Figure 6 shows the specific network structure of the AC algorithm.

**Figure 6.** Network structure of AC algorithms.

The Actor network structure contains two fully connected hidden layers. The number of neurons is 300 and 600, respectively. The output nodes of each hidden layer are nonlinearly processed by the activation function to limit the number of output actions to one. First, the Actor network enters the initial ship state information *st*, calculates it through two hidden layers of neurons, and then uses the *ReLU* activation function to limit the output of each layer. At the same time, in the last layer of the network, the *Tanh* activation function is used to limit the network single output action value *at* between [−1, 1], and the Ship Action Controller is then used to convert the network output action into the actual operation action.

The Critic network and The Actor network have the same network structure. The number of neurons of the two hidden layers is 200, and the output node of each hidden layer also adopts the activation function for non-linear processing. The evaluation value of the output action of Actor network is the final network result, which is known as the Q value. Different from the input parameters of the Actor network, the Critic network takes the initial state of the ships *st*, and the output action of the Actor network *at* as the input parameters. After two hidden layers' calculation, the output is processed by *ReLU* activation function, but when the final action value Q is output, no activation function is used to ensure that the output result of network is the definite action value. Additionally, the action value is used to evaluate the output action of Actor network.

#### 3.3.3. Action Strategy Design

In the process of DRL, the relationship between exploitation and exploration needs to be correctly handled. The appropriate action exploration strategy can enable the agent to try more new actions, so as to avoid falling into local optimization. In this paper, the action strategy of unmanned ships mainly includes action control strategies and action exploration strategies. When designing a neural network structure, the random noise is added into the output action of the Actor network as an action exploration strategy, and the output action is converted into specific execution as the control strategy according to the actual physical meaning.
