**1. Introduction**

A USV is a kind of ship which navigates on the water and is controlled by an automated algorithm or a remote device. In recent years, with the rapid development of marine science and control theory, the USV has been widely used in marine rescue and marine monitoring due to its advantages of speed and economy [1,2]. Path following is the foundation for a USV to perform tasks and is also the reflection of the intelligence of a USV.

In recent years, artificial intelligence methods, especially reinforcement learning (RL) technology, have effectively improved the accuracy of heading control for USVs. Reinforcement learning is a machine learning technique in which the agents gain knowledge about the specified scenario with training and learning from interactions with the environment directly; it can be combined with the concept of hierarchical neural networks (HNNs) in deep learning (DL) to form various types of deep reinforcement learning (DRL) methods, such as deep Q-learning network (DQN) [3], deep deterministic policy gradient (DDPG) [4], asynchronous advantage actor–critic (A3C) [5] and soft actor–critic (SAC) [6], etc. These algorithms have achieved unprecedented success in many challenging areas. In particular, reinforcement learning techniques are also used in the USV field. Gonzalez-Garcia et al. [7] proposed a USV guidance control method based on DDPG that combined sliding mode control. By training the heading command, the path following of the USV is realized. The results showed that the performance is improved, and the control stability is enhanced. Zhao et al. [8] proposed a deep Q-learning (DQL) method based on the adaptive gradient descent function to guide USV navigation. The results showed that the algorithm performs well in reducing the complexity and improving the accuracy of the path following. Wang et al. [9] proposed an optimal trajectory tracking control algorithm based on deep

**Citation:** Song, L.; Xu, C.; Hao, L.; Yao, J.; Guo, R. Research on PID Parameter Tuning and Optimization Based on SAC-Auto for USV Path Following. *J. Mar. Sci. Eng.* **2022**, *10*, 1847. https://doi.org/10.3390/ jmse10121847

Academic Editors: Carlos Guedes Soares and Serge Sutulo

Received: 24 October 2022 Accepted: 15 November 2022 Published: 1 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

reinforcement learning, with which the tracking error can converge; its effectiveness and superiority are proven through simulation. The SAC algorithm is an RL algorithm proposed by Haarnoja et al., based on AC. The core idea of the algorithm is that entropy information is combined with the original reward to encourage exploration; the behavior strategy to maximize the reward with entropy is trained. The algorithm can preserve the randomness of the behavior strategy to the maximum extent and improve the agent's ability to perceive the environment. Zheng et al. [10] proposed a linear active disturbance rejection control strategy based on the SAC algorithm, which performed the tracking on both a straight and a circular trajectory for a USV under the wave effect.

The path-following control system comprises guidance laws and controllers [11]. Lineof-sight (LOS) guidance law [12,13] has been widely used in the design of a path-following controller for USVs due to its simplicity, efficiency, and ease of implementation. The law is independent of the controller and does not depend on the mathematical model of the system. For example, in reference [14], the virtual control law that uses the tracking error is calculated to design the guidance law; in contrast, in reference [15], the visual angle of the desired path is calculated for the USV heading controller. The path-following controller can be designed with self-adaption [16,17], backstepping [18,19], sliding mode [20,21], and other control methods. However, the robustness of adaptive control methods needs to be improved. The backstepping algorithm is highly dependent on the model and needs accurate model parameters. Although sliding mode control does not require a high-accuracy model, its chattering problem is difficult to eliminate [22]. Moreover, due to the high cost of the experiment, the difficulty of adjusting the control parameters, the nonlinearity of the motion model, and the uncertainty of the environment, it is difficult to guarantee the stability of the system. Therefore, it is difficult for these control algorithms to show their advantages in practical applications.

Therefore, it is significant to propose an adaptive control method with a simple system, good robustness, and low requirement on model accuracy. PID control is widely used in a ship's heading control. Miao et al. [23] designed a self-adapting expert S-PID controller for a mini-USV to perform heading control and verified the effectiveness and reliability of the designed motion control system through experiments. Based on the LOS guidance law, Zhu et al. [24] conducted a simulation analysis on three control algorithms, incremental PID, fuzzy PID, and variable-length-ratio-fuzzy PD, the results showed that the third has more anti-interference advantages than the other two. Fan et al. [25] designed a track control algorithm combining LOS and fuzzy adaptive PID and conducted a real boat test. The results show that the algorithm reduces the influence of time-varying drift angle on track control, however, some steering overshoot and position deviation still emerged at the corner of the path.

At present, the existing algorithms are not practical in automatically optimizing the PID parameters, and little research has been done on the application of the RL method to the heading control of USVs. For this reason, a PID parameter optimization method based on the SAC algorithm for USVs is proposed to achieve adaptive heading control. The remainder of this paper is organized as follows. In Section 2, the USV kinematic model based on Abkowite is introduced and the path-following guidance system is designed. Then the PID parameter optimization algorithm based on SAC is proposed. In Section 3, the simulation training process and the simulation results under three different working conditions are performed to verify the feasibility and effectiveness of the proposed algorithm. Finally, the concluding remarks are provided in Section 4.

#### **2. The Design of the DRL**

In this section, the overall design diagram for the proposed system is presented, and, to solve the tedious process of PID parameter tuning, an adaptive SAC-PID control method is introduced to solve the mechanical tuning problem of PID parameters. The overall flow diagram of the proposed method is given in Figure 1.

**Figure 1.** The overall flow diagram for the proposed method.

As shown in Figure 1, the Abkowitz holistic model is established, and considering the integral saturation condition of the PID controller, the control system is based on the LOS guidance and the PD controller is designed to control the USV heading. To obtain reasonable PD parameters, a neural network based on the SAC algorithm is established, and the agent is trained to interact with the abovementioned control system. The final network is used for the experimental simulation in which the required control parameters are inputted for the control parameters by online transmission.

#### *2.1. Dynamic Model of USV*

Since the USV is a kind of ship that navigates on water, a hydrodynamic model with three degrees of freedom can be used to describe the relationship between the USV's motion state and the power, external force, fluid force, and corresponding torque on the USV in the local coordinate system of the USV. After ignoring the pitch, roll, and heave of the USV, the two-dimensional coordinate description of ship motion is shown in Figure 2, where *o* − *xy* is the global coordinate system (Earth coordinate system), and *o*<sup>0</sup> − *x*0*y*<sup>0</sup> is the local coordinate system of the USV.

**Figure 2.** USV manipulation motion coordinate system. *u*, *v*, and *r* are the velocities on the three degrees of freedom (forward, lateral, and head) of the ship motion, respectively. *ψ* is the heading angle.

The motion equation of the USV with three degrees of freedom can be described by the following equation:

$$\begin{cases} X = m(\dot{u} - rv - \mathbf{x\_G}r^2) \\ \quad Y = m(\dot{v} + ru + \mathbf{x\_G}\dot{r}) \\ \quad N = I\_z \dot{r} + m\mathbf{x\_G}(\dot{v} + ru) \end{cases} \tag{1}$$

where *m* is the total weight of the USV, *xG* is the longitudinal coordinate of the center of gravity, and *Iz* is the moment of inertia of the vertical axis over the center of gravity of the USV. *X*, *Y*, and *N* are the hydrodynamic and torque components of the USV on the 3-DOF of the three motion directions of *u*, *v*, and *r*, respectively.

The integral-type Abkowitz model refers to the motion model proposed by Abkowitz, which takes the ship as an entirety under force to process the mechanical analysis. In theory, the hydrodynamic derivative of the infinite order derived from the Taylor expansion equals the real value. Still, a large amount of research shows that the accuracy is satisfied for engineering applications when the hydrodynamic is derived from the third derivative [26,27]. The thirdorder Taylor series expansion model is shown in Equation (2). Δ*u* represents the difference between the current speed and the designed speed, which can be described as Equation (3).

$$\begin{cases} \begin{aligned} m(\dot{u} - rv - \mathbf{x}\_{G}r^{2}) &= \mathbf{X}\_{\mathsf{u}}\Delta u + \mathbf{X}\_{\mathsf{u}\mathsf{u}}(\Delta u)^{2} + \mathbf{X}\_{\mathsf{u}\mathsf{u}\mathsf{u}}(\Delta u)^{3} + \mathbf{X}\_{\mathsf{u}}\dot{u} + \mathbf{X}\_{\mathsf{v}\mathsf{v}}v^{2} + \mathbf{X}\_{\mathsf{r}\mathsf{v}}r^{2} + \mathbf{X}\_{\mathsf{r}\mathsf{v}}r^{2} \\ &+ \mathbf{X}\_{\mathsf{d}\mathsf{v}}\delta^{2} + \mathbf{X}\_{\mathsf{u}\mathsf{b}}v\delta + \mathbf{X}\_{\mathsf{r}\mathsf{d}}r\delta \\ &m(\dot{v} + ru + \mathbf{x}\_{G}\dot{r}) = \mathbf{Y}\_{\mathsf{v}}\boldsymbol{v} + \mathbf{Y}\_{\mathsf{r}\mathsf{v}\mathsf{v}}\boldsymbol{v}^{3} + \mathbf{Y}\_{\mathsf{g}}\dot{v} + \mathbf{Y}\_{\mathsf{r}\mathsf{f}}r + \mathbf{Y}\_{\mathsf{r}\mathsf{f}}r^{3} + \mathbf{Y}\_{\mathsf{r}\mathsf{v}\mathsf{v}}r^{2} + \mathbf{Y}\_{\mathsf{r}\mathsf{v}\mathsf{v}}\boldsymbol{v}^{2}r + \mathbf{Y}\_{\mathsf{d}}\delta \\ &+ \mathbf{Y}\_{\delta\mathsf{d}\delta}\delta^{3} + \mathbf{Y}\_{\mathsf{u}\mathsf{d}}u\delta + \mathbf{Y}\_{\mathsf{r}\mathsf{d}}v\delta^{2} + \mathbf{Y}\_{\mathsf{r}\mathsf{d}}\delta^{2}\delta + \mathbf{Y}\_{\mathsf{r}\mathsf{d}}r\delta^{2}\delta \\ &\mathbf{I}\_{\mathsf{r}}\dot{r} + \mathbf{m}\mathbf{x}\_{G}\left(\bar{\boldsymbol{v}} + ru\right) = \mathbf{N}\_{\mathsf{r}\mathsf{v}}\boldsymbol$$

$$
\Delta \mathfrak{u} = \mathfrak{u} - \mathfrak{u}\_0 \tag{3}
$$

Move all terms that are proportional to translational accelerations . *<sup>u</sup>*, . *<sup>v</sup>*, and . *r* in Equation (2) to the left of the equation, while the inertial force, the lift force, and the drag force of the fluid on the hull have been moved to the right of the equation. Then the Abkowitz mathematical model of ship motion can be given as shown in Equation (4). The USV model parameters are shown in Table 1. Table 2 contains the hydrodynamic coefficients.

$$\begin{cases} \begin{aligned} (m - X\_{\dot{u}})\dot{u} &= F\_{\mathcal{X}}(u, v, r, \delta) \\ (m - Y\_{\dot{v}})\dot{v} + (m\chi\_{G} - Y\_{\dot{r}}) &= F\_{\mathcal{Y}}(u, v, r, \delta) \\ (m\chi\_{G} - N\_{\dot{v}})\dot{v} + (I\_{\mathcal{z}} - N\_{\dot{r}})\dot{r} &= F\_{\mathcal{N}}(u, v, r, \delta) \end{aligned} \end{cases} \tag{4}$$

where *X* . *<sup>u</sup>*,*Y*. *<sup>v</sup>*,*Y*. *<sup>r</sup>*, *N*. *<sup>v</sup>*, and *N*. *<sup>r</sup>* are the hydrodynamic derivatives.

**Table 1.** Main parameters of the USV.


**Table 2.** Hydrodynamic Coefficients.



**Table 2.** *Cont.*

#### *2.2. Path Following*

The USV path-following control problem is defined as controlling the USV to the desired path *S*. The line-of-sight (LOS) guidance law is a classical and effective navigation algorithm that does not depend on any dynamic control model and is insensitive to highfrequency white noise. The guidance law can efficiently calculate the desired course and pass it to the controller to achieve the control goal in real time owing to the reason that the desired course is only related to the desired route and the real-time position of the USV. LOS guidance algorithms can be divided into two types, which are based on the lookahead distance and the enveloping circle. The former type is adopted in this paper. The figure of LOS guidance law is shown in Figure 3.

**Figure 3.** LOS guidance law. *o* − *xy* is the global coordinate system. *od* − *xdyd* is the carrier coordinate system. *ψ<sup>d</sup>* is the desired heading angle. *ϕ* is the angle between the bow of the USV and the vertical axis of the global coordinate system. *ye* is the lateral error about path tracking. *α<sup>k</sup>* is the tangential angle at the *od* point on the desired path. *U* is the actual velocity of the USV. Δ is the lookahead distance, usually set as an integer multiple of the ship's length. In this paper, we take the multiple as 2.

The PID controller is a linear regulator that compares the desired heading angle *ψd*(*t*) with the actual heading angle *ϕ*(*t*) to form the heading angle deviation *e*(*t*):

$$e(t) = \psi\_{\mathbf{d}}(t) - \varphi(t) \tag{5}$$

The desired rudder angle can be expressed as Equation (6):

$$\delta(k) = K\_\mathrm{P} e(k) + K\_\mathrm{i} \sum\_{i=0}^k e(i) - K\_\mathrm{d} (e(k) - e(k-1)) \tag{6}$$

Considering the integral saturation condition of the PID controller, the PD parameters are adjusted to ensure the USV quickly tends to the desired track and keeps the USV navigating within the error range. Therefore, Equation (6) can also be expressed as Equation (7),

$$
\delta = K\_{\rm P} \epsilon + K\_{\rm d} \left( \epsilon - \epsilon' \right) \tag{7}
$$

where *e* is the error at the last time. The neural network is performed to produce the appropriate PD parameters.

#### *2.3. SAC Algorithm*

Figure 4 clearly describes the interaction process between the reinforcement learning agent and the environment, which is also called the Markov Decision Process (MDP) [25]. (*S*, *A*, *ρ*, *r*) is an important tuple in MDP, in which *S* is all the states in the environment, *A* is the set of all the actions, *ρ* represents the probability density of the next state, and *st*+<sup>1</sup> ∈ *S* is given the current state *st* ∈ *S*, and the action *at* ∈ *A*. r is a bounded immediate payoff at each time when one state transfers to another. *ρ*π(*st*, *at*) represents the stateaction distribution generated by policy π. At time *t*, the agent obtains the state *st* from the environment and inputs it into the policy π to obtain the action *at*. The action *at* is executed and the reward *rt* of the current step is obtained, while the agent enters the next state *st*+1. *γ* represents the discount factor, so the total reward at time *t* can be described as Equation (8). The state value function, shown in Equation (9), can be used to evaluate the quality of the current state, while the state-action-value function, shown in Equation (10), can be used to represent whether the action made in the current state is of high quality. The transformation between these two can be described as Equations (11) and (12).

$$R\_t = \sum\_{k=0}^{\infty} \gamma^k r\_{t+k} \tag{8}$$

$$V\_{\pi}(s) = E\_{\pi} \left[ \sum\_{k=0}^{\infty} \gamma^k r\_{t+k} \middle| s\_t = s \right] = E[\mathcal{R}\_t | s\_t = s] \tag{9}$$

$$Q\_{\pi}(s, a) = E[R\_t | s\_t = s, a\_t = a] \tag{10}$$

$$V\_{\pi}(s) = E[Q\_{\pi}(s, a)|s\_t = s] \tag{11}$$

$$Q\_{\pi}(s, a) = R\_{t+1} + \gamma \sum\_{s\_{t+1} \in \mathcal{S}} P\_{ss'}^{\mathfrak{a}} V\_{\pi}(s\_{t+1}) \tag{12}$$

where *P<sup>a</sup> ss* = *P*[*st*+<sup>1</sup> = *s* ( (*st* = *s*, *at* = *a*], and *P* is a state transition probability matrix. The optimization objective in reinforcement learning is to maximize the long-term reward R. According to the MDP solution process, the optimal strategy π is the policy that maximizes the reward *R*, which can be described as Equation (13).

$$\pi = \arg\max \sum\_{t=0}^{T} E\_{(s\_t, a\_t) \sim \rho\_{\pi}} \left[ \gamma^t r(s\_t, a\_t) \right] \tag{13}$$

**Figure 4.** Schematic diagram for reinforcement learning.

Compared with other strategies, the core idea of SAC is that entropy information which encourages the agent to explore for maximizing the entropy reward is combined with the original reward. Thus, Equation (13) can be updated as Equation (14),

$$\pi = \arg\max \sum\_{t=0}^{T} E\_{(s\_t, a\_t) \sim \rho\_\pi} \left[ \gamma^t r(s\_t, a\_t) + \alpha H(\pi(s\_t)) \right] \tag{14}$$

where *α* is the entropy coefficient, which controls the weight between the entropy term and the revenue term and also influences the randomness of the strategy. H is entropy, which represents the randomization of the current policy, expressed as Equation (15).

$$H(P) = E\_{\mathbf{x} \sim \mathbf{P}}[-\log P(\mathbf{x})] \tag{15}$$

In this paper, the SAC algorithm is mainly composed of five networks, including two value networks (one V network, one target-V network), two action-value networks (Q network), and one actor network (π network). The V network is used to calculate the value of the value function. The Q network is used to calculate the value of the action-value function. The network π outputs the policy value that guides the action of the agent. The overview of the SAC system is shown in Figure 5.

**Figure 5.** SAC architecture.

#### 2.3.1. Training and Updating of the Actor Network

The strategy π is a Gaussian distribution with mean μ and covariance *σ* calculated by the neural network. The sampling of each policy *πφ*(·|*st*) is a function calculated from the state *s*, policy parameter *φ*, and independent noise *ξ* ∈ *N*(0, 1), which can be described as Equation (16). The loss function of the actor network training is given as Equation (17).

$$\tilde{a}\_{\phi}(s,\xi) = \tan \mathbf{h} \left( \mu\_{\phi}(s) + \sigma\_{\phi}(s) \odot \xi \right) \tag{16}$$

$$Loss = E\_{\tilde{\xi} \in \mathcal{N}} \left[ \text{alog } \pi\_{\phi} \left( \tilde{a}\_{\phi}(s, \xi) \middle| s \right) - Q(s, \tilde{a}\_{\phi}(s, \xi)) \right] \tag{17}$$

Compared with other RL methods, obtaining action *<sup>a</sup>* to calculate the *Loss* instead of selecting the action from the sampled *mini-batch* data, the actor network is reused to predict all of the possible actions. The optimization objective of the actor network can be expressed as Equation (18). The gradient calculation formula of the actor network is expressed as Equation (19).

$$J\_{\pi}(\boldsymbol{\phi}) = \underset{\boldsymbol{\xi} \in N}{E} \left[ \text{alog } \pi\_{\boldsymbol{\phi}} \left( \overline{a}\_{\boldsymbol{\phi}}(\boldsymbol{s}, \boldsymbol{\xi}) \, | \, \boldsymbol{s} \right) - \underset{\boldsymbol{\imath} = 1, 2}{\text{min}} \mathbb{Q}\_{\ell \cdot \boldsymbol{\jmath}} \left( \boldsymbol{s}, \overline{a}\_{\boldsymbol{\phi}}(\boldsymbol{s}, \boldsymbol{\zeta}) \right) \right] \tag{18}$$

$$\nabla\_{\boldsymbol{\Phi}} I\_{\pi}(\boldsymbol{\phi}) = \nabla\_{\boldsymbol{\Phi}} a \log \left( \pi\_{\boldsymbol{\Phi}}(a\_{l}|s\_{l}) \right) + \left( \nabla\_{a\_{l}} a \log \left( \pi\_{\boldsymbol{\Phi}}(a\_{l}|s\_{l}) \right) - \nabla\_{a\_{l}} \min\_{l=1,2} Q\_{\theta-i}(s\_{l}, a\_{l}) \right) \nabla\_{\boldsymbol{\Phi}} \tilde{a}\_{\boldsymbol{\Phi}}(s, \boldsymbol{\xi}) \tag{19}$$

2.3.2. Training and Updating of V Networks

As shown in Figure 5, the V network is updated with the *mini-batch*, which is the data sampled from the experience pool. The combination of the probability π(*at*|*st*) of performing action *at* in the current state *st*, the probability log(*π*(*at*|*st*)) after taking the logarithm, and the minimum value of the state-action value *Q*<sup>1</sup> and *Q*<sup>2</sup> is taken as the true value of the V network. The MSE method is adopted for loss function calculation and V network training. The objective function can be expressed as Equation (20):

$$J\_V(\boldsymbol{\psi}) = E\_{\boldsymbol{s}\_t \sim D} \left[ \frac{1}{2} \left( V\_{\boldsymbol{\theta}}(\boldsymbol{s}\_t) - E\_{\boldsymbol{a}\_t \sim \pi\_{\boldsymbol{\theta}}} \left[ \min\_{i=1,2} Q\_{\boldsymbol{\theta}-i}(\boldsymbol{s}\_t, \boldsymbol{a}\_t) - a \log \pi\_{\boldsymbol{\phi}}(\boldsymbol{a}\_t \Big| \boldsymbol{s}\_t) \right] \right)^2 \right] \tag{20}$$

where *ψ* is the parameter in the V network. *D* is the experience pool. *at* ∼ *πφ* means that instead of sampling from the experience pool, the actions are sampled according to the current policy. The gradient calculation formula of the V network is expressed as Equation (21).

$$\nabla\_{\Psi} I\_V(\boldsymbol{\psi}) = \nabla\_{\Psi} V\_{\Psi}(\mathbf{s}\_t) \left( V\_{\Psi}(\mathbf{s}\_t) - Q\_{\theta}(\mathbf{s}\_t, a\_t) + \log \pi\_{\theta}(a\_t | \mathbf{s}\_t) \right) \tag{21}$$

2.3.3. Training and Updating of Critic-Q Network

As shown in Figure 5, the Q network is updated with the *mini-batch*, which is the data sampled from the experience pool. *Q* = *rt* + *γV*(*st*+1) is used to calculate the true value of the state *st*, and *Q*<sup>1</sup> and *Q*<sup>2</sup> at the same action *at* are used to estimate the predictive value of the state *st*. The objective function can be expressed as Equation (22):

$$J\_Q(\theta) = E\_{(s\_t, a\_t) \sim D} \left[ \frac{1}{2} (Q\_\theta(s\_{t\prime} a\_t) - Q'(s\_{t\prime} a\_t))^2 \right] \tag{22}$$

where *θ* is the parameter in the Q network. *Q* (*st*, *at*) is presented as Equation (23):

$$Q'(s\_t, a\_t) = r(s\_t, a\_t) + \gamma E\_{s\_{t+1} \sim P} \left[ V\_{\overline{\mathbb{P}}}(s\_{t+1}) \right] \tag{23}$$

where Ψ is the parameter of the target-V network in the state *st*+1. The gradient calculation formula for the Q network is expressed as Equation (24).

$$\nabla\_{\theta} I\_{Q}(\theta) = \nabla\_{\theta} Q\_{\theta}(s\_t, a\_t) \left( Q\_{\theta}(s\_t, a\_t) - r(s\_t, a\_t) - \gamma V\_{\overline{\mathbb{V}}}(s\_{t+1}) \right) \tag{24}$$

Leaving the entropy coefficient *α* unchanged would be unreasonable because constant change in reward would negatively affect the whole training process. Therefore, it is necessary to automatically adjust *α*. To improve the learning speed and improve the stability of the agent, this article designed a neural network to adaptively adjust the size of the entropy coefficient *α* based on the theory of reference [26]. Specifically, when the agent enters a new solution area where the agent's exploration ability should be enhanced to find the best action, *α* should increase so that the agent will not be trapped in the local optimum. When the agent has almost finished the exploration in a solution area where the learning ability of the agent should be improved, to accumulate experience from the best action, *α* should be decreased. The optimization function maximizes the expected return under the constraint of the minimum expected entropy, which can be expressed as Equation (25):

$$\begin{aligned} \max\_{\pi\_{0:T}} & E\_{\rho\_{\pi}} \left[ \sum\_{t=0}^{T} \gamma^{t} r(s\_{t\prime} a\_{t}) \right] \\ \text{s.t. } & E\_{(s\_{t}, a\_{t}) \sim \rho\_{\pi}} \left[ -\log \left( \pi\_{t} (a\_{t} | s\_{t}) \right) \right] \ge H\_{0\prime} \,\forall t \end{aligned} \tag{25}$$

where *H*<sup>0</sup> is a constant, representing the preset minimum entropy value. To solve Equation (25), the Lagrange multiplier method is performed to transform the optimization problem into the primal problem and its dual problem. Then the final optimization result can be obtained as Equation (26).

$$a\_t^\* = \underset{a\_t}{\text{argmin}} \sum\_{t=0}^T E\_{\rho\_t^\*} [-a\_t \log \pi\_t^\*(a\_t|s\_t) - a\_t H\_0] \tag{26}$$

where *ρ*∗ *<sup>t</sup>* indicates the state-action pair of the optimal policy. Then the network for *α* is setup and the stochastic gradient descent is performed as in Equation (26), which can be dubbed Equation (27).

$$\begin{aligned} \nabla\_a J(\boldsymbol{a}) &= \nabla\_a E\_{\boldsymbol{a}\_l \sim \pi\_l} [-\boldsymbol{a}\_t \log \pi\_t(\boldsymbol{a}\_t|\boldsymbol{s}\_t) - \boldsymbol{a}\_t H\_0] \\ &\quad \boldsymbol{\pi} \leftarrow \boldsymbol{\pi} - \eta \nabla\_a J(\boldsymbol{a}) \end{aligned} \tag{27}$$

where *at* is derived from the current policy π*t*(*st*), but *st* is selected from the *mini-batch*. The Adam algorithm is used for optimization and the learning rate *lrα* is set to 0.0001.

#### 2.3.4. The Design of State, Action Space, and Reward

During path following, the USV will be disturbed by wind, waves, currents, and other environmental factors. To make the output parameters of the agent more accurate, the environmental information should be considered in the state space design as much as possible. Based on the USV model constructed in Section 2, the state space is defined as Equation (28).

$$s = \begin{bmatrix} \mu\_\prime \upsilon\_\prime r\_\prime \,\,\varphi\_\prime \,\, y\_\prime, \alpha\_\mathbf{k}, \delta\_\prime \dot{\delta}\_\prime \,\varepsilon\_\prime \dot{\varepsilon} \end{bmatrix} \tag{28}$$

Similarly, the action space is defined as *a* = " *K*p, *K*<sup>d</sup> # , and *K*<sup>p</sup> ranging from [−0.5, 0.5], and *K*<sup>d</sup> ranges from [−50, 50]. The reward function *r* has two parts, one is *r*psi, the other is *ry*<sup>e</sup> ; *rpsi* and *ry*<sup>e</sup> are designed as shown in Equations (30) and (31).

$$r = r\_{\text{psi}} + r\_{\text{y}\_v} \tag{29}$$

$$r\_{\rm psi} = \begin{cases} 0, & e \le 0.1 \\ -e - 0.1e', & e \ge 0.1 \end{cases} \tag{30}$$

$$r\_{y\_\mathbf{e}} = \begin{cases} \begin{array}{c} 0, \\ -0.1, \end{array} \begin{array}{c} y\_\mathbf{e} \le 1 \\ y\_\mathbf{e} \ge 1 \end{array} \tag{31}$$

The design of the actor network and the critic network are shown in Figures 6 and 7. It can be seen that they share the same structure. The dimension of the input layer in the actor network is set as 10. The hidden layer consists of two layers with 400 and 300 neuron

nodes, respectively. The dimension of the output layer is set to 2. The dimension of the input layer in the critic network is set as 12. The hidden layer includes two layers with 400 and 300 neuron nodes, respectively. The dimension of the output layer is set to 1. In order to prevent gradient saturation and gradient disappearance, ReLU is used as the activation function of hidden layers in both actor and critic, and tanh is adopted as the activation function of the output layer.

**Figure 6.** Actor network architecture.

**Figure 7.** Critic network architecture.

#### **3. Training and Simulation Results**

#### *3.1. Network Training*

The algorithm code was written based on Pycharm (Jetbrains, Czech Republic) and TensorFlow 2.0.1 (Google Brain, USV) and was used as the framework, and the code runs on a computer with 8GB RAM. The maximum time step of each training was 2500 and the number of training runs was set to 1500. On average, it takes 154 min to complete the training for each path following. The Adam optimizer based on gradient descent was used to learn the parameters of the deep neural network during training. To test the superiority of the SAC algorithm, training and learning based on the DDPG algorithm were performed for comparison. The hyperparameters of the agents are shown in Table 3.

At the initial times of the network training, the strategy was almost random, and the agent could not learn much useful experience, so the effect of following the desired path was not good. Figure 8 shows the training process diagram. To better highlight the average performance and fluctuation range of the algorithm, it was designed as a mean–variance curve. The vertical coordinate was designed as the average return per 10 training sessions. The return was calculated after each training, and the parameters of each network were updated for *N*<sup>T</sup> times according to the reward. Note that the reward represents an immediate return on the action taken. The return is the sum of the immediate returns after the training.


**Table 3.** Hyperparameters of the algorithms.

**Figure 8.** Training process.

It can be seen from Figure 8 that all three algorithms can converge. SAC-auto converges faster than the DDPG. The average return and mean square deviation (MSD) after training are shown in Table 4. The higher average return and smaller MSD indicate that the agent based on the SAC algorithm can better complete the path-following task and has a better stability.

**Table 4.** Comparison of results after training.


#### *3.2. Simulation Results*

In this section, the effectiveness of the proposed method is verified by the linear and circular path-following simulation in the simulated wind and wave environment, where *u*<sup>0</sup> = 1.242 m/s, *v*<sup>0</sup> = 0 m/s, and *r*<sup>0</sup> = 0 m/s at the beginning. Under the same guidance law, the abovementioned control method with three trained RL parameters is compared with the adaptive PID parameter controller [28].

In this paper, in order to verify the anti-interference and navigation stability of the system, the interference of " −0.2 × <sup>10</sup>3, 0.2 × 103 # N was added to the transverse force, and the interference of " −0.2 × 103, 0.2 × <sup>10</sup><sup>3</sup> # N·m was added to the turning moment. The

transverse disturbing force is shown in Figure 9. The turning disturbing moment is shown in Figure 10.

**Figure 9.** Transverse disturbing force.

**Figure 10.** Turning disturbing moment.

#### 3.2.1. Linear Path Following

The linear reference path was designed as the line segment between the points (20, 20) and (160, 20). The initial position of the USV is (0, 0), and its initial heading is parallel to the path. Figure 11 is the path following the comparison figures between the controllers mentioned above. Figures 12 and 13 are the comparisons of the heading angle and rudder angle during the path-following process. The path-tracking errors of the four controllers are shown in Table 5. The controller based on the reinforcement learning algorithm takes about 2 to 3 s to complete a path-following task, which is not much different from the calculation time in the literature [28], but the actual time still needs to prevail.

**Figure 11.** Path following.

**Figure 12.** Heading angle.

**Figure 13.** Rudder Angle.

**Table 5.** Data comparison table of four algorithms.


It can be seen from Figure 11 that in the presence of a disturbing force, the tracking trajectories obtained with four controllers finally approached the desired paths within the specified error range. The rudder angle is rapidly operated to overcome the disturbance of waves. Compared to the adaptive controller, the RL-based controller has less overshoot during steering and produces smoother trajectories. Compared to the DDPG algorithm controller, the SAC-auto controller has better performance in both heading control and rudder maneuvering. According to Table 5, compared to the SAC controller, the steadystate performance of the SAC-auto controller is improved, with which, when the desired

direction changes, the improved parameter can provide the USV with a fast adjustment to the desired direction and a stable path-following effect on the desired path, and the average deviation of the direction angle when stable is limited to 0.5◦. It can also be concluded from Figure 13 that the fluctuation of the rudder angle with SAC-auto is the smoothest, and the maximum fluctuation of the rudder angle is less than 5◦, indicating that the rudder is slightly frayed and the control gear can be well-protected.

The parameter curves of *K*<sup>p</sup> and *K*<sup>d</sup> output by SAC-auto are shown in Figure 14. Figure 15 shows the transverse force and turning moment of the USV. It can be seen that under the control of the SAC-auto algorithm, the transverse force and turning moment of the USV fluctuate less; the reason is that the maneuvering fluctuation of the rudder angle is smaller.

**Figure 15.** Transverse forces and turning moments.

3.2.2. Curve and Polyline Path Following

The above simulation results verify the feasibility and anti-interference ability of the SAC-auto algorithm when following a straight path. In order to verify the performance of the algorithm when following other desired paths which are more complex, and to inspect whether the SAC-auto algorithm can adaptively produce appropriate PD parameters, the path-following simulations of zigzag and turning are performed.

**Scenario 1.** The desired path of the zigzag is designed as the polyline between points (20, 20), (100, 100), (180, 20), and (260, 100). The initial position of the USV is located at (0, 0), and the initial heading is parallel to the Y-axis.

**Scenario 2.** The turning path is a circle with points (0, 0) as the center and 40m as the radius. The initial position of the USV is (0, 0), and the initial heading is parallel to the Y-axis.

Figure 16 shows the graphs for path following, rudder angle, heading angle, and curves of *K*<sup>p</sup> and *K*<sup>d</sup> for different algorithms in the tracking process.

**Figure 16.** *Cont*.

**Figure 16.** Zigzag and circle path following, heading angle, rudder angle, and *K*p, *K*<sup>d</sup> curves.

It can be seen from the simulations that the SAC-auto controller has a better performance in both heading control and rudder maneuvering under the condition of path following of zigzag and turning when considering the disturbance of the wave. The fluctuation of the rudder angle with SAC-auto is the smoothest, indicating that the SAC-auto control method performs well when following a complex desired path.

### **4. Conclusions**

The classic adaptive PID control method used for path following does not perform well under complex conditions such as following a curvilinear path or considering wave interference. Concerning this issue, this paper presents a path-following control method based on SAC for PID parameter setting. First, a 3-DOF USV dynamics model based on Abkowite was established. Second, the guidance system using the line-of-sight method and the USV heading control system in the PID controller was designed. Third, the SAC algorithm was then used, and the state space, action space, and reward function were designed for the training of the RL on the path-following scenarios; the SAC is promoted to adaptively and rapidly adjust the PID parameter during the simulation. Finally, the algorithm was proven in the simulation experiments under path following in a straight line, zigzag, and turning with disturbance of wave scenarios, which verify the feasibility and robustness of the proposed method. In further research, the experiment should be conducted.

**Author Contributions:** Data curation, C.X.; funding acquisition, R.G.; methodology, L.S.; software, C.X.; visualization, C.X.; oringin draft, L.H.; writing—original draft, C.X., and L.S.; writing—review and editing, J.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

