Autonomous Navigation Decision-Making Method for a Smart Marine Surface Vessel Based on an Improved Soft Actor–Critic Algorithm

Cui, Zhewen; Guan, Wei; Zhang, Xianku; Zhang, Cheng

doi:10.3390/jmse11081554

Open AccessArticle

Autonomous Navigation Decision-Making Method for a Smart Marine Surface Vessel Based on an Improved Soft Actor–Critic Algorithm

Navigation College, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(8), 1554; https://doi.org/10.3390/jmse11081554

Submission received: 4 July 2023 / Revised: 27 July 2023 / Accepted: 4 August 2023 / Published: 5 August 2023

(This article belongs to the Special Issue AI for Navigation and Path Planning of Marine Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this study, an intelligent hybrid algorithm based on deep-reinforcement learning (DRL) is proposed to achieve autonomous navigation and intelligent collision avoidance for a smart autonomous marine surface vessel (SMASV). First, the kinematic model of the SMASV is used, and clauses 13 to 17 of the Convention on the International Regulations for Preventing Collisions at Sea (COLREGs) are introduced. Then, the electronic chart is rasterized and used for path planning. Next, states, actions, and reward functions are designed, and collision avoidance strategies are formulated. In addition, a temperature factor and a constrained loss function are used to improve the soft actor–critic (SAC) algorithm. This improvement reduces the challenges of hyperparameter adjustment and improves sampling efficiency. By comparing the improved SAC algorithm with other deep-reinforcement learning (DRL) algorithms based on strategy learning, it is proved that the improved SAC algorithm converges faster than the other algorithms. During the experiment, some unknown obstacles are added to the simulation environment to verify the collision-avoidance ability of the trained SMASV. Moreover, eight sea areas are randomly selected to verify the generalization ability of the intelligent-navigation system. The results show that the proposed method can plan a path for the SMASV accurately and effectively, and the SMASV decision-making behavior in the collision-avoidance process conforms to the COLREGs in both unknown and dynamic environments.

Keywords:

deep reinforcement learning; SMASV; COLREGs; collision avoidance; SAC

1. Introduction and Background

With the acceleration of globalization, maritime traffic is becoming increasingly important. In particular, ships are usually underactuated, in addition to their characteristics of large tonnage, relatively weak power, and slow speed. According to an International Maritime Organization (IMO) report, the performance of non-standard ship operations by pilots is a significant factor leading to frequent maritime accidents at sea [1]. Therefore, methods for the autonomous collision avoidance of ships must be developed. In this regard, artificial intelligence has developed rapidly in recent years, and the combination of robot technology and control technology has provided a new solution to the problem of path planning for ships and intelligent collision avoidance [2].

In the process of ship navigation, intelligent navigation systems face several challenges, including an unknown environment and perceived uncertainty. Therefore, the core of developing intelligent ship navigation focuses on path planning and intelligent collision avoidance. One approach assumes that path-planning or decision-making algorithms are based on mathematical-function models or physical-function models. The Avalon ship was designed by Erckens et al. [3] to cross the Atlantic Ocean automatically. The route was planned using the A-star algorithm and the authors considered the effect of dynamic obstacles and airflow in their study. However, the A-star algorithm requires too much memory to compute an optimal solution with the available resources. Song et al. [4] proposed a smooth A-star algorithm. The route provided by the algorithm contained no redundant waypoints and could provide continuous routes. The A-star algorithm was efficient for solving global path-planning problems. However, the A-star algorithm could not solve the problem when an unknown obstacle appeared in the environment. Therefore, this algorithm was ineffective as a decision-making algorithm. Ding et al. [5] improved the decision-making by incorporating a particle swarm algorithm into the ship-path-planning system. Mathematical models of unmanned surface vessels (USVs) and the marine environment were established by their study, and the path was optimized by particle swarm optimization. Nevertheless, the search ability of the algorithm was poor and the search accuracy was not high. Liu et al. [6] adopted an ant-colony optimization-based algorithm (ACA) and a clustering algorithm (CA) to solve the USV path-planning problem. The algorithm could automatically select an appropriate search range for different environments and smooth the planned path. However, this particle swarm optimization approach was not practical for solving dynamic obstacle problems. A rapidly-exploring random tree algorithm (RRT) was derived by Zhang et al. [7], and dynamic hybrid step size and target attraction were added to the algorithm, thereby improving its convergence. Nevertheless, the improved RRT algorithm randomly selected actions that were not practical for solving navigation problems in narrow waters, because it wasted too much time on collision detections. Furthermore, a collision-avoidance method based on the velocity–obstacle (VO) method and improved particle swarm optimization was proposed by Xia et al. [8]. Their method transformed the decision-making problem into a multi-objective optimization problem in a continuous action space, addressing collision avoidance by establishing a ship-collision-avoidance model together with risk assessment. Lyu et al. [9] adopted a robust artificial potential field (APF) algorithm to promote the decision-making ability of ship-collision avoidance. A new repulsive potential field function and virtual force were designed for their experiments. The ship domain and safety requirements were introduced before planning collision avoidance. However, adjusting the parameters of the potential field function was challenging. Moreover, because the same parameter did not apply to all collision-avoidance situations, the generalization ability of this algorithm was weak. A collision-avoidance method based on the COLREGs and a reciprocal velocity obstacles (RVO) method was proposed by Wang et al. [10]. A distance-closest-point of approach (DCPA) and a time-closest-point of approach (TCPA) were used for collision risk assessment in this algorithm. However, owing to the complexity of multi-ship encounters and perception of the ocean, the robustness of the model-based methods for solving navigation problems must be improved.

Another typical method for the autonomous collision avoidance of ships is model-free reinforcement learning. This approach is independent of an established mathematical framework. Through the interaction of an agent and the environment, a neural network is updated iteratively, and a generalization-capable neural-network model is then trained. It can adapt to the complex environment and learn the best strategy through the interaction between the environment and the ship.

Chen et al. [11] applied an improved Q-learning algorithm to USV path planning. In particular, the studied USV could effectively avoid static obstacles and forbidden areas using the modified algorithm. However, traditional reinforcement-learning algorithms cannot effectively deal with continuous action-space problems. In addition, when the state space input is substantially high, a traditional algorithm can readily lead to an input dimension explosion. Thus, a deep Q-learning (DQN) algorithm, combining deep-learning image processing and decision-making, was proposed by Zhang et al. [12]. Their algorithm used a depth camera to obtain the external environment information and took the processed environment information as the state input of the DQN algorithm. However, convolutional neural networks were used for image processing, increasing the computational burden. In addition, sea conditions might have affected the image data. Therefore, the depth camera was not suitable at sea. An algorithm based on a combination of DQN and APF was proposed by Li et al. [13]. In the early training stage, the algorithm adopted an APF to guide the agent to the target position and obtain a higher reward value so that the USV could rapidly find the optimal path. A ship-collision-avoidance algorithm based on the DQN algorithm was proposed by Shen et al. [14], who proposed a simplified ship domain and introduced it into the reward function. However, the DQN algorithm had the problem of overestimation, and unmanned surface ships were prone to make action-selection errors in complex environments. A reinforcement learning training method based on electronic charts was proposed by Guo et al. [15]. In their new algorithm, automatic-identification-system (AIS) information was regarded as the state space of the neural-network input. The trained neural-network model could be used for autonomous path planning. A new multi-ship collision-avoidance algorithm was adopted by Zhao et al. [16]. In particular, the authors considered the powerful guidance ability of the line-of-sight (LOS) algorithm to assist in training the proximal policy optimization (PPO) algorithm. The experimental results showed that the algorithm had a desirable collision-avoidance effect in multi-ship encounter situations. In addition, the collision-avoidance path of each ship conformed to the COLREGs. A deep deterministic policy gradient (DDPG) algorithm, based on a priority sampling mechanism, was improved by Xu et al. [17], and the authors defined five complex reward functions. The algorithm designed the USV navigation situation model and established a geometric model of the two-ship encounter situation, according to navigational practice. A collision cone was introduced for static obstacles, and the COLREGs were monitored for dynamic ships to quantify a collision into five types. Although the algorithm achieved effective collision avoidance in complex environments with dynamic and static obstacles, the reward function was too complex and the experiment did not have a generalization ability. Do-Hyun Chun et al. [18] adopted a risk-assessment algorithm and ship domain to solve the ship-collision-avoidance problem. The algorithm needed to fully consider the DCPA and TCPA between their own ship and the target ship. However, TCPA was related to the time at which the two ships encountered. Thus, calculating the rendezvous time between the two ships during the experiment was challenging. Long et al. [19] proposed a collision-avoidance strategy based on the PPO algorithm. In their experiment, the distributed sensors of a multi-robot system were used. The strategy of the algorithm could directly map the original sensor-measurement results to the agent’s steering command. based on the motion speed, and the trained network model had a good generalization.

A summary of recent literature on ship-collision avoidance is provided in Table 1.

Based on the above studies, we developed an intelligent-navigation decision-making system for a smart marine autonomous surface vessel (SMASV), based on an improved soft actor–critic (SAC) and A-star algorithm. The main contributions of this study are as follows:

The convergence speed of the loss function and the average reward function of the improved SAC algorithm were better than those of other popular strategy-based DRL algorithms.
The trained deep-neural-network model for the SMASV had a strong generalization ability, and could solve the problem of multiple SMASV encounters in complex situations.

The remainder of the paper is organized as follows: Section 2 introduces the preliminary considerations and the formulation of the problem. Section 3 introduces the proposed approach. Section 4 presents a simulation and comparative experiments. Finally, Section 5 concludes the study.

2. Preliminary Considerations and Problem Formulation

2.1. Intelligent SMASV Navigation System

Overall, the proposed intelligent SMASV navigation system comprises four components: the path planner, the sensor module, the decision-making module, and the control module [20]. The planning system refers to the path-planning process before sailing. In this study, the A-star algorithm was used to plan the global path of the SMASV, and the heuristic search was performed on the rasterized electronic chart to plan the optimal path rapidly. In particular, in the path-planning process, the planning system can only plan on an electronic chart with known information and cannot avoid unknown obstacles.

The sensor module contains two perception components: the SMASV's position information and the navigation-environment information [21]. The sensor part is mainly composed of the global navigation satellite system (GNSS), sonar, a sounder, an inertial measurement unit (IMU), a gyrocompass, marine radar, lidar, and so on.

The decision-making module is composed of a deep-neural-network (DNN) model trained on the deep reinforcement-learning SAC algorithm. The algorithm can realize the collision avoidance of unknown obstacles and dynamic obstacles by processing the state information and the action information of the sensing module.

Finally, the control module executes the rudder angle command given by the decision system, so that the SMASV can sail safely according to the specified rudder order. The architecture of the intelligent SMASV navigation system is illustrated in Figure 1.

2.2. Construction of the Path Planner

The global path-planning of the SMASV includes two parts: rasterization of the electronic chart and global path planning [22]. In this study, the grid method was used to model the electronic chart after gridding. The grayscale processing of the electronic chart is shown in Figure 2.

First, the electronic chart was processed into a grayscale map, improving the chart rasterization accuracy. Then, the grayscale image was rasterized to form the final electronic chart. This chart had a simple structure, which helped in developing a simulation environment and path planning. The processing of the marine environmental information using the grid method is illustrated in Figure 2.

The A-star algorithm is a heuristic algorithm with some prior knowledge of path planning. The algorithm not only focuses on the paths that have been passed, but also predicts the points or states that have not yet been passed through. Hence, the A-star algorithm was used in this paper to solve the shortest path in static networks.

2.3. Ship-Motion Mathematical Model

The actual motion of the SMASV is complex, with six degrees of freedom (6-DOF). Moreover, overly high-dimensional and elaborate models can contain huge numbers of parameters that are challenging to estimate and difficult to solve [23]. Therefore, we simplify the 6-DOF model to a three-degrees-of-freedom (3-DOF) model. The SMASV 3-DOF model has sufficient accuracy for ship path-planning and automatic collision avoidance [24].

As shown in Figure 3, the ship motion Equation [25] is as follows:

\begin{array}{l} \dot{η} = R (ψ) v \\ M \dot{v} = τ - C (v) v - D (v) v - g (v) + τ_{w} \end{array}

(1)

where

η = {[x, y, ψ]}^{T}

refers to the position coordinates and the heading angle of the ship,

v = {[u, v, r]}^{T}

represents the velocity vector of the ship’s 3DOF,

R (ψ)

represents the rotation matrix of the ship,

g (v) = {[g_{u}, g_{v}, g_{r}]}^{T}

represents the unmodeled dynamic model, and

τ_{w}

is the sum of the external forces of environmental interference.

R (ψ)

is expressed as:

R (ψ) = [\begin{matrix} \cos ψ & - \sin ψ & 0 \\ \sin ψ & \cos ψ & 0 \\ 0 & 0 & 1 \end{matrix}]

(2)

Matrix

C (v)

is composed of a centripetal matrix and a Coriolis force matrix that account for the effects of fluid dynamics.

C (v) = C_{A} (v) + C_{R B}

(3)

C_{A} (v) = [\begin{matrix} 0 & 0 & Y_{\dot{v}} v + Y_{\dot{r}} r \\ 0 & 0 & - X_{\dot{u}} u \\ - Y_{\dot{v}} v & X_{\dot{u}} u & 0 \end{matrix}]

(4)

C_{R B} = [\begin{matrix} 0 & 0 & - m (x_{g} r + v) \\ 0 & 0 & m u \\ m (x_{g} r + v) & - m u & 0 \end{matrix}]

(5)

The inertial matrix of the system includes an added mass matrix

M_{A}

, which accounts for the effects of fluid mechanics, and a rigid body mass matrix,

M_{R B}

:

\begin{matrix} M = M_{A} + M_{R B} \\ M_{A} = [\begin{matrix} - X_{\dot{u}} & 0 & 0 \\ 0 & - Y_{\dot{v}} & - Y_{\dot{r}} \\ 0 & - Y_{\dot{r}} & - N_{\dot{r}} \end{matrix}], M_{R B} = [\begin{matrix} m & 0 & 0 \\ 0 & m & m x_{g} \\ 0 & m x_{g} & I_{z} \end{matrix}] \end{matrix}

(6)

where

X_{\dot{u}}, Y_{\dot{v}}, Y_{\dot{r}}, N_{\dot{r}}

are the hydrodynamic derivatives,

m

is the mass of the ship, and

I_{z}

is the moment of inertia of the ship around the Z coordinate axis.

D (v)

is a nonlinear damping matrix, which can be expressed as the sum of linear damping and nonlinear damping:

D (v) = D_{L} + D_{D L} (v)

(7)

D_{L} = [\begin{matrix} - X_{u} & 0 & 0 \\ 0 & - Y_{v} & - Y_{r} \\ 0 & - N_{v} & - N_{r} \end{matrix}]

(8)

D_{D L} (v) = [\begin{matrix} - X_{|u |u} |u| & 0 & 0 \\ 0 & - Y_{|v |v} |v| & - Y_{|v |r} |v| \\ 0 & - N_{|v |v} |v| & - N_{|v |r} |v| \end{matrix}]

(9)

where

[Y_{(*)}, X_{(*)}, N_{(*)}]

are the hydrodynamic derivatives, which represent the hydrodynamic forces and moments acting on the ship.

2.4. COLREGs

It is necessary to consider the COLREGs when solving the intelligent-decision problem of ships using deep-reinforcement learning [26]. The encounter between two ships is described in four cases, as expressed in clauses 13 to 17 of the COLREGs [12], as shown in Figure 4.

(a): Head-on situation: When two vessels (OS and TS) are meeting on reciprocal or nearly reciprocal courses within an azimuth angle of (0°, 5°) or (355°, 360°), this situation should be judged as a head-on situation. Each vessel (OS and TS) should alter her course to starboard so that each will pass on the port side of the other. The head-on situation is shown in Figure 4a.
(b): Overtaking situation: A vessel (OS) is deemed to be overtaking when coming up to another vessel (TS) from a direction within the range of (112.5°, 247.5°) abaft her beam. The OS vessel should alter her course to starboard or port to give way to the TS. The overtaking situation is shown in Figure 4b.
(c): Crossing give-way situation: When one vessel (TS) is located within (5°, 112.5°) of the other vessel’s (OS)’s azimuth, this situation is considered as the crossing give-way. The OS vessel should alter her course to starboard to avoid the TS vessel. The crossing give-way situation is shown in Figure 4c.
(d): Crossing stand-on situation: When one vessel (TS) is located within (247.5°, 355°) of the other vessel’s (OS) ‘s azimuth, this situation is considered as the crossing stand-on. The OS vessel needs to maintain course and speed, and make a timely observation of the TS vessel’s dynamics. When the TS and OS pose a collision threat, the OS should alter her course to starboard or port to avoid collision. The crossing stand-on situation is shown in Figure 4d.

3. Proposed Approach

3.1. Improved SAC Algorithm

The SAC algorithm was proposed by Haarnoja et al. [27]. Note that the Boltzmann distribution of the Q function in a traditional maximum entropy RL soft Q-learning algorithm cannot solve the action problem in continuous space. In this regard, the SAC algorithm solved this limitation and achieved good convergence.

The entropy represents a measure of the randomness of random variables. In reinforcement learning, we use

H (π (\cdot |s))

to represent the randomness of strategy

π

under state

s

. The idea of maximum entropy reinforcement learning is to make strategies more random in addition to maximizing cumulative rewards. Therefore, an entropy regularization is added to the objective function of RL as follows:

π^{*} = \underset{π}{\arg \max} E_{π} [\sum_{t} r (s_{t}, a_{t}) + α H (π (\cdot |s))]

(10)

where

α

is a regularization coefficient which is used to control the importance of entropy.

The entropy regularization increases the exploration degree of the reinforcement learning algorithm. The larger the value of

α

is, the more strongly the algorithm explores, which helps to improve the speed of strategy learning. In addition, a sufficiently large value of

α

can reduce the possibility of agents falling into the local optima. The difference between traditional reinforcement learning and maximum entropy reinforcement learning for the ship decision-making process is shown in Figure 5.

In Figure 5, the ship has two routes to reach the target point. Obviously, Path 1 is shorter and more effective. At this time, the two peaks of the curve in Figure 5 correspond to two paths. The high peak represents Path 1, and the low peak represents Path 2. Traditional reinforcement-learning methods typically use a policy distribution centered on the largest Q value, such as the Gaussian distribution represented by the red curve in Figure 5. At this time, the ship’s exploration will be completely biased toward Path 1, thus completely ignoring Path 2. Although the agent can learn better behavior under this strategy distribution, once the environment changes, Path 1 is blocked by unknown obstacles. Since the ship only learns Path 1, in order to reach the target again, Path 2 must be learned from scratch. Therefore, the soft Bellman equation is used:

Q (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ E_{S_{t + 1}} [V (S_{t + 1})]

(11)

The state value function can be written as follows:

\begin{matrix} V (s_{t}) = E_{a_{t} \sim π} [Q (s_{t}, a_{t}) - α \log π (a_{t} |s_{t})] \\ = E_{a_{t} \sim π} [Q (s_{t}, a_{t})] + H (π (\cdot |s)) \end{matrix}

(12)

According to the soft Bellman equation, policy evaluation can converge to a policy

π

in finite state and action space. Therefore, the policy improvement formula can be written as follows:

π_{n e w} = \underset{π^{'}}{\arg \min} D_{K L} (π^{'} (\cdot |s), \frac{\exp (\frac{1}{α} Q^{π_{o l d}} (s, \cdot))}{Z^{π_{o l d}} (s, \cdot)})

(13)

This distribution is a form of the Boltzmann distribution (energy-based model). The distribution assigns a non-zero probability to all actions. Thus, the ship will be more inclined to learn all the behaviors that can handle the task, and adapt to the local optimal problem caused by environmental changes. In the SAC algorithm, action value function Q and a policy function

π

are used for modeling. The algorithm selects a smaller value when selecting the Q value, reducing overestimation. The loss function of the Q network is as follows:

\begin{matrix} L_{Q} (ω) = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim R} [\frac{1}{2} {(Q_{ω} (s_{t}, a_{t}) - (r_{t} + γ V_{ω^{-}} (s_{t + 1})))}^{2}] \\ = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim R, a_{t + 1} \sim π_{θ} (\cdot |s_{t + 1})} [\frac{1}{2} {(Q_{ω} (s_{t}, a_{t}) - B)}^{2}] \end{matrix}

(14)

B = r_{t} + γ (\min_{j = 1, 2} Q_{ω_{j}^{-}} (s_{t}, a_{t}) - α \log π (a_{t + 1} |s_{t + 1}))

(15)

where

R

represents the data collected by the strategy in the past. Because SAC is an off-policy algorithm, in order to make the strategy more stable, we considered two target Q networks. The network parameters are updated as follows:

ω_{1}^{-} \leftarrow τ ω_{1} + (1 - τ) ω_{1}^{-}

(16)

ω_{2}^{-} \leftarrow τ ω_{2} + (1 - τ) ω_{2}^{-}

(17)

The loss function for strategy

π

is derived from the

K L

divergence. The loss function is expressed as follows:

L_{π} (θ) = E_{s_{t} \sim R, a_{t} \sim π_{θ}} [α \log (π_{θ} (a_{t} |s_{t})) - Q_{ω} (s_{t}, a_{t})]

(18)

In the maximum entropy reinforcement learning algorithm, the selection of the entropy regularization coefficient is crucial. The choice of entropy weighting value differs in different states. For example, when the optimal action is uncertain, the entropy weighting value should be slightly higher. When some actions are determined, the entropy weighting value can be reduced. The goal of reinforcement learning is expressed as a constrained maximum expected revenue, and the mean of constraint entropy is greater than

H_{0}

. Then we can obtain the loss function of

α

as follows:

L (α) = E_{s_{t} \sim R, a_{t} \sim π_{θ}} [- α \log (π_{θ} (a_{t} |s_{t})) - α H_{0}]

(19)

When the entropy of the strategy is lower than

H_{0}

, the loss function of the training target will increase

α

. The importance of the entropy counterpart will also increase when the training strategy function

π

. When the entropy of the strategy is higher than

H_{0}

, the training target loss function will reduce

α

. This implies that the maximum value is a critical parameter in strategy training.

Remark 1.

For the algorithm selection, we found that DRL based on value learning [28,29] has an overestimation that cannot be eliminated. This is due to the bootstrapping of such algorithms in the iterative process and the presence of uneven noise. The Actor–Critic (AC) algorithm [30] contains two neural networks, one interacting with the environment to select the action (actor-network) and the other evaluating the quality of the action (critic-network). Subsequently, random sampling is performed in action probability to extract the current action. However, the traditional AC algorithm is difficult to introduce to converge rapidly. Thus, the PPO algorithm [31] was proposed by open AI in 2017. Compared with complex methods such as Taylor expansion approximation, conjugate gradient, and linear search of trust region policy optimization (TRPO) [32], PPO is simpler to use than Penalty or Clip. Nevertheless, both TRPO and PPO are on-policy algorithms, implying that their sample efficiency is low. An improved DDPG algorithm [33] was proposed by Lillicrap et al. Compared with the random strategy, the deterministic strategy adopted by the DDPG algorithm considers the maximum probability action when solving the continuous action problem. Thus, the sampling cost was significantly reduced, improving the sampling efficiency. However, the training of the DDPG algorithm is unstable, with poor convergence. Moreover, it is sensitive to hyperparameters. Therefore, adapting the algorithm to different complex environments is challenging. Multi-agent reinforcement learning, which is based on game theory and involves different agents having cooperative or competitive relationships, has gained substantial interest recently. The multi-agent reinforcement learning [34] approach used to train the network model, however, left it with poor generalization capabilities and insufficient expertise in dealing with challenges involving unknown obstacles. As a result, the multi-agent reinforcement learning method and the DRL based on value learning are not suitable for this study.

3.2. Construction of Neural Network

In the improved SAC algorithm, the action selection of the SMASV depends on the state of the system. The decision-making module provides the corresponding action according to the environmental information detected by the SMASV in the past period. The SMASV can obtain the location of obstacles and targets through detection technology and positioning systems. Thus, the SMASV state space and action space must be designed to achieve intelligent decision-making.

Generally, state information is used as the input of the network for the improved SAC algorithm. The states of the network can be expressed as Equation (20):

S^{S M A S V} = [x_{p}, y_{p}, x_{g}, y_{g}, (ζ_{1}, ζ_{2}, ζ_{3} ... ζ_{24})]

(20)

where

x_{p}

and

y_{p}

are the two-dimensional plane coordinates of the SMASV,

x_{g}, y_{g}

are the two-dimensional plane coordinates of the target point, and

(ζ_{1}, ζ_{2}, ζ_{3} ... ζ_{24})

are the 24 laser radar vector lines.

To reduce the complexity of the calculation and the reward function, a continuous action space is be set up. Since assuming that the SMASV can only alter her course to port or starboard, or to maintain her course through the rudder steering control, the action space of the SMASV was designed considering the following action: [−20°~20°].

The target-Q functions, evaluation-Q functions, and strategy function

π

should be fitted by the neural network. The actor network was used to fit the strategy function

π

. Two fully connected layers were adopted for the network, and each layer contained 256 neurons. In addition, the ReLU activation function was used in this network. Then, the target-Q function and the evaluation-Q function were fitted using four networks, each with four fully connected layers containing 256 neurons and the ReLU activation function. However, the difference between the target-Q network and the evaluation-Q network was that the evaluation-Q network and the target-Q network were used for current and delayed action evaluations, respectively. The interaction process between the decision system and the environment is shown in Figure 6.

The RMSprop with a momentum term was used to optimize the network training. The working principle was to dynamically adjust the learning rate of each parameter by using the first-order moment estimation and the second-order moment estimation of the gradient. The learning rate of each iteration had a specific range and made the parameters stationary.

To prevent a reduction in neural network training efficiency, memory with an experience playback function was set up to store historical action and state information. This method can improve the use of data because each sample is used multiple times. In addition, this method can reduce the correlation between continuous samples and the parameter update variance. Using historical data in the training process can enhance the efficiency of the algorithm and accelerate the convergence of the loss value.

3.3. Reward Function Setting

The purpose of the reward function setting is to allow the SMASV to follow the planned path while avoiding unknown obstacles and complying with the COLREGs. The reward function is calculated as the sum of the episode reward [35]. In particular, the reward function is divided into five parts, as shown in Figure 7.

(1): Goal-approach reward

The goal-approach reward function guides the SMASV to the target. Both the target-point position information and the SMASV position information are considered as part of the state information input by the neural network, which can be expressed as follows:

R_g = - λ_{g} \cdot \sqrt{{(x_{p} - x_{g})}^{2} + {(y_{p} - y_{g})}^{2}}

(21)

(2): Angle reward

In the SMASV decision-making process, the yaw angle is crucial in the decision-making system. Therefore, the yaw-angle reward function was added to the reward function. If the SMASV sails in the planned direction, positive reward is given; otherwise, a negative reward is given. The specific settings are as follows:

R_y a w = t r \cdot λ_{a} \cdot e^{ε \cdot y a w} \sqrt{{(x_{p} - x_{g})}^{2} + {(y_{p} - y_{g})}^{2}}

(22)

where

y a w

is the heading angle error,

t r

is the coefficient of the heading angle error,

λ_{a}

is the weight of the angle reward, and

ε

is a parameter in which the reward value changes with the distance.

(3): Reach target point and collision reward

A high reward value is obtained when the SMASV reaches the target location. This causes the SMASV to regard the target point as the learning target. A large negative reward is obtained when the SMASV collides with an obstacle. This allows the SMASV to learn to avoid obstacles. The reward function is set as follows:

R_r = \{\begin{cases} - 1000, c o l l i s i o n \\ 2000, g o a l \end{cases}

(23)

(4): Obstacle-avoidance reward

The SMASV radar, which emits 24 detection lines, can be used to detect whether there are obstacles within the radar coverage. When the SMASV radar senses the presence of an obstacle, it calculates the obstacle-avoidance reward. We set a safe distance at the position of

S_{1}

within the radar radius, as shown in Figure 8. Hence, the obstacle-avoidance reward function is set as follows:

R_l = \{\begin{cases} 0, o b > R_r a d a r \\ - 5, o b < R_r a d a r, o b < S_{1} \\ 1, o b < R_r a d a r, o b > S_{1} \end{cases}

(24)

where

o b

represents the distance between the obstacle and the SMASV.

(5): COLREGs reward

The purpose of introducing the COLREGs reward is to make the SMASV comply with the COLREGs in collision-avoidance actions during navigation. When the SMASV decides on an effective measure, complying with the COLREGs and avoiding the target ship, the SMASV obtains a positive reward. However, if the action taken by the SMASV does not conform to the COLREGs, the SMASV will not be rewarded. In particular, the reward function is set as follows:

R_c = \{\begin{cases} - λ_{c} \cdot \sqrt{{(x_{p} - x_{g})}^{2} + {(y_{p} - y_{g})}^{2}}, c o n t r a r y t o C O L R E G s, \\ λ_{c} \cdot \sqrt{{(x_{p} - x_{g})}^{2} + {(y_{p} - y_{g})}^{2}}, e l s e . \end{cases}

(25)

Remark 2.

In designing the reward function of the COLREGs, some works of literature [15,16,34,36] are referred to. The SMASV alters her course to starboard to avoid the TS according to the head-on, overtaking, and crossing give-way situations (rules 13–16) in the COLREGs. Additionally, changing the SMASV’s course to starboard can more effectively reduce the probability of a collision when she encounters the crossing stand-on situation (rules 17). In order to make the SMASV incline to the starboard alteration during the collision-avoidance decision-making process, we introduced the COLREGs into the reward function.

The reward function equation is expressed as follows:

R = R_g + R_y a w + R_r + R_l + R_c

(26)

4. Experiment

4.1. Design of Simulation

Multiple simulation scenarios were established to verify the practicability and feasibility of the intelligent-SMASV navigation system based on the SAC and A-star algorithms. Based on the above improved SAC algorithm, PyTorch was used to construct a neural network framework and train the neural network. In this experiment, Gazebo and VScode were used for conjoint simulation. Before sailing, the rasterization of the electronic chart and the path-planning of the A-star algorithm were first performed [37]. Then, neural-network training and model-parameter debugging were performed in VScode. Finally, a simulation environment was established in Gazebo to test the feasibility and generalization ability of the algorithm in different environments. The length of the SMASV was 116 m; the draft was 5.4 m; the beam was 18 m; the rudder area was 11.8 m²; the maximum rudder angle was 35°; and the speed was 12 knots. The hydrodynamic parameters are shown in Table 2.

This experiment assumed that the SMASV sailed in still water and some restrictions were attached to the SMASV model. First, the SMASV could not slow down her speed during navigation, but could only change her course, because when a ship navigates in the ocean and encounters obstacles or other target ships, course alteration is a better action for collision avoidance. Second, the inertia of a ship was introduced to simulate the true motion state of the SMASV. Third, during the simulation process, the SMASV did not know the location of unknown obstacles or the subsequent actions of dynamic obstacles. Therefore, the SMASV could only perform response actions based on existing information.

4.2. Network Training

The network parameters and the algorithm parameters are listed Table 3. The SMASV kinematics and dynamics information could be generated and calculated through the environment module in the Gazebo simulation environment. These items of information were introduced into the VScode as subscription nodes for network training and parameter updating. When the SMASV reached the target position, the training was completed and the rewards were recorded in cumulative form. At this time, a new target point was generated in the simulation environment and the SMASV started a new task. When the experiment reached its maximum number of steps or when the SMASV encountered an obstacle, the experiment was immediately stopped and the reward value was output. After the training stopped, the SMASV was placed in a new position and started a new training round.

Environment 0 (ENV-TRAIN) was used to the training of the intelligent-SMASV decision-making system. The training environment is illustrated in Figure 8. Nine obstacles were present in the simulation environment. The initial position of the SMASV in the experiment was (0, 0). In the early stage of interaction with the environment, the SMASV did not have the ability to drive to the target point and avoid collisions. During exploration, the SMASV constantly collided with the environmental frame.

After 500 training iterations, the SMASV could successfully avoid Obstacle 1, Obstacle 2, and Obstacle 3. When the SMASV passed through the left side of Obstacle 2, the SMASV operated at the rudder angle to maintain her course. When the SMASV was 1.3 nautical miles away from Obstacle 2, the SMASV chose to alter her course to starboard to 45° and then alter her course to port to 5°, passing between Obstacle 2 and Obstacle 3. Due to a lack of training experience, the SMASV collided with Obstacle 5.

After 700 training iterations, the SMASV successfully avoided five obstacles on the left side of the simulation environment. At this time, the SMASV found that the reward value obtained by passing above the second obstacle was larger than that obtained by passing below the second obstacle. Therefore, the SMASV changed its strategy. When the SMASV was 0.7 nautical miles away from Obstacle 2, the SMASV chose to alter her course to port 23°, then altered her course to starboard 5° and passed Obstacle 1 and Obstacle 2. When the SMASV was 1.5 nautical miles from Obstacle 4, the SMASV chose to alter her course to starboard 15 degrees and keep her course. Then, when the SMASV was 0.6 nautical miles away from Obstacle 4, the SMASV chose to alter her course to starboard 5° to pass Obstacle 4 and Obstacle 5. The collision-avoidance process is shown in Figure 9A.

After 1000 training iterations, the SMASV could successfully avoid the four obstacles on the right side of the simulation environment. When the SMASV was 0.4 nautical miles away from Obstacle 7, the SMASV altered her course to starboard 20°. At this time, the SMASV was 0.2 nautical miles from Obstacle 7. Then, the SMASV chose to alter her course to port 17°, then alter course to starboard 5° to avoid Obstacles 8 and 9. Finally, the SMASV held her course and approached the target point. The collision-avoidance process is shown in Figure 9B.

After 1500 training times, the SMASV could successfully avoid nine obstacles in the simulation environment and reach the target point.

Remark 3.

In this study, a static obstacle environment was used to train the neural-network model. In the verification generalization experiment, we found that the SMASV can avoid obstacles well in the dynamic obstacle environment, and the collision-avoidance action is compliant with the COLREGs. The reason for this phenomenon is that the neural-network model of the COLREGs has been learned during the training process and the COLREGs will also be executed under certain conditions in a dynamic environment. The training of lidar-based course adjustment can be generalized from static environments to dynamic environments. This is because, regardless of whether an environment is dynamic or static, the principle of lidar detecting obstacles is the same. It is shown that the trained neural-network model has a strong generalization ability. Additionally, the multi-ship decision-making problems can be resolved via the proposed reinforcement-learning method.

In addition, while verifying whether the SMASV could avoid obstacles, we also had to guarantee the stability of the algorithm based on the neural-network loss-function curves. During the neural-network training process, the loss curve can directly reflect the network architecture quality. The loss values generated during the training process were stored. Then, the curve was drawn using the tensorboardX drawing tool to observe the changes in the training-loss-function value curves and adjust to the learning-rate parameters. Figure 10 shows the loss-value curves for the Target-Q1 network and the actor network in the improved SAC algorithm.

The target-Q1 network had a large decrease in loss value at the beginning of the training, indicating that the learning rate was appropriate and the gradient descent process was performed. After 5000 iterations of learning, the loss values converged to a stable state, and the change in loss function was not as evident as it was in the beginning states. Simultaneously, the actor network began to converge at approximately 3500 iterations during training and the convergence rate was relatively stable. The network-loss-function value of the improved SAC algorithm was compared to that of the traditional DRL algorithm, as shown in Figure 11. The results showed that the improved SAC algorithm is more stable and converges rapidly for neural-network training.

4.3. Comparison Experiment

The maximum-entropy DRL algorithm was compared with traditional DRL algorithms to verify its performance. In particular, the improved SAC algorithm was compared with other strategy-based learning algorithms, including the improved PPO [35], PPO [31], DDPG [33], and AC [30] algorithms. A comparison of the average-reward-convergence curves for the different algorithms is shown in Figure 12.

In this experiment, the maximum number of episodes was 2000. After 2000 episodes, the reward-function value of the AC algorithm did not show a convergence trend. After 1800 episodes, the reward-function value of the DDPG algorithm began to converge. The average-reward value of the DDPG algorithm was approximately 21,400. After 1200 episodes, the reward-function value of the PPO algorithm began to converge. As a result, the curve-convergence rate was better than that of the DDPG algorithm and AC algorithm. The average-reward value of the PPO algorithm was approximately 27,600. In addition, the convergence rate of the reward-function value of the improved PPO algorithm was better than that of the PPO algorithm. After 900 episodes, the algorithm began to converge, and the average reward-value was approximately 30,000.

After 800 episodes, the improved SAC algorithm began to converge, and the fluctuation of the reward-function value was very minimal. Under the above conditions, the convergence rate of the improved SAC algorithm outperformed that of the improved PPO algorithm and the PPO algorithm by approximately 12% and 45%, respectively.

4.4. Verification Simulation

In this study, a real marine environment was considered as the environmental space for the proposed model verification. First, the grid method was used to rasterize the marine environment and, then, the rasterized environment was converted into binary values “0” and “1”, where “1” represented the obstacle, which was the forbidden navigation area and displayed as a black area in the environment model. In contrast, “0” represented the navigable area, displayed as a white area in the environment model.

The simulation environment was used to construct a navigation simulation and model-training environment, which contained unknown and dynamic obstacles. We applied the network model trained in the training environment in the real marine environment. However, these obstacles were not displayed in a real marine environment. To reach the target location safely and efficiently, the trained SMASV must avoid obstacles, including unknown obstacles not displayed in the real marine environment.

In addition, to verify the generalization ability of the proposed intelligent-SMASV navigation system, we built several simulation environments in different water areas. In this regard, we extracted eight representative water areas from the constructed simulation environments, as shown in Figure 13.

Environment 1 is the Strait of Malacca, a maritime area prone to accidents and an important strait connecting Asia and Africa. Environment 2 is the Bismarck Strait, which contains many islands and reefs. Environment 3 is the Balearic Sea, located on the north side of the Mediterranean Sea, which has a narrow coast. Environment 4 is the Tyrrhenian Sea, which is connected to the Mediterranean Sea and is relatively open. Environment 5 is the Strait of Tartary, located on the northwest side of the Pacific Ocean, which is one of the world's major straits. Environment 6 is the coastal port of Djibouti, one of the modern ports in eastern Africa. Environment 7 is the strait of Dover, which is the channel connecting the North Sea and the Atlantic, where shipping is very busy. Environment 8 is the Mozambique Channel, the longest strait in the world, located between Mozambique and Madagascar in southeastern Africa.

The path planning track and the sailing track of the intelligent-ship navigation system are displayed on the grid chart. The results are shown in Figure 14. The changes in the rudder angle during the ship navigation are shown in Figure 15. In this paper, an A-star algorithm was used to plan the SMASV navigation route. The blue point is the waypoint planned by the A-star algorithm, which was also defined as the virtual target point of the SMASV in different stages of the decision-making algorithm. The SMASV needs to sail to the virtual target point. If the SMASV does not reach the waypoint due to the collision-avoidance task, it will not be punished. The red line is the planned route before sailing, the blue line is the actual sailing track during the voyage, the green pentagram is the destination, and the black circle is the unknown obstacle.

To verify that the trained intelligent SMASV navigation system complies with the COLREGs, we carried out an additional experiment. The three wide waters of Environments 3, 4, and 7 were used as the COLREGs verification environments. As shown in Figure 16, the red line represents the planning path, the black line represents the decision-making track, and there are three target ships in each environment. TS01 was used to simulate the head-on case, TS02 was used to simulate the overtaking case, and TS03 was used to simulate the crossing give-way and crossing stand-on cases. The encounter angle, the minimum encounter distance, and the collision-avoidance action of the SMASV and other TSs in the encounter process are illustrated in Table 4.

In Environment 3, the SMASV and TS03 are in a crossing give-way situation. When the SMASV was 0.6 nautical miles away from TS03, the SMASV altered her course to starboard 16°. Then, the SMASV and TS02 constituted an overtaking situation. The SMASV chose to alter her course to starboard 12°, then altered her course to port 6° to avoid TS02. Finally, the SMASV and TS01 constituted a head-on situation. When the SMASV was 0.5 nautical miles away from TS01, the SMASV altered her course to starboard 14° to avoid TS01.

In Environment 4, the SMASV and TS02 are in an overtaking situation. The SMASV chose to alter her course to starboard 17°, then alter her course to port 4° to avoid TS02. Then, the SMASV and TS03 constituted a crossing stand-on situation. Because TS03 did not follow the COLREGs, the SMASV chose to alter her course to starboard 13°. Finally, the SMASV and TS01 constituted a head-on situation. When the SMASV was 0.3 nautical miles away from TS01, the SMASV altered her course to starboard 10° to avoid TS01.

In Environment 7, the SMASV and TS02 are in an overtaking situation. The SMASV chose to alter her course to starboard 15° continuously to avoid TS02. Then, the SMASV and TS03 constituted a crossing give-way situation. The SMASV chose to alter her course to starboard 15°. Finally, the SMASV and TS01 constituted a head-on situation. When SMASV was 0.4 nautical miles away from TS01, the SMASV altered her course to starboard 15°, then altered her course to port 7° to avoid TS01. Figure 17 shows the changes in the rudder angle during the ship navigation.

At the end of the experiment, we set up an open water experimental environment to verify the applicability of the proposed collision-avoidance decision-making trained by the improved SAC algorithm. In the experiment, every SMASV was endowed with a network model trained in this study. The typical three-ship and four-ship encounter process is shown in Figure 18. The COLREGs do not specify the specific operation and collision-avoidance actions of each ship in the process of multi-ship encounters. However, the network model trained in this study can solve the problem of multi-ship encounters well, and each ship can reach the target position safely.

Remark 4.

The training process of the neural-network model requires strong computing power and has a long-time cost. However, it does not need to spend a lot of hash rate to apply the trained neural-network model to the SMASV. In the process of generalization experiments, we directly called the neural-network model trained in the training environment (ENV-Train) and did not need to retrain the network model according to the different environments. This was also the reason why deep-reinforcement learning is more suitable for solving ship-collision-avoidance problems.

5. Conclusions

In this study, an SMASV autonomous navigation decision-making method based on an improved SAC algorithm was proposed. The A-star algorithm, which can plan a safe route effectively and accurately, was used for path planning before voyages. Furthermore, a deep-reinforcement learning algorithm was used to avoid unknown and dynamic obstacles during the voyage.

The contributions of this study are as follows:

An improved SAC algorithm based on maximum-entropy reinforcement learning has a better convergence speed of average reward value than other traditional DRL algorithms based on strategy learning. In addition, the simulation results showed that the neural network model trained by the proposed method had a strong generalization ability, which can solve unknown-environment and dynamic-environment decision-making problems during the voyage.
The proposed reward function allowed the SMASV to follow the COLREGs in the decision-making process. In this study, it was proved by experiments that this process can solve the problem of multiple SMASV encounters in complex situations and assist ship navigation well, and that it has high engineering practicability.

Further efforts should be carried out in the following aspects. The obstacles in the training environment were considered as cylinders. The geometry of the obstacles is crucial in simulating a real navigation environment. Therefore, this task should be considered in future research. Moreover, the simulation environment of this study was static waters, and the influence of wind and waves on the SMASV rudder angle was not considered. In future work, a robust control methodology should be designed to perform the decision-making model’s command to resist the uncertain navigation environment disturbance. Finally, in future work, we will consider deploying the trained network model for experiments on real ships.

Author Contributions

Conceptualization, W.G. and Z.C.; methodology, W.G. and Z.C.; software, Z.C. and C.Z.; writing—original draft preparation, Z.C.; writing—review and editing, W.G. and X.Z.; project administration, W.G.; funding acquisition, W.G. All authors have read and agreed to the published version of the manuscript.

Funding

The paper was partially supported by the National Natural Science Foundation of China (No. 52171342, and by the Dalian Innovation Team Support Plan in the Key Research Field (2020RT08). The authors would like to thank the reviewers for their valuable comments.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hu, S.; Li, Z.; Xi, Y.; Gu, X.; Zhang, X. Path Analysis of Causal Factors Influencing Marine Traffic Accident via Structural Equation Numerical Modeling. J. Mar. Sci. Eng. 2019, 7, 96. [Google Scholar] [CrossRef] [Green Version]
Seuwou, P.; Banissi, E.; Ubakanma, G.; Sharif, M.S.; Healey, A. Actor-Network Theory as a Framework to Analyse Technology Acceptance Model’s External Variables: The Case of Autonomous Vehicles. In Proceedings of the International Conference on Global Security, Safety, and Sustainability, London, UK, 18–20 January 2017. [Google Scholar]
Erckens, H.; Busser, G.A.; Pradalier, C.; Siegwart, R.Y. Avalon Navigation Strategy and Trajectory Following Controller for an Autonomous Sailing Vessel. IEEE Robot. Autom. Mag. 2010, 17, 45–54. [Google Scholar] [CrossRef]
Song, R.; Liu, Y.C.; Bucknall, R. Smoothed A* algorithm for practical unmanned surface vehicle path planning. Appl. Ocean Res. 2019, 83, 9–20. [Google Scholar] [CrossRef]
Ding, F.; Zhang, Z.; Fu, M.; Wang, Y.; Wang, C. Energy-efficient Path Planning and Control Approach of USV Based on Particle Swarm optimization. In Proceedings of the MTS/IEEE Charleston OCEANS Conference, Charleston, SC, USA, 22–25 October 2018. [Google Scholar]
Liu, X.Y.; Li, Y.; Zhang, J.; Zheng, J.; Yang, C.X. Self-Adaptive Dynamic Obstacle Avoidance and Path Planning for USV Under Complex Maritime Environment. IEEE Access 2019, 7, 114945–114954. [Google Scholar] [CrossRef]
Zhang, Z.; Wu, D.F.; Gu, J.D.; Li, F.S. A Path-Planning Strategy for Unmanned Surface Vehicles Based on an Adaptive Hybrid Dynamic Stepsize and Target Attractive Force-RRT Algorithm. J. Mar. Sci. Eng. 2019, 7, 132. [Google Scholar] [CrossRef] [Green Version]
Xia, G.; Han, Z.; Zhao, B.; Wang, X. Local Path Planning for Unmanned Surface Vehicle Collision Avoidance Based on Modified Quantum Particle Swarm Optimization. Complexity 2020, 2020, 3095426. [Google Scholar] [CrossRef]
Lyu, H.; Yin, Y. COLREGS-Constrained Real-time Path Planning for Autonomous Ships Using Modified Artificial Potential Fields. J. Navig. 2019, 72, 588–608. [Google Scholar] [CrossRef]
Wang, X.; Liu, Z.J.; Cai, Y. The ship maneuverability based collision avoidance dynamic support system in close-quarters situation. Ocean Eng. 2017, 146, 486–497. [Google Scholar] [CrossRef]
Chen, C.; Chen, X.Q.; Ma, F.; Zeng, X.J.; Wang, J. A knowledge-free path planning approach for smart ships based on reinforcement learning. Ocean Eng. 2019, 189, 106299. [Google Scholar] [CrossRef]
Zhang, J.; Springenberg, J.T.; Boedecker, J.; Burgard, W. Deep Reinforcement Learning with Successor Features for Navigation across Similar Environments. arXiv 2016, arXiv:1612.05533v3. [Google Scholar]
Li, L.; Wu, D.; Huang, Y.; Yuan, Z.M. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Appl. Ocean Res. 2021, 113, 102759. [Google Scholar] [CrossRef]
Shen, H.; Hashimoto, H.; Matsuda, A.; Taniguchi, Y.; Terada, D.; Guo, C. Automatic collision avoidance of multiple ships based on deep Q-learning. Appl. Ocean Res. 2019, 86, 268–288. [Google Scholar] [CrossRef]
Guo, S.Y.; Zhang, X.G.; Zheng, Y.S.; Du, Y.Q. An Autonomous Path Planning Model for Unmanned Ships Based on Deep Reinforcement Learning. Sensors 2020, 20, 426. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhao, L.; Roh, M.I.; Lee, S.J. Control Method For Path Following And Collision Avoidance Of Autonmous Ship Based On Deep Reinforcemnet Learning. J. Mar. Sci. Technol. 2019, 27, 293–310. [Google Scholar]
Xu, X.; Lu, Y.; Liu, G.; Cai, P.; Zhang, W. COLREGs-abiding hybrid collision avoidance algorithm based on deep reinforcement learning for USVs. Ocean Eng. 2022, 247, 110749. [Google Scholar] [CrossRef]
Chun, D.-H.; Roh, M.-I.; Lee, H.-W.; Ha, J.; Yu, D. Deep reinforcement learning-based collision avoidance for an autonomous ship. Ocean Eng. 2021, 234, 109216. [Google Scholar] [CrossRef]
Long, P.; Fan, T.; Liao, X.; Liu, W.; Zhang, H.; Pan, J. Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 6252–6259. [Google Scholar]
Guan, W.; Peng, H.W.; Zhang, X.K.; Sun, H. Ship Steering Adaptive CGS Control Based on EKF Identification Method. J. Mar. Sci. Eng. 2022, 10, 294. [Google Scholar] [CrossRef]
Cui, Z.; Guan, W.; Luo, W. Intelligent Ship Decision System Based on DDPG Algorithm. In Proceedings of the 2022 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), Shijiazhuang, China, 22–24 July 2022; pp. 700–705. [Google Scholar]
Guo, S.; Zhang, X.; Du, Y.; Zheng, Y.; Cao, Z. Path Planning of Coastal Ships Based on Optimized DQN Reward Function. J. Mar. Sci. Eng. 2021, 9, 210. [Google Scholar] [CrossRef]
Guan, W.; Zhou, H.T.; Su, Z.J.; Zhang, X.K.; Zhao, C. Ship Steering Control Based on Quantum Neural Network. Complexity 2019, 2019, 3821048. [Google Scholar] [CrossRef]
Perera, L.P.; Oliveira, P.; Guedes Soares, C. System Identification of Nonlinear Vessel Steering. J. Offshore Mech. Arct. Eng. 2015, 137, 031302. [Google Scholar] [CrossRef] [Green Version]
Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Vagale, A.; Oucheikh, R.; Bye, R.T.; Osen, O.L.; Fossen, T.I. Path planning and collision avoidance for autonomous surface vehicles I: A review. J. Mar. Sci. Technol. 2021, 26, 1292–1306. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Hasselt, H.V.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. arXiv 2015, arXiv:1509.06461v3. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In Proceedings of the International Conference on Machine Learning, Bejing, China, 22–24 June 2014. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Wei, G.; Kuo, W. COLREGs-Compliant Multi-Ship Collision Avoidance Based on Multi-Agent Reinforcement Learning Technique. J. Mar. Sci. Eng. 2022, 10, 1431. [Google Scholar] [CrossRef]
Guan, W.; Cui, Z.; Zhang, X. Intelligent Smart Marine Autonomous Surface Ship Decision System Based on Improved PPO Algorithm. Sensors 2022, 22, 5732. [Google Scholar] [CrossRef]
Xu, X.L.; Lu, Y.; Liu, X.C.; Zhang, W.D. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Eng. 2020, 217, 107704. [Google Scholar] [CrossRef]
Guan, W.; Wang, K. Autonomous Collision Avoidance of Unmanned Surface Vehicles Based on Improved A-Star and Dynamic Window Approach Algorithms. IEEE Intell. Transp. Syst. Mag. 2023, 15, 36–50. [Google Scholar] [CrossRef]

Figure 1. Intelligent navigation system for the SMASV: part (a) is the path planner; part (b) is the sensor module; part (c) is the decision-making module; and part (d) is the control module.

Figure 2. Electronic chart rasterization process. Subfigure (a) is the real sea environment, and subfigure (b) is the rasterized electronic chart environment.

Figure 3. 3-DOF ship-motion mathematical model.

Figure 4. Collision avoidance in four situations, based on the COLREGs. Among them, (a) is Head-on situation, (b) is Overtaking situation, (c) is Crossing give-way situation, and (d) is Crossing stand-on situation.

Figure 5. Differences between the traditional reinforcement learning and the maximum-entropy reinforcement learning. Among them, the red pentagram represents the target point, the black fork represents no navigable path.

Figure 6. Improved SAC algorithm and environment interaction process. The Q network contains four fully connected layers, with 256 neurons in each layer, using the ReLU activation function. The strategy network comprises two layers of fully connected layers; each layer has 256 neurons, using the ReLU activation function.

Figure 7. Concept of reward function.

Figure 8. SMASV neural network model training environment. (Env-Train).

Figure 9. (A) SMASV avoiding obstacles in the Env-Train after 700 training iterations. Subfigures (a–d) SMASV avoiding ob1, ob2, and ob3, respectively; (e–h) SMASV avoiding ob4 and ob5, respectively; (B) SMASV avoiding obstacles in the Env-Train after 1000 training iterations; (a–e) SMASV avoiding ob6 and ob7, respectively; (f–h) SMASV avoiding ob8 and ob9, respectively.

Figure 10. The loss curves for the target−Q1 network and the actor network in the improved SAC algorithm.

Figure 11. The loss curves for the Actor network and the Critic network in the different DRL algorithm.

Figure 12. The comparison of the average−reward−convergence curves of the improved SAC algorithm with AC, DDPG, PPO, and improved PPO algorithm.

Figure 13. Electronic chart rasterization of eight water areas.

Figure 14. The planning and sailing tracks on the rasterized electronic chart.

Figure 15. Rudder angle of the SMASV in ENV1−ENV8.

Figure 16. Multiple SMASV encounter situations experiment.

Figure 17. The SMASV rudder angle in multiple ship encounter situations.

Figure 18. The typical three−ship and four−ship encounter sailing track.

Table 1. Summary of literature on ship-collision avoidance.

Research Classification	Reference	Technique	Consideration of the COLREGs	Advantages	Disadvantages
Traditional methods	[3]	A-star	No	A ship named Avalon which crosses the Atlantic Ocean automatically	Cannot avoid unknown obstacles and does not meet the COLREGs
	[4]	Smooth A-star	No	The route has no redundant waypoints
	[5]	Particle swarm	No	Optimization path
	[6]	ACA and CA	No	Algorithm has generalization ability
	[7]	RRT	No	Avoid static and dynamic obstacles	Collision avoidance effect in narrow waters is not good
	[8]	VO and MQPSO	Yes	Ship collision avoidance model and risk assessment	Cannot avoid unknown obstacles
	[9]	APF	Yes	New heuristic function	Repulsive potential field function is too complex to be generalized
	[10]	RVO	Yes	Two-ship collision avoidance complied with COLREGs	TCPA computation is too complex
Model-free DRL methods	[11]	Q-learning (RL)	No	Human experience	Cannot solve the complex sea conditions and continuous action
	[12]	APF and DQN (DRL)	Yes	APF algorithm can guide the ship to the target point.	The DQN algorithm has an overestimation problem, and errors occur during the selection of actions
	[13]	DQN (DRL)	Yes	The experimental validation of three self-propelled ships
	[14]	DDPG (DRL)	No	Combined with AIS data	Slow convergence speed of the method
	[15]	LOS and PPO (DRL)	Yes	Two-ship collision avoidance complied with COLREGs	Low generalization ability
	[16]	DDPG (DRL)	Yes	Five complex reward functions.	Low generalization ability
	[17]	PPO (DRL)	Yes	New quantitative risk assessment	Slow convergence speed of the method

Table 2. Hydrodynamic parameters of the SMASV.

Hydrodynamic Parameters of Ship-Motion Model
$X_{\dot{u}} = 0.0075$	$Y_{\dot{v}} = 0.1553$	$Y_{\dot{u}} = 0.1553$	$N_{\dot{v}} = 0.0007927$	$N_{\dot{r}} = 0.0074$
$X_{u} = - 3.5655$	$X_{\|u \|u} = - 0.0697$	$X_{u u u} = - 0.0697$	$Y_{v} = - 0.3032$	$Y_{\|v \|v} = - 0.7834$
$Y_{\|v \|r} = - 0.2099$	$Y_{\|r \|r} = - 0.0543$	$N_{v} = - 0.0999$	$N_{\|v \|v} = - 0.1822$	$N_{\|r \|v} = - 0.1561$
$N_{\|v \|r} = - 0.0561$	$N_{\|r \|r} = - 0.3390$	$I_{\dot{z}} = 0.0120$	$Y_{r} = 0.0832$	$N_{r} = - 0.0455$

Table 3. Experimental data and network parameters.

Parameters	Symbol	Value
Discounted rate	$γ$	0.99
Q1 network learning rate	$v_l r$	0.0003
Q2 network learning rate	$q_s o f t_l r$	0.0003
Actor network learning rate	$p_l r$	0.0003
Target reward weight	$λ_{g}$	10.0
Reward coefficient	$t r$	0.80
Yaw reward weight	$λ_{a}$	0.30
Distance variation parameter	$ε$	0.50
COLREGs reward weight	$λ_{c}$	1.50
Safe distance (nm)	$S_{1}$	0.50
Radar radius (nm)	$R_r a d a r$	3.50

Table 4. Multiple SMASV encounter situations experiment.

Experiment	Ship Number	Scenario	Angle of Encounter	Minimum Distance between the Ships
ENV3	TS01	Head-on	5°	0.5 nm
	TS02	Overtaking	150°	0.5 nm
	TS03	Crossing give-way	110°	0.6 nm
ENV4	TS01	Head-on	357°	0.3 nm
	TS02	Overtaking	230°	0.5 nm
	TS03	Crossing stand-on	280°	0.4 nm
ENV7	TS01	Head-on	355°	0.4 nm
	TS02	Overtaking	200°	0.5 nm
	TS03	Crossing give-way	85°	0.5 nm

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, Z.; Guan, W.; Zhang, X.; Zhang, C. Autonomous Navigation Decision-Making Method for a Smart Marine Surface Vessel Based on an Improved Soft Actor–Critic Algorithm. J. Mar. Sci. Eng. 2023, 11, 1554. https://doi.org/10.3390/jmse11081554

AMA Style

Cui Z, Guan W, Zhang X, Zhang C. Autonomous Navigation Decision-Making Method for a Smart Marine Surface Vessel Based on an Improved Soft Actor–Critic Algorithm. Journal of Marine Science and Engineering. 2023; 11(8):1554. https://doi.org/10.3390/jmse11081554

Chicago/Turabian Style

Cui, Zhewen, Wei Guan, Xianku Zhang, and Cheng Zhang. 2023. "Autonomous Navigation Decision-Making Method for a Smart Marine Surface Vessel Based on an Improved Soft Actor–Critic Algorithm" Journal of Marine Science and Engineering 11, no. 8: 1554. https://doi.org/10.3390/jmse11081554

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Autonomous Navigation Decision-Making Method for a Smart Marine Surface Vessel Based on an Improved Soft Actor–Critic Algorithm

Abstract

1. Introduction and Background

2. Preliminary Considerations and Problem Formulation

2.1. Intelligent SMASV Navigation System

2.2. Construction of the Path Planner

2.3. Ship-Motion Mathematical Model

2.4. COLREGs

3. Proposed Approach

3.1. Improved SAC Algorithm

3.2. Construction of Neural Network

3.3. Reward Function Setting

4. Experiment

4.1. Design of Simulation

4.2. Network Training

4.3. Comparison Experiment

4.4. Verification Simulation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI