Navigation and Obstacle Avoidance for USV in Autonomous Buoy Inspection: A Deep Reinforcement Learning Approach

Wang, Jianhui; Lu, Zhiqiang; Hong, Xunjie; Wu, Zeye; Li, Weihua

doi:10.3390/jmse13050843

Open AccessArticle

Navigation and Obstacle Avoidance for USV in Autonomous Buoy Inspection: A Deep Reinforcement Learning Approach

by

Jianhui Wang

¹

,

Zhiqiang Lu

^2,*

,

Xunjie Hong

²,

Zeye Wu

² and

Weihua Li

^2,3,*

¹

School of Intelligent Science and Engineering, Jinan University, Zhuhai 519000, China

²

School of International Energy, Jinan University, Zhuhai 519000, China

³

Institute of Rail Transportation, Jinan University, Zhuhai 519000, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(5), 843; https://doi.org/10.3390/jmse13050843

Submission received: 31 March 2025 / Revised: 18 April 2025 / Accepted: 23 April 2025 / Published: 24 April 2025

(This article belongs to the Special Issue The Control and Navigation of Autonomous Surface Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

To address the challenges of manual buoy inspection, this study enhances a previously proposed Unmanned Surface Vehicle (USV) inspection system by improving its navigation and obstacle avoidance capabilities using Proximal Policy Optimization (PPO). For improved usability, the entire system adopts a fully end-to-end design, with an angular deviation weighting mechanism for stable circular navigation, a novel image-based radar encoding technique for obstacle perception and a decoupled navigation and obstacle avoidance architecture that splits the complex task into three independently trained modules. Experiments validate that both navigation modules exhibit robustness and generalization capabilities, while the obstacle avoidance module partially achieves International Regulations for Preventing Collisions at Sea (COLREGs)-compliant maneuvers. Further tests in continuous multi-buoy inspection tasks confirm the architecture’s effectiveness in integrating these modules to complete the full task.

Keywords:

buoy inspection; collision avoidance; deep reinforcement learning (DRL); proximal policy optimization (PPO); unmanned surface vehicle (USV)

1. Introduction

Buoys are widely utilized in meteorological forecasting, marine environment assessment, navigation safety assurance, and scientific research [1,2]. However, as equipment is exposed to harsh marine environments for extended periods, buoys encounter issues such as seawater corrosion and biofouling. Regular inspection and maintenance, therefore, become imperative to ensure the efficient and reliable operation of buoy observation systems [3].

Conventional buoy inspection methods predominantly rely on manual checks, which prove not only costly but also hazardous, particularly under adverse weather conditions. Unmanned Surface Vehicles (USVs), as novel maritime operation platforms, demonstrate significant advantages in automation, all-weather capabilities, and long endurance. When equipped with visual sensors, USVs can effectively detect and evaluate buoy structural integrity, surface corrosion, and abnormal orientation. Previous research by [4] develops a USV system for buoy inspection that achieves dynamic target tracking during circular navigation through Kernelized Correlation Filter (KCF) algorithms, coupled with buoy health assessment using YOLOv7 architecture, as illustrated by Figure 1. Building upon this foundation, this study introduces improvements to USV navigation and obstacle avoidance systems.

To achieve comprehensive detection and evaluation of buoy conditions, a USV must perform stable circumnavigation tasks. Conventional methods typically employ waypoint tracking through integrated Line-of-Sight (LOS) guidance and Proportional-Integral-Derivative (PID) control architectures [5]. These methods approximate circumnavigation trajectories by discretizing the circular path into segments, which, however, could introduce trajectory oscillations [6].

Rooted in Sutton and Barto’s pioneering theoretical framework [7], reinforcement learning (RL) enabled robotic systems to achieve adaptive navigation, dynamic obstacle avoidance, and model-free control. Recent advancements have extended these benefits to USV motion control, demonstrating promising results. Ref. [8] proposes a navigation algorithm using the Deep Q-Network (DQN) for USVs, which demonstrates comparable tracking accuracy to Improved LOS (ILOS) with PD controllers. However, its circular navigation approach still relies on pre-set waypoints and does not eliminate trajectory oscillations. Furthermore, ref. [9] proposes a DQN-based heading control method focusing on a KVLCC2 tanker. Ref. [10] addresses the high computational complexity of Q-learning in USV path tracking through Smoothly-Convergent DRL (SCDRL). While both studies demonstrate the ability to follow curved paths, their reliance on predefined paths presents challenges for dynamic path generation in multi-buoy inspection tasks. Additionally, neither approach explicitly involves a circular path following in their experiments.

As an essential motion control component of the autonomous buoy inspection USV system, this study proposes a Proximal Policy Optimization (PPO)-based circular navigation algorithm. By introducing an angular deviation weighting mechanism, the proposed algorithm enables smooth transitions between linear and circular motions without relying on pre-set waypoints or paths. Another control component, the goal-directed navigation algorithm, is developed as a simplified and modified version of the circular navigation algorithm.

Given the frequent presence of moving objects, such as fishing boats and cargo ships near waterways, an autonomous USV system should be capable of real-time obstacle avoidance. Traditional obstacle avoidance methods, such as the A* algorithm [11] and Dijkstra’s algorithm [12], are mostly effective in static obstacle environments and come with high computational demands, limiting their applicability in time-sensitive scenarios. Methods like the Artificial Potential Field (APF) [13], Velocity Obstacle (VO) [14,15], and Dynamic Window Approach (DWA) [16], while suitable for dynamic environments, often suffer from the risk of getting trapped in local optima [17]. Due to the strong performance of reinforcement learning in complex decision-making tasks, a few reinforcement learning-based obstacle avoidance strategies for USVs have been proposed. Ref. [18] introduces a Deep Deterministic Policy Gradient (DDPG)-based avoidance strategy adhering to COLREGs (International Regulations for Preventing Collisions at Sea), enhanced with a prioritized experience replay mechanism to improve USV learning efficiency. Similarly, ref. [19] applies convolutional neural networks (CNNs) to process temporal state information and train a DQN model, demonstrating higher avoidance success rates compared to traditional VO and APF methods. However, these approaches depend heavily on accurate motion parameters of obstacles, which are often difficult to obtain in real-world inspection tasks. Ref. [17] proposes an end-to-end obstacle avoidance solution using local grid maps by encoding the temporal dynamics of obstacles within a single image. While innovative, this method does not strictly comply with COLREGs, and the process of constructing grid maps from sensor data is not explicitly addressed. Ref. [20] explores avoidance for UAVs using radar sensors but is limited to static obstacle scenarios.

Building upon these foundations, this study designs radar data as image input and adopts the temporal coding strategy from [17]. The proposed approach achieves dynamic obstacle avoidance that partially complies with COLREGs in a fully end-to-end manner, thus bridging the gap between theoretical research and practical implementation of autonomous collision avoidance systems.

The main contributions of this study are as follows:

An angular deviation weighting mechanism is introduced, enabling a circular navigation algorithm for USVs that demonstrates robust adaptability across different circumnavigation radii, even when trained on a specific radius.
A novel generalized radar image encoding technique is developed, facilitating an end-to-end USV collision avoidance solution that relies solely on radar sensors for obstacle scenario recognition.
Inspired by the concept of hierarchical reinforcement learning [21], a decoupled architecture is proposed, separating navigation and obstacle avoidance for USVs. The effectiveness of this architecture is validated in autonomous multi-buoy inspection tasks.

2. Background

2.1. Kinematic Model of USV

This study establishes a three-degree-of-freedom (3-DOF) underactuated vessel mathematical model for algorithm design and validation [22]. Considering only the motion of the vessel in a two-dimensional plane, as illustrated in Figure 2,

η = [x, y, ψ]

is defined as the vessel’s position and heading angle in the world coordinate system while

υ = [u, v, r]

represents the surge velocity, sway velocity, and yaw rate in the body-fixed coordinate frame. The kinematic model of USV can be expressed as:

\begin{matrix} \dot{η} = T (ψ) \end{matrix} υ

(1)

where

T (ψ)

is the transformation matrix from the world coordinate system to the body-fixed coordinate system, which can be expressed as:

\begin{matrix} T (ψ) = [\begin{matrix} \cos ψ & - \sin ψ & 0 \\ \sin ψ & \cos ψ & 0 \\ 0 & 0 & 1 \end{matrix}] \end{matrix}

(2)

2.2. Reinforcement Learning

Reinforcement Learning trains agents to optimize strategies through environment interactions using reward/penalty feedback, aiming to maximize cumulative discounted rewards. It is modeled via Markov Decision Processes (MDPs), which are defined by state/action spaces

(S, A)

, transition probability

P

, reward function

R

, and discount factor

γ

. Each timestep generates a tuple

〈 s_{t}, a_{t}, r_{t + 1}, s_{t + 1} 〉

where

s_{t} \in S

is the observed state at timestep

t

,

a_{t} \in A

is the action selected at

t

,

s_{t + 1}

is the resulting state determined by transition probability

P (s_{t + 1} | s_{t}, a_{t})

, and

r_{t + 1}

is the feedback reward given by

R (s_{t}, a_{t}, s_{t + 1})

.

Another critical component is the policy

π (a | s)

, which defines the probability distribution over action

a

given state

s

. The objective of reinforcement learning is to find an optimal policy

π *

that maximizes the expected cumulative discounted reward

R_{t}

:

\begin{matrix} R_{t} = r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} r_{k + t + 1} \end{matrix}

(3)

where the discount factor

γ \in [0, 1]

determines the agent’s preference for long-term versus short-term rewards. A lower

γ

favors short-term gains, while a higher

γ

prioritizes long-term benefits.

2.3. Proximal Policy Optimization

Policy gradient methods address traditional RL’s limitations in high-dimensional continuous action spaces by directly optimizing parameterized policies

π_{θ} (a | s)

via gradient ascent to maximize discounted rewards [23]. PPO, an improved Trust Region Policy Optimization (TRPO) variant, ensures stable training through constrained policy updates while maintaining computational efficiency [24]. This study adopts PPO with continuous action spaces as its reinforcement learning framework.

In policy gradient methods, the expected cumulative discounted reward is represented as:

\begin{matrix} J (θ) = E_{τ ~ π_{θ}} [\sum_{t = 0}^{T} γ^{t} r (s_{t}, a_{t})] \end{matrix}

(4)

where

τ

denotes a trajectory

(s_{0}, a_{0}, r_{0}, \dots, s_{T})

sampled by the policy

π_{θ}

. The expectation is taken over trajectories sampled under the current policy. The policy parameters

θ

are updated via gradient ascent to maximize

J (θ)

. The policy gradient is derived as:

\begin{matrix} \nabla_{θ} J (θ) = E_{τ ~ π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) A^{π} (s_{t}, a_{t})] \end{matrix}

(5)

where

A^{π} (s_{t}, a_{t})

is the advantage function, which quantifies how much better taking action

a_{t}

in state

s_{t}

is compared to the average performance under the policy

π

.

Typically, the PPO algorithm employs the Generalized Advantage Estimation (GAE) as the advantage function [25]. It is defined as:

\begin{matrix} \{\begin{matrix} A_{t}^{GAE (γ, λ)} = δ_{t} + γ λ A_{t + 1}^{GAE (γ, λ)} \\ δ_{t} = r_{t} + γ V^{π} (s_{t + 1}) - V^{π} (s_{t}) \end{matrix} \end{matrix}

(6)

where the TD error

δ_{t}

serves as an unbiased but high-variance estimate of the advantage function.

V^{π} (s_{t})

denotes the expected cumulative discounted reward an agent will receive starting from state

s_{t}

and following policy

π

thereafter.

A key feature of PPO is its ability to avoid resampling trajectories during training by leveraging importance sampling. However, this approach introduces a new challenge: If the policy update step is too large, the importance weight can deviate significantly from its optimal values. To address this, PPO incorporates a clipping mechanism that limits how much the new policy is allowed to diverge from the old policy. These mechanisms lead to the following reformulated objective function:

\begin{matrix} J^{C L I P} (θ) = E_{t} [\min (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})} A_{t}, clip (\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}, 1 - ε, 1 + ε) A_{t})] \end{matrix}

(7)

where

\frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

is the importance weight, quantifying how much the current policy differs from the old one for each sampled action at each state.

ε

is a hyperparameter defining the clipping range, ensuring policy updates remain within the interval

[1 - ε, 1 + ε]

.

3. Methodology

3.1. Problem Formulation

As shown in Figure 3, the complete inspection task involves the following phases: (1) Navigation from the starting point to the vicinity of the first target buoy; (2) Execution of circumnavigation inspection around the buoy; (3) Proceeding to the next target buoy; (4) Repeating phases (1)–(3) until completing all buoy inspections, followed by return to the starting point. To address the complexity of this task, we decouple the process into three core subtasks:

Goal-directed navigation: Guides the USV from a start point to a goal point.
Circular navigation: Enables smooth transition into orbit radius and maintains stable circular motion around the target buoy.
Obstacle avoidance: Detects and evades moving obstacles during navigation, complying with COLREGs.

In multi-subtask environments, conventional reinforcement learning methods face issues with reward design complexity and model convergence due to conflicting multi-objective optimization [26]. To address these challenges, we adopt a hierarchical reinforcement learning framework and propose a decoupled architecture for navigation and obstacle avoidance, as illustrated in Figure 4.

The core concept involves independent training and deployment of these functional modules to ensure sample efficiency and reusability. Previous studies typically couple navigation with obstacle avoidance, often resulting in high sample requirements and complex reward functions—sparse obstacles lead to insufficient training of avoidance capacity, while dense obstacles cause convergence difficulties and overcautious behaviors. Our approach is based on the intuition that a well-trained navigation module can provide reliable heading references, where obstacle avoidance behaviors can be treated as localized dynamic adjustments to the reference heading. Specifically, when the vessel maintains the correct navigation direction, the independently trained obstacle avoidance module can be activated upon detecting an obstacle to adjust the current heading. Once the obstacle is avoided, the avoidance module is deactivated, and the default navigation module resumes. Based on the proposed architecture, this study independently trains three submodules (goal-directed navigation, circular navigation, and obstacle avoidance), with a decision and encoding unit determining module activation and state transformation.

3.2. Design of Navigation Modules

3.2.1. Environments

To address the environmental differences between navigation and obstacle avoidance tasks, this study designs separate environments for different modules. The goal-directed navigation module and the circular navigation module, introduced in this section, are trained in the same environment but with distinct termination criteria.

As shown in Figure 5, during environment initialization, the USV’s initial position

(x_{0}, y_{0})

, initial heading angle

ψ_{0}

, goal distance

r_{g 0}

, and goal azimuth angle

ψ_{g 0}

are randomly assigned:

\begin{matrix} \begin{matrix} (x_{0}, y_{0}) ~ U ({[- 12, 12]}^{2}) \\ ψ_{0} ~ U (0, 2 π) \\ r_{g 0} ~ U ([50, 70]) \\ ψ_{g 0} ~ U (0, 2 π) \end{matrix} \end{matrix}

(8)

Therefore, the goal position

(x_{g o a l}, y_{g o a l})

is calculated as:

\begin{matrix} (x_{g o a l}, y_{g o a l}) = (x_{0} + r_{g 0} \cos ψ_{g 0}, y_{0} + r_{g 0} \sin ψ_{g 0}) \end{matrix}

(9)

For the goal-directed task, the goal serves as the center of the goal area, while for the circular navigation task, the goal is the center of the expected circumnavigation trajectory (which is also the center of the buoy). This randomized task generation enhances the algorithms’ generalization capabilities.

For the goal-directed navigation task, the USV starts from

(x_{0}, y_{0})

and navigates toward

(x_{g o a l}, y_{g o a l})

. If the distance to the goal becomes smaller than

E^{g}

, the USV is considered to have reached the vicinity of the goal, and the task is completed successfully, as shown in the top-right subfigure of Figure 6.

For the circular navigation task, the USV departs from its initial position and approaches a predefined orbital radius

R_{r}

. It then adjusts its heading to smoothly navigate along the expected circumnavigation trajectory. As shown in main body of Figure 6, during orbiting:

If the USV’s distance from the buoy exceeds the trajectory deviation tolerance limit $D_{t}$ , the accumulated circumnavigation angle $θ_{t o t a l}$ is reset to zero.
If the USV remains within the tolerance range, $θ_{t o t a l}$ is continuously recorded and incremented.

Both clockwise and counterclockwise circumnavigation are considered valid. The task is deemed successful if

|θ_{t o t a l}| \geq 2 π - E^{c}

, indicating nearly a full orbit around the buoy.

If the distance between the USV and the buoy becomes smaller than

R_{b u o y}

, a collision is triggered, and the episode terminates in failure.

3.2.2. Action Space

In both navigation and obstacle avoidance environments, the USV employs continuous rudder angle control as its action form. By adjusting the rudder angle, the USV can indirectly alter its navigation direction. Considering the hydrodynamic characteristics of the vessel, the effect of the rudder angle on the heading exhibits a delay and cannot instantaneously change the USV’s course. The control range of the rudder angle is set to

[- \frac{π}{2}, \frac{π}{2}]

.

3.2.3. State Spaces

The design of the state space directly impacts the agent’s environmental perception capability and training efficiency. This study designs two distinct state spaces based on task complexity and goal specificity.

For navigation tasks with relatively explicit objectives, we adopt a task-oriented state representation that avoids explicit encoding of task-specific goals in the state space, thereby accelerating training convergence.

The state space for circular navigation is represented as:

〈 \begin{matrix} θ_{e}, d_{e}, r_{u s v}, θ_{t o t a l} \end{matrix} 〉

(10)

θ_{e}

represents the directed angle between the actual velocity direction of the USV and the line connecting to the goal point:

\begin{matrix} θ_{e} = [θ_{v} - ψ_{g o a l} + π \mod 2 π] - π \end{matrix}

(11)

where

θ_{v} = \arctan (\frac{\dot{y}}{\dot{x}})

is the velocity direction in the inertial frame,

ψ_{g o a l} = \arctan (\frac{x_{g o a l} - x}{y_{g o a l} - y})

is the goal azimuth relative to the USV. The difference is normalized to

[- π, π]

. This is the most critical element in the state space, directly guiding the agent’s actions.

d_{e}

represents the distance between the USV and the expected circumnavigation trajectory:

\begin{matrix} d_{e} = d_{g} - R_{r} \end{matrix}

(12)

where

d_{g} = \sqrt{{(x_{g o a l} - x)}^{2} + {(y_{g o a l} - y)}^{2}}

is the distance to the goal. Guided by this element, the agent learns to adjust its heading when

d_{e}

is small to ensure a smooth transition to the circumnavigation state and to maintain

d_{e}

within

[- D_{t}, D_{t}]

to accumulate the circumnavigation angle.

r_{u s v} = r

is the USV’s yaw rate. Considering the hysteresis in rudder control,

r_{u s v}

reflects the USV’s motion trend, preventing oversteering and ensuring stable navigation.

θ_{t o t a l}

represents the accumulated circumnavigation angle. During each state update, this element is updated as follows:

\begin{matrix} θ_{t o t a l}^{t} = \{\begin{matrix} θ_{t o t a l}^{t - 1} + Δ ψ_{g o a l}^{t}, if |d_{e}^{t}| \leq D_{t} \\ 0, else \end{matrix} \end{matrix}

(13)

where

Δ ψ_{g o a l}^{t} = [ψ_{g o a l}^{t} - ψ_{g o a l}^{t - 1} + π \mod 2 π] - π

is the incremental azimuth angle at time

t

. The sign of

θ_{t o t a l}

provides information about the circumnavigation direction (clockwise/counterclockwise), while its absolute value indicates task progress, aiding the agent in making correct decisions.

For the goal-directed navigation, the state space is similar to circular navigation:

〈 \begin{matrix} θ_{e}, d_{g}, r_{u s v} \end{matrix} 〉

(14)

3.2.4. Reward Functions

Reward functions critically influence policy optimization efficiency and solution quality as the primary feedback signal [27,28]. In this study, we categorize rewards into high-magnitude terminal rewards and auxiliary rewards. Terminal rewards, with higher magnitudes, are given at task completion, while auxiliary rewards address exploration inefficiencies caused by sparse rewards [29].

Auxiliary rewards are designed using a combination of a negative exponential function and a linear function. The exponential function provides steep gradients with near-optimal behavior to accelerate late-stage training, while the linear function ensures stable early-stage progress and avoids local optima due to vanishing gradients.

To avoid potential bias introduced by reward shaping, we assign positive rewards only when the task is successfully completed. This design ensures the agent remains focused on the main task rather than chasing short-term rewards [30].

The reward functions for the circular navigation task follow these principles: higher auxiliary rewards are given when the USV is closer to the orbital trajectory, has smaller angular errors, and accumulates larger circumnavigation angles. The reward functions are defined as:

\begin{matrix} \{\begin{matrix} r_{1} = e^{- 5.0 |d_{e}|} - 1.0 - 0.5 \frac{|d_{e}|}{20} \\ r_{2} = \frac{|θ_{t o t a l}|}{2 π} - 1.0 \\ r_{3} = e^{- 5.0 |θ_{e}^{d e s i r e}|} - 1.0 - 0.5 \frac{|θ_{e}^{d e s i r e}|}{π} \end{matrix} \end{matrix}

(15)

where

θ_{e}^{d e s i r e}

represents the deviation between the USV’s velocity direction and the desired direction. We aim for the following behavior (see Figure 7):

When far outside the orbit, $θ_{e}$ should approach 0.
When far inside the orbit, $θ_{e}$ should approach $π$ .
Approaching the orbital boundary, $θ_{e}$ smoothly transitions to $- \frac{π}{2}$ or $\frac{π}{2}$ .

To achieve this, a weighted angular error is introduced:

θ_{e}^{d e s i r e} = (1 - γ_{e}) θ_{e f} + γ_{e} θ_{e n}

(16)

where

γ_{e}

is a weight coefficient that increases as the distance to the expected circumnavigation trajectory

d_{e}

decrease.

θ_{e f}

represents the desired angular difference for large

d_{e}

and

θ_{e n}

represents the desired angular difference for small

d_{e}

. They are defined as:

\begin{matrix} γ_{e} = e^{- 5.0 |d_{e}|} \end{matrix}

(17)

\begin{matrix} θ_{e f} = \{\begin{matrix} ||θ_{e}| - 0|, if d_{e} > 0 \\ ||θ_{e}| - π|, else \end{matrix} \end{matrix}

(18)

\begin{matrix} θ_{e n} = ||θ_{e}| - \frac{π}{2}| \end{matrix}

(19)

The total auxiliary reward for circular navigation is:

\begin{matrix} r^{c} = r_{1} + 2 r_{2} + r_{3} \end{matrix}

(20)

When the task is successfully completed, a terminal reward

r_{v}^{c} = 1000

is provided. If the USV collides with a buoy, a failure reward

r_{f}^{c} = - 1000

is given.

For the goal-direct navigation task, with no circumnavigation requirement (

r_{2}

removed), rewards focus on minimizing distance

d_{g}

and angle error

θ_{e}

:

\begin{matrix} \{\begin{matrix} r_{4} = e^{- 1.0 |d_{g}|} - 1.0 - 0.5 \frac{|d_{g}|}{20} \\ r_{5} = e^{- 5.0 |θ_{e}|} - 1.0 - 0.5 \frac{|θ_{e}|}{2 π} \end{matrix} \end{matrix}

(21)

The total auxiliary reward for goal-direct navigation is given by:

r^{g} = r_{4} + r_{5}

(22)

A terminal reward

r_{v}^{g} = 1000

is provided only when the USV successfully reaches the vicinity of the goal point. In this environment, there are no failure rewards.

3.2.5. Deep Neural Networks

The PPO algorithm employs Actor-Critic network architecture to implement policy iteration optimization in continuous action space problems. Given a state, the Actor network outputs the parameters of the probability distribution over possible actions. During training, actions are determined by sampling from this distribution, while during testing, the action with the highest probability is selected. The Critic network evaluates the value of a given state, serving as

V^{π} (s_{t})

in Equation (6).

The Beta distribution is adopted to model the estimation of the continuous action range. The Actor network outputs the probability distribution parameters

(α, β)

for the rudder angle. The sampled values fall within the interval

[0, 1]

, which are then mapped to the actual action interval

[- \frac{π}{2}, \frac{π}{2}]

via a linear mapping function.

The neural network architecture designed for navigation tasks is illustrated in Figure 8, in which green denotes the Actor’s components, blue represents the Critic’s components, and gray indicates raw input layers. The Actor and Critic are trained independently, each taking the state vector as input. This input passes through two fully connected neural network layers, each with 128 nodes and Tanh activation functions for their outputs. Finally, the Actor outputs the action distribution parameters, while the Critic outputs the state value.

3.3. Design of Obstacle Avoidance Module

3.3.1. Environment

As shown in Figure 9, during the obstacle avoidance task initialization, the USV’s initial position is fixed at the origin of the world coordinate system, with its heading always oriented north, so

(x_{0}, y_{0}) = (0, 0)

,

ψ_{0} = \frac{π}{2}

. Obstacle vessel positions and headings are randomly selected from the four COLREGs-defined encounter scenarios: overtaking scenario, head-on scenario, right-crossing scenario and left-crossing scenario. Besides, the obstacle vessel’s speed, radius, and environmental disturbance are randomly selected within a certain range.

This fixed initialization reduces reward function design complexity (as discussed in Section 3.3.4), narrows the exploration space, and accelerates convergence, albeit at the cost of degraded generalization performance [31]. Consequently, coordinates and heading transformations are required during real-world deployment (see Section 3.4).

The USV is equipped with a radar sensor for obstacle detection. As illustrated in Figure 10, the radar emits

N_{s c a n}

beams covering a 360° angular range, with each beam having a detection range of

[0, R_{r a d a r}]

. The radar is fixed to the USV body, where the

0 °

beam direction aligns with the USV’s bow. Obstacle vessels are modeled as rigid circular bodies with radius

R_{o b s t a c l e}

. The USV’s safety zone radius is

R_{u s v}

. Let the

i

-th radar beam detect a distance

d_{r a d a r}^{i}

at bearing angle

θ_{r a d a r}^{i}

relative to the bow. The failure condition is defined as:

\begin{matrix} d_{r a d a r}^{i} < R_{u s v}, \exists i \in \{1, \dots, N_{s c a n}\} \end{matrix}

(23)

indicating a collision if any beam distance falls below

R_{u s v}

. The success condition is defined as:

\begin{matrix} d_{r a d a r}^{i} = R_{r a d a r}, \forall i \in \{1, \dots, N_{s c a n}\} \end{matrix}

(24)

where all radar beams reach the maximum detection range

R_{r a d a r}

, indicating successful obstacle avoidance.

3.3.2. Action Space

The action space used in the obstacle avoidance environment is identical to that in the navigation environments (see Section 3.2.2).

3.3.3. State Space

The dynamic obstacle avoidance task fundamentally differs from navigation tasks because, as a subtask of path planning, it lacks explicit objectives. Therefore, the state space design should focus on a precise quantitative representation of the USV’s motion state. Given the practical challenges of acquiring real-time velocity and heading information for moving obstacles, this study proposes leveraging radar sensing data as indirect cues. A composite state space is constructed, consisting of a high-dimensional radar data layer and a low-dimensional motion parameter layer.

For the high-dimensional radar data layer, an image-based multi-scale radar data encoding method is proposed. As shown in Figure 11b, we encode four key elements in a 256-level grayscale image through spatial and grayscale stratification: relative position of obstacle vessel, relative motion trend of obstacle vessel, USV safety radius and USV heading angle.

The implementation involves four steps: (1) Convert raw radar polar coordinates

(d_{r a d a r}^{i}, θ_{r a d a r}^{i})

to Cartesian coordinates

(x_{r a d a r}^{i}, y_{r a d a r}^{i})

:

\begin{matrix} \{\begin{matrix} x_{r a d a r}^{i} = d_{r a d a r}^{i} \cos (θ_{r a d a r}^{i} + ψ) \\ y_{r a d a r}^{i} = d_{r a d a r}^{i} \sin (θ_{r a d a r}^{i} + ψ) \end{matrix} \end{matrix}

(25)

where

(x_{r a d a r}^{i}, y_{r a d a r}^{i})

are the coordinates of the

i

-th radar beam endpoint in the radar-centric coordinate system. This enables geometric alignment between radar data and the global coordinate system, eliminating heading angle

ψ

effects on obstacle position representation. (2) Plot radar scan points on the image and sequentially connect them to form obstacle contours. This method also generalizes across radar resolutions, as demonstrated in Figure 12, which includes scans from three different radar resolutions, showing minimal differences in the connected contours.

(3) Inspired by [17], we design an exponential decay fusion algorithm to integrate historical radar frames for motion trend analysis. Let the

t

-th radar image be a matrix

R I^{t}

, with decay coefficient

γ_{r i}

. The historically fused radar image

R I_{H}^{t}

is computed as:

\begin{matrix} R I_{H}^{t} = clip (γ_{r i} R I^{t} + R I_{H}^{t - 1}, 0, 255) \end{matrix}

(26)

where pixel values are clipped to

[0, 255]

for numerical stability. (4) Finally, a safety radius (represented as a central circular region) and a real-time heading vector (a pointer in the central circular region) are overlaid onto the radar image. The entire process is illustrated in Figure 13.

For the low-dimensional motion parameter layer, with a fixed initial position and heading, target-related encoding is unnecessary. The motion state is directly represented by raw kinematic vectors:

〈 \begin{matrix} x, y, ψ, \dot{x}, \dot{y}, \dot{ψ} \end{matrix} 〉

(27)

3.3.4. Reward Function

Due to the decoupling of the obstacle avoidance system and the navigation system, the agent faces a lack of heading decision information during obstacle avoidance, which may result in local optima (e.g., spinning in place to avoid obstacles). To address this, we propose a heading-lock strategy based on the first observation of obstacles. Upon initial obstacle detection, the current heading angle and coordinates of the USV are recorded as reference direction

θ_{r e f}

and reference position

(x_{r e f}, y_{r e f})

. Advancing in reference direction will be rewarded positively while deviating from it will be penalized negatively. Based on this principle, the auxiliary reward components are defined as:

\begin{matrix} \{\begin{matrix} r_{6} = \frac{s - s_{m a x}}{3.0} \\ r_{7} = e^{- 3.0 |d_{l a t}|} - \frac{|d_{l a t}|}{1.0} \\ r_{8} = e^{- 5.0 |ψ - θ_{r e f}|} - \frac{|ψ - θ_{r e f}|}{\frac{π}{6}} \\ r_{9} = \{\begin{matrix} + 1, if follows COLREGs \\ - 1, else \end{matrix} \end{matrix} \end{matrix}

(28)

where

s

is the distance the USV has advanced in the reference direction from the reference position and

s_{m a x}

denotes the maximum recorded distance historically, updated as

s_{m a x}^{t} = \max (s_{m a x}^{t - 1}, s)

.

d_{l a t}

is the lateral offset of the USV relative to the reference path defined by

θ_{r e f}

and

(x_{r e f}, y_{r e f})

. As mentioned in Section 3.2.1, the initial coordinates of the USV in the environment are fixed at the origin, the initial heading is fixed north, and the USV encounters an obstacle at the initial moment. Thus,

(x_{r e f}, y_{r e f}) = (0, 0)

,

θ_{r e f} = \frac{π}{2}

, and the remaining variables can be simplified as:

\begin{matrix} s = y \end{matrix}

(29)

\begin{matrix} d_{l a t} = x \end{matrix}

(30)

r_{6}

provides incentives only when the agent continuously advances while penalizing any backward movement to restrict behaviors such as reversing to avoid obstacles.

r_{7}

limits the USV’s velocity angle deviation and

r_{8}

restricts the lateral offset distance.

r_{9}

encourages the agent to follow COLREGs rules for obstacle avoidance.

The total auxiliary reward is expressed as:

\begin{matrix} r^{a o} = r_{6} + 0.5 r_{7} + 0.5 r_{8} + r_{9} \end{matrix}

(31)

When the agent collides with an obstacle, a terminal reward

r_{f}^{a o} = - 250

is given. When the agent avoids the obstacle, the terminal reward is set to

r_{v}^{a o} = 0

. This is to prevent the agent from prematurely rushing away from the obstacle to obtain the terminal reward while emphasizing the importance of advancing and adhering to rules during obstacle avoidance.

3.3.5. Deep Neural Network

The neural network architecture designed for the obstacle avoidance task is shown in Figure 14. In this network, CNN is used to process radar image input. The input undergoes sequential processing through multiple convolution-pooling layers, progressively reducing its dimension from 256 × 256 to a compact 4 × 4 × 64 feature representation. Notably, this feature extraction module is shared between the Actor and Critic components. The resulting lower-dimensional image features are then concatenated with low-dimensional motion state data. These combined features are then passed through several fully connected layers with decreasing widths, ultimately producing the final output. Due to the large size of the input images, the low-dimensional data is fed into different layers of fully connected layers to accelerate convergence [32].

3.4. Design of Decision and Encoding Unit

As shown in Figure 4, a critical component of this decoupled architecture, in addition to independent modules, is the decision and encoding unit (marked as “decide and encode”), which serves as the centralized coordinator of the entire system. The decision unit determines when to activate specific modules, while the encoding unit bridges the gap between independently trained modules and real-world environments by converting environmental observations into suitable module inputs. By designing tailored decisions and encoding units for different tasks, the architecture achieves versatility. For the multi-buoy inspection tasks in this study, the unit operates as follows.

For the decision unit, the agent first activates the circular navigation module, guiding the USV from the starting point to each buoy for inspection. Once all buoys are inspected, the USV initiates a return-to-home task, triggering a switch to the goal-directed module. The task concludes when the USV reaches its origin. Throughout the process, if radar detects approaching obstacles, the agent immediately switches to the obstacle avoidance module.

For the encoding unit, it is worth noting that when switching to the obstacle avoidance module, a state transformation is required due to the fixed coordinate and heading initialization. As shown in Figure 15a, a local coordinate system is established with the reference position as the origin and the reference heading as the

y

-axis. The motion vector in the world frame,

〈 x, y, ψ, \dot{x}, \dot{y}, \dot{ψ} 〉

, is transformed into the new coordinate frame as

〈 x^{'}, y^{'}, ψ^{'}, {\dot{x}}^{'}, {\dot{y}}^{'}, {\dot{ψ}}^{'} 〉

:

\begin{matrix} \{\begin{matrix} x^{'} = (x - x_{r e f}) \cos (\frac{π}{2} - θ_{r e f}) - (y - y_{r e f}) \sin (\frac{π}{2} - θ_{r e f}) \\ y^{'} = (x - x_{r e f}) \sin (\frac{π}{2} - θ_{r e f}) + (y - y_{r e f}) \cos (\frac{π}{2} - θ_{r e f}) \\ ψ^{'} = (ψ - θ_{r e f} + \frac{π}{2}) m o d 2 π \end{matrix} \end{matrix}

(32)

\begin{matrix} [\begin{matrix} {\dot{x}}^{'} \\ {\dot{y}}^{'} \\ {\dot{ψ}}^{'} \end{matrix}] = T (\frac{π}{2} - θ_{r e f}) [\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{ψ} \end{matrix}] \end{matrix}

(33)

By substituting the transformed motion vector for the original motion vector in Equations (25)–(27), the input is aligned with the obstacle avoidance module’s training dataset.

4. Training and Validation

4.1. Training Results

Based on the OpenAI Gym framework, this study developed a virtual simulation platform for USV inspection. This platform integrates real-time trajectory monitoring and radar scanning functionalities, enabling the synchronous visualization of both the USV’s navigation status and environmental perception data. For the three subtasks involved in the inspection mission, neural network models are trained independently using the PyTorch deep learning framework (version 2.5.0). All training tasks were conducted on a server equipped with an NVIDIA RTX 4060 Ti GPU, an Intel i5-12500H CPU, and 32 GB of DDR4 memory. The detailed configuration of environment interaction parameters and PPO hyperparameters is provided in Table 1 and Table 2.

At the end of each training iteration, the cumulative discounted reward obtained by the agent during that iteration is recorded. To mitigate the impact of random environment initialization, a 200-step moving average is applied. The reward curves for the three subtasks are shown in Figure 16. The results demonstrate that the reward curves for all subtasks exhibit convergence characteristics. The rewards stabilize after a certain number of iterations and eventually reach a steady-state value, indicating that the policies have converged effectively.

4.2. Validation Results

Validation scenarios are independently designed for each of the three subtasks to evaluate the effectiveness of the trained policies. Additionally, a comprehensive multi-buoy inspection scenario is proposed to test the cooperative performance of the obstacle avoidance algorithm alongside the navigation algorithms. In all scenarios, there exists a wind disturbance with a constant speed of 1.0 m/s and a direction consistently oriented to the right.

4.2.1. Validations for Goal-Directed Navigation

To validate the goal-directed navigation algorithm, this study designs two test scenarios: In the first scenario, the initial position of the USV is set 100 m away from the goal point (42.9% beyond the maximum test distance of 70 m in the training set), to validate the algorithm’s generalization capability in unknown states. The second scenario positions the USV’s initial location adjacent to the goal point and with its heading direction facing away from the goal, testing the agent’s autonomous adjustment ability under extreme pose conditions. The test results are shown in Figure 17. The results demonstrate that the goal-directed navigation algorithm successfully reached the destination range in both scenarios. The first scenario proves that the algorithm can adjust its course and proceed toward the target even in unseen situations. The second scenario shows its capability to handle extreme USV attitudes. The smooth paths generated in both cases confirm the algorithm’s excellent control performance.

4.2.2. Validations for Circular Navigation

Two scenarios have been designed for the circular navigation algorithm. In the first scenario, the initial pose of the USV is placed inside the expected circumnavigation trajectory to examine the algorithm’s path recovery capability. In the second scenario, the required orbital radius is set to 10 m (twice the training standard of 5 m) to verify the generalization ability of the algorithm. The test results are shown in Figure 18. The results demonstrate that the circular navigation algorithm can achieve smooth transitions from linear to circular paths and maintain trajectory error within the required threshold, even under task requirements that differ from the training environment, thereby validating the effectiveness of the proposed angular deviation weighting mechanism.

To highlight the advantage of the PPO-based circular navigation algorithm, a comparative experiment is conducted with the traditional LOS navigation algorithm. The parameter configurations of LOS are detailed in Table 3, while the trajectory comparison and lateral error analysis of the two algorithms are shown in Figure 19.

The experimental results indicate that, compared to the traditional LOS algorithm, the PPO-based algorithm exhibits smoother trajectories with fewer trajectory oscillations.

4.2.3. Validations for Obstacle Avoidance

For the obstacle avoidance algorithm, this study validates the performance under four typical scenarios designed in compliance with COLREGs rules.

The visualization results of obstacle avoidance trajectories are presented in Figure 20. The experimental results indicate that, except for the left-crossing scenario, the agent effectively adheres to COLREGs rules and completes avoidance maneuvers in the other three scenarios. Further analysis reveals that in the left-crossing scenario, COLREGs explicitly require the own ship to maintain its course and speed, placing the responsibility of avoidance on the obstacle vessel. However, the obstacle vessel maintains a constant heading, creating a fundamental conflict between the rule requirements and the simulation environment. Faced with this contradiction, to avoid collision risks and associated negative rewards, the agent autonomously opts to violate the rule and execute an avoidance maneuver.

4.2.4. Validation for Multi-Buoy Inspection

To investigate the collaborative performance of independently trained obstacle avoidance and navigation modules, as well as to comprehensively evaluate the autonomous control capabilities of the agent in a complete multi-buoy inspection task, we designed a multi-buoy inspection scenario. This scenario includes two buoys. The USV departs from the starting point, inspects the buoys in a predefined sequence, and returns to the origin after completing the inspection, all while dynamically avoiding appearing obstacles.

The simulation result is shown in Figure 21a. In this test, obstacles are generated just outside the USV’s radar detection range during movement. The first obstacle appears during the USV’s transit to the first buoy, simulating a head-on scenario. The second obstacle appears during transit to the second buoy, simulating a right-crossing scenario where the USV navigates a channel and encounters a crossing obstacle. In both cases, the USV should steer right to avoid collisions in accordance with COLREGs. Experimental results demonstrate that the USV consistently adopted the correct right-steering strategy without exhibiting excessive avoidance maneuvers like course reversal in both encountering scenarios. This confirms that our fixed initialization training, through state transformation, maintains generalization capability.

Different modules are activated to accomplish this task depending on the task progress and obstacle detection, as detailed in Section 3.4. The module-switching workflow is illustrated in Figure 21b, where the red trajectory is generated by the obstacle avoidance module, the blue trajectory by the circular navigation module, and the yellow trajectory by the goal-directed navigation module. The results demonstrate that the module-switching process performs consistently with the theoretical expectations.

5. Conclusions

In this paper, we use reinforcement learning methods to improve the navigation and obstacle avoidance performance of the autonomous buoy inspection USV system. Due to the complexity of the multi-buoy inspection tasks, we designed a decoupled architecture for navigation and obstacle avoidance and trained separate modules for each subtask. For the circular navigation module, we proposed an angle-weighted approach and independent experiments have demonstrated that this module enables smooth transitions. For the obstacle avoidance module, we designed a radar image encoding technique, and independent experiments have shown that it can partially comply with COLREGs for obstacle avoidance. Finally, a module-switching and state-transformation component tailored for multi-buoy inspection tasks is proposed, and integrated module experiments are conducted. The results show that with this component and the modules, the entire buoy inspection process can be fully automated, thereby proving the effectiveness of the proposed architecture.

Moving forward, we will enhance the obstacle avoidance module for full COLREGs compliance. For practical implementation, our algorithm will incorporate variables like obstacle vessel dynamics (speed, heading, dimensions), environmental visibility, and external interference to improve robustness. While initially designed for inspection tasks, the architecture demonstrates strong adaptability. We plan to expand its applications through sensor-specific avoidance modules and the integration of additional navigation modules to handle more complex tasks.

Author Contributions

Conceptualization, J.W.; Formal analysis, Z.W.; Funding acquisition, Z.L. and W.L.; Investigation, X.H.; Methodology, J.W. and Z.L.; Project administration, W.L.; Software, J.W.; Supervision, Z.L., X.H. and Z.W.; Validation, J.W., X.H. and Z.W.; Visualization, J.W.; Writing—original draft, J.W.; Writing—review and editing, Z.L. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Innovation and Entrepreneurship Training Program For Undergraduate (202410559074), Special Funds for the Cultivation of Guangdong College Student’s Scientific and Technological Innovation (pdjh2024a050), Guangdong Innovation and Entrepreneurship Training Program for Undergraduate (S202310559038).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, G.; Shi, Y.; Sun, X.; Shen, W. Internet of Things in Marine Environment Monitoring: A Review. Sensors 2019, 19, 1711. [Google Scholar] [CrossRef] [PubMed]
Otero, P.; Hernández-Romero, Á.; Luque-Nieto, M.; Ariza, A. Underwater Positioning System Based on Drifting Buoys and Acoustic Modems. J. Mar. Sci. Eng. 2023, 11, 682. [Google Scholar] [CrossRef]
Wang, J.; Wang, Z.; Wang, Y.; Liu, S.; Li, Y. Current situation and trend of marine data buoy and monitoring network technology of China. Acta Oceanol. Sin. 2016, 35, 1–10. [Google Scholar] [CrossRef]
Lu, Z.; Li, W.; Zhang, X.; Wang, J.; Zhuang, Z.; Liu, C. Design and Testing of an Autonomous Navigation Unmanned Surface Vehicle for Buoy Inspection. J. Mar. Sci. Eng. 2024, 12, 819. [Google Scholar] [CrossRef]
Fossen, T.I.; Breivik, M.; Skjetne, R. Line-of-sight path following of underactuated marine craft. IFAC Proc. Vol. 2003, 36, 211–216. [Google Scholar] [CrossRef]
Moe, S.; Pettersen, K.Y.; Fossen, T.I.; Gravdahl, J.T. Line-of-sight curved path following for underactuated USVs and AUVs in the horizontal plane under the influence of ocean currents. In Proceedings of the 2016 24th Mediterranean Conference on Control and Automation (MED), Athens, Greece, 21–24 June 2016. [Google Scholar]
Sutton, R.S.; Barto, A.G. Adaptive computation and machine learning series. In Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Deraj, R.; Kumar, R.S.; Alam, S.; Somayajula, A. Deep reinforcement learning based controller for ship navigation. Ocean Eng. 2023, 273, 113937. [Google Scholar] [CrossRef]
Sivaraj, S.; Rajendran, S.; Prasad, L.P. Data driven control based on Deep Q-Network algorithm for heading control and path following of a ship in calm water and waves. Ocean Eng. 2022, 259, 111802. [Google Scholar] [CrossRef]
Zhao, Y.; Qi, X.; Ma, Y.; Li, Z.; Malekian, R.; Sotelo, M.A. Path Following Optimization for an Underactuated USV Using Smoothly-Convergent Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2020, 22, 6208–6220. [Google Scholar] [CrossRef]
Singh, Y.; Sharma, S.; Sutton, R.; Hatton, D.; Khan, A. A constrained A* approach towards optimal path planning for an unmanned surface vehicle in a maritime environment containing dynamic obstacles and ocean currents. Ocean Eng. 2018, 169, 187–201. [Google Scholar] [CrossRef]
Hashali, S.D.; Yang, S.; Xiang, X. Route Planning Algorithms for Unmanned Surface Vehicles (USVs): A Comprehensive Analysis. J. Mar. Sci. Eng. 2024, 12, 382. [Google Scholar] [CrossRef]
Yan, X.; Jiang, D.; Miao, R.; Li, Y. Formation Control and Obstacle Avoidance Algorithm of a Multi-USV System Based on Virtual Structure and Artificial Potential Field. J. Mar. Sci. Eng. 2021, 9, 161. [Google Scholar] [CrossRef]
Zhang, W.; Wei, S.; Teng, Y.; Zhang, J.; Wang, X.; Yan, Z. Dynamic Obstacle Avoidance for Unmanned Underwater Vehicles Based on an Improved Velocity Obstacle Method. Sensors 2017, 17, 2742. [Google Scholar] [CrossRef]
Jo, H.-J.; Kim, S.-R.; Kim, J.-H.; Park, J.-Y. Comparison of Velocity Obstacle and Artificial Potential Field Methods for Collision Avoidance in Swarm Operation of Unmanned Surface Vehicles. J. Mar. Sci. Eng. 2022, 10, 2036. [Google Scholar] [CrossRef]
Yuan, X.; Tong, C.; He, G.; Wang, H. Unmanned Vessel Collision Avoidance Algorithm by Dynamic Window Approach Based on COLREGs Considering the Effects of the Wind and Wave. J. Mar. Sci. Eng. 2023, 11, 1831. [Google Scholar] [CrossRef]
Xia, J.; Zhu, X.; Liu, Z.; Luo, Y.; Wu, Z.; Wu, Q. Research on Collision Avoidance Algorithm of Unmanned Surface Vehicle Based on Deep Reinforcement Learning. IEEE Sens. J. 2022, 23, 11262–11273. [Google Scholar] [CrossRef]
Xu, X.; Cai, P.; Ahmed, Z.; Yellapu, V.S.; Zhang, W. Path planning and dynamic collision avoidance algorithm under COLREGs via deep reinforcement learning. Neurocomputing 2022, 468, 181–197. [Google Scholar] [CrossRef]
Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Eng. 2020, 217, 107704. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Dong, Q. Autonomous navigation of UAV in multi-obstacle environments based on a Deep Reinforcement Learning approach. Appl. Soft Comput. 2022, 115, 108194. [Google Scholar] [CrossRef]
Hutsebaut-Buysse, M.; Mets, K.; Latré, S. Hierarchical Reinforcement Learning: A Survey and Open Research Challenges. Mach. Learn. Knowl. Extr. 2022, 4, 172–221. [Google Scholar] [CrossRef]
Mccue, L. Handbook of Marine Craft Hydrodynamics and Motion Control. J. IEEE Control Methods 2016, 36, 78–79. [Google Scholar] [CrossRef]
Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems; Solla, S., Leen, T., Müller, K., Eds.; MIT Press: Cambridge, MA, USA, 1999; Available online: https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf (accessed on 27 March 2025).
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2018, arXiv:1506.02438. [Google Scholar] [CrossRef]
Van Moffaert, K.; Drugan, M.M.; Nowe, A. Scalarized multi-objective reinforcement learning: Novel design techniques. In Proceedings of the 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Singapore, 16–19 April 2013; pp. 191–199. [Google Scholar] [CrossRef]
Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K.O.; Clune, J. First return, then explore. Nature 2021, 590, 580–586. [Google Scholar] [CrossRef] [PubMed]
Han, R.; Chen, S.; Wang, S.; Zhang, Z.; Gao, R.; Hao, Q.; Pan, J. Reinforcement Learned Distributed Multi-Robot Navigation With Reciprocal Velocity Obstacle Shaped Rewards. IEEE Robot. Autom. Lett. 2022, 7, 5896–5903. [Google Scholar] [CrossRef]
Riedmiller, M.; Hafner, R.; Lampe, T.; Neunert, M.; Degrave, J.; Van de Wiele, T.; Mnih, V.; Heess, N.; Springenberg, J.T. Learning by Playing-Solving Sparse Reward Tasks from Scratch. arXiv 2018, arXiv:1802.10567. [Google Scholar] [CrossRef]
Everitt, T.; Krakovna, V.; Orseau, L.; Hutter, M.; Legg, S. Reinforcement Learning with a Corrupted Reward Channel. arXiv 2017, arXiv:1705.08417. [Google Scholar] [CrossRef]
Packer, C.; Gao, K.; Kos, J.; Krähenbühl, P.; Koltun, V.; Song, D. Assessing Generalization in Deep Reinforcement Learning. arXiv 2019, arXiv:1810.12282. [Google Scholar] [CrossRef]
Pateriya, N.; Jain, P.; Niveditha, K.P.; Tiwari, V.; Vishwakarma, S. Deep Residual Networks for Image Recognition. Int. J. Innov. Res. Comput. Commun. Eng. 2023, 11, 10742–10747. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of USV visual inspection.

Figure 2. The kinematic model of USV.

Figure 3. Schematic diagram of multi-buoy inspection tasks.

Figure 4. Structure diagram of decoupled navigation and obstacle avoidance system.

Figure 5. Schematic diagram of navigation environment initialization.

Figure 6. Schematic diagram of navigation tasks.

Figure 7. Schematic diagram of a smooth transition to expected circumnavigation trajectory.

Figure 8. Structure diagram of neural network architecture employed in navigation tasks.

Figure 9. Schematic diagram of obstacle avoidance environment initialization.

Figure 10. Schematic diagram of radar obstacle avoidance task.

Figure 11. (a) Schematic diagram of full observation in obstacle avoidance. (b) Schematic diagram of encoded radar observation.

Figure 12. Schematic diagram of encoded radar observations with different radar resolutions. The generated contours exhibit minimal structural divergence. (a) The radar resolution is 1.0. (b) The radar resolution is 1.5. (c) The radar resolution is 2.0.

Figure 13. Schematic diagram of radar data encoding process.

Figure 14. Structure diagram of neural network architecture employed in obstacle avoidance task.

Figure 15. Schematic diagram of state transformation. (a) The local coordinate system. (b) The transformed radar image input.

Figure 16. Cumulative discounted reward curves. (a) Circular navigation. (b) Goal-directed navigation. (c) Obstacle avoidance.

Figure 17. (a) Test result for goal-directed navigation scenario 1. (b) Test result for goal-directed navigation scenario 2.

Figure 18. (a) Test result for circular navigation scenario 1. (b) Test result for circular navigation scenario 2.

Figure 19. (a) Comparison of the trajectories. (b) Comparison of the lateral error.

Figure 20. (a) Test result for overtaking scenario. (b) Test result for head-on scenario. (c) Test result for left-crossing scenario. (d) Test result for right-crossing scenario.

Figure 21. (a) The USV and obstacle trajectories in multi-buoy inspection scenario. (b) The module switching process in multi-buoy inspection scenario.

Table 1. Settings of environment interaction parameters.

Parameter	Value
$E^{g}$	0.5 m
$R_{r}$	5.0 m
$D_{t}$	0.5 m
$E^{c}$	0.01 rad
$R_{b u o y}$	0.8 m
$N_{s c a n}$	360
$R_{r a d a r}$	10.0 m
$R_{o b s t a c l e}$	0.75 m
$R_{u s v}$	1.0 m
$γ_{r i}$	0.8 m

Table 2. Settings of PPO hyperparameters.

Hyperparameter	Value *
Training episodes	2000/8000
Experience buffer size	2048/1024
Max timesteps	500/100
Update epochs	10/10
Learning rate	2 × 10⁻⁴/2 × 10⁻⁴
$γ$	0.99/0.99
$λ$	0.95/0.95
$ε$	0.2/0.1

* Value for navigation/value for obstacle avoidance.

Table 3. Settings of LOS parameters.

Parameter	Value
Lookahead distance	10 m
Number of Waypoints	24
Path Following Tolerance	1.0 m
$K_{P}$	3.0
$K_{D}$	1.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Lu, Z.; Hong, X.; Wu, Z.; Li, W. Navigation and Obstacle Avoidance for USV in Autonomous Buoy Inspection: A Deep Reinforcement Learning Approach. J. Mar. Sci. Eng. 2025, 13, 843. https://doi.org/10.3390/jmse13050843

AMA Style

Wang J, Lu Z, Hong X, Wu Z, Li W. Navigation and Obstacle Avoidance for USV in Autonomous Buoy Inspection: A Deep Reinforcement Learning Approach. Journal of Marine Science and Engineering. 2025; 13(5):843. https://doi.org/10.3390/jmse13050843

Chicago/Turabian Style

Wang, Jianhui, Zhiqiang Lu, Xunjie Hong, Zeye Wu, and Weihua Li. 2025. "Navigation and Obstacle Avoidance for USV in Autonomous Buoy Inspection: A Deep Reinforcement Learning Approach" Journal of Marine Science and Engineering 13, no. 5: 843. https://doi.org/10.3390/jmse13050843

APA Style

Wang, J., Lu, Z., Hong, X., Wu, Z., & Li, W. (2025). Navigation and Obstacle Avoidance for USV in Autonomous Buoy Inspection: A Deep Reinforcement Learning Approach. Journal of Marine Science and Engineering, 13(5), 843. https://doi.org/10.3390/jmse13050843

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Navigation and Obstacle Avoidance for USV in Autonomous Buoy Inspection: A Deep Reinforcement Learning Approach

Abstract

1. Introduction

2. Background

2.1. Kinematic Model of USV

2.2. Reinforcement Learning

2.3. Proximal Policy Optimization

3. Methodology

3.1. Problem Formulation

3.2. Design of Navigation Modules

3.2.1. Environments

3.2.2. Action Space

3.2.3. State Spaces

3.2.4. Reward Functions

3.2.5. Deep Neural Networks

3.3. Design of Obstacle Avoidance Module

3.3.1. Environment

3.3.2. Action Space

3.3.3. State Space

3.3.4. Reward Function

3.3.5. Deep Neural Network

3.4. Design of Decision and Encoding Unit

4. Training and Validation

4.1. Training Results

4.2. Validation Results

4.2.1. Validations for Goal-Directed Navigation

4.2.2. Validations for Circular Navigation

4.2.3. Validations for Obstacle Avoidance

4.2.4. Validation for Multi-Buoy Inspection

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI