1. Introduction
In recent years, there has been an increasing global emphasis on the importance of marine resources, coinciding with the rapid advancement of artificial intelligence. In this context of technological convergence, Unmanned Surface Vehicles (USVs) have gained considerable traction in various fields including scientific research, ocean resource exploration, water rescue missions, and environmental initiatives [
1,
2,
3,
4]. Given the inherently complex and dynamic nature of the marine environment, effective path planning for USVs plays a crucial role in ensuring the successful execution of the aforementioned tasks.
The navigation system of Unmanned Surface Vehicles (USVs) comprises three major subsystems: environmental and navigation situation awareness, cognition-based decision making, and ship navigation control. Path planning and obstacle avoidance are fundamental challenges in constructing these subsystems [
5]. A typical path planning task aims to provide collision-free navigation from the starting position to the specified target position on a given map or grid [
6,
7]. Currently, USV path planning and obstacle avoidance techniques can be broadly classified into traditional methods and intelligent methods. Traditional methods typically refer to deterministic approaches [
8] that provide solutions following predefined rules, using fused information at each decision step. Among traditional methods, Iijima et al. [
9] used the width-first search method to select and plan collision avoidance paths. However, their approach did not consider the influence of the navigation environment. Churkin et al. [
10] attempted to establish a mathematical model for collision avoidance path planning using both continuous and discrete research methods. However, the continuous method exhibited high computational complexity and was unsuitable for cases involving multiple USV encounters. In another study, Hwang et al. [
11] employed the fuzzy set theory to establish a knowledge base system to evaluate ship collision risk and determine collision avoidance strategies. However, their system solely focused on collision avoidance strategies, rendering the suboptimal overall voyage. Chang et al. [
12] proposed a model for calculating collision avoidance paths on grid maps using a maze routing algorithm, but this approach did not account for international maritime collision avoidance rules and navigation environment conditions.
Szlapcynski et al. [
13] improved the maze routing method in [
12] by adding a turning penalty and a time-varying restricted area. However, the resulting calculated path remained suboptimal due to the neglect of navigation environment conditions.
Apart from the aforementioned modeled based methods, a number of heuristic algorithms have also been proposed. Recently, a novel Voronoi–Visibility path planning algorithm, which integrates the advantages of a Voronoi diagram and a Visibility graph, was proposed for solving the USV path planning problem in [
14]. In [
15], Nie et al. studied the problem of robot path planning using the Dijkstra algorithm and Ant Colony Optimization. For known environments, the path planning problem was studied in [
16], which introduced geometric areas to divide obstacle avoidance zones and perform global obstacle-avoidance planning of USV with an improved A-star algorithm. In [
17], Yao et al. proposed a hierarchical architecture using the biased min-consensus method to address the path planning problem of USVs. In [
18], Wu et al. investigated USV path planning by proposing a global path planning approach for USVs using an intelligent water drops algorithm. Wei et al. [
19] designed a trajectory planning unit based on the unique characteristics of USVs, reflecting the intelligent navigation of USVs. The aforementioned methods demonstrate a wide application prospect in the field of USVs.
However, determining the optimal obstacle-avoidance path for USVs involves a number of crucial factors, including navigation environmental conditions and international maritime collision-avoidance rules. Many of these factors are abstract and qualitative, making them challenging to quantify using deterministic mathematical methods. In contrast, intelligent methods, such as Deep Reinforcement Learning (DRL) algorithms, show better efficacy in handling abstract and qualitative influencing factors, making them more suitable for USV path planning and obstacle avoidance in uncertain and time-varying ocean environments. DRL combines the feature interpretation capabilities of deep learning with the decision-making abilities of reinforcement learning, enabling direct optimal decision outputs based on high dimensional input data. This approach constitutes an end-to-end decision control system [
20,
21]. Trained with DRL, Jaradat et al. [
22] incorporated a predictive model into DRL, achieving high dynamic performance in convergence speed, average reward value, and other indicators through path planning experiments on aircraft carrier decks. Guan et al. [
23] proposed a local path planning and behavior decision-making approach based on an improved Proximal Policy Optimization (PPO) algorithm, enabling smart USVs to reach their targets without requiring human experience. To further enhance ship path planning during navigation, Guo et al. [
24] introduced a coastal ship path planning model based on the Deep Q-Network (DQN) algorithm. Prianto et al. [
25] developed a path planning algorithm based on Soft Actor–Critic (SAC), allowing for multi-arm manipulator path planning.
Convolutional layers have been widely applied to the feature extraction problem of high-dimensional state tasks in DRL. Habib et al. [
26] gave detailed insight into computation acceleration using stochastic gradient descent, fast convolution, and parallelism in CNN. Lebedev et al. [
27] covered the approaches based on tensor decompositions, weight quantization, weight pruning, and teacher–student learning. Krichen [
28] provided a comprehensive overview of CNNs and their applications in image recognition tasks.
In this paper, a path planning algorithm of USV with local environmental information perception for a time-varying maritime environment is proposed based on an improved Proximal Policy Optimization (PPO) algorithm. The contributions of this study can be summarized in the following three key aspects:
To reflect realistic maritime environments, a grid-based environment model is constructed based on real-world electronic charts to map the dynamic states of a ship and static obstacles in the sea.
Integration of planning and obstacle avoidance is achieved based on the proposed PPO algorithm, with the consideration of the sensing range of on-board sensors.
To address the unpredictable situations, e.g., unknown maps or moving ships in the area, we use convolutional neural networks (CNNs) for state-feature extraction in PPO. Our simulation results show that this method greatly improves the adaptability of USV in path planning in uncharted marine environments.
The rest of this paper is organized as follows. The problem formulation is described in
Section 2. In
Section 3, a path planning algorithm based on PPO is proposed.
Section 4 presents a comparative analysis of the simulation experiment process and experimental results. Finally, the conclusion and future work are given in
Section 5.
4. Experiments
In this section, we provide a series of numerical simulation results to evaluate the performance of our proposed algorithm. For marine environment simulation and USV strategy training, all the experiments are performed with Pytorch 2.0.1 on a desktop machine with 128 G memory and hardware acceleration using a GeForce RTX 3090Ti GPU from NVIDIA Santa Clara, CA, USA. We aim to validate the generalizability of the proposed algorithm by modifying three conditions: endpoint coordinates, map, and the number of training sets. We conduct simulation experiments from various aspects to verify the effectiveness of our approach. Experiment 1 focuses on a USV obstacle avoidance simulation using the algorithm proposed in this paper. In Experiment 2, we test the generalization capability of the proposed algorithm by changing the endpoint. Similarly, in Experiment 3, we explore the algorithm’s generalization under different sea maps. Experiment 4 involves training an additional network model by increasing the number of maps used for training. This model is then used to assess the generalization of the proposed algorithm in the simulation environment of Experiment 2. Finally, Experiment 5 involves comparing the performance of the proposed algorithm with other algorithms, thereby demonstrating its effectiveness.
4.1. Generalization Definition and Modeling
For reinforcement learning (RL), the generalization ability refers to when the reinforcement learning model is trained in the training environment and the performance of the model is verified in the test environment for the same task in the same domain. In supervised learning, some predictor is trained on a training dataset, and the performance of the model is measured on a held out testing dataset. It is often assumed that the data points in both the training and testing dataset are drawn independently and identically distributed from the same underlying distribution. The generalization gap in supervised learning for a model
with training and testing data
,
and loss function
L is defined as
This gap is used as a measure of generalization, specifically, a smaller gap means a model generalizes better. Generalization refers to a class of problems, rather than a specific problem. Thus, the generalization measure in RL is shown in Equation (
13).
To discuss generalization, we need a way of talking about a collection of tasks, environments, or levels; the need for generalization emerges from the fact that we train and test the policy on different collections of tasks. To formalize the notion of a collection of tasks, we start with the Contextual Markov Decision Process (CMDP). The CMDP is shown as Equation (
14):
where
is the underlying state space;
A is the action space;
O is the observation space;
R is the scalar reward function;
T is the Markovian transition function;
C is the context space; and
is the emission or observation function. We factorize the initial state distribution as shown in (
15):
and we call
the context distribution. This distribution is what is used to determine the training and testing collections of levels, tasks, or environments. We now describe the class of generalization problems we focus on, using the CMDP formalism.
All else being equal, the more similar the training and testing environments are, the smaller the generalization gap and the higher the test time performance. The categorization of the methods for tackling generalization in RL is shown in
Figure 6.
In this paper, the generalization of the model is verified by holdout validation and data augmentation. For the holdout validation method, the map set is divided into the training set and the test set. The training set specifically refers to the simulation data in Experiment 1. The test set refers to the simulation data in Experiment 2 and Experiment 3. The results of Experiment 2 and Experiment 3 show that the USV can successfully reach the end point under different test sets. For the data augmentation method, by transforming and expanding the training data, more samples are introduced. Then, Experiment 4 tests the generalization of the model under the condition of multiple training data sets by using the simulation environment of Experiment 2.
4.2. Simulation Experiment
4.2.1. Experimental Platform Description and Training Parameters
The USV simulation environment in this paper is shown in
Figure 7, which is extracted from the data as illustrated in
Figure 1. In this environment, the USV is represented by the blue square in the figure. The gray squares represent obstacles. The yellow squares represent the end of the path. USV continuously learns and explores its strategy, following the method proposed in this paper. The entire scene is reset at the end of each round or when the maximum number of time steps is exceeded in a single round. The detection radius of the radar is set to 153 m. In the simulation experiment, the performance indicator values include the iterations used by the USV to reach the end point, the time needed for algorithm convergence, and the average reward obtained. Among them, the step numbers can be reflected by the path diagram of each experiment. The allowable number of the training time steps in this experiment is
, and the PPO parameter settings are shown in
Table 2. The convergence time and average reward are shown in
Table 3.
4.2.2. Experimental Results and Analysis
For Experiment 1, the path trace following the proposed algorithm and the convergence tendency of the average reward are shown in
Figure 8 and
Figure 9, respectively. We also record the snapshot of the paths for the USV to reach the destination in each episode, as shown in
Figure 8. The average reward converges when the episode reaches 341. During the training process, with the improvement of the strategy, the number of steps is reduced when the USV reaches the end at each episode. To find a better strategy, the PPO algorithm explores the unknown action space. Therefore, when the episode reaches 1634, the number of steps is increased.
As shown in
Figure 9, as the iteration time steps increase, the average reward converges when the algorithm iterates to roughly
time steps. The final converged reward fluctuates between −111.34 and −125.55, which shows that the proposed algorithm is basically converging.
In
Figure 10, the vanilla PPO algorithm with zero sensing range is used for path planning under the same environment. It can be seen from the figure that PPO using convolutional layers achieves a better cumulative reward. In the absence of a sensing range, the USV is not able to handle different obstacle environments (see the near-to-end stage of the training process). The experimental results show that increasing the sensing range can greatly improve the convergence efficiency of USV as well as the obstacle-handling capability.
In Experiment 2, the generalization capability of the proposed algorithm is verified by modifying the path endings. The starting point coordinates of the simulation environment remain unchanged, and the end point coordinates are changed from (40, 40) to (19, 44). The path diagram of the test process is shown in
Figure 11, which shows that the USV successfully reaches the end point. During the test process, the PPO algorithm will explore the unknown action space to find a better strategy. When the episode reaches 77, the average reward value decreases sharply, and the number of steps is increased. At the 100th episode, the USV uses the least number of steps to reach the end point and receives the highest average reward. After changing the end point of the simulation environment to more complex areas, the models trained by the proposed algorithm can also guide USV to reach the end point. It indicates that the proposed algorithm has a strong generalization capability.
Figure 12 shows the average reward of the proposed algorithm with sensing capability after changing the end point of the simulation environment. The total training session is 100 episodes. As shown in
Figure 12, the blue line takes a few time steps to reach the average reward convergence value, and the average reward for testing is stable at −233.84. This means that the USV can find a safe and collision-free path in the simulation environment after changing the end point of the path planning.
In Experiment 3, the generalization of the proposed algorithm is further verified by modifying the simulation map. In Experiment 3, the training map is changed to a new one, including the obstacles on the map and the end point coordinates. The path planning diagram for the test is shown in
Figure 13. The algorithm proposed in this paper can make the USV bypass from above or below the obstacle and successfully reach the end point. At the 77th episode, the USV takes the least number of steps to reach the end point and receives the highest average reward. As shown in
Figure 14, the blue line represents the average reward convergence diagram after 100 episodes of testing. The average reward is −323.18.
In Experiment 4, we increase the size of the training map sets, and parts of the maps are shown in
Figure 15. As shown in
Figure 16, the blue line represents the average reward convergence curve obtained by training with the three maps. The final average reward fluctuates between −100.28 and −168.54, indicating that the proposed algorithm basically tends to converge.
Figure 17 shows the path diagram in testing after expanding the training set. In some of the rounds shown therein, the USV successfully reaches the end point, which demonstrates the generalization capability of the algorithm proposed in this paper.
As shown in
Figure 18, the blue line represents the average reward convergence diagram after 100 rounds of testing. The average reward is −186.84. After changing the training atlas, the average reward convergence curve obtained is shown in
Figure 19, where the solid black line represents the average reward convergence graph after training on a static graph, and the red dotted line represents the average reward convergence graph after training with three static maps.
4.2.3. Comparative Experiment
Experiment 5 verifies the effectiveness of the proposed algorithm by comparing its performance with several baseline algorithms. For comparison, the USV performs obstacle avoidance tasks based on the SAC algorithm, the PPO algorithm, the DQN algorithm, and the proposed PPO algorithm with a sensing range. We compare the average reward obtained by these algorithms in the same scenario and the convergence time steps taken to reach the target position. The snapshots of different algorithm results are given in
Figure 20,
Figure 21 and
Figure 22. In the process of training, with the improvement of strategy, the USV becomes more and more certain and the average reward value is convergent. As shown in
Figure 20, when the episode reaches 1958, the PPO algorithm explores a new action space, and the number of steps is increased. In
Figure 21, at the 1634th episode, the USV falls into a local optimum near the end point, and the number of steps is increased. As shown in
Figure 22, the number of steps for the USV to reach the end of each round gradually decreases, and the final average reward is convergent.
Figure 23 shows the average reward curve of the four algorithms as the time steps increase in a static environment. The algorithm proposed in this paper has less fluctuation in the early training process than the PPO, DQN, and SAC algorithms. As shown in
Table 3, the proposed algorithm, PPO algorithm, DQN algorithm, and SAC algorithm converge in 20,400, 59,400, 136,800 and 142,000 time steps, respectively. According to
Figure 23, SAC fluctuates greatly before the average reward converges. The proposed algorithm takes advantage of the improved perceptual capability of PPO and accumulates higher rewards. In the late stages of training, the average reward converges to around −117.574.