1. Introduction
Covering 71% of the Earth’s surface area while being rich in material resources such as minerals, energy and organisms [
1], the oceans are extremely important for both the natural ecosystem and human society. Therefore, researchers have invested much effort in researching intelligent, stable and efficient ocean exploration technologies. Among them, autonomous underwater vehicles (AUVs), as important ocean exploration technology tools, have been widely used because they have better flexibility [
2] and can accomplish large-scale and wide-ranging underwater ocean monitoring [
3].
To enhance the robustness and adaptability of AUV motion, numerous researchers have conducted extensive scientific research in the field of motion planning and achieved remarkable results. Motion planning research encompasses environmental modeling and motion planning algorithms. Environmental modeling necessitates the modeling of both environmental and motion models. An environmental model is utilized to establish a description of the spatial positions of an AUV object within the environment, as well as their corresponding attribute characteristics. Existing environmental modeling methods include the grid method [
4], Voronoi diagrams [
5], and inverse distance weighting (IDW) [
6]. The environmental model established by the grid method is discrete, and actions can only be performed at grid vertices, resulting in systematic errors under actual conditions. The Voronoi diagram method maximizes the distance between the AUV and the obstacles, leading to suboptimal motion planning results. The IDW method, based on the grid method, employs distance-based similarity for weighted interpolation to address the errors caused by the discrete states in the grid method and establish a more realistic environmental model. The motion model defines the state and action outputs of the examined AUV. Traditional methods include the particle method and dynamic method [
7]. The particle method simplifies AUV to a particle, while the dynamic method accurately reflects its motion based on performance. This paper uses IDW and dynamic methods for environmental modeling.
It is important to consider the motion-planning algorithms of AUV in addition to environmental modeling [
8]. Motion planning algorithms are primarily utilized for path planning and trajectory-tracking tasks. The algorithm’s optimization objective in path planning tasks is to obtain the optimal path to reach the target point. Traditional algorithms include the A* algorithm [
9], rapidly exploring random trees (RRT) algorithm [
10], rapidly exploring random trees star (RRT*) algorithm [
11], particle swarm optimization [
12], ant colony optimization [
13], and genetic algorithm [
14]. The first three methods [
9,
10,
11] are straightforward to implement, but their efficiencies are constrained by the scale of environmental modeling and decrease as the search range expands. The last three approaches are based on intelligent bionic models [
12,
13,
14]. These algorithms have more advantages than the former techniques, but their ability to extract features in more complex environments is limited, resulting in poor generalization ability, a tendency to fall into local optima, and difficulty in meeting practical application requirements. Due to the influence of time-varying ocean currents, the state of the ocean environment is constantly changing, making it very difficult for these algorithms to complete path planning tasks. In trajectory-tracking tasks, the algorithm’s optimization objective is to ensure that the AUV’s motion trajectory remains as consistent as possible with the target trajectory. Traditional algorithms include nonlinear model-based predictive controllers [
15,
16,
17] and proportional–integral–differential (PID) control [
18]. Ocean currents introduce uncertainty into the AUV’s state, making it challenging for traditional algorithms to complete trajectory tracking tasks in ocean environments.
With the development of artificial intelligence (AI), increasingly complex AI technologies are being used in AUV motion planning [
19], providing more possibilities for solving underwater problems using the Internet of Things [
20]. Reinforcement learning (RL), as a new machine learning paradigm, learns through trial and error. The goal of an RL policy is to maximize reward values or achieve specific goals during the process of interacting with the environment [
21,
22]. This training process endows RL with nonmodality, flexibility, and adaptability, giving it high potential in complex and ever-changing ocean environments. In [
23], a value-based Q-learning algorithm was successfully applied to obstacle avoidance tasks. However, this method requires the use of tables to store all operations, consuming considerable storage resources. Mnih et al. [
24] incorporated a deep Q network (DQN), which uses deep neural networks to represent state-action value networks and addresses the curse of dimensionality. Chu et al. [
25] introduced the DQN algorithm into the AUV path planning task. Zhang et al. [
26] incorporated deep deterministic policy gradient (DDPG) [
27] into the AUV trajectory-tracking tasks, reducing the induced tracking errors. Du et al. [
28] and Hadi et al. [
29] used twin delayed deep deterministic policy gradient (TD3) [
30] as the AUV state control method to obtain better results, but the exploration process of the algorithm is slow. Xu et al. [
31] and Huang et al. [
32] introduced soft actor–critic (SAC) [
33] for AUV path planning, improving the algorithm’s exploration ability; however, nly single tasks were considered. He et al. [
34] proposed asynchronous multithreading proximal policy optimization (AMPPO) based on proximal policy optimization (PPO) [
35], which achieved good results in multiple task scenarios, but the designs of the individual environments used the experiment were relatively simple. In present studies [
25,
26,
28,
29,
31,
32,
34] on modeling reinforcement learning environments, the real ocean current data are seldom considered, and the overestimation error caused by the ocean current is rarely taken into account in these algorithms, which leads to the low exploration rate, poor adaptability, and robustness in motion planning. To solve these problems, a reinforcement learning-based motion planning algorithm based on comprehensive ocean information (RLBMPA-COI) is proposed to AUV motion planning tasks. The main contributions of this paper are as follows:
To create a realistic ocean current environment, the AUV motion model and real ocean current data are introduced into the environment modeling based on reinforcement learning. This method effectively minimized the distance between practical applications and the simulations. Therefore, this method brings more significant practical value to AUV motion planning.
We proposed the RLBMPA-COI AUV motion planning algorithm based on a real ocean environment. By incorporating local ocean current information into the objective function of the action-value network, this algorithm successfully minimizes the overestimation error interference and enhances the efficiency of motion planning. According to the influence of the ocean current, target distance, obstacles, and steps on the motion planning task, we also established a corresponding reward function. This ensures the efficient training of the algorithm, further improving its exploration ability and better adaptability.
Multiple simulations, including complex obstacles and dynamic multi-objective tasks in path planning, as well as a trajectory-tracking task are designed to verify the performance of the proposed algorithms comprehensively. Compared with state-of-art algorithms, the performance of RLBMPA-COI has been evaluated and proven by numerical results, which demonstrate efficient motion planning and high flexibility for expansion into different ocean environments.
The remainder of this paper is organized as follows.
Section 2 introduces the AUV motion model, ocean current model and SAC background information. Then, we describe the specific method implementation details in detail in
Section 3. In
Section 4, we discuss a variety of experiments to verify the performance of the algorithm. Finally, we conclude our work in
Section 5.
3. Methods
The ocean environment is constantly changing and remains uncertain due to the impact of ocean currents on AUV motion. To achieve improved adaptability and intelligence, the algorithm needs to adjust to changes in the ocean currents. Therefore, we designed a comprehensive ocean information-enabled AUV motion planning model based on RL (RLBMPA-COI) to effectively suppress the disturbances caused by ocean currents in AUV motion planning.
3.1. Marine Environmental Information Model
To balance the simplicity and realism of the environmental simulation and reduce the discrepancy between the simulated and actual environments, this study utilizes real ocean data to construct a time-varying ocean current model and improve the confidence level of the AUV motion planning environment. We focus on the South China Sea (22.25°N to 22.75°N, 115.50°E to 116°E) and use data from the National Data Center for Marine Science [
37,
38]. We constructed a comprehensive time-varying ocean current model using various information such as temperature, wind speed, wave intensity, wave temperature, and salinity from January to June 2021. The vertical velocity of a 3D ocean current is much smaller than its longitude and latitude components and is usually ignored to reduce the complexity of environmental simulation.
This study constructs an ocean simulation environment based on real ocean environment data using the GYM framework. The GYM framework is a reinforcement learning environment framework developed by the OpenAI company, and it has been widely used in the field of reinforcement learning. First, a grid method is used to create a 100 × 100 scale grid corresponding to the study area. Then, an inverse distance weighted interpolation method is used to interpolate the integrated ocean data model into the grid so that the AUV can obtain the ocean changing current state in a continuous space. The ocean current vector is
. The longitudinal current velocity
and the latitudinal current velocity
in the plane of the geostationary coordinate system
are shown in
Figure 2, where
t corresponds to the current moment. The variation in the surface currents is shown in
Figure 3, where the direction of the black arrows corresponds to the direction of the currents and the distribution of the heatmap corresponds to the current intensity
. Notably, 10 grid lengths correspond to 0.05°.
3.2. Reinforcement Learning Environment
Based on the ocean current change model, an RL environment is designed, which includes two basic modules: a state transition function and a reward function. The scale of this RL environment is 100 × 100 grids, and all observation parameters and output parameters are scaled by 50 times the scale size.
3.2.1. State Transition Function
The state transition function describes the changes in the environment corresponding to the actions taken by the AUV at each moment. To reduce the computational complexity of the algorithm, we only consider the AUV motion in the horizontal plane. Correspondingly,
and
in the AUV motion model can be simplified to
and
. The change in the position of the AUV is not only affected by its speed, but also related to the present strength of the ocean current
. Therefore, the change in the state vector of the AUV can be expressed as:
To meet the needs of motion planning tasks, two features are introduced, namely and , which represent the distance of the AUV relative to obstacles and the distance relative to the target position, respectively. Overall, the state space of this RL environment can be expressed as .
Additionally, we designed a continuous action space A so that the algorithm can take more accurate actions. Specifically, the action output consists of speed and yaw angle . The range of values for peed is related to the maximum ocean current speed , and the range of values for the yaw angle corresponds to the range of angles at which the hull can turn at the current moment 4,/4].
3.2.2. Reward Function
In RL, the model seeks to maximize the expected reward value, which further influences the directional gradient update process of the neural network. By combining the task requirements of path planning and trajectory tracking, this paper carefully designs the corresponding reward functions.
In the path planning task, the target reward suffers from sparsity and abrupt changes. Therefore, a distance reward
is designed to guide the training procedure of the algorithm.
represents reaching the target,
represents the distance of the AUV relative to the target, and
represents the criterion for judging whether the target has been reached. To evaluate the ability of the algorithm to adapt to ocean currents changes, an ocean current reward
is designed.
When the AUV reaches the target point, it triggers a task completion event and receives a task completion reward
.
When the AUV approaches or reaches the obstacle position, it triggers a collision event and receives an obstacle reward
. The obstacle reward
is set in levels as a warning value and a collision value, and is judged based on the obstacle warning distance
and the obstacle collision distance
. The obstacle collision reward
is also affected by the completion of task
.
Finally, the step reward
and the total reward
r are defined. Specifically,
,
. The weight of the ocean current reward is determined by the parameter
and the distance reward
, and the weight can be adaptively adjusted to improve the adaptability and robustness of the algorithm to changes in ocean currents.
In the trajectory-tracking task, an event trigger reward
is defined to judge the overlap of trajectories. At the same time, the distance reward
is modified to satisfy the requirements of trajectory tracking.
is the target position of the trajectory.
The total trajectory-tracking reward is r. The weight coefficient of the ocean current reward is adaptively adjusted to improve the robustness of tracking and the adaptability of the algorithm to ocean currents changes.
3.3. RLBMPA-COI Algorithm
In an ocean environment, ocean currents have a significant impact on the motion state of an AUV. Therefore, a motion planning algorithm for an AUV needs to fully consider the impact of ocean currents to reduce the interference caused by changes in ocean currents. To satisfy this requirement, we designed a reinforcement learning-based motion planning algorithm with comprehensive ocean information, the RLBMPA-COI algorithm. The system block diagram of this algorithm is shown in
Figure 4.
Overall, the RLBMPA-COI algorithm includes a pair of state-action value networks
, a pair of target state-action value networks
, and a policy network
. Their respective network parameters are
and
. The RLBMPA-COI algorithm enhances the exploration ability of the algorithm by introducing the action entropy
, so the optimization goal of the policy network
can be expressed as:
The state-action value network
is used to evaluate the current policy. Furthermore, it is defined as:
is the soft state-value function, and
is a discount factor. By introducing the target state-action value networks
to assist with training the state-action value networks
, the training process is stabilized. In existing methods, a trimmed state-action value function
is typically used as the target function.
However, only considers the impact of random errors on the update of the network and does not account for the systematic errors caused by the motion of ocean currents, which can lead to overestimation. Therefore, this paper specifically analyzes and improves the impact of ocean current motion on the training process of the network.
To attain improved generalization, the ocean current is simplified to sinusoidal functions
and
in the longitudinal and transverse components, respectively. The state-action value networks
and
can represent the state of the ocean current during sampling as follows:
Different sampling moments result in phase differences between the ocean current in
and
, forming a systematic overestimation error that cannot be eliminated in
. Assuming that the phase difference is constant, the existing overestimation error in the longitudinal direction can be defined as
:
Assuming that the sampled phase difference
is uniformly distributed in
, the average systematic overestimation error caused by the interference of the longitudinal ocean current can be calculated as
:
Similarly, in the transverse direction, the average systematic overestimation error caused by the ocean current can be expressed as
:
Due to the systematic overestimation error caused by ocean currents in the reward function, this paper incorporates an ocean current reward weight coefficient
into the experience tuple, that is,
. At the same time, an ocean current weight queue
is designed, which can add
to the queue and output smoothed weight values
and
. The target function
is redefined as:
We calculate the mean squared error (MSE) loss of the state-action value network.
The parameter
is optimized using the random gradient descent method:
Finally, we update : , where is the update weight. We update the target network parameters : , where is the soft update weight.
For the strategy network
, the loss function is:
The action is reparametrized under the influence of the neural network
. The gradient corresponding to the loss function of the strategy network
is
Finally, we update the parameter
:
, where
is the update frequency. In addition, the loss function
dynamically adjusts the entropy weights
:
where
is the target entropy and the corresponding gradient function is
, while the entropy weight
is updated by
, where
is the update frequency. The algorithm pseudocode is shown in Algorithm 1.
Algorithm 1: RLBMPA-COI algorithm. |
1 | Initialize the environment and set the number of episodes ; |
2 | Initialize a replay buffer D with a capacity of ;
|
3 | Randomly initialize the policy network and state-action value networks with parameter vectors , , ;
|
4 |
Initialize the target state-action value networks with parameter vectors ;
|
5 | Define the total training steps T, episode training steps ;
|
6 | foreach n in set T do |
7 | | | Reset the step counter for each episode ; |
8 | | | Clear the event trigger flag , ; |
9 | | | Clear the task completion flag ; |
10 | | | Reset the environment, AUV start point , initial state ; |
11 | | | while or not or not do |
12 | | |
|
| Integrate ocean current information and AUV motion state |
| | |
|
| ; |
13 | | |
|
| Based on the AUV motion model and state , take action according to |
| | | | | policy ; |
14 | | |
|
| if then |
15 | | |
|
|
|
| Trigger obstacle collision reward, end this episode |
16 | | |
|
| else |
17 | | |
|
|
|
| if then |
18 | | |
|
|
|
|
|
| , trigger obstacle warning reward |
19 | | |
|
|
|
| else |
20 | | |
|
|
|
|
|
| pass; |
21 | | |
|
|
|
| end |
22 | | |
|
| end |
23 | | |
|
| if then |
24 | | |
|
|
|
| , trigger target tracking reward |
25 | | |
|
| end |
26 | | |
|
| Get an AUV state , update the rewards |
| | |
|
| , and calculate the total reward r; |
27 | | |
|
| Integrate to obtain the state and various rewards of the next moment, |
| | |
|
| determining the triggering of the obstacle events and target events, and |
| | |
|
| calculate the total reward r; |
28 | | |
|
| Store the experience in D; |
29 | | |
|
| ; |
30 | | | end |
31 | | | ; |
32 | | | if then |
33 | | |
|
| Sample a segment of experience from D; |
34 | | |
|
| Extract the weight of the ocean current reward to the queue ; |
35 | | |
|
| Calculate the average ocean current weight in the queue ; |
36 | | |
|
| Update ; |
37 | | |
|
| Update ; |
38 | | |
|
| Update ; |
39 | | |
|
| Update ; |
40 | | |
|
| Update ; |
41 | | | end |
42 | end |