AUV Obstacle Avoidance Framework Based on Event-Triggered Reinforcement Learning

Liu, Shoufu; Ma, Chao; Juan, Rongshun

doi:10.3390/electronics13112030

Open AccessArticle

AUV Obstacle Avoidance Framework Based on Event-Triggered Reinforcement Learning

by

Shoufu Liu

,

Chao Ma

^* and

Rongshun Juan

School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(11), 2030; https://doi.org/10.3390/electronics13112030

Submission received: 25 April 2024 / Revised: 13 May 2024 / Accepted: 20 May 2024 / Published: 23 May 2024

(This article belongs to the Special Issue Nonlinear Intelligent Control: Theory, Models, and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Autonomous Underwater Vehicles (AUVs), as a member of the unmanned intelligent ocean vehicle group, can replace human beings to complete dangerous tasks in the ocean. It is of great significance to apply reinforcement learning (RL) to AUVs to realize intelligent control. This paper proposes an AUV obstacle avoidance framework based on event-triggered reinforcement learning. Firstly, an environment perception model is designed to judge the relative position relationship between the AUV and all unknown obstacles and known targets. Secondly, considering that the detection range of AUVs is limited, and the proposed method needs to deal with unknown static obstacles and unknown dynamic obstacles at the same time, two different event-triggered mechanisms are designed. Soft actor–critic (SAC) with a non-policy sampling method is used. Then, improved reinforcement learning and the event-triggered mechanism are combined in this paper. Finally, a simulation experiment of the obstacle avoidance task is carried out on the Gazebo simulation platform. Results show that the proposed method can obtain higher rewards and complete tasks successfully. At the same time, the trajectory and the distance between each obstacle confirm that the AUV can reach the target well while maintaining a safe distance from static and dynamic obstacles.

Keywords:

intelligent control; complex system; artificial intelligence; reinforcement learning; autonomous underwater vehicles; obstacle avoidance

1. Introduction

Most of the Earth is covered by oceans, which contain abundant natural resources. At present, when resources are increasingly scarce, exploring and exploiting the marine environment directly or indirectly contributes to the survival and development of human society. Considering complex variations in the marine environment and possible risks to humans in manned underwater vehicles, the deployment of underwater unmanned systems is increasingly favored, which can be used at deeper depths and under extremely harsh conditions to replace humans in dangerous tasks [1]. The autonomous underwater vehicle (AUV) relies on its navigation algorithms and surrounding information to navigate autonomously [2]. At the same time, it does not involve labor fatigue, and the operation cost is very low. Relying on these extremely high operational advantages, they are highly maneuverable [1].

In recent years, with the deepening of ocean exploration, simply achieving path tracking is insufficient to complete increasingly complex underwater tasks. To finish the more demanding requirements of complex tasks, higher autonomy is demanded for the AUV. One of the most common complex tasks is obstacle avoidance, where the AUV must safely avoid all unknown static and dynamic obstacles while autonomously reaching a specified target area. The realization of autonomous collision avoidance is crucial for ensuring AUV safety and completing autonomous control in unknown underwater environments, determining the quality and efficiency of task completion. The AUV autonomous control algorithms commonly used in the past include some traditional model-based control methods, such as proportional integral derivative (PID) [3]. However, with the rise of machine learning in recent years, some intelligent algorithms do not need to know the complex model of the AUV, nor do they need to have complete prior knowledge to achieve good control effects, such as reinforcement learning (RL).

Researchers have applied many control methods to AUVs, such as PID, reverse step control [4], and sliding mode control [5]. These methods are model-based AUV control methods that need to design the relevant parameters. However, considering that saving costs and reducing weight, most AUVs are designed as underactuated systems. Therefore, the dynamic model of the underwater vehicle has strong coupling and nonlinearity. In addition, the hydrodynamic coefficients of underwater vehicles in the real ocean environment are time-varying, which makes model-based control methods a great challenge in terms of the accuracy and autonomy in the execution of tasks.

RL is a research hotspot in the field of machine learning. In recent years, reinforcement learning has been successfully applied in fields of robot planning, control, and navigation. The field aims to enable agents to continuously learn as they interact with their environment. The rationality of action selection is judged by calculating the cumulative expected return value after the action is executed in the current state. End-to-end Learning can be achieved through a combination of Deep Learning (DL). The mapping relationship between input and output is constructed directly through the model. The application of deep reinforcement learning (DRL) technology to AUV motion planning research can make use of the advantages of reinforcement learning. Through self-interactive training, a series of decision sequences considering long-term effects can be generated, which greatly improves the robustness and adaptability of AUVs to complex environments. Reinforcement learning is a machine learning method that learns optimal actions by interacting with the environment, getting reward feedback, and constantly adjusting strategies. The agent observes the state of the environment at each time step, selects an action, receives reward feedback after performing it, and then updates the strategy. Because of its interactive characteristics, its motion planning and control are usually crossed.

In this paper, to enable the AUV to simultaneously avoid unknown static and dynamic obstacles within a limited detection range, a combination of two different event-triggered mechanisms (ET) is integrated when customizing state space and reward functions. The ET framework is then combined with reinforcement learning methods to design an AUV obstacle avoidance algorithm based on event-triggered reinforcement learning. Finally, obstacle avoidance experiments in environments with unknown static obstacles and unknown dynamic obstacles are conducted on the constructed simulation platform.

The main contributions of this paper are as follows:

The environmental state perception model is designed to model the environmental state around the underwater vehicle, which can be used to perceive the environmental state characteristics.
Two different event-triggered mechanisms are used to design the state space and the reward function, respectively. The designed event-triggered mechanism can improve the obstacle avoidance performance of the AUV in the case of the limited detection range.
The event-triggered mechanism is combined with the SAC algorithm, in which a non-policy sampling method is used. An AUV obstacle avoidance framework based on the improved SAC algorithm is designed.
We conduct three sets of experiments on the built simulation platform, and the obstacles in the experiments are unknown to the AUV until they are detected by the detector. The first set of experiments is a basic single obstacle avoidance experiment to verify the feasibility of the proposed method. In order to enhance the challenge, the second set of experiments verify the effectiveness of the proposed method in this environment through multiple obstacle avoidance experiments. For the third set of experiments to further enhance the challenge, multiple dynamic obstacles are used in this paper. Multiple dynamic obstacle avoidance experiments show that the proposed method is also effective in random environments with dynamic obstacles.

2. Related Work

2.1. The Related Work of Model-Based AUV Control

In the past few decades, researchers have proposed many traditional control methods based on control theory to solve the motion control problem of underwater vehicles in various situations. PID [6] is very popular in practical applications due to its simple design, easy parameter adjustment, and low computational complexity. Wanigasekara, C. et al. [7] proposed a Delta-Sigma-based 1-bit PID controller for the obstacle avoidance tasks of AUV, which consumes less communication resources and is robust to various disturbances arising in the underwater environment. Fittery et al. [8] used a simple PD controller and controlled the heading angle of the egg-shaped AUV. Shome and Das [9] implemented PID control for “AUV-150”. In order to further reduce the computation amount and computation time of the RAFNNC, the parameters of the PID control algorithm were optimized in advance by using the improved BFO algorithm in [10]. Wang, X. et al. [11] developed an adaptive dynamical sliding mode control (ADSMC) technology to design a dynamic controller, which can effectively overcome the influence of model uncertainties. In order to overcome the uncertainty of the dynamics model and the saturation of actual control input, Yao, X. et al. [12] designed the dynamic controller by using SMC technology. Model predictive control (MPC) has also received a lot of research in the field of AUV [13]. In addition, Liu, Z. [14] proposed an obstacle restraint-model predictive control (OR-MPC) path planning algorithm with obstacle restraints for autonomous underwater vehicles.

A large number of traditional control methods have answered some problems of AUV motion control in various cases. However, the different traditional methods have different disadvantages. Traditional PID parameters need to adapt to the changes in model parameters. The dynamic model parameters of underwater vehicles are uncertain, which makes it difficult to meet the requirements of many AUV control tasks. Buffeting is difficult to eliminate in SMC. MPC requires high accuracy of the model and high computational complexity. The most important thing is that most traditional control methods are usually predefined controllers. It means that they are designed with fixed parameters and known system models. Therefore, more and more scholars focus on intelligent controllers with good learning ability and adaptive ability.

2.2. The Related Work of Control Based on Reinforcement Learning

Reinforcement learning is similar to the basic logic of human brain learning knowledge. It interacts with the environment through actions to realize continuous exploration. After receiving the corresponding feedback signals from the environment, the agent constantly adjusts its own parameter values and optimizes the next step strategy in order to maximize the discount reward value. At present, more and more scholars have applied reinforcement learning to robot motion planning, and a small number of studies have applied reinforcement learning to the obstacle avoidance task of AUV.

Kim et al. [15] proposed a method to automatically tune the scale of the dominant reward functions in reinforcement learning for quadrupedal robot locomotion. Jang et al. [16] present a novel RL-based autonomous driving system technology to effectively address the overestimation phenomenon, learning time, and sparse reward problems faced in the field of autonomous driving. In order to solve the problems of rapid path planning and effective obstacle avoidance for autonomous underwater vehicles (AUVs) in a 2D underwater environment, the paper [17] proposed a path planning algorithm based on reinforcement learning mechanism and particle swarm optimization (RMPSO). Fang et al. [18] developed a deep reinforcement learning algorithm that can directly control four rudder blades of an underwater vehicle. Carlucho et al. [19] proposed an adaptive motion planning and obstacle avoidance technique based on deep reinforcement learning for an AUV. Their research employs a twin-delayed deep deterministic policy algorithm, which is suitable for Markov processes with continuous actions. In the paper [20], the trained proximal policy optimization (PPO) network can output reasonable actions in order to control the AUV to avoid obstacles. Jiang et al. [21] also used the DDPG algorithm to control three degrees of freedom of the underwater vehicle and realized the uniform linear motion of the underwater vehicle. Other scholars have realized the control of 5-DOF AUV by improving deep RL [22]. Some researchers used the DQN algorithm and PPO algorithm to realize collision avoidance and multi-position tracking of AUV [23]. It can be seen that deep reinforcement learning can be applied in the field of intelligent autonomous control of AUV, and relevant results have been achieved.

When an AUV travels underwater, parameters such as buoyancy, weight, water resistance, and propulsion efficiency may change due to environmental changes. In the case of constantly changing parameters, the traditional controller needs to adjust its parameters frequently to adapt to changes in the environment, which may be difficult to achieve in practice, especially in the case of irregular or unknown parameter changes. RL can self-optimize strategies through continuous interaction in response to changes in the environment. At the same time, many RL algorithms are model-free; they do not require an accurate mathematical model of the environment in advance, but learn directly from the feedback of the environment. In addition, fully trained RL models are able to generalize to previously unseen environmental states and handle a variety of dynamic changes.

3. Proposed Method

For autonomous obstacle avoidance tasks, an AUV obstacle avoidance framework based on event-triggered reinforcement learning is designed. Firstly, an environment perception model is designed to judge the relative position relationship between the AUV and all unknown single or multiple obstacles and target points. Then, considering the limited detection range of the AUV, the complete state space is customized by combining the state of the AUV and the state related to the target region, and an event-triggered mechanism is designed. The purpose of this method is to enable the AUV to reach the target area well while maintaining a safe distance from obstacles.

It should be noted that the basic reinforcement learning algorithm used in this paper is soft actor–critic (SAC) [24]. Emphasizing recent experience (ERE) [25] is a non-policy sampling method is used in the SAC algorithm. At the same time, the proximal policy optimization (PPO) algorithm [26] and the SAC with prioritized experience replay (SAC-PER) [27] are used as the baseline algorithm. The simulation experiment is implemented on the robot operating system (ROS) and Gazebo simulation platform, in which the AUV model is built and the relevant coordinate system definitions, can be referred to in papers [28,29,30]. The simulation platform built is shown in Figure 1. The green line is the x-axis, the red line is the y-axis, and the vertical blue line is the z-axis.

3.1. Environment State Perception Model

To effectively obtain environmental state features, this section adopts an environment state perception model to model the surrounding environment of underwater vehicles. Considering that the main goal is to enable the AUV to reach the target area while avoiding the unknown static and dynamic obstacles, the environmental state focuses on the relative positional relationship between the AUV and detects obstacles or target areas. This includes relative distance and relative yaw angle. Figure 2 illustrates the environmental state perception model when the AUV detects dynamic obstacles. In other cases, the perception model is the same as shown in Figure 2. Additionally, the AUV has a mode of operation called constant depth navigation. It refers to the AUV’s ability to maintain navigation in a specific water depth when performing underwater tasks. Constant depth navigation allows the AUV to remain at a stable depth for tasks such as seafloor mapping, biological sampling, or hydrological data acquisition. This is critical to the accuracy and consistency of data in research and engineering applications. We can consider 2D cases, which also represent some kinds of underwater environments. The unknown obstacle and the known target area are simplified to a circular shape in this paper.

Relative distance and relative yaw angle can be calculated through the rotation matrix [31]. Considering the specific task in this paper, it can be simplified to Formula (1):

T (ψ) = [\begin{matrix} cos ψ & - sin ψ \\ sin ψ & cos ψ \end{matrix}]

(1)

where

ψ

is the yaw angle.

Based on Formula (1), we can obtain the transformation formula of the absolute and relative distance between the AUV and the detected obstacle or between the AUV and the known target region along each coordinate axis through rotation and shift operations.

In this section,

x_{r}

and

y_{r}

represent the relative distances on the x-axis and y-axis, respectively, while

Δ x

and

Δ y

represent the absolute distances on the x-axis and y-axis. When rotating around the z-axis, the values on the z-axis do not change, so only the related rotation matrices for the x and y axes are listed here. Next, we can calculate the relative distance

ρ

between the AUV and obstacles using Formula (2).

ρ = \sqrt{x_{r}^{2} + y_{r}^{2}}

(2)

We calculate the relative yaw angle

α

using Formula (3).

α = arctan 2 (y_{r}, x_{r})

(3)

The

arctan 2

returns the arctangent value of

y_{r}

and

x_{r}

within the range of

(- π, π]

.

3.2. An AUV Obstacle Avoidance Framework Based on Event-Triggered Reinforcement Learning

Considering the application of the basic reinforcement learning model to solve the AUV collision avoidance problem in an unknown underwater environment, its performance largely depends on the definition of the input state and the design of the reward function. Meanwhile, due to the limited detection range of AUVs and the need to deal with unknown static obstacles and unknown dynamic obstacles, two different event-triggered mechanisms are combined in reinforcement learning to build a framework. The relevant state space and reward function are also designed respectively. The framework can generate states related to obstacles and target areas in different scenarios. On this basis, the framework can reward or punish the behavior of the AUV. Finally, a feasible policy is learned through the interactive iteration between neural networks.

3.2.1. Design of State Space and Action Space

When applying deep reinforcement learning to complex tasks, there are generally three considerations. First, variables in the state space need to be relevant to the task. Secondly, considering that the size of each state may be different, the data can be normalized, such as using the function tanh to normalize the data into the range [−1, 1]. Third, there may be coupling relations between different variables, so it is necessary to use orthogonalization to process the data.

Considering that the target location of the task is known, its relevant central point location and radius

ϕ_{t}

are known, so the relevant mechanism of event triggering does not need to be designed here. At the same time, the current heading c of AUV which is obtained directly from the sensor, the expected heading

o_{h}

of AUV under the current state, the distance

Δ y

which is the distance between the current state of AUV and the target position along the y-axis, the distance

Δ x

which is the distance between the current state of AUV and the target position along the x-axis, the distance defined after combining the event-triggering mechanism, and the yaw angle combined with the event-triggered mechanism can be normalized and added into the state space. Therefore, the state space of the obstacle avoidance task can be expressed as:

S = \{\bar{c}, \bar{o_{h}}, \bar{Δ x}, \bar{Δ y}, \bar{ρ_{o}}, \bar{α_{o}}\}

. To fully describe the state of AUV and the environmental state within the detection range, each part of the state space S will be introduced below.

o_{h}

can be expressed as:

o_{h} = arctan (\frac{Δ y}{Δ x}) = arctan (\frac{y_{h} - y}{x_{h} - x})

(4)

where x and y represent the current location along the x-axis and the y-axis.

x_{h}

and

y_{h}

represent the target location along the x-axis and the y-axis.

In the obstacle avoidance task, the AUV in this section is designed to start from the initial state and reach the target point after avoiding the obstacle. In order to represent uniformly the states related to static and dynamic obstacles in the limited detection range of AUV, including relative distance and relative yaw angle, an event-triggering mechanism is designed and combined with the environmental state model. In the task of collision avoidance, considering the limited detection range of AUV sensors,

E_{d}

is used as a sign of event-triggered to determine whether AUV has detected an obstacle.

E_{d} = 0

indicates that AUV has not detected an obstacle, and

E_{d} = 1

indicates that AUV has detected an obstacle. This can be expressed by Formula (5):

E_{d} = \{\begin{matrix} \begin{matrix} 0, & ρ > ϕ_{o} + ρ_{max} \end{matrix} \\ \begin{matrix} 1, & ρ \leq ϕ_{o} + ρ_{max} \end{matrix} \end{matrix}

(5)

where

ρ

is the distance between the current position of AUV and the center of the obstacle,

ϕ_{o}

is the radius of the obstacle, and

ρ_{max}

is the maximum distance that can be detected by the sensor equipped with the AUV, as shown in Figure 3.

ρ_{c}

in the figure is the range where the AUV will collide with the obstacle.

ρ_{e}

indicates the range of unsafe areas. The distance defined after combining the event-triggering mechanism can then be expressed according to Equation (6):

ρ_{o} = E_{d} (ρ - ϕ_{o}) + (1 - E_{d}) ρ_{max}

(6)

Similarly, the yaw angle combined with the event-triggered mechanism can be defined as

α_{o} = E_{d} α + (1 - E_{d}) π

(7)

Formulas (6) and (7) can then be processed using normalization to obtain the normalization of distance and normalization of yaw angle:

\bar{ρ_{o}} = \frac{ρ_{o}}{ρ_{max}}

(8)

\bar{α_{o}} = \frac{α_{o}}{π}

(9)

In this task, the action space is the rudder angle value of the AUV, and the rudder angle value corresponds to the steering force on the rudder. Different rudder angle values correspond to different actions.

3.2.2. Design of Reward Function

In this section, the reward function of the obstacle avoidance task is planned and designed. A complete reward function is designed for the safe obstacle avoidance behavior of the AUV in the unknown static obstacle environment.

First of all, since the position of the target point is known, the distance

ρ_{t}

between the current position and the target position can be calculated in real time. It can also be normalized to obtian

\bar{ρ_{t}}

. The formula can be expressed as

\bar{ρ_{t}} = \frac{ρ_{t}}{\sqrt{x_{max}^{2} + y_{max}^{2}} - ϕ_{t}}

(10)

where

x_{max}^{2}

and

y_{max}^{2}

are the maximum value of the task area on each coordinate axis, and

ϕ_{t}

is the radius of the target area. This section uses the difference between the current heading angle and the expected heading angle of the AUV and the distance between the current position and the target position to define the reward function. Since the target of reinforcement learning is to maximize the long-term reward obtained, the reward function of reaching the target area can be written as

r_{t} = - k_{t} (\bar{ρ_{t}} + |\bar{c} - \bar{o_{h}}|)

(11)

The

k_{t}

in Equation (11) is the coefficient in the reward function that reaches the target area. The value chosen for

k_{t}

in this paper is 0.5. To achieve safe obstacle avoidance, it is necessary to ensure that the AUV can safely avoid the detected obstacles, that is, the distance between the AUV and the obstacle is always bigger than the safe distance. The event trigger flag

E_{r}

is designed to judge whether the current obstacle is in the safe area. Different weights are set according to the different areas where the AUV is located.

E_{r}

can be defined as

E_{r} = \{\begin{matrix} \begin{matrix} 0, & ρ > ϕ_{o} + ρ_{e} \end{matrix} \\ \begin{matrix} 1, & ρ \leq ϕ_{o} + ρ_{e} \end{matrix} \end{matrix}

(12)

The reward function designed by the relative distance and the relative yaw angle from the obstacle in the limited detection range of the AUV is as follows:

r_{o} = - k_{o} (1 - \bar{ρ_{o}} + 1 - \bar{α_{o}})

(13)

k_{o}

is the coefficient related to safety obstacle avoidance, which is composed of conventional coefficient

k_{o 1}

and event-triggered factor

k_{o 2}

. The values selected for

k_{o 1}

and

k_{o 2}

are 0.05 and 0.35, respectively. The formula is as follows:

k_{o} = k_{o 1} + E_{r} k_{o 2}

(14)

When the AUV enters an unsafe area, the coefficient is increased to strengthen the constraint. This makes the AUV more focused on learning strategies to stay away from obstacles. The first item

1 - \bar{ρ_{o}}

in Equation (13) indicates that the reward is smaller when the AUV is closer to the obstacle, and the second item

1 - \bar{α_{o}}

indicates that the reward is smaller when the AUV is moving in the direction of the obstacle. Additional parts of the reward function are designed for special moments of entering an unsafe area, colliding with an obstacle, and successfully reaching the target position:

r_{e} = \{\begin{matrix} - 5, 1 < ρ < 3 \\ - 20, ρ < 1 \\ + 15, s u c c e s s \end{matrix}

(15)

Based on the above rewards, the reward function of the AUV obstacle avoidance task can be set as follows:

r_{t o t a l} = r_{t} + r_{o} + r_{e}

(16)

4. Obstacle Avoidance Experiments Based on Reinforcement Learning

In this section, experimental results of different DRL methods on the single obstacle avoidance task, the multiple obstacle avoidance task, and the multiple-dynamic obstacle avoidance task are presented and compared. It should be noted that static obstacles may include anchored buoys, subsea equipment, and subsea terrain. Dynamic obstacles may include marine organisms, submersibles, marine debris, and mesoscale vortices. Dangerous ocean phenomena like mesoscale vortices actually move relatively slowly through the ocean. They usually move at speeds ranging from a few kilometers to a dozen kilometers per day. However, in order to set the experimental environment more harshly, we designed the dynamic obstacle movement trajectory as a fast-moving straight line.

The technical indices of the method in this paper include trajectories, cumulative rewards, variation in distance between individual algorithms and each obstacle, the mean of the minimum, the number of successes in three experiments, and the total number of steps. Trajectories intuitively demonstrate the obstacle avoidance effects of various methods. Cumulative rewards are commonly used to evaluate reinforcement learning. Confidence interval and mean performance are used to fully show the algorithm performance. The variation in distance between individual algorithms and each obstacle shows the change of distance between each algorithm and each obstacle. It can also intuitively show the obstacle avoidance effect of the algorithm. The mean of the minimum and number of successes in the three experiments shows the average obstacle avoidance effect of each algorithm in the experiments. The total number of steps shows the convergence speed of each algorithm.

4.1. Single Obstacle Avoidance Experiment

This study combines the ET framework with the PPO, SAC, SAC-PER, and SAC-ERE algorithms to create the ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE methods. In this section, four methods are applied to the single obstacle avoidance task. The purpose of this set of experiments is to verify the feasibility of the proposed algorithm in the environment with an obstacle.

As shown in Figure 4, the AUV obstacle avoidance trajectories of ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE methods are compared and shown. In this task, there are 500 episodes for each method, with a maximum of 300 steps per episode.

During the single obstacle avoidance task training, the obstacle avoidance performance using the ET-PPO (a), ET-SAC (b), ET-SAC-PER (c), and ET-SAC-ERE (d) methods is illustrated in Figure 4. The black dot represents the target location, with a center coordinate of (95, 0) and a radius of 1 m. It means that reaching within a 1 m radius around the center is considered a successful target approach. The red dot represents the obstacle model, with a center coordinate of (45, 0) and a radius of 1.5 m. The green circle represents the detection range of the agent, with a radius of 6 m. The blue circle indicates the unsafe area, with a radius of 4 m, while the red circle represents the collision area, with a radius of 2 m.

In the initial episode, all reinforcement learning methods trigger constraints that stop the training process. Figure 4a illustrates that in the single obstacle avoidance task, the ET-PPO agent fails to acquire a strategy to avoid the obstacle and reach the target within the initial 400 episodes. It frequently enters the unsafe area and collides with the obstacle. By the 500th episode, the ET-PPO agent has successfully navigated through the safe area to avoid the obstacle and ultimately reaches the target. Figure 4b illustrates that in the single obstacle avoidance task, the ET-SAC agent fails to learn a strategy to avoid the obstacle and reaches the target during the initial 200 episodes. It frequently enters the unsafe area and collids with the obstacle. Around the 300th episode, the agent starts experimenting with different routes to avoid the obstacle. However, it often exceeds the operational area, which triggers constraints that stop the training. In the 400th episode, the ET-SAC agent can consistently navigate through the safe area to avoid an obstacle and reach the target. Figure 4c illustrates that in the single obstacle avoidance task, the ET-SAC-PER agent has challenges in learning a strategy to avoid the obstacle and reach the target during the initial 400 episodes. It frequently enters the unsafe area or manages to avoid the obstacle but fails to reach the target. By the 500th episode, the ET-SAC-PER agent is able to effectively navigate through the safe area, avoid the obstacle, and reach the target. Figure 4d illustrates that in the single obstacle avoidance task, the ET-SAC-ERE agent starts to have a stable strategy around the 200th episode. The agent effectively navigates through the safe area, avoids the obstacle, and ultimately reachs the target.

According to the reward function defined in Equations (4)–(16), the learning curve for the AUV obstacle avoidance task is measured by the cumulative reward obtained in each episode. Considering that each reinforcement learning method does not have a fixed number of steps in every episode, the average reward per step is calculated for each episode. The average values from three experiments are computed, with the shaded area representing the 0.95 confidence interval and the bold line representing the mean performance, as shown in Figure 5. The average cumulative reward for ET-PPO remains consistently low, reflecting the difficulty of ET-PPO in successfully learning a strategy to complete the task during training. The ET-SAC curve exhibits significant fluctuation, with the final convergence trend slightly lower than that of ET-SAC-ERE. The ET-SAC-PER curve remains at a relatively low level but shows a significant increase near the 500th episode. ET-SAC-ERE ultimately learns a strategy that achieves a high average cumulative reward, demonstrating its ability to quickly find a stable and well-performing strategy with lower fluctuation in the average cumulative reward.

This section analyzes the distance between ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE and the obstacle during the training process for the single obstacle avoidance task, as illustrated in Figure 6. Unlike the path following task, the same reinforcement learning method may yield different paths in each episode due to variations in learning. Thus, Figure 6 only displays the distance data corresponding to Figure 4. It is evident that the closest distance between the obstacle and the ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE methods is 4.0832 m, indicating that the agents can effectively navigate through the safe area to avoid the obstacle. Since experiments in this paper have only one target point, the number of successful experiments is designed as a metric. Table 1 presents the mean of the minimum in three experiments for ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE. Success is defined as avoiding the obstacle and reaching the target point. The table also records the number of successful experiments. Overall, ET-SAC-ERE has the highest mean of the minimum, indicating that it is the most successful in avoiding obstacles and reaching the target point. In contrast, the ET-PPO, ET-SAC, and ET-SAC-PER methods have lower mean of the minimum because of entering unsafe areas or colliding with obstacles. Figure 7 also shows the total number of steps needed for each deep reinforcement learning method to converge to a stable policy. It can be seen that ET-SAC and ET-SAC-ERE converge to a fixed policy more rapidly than ET-PPO and ET-SAC-PER. On the other hand, ET-SAC-ERE achieves policy convergence 10,000 steps faster than ET-SAC. These results indicate that in the single obstacle avoidance task, ET-SAC-ERE can obtain a stable strategy with fewer steps, enabling the agent to navigate through the safe area, avoid the obstacle, and ultimately reach the target location in the shortest time.

4.2. Multiple Obstacle Avoidance Experiment

In this section, ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE methods are trained for multiple obstacle avoidance tasks. The purpose of this set of experiments is to verify the effectiveness of the proposed algorithm in a more challenging environment with multiple obstacles.

During the training process of multiple obstacle avoidance tasks, the performance of the AUV using ET-PPO (a), ET-SAC (b), ET-SAC-PER (c), and ET-SAC-ERE (d) is depicted in Figure 8. In the figure, the meaning of each circle is the same as before. In episode 0, all reinforcement learning methods trigger constraints and stop training. Figure 8a illustrates that in multiple obstacle avoidance tasks, the ET-PPO agent collides with an obstacle by the 200th episode. Then it learns strategies to navigate around all obstacles and reach the target point. Figure 8b demonstrates that in multiple obstacle avoidance tasks, the ET-SAC agent attempts to learn to avoid all obstacles starting from the 100th episode but fails to reach the target point. Figure 8c indicates that in multiple obstacle avoidance tasks, the ET-SAC-PER agent tries to start trying to cross the safe area between the obstacles to reach the target location starting from the 100th episode. However, it only managed to avoid obstacles and approach the target point in episode 400. Figure 8d shows that in the multiple obstacle avoidance task, the ET-SAC-ERE agent is capable of learning to cross the safe areas between obstacles to reach the target position by the 100th episode. By the 200th episode, it has learned a fixed strategy to effectively pass through safe areas, ultimately reaching the target position.

Using the defined reward function, the learning curve of AUV obstacle avoidance tasks is measured by the cumulative reward obtained per episode. The average reward obtained per step within an episode is also calculated. The averages of three experimental data are computed, with the shaded area representing the 0.95 confidence interval and the bold line representing the average performance, as shown in Figure 9. The average cumulative reward of ET-PPO consistently remains within a low range, with significant fluctuations in the shaded area. This is consistent with the fact that although ET-PPO performs poorly in many experiments, there is a relatively small chance that it can learn the strategy of avoiding all obstacles to reach the target point. The curves of ET-SAC, ET-SAC-PER, and ET-SAC-ERE can converge to a certain range in a certain episode. It is also consistent with the approximate performance of each method in its experimental processes. Among them, ET-SAC-ERE achieves the highest average reward. Its curve can quickly and steadily maintain a high reward range. It is evident that ET-SAC-ERE is more likely to learn effective strategies to navigate through safe areas to avoid obstacles and ultimately reach the target position in experiments.

This section analyzes the training process of multiple obstacle avoidance tasks using ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE. We depict the distance variations between the agents and obstacles as shown in Figure 10, Figure 11 and Figure 12. Similar to the single obstacle avoidance task, only the distance variation data corresponding to Figure 8 are presented. It is observed that in ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE, the closest distances to obstacle 1, obstacle 2, and obstacle 3 are 4.8705 m, 4.3699 m, and 5.2162 m, respectively. It means they are all in a safe area. The agents can effectively navigate through the safe area to avoid obstacles. However, it is noted from Figure 8 that some agents fail to reach the target. Table 2 provides the distances between ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE and obstacles in three experiments. The table also records the number of successful experiments. It can be seen that ET-SAC-ERE can reach the target point more effectively while avoiding obstacles compared to the other methods. ET-PPO successfully completes the mission once, while others enter unsafe areas or collide with obstacles. ET-SAC and ET-SAC-PER fail in all three attempts, managing to avoid the third obstacle but failing to reach the target successfully. Therefore, the mean of the minimum value of obstacle 3 is also relatively larger. The total number of steps for each deep reinforcement learning convergence to a fixed policy is recorded in Figure 13. ET-PPO, ET-SAC, and ET-SAC-PER stop training early due to the inability to converge to a successful policy, so they have fewer steps compared to ET-SAC-ERE.

4.3. Multiple Dynamic Obstacle Avoidance Experiments

In this section, the training of the ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE methods for multiple dynamic obstacle avoidance tasks is conducted. In order to further enhance the challenge of AUV operation in a complex environment, we used multiple dynamic obstacles to verify the effectiveness of the algorithm in this random environment.

Figure 14 illustrates the obstacle avoidance performance using the ET-PPO (a), ET-SAC (b), ET-SAC-PER (c), and ET-SAC-ERE (d) methods. In the figure, the meaning of each circle is the same as before. At episode 0, all reinforcement learning methods trigger the constraint and stop training. Figure 14a depicts that in the multiple dynamic obstacle avoidance task, the ET-PPO agents consistently attempt to learn strategies to navigate around obstacles but do not succeed. Figure 14b shows that ET-SAC agents begin attempting to avoid all obstacles around episode 100, briefly entering the safe area around episode 400 but fail to learn a stable strategy. It is unable to successfully complete the task. Figure 14c shows that ET-SAC-PER agents begin to attempt to avoid all obstacles around episode 100 but fail to reach the target while avoiding obstacles. Figure 14d indicates that ET-SAC-ERE agents pass through unsafe areas to avoid obstacles and reach the target around episode 100. By episode 200, the agent has learned a fixed strategy that can pass within 4 m of the obstacle and finally reach the target position. The specific distance variations between each agent and obstacle will be further detailed in the following content.

Using the defined reward function, the average reward per step for each episode is calculated. The shaded area represents the 0.95 confidence interval, and the bold line indicates the average performance, as shown in Figure 15. The average cumulative reward of ET-PPO fluctuates widely, and the overall performance is unstable and effectively poor. The curves for ET-SAC and ET-SAC-PER are relatively similar, but both fail to achieve high rewards. They can’t finish the designated tasks. At episode 150, ET-SAC-ERE manages to reach higher average rewards. Then the curve can be more smoothly maintained in a high reward range. The average value of ET-SAC-ERE is the highest among all agents, which is consistent with the fact that ET-SAC-ERE is more likely to learn effective strategies in experiments to eventually reach the target position while avoiding obstacles.

y = - 0.5 x + 44

(17)

This section records the distance between ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE and obstacles in the training process of multiple dynamic obstacle avoidance tasks, as shown in Figure 16, Figure 17 and Figure 18. It should be noted here that since obstacle 1 and obstacle 3 move dynamically, there are fluctuations in the curve. In the meantime, only the distance corresponding to Figure 14 is shown. The nearest distance between all agents and obstacle 1, obstacle 2 and obstacle 3 are 5.1249 m, 3.7988 m, and 5.9375 m, respectively. It can be seen that obstacle 1 and obstacle 3 are all in the safe area. While ET-SAC-ERE will slightly enter the unsafe area, but then quickly move away from the unsafe area to stay away from the obstacles. Although other agents are always in the safe area, it can be seen from the combination of Figure 14 that agents cannot successfully reach the target point. Table 3 shows the mean of the minimum between ET-PPO, ET-SAC, ET-SAC-PER, and ET-SAC-ERE and obstacles in the three experiments. The table also records the number of successful experiments. It can be seen that the mean minimum values of obstacle 1 and obstacle 3 are all in the safe area. However, the three experiments of ET-PPO, ET-SAC, and ET-SAC-PER fails to succeed. They cannot avoid obstacle 2 and reach the target point, so the corresponding average minimum value is also relatively larger. The average minimum value between ET-SAC-ERE and obstacle 2 is slightly less than 4, but the target point can be reached successfully. Figure 19 also records the total number of steps executed within 500 episodes of each kind of deep reinforcement learning. ET-PPO, ET-SAC, and ET-SAC-PER will touch the restrictions to stop training early because they basically cannot converge on a strategy for successfully completing the task. Therefore, the number of steps used is relatively less than ET-SAC-ERE. Experiments in this section prove that the proposed method can complete the obstacle avoidance of AUV in a challenging environment with unknown moving obstacles.

5. Conclusions

The main work of this paper includes the following parts. For the autonomous obstacle avoidance task, to enable the AUV to safely avoid all unknown obstacles and arrive at the designated target area from the starting point, this paper designs an environment perception model to perceive the relative position relationship. Then, considering the limited detection range of AUVs, an AUV obstacle avoidance framework based on event-triggered reinforcement learning is designed by using an event-triggered mechanism. Finally, obstacle avoidance experiments in the unknown obstacle environment are carried out on the simulation platform, which proves the effectiveness of our proposed method and the feasibility of reinforcement learning to realize robust artificial intelligence.

In this paper, the autonomous collision avoidance of AUVs is realized, and the performance of the proposed learning method is verified by simulation. However, for the actual AUV, there are still some limitations that are not considered, such as time delay, actuator failure, ocean currents, energy consumption, etc. In the future, these constraints will be taken into account. The constrained Markov decision process is constructed by using loss functions to improve security and robustness. In addition, we will apply the method in this paper to further test the effect in more complex scenes, such as completing obstacle avoidance tasks in an environment with more turbulent water flow or considering realizing the obstacle avoidance tasks of AUVs in 3D space.

Author Contributions

Conceptualization, S.L. and R.J.; methodology, S.L.; software, S.L. and R.J.; validation, S.L., C.M. and R.J.; formal analysis, S.L.; investigation, R.J.; resources, C.M.; data curation, R.J.; writing—original draft preparation, S.L.; writing—review and editing, S.L. and C.M.; visualization, S.L.; supervision, C.M. and R.J.; project administration, C.M.; funding acquisition, C.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (No. 62273253), and Tianjin Natural Science Foundation Key Project (No. 22JCZDJC00330).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, R.; Shi, Z.; Huang, C.; Li, T.; Ma, Q. Deep reinforcement learning based optimal trajectory tracking control of autonomous underwater vehicle. In Proceedings of the 2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; pp. 4958–4965. [Google Scholar]
Sahoo, A.; Dwivedy, S.K.; Robi, P. Advancements in the field of autonomous underwater vehicle. Ocean Eng. 2019, 181, 145–160. [Google Scholar] [CrossRef]
Yin, Q.; Shen, Y.; Li, H.; Wan, J.; Liu, F.; Kong, X.; He, B.; Yan, T. Fuzzy PID motion control based on extended state observer for AUV. In Proceedings of the 2019 IEEE Underwater Technology (UT), Kaohsiung, Taiwan, 16–19 April 2019; pp. 1–4. [Google Scholar]
Wang, J.; Wang, C.; Wei, Y.; Zhang, C. Three-dimensional path following of an underactuated AUV based on neuro-adaptive command filtered backstepping control. IEEE Access 2018, 6, 74355–74365. [Google Scholar] [CrossRef]
Zhang, G.-C.; Huang, H.; Wan, L.; Li, Y.-M.; Cao, J.; Su, Y.-M. A novel adaptive second order sliding mode path following control for a portable AUV. Ocean Eng. 2018, 151, 82–92. [Google Scholar] [CrossRef]
Khodayari, M.H.; Balochian, S. Modeling and control of autonomous underwater vehicle (AUV) in heading and depth attitude via self-adaptive fuzzy PID controller. J. Mar. Sci. Technol. 2015, 20, 559–578. [Google Scholar] [CrossRef]
Wanigasekara, C.; Torres, F.S.; Swain, A. Robust Control of Autonomous Underwater Vehicles Using Delta-Sigma-Based 1-bit Controllers. IEEE Access 2023, 11, 122821–122832. [Google Scholar] [CrossRef]
Fittery, A.; Mazumdar, A.; Lozano, M.; Asada, H.H. Omni-Egg: A smooth, spheroidal, appendage free underwater robot capable of 5 dof motions. In Proceedings of the 2012 Oceans, Hampton Roads, VA, USA, 14–19 October 2012; pp. 1–5. [Google Scholar]
Shome, S.N.; Nandy, S.; Pal, D.; Das, S.K.; Vadali, S.R.K.; Basu, J.; Ghosh, S. Development of modular shallow water AUV: Issues & trial results. J. Inst. Eng. Ser. C 2012, 93, 217–228. [Google Scholar]
Dong, Z.; Bao, T.; Zheng, M.; Yang, X.; Song, L.; Mao, Y. Heading control of unmanned marine vehicles based on an improved robust adaptive fuzzy neural network control algorithm. IEEE Access 2019, 7, 9704–9713. [Google Scholar] [CrossRef]
Wang, X.; Yao, X.; Zhang, L. Path planning under constraints and path following control of autonomous underwater vehicle with dynamical uncertainties and wave disturbances. J. Intell. Robot. Syst. 2020, 99, 891–908. [Google Scholar] [CrossRef]
Yao, X.; Wang, X.; Wang, F.; Zhang, L. Path following based on waypoints and real-time obstacle avoidance control of an autonomous underwater vehicle. Sensors 2020, 20, 795. [Google Scholar] [CrossRef]
Xu, H.; Pan, J. AUV motion planning in uncertain flow fields using bayes adaptive MDPs. IEEE Robot. Autom. Lett. 2022, 7, 5575–5582. [Google Scholar] [CrossRef]
Liu, Z.; Zhu, D.; Liu, C.; Yang, S.X. A Novel Path Planning Algorithm of AUV with Model Predictive Control. Int. J. Robot. Autom. 2022, 37. [Google Scholar] [CrossRef]
Kim, M.S.; Kim, J.S.; Park, J.H. Automated Hyperparameter Tuning in Reinforcement Learning for Quadrupedal Robot Locomotion. Electronics 2023, 13, 116. [Google Scholar] [CrossRef]
Jang, S.H.; Ahn, W.J.; Kim, Y.J.; Hong, H.G.; Pae, D.-S.; Lim, M.T. Stable and Efficient Reinforcement Learning Method for Avoidance Driving of Unmanned Vehicles. Electronics 2023, 12, 3773. [Google Scholar] [CrossRef]
Huang, H.; Jin, C. A novel particle swarm optimization algorithm based on reinforcement learning mechanism for AUV path planning. Complexity 2021, 2021, 8993173. [Google Scholar] [CrossRef]
Fang, Y.; Huang, Z.; Pu, J.; Zhang, J. AUV position tracking and trajectory control based on fast-deployed deep reinforcement learning method. Ocean Eng. 2022, 245, 110452. [Google Scholar] [CrossRef]
Hadi, B.; Khosravi, A.; Sarhadi, P. Deep reinforcement learning for adaptive path planning and control of an autonomous underwater vehicle. Appl. Ocean Res. 2022, 129, 103326. [Google Scholar] [CrossRef]
Zhu, G.; Shen, Z.; Liu, L.; Zhao, S.; Ji, F.; Ju, Z.; Sun, J. AUV dynamic obstacle avoidance method based on improved PPO algorithm. IEEE Access 2022, 10, 121340–121351. [Google Scholar] [CrossRef]
Jiang, J.; Zhang, R.; Fang, Y.; Wang, X. Research on motion attitude control of underactuated autonomous underwater vehicle based on deep reinforcement learning. In Proceedings of the 2020 3rd International Conference on Computer Information Science and Artificial Intelligence (CISAI) 2020, Hulun Buir, China, 25–27 September 2020; p. 012206. [Google Scholar]
Ariza Ramirez, W.; Leong, Z.Q.; Nguyen, H.D.; Jayasinghe, S.G. Exploration of the applicability of probabilistic inference for learning control in underactuated autonomous underwater vehicles. Auton. Robot. 2020, 44, 1121–1134. [Google Scholar] [CrossRef]
Havenstrøm, S.T.; Rasheed, A.; San, O. Deep reinforcement learning controller for 3D path following and collision avoidance by autonomous underwater vehicles. Front. Robot. AI 2021, 7, 211. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 1861. [Google Scholar]
Wang, C.; Ross, K. Boosting soft actor-critic: Emphasizing recent experience without forgetting the past. arXiv 2019, arXiv:1906.04009. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Fang, Z.; Jiang, D.; Huang, J.; Cheng, C.; Sha, Q.; He, B.; Li, G. Autonomous underwater vehicle formation control and obstacle avoidance using multi-agent generative adversarial imitation learning. Ocean Eng. 2022, 262, 112182. [Google Scholar] [CrossRef]
Tristan, P.; Smogeli, O.; Fossen, T.; Sorensen, A.J. An overview of the marine systems simulator (MSS): A simulink toolbox for marine control systems. Model. Identif. Control 2006, 27, 259–275. [Google Scholar]
Vibhute, S. Adaptive dynamic programming based motion control of autonomous underwater vehicles. In Proceedings of the 2018 5th International Conference on Control, Decision and Information Technologies (CoDIT), Thessaloniki, Greece, 10–13 April 2018; pp. 966–971. [Google Scholar]
Zhao, Y.; Han, F.; Han, D.; Peng, X.; Zhao, W. Decision-making for the autonomous navigation of USVs based on deep reinforcement learning under IALA maritime buoyage system. Ocean Eng. 2022, 266, 112557. [Google Scholar] [CrossRef]

Figure 1. The simulation platform.

Figure 2. The relative position relationship between the underwater vehicle and a dynamic obstacle.

Figure 3. The schematic diagram of the relative position of an underwater vehicle with an obstacle triggered by an event.

Figure 4. The AUV obstacle avoidance trajectories of each algorithm.

Figure 5. The cumulative rewards for each algorithm.

Figure 6. The variation of distance between individual algorithms and an obstacle.

Figure 7. The total number of steps that each algorithm converges to a fixed policy.

Figure 8. The AUV obstacle avoidance trajectories of each algorithm.

Figure 9. The cumulative rewards for each algorithm.

Figure 10. The variation of distance between individual algorithms and obstacle 1.

Figure 11. The variation of distance between individual algorithms and obstacle 2.

Figure 12. The variation of distance between individual algorithms and obstacle 3.

Figure 13. The total number of steps that each algorithm converges to a fixed policy.

Figure 14. The AUV obstacle avoidance trajectories of each algorithm.

Figure 15. The cumulative rewards for each algorithm.

Figure 16. The variation of distance between individual algorithms and obstacle 1.

Figure 17. The variation of distance between individual algorithms and obstacle 2.

Figure 18. The variation of distance between individual algorithms and obstacle 3.

Figure 19. The total number of steps that each algorithm converges to a fixed policy.

Table 1. The mean of the minimum and number of successes in three experiments.

Method	ET-PPO	ET-SAC	ET-SAC-PER	ET-SAC-ERE
The mean of the minimum	2.1888	4.2695	4.3755	5.2183
The number of successful experiments	1	2	2	3

Table 2. The mean of the minimum and number of successes in three experiments.

Method	ET-PPO	ET-SAC	ET-SAC-PER	ET-SAC-ERE
The mean of the minimum with obstacle 1	4.1864	5.1062	5.4066	5.0187
The mean of the minimum with obstacle 2	10.0536	6.6736	6.3497	4.1996
The mean of the minimum with obstacle 3	6.6959	25.0357	13.0302	5.9966
The number of successful experiments	1	0	0	3

Table 3. The mean of the minimum and number of successes in three experiments.

Method	ET-PPO	ET-SAC	ET-SAC-PER	ET-SAC-ERE
The mean of the minimum with obstacle 1	6.8074	4.7588	5.4661	5.0234
The mean of the minimum with obstacle 2	9.1914	7.6167	6.4675	3.9317
The mean of the minimum with obstacle 3	14.9097	11.2197	19.9554	5.2736
The number of successful experiments	0	0	0	3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, S.; Ma, C.; Juan, R. AUV Obstacle Avoidance Framework Based on Event-Triggered Reinforcement Learning. Electronics 2024, 13, 2030. https://doi.org/10.3390/electronics13112030

AMA Style

Liu S, Ma C, Juan R. AUV Obstacle Avoidance Framework Based on Event-Triggered Reinforcement Learning. Electronics. 2024; 13(11):2030. https://doi.org/10.3390/electronics13112030

Chicago/Turabian Style

Liu, Shoufu, Chao Ma, and Rongshun Juan. 2024. "AUV Obstacle Avoidance Framework Based on Event-Triggered Reinforcement Learning" Electronics 13, no. 11: 2030. https://doi.org/10.3390/electronics13112030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AUV Obstacle Avoidance Framework Based on Event-Triggered Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. The Related Work of Model-Based AUV Control

2.2. The Related Work of Control Based on Reinforcement Learning

3. Proposed Method

3.1. Environment State Perception Model

3.2. An AUV Obstacle Avoidance Framework Based on Event-Triggered Reinforcement Learning

3.2.1. Design of State Space and Action Space

3.2.2. Design of Reward Function

4. Obstacle Avoidance Experiments Based on Reinforcement Learning

4.1. Single Obstacle Avoidance Experiment

4.2. Multiple Obstacle Avoidance Experiment

4.3. Multiple Dynamic Obstacle Avoidance Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI