State Super Sampling Soft Actor–Critic Algorithm for Multi-AUV Hunting in 3D Underwater Environment

Wang, Zhuo; Sui, Yancheng; Qin, Hongde; Lu, Hao

doi:10.3390/jmse11071257

Open AccessArticle

State Super Sampling Soft Actor–Critic Algorithm for Multi-AUV Hunting in 3D Underwater Environment

School of Naval Engineering, Harbin Engineering University, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(7), 1257; https://doi.org/10.3390/jmse11071257

Submission received: 18 May 2023 / Revised: 15 June 2023 / Accepted: 19 June 2023 / Published: 21 June 2023

(This article belongs to the Special Issue Advances in Autonomous Underwater Robotics Based on Machine Learning II)

Download

Browse Figures

Versions Notes

Abstract

:

Reinforcement learning (RL) is known for its efficiency and practicality in single-agent planning, but it faces numerous challenges when applied to multi-agent scenarios. In this paper, a Super Sampling Info-GAN (SSIG) algorithm based on Generative Adversarial Networks (GANs) is proposed to address the problem of state instability in Multi-Agent Reinforcement Learning (MARL). The SSIG model allows a pair of GAN networks to analyze the previous state of dynamic system and predict the future state of consecutive state pairs. A multi-agent system (MAS) can deduce the complete state of all collaborating agents through SSIG. The proposed model has the potential to be employed in multi-autonomous underwater vehicle (multi-AUV) planning scenarios by combining it with the Soft Actor–Critic (SAC) algorithm. Hence, this paper presents State Super Sampling Soft Actor–Critic (S4AC), which is a new algorithm that combines the advantages of SSIG and SAC and can be applied to Multi-AUV hunting tasks. The simulation results demonstrate that the proposed algorithm has strong learning ability and adaptability and has a considerable success rate in hunting the evading target in multiple testing scenarios.

Keywords:

multi-agent reinforcement learning; Soft Actor–Critic; generative adversarial networks; multiple autonomous underwater vehicle hunting

1. Introduction

An autonomous underwater vehicle (AUV) is a robot that can perform assigned tasks autonomously in complex marine environments [1]. Multi-AUV systems have more solutions and higher efficiency for assignments than a single AUV [2]. As an important research issue of multi-agent systems, multi-AUV hunting has broad application prospects in the national defense industry and in maritime rights and interests protection [3]. The multi-AUV hunting, which refers to the capture of mobile targets (evaders) through collaboration between AUVs (the hunters) in unknown environments, is a typical case of multi-AUV coordination. It involves interdisciplinary knowledge such as navigation, control, communication, and cooperation among multiple intelligent agents [3,4,5,6,7,8,9,10,11,12,13,14].

The dimensional explosion and environmental instability [15,16] in multi-AUV hunting tasks make it difficult for traditional rule-based algorithms to find models that accurately and completely describe dynamic and unstable environments [17]. The traditional AUV planning methods mainly include rapidly exploring random tree (RRT) [18,19], Artificial Potential Field (APF) [20,21,22], Fuzzy Logic [23,24], and Geofencing [25,26,27] algorithms. These methods require designing algorithm parameters based on underwater conditions, which depends on the designer’s understanding of the underwater environment [28]. Additionally, the lack of learning capability makes these methods unable to enhance the path-planning function of AUVs. Although geofencing techniques have emerged as an attractive topic in recent years and hold promise for path planning in autonomous vehicles, their application in AUVs presents notable challenges due to the limited communication and localization capabilities inherent in underwater environments. Multi-AUV hunting is a dynamic game process in which hunters and evaders compete with each other [15]. To demonstrate intelligence and adaptability in unstable environments, hunting AUVs must incorporate learning algorithms to improve their hunting abilities through competitive learning. Deep Reinforcement Learning (DRL) [29] provides a suitable solution for multi-AUV hunting. DRL has a strong ability to independently explore and learn and can continuously interact with the environment through trial and error to improve agent behavior [16]. However, in most cases, DRL deals with the tasks of a single agent.

A multi-agent DRL must consider more actions and state spaces than a single-agent DRL, and the reward of each agent is not only interconnected with the environment but also interconnected with the action of other agents, which makes solving multi-agent tasks more intricate [30]. To address these difficulties, scholars have developed a variety of multi-agent hunting algorithms based on the DRL. Gu. et al. [31] introduced an attention-based Fault-Tolerant (FT-Attn) model that can select accurate and relevant information for each agent in noisy environments at each time step. This approach assists in providing reasonable target search paths for the pursuit and evasion problem. Singh et al. [32] use the actor–critic model-free multi-agent deep deterministic policy gradient (MADDPG) algorithm on continuous spaces. The study did not learn the evader technique; instead, it was based on Voronoi regions that the pursuers tried to minimize while the evader tried to maximize. Cao et al. [5] utilized a potential field-based hierarchical reinforcement learning approach that integrates segmental options with the traditional HRL algorithm to overcome the problem of dimensional explosion. Furthermore, they established a potential field hierarchy from which all the required parameters can be obtained automatically, providing reasonable paths for target searches in unexplored environments. Building on this work, Cao et al. [6] proposed a fuzzy-based potential field hierarchical reinforcement learning approach that uses the potential field hierarchy to establish a hierarchy reinforcement learning algorithm. This algorithm offers a reasonable path for AUV target hunting and introduces a fuzzy algorithm to improve both the efficiency of target hunting and the smoothness of the AUV’s trajectory. Mao et al. [7] proposed a multi-agent GAIL (MAG) algorithm based on the generative adversarial imitation algorithm (GAIL) to directly learn from expert demonstrations, overcoming the problem of slow initial training networks. The method solves the problem of sparse reward and local convergence, improving the decision-making ability in the tracking process. Wu et al. [8] proposed an effective reward acting on reinforcement learning and the particle swarm optimization (R-RLPSO) method. They introduced an “attraction area” containing targets, and calculated the reward for each waypoint within the area using a linear reward function. The R-RLPSO system is more cost-effective and time-saving through efficient reward strategies.

The MARL training in underwater environments encounters two main challenges: (i) AUVs mainly employ acoustic communication for information transmission. The inherent characteristics of acoustic communication, such as high latency and a high error rate, result in asynchronous information exchange among AUVs, wherein the collaborator information an AUV receives is typically several seconds outdated. This, in turn, further exacerbates the inherent instability of the state in the MARL training. Under these conditions, conventional MARL methods always struggle to converge, and (ii) the real-world underwater environment tends to be substantially more complex and unpredictable than the simulated training environments. The stochastic disturbances induced by ocean currents often lead to poor performance by algorithms with low robustness in the actual operational scenarios, which can significantly underperform compared to their behavior in the training environments.

To address these identified issues, this paper proposes the State Super Sampling Soft Actor–Critic (S4AC) algorithm for the multi-AUV hunting task. Firstly, to mitigate the issue of high communication latency and the inability to obtain real-time states, the Super Sampling Info-GAN (SSIG) is introduced. The SSIG learns the dynamic model of the multi-AUV system through past state pairs of the AUV group, and the AUV can leverage this ability to predict the motion trends of other AUVs. This strategy thereby facilitates the acquisition of real-time overall states in a high-latency underwater environment. Secondly, the S4AC inherits the SAC’s characteristic of maximizing action entropy, which gives the algorithm robustness against random disturbances and demonstrates adaptability to complex underwater environments. As such, it not only provides a solution to the state instability problem due to high communication latency but also improves the exploration ability and hunting efficiency of AUVs in unpredictable underwater environments, enhancing the overall performance and robustness of multi-AUV systems.

The remainder of this paper is organized as follows: Section 2 presents the framework of the S4AC algorithm and outlines the assumptions and formulas related to the model of multi-AUV hunting in three-dimensional (3D) underwater environments. Section 3 introduces the SSIG for generating real-time states of the dynamic system and discusses its combination with the SAC algorithm, leading to the proposal of the S4AC algorithm for multi-AUV hunting. This section also provides an overview of the S4AC workflow for multi-AUV hunting. In Section 4, the paper explores simulations of the S4AC algorithm on a semi-physical simulation platform, considering scenarios with high underwater acoustic communication latency, diverse obstacle layouts, and ocean current disturbances, thereby offering a comprehensive evaluation of the system’s performance. This section also includes comparative results between the S4AC algorithm and the MADDPG [33]. Finally, Section 5 concludes the paper.

2. Materials and Methods

The maximum entropy is introduced using classical soft actor–critic algorithms [34,35,36,37,38] to increase the randomness of actions and encourage the strategy

π (a_{t} | s_{t})

to perform more exploratory actions to optimize the future learning process. However, in MARL, the randomness of other agent actions makes it challenging for any single agent to obtain state information about the whole system accurately, which decreases the training efficiency. Furthermore, the high communication delays in the underwater environments aggravate this impact, which makes it difficult for multi-agent systems to achieve a stable state of Markov decision processes (MDPs) [39,40]. The objective of this study is to develop a method that enables a single agent to deduce the intentions of others. A pair of GAN networks [41] are trained to capture essential features representing the transition properties of the dynamic system, which can predict the future state and minimize the impact of random actions by other agents.

2.1. The Proposed Framework

To apply the SAC algorithm to MARL, a distributed training approach is adopted, where each AUV trains its own policy network and integrates the positional information of collaborators into the state input as the training basis. The SSIG is used to generate the predictive state of a dynamic system, and the SAC algorithm is used to handle the AUV hunting process.

As shown in Figure 1, each AUV in the group selects an action

a_{t} \sim π (\cdot | s_{t})

based on the current state

s_{t}

through the actor network. The state at the next moment

s_{t + 1}

is determined via the current state and action. The calculation formula, known as the state transition function, is represented as Equation (1):

s_{t + 1} = f (s_{t}, a_{t})

(1)

where function f represents the dynamics of the environment and describes how the current state and action transition to the state at the next moment. In MARL, each agent treats the other agents as part of the environment; that is, the state transition function of agent i is described as

f^{i} = f_{a^{- i}}

. Therefore, the next state

s_{t + 1}

depends not only on the actions of the agent itself but also on the actions of other agents. The randomness in the actions of other agents leads to instability in the state, making it challenging for an agent to determine the state

s_{t + 1}

. Thus, we introduced SSIG, where the discriminator (D network) is trained using existing state pairs sampled from the replay buffer, and the generator (G network) is trained to generate the predictive state of the AUV group. The SAC networks can obtain the predictive state of the hunter group in real-time, which improves the efficiency of exploration training progress.

2.2. Assumptions

To simplify the problem, an AUV equipped with a main propeller, vertical rudder, and vertical thruster was utilized to perform hunting tasks in a 3D underwater environment. Some simplifications were made for computer simulations of the AUV motion, referring to the characteristics and limitations of the AUV’s mobility [42]:

The motion of the AUV in the roll, pitch, and sway directions was neglected;
The AUV maintained neutral buoyancy and the origin of the carrier coordinate was located at the center of mass;
The AUV had three planes of symmetry;
The maximum speed in the surge direction was 5 knots, and the maximum speed in the heave direction was 2 knots;
The action decision interval of the AUV was a step time of $Δ T = 1$ s. Within each step time, the change in speed was within $[- 0.5, 0.5]$ knots, and the heading change was within $[- 15, 15]$ degrees.
The communication delay between the AUVs was set to a fixed value of 2 s; i.e., at time t, an AUV can obtain accurate location information about other AUVs at time $t - 2$ .

2.3. AUV Motion Model

To facilitate the description of the state and action of the AUV, the motion model of the AUVs, including the kinematic model and dynamic model in 3D underwater environment, are provided.

As shown in Figure 2,

φ

,

θ

, and

ψ

correspond to the heeling angle, the trim angle, and the slant angle, respectively, in the inertial coordinate system of the AUV (the counterclockwise direction is positive). u, v, and w are the three velocity components of AUV in the carrier coordinate system, while p, q, and r are the three angular velocity components. In the inertial coordinate system, the state of AUV uses six degrees of freedom to represent vector

η_{p} = [x, y, z, φ, ψ, θ]

[43]. Based on the assumptions mentioned in Section 2.2, the location in the inertial coordinate system can be simplified as

η_{p} = [x, y, z, ψ]

.

In addition, the velocity vector

η_{d} = (u, w, r)

in the carrier coordinate system is used to describe the motion of the AUV, where the angle between the AUV forward velocity u and the X-axis is represented by

ψ

, the vertical velocity w is aligned with the Z-axis, and

r = \dot{ψ}

. Thus, the kinematic model of the AUV can be expressed as

\{\begin{matrix} \dot{x} & u \times cos ψ \\ \dot{y} & u \times sin ψ \\ \dot{z} & w \\ \dot{ψ} & r \end{matrix}

(2)

The dynamic model of the AUV can be expressed as

\{\begin{matrix} (m - X_{\dot{u}}) \dot{u} & (m - Y_{\dot{v}}) v r + (X_{u} + X_{u | u |} | u |) u + τ_{u} \\ (I_{z} - N_{r}) \dot{r} & - (X_{\dot{u}} - Y_{\dot{v}}) u v + (N_{r} + N_{r |r|} | r |) r + τ_{r} \\ (m - Z_{\dot{w}}) \dot{w} & (m - Z_{\dot{w}}) u w + (Z_{w} + Z_{w w} | w |) w + τ_{w} \end{matrix}

(3)

where

τ = [τ_{u}, τ_{r}, τ_{w}]

represents the control force and torque of the AUV, whereas parameters such as

m, X_{\dot{u}}

, etc., represent the mass matrix and hydrodynamic parameters of the AUV. These parameters are preserved within the simulation experiment platform. The velocity change in the AUV is defined as the action, represented as

a = (Δ u, Δ w, Δ ψ)

and is input into the controller as a target value for computation. A Proportional Integral Derivative (PID) motion control model generates the thrust and torque for the controller, which are then utilized by the simulation platform to calculate the motion of the AUV. The location

η_{p}

and movement

η_{d}

of the AUV are defined as the attitude components of the AUV state, as shown in (4).

η = [η_{p}, η_{d}] = [x, y, z, ψ, u, w, r]

(4)

2.4. Hunting Environment Description

In the 3D hunting environment of

500 m \times 500 m \times 200 m

, 6 hunting AUVs are deployed to accomplish the encirclement of the evading target. The coordinate system of the underwater environment employs the geodetic coordinate system, and the bottom of the area is set as the coordinate center O. The positive direction of the X-axis denotes an increment in longitude, while the positive direction of the Y-axis indicates an increment in latitude. The positive direction of the Z-axis denotes an increment in depth. The

X, Y, and Z

-axes intersect at the coordinate origin O. Several obstacles are placed in the environment to verify the obstacle-avoiding ability of the AUVs during hunting.

To encircle the evading target with the AUV group, the position information of collaborators and the evading target must be obtained. Figure 3a displays the position information of collaborators and the target relative to the AUV in a 3D environment, represented by

C^{i} = (ϕ^{i}, d_{p}^{i}, d_{z}^{i})

. Seven vectors are established from the current position of the AUV to collaborators or evading targets.

d_{p}^{i}

represents the projection length of the vector

i \in [1, 7]

on the

X O Y

plane, while

ϕ^{i}

represents the angle between the vector i and the positive direction of the X-axis, and

d_{z}^{i}

represents the projection of the vector i on the Z-axis. The relative state of the AUV to the collaborators and the target can be expressed as

C = ⋃_{i = 1}^{7} (ϕ^{i}, d_{p}^{i}, d_{z}^{i})

(5)

AUVs detect obstacles in the environment through forward-looking sonar. Owing to the vertical movement of AUV, the detection model of forward-looking sonar has been simplified, neglecting the vertical beam angle. The sonar detection range is divided into multiple zones in the

X O Y

plane at the same depth as the AUV. Figure 3b depicts the detection area of the sonar, which extends from the northwest to the northeast of the AUV sailing direction. The range within which the sonar can detect obstacles is

50 m

, uniformly divided into six blocks, referred to as zones 1 to 6 in a clockwise direction. The angles of the nearest obstacle detected in zone

i_{z}

at time t are represented by

δ_{t}^{i_{z}}

, whereas the distances of the nearest obstacle are denoted by

d_{s_{t}}^{i_{z}}

. If no obstacle is detected in a particular zone, then

δ_{t}^{i_{z}}

and

d_{s_{t}}^{i_{z}}

have constant values. Therefore, the detection result

χ

of the forward-looking sonar can be described as shown below [28]:

χ = ⋃_{i_{z} = 1}^{6} (δ^{i_{z}}, d_{s}^{i_{z}})

(6)

In summary, during the training process of the SAC algorithm, the state of every AUV is composed of its attitude

η

, the relative position C of the collaborators and target, and the observation result

χ

of the forward-looking sonar, as given in Equations (4), (5), and (6), respectively. Therefore,

s_{t}

can be described as

s_{t} = {η_{t}, C_{t}, χ_{t}}

. The AUV’s attitude

η

and the observation results

χ

of forward-looking sonar can be provided in real-time using the AUV sensor. Regarding the relative position C of the collaborators and the target, it is assumed that the position of the evading target is given by the environment so that its relative position can be calculated easily. This paper mainly discusses the acquisition of the collaborators’ real-time state.

3. State Super Sampling Soft Actor–Critic (S4AC) Algorithm

In this section, the construction of the SSIG objective function is presented, followed by a description of using SSIG to generate predictive states of a dynamic system. Afterwards, the action–value function of a multi-agent version of SAC is detailed, providing real-time states of other agents generated by the SSIG, thereby enabling each agent to independently learn its policy, considering the policies of other agents. Lastly, the specific definition of the reward function is given, and the training process of the S4AC for multi-AUV hunting tasks is demonstrated.

3.1. Super Sampling Info-GAN

In the proposed framework, an info-GAN [44] pair is trained to generate state pairs for predicting the motion of the dynamical system. The generative network is a deep neural network that takes as input both unstructured random noise and a structured pair of consecutive states from a low-dimensional, parametrized dynamical system termed the inference transition model. In this session, the operation of the SSIG to generate the predictive state will be explained.

In the classical GAN [41] framework, assume that

s \sim P_{d a t a} (s)

is a state sample extracted from the dataset. Deep generative models aim to train stochastic neural networks to generate the data distribution approximating

P_{d a t a}

. The GAN framework consists of a generator,

s = G (z)

, that maps a noise input

z \sim P_{n o i s e} (z)

to a state sample, and a discriminator,

D (s)

, that maps the state sample to the probability that it was sampled from the real data instead of the generator. The GAN training is optimized through a game between the generator and the discriminator:

min_{G} max_{D} V (G, D) = min_{G} max_{D} [E_{s \sim P_{d a t a}} [log D (s)] + E_{z \sim P_{n o i s e}} [log (1 - D (G (z)))]]

(7)

The noise vector z in GAN can be regarded as containing some representation of the state s. However, in general GAN training, there is no incentive to make this representation structural, which makes it difficult to interpret. The Info-GAN [44] method aims to mitigate this issue. Let H denote the entropy of a random variable

H (x) = E_{x} [- log (P (x))]

. The mutual information between two random variables,

I (x; y) = H (x) - H (x | y) = H (y) - H (y | x)

, measures the influence of one variable on the uncertainty of the other. The idea in info–GAN is to add to the generator input an additional “state” component

u \sim P (u)

, and add to the GAN objective a loss that induces maximal mutual information between the generated state and the abstract state. The info–GAN objective [44] is given by

min_{G} max_{D} [V (G, D) - λ I (u; G (z, u))]

(8)

where

λ > 0

is a weight parameter and

V (G, D)

is the GAN loss function in Equation (7). This objective induces the abstract state to capture the most salient properties of the state samples. It is difficult to solve the optimization objective in Equation (8) without access to the posterior distribution

P (u | s)

, and a variational lower bound was proposed in [44]. An auxiliary distribution

ρ (u | s)

is defined to approximate the posterior

P (u | s)

. Then,

I (u; G (z, u)) \geq E_{u \sim P (u), s \sim G (z, u)} [log ρ (u | s)] + H (u)

(9)

According to Equation (9), the info–GAN objective (8) can be optimized using stochastic gradient descent.

In the MARL system, continuous real environment states

(s_{t}, s_{t + 1})

are provided in the replay buffer, which can be regarded as a dataset

D = {(s_{1}, s_{2}), (s_{2}, s_{3}), \dots, (s_{t - 1}, s_{t}), (s_{t}, s_{t + 1})}

.

Let s and

s^{'}

denote a pair of consecutive states sampled from dataset D, and

P_{d a t a} (s, s^{'})

denotes their probability, as displayed in the data D. We believe that a generative model that can accurately learn

P_{d a t a} (s, s^{'})

has to capture the policy features that can represent the movement of agents from one state to another. Following the approach outlined in [45], we modified the classic GAN networks and applied them to our setting. A vanilla GAN consists of a generator,

(s, s^{'}) = G (z)

, that maps a noise input

z \sim P_{n o i s e} (z)

to a state pair, and a discriminator,

D (s, s^{'})

, that maps a state pair to the probability that it was sampled from the real data D instead of the generator. The noise vector z of GAN can be regarded as a feature vector that contains a representation of a certain transition from s to

s^{'}

. On this basis, a generator with structured input that can be used for inferring the action policy of the agents is proposed.

Let

M

denote a dynamical system with a transition space

U

, and name it an abstract states set.

Π_{M} (u^{'} | u) : u^{'} \sim Π_{M} (u^{'} | u)

is a parametrized, stochastic transition function, where

u, u^{'} \in U

are a pair of consecutive abstract states.

P_{M} (i)

denotes the prior probability of an abstract state u.

M

is termed the implicit transition system. The generator is structured as taking in a pair of consecutive abstract states

(u, u^{'})

in addition to the noise vector z. The objective function of GAN in this case is

V (G, D) = E_{s, s^{'} \sim P_{d a t a}} [log D (s, s^{'})] + E_{z \sim P_{n o i s e}, u \sim P_{M} (u), u^{'} \sim Π_{M} (u)} [log (1 - D (G (z, u, u^{'})))]

(10)

where u and

u^{'}

represent the implicit features required for inferring the dynamic system transition model, which includes the information about the cooperator motion policy, while z simulates less informative variations. To induce learning such representations, we follow the Info-GAN method and add to the GAN objective a loss that induces maximal mutual information between the generated pair of states and the abstract states.

The Super-Sampling Info–GAN objective is proposed as

\begin{matrix} min_{M, G} max_{D} & V (G, D) - λ I (u, u^{'}; s, s^{'}) \\ s . t . & s, s^{'} \sim G (u, u^{'}), u \sim P_{M}, u^{'} \sim Π_{M} (u) \end{matrix}

(11)

where

λ > 0

is a weight parameter, and

V (G, D)

is given by Equation (10). Intuitively, this objective enables the abstract model to capture the most significant change that may occur within the dynamic system. Since we cannot access the posterior distribution

P (u, u^{'} | s, s^{'})

when using an expressive generator function, it is difficult to directly optimize Equation (11). Therefore, a variational lower bound of (10) is derived following the info–GAN, and an auxiliary distribution

ρ (u, u^{'} | s, s^{'})

is defined to approximate the posterior

P (u, u^{'} | s, s^{'})

, similar to the derivation proposed by [44]:

I ((u, u^{'}); G (z, u, u^{'})) \geq E_{u \sim P_{M} (u), u^{'} \sim Π_{M} (u), s, s^{'} \sim G (z, u, u^{'})} [log ρ (u, u^{'} | s, s^{'})] + H (u, u^{'}) = I_{VLB} (G, ρ)

(12)

In Equation (12),

ρ

can be seen as a classifier that maps pairs of state samples to pairs of abstract states. The mutual information in (11) is not sensitive to the code order of random variables u and

u^{'}

, which mentions a potential caveat in the optimization objective in Equation (12); that is, we would like the random variable for the next abstract state to have the same meaning as the random variable for the abstract state. The transition operator

T_{M}

can be applied sequentially to show the sequence of changes in the abstract state and effectively plan the abstract state transition model

M

. The loss function is obtained by bringing Equation (12) in Equation (11):

min_{G, ρ, M} max_{D} [V (G, D) - λ I_{VLB} (G, ρ)]

(13)

where

λ > 0

is a constant. The loss in Equation (13) can be optimized effectively using stochastic gradient descent.

3.2. Inferring the Predictive State with SSIG

The paradigm for state inferrence is shown in Figure 4. It is started by training a Super Sampling Info–GAN model from the replay buffer containing real trajectory data, as described in the previous section. Then, the process of generating the predicted states based on the current state using the SSIG algorithm is detailed.

Firstly, the AUV provides its previous state

s_{t - 1}

and current state

s_{t}

. Following [46], we perform a search over the transition space to find the best implicit state mapping

u^{*} (s)

for a corresponding state pair

(u_{t - 1}, u_{t})

:

u^{*} (s) = arg min_{u} min_{u^{'}, z} | | s - G (u, u^{'}, z) {| |}^{2}

(14)

Then, the parameters of the transition model

Π_{M} (\cdot | u)

are updated so that

Π_{M} (\cdot | u_{t - 1}) \to u_{t}

, and we introduce

u_{t}

to obtain

u_{t + 1}

:

u_{t + 1} \sim Π_{M} (\cdot | u_{t})

.

Finally, the paired

(u_{t}, u_{t + 1})

is input into the generator to obtain

(\tilde{s_{t}}, \tilde{s_{t + 1}})

. When the error between

\tilde{s_{t}}

and

s_{t}

is less than a small value

ε

,

\tilde{s_{t + 1}}

can be considered a reasonable prediction for the next state of the AUV.

3.3. Soft Policy Iteration

In this section, a multi-agent version of the Soft-Q iteration algorithm proposed in [34] is derived. The derivation follows a logic similar to that of the vanilla Soft Actor–Critic.

Following [47], a tuple

(S_{1}, \dots, S_{n}, A_{1}, \dots, A_{n}, R_{1}, \dots, R_{n}, p, T, γ)

is defined for an n-AUV system, where S denotes the state space, p denotes the distribution of the initial state,

γ

is a discount factor, and

A_{i}

and

R_{i} = R (s, a^{i}, s^{'})

are the action space and the reward function of AUV i,

i \in {1, \dots, n}

respectively. The states will transition according to

T : S \times A

, where

A = {A_{1}, \dots, A_{n}}

, and be represented as

s^{'} \sim T (s^{'} | s, a^{i}, a^{- i})

. AUV i selects action

a^{i} \in A_{i}

according to the policy

π_{θ_{i}}^{i} (a^{i} ∣ s)

parameterized by

θ_{i}

, provided that a given state

s \in S

.

It is convenient to interpret the joint policy from the perspective of AUV i such that

π_{θ} = {π_{θ_{i}}^{i} (a^{i} ∣ s), π_{θ_{- i}}^{- i} (a^{- i} ∣ s)}

, where

a^{- i} = {(a^{j})}_{j \neq i}

,

θ^{- i} = {(θ^{j})}_{j \neq i}

, and

π_{θ_{- i}}^{- i} (a^{- i} ∣ s)

is a compact representation of the joint policy of all complementary AUVs of i. Actions are taken simultaneously at each training step. Each AUV is presumed to discover the policy with the optimal soft action value

Q_{soft}^{i} (s_{t}, a_{t}^{i}, a_{t}^{- i})

, as shown in Equation (15):

J^{i} = arg max_{π_{θ}^{i}} [Q_{soft}^{i} (s_{t}, a_{t}^{i}, a_{t}^{- i}) - α_{t} log π_{θ_{i}}^{i} (a_{t}^{i}, s_{t})]

(15)

where

Q_{soft}^{i} (s_{t}, a_{t}^{i}, a_{t}^{- i})

is updated by Equation (16):

Q_{soft}^{i} (s_{t}, a_{t}^{i}, a_{t}^{- i}) = r_{t} + E_{(s_{t + l}, a_{t + l}^{i}, a_{t + l}^{- i}) \sim q} [\sum_{l = 1}^{\infty} γ^{l} (Q_{soft}^{i} (s_{t + l}, a_{t + l}^{i}, a_{t + l}^{- i}) - α_{t} log π (a_{t + l}^{i} | a_{t + l}^{- i}, s_{t + l}))]

(16)

Within the soft policy iteration, each agent’s action value function serves to estimate the value of its actions under the policies of all agents. This function is dependent on the current state of the environment where all agents act as a part.

Compared with SAC iteration in a single AUV, where

s_{t + 1}

can always be calculated given

s_{t}

and

a_{t}

, the

s_{t + 1} \sim T (s_{t + 1}^{i} ∣ s_{t}^{i}, a_{t}^{i}, a_{t}^{- i})

of multi-AUV is closely related to the actions of other AUVs. The method to solve

s_{t + 1}^{i}

was already provided in the previous sessions. Subsequently, each agent updates its policy via gradient ascent to maximize its action–value function. Through this approach, although each agent updates its policy independently, the behaviors of all other agents are taken into account. Thus, even though each agent maintains its own policy, the overall policy evolves alongside updates to each individual agent’s policy. The flowchart of the S4AC iteration in the hunting scenario is shown in Figure 5.

3.4. Reward Function

The planning strategy for multi-AUV hunting must take into account its obstacle avoidance in the underwater environment, as well as maintain a certain formation to collaboratively enclose the target in all directions. Moreover, the AUV navigation target point must be adjusted in real time since the evader will attempt to evade the enclosure. Given these challenges, it is crucial to set a reasonable reward function. In the S4AC algorithm, a continuous modular reward function was designed to improve the training effectiveness. Rewards are defined in terms of the following four aspects based on different situations during the hunting process.

3.4.1. Safe Rewards

We have set up multiple impassable areas in an environment with obstacles, and the AUV should avoid approaching or colliding with obstacles. Therefore, the AUV will receive a negative reward when it detects the obstacles but still runs toward them.

The safe rewards are defined as

r_{s a f e} = \{\begin{matrix} 0 & if d_{o b s t a c l e} > 40 \\ - 0.5 & if d_{o b s t a c l e} \in [20, 40] \\ - 1 & if d_{o b s t a c l e} < 20 \end{matrix}

3.4.2. Cooperating Rewards

To gain an advantage from collaborators when detecting targets and starting the pursuit, AUVs should maintain an appropriate encirclement. It is better to limit the distance between the AUV and the cluster centroid and maintain a minimum distance between AUVs to avoid collisions.

The cooperating rewards are defined as

r_{e n c i r c l e} = \{\begin{matrix} 1 & if d_{c e n t e r} \in [20, 40] \\ 0 & else \end{matrix}

r_{h o l d i n g r a n g e} = \{\begin{matrix} 0 & if min d_{other} > 30 \\ - 0.5 & if min d_{other} \in [20, 30] \\ - 1 & if d_{other} < 20 \end{matrix}

3.4.3. Hunting Rewards

In the training, we assume that when any AUV discovers an evading target, it will notice its cooperators, and then the whole group starts to pursue the target. Hunter AUVs are encouraged to chase and approach the target. Meanwhile, the situation where the AUV group is approaching the target and a hunter is away from the target has also been considered. Under this condition, a positive reward is given to the AUV to prioritize their actions toward achieving the overall objective.

r_{h u n t i n g} = \{\begin{matrix} 0 & if min d_{target} > 50 \\ 2 & if d_{target}^{t + 1} < d_{target}^{t} \\ - 1 & if d_{target}^{t + 1} \geq d_{target}^{t} \\ 1 & if d_{center to target}^{t + 1} < d_{center to target}^{t} \end{matrix}

3.4.4. Finishing Rewards

There are three situations that lead to the end of an episode: (1) any AUV collides with an obstacle; (2) any AUV collides with other AUVs; (3) the hunter AUVs encircle the evader and restrict most possible directions of escape from the encirclement. In any case, the hunters gain an

r_{f i n i s h}

and terminate the episode. The finishing rewards are defined as

r_{f i n i s h} = \{\begin{matrix} 100 & if situation (3) \\ - 100 & if situation (1) or (2) \end{matrix}

Each AUV receives total rewards based on the above factors:

r_{t o t a l} = r_{s a f e} + r_{e n c i r c l e} + r_{h o l d i n g r a n g e} + r_{h u n t i n g} + r_{f i n i s h}

(17)

3.5. Training Progress

The training process of S4AC utilizes the distributed training method, and for an individual AUV, it adopts the training framework of the SAC algorithm. During the iterative process, we used the SSIG algorithm to generate real-time prediction states including the cooperators’ representations. Multiple deep neural networks are updated during algorithm iteration, including generator network G, discriminator network D, mutual information network

ρ

, transition network

Π

, actor network

π

, and soft Q network Q. At the beginning of the training process, after initializing all network parameters, the actions

a_{1}

of all AUVs given by

π_{θ_{π}} (a_{1} | s_{1})

are first calculated and

s_{2}

is obtained. In the initial calculation, we inform all AUVs of the actual state

s_{2}

of the group to initiate subsequent iterations. At the beginning of the second iteration, SSIG begins to generate prediction states and compare them with the state pairs extracted from the past data. The discriminator determines the probability of generated state pair conformity to the real motion model. During the policy training of each AUV, the soft network is trained and updated using the global prediction state obtained from SSIG as input.

According to the S4AC, the training process can be described as shown in Algorithm 1.

Algorithm 1 S4AC Training Process

1:: Randomly initialize parameter vectors $θ_{π}, θ_{Π}, θ_{G}, θ_{D}, θ_{ρ}, θ_{s o f t Q}$
2:: for each episode do
3:: All the hunting AUVs and the target are deployed into the environment, AUVs are fed with the global information for the first two steps.
4:: for each iteration do
5:: $a_{t} \sim π_{θ_{π}} (a_{t} | s_{t})$
6:: $(u_{t - 1}, u_{t}) \sim u^{*} (s_{t - 1}, s_{t})$
7:: $u_{t} \leftarrow Π_{θ_{Π}} (\cdot | u_{t - 1})$ , update $θ_{Π}$
8:: $u_{t + 1} \sim Π_{θ_{Π}} (\cdot | u_{t})$
9:: for $l = 2$ to t do
10:: $1 \leftarrow D (s_{l - 1}, s_{l})$ , update $θ_{D}$
11:: $0 \leftarrow D (G (u_{l - 1}, u_{l}, z))$ , update $θ_{D}$
12:: $1 \leftarrow D (G (u_{l}, u_{l + 1}, z))$ , update $θ_{G}, θ_{ρ}$
13:: end for
14:: $\tilde{s_{t + 1}} \sim G (u_{t}, u_{t + 1}, z)$
15:: $Buffer \leftarrow Buffer \cup {(s_{t}, a_{t}, r (s_{t}, a_{t}, \tilde{s_{t + 1}}), \tilde{s_{t + 1}})}$
16:: if mission accomplished or collision occurs then
17:: break
18:: end if
19:: for each gradient step do
20:: $θ_{π} \leftarrow θ_{π} - λ_{π} {\hat{\nabla}}_{θ_{π}} J_{π} (θ_{π})$
21:: $θ_{s o f t Q} \leftarrow θ_{s o f t Q} - λ_{Q} {\hat{\nabla}}_{θ_{s o f t Q}} J_{Q} (θ_{s o f t Q})$
22:: end for
23:: end for
24:: end for

4. Simulation Results

In this section, to closely emulate the real-world experimental scenarios, simulation experiments are conducted on a semi-physical simulation platform available in our laboratory. A variety of simulation experiments are demonstrated to validate the efficacy of the proposed algorithm in the context of limited underwater communication and to exhibit its adaptability to various underwater environments. Comparisons with the MADDPG algorithm [33] are also presented to showcase the superiority of the proposed algorithm.

4.1. Semi-Physical Simulation Platform

The simulation experiment of hunting with multiple AUVs leverages a semi-physical simulation platform. This platform integrates actual hardware components from real-world experiments into the simulation environment, aiming for more authentic results. The platform is made up of four main systems—scenario simulation, sensor simulation, motion control simulation, and planning simulation systems—as shown in Figure 6.

The graphics workstation executes scenario and sensor simulations. The scenario simulation system receives input from the motion control system in terms of thrust and torque and computes the actual locations and attitudes of the AUVs according to the AUV dynamic model and kinematic model. It also provides physical information about the environment to the sensor simulation system, which in this study mainly refers to obstacle information. The sensor simulation system mainly handles information in three areas: (1) it simulates the compass and Doppler Velocity Logger (DVL) to give the location and attitude information about the AUVs, (2) it simulates the forward-looking sonar to provide detection results regarding obstacles, and (3) it simulates communication sonar to provide interaction information among AUVs. It is worth noting that due to the high latency and error rate brought by acoustic communication, AUVs cannot acquire the status information of their collaborators in time. The interaction information transmission interval is set to 2 s (2 times the time step) based on the assumptions in Section 2. The sensor simulation system transmits the simulated sensor information to the planning and control system.

Several embedded computers are utilized to emulate the control units of actual AUVs, which incorporate both a motion control system and a basic planning system. In this study, the S4AC training workstation takes over the basic planning system. S4AC uses the sensor data received by the embedded computer as the state, outputs action decisions (expected speed and heading) as references for the control system, and participates in the simulation experiments of the semi-physical simulation platform.

Thus, the algorithm operates in an experimental environment that closely mirrors real-world conditions, demonstrating its practical significance and potential for real-world application.

4.2. Experimental Environment and Training Parameters

As introduced in the previous section, the S4AC algorithm is deployed on a training workstation, which uses an Intel Core i9 11900K Eight-Core 5.1 GHz processor and a 24 GB RAM graphics card. The deep neural networks were constructed using Pytorch, and the network models were trained using GPU. The hyperparameters for S4AC were initially based on references, i.e., based on the hyperparameters of the SAC algorithm primarily referenced our previous work [28], while the hyperparameters of the GAN network mainly referenced [45]. Moreover, the learning rate was slightly adjusted from 0.0001 to 0.0003 to expedite the training process. It is worth noting that due to the characteristic of the SAC algorithm to maximize action entropy, the actions derived by the algorithm have a certain degree of randomness, and its performance is not sensitive to hyperparameters. The slight adjustments we made were primarily to reduce the time cost of the training process. The hyperparameters are depicted in Table 1.

4.3. Results and Analysis

In order to illustrate the effectiveness of the S4AC algorithm under conditions of delayed state synchronization, the subsequent sections detail its training process and various testing scenarios. These testing scenarios encompass different environmental layouts and scenarios with the inclusion of current disturbances. A comparative analysis of the performance of the proposed algorithm and the MADDPG algorithm under conditions with and without current disturbances is presented. The results underscore the adaptability and robustness of the proposed algorithm to varying environments.

4.3.1. Training Scenario

In the training scenario, six hunting AUVs were dispersed and deployed in an underwater environment.

I (I_{x}, I_{y}, I_{z})

denotes the initial position of the AUV. Six impassable areas were set up in the environment to demonstrate the obstacle-avoidance ability of S4AC. The obstacle area is represented as

W_{i} (X_{i}, Y_{i}, H_{i}, Θ_{i})

, where

(X_{i}, Y_{i})

denotes the center coordinate of the rectangle area,

H_{i}

denotes the diagonal length, and

Θ_{i}

denotes the aspect ratio. Table 2 presents the parameter settings of the training scenario.

The rectangle areas detected by the forward-looking sonar of the AUV are perceived as obstacles. The initial position of the evader is

I_{F} (350, 55, 65)

and sails towards outer boundary

T (100, 502, 175)

to escape from the hunting area. The escape strategy of the evader is set as follows:

When the evader does not sense any hunter within its perception range, which is set to 50 m, it moves at a speed of 4 knots and heads straight toward its target point.
When any of the hunters is detected within the perception range, the evader employs the following strategy: Firstly, it calculates the centroid of all hunters within its perception range. Then, based on the original direction of travel, the evader adjusts its sailing direction by adding a vector that points away from the centroid. The magnitude of this vector is inversely proportional to the distance between the evader and the centroid of the hunter. Finally, the evader’s sailing direction is determined by the sum of this vector and the target direction.

Figure 7 illustrates the progression of the S4AC algorithm’s training process. During the early stages of exploratory training, the hunting group consistently failed to achieve the task due to collisions with obstacles or exceeding the boundary of the designated area. Some early training failures are exhibited in Figure 7. In episode 1, AUV A exceeded the boundary of the designated area, and by episode 42, AUV B collided with an obstacle.

Nevertheless, as the training episodes advanced, the system’s performance began to progressively improve. This advancement became particularly noticeable by episode 368, when the AUV group successfully completed the hunting task for the first time, marking a significant leap in the learning of more optimal trajectories.

As the training progressed, the AUV group not only successfully completed the hunting task but also further optimized its strategies. This optimization allowed the group to complete the task in fewer steps and with smoother trajectories. Figure 8 portrays the decreasing distance over time between the hunter group center and the evader. As the number of training iterations increased, the trend of decreasing distance becomes more pronounced. This suggests that the number of steps needed to complete the hunting task decreases as the training progresses, hence attesting to the efficiency and performance of the proposed S4AC algorithm.

To demonstrate that the SSIG model significantly enhances the performance of the SAC algorithm in MARL scenarios, a comparison of the training process curves between S4AC and vanilla SAC is provided in Figure 9. In the vanilla SAC training process, the SSIG network is excluded, and the hunter AUV synchronizes the unknown state of its collaborators every two intervals. This comparison effectively illustrates the critical role of the SSIG model in improving the performance of the SAC algorithm within a multi-agent reinforcement learning context.

Using the reward functions of six AUVs to represent the training result may cause confusion. Thus, the average reward of the six hunting AUVs was utilized to portray the trend of reward changes across all AUVs involved in the hunting task. The curves visualize the value of each episode using semi-transparent lines, while the average value of the past 100 episodes is plotted using solid lines at intervals of 100 episodes. This presentation style provides a clearer and more intuitive representation of the training trend.

Figure 9a,b illustrate that as the S4AC training progresses, the average reward per step steadily increases, and concurrently, the number of steps per episode decreases. This suggests that the S4AC algorithm, which employs the SSIG model, learns to predict other AUVs’ strategies, leading to improved performance and a more efficient hunting process.

On the contrary, Figure 9c,d indicate that the vanilla SAC struggles to achieve convergence during the training process and infrequently succeeds in completing the hunting task. This performance discrepancy can be attributed to the fact that in the SAC algorithm, the state of the environment is significantly influenced by the random actions of other AUVs. Without a tool such as SSIG to predict these actions, the SAC algorithm struggles to optimize its policy, leading to difficulties in achieving convergence and successful task completion. This comparison highlights the advantage of incorporating the SSIG model into the SAC framework, as demonstrated by the improved performance of the S4AC algorithm.

4.3.2. Testing Scenarios in Different Layouts

To demonstrate the adaptability of the proposed algorithm to various underwater environments, two testing scenarios different from the training scenario were designed. The settings of these testing scenarios are provided in Table 3 and Table 4.

The initial network parameters of S4AC in testing scenarios A and B were inherited from the training scenario, and the network continued to update the parameters throughout the testing process. After 1000 testing episodes, the hunting success rates of the AUV group in the two testing scenarios A and B reached 86.7% and 83.8%, respectively, underscoring the effectiveness and robustness of the S4AC algorithm.

Figure 10 illustrates the trajectories of the three scenarios. Compared with the training scenario shown in Figure 10a, testing scenario A (depicted in Figure 10b) featured the same initial positions but different obstacle areas, while testing scenario B (as displayed in Figure 10c) preserved the same obstacle areas but varied the initial positions. In both testing scenarios, AUVs successfully pursued and encircled the evading target without colliding with obstacles.

These outcomes highlight the adaptability of the S4AC algorithm. By incorporating the SSIG model, the S4AC is able to make real-time predictions of other AUVs’ strategies; each hunter, therefore, is capable of inferring the policies of others at every step and allowing the hunting group to continually adjust their actions to the ever-changing scenarios.

To further demonstrate the adaptability of the method, we removed the termination condition for the end of the task and allowed AUVs to continue chasing the target to obtain more rewards. In this case, the evader escapes from the center of the hunter group regardless of whether it is encircled. The hunters manage to follow the target and hold the encirclement at a fixed distance, as shown in Figure 11. In Figure 12, the orange line represents the distance between the target and the center of the hunting group, the blue line represents the distance between the target and one of the hunting AUVs, and the green line represents the distance that should be held to obtain the maximum reward. This indicates that the proposed method exhibits good performance in chasing and following moving targets.

4.3.3. Testing Scenarios with Current Disturbance

In order to test the robustness of the S4AC algorithm facing an unpredictable environment, a time-varying sinusoidal disturbance was introduced into testing scenario A to simulate the influence of water currents on the multi-AUV system. The simulated current field was simplified as

τ_{c} = V_{c} \times n \times s i n (T)

, where

V_{c}

denotes the strength of the current, and n indicates its direction. This sinusoidally varying current was incorporated into the scenario simulation system’s dynamic model, thus affecting the motion of the multi-AUV system. Additionally, the MADDPG algorithm was deployed to verify the superiority of the proposed algorithm, with its training conditions set the same as those of S4AC, i.e., trained 6000 times in the training scenario.

Figure 13 presents the performance curves of the S4AC algorithm under the testing scenarios with and without current disturbance. Figure 13a shows the trend of average rewards per step as the testing episodes progress, and Figure 13b displays the number of steps required to complete the task across testing episodes. It can be seen that the S4AC algorithm exhibits similar performance under both conditions, with only a few episodes experiencing task failure due to collision. This demonstrates the robustness of the algorithm. In contrast, Figure 14 shows the performance curves of the MADDPG algorithm under the same conditions, where a significant fluctuation can be observed when the algorithm is subjected to the disturbance.

Table 5 displays the testing results of the S4AC and MADDPG algorithms under scenarios with and without current disturbance. As can be seen from Table 5, the S4AC algorithm achieves higher hunting success rates and completes the hunting task in fewer steps under both scenarios. Furthermore, the disturbance from the current has a smaller impact on the success rate of S4AC, with a decrease of only 1.4%, while the success rate of MADDPG drops by 12.3%. This further attests to the robustness of the S4AC algorithm.

In testing scenario A without current disturbance, the performance curves of the S4AC and MADDPG algorithms are as depicted in Figure 15. As shown in Figure 15a, the S4AC algorithm tends to converge after about 100 rounds, while MADDPG only nears convergence after approximately 400 rounds. Moreover, it is evident that S4AC has a smaller variance compared to MADDPG. This suggests that the S4AC algorithm enables the AUV group to learn optimal strategies more quickly and with better stability. Figure 15b further shows that the S4AC algorithm always enables the AUV group to complete tasks in fewer steps, which means it achieves quicker hunting of the evading target. Overall, compared with MADDPG, the S4AC algorithm has a higher training efficiency, better performance in hunting tasks, and stronger robustness.

5. Conclusions

In this paper, the SAC algorithm is applied to multiple AUVs hunting an evading target in a 3D underwater environment. Since the hunting tasks require close collaboration among AUVs, AUVs need to obtain real-time state information from their collaborators, which are dynamic and unstable. Furthermore, in the high-latency underwater environment, the communication between the AUV and its collaborators is asynchronous, further exacerbating the instability of the state information. The conventional SAC algorithm undergoes convergence when processing this information. To overcome this limitation, the Super-Sampling Info–GAN (SSIG) algorithm is presented. It takes past states as inputs, captures essential features representing the transition properties of the dynamic system through training, and predicts the recent state. A State Super-Sampling Soft Actor–Critic (S4AC) algorithm is proposed to apply SSIG. It exhibits strong exploratory performance and fast convergence of SAC and has good adaptability to multi-agent learning. Our experiments have demonstrated that the S4AC algorithm performs well in the hunt-and-escape game under high-latency underwater environments, showcasing strong robustness to disturbances. While our initial goal was to verify the feasibility and effectiveness of the hunting tasks in relatively simple scenarios with static obstacles and evading targets, we recognize that real-world underwater environments may present additional challenges, including dynamic obstacles. Future research will focus on enhancing the adaptability of our S4AC algorithm to more demanding scenarios, including dynamic obstacles, and investigating its performance under such conditions.

Author Contributions

Conceptualization, Z.W. and Y.S.; methodology, Z.W. and Y.S.; software, Z.W. and Y.S.; validation, Z.W., Y.S. and H.L.; formal analysis, Z.W., Y.S. and H.L.; investigation, H.Q. and Y.S.; resources, H.Q. and Z.W.; data curation, Z.W. and Y.S.; writing—original draft preparation, Y.S.; writing—review and editing, Y.S. and Z.W.; visualization, Y.S.; supervision, H.Q. and Z.W.; project administration, H.Q. and Z.W.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China National Natural Science Foundation (grant number 52025111, 51979048, U21A20490, 51979057, 52131101).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Castillo-Zamora, J.J.; Camarillo-Gómez, K.A.; Pérez-Soto, G.I.; Rodríguez-Reséndiz, J.; Morales-Hernández, L.A. Mini-AUV hydrodynamic parameters identification via CFD simulations and their application on control performance evaluation. Sensors 2021, 21, 820. [Google Scholar] [CrossRef] [PubMed]
Orr, J.; Dutta, A. Multi-Agent Deep Reinforcement Learning for Multi-Robot Applications: A Survey. Sensors 2023, 23, 3625. [Google Scholar] [CrossRef]
Chen, M.; Zhu, D. A novel cooperative hunting algorithm for inhomogeneous multiple autonomous underwater vehicles. IEEE Access 2018, 6, 7818–7828. [Google Scholar] [CrossRef]
Cao, X.; Xu, X. Hunting algorithm for multi-auv based on dynamic prediction of target trajectory in 3d underwater environment. IEEE Access 2020, 8, 138529–138538. [Google Scholar] [CrossRef]
Cao, X.; Sun, H.; Guo, L. Potential field hierarchical reinforcement learning approach for target search by multi-AUV in 3-D underwater environments. Int. J. Control 2020, 93, 1677–1683. [Google Scholar] [CrossRef]
Cao, X.; Zuo, F. A fuzzy-based potential field hierarchical reinforcement learning approach for target hunting by multi-AUV in 3-D underwater environments. Int. J. Control 2021, 94, 1334–1343. [Google Scholar] [CrossRef]
Mao, Y.; Gao, F.; Zhang, Q.; Yang, Z. An AUV target-tracking method combining imitation learning and deep reinforcement learning. J. Mar. Sci. Eng. 2022, 10, 383. [Google Scholar] [CrossRef]
Wu, J.; Song, C.; Ma, J.; Wu, J.; Han, G. Reinforcement learning and particle swarm optimization supporting real-time rescue assignments for multiple autonomous underwater vehicles. IEEE Trans. Intell. Transp. Syst. 2021, 23, 6807–6820. [Google Scholar] [CrossRef]
Jiang, Y.; Zhao, M.; Wang, C.; Wei, F.; Qi, H. A Method for Underwater Human–Robot Interaction Based on Gestures Tracking with Fuzzy Control. Int. J. Fuzzy Syst. 2021, 23, 2170–2181. [Google Scholar] [CrossRef]
Jiang, Y.; Zhao, M.; Wang, C.; Wei, F.; Wang, K.; Qi, H. Diver’s hand gesture recognition and segmentation for human–robot interaction on AUV. Signal Image Video Process. 2021, 15, 1899–1906. [Google Scholar] [CrossRef]
Jiang, Y.; Peng, X.; Xue, M.; Wang, C.; Qi, H. An underwater human–robot interaction using hand gestures for fuzzy control. Int. J. Fuzzy Syst. 2021, 23, 1879–1889. [Google Scholar] [CrossRef]
Jiang, Y.; Lin, S.; Ruan, J.; Qi, H. Spatio-temporal dependence-based tensor fusion for thermocline analysis in Argo data. Proc. Inst. Mech. Eng. Part J. Syst. Control. Eng. 2021, 235, 1797–1807. [Google Scholar] [CrossRef]
Jiang, Y.; Yu, D.; Zhao, M.; Bai, H.; Wang, C.; He, L. Analysis of semi-supervised text clustering algorithm on marine data. Comput. Mater. Contin. 2020, 64, 207–216. [Google Scholar] [CrossRef]
Wang, G.; Wei, F.; Jiang, Y.; Zhao, M.; Wang, K.; Qi, H. A Multi-AUV Maritime Target Search Method for Moving and Invisible Objects Based on Multi-Agent Deep Reinforcement Learning. Sensors 2022, 22, 8562. [Google Scholar] [CrossRef] [PubMed]
Chen, L.; Guo, T.; Liu, Y.T.; Yang, J.M. Survey of multi-agent strategy based on reinforcement learning. In Proceedings of the 2020 Chinese Control and Decision Conference (CCDC), Hefei, China, 22–24 August 2020. [Google Scholar]
Ma, T.; Lyu, J.; Yang, J.; Xi, R.; Li, Y.; An, J.; Li, C. CLSQL: Improved Q-Learning Algorithm Based on Continuous Local Search Policy for Mobile Robot Path Planning. Sensors 2022, 22, 5910. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Deng, G.; Luo, C.; Lin, Q.; Yan, Q.; Ming, Z. A hybrid path planning method in unmanned air/ground vehicle (UAV/UGV) cooperative systems. IEEE Trans. Veh. Technol. 2016, 65, 9585–9596. [Google Scholar] [CrossRef]
Nie, Y.; Yang, H.; Gao, Q.; Qu, T.; Fan, C.; Song, D. Research on Path Planning Algorithm Based on Dimensionality Reduction Method and Improved RRT. In Proceedings of the Global Oceans 2020: Singapore-U.S. Gulf Coast, Biloxi, MS, USA, 9 April 2021. [Google Scholar]
Li, J.; Li, C.; Chen, T.; Zhang, Y. Improved RRT algorithm for AUV target search in unknown 3D environment. J. Mar. Sci. Eng. 2022, 10, 826. [Google Scholar] [CrossRef]
Solari, F.J.; Rozenfeld, A.F.; Villar, S.A.; Acosta, G.G. Artificial potential fields for the obstacles avoidance system of an AUV using a mechanical scanning sonar. In Proceedings of the 2016 3rd IEEE/OES South American International Symposium on Oceanic Engineering (SAISOE), Buenos Aires, Argentina, 15–17 June 2016. [Google Scholar]
Noguchi, Y.; Maki, T. Path planning method based on artificial potential field and reinforcement learning for intervention AUVs. In Proceedings of the 2019 IEEE Underwater Technology (UT), Kaohsiung, Taiwan, China, 16–19 April 2019. [Google Scholar]
Fan, X.; Guo, Y.; Liu, H.; Wei, B.; Lyu, W. Improved artificial potential field method applied for AUV path planning. Math. Probl. Eng. 2020, 2020, 6523158. [Google Scholar] [CrossRef]
Meng, X.; Sun, B.; Zhu, D. Harbour protection: Moving invasion target interception for multi-AUV based on prediction planning interception method. Ocean Eng. 2021, 219, 108268. [Google Scholar] [CrossRef]
Lin, C.; Han, G.; Du, J.; Bi, Y.; Shu, L.; Fan, K. A path planning scheme for AUV flock-based Internet-of-Underwater-Things systems to enable transparent and smart ocean. IEEE Internet Things J. 2020, 7, 9760–9772. [Google Scholar] [CrossRef]
Tang, G.; Liu, P.; Hou, Z.; Claramunt, C.; Zhou, P. Motion Planning of UAV for Port Inspection Based on Extended RRT* Algorithm. J. Mar. Sci. Eng. 2023, 11, 702. [Google Scholar] [CrossRef]
Hermand, E.; Nguyen, T.W.; Hosseinzadeh, M.; Garone, E. Constrained control of UAVs in geofencing applications. In Proceedings of the 2018 26th Mediterranean Conference on Control and Automation (MED), Zadar, Croatia, 19–22 June 2018. [Google Scholar]
Kim, J.; Atkins, E. Airspace Geofencing and Flight Planning for Low-Altitude, Urban, Small Unmanned Aircraft Systems. Appl. Sci. 2022, 12, 576. [Google Scholar] [CrossRef]
Wang, Z.; Lu, H.; Qin, H.; Sui, Y. Autonomous Underwater Vehicle Path Planning Method of Soft Actor–Critic Based on Game Training. J. Mar. Sci. Eng. 2022, 10, 2018. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Nagayuki, Y.; Ishii, S.; Doya, K. Multi-agent reinforcement learning: An approach based on the other agent’s internal model. In Proceedings of the Fourth International Conference on MultiAgent Systems, Boston, MA, USA, 10–12 July 2000. [Google Scholar]
Gu, S.; Geng, M.; Lan, L. Attention-based fault-tolerant approach for multi-agent reinforcement learning systems. Entropy 2021, 23, 1133. [Google Scholar] [CrossRef] [PubMed]
Singh, G.; Lofaro, D.M.; Sofge, D. Pursuit-evasion with Decentralized Robotic Swarm in Continuous State Space and Action Space via Deep Reinforcement Learning. In Proceedings of the ICAART, Valletta, Malta, 22–24 February 2020. [Google Scholar]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, T.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Wei, E.; Wicke, D.; Freelan, D.; Luke, S. Multiagent soft q-learning. arXiv 2018, arXiv:1804.09817. [Google Scholar]
Danisa, S. Learning to Coordinate Efficiently through Multiagent Soft Q-Learning in the presence of Game-Theoretic Pathologies. Master’s Thesis, Department of Pure and Applied Mathematics University of Cape Town, Cape Town, South Africa, September 2022. [Google Scholar]
Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef] [Green Version]
Todorov, E. Policy gradients in linearly-solvable MDPs. In Proceedings of the NIPS 2010, Vancouver, BC, Canada, 6–9 December 2010. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the NIPS 2016: Neural Information Processing Systems Proceedings, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Vu, M.T.; Van, M.; Bui, D.H.P.; Do, Q.T.; Huynh, T.T.; Lee, S.D.; Choi, H.S. Study on dynamic behavior of unmanned surface vehicle-linked unmanned underwater vehicle system for underwater exploration. Sensors 2020, 20, 1329. [Google Scholar] [CrossRef] [Green Version]
Fjellstad, O.E.; Fossen, T.I. Position and attitude tracking of AUV’s: A quaternion feedback approach. IEEE J. Ocean. Eng. 1994, 19, 512–518. [Google Scholar] [CrossRef] [Green Version]
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the NIPS 2016, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Kurutach, T.; Tamar, A.; Yang, G.; Russell, S.J.; Abbeel, P. Learning plannable representations with causal infogan. In Proceedings of the NIPS 2018, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Wang, W.; Wang, A.; Tamar, A.; Chen, X.; Abbeel, P. Safer classification by synthesis. arXiv 2017, arXiv:1711.08534. [Google Scholar]
Tian, Z.; Wen, Y.; Gong, Z.; Punakkath, F.; Zou, S.; Wang, J. A regularized opponent model with maximum entropy objective. arXiv 2019, arXiv:1905.08087. [Google Scholar]

Figure 1. The framework of S4AC.

Figure 2. AUV motion model.

Figure 3. Environment schematic. (a) State description in a 3D environment. (b) Sonar model.

Figure 4. SSIG framework.

Figure 5. The flowchart of the S4AC in the hunting scenario.

Figure 6. Semi-physical Simulation Platform.

Figure 7. Training process of S4AC. (a) Trajectories of episode 1. (b) Trajectories of episode 42. (c) Trajectories of episode 368. (d) Trajectories of episode 986.

Figure 8. Distance between the hunter group center and the evader.

Figure 9. Training scenario simulation curves. (a) Average rewards per step of S4AC. (b) Steps of episodes of S4AC. (c) Average rewards per step of SAC. (d) Steps of episodes of SAC.

Figure 10. Training and testing scenarios. (a) Trajectories of training scenario. (b) Trajectories of testing scenario A. (c) Trajectories of testing scenario B.

Figure 11. Addition testing scenario. (a) Top view of the trajectories. (b) Three-dimensional view of the trajectories.

Figure 12. Distance curve.

Figure 13. The curves of the S4AC with and without a current disturbance. (a) Rewards per step. (b) Finishing steps.

Figure 14. The curves of the MADDPG with and without a current disturbance. (a) Rewards per step. (b) Finishing steps.

Figure 15. The comparison curves for the testing scenario. (a) Rewards per step. (b) Finishing steps.

Table 1. Hyperparameters of S4AC.

Hyperparameter	Value
Learning rate of all networks	0.0003
Discount factor $γ$	0.99
State error threshold $ε$	0.75
Mini batch size	256
Replay buffer size	2048
Soft update frequency	0.01
Max training episode num	6000
Input size of the Generator	3
Output size of the Generator	3 × 1024 × 1
Hidden layers of the Generator	2
Input size of the Discriminator	3 × 1024 × 1
Output size of the Discriminator	1
Hidden layers of the Discriminator	1
Activation function	ReLU

Table 2. The settings of training scenario.

Initial Position of AUVs	Obstacle Parameters
$I_{A} (50, 55, 65)$	$W_{1} (160, 200, 80, 0.3)$
$I_{B} (150, 100, 100)$	$W_{2} (100, 350, 50, 0.5)$
$I_{C} (400, 200, 90)$	$W_{3} (250, 50, 50, 0.5)$
$I_{D} (400, 400, 50)$	$W_{4} (300, 350, 50, 0.5)$
$I_{E} (430, 330, 130)$	$W_{5} (450, 275, 50, 0.5)$
$I_{F} (420, 220, 50)$	$W_{6} (375, 150, 50, 0.5)$

Table 3. The settings of testing scenario A.

Initial Position of AUVs	Obstacle Parameters
$I_{A} (50, 55, 65)$	$W_{1} (200, 200, 80, 0.3)$
$I_{B} (150, 100, 100)$	$W_{2} (80, 250, 50, 0.3)$
$I_{C} (400, 200, 90)$	$W_{3} (275, 75, 50, 0.5)$
$I_{D} (400, 400, 50)$	$W_{4} (300, 350, 80, 0.7)$
$I_{E} (430, 330, 130)$	$W_{5} (350, 250, 50, 0.5)$
$I_{F} (420, 220, 50)$	$W_{6} (150, 450, 50, 0.5)$

Table 4. The settings of testing scenario B.

Initial Position of AUVs	Obstacle Parameters
$I_{A} (65, 255, 76)$	$W_{1} (160, 200, 80, 0.3)$
$I_{B} (145, 380, 100)$	$W_{2} (100, 350, 50, 0.5)$
$I_{C} (210, 117, 90)$	$W_{3} (250, 50, 50, 0.5)$
$I_{D} (242, 300, 150)$	$W_{4} (300, 350, 50, 0.5)$
$I_{E} (330, 105, 130)$	$W_{5} (450, 275, 50, 0.5)$
$I_{F} (380, 79, 50)$	$W_{6} (375, 150, 50, 0.5)$

Table 5. Simulation Results of S4AC and MADDPG.

Algorithm	With or Without a Current Disturbance	The Minimum Hunting Steps	Success Rates
S4AC	Without a current disturbance	127	86.7%
S4AC	With a current disturbance	129	85.3%
MADDPG	Without a current disturbance	129	77.5%
MADDPG	With a current disturbance	150	65.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Sui, Y.; Qin, H.; Lu, H. State Super Sampling Soft Actor–Critic Algorithm for Multi-AUV Hunting in 3D Underwater Environment. J. Mar. Sci. Eng. 2023, 11, 1257. https://doi.org/10.3390/jmse11071257

AMA Style

Wang Z, Sui Y, Qin H, Lu H. State Super Sampling Soft Actor–Critic Algorithm for Multi-AUV Hunting in 3D Underwater Environment. Journal of Marine Science and Engineering. 2023; 11(7):1257. https://doi.org/10.3390/jmse11071257

Chicago/Turabian Style

Wang, Zhuo, Yancheng Sui, Hongde Qin, and Hao Lu. 2023. "State Super Sampling Soft Actor–Critic Algorithm for Multi-AUV Hunting in 3D Underwater Environment" Journal of Marine Science and Engineering 11, no. 7: 1257. https://doi.org/10.3390/jmse11071257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

State Super Sampling Soft Actor–Critic Algorithm for Multi-AUV Hunting in 3D Underwater Environment

Abstract

1. Introduction

2. Materials and Methods

2.1. The Proposed Framework

2.2. Assumptions

2.3. AUV Motion Model

2.4. Hunting Environment Description

3. State Super Sampling Soft Actor–Critic (S4AC) Algorithm

3.1. Super Sampling Info-GAN

3.2. Inferring the Predictive State with SSIG

3.3. Soft Policy Iteration

3.4. Reward Function

3.4.1. Safe Rewards

3.4.2. Cooperating Rewards

3.4.3. Hunting Rewards

3.4.4. Finishing Rewards

3.5. Training Progress

4. Simulation Results

4.1. Semi-Physical Simulation Platform

4.2. Experimental Environment and Training Parameters

4.3. Results and Analysis

4.3.1. Training Scenario

4.3.2. Testing Scenarios in Different Layouts

4.3.3. Testing Scenarios with Current Disturbance

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI