In this section, the construction of the SSIG objective function is presented, followed by a description of using SSIG to generate predictive states of a dynamic system. Afterwards, the action–value function of a multi-agent version of SAC is detailed, providing real-time states of other agents generated by the SSIG, thereby enabling each agent to independently learn its policy, considering the policies of other agents. Lastly, the specific definition of the reward function is given, and the training process of the S4AC for multi-AUV hunting tasks is demonstrated.
3.1. Super Sampling Info-GAN
In the proposed framework, an info-GAN [
44] pair is trained to generate state pairs for predicting the motion of the dynamical system. The generative network is a deep neural network that takes as input both unstructured random noise and a structured pair of consecutive states from a low-dimensional, parametrized dynamical system termed the inference transition model. In this session, the operation of the SSIG to generate the predictive state will be explained.
In the classical GAN [
41] framework, assume that
is a state sample extracted from the dataset. Deep generative models aim to train stochastic neural networks to generate the data distribution approximating
. The GAN framework consists of a generator,
, that maps a noise input
to a state sample, and a discriminator,
, that maps the state sample to the probability that it was sampled from the real data instead of the generator. The GAN training is optimized through a game between the generator and the discriminator:
The noise vector
z in GAN can be regarded as containing some representation of the state
s. However, in general GAN training, there is no incentive to make this representation structural, which makes it difficult to interpret. The Info-GAN [
44] method aims to mitigate this issue. Let
H denote the entropy of a random variable
. The mutual information between two random variables,
, measures the influence of one variable on the uncertainty of the other. The idea in info–GAN is to add to the generator input an additional “state” component
, and add to the GAN objective a loss that induces maximal mutual information between the generated state and the abstract state. The info–GAN objective [
44] is given by
where
is a weight parameter and
is the GAN loss function in Equation (
7). This objective induces the abstract state to capture the most salient properties of the state samples. It is difficult to solve the optimization objective in Equation (
8) without access to the posterior distribution
, and a variational lower bound was proposed in [
44]. An auxiliary distribution
is defined to approximate the posterior
. Then,
According to Equation (
9), the info–GAN objective (
8) can be optimized using stochastic gradient descent.
In the MARL system, continuous real environment states are provided in the replay buffer, which can be regarded as a dataset .
Let
s and
denote a pair of consecutive states sampled from dataset
D, and
denotes their probability, as displayed in the data
D. We believe that a generative model that can accurately learn
has to capture the policy features that can represent the movement of agents from one state to another. Following the approach outlined in [
45], we modified the classic GAN networks and applied them to our setting. A vanilla GAN consists of a generator,
, that maps a noise input
to a state pair, and a discriminator,
, that maps a state pair to the probability that it was sampled from the real data
D instead of the generator. The noise vector
z of GAN can be regarded as a feature vector that contains a representation of a certain transition from
s to
. On this basis, a generator with structured input that can be used for inferring the action policy of the agents is proposed.
Let
denote a dynamical system with a transition space
, and name it an abstract states set.
is a parametrized, stochastic transition function, where
are a pair of consecutive abstract states.
denotes the prior probability of an abstract state
u.
is termed the implicit transition system. The generator is structured as taking in a pair of consecutive abstract states
in addition to the noise vector
z. The objective function of GAN in this case is
where
u and
represent the implicit features required for inferring the dynamic system transition model, which includes the information about the cooperator motion policy, while
z simulates less informative variations. To induce learning such representations, we follow the Info-GAN method and add to the GAN objective a loss that induces maximal mutual information between the generated pair of states and the abstract states.
The Super-Sampling Info–GAN objective is proposed as
where
is a weight parameter, and
is given by Equation (
10). Intuitively, this objective enables the abstract model to capture the most significant change that may occur within the dynamic system. Since we cannot access the posterior distribution
when using an expressive generator function, it is difficult to directly optimize Equation (
11). Therefore, a variational lower bound of (
10) is derived following the info–GAN, and an auxiliary distribution
is defined to approximate the posterior
, similar to the derivation proposed by [
44]:
In Equation (
12),
can be seen as a classifier that maps pairs of state samples to pairs of abstract states. The mutual information in (
11) is not sensitive to the code order of random variables
u and
, which mentions a potential caveat in the optimization objective in Equation (
12); that is, we would like the random variable for the next abstract state to have the same meaning as the random variable for the abstract state. The transition operator
can be applied sequentially to show the sequence of changes in the abstract state and effectively plan the abstract state transition model
. The loss function is obtained by bringing Equation (
12) in Equation (
11):
where
is a constant. The loss in Equation (
13) can be optimized effectively using stochastic gradient descent.
3.3. Soft Policy Iteration
In this section, a multi-agent version of the Soft-Q iteration algorithm proposed in [
34] is derived. The derivation follows a logic similar to that of the vanilla Soft Actor–Critic.
Following [
47], a tuple
is defined for an
n-AUV system, where
S denotes the state space,
p denotes the distribution of the initial state,
is a discount factor, and
and
are the action space and the reward function of AUV
i,
respectively. The states will transition according to
, where
, and be represented as
. AUV
i selects action
according to the policy
parameterized by
, provided that a given state
.
It is convenient to interpret the joint policy from the perspective of AUV
i such that
, where
,
, and
is a compact representation of the joint policy of all complementary AUVs of
i. Actions are taken simultaneously at each training step. Each AUV is presumed to discover the policy with the optimal soft action value
, as shown in Equation (
15):
where
is updated by Equation (
16):
Within the soft policy iteration, each agent’s action value function serves to estimate the value of its actions under the policies of all agents. This function is dependent on the current state of the environment where all agents act as a part.
Compared with SAC iteration in a single AUV, where
can always be calculated given
and
, the
of multi-AUV is closely related to the actions of other AUVs. The method to solve
was already provided in the previous sessions. Subsequently, each agent updates its policy via gradient ascent to maximize its action–value function. Through this approach, although each agent updates its policy independently, the behaviors of all other agents are taken into account. Thus, even though each agent maintains its own policy, the overall policy evolves alongside updates to each individual agent’s policy. The flowchart of the S4AC iteration in the hunting scenario is shown in
Figure 5.