Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation

Xiang, Guofei; Dian, Songyi; Du, Shaofeng; Lv, Zhonghui

doi:10.3390/s23020762

Open AccessArticle

Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation

by

Guofei Xiang

^1,2

,

Songyi Dian

¹

,

Shaofeng Du

^2,* and

Zhonghui Lv

²

¹

College of Electrical Engineering, Sichuan University, Chengdu 610065, China

²

National Key Laboratory of Special Vehicle Design and Manufacturing Integration Technology, Baotou 014031, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(2), 762; https://doi.org/10.3390/s23020762

Submission received: 16 December 2022 / Revised: 3 January 2023 / Accepted: 5 January 2023 / Published: 9 January 2023

(This article belongs to the Special Issue Advances in Mobile Robot Perceptions, Planning, Control and Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Deep Reinforcement Learning (DRL) algorithms have been widely studied for sequential decision-making problems, and substantial progress has been achieved, especially in autonomous robotic skill learning. However, it is always difficult to deploy DRL methods in practical safety-critical robot systems, since the training and deployment environment gap always exists, and this issue would become increasingly crucial due to the ever-changing environment. Aiming at efficiently robotic skill transferring in a dynamic environment, we present a meta-reinforcement learning algorithm based on a variational information bottleneck. More specifically, during the meta-training stage, the variational information bottleneck first has been applied to infer the complete basic tasks for the whole task space, then the maximum entropy regularized reinforcement learning framework has been used to learn the basic skills consistent with that of basic tasks. Once the training stage is completed, all of the tasks in the task space can be obtained by a nonlinear combination of the basic tasks, thus, the according skills to accomplish the tasks can also be obtained by some way of a combination of the basic skills. Empirical results on several highly nonlinear, high-dimensional robotic locomotion tasks show that the proposed variational information bottleneck regularized deep reinforcement learning algorithm can improve sample efficiency by 200–5000 times on new tasks. Furthermore, the proposed algorithm achieves substantial asymptotic performance improvement. The results indicate that the proposed meta-reinforcement learning framework makes a significant step forward to deploy the DRL-based algorithm to practical robot systems.

Keywords:

reinforcement learning (RL); meta-learning; information bottleneck; deep neural networks; skill transfer; robotics

1. Introduction

Deep reinforcement learning (DRL) algorithms have achieved outstanding success in solving complex tasks, including Computer Go [1,2], manufacturing [3,4] and healthcare [5]. Generally, reinforcement learning agents learn to optimize the expected cumulative rewards by interacting with the external environment in some trial-and-error manner, which is just like how human beings learn new skills [6,7]. Therefore, DRL has received increasingly significant attention in highly nonlinear, high-dimensional robot systems for complicated tasks [8,9,10,11]. There have already emerged some impressive results in complex robotic skill learning and control, such as navigation [12,13], quadrupedal locomotion [14,15], dexterous manipulation [16], soft robots [17,18] and many more robotic tasks [19,20].

Despite the remarkable progress of the application of DRL to solve various robotic skill learning tasks, the potential assumption, i.e., the training environment and the deploy environment should be identical, results in the agents trained with DRL algorithms often over-specialize to the training environment, which is still prohibitive for adapting to some novel, unseen circumstances [21,22]. However, in modern robot application scenarios, the robot should work in an ever-changing and even unknown environment [23,24]. Intuitively, there are two methods for this problem, the first is re-training in the new environment, which would require a huge number of samples. The second is to deploy the well-trained policy directly, leading to great performance degradation, and damage to the practical robot system. Therefore, the problem of how to make the DRL agent efficiently learn and adapt to a dynamical environment is of great significance [25,26,27].

On the other hand, we deeply recognize that human beings can learn new skills for solving new tasks in new environments only through very limited samples [28]. The reason is that we can draw inferences about other cases from one instance, i.e., when new tasks emerge, we always find some related or similar cases by checking our experiences, and the skills would be abstracted and transferred to new tasks, thus we would acquire the new skills very efficiently. This kind of human intelligence gives us significant inspiration on how to improve the robot intelligence, i.e., endowing the robot with the capability of transfer learning. Many practical insights give us the possibility to realize this idea. For example, the majority of robots are not designed for a specific task, the universal robot manipulator can be used to accomplish versatile tasks and shares the same configurations and dynamics. Moreover, different tasks share the same skills, such as the door opening task and the cap unscrewing task, both of them can be decomposed into holding and rotating the object. Therefore, if we embed the mechanism of transfer learning into the DRL framework, the learning efficiency of new skills would be improved substantially.

Recently, transfer learning methods have been widely studied in the pattern recognition field [29]. The main idea of this method is that pre-training a model with huge amounts of acquired data, then fine-tune the model with a few new samples from the new task. Taylor et improving the robot intelligence, i.e., endowing the robot with the and then fine-tuning the network with few specific images [30]. Ramachandran and Howard et al. train a large long short-term memory model with the Corpora natural language processing database, then fine-tuning technique used to update the model [31]. However, this simple pre-training technique can only deal with very limited kinds of tasks, the performance is also limited by the number of new task samples. More severely, we cannot build a robot learning database like ImageNet or Corpora, so this method has hardly been applied in robotic skill transfer [32,33].

The meta-learning scheme, or learning to learn, aims to extract some similar or common structures which are shared for various tasks [34]. Once the latent common structures are obtained, one can acquire the new skills very efficiently by combining the latent structures with very few samples from the current task. Following this idea, meta-reinforcement learning [35], i.e., incorporating the meta-learning mechanism into the reinforcement learning framework, can be categorized into two classes. The first is encoding the various tasks and the according skills with memory-enabled neural networks, such as neural turning machine [36], long short term memory model, and the memorized information can be further used to assist the skill learning for new tasks [37,38]. In general, the encoding information could be the initialization parameter for new networks, and the learning parameters for new tasks. This kind of meta-learning can realize structure extraction and transfer to some extent, however, this memory-based meta-learning approach can not learn the intrinsic structure explicitly, and the learning process is like a black box.

Another kind of meta-learning approach aims to embed structure learning into policy learning. The most famous algorithm is model agnostic meta-learning (MAML) based on the policy gradient method, and can be used in nearly all policy gradient-based reinforcement learning frameworks [39,40]. The main idea of MAML is that obtain a bunch of initial parameters that are sensitive to different task gradients by formulating the parameter optimization as a bi-level optimization problem. Once the initial parameters are obtained from the source training tasks, only one or a few gradient steps are needed for the new task, then the new policy will be obtained. In comparison with the memory-based meta-learning method, MAML does not need memory-enabled neural networks, and does not introduce additional model parameters. From the perspective of feature learning, MAML can learn a bunch of task features that can be widely used for many tasks, thus a high-performance policy can be obtained through only a few policy gradients. From the dynamical system perspective, MAML hopes to find a bunch of parameters with the greatest sensitivity to all task performance. MAML has been successfully applied to practical robot learning tasks, such as the domain adaptation issue in the observing-then-imitating robot skill learning [41]. Since MAML is built on policy gradient algorithms, there still exist some issues for the computation of the policy gradient, such as the bias-variance problem in gradient estimation, and unstable gradient computation. Liu et al. proposed taming MAML algorithm to deal with the bias-variance issue by reconstructing a surrogate objective function by introducing a zero-expectation baseline function [42]. Rothfuss et al. proposed the PROMP algorithm by analyzing the credit assignment issue to improve the gradient estimation and algorithmic stability [43]. Gupta et al. studied the exploration issue and built structured noise for better exploration, thus enabling better performance [44]. However, there are still some issues, as MAML optimizes policy parameters that are most sensitive to all tasks, that is to say, the obtained policy cannot accomplish any task, including those source tasks, and once the policy adapts to one specific task, the network loses the ability of further improvement. During adaptation, all parameters of the MAML policy would be updated, and second-order optimization is involved, which aggravates the computational burden. And most of all, there is still a lack of effective methods for sub-task extraction and representation.

Inspired by how human beings adapt the obtained skill to new tasks. Pastor et al. realize the generalization of relevant features among different skills with the help of the associative skill memory technique [45,46]. Rueckert et al. use the motion primitive technique to extract a group of low-dimensional control variables that allow being reused [47]. Sutton et al. represents the process of task realization as a hierarchical motion sequence [48,49,50]. However, the aforementioned works still cannot deal with the problem of how to extract and represent the basic tasks and basic policies from the source task space, and how to infer the spatial-temporal combination of the basic tasks and basis policies automatically.

Therefore, we take advantage of latent space learning techniques and deep neural networks to automatically extract the basic tasks and basic policies, and their respective combination methods. Latent space learning techniques are widely used to compress high-dimensional data into low-dimensional information structures. Lenz et al. map the high-dimensional state space into low-dimensional latent space, and learn skills in the latent space [51]. Du et al. applies latent space to learn the dynamics of the robot, then online skill transfer is realized [52]. Generally, principal component analysis is the most classic latent space learning technique, it would struggle to deal with those high-dimensional, highly nonlinear robotic skill learning problems [53]. Tishby et al. proposed the information bottleneck technique from the perspective of information theory to extract low-dimensional structures [54]. Then, Saxe et al. extended to the deep learning domain by combining deep neural networks [55]. Alemi et al. further proposed a variational information bottleneck to quantify the inference uncertainties via variational inference theory [56]. Recent studies [57] also show that the information bottleneck principle advocates for learning minimal sufficient representations, i.e., those which contain only sufficient information for the downstream task. Optimal representations contain relevant information between input and output that is parsimonious to learn a task. Peng et al. [58] showed that variational information bottleneck could be applied to imitation learning, inverse reinforcement learning, and adversarial learning for more robust performance.

Therefore, we incorporate the variational information bottleneck into the maximum entropy off-policy reinforcement learning framework to develop a novel meta-reinforcement learning scheme. In summary, the main contributions of this article are summarized as follows:

We develop a novel meta-reinforcement learning framework based on a variational information bottleneck. The framework consists of two stages, i.e., the meta-training stage and the meta-testing stage. The meta-training stage aims to extract the basic tasks and the according to basic policies, the meta-testing stage aims to efficiently infer the new policy for a new task by taking advantage of the basic tasks and basic policies.
The meta-training and meta-testing algorithms are presented in detail. Thus, the meta-reinforcement learning framework allows efficient robotic skill transfer learning in a dynamic environment.
Empirical experiments based on Mujoco have been conducted to show the effectiveness of the proposed scheme.

The following of this paper is organized as follows: Section 2 formally describes the problem mathematically and introduces the variational information bottleneck theory. The novel variational information bottleneck regularized DRL framework is proposed in Section 3. In Section 4, we first formulate the transfer learning tasks based on Mujoco, and then empirical results are presented. In Section 5, discussions and conclusions are represented.

2. Problem Formulation and Background

2.1. Problem Formulation

Meta-reinforcement learning involves two stages, meta-training and meta-testing. During the meta-training stage, a batch of tasks is first sampled from the source task space as the training set, then an algorithm is employed for model training. During the meta-testing stage, a small batch of tasks is sampled from the target task space as a testing set, using the well-trained model to evaluate the transfer performance through only a few interactions on each task. In general, assuming that the training task set and testing task set are sampled from the same distribution

p (T)

, and the task space consists of state space

S

, action space

A

, transition function

P

, and bounded reward function space

R

, every task could be characterized by Markov Decision Process (MDP), we further assume that the transition functions and reward functions are unknown to the agent, only samples could be gathered. Stochastically sampling a task

T = \{\begin{matrix} p (s_{0}), & p (s_{t + 1}| s_{t}, a_{t}), & r (s_{t}, a_{t}) \end{matrix}\}

, where

p (s_{0})

denotes the initial state distribution,

p (s_{t + 1}| s_{t}, a_{t})

denotes the transition probability, which is unknown,

r (s_{t}, a_{t})

denotes reward function. From the aforementioned definition,

p (T)

can describe both tasks with different transition functions, i.e., different robot dynamics, and tasks with different reward functions, i.e., different tasks, but assuming that all robots and tasks share with the same state space and action space. When given a task distribution

p (T)

, we firstly sample M tasks as the meta-training set

\{T_{i}\} \sim p (T), (i = 1, \dots, M)

, then we can optimize the latent space based policy

π (a| s, z)

, where z denotes the latent space, with these training tasks. Denoting

e_{k}^{T} = \{s_{k}, a_{k}, r_{k}, s_{k}^{'}\}

as a sample for the task

T

, then

e_{1 : K}^{T}

denote a trajectory. Thus, we can obtain the meta-training data space

\{D_{T_{i}}\}, i = 1, \dots, M

. Once meta-training accomplished, we can sample another N tasks from the task distribution

p (T)

as meta-testing set

\{T_{j}\} \sim p (T), (j = 1, \dots, N)

. For any test task, only a few interactions are needed, then the latent space can infer the spatial-temporal combination of basic tasks. Thus, the agent can adapt to the test task efficiently. The detailed variable definitions are shown in Table 1.

The latent space-based meta-reinforcement learning can simultaneously learn the latent space and its corresponding policy. The latent space is obtained in a self-supervised manner and used to differentiate various tasks. Thus, the latent space should satisfy the following principles, (1) sufficiency, the latent space z should sufficiently characterize the task space

p (T)

, i.e., the mapping from task space to latent space should be surjective and injective. (2) Parsimony, the latent space should not be over-fit to the detailed state. (3) Identifiability, the latent space could identify the specific task from the trajectories. Therefore, we tailored the variational information bottleneck technique to extract the aforementioned latent space. Furthermore, to achieve better training efficiency, we consider the state-of-art maximum entropy off-policy actor-critic algorithm, i.e., soft actor-critic (SAC). In the following subsections, we will describe the mathematical formulation of reinforcement learning and variational information bottleneck theory.

2.2. Markov Decision Process (MDP)

In general, the MDP is describe in tuple

M = (\begin{matrix} S, & A, & P, & r, & γ, & ρ_{0} \end{matrix})

, where

S

,

A

and

P : S \times A \times S \to [0, \infty)

with definition before.

r : S \times A \to R

denotes the reward function, bounded by

[\underset{̲}{r}, \bar{r}]

.

ρ_{0} : S \to R

denotes the initial state distribution, and

γ \in (0, 1)

denotes the discount factor.

In this work, we consider stochastic policy, i.e.,

π : S \times A ⟶ [0, 1]

, and let

J (π)

denote the expected discounted cumulative reward:

J (π) = E_{s_{0}, a_{0}, \dots} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})],

(1)

where

E

means expectation, by taking on a bunch of trajectories following the policy

π

. The objective of RL is to optimize the policy by maximizing the objective function:

π^{*} = \underset{π}{arg max} J (π),

(2)

where

s_{0} \sim ρ_{0} (s_{0}), s_{t + 1} \sim P (s_{t + 1}| a_{t}, s_{t}), a_{t} \sim π (a_{t}| s_{t})

. Let

Q_{π}

denotes the state-action value function,

V_{π}

denotes the state value function, and

A_{π}

denotes the advantage function:

Q_{π} (s_{t}, a_{t}) = E_{s_{t + 1}, a_{t + 1}, \dots} [\sum_{k = 0}^{\infty} γ^{k} r (s_{t + k})],

(3)

V_{π} (s_{t}) = E_{a_{t}, s_{t + 1}, \dots} [\sum_{k = 0}^{\infty} γ^{k} r (s_{t + k})],

(4)

A_{π} (s_{t}, a_{t}) = Q_{π} (s_{t}, a_{t}) - V_{π} (s_{t}),

(5)

where,

a_{t} \sim π (a_{t}| s_{t}), s_{t + 1} \sim P (s_{t + 1}| a_{t}, s_{t})

for all

t \geq 0

.

2.3. Maximum Entropy Actor-Critic

To balance the exploration and exploitation problem in the original RL, the per-step entropy is usually added to the original per-step reward as

r (s_{t}, a_{t}) + α H (π (\cdot| s_{t}))

, where

H (π (\cdot| s_{t})) = - \int_{A} π (\cdot| s_{t}) log (π (\cdot| s_{t})) d a_{t}

denotes the entropy of the policy at

s_{t}

,

α \in (0, 1)

is a weighting term. Then, the optimal policy would be achieved by maximizing the discounted reward and its entropy in expectation:

\begin{matrix} π^{*} = \underset{π}{arg max} \sum_{t = 0}^{\infty} E_{(s_{t}, a_{t}) \sim ρ_{π}} γ^{t} [r (s_{t}, a_{t}) + α H (π (\cdot| s_{t}))] . \end{matrix}

(6)

By doing so, the agent aims to maximize the cumulative reward, whilst keeping the policy as stochastic as possible. Several previous works have studied this framework carefully, and shown some conceptual and practical advantages [9,10,11]. The policy is incentivized to explore more diversely, while giving up those obviously hopeless directions by showing improved sampling efficiency and performance. In comparison with general RL, the action-state value function is modified as

\begin{matrix} Q^{*} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + E_{(s_{t}, a_{t}) \sim ρ_{π}} [\sum_{k = 1}^{\infty} γ^{k} (r (s_{t + k}) + α H (π^{*} (\cdot| s_{t + k})))], \end{matrix}

(7)

and the state value function as

V^{*} (s_{t + 1}) = α log \int_{A} exp (\frac{1}{α} Q^{*} (s_{t}, a^{'})) d a^{'} .

(8)

where

ρ_{π} (s, a) = π (a| s) \sum_{t = 0}^{\infty} γ^{t} P (s_{t} = s| π)

and

ρ_{π} (s) = E_{a \sim π (\cdot| s)} [ρ_{π} (s, a)]

denote the state-action occupancy measure and state occupancy measure, respectively. Then, according to the Bellman optimality principle, we obtain the following results.

Lemma 1.

As the modified action-state value function and the modified state value function defined in Equation (7) and Equation (8), respectively, the modified action-state value function and the modified state value function satisfy the Bellman optimality equation

Q^{*} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim p_{s}} [V^{*} (s_{t + 1})],

(9)

with the optimal policy given by

π^{*} (a_{t} | s_{t}) = exp (\frac{1}{α} (Q^{*} (s_{t}, a_{t}) - V^{*})) .

(10)

Moreover, the modified value iteration is given by

Q (s_{t}, a_{t}) \leftarrow r (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim p_{s}} [V], \forall s_{t}, a_{t},

(11)

V (s_{t}) \leftarrow α log \int_{A} exp (\frac{1}{α} Q (s_{t}, a^{'})) d a^{'}, \forall s_{t},

(12)

converges to their fix point

Q^{*} (s_{t}, a_{t})

and

V^{*} (s_{t})

, respectively.

The modified Bellman optimality equation in Equation (9) can be regarded as a generalization of the conventional Bellman equation, which could be recovered as

α \to 0

. The SQL (Soft Q-Learning) and SAC (Soft Actor-Critic) algorithms, which are built on the aforementioned value iteration, can achieve state-of-art performance in continuous control tasks and dexterous hand manipulation tasks, by using deep neural networks to approximate the modified action-value function, the modified state value function and the policy [10].

2.4. Variational Information Bottleneck (VIB)

The information bottleneck theory was first proposed by Tishby et al., and Alemi et al. proposed variational information bottleneck by incorporating variational inference and deep learning techniques. For a general supervised learning task, given a data set

\{\begin{matrix} x_{i}, & y_{i} \end{matrix}\}

, where

x_{i}

denotes the input,

y_{i}

denotes the label. Information bottleneck encodes the inputs into a latent space as

E (z| x)

, and constrain the information flow between x and z using mutual information

I (X; Z)

, defined by,

\begin{matrix} I (Z, X) & = D_{KL} [p (Z, X) ∣ p (Z) p (x)] \\ = E_{X \sim p (X)} [D_{KL} [E (Z ∣ X) ∣ p (Z)]], \end{matrix}

(13)

where

D_{K L}

denotes Kullback-Leibler (KL) divergence.

Thus, we obtain a constrained optimization problem as,

\begin{matrix} J (q, E) = min_{q, E} E_{x, y \sim p (x, y)} [E_{z \sim E (z| x)} [- log q (y| z)]] \\ \begin{matrix} s . t . & I (X, Z) \leq \end{matrix} I_{c}, \end{matrix}

(14)

where

I_{c}

denotes the user-defined information constraint,

E (z| x)

denotes the latent space encoder,

q (y| z)

denotes the mapping from latent space to the label space. By solving this constrained optimization problem, we obtain a latent space that is sufficient for the final task and parsimony enough without redundancy information.

In practice, one can take samples of

D_{KL} [E (Z ∣ X) ∣ p (Z)]

to estimate the mutual information. While

E (Z ∣ X)

is straightforward to compute, calculating

p (Z)

requires marginalization across the entire state space S, which is impossible since most non-trivial environments are intractable. Instead, we introduce an approximator,

q (Z) \sim N (\vec{0}, I)

, to replace

p (Z)

. This achieves an upper bound on

I (Z, X)

,

\begin{matrix} E_{X \sim p (X)} [D_{KL} [E (Z ∣ X) ∣ q (Z)]] \\ = & \int_{x} d x p (x) \int_{z} d z E (z ∣ x) log \frac{E (z ∣ x)}{q (z)} \\ = & \int_{z, x} d x d z E (z ∣ x) log E (z ∣ x) - \int_{z} d z p (z) log q (z) \\ \geq & \int_{z, x} d x d z E (z ∣ x) log E (z ∣ x) - \int_{z} d z p (z) log p (z) \\ = & \int_{x} d x p (x) \int_{z} d z E (z ∣ x) log \frac{E (z ∣ x)}{p (z)} \\ = & I (Z, X), \end{matrix}

(15)

where the inequality arises because of the non-negativeness KL-divergence, i.e.,

D_{KL} [p (z) ∣ q (z)] \geq 0

.

Substituting Equation (15) into Equation (14) obtains,

\begin{matrix} \tilde{J} (q, E) = min_{q, E} E_{x, y \sim p (x, y)} [E_{z \sim E (z| x)} [- log q (y| z)]] \\ \begin{matrix} s . t . & E_{x \sim p (x)} [D_{K L} [E (z| x)∥ p (z)]] \leq \end{matrix} I_{c} . \end{matrix}

(16)

where

\tilde{J} (q, E) \geq J (q, E)

denotes the upper bound of the objective function. By taking advantage of lagrangian multiplier

β

, we convert the constrained optimization problem into an unconstrained optimization problem,

min_{q, E} E_{x, y \sim p (x, y)} [E_{z \sim E (z| x)} [- log q (y| z)]] + β (E_{x \sim p (x)} [D_{K L} [E (z| x)∥ p (z)]] - I_{c}) .

(17)

By solving the aforementioned problem, we can obtain a sufficient and parsimonious representation of the original data distribution. Alemi et al. showed that the variational information bottleneck approach could suppress the parameter over-fitting problem, and the obtained model was robust to adversarial attacks.

3. VIB Based Meta-Reinforcement Learning

3.1. Overview

The latent space-based robotic skill transfer learning framework is shown in Figure 1, which includes two stages. During the meta-training stage, we sample from the source task space and obtain M training tasks, for each task, the RL agent interacts with the environment and obtains several trajectories

e_{1 : K}^{T}

, then we iteratively optimize the latent space encoder

E_{ω} (z| e^{T_{m}})

and latent-based policy

π_{θ} (a| s, z^{T_{m}})

, which both parameterized by deep neural networks with parameters as

ω

and

θ

, respectively. Once the training process converges, the two networks are fixed, and they would be reused for test tasks. During the meta-testing stage, we sample another N tasks, the latent space encoder infers the specific task from very few interacting samples, then the latent vector informs the policy to synthesize the final policy. During the test stage, there is no gradient update, so the agent can reuse the obtained skills, which would be very efficient. Moreover, since the latent space would inform the agent of the way to synthesize new policy according to the real-time data sequence for a specific task, improved performance would be achieved.

Following the procedure of maximum entropy actor-critic framework, the latent-based constrained optimization problem is formulated as,

\begin{matrix} J (π, E) = max_{π, E} E_{T \sim p (T)} [E_{z \sim E (z| e^{T})} [R (e^{T}, z) + α H (π (\cdot| s_{t}, z_{t}))]] \\ \begin{matrix} s . t . & I (e^{T}, Z) \leq I_{c}, \end{matrix} \end{matrix}

(18)

where

R (e^{T}, z)

denotes the latent space informed trajectory discounted return

e^{T}

.

According to the variational information bottleneck theory, introducing the variational lower bound, one obtains,

\begin{matrix} \hat{J} (π, E) = max_{π, E} E_{T \sim p (T)} [E_{z \sim E (z| e^{T})} [R (e^{T}, z) + α H (π (\cdot| s_{t}, z_{t}))]] \\ s . t . E_{T \sim p (T)} [D_{K L} [E (z| e^{T})∥ p (z)]] \leq I_{c}, \end{matrix}

(19)

where

\hat{J} (π, E)

denotes the upper bound of

J (π, E)

.

According to the maximum entropy Bellman optimality principle, denote the optimal action-state value function for the problem (19) as

\begin{matrix} Q^{*} (s_{t}, a_{t}, z_{t}) = r (s_{t}, a_{t}, z_{t}) + \\ \begin{matrix} E_{(s_{t}, a_{t}) \sim D_{T_{i}}; z \sim E (z| e^{T})} [\sum_{k = 1}^{\infty} γ^{k} (r (s_{t + k}, a_{t + k}, z_{t + k}) + α H (π^{*} (\cdot| s_{t + k}, z_{t + k})))], \end{matrix} \end{matrix}

(20)

with the optimal state value function as,

V^{*} (s_{t + 1}, z_{t + 1}) = α log \int_{A} exp (\frac{1}{α} Q^{*} (s_{t}, a^{'}, z)) d a^{'} .

(21)

According to Lemma 1, the optimal Bellman equation is,

B^{π} Q^{*} (s_{t}, a_{t}, z_{t}) = r (s_{t}, a_{t}, z_{t}) + γ E_{s_{t + 1} \sim p_{s}} [V^{*} (s_{t + 1}, z_{t + 1})],

(22)

followed, by optimal policy, as

π^{*} (a_{t} | s_{t}, z_{t}) = exp (\frac{1}{α} (Q^{*} (s_{t}, a_{t}, z_{t}) - V^{*} (s_{t}, z_{t}))) .

(23)

For practical robotic skill learning problems, the state space and action space are always very high dimensional, so we utilize deep neural networks to approximate the action-state value function

Q_{φ}

, state value function

V_{ψ}

, policy

π_{θ}

and latent space encoder

E_{ω}

with network parameters as

ψ

,

φ

,

θ

, and

ω

, respectively.

3.2. Latent Space Learning

From Equation (19), the latent space modeling should simultaneously satisfy two requirements, maximizing the entropy-augmented returns for task accomplishment and the information bottleneck constraint. So, the latent learning algorithm should simultaneously minimize the Bellman residual

∥B^{π} Q - Q∥

and KL divergence

D_{K L} [p (Z)∥ p (Z)] \geq 0

, the cost function is formulated as,

\begin{matrix} J_{E_{ω}} (D, z) = \sum_{T_{i} \sim p (T)} (\frac{1}{2} \sum_{(s_{t}, a_{t}, s_{t}) \sim D} (Q_{ϕ} (s_{t}, a_{t}, z_{t}) - (r (s_{t}, a_{t}, z_{t}) \\ {+ γ E_{s_{t + 1} \sim p; z_{t + 1} \sim E} [V_{\bar{ψ}} (s_{t + 1}, z_{t + 1})]))}^{2} + β D_{K L} [E_{ω} (z| e^{T})∥ p (z)]), \end{matrix}

(24)

where

V_{\bar{ψ}}

denotes the target network for algorithm stability. Taking the computation feasibility into account, implementing

p (z)

as Gaussian distribution, i.e.,

p (z) = \prod_{i = 1}^{K} p (z^{i}) = \prod_{i = 1}^{K} N (0, I)

, where

N (0, I)

is also Gaussian distributions. Thus, the

E_{ω}

can be modeled as multi-variable Gaussian distributions as,

E_{ω} (z| e_{1 : K}^{T}) \propto \prod_{k = 1}^{K} N (μ_{ω} (e_{k}^{T}), σ_{ω} (e_{k}^{T})),

(25)

where,

μ_{ω} (e_{k}^{T})

and

σ_{ω} (e_{k}^{T})

denote the mean and variance. The latent space encoder network updating rule is,

ω_{t + 1} \leftarrow ω_{t} - η_{ω} {\hat{\nabla}}_{ω} J_{E_{ω}} (D, z),

(26)

where

η_{ω}

denotes updating rate, and

{\hat{\nabla}}_{ω}

denotes the first-order gradient.

Using the aforementioned Bayesian optimization protocol, we can realize skill transfer from task space to new tasks by posteriori inference. Furthermore, the intrinsic uncertainty estimation property can be used to explore a better way of task realization.

3.3. Vib Based Meta-Reinforcement Learning Algorithm

According to Lemma 1, we know that the Bellman equation in Equation (22) is a contraction mapping, i.e., the optimal action-state value function is the fixed point. So the cost function for the action-state value function is,

J_{Q_{ϕ}} (D, z) = \frac{1}{2} E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim D} [{(Q_{ϕ} (s_{t}, a_{t}, z_{t}) - (r (s_{t}, a_{t}, z_{t}) + γ E_{s_{t + 1} \sim p} [V_{\bar{ψ}} (s_{t}, z_{t})]))}^{2}] .

(27)

And we get the action-stage value function network parameter’s updating rule by gradient descending,

φ_{t + 1} \leftarrow φ_{t} - η_{φ} {\hat{\nabla}}_{φ} J_{Q_{ϕ}} (D, z),

(28)

where

η_{φ}

denotes the learning rate,

{\hat{\nabla}}_{φ}

denotes the unbiased estimation of gradient estimation.

According to Equation (21), the cost function for state value function is,

J_{V_{ψ}} (D, z) = \frac{1}{2} E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim D} [{(V_{ψ} (s_{t}, a_{t}, z_{t}) - E_{a_{t} \sim π_{θ}} [Q_{ϕ} (s_{t}, a_{t}, z_{t}) - α log π_{θ} (a_{t}| s_{t}, z_{t})])}^{2}] .

(29)

where the computation of the action-state value function using actions from the current policy. We get the state value function network parameter’s updating rule as,

ψ_{t + 1} \leftarrow ψ_{t} - η_{ψ} {\hat{\nabla}}_{ψ} J_{V_{ψ}} (D, z),

(30)

where

η_{ψ}

denotes the learning rate, and

{\hat{\nabla}}_{ψ}

denotes the unbiased estimation.

According to Equation (23), the cost function for policy network is,

J_{π_{θ}} (D, z) = D_{K L} (π_{θ} (a_{t}| s_{t}, z_{t})∥ exp (\frac{1}{α} Q_{ϕ} (s_{t}, a_{t}, z_{t})) - log (Z_{ϕ} (s_{t}, z_{t}))),

(31)

where

Z_{ϕ} (s_{t}, z_{t})

denotes the normalizing function, which is irrelevant to the computation of policy function gradient. So the policy network parameter’s updating rule is,

θ_{t + 1} \leftarrow θ_{t} - η_{θ} {\hat{\nabla}}_{θ} J_{π_{θ}} (D, z),

(32)

where

η_{θ}

denotes the learning rate,

{\hat{\nabla}}_{θ}

denotes the unbiased gradient estimation.

According to the aforementioned analysis, we could formulate the VIB-based meta-reinforcement learning algorithm. The meta-training protocol is presented in Algorithm 1, and the meta-testing procedure is presented in Algorithm 2. We use an ADAM optimizer for deep neural network training.

Algorithm 1 VIB based meta-reinforcement learning training algorithm.

1:: Input: $D$ : training data set; $η_{θ}, η_{φ}, η_{ψ}, η_{ω}$ denote learning rate; ${T_{m}}_{m = 1, \dots, M} \sim p (T)$ : meta-training task set.
2:: Output: policy network $π_{θ}$ and latent space encoder $E_{ω}$ .
3:: Setting parameter for the target network: $ψ \leftarrow \bar{ψ}$ ; the initial sample set for each task $D^{m}$
4:: for epoch do
5:: for $T_{m}$ do
6:: Initializing each trajectory: $e^{T_{m}} = {}$
7:: for $k = 1, \dots, N$ do
8:: Latent space inference $z \sim E_{ω} (z| e^{T_{m}})$
9:: The current policy $π_{θ} (a| s, z)$ interact with each task and obtain sample $D^{m}$
10:: Updating $e^{T_{m}} = {(s_{j}, a_{j}, s_{j}^{'}, r_{j})}_{j : 1 \dots K} \sim D^{m}$
11:: end for
12:: end for
13:: for step do
14:: for $T_{m}$ do
15:: Sampling from the training data set: $e^{T_{m}}, d^{m} \sim D^{m}$
16:: Latent space inference: $z \sim E_{ω} (z| e^{T_{m}})$
17:: Computing the action-state value function: $J^{m} (Q) = J_{E_{φ}} (d^{m}, z)$
18:: Computing the state value function: $J^{m} (V) = J_{E_{ψ}} (d^{m}, z)$
19:: Computing the policy cost function: $J^{m} (π) = J_{E_{θ}} (d^{m}, z)$
20:: Computing the latent space encoder cost function: $J^{m} (E) = J_{E_{ω}} (d^{m}, z)$
21:: end for
22:: Updating the action-state value function network: $φ_{t + 1} \leftarrow φ_{t} - η_{φ} {\hat{\nabla}}_{φ} \sum_{m} J^{m} (Q)$
23:: Updating the state value function network: $ψ_{t + 1} \leftarrow ψ_{t} - η_{ψ} {\hat{\nabla}}_{ψ} \sum_{m} J^{m} (V)$
24:: Updating the policy network: $θ_{t + 1} \leftarrow θ_{t} - η_{θ} {\hat{\nabla}}_{θ} \sum_{m} J^{m} (π)$
25:: Updating the latent space encoder network: $ω_{t + 1} \leftarrow ω_{t} - η_{ω} {\hat{\nabla}}_{ω} \sum_{m} J^{m} (E)$
26:: end for
27:: Updating the target network $ψ \leftarrow \bar{ψ}$
28:: end for

Algorithm 2 VIB based meta-reinforcement learning testing algorithm.

1:: Input: ${T_{n}}_{m = 1, \dots, N} \sim p (T)$ : Meta-testing task set; $θ$ : meta-training policy network; $ω$ : meta-training latent space encoder network.
2:: Initializing the trajectory: $e^{T} = {}$
3:: for $k = 1, \dots, N$ do
4:: Latent space inference: $z \sim E_{ω} (z| e^{T})$
5:: Using current policy $π_{θ} (a| s, z)$ to interact with each task and obtain $D^{k}$
6:: Sample accumulation: $e^{T} = e^{T} \cup D^{k}$
7:: Evaluating empirical discounted return for each task.
8:: end for

4. Experiments

Our experiments aim to investigate the following questions:

Does the VIB-based meta-reinforcement learning algorithm realize efficient skill transfer?
How about the learning efficiency and asymptotic performance of the VIB-based meta-reinforcement learning algorithm in comparison with that of other meta-learning approaches, such as MAML and ProMP?
Does the VIB-based meta-reinforcement learning algorithm can improve the learning performance during the training stage in comparison with other algorithms?

4.1. Experiments Configuration

To investigate the aforementioned questions, we implement the proposed algorithm on several challenging robotic locomotion tasks from the OpenAI Gym benchmark [21] and also on the rllab implementation tasks [59], which are implemented using MuJoCo [60], a 3D physics simulator with better modeling of contacts. More specifically, we build on four high-dimensional, highly nonlinear robots, including Walker2d, Half-Cheetah, Ant, and Humanoid, as shown in Figure 2. Walker2d is a planar robot with 7 links, the reward is given by

r (s, a) = v_{x} - 0.005 {∥a∥}_{2}^{2}

, the episode is terminated when

z_{b o d y} < 0.8

,

z_{b o d y} > 0.2

or when

| θ | > 1.0

, the dimension of state space and action space are 17 and 6, respectively; Half-Cheetah is a planar biped robot with 9 links, the reward is given by

r (s, a) = v_{x} - 0.05 {∥a∥}_{2}^{2}

, the dimension of state space and action space are 17 and 6, respectively; Ant is a quadruped robot with 13links, the reward is given by

r (s, a) = v_{x} - 0.005 {∥a∥}_{2}^{2} - C_{c o n t a c t} + 0.05

, where

C_{c o n t a c t}

penalizes contacts to the ground, and is given by

5 \times 10^{- 4} {∥F_{c o n t a c t}∥}_{2}^{2}

, where

F_{c o n t a c t}

is the contact force, the episode is terminated when

z_{b o d y} < 0.2

or when

z_{b o d y} > 1.0

, the dimension of state space and action space are 111 and 8, respectively; Humanoid is a human-like robot with 17 joints, including the head, body trunk, two-arms and two-legs, the reward is given as

r (s, a) = v_{x} 5 \times 10^{- 4} {∥a∥}_{2}^{2} - C_{c o n t a c t} + C_{d e v i a t i o n} + 0.2

, where

C_{c o n t a c t} = 5 \times 10^{- 6} ∥F_{c o n t a c t}∥

,

C_{d e v i a t i o n} = 5 \times 10^{- 3} (v_{y}^{2} + v_{z}^{2})

, the episode is terminated when

z_{b o d y} < 0.8

or

z_{b o d y} > 2.0

, the dimension of state space and action space are 376 and 17, respectively.

During the building of source tasks space, we consider two classes of tasks. The first class of task is the robot owns the same configuration and same physical parameters but with different tasks, which can be realized by setting different reward functions, including, Walker2d-Diff-Velocity and HalfCheetah-Diff-Velocity, the robot move forward with different velocity, where the training task set includes 100 tasks with different velocity setting, and the testing task set includes 30 tasks with different velocity setting. Ant-Random-Goal and Humanoid-Random-Dir, robot move in different directions in the 2D planar space, where the training task set includes 100 tasks with different directions, and the testing task set include 30 tasks with different directions. Another class of tasks is that the robot with different physical parameters realizes the same task. Walker2d-Diff-Params and Ant-Diff-Params, robots own different parameters, where the training task set includes 40 tasks with different physical parameters, and the testing task set includes 10 tasks with different physical parameters. To investigate the advantages of the proposed algorithm, we compare the VIB-based meta-reinforcement learning algorithm with MAML and ProMP, where the learning parameters are borrowed from the original papers.

Due to the universal approximation capability of DNNs, we parameterize the action-value function, state-value function, policy, and the latent space encoder all in feed-forward type DNNs with three hidden layers, each hidden layer owning 300 neurons. More specifically, Table 2 lists the parameters used in the algorithm for comparative evaluation.

4.2. Comparative Study

To show the efficiency and effectiveness of the presented VIB-based off-policy actor-critic algorithms, we compare our algorithm with the famous MAML and ProMP from two aspects, i.e., the sample efficiency during the training procedure, and asymptotic performance after the algorithm converges. To increase the statistical reliability of the results, all the results are obtained by taking an empirical mean and variance on 10 random seeds. The results are shown in Figure 3, Figure 4 and Figure 5. The solid lines represent the mean and the shallow part around the solid lines denotes one standard deviation, the dotted lines denote the asymptotic performance for different algorithms with a different color. From the results, the proposed algorithm achieves simultaneous big sample efficiency during training and asymptotic performance improvement. We compute the number of times for performance improvement as,

# Times = \frac{t_{ReachingAsymptotic}^{VIB}}{t_{ReachingAsymptotic}^{MAML | ProMP}},

(33)

where the

t_{ReachingAsymptotic}^{VIB}

denotes the time moment when the proposed VIB-based meta-RL algorithm reaches the asymptotic performance of the baseline MAML or ProMP, the

t_{ReachingAsymptotic}^{MAML | ProMP}

denotes the time moment when the baseline MAML or ProMP algorithms reach asymptotic performance. As shown in Table 3, the sample efficiency improves by at least 200 times. In particular, for the tasks Walker2d-Diff-Velocity and Humanoid-Random-Dir, 5000 times sample efficiency improvement was achieved. We know that the Humanoid task is very challenging due to its extremely high-dimensional state space and action space, and its highly nonlinear dynamics. Therefore, the proposed VIB-based meta-reinforcement learning algorithm can realize efficient skill adaptation.

5. Conclusions

In this paper, we revisited the problem of efficiently adapting skills obtained with DRL methods to novel, unseen environments or tasks. By observing the facts that human beings could learn new skills very efficiently since we could draw inferences about other cases from one instance, thus the acquired skills would be adapted to new tasks and be formulated as new skills. Inspired by this idea, we argued that if the robot could extract the basic tasks and the respective basic skills from the task space, when a new task was encountered, the basic tasks and basic skills could be efficiently transferred and formulate new skills. Therefore, we took advantage of variational information bottleneck techniques and developed a latent space-based meta-reinforcement learning algorithm. The empirical results based on Mujoco benchmarking robotic locomotion tasks show that the variational information bottleneck-based meta-reinforcement learning could realize efficient skill learning and transfer. Thus, this work took a substantial step in implementing the learning-based algorithms to practical robotic skill learning. In the future, we will take a step further by applying our algorithm to much more complicated tasks, such as quadrotor aerobatic flight with quickly changing aerodynamics, and high dimensional tasks, such as tasks with image input.

Author Contributions

Conceptualization, G.X. and S.D. (Songyi Dian); methodology, writing, visualization, G.X. and Z.L.; investigation, S.D. (Songyi Dian) and S.D. (Shaofeng Du); Validation, Z.L.; supervision, S.D. (Shaofeng Du). All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the Natural Science Foundation of Sichuan Province under Grants 2023NSFSC0475 and 2023NSFSC1441, the Fundamental Research Funds for the Central Universities under Grant 2022SCU12004, and the Funds for National Key Laboratory of Special Vehicle Design and Manufacturing Integration Technology under Grant GZ2022KF007.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Driessche, G.V.D.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of go with deep neural networks and tree search. Nature 2016, 529, 484. [Google Scholar] [CrossRef] [PubMed]
Hou, Z.; Fei, J.; Deng, Y.; Xu, J. Data-efficient hierarchical reinforcement learning for robotic assembly control applications. IEEE Trans. Ind. Electron. 2020, 68, 11565–11575. [Google Scholar] [CrossRef]
Funk, N.; Chalvatzaki, G.; Belousov, B.; Peters, J. Learn2assemble with structured representations and search for robotic architectural construction. In Proceedings of the 5th Conference on Robot Learning, PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 1401–1411. [Google Scholar]
Guez, A.; Vincent, R.D.; Avoli, M.; Pineau, J. Adaptive treatment of epilepsy via batch-mode reinforcement learning. AAAI 2008, 1671–1678. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.I.; Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Volume 37, pp. 1889–1897. [Google Scholar]
Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, JMLR.org, Sydney, Australia, 6–11 August 2017; pp. 1352–1361. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1856–1865. [Google Scholar]
Nachum, O.; Norouzi, M.; Xu, K.; Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 2775–2785. [Google Scholar]
McGuire, K.; Wagter, C.D.; Tuyls, K.; Kappen, H.; de Croon, G.C. Minimal navigation solution for a swarm of tiny flying robots to explore an unknown environment. Sci. Robot. 2019, 4, eaaw9710. [Google Scholar] [CrossRef]
Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3357–3364. [Google Scholar]
Hwangbo, J.; Lee, J.; Dosovitskiy, A.; Bellicoso, D.; Tsounis, V.; Koltun, V.; Hutter, M. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 2019, 4, eaau5872. [Google Scholar] [CrossRef] [Green Version]
Miki, T.; Lee, J.; Hwangbo, J.; Wellhausen, L.; Koltun, V.; Hutter, M. Learning robust perceptive locomotion for quadrupedal robots in the wild. Sci. Robot. 2022, 7, eabk2822. [Google Scholar] [CrossRef]
Kopicki, M.S.; Belter, D.; Wyatt, J.L. Learning better generative models for dexterous, single-view grasping of novel objects. Int. J. Robot. Res. 2019, 38, 1246–1267. [Google Scholar] [CrossRef] [Green Version]
Bhagat, S.; Banerjee, H.; Tse, Z.T.H.; Ren, H. Deep reinforcement learning for soft, flexible robots: Brief review with impending challenges. Robotics 2019, 8, 4. [Google Scholar] [CrossRef] [Green Version]
Thuruthel, T.G.; Falotico, E.; Renda, F.; Laschi, C. Model-based reinforcement learning for closed-loop dynamic control of soft robotic manipulators. IEEE Trans. Robot. 2018, 35, 124–134. [Google Scholar] [CrossRef]
Wang, C.; Zhang, Q.; Tian, Q.; Li, S.; Wang, X.; Lane, D.; Petillot, Y.; Wang, S. Learning mobile manipulation through deep reinforcement learning. Sensors 2020, 20, 939. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mahler, J.; Matl, M.; Satish, V.; Danielczuk, M.; DeRose, B.; McKinley, S.; Goldberg, K. Learning ambidextrous robot grasping policies. Sci. Robot. 2019, 4, eaau4984. [Google Scholar] [CrossRef] [PubMed]
Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1329–1338. [Google Scholar]
Munos, R.; Stepleton, T.; Harutyunyan, A.; Bellemare, M. Safe and efficient off-policy reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1054–1062. [Google Scholar]
Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef] [Green Version]
Deisenroth, M.P.; Neumann, G.; Peters, J. A survey on policy search for robotics. Found. Trends Robot. 2013, 2, 1–142. [Google Scholar]
Dulac-Arnold, G.; Levine, N.; Mankowitz, D.J.; Li, J.; Paduraru, C.; Gowal, S.; Hester, T. Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis. Mach. Learn. 2021, 110, 2419–2468. [Google Scholar] [CrossRef]
Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; Meger, D. Deep reinforcement learning that matters. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Braun, D.A.; Aertsen, A.; Wolpert, D.M.; Mehring, C. Learning optimal adaptation strategies in unpredictable motor tasks. J. Neurosci. 2009, 29, 6472–6478. [Google Scholar] [CrossRef] [Green Version]
Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv 2018, arXiv:1801.06146. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Taylor, M.E.; Stone, P. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res. 2009, 10. [Google Scholar]
Thrun, S.; Pratt, L. Learning to Learn; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Yu, T.; Quillen, D.; He, Z.; Julian, R.; Hausman, K.; Finn, C.; Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Proceedings of the Conference on Robot Learning, PMLR, Cambridge, MA, USA, 16–18 November 2020; pp. 1094–1100. [Google Scholar]
Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; Lillicrap, T. Meta-learning with memory-augmented neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 1842–1850. [Google Scholar]
Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M.W.; Pfau, D.; Schaul, T.; Shillingford, B.; Freitas, N.D. Learning to learn by gradient descent by gradient descent. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Hochreiter, S.; Younger, A.S.; Conwell, P.R. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2001; pp. 87–94. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Xu, Z.; van Hasselt, H.P.; Silver, D. Meta-gradient reinforcement learning. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Finn, C.; Yu, T.; Zhang, T.; Abbeel, P.; Levine, S. One-shot visual imitation learning via meta-learning. In Proceedings of the Conference on Robot Learning PMLR, Mountain View, CA, USA, 13–15 November 2017; pp. 357–368. [Google Scholar]
Liu, H.; Socher, R.; Xiong, C. Taming maml: Efficient unbiased meta-reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 4061–4071. [Google Scholar]
Rothfuss, J.; Lee, D.; Clavera, I.; Asfour, T.; Abbee, P. Promp: Proximal meta-policy search. arXiv 2018, arXiv:1810.06784. [Google Scholar]
Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; Levine, S. Meta-reinforcement learning of structured exploration strategies. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Pastor, P.; Kalakrishnan, M.; Righetti, L.; Schaal, S. Towards associative skill memories. In Proceedings of the 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012), Osaka, Japan, 29 November–1 December 2012; pp. 309–315. [Google Scholar]
Pastor, P.; Kalakrishnan, M.; Meier, F.; Stulp, F.; Buchli, J.; Theodorou, E.; Schaal, S. From dynamic movement primitives to associative skill memories. Robot. Auton. Syst. 2013, 61, 351–361. [Google Scholar] [CrossRef]
Rueckert, E.; Mundo, J.; Paraschos, A.; Peters, J.; Neumann, G. Extracting low-dimensional control variables for movement primitives. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 1511–1518. [Google Scholar]
Sutton, R.S.; Precup, D.; Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef] [Green Version]
Kulkarni, T.D.; Narasimhan, K.; Saeedi, A.; Tenenbaum, J. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Mendonca, M.R.; Ziviani, A.; Barreto, A.M. Graph-based skill acquisition for reinforcement learning. ACM Comput. Surv. (CSUR) 2019, 52, 1–26. [Google Scholar] [CrossRef]
Lenz, I.; Knepper, R.A.; Saxena, A. Deepmpc: Learning deep latent features for model predictive control. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015. [Google Scholar]
Du, S.; Krishnamurthy, A.; Jiang, N.; Agarwal, A.; Dudik, M.; Langford, J. Provably efficient rl with rich observations via latent state decoding. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 1665–1674. [Google Scholar]
Candès, E.J.; Li, X.; Ma, Y.; Wright, J. Robust principal component analysis? J. Acm (JACM) 2011, 58, 1–37. [Google Scholar] [CrossRef]
Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019, 2019, 124020. [Google Scholar] [CrossRef]
Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
Wang, H.Q.; Guo, X.; Deng, Z.H.; Lu, Y. Rethinking minimal sufficient representation in contrastive learning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; IEEE: New York, NY, USA, 2022; pp. 16041–16050. [Google Scholar]
Peng, X.B.; Kanazawa, A.; Toyer, S.; Abbeel, P.; Levine, S. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. arXiv 2018, arXiv:1810.00821. [Google Scholar]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Algarve, Portugal, 7–12 October 2012; IEEE: New York, NY, USA, 2012; pp. 5026–5033. [Google Scholar]

Figure 1. Latent space-based robotic skill transfer learning framework.

Figure 2. Mujoco robot learning tasks used for skill transfer learning.

Figure 3. Performance comparisons for Walker2d (a) and Half-Cheetah (b) with different forward velocity transfer learning tasks.

Figure 4. Performance comparisons for Ant (a) and Humanoids (b) with different goal location transfer learning tasks.

Figure 5. Performance comparisons for Walker2d (a) and Ant (b) with different physical parameters transfer learning tasks.

Table 1. Definitions and descriptions of variables used for meta-reinforcement learning.

Symbol	Functions	Description
$p (T)$	Task distribution	Characterize a class of task
$T$	Task	A specific task described by MDP
$S$	State space	$p (T)$ shares the same state space
$A$	Action space	$p (T)$ share the same action space
$P$	Transition function space	Including varying functions,
		i.e., different robot dynamics
$R$	Bounded reward	Including varying reward functions,
	function space	i.e., different tasks
$\{T_{i}\} \sim p (T)$	Meta-training task set	Sampling M tasks from source task space
$\{D_{T_{i}}\}$	Training set	The meta-training data set
$\{T_{j}\} \sim p (T)$	Meta-testing task set	Sampling N tasks for testing
$\{D_{T_{j}}\}$	Testing set	The meta-testing data set
$e_{1 : K}^{T}$	Trajectories	Trajectories for task $T$

Table 2. Hyper parameters for the proposed meta reinforcement learning algorithm.

Parameters	Symbol	Value
Optimization algorithm		Adam
Learning rate	$η_{φ}, η_{ψ}, η_{θ}, η_{ω}$	$3 \times 10^{- 4}$
Discounting factor	$γ$	$0.99$
Entropy weighting	$\frac{1}{α}$	5
Lagrange multiplier	β	0.1
Information constraint	I_c	1.0
Number of the hidden layers	Q, V, π	3
	E	3
Number of neurons in each layer	Q, V, π	300
	E	200
Nonlinear activator		ReLU
The maximum path length		200
Samples for each mini-batch	M	256
The frequency for target network updating	τ	1000

Table 3. Quantitative studies the proposed meta reinforcement learning algorithm.

Tasks	Performance Improvement (# Times )
Walker2d-Diff-Velocity	≈5000
HalfCheetah-Diff-Velocity	≈4000
Ant-Forward-Back	≈2500
Humanoid-Random-Dir	≈5000
Walker2d-Diff-Params	≈200
Ant-Diff-Params	≈200

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiang, G.; Dian, S.; Du, S.; Lv, Z. Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation. Sensors 2023, 23, 762. https://doi.org/10.3390/s23020762

AMA Style

Xiang G, Dian S, Du S, Lv Z. Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation. Sensors. 2023; 23(2):762. https://doi.org/10.3390/s23020762

Chicago/Turabian Style

Xiang, Guofei, Songyi Dian, Shaofeng Du, and Zhonghui Lv. 2023. "Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation" Sensors 23, no. 2: 762. https://doi.org/10.3390/s23020762

APA Style

Xiang, G., Dian, S., Du, S., & Lv, Z. (2023). Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation. Sensors, 23(2), 762. https://doi.org/10.3390/s23020762

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation

Abstract

1. Introduction

2. Problem Formulation and Background

2.1. Problem Formulation

2.2. Markov Decision Process (MDP)

2.3. Maximum Entropy Actor-Critic

2.4. Variational Information Bottleneck (VIB)

3. VIB Based Meta-Reinforcement Learning

3.1. Overview

3.2. Latent Space Learning

3.3. Vib Based Meta-Reinforcement Learning Algorithm

4. Experiments

4.1. Experiments Configuration

4.2. Comparative Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI