Prioritized Hindsight with Dual Buffer for Meta-Reinforcement Learning

Beyene, Sofanit Wubeshet; Han, Ji-Hyeong

doi:10.3390/electronics11244192

Open AccessArticle

Prioritized Hindsight with Dual Buffer for Meta-Reinforcement Learning

by

Sofanit Wubeshet Beyene

and

Ji-Hyeong Han

^*

Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(24), 4192; https://doi.org/10.3390/electronics11244192

Submission received: 31 October 2022 / Revised: 7 December 2022 / Accepted: 12 December 2022 / Published: 15 December 2022

(This article belongs to the Special Issue Advanced Machine Learning for Intelligent Robotics)

Download

Browse Figures

Versions Notes

Abstract

:

Sharing prior knowledge across multiple robotic manipulation tasks is a challenging research topic. Although the state-of-the-art deep reinforcement learning (DRL) algorithms have shown immense success in single robotic tasks, it is still challenging to extend these algorithms to be applied directly to resolve multi-task manipulation problems. This is mostly due to the problems associated with efficient exploration in high-dimensional state and continuous action spaces. Furthermore, in multi-task scenarios, the problem of sparse reward and sample inefficiency of DRL algorithms is exacerbated. Therefore, we propose a method to increase the sample efficiency of the soft actor-critic (SAC) algorithm and extend it to a multi-task setting. The agent learns a prior policy from two structurally similar tasks and adapts the policy to a target task. We propose a prioritized hindsight with dual experience replay to improve the data storage and sampling technique, which, in turn, assists the agent in performing structured exploration that leads to sample efficiency. The proposed method separates the experience replay buffer into two buffers to contain real trajectories and hindsight trajectories to reduce the bias introduced by the hindsight trajectories in the buffer. Moreover, we utilize high-reward transitions from previous tasks to assist the network in easily adapting to the new task. We demonstrate the proposed method based on several manipulation tasks using a 7-DoF robotic arm in RLBench. The experimental results show that the proposed method outperforms vanilla SAC in both a single-task setting and multi-task setting.

Keywords:

experience replay; robot manipulator; deep reinforcement learning; meta-reinforcement learning; transfer learning

1. Introduction

The usage of robots in industries has been immensely increasing. Since we are entering an era in which robots will take over most of the physical workloads, efficient methods of training and utilizing robots are essential [1]. One possible approach is to develop robots that can continuously acquire additional new skills. The introduction of deep learning and advancements in computing power has led to a surge of interest in developing intelligent multi-tasking robots. However, learning new motor skills to complete tasks is challenging and requires a variety of methods. Deep reinforcement learning (DRL) has proven to be effective in training robots to perform a single task [2,3]. Even though DRL has succeeded in single-task scenarios, it is still challenging to extend the trained agent to undertake multiple tasks. If tasks are well-defined and structurally similar, then knowledge of one task can potentially help to learn a new task. In particular, the successful experiences from one task can be used as a base or prior knowledge for a structurally similar task.

The challenges in multi-task learning include sparse reward and sample inefficiency of DRL algorithms. There have been different approaches to tackle these challenges. One approach is to learn to distill a shared representation from different but related tasks, i.e., policy distillation and actor-mimic [4,5]. Another approach is to base the learning process of the target task using parameters of the source task, which are used as prior knowledge [6]. Most previous researches have been conducted on discrete action space environments using on-policy algorithms. Thus, in this paper, we tackle the continuous action space environments using off-policy algorithms, i.e., robot multi-task learning. The proposed method increases the sample efficiency of the soft actor-critic (SAC) algorithm, which is one of the off-policy algorithms, and extends it to a multi-task setting to make a robot learn a new task based on previously learned tasks. We verify the proposed method by training a robot manipulator on three different tasks. Firstly, a manipulator learns a set of prior policies from two base tasks, which are reaching a target (ReachTarget) and closing a box (CloseBox); then, it adapts the prior policies to the target task, which is closing a microwave (CloseMicrowave), with a small number of samples. These three tasks have high-dimensional state space and sparse reward settings. Hence, the conventional SAC fails to learn due to sample inefficiency and the sparse reward problem is aggravated. Therefore, we propose the prioritized hindsight experience replay to tackle these challenges by improving the data storage and sampling technique. Moreover, the proposed method tries to remember and utilize high-reward transition in such a way that it assists the network to easily adapt to new tasks.

This paper is organized as follows. Section 2 introduces the previous research related to the proposed method. Section 3 proposes the SAC with prioritized hindsight experience replay for multi-task learning. In Section 4, the experimental environment and results of the proposed method are discussed. Finally, the concluding remarks follow in Section 5.

2. Related Works

Robotic manipulation deals with the way robots manipulate objects in their environment; such as packaging items, grasping objects, and organizing and arranging objects. Robotic manipulation and control tasks are heavily dependent on hardware-based solutions. In order to perform a certain manipulation task, the end-effector is moved to the target position by solving the forward and inverse kinematics equation. The series of actuator positions gained by solving the equations are stored in memory and a robot performs manipulation tasks based on these instructions [7]. This method of robot control produces deterministic movements in a controlled environment.

Modern robots should work within unstructured and dynamic environments. Therefore, robot control algorithms should be independent of the assigned task to learn continuously and adapt to the changing environments. There are several methods for controlling and teaching robots to accomplish tasks. These include direct programming, imitation learning, and reinforcement learning. To work successfully, direct programming and imitation learning require a precisely designed environment, an exact mathematical model, and a massive amount of human intervention.

To solve these problems of the traditional approach, reinforcement-learning-based methods have been applied to robotic control and manipulation tasks [8,9,10]. Robotic control and manipulation tasks can be modeled as partially observable Markov decision processes (POMDP) with a high-dimensional state and continuous action space. Several kinds of research have been conducted to adopt DRL for this purpose. State-of-the-art algorithms such as DDPG and SAC have been mainly used in reaching target, pick and place, pull door, push door, and insertion tasks [11,12,13]. In the following subsections, we briefly introduce several DRL approaches for multi-task robotics manipulation. Moreover, we explain the basics of the off-policy reinforcement learning algorithm based on policy gradient along with experience sampling method, because off-policy reinforcement learning is more suitable for robotic control and manipulation tasks than on-policy reinforcement learning when we consider the characteristics of tasks, i.e., high-dimensional state, continuous action space, and hard-to-obtain many-success experiences.

2.1. Deep Reinforcement Learning for Multi-Task Robotics Manipulation

Researchers have attempted to accomplish lifelong learning through the acquisition of general abilities across a variety of tasks. Among these methods, multi-task reinforcement learning, transfer learning, and meta-reinforcement learning approaches are discussed in this section. Multi-task reinforcement learning is an approach that tries to achieve generalization in similar tasks. The intuition behind multi-task learning is, by concurrently training multiple tasks, an agent can exploit knowledge gained from one task to improve the learning of similar other tasks, which, in turn, increases the overall performance of the agent in those tasks [14,15]. In the multi-task learning approach, the training and the test tasks are the same.

Another approach is transfer learning, which tries to achieve adaptability and generalization across tasks by directly transferring information from a source task to a target task. Fine-tuning on target task and transferring parameters—i.e., policy—are widely used techniques [16,17]. In the case of transfer learning, the training task, which is the source task, is different from the test task, which is the target task.

A similar but quite different approach is meta-reinforcement learning. Meta-reinforcement learning follows a learning-to-learn strategy [18]. This paradigm focuses on trying to improve the learning performance in addition to trying to enhance task performance. Improvement of learning performance is achieved by training over multiple related tasks and optimizing the meta-objective. Meta-objective evaluates the prior knowledge with respect to the new data in the training batch. Finally, the agent’s performance is measured by applying the resulting policy to the new task with a small number of training data. These three methods have shown successful results in different domains.

Applying these methods directly to generalize over robotic arm manipulation tasks has been a long-standing research challenge [19,20,21]. Even though concurrently training multiple robotic tasks allows policies to be shared among different tasks, solving the optimization problem is significantly challenging. This is due to the inability to select parameters that could be used across the tasks so that gradients from the different tasks could positively interfere with each other. This problem is aggravated as the number of tasks increases. In order to solve the problem of negative transfer, researchers proposed using a separate network for each task to distill the similarities of the gradients to a shared policy [22]. Another approach was to decompose network policies into task-specific and modular sub-policies in which the sub-policies are selected and shared across tasks by the learned task-specific policies [23]. Furthermore, there have been attempts to reduce negative interference by directly adjusting gradients that can potentially cause conflicts. Cosine similarity is used to determine the conflicting gradients [24]. However, when task distributions are significantly broad and if there is a large gradient variance within each task itself, the possibility of solving the optimization problem by relying on similarity of gradients is unstable. Therefore, another mechanism has been proposed. This proposed approach uses composition models to reduce conflicts between gradients of different tasks. The policy is decomposed into modules and is used to solve tasks by re-combining the pre-defined modules. However, using pre-defined modules to alleviate the problem of negative transfer is not scalable. Hence, [23] devised a method to route through the network and use soft combinations over modules.

In addition to negative transfer of gradients, sparse reward and sample inefficiency of reinforcement learning algorithms also contribute to the challenges of training a robotic arm to perform multiple tasks. The sparse reward problem occurs when the reward function is zero in most of its domain, and only contains positive values for a few of the state transitions. This presents a significant learning difficulty for DRL algorithms since the reward is the only feedback to decide whether the actions taken are good or bad. The problem of sparse reward also has a potential to cause sample-inefficiency. Depending upon the sampling technique used, the samples drawn from the replay memory could mostly contain only trajectories with zero-valued rewards. Hence, the agent will have difficulties updating its parameters due to the lack of sufficient information.

In order to mitigate the problem of sparse rewards, researchers have been manually defining and shaping rewards to compensate for delayed rewards. Manually defining rewards has been used across reinforcement learning tasks; however, researches have shown that this approach could easily result in sub-optimal performance in robotic manipulation tasks [25,26]. In order to mitigate the lack of well-designed reward functions, researchers have been using demonstrations from experts. These demonstrations are then used as prior knowledge and can be in the form of rules or behavioral trajectories [27,28]. Another approach was to devise a way that the agent could be informed about the importance of the steps even though there is no corresponding reward. One way to achieve this is by incorporating sub-environments, which can be considered as intermediate steps [29]. The agent is trained to learn under multiple parallel environments by automatically assigning rewards from these sub-environments.

Therefore, in this paper, we propose a method to train a robotic arm to perform an unseen task based on prior tasks and to solve the challenges that are mentioned above by extending SAC to a multi-task setting and allowing a non-uniform sampling technique in which both trajectories with higher losses and successful samples are drawn from two different replay buffers.

2.2. Off-Policy Reinforcement Learning Algorithms and Experience Sampling

Soft actor-critic (SAC) is a model-free and off-policy reinforcement learning algorithm based on entropy regularization. Entropy defines how random the policy is in sampling and signifies the expectation of the amount of information of the policy. Mathematically, it can be defined as

H (π (a_{t} | s_{t})) = - \sum_{a_{t}} π (a_{t} | s_{t}) log π (a_{t} | s_{t})

, where

π

denotes the policy function, and

a_{t}

and

s_{t}

define the action and state, respectively, at time t. In SAC, this entropy function is considered part of the objective function. Therefore, both the state-value and action-value functions are modified to incorporate this entropy function, and they are named as the soft state-value function and soft action-value function, respectively. The use of entropy regularization has been shown to assist in policy optimization by increasing exploration capabilities. This is essential for environments with high-dimensional state space. Higher entropy encourages exploration of state space and prevents the policy from converging to sub-optimal values by exploiting a few actions [30].

SAC incorporates an actor-critic approach, as shown in Figure 1, in which separate neural networks are used for policy and value functions. As can be seen from Figure 1, the actor network learns by policy gradient and chooses an action following a Gaussian distribution and the critic network evaluates the action by calculating the value function. Therefore, SAC has the advantage of both value-based and policy-based approaches. Moreover, in standard reinforcement learning, the expected sum of rewards

\sum_{t = 1} E_{s_{t}, a_{t}} [r (s_{t}, a_{t})]

is maximized, whereas in SAC the policy is trained to maximize both the cumulative discounted reward and entropy, as described in Equation (1).

J (π) = \sum_{t = 1}^{T} E_{s_{t}, a_{t}} [r (s_{t}, a_{t}) + α H (π (a_{t} | s_{t}))]

(1)

where H measures the entropy term and

α

is a non-negative hyper-parameter that regularizes the stochasticity of the policy. Higher values of

α

signify higher entropy, which, accordingly, makes the variance of the policy distribution increase. As the variance of the policy distribution increases, there will be more actions to be chosen and better exploration. Lower values of

α

show a deterministic aspect of the policy [31].

V (s_{t}) = E_{a_{t} \sim π} [Q (s_{t}, a_{t}) - α log π (a_{t} | s_{t})]

(2)

Q (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim p} [V (s_{t + 1})]

(3)

The soft-value function

V (s_{t})

described in Equation (2) evaluates how good it is to be in a particular state, whereas the Q-function

Q (s_{t}, a_{t})

denoted by Equation (3) evaluates the state–action pair. From Equations (2) and (3), the Q-function

Q (s_{t}, a_{t})

is used to train the policy, whereas the soft

V (s_{t})

is used to estimate the soft Q value.

As shown in Figure 1, the policy, the soft state-value function, and soft Q-function are parameterized by

θ, ψ, ϕ

, respectively. The actor network

π_{θ} (s, a)

produces the mean and the covariance, which defines the Gaussian policy distribution. From this distribution, actions are sampled.

J_{v} (ψ) = E_{s_{t} \sim D} [\frac{1}{2} {(V_{ψ} (s_{t}) - E [Q (s_{t}, a_{t}) - α log π (a_{t} | s_{t})])}^{2}]

(4)

During training, the soft state-value network

V_{ψ}

is expected to minimize the mean square error between the value network

V_{ψ}

and both the entropy augmented policy

π_{θ}

and the expected prediction of the soft Q-function

Q_{ϕ}

across all the data sampled from the experience replay buffer (D), as described in Equation (4).

J_{Q} (ϕ) = E_{(s_{t}, a_{t}) \sim D} [\frac{1}{2} (Q_{ϕ} (s_{t}, a_{t}) - {\hat{Q}}_{ϕ} {(s_{t}, a_{t})}^{2}]

(5)

The soft action-value function

Q_{ϕ}

is trained to minimize the objective function (soft Bellman residual) described in Equation (5), where

{\hat{Q}}_{ϕ}

is evaluated according to Equation (2), with the copy of the value function referred to as the target value function

V

, which is approximated using

V_{\hat{ψ}}

. The target value network

V_{\hat{ψ}}

is added to stabilize the training process.

π_{n e w} = \underset{π \in Π}{argmax} D_{K L} ((π (\cdot | s_{t})) | | e x p (Q_{π_{o l d}} (s_{t}, \cdot) - log Z_{π_{o l d}} (s_{t})))

(6)

As described in Equation (6), the policy incorporates the

D_{K L}

Kullback–Leibler (KL) divergence between the exponentiation of the Q-function and the policy. KL divergence evaluates how one distribution differs from another. Thus, it is a distance measure on the space of probability distributions. In Equation (6), the first distribution

π (\cdot | s_{t})

is from a set of Gaussian distributions

Π

. The second part of the equation, which can be rewritten as (

\frac{exp (\frac{1}{α} Q_{π_{o l d}} (s_{t}, \cdot))}{Z_{π_{o l d}} (s_{t})}

), is the softmax of Q. Thus, the policy parameter is optimized by training to minimize this KL divergence.

J_{π} (θ) = E_{a_{t} \sim π} [log π_{θ} (a_{t} | s_{t}) - Q_{ϕ} (s_{t}, a_{t}) + log Z_{ϕ} (s_{t})]

(7)

The objective function described in Equation (7) aims at making the distribution of the policy function proportional to the distribution of the Q-function, which is normalized using the function Z. The function (Z), also known as the partition function, does not affect the gradient with respect to the policy, it is used only as a normalizer and depends only on the state

s_{t}

. Therefore, it can be discarded. The objective function can then be re-written as Equation (8)

J_{π} (θ) = E_{s_{t} \sim D} [E_{a_{t} \sim π_{θ}} [α log π_{θ} (a_{t} | s_{t}) - Q_{ϕ} (s_{t}, a_{t})]]

(8)

where D is the replay buffer, which contains past experiences. From Equation (8), the policy parameters can be learned by directly minimizing the expected KL divergence multiplied by the parameter

α

.

Although SAC has shown better results in robotic manipulation tasks due to higher exploration capabilities, the problem of sparse reward remains challenging. In single-task reinforcement learning, researchers have shown that the use of hindsight experience replay (HER) has the tendency to alleviate the problem of sparse reward leading to better performance, but it does not guarantee sample efficiency [32]. Hence, to solve for the sample inefficiency that is caused by uniformly sampling the buffer regardless of the significance of the goals, HER has been extended to a prioritized hindsight model for multi-goal RL in which virtual goals are generated by an attentive goal generation network [33]. Despite the fact that prioritized sampling (PER) has been proven to be one of the most important components for the overall performance of deep RL systems in the discrete action domains, several empirical investigations reveal that it significantly under-formed when used with actor-critic algorithms in continuous control [34].

In this paper, we propose the method to solve the problems that arise due to the introduction of virtual goals and the problem of training actor-critic networks with large TD-errors.

3. Proposed Method

In this section, we explain the problem formulation and propose the SAC algorithm using prioritized hindsight for meta-reinforcement learning.

3.1. Problem Formulation

In this paper, an experiment was devised in which a 7-DoF robotic arm initially learns a single policy to perform two structurally similar tasks and adapts this prior policy to a target task. This problem can be stated in the following way. Let

S_{T} = {s_{T} | s_{T} \in S_{T}}

, where

S_{T}

denotes the set of all source tasks and

T_{T}

denotes the target task. Our aim is to learn an optimal policy

π^{*}

for the target task

T_{T}

by incorporating knowledge

D_{S_{T}}

from

S_{T}

into target task

D_{T_{T}}

.

We approach this problem in two phases. The first phase requires training the prior policy. For this purpose, ReachTarget and CloseBox tasks, which are the source tasks, are selected. This phase can be seen as a transfer-learning phase, in which the trained policy from the first task is fine-tuned onto the second one. The main target of this phase is to first learn a policy that already has an insight into the target task and also to collect samples that can assist in the learning of the target task. The second phase is the adaptation phase. In this phase, the prior policy is adapted to the CloseMicrowave task, which is the target task, with fewer interactions with the environment.

Despite being successful in game environments, directly applying the state-of-the-art DRL algorithms to generalize over robotic manipulation tasks still suffers from several issues. This is mainly due to three basic challenges. The first one is that all the tasks—i.e, ReachTarget, CloseBox, and CloseMicrowave—have high-dimensional states and continuous action space, which make it difficult for agents to explore effectively. Second, unlike game environments in which continuous feedback is available, these tasks only receive reward when the task is completed. Since reward is the only feedback in RL that reflects how one approach outperforms another one in terms of completing a task, when this reward is sparse, it becomes challenging for the agent to determine how to improve the policy. The final one is sample inefficiency. Although low sample efficiency can be alleviated by using off-policy learning with experience replay buffer, this problem persists if the agent receives sparse rewards.

3.2. Proposed SAC Using Prioritized Hindsight for Meta-Reinforcement Learning

Learning a single policy from two tasks using SAC off-policy reinforcement algorithm requires investigation of gradients communicated across tasks and experiences used. The agent mainly learns policy from collected experiences. Hence, the quality of the policy depends on the contents of the data and how they are gathered and utilized in improving the policy. Therefore, the experience replay has to be optimized in such a way that it assists in the training [35]. The main challenge encountered during this phase is the sparsity of the reward, i.e., the agent receives zero rewards everywhere except when the task is successfully completed. Since trajectories with zero reward do not give much insights into how to update the policy, the samples taken for training will be highly inefficient. Additionally, in the early stages of the training, the lack of sufficient high-value training samples leads to lower cumulative reward, which, in turn, affects the optimal policy.

To alleviate this problem, we apply HER. For each of the tasks, in addition to the state-space (S) and the action-space (A), we introduce a goal-state (G) from which

S_{g} \in G

are sampled. As shown in Figure 2, the proposed method architecture is summarized in the red box and the SAC is described in the green box. The flow of the diagram is indexed from (1) to (9). (1) The process starts with initializing the buffers and selecting goals

s_{g}

; then, the first step is to select an action

a_{t}

by sampling from the current policy distribution. (2) The observed state augmented with the goal (

s_{t} | | s_{g}

), the action taken

a_{t}

, the next state augmented with the goal state (

s_{t + 1} | | s_{g}

), and the reward

r_{t}

form the transition tuple

(s_{t} ‖ s_{g}, a_{t}, s_{t + 1} ‖ s_{g}, r_{t})

, which is stored as part of the trajectory in the prioritized replay buffer; (3) it is also copied into the regular buffer if the reward

r_{t}

=1.

As depicted in Figure 2, the green box consists of two critic networks (7), one policy network (8), and one value network (6) with its corresponding target value network (9). The target network is just a frozen copy of the original network, which is updated infrequently. Targets allow the agent to learn from experiences with a fixed policy. Hence, it stabilizes the training by allowing the update not to be performed on a moving target. (4) During training, we roll out samples from either of the buffers. (5) The mini-batch data are then used to update the soft value parameter

ψ

, soft Q parameter

ϕ

, the policy

θ

, and finally the target value parameter

\hat{ψ}

. Therefore, the overall process alternates between gathering experience from the environment with the current policy and updating the parameters from batches sampled from a replay buffer using stochastic gradients. For each of the steps taken, the outcome specifies whether the goal is reached from that current state or not. In case the goal is not reached from the current state, we modify the trajectory with the achieved goal

s_{g}^{a}

by pretending as if the achieved goal was the original goal

(s_{t} ‖ s_{g}^{a}, a_{t}, s_{t + 1} ‖ s_{g}^{a}, r_{t}^{a})

and recalculate the reward in hindsight. This new generated sample is then stored in the buffer based on the value of the reward. Therefore, we have a combination of virtually generated successful trajectories and trajectories with the original goal in the replay buffer.

In this proposed approach, despite solving the sparse reward setting by generating successful trajectories with hindsight goals, it introduces several irrelevant virtual goals. Transitions with original goal and hindsight transitions do not have the same impact in the training process [36]. Therefore, instead of sampling transition uniformly from the replay buffer, we propose the use of the proportional variant of prioritized sampling technique to filter significant transitions. In prioritized experience replay (PER), transitions are prioritized proportionally to their temporal-difference error (TD-error) [37]. This prioritization is achieved based on the TD-error generated from the Q-function of our SAC algorithm. Hence, we redefine the TD-error of the

(s_{t} ‖ s_{g}^{a}, a_{t}, s_{t + 1} ‖ s_{g}^{a}, r_{t}^{a})

transition as the absolute sum of the TD-errors of the two instances of the critic network, as shown in Equation (9).

| σ | = \sum_{i = 1}^{2} | r + γ V_{\bar{ψ}} (s_{t + 1} ‖ s_{g}) - Q_{ϕ_{i}} (s ‖ s_{g}, a) |

(9)

The sampling probabilities of transitions are then calculated as follows:

P (j) = \frac{p_{j}^{β}}{\sum_{k} p_{k}^{β}}

(10)

where

p_{j} = | σ | + ϵ

and

β

is the hyper-parameter that controls how much the sampling is influenced by the priority value.

In the second phase of the proposed method, we adapt the prior optimal policy learned to the target task. We claim that effective storage and sampling of experiences can enhance the performance of an agent. Moreover, remembering best experiences in one task can guide the learning of another task. Therefore, we divide the replay buffer into two. The first buffer, which is the prioritized buffer in Figure 2, contains both hindsight and real trajectories. The samples drawn from this buffer are prioritized. The second buffer, which is the regular buffer with successful goals in Figure 2, contains the real successful trajectories. Experiences are sampled uniformly from this buffer. Separating the buffer aims at resolving the bias introduced in the replay buffer due to the introduction of hindsight experiences and also stabilizing the learning process by remembering best experiences.

Figure 2 shows the overall process of the proposed method, and Algorithm 1 explains the detailed algorithm of the proposed method. In summary, we propose the prioritized hindsight algorithm with two separate buffers that optimize experience replay. We apply this method on the SAC reinforcement learning algorithm to train a 7-DoF manipulator to be able to perform closing microwave task trained with prior policies resulted from ReachTarget and CloseBox tasks (see Algorithm 1). For further reference, see Github.

Algorithm 1 SAC + HER + PER with separate replay buffer

Require: Initialized replay buffer: R, E ▹ R: Regular buffer, E: Successful reward buffer
Input: Train

π_{θ}

with samples from

D_{S_{T}}

for epoch = 1, N do
sample goal

s_{g}

and

s_{o}

▹

s_{o}

: initial state
for t = 0, T − 1 do
Choose

a_{t} \sim π_{θ} (s_{t} ‖ s_{g})

according to the current policy
Execute action

a_{t}

and Observe

s_{t + 1}, r_{t + 1}

end for
for t = 0, T − 1 do
Store Transition (

s_{t} | | s_{g}, a_{t}, r_{t}, s_{t + 1} | | s_{g}

) in R prioritized by TD error

p_{t} = m a x_{i < t} (p_{i})

if

r_{t}

= 1 then
Store Transition (

s_{t} | | s_{g}, a_{t}, s_{t + 1} | | s_{g}, r_{t} = 1

) in E;
end if
Sample additional goals using k random states in the current episode
for 0, K − 1 do
Recompute reward

r^{a} = r (s_{t}, a_{t}, s_{g}^{a})

Store Transition (

s_{t} | | s_{g}^{a}, a_{t}, r_{t}^{a}, s_{t + 1} | | s_{g}^{a}, r_{t} = 1

) in R prioritized by TD error
end for
end for
Sample mini-batch from n where

n \leftarrow {\begin{matrix} p (R) = r \\ p (E) = e \end{matrix}

▹ where r and e are probability values assigned to R and E, respectively.
if n == R then
Sample mini-batch based on proportional Priority
else
Sample mini-batch uniformly
end if
Perform one step network update on the mini-batch
end for

4. Experiments and Results

The proposed method was evaluated in RLBench [38], a benchmark built on V-REP and PyRep that consists of several robotic manipulation challenges and serves as a testing ground for RL agents. RLBench also provides train–test splits of tasks to support multi-task and few-shot learning. These tasks are composed of different variations from which episodes can be sampled. Among these tasks, we have selected and modified the environments for ReachTarget, Closebox, and CloseMicrowave. In each of these tasks, the scene was displayed at the center of the workspace. By the end of each successful task completion, a reward was provided. The experiments were designed to train the Franka Emika Panda robot arm. The robot arm base was attached at the end corner of the table and all objects for each of the tasks were displayed within a bounding box.

4.1. Task and Robotic Arm Description

RLBench provides a wide range of tasks with a set of demonstrations. For this experiment, three tasks were selected that use a seven degree-of-freedom (7-Dof) Franka Emika Panda robotic arm to complete the tasks. Table 1 summarizes the parameters for each of the robot joints and the descriptions of each task follow subsequently.

ReachTarget: As shown in Figure 3a, the ReachTarget task consists of the target ball, which is the red, and two other distractors, which are magenta and pink. In this task, the agent is required to reach the target object, i.e., the red ball.
Closebox: The Closebox task requires the robotic arm to reach for the box positioned in the bounding box and close the lid, as described in Figure 3b.
CloseMicrowave: Similar to Closebox task, CloseMicrowave task requires the agent to close the lid of the microwave, as displayed in Figure 3c.

These tasks can be defined as an MDP, with the state space consisting of joint positions, velocities, force, gripper pose, and target object position. Thus, the variation in these state space elements comprises the action space. As shown in Table 2, the action space for all tasks has eight dimensions. The state space for ReachTarget and CloseMicrowave has 39 dimensions while the CloseBox task has only 32 dimensions. Even though CloseMicrowave and CloseBox are variants on the same set of tasks, the difference in the state space dimension arises due to the consideration of microwave door joint pose as part of the observation.

Each task requires finding the optimal sequence of actions to position the robot arm’s end-effector close to the target coordinates to complete the task effectively. At the start of each episode, the joints of the robot were reset to the initial angle positions and the goal positions were randomly set. When using only low-dimension observations, RLBench does not provide a way to handle if the target goal becomes unreachable from the robot arm. Therefore, we lay a constraint in which every object must remain within the bounding box to ensure that the target object is reachable.

In the beginning, the robot arm was in the initial state

s_{0}

, a goal state was sampled

s_{g}

, and a finite episode T was given. Then, the robot took an action

a_{t}

, and (

s_{t + 1}

,

r_{t + 1}

) were produced. If the goal state was the same as the next state, a reward (+1) was provided; otherwise, the next state was set as the new goal. Then, it was checked whether this state was reached from a different state in the previous trajectories in hindsight. If such event had occurred, the reward was set to (+1) and the modified tuple was recorded. This process was repeated until the end of the episode. An episode terminated either when the maximum timestep was reached or the task was completed. In either case, the episode samples were recorded in the replay buffers. These replay buffers were sampled and used to determine the optimal policy using SAC.

4.2. Results and Discussions

To verify the efficacy of the proposed method, we divided the experiments to be assessed against two evaluation baselines. The first baseline evaluates the performance of the proposed method against the vanilla SAC with a regular experience replay buffer for single tasks. The second one compares SAC + HER + PER against the proposed method in a multi-task setting. We aim to show how the performance of SAC is affected by different experience replay schemes in multitasking settings.

We compared three variants of SAC: vanilla SAC, SAC + HER + PER, and the proposed method in multitask setting. For each of the different variants we use the same SAC code. We use the same neural network architecture including activation functions, optimizer, learning rate, replay buffer size, and varying hyper-parameters across the experiments. In order to execute SAC + HER + PER and the proposed method, we modified the three environments (ReachTarget, Closebox, and CloseMicrowave) to include state-goal observations. Additionally, in ReachTarget and Closebox tasks, the positions of the targets changed per episode. We also used a different set of hyper-parameter values than the original works on SAC, HER, and PER. Table 3 shows a list of the hyper-parameter values used in our experiments.

4.2.1. Single Task Learning Results

After modification of the tasks to incorporate the desired and achieved goals as part of the state space, the training procedure began with training each of the tasks individually. The Closebox and CloseMicrowave tasks were trained for a total of 200 epochs, with each epoch rolling out 200 episodes of 300 steps each. On the other hand, the ReachTarget task was trained for 500 epochs in which each epoch contained 200 episodes of 300 steps each. Actions were randomly sampled for the first 100 k steps. The policies were then evaluated every 300 steps based on the average reward per episode. Each experiment was performed multiple times, and their average values were taken as the result. Figure 4 shows the results of single task training. Contrary to the notion that the ReachTarget task was relatively easier than the rest of the tasks, it appeared to be more challenging. This was due to randomly initializing the position of the target goal per episode. The CloseMicrowave task was also challenging to solve as the return was lower than any of the other tasks at the end of the training, as shown in Figure 4a. The resulting graphs were smoothed for readability purposes. Under each of the tasks, the proposed method relatively outperformed vanilla SAC in single task learning.

4.2.2. Multi-Task Learning Results

The next step was to evaluate the proposed method in a multi-task setting. For this purpose, we used pre-trained policies from the ReachTarget and CloseBox tasks as a prior to train the CloseMicrowave task. The agent was trained for a total of 120 epochs, consisting of 200 episodes with 300 steps. The number of epochs was reduced to test whether the proposed method had the capacity to generalize with a short adaptation phase or not. The reward was set up as (+1) for reaching the goal and 0 otherwise. For the first 100k steps, actions were sampled randomly. Moreover, the hyper-parameter k, which sets the value for the number of random states to consider during hindsight phase, was increased from 4 to 8.

During the training of SAC + HER + PER, we noted that the goals generated by HER and the success rate to the real goals were on average

1 : 300,000 \sim 500,000

steps, which were still very sparse given our dimension. Furthermore, as the training time became longer, the model’s performance declined. In the case of actor-critic networks, the actor network is updated based on the evaluation of the action by the critic network, which is achieved by calculating the Q-value of the given state. Hence, as the critic knows more about the states that yield better values, the better actions the agent selects. However, studies show that in continuous action space where actor-critic algorithms are used, the performance of PER degrades due to attempting to train the actor with experiences the critic has uncertainties. Moreover, in actor-critic training, employing a set of uniformly sampled transitions for a percentage of the batch size is critical for stability [34].

In order to alleviate this problem, our proposed method uses two replay buffers from which experiences are rolled out based on a sampling ratio. The sampling ratio affects the distribution of the state space; hence, the ratio is determined heuristically. Table 4 shows the different sample ratios with the number of times the agent successfully completed the CloseMicrowave task. Since one of the buffers only contains successful trajectories, it assists in exploitation, whereas the other buffer, which contains experiences across multiple tasks, can serve as an exploration buffer.

Figure 5 shows the performance of SAC + HER + PER versus the proposed method on multi-task learning. The result indicates that the proposed method clearly outperformed the SAC + HER + PER in the CloseMicrowave task, which was the target task in multi-task settings. In addition to having higher return values, the proposed method also resulted in more successful attempts, as shown in Table 5. Our experimental results showed that using pre-trained policies in conjunction with an off-policy learning method assisted by a mechanism to structurally explore the environment accelerated learning performance and improved policy performance.

5. Conclusions

In this paper, we proposed a method that can successfully incorporate prior experience from previous tasks to perform well on a given task. We created the task space, which consists of three tasks, i.e., ReachTarget, CloseBox, and CloseMicrowave. The goal was to apply experiences learned from ReachTarget and CloseBox to the CloseMicrowave task. Through experiments, we showed that the proposed method outperformed each of the evaluation baselines. Furthermore, it had the advantage of not having to define complex reward functions. The pre-trained policies would be viewed as task-agnostic policies that could further be used to solve more challenging structurally similar tasks. Therefore, the proposed method had the capability to leverage existing knowledge and training agents to execute tasks more efficiently. However, the proposed method had a limitation in that the training order of the tasks highly affected the performance of the method. Moreover, tasks with broader distributions have not been considered. In future works, we plan to define task MDPs with significant differences and scale the proposed algorithm.

Author Contributions

Conceptualization, S.W.B. and J.-H.H.; methodology, S.W.B. and J.-H.H.; software, S.W.B.; validation, S.W.B. and J.-H.H.; formal analysis, S.W.B. and J.-H.H.; investigation, S.W.B.; resources, J.-H.H.; data curation, S.W.B.; writing—original draft preparation, S.W.B.; writing—review and editing, J.-H.H.; visualization, S.W.B.; supervision, J.-H.H.; project administration, J.-H.H.; funding acquisition, J.-H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2022-RS-2022-00156295) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Dzedzickis, A.; Subačiūtė-žemaitienė, J.; Šutinys, E.; Samukaitė-Bubnienė, U.; Bučinskas, V. Advanced applications of industrial robotics: New trends and possibilities. Appl. Sci. 2022, 12, 135. [Google Scholar] [CrossRef]
Hua, J.; Zeng, L.; Li, G.; Ju, Z. Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning. Sensors 2021, 21, 1278. [Google Scholar] [CrossRef] [PubMed]
Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of the IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017; pp. 3389–3396. [Google Scholar]
Parisotto, E.; Ba, J.; Salakhutdinov, R. Actor-mimic deep multitask and transfer reinforcement learning. arXiv 2015, arXiv:1511.06342. [Google Scholar]
Rusu, A.A.; Colmenarejo, S.G.; Gülçehre, Ç.; Desjardins, G.; Kirkpatrick, J.; Pascanu, R.; Mnih, V.; Kavukcuoglu, K.; Hadsell, R. Policy distillation. arXiv 2015, arXiv:1511.06295. [Google Scholar]
Boutsioukis, G.; Partalas, I.; Vlahavas, I. Transfer learning in multi-agent reinforcement learning domains. In European Workshop on Reinforcement Learning; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2012; Volume 7188 LNAI, pp. 249–260. [Google Scholar] [CrossRef] [Green Version]
Kormushev, P.; Calinon, S.; Caldwell, D.G. Reinforcement learning in robotics: Applications and real-world challenges. Robotics 2013, 2, 122–148. [Google Scholar] [CrossRef] [Green Version]
Morales, E.F.; Zaragoza, J.H. An introduction to reinforcement learning. In Decision Theory Models for Applications in Artificial Intelligence: Concepts and Solutions; MIT Press: Cambridge, MA, USA, 2011; pp. 63–80. [Google Scholar] [CrossRef] [Green Version]
Franceschetti, A.; Tosello, E.; Castaman, N.; Ghidoni, S. Robotic Arm Control and Task Training Through Deep Reinforcement Learning. Lect. Notes Netw. Syst. 2022, 412 LNNS, 532–550. [Google Scholar] [CrossRef]
Kroemer, O.; Niekum, S.; Konidaris, G. A review of robot learning for manipulation: Challenges, representations, and algorithms. J. Mach. Learn. Res. 2021, 22, 1–82. [Google Scholar]
Aumjaud, P.; McAuliffe, D.; Rodríguez-Lera, F.J.; Cardiff, P. Reinforcement Learning Experiments and Benchmark for Solving Robotic Reaching Tasks. Adv. Intell. Syst. Comput. 2021, 1285, 318–331. [Google Scholar] [CrossRef]
Wang, D.; Jia, M.; Zhu, X.; Walters, R.; Platt, R. On-Robot Learning With Equivariant Models. In Proceedings of the 6th Annual Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022. [Google Scholar]
Liu, Y.; Gao, P.; Zheng, C.; Tian, L.; Tian, Y. A Deep Reinforcement Learning Strategy Combining Expert Experience Guidance for a Fruit-Picking Manipulator. Electronic 2022, 11, 311. [Google Scholar] [CrossRef]
Varghese, N.V.; Mahmoud, Q.H. A survey of multi-task deep reinforcement learning. Electronic 2020, 9, 1363. [Google Scholar] [CrossRef]
Li, H.; Liao, X.; Carin, L. Multi-task reinforcement learning in partially observable stochastic environments. J. Mach. Learn. Res. 2009, 10, 1131–1186. [Google Scholar]
Zhu, Z.; Lin, K.; Jain, A.K.; Zhou, J. Transfer Learning in Deep Reinforcement Learning: A Survey. arXiv 2020, arXiv:2009.07888. [Google Scholar]
Campos, V.; Sprechmann, P.; Hansen, S.; Barreto, A.; Kapturowski, S.; Vitvitskyi, A.; Badia, A.P.; Blundell, C. Beyond Fine-Tuning: Transferring Behavior in Reinforcement Learning. arXiv 2021, arXiv:2102.13515. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 3, pp. 1856–1868. [Google Scholar]
Gupta, A.; Mendonca, R.; Liu, Y.X.; Abbeel, P.; Levine, S. Meta-reinforcement learning of structured exploration strategies. Adv. Neural Inf. Process. Syst. 2018, 2018, 5302–5311. [Google Scholar]
Kaushik, R.; Anne, T.; Mouret, J.B. Fast online adaptation in robotics through meta-learning embeddings of simulated priors. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 25–29 October 2020; pp. 5269–5276. [Google Scholar] [CrossRef]
Yu, T.; Finn, C.; Xie, A.; Dasari, S.; Zhang, T.; Abbeel, P.; Levine, S. One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning. arXiv 2018, arXiv:1802.01557. [Google Scholar]
Teh, Y.W.; Bapst, V.; Czarnecki, W.M.; Quan, J.; Kirkpatrick, J.; Hadsell, R.; Heess, N.; Pascanu, R. Distral: Robust multitask reinforcement learning. Adv. Neural Inf. Process. Syst. 2017, 2017, 4497–4507. [Google Scholar]
Yang, R.; Xu, H.; Wu, Y.; Wang, X. Multi-task reinforcement learning with soft modularization. Adv. Neural Inf. Process. Syst. 2020, 2020, 1–11. [Google Scholar]
Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient surgery for multi-task learning. Adv. Neural Inf. Process. Syst. 2020, 2020, 1–13. [Google Scholar]
Zou, H.; Ren, T.; Yan, D.; Su, H.; Zhu, J. Reward Shaping via Meta-Learning. arXiv 2019, arXiv:1901.09330. [Google Scholar]
Lanka, S.; Wu, T. ARCHER: Aggressive Rewards to Counter bias in Hindsight Experience Replay. arXiv 2018, arXiv:1809.02070. [Google Scholar]
Wang, D.; Ding, B.; Feng, D. Meta Reinforcement Learning with Generative Adversarial Reward from Expert Knowledge. In Proceedings of the 2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education, ICISCAE 2020, Dalian, China, 27–29 September 2020; pp. 1–7. [Google Scholar] [CrossRef]
Wang, H.; Zhang, Y.; Feng, D.; Li, D.; Huang, F. BSE-MAML: Model agnostic meta-reinforcement learning via bayesian structured exploration. In Proceedings of the 2020 IEEE 13th International Conference on Services Computing, SCC 2020, Beijing, China, 7–11 November 2020; pp. 60–67. [Google Scholar] [CrossRef]
Ma, C.; Li, Z.; Lin, D.; Zhang, J. Parallel Multi-Environment Shaping Algorithm for Complex Multi-step Task. Neurocomputing 2020, 402, 323–335. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Prianto, E.; Kim, M.; Park, J.H.; Bae, J.H.; Kim, J.S. Path planning for multi-arm manipulators using deep reinforcement learning: Soft actor–critic with hindsight experience replay. Sensors 2020, 20, 5911. [Google Scholar] [CrossRef] [PubMed]
Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, P.; Zaremba, W. Hindsight Experience Replay (279 cites). Adv. Neural Inf. Process. Syst. 2017, 2017, 5049–5059. [Google Scholar]
Liu, P.; Bai, C.; Zhao, Y.; Bai, C.; Zhao, W.; Tang, X. Generating attentive goals for prioritized hindsight reinforcement learning. Knowl.-Based Syst. 2020, 203, 106140. [Google Scholar] [CrossRef]
Saglam, B.; Mutlu, F.B.; Cicek, D.C.; Kozat, S.S. Actor Prioritized Experience Replay. arXiv 2022, arXiv:1803.00933. [Google Scholar]
Zha, D.; Lai, K.H.; Zhou, K.; Hu, X. Experience replay optimization. IJCAI Int. Jt. Conf. Artif. Intell. 2019, 2019, 4243–4249. [Google Scholar] [CrossRef] [Green Version]
McInroe, T.A. Sample Efficiency in Sparse Reinforcement Learning: Or Your Money Back. arXiv 2020, arXiv:2008.12693. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2016, arXiv:1511.05952. [Google Scholar]
James, S.; Ma, Z.; Arrojo, D.R.; Davison, A.J. RLBench: The Robot Learning Benchmark and Learning Environment. IEEE Robot. Autom. Lett. 2020, 5, 3019–3026. [Google Scholar] [CrossRef]

Figure 1. This figure shows the overall architecture of SAC algorithm. SAC has three main components: the soft state-value function, soft Q-function, and policy.

Figure 2. This figure describes the overall architecture of the proposed method.

Figure 3. Franka Emilka Panda Robotic Arm with 7 Dof displayed in RLBench simulator. (a) ReachTarget with red ball as the target object. (b) CloseBox and (c) CloseMicrowave tasks.

Figure 4. Comparison average reward results of the proposed method versus vanilla SAC. Red and blue lines are the results of the proposed method, and orange and gray lines are the results of vanilla SAC.

Figure 5. Average reward results for SAC + HER + PER versus the proposed method in the CloseMicrowave, which is the target task in multi-task setting. The blue line indicates the proposed method, whereas the orange line indicates SAC + HER + PER.

Table 1. Parameters for each of the joints.

Arm	Values
Degree of freedom	7
Joint position limits [°]	Q1: −166/166, Q2: −101/101, Q3: −166/166, Q4: −176/−4, Q5: −166/166, Q6: −1/215, Q7: −166/166
Joint Velocity limits [°/s]	Q1: 150, Q2: 150, Q3: 150, Q4: 150, A5: 180, Q6: 180, Q7: 180

Table 2. Task specification.

Tasks	State Space Dimension	Action Space Dimension
ReachTarget	39	8
CloseBox	32	8
CloseMicrowave	39	8

Table 3. List of hyper-parameter values used during training.

Hyper-Parameters	Value
Learning rate	0.0001
Replay memory size	10⁷
Batch Size	256
Discount factor	0.99
SAC—entropy/temperature ( $α$ )	0.005
Maximum step size	150
PER—prioritization exponent ( $β$ )	0.5

Table 4. Total successful attempts for each sampling ratio.

Sampling Ratio (n)	Number of Successful Attempt
0.6/0.4	250
0.5/0.5	154
0.7/0.3	<75

Table 5. Number of total successful attempts.

Environment	SAC + HER + PER	Ours
CloseMicrowave	<45	250

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Beyene, S.W.; Han, J.-H. Prioritized Hindsight with Dual Buffer for Meta-Reinforcement Learning. Electronics 2022, 11, 4192. https://doi.org/10.3390/electronics11244192

AMA Style

Beyene SW, Han J-H. Prioritized Hindsight with Dual Buffer for Meta-Reinforcement Learning. Electronics. 2022; 11(24):4192. https://doi.org/10.3390/electronics11244192

Chicago/Turabian Style

Beyene, Sofanit Wubeshet, and Ji-Hyeong Han. 2022. "Prioritized Hindsight with Dual Buffer for Meta-Reinforcement Learning" Electronics 11, no. 24: 4192. https://doi.org/10.3390/electronics11244192

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prioritized Hindsight with Dual Buffer for Meta-Reinforcement Learning

Abstract

1. Introduction

2. Related Works

2.1. Deep Reinforcement Learning for Multi-Task Robotics Manipulation

2.2. Off-Policy Reinforcement Learning Algorithms and Experience Sampling

3. Proposed Method

3.1. Problem Formulation

3.2. Proposed SAC Using Prioritized Hindsight for Meta-Reinforcement Learning

4. Experiments and Results

4.1. Task and Robotic Arm Description

4.2. Results and Discussions

4.2.1. Single Task Learning Results

4.2.2. Multi-Task Learning Results

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI