Efficient Robot Manipulation via Reinforcement Learning with Dynamic Movement Primitives-Based Policy

Li, Shangde; Huang, Wenjun; Miao, Chenyang; Xu, Kun; Chen, Yidong; Sun, Tianfu; Cui, Yunduan

doi:10.3390/app142210665

Open AccessArticle

Efficient Robot Manipulation via Reinforcement Learning with Dynamic Movement Primitives-Based Policy

by

Shangde Li

^1,2,

Wenjun Huang

^2,3

,

Chenyang Miao

^2,3,

Kun Xu

²,

Yidong Chen

⁴,

Tianfu Sun

²

and

Yunduan Cui

^2,*

¹

College of Engineering, Southern University of Science and Technology, Shenzhen 518055, China

²

Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, Shenzhen 518055, China

³

University of Chinese Academy of Sciences, Beijing 100864, China

⁴

Shenzhen ZHAOWEI Machinery & Electronics Co., Ltd., Shenzhen 518105, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10665; https://doi.org/10.3390/app142210665

Submission received: 31 October 2024 / Revised: 15 November 2024 / Accepted: 16 November 2024 / Published: 18 November 2024

(This article belongs to the Special Issue Intelligent Robotics in the Era of Industry 5.0)

Download

Browse Figures

Versions Notes

Abstract

:

Reinforcement learning (RL) that autonomously explores optimal control policies has become a crucial direction for developing intelligent robots while Dynamic Movement Primitives (DMPs) serve as a powerful tool for efficiently expressing robot trajectories. This article explores an efficient integration of RL and DMP to enhance the learning efficiency and control performance of reinforcement learning in robot manipulation tasks by focusing on the forms of control actions and their smoothness. A novel approach, DDPG-DMP, is proposed to address the efficiency and feasibility issues in the current RL approaches that employ DMP to generate control actions. The proposed method naturally integrates a DMP-based policy into the actor–critic framework of the traditional RL approach Deep Deterministic Policy Gradient (DDPG) and derives the corresponding update formulas to learn the networks that properly decide the parameters of DMPs. A novel inverse controller is further introduced to adaptively learn the translation from observed states into various robot control signals through DMPs, eliminating the requirement for human prior knowledge. Evaluated on five robot arm control benchmark tasks, DDPG-DMP demonstrates significant advantages in control performance, learning efficiency, and smoothness of robot actions compared to related baselines, highlighting its potential in complex robot control applications.

Keywords:

reinforcement learning; robot manipulation; dynamic movement primitives

1. Introduction

Describing the environment as a Markov Decision Process (MDP), reinforcement learning (RL) is a potential branch of machine learning that focuses on iteratively learning tasks through direct interactions without human prior knowledge [1,2]. It has been widely implemented in both academia and industry, achieving significant success in areas such as games [3,4,5], unmanned systems [6,7,8], intelligent manufacturing [9,10,11], and robot control [12,13,14]. In contrast, implementing the unrestricted exploration behaviors of RL and neural network-based control policies in real-world engineering scenarios is extremely costly, particularly when dealing with noise and disturbances [15], which present substantial challenges for implementing learning algorithms to control sophisticated unmanned systems [16]. As a result, developing more sample-efficient methods remains a crucial challenge in the field of RL. The machine learning community has explored various approaches to enhance the sampling efficiency of RL. These included employing entropy and relative entropy to constraint the policy update to improve the agent’s efficiency of exploration in high-dimensional states [17,18], and generating pseudo samples from learned models to boost the training process [19,20]. However, these methods still largely depend on traditional control policies that use neural networks to approximate and determine one-step actions based on the current state, lacking a mechanism for implementing efficient and smooth control trajectories. Overall, current state-of-the-art RL approaches have primarily enhanced learning capabilities by improving actor–critic network update mechanisms and building models for target tasks. However, research on improving learning efficiency and control performance from the perspective of the forms of control actions and their smoothness remains limited.

Unlike the direct control policy based on neural networks, Dynamic Movement Primitives (DMPs) [21,22] offer an appealing solution for designing robot manipulation trajectories while featuring high non-linearity and real-time performance. DMPs are widely used in robot control scenarios including pouring water without spillage [23,24] and deformable object manipulation [25,26,27] owing to their robustness, guaranteed convergence, and easy spatiotemporal scaling. Inspired by DMPs’ advantages in parameterizing smooth and efficient robot motion trajectories, the feasibility of adaptively learning DMP-based control policies in RL on the robot platform has been explored. As a pioneering work that integrated Path Integrals and DMPs within one RL framework, Policy Improvement with Path Integrals (PI2) first investigated the effectiveness of DMPs as the control policy in RL on a simulated quadruped robot with 12 degrees of freedom [28]. Based on this work, a DMP policy incorporating not only robot control trajectories but also contact force signal trajectories was proposed to enhance the performance of PI2, achieving superior performance on a KUKA robot arm in a real-world wood planing scenario [29]. Neural Dynamic Policies (NDPs) [30] attempted to embed DMPs into the actor networks under the modern RL’s actor–critic network framework. They employed neural network-based policies to determine the parameters of the DMP controller, which generated smooth and efficient robot control trajectories. Evaluated by several control tasks built on a simulated Kinova Jaco robot arm, NDPs outperformed Proximal Policy Optimization (PPO) [31] in learning capability and control performance. Although NDPs can be considered a state-of-the-art RL approach using the DMP-based policy, the input states and output actions of their DMP-based policy must be consistent. Otherwise, a manually designed inverse controller—which is usually costly—is required. Therefore, it is difficult to support torque control tasks when the observed states are only related to joint angles. This limitation significantly restricts NDPs’ potential applications in the field of robot control.

To address the generalization issue of NDPs in complex robot control scenarios, particularly in robot arms with multiple degrees of freedom, we propose a novel RL approach DDPG-DMP that combines DMP-based control policy with the actor–critic RL framework from Deep Deterministic Policy Gradient (DDPG) [32]. Our method seamlessly integrates a DMP controller, whose parameters are determined by the actor networks, into the modern RL’s actor–critic structure to generate smooth and efficient robot control trajectories. DDPG-DMP develops an adaptive inverse controller that can be learned through the RL process to flexibly transfer the output trajectories of the DMP controller to various robot control signals while preserving trajectory continuity. This enables the DMP-based policy to generate control action trajectories that differ from the input states. Therefore, the proposed method enjoys a smooth control of various control signals including torque, angle, and end-effector position without requiring human prior knowledge. Evaluated on various robot arm simulation tasks based on torque, angle, and end-joint displacement control, DDPG-DMP achieved significant improvements in learning efficiency and control performance compared to the original DDPG without DMPs and the related work on NDPs. The relationship between DDPG-DMP and other related works is presented in Table 1 where ◯ and × indicate the involved and uninvolved terms, N/A indicates the unapplicable term. The contributions of this work can be summarized as follows:

Developing a novel RL approach specific for robot control tasks by integrating DMP-based policy and DDPG’s on-policy actor–critic framework. It enhanced the learning capability and the sample efficiency compared to the related work on NDPs [30], which inherits the off-policy characteristics of PPO.
Designing an adaptive inverse controller that learns the hidden relationships between DMP outputs and various robot control actions. This overcomes the limitation of NDPs [30], which rely on human prior knowledge for designing inverse controllers. As a result, it significantly expands the applicability of DMPs in the RL domain.
Evaluated on various simulated robot arm scenarios, the proposed method achieved superior control performances compared with NDPs and the original DDPG, highlighting the potential of RL with a DMP-based policy in complex robot control applications.

Table 1. Comparison of FPMPC and the related MBRL approaches.

Approach	Off-Policy	Policy Type	Adaptive Inverse Controller
DDPG [32]	◯	Neural Networks	N/A
SAC [33]	◯	Neural Networks	N/A
TD3 [34]	◯	Neural Networks	N/A
PPO [35]	×	Neural Networks	N/A
NDP [30]	×	DMP	×
DDPG-MPC (ours)	◯	DMP	◯

The remainder of this paper is organized as follows. The background of NDPs and actor–critic networks is introduced in Section 2. Section 3 details the proposed approach. The experimental results are presented in Section 4. The conclusion is presented in Section 5.

2. Preliminaries

2.1. Markov Decision Process

Traditional RL describes the target task as an MDP [1] with elements

(S, A, P, R, γ)

.

S = {s_{1}, s_{2}, . . ., s_{n}}

and

A = {a_{1}, a_{2}, . . ., a_{n}}

denote the finite sets of states and actions. The transition from one state

s

to another one

s^{'}

under a certain action

a

follows unknown dynamics

P_{s s^{'}}^{a} = p (s^{'} | s, a)

, and the corresponding task-related reward function is defined as

R_{s s^{'}}^{a}

. The RL agent aims to search an optimal policy

π (a | s)

to maximize the reward from the current state

s

:

\begin{matrix} r_{s_{t}} = \sum_{\binom{a \in A}{s_{t + 1} \in S}} π (a | s_{t}) P_{s_{t} s_{t + 1}}^{a} R_{s_{t} s_{t + 1}}^{a} . \end{matrix}

(1)

Define a discount factor

γ \in (0, 1)

; the value function is defined as the expected discounted total reward from

s

:

\begin{matrix} V_{π} (s) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{s_{t}} | s_{0} = s] . \end{matrix}

(2)

The maximized value function satisfies a Bellman equation:

\begin{matrix} V^{*} (s) = max_{π} \sum_{\binom{a \in A}{s^{'} \in S}} π (a | s) P_{s s^{'}}^{a} (R_{s s^{'}}^{a} + γ V^{*} (s^{'})) . \end{matrix}

(3)

It can be translated to a function of state–action pairs

Q (\cdot)

that is more widely used:

\begin{matrix} Q^{*} (s, a) = max_{π} \sum_{s^{'} \in S} P_{s s^{'}}^{a} (R_{s s^{'}}^{a} + γ \sum_{a^{'} \in A} π (a^{'} | s^{'}) Q^{*} (s^{'}, a^{'})) . \end{matrix}

(4)

2.2. Deep Deterministic Policy Gradient

DDPG [32] introduced actor–critic networks to separately learn the estimated Q function and the policy

π

by critic neural networks

Q_{θ}

and actor networks

π_{ϕ}

with independent parameters

θ

and

ϕ

. At the t-th step of interaction, the agent of DDPG decides the control action with exploration noise

ω

following Gaussian distribution or the Ornstein–Uhlenbeck process from its actor networks

a_{t} = π_{ϕ} (s_{t}) + ω

. After executing

a_{t}

, the agent observes the next step state

s_{t + 1}

and a reward

R_{s_{t} s_{t}^{'}}^{a_{t}}

. The whole transition

(s_{t}, a_{t}, R_{s_{t} s_{t}^{'}}^{a_{t}}, s_{t}^{'})

is stored in the memory replay buffer

D

as one sample.

The training procedure of DDPG was designed mainly based on the one of Deep Q Networks (DQN) [36]. For a mini-batch with N samples from the replay buffer

D

, the critic networks are updated by minimizing the following loss function:

\begin{matrix} L (θ) & = \frac{1}{N} \sum_{n = 1}^{N} {(R_{s_{n} s_{n}^{'}}^{a_{n}} + γ Q_{θ^{-}} (s_{n}^{'}, π_{ϕ^{-}} (s_{n}^{'})) - Q_{θ} (s_{n}, a_{n}))}^{2} . \end{matrix}

(5)

The actor networks are updated via a sampled policy gradient as follows:

\begin{matrix} \nabla_{ϕ} \leftarrow \frac{1}{N} \sum_{n = 1}^{N} \nabla_{a_{n}^{'}} Q_{θ} (s_{n}, a_{n}^{'}) |_{a_{n}^{'} = π_{ϕ} (s_{n})} \nabla_{ϕ} π_{ϕ} (s_{n}) \end{matrix}

(6)

where

Q_{θ^{-}}

and

π_{ϕ^{-}}

are the target networks of the critic and actor with parameters smoothly updated from

θ

and

ϕ

under the control of parameter

η

:

\begin{matrix} \begin{matrix} θ^{-} & \leftarrow η θ + (1 - η) θ^{-}, \\ ϕ^{-} & \leftarrow η ϕ + (1 - η) ϕ^{-} . \end{matrix} \end{matrix}

(7)

2.3. Dynamic Movement Primitives

DMP [21] was proposed to stably represent complex motor action trajectories that can be flexibly adjusted without manual parameter tuning in robot control. Define

α_{y}

and

β_{y}

as constants; y, g as the state variable and the target position. One motor action trajectory is encoded by DMP as follows:

\begin{matrix} \ddot{y} (t) = (α_{y} (β_{y} (g - y (t)) - (\frac{\dot{y} (t)}{τ})) + f (x_{t})) τ^{2} \end{matrix}

(8)

where

τ

is a time constant. Let input x follow a canonical system

\dot{x} = - α_{x} x τ

with initial value

x_{0} = 1

; the forcing function

f (x_{t})

is approximated by M Gaussian basis functions

ψ_{i} (\cdot)

and one

M \times 1

weights vector

w

:

\begin{matrix} f (x_{t}) = \frac{\sum_{i = 1}^{M} ψ_{i} (x_{t}) w_{i} x_{t}}{\sum_{j = 1}^{M} ψ_{j} (x_{t})} = ϕ {(x_{t})}^{T} w, \end{matrix}

(9)

\begin{matrix} ψ_{i} (x_{t}) = exp (- \frac{1}{2 σ_{i}^{2}} {(x_{t} - c_{i})}^{2}) . \end{matrix}

(10)

σ_{i}

and

c_{i}

are the parameters of the i-th basis function.

DMP encodes a T step trajectory

y = {[y (t), \dot{y} (t), \ddot{y} (t)]}_{t = 1 : T}^{T}

by modifying its weight vector

w

. The corresponding forcing function at the t-th step can be calculated from Equation (9):

\begin{matrix} \begin{matrix} f (x_{t}) = \frac{1}{τ^{2}} \ddot{y} (t) - α_{y} (β_{y} (g - y (t)) - \frac{\dot{y} (t)}{τ}) . \end{matrix} \end{matrix}

(11)

Let

f = Φ w

, with basis functions

Φ = {[ϕ (x_{1}), . . ., ϕ (x_{T})]}^{T}

and

f = {[f (x_{1}), . . ., f (x_{T})]}^{T}

; the weights vector

w

can be determined via least squares regression:

\begin{matrix} \begin{matrix} w = {(Φ^{T} Φ)}^{- 1} Φ^{T} f . \end{matrix} \end{matrix}

(12)

3. Approach

3.1. Neural Dynamic Policies

An NDP [30] attempts to employ the DMP-based policy into the actor–critic structure of the traditional RL approach PPO. Its principle and workflow are illustrated in Figure 1. Receiving the input state

s_{t}

, the actor networks output the parameters of DMP rather than the action signal

a_{t}

:

\begin{matrix} \begin{matrix} [g, w] = π_{ϕ} (s_{t}) \end{matrix} \end{matrix}

(13)

where g and

w

are the target position and the weights vector of a DMP-based policy, and

ϕ

is the parameter of the actor networks. Let

y_{0}

represent the variables related to each action in the observed state

s_{t}

, and assuming that the information of previous

y

is known (the acceleration and velocity at the initial moment could be initialized as zero and iterated), the NDP generates one trajectories with

m = 40

steps based on the DMP dynamics in Equation (8):

\begin{matrix} \begin{matrix} {\ddot{y}}_{j} & = α (β (g - y_{j - 1}) - {\dot{y}}_{j - 1} + f (x_{j}), \\ {\dot{y}}_{j} & = {\dot{y}}_{j - 1} + {\ddot{y}}_{j - 1} d j, \\ y_{j} & = y_{j - 1} + {\dot{y}}_{j - 1} d j \end{matrix} \end{matrix}

(14)

where

d j = 0.025

is set following [30]. At this point, the NDP successfully generates a continuous control action trajectory of length

m = 40

steps based on the input state

s_{t}

. The trajectory

{[y_{j}, {\dot{y}}_{j}, {\ddot{y}}_{j}]}_{j = 1}^{m}

includes position, velocity, and acceleration.

An NDP employs an inverse controller to sample the position information from the generated trajectory

{[y_{j}, {\dot{y}}_{j}, {\ddot{y}}_{j}]}_{j = 1}^{m}

at equal intervals to reconstruct a five-step robot control action trajectory

μ = [μ_{t}, μ_{t + 1}, . . ., μ_{t + K - 1}]

, where

K = 5

, which matches the time steps of state and action spaces for the target task. At the next step, the NDP samples five steps’ exploration actions

a = [a_{t}, a_{t + 1}, . . ., a_{t + K - 1}]

based on

μ

and sequentially interacts with the external environment using these five-step actions. The five-step interactive samples

{s_{t + k}, a_{t + k}, R_{s_{t + k} s_{t + k}^{'}}^{a_{t + k}}, s_{t + k}^{'}, v (s_{t + k})}_{k = 0}^{K - 1}

are sequentially collected and stored in the memory buffer

D

. The element

v (s_{t})

is the output of critic networks (e.g., the estimated value function), which is required in the update of PPO. Please note that an NDP’s inverse controller conducts a simplified mapping that discards velocity and acceleration information from the initial trajectory

{[y_{j}, {\dot{y}}_{j}, {\ddot{y}}_{j}]}_{j = 1}^{m}

. As a result, the output control trajectory

a

only follows the discrete positions of the DMP trajectory, compromising its quality and smoothness. This characteristic hinders an NDP’s capability to generate DMP trajectories of physical variables absent from the observed states, such as generating torque trajectories from joint states. It is highlighted that such a conversion would require manually designed inverse controllers based on human prior knowledge [30], which significantly constrains an NDP’s applicability in robotics.

The update of actor and critic networks in NDPs is mainly similar to PPO, but instead of purely randomly sampling a mini-batch from the memory buffer

D

, it selects the entire

K = 5

steps, continuous trajectory to ensure that the continuity of the generated trajectory is fully considered during training. For one training sample

{s_{t}, a_{t}, R_{s_{t} s_{t}^{'}}^{a_{t}}, s_{t}^{'}, v (s_{t})}

, the loss function of the actor networks is as follows:

\begin{matrix} \begin{matrix} L (ϕ) = min (c (ϕ) \cdot A (s_{t}, a_{t}), clip (c (ϕ), 1 - ϵ, 1 + ϵ) \cdot A (s_{t}, a_{t})), \end{matrix} \end{matrix}

(15)

\begin{matrix} \begin{matrix} c (ϕ) = \frac{π_{ϕ} (a_{t} | s_{t})}{π_{ϕ^{-}} (a_{t} | s_{t})} \end{matrix} \end{matrix}

(16)

where

c (ϕ)

is the ratio between the current and previous policies,

ϵ

is the parameter for clipping overlarge

c (ϕ)

, and

A (s_{t}, a_{t})

is the estimated advantage function based on the current critic networks

V_{θ}

within the whole rollout with N steps:

\begin{matrix} \begin{matrix} A (s_{t}, a_{t}) = \sum_{l = 0}^{N} {(γ λ)}^{l} \cdot (R_{s_{t + l} s_{t + l}^{'}}^{a_{t + l}} + γ \cdot V_{θ} (s_{t + l}^{'}) - V_{θ} (s_{t + l})), \end{matrix} \end{matrix}

(17)

λ

is the corresponding hyperparameter. The critic networks are updated using the loss function below:

\begin{matrix} \begin{matrix} L (θ) = {(R_{s_{t} s_{t}^{'}}^{a_{t}} + V_{θ} (s_{t}^{'}) - V_{θ} (s_{t}))}^{2} . \end{matrix} \end{matrix}

(18)

3.2. DDPG with DMP-Based Policy and Adaptive Inverse Controller

In this subsection, we introduce the proposed method DDPG-DMP as a novel framework to integrate the DMP-based policy into the traditional actor–critic RL-approach DDPG. The principle and the workflow of the proposed method are illustrated in Figure 2. Compared to an NDP, which employs policy networks and value function critic networks, the proposed method is mainly based on the traditional actor–critic framework where the critic networks estimate the q function in order to inherit the off-policy characteristic from DDPG. Given the input state as

s_{t}

, the actor networks in DDPG-DMP output two sets of parameters:

ϕ

and

ρ

. The first set

ψ = [g, w]

is consistent with the NDP and presents the target position and the weights vector required by the DMP. The second set

ρ

is simultaneously passed to an adaptive inverse controller as its neural networks’ parameters.

Using this setup, the DMP generates an m-step trajectory

{[y_{j}, {\dot{y}}_{j}, {\ddot{y}}_{j}]}_{j = 1}^{m}

with position, velocity, and acceleration information based on parameter set

ψ

using Equation (14). The adaptive inverse controller then converts it into a discrete control trajectory

a = [a_{t}, a_{t + 1}, . . ., a_{t + K - 1}]

with the consideration of the information related to velocity and acceleration. Compared with the inverse controller in an NDP, the adaptive inverse controller proposed in DDPG-DMP enhances the capability of the RL framework in fine-tuning and translating the trajectories generated by the DMP, especially when the control variables are not included in the observed states (e.g., adaptively mapping from joint angles to torque or end-effector position). It contributes to the superior feasibility of the RL and DMP-based policy in a wide range of robot systems without requiring additional human prior knowledge. In practice, the adaptive inverse controller is implemented as an additional neuron (with a tanh activation function) at the end of the actor networks, with its weights and the bias of this neuron determined by

ρ

. Unlike an NDP, where the DMP and inverse controller are external processes of the actor networks, the DDPG-DMP considers

π_{ϕ}

, DMP, and adaptive inverse controller as three parts of the entire actor network, outputting a control action trajectory with

K = 5

steps based on the input state

s_{t}

.

3.3. Update of DDPG-DMP

Consistent with DDPG, the proposed method adds exploration noises to the generated trajectory

[a_{t}, a_{t + 1}, . . ., a_{t + K - 1}]

and interacts with the external environment using this trajectory. The corresponding samples

{s_{t + k}, a_{t + k}, R_{s_{t + k} s_{t + k}^{'}}^{a_{t + k}}, s_{t + k}^{'}}_{k = 0}^{K - 1}

within

K = 5

steps are sequentially obtained and stored to the memory buffer

D

. Define the size of a mini-batch as N; DDPG-DMP randomly samples N complete trajectories (each one has K steps) during its update. Denote

{s_{t + k}, a_{t + k}, R_{s_{t + k} s_{t + k}^{'}}^{a_{t + k}}, s_{t + k}^{'}}_{k = 0}^{K - 1}

as one sampled trajectory; we only select the first step state

s_{t}

and the first step of the output control trajectory

a_{t}

as the input of critic networks

Q_{θ} (\cdot)

in order to satisfy the update formulas of the original DDPG. Therefore, we only use the first step’s state

s_{t}

, the next step’s state,

s_{t}^{'}

and action

a_{t}

for each sampled trajectory. To address this operation’s neglect of information related to subsequent actions in the sampled trajectory and their corresponding states, we aggregate the reward functions in all K steps as follows:

\begin{matrix} \begin{matrix} R_{t}^{traj} = \sum_{k = 0}^{K - 1} R_{s_{t + k} s_{t + k}^{'}}^{a_{t + k}} . \end{matrix} \end{matrix}

(19)

The processed sample

[s_{t}, a_{t}, R_{t}^{traj}, s_{t}^{'}]

is finally used to update both actor and critic networks in DDPG-DMP. The learn and update processes of the proposed method are summarized in Algorithm 1. Please note that there is one significant limitation of the proposed DDPG-DMP: beyond the first step, the remaining control actions and their corresponding states in the K-step trajectory generated by the actor networks are almost invisible to the agent during the update process. They only influence the accumulated reward

R_{t}^{traj}

. This characteristic substantially restricts the learning efficiency of DDPG-DMP. Addressing this limitation will be a crucial focus of our future work.

Algorithm 1: Learning and Update Processes of the Proposed DDPG-DMP

4. Experimental Results

4.1. Simulation Settings

In this section, we report the evaluation of the proposed method DDPG-DMP using five robot arm manipulation scenarios from two different benchmark environments. The first three tasks Thrower-v0, Pusher-v0, and Reacher-v1 were from OpenAI Gym (https://www.gymlibrary.dev/environments/mujoco/pusher/, accessed on 15 November 2024) and aimed to control a robot to throw, push the target object, or move the end-effector to the randomly generated target positions. We further selected the PandaReach-v2 task from the Panda robot arm simulation environment panda-gym (https://github.com/qgallouedec/panda-gym, accessed on 15 November 2024), where the agent directly controls the end-effector position of a Panda robot arm to reach the randomly generated target positions. To additionally demonstrate the advantages of our method in directly controlling robot joints, we extended the PandaReach-v2 task PandaReachJoints-v2 where the agent can directly control all joints of the Panda robot arm. The state–action spaces of the five tasks and the corresponding parameters for DDPG-DMP are shown in Table 2. Considering that all experimental environments supported the standard OpenAI Gym API, the proposed method can be directly applied to any real physical robot system compliant with OpenAI Gym API and ROS communication API, such as panda_robot package (https://github.com/rvl-lab-utoronto/panda_robot, accessed on 15 November 2024), which is adapted for Panda robot arm hardware, and robo-gym package (https://github.com/jr-robotics/robo-gym, accessed on 15 November 2024), which supports the UR5 robot arm hardware. Most parameters, including the neural network structure, the learning rate of critic networks, and the target update rate, were set uniformly across tasks. We only adjusted the learning rate of actor networks and the steps for warm-up to meet the specific requirements of each task. In terms of the compared baselines, we selected several widely applied RL algorithms: DDPG [32], Twin Delayed Deep Deterministic Policy Gradient (TD3) [34], and Soft Actor Critic (SAC) [33], as well as the NDP [30], which is based on a DMP policy. All compared methods were developed using PyTorch (DDPG, TD3, and SAC were implemented following open source code: https://github.com/Lizhi-sjtu/DRL-code-pytorch (accessed on 15 November 2024), and the NDP was implemented following its official code: https://github.com/shikharbahl/neural-dynamic-policies, accessed on 15 November 2024). NDPs can be seen as the state-of-the-art RL approach using the DMP policy. DDPG, TD3, and SAC were selected as the traditional model-free RL approaches using the actor–critic framework. We did not consider other newer model-freeRL methods as baselines because our proposed DDPG-DMP can be directly applied to any method with an actor–critic framework. In this paper, we primarily focused on the improvement in learning performance when employing the DMP-based policy compared to the NDP and other similar approaches that do not use the DMP-based policy. All experiments were conducted in five independent trials on a computational server with Intel Xeon W2295 CPU, NVIDIA GeForce RTX 3090 GPU, 64GB memory, and Ubuntu 20.04 OS (Santa Clara, CA, USA).

The experimental results are detailed in the following order. The learning capability of all compared methods in OpenAI robot arm scenarios is evaluated in Section 4.2, where the learning curves are shown in Figure 3, and the average rewards are compared in Table 3. The corresponding results of the Panda robot arm scenarios are demonstrated in Figure 4 and Table 4. Figure 5, Figure 6 and Figure 7 illustrate the control behaviors of DDPG-DMP and the baseline in OpenAI and Panda robot arm scenarios as a case study.

4.2. OpenAI Robot Arm Scenarios

The learning performances of the proposed method and other baselines in three tasks in the OpenAI Gym environment, Reacher-v1, Pusher-v0, and Thrower-v0 are analyzed in this subsection. The learning curves of all compared methods in these tasks are presented in Figure 3, where the shaded region represents the corresponding standard deviation over five independent trials with different random seeds. The final average returns obtained by each method over five trials are compared in Table 3. Both results demonstrate the significant superiority of our method in terms of not only learning efficiency and control performance.

In the Reacher-v1 task, DDPG, TD3, SAC, and the proposed DDPG-DMP ultimately converged to similar average returns according the learning curves. Regarding the convergence speed, DDPG-DMP achieved an approximately

- 10

average return in about 26,000 steps, while DDPG required around 33,000 steps, indicating an

11.7 %

improvement for our method. As a comparison, SAC required more interaction steps to converge in this task from a relatively low initial average return due to its inherent random exploration policy. Only TD3 achieved a learning efficiency comparable to DDPG-DMP. Please also note that DDPG-DMP only updates after executing a DMP trajectory segment every

K = 5

steps while traditional RL methods like DDPG, TD3, and SAC update the agent at every step. This further showcases the significant advantage of our method in updating efficiency. Although the average return calculated over five-step samples resulted in a slight disadvantage for DDPG-DMP as shown in Table 3, this had no significant impact on actual control performance. In contrast, the NDP, which was also based on a DMP policy, struggled to leverage DMP’s smooth control advantages in the this task. Its learning curve shows no significant improvements, resulting in our method outperforming the NDP by up to

52.7 %

in the final converged average return.

In Pusher-v0, which has larger state and action spaces, the proposed DDPG-DMP outperformed the DDPG and NDP in the final converged average (with

43.7 %

and

26.7 %

higher average return, respectively) while achieving the best learning efficiency. Our method quickly converged to over

- 40

average return within 30,000 steps. In the most challenging task Thrower-v0, it was clearly observed that the original DDPG failed to learn a proper policy to finish this task. Other traditional RL approaches like TD3 and SAC could not maintain their average return and finally converged to poor control performances with very high standard deviations. As a comparison, the NDP showcased the advantages of a DMP-based policy in control performance for the given task, which requires high coordination between trajectory smoothness and control actions across seven degrees of freedom. It consistently achieved a suboptimal average return, significantly outperforming DDPG, SAC, and TD3. However, NDP took about 150,000 steps to converge, which was relatively slow. In contrast, the proposed algorithm not only enhanced the advantages of the DMP-based policy in control performance but also maximized its learning efficiency. By generating smooth control action trajectories independently for all seven joints simultaneously, our method demonstrated improvements of

65.1 %

,

43.8 %

,

34.2 %

, and

16.2 %

compared to DDPG, TD3, SAC, and NDP, respectively. Remarkably, DDPG-DMP surpassed the NDP’s peak average return using only one-third of the interactions, exhibiting superior learning efficiency. The NDP’s poor performances in three torque-based tasks were reasonable. Its inverse controller failed to generate proper torque trajectories when the observed states lacked torque information. Moreover, its on-policy characteristic inherited from PPO also contributed to a lower learning efficiency. In contrast, our method employed an adaptive inverse controller (it was implemented by one additional neuron in the actor networks) to learn the mapping from joint information to torque trajectories, demonstrating significant advantages in both control performances and learning efficiency. Furthermore, the off-policy inherent from DDPG contributed to superior sample efficiency than the NDP conducted with on-policy learning following PPO. We should focus on the improvements of DDPG-DMP over the original DDPG, as our algorithm can be freely applied to TD3 and SAC with almost no modifications, allowing us to benefit from these algorithms’ enhanced learning capabilities compared to DDPG.

4.3. Panda Robot Arm Scenarios

In this subsection, we evaluate the proposed method by more complex robot arm control scenarios: PandaReach-V2 and PandaReachJoints-V2 on a Panda robot arm simulation environment panda-gym (https://github.com/qgallouedec/panda-gym, accessed on 15 November 2024). The first task directly controls the three-dimensional position of the end-effector via internal inverse kinematics. The second task is extended from the first one to further control the angle of each joint to further investigate the capability of the proposed method in the Panda robot arm. The reward is designed to reach the randomly generated end-effector position as closely as possible. The observation states of two tasks were designed as a 20-dimensional vector including the position and velocity of the end-effector. The learning curves of all compared approaches are illustrated in Figure 4. Table 4 compares the final converged results of all methods. We selected success rate as the evaluation metric following the original paper of panda-gym [37].

In the PandaReach-V2 task, DDPG struggled to learn and exhibited an unstable learning process, evidenced by a high standard deviation in the success rate. TD3 learned at a nearly

100 %

success rate, albeit at a bit slower convergence. Among traditional RL methods, only SAC achieved quick and stable convergence. In terms of the approaches using DMP-based policy, the NDP completely failed to learn this task. As a comparison, the proposed method converged to an almost

100 %

success rate with efficiency second only to SAC. It is worth noting that while our method was slightly less efficient than the neural network policy-based SAC in this task where directly controlling the end-effector position created a relatively simple scenario, it demonstrated superior smoothness in robot trajectories when controlling robot arm joint angles. We explore this advantage in the next case.

In the PandaReachJoints-V2 task, the agent attempts to move the end-effector to reach the randomly generated target position by controlling the seven joint angles in a Panda robot arm. This task implicitly requires learning the unknown inverse kinematics of a Panda robot, making it significantly more complex than the previous task. Such an increased challenge results in all comparison methods failing to reach the target successfully in the first 50,000 steps. Among all traditional RL approaches based on the neural network policy, the DDPG showed no signs of learning the task, even after about 200,000 interaction steps. TD3 slowly learned to a success rate of around

92 %

. SAC, despite its advantages in the previous task, underperformed TD3 in this task. It demonstrated good learning efficiency within the first 120,000 steps but turned to unstable afterward, finally converging to a low success rate (

85 %

) with a high standard deviation. For the approaches using the DMP-based policy, the NDP continued to struggle in this task, while the proposed method demonstrated the significant advantages of DMP’s smooth trajectories in both task success rate and learning efficiency. DDPG-DMP not only stably converged to a

99 %

success rate but also enjoyed markedly superior learning efficiency compared to other baselines. These results indicate that the adaptive inverse controller is crucial to employ a DMP-based policy into RL for general applications in robotic systems. Under the proper setting in DDPG-DMP, the DMP-based policy contributes to significant advantages in the learning efficiency.

4.4. Case Study

In this subsection, we describe conducted case studies on the OpenAI and Panda robot arm scenarios to demonstrate the advantages of the proposed DDPG-DMP over other baselines. We focus particularly on comparing it to the original DDPG, SAC, and TD3 that employ neural network-based policies, in terms of their impacts on robotic arm control behaviors. Figure 5 shows the screenshots of a robot arm controlled by the trained DDPG and DDPG-DMP in one complete rollout (100 steps) in the Pusher-v0 task, depicting the robot’s state at steps

0, 25, 50, 75

, and 100 from left to right. It was clearly observed that DDPG’s torque control lacked proper coordination between multiple joints, causing excessive robotic arm movement. Between steps 25 and 50, the robot end-effector pushed the object far away from the target position before reversing direction to complete the task. In contrast, the proposed DDPG-DMP generated gentler control actions, smoothly guiding the robot arm’s end-effector towards the object from steps 0 to 75, then steadily pushing it to the target within the final 25 steps. Figure 4 illustrates the trajectories of joint angle positions and angular velocities for three crucial joints (wrist3, wrist4, wrist5) using DDPG and DDPG-DMP. Owing to the characteristic of the DMP, our method controlled each joint smoothly in terms of both angle and angular velocity. In contrast, the original DDPG without employing the DMP-based policy not only exhibited larger joint angle fluctuations but also produced severe jitter in joint angular velocities, further leading to excessively large robot arm control actions and violent swings in the end-effector position.

Furthermore, we analyzed the superiority of DDPG-DMP in directly generating joint angle control trajectories in the PandaReachJoints-v2 task compared to TD7 and SAC (DDPG and NDP were not included in this comparison due to their poor control performance). Figure 7 compares the trajectories of all seven joint angle signals in one complete rollout for one target position after 175,000 steps’ training using the aforementioned methods (joints are represented by indices 0 to 6, from the base to the joint closest to the end-effector). It was observed that in controlling the base joints (index 0, index 1) and shoulder joint (index 2), all three methods generated relatively stable control trajectories while DDPG-DMP achieved relatively smoother and smaller-amplitude joint angle trajectories. For the joints responsible for controlling the first segment of the robotic arm (index 3 and 4), TD3 and SAC exhibited noticeable jitter, negatively impacting the coordination of all seven joints. In contrast, DDPG-DMP consistently generated smooth trajectories, effectively avoiding joint jitter. For the last two joints (indices 5 and 6), our method’s control trajectories demonstrated clear superiority in both appropriate joint movement range and smooth angle changes. This led to superior control performance and a high success rate.

5. Conclusions and Discussion

In this work, we introduce DDPG-DMP, a novel and effective RL approach that employs a DMP to generate robot control actions. Unlike traditional RL methods that use neural networks to directly output control signals, our method generated proper robot control trajectories with consideration of the velocity and acceleration information, resulting in smooth and effective robot control behaviors. Compared to a related RL approach using DMP, DDPG-DMP leveraged the off-policy actor–critic framework for enhanced learning efficiency and achieved superior feasibility across a wide range of control signals (e.g., torque, joint angle, and end-effector position) without requiring additional human knowledge by proposing an adaptive inverse controller. The experimental results of robot manipulation tasks on OpenAI Gym and Panda simulation demonstrated significant advantages in control performance, learning efficiency, and smoothness of robot actions over related baselines, indicating the potential of our method in the field of robot control.

However, the current DDPG-DMP had limitations in fully utilizing trajectory information. Beyond accumulating the rewards over a K-step trajectory, it lacked a mechanism to fully consider the entire trajectory’s states and actions (as discussed in Section 3.3). This presents a crucial direction for our future work. We also plan to explore integrating the DDPG-DMP framework with state-of-the-art model-free RL methods such as Continuous Dynamic Policy Programming (CDPP) [18], TD7 [38], and CrossQ [39]. Furthermore, incorporating DMP-based policies into model-based RL methods like Dropout-based Probabilistic Ensembles with Trajectory Sampling (DPETS) [40] and optimizing the initial trajectories of its model predictive controller offers an interesting research direction. From the perspective of engineering implementation, we aim to investigate the potential of DDPG-DMP in various real-world scenarios, including real physical systems such as Panda and UR5 robot arms, quadruped robots, and dexterous hands.

Author Contributions

Conceptualization, S.L. and Y.C. (Yunduan Cui); methodology, S.L., W.H., C.M. and Y.C. (Yunduan Cui); software, S.L., W.H. and C.M.; validation, T.S. and Y.C. (Yunduan Cui); formal analysis, S.L. and Y.C. (Yunduan Cui); investigation, S.L.; writing—original draft preparation, S.L. and Y.C. (Yunduan Cui); writing—review and editing, Y.C. (Yunduan Cui); visualization, S.L. and W.H.; supervision, Y.C. (Yidong Chen), T.S. and Y.C. (Yunduan Cui); project administration, K.X., Y.C. (Yidong Chen), T.S. and Y.C. (Yunduan Cui); funding acquisition, Y.C. (Yidong Chen), K.X. and Y.C. (Yunduan Cui). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China under Grants 62103403; in part by Major Program of Science and Technology of Shenzhen Grant No.KJZD20231023100304010 and No. KJZD20231023100509018; and in part by the Key Program of Natural Science Foundation of Shenzhen Grant No. JCYJ20220818101406014.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Yidong Chen was employed by the company Shenzhen ZHAOWEI Machinery & Electronics Co., Ltd. The remaining authors declare that the re-search was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DDPG	Deep Deterministic Policy Gradient
DQN	Deep Q Network
DMP	Dynamic Movement Primitive
MDP	Markov Decision Process
NDP	Neural Dynamic Policy
PI2	Policy Improvement with Path Integrals
PPO	Proximal Policy Optimization
RL	Reinforcement Learning
SAC	Soft Actor Critic
TD3	Twin Delayed Deep Deterministic Policy Gradient

References

Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B.; Miao, Q. Deep Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5064–5078. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Ye, D.; Liu, Z.; Sun, M.; Shi, B.; Zhao, P.; Wu, H.; Yu, H.; Yang, S.; Wu, X.; Guo, Q.; et al. Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 6672–6679. [Google Scholar]
Masmitja, I.; Martin, M.; O’Reilly, T.; Kieft, B.; Palomeras, N.; Navarro, J.; Katija, K. Dynamic robotic tracking of underwater targets using reinforcement learning. Sci. Robot. 2023, 8, eade7811. [Google Scholar] [CrossRef]
Cui, Y.; Peng, L.; Li, H. Filtered Probabilistic Model Predictive Control-based Reinforcement Learning for Unmanned Surface Vehicles. IEEE Trans. Ind. Inform. 2022, 18, 6950–6961. [Google Scholar] [CrossRef]
Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef]
Zhu, L.; Cui, Y.; Takami, G.; Kanokogi, H.; Matsubara, T. Scalable reinforcement learning for plant-wide control of vinyl acetate monomer process. Control. Eng. Pract. 2020, 97, 104331. [Google Scholar] [CrossRef]
Liu, B.; Akcakaya, M.; McDermott, T.E. Automated control of transactive hvacs in energy distribution systems. IEEE Trans. Smart Grid 2020, 12, 2462–2471. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, Z.; Li, X.; Cui, Y.; Li, M.; Ban, X. Controlling Partially Observed Industrial System Based on Offline Reinforcement Learning—A Case Study of Paste Thickener. IEEE Trans. Ind. Inform. 2024, 1–11. [Google Scholar] [CrossRef]
Tsurumine, Y.; Cui, Y.; Uchibe, E.; Matsubara, T. Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation. Robot. Auton. Syst. 2019, 112, 72–83. [Google Scholar] [CrossRef]
Won, D.O.; Müller, K.R.; Lee, S.W. An adaptive deep reinforcement learning framework enables curling robots with human-like performance in real-world conditions. Sci. Robot. 2020, 5, eabb9764. [Google Scholar] [CrossRef] [PubMed]
Haarnoja, T.; Moran, B.; Lever, G.; Huang, S.H.; Tirumala, D.; Humplik, J.; Wulfmeier, M.; Tunyasuvunakool, S.; Siegel, N.Y.; Hafner, R.; et al. Learning agile soccer skills for a bipedal robot with deep reinforcement learning. Sci. Robot. 2024, 9, eadi8022. [Google Scholar] [CrossRef] [PubMed]
Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef]
Tutsoy, O.; Asadi, D.; Ahmadi, K.; Nabavi-Chashmi, S.Y.; Iqbal, J. Minimum Distance and Minimum Time Optimal Path Planning With Bioinspired Machine Learning Algorithms for Faulty Unmanned Air Vehicles. IEEE Trans. Intell. Transp. Syst. 2024, 25, 9069–9077. [Google Scholar] [CrossRef]
Vieillard, N.; Kozuno, T.; Scherrer, B.; Pietquin, O.; Munos, R.; Geist, M. Leverage the average: An analysis of KL regularization in reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Shang, Z.; Li, R.; Zheng, C.; Li, H.; Cui, Y. Relative Entropy Regularized Sample-Efficient Reinforcement Learning with Continuous Actions. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–11. [Google Scholar] [CrossRef]
Janner, M.; Fu, J.; Zhang, M.; Levine, S. When to trust your model: Model-based policy optimization. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Hansen, N.; Wang, X.; Su, H. Temporal Difference Learning for Model Predictive Control. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Schaal, S. Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive Motion of Animals and Machines; Springer: Berlin/Heidelberg, Germany, 2006; pp. 261–280. [Google Scholar]
Saveriano, M.; Abu-Dakka, F.J.; Kramberger, A.; Peternel, L. Dynamic movement primitives in robotics: A tutorial survey. Int. J. Robot. Res. 2023, 42, 1133–1184. [Google Scholar] [CrossRef]
Nemec, B.; Ude, A. Speed profile optimization through directed explorative learning. In Proceedings of the 2014 IEEE-RAS International Conference on Humanoid Robots, Atlanta, GA, USA, 18–20 November 2014; pp. 547–553. [Google Scholar]
Vuga, R.; Nemec, B.; Ude, A. Enhanced policy adaptation through directed explorative learning. Int. J. Humanoid Robot. 2015, 12, 1550028. [Google Scholar] [CrossRef]
Colomé, A.; Torras, C. Dimensionality reduction for dynamic movement primitives and application to bimanual manipulation of clothes. IEEE Trans. Robot. 2018, 34, 602–615. [Google Scholar] [CrossRef]
Cui, Y.; Poon, J.; Miro, J.V.; Yamazaki, K.; Sugimoto, K.; Matsubara, T. Environment-adaptive interaction primitives through visual context for human–robot motor skill learning. Auton. Robot. 2019, 43, 1225–1240. [Google Scholar] [CrossRef]
Lai, Y.; Paul, G.; Cui, Y.; Matsubara, T. User intent estimation during robot learning using physical human robot interaction primitives. Auton. Robot. 2022, 46, 421–436. [Google Scholar] [CrossRef]
Theodorou, E.; Buchli, J.; Schaal, S. A generalized path integral control approach to reinforcement learning. J. Mach. Learn. Res. 2010, 11, 3137–3181. [Google Scholar]
Hazara, M.; Kyrki, V. Reinforcement learning for improving imitated in-contact skills. In Proceedings of the 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), Cancun, Mexico, 15–17 November 2016; pp. 194–201. [Google Scholar] [CrossRef]
Bahl, S.; Mukadam, M.; Gupta, A.; Pathak, D. Neural dynamic policies for end-to-end sensorimotor learning. Adv. Neural Inf. Process. Syst. 2020, 33, 5058–5069. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Kamthe, S.; Deisenroth, M. Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Playa Blanca, Spain, 9 April 2018; pp. 1701–1710. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Gallouédec, Q.; Cazin, N.; Dellandréa, E.; Chen, L. Panda-gym: Open-source goal-conditioned environments for robotic learning. In Proceedings of the 4th Robot Learning Workshop: Self-Supervised and Lifelong Learning at NeurIPS 2021, Virtual, 14 December 2021. [Google Scholar]
Fujimoto, S.; Chang, W.D.; Smith, E.; Gu, S.S.; Precup, D.; Meger, D. For sale: State-action representation learning for deep reinforcement learning. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Bhatt, A.; Palenicek, D.; Belousov, B.; Argus, M.; Amiranashvili, A.; Brox, T.; Peters, J. CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Huang, W.; Cui, Y.; Li, H.; Wu, X. Practical Probabilistic Model-Based Reinforcement Learning by Integrating Dropout Uncertainty and Trajectory Sampling. IEEE Trans. Neural Netw. Learn. Syst. 2024, 1–15. [Google Scholar] [CrossRef]

Figure 1. Principle and workflow of related work on NDP.

Figure 2. Principle and workflow of the proposed method DDPG-DMP.

Figure 3. Learning curves over three OpenAI robot arm scenarios. Curves are uniformly smoothed for visual clarity. The shaded region represents the corresponding standard deviation.

Figure 4. Learning curves over two Panda robot arm simulation scenarios. Curves are uniformly smoothed for visual clarity. The shaded region represents the corresponding standard deviation.

Figure 5. Screenshots of robot arm in the Pusher-v0 task using trained DDPG-DMP and DDPG agents during one complete rollout. The subfigures indicate the robot’s state at steps

0, 25, 50, 75

, and 100 from left to right.

Figure 5. Screenshots of robot arm in the Pusher-v0 task using trained DDPG-DMP and DDPG agents during one complete rollout. The subfigures indicate the robot’s state at steps

0, 25, 50, 75

, and 100 from left to right.

Figure 6. Trajectories of angle position and angle velocity of joints wrist3, wrist4, and wrist5 in one complete rollout using trained DDPG and proposed DDPG-DMP agents.

Figure 7. Trajectories of seven joints’ angle position on the simulated Panda robot arm in the PandaReachJoints-v2 task during one complete rollout using trained SAC, TD3, and proposed DDPG-DMP agents.

Table 2. Common hyperparameter settings of compared environments.

	Reacher-v1	Pusher-v0	Thrower-v0	PandaReach-v1	PandaReachJoints-v1
State Dimension	11	23	23	20	20
Action Dimension	2	7	7	3	7
Critic Learning Rate	$3 \times 10^{- 4}$	$3 \times 10^{- 4}$	$3 \times 10^{- 4}$	$3 \times 10^{- 4}$	$3 \times 10^{- 4}$
Actor Learning Rate	$5 \times 10^{- 4}$	$5 \times 10^{- 4}$	$5 \times 10^{- 4}$	$3 \times 10^{- 4}$	$3 \times 10^{- 4}$
Target Update Rate ( $τ$ )	$5 \times 10^{- 3}$	$5 \times 10^{- 3}$	$5 \times 10^{- 3}$	$5 \times 10^{- 3}$	$5 \times 10^{- 3}$
Networks Structure	$(256, 256)$	$(256, 256)$	$(256, 256)$	$(256, 256)$	$(256, 256)$
Batch Size (J)	256	256	256	256	256
Discount Factor ( $γ$ )	0.99	0.99	0.99	0.95	0.95
Steps per Update (K)	5	5	5	5	5
Memory Buffer Size	$10^{6}$	$10^{6}$	$10^{6}$	$10^{5}$	$10^{5}$
Warmup Steps	$10^{4}$	$10^{4}$	$10^{4}$	2000	2000
Control Modes	Torque	Torque	Torque	End-Effector Position	Joint Angle

Table 3. Average return over three OpenAI robot arm scenarios.

	DDPG-DMP	DDPG	TD3	SAC	NDP
Reacher-v1	$- 7.36 \pm 0.54$	$- 6.58 \pm 0.54$	$- 7.89 \pm 0.57$	$- 6.10 \pm 0.61$	$- 15.56 \pm 1.47$
Pusher-v0	$- 33.37 \pm 1.73$	$- 44.23 \pm 8.82$	$- 32.06 \pm 1.41$	$- 31.87 \pm 4.01$	$- 46.81 \pm 2.16$
Thrower-v0	$- 62.74 \pm 9.78$	$- 180.37 \pm 24.61$	$- 111.52 \pm 23.74$	$- 95.33 \pm 27.72$	$- 74.90 \pm 18.27$

Table 4. Success rates over two Panda simulation control scenarios.

	DDPG-DMP	DDPG	TD3	SAC	NDP
PandaReach-v2	$0.99 \pm 0.00$	$0.41 \pm 0.24$	$0.99 \pm 0.00$	$0.99 \pm 0.01$	$0.00 \pm 0.00$
PandaReachJoints-v2	$0.99 \pm 0.01$	$0.00 \pm 0.00$	$0.92 \pm 0.01$	$0.85 \pm 0.41$	$0.00 \pm 0.00$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Huang, W.; Miao, C.; Xu, K.; Chen, Y.; Sun, T.; Cui, Y. Efficient Robot Manipulation via Reinforcement Learning with Dynamic Movement Primitives-Based Policy. Appl. Sci. 2024, 14, 10665. https://doi.org/10.3390/app142210665

AMA Style

Li S, Huang W, Miao C, Xu K, Chen Y, Sun T, Cui Y. Efficient Robot Manipulation via Reinforcement Learning with Dynamic Movement Primitives-Based Policy. Applied Sciences. 2024; 14(22):10665. https://doi.org/10.3390/app142210665

Chicago/Turabian Style

Li, Shangde, Wenjun Huang, Chenyang Miao, Kun Xu, Yidong Chen, Tianfu Sun, and Yunduan Cui. 2024. "Efficient Robot Manipulation via Reinforcement Learning with Dynamic Movement Primitives-Based Policy" Applied Sciences 14, no. 22: 10665. https://doi.org/10.3390/app142210665

APA Style

Li, S., Huang, W., Miao, C., Xu, K., Chen, Y., Sun, T., & Cui, Y. (2024). Efficient Robot Manipulation via Reinforcement Learning with Dynamic Movement Primitives-Based Policy. Applied Sciences, 14(22), 10665. https://doi.org/10.3390/app142210665

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Robot Manipulation via Reinforcement Learning with Dynamic Movement Primitives-Based Policy

Abstract

1. Introduction

2. Preliminaries

2.1. Markov Decision Process

2.2. Deep Deterministic Policy Gradient

2.3. Dynamic Movement Primitives

3. Approach

3.1. Neural Dynamic Policies

3.2. DDPG with DMP-Based Policy and Adaptive Inverse Controller

3.3. Update of DDPG-DMP

4. Experimental Results

4.1. Simulation Settings

4.2. OpenAI Robot Arm Scenarios

4.3. Panda Robot Arm Scenarios

4.4. Case Study

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI