Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization
Abstract
:1. Introduction
2. Preliminary
3. Methods
3.1. Experience Scoring Algorithm
- Cumulative discounted reward . The ultimate goal of reinforcement learning is to obtain the maximum cumulative expected reward. It is intuitive to use as an indicator to measure the quality of experience data. For one trajectory, the greater the final cumulative expected reward, the better the overall performance of this episode, and the more valuable it is to learn from.
- The variance of all single step rewards. If the single step reward value is much larger than the average value, it will guide the network update direction from the positive direction more effectively. If the single step reward value is much smaller than the average value, it can guide the network update from the opposite direction more effectively. The reward information close to the average value is less efficient for network updates. The analogy is that people can accumulate more life experience in great success or frustration. However, experiences with too large may lead to more radical network updates and increase the instability of network updating.
- Episode length . There is a correlation between and , but not a positive one. For example, an episode in which the single step reward is always low but lasts for a long time has a larger and a smaller . In this way, even if there is a large , it will not be considered as a valuable trajectory.
3.2. Weighted Near-Optimal Experiences Policy Optimization
3.2.1. Dynamics Model Fitting
3.2.2. Linear Gaussian Controller Optimization
3.2.3. Policy Network Optimization
- (1)
- Initialize the linear Gaussian controller and policy network;
- (2)
- Control the robot to walk with the linear Gaussian controller and record the experiences in the weighted replay buffer;
- (3)
- Use experiences stored in the weighted replay buffer to update the linear Gaussian controller and policy network in a supervised learning manner;
- (4)
- Check whether the cumulative reward obtained by the linear Gaussian controller converges; if it converges, skip to step (5), otherwise return to step (2);
- (5)
- Control the robot to walk with the policy network and record the experiences in the weighted replay buffer;
- (6)
- Use the PPO algorithm to update the policy network until the cumulative reward converges.
Algorithm 1. WNEPO: A two-phase framework for efficient robotic skills learning |
1: Initialize , , , weighted replay buffer D |
2: Initialize , K, J, |
3: For : |
4: Initialize |
5: Interacting with the environment M times with |
6: collect , update |
7: fit dynamics to D |
8: For : |
9: Update the linear Gaussian controller with Equation (16) |
10: If : |
11: Optimize with Adam according to Equation (25) |
12: Else: |
13: Interacting with the environment N times with |
14: collect , update |
15: Randomly pick minibatch sequences from D |
16: Update based on PPO |
17: Update |
18: End For |
19: End For |
4. Experiments
4.1. Experiment Setup
4.1.1. Description of the Environment
4.1.2. Parameter Specification
4.1.3. Comparison Methods
- iLQG [19]. iLQG is a typical model-based RL algorithm. When the environment dynamics are known, the optimal analytical solution can be obtained.
- GPS [6]. GPS is a state-of-the-art algorithm combining model-based RL with model-free RL.
- WE-GPS. Replaces the experience pool in GPS with a weighted replay buffer.
- In practice, the last point can make a big difference if mistakes are costly—e.g., you are training a robot not in simulation, but in the real world. You may prefer a more conservative learning algorithm that avoids high risk if there is real time and money at stake if the robot were to be damaged.
- WE-PPO. Replaces the replay buffer in PPO with a weighted replay buffer.
- GPS-PPO. GPS is used to update the policy network offline, and then, the PPO algorithm is used to train the policy network online. The only difference between GPS-PPO and WNEPO is that GPS-PPO directly uses the online interactive data between the policy network and the environment instead of the experiences in the weighted replay buffer to update the policy.
4.2. Result
4.2.1. Evaluation of Walking Skills Learned from the Imitation Phase
4.2.2. Asymptotic Performance and Sample Efficiency
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kuindersma, S.; Deits, R.; Fallon, M.; Valenzuela, A.; Dai, H.; Permenter, F.; Koolen, T.; Marion, P.; Tedrake, R. Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot. Auton. Robot. 2016, 40, 429–455. [Google Scholar] [CrossRef]
- Raibert, M.; Blankespoor, K.; Nelson, G.; Playter, R. Bigdog, the rough-terrain quadruped robot. IFAC Proc. Vol. 2008, 41, 10822–10825. [Google Scholar] [CrossRef] [Green Version]
- Miller, A.T.; Knoop, S.; Christensen, H.I.; Allen, P.K. Automatic grasp planning using shape primitives. In Proceedings of the 2003 IEEE International Conference on Robotics and Automation, Taipei, Taiwan, 14–19 September 2003; pp. 1824–1829. [Google Scholar]
- Saxena, A.; Driemeyer, J.; Ng, A.Y. Robotic grasping of novel objects using vision. Int. J. Robot. Res. 2008, 27, 157–173. [Google Scholar] [CrossRef] [Green Version]
- Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef] [Green Version]
- Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1334–1373. [Google Scholar]
- Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings of the 2018 Conference on Robot Learning, Zürich, Switzerland, 29–31 October 2018; pp. 651–673. [Google Scholar]
- Schoettler, G.; Nair, A.; Ojea, J.A.; Levine, S. Meta-Reinforcement Learning for Robotic Industrial Insertion Tasks. arXiv 2020, arXiv:2004.14404. [Google Scholar]
- Cho, N.; Lee, S.H.; Kim, J.B.; Suh, I.H. Learning, Improving, and Generalizing Motor Skills for the Peg-in-Hole Tasks Based on Imitation Learning and Self-Learning. Appl. Sci. 2020, 10, 2719. [Google Scholar] [CrossRef] [Green Version]
- Peng, X.B.; Berseth, G.; Yin, K.; Panne, M.V.D. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph. 2017, 36, 1–13. [Google Scholar] [CrossRef]
- Zhang, M.; Geng, X.; Bruce, J.; Caluwaerts, K.; Vespignani, M.; SunSpiral, V.; Abbeel, P.; Levine, S. Deep reinforcement learning for tensegrity robot locomotion. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017; pp. 634–641. [Google Scholar]
- Liu, N.; Cai, Y.; Lu, T.; Wang, R.; Wang, S. Real–Sim–Real Transfer for Real-World Robot Control Policy Learning with Deep Reinforcement Learning. Appl. Sci. 2020, 10, 1555. [Google Scholar] [CrossRef] [Green Version]
- Abbeel, P.; Coates, A.; Quigley, M.; Ng, A.Y. An application of reinforcement learning to aerobatic helicopter flight. In Proceedings of the 2006 International Conference on Neural Information Processing, Vancouver, BC, Canada, 4–7 December 2006; pp. 1–8. [Google Scholar]
- Zhang, M.; Vikram, S.; Smith, L.; Abbeel, P.; Johnson, M.; Levine, S. SOLAR: Deep structured representations for model-based reinforcement learning. In Proceedings of the 2019 International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7444–7453. [Google Scholar]
- Thuruthel, T.G.; Falotico, E.; Renda, F.; Laschi, C. Model-based reinforcement learning for closed-loop dynamic control of soft robotic manipulators. IEEE Trans. Robot. 2018, 35, 124–134. [Google Scholar] [CrossRef]
- Clavera, I.; Rothfuss, J.; Schulman, J.; Fujita, Y.; Asfour, T.; Abbeel, P. Model-Based Reinforcement Learning via Meta-Policy Optimization. In Proceedings of the 2018 Conference on Robot Learning, Zürich, Switzerland, 29–31 October 2018; pp. 617–629. [Google Scholar]
- Asadi, K.; Misra, D.; Kim, S.; Littman, M.L. Combating the compounding-error problem with a multi-step model. arXiv 2019, arXiv:1905.13320. [Google Scholar]
- Levine, S.; Vladlen, K. Learning complex neural network policies with trajectory optimization. In Proceedings of the 2014 International Conference on Machine Learning, Beijing, China, 21–24 June 2014; pp. 829–837. [Google Scholar]
- Todorov, E.; Li, W. A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005 American Control Conference, Portland, OR, USA, 8–10 June 2005; pp. 300–306. [Google Scholar]
- Kajita, S.; Hirukawa, H.; Harada, K. Introduction to Humanoid Robotics; Springer Press: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Heess, N.; Dhruva, T.B.; Srinivasan, S.; Jay, L.; Josh, M.; Greg, W.; Yuval, T. Emergence of Locomotion Behaviours in Rich Environments. arXiv 2017, arXiv:1707.02286. [Google Scholar]
- Kaneko, K.; Kanehiro, F.; Kajita, S.; Yokoyama, K.; Akachi, K.; Kawasaki, T.; Ota, S. Design of prototype humanoid robotics platform for HRP. In Proceedings of the 2002 International Conference on Intelligent Robots and Systems, Lausanne, Switzerland, 30 September–4 October 2002; pp. 2431–2436. [Google Scholar]
- Choi, M.; Lee, J.; Shin, S. Planning biped locomotion using motion capture data and probabilistic roadmaps. ACM Trans. Graph. 2003, 22, 182–203. [Google Scholar] [CrossRef]
- Taga, G. A model of the neuro-musculo-skeletal system for anticipatory adjustment of human locomotion during obstacle avoidance. Biol. Cybern. 1998, 78, 9–17. [Google Scholar] [CrossRef] [PubMed]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Schaal, S. Is imitation learning the route to humanoid robots? Trends Cogn. Sci. 1999, 3, 233–242. [Google Scholar] [CrossRef]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the 2015 International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Wang, H.; Banerjee, A. Bregman Alternating Direction Method of Multipliers. In Proceedings of the 2014 International Conference on Neural Information Processing, Montreal, QC, Canada, 8–13 December 2014; pp. 2816–2824. [Google Scholar]
- Zainuddin, Z.; Ong, P. Function approximation using artificial neural networks. WSEAS Trans. Math. 2008, 7, 333–338. [Google Scholar]
- Stamatis, C. A general approach to linear mean-square estimation problems. IEEE Trans. Inform. Theory 1973, 19, 110–114. [Google Scholar]
- Balogun, O.S.; Hubbard, M.; DeVries, J.J. Automatic control of canal flow using linear quadratic regulator theory. J. Hydraul Eng. 1988, 114, 75–102. [Google Scholar] [CrossRef]
- Wang, Y.; Li, T.S.; Lin, C. Backward Q-learning: The combination of Sarsa algorithm and Q-learning. Eng. Appl. Artif. Intell. 2013, 26, 2184–2193. [Google Scholar] [CrossRef]
RL Methods | Advantages | Disadvantages |
---|---|---|
Model-free RL | - No need for prior knowledge - Strong asymptotic performance | - Slow convergence speed - High risk of damage to robot and environment |
Model-based RL | - Less online interaction, safer for robot and environment - Fast convergence speed | - Depends on explicit models - Poor asymptotic performance |
GPS | - All the above advantages | - May never explore the optimal space - Cannot be updated after being deployed |
WNEPO | - Same as GPS | - Not all of the above, but additional component is required |
Parameter | Leg_Radius | Lower_Leg_Length | Upper_Leg_Length | Torso | Foot |
Value | 0.75 cm | 10 cm | 10 cm | 5 × 8 × 8 cm3 | 5 × 4 × 1 cm3 |
Parameter | Density | Torso_Offset_x | Torso_Offset_z | Joint_Damping | Joint_Stiffness |
Value | 1.05 g/cm3 | −1 cm | −2 cm | 0.001 Ns/cm | 0 |
Index | iLQG | GPS | WE-GPS | PPO | WE-PPO | WNEPO | GPS-PPO |
---|---|---|---|---|---|---|---|
Distance (m) | 4.56 | 4.49 | 4.51 | 4.83 | 4.91 | 4.95 | 2.63 |
Avg. steps for task completion | 77.9 | 78.0 | 78.2 | 75.6 | 72.1 | 71.4 | Failure |
Convergence episodes (approximate) | 1100 | 1400 | 1300 | 3700 | 3300 | 1900 | Not converged |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hou, L.; Wang, H.; Zou, H.; Wang, Q. Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization. Appl. Sci. 2021, 11, 1131. https://doi.org/10.3390/app11031131
Hou L, Wang H, Zou H, Wang Q. Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization. Applied Sciences. 2021; 11(3):1131. https://doi.org/10.3390/app11031131
Chicago/Turabian StyleHou, Liwei, Hengsheng Wang, Haoran Zou, and Qun Wang. 2021. "Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization" Applied Sciences 11, no. 3: 1131. https://doi.org/10.3390/app11031131
APA StyleHou, L., Wang, H., Zou, H., & Wang, Q. (2021). Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization. Applied Sciences, 11(3), 1131. https://doi.org/10.3390/app11031131