Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation
Abstract
:1. Introduction
- We develop a novel meta-reinforcement learning framework based on a variational information bottleneck. The framework consists of two stages, i.e., the meta-training stage and the meta-testing stage. The meta-training stage aims to extract the basic tasks and the according to basic policies, the meta-testing stage aims to efficiently infer the new policy for a new task by taking advantage of the basic tasks and basic policies.
- The meta-training and meta-testing algorithms are presented in detail. Thus, the meta-reinforcement learning framework allows efficient robotic skill transfer learning in a dynamic environment.
- Empirical experiments based on Mujoco have been conducted to show the effectiveness of the proposed scheme.
2. Problem Formulation and Background
2.1. Problem Formulation
2.2. Markov Decision Process (MDP)
2.3. Maximum Entropy Actor-Critic
2.4. Variational Information Bottleneck (VIB)
3. VIB Based Meta-Reinforcement Learning
3.1. Overview
3.2. Latent Space Learning
3.3. Vib Based Meta-Reinforcement Learning Algorithm
Algorithm 1 VIB based meta-reinforcement learning training algorithm. |
|
Algorithm 2 VIB based meta-reinforcement learning testing algorithm. |
|
4. Experiments
- Does the VIB-based meta-reinforcement learning algorithm realize efficient skill transfer?
- How about the learning efficiency and asymptotic performance of the VIB-based meta-reinforcement learning algorithm in comparison with that of other meta-learning approaches, such as MAML and ProMP?
- Does the VIB-based meta-reinforcement learning algorithm can improve the learning performance during the training stage in comparison with other algorithms?
4.1. Experiments Configuration
4.2. Comparative Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529. [Google Scholar] [CrossRef] [PubMed]
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Driessche, G.V.D.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of go with deep neural networks and tree search. Nature 2016, 529, 484. [Google Scholar] [CrossRef] [PubMed]
- Hou, Z.; Fei, J.; Deng, Y.; Xu, J. Data-efficient hierarchical reinforcement learning for robotic assembly control applications. IEEE Trans. Ind. Electron. 2020, 68, 11565–11575. [Google Scholar] [CrossRef]
- Funk, N.; Chalvatzaki, G.; Belousov, B.; Peters, J. Learn2assemble with structured representations and search for robotic architectural construction. In Proceedings of the 5th Conference on Robot Learning, PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 1401–1411. [Google Scholar]
- Guez, A.; Vincent, R.D.; Avoli, M.; Pineau, J. Adaptive treatment of epilepsy via batch-mode reinforcement learning. AAAI 2008, 1671–1678. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.I.; Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Volume 37, pp. 1889–1897. [Google Scholar]
- Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, JMLR.org, Sydney, Australia, 6–11 August 2017; pp. 1352–1361. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1856–1865. [Google Scholar]
- Nachum, O.; Norouzi, M.; Xu, K.; Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 2775–2785. [Google Scholar]
- McGuire, K.; Wagter, C.D.; Tuyls, K.; Kappen, H.; de Croon, G.C. Minimal navigation solution for a swarm of tiny flying robots to explore an unknown environment. Sci. Robot. 2019, 4, eaaw9710. [Google Scholar] [CrossRef]
- Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3357–3364. [Google Scholar]
- Hwangbo, J.; Lee, J.; Dosovitskiy, A.; Bellicoso, D.; Tsounis, V.; Koltun, V.; Hutter, M. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 2019, 4, eaau5872. [Google Scholar] [CrossRef] [Green Version]
- Miki, T.; Lee, J.; Hwangbo, J.; Wellhausen, L.; Koltun, V.; Hutter, M. Learning robust perceptive locomotion for quadrupedal robots in the wild. Sci. Robot. 2022, 7, eabk2822. [Google Scholar] [CrossRef]
- Kopicki, M.S.; Belter, D.; Wyatt, J.L. Learning better generative models for dexterous, single-view grasping of novel objects. Int. J. Robot. Res. 2019, 38, 1246–1267. [Google Scholar] [CrossRef] [Green Version]
- Bhagat, S.; Banerjee, H.; Tse, Z.T.H.; Ren, H. Deep reinforcement learning for soft, flexible robots: Brief review with impending challenges. Robotics 2019, 8, 4. [Google Scholar] [CrossRef] [Green Version]
- Thuruthel, T.G.; Falotico, E.; Renda, F.; Laschi, C. Model-based reinforcement learning for closed-loop dynamic control of soft robotic manipulators. IEEE Trans. Robot. 2018, 35, 124–134. [Google Scholar] [CrossRef]
- Wang, C.; Zhang, Q.; Tian, Q.; Li, S.; Wang, X.; Lane, D.; Petillot, Y.; Wang, S. Learning mobile manipulation through deep reinforcement learning. Sensors 2020, 20, 939. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mahler, J.; Matl, M.; Satish, V.; Danielczuk, M.; DeRose, B.; McKinley, S.; Goldberg, K. Learning ambidextrous robot grasping policies. Sci. Robot. 2019, 4, eaau4984. [Google Scholar] [CrossRef] [PubMed]
- Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1329–1338. [Google Scholar]
- Munos, R.; Stepleton, T.; Harutyunyan, A.; Bellemare, M. Safe and efficient off-policy reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1054–1062. [Google Scholar]
- Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef] [Green Version]
- Deisenroth, M.P.; Neumann, G.; Peters, J. A survey on policy search for robotics. Found. Trends Robot. 2013, 2, 1–142. [Google Scholar]
- Dulac-Arnold, G.; Levine, N.; Mankowitz, D.J.; Li, J.; Paduraru, C.; Gowal, S.; Hester, T. Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis. Mach. Learn. 2021, 110, 2419–2468. [Google Scholar] [CrossRef]
- Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; Meger, D. Deep reinforcement learning that matters. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
- Braun, D.A.; Aertsen, A.; Wolpert, D.M.; Mehring, C. Learning optimal adaptation strategies in unpredictable motor tasks. J. Neurosci. 2009, 29, 6472–6478. [Google Scholar] [CrossRef] [Green Version]
- Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv 2018, arXiv:1801.06146. [Google Scholar]
- Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
- Taylor, M.E.; Stone, P. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res. 2009, 10. [Google Scholar]
- Thrun, S.; Pratt, L. Learning to Learn; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Yu, T.; Quillen, D.; He, Z.; Julian, R.; Hausman, K.; Finn, C.; Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Proceedings of the Conference on Robot Learning, PMLR, Cambridge, MA, USA, 16–18 November 2020; pp. 1094–1100. [Google Scholar]
- Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; Lillicrap, T. Meta-learning with memory-augmented neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 1842–1850. [Google Scholar]
- Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M.W.; Pfau, D.; Schaul, T.; Shillingford, B.; Freitas, N.D. Learning to learn by gradient descent by gradient descent. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
- Hochreiter, S.; Younger, A.S.; Conwell, P.R. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2001; pp. 87–94. [Google Scholar]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
- Xu, Z.; van Hasselt, H.P.; Silver, D. Meta-gradient reinforcement learning. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
- Finn, C.; Yu, T.; Zhang, T.; Abbeel, P.; Levine, S. One-shot visual imitation learning via meta-learning. In Proceedings of the Conference on Robot Learning PMLR, Mountain View, CA, USA, 13–15 November 2017; pp. 357–368. [Google Scholar]
- Liu, H.; Socher, R.; Xiong, C. Taming maml: Efficient unbiased meta-reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 4061–4071. [Google Scholar]
- Rothfuss, J.; Lee, D.; Clavera, I.; Asfour, T.; Abbee, P. Promp: Proximal meta-policy search. arXiv 2018, arXiv:1810.06784. [Google Scholar]
- Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; Levine, S. Meta-reinforcement learning of structured exploration strategies. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
- Pastor, P.; Kalakrishnan, M.; Righetti, L.; Schaal, S. Towards associative skill memories. In Proceedings of the 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012), Osaka, Japan, 29 November–1 December 2012; pp. 309–315. [Google Scholar]
- Pastor, P.; Kalakrishnan, M.; Meier, F.; Stulp, F.; Buchli, J.; Theodorou, E.; Schaal, S. From dynamic movement primitives to associative skill memories. Robot. Auton. Syst. 2013, 61, 351–361. [Google Scholar] [CrossRef]
- Rueckert, E.; Mundo, J.; Paraschos, A.; Peters, J.; Neumann, G. Extracting low-dimensional control variables for movement primitives. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 1511–1518. [Google Scholar]
- Sutton, R.S.; Precup, D.; Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef] [Green Version]
- Kulkarni, T.D.; Narasimhan, K.; Saeedi, A.; Tenenbaum, J. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
- Mendonca, M.R.; Ziviani, A.; Barreto, A.M. Graph-based skill acquisition for reinforcement learning. ACM Comput. Surv. (CSUR) 2019, 52, 1–26. [Google Scholar] [CrossRef]
- Lenz, I.; Knepper, R.A.; Saxena, A. Deepmpc: Learning deep latent features for model predictive control. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015. [Google Scholar]
- Du, S.; Krishnamurthy, A.; Jiang, N.; Agarwal, A.; Dudik, M.; Langford, J. Provably efficient rl with rich observations via latent state decoding. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 1665–1674. [Google Scholar]
- Candès, E.J.; Li, X.; Ma, Y.; Wright, J. Robust principal component analysis? J. Acm (JACM) 2011, 58, 1–37. [Google Scholar] [CrossRef]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
- Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019, 2019, 124020. [Google Scholar] [CrossRef]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
- Wang, H.Q.; Guo, X.; Deng, Z.H.; Lu, Y. Rethinking minimal sufficient representation in contrastive learning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; IEEE: New York, NY, USA, 2022; pp. 16041–16050. [Google Scholar]
- Peng, X.B.; Kanazawa, A.; Toyer, S.; Abbeel, P.; Levine, S. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. arXiv 2018, arXiv:1810.00821. [Google Scholar]
- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
- Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Algarve, Portugal, 7–12 October 2012; IEEE: New York, NY, USA, 2012; pp. 5026–5033. [Google Scholar]
Symbol | Functions | Description |
---|---|---|
Task distribution | Characterize a class of task | |
Task | A specific task described by MDP | |
State space | shares the same state space | |
Action space | share the same action space | |
Transition function space | Including varying functions, | |
i.e., different robot dynamics | ||
Bounded reward | Including varying reward functions, | |
function space | i.e., different tasks | |
Meta-training task set | Sampling M tasks from source task space | |
Training set | The meta-training data set | |
Meta-testing task set | Sampling N tasks for testing | |
Testing set | The meta-testing data set | |
Trajectories | Trajectories for task |
Parameters | Symbol | Value |
---|---|---|
Optimization algorithm | Adam | |
Learning rate | ||
Discounting factor | ||
Entropy weighting | 5 | |
Lagrange multiplier | β | 0.1 |
Information constraint | Ic | 1.0 |
Number of the hidden layers | Q, V, π | 3 |
E | 3 | |
Number of neurons in each layer | Q, V, π | 300 |
E | 200 | |
Nonlinear activator | ReLU | |
The maximum path length | 200 | |
Samples for each mini-batch | M | 256 |
The frequency for target network updating | τ | 1000 |
Tasks | Performance Improvement (# Times ) |
---|---|
Walker2d-Diff-Velocity | ≈5000 |
HalfCheetah-Diff-Velocity | ≈4000 |
Ant-Forward-Back | ≈2500 |
Humanoid-Random-Dir | ≈5000 |
Walker2d-Diff-Params | ≈200 |
Ant-Diff-Params | ≈200 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xiang, G.; Dian, S.; Du, S.; Lv, Z. Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation. Sensors 2023, 23, 762. https://doi.org/10.3390/s23020762
Xiang G, Dian S, Du S, Lv Z. Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation. Sensors. 2023; 23(2):762. https://doi.org/10.3390/s23020762
Chicago/Turabian StyleXiang, Guofei, Songyi Dian, Shaofeng Du, and Zhonghui Lv. 2023. "Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation" Sensors 23, no. 2: 762. https://doi.org/10.3390/s23020762
APA StyleXiang, G., Dian, S., Du, S., & Lv, Z. (2023). Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation. Sensors, 23(2), 762. https://doi.org/10.3390/s23020762