Balanced Domain Randomization for Safe Reinforcement Learning
Abstract
:1. Introduction
- Identification of rare domains: we identify rare domains where policies may be undertrained by analyzing context embedding distances.
- Balancing domain randomization: we propose a balancing mechanism that assigns greater weight to rare domains during training to achieve balanced domain randomization.
- Empirical validation: our experiments demonstrate that the proposed method efficiently improves worst-case performance, enhancing the robustness of RL agents, especially in novel situations.
2. Related Work
3. Problem Definition
3.1. Contextual Reinforcement Learning Framework
3.2. Learning Imbalance in Domain Randomization
4. Balanced Domain Randomization
Algorithm 1: Balanced Domain Randomization |
4.1. Context Embedding
4.2. Accessing the Rarity of Contexts
4.3. Reweighting Training Loss
4.4. Context Statistics with Exponential Moving Average
5. Experiments
5.1. Setup
5.2. Evaluation Metric
5.3. Results
5.4. Analysis
5.5. Evaluation in Safety-Critical Navigation Tasks
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
RL | Reinforcement Learning |
UDR | Uniform Domain Randomization |
ADR | Automatic Domain Randomization |
BDR | Balanced Domain Randomization |
MDP | Markov Decision Process |
POMDPs | Partially Observable MDPs |
EMA | Exponential Moving Average |
DMC | DeepMind Control |
MLP | Multi-Layer Perceptron |
TD3 | Twin Delayed Deep Deterministic policy gradient |
References
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, 37, 421–436. [Google Scholar] [CrossRef]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
- Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef]
- Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
- Ellis, B.; Cook, J.; Moalla, S.; Samvelyan, M.; Sun, M.; Mahajan, A.; Foerster, J.; Whiteson, S. Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017. [Google Scholar]
- Andrychowicz, O.M.; Baker, B.; Chociej, M.; Jozefowicz, R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.; et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 2020, 39, 3–20. [Google Scholar] [CrossRef]
- Patel, U.; Kumar, N.K.S.; Sathyamoorthy, A.J.; Manocha, D. Dwa-rl: Dynamically feasible deep reinforcement learning policy for robot navigation among mobile obstacles. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021. [Google Scholar]
- Zhu, K.; Zhang, T. Deep reinforcement learning based mobile robot navigation: A review. Tsinghua Sci. Technol. 2021, 26, 674–691. [Google Scholar] [CrossRef]
- Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017. [Google Scholar]
- Yu, W.; Tan, J.; Liu, C.K.; Turk, G. Preparing for the unknown: Learning a universal policy with online system identification. In Proceedings of the Robotics: Science and Systems (RSS), Cambridge, MA, USA, 12–16 July 2017. [Google Scholar]
- Peng, X.B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018. [Google Scholar]
- Tan, J.; Zhang, T.; Coumans, E.; Iscen, A.; Bai, Y.; Hafner, D.; Bohez, S.; Vanhoucke, V. Sim-to-Real: Learning Agile Locomotion For Quadruped Robots. In Proceedings of the Robotics: Science and Systems (RSS), Pittsburgh, PA, USA, 26–30 June 2018. [Google Scholar]
- Chen, X.; Hu, J.; Jin, C.; Li, L.; Wang, L. Understanding Domain Randomization for Sim-to-real Transfer. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
- Luo, F.M.; Jiang, S.; Yu, Y.; Zhang, Z.; Zhang, Y.F. Adapt to Environment Sudden Changes by Learning a Context Sensitive Policy. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual, 22 February–1 March 2022. [Google Scholar]
- Hallak, A.; Di Castro, D.; Mannor, S. Contextual markov decision processes. arXiv 2015, arXiv:1502.02259. [Google Scholar]
- Benjamins, C.; Eimer, T.; Schubert, F.; Mohan, A.; Döhler, S.; Biedenkapp, A.; Rosenhahn, B.; Hutter, F.; Lindauer, M. Contextualize Me—The Case for Context in Reinforcement Learning. arXiv 2022, arXiv:2202.04500. [Google Scholar]
- Rakelly, K.; Zhou, A.; Finn, C.; Levine, S.; Quillen, D. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
- Zhou, W.; Pinto, L.; Gupta, A. Environment probing interaction policies. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Lee, K.; Seo, Y.; Lee, S.; Lee, H.; Shin, J. Context-aware dynamics model for generalization in model-based reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020. [Google Scholar]
- Seo, Y.; Lee, K.; Clavera Gilaberte, I.; Kurutach, T.; Shin, J.; Abbeel, P. Trajectory-wise multiple choice learning for dynamics generalization in reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 12968–12979. [Google Scholar]
- Packer, C.; Gao, K.; Kos, J.; Krähenbühl, P.; Koltun, V.; Song, D. Assessing generalization in deep reinforcement learning. arXiv 2018, arXiv:1810.12282. [Google Scholar]
- Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-real transfer in deep reinforcement learning for robotics: A survey. In Proceedings of the Symposium Series on Computational Intelligence, Canberra, Australia, 1–4 December 2020. [Google Scholar]
- Kirk, R.; Zhang, A.; Grefenstette, E.; Rocktäschel, T. A survey of zero-shot generalisation in deep reinforcement learning. J. Artif. Intell. Res. 2023, 76, 201–264. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Drummond, C.; Holte, R.C. C4. 5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Proceedings of the Workshop on Learning from Imbalanced Datasets II, Washington, DC, USA, 21 August 2003. [Google Scholar]
- Estabrooks, A.; Jo, T.; Japkowicz, N. A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 2004, 20, 18–36. [Google Scholar] [CrossRef]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Ren, M.; Zeng, W.; Yang, B.; Urtasun, R. Learning to reweight examples for robust deep learning. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 25–31 July 2018. [Google Scholar]
- Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Alshammari, S.; Wang, Y.X.; Ramanan, D.; Kong, S. Long-tailed recognition via weight balancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Liu, S.; Garrepalli, R.; Dietterich, T.; Fern, A.; Hendrycks, D. Open category detection with PAC guarantees. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 25–31 July 2018. [Google Scholar]
- Yin, X.; Yu, X.; Sohn, K.; Liu, X.; Chandraker, M. Feature transfer learning for face recognition with under-represented data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the International Conference on Machine Learning (ICML), Montreal, QC, Canada, 14–18 June 2009. [Google Scholar]
- Narvekar, S.; Peng, B.; Leonetti, M.; Sinapov, J.; Taylor, M.E.; Stone, P. Curriculum learning for reinforcement learning domains: A framework and survey. J. Mach. Learn. Res. 2020, 21, 1–50. [Google Scholar]
- Kim, S.; Lee, K.; Choi, J. Variational Curriculum Reinforcement Learning for Unsupervised Discovery of Skills. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Naeem, M.F.; Oh, S.J.; Uh, Y.; Choi, Y.; Yoo, J. Reliable fidelity and diversity metrics for generative models. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020. [Google Scholar]
- Han, J.; Choi, H.; Choi, Y.; Kim, J.; Ha, J.W.; Choi, J. Rarity Score: A New Metric to Evaluate the Uncommonness of Synthesized Images. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Smallwood, R.D.; Sondik, E.J. The optimal control of partially observable Markov processes over a finite horizon. Oper. Res. 1973, 21, 1071–1088. [Google Scholar] [CrossRef]
- Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef]
- Mahalanobis, P.C. On the generalized distance in statistics. Sankhyā Indian J. Stat. Ser. A 2018, 80, S1–S7. [Google Scholar]
- Dragicevic, A.Z. Spacetime discounted value of network connectivity. Adv. Complex Syst. 2018, 21, 1850018. [Google Scholar] [CrossRef]
- Towers, M.; Kwiatkowski, A.; Terry, J.K.; Balis, J.U.; De Cola, G.; Deleu, T.; Goulão, M.; Kallinteris, A.; KG, A.; Krimmel, M.; et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv 2024, arXiv:2407.17032. [Google Scholar]
- Tunyasuvunakool, S.; Muldal, A.; Doron, Y.; Liu, S.; Bohez, S.; Merel, J.; Erez, T.; Lillicrap, T.; Heess, N.; Tassa, Y. dm_control: Software and tasks for continuous control. Softw. Impacts 2020, 6, 100022. [Google Scholar] [CrossRef]
- Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 25–31 July 2018. [Google Scholar]
- Akkaya, I.; Andrychowicz, M.; Chociej, M.; Litwin, M.; McGrew, B.; Petron, A.; Paino, A.; Plappert, M.; Powell, G.; Ribas, R.; et al. Solving rubik’s cube with a robot hand. arXiv 2019, arXiv:1910.07113. [Google Scholar]
- Ji, J.; Zhang, B.; Zhou, J.; Pan, X.; Huang, W.; Sun, R.; Geng, Y.; Zhong, Y.; Dai, J.; Yang, Y. Safety gymnasium: A unified safe reinforcement learning benchmark. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Task | Method | Minimum Episodic Return | Mean Episodic Return (Worst 10%) | Mean Episodic Return |
---|---|---|---|---|
Pendulum | Baseline | −1263.97 | −1221.57 | −392.70 ± 319.62 |
ADR | −1312.10 | −1332.14 | −569.46 ± 428.23 | |
BDR | −1199.37 | −1132.30 | −379.14 ± 295.16 | |
Walker | Baseline | 539.62 | 516.80 | 878.21 ± 106.04 |
ADR | 224.65 | 220.40 | 560.88 ± 198.98 | |
BDR | 673.49 | 638.59 | 885.18 ± 88.89 |
Task | Method | Minimum Episodic Return | Mean Episodic Return (Worst 10%) | Mean Episodic Return |
---|---|---|---|---|
Pendulum | Baseline () | −1263.97 | −1221.57 | −392.70 ± 319.62 |
−1299.66 | −1244.98 | −450.21 ± 317.04 | ||
−1199.37 | −1132.30 | −379.14 ± 295.16 | ||
−1227.09 | −1134.78 | −411.02 ± 319.18 | ||
−1386.99 | −1186.47 | −464.74 ± 323.14 | ||
Walker | Baseline () | 539.62 | 516.80 | 878.21 ± 106.04 |
668.35 | 633.97 | 900.73 ± 85.59 | ||
673.49 | 638.59 | 885.18 ± 88.89 | ||
454.86 | 455.92 | 767.93 ± 109.67 | ||
497.80 | 567.64 | 851.82 ± 106.13 |
Task | Method | Minimum Episodic Return | Mean Episodic Return (Worst 10%) | Mean Episodic Return |
---|---|---|---|---|
Navigation | Baseline | 0.12 | −1.82 | 2.81 ± 1.37 |
BDR | 0.72 | −0.35 | 9.88 ± 4.84 | |
Maximum Episodic Cost | Mean Episodic Cost (Worst 10%) | Mean Episodic Cost | ||
Baseline | 109.97 | 484.53 | 61.62 ± 16.96 | |
BDR | 82.09 | 317.56 | 55.69 ± 11.09 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kang, C.; Chang, W.; Choi, J. Balanced Domain Randomization for Safe Reinforcement Learning. Appl. Sci. 2024, 14, 9710. https://doi.org/10.3390/app14219710
Kang C, Chang W, Choi J. Balanced Domain Randomization for Safe Reinforcement Learning. Applied Sciences. 2024; 14(21):9710. https://doi.org/10.3390/app14219710
Chicago/Turabian StyleKang, Cheongwoong, Wonjoon Chang, and Jaesik Choi. 2024. "Balanced Domain Randomization for Safe Reinforcement Learning" Applied Sciences 14, no. 21: 9710. https://doi.org/10.3390/app14219710
APA StyleKang, C., Chang, W., & Choi, J. (2024). Balanced Domain Randomization for Safe Reinforcement Learning. Applied Sciences, 14(21), 9710. https://doi.org/10.3390/app14219710