Noise-Regularized Advantage Value for Multi-Agent Reinforcement Learning
Abstract
:1. Introduction
- We analyze the reason for policy overfitting in actor–critic MARL algorithms with the centralized value function, which is caused by the batch sampling mechanism in the training stage;
- We propose two patterns of noise injection to deal with the policy overfitting problem, and experimentally prove the noise injected into the centralized value function can maintain the entropy of agents’ policies during training to alleviate the information redundancy and enhance performance;
- The experiments show our proposed method is able to archive comparable or much better performance than state-of-the-art results in most hard scenarios of SMAC compared to the current most trustworthy actor–critic MARL methods. Our code is open source and can be found at https://github.com/hijkzzz/noisy-mappo (accessd on 7 January 2022) for experimental verification and for future works.
2. Related Works
3. Preliminaries
4. Method
4.1. Policy Overfitting
4.2. Noisy Advantage Values
- The advantage noises prevent the multi-agent policies overfitting caused by the sampled advantage values with deviations and environmental non-stationarity.
- The policies trained by N noisy value networks of agents are similar to policies ensembling, which could enhance the generalization of the joint policy.
- The different noises of each agent stimulate the gradients of policies to go in different directions, which encourage agents to explore diverse high-return trajectories.
Algorithm 1: NA-MAPPO. |
|
5. Experiments
5.1. Testbeds
5.1.1. Non-Monotonic Matrix Game
5.1.2. SMAC
5.2. Experimental Setup
5.3. Results
5.4. Ablations
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
Appendix B
Algorithm A1 NV-MAPPO. |
|
Appendix C
Map | PPO Epochs | Mini-Batch | Gain | Network | Stacked Frames | NV-MAPPO | NV-MAPG | NV-IPPO | NA-MAPPO |
---|---|---|---|---|---|---|---|---|---|
2s3z | 15 | 1 | 0.01 | rnn | 1 | 1 | 1 | 0.05 | 0.05 |
1c3s5z | 15 | 1 | 0.01 | rnn | 1 | 1 | 1 | 0.05 | 0.05 |
3s5z | 5 | 1 | 0.01 | rnn | 1 | 1 | 1 | 0.05 | 0.05 |
2s_vs_1sc | 15 | 1 | 0.01 | rnn | 1 | 1 | 1 | 0.05 | 0.05 |
3s_vs_5z | 15 | 1 | 0.01 | mlp | 4 | 1 | 1 | 1 | 0.05 |
2c_vs_64zg | 5 | 1 | 0.01 | rnn | 1 | 1 | 1 | 1 | 0.05 |
5m_vs_6m | 10 | 1 | 0.01 | rnn | 1 | 8 | 3 | 0 | 0.05 |
8m_vs_9m | 15 | 1 | 0.01 | rnn | 1 | 1 | 0.05 | 1 | 0.05 |
corridor | 5 | 1 | 0.01 | mlp | 1 | 3 | 1 | 1 | 0.06 |
MMM2 | 5 | 2 | 1 | rnn | 1 | 0 | 0.5 | 0 | 0 |
3s5z_vs_3s6z | 5 | 1 | 0.01 | rnn | 1 | 10 | 1 | 8 | 0.05 |
6h_vs_8z | 5 | 1 | 0.01 | mlp | 1 | 1 | 1 | 1 | 0.06 |
27m_vs_30m | 5 | 1 | 0.01 | rnn | 1 | 1 | 1 | 1 | 0 |
References
- Hüttenrauch, M.; Šošić, A.; Neumann, G. Guided deep reinforcement learning for swarm systems. arXiv 2017, arXiv:1709.06011. [Google Scholar]
- Kušić, K.; Ivanjko, E.; Vrbanić, F.; Gregurić, M.; Dusparic, I. Spatial-Temporal Traffic Flow Control on Motorways Using Distributed Multi-Agent Reinforcement Learning. Mathematics 2021, 9, 3081. [Google Scholar] [CrossRef]
- Cao, Y.; Yu, W.; Ren, W.; Chen, G. An overview of recent progress in the study of distributed multi-agent coordination. IEEE Trans. Ind. Inform. 2012, 9, 427–438. [Google Scholar] [CrossRef] [Green Version]
- Samvelyan, M.; Rashid, T.; De Witt, C.S.; Farquhar, G.; Nardelli, N.; Rudner, T.G.; Hung, C.M.; Torr, P.H.; Foerster, J.; Whiteson, S. The starcraft multi-agent challenge. arXiv 2019, arXiv:1902.04043. [Google Scholar]
- Tatari, F.; Naghibi-Sistani, M.B.; Vamvoudakis, K.G. Distributed optimal synchronization control of linear networked systems under unknown dynamics. In Proceedings of the 2017 American Control Conference (ACC), Seattle, WA, USA, 24–26 May 2017; pp. 668–673. [Google Scholar] [CrossRef]
- Vamvoudakis, K.G.; Lewis, F.L.; Hudas, G.R. Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality. Automatica 2012, 48, 1598–1611. [Google Scholar] [CrossRef]
- Jiao, Q.; Modares, H.; Xu, S.; Lewis, F.L.; Vamvoudakis, K.G. Multi-agent zero-sum differential graphical games for disturbance rejection in distributed control. Automatica 2016, 69, 24–34. [Google Scholar] [CrossRef] [Green Version]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the ICML, Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
- Oliehoek, F.A.; Spaan, M.T.; Vlassis, N. Optimal and approximate Q-value functions for decentralized POMDPs. J. Artif. Intell. Res. 2008, 32, 289–353. [Google Scholar] [CrossRef] [Green Version]
- Lowe, R.; WU, Y.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6379–6390. [Google Scholar]
- Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the ICML, Long Beach, CA, USA, 9–15 June 2019; pp. 2961–2970. [Google Scholar]
- Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; Whiteson, S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the ICML, Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
- Ha, D.; Dai, A.; Le, Q.V. Hypernetworks. arXiv 2016, arXiv:1609.09106. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- De Witt, C.S.; Gupta, T.; Makoviichuk, D.; Makoviychuk, V.; Torr, P.H.; Sun, M.; Whiteson, S. Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? arXiv 2020, arXiv:2011.09533. [Google Scholar]
- Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Hessel, M.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; et al. Noisy Networks for Exploration. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.Y.; Chen, X.; Asfour, T.; Abbeel, P.; Andrychowicz, M. Parameter Space Noise for Exploration. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Hernandez-Leal, P.; Kartal, B.; Taylor, M.E. A survey and critique of multiagent deep reinforcement learning. Auton. Agents Multi-Agent Syst. 2019, 33, 750–797. [Google Scholar] [CrossRef] [Green Version]
- Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. In Proceedings of the AAMAS, Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [Google Scholar]
- Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the ICML, Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
- Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; Zhang, C. Qplex: Duplex dueling multi-agent q-learning. arXiv 2020, arXiv:2008.01062. [Google Scholar]
- Zhou, M.; Liu, Z.; Sui, P.; Li, Y.; Chung, Y.Y. Learning implicit credit assignment for cooperative multi-agent reinforcement learning. arXiv 2020, arXiv:2007.02529. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the ICLR (Poster), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI, New Orleans, LO, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Peng, B.; Rashid, T.; de Witt, C.A.S.; Kamienny, P.A.; Torr, P.H.; Böhmer, W.; Whiteson, S. FACMAC: Factored Multi-Agent Centralised Policy Gradients. arXiv 2020, arXiv:2003.06709. [Google Scholar]
- Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of mappo in cooperative, multi-agent games. arXiv 2021, arXiv:2103.01955. [Google Scholar]
- Kuba, J.G.; Chen, R.; Wen, M.; Wen, Y.; Sun, F.; Wang, J.; Yang, Y. Trust region policy optimisation in multi-agent reinforcement learning. In Proceedings of the ICLR, Virtual, 25–29 April 2022. [Google Scholar]
- Mahajan, A.; Rashid, T.; Samvelyan, M.; Whiteson, S. MAVEN: Multi-agent variational exploration. In Proceedings of the NeuIPS, Vancouver, Canada, 8–14 December 2019; pp. 7613–7624. [Google Scholar]
- Wang, T.; Dong, H.; Lesser, V.; Zhang, C. ROMA: Multi-Agent Reinforcement Learning with Emergent Roles. In Proceedings of the ICML, Virtual, 13–18 July 2020; pp. 9876–9886. [Google Scholar]
- Pan, L.; Rashid, T.; Peng, B.; Huang, L.; Whiteson, S. Regularized Softmax Deep Multi-Agent Q-Learning. Adv. Neural Inf. Process. Syst. 2021, 34, 1365–1377. [Google Scholar]
- Liu, I.J.; Jain, U.; Yeh, R.A.; Schwing, A. Cooperative exploration for multi-agent deep reinforcement learning. In Proceedings of the ICML, Virtual, 18–24 July 2021; pp. 6826–6836. [Google Scholar]
- Ong, S.C.; Png, S.W.; Hsu, D.; Lee, W.S. POMDPs for robotic tasks with mixed observability. In Proceedings of the Robotics: Science and Systems, Seattle, WA, USA, 28 June–1 July 2009; Volume 5, p. 4. [Google Scholar]
- Böhmer, W.; Kurin, V.; Whiteson, S. Deep coordination graphs. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 980–991. [Google Scholar]
- Hausknecht, M.; Stone, P. Deep recurrent q-learning for partially observable mdps. In Proceedings of the 2015 AAAI Fall Symposium Series, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
- Hu, J.; Jiang, S.; Harding, S.A.; Wu, H.; Liao, S.W. Revisiting the Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning. arXiv 2021, arXiv:2102.03479. [Google Scholar]
A | B | C | A | B | C | ||||
---|---|---|---|---|---|---|---|---|---|
A | 8 | −12 | −12 | A | 12 | 0 | 10 | ||
B | −12 | 0 | 0 | B | 0 | 10 | 10 | ||
C | −12 | 0 | 0 | C | 10 | 10 | 10 | ||
(a) Payoff of matrix game 1 | (b) Payoff of matrix game 2 |
Hyperparameters | MAPPO & MAPG |
---|---|
Envs num | 8 |
Buffer length | 400 |
RNN hidden state dim | 64 |
FC layer dim | 64 |
Noise dim num | 10 |
Adam lr | |
GAE() | 0.95 |
Entropy coef | 0.01 |
PPO clip | 0.2 |
Noise shuffle interval (episodes) | 100 |
Scenarios | Difficulty | NV-MAPPO | NA-MAPPO | MAPPO | MAPPO | NV-IPPO | IPPO |
---|---|---|---|---|---|---|---|
2s3z | Easy | 100% | 100% | 100% | 100% | 100% | 100% |
1c3s5z | Easy | 100% | 100% | 100% | 100% | 100% | 100% |
3s5z | Easy | 100% | 100% | 100% | 100% | 100% | 100% |
2s_vs_1sc | Easy | 100% | 100% | 100% | 100% | 100% | 100% |
3s_vs_5z | Hard | 100% | 100% | 100% | 98% | 100% | 100% |
2c_vs_64zg | Hard | 100% | 100% | 100% | 100% | 100% | 98% |
5m_vs_6m | Hard | 89% | 85% | 89% | 25% | 87% | 87% |
8m_vs_9m | Hard | 96% | 96% | 96% | 93% | 96% | 96% |
MMM2 | Super Hard | 96% | 96% | 90% | 96% | 86% | 86% |
3s5z_vs_3s6z | Super Hard | 87% | 72% | 84% | 56% | 96% | 82% |
6h_vs_8z | Super Hard | 91% | 90% | 88% | 15% | 94% | 84% |
corridor | Super Hard | 100% | 100% | 100% | 3% | 98% | 98% |
27m_vs_30m | Super Hard | 100% | 98% | 94% | 98% | 72% | 69% |
Avg. Score | Hard+ | 95.5% | 93.2% | 93.4% | 64.9% | 91.9% | 88.8% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, S.; Chen, W.; Hu, J.; Hu, S.; Huang, L. Noise-Regularized Advantage Value for Multi-Agent Reinforcement Learning. Mathematics 2022, 10, 2728. https://doi.org/10.3390/math10152728
Wang S, Chen W, Hu J, Hu S, Huang L. Noise-Regularized Advantage Value for Multi-Agent Reinforcement Learning. Mathematics. 2022; 10(15):2728. https://doi.org/10.3390/math10152728
Chicago/Turabian StyleWang, Siying, Wenyu Chen, Jian Hu, Siyue Hu, and Liwei Huang. 2022. "Noise-Regularized Advantage Value for Multi-Agent Reinforcement Learning" Mathematics 10, no. 15: 2728. https://doi.org/10.3390/math10152728
APA StyleWang, S., Chen, W., Hu, J., Hu, S., & Huang, L. (2022). Noise-Regularized Advantage Value for Multi-Agent Reinforcement Learning. Mathematics, 10(15), 2728. https://doi.org/10.3390/math10152728