Multi-Agent Hierarchical Graph Attention Actor–Critic Reinforcement Learning
Abstract
:1. Introduction
- Multi-agent interactions are effectively modelled as graphs, where agents are represented as nodes, and their connections form edges through which information is exchanged. Graph attention networks encode each agent’s local observations into a single node embedding vector. The dimensionality of this embedding vector remains constant regardless of the number of agents and generates a fixed-size environment representation, offering flexibility and scalability.
- We propose a hierarchical graph attention mechanism to optimize the efficient information extraction of agents in complex environments. The HGAT transforms the agents’ observation information into an information-condensed and contextualized state representation to capture relationships at both individual and hierarchical levels using the “inter-agent” and “inter-group” attention layers. By aggregating individual and hierarchical relationships, agents can better “understand” the dynamic environment changes, focus on interacting with the most relevant agents, and thus learn more “advanced” strategies.
- To validate the transferability of our method, we trained it within a curriculum learning framework. With curriculum learning, agents can gradually adapt to new tasks with varying numbers of agents, enabling the trained strategies to be effectively transferred to new tasks, thereby enhancing their transferability. Using curriculum learning, we successfully transferred a five-agent line formation strategy to a new task with ten agents.
2. Related Works
3. Preliminaries
3.1. Partially Observable Markov Game (POMG)
3.2. Graph Attention Network (GAT)
4. Methods
4.1. Agents Communication
4.2. Hierarchical Graph Attention Network (HGAT)
- Step 1. Entities Clustering
- Step 2. “Inter-agent” Attention
- Step 3. “Inter-group” Attention
4.3. Multi-Agent Actor-Critic
4.4. MAHGAC Algorithm
Algorithm 1 Training Procedure for MAHGAC |
|
5. Experiments
5.1. Experimental Settings
- Cooperative navigation: Figure 5a shows that the environment consists of M agents and M landmarks. The objective for each agent is to reach a distinct landmark while avoiding collisions with other agents. Each episode begins with M agents and M landmarks randomly initialised in the environment and ends after 25 time steps. During each episode, each agent receives a reward of based on its distance to the nearest landmark and incurs a penalty of −1 if it collides with another agent. Landmarks are not preassigned to agents, and agents dynamically determine which landmarks to target based on environmental feedback. Ultimately, each agent occupies a unique landmark, completing the navigation task and learning collaborative strategies.
- Linear formation: As shown in Figure 5b, there are M agents and two landmarks. The agents aim to position themselves equally spaced along a line between the two landmarks. Each episode begins with the agents and landmarks randomly initialized and ends after 25 time steps. Each agent receives a reward of based on the distance between its current position and the expected position along the line.
- Regular polygonal formation: As shown in Figure 5c, there are M agents and one landmark. The agents are required to position themselves into an M-sided regular polygonal formation with the landmark at its centre. Each episode starts with the agents, and the landmark randomly initializes and ends after 25 time steps. During the episode, each agent receives a reward of based on the distance between its current position relative to the landmark and the expected position in the polygonal formation.
- Confronting pursuit: Figure 5d shows that the environment consists of M pursuers and N prey. The competitive game objective is that the M homogeneous pursuers pursue the N prey while the prey strives to escape. As pursuers have lower speed and acceleration compared with prey, they must cooperate effectively to succeed in their pursuit. Each pursuer obtains a positive reward of +10 when it catches the prey, while the prey incurs a negative reward of −10. To prevent prey from straying too far from a designated zone, they receive a negative reward if they leave this area. The environment also contains obstacles, and any agent colliding with an obstacle is penalized with a negative reward of −10.
- Success rate (S%): percentage of tasks completed during evaluation episodes (higher is better).
- Mean episode length (MEL): average length of successful episodes during evaluation (lower is better).
5.2. Results
5.2.1. Effectiveness
5.2.2. Scalability
6. Curriculum Learning
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Franks, N.R.; Worley, A.; Grant, K.A.; Gorman, A.R.; Vizard, V.; Plackett, H.; Doran, C.; Gamble, M.L.; Stumpe, M.C.; Sendova-Franks, A.B. Social behaviour and collective motion in plant-animal worms. Proc. R. Soc. B Biol. Sci. 2016, 283, 20152946. [Google Scholar] [CrossRef] [PubMed]
- Perolat, J.; Leibo, J.Z.; Zambaldi, V.; Beattie, C.; Tuyls, K.; Graepel, T. A multi-agent reinforcement learning model of common-pool resource appropriation. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Oliff, H.; Liu, Y.; Kumar, M.; Williams, M.; Ryan, M. Reinforcement learning for facilitating human-robot-interaction in manufacturing. J. Manuf. Syst. 2020, 56, 326–340. [Google Scholar] [CrossRef]
- Zhang, K.; Yang, Z.; Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control; Springer: Berlin/Heidelberg, Germany, 2021; pp. 321–384. [Google Scholar]
- Mo, Z.; Li, W.; Fu, Y.; Ruan, K.; Di, X. CVLight: Decentralized learning for adaptive traffic signal control with connected vehicles. Transp. Res. Part C Emerg. Technol. 2022, 141, 103728. [Google Scholar] [CrossRef]
- Farinelli, A.; Iocchi, L.; Nardi, D. Distributed on-line dynamic task assignment for multi-robot patrolling. Auton. Robot. 2017, 41, 1321–1345. [Google Scholar] [CrossRef]
- Sui, Z.; Pu, Z.; Yi, J.; Wu, S. Formation control with collision avoidance through deep reinforcement learning using model-guided demonstration. IEEE Trans. Neural Networks Learn. Syst. 2020, 32, 2358–2372. [Google Scholar] [CrossRef] [PubMed]
- Liu, L.; Luo, C.; Shen, F. Multi-agent formation control with target tracking and navigation. In Proceedings of the 2017 IEEE International Conference on Information and Automation (ICIA), Macau, Chin, 18–20 July 2017; IEEE: Piscataway Township, NJ, USA, 2017; pp. 98–103. [Google Scholar]
- Ryu, H.; Shin, H.; Park, J. Cooperative and competitive biases for multi-agent reinforcement learning. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, Virtual, 3–7 May 2021; International Foundation for Autonomous Agents and Multiagent Systems: Liverpool, UK, 2021. AAMAS ’21. pp. 1091–1099. [Google Scholar]
- Hahn, C.; Ritz, F.; Wikidal, P.; Phan, T.; Gabor, T.; Linnhoff-Popien, C. Foraging swarms using multi-agent reinforcement learning. In Proceedings of the ALIFE 2020: The 2020 Conference on Artificial Life, Online, 13–18 July 2020; pp. 333–340. [Google Scholar]
- Leitão, P.; Barbosa, J.; Trentesaux, D. Bio-inspired multi-agent systems for reconfigurable manufacturing systems. Eng. Appl. Artif. Intell. 2012, 25, 934–944. [Google Scholar] [CrossRef]
- Stadler, M.; Banfi, J.; Roy, N. Approximating the value of collaborative team actions for efficient multiagent navigation in uncertain graphs. In Proceedings of the International Conference on Automated Planning and Scheduling, Prague, Czech Republic, 8–13 July 2023. [Google Scholar]
- Tassel, P.; Kovács, B.; Gebser, M.; Schekotihin, K.; Kohlenbrein, W.; Schrott-Kostwein, P. Reinforcement learning of dispatching strategies for large-scale industrial scheduling. In Proceedings of the International Conference on Automated Planning and Scheduling, Virtual, 13–24 June 2022; Volume 32, pp. 638–646. [Google Scholar]
- Xie, S.; Li, Y.; Wang, X.; Zhang, H.; Zhang, Z.; Luo, X.; Yu, H. Hierarchical relationship modeling in multi-agent reinforcement learning for mixed cooperative–competitive environments. Inf. Fusion 2024, 108, 102318. [Google Scholar] [CrossRef]
- Tony, L.A.; Jana, S.; Varun, V.; Shorewala, S.; Vidyadhara, B.; Gadde, M.S.; Kashyap, A.; Ravichandran, R.; Krishnapuram, R.; Ghose, D. UAV collaboration for autonomous target capture. In Proceedings of the Congress on Intelligent Systems: Proceedings of CIS 2021; Springer: Berlin/Heidelberg, Germany, 2022; Volume 1, pp. 847–862. [Google Scholar]
- Hausman, K.; Müller, J.; Hariharan, A.; Ayanian, N.; Sukhatme, G.S. Cooperative multi-robot control for target tracking with onboard sensing. Int. J. Robot. Res. 2015, 34, 1660–1677. [Google Scholar] [CrossRef]
- Gong, X.; Chen, W.; Chen, Z. All-aspect attack guidance law for agile missiles based on deep reinforcement learning. Aerosp. Sci. Technol. 2022, 127, 107677. [Google Scholar] [CrossRef]
- Shalumov, V. Cooperative online guide-launch-guide policy in a target-missile-defender engagement using deep reinforcement learning. Aerosp. Sci. Technol. 2020, 104, 105996. [Google Scholar] [CrossRef]
- Wu, J.; Huang, Z. Promoting diversity in mixed complex cooperative and competitive multi-agent environment. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; Association for Computing Machinery: New York, NY, USA CIKM ’23. ; pp. 4355–4359. [Google Scholar]
- Munikoti, S.; Agarwal, D.; Das, L.; Halappanavar, M.; Natarajan, B. Challenges and opportunities in deep reinforcement learning with graph neural networks: A comprehensive review of algorithms and applications. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 15051–15071. [Google Scholar] [CrossRef] [PubMed]
- Agarwal, A.; Kumar, S.; Sycara, K.P. Learning transferable cooperative behavior in multi-agent teams. arXiv 2019, arXiv:1906.01202. [Google Scholar]
- Niu, Y.; Paleja, R.R.; Gombolay, M.C. Multi-agent graph-attention communication and teaming. In Proceedings of the AAMAS, Virtual, 3–7 May 2021; pp. 964–973. [Google Scholar]
- Ma, X.; Yang, Y.; Li, C.; Lu, Y.; Zhao, Q.; Jun, Y. Modeling the interaction between agents in cooperative multi-agent reinforcement learning. arXiv 2021, arXiv:2102.06042. [Google Scholar]
- Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2961–2970. [Google Scholar]
- Su, J.; Adams, S.; Beling, P.A. Counterfactual multi-agent reinforcement learning with graph convolution communication. arXiv 2020, arXiv:2004.00470. [Google Scholar]
- Liu, Y.; Wang, W.; Hu, Y.; Hao, J.; Chen, X.; Gao, Y. Multi-agent game abstraction via graph attention neural network. arXiv 2019, arXiv:1911.10715. [Google Scholar] [CrossRef]
- Jiang, J.; Dun, C.; Huang, T.; Lu, Z. Graph convolutional reinforcement learning. arXiv 2018, arXiv:1810.09202. [Google Scholar]
- Sun, F.Y.; Kauvar, I.; Zhang, R.; Li, J.; Kochenderfer, M.J.; Wu, J.; Haber, N. Interaction modeling with multiplex attention. Adv. Neural Inf. Process. Syst. 2022, 35, 20038–20050. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, O.P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Wang, X.; Chen, Y.; Zhu, W. A survey on curriculum learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4555–4576. [Google Scholar] [CrossRef]
Method | Cooperative Navigation (N = 3) | Linear Formation (N = 5) | Regular Polygonal (N = 4) | Pursuit (N = 4) | ||||
---|---|---|---|---|---|---|---|---|
S (%) | MEL | S (%) | MEL | S (%) | MEL | S (%) | MEL | |
MADDPG | 47.389 | 3.86 | 55.938 | 7.12 | 65.994 | 5.54 | 53.935 | 5.84 |
G2ANet | 54.983 | 5.04 | 61.214 | 8.22 | 70.186 | 5.37 | 58.420 | 6.77 |
DGN | 79.083 | 4.10 | 79.617 | 7.70 | 87.710 | 5.53 | 66.708 | 6.30 |
MAAC | 85.025 | 4.39 | 90.720 | 7.23 | 91.883 | 5.45 | 78.689 | 6.87 |
MAHGAC | 86.378 | 3.72 | 91.330 | 7.11 | 92.865 | 5.29 | 86.650 | 5.78 |
Method | Cooperative Navigation (S%) | |||
---|---|---|---|---|
N = 3 | N = 7 | N = 11 | N = 15 | |
MADDPG | 47.389 | 13.850 | - | - |
G2ANet | 54.987 | 28.076 | - | - |
DGN | 79.083 | 70.200 | 52.775 | 48.120 |
MAAC | 85.025 | 83.883 | 82.092 | 80.121 |
MAHGAC | 86.378 | 86.272 | 86.435 | 86.195 |
Normal Training | Curriculum Learning Training | ||||||
---|---|---|---|---|---|---|---|
N = 0 | N = 10 | N = 5 | N = 10 | ||||
S (%) | MEL | S (%) | MEL | S (%) | MEL | S (%) | MEL |
0 | 0 | 87.62 | 21.82 | 91.33 | 7.11 | 90.513 | 18.49 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, T.; Shi, D.; Jin, S.; Wang, Z.; Yang, H.; Chen, Y. Multi-Agent Hierarchical Graph Attention Actor–Critic Reinforcement Learning. Entropy 2025, 27, 4. https://doi.org/10.3390/e27010004
Li T, Shi D, Jin S, Wang Z, Yang H, Chen Y. Multi-Agent Hierarchical Graph Attention Actor–Critic Reinforcement Learning. Entropy. 2025; 27(1):4. https://doi.org/10.3390/e27010004
Chicago/Turabian StyleLi, Tongyue, Dianxi Shi, Songchang Jin, Zhen Wang, Huanhuan Yang, and Yang Chen. 2025. "Multi-Agent Hierarchical Graph Attention Actor–Critic Reinforcement Learning" Entropy 27, no. 1: 4. https://doi.org/10.3390/e27010004
APA StyleLi, T., Shi, D., Jin, S., Wang, Z., Yang, H., & Chen, Y. (2025). Multi-Agent Hierarchical Graph Attention Actor–Critic Reinforcement Learning. Entropy, 27(1), 4. https://doi.org/10.3390/e27010004