QMIX-GNN: A Graph Neural Network-Based Heterogeneous Multi-Agent Reinforcement Learning Model for Improved Collaboration and Decision-Making
Abstract
:1. Introduction
- Adaptive Information Fusion: QMIX-GNN introduces a GNN-based information infusion mechanism that aggregates data from multiple agents, allowing for enhanced team perception and decision-making.
- Heterogeneous Information Processing: Our model incorporates a dedicated module that handles the diverse observation and action spaces of heterogeneous agents, facilitating seamless communication and collaboration.
- Scalability on Complex Tasks: By integrating these modules within the CTDE framework, QMIX-GNN achieves significant improvements in both performance and convergence speed on complex multi-agent coordination tasks.
2. Materials and Methods
2.1. Preliminaries
2.1.1. Reinforcement Learning
2.1.2. Multi-Agent Reinforcement Learning
- Centralized Training: Joint policies are learned using the global information (e.g., all agents’ observations).
- Decentralized Execution: Agents act independently using local observations.
- COMA [23]: Counterfactual baselines are used for credit assignment by marginalizing individual agents’ actions while keeping others fixed.
- QMIX [22]: The joint Q-value is decomposed by a monotonic mixing network; this ensures that , enabling decentralized argmax operations.
- QTRAN [24]: QMIX’s monotonicity constraint is relaxed through a transformed joint Q-value
2.1.3. Graph Neural Networks
- Graph Convolutional Networks (GCNs) [25]: Aggregate neighbor features Z with spectral convolution, as follows:
- GAT [16]: Computes the adaptive attention weights between nodes i and j:
2.2. Existing Issues
- Partial Observability Constraints: In decentralized execution, agents lack access to global state information. Existing CTDE methods (e.g., QMIX, COMA) assume that agents can implicitly coordinate through shared value functions; however, this fails when the agents exhibit heterogeneous observation spaces or roles.
- Static Interaction Modeling: Traditional value factorization approaches [24] decompose joint Q-values under the assumption of fixed monotonicity, ignoring dynamic agent relationships.
- Heterogeneous Information Fusion: Agents with diverse sensor modalities (e.g., LiDAR vs. cameras) generate multimodal data with varying structures and scales. Simple concatenation or averaging of observations can lose critical relational patterns [26], while manually designed fusion rules lack adaptability.
2.3. Proposed Method
2.3.1. Agent Network
- A projection matrix is used to map the states of heterogeneous agents into a consistent state space.
- Shared-bottom parameters accelerate the model’s learning speed and improve its generalization ability.
- A tower structure enables different types of agents to learn differentiated policies, helping them to adapt to various action spaces.
2.3.2. Information Infusion Network
2.3.3. Mixing Network
2.3.4. Implementations
Algorithm 1 QMIX-GNN | |
Input: Hyperparameters. | |
Output: Model parameters θ. | |
1: step = 0, . | ▹ Initialize |
2: while do | |
3: t = 0 | ▹ Initialize episode |
4: | ▹ Obtain the initial state |
5: while and do | |
6: observations for all agents from the environment | |
7: K-Nearest Neighbor Graph | ▹ Construct the graph for current timestep |
8: | ▹ Infuse observation information from all agents |
9: for each agent i do | |
10: ϵ = epsilon-schedule(step) | ▹ Explore or exploit threshold |
11: if random(0, 1) < ϵ then | ▹ Explore or exploit |
12: | |
13: | |
14: else | |
15: | ▹ Random exploration |
16: end if | |
17: end for | |
18: get reward and next state from the environment | |
19: | ▹ Save data to replay buffer |
20: t = t + 1, step = step + 1 | |
21: if then | |
22: batch ← random batch of episodes from | |
23: for each timestep t in each episode in batch b do | |
24: | |
25: | |
26: end for | |
27: | |
28: | |
29: | ▹ Update model parameters |
30: end if | |
31: if update-interval steps have passed then | |
32: | |
33: end if | |
34: end while | |
35: end while |
3. Results
3.1. Experimental Platform
3.2. Experimental Configure
3.3. Experimental Results
- In the 3m scenario, QTRAN performs the best, followed by QMIX-GNN and QMIX, which show similar results. COMA performs slightly worse, with a winning rate of only around 76%, indicating relatively poor performance.
- In the 2s3z scenario, our QMIX-GNN has a slight decrease in winning rate compared to QMIX (a 2.6% drop). The reason for this could be that the 2s3z scenario is relatively simple, allowing agents to make good decisions based on their own observations. The introduction of information fusion increases the complexity of learning, which slows down the agent’s learning process. In this scenario, QTRAN converges slowly but eventually reaches a winning rate of about 71%, while COMA struggles to converge, with a winning rate of only around 20%.
- In the 1c3s5z and mmm scenarios, QTRAN performs poorly, with a winning rate of around 30%, while COMA fails to converge. QMIX-GNN and QMIX achieve relatively high winning rate, with QMIX-GNN showing varying degrees of improvement over QMIX, suggesting that in more complex scenarios the agents need to focus more on information from other agents in order to achieve better cooperation.
- In the 3m scenario, all four models converge relatively quickly and maintain a stable winning rate, although COMA only manages to maintain a relatively low winning rate.
- In the 2s3z scenario, both QMIX and QMIX-GNN converge quickly and maintain a high winning rate, but QMIX has a more stable convergence curve with smaller fluctuations, while COMA exhibits an unstable convergence curve and performs poorly. In comparison, QTRAN converges slowly.
- In the 1c3s5z, 10m_vs_11m, mmm, and mmm2 scenarios, QTRAN converges slowly and COMA fails to converge. Both QMIX and QMIX-GNN show good convergence performance, with QMIX-GNN converging significantly faster than QMIX and demonstrating better overall training results.
3.4. Ablation Study
4. Discussion
- In simpler scenarios such as 3m and 2s3z, both QMIX-GNN and QMIX achieve strong performance, indicating that the advantages of our approach are less pronounced in low-complexity settings. However, the benefits of QMIX-GNN become more evident as task complexity increases. In medium-difficulty scenarios such as 1c3s5z and 10m_vs_11m, our method converges faster than QMIX, demonstrating superior sample efficiency. In challenging scenarios such as MMM and MMM2, QMIX-GNN consistently outperforms QMIX, underscoring its effectiveness in handling complex coordination tasks. This performance trend can be attributed to our information infusion mechanism, which enhances agent coordination by leveraging structured relational data. While this mechanism significantly improves performance in complex environments, it introduces additional computational overhead that may not be necessary in simpler tasks. Additionally, our use of a k-nearest graph based on inherent SMAC features such as visual and attack radii imposes certain constraints on QMIX-GNN’s applicability. Future work could explore adaptive strategies that dynamically adjust the degree of information infusion based on task complexity, which would enhance flexibility and reduce unnecessary learning overhead in less demanding scenarios.
- Our proposed QMIX-GNN is fundamentally reliant on the effective fusion of information across multiple agents to enable collaborative decision-making. Scenarios in which agents operate in complete isolation, such as decentralized systems lacking communication channels, are outside the primary scope of our work; nevertheless, this assumption imposes practical limits on the application of QMIX-GNN. In environments where robust inter-agent communication and information sharing cannot be guaranteed, the underlying information fusion mechanism may be ineffective, potentially leading to degraded performance. Future work could explore the design of lightweight or privacy-aware fusion strategies that would simplify the fusion architecture or incorporate privacy-preserving mechanisms while still enabling collaborative information sharing. This would help to extend the practical applicability of QMIX-GNN to a broader range of scenarios.
- Although SMAC presents certain challenges, the current infusion mechanism in QMIX-GNN is specifically designed around the unique features of the SMAC, as mentioned above. As a result, while our experiments demonstrate promising performance in this context, the effectiveness of QMIX-GNN in more complex environments remains to be validated. Because the SMAC does not support macro mode, extending our evaluation to such settings is left for future work. In addition, future research could broaden the applicability of QMIX-GNN by focusing on development of a more general infusion mechanism that is adaptable to a wider range of environments.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
MARL | Multi-Agent Reinforcement Learning |
GNN | Graph Neural Network |
GAT | Graph Attention Network |
GRU | Gated Recurrent Unit |
RNN | Recurrent Neural Network |
References
- Ning, Z.; Xie, L. A survey on multi-agent reinforcement learning and its application. J. Autom. Intell. 2024, 3, 73–91. [Google Scholar] [CrossRef]
- Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev. 2021, 55, 895–943. [Google Scholar] [CrossRef]
- Ruder, S. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv 2017, arXiv:1706.05098. [Google Scholar]
- Wen, Y.; Qian, F.; Guo, W.; Zong, J.; Peng, D.; Chen, K.; Hu, G. VSP Upgoing and Downgoing Wavefield Separation: A Hybrid Model-Data Driven Approach. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5908014. [Google Scholar] [CrossRef]
- Qian, F.; Pan, S.; Zhang, G. Tensor Computation for Seismic Data Processing: Linking Theory and Practice; Earth Systems Data and Models Series; Springer: Berlin, Germany, 2025. [Google Scholar]
- Kong, G.; Chen, F.; Yang, X.; Cheng, G.; Zhang, S.; He, W. Optimal Deception Asset Deployment in Cybersecurity: A Nash Q-Learning Approach in Multi-Agent Stochastic Games. Appl. Sci. 2024, 14, 357. [Google Scholar] [CrossRef]
- Wilson, A.; Menzies, R.; Morarji, N.; Foster, D.; Mont, M.C.; Turkbeyler, E.; Gralewski, L. Multi-Agent Reinforcement Learning for Maritime Operational Technology Cyber Security. arXiv 2024. [Google Scholar] [CrossRef]
- Du, J.; Yu, A.; Zhou, H.; Jiang, Q.; Bai, X. Research on Integrated Control Strategy for Highway Merging Bottlenecks Based on Collaborative Multi-Agent Reinforcement Learning. Appl. Sci. 2025, 15, 836. [Google Scholar] [CrossRef]
- Garcia-Cantón, S.; Ruiz de Mendoza, C.; Cervelló-Pastor, C.; Sallent, S. Multi-Agent Reinforcement Learning-Based Routing and Scheduling Models in Time-Sensitive Networking for Internet of Vehicles Communications Between Transportation Field Cabinets. Appl. Sci. 2025, 15, 1122. [Google Scholar] [CrossRef]
- Mamond, A.W.; Kundroo, M.; Yoo, S.e.; Kim, S.; Kim, T. FLDQN: Cooperative Multi-Agent Federated Reinforcement Learning for Solving Travel Time Minimization Problems in Dynamic Environments Using SUMO Simulation. Sensors 2025, 25, 911. [Google Scholar] [CrossRef] [PubMed]
- Zhang, C.; Zhou, X.; Xu, C.; Wu, Z.; Liu, J.; Qi, H. Automatic Generation of Precast Concrete Component Fabrication Drawings Based on BIM and Multi-Agent Reinforcement Learning. Buildings 2025, 15, 284. [Google Scholar] [CrossRef]
- Sharma, P.K.; Fernandez, R.; Zaroukian, E.; Dorothy, M.; Basak, A.; Asher, D.E. Survey of recent multi-agent reinforcement learning algorithms utilizing centralized training. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III; SPIE: Online Only, FL, USA, 2021; Volume 11746, pp. 665–676. [Google Scholar]
- Xiao, L.; Wu, X.; Wu, W.; Yang, J.; He, L. Multi-channel attentive graph convolutional network with sentiment fusion for multimodal sentiment analysis. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 4578–4582. [Google Scholar]
- Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
- Liu, Z.; Zhang, J.; Shi, E.; Liu, Z.; Niyato, D.; Ai, B.; Shen, X. Graph Neural Network Meets Multi-Agent Reinforcement Learning: Fundamentals, Applications, and Future Directions. IEEE Wirel. Commun. 2024, 31, 39–47. [Google Scholar] [CrossRef]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Ding, S.; Du, W.; Ding, L.; Zhang, J.; Guo, L.; An, B. Multiagent Reinforcement Learning With Graphical Mutual Information Maximization. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–10. [Google Scholar] [CrossRef]
- Huang, J.; Su, J.; Chang, Q. Graph neural network and multi-agent reinforcement learning for machine-process-system integrated control to optimize production yield. J. Manuf. Syst. 2022, 64, 81–93. [Google Scholar] [CrossRef]
- Leibfried, F.; Grau-Moya, J. Mutual-information regularization in Markov decision processes and actor-critic learning. In Proceedings of the Conference on Robot Learning, Graz, Austria, 9–12 July 2020; pp. 360–373. [Google Scholar]
- Oliehoek, F.A.; Amato, C.A. A Concise Introduction to Decentralized POMDPs; Springer: Berlin, Germany, 2016. [Google Scholar]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
- Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Zhu, C.; Dastani, M.; Wang, S. A survey of multi-agent deep reinforcement learning with communication. Auton. Agents-Multi-Agent Syst. 2024, 38, 4. [Google Scholar] [CrossRef]
- Samvelyan, M.; Rashid, T.; De Witt, C.S.; Farquhar, G.; Nardelli, N.; Rudner, T.G.; Hung, C.M.; Torr, P.H.; Foerster, J.; Whiteson, S. The starcraft multi-agent challenge. arXiv 2019, arXiv:1902.04043. [Google Scholar]
Variable | Meaning |
---|---|
o | The agent’s raw observation |
The agent’s type | |
The projected feature | |
b | The aligned vector output of Shared-Bottom |
The projection matrix | |
The team information | |
h | Attention head |
H | The number of attention heads |
N | The number of agents |
The number of k-nearest neighbors | |
a | Agents’ actions |
u | The joint action of agents |
s | Agents’ states |
r | Agents’ rewards |
Q | Agents’ Q-value |
Models’ parameters | |
v | The feature that integrates neighbor information |
z | The concatenated features |
Name | Version |
---|---|
CPU | Intel(R) Xeon(R) Gold 6330 CPU @ 2.00 GHz |
GPU | Nvidia GeForce RTX 3090 |
OS | Ubuntu 18.04.5 LTS |
Language | Python 3.8.10 |
DL Framework | PyTorch 1.7.0 |
Scenario | Alliance | Enemy | Type | Difficulty |
---|---|---|---|---|
3m | Marine × 3 | Marine × 3 | Homogeneous | Easy |
2s3z | Stalker × 2 Zealot × 3 | Stalker × 2 Zealot × 3 | Heterogeneous | Easy |
1c3s5z | Colossus × 1 Stalker × 3 Zealot × 5 | Colossus × 1 Stalker × 3 Zealot × 5 | Heterogeneous | Medium |
10m_vs_11m | Marine × 10 | Marine × 11 | Homogeneous | Medium |
mmm | Medivac × 1 Marauder × 2 Marine × 7 | Medivac × 1 Marauder × 2 Marine × 7 | Heterogeneous | Hard |
mmm2 | Medivac × 1 Marauder × 2 Marine × 7 | Medivac × 1 Marauder × 3 Marine × 8 | Heterogeneous | Hard |
Parameter | Value |
---|---|
SMAC Difficulty | 7 (Hard) |
Episode | 1,000,000 |
Discount factor | 0.99 |
Parameter update step | 200 |
Attention head | 6 |
K-nearest neighbor | 5 |
Replay buffer size | 3000 |
Winning Rate | 3m | 2s3z | 1c3s5z | 10m_vs_11m | mmm | mmm2 |
---|---|---|---|---|---|---|
COMA | 0.778 | 0.203 | 0.285 | 0.252 | 0.015 | 0.008 |
QTRAN | 0.991 * | 0.720 | 0.713 | 0.717 | 0.316 | 0.167 |
QMIX | 0.956 | 0.969 | 0.846 | 0.855 | 0.877 | 0.542 |
QMIX-GNN | 0.967 | 0.957 | 0.900 | 0.892 | 0.928 | 0.655 |
Comparison with QMIX | 0.011 | −0.012 | 0.054 | 0.037 | 0.051 | 0.113 |
Episodes | 3m | 2s3z | 1c3s5z | 10m_vs_11m | mmm |
---|---|---|---|---|---|
COMA | 820k | ∞ | ∞ | ∞ | ∞ |
QTRAN | 150k | ∞ | ∞ | ∞ | ∞ |
QMIX | 140k | 300k | 730k | 680k | 800k |
QMIX-GNN | 120k | 410k | 380k | 420k | 640k |
Model | QMIX | QMIX-HT | QMIX-GNN |
---|---|---|---|
Winning rate | 0.834 | 0.861 | 0.913 |
Improvement | - | 0.027 | 0.079 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, T.; Chen, T.; Zhang, B. QMIX-GNN: A Graph Neural Network-Based Heterogeneous Multi-Agent Reinforcement Learning Model for Improved Collaboration and Decision-Making. Appl. Sci. 2025, 15, 3794. https://doi.org/10.3390/app15073794
Zhao T, Chen T, Zhang B. QMIX-GNN: A Graph Neural Network-Based Heterogeneous Multi-Agent Reinforcement Learning Model for Improved Collaboration and Decision-Making. Applied Sciences. 2025; 15(7):3794. https://doi.org/10.3390/app15073794
Chicago/Turabian StyleZhao, Taiyin, Tian Chen, and Bing Zhang. 2025. "QMIX-GNN: A Graph Neural Network-Based Heterogeneous Multi-Agent Reinforcement Learning Model for Improved Collaboration and Decision-Making" Applied Sciences 15, no. 7: 3794. https://doi.org/10.3390/app15073794
APA StyleZhao, T., Chen, T., & Zhang, B. (2025). QMIX-GNN: A Graph Neural Network-Based Heterogeneous Multi-Agent Reinforcement Learning Model for Improved Collaboration and Decision-Making. Applied Sciences, 15(7), 3794. https://doi.org/10.3390/app15073794