Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment
Abstract
:1. Introduction
- We transform the interactions between the ViZDoom environment and agents into a Partially Observable Markov Decision Process, and systematically define the corresponding states, actions, and reward functions. The reward functions serve as key indicators for assessing the effectiveness of each agent in executing their designated specialized tasks.
- We propose an innovative multi-agent reinforcement learning algorithm framework MA-PPO based on PPO. Our framework utilizes image information as input and outputs joint actions composed of multiple individual actions to interact with the ViZDoom environment, achieving predefined objectives in a manner that better approximates human players.
- Simulation experiments show that compared to original PPO, MA-PPO achieved a 30.67% reward gain, while compared to three other benchmark algorithms including DQN, MA-PPO achieves at least 32.00% performance improvement. Visual analysis shows that MA-PPO achieves optimal task completion with minimal steps required, and parameter experiment results indicate that all selected parameters are optimal.
2. Related Work
3. Modeling in ViZDoom
3.1. Preliminaries
3.1.1. ViZDoom
3.1.2. Reinforcement Learning with POMDP
3.2. Multi-Agent POMDP Formulation
3.2.1. State
3.2.2. Action
3.2.3. Reward
4. Proposed Algorithm
4.1. Proximal Policy Optimization
4.2. Multi-Agent-Based RL Framework for ViZDoom
Algorithm 1 Multi-Agent Proximal Policy Optimization. |
Input: state from environment , , , ; |
Output: control strategy; |
|
5. Results and Analysis
5.1. Experiment Setup
5.2. Performance Evaluation
5.2.1. Baseline
5.2.2. Convergence
5.3. Performance Comparison
5.3.1. Rendering
5.3.2. Parameter Experiments
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. PPO Algorithm
Algorithm A1 Proximal Policy Optimization. |
Input: state from environment , , , ; |
Output: actor network, critic network; |
|
Appendix B. Performance Comparison with SAC and MADDPG
Appendix C. Performance Evaluation with Error Bars
Appendix D. Parameter Experiments of Learning Rate
References
- Abriata, L.A. The Nobel Prize in Chemistry: Past, present, and future of AI in biology. Commun. Biol. 2024, 7, 1409. [Google Scholar] [CrossRef]
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
- Dong, C.; Shafiq, M.; Dabel, M.M.A.; Sun, Y.; Tian, Z. DNN Inference Acceleration for Smart Devices in Industry 5.0 by Decentralized Deep Reinforcement Learning. IEEE Trans. Consum. Electron. 2024, 70, 1519–1530. [Google Scholar] [CrossRef]
- Wen, J.; Dai, H.; He, J.; Xi, M.; Xiao, S.; Yang, J. Federated Offline Reinforcement Learning with Multimodal Data. IEEE Trans. Consum. Electron. 2024, 70, 4266–4276. [Google Scholar] [CrossRef]
- Zhang, L.; U, L.H.; Zhou, M.; Yang, F. Elastic Tracking Operation Method for High-Speed Railway Using Deep Reinforcement Learning. IEEE Trans. Consum. Electron. 2024, 70, 3384–3391. [Google Scholar] [CrossRef]
- Yi, Y.; Zhang, G.; Jiang, H. Online Digital Twin-Empowered Content Resale Mechanism in Age of Information-Aware Edge Caching Networks. IEEE Trans. Commun. 2024, 1. [Google Scholar] [CrossRef]
- Scheiermann, J.; Konen, W. AlphaZero-Inspired Game Learning: Faster Training by Using MCTS Only at Test Time. IEEE Trans. Games 2023, 15, 637–647. [Google Scholar] [CrossRef]
- Wang, B.; Zha, Z.; Zhang, L.; Liu, L.; Fan, H. Deep Reinforcement Learning-Based Security-Constrained Battery Scheduling in Home Energy System. IEEE Trans. Consum. Electron. 2024, 70, 3548–3561. [Google Scholar] [CrossRef]
- Zhang, Y.; Lu, J.; Zhang, H.; Huang, Z.; Briso-Rodríguez, C.; Zhang, L. Experimental study on low-altitude UAV-to-ground propagation characteristics in campus environment. Comput. Netw. 2023, 237, 110055. [Google Scholar] [CrossRef]
- Bi, X.; Wang, R.; Jia, Q. On the Speed-Varying Range of Electric Vehicles in Time-Windowed Routing Problems With En-Route Partial Re-Charging. IEEE Trans. Consum. Electron. 2024, 70, 3650–3657. [Google Scholar] [CrossRef]
- Wang, W.; Xu, X.; Bilal, M.; Khan, M.; Xing, Y. UAV-Assisted Content Caching for Human-Centric Consumer Applications in IoV. IEEE Trans. Consum. Electron. 2024, 70, 927–938. [Google Scholar] [CrossRef]
- Zhang, L. Joint Energy Replenishment and Data Collection Based on Deep Reinforcement Learning for Wireless Rechargeable Sensor Networks. IEEE Trans. Consum. Electron. 2024, 70, 1052–1062. [Google Scholar] [CrossRef]
- Jin, Y.; Song, X.; Slabaugh, G.; Lucas, S. Partial Advantage Estimator for Proximal Policy Optimization. IEEE Trans. Games 2024, 1–10. [Google Scholar] [CrossRef]
- Moniruzzaman, M.; Yassine, A.; Hossain, M.S. Energizing Charging Services for Next-Generation Consumers E-Mobility With Reinforcement Learning and Blockchain. IEEE Trans. Consum. Electron. 2024, 70, 2269–2280. [Google Scholar] [CrossRef]
- Liu, R.Z.; Shen, Y.; Yu, Y.; Lu, T. Revisiting of AlphaStar. IEEE Trans. Games 2024, 16, 317–330. [Google Scholar] [CrossRef]
- Li, S.; Xu, J.; Dong, H.; Yang, Y.; Yuan, C.; Sun, P.; Han, L. The Fittest Wins: A Multistage Framework Achieving New SOTA in ViZDoom Competition. IEEE Trans. Games 2024, 16, 225–234. [Google Scholar] [CrossRef]
- Huang, J.; Zhang, H.; Zhao, M.; Wu, Z. IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation. arXiv 2024, arXiv:2403.19336. [Google Scholar]
- Zhu, A.; He, H.; Yang, Y.; Zheng, Z.; Shao, J. Hands-Free: Action Abstraction With Hierarchical Reinforcement Learning in Text-Based Games. IEEE Trans. Consum. Electron. 2024, 1. [Google Scholar] [CrossRef]
- Cao, W.; Zhang, D.; Feng, G. Resilient semi-global finite-time cooperative output regulation of heterogeneous linear multi-agent systems subject to denial-of-service attacks. Automatica 2025, 173, 112099. [Google Scholar] [CrossRef]
- Zhang, D.; Chen, H.; Lu, Q.; Deng, C.; Feng, G. Finite-time cooperative output regulation of heterogeneous nonlinear multi-agent systems under switching DoS attacks. Automatica 2025, 173, 112062. [Google Scholar] [CrossRef]
- Chen, Y.; Li, Y.; Huang, X.; Wang, Y.; Zhu, G.; Min, G.; Li, J. An Explainable Recommendation Method for Artificial Intelligence of Things Based on Reinforcement Learning With Knowledge Graph Inference. IEEE Trans. Consum. Electron. 2024, 1. [Google Scholar] [CrossRef]
- Bebortta, S.; Sekhar Tripathy, S.; Bhatia Khan, S.; Al Dabel, M.M.; Almusharraf, A.; Kashif Bashir, A. TinyDeepUAV: A Tiny Deep Reinforcement Learning Framework for UAV Task Offloading in Edge-Based Consumer Electronics. IEEE Trans. Consum. Electron. 2024, 70, 7357–7364. [Google Scholar] [CrossRef]
- Wang, S.; Cao, H.; Yang, L.; Garg, S.; Kaddoum, G.; Alrashoud, M. GCN-Based Multi-Agent Deep Reinforcement Learning for Dynamic Service Function Chain Deployment in IoT. IEEE Trans. Consum. Electron. 2024, 70, 6105–6118. [Google Scholar] [CrossRef]
- Li, H.; He, H. Multiagent Trust Region Policy Optimization. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 12873–12887. [Google Scholar] [CrossRef] [PubMed]
- Zhang, R.; Xia, H.; Chen, Z.; Kang, Z.; Wang, K.; Gao, W. Computation Cost-Driven Offloading Strategy Based on Reinforcement Learning for Consumer Devices. IEEE Trans. Consum. Electron. 2024, 70, 4120–4131. [Google Scholar] [CrossRef]
- Jabbari, A.; Khan, H.; Duraibi, S.; Budhiraja, I.; Gupta, S.; Omar, M. Energy Maximization for Wireless Powered Communication Enabled IoT Devices With NOMA Underlaying Solar Powered UAV Using Federated Reinforcement Learning for 6G Networks. IEEE Trans. Consum. Electron. 2024, 70, 3926–3939. [Google Scholar] [CrossRef]
- Zeng, P.; Li, H.; He, H.; Li, S. Dynamic Energy Management of a Microgrid Using Approximate Dynamic Programming and Deep Recurrent Neural Network Learning. IEEE Trans. Smart Grid 2019, 10, 4435–4445. [Google Scholar] [CrossRef]
- Tavares, A.R.; Vieira, D.K.S.; Negrisoli, T.; Chaimowicz, L. Algorithm Selection in Adversarial Settings: From Experiments to Tournaments in StarCraft. IEEE Trans. Games 2019, 11, 238–247. [Google Scholar] [CrossRef]
- Zha, Z.; Wang, B.; Tang, X. Evaluate, explain, and explore the state more exactly: An improved Actor-Critic algorithm for complex environment. Neural Comput. Appl. 2023, 35, 1–12. [Google Scholar] [CrossRef]
- Yang, D.; Yang, K.; Wang, Y.; Liu, J.; Xu, Z.; Yin, R.; Zhai, P.; Zhang, L. How2comm: Communication-efficient and collaboration-pragmatic multi-agent perception. Adv. Neural Inf. Process. Syst. 2024, 36, 25151–25164. [Google Scholar]
- Guo, X.G.; Liu, P.M.; Wu, Z.G.; Zhang, D.; Ahn, C.K. Hybrid Event-Triggered Group Consensus Control for Heterogeneous Multiagent Systems With TVNUD Faults and Stochastic FDI Attacks. IEEE Trans. Autom. Control 2023, 68, 8013–8020. [Google Scholar] [CrossRef]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv 2017, arXiv:1706.02275. [Google Scholar]
- Seraj, E.; Xiong, J.; Schrum, M.; Gombolay, M. Mixed-initiative multiagent apprenticeship learning for human training of robot teams. In Proceedings of the NeurIPS 2023, the Thirty-Seventh Annual Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Hong, J.; Levine, S.; Dragan, A. Learning to influence human behavior with offline reinforcement learning. arXiv 2024, arXiv:2303.02265. [Google Scholar]
- Sarkar, S.; Naug, A.; Guillen, A.; Luna, R.; Gundecha, V.; Babu, A.R.; Mousavi, S. Sustainability of Data Center Digital Twins with Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20–27 February 2024; Volume 38, pp. 23832–23834. [Google Scholar]
- Wydmuch, M.; Kempka, M.; Jaśkowski, W. ViZDoom Competitions: Playing Doom From Pixels. IEEE Trans. Games 2019, 11, 248–259. [Google Scholar] [CrossRef]
- Ren, X.; Lai, C.S.; Guo, Z.; Taylor, G. Eco-Driving With Partial Wireless Charging Lane at Signalized Intersection: A Reinforcement Learning Approach. IEEE Trans. Consum. Electron. 2024, 70, 6547–6559. [Google Scholar] [CrossRef]
- Zhang, L.; Zhang, Y.; Lu, J.; Xiao, Y.; Zhang, G. Deep Reinforcement Learning Based Trajectory Design for Customized UAV-Aided NOMA Data Collection. IEEE Wirel. Commun. Lett. 2024, 13, 3365–3369. [Google Scholar] [CrossRef]
- Zhang, H.; Zhang, G.; Zhao, M.; Liu, Y. Load Forecasting-Based Learning System for Energy Management With Battery Degradation Estimation: A Deep Reinforcement Learning Approach. IEEE Trans. Consum. Electron. 2024, 70, 2342–2352. [Google Scholar] [CrossRef]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Symbol | Definition |
---|---|
P | State transition probability function |
Discount factor | |
I, i | Set of agents, agent number |
T, t | Termination time, time slot |
R, | Reward function, reward at time slot t |
, | Penalty of movement and shooting |
Reward of hitting the target | |
S, | State space set, state at time slot t |
A, | Action space set, action made by the agent at time slot t |
Action space set of agent i | |
, , | Stay still, move to the left or right |
, | Hold a weapon without firing, fire |
, , | Reward, state or action at time slot t of agent i |
Trajectory | |
Target value | |
Parameter of critic network | |
Value calculated by critic network with | |
Loss function of critic network with | |
Estimated value of advantage function | |
Measurement of the loss estimated by critic network | |
Parameter of actor network | |
Old parameter of actor network before updating | |
Policy of actor network | |
Old policy of actor network before updating | |
Probability of policy taking action at state | |
Ratio of to in taking action at state | |
Loss function for updating actor network | |
Value of after clip operation |
Parameter | Description | Value |
---|---|---|
Learning rate | [1 × 10−7, 1 × 10−5] | |
Discount factor for calculating target value | 0.98 | |
Discount factor for calculating advantage function | 0.98 | |
Bias between 0 and 1 | 0.95 | |
Range of clip operation | 0.2 | |
Training epoch | 1000 | |
Training step | 2000 | |
repeat training times | 10 |
Scheme | Average Reward ↑ | Average Step ↓ | Ammunition (AMMO) ↑ |
---|---|---|---|
PPO | 56 | 35 | 47 |
A2C | −109 | 165 | 40 |
DQN | 36 | 55 | 47 |
MA-A2C | 47 | 44 | 47 |
MA-PPO (Ours) | 76 | 20 | 48 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cui, Z.; Deng, K.; Zhang, H.; Zha, Z.; Jobaer, S. Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment. Mathematics 2025, 13, 754. https://doi.org/10.3390/math13050754
Cui Z, Deng K, Zhang H, Zha Z, Jobaer S. Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment. Mathematics. 2025; 13(5):754. https://doi.org/10.3390/math13050754
Chicago/Turabian StyleCui, Zihao, Kailian Deng, Hongtao Zhang, Zhongyi Zha, and Sayed Jobaer. 2025. "Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment" Mathematics 13, no. 5: 754. https://doi.org/10.3390/math13050754
APA StyleCui, Z., Deng, K., Zhang, H., Zha, Z., & Jobaer, S. (2025). Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment. Mathematics, 13(5), 754. https://doi.org/10.3390/math13050754