Maneuver Decision-Making for Autonomous Air Combat Based on FRE-PPO
Abstract
:1. Introduction
2. Background
3. Method
3.1. Aircraft Model and Missile Model
3.2. Reward Design
3.3. FRE-PPO Algorithm
- Use the single path or vine procedures to collect a set of state-action pairs along with Monte Carlo estimates of their Q-values;
- By averaging over samples, construct the estimated objective and constraint in
- Approximately solve this constrained optimization problem to update the policy’s parameter vector θ by means of the conjugate gradient algorithm followed by a line search, which is altogether only slightly more expensive than computing the gradient itself. Where, is the parameter of the current strategy, ρ is the probability distribution of state transition, q is the sampling distribution, and DKL is the KL divergence.
3.4. Air Combat Agent Training Framework Based on FRE-PPO
3.5. Air Combat State
4. Experiments and Results
4.1. Ablation Studies
4.2. Simulation Experiments
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- DARPA’s AlphaDogfight Tests AI Pilot’s Combat Chops, Breaking Defense. Available online: https://breakingdefense.com/2020/08/darpas-alphadogfight-tests-AI-pilot’s-combat-chops/ (accessed on 12 August 2020).
- Huang, C.; Dong, K.; Huang, H.; Tang, S.; Zhang, Z. Autonomous air combat maneuver decision using Bayesian inference and moving horizon optimization. J. Syst. Eng. Electron. 2018, 29, 86–97. [Google Scholar] [CrossRef]
- Guo, H.; Hou, M.; Zhang, Q.; Tang, C. UCAV robust maneuver decision based on statistics principle. Acta Armamentaria 2017, 38, 160–167. [Google Scholar]
- Du, H.W.; Cui, M.L.; Han, T.; Wei, Z.; Tang, C.; Tian, Y. Maneuvering decision in air combat based on multi-objective optimization and reinforcement learning. J. Beijing Univ. Aeronaut. Astronaut. 2018, 44, 2247–2256. [Google Scholar]
- Mcgrew, J.S.; How, J.P.; Williams, B. Air-combat strategy using approximate dynamic programming. J. Guid. Control Dyn. 2010, 33, 1641–1654. [Google Scholar] [CrossRef] [Green Version]
- Li, S.; Ding, Y.; Gao, Z. UAV air combat maneuvering decision based on intuitionistic fuzzy game theory. J. Syst. Eng. Electron. 2019, 41, 1063–1070. [Google Scholar]
- Wei, R.X.; Zhou, K.; Ru, C.J.; Guan, X.; Che, J. Study on fuzzy cognitive decision-making method for multiple UAVs cooperative search. Sci. Sin. Technol. 2015, 45, 595–601. [Google Scholar]
- Zhang, Q.; Yang, R.; Yu, L.; Zhang, T.; Zuo, J. BVR air combat maneuvering decision by using Q-network reinforcement learning. J. Air For. Eng. Univ. 2018, 19, 8–14. [Google Scholar]
- Hu, D.; Yang, R.; Zuo, J.; Zhang, Z.; Wu, J.; Wang, Y. Application of deep reinforcement learning in maneuver planning of beyond-visual-range air combat. IEEE Access 2021, 9, 32282–32297. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Watkins, H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Hado, H.; Arthur, G.; David, S. Deep reinforcement learning with double q-learning. In Proceedings of the National Conference of the American Association for Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1813–1825. [Google Scholar]
- Matteo, H.; Joseph, M.; Hado, H. Rainbow: Combining Improvements in Deep Reinforcement Learning. arXiv 2017, arXiv:1710.02298v1. [Google Scholar]
- Silver, D.; Huang, A.; Maddison, C. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
- Silver, D.; Schrittwieser, J.; Simonyan, K. Mastering the game of Go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Schrittwieser, J.; Antonoglou, I.; Silver, D. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef] [PubMed]
- Oriol, V.; Igor, B.; Silver, D. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar]
- Jonas, D.; Federico, F.; Jonas, B. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 2022, 602, 414–419. [Google Scholar]
- Ma, Y.; Wang, G.; Hu, X.; Luo, H.; Lei, X. Cooperative occupancy decision making of multi-UAV in beyond-visual-range air combat: A game theory approach. IEEE Access 2020, 8, 11624–11634. [Google Scholar] [CrossRef]
- Kose, O.; Oktay, T. Simultaneous quadrotor autopilot system and collective morphing system design. Aircr. Eng. Aerosp. Technol. 2020, 92, 1093–1100. [Google Scholar] [CrossRef]
- Ma, X.; Li, X.; Zhao, Q. Air combat strategy using deep Q-learning. In Proceedings of the Chinese Automation Congress, Xi’an, China, 30 November–2 December 2018; pp. 3952–3957. [Google Scholar]
- Eloy, G.; David, W.C.; Dzung, T.; Meir, P. A differential game approach for beyond visual range tactics. arXiv 2020, arXiv:2009.10640v1. [Google Scholar]
- Wu, S.; Nan, Y. The calculation of dynamical allowable lunch envelope of air-to-air missile after being launched. J. Proj. Rocket. Missile Guid. 2013, 33, 49–54. [Google Scholar]
- Li, X.; Zhou, D.; Feng, Q. Air-to-air missile launch envelops fitting based on genetic programming. J. Proj. Rocket. Missile Guid. 2015, 35, 16–18. [Google Scholar]
- Wang, J.; Ding, D.; Xu, M.; Han, B.; Lei, L. Air-to-air missile launchable area based on target escape maneuver estimation. J. Beijing Univ. Aeronaut. Astronaut. 2019, 45, 722–734. [Google Scholar]
- He, X.; Jing, X.; Feng, C. Air combat maneuver decision based on MCTS method. J. Air For. Eng. Univ. 2017, 18, 36–41. [Google Scholar]
- Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 2013, 47, 253–279. [Google Scholar] [CrossRef]
- Mnih, V.; Badia, A.P.; Mirza, M. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement learning: An introduction. AI MAG 2000, 21, 103. [Google Scholar] [CrossRef]
- Schaul, T.; Quan, J.; Antonoglou, I. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations, San Juan, WA, USA, 7–9 May 2015; pp. 1559–1566. [Google Scholar]
- Wang, Z.; Schaul, T.; Hessel, M. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
- Van, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1421–1432. [Google Scholar]
- Fortunato, M.; Azar, M.G.; Piot, B. Noisy networks for exploration. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 23–26 April 2017; pp. 1177–1182. [Google Scholar]
- Bellemare, M.G.; Dabney, W.; Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 449–458. [Google Scholar]
- Steven, K.; Georg, O.; John, Q. Recurrent experience replay indistributed reinforcement learning. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 392–407. [Google Scholar]
- Puigdomènech, B.A.; Sprechmann, P.; Vitvitskyi, A. Never give up: Learning directed exploration strategies. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020; PMLR: Addis Ababa, Ethiopia, 2020; pp. 1–28. [Google Scholar]
- Adrià, P.B.; Bilal, P.; Steven, K. Agent57: Outperforming the Atari human benchmark. In Proceedings of the International Conference on Machine Learning, Virtual Event, 26–30 April 2020; pp. 507–517. [Google Scholar]
- Sutton, R.S.; Mcallester, D.A.; Singh, S.P. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the Advances in Neural Information Processing Systems, Breckenridge, CO, USA, 1–2 December 2000; pp. 1057–1063. [Google Scholar]
- Sham, K. A natural policy gradient. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–8 December 2001; pp. 1531–1538. [Google Scholar]
- Silver, D.; Lever, G.; Heess, N.; Thomas, W.D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
- Timothy, P.L.; Jonathan, J.H.; Alexander, P. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, Puerto Rico, FL, USA, 2–4 May 2016; pp. 1692–1707. [Google Scholar]
- Scott, F.; Herke, V.H.; David, M. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1582–1591. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Williams, P. Three-dimensional aircraft terrain-following via real-time optimal control. J. Guid. Control Dyn. 1990, 13, 1146–1149. [Google Scholar] [CrossRef]
- Fang, X.; Liu, J.; Zhou, D. Background interpolation for on-line situation of capture zone of air-to-air missiles. J. Syst. Eng. Electron. 2019, 41, 1286–1293. [Google Scholar]
- David, S.; Satinder, S.; Doina, P.; Richard, S.S. Reward is enough. Artif. Intell. 2021, 299, 1–13. [Google Scholar]
- John, S.; Philipp, M.; Sergey, L.; Michael, I.J.; Pieter, A. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations, Puerto Rico, FL, USA, 2–4 May 2016; pp. 1181–1192. [Google Scholar]
State | Symbol | Formula |
---|---|---|
yaw angle | ||
pitch angle | ||
velocity | v | |
altitude | z | |
distance between the two sides | d | |
launch missile | f1 | 0 or 1 |
yaw angle of the missile | ||
pitch angle of the missile | ||
distance between the missile and the other side | d1 | |
heading crossing angle | ||
launch missile from the other side | f2 | 0 or 1 |
Hyperparameter | Value |
---|---|
azimuth angle | (−π/4, π/4) |
distance | (40,000 m, 100,000 m) |
velocity | (250 m/s, 400 m/s) |
batch size | 1024 |
optimizer | Adam |
actor learning rate | 0.0002 |
critic learning rate | 0.001 |
actor architecture | (256, 256, 4) |
critic architecture | (256, 256, 1) |
activate function | tanh |
epoch | 8 |
0.995 | |
0.98 | |
0.2 |
GAE | DSR | FRE-U | FRE-S | FRE | |
---|---|---|---|---|---|
maximum wins | 32 | 14 | 32 | 28 | 55 |
average wins | 9.20 | 3.75 | 12.45 | 10.68 | 26.22 |
average loses | 7.96 | 4.37 | 8.96 | 9.03 | 18.08 |
average draws | 90.84 | 99.88 | 86.59 | 88.29 | 63.70 |
average time of each decision-making | 0.001 s | 0.001 s | 0.001 s | 0.001 s | 0.001 s |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, H.; Wei, Y.; Zhou, H.; Huang, C. Maneuver Decision-Making for Autonomous Air Combat Based on FRE-PPO. Appl. Sci. 2022, 12, 10230. https://doi.org/10.3390/app122010230
Zhang H, Wei Y, Zhou H, Huang C. Maneuver Decision-Making for Autonomous Air Combat Based on FRE-PPO. Applied Sciences. 2022; 12(20):10230. https://doi.org/10.3390/app122010230
Chicago/Turabian StyleZhang, Hongpeng, Yujie Wei, Huan Zhou, and Changqiang Huang. 2022. "Maneuver Decision-Making for Autonomous Air Combat Based on FRE-PPO" Applied Sciences 12, no. 20: 10230. https://doi.org/10.3390/app122010230
APA StyleZhang, H., Wei, Y., Zhou, H., & Huang, C. (2022). Maneuver Decision-Making for Autonomous Air Combat Based on FRE-PPO. Applied Sciences, 12(20), 10230. https://doi.org/10.3390/app122010230