Dynamic Scene Path Planning of UAVs Based on Deep Reinforcement Learning
Abstract
:1. Introduction
2. Problem Description and Model
2.1. Path Planning Problem
2.2. Environmental Model
3. Methods
3.1. MDP Element Design
3.1.1. State
3.1.2. Action
3.1.3. Reward
3.2. Action Selection Policy
Algorithm 1. Action selection policy with heuristic search rules. |
Input: Q-value, UAV location (x1, y1), Target location (x2, y2). Output: Action a. |
1. Generate p randomly, p∈(0, 1) |
2. if x1 < x2: |
3. if y1 > y2: |
4. a = random (1,3, 5, 6, 7) |
5. else: |
6. a = random (0,3, 4, 5, 7) |
7. end if |
8. else: |
9. if y1 > y2: |
10. a = random (1,2, 4, 6, 7) |
11. else: |
12. a = random (0,2, 4, 5, 6) |
13. end if |
14.end if |
3.3. Improved D3QN Algorithm
Algorithm 2. Improved D3QN considering action selection policy with heuristic rules. |
1. Initialize Q network parameter w, target Q network parameter w = w′. |
2. Initialize replay memory D with capacity N, the priority of all Sum Tree leaf nodes pj = 1. |
3. for i = 1 to T perform |
4. Initialize s as the first state of UAV. |
5. while s is not Termination: 6. Choose an action according to Algorithm 1. |
7. Execute action a. Transfer to the next state s′, and obtain the immediate reward r. Judge whether it is in the termination state d. 8. Store transition {s, a, s′, r, d} in D. Replace the oldest tuple if . |
9. Sample Nb tuples from D, {sj, aj, , rj, dj }, j = 1, 2, 3, …, n. The probability is . Compute the weight of loss function: . |
10. Compute the current target Q-value yi. |
11. Compute the loss as Equation (11). Update Q network parameter w. |
12. Compute TD error for sample data: . Update the priority for all Sum Tree nodes: . 13. For every N′ step, update the target Q network parameter w′ = w. |
14. s = s′. |
15.end for |
4. Results
4.1. Training Results
4.2. Results in Static Scenario
4.3. Visualized Action Field
4.4. Results in Dynamic Scene
5. Conclusions
- (1)
- The MDP model for UAV path planning is translated into the framework of reinforcement learning. The design encompasses the definition of state space, action space, and reward functions, while heuristic rules are incorporated into the action exploration strategy.
- (2)
- An improved deep reinforcement learning approach is proposed, employing the Q-function approximation of the prioritized experience replay D3QN to estimate the action Q-values of the agent. The algorithm’s network structure is designed based on the TensorFlow framework, combining double Q-learning, a competitive architecture of neural networks, and prioritized experience replay. The resulting algorithmic flow is based on the improved D3QN for path planning.
- (3)
- Reinforcement learning path planning policy for static scenarios is trained and subjected to simulation analysis. The results indicate that, compared to A*, RRT, DQN, and DDQN, this strategy demonstrates superior performance in terms of path length and planning time. A visualized action space is utilized to analyze the learned path planning policy of the intelligent agent.
- (4)
- Building upon the static scenario, further training is conducted to derive reinforcement learning planning policy for dynamic scenarios, with simulation testing in typical scenarios. The results demonstrate the successful navigation of the UAV to the target area, showcasing the algorithm’s robust performance by effectively avoiding search in obstacle zones.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
References
- Bulka, E.; Nahon, M. Automatic control for aerobatic maneuvering of agile fixed-wing UAVs. J. Intell. Robot. Syst. 2019, 93, 85–100. [Google Scholar] [CrossRef]
- Chen, Q.; Jin, Y.; Han, L. A review of research on unmanned aerial vehicle path planning algorithms. Aerodyn. Missile J. 2020, 5, 54–58. [Google Scholar] [CrossRef]
- Chen, G. Application of improved A* algorithm in robot path planning. Electron. Des. Eng. 2014, 19, 96–98. [Google Scholar]
- Liu, Y.; Dai, T.; Song, J. Research of path planning algorithm based on improved artificial potential field. J. Shenyang Ligong Univ. 2017, 1, 61–65. [Google Scholar]
- LaValle, S. Rapidly-exploring random trees: A new tool for path planning. Res. Rep. 9811 1998, 293–308. [Google Scholar]
- Li, Y.; Zhang, Q. Overview of indoor unknown environment traversal path planning algorithms. Comput. Sci. 2012, 39, 334–338. [Google Scholar]
- Xu, X.; Liang, R.; Yang, H. Path planning for agent based on improved genetic algorithm. Comput. Simul. 2014, 31, 357–361. [Google Scholar]
- Kang, B.; Wang, X.; Liu, F. Path planning of searching robot based on improved ant colony algorithm. J. Jilin Univ. 2014, 44, 1062–1068. [Google Scholar]
- Li, Q.; Zhang, C.; Chen, P. Improved ant colony optimization algorithm based on particle swarm optimization. Control. Decis. 2013, 28, 873–879. [Google Scholar]
- Wang, L.; Li, M. Application of Improved Adaptive Genetic Algorithm in Mobile Robot Path Planning. J. Nanjing Univ. Technol. Nat. Sci. Ed. 2017, 41, 627–633. [Google Scholar]
- Shi, E.; Chen, M.; Li, J. Research on global path planning method for mobile robots based on ant colony algorithm. Trans. Chin. Soc. Agric. Mach. 2014, 45, 53–57. [Google Scholar]
- Wang, X.; Shi, Y.; Ding, D. Double global optimum genetic algorithm-particle swarm optimization-based welding robot path planning. Eng. Optim. 2016, 48, 299–316. [Google Scholar] [CrossRef]
- Contreras, M.; Ayala, V.; Hernandez, U. Mobile robot path planning using artificial bee colony and evolutionary programming. Appl. Soft Comput. J. 2015, 30, 319–328. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Zhao, Y.; Zheng, Z.; Zhang, X.; Liu, Y. Q learning algorithm-based UAV path learning and obstacle avoidance approach. In Proceedings of the 36th Chinese Control Conference, Dalian, China, 26–28 July 2017; pp. 95–100. [Google Scholar]
- Zhou, B.; Guo, Y.; Li, N.; Zhong, X. Drone path planning based on directed reinforcement Q-learning. Acta Aeronaut. ET Astronaut. Sin. 2021, 42, 506–513. [Google Scholar]
- Huang, D.; Jiang, C.; Han, K. A 3D path planning algorithm based on deep reinforcement learning. Comput. Eng. Appl. 2020, 56, 30–36. [Google Scholar] [CrossRef]
- Feng, S.; Shu, H.; Xie, B. 3D environment path planning based on improved deep reinforcement learning. Comput. Appl. Softw. 2021, 38, 250–255. [Google Scholar] [CrossRef]
- Cao, J.; Wang, X.; Wang, Y.; Tian, Y. An improved Dueling Deep Q-network with optimizing reward functions for driving decision method. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2023, 237, 2295–2309. [Google Scholar] [CrossRef]
- Li, S.; Xin, X.; Lei, Z. Dynamic path planning of a mobile robot with improved Q-learning algorithm. In Proceedings of the 2015 IEEE International Conference on Information and Automation, Lijiang, China, 8–10 August 2015; pp. 409–414. [Google Scholar]
- Yan, C.; Xiang, X.; Wang, C. Towards Real-Time Path Planning through Deep Reinforcement Learning for a UAV in Dynamic Environments. J. Intell. Robot. Syst. 2020, 98, 297–309. [Google Scholar] [CrossRef]
- Gao, Y.; Xiang, J. New threat assessment non-parameter model in beyond-visual-range air combat. J. Syst. Simul. 2006, 18, 2570–2572. [Google Scholar]
- Wen, N.; Su, X.; Ma, P.; Zhao, L.; Zhang, Y. Online UAV path planning in uncertain and hostile environments. Int. J. Mach. Learn. Cybern. 2017, 8, 469–487. [Google Scholar] [CrossRef]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that master chess, shogi, and go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
- Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
- Sutton, S.; Barto, G. Reinforcement learning: An introduction. IEEE Trans. Neural Netw. 1998, 9, 1054–1065. [Google Scholar] [CrossRef]
- Liu, Z.; Lan, F.; Yang, H. Partition Heuristic RRT Algorithm of Path Planning Based on Q-learning. In Proceedings of the 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chengdu, China, 20–22 December 2019. [Google Scholar]
- Tai, L.; Liu, M. Towards cognitive exploration through deep reinforcement learning for mobile robots. arXiv 2016, arXiv:1610.01733. [Google Scholar]
- Van-Hasselt, H.; Guez, A.; Silver, A. Deep reinforcement learning with double Q-learning. Proc. AAAI Conf. Artif. Intell. 2015, 30, 2094–2100. [Google Scholar] [CrossRef]
- Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceeding of 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 20–22 June 2016; pp. 1995–2003. [Google Scholar]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. Comput. Sci. 2015, 1–17. [Google Scholar] [CrossRef]
- Maniatopoulos, A.; Mitianoudis, N. Learnable Leaky ReLU (LeLeLU): An Alternative Accuracy-Optimized Activation Function. Information 2021, 12, 513. [Google Scholar] [CrossRef]
- Sui, Z.; Pu, Z.; Yi, J.; Xiong, T. Formation control with collision avoidance through deep reinforcement learning. In Proceedings of the International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Xie, N.; Hu, Y.; Chen, L. A distributed multi-agent formation control method based on deep Q learning. Front. Neurorobotics 2022, 16, 817168. [Google Scholar] [CrossRef]
Reward | State of UAV | Reward Value |
---|---|---|
r1 | In the obstacle area | rt |
r2 | In the target area | +200 |
r3 | Outside the environmental boundary | −50 |
r4 | Beyond the maximum step | −50 |
r5 | Other state | −0.5 |
Obstacle Areas | Location of Center (km) | Rmax (km) | RMmax (km) |
---|---|---|---|
Obstacle 1 | (20, 22) | 6 | 5 |
Obstacle 2 | (30, 40) | 10 | 9 |
Obstacle 3 | (40, 18) | 8 | 7 |
Parameter | Value | Parameter | Value |
---|---|---|---|
The radius of the target area | 4 km | Target network update rate | 0.008 |
Max episodes i | 10,000 | Discount factor | 0.96 |
Max steps of UAV | 500 | Q network learning rate | 0.0005 |
Replay memory capacity N | 10,000 | Initial exploration | 1 |
Minibatch size Nb | 32 | Final exploration | 0.1 |
Number of pre-training rounds | 200 | Number of annealing | 2000 |
Algorithm | DQN | DDQN | Improved D3QN |
---|---|---|---|
Steps of UAV | 53 | 53 | 51 |
Cumulative reward | 174.0 | 174.5 | 175.0 |
Path Number | Initial Position | UAV Step | Cumulative Reward |
---|---|---|---|
1 | (20, 5) | 43 | 179.0 |
2 | (25, 7) | 41 | 188.0 |
3 | (10, 15) | 46 | 177.5 |
4 | (25, 28) | 29 | 188.0 |
5 | (55, 5) | 43 | 179.0 |
6 | (59, 59) | 5 | 198.0 |
7 | (31, 15) | 33 | 184.0 |
8 | (10, 30) | 11 | −55.0 |
Parameter | Value | Parameter | Value |
Max episodes i | 20,000 | Number of pre-training | 0 |
Replay memory capacity N | 50,000 | Q network learning rate | 0.00005 |
Target network update rate | 0.0004 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tang, J.; Liang, Y.; Li, K. Dynamic Scene Path Planning of UAVs Based on Deep Reinforcement Learning. Drones 2024, 8, 60. https://doi.org/10.3390/drones8020060
Tang J, Liang Y, Li K. Dynamic Scene Path Planning of UAVs Based on Deep Reinforcement Learning. Drones. 2024; 8(2):60. https://doi.org/10.3390/drones8020060
Chicago/Turabian StyleTang, Jin, Yangang Liang, and Kebo Li. 2024. "Dynamic Scene Path Planning of UAVs Based on Deep Reinforcement Learning" Drones 8, no. 2: 60. https://doi.org/10.3390/drones8020060
APA StyleTang, J., Liang, Y., & Li, K. (2024). Dynamic Scene Path Planning of UAVs Based on Deep Reinforcement Learning. Drones, 8(2), 60. https://doi.org/10.3390/drones8020060