Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning
Abstract
:1. Introduction
2. Preliminary
2.1. VRP-STC
2.2. Reinforcement Learning Framework
3. Method
3.1. Formulation of DRL
- State: Status st = (Rt, Ct) is a part of the solution, for instance, G (Q, q, R) created at time step t. Here, Rt (for t ≠ 0) is the group of customers receiving services, which includes all of the chosen customer locations up to step t, Ct is the set of candidate nodes at step t, Q is the maximum capacity of each vehicle, q is the demand of customers.
- Action: Action at indicates that at step t, the candidate node πt is chosen from the candidate node set Ct and added to the service customer set Rt.
- Transition: With the action at, a modern fractional solution is obtained as the following state, i.e., st+1 = (Rt+1, Ct+1). Within the updated state, Rt+1 includes πt in addition to the nodes chosen so far, whereas Ct+1 consists of the candidate nodes from Ct with πt expelled.
- Reward: To minimize the total cost, we define the value of the objective function at step t as Objt = min E(cost), and the reward at step t as rt = Objt−1 − Objt.
- Policy: The strategy Pθ is parameterized using θ within the GAT-AM model. At each step t, a candidate node is automatically chosen as the service customer node until all service customer nodes are chosen, resulting in the final solution π = {π1, π2, …, πn,} generated by the policy.
3.2. Model
3.3. Encoder
3.4. Decoder
3.5. Algorithm
Algorithm 1. REINFORCE with Rollout Baseline |
Input: number of epochs E, steps per epoch T, batch size B, significance α Initialize θ, θBaseline ← θ For epoch = 1, …, E do For step = 1, …, T do End For If Test (Pθ, PθBaseline) < α then θBaseline ← θ End If End For |
3.6. Stochastic Travel Costs
4. Experiments
4.1. Experimental Settings
4.2. Baseline Methods and Evaluation Metrics
4.3. Comparison Analysis
4.4. Model Convergence Performance
4.5. Visualization
5. Conclusions
- Algorithmic Refinement and Generalization: The model currently exhibits certain limitations. Enhancing the algorithm’s generalization capacity to accommodate a broader spectrum of environments and various problem scenarios represents a pivotal area for future investigation.
- Real-time Dynamic Planning: With the growing demand for practical applications, implementing real-time dynamic planning within the model to accommodate the evolving logistics demands and traffic conditions is a pressing challenge awaiting resolution.
- Multi-objective Optimization: The present model primarily focuses on minimizing the total travel cost. In the future, exploration could be extended to achieve a balance among multiple objectives, such as service level assurance and minimizing environmental impacts.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Khalil, E.; Dai, H.; Zhang, Y.; Dilkina, B.; Song, L. Learning combinatorial optimization algorithms over graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 6348–6358. [Google Scholar]
- Bengio, Y.; Lodi, A.; Prouvost, A. Machine learning for combinatorial optimization: A methodological tour d’horizon. Eur. J. Oper. Res. 2021, 290, 405–421. [Google Scholar] [CrossRef]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A. Mastering the game of Go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Karimi-Mamaghan, M.; Mohammadi, M.; Meyer, P.; Karimi-Mamaghan, A.M.; Talbi, E.-G. Machine learning at the service of meta-heuristics for solving combinatorial optimization problems: A state-of-the-art. Eur. J. Oper. Res. 2022, 296, 393–422. [Google Scholar] [CrossRef]
- Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
- Konda, V.; Tsitsiklis, J. Actor-critic algorithms. Adv. Neural Inf. Process. Syst. 1999, 12, 1008–1014. [Google Scholar]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 22–24 June 2014; pp. 387–395. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Babaeizadeh, M.; Frosio, I.; Tyree, S.; Clemons, J.; Kautz, J. Reinforcement learning through asynchronous advantage actor-critic on a gpu. arXiv 2016, arXiv:1611.06256. [Google Scholar]
- Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1334–1373. [Google Scholar]
- Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; Dai, Q. Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 653–664. [Google Scholar] [CrossRef]
- Zheng, G.; Zhang, F.; Zheng, Z.; Xiang, Y.; Yuan, N.J.; Xie, X.; Li, Z. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 167–176. [Google Scholar]
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
- Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef]
- Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
- Lu, H.; Zhang, X.; Yang, S. A learning-based iterative method for solving vehicle routing problems. In Proceedings of International Conference on Learning Representations. Available online: https://openreview.net/forum?id=BJe1334YDH (accessed on 13 August 2024).
- Manchanda, S.; Mittal, A.; Dhawan, A.; Medya, S.; Ranu, S.; Singh, A. Learning heuristics over large graphs via deep reinforcement learning. arXiv 2019, arXiv:1903.03332. [Google Scholar]
- Mazyavkina, N.; Sviridov, S.; Ivanov, S.; Burnaev, E. Reinforcement learning for combinatorial optimization: A survey. Comput. Oper. Res. 2021, 134, 105400. [Google Scholar] [CrossRef]
- Cappart, Q.; Chételat, D.; Khalil, E.B.; Lodi, A.; Morris, C.; Veličković, P. Combinatorial optimization and reasoning with graph neural networks. J. Mach. Learn. Res. 2023, 24, 1–61. [Google Scholar]
- Kool, W.; Van Hoof, H.; Welling, M. Attention, learn to solve routing problems! arXiv 2018, arXiv:1803.08475. [Google Scholar]
- Nowak, A.; Villar, S.; Bandeira, A.S.; Bruna, J. A note on learning algorithms for quadratic assignment with graph neural networks. Stat 2017, 1050, 22. [Google Scholar]
- Li, Z.; Chen, Q.; Koltun, V. Combinatorial optimization with graph convolutional networks and guided tree search. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper_files/paper/2018/file/8d3bba7425e7c98c50f52ca1b52d3735-Paper.pdf (accessed on 13 August 2024).
- Drori, I.; Kharkar, A.; Sickinger, W.R.; Kates, B.; Ma, Q.; Ge, S.; Dolev, E.; Dietrich, B.; Williamson, D.P.; Udell, M. Learning to solve combinatorial optimization problems on real-world graphs in linear time. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 14–17 December 2020; pp. 19–24. [Google Scholar]
- Lodi, A.; Mossina, L.; Rachelson, E. Learning to handle parameter perturbations in combinatorial optimization: An application to facility location. EURO J. Transp. Logist. 2020, 9, 100023. [Google Scholar] [CrossRef]
- Xidias, E.; Zacharia, P.; Nearchou, A. Intelligent fleet management of autonomous vehicles for city logistics. Appl. Intell. 2022, 52, 18030–18048. [Google Scholar] [CrossRef]
- Luo, H.; Dridi, M.; Grunder, O. A branch-price-and-cut algorithm for a time-dependent green vehicle routing problem with the consideration of traffic congestion. Comput. Ind. Eng. 2023, 177, 109093. [Google Scholar] [CrossRef]
- Bai, R.; Chen, X.; Chen, Z.-L.; Cui, T.; Gong, S.; He, W.; Jiang, X.; Jin, H.; Jin, J.; Kendall, G. Analytics and machine learning in vehicle routing research. Int. J. Prod. Res. 2023, 61, 4–30. [Google Scholar] [CrossRef]
- Puterman, M.L. Markov decision processes. Handb. Oper. Res. Manag. Sci. 1990, 2, 331–434. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
- Liu, Z.; Li, X.; Khojandi, A. The flying sidekick traveling salesman problem with stochastic travel time: A reinforcement learning approach. Transp. Res. Part E Logist. Transp. Rev. 2022, 164, 102816. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Helsgaun, K. An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems. Rosk. Rosk. Univ. 2017, 12, 966–980. [Google Scholar]
- Bello, I.; Pham, H.; Le, Q.V.; Norouzi, M.; Bengio, S. Neural combinatorial optimization with reinforcement learning. arXiv 2016, arXiv:1611.09940. [Google Scholar]
VRP20 | VRP50 | VRP100 | |||||||
---|---|---|---|---|---|---|---|---|---|
Method | Obj | Gap | Time | Obj | Gap | Time | Obj | Gap | Time |
LKH3 | 6.14 | 0.00% | (7 h) | 10.38 | 0.00% | (7 h) | 15.65 | 0.00% | (13 h) |
OR Tools | 6.42 | 4.84% | (-) | 11.22 | 8.12% | (-) | 17.14 | 9.34% | (-) |
AM (greedy) | 6.40 | 4.57% | (1 s) | 10.98 | 5.78% | (3 s) | 16.80 | 7.34% | (8 s) |
AM (sampling) | 6.25 | 2.12% | (6 m) | 10.62 | 2.31% | (28 m) | 16.23 | 3.72% | (2 h) |
GAT-AM (greedy) | 6.35 | 3.76% | (1 s) | 10.88 | 4.82% | (2 s) | 16.13 | 2.89% | (5 s) |
GAT-AM (sampling) | 6.18 | 0.98% | (5 m) | 10.51 | 1.25% | (11 m) | 15.89 | 1.53% | (23 m) |
VRP-STC20 | VRP-STC50 | VRP-STC100 | ||||
---|---|---|---|---|---|---|
Method | Obj | Time | Obj | Time | Obj | Time |
AM (greedy) | 9.54 | (2 s) | 16.36 | (8 s) | 25.08 | (23 s) |
AM (sampling) | 9.31 | (13 m) | 15.98 | (1 h) | 24.68 | (4.5 h) |
GAT-AM (greedy) | 9.34 | (2 s) | 16.01 | (6 s) | 24.67 | (13 s) |
GAT-AM (sampling) | 9.18 | (7 m) | 15.68 | (16 m) | 24.36 | (46 m) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cai, H.; Xu, P.; Tang, X.; Lin, G. Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning. Electronics 2024, 13, 3242. https://doi.org/10.3390/electronics13163242
Cai H, Xu P, Tang X, Lin G. Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning. Electronics. 2024; 13(16):3242. https://doi.org/10.3390/electronics13163242
Chicago/Turabian StyleCai, Hao, Peng Xu, Xifeng Tang, and Gan Lin. 2024. "Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning" Electronics 13, no. 16: 3242. https://doi.org/10.3390/electronics13163242
APA StyleCai, H., Xu, P., Tang, X., & Lin, G. (2024). Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning. Electronics, 13(16), 3242. https://doi.org/10.3390/electronics13163242