Advances in Q-Learning: Real-Time Optimization of Multi-Distant Transportation Systems

Barghash, Ahmad; Abuznaid, Ahmad

doi:10.3390/app15179493

Open AccessArticle

Advances in Q-Learning: Real-Time Optimization of Multi-Distant Transportation Systems

by

Ahmad Barghash

^1,* and

Ahmad Abuznaid

²

¹

Department of Computer Science, German Jordanian University, Amman 11180, Jordan

²

Department of Computer Engineering, German Jordanian University, Amman 11180, Jordan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9493; https://doi.org/10.3390/app15179493

Submission received: 22 July 2025 / Revised: 20 August 2025 / Accepted: 25 August 2025 / Published: 29 August 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Vehicle routing improvement has become a vital topic in modern transport digitalization projects. Presently, there are no fully adapted techniques to offer optimal solutions for finding the best routes that include all visiting locations, considering mandatory transportation constraints. This project explores the application of modern machine learning techniques in solving transportation problems, with a specific focus on Q-learning. We utilize Q-learning to address the traveling salesman and vehicle routing problems. The ability of Q-learning to find optimal solutions in dynamic environments helped overcome the vulnerabilities of traditionally used algorithms. Moreover, this project provides an advanced comparative analysis in terms of accuracy and speed between Q-learning and currently used algorithms in the same scope, using a set of generated routing datasets. Q-learning presented superior performance, generating solutions that were closest to the global optima, exhibiting impressive computational efficiency and fast action even with large-scale problem instances, suggesting that it can serve as a powerful tool for optimizing transportation systems.

Keywords:

Q-learning; transportation problems; sustainable transport; traveling salesman problem; vehicle routing problem; ant colony optimization; genetic algorithm; nearest neighbor; insertion heuristics; Christofides algorithm; simulated annealing

1. Introduction

In the world of transportation digitalization, routing is one of the hardest challenges. The most common prototypes of the routing problem are the traveling salesman problem (TSP) and vehicle routing problem (VRP), both of which can be considered as benchmarks for testing the performance of optimization algorithms [1]. The VRP goal is to find an optimal set of routes for a fleet of vehicles to serve a given set of customers, while minimizing the total distance traveled or maximizing the number of customers served. Such an aim is subject to various constraints, including capacity limitations, time windows, and vehicle availability.

The VRP is a challenging problem because it involves finding an optimal solution from a large number of possible routes and vehicle assignments. Additionally, the problem is often dynamic, where new customer demands and changes in fleet and road conditions may arise during the optimization process. Therefore, the development of efficient algorithms and techniques for solving VRP has been an active area of research in the field of operations research and logistics.

To facilitate a deep investigation of VRP, several benchmarks were introduced to offer routing datasets along with optimal solutions such as CVRPLIB [2]. The datasets are visualized in terms of the number of locations to be visited (n) and the number of participating drivers (k). Figure 1 presents the CVRPLIB datasets (E-n22-k4), where four drivers need to visit one specific location (blue) and visit the remaining 21 locations [3]. The objective was to determine an optimal driver allocation plan that includes the route that each driver must take to complete their respective tasks. It is crucial that at least one driver visits each of the 21 locations while minimizing the time and cost required for completion. The cost of each driver’s task is measured using a single parameter that accounts for any potential challenges encountered during the job, such as traffic and maintenance issues, presented as the Euclidean Distance between two points.

To facilitate comparison when new algorithms are introduced, the optimal path that can be taken by the four drivers to fulfill the tasks is also provided for each routing dataset in the CVRPLIB. Figure 2 shows the suggested solution for the (E-n22-k4) dataset. If the problem is upgraded, where only one driver is considered, the VRP converges to a famous problem called the traveling salesman problem (TSP). Optimizing solutions to this problem has been of significant interest because of its direct effect on transportation costs and time [4,5]. Therefore, a set of TSP datasets and optimal paths is becoming more available [6] to simplify the analysis of the TSP, such as TSPLIB [7].

Many studies have presented algorithms that target the TSP and VRP [8,9,10,11]. Other studies presented a comparative analysis in which popular techniques were compared in solving TSP in fixed environments. For example, Mukhairez et al. compared the performance of simulated annealing (SA), a genetic algorithm (GA), and ant colony optimization (ACO) in solving TSP in an environment of 30 cities [12]. The authors reported that in terms of the shortest path, ACO was the clear winner but also had the slowest execution time. SA had the fastest execution time and had comparable results to GA, whereas GA came in second place for both execution time and results. In another study, the authors favored GA over ACO in terms of speed but not in terms of accuracy [13]. Similarly, Haroun et al. compared ACO and GA and reported that GA was faster and more lightweight than ACO, whereas ACO consistently found shorter distances, especially with larger routing datasets [14]. However, other comparative studies have reported that GA outperformed ACO [15], suggesting that GA can be a contender to ACO for finding the shortest distance in TSP, which should be considered when Q-learning (QL) is introduced to solve the TSP. In this project, we used Q-learning to solve the TSP and compared its performance with that of the GA, ACO, nearest neighbor (NN), insertion heuristics, Christofides algorithm, and simulated annealing (SA). Although Q-learning has already been introduced in the scope of TSP [16], to the best of our knowledge, an intensive comparison with distinguished TSP algorithms is not yet available.

2. Materials and Methods

In this study, we tested the selected algorithms on publicly available datasets for trustworthy comparisons. To analyze the TSP, we used six routing datasets mostly from the TSPLIB (FIVE, P01, GR17, FRI26, DANTZIG42, and ATT48), and all were reported in [6]. Next, we tested the Q-learning ability to handle a VRP problem. At this level, we used the six TSPLIB datasets available in Ref. [7]. In this section, we present a comparison between the Q-learning performance in TSP and VRP, along with other commonly used approaches. To have fair and stable comparisons, we completed all implementations in JETBRAINS Pycharm 2020.1.5 installed on a 16 GB RAM and cori5 11th generation machine.

2.1. Ant Colony Optimization

ACO is a metaheuristic algorithm inspired by the behavior of ants in nature and has been applied to a variety of optimization problems [1,17]. To apply ACO in TSP, ants are imagined if they are initially placed in a random location in the map and allowed to start moving to other locations while tracking the path used. The tracked moves continue until a termination condition is satisfied, such as the maximum number of iterations or the existence of an acceptable solution. By repeating this process, the ACO algorithm can provide a high-quality solution to the TSP, even with large-scale routing datasets [17]. Despite the many advantages of ACO, the algorithm can become stuck in local optima, resulting in suboptimal solutions [18].

2.2. The Genetic Algorithm

The GA is a metaheuristic optimization algorithm inspired by the process of natural selection in genetics. The GA has been applied to a wide range of optimization problems, including the TSP [14,19]. To apply the GA to TSP, solutions represented by chromosomes are mapped to cities in a routing dataset, and the population is the resulting route. The population starts at a random point, and cities are added in terms of their fitness. Genetic operations, such as selection, crossover, and mutation, are consistently applied, affecting the parents of the next generation. This process was repeated until a satisfactory solution was obtained. GA has shown solutions to TSP, even with large-scale routing datasets. However, parameters such as population size, mutation rate, and crossover probability must be handled carefully to ensure acceptable outcomes.

2.3. Nearest Neighbor

The NN is a simple heuristic algorithm that can be used to solve the TSP [20]. It starts with an arbitrary city as the starting point and then repeatedly selects the nearest unvisited city until all cities have been visited, eventually forming a tour. NN can provide a reasonable solution to the TSP, especially for small-to medium-sized instances. However, finding the optimal solution is not guaranteed, and the quality of the solution obtained depends heavily on the order in which the cities are visited.

2.4. Insertion Heuristics

Insertion heuristics are a class of algorithms that are widely used to solve TSP [21,22]. Insertion heuristics iteratively build a feasible solution from an initial set of cities. At each iteration, a new city is added to the current solution to minimize the increase in the total cost of the solution. The insertion can be performed in various ways, such as nearest-neighbor insertion, cheapest insertion, and farthest insertion.

2.5. Christofides Algorithm

The Christofides algorithm is a heuristic algorithm that is frequently used to solve TSP [23]. The algorithm first constructs a minimum spanning tree (MST) for a given set of cities. MST is a tree that connects all cities with the minimum possible total edge weight. The algorithm then adds edges to the MST to create an Eulerian circuit, which is a path that visits each edge exactly once. This is achieved by adding minimum-weight edges to the MST until all vertices have even degrees. Finally, the algorithm traverses the Eulerian circuit, skipping over previously visited vertices to obtain a Hamiltonian circuit, which is a path that visits every vertex exactly once. However, this algorithm could not handle asymmetric TSPs.

2.6. Simulated Annealing

The algorithm was inspired by the annealing process in metallurgy, where a metal is heated and then slowly cooled to reduce its defects and improve its properties. Similarly, when SA is applied to the TSP [24], it starts with an initial solution and then iteratively improves it by gradually perturbing the solution and accepting changes that lead to better solutions. However, SA is classified among the algorithms with high computational complexity.

2.7. Google Or-Tools

Google has developed an open source toolbox that incorporates several solvers. Currently, OR-Tools are often used to solve VRP. Because it contains multiple solvers, its performance is reported to supersede single-algorithm approaches.

2.8. Q-Learning

Reinforcement learning (RL) is a branch of machine learning that deals with agents that learn to make decisions through trial-and-error interactions with an environment. In RL, the agent learns to choose actions that maximize the long-term reward signal [1]. RL is widely used in solving VRP [25,26,27] and frequently as part of the solution [28] even in multi-agent environments [29]. On the other hand, Q-learning is part of RL, where the learning process is based on rewarding or punishing the learner upon making actions, and the learner’s goal is to have as many rewards as possible and avoid punishments. Q-learning uses a table known as a Q-table to store the expected future rewards for each state–action pair in the environment. The agent interacts with the environment by taking action and receiving rewards, and then updates the Q-table based on the observed rewards.

The Q-learning algorithm follows the principle of the Bellman equation [1], which states that the optimal action–value function satisfies a recursive relationship. This relationship involves the current and expected future rewards that can be obtained by taking different actions in the next state. Q-Learning is presently used in solving the TSP, where the environment consists of all locations and the distances in between, the state is the current location with the predicted rewards, the action is the decision made by the driver moving to the next location, and the Q-value is the value in the Q-table that shows the final decision the driver should use to move between locations considering rewards and punishments. The Q-table is updated as in the following equation:

Q (s, a) \leftarrow Q (s, a) + α [r + γ m a x (Q (s ’, a ’)) - Q (s, a)]

where Q(s,a) represents the maximum future reward considering action a (change from one location to another) and state s (Location). α has a value between 0 and 1 and presents the learning rate. r is the reward received taking action a in state s to reach state s′. max(Q(s′,a′)) represents the optimal future reward considering the next state s′ and the next action a′. γ has the value between 0 and 1, and it controls how eager the agent should seek future rewards, considering the exploration rate (epsilon).

The exploration rate represents the process through which the agent tries to learn unknown environments to gain the required information. Exploration is followed by exploitation, where the agent takes action based on the environment he already learned about. The balance between exploration and exploitation is important in Q-learning, as too much exploration can lead to slow learning and poor performance, while too much exploitation can lead to the agent getting stuck in suboptimal policies. In this work, we use the FRI26 dataset to find the suitable parameters for our module, where the reported optimal solution reward is |937|. We report that using 0.8 for α, 0.7 for γ, 0.95 for epsilon decay, and 300 iterations leads to a reward of |976|, which is around 96% accurate, indicating the suitability of the module configuration. Fewer iterations lead to a faster performance, but this affects the overall reliability.

3. Results and Discussion

In this section, we present the results of applying the chosen techniques to six datasets, mostly from TSPLIB, and report the findings in terms of the accuracy, time, and required iterations to complete specific stages. Because the acceptable approach has a clear tradeoff between accuracy and speed, we list the detailed results after the comparisons.

Initially, we noticed that the quickest algorithms often had below-average accuracy in large-scale datasets. For example, the NN and insertion heuristics are among the top algorithms in terms of speed, but their accuracy drops noticeably as the locations grow. Figure 3 shows the change in the required execution time as the environment grows for each technique.

As expected, the required execution time increased as more locations were added to all algorithms, except for NN. The decrease in the required time in the NN tests was accompanied by a drop in the accuracy, which indicates that it is not a good choice for large-scale routing datasets. Generally, the performance of all techniques faced noticeable accuracy drops as the environment extended, except for ACO and Q-learning, where the drop was limited, as shown in Figure 4.

At this stage, we noticed that the chosen technique should manage the tradeoff between the execution time and accuracy. In applications where prompt action is required, a limited drop in accuracy may be favored. However, applications where the resulting route is far from optimal would accept extra time, as it would lead to an approximately optimal solution. Therefore, we examined the achievements of Q-learning, ACO, SA, and GA in defined iteration counts, while considering different time points in other algorithms. Table 1 presents the detailed outcomes.

We found that 100 iterations were sufficient to achieve excellent results in Q-learning, ACO, GA, and SA in small- or mid-size environments. In large-scale environments, extra iterations help Q-learning and ACO achieve acceptable results but do not help GA or SA. On the other hand, NN, the Christofides algorithm, and insertion heuristics were granted the required time to achieve their final acceptable tour, but the accuracy dropped in large-scale environments, even with increased execution time.

One noticeable finding was how different algorithms dealt with the relatively small dataset P01, where four out of seven algorithms achieved excellent results, while GA, insertion heuristics, and Christofides algorithm had clearly worse achievements considering their achievements in smaller and larger datasets. The reason for this was that this dataset was not a TSPLIB dataset like the others, and it probably had a different complexity.

Finally, we tested the performance of Q-learning in VRP problems using Google OR-Tools. The OR-Tools, with its multi-solver nature, outperformed Q-learning even with multi-agent options and provided stable results as the environment extended in terms of the number of locations and drivers. Nonetheless, we expect that a better Q-Learning performance can be achieved if different exploration strategies and reward systems are considered in such environments. Table 2 presents a detailed comparison of VRPs.

4. Conclusions

In this work, we presented a detailed comparison between Q-learning performance in TSP and VRP problems along with commonly used approaches in this field. We aimed to present reasonable solutions that will have direct effects on modern world transportation. We found that Q-learning outperformed the other approaches in TSP problems, where only ACO had comparable performance. On the other hand, Q-learning did not have great achievements in large-scale VRP, and it was outperformed by Google OR-Tools. Based on our findings, Q-learning is highly advised for TSP problems, but is advisable only for small-scale VRP problems. Moreover, we expect better achievements if Q-learning is combined with other algorithms in the same module.

Author Contributions

This work is a result master thesis of A.A. under the supervision of A.B. Conceptualization, A.B. and A.A.; methodology, A.A. and A.B.; scripting, A.A.; validation, A.A. and A.B.; writing—original draft preparation, A.B. and A.A.; writing the manuscript, A.B.; visualization, A.A.; supervision, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

QL	Q-learning
TSP	Traveling salesman problem
VRP	Vehicle routing problem
ACO	Ant colony optimization
GA	Genetic algorithm
NN	Nearest neighbor
SA	Simulated annealing
MST	Minimum spanning tree
RL	Reinforcement learning

References

Sutton, R.; Barto, A. Reinforcement Learning: An Introduction. 1998. Available online: https://www.cambridge.org/core/journals/robotica/article/robot-learning-edited-by-jonathan-h-connell-and-sridhar-mahadevan-kluwer-boston-19931997-xii240-pp-isbn-0792393651-hardback-21800-guilders-12000-8995/737FD21CA908246DF17779E9C20B6DF6 (accessed on 28 May 2025).
Uchoa, E.; Pecin, D.; Pessoa, A.; Poggi, M.; Vidal, T.; Subramanian, A. New benchmark instances for the Capacitated Vehicle Routing Problem. Eur. J. Oper. Res. 2017, 257, 845–858. [Google Scholar] [CrossRef]
CVRPLIB—Plotted Instances. Available online: http://vrp.atd-lab.inf.puc-rio.br/index.php/en/plotted-instances?data=E-n22-k4 (accessed on 29 May 2025).
Davendra, D. Traveling Salesman Problem: Theory; Applications. 2010. Available online: https://books.google.com/books?hl=en&lr=&id=gKWdDwAAQBAJ&oi=fnd&pg=PR11&dq=Traveling+Salesman+Problem:+Theory+and+Applications&ots=aacB087hD7&sig=y3elL3SUkXtjd_TbIwEdi0T0ix8 (accessed on 31 May 2025).
Demez, H. Combinatorial Optimization: Solution Methods of Traveling Salesman Problem. Master’s Thesis, Eastern Mediterranean University, Famagusta, North Cyprus, 2013. Available online: https://i-rep.emu.edu.tr/xmlui/handle/11129/654 (accessed on 31 May 2025).
TSP—Data for the Traveling Salesperson Problem. Available online: https://people.sc.fsu.edu/~jburkardt/datasets/tsp/tsp.html (accessed on 29 May 2025).
Reinelt, G. TSPLIB—A Traveling Salesman Problem Library. ORSA J. Comput. 1991, 3, 376–384. [Google Scholar] [CrossRef]
Lingling, W.; Qingbao, Z. An efficient approach for solving TSP: The rapidly convergent ant colony algorithm. In Proceedings of the 4th International Conference on Natural Computation, ICNC 2008, Jinan, China, 18–20 October 2008; Volume 4, pp. 448–452. [Google Scholar] [CrossRef]
Mohsen, A.M. Annealing Ant Colony Optimization with Mutation Operator for Solving TSP. Comput. Intell. Neurosci. 2016, 2016, 8932896. [Google Scholar] [CrossRef] [PubMed]
Hussain, A.; Muhammad, Y.S.; Sajid, M.N.; Hussain, I.; Shoukry, A.M.; Gani, S. Genetic Algorithm for Traveling Salesman Problem with Modified Cycle Crossover Operator. Comput. Intell. Neurosci. 2017, 2017, 7430125. [Google Scholar] [CrossRef] [PubMed]
Ismkhan, H.; Zamanifar, K. Developing Programming Tools to Handle Traveling Salesman Problem by the Three Object-Oriented Languages. Appl. Comput. Intell. Soft Comput. 2014, 2014, 137928. [Google Scholar] [CrossRef]
Mukhairez, H.H.A.; Maghari, A.Y.A. Performance Comparison of Simulated Annealing, GA and ACO Applied to TSP. Int. J. Intell. Comput. Res. 2015, 6, 647–654. [Google Scholar] [CrossRef]
El Din, H.M. Comparative Analysis of Ant Colony Optimization and Genetic Algorithm in Solving the Traveling Salesman Problem; Blenkinge Institute of Technology: Karlskrona, Sweden, 2021. [Google Scholar]
Haroun, S.A.; Jamal, B.; Hicham, E.H. A Performance Comparison of GA and ACO Applied to TSP. Int. J. Comput. Appl. 2015, 117, 28–35. [Google Scholar] [CrossRef]
Alhanjouri, M.; Alfarra, B. Ant colony versus genetic algorithm based on travelling salesman problem. Int. J. Comput. Technol. Appl. 2011, 2, 570–578. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=20d117804c246f3bcb366fd8e6962cde78e34f1b (accessed on 31 May 2025).
Chen, P.; Wang, Q. Learning for multiple purposes: A Q-learning enhanced hybrid metaheuristic for parallel drone scheduling traveling salesman problem. Comput. Ind. Eng. 2024, 187, 109851. [Google Scholar] [CrossRef]
Manfrin, M.; Birattari, M.; Stützle, T.; Dorigo, M. Parallel ant colony optimization for the traveling salesman problem. In Ant Colony Optimization and Swarm Intelligence: 5th International Workshop; Springer: Berlin, Heidelberg, 2006; Available online: https://link.springer.com/chapter/10.1007/11839088_20 (accessed on 31 May 2025).
Gan, R.; Guo, Q.; Chang, H.; Yi, Y. Improved ant colony optimization algorithm for the traveling salesman problems. J. Syst. Eng. Electron. 2010, 21, 329–333. [Google Scholar] [CrossRef]
Holland, J. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. 1992. Available online: https://books.google.com/books?hl=en&lr=&id=5EgGaBkwvWcC&oi=fnd&pg=PR7&dq=Adaptation+in+Natural+and+Artificial+Systems:+An+Introductory+Analysis+with+Applications+to+Biology,+Control,+and+Artificial+Intelligence&ots=mKjq65Knwo&sig=7UCexT89PHykaf8ooWmXKCa9XZM (accessed on 31 May 2025).
Hougardy, S.; Wilde, M. On the nearest neighbor rule for the metric traveling salesman problem. Discrete Appl. Math. 2015, 195, 101–103. [Google Scholar] [CrossRef]
Raymond, T.C. Journal of and Development and Undefined 1969. Heuristic Algorithm for the Traveling-Salesman Problem. Available online: https://ieeexplore.ieee.org/abstract/document/5391746/ (accessed on 31 May 2025).
Ayudhya, W.; Grasman, S. Conference Undefined 2005. A New Heuristic Algorithm for the Traveling Salesman Problem. Available online: https://search.proquest.com/openview/ab34da7735a2205dfa1664a3bf507c81/1?pq-origsite=gscholar&cbl=51908 (accessed on 31 May 2025).
Genova, K.; Williamson, D.P. An Experimental Evaluation of the Best-of-Many Christofides’ Algorithm for the Traveling Salesman Problem. Algorithmica 2017, 78, 1109–1130. [Google Scholar] [CrossRef]
Bayram, H.; Şahin, R. A new simulated annealing approach for travelling salesman problem. Math. Comput. Appl. 2013, 18, 313–322. [Google Scholar] [CrossRef]
Wang, Y.; Sun, S.; Li, W. Hierarchical Reinforcement Learning for Vehicle Routing Problems with Time Windows. In Proceedings of the Canadian Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–28 May 2021. [Google Scholar] [CrossRef]
Nazari, M.; Oroojlooy, A.; Snyder, L.; Takác, M. In Neural; Undefined 2018. Reinforcement Learning for Solving the Vehicle Routing Problem. Advances in Neural Information Processing Systems. Available online: https://proceedings.neurips.cc/paper/2018/hash/9fb4651c05b2ed70fba5afe0b039a550-Abstract.html (accessed on 19 August 2025).
Yan, D.; Guan, Q.; Ou, B.; Yan, B.; Cao, H. Graph-Driven Deep Reinforcement Learning for Vehicle Routing Problems with Pickup and Delivery. Appl. Sci. 2025, 15, 4776. [Google Scholar] [CrossRef]
Tien, Z.C.; Qi-lee, J. Enhancing vehicle routing problem solutions through deep reinforcement learning and graph neural networks. Int. J. Enterp. Model. 2022, 16, 125–135. [Google Scholar]
Singh, J.; Dhurandher, S.K.; Woungang, I.; Ngatched, T.M.N. Multi-agent Reinforcement Learning Based Approach for Vehicle Routing Problem. Lect. Notes Inst. Comput. Sci. Soc.-Inform. Telecommun. Eng. 2023, 459, 411–422. [Google Scholar] [CrossRef]

Figure 1. The CVRPLIB datasets (E-n22-k4) [3].

Figure 2. Optimal solution of dataset (E-n22-k4) [3].

Figure 3. Change in required execution time as the environment grows.

Figure 4. Accuracy change over the extended environment.

Table 1. Detailed test outcomes for all algorithms considering required iterations and execution time.

Problem	Size	Optimal Answer	Q-Learning				Ant Colony Algorithm
Problem	Size	Optimal Answer	Iterations	Time	Answer	Accuracy	Iterations	Time	Answer	Accuracy
FIVE	5	19	100	0.015615	19	100	100	0.25	19	100
P01	15	291	100	0.0625	291	100	100	1.95	291	100
GR17	17	2085	100	0.04688	2187	95.33	100	2.3	2153	96.84
FRI26	26	937	200	0.15625	959	97.7	200	4.48	962	97.4
DANTZIG42	42	699	300	0.421941	82	85.24	300	21.55	830	84.22
ATT48	48	33,523	500	0.79745	38,375	87.36	500	28.86	38,624	86.79
Problem	Size	Optimal Answer	Nearest Neighbor				Christofides Algorithm
Problem	Size	Optimal Answer	Iterations	Time	Answer	Accuracy	Iterations	Time	Answer	Accuracy
FIVE	5	19	-	66.4 × 10⁻⁶	21	90.48	-	13.3 × 10⁻⁴	23	82.6
P01	15	291	-	80 × 10⁻⁶	291	100	-	16.5 × 10⁻⁴	432	67.36
GR17	17	2085	-	86.5 × 10⁻⁶	2187	95.33	-	20 × 10⁻⁴	2352	88.65
FRI26	26	937	-	115 × 10⁻⁶	1112	84.26	-	31 × 10⁻⁴	1094	85.65
DANTZIG42	42	699	-	200 × 10⁻⁶	956	73.12	-	57 × 10⁻⁴	908	76.98
ATT48	48	33,523	-	238 × 10⁻⁶	40,551	82.67	-	70 × 10⁻⁴	43,088	77.8
Problem	Size	Optimal Answer	Simulated Annealing				Genetic Algorithm
Problem	Size	Optimal Answer	Iterations	Time	Answer	Accuracy	Generations	Time	Answer	Accuracy
FIVE	5	19	100	59.2 × 10⁻⁴	19	100	100	0.0883	19	100
P01	15	291	100	86 × 10⁻⁴	291	100	100	0.272	307	94.79
GR17	17	2085	100	73 × 10⁻⁴	2090	99.76	100	0.2663	2167	96.22
FRI26	26	937	200	11 × 10⁻³	1088	86.12	200	1.02	1353	69.25
DANTZIG42	42	699	300	10.4 × 10⁻³	919	76.06	300	2.805	1066	65.57
ATT48	48	33,523	500	188	52,658	63.66	500	6.523	53,625	62.51
Problem	Size	Optimal Answer	Insertion Heuristics
Problem	Size	Optimal Answer	Iterations	Time	Answer	Accuracy
FIVE	5	19	-	83 × 10⁻⁶	19	100
P01	15	291	-	206 × 10⁻⁶	371	78.43
GR17	17	2085	-	273 × 10⁻⁶	2382	87.53
FRI26	26	937	-	733 × 10⁻⁶	1201	78.02
DANTZIG42	42	699	-	285 × 10⁻⁴	895	78.1
ATT48	48	33,523	-	409 × 10⁻⁴	42,252	79.34

Table 2. Performance of Q-learning vs. OR-Tools in VRP problems.

Problem	Number of Locations	Number of Drivers	Optimal Answer	Q-Learning (Multi—Agent)
Problem	Number of Locations	Number of Drivers	Optimal Answer	Iterations	Answer	Accuracy (%)
P-n20-k2	20	2	216	100	223	96.86
P-n22-k2	22	2	216	100	238	90.75
E-n22-k4	22	4	375	100	412	91
E-n33-k4	33	4	835	200	921	90.66
P-n76-k4	76	4	593	300	1062	57.8
P-n101-k4	101	4	681	500	1500	45
Problem	Size	Number of Drivers	Optimal Answer	Google OR—TOOLS
Problem	Size	Number of Drivers	Optimal Answer	Iterations	Answer	Accuracy (%)
P-n20-k2	20	2	216	100	216	100
P-n22-k2	22	2	216	100	216	100
E-n22-k4	22	4	375	100	375	100
E-n33-k4	33	4	835	200	857	97.4
P-n76-k4	76	4	593	300	606	97.8
P-n101-k4	101	4	681	500	942	72.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Barghash, A.; Abuznaid, A. Advances in Q-Learning: Real-Time Optimization of Multi-Distant Transportation Systems. Appl. Sci. 2025, 15, 9493. https://doi.org/10.3390/app15179493

AMA Style

Barghash A, Abuznaid A. Advances in Q-Learning: Real-Time Optimization of Multi-Distant Transportation Systems. Applied Sciences. 2025; 15(17):9493. https://doi.org/10.3390/app15179493

Chicago/Turabian Style

Barghash, Ahmad, and Ahmad Abuznaid. 2025. "Advances in Q-Learning: Real-Time Optimization of Multi-Distant Transportation Systems" Applied Sciences 15, no. 17: 9493. https://doi.org/10.3390/app15179493

APA Style

Barghash, A., & Abuznaid, A. (2025). Advances in Q-Learning: Real-Time Optimization of Multi-Distant Transportation Systems. Applied Sciences, 15(17), 9493. https://doi.org/10.3390/app15179493

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advances in Q-Learning: Real-Time Optimization of Multi-Distant Transportation Systems

Abstract

1. Introduction

2. Materials and Methods

2.1. Ant Colony Optimization

2.2. The Genetic Algorithm

2.3. Nearest Neighbor

2.4. Insertion Heuristics

2.5. Christofides Algorithm

2.6. Simulated Annealing

2.7. Google Or-Tools

2.8. Q-Learning

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI