The proposed Hybrid Reinforcement Learning-Variable Neighborhood Strategy Adaptive Search (H-RL-VaNSAS) algorithm introduces several innovative elements that distinguish it from existing methods in the field of urban bus routing optimization. Here are the key innovation points of our approach:
By combining these innovative elements, H-RL-VaNSAS offers a flexible and effective solution for complex urban bus routing problems, providing significant improvements in solution quality and computational efficiency.
The Variable Neighborhood Strategy Adaptive Search (VaNSAS) algorithm is a novel approach in metaheuristics designed to efficiently solve complex optimization problems. It allows the algorithm to explore diverse solution areas using various search strategies, enhancing the chances of finding optimal solutions. VaNSAS consists of five key steps: track initiation, black box selection, black box operation, update of the track, and repetition of steps. By incorporating different search methods, VaNSAS can adapt to the problem at hand and improve solution quality. Overall, VaNSAS offers a flexible and effective way to tackle challenging optimization problems and achieve near-optimal results efficiently.
Integrating reinforcement learning (RL) into VaNSAS can enhance decision-making during black box selection by leveraging adaptive learning from interactions with the environment. RL’s ability to optimize based on rewards could improve solution quality and efficiency in VaNSAS by learning optimal strategies for black box selection. This integration aligns with VaNSAS’s exploration and intensification principles, offering a more adaptive and robust optimization approach that adjusts search strategies based on problem feedback. Further research is needed to explore the effectiveness of this integration in solving complex optimization problems. The stepwise explanation of the RL-VaNSAS is as follows.
3.3.1. Establishment of the Initial Set of Tracks
VaNSAS is a population-based heuristic approach. The initial set of tracks is randomly generated, with values in each position ranging from 0 to 1 (real numbers). These values represent the degree of preference or importance assigned to each destination or bus route within the track. In the initial iteration, the value in each position, such as those shown in
Figure 1, is randomly selected. For example, the values 0.94, 0.72, 0.96, etc., indicate the initial preference levels for the corresponding destinations and bus routes. These values help guide the optimization process by indicating which elements to prioritize. During subsequent iterations, these values are adjusted using the improvement box (IB) mechanism, which refines the solution by enhancing the preference values based on the algorithm’s feedback. This iterative adjustment ensures that the optimization process converges towards a high-quality solution by progressively improving the preference values.
Let T denote the maximum number of iterations, and
t represent the current iteration. The parameter NP, which remains constant, refers to the number of tracks in a given iteration. The size of each track in our proposed problem is defined as 1 × D, where D is the sum of the number of tourist attractions and the number of allowed bus routes. For example, if there are 10 attractions and 3 allowed bus routes, D will be set to 13 positions in one track. An example of a track with 13 positions is shown in
Figure 1.
The decoding procedure for transforming a track into a solution for the proposed problem involves three steps, as illustrated by the example track in
Figure 1, which has a dimension of 1 × 12. If NP is set to 10, the remaining nine tracks will be constructed similarly. The decoding process follows these steps:
Step 1: Data Organization. First, sort the track into structured lists to enhance route management efficiency. The values in positions 1 to 10, representing destinations, are sorted into List A. The values in positions 10 to 13, representing the bus route indices, are sorted into List B. Additionally, create List C for the assigned bus stops. This sorting differentiates between the destination positions and the bus route indices, facilitating efficient route planning.
Step 2: Route Planning and Execution. Begin with the first route in List B to traverse demand points up to its maximum travel capacity, constrained by time or fuel limits per route. The planning takes into account the approximate fuel consumption for each segment between attractions and the expected number of passengers at each stop. Once a bus reaches its passenger capacity or estimated fuel limit, it returns to the depot (all buses share the same depot). The assigned route is then removed from List B, and the assigned destination is transferred from List A to List C. If a destination is assigned to List C, any nearby destinations (within a 20-min walk) are also removed from List A.
Step 3: Iterative Process and Completion Continue the routing process until all routes in List B are completed or List A is empty. This iterative approach ensures all destinations are covered within the operational constraints.
Table 1 presents the important parameters used in the decoding example. The walking speed is assumed to be 10 km/h, and the bus travels at 60 km/h. Fuel consumption is set at 0.3 L per kilometer, with a maximum fuel capacity of 60 L per round, allowing for a maximum travel distance of 130 km.
Table 2 displays the symmetrical distances between destinations in kilometers. This symmetry ensures that the distance from point A to point B is equal to the distance from point B to point A, providing consistency and accuracy in route planning.
Step 1: Data Organization
The first step involves sorting the tracks into structured lists. List A, which represents destinations, is sorted in ascending order: [0.14 (7), 0.52 (6), 0.54 (5), 0.55 (9), 0.72 (2), 0.79 (10), 0.85 (4), 0.94 (1), 0.96 (3), 0.96 (8)]. This assigns the destinations to bus stops in the following order: Position 7 (value 0.14) as the first stop, Position 6 (value 0.52) as the second stop, Position 5 (value 0.54) as the third stop, Position 9 (value 0.55) as the fourth stop, Position 2 (value 0.72) as the fifth stop, Position 10 (value 0.79) as the sixth stop, Position 4 (value 0.85) as the seventh stop, Position 1 (value 0.94) as the eighth stop, and Positions 3 and 8 (both with value 0.96) as the ninth and tenth stops. List B, which represents bus routes, is sorted in ascending order: [0.39 (11), 0.49 (13), 0.75 (12)], assigning the bus routes as follows: Position 11 (value 0.39) as the first route, Position 13 (value 0.49) as the second route, and Position 12 (value 0.75) as the third route. List C, which represents assigned bus stops, is initially empty.
Step 2: Route Planning and Execution
First bus route (List B, first value: 0.39):
Starting with the first bus route (value 0.39), the bus begins by visiting destination number 7 (value 0.14). The walkable destinations within 5 km from destination 7 are checked. From the distance table, destination 9 (13 km) is within walking distance. Therefore, destination 9 is removed from List A and added to the walkable destinations for this route. Next, the bus proceeds to destination 6 (value 0.52). The walkable destinations within 5 km from destination 6 are checked. No destinations are within walking distance. Similarly, the bus proceeds to destinations 5 (value 0.54) and 9 (value 0.55), but no other destinations are within walking distance. These four destinations are removed from List A and added to List C, and route 0.39 is removed from List B. The updated lists are: List A contains [0.72 (2), 0.79 (10), 0.85 (4), 0.94 (1), 0.96 (3), 0.96 (8)], List B contains [0.49 (13), 0.75 (12)], and List C contains [7, 6, 5, 9].
Second bus route (List B, next value: 0.49):
Next, the bus proceeds with the second route (value 0.49), visiting destination 2 (value 0.72). The walkable destinations within 5 km from destination 2 are checked. From the distance table, destination 10 (13 km) is within walking distance. Therefore, destination 10 is removed from List A and added to the walkable destinations for this route. The bus then proceeds to destinations 4 (value 0.85) and 1 (value 0.94), and these destinations are removed from List A and added to List C, and the route 0.49 is removed from List B. The updated lists are: List A contains [0.96 (3), 0.96 (8)], List B contains [0.75 (12)], and List C contains [7, 6, 5, 9, 2, 10, 4, 1].
Third bus route (List B, last value: 0.75):
For the final bus route (value 0.75), the bus visits the remaining destinations 3 and 8 (both with value 0.96). These destinations are removed from List A and added to List C, and route 0.75 is removed from List B. The updated lists are: List A is empty, List B is empty, and List C contains [7, 6, 5, 9, 2, 10, 4, 1, 3, 8].
Explanation of Results
In this example, the initial track was decoded by sorting and assigning bus routes to demand points based on the values in the track. Each destination was assigned to a bus stop in the order determined by the sorted values, ensuring that all routes and demand points were covered within the constraints of bus capacity, travel time, and fuel limits. By checking the walkable distances from each assigned bus stop, we ensured that any nearby destinations were efficiently allocated to minimize travel time and maximize resource use.
In this analysis, according to
Table 3, Route 1 has destination 9 (Ubon Ratchathani Zoo) with destination 7 (Wat Pa Nanachat) as a walkable destination within 13 km. Route 2 has destination 10 (Wat Ban Na Mueang), and destination 2 (Tung Sri Muang Temple) is a walkable destination within 13 km. Route 3 does not have any walkable destinations.
To calculate the traveling distance for each route and the total distance for all routes, we can use the provided distance table. For Route 1 (depot-7-6-5-depot), the distance from the depot to 7 is 22 km, from 7 to 6 is 14 km, from 6 to 5 is 24 km, and from 5 back to the depot is 37 km, resulting in a total distance of 97 km. For Route 2 (depot-2-4-1-depot), the distance from the depot to 2 is 37 km, from 2 to 4 is 15 km, from 4 to 1 is 11 km, and from 1 back to the depot is 22 km, resulting in a total distance of 85 km. For Route 3 (depot-3-8-depot), the distance from the depot to 3 is 31 km, from 3 to 8 is 21 km, and from 8 back to the depot is 25 km, resulting in a total distance of 77 km. Therefore, the total distance for all routes is 97 km + 85 km + 77 km = 259 km.
The calculation of the objective functions follows. For Objective Function 1, which maximizes the Resilience Index, the resilience indices for the assigned bus stops are summed. The resilience indices for Route 1 (stops 7, 6, and 5) are 6.88, 7.80, and 5.02. For Route 2 (stops 2, 4, and 1), they are 7.60, 6.06, and 5.31. For Route 3 (stops 3 and 8), they are 7.94 and 5.42. Summing these values gives a Resilience Index of 52.03. For Objective Function 2, which maximizes the Sustainability Index, the sustainability indices for the assigned bus stops are summed. The sustainability indices for Route 1 are 6.76, 6.35, and 9.22. For Route 2, they are 4.89, 6.03, and 9.09. For Route 3, they are 5.51 and 9.19. Summing these values gives a Sustainability Index of 56.04.
For Objective Function 3, which maximizes Tourist Preferences, the preference ratings for the assigned bus stops are summed. The preference ratings for Route 1 are 3.66, 4.94, and 2.35. For Route 2, they are 3.94, 5.08, and 3.90. For Route 3, they are 3.92 and 4.55. Summing these values gives a Tourist Preference Index of 32.34. For Objective Function 4, which maximizes Accessibility, the demands for the assigned bus stops are summed. The demands for Route 1 are 705, 659, and 787. For Route 2, they are 889, 776, and 866. For Route 3, they are 702 and 655. Summing these values gives an Accessibility Index of 6039.
For Objective Function 5, which minimizes Total Travel Distance, the total distance traveled is multiplied by the fuel consumption factor (0.3 L/km). The total distance for all routes is 259 km. Therefore, the total fuel consumption is 259 km × 0.3 L/km = 77.7 L. In summary, the calculations yield the following objective function values: the Resilience Index is 52.03, the Sustainability Index is 56.04, the Tourist Preference Index is 32.34, the Accessibility Index is 6039, and the Total Travel Distance results in a fuel consumption of 77.7 L.
3.3.2. Improve the Solution of the Tracks Using the Improvement Box (Black Box)
In this step, each track independently selects the preferred improvement box from the last iteration, irrespective of other tracks. There are five improvement boxes available, as represented by Equations (17) to (21). Each equation is inspired by different metaheuristic improvement procedures. Equations (17) and (18) are inspired by the operators of the Crested Porcupine Optimizer (CPO) [
36]. Equation (19) is inspired by the Krill Herd Algorithm (KHA) [
37]. Equation (20) is derived from the Salp Swarm Algorithm (SSA) [
38]. Finally, Equation (21) is based on the Manta Ray Foraging Algorithm (MRFO) [
39].
In these equations, represents the best track so far, which is the track that has provided the best solution from the start of the simulation run up to iteration t − 1. Whenever a new best solution is found, the best track is updated accordingly. The terms , , and are randomly selected tracks from the available NP tracks. The term denotes the value in position j of track i at iteration t.
In the context of the equations inspired by various metaheuristic algorithms, several key parameters are defined. The value of π is a constant, approximately equal to 3.14159. The parameter F is a scaling factor commonly used in differential evolution and other algorithms. For these equations, F is set to 0.8. Additionally, the parameters α and β serve as weights in the Krill Herd Algorithm, with specific values of α = 1.5 and β = 1.0.
Other important parameters include λ, which is the scaling parameter in Dynamic Levy Flight, set to 3, and Ω, the learning rate in reinforcement learning, set to 0.2. The crossover rate in Hybrid Differential Evolution, δ, is set to 0.8. Additionally, Ψ represents a random number between [0, 1], η is the quantum probability amplitude in the Quantum-inspired Evolutionary Algorithm, set to 0.7, and θ, the rotation angle in the Quantum-inspired Evolutionary Algorithm, is set to 2π2.
These parameters are critical in balancing the influence of different components in the optimization process, ensuring a robust and efficient search for optimal solutions. By carefully tuning these parameters, we can manage the exploration and exploitation phases of the algorithms more effectively. Proper parameter settings allow the algorithms to explore the solution space thoroughly while converging on high-quality solutions, thus enhancing the overall performance and reliability of the optimization process.
The Crested Porcupine Optimizer incorporates defensive mechanisms such as sight, sound, and physical attacks. We can adapt these mechanisms to influence
The Krill Herd Algorithm simulates the herding behavior of krill, incorporating movement influenced by local and global factors.
The Salp Swarm Algorithm mimics the chain foraging behavior of salps in the ocean, where the leading salp guides the swarm.
The MRFO algorithm models the foraging behaviors of manta rays, including chain foraging and cyclone foraging.
These Equations (17) to (26) utilize improvement procedures inspired by various metaheuristic methods to iteratively change
to explore different areas of the search space effectively. Each equation integrates unique strategies from the respective algorithms to enhance the exploration and exploitation capabilities during optimization. To integrate the reinforcement learning concept into the probability equation for selecting each improvement box, the modified equation can be written as Equation (27). Define b as the index of improvement box lies from 1 to B when B is the maximum number of improvement boxes.
Set is the Q-value representing the expected reward for selecting improvement box (IB) b at iteration t − 1. γ is a learning rate that controls the influence of the Q-value in the selection process. The original equation uses three factors to influence the selection of improvement boxes. Historical Selection Frequency , this factor represents how often improvement box b has been selected in the past. A higher frequency indicates a preference for improvement box b, suggesting its effectiveness in previous iterations. To clarify, the improvement box (IB) refers to a method used to enhance the solution of the tracks. Define b as the index of the improvement box, which ranges from 1 to BBB, where BBB is the maximum number of improvement boxes.
Efficiency , this factor is the inverse of the average objective function value obtained using IB b. A higher average value indicates higher efficiency, making this IB more desirable. Reward Value , this factor represents the reward value for Improvement box (IB) bbb that has discovered a new best solution during iterations from 1 to the current iteration. The value of increases by 1 if IB b finds a new best solution; otherwise, it remains unchanged. More instances of discovering better solutions indicate better performance, thereby increasing the probability of selecting this IB.
Incorporating reinforcement learning adds the Q-value , which captures the expected future rewards of selecting an IB. This value is learned over time based on the performance of the IB, allowing the algorithm to adaptively favor IBs that have not only performed well historically but are also expected to perform well in the future. The learning rate (γ) determines how strongly the Q-value influences the probability, balancing between historical performance and expected future rewards. After calculating the probabilities for each improvement box using the modified equation, the roulette wheel selection procedure is employed to select the improvement box for each track in the current iteration. This method ensures that the selection process is guided by both the historical performance and the learned expected rewards, enhancing the algorithm’s overall optimization capability.
To properly set the values for γ, F1, F2 and F3, we can use common practices in metaheuristic optimization and reinforcement learning. The learning rate γ controls how much new information overrides old information in reinforcement learning. According to Pitakaso et al. [
40] and Nanthasamroeng et al. [
41], a typical value for γ is between 0.1 and 0.3. For our scenario, γ = 0.2 is a good starting point, balancing the rate of learning from new experiences without being too sensitive to recent changes.
The scaling factors F1, F2 and F3 weigh the importance of historical selection frequency, efficiency, and optimal solution discovery, respectively. These can be set based on preliminary experiments to balance their influence. Commonly, an initial balanced approach uses equal values, such as F1 = 10, F2 = 10, and F3 = 10. These values ensure that frequently selected boxes, efficient boxes, and those that have historically found better solutions are all given appropriate consideration. These initial parameter values can be fine-tuned through systematic experimentation, adjusting them incrementally, and evaluating their impact on optimization results to achieve optimal performance for specific problems.
The objective function used in this section is the value that we derived from the five objective functions explained in
Section 3.1, which can be calculated using Equation (28).
When
to
are randomly selected from the range 0 to 1 (real numbers), and
is calculated using Equation (29), where l is the index of μ and ranges from 1 to 5.
Research on parameter settings in metaheuristic optimization and reinforcement learning reveals several key insights. Li et al. [
42] emphasized the dynamic adjustment of learning rate parameters (γ, Ω) to enhance deep reinforcement learning performance. Silva-Rodriguez and Li [
43] explored decentralized approaches for distributed optimization problems. Zhang et al. [
44] discussed parameter-based exploration methods, such as scaling factors (F, α, β), to balance exploration and exploitation. Tessari and Iacca [
45] highlighted the combination of evolutionary algorithms with adaptive heuristic critic methods for continuous state and action spaces in reinforcement learning, focusing on parameters like λ (Levy Flight), δ (Crossover Rate in DE), η (Quantum Probability Amplitude), and θ (Rotation Angle in QEA). Additionally, Nanthasamroeng et al. [
41] and Pitakaso et al. [
40] suggest optimal parameter settings for VaNSAS, including F1, F2, and F3. A summary of these parameter settings is shown in
Table 4.