Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning

Cai, Hao; Xu, Peng; Tang, Xifeng; Lin, Gan

doi:10.3390/electronics13163242

Open AccessArticle

Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning

College of Civil and Transportation Engineering, Hohai University, Xikang Road, Nanjing 210024, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3242; https://doi.org/10.3390/electronics13163242

Submission received: 3 July 2024 / Revised: 2 August 2024 / Accepted: 14 August 2024 / Published: 15 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

The Vehicle Routing Problem (VRP) is a classic combinatorial optimization problem commonly encountered in the fields of transportation and logistics. This paper focuses on a variant of the VRP, namely the Vehicle Routing Problem with Stochastic Travel Cost (VRP-STC). In VRP-STC, the introduction of stochastic travel costs increases the complexity of the problem, rendering traditional algorithms unsuitable for solving it. In this paper, the GAT-AM model combining Graph Attention Networks (GAT) and multi-head Attention Mechanism (AM) is employed. The GAT-AM model uses an encoder–decoder architecture and employs a deep reinforcement learning algorithm. The GAT in the encoder learns feature representations of nodes in different subspaces, while the decoder uses multi-head AM to construct policies through both greedy and sampling decoding methods. This increases solution diversity, thereby finding high-quality solutions. The REINFORCE with Rollout Baseline algorithm is used to train the learnable parameters within the neural network. Test results show that the advantages of GAT-AM become greater as problem complexity increases, with the optimal solution generally unattainable through traditional algorithms within an acceptable timeframe.

Keywords:

VRP-STC; graph attention networks; multi-head attention mechanism; deep reinforcement learning

1. Introduction

Combinatorial Optimization (CO) problems optimize a classical and intricate category of decision-making quandaries. These problems involve the fastidious selection of a subset of components from a limited set of choices, with the overarching objective of maximizing or minimizing a foreordained objective function. Over recent years, the significant ascendancy of deep learning (DL) has markedly propelled progress in conventional domains such as image classification, speech recognition, and computer vision. This advancement can be ascribed to the profound adaptability in feature acquisition demonstrated by DL models. These models exhibit the capacity to autonomously glean intricate abstract feature representations from raw data, thus endowing them with adaptability across a myriad of complex tasks. The end-to-end learning paradigm inherent in DL engenders a streamlined trajectory for model formulation. By directly linking inputs to outputs, it obviates the necessity for elaborate feature engineering processes, thereby amplifying the efficiency and precision of data manipulation. In recent years, the ability of DL systems to tackle CO problems has been extensively demonstrated across various applications [1,2]. Despite the significant potential of DL methods in CO problems, their application faces several limitations. These models typically demand a substantial amount of labeled data for training, which are often difficult to acquire. They also demand substantial computational resources, including high-performance GPUs and extensive memory capacities, leading to increased computational costs. Moreover, the generalization ability of DL models can be unstable with unseen data, affecting the reliability of real-world solutions. Additionally, the complexity of these CO problems often necessitates specific model architectures and hyperparameter tuning, complicating design and implementation [2]. DL methods are prone to getting trapped in local optima rather than finding global optima, leading to suboptimal solutions [3]. Lastly, the training duration can be lengthy, especially for complex problems, delaying practical application. Hence, it is crucial to evaluate these advantages and limitations comprehensively and consider integrating other methodologies to enhance solution effectiveness. Based on the characteristics of CO problems, reinforcement learning (RL) [4] is well suited for them because it does not require labeled data like supervised learning does. Furthermore, the performance of deep reinforcement learning (DRL) in discrete sequence decision-making problems such as Alpha Go Zero [5] and Atari [6] demonstrates its powerful learning and optimization capabilities. DRL has been employed to address CO problems, such as the Traveling Salesman Problem (TSP), Vehicle Routing Problem (VRP), and Facility Location Problem (FLP). Encouraging outcomes have been achieved with regards to computational efficiency and generalization. While certain conventional heuristic approaches excel in resolving particular challenges, they frequently entail considerable time expenditures when tackling large-scale instances [7]. Conversely, with the trained neural network, rapid results can be generated, thereby alleviating a substantial portion of the time burden.

DRL is an approach that integrates DL and RL to tackle intractable problems. Deep Q-learning Networks [6] mark the initial revelation of DRL, which expands upon Q-learning [8] to better address high-dimensional state spaces and complex environments. By amalgamating various classical RL algorithms such as REINFORCE [9], Actor-Critic [10], and DPG [11], it has led to the emergence of many renowned DRL algorithms, including TRPO [12], DDPG [13], PPO [14], A3C [15], and others. These algorithms find application across numerous domains, including robotics [16], finance [17], recommendation systems [18], and various decision-making contexts. Another noteworthy accomplishment is observed in works such as AlphaGo [19], AlphaZero [20], and MuZero [21], which represent a series of achievements in mastering board games and electronic games.

In the realm of using DL to address CO problems, Vinyals et al. [22] are among the first to propose Pointer Networks (Ptr-Net) for this purpose. Their method utilized a sequence-to-sequence Ptr-Net supervised learning framework to tackle TSP. Due to its reliance on supervised learning methods, it necessitated the creation of large labeled datasets beforehand, thereby limiting its application in larger-scale CO problems. RL has been employed to train deep neural networks for addressing CO problems [23,24,25,26]. Inspired by the Transformer model, Kool et al. [27] proposed an attention model as a solution for multiple CO problems. It is noteworthy that their method not only surpasses all previous Ptr-Net models in performance but also closely approximates the optimal solutions achieved by solvers such as Gurobi. Lu et al. [23] introduced a framework that utilizes RL to select among various classical improvement and perturbation operators, effectively addressing the VRP with capacity constraints. Moreover, for graph-based CO problems, more sophisticated models, such as Graph Neural Networks (GNNs), have been utilized to address these challenges [24,27,28]. GNNs adopt a more global perspective by representing CO problems as graphs. By learning relationships between nodes, GNNs can automatically extract and leverage features within the spatial layout of cities. For instance, Li et al. [29] utilized Graph Convolutional Networks (GCNs) to address the graph-based CO problems, such as the Minimum Vertex Cover. Their method entails directly predicting the probabilities of selecting all candidate nodes, rather than iteratively generating solution nodes. In summary, they employ GCNs to directly output the probabilities of all feasible nodes, rather than adopting a step-by-step construction approach for solutions. Drodi et al. [30] employed GAT to address various graph-based CO problems. They claimed that their model possesses the capability to generalize from random point training to real city scenarios. Lodi et al. [31] first utilized machine learning methods to foresee the percentage of facilities in the current solution that require adjustment. Subsequently, they integrated linear constraints associated with this percentage into the mixed-integer programming model and utilized classical solvers to derive solutions. Nevertheless, this paper concentrates primarily on supervised learning, leveraging the robust capabilities of classical solvers to achieve significant benefits.

The inherent randomness and unpredictability of road traffic conditions call for the application of fuzzy logic concepts in urban logistics. In practical systems, travel time varies due to numerous factors like traffic congestion, weather, and road conditions, which are intrinsically challenging to quantify and measure precisely. However, these variables can be effectively described using linguistic terms grounded in human judgment, such as “good” or “bad”. In the study conducted by Xidias et al. [32], the Mamdani method was employed due to its widespread use in fuzzy logic systems. The Mamdani-type fuzzy inference system proposed here utilizes two input variables—actual distance and traffic conditions—and produces one output variable, virtual distance. The actual distance reflects the length of the path between successive request points, derived from a map image. Road traffic conditions reflect the current state of the road and are closely linked to the traffic flow of autonomous vehicles. For example, traffic congestion, characterized by slower speeds, longer travel times, and increased vehicle queues, can be described as poor. The virtual distance is derived by adjusting the actual distance to account for the prevailing traffic conditions. Consequently, the virtual distance is typically greater than the actual distance, depending on the severity of the traffic conditions. Luo et al. [33] represented the traffic congestion factor using speed profiles for different departure times. The travel time function in their study is a stepwise linear continuous function, following a first-come-first-served rule. Undoubtedly, speeds vary between normal and peak hours, impacting the required travel time. This approach provides a robust method for studying the randomness of traffic conditions.

In this paper, our focus lies on the study of the Vehicle Routing Problem with Stochastic Travel Cost (VRP-STC). This problem introduces uncertainty factors into the classical VRP, thereby adding complexity to the problem. In this problem setting, vehicles are constrained not only by their capacity limits but also by the fact that the travel cost between each pair of nodes is not fixed but rather a stochastic variable. This reflects real-life traffic conditions, such as congestion, accidents, or weather-related uncertainties. Similar to the majority of CO problems, solving it entails an NP-hard complexity. Consequently, numerous heuristic approaches have been proposed for addressing this problem, with the majority relying on mathematical programming or metaheuristic techniques [7,34]. We propose a learnable decision-making approach wherein the current partial solution is dynamically updated based on a DRL model. This approach aims to generate feasible solutions that adhere to problem constraints. In detail, we employ an enhanced GAT to handle instances of the VRP-STC and obtain node embeddings and graph embeddings. Subsequently, we utilize multi-head AM to guide the consecutive development of attainable solutions. We will refer to this model as the GAT-AM model. Finally, we utilize the REINFORCE with Rollout Baseline algorithm [27] to train the learnable parameters within the neural network. The main contributions of this paper are as follows. (1) We investigate the VRP with an emphasis on traffic congestion, incorporating limiting distributions to address this issue. (2) We propose a novel encoder using GAT with added residual connections for improved model performance. (3) The results of extensive experiments validate and highlight the effectiveness of the proposed GAT-AM model.

2. Preliminary

2.1. VRP-STC

The VRP-STC represents a sophisticated extension of the classical VRP paradigm by integrating stochastic elements into the calculation of travel costs. In this problem setting, vehicles are not only subject to capacity constraints but also encounter varying travel costs between each pair of nodes, characterized by stochastic variables. This dynamic reflects real-world traffic conditions, encompassing uncertainties such as congestion, accidents, or weather-related factors.

Let C = {0, 1, …, n} denote the set of nodes, where 0 typically represents the depot; V = {1, 2, …, m} is the set of vehicles; Q is the maximum capacity of each vehicle; q_i is the demand of customers i (i = 1, …, n). C_ij represents the generalized cost from node i to node j, which is a stochastic variable in this context (in VRP, the C_ij is fixed and known in advance);

x_{i j}^{v}

is the binary decision variable indicating whether or not vehicle v travels directly from node i to node j (1 if yes, and 0 otherwise).

The aim is to reduce the total expected delivery cost by factoring in its stochastic characteristics. The formulation of the objective function is as follows:

m i n E [\sum_{v = 1}^{m} \sum_{i = 0}^{n} \sum_{j = 0, j \neq i}^{n} x_{i j}^{v} C_{i j}]

(1)

Subject to

\sum_{i = 1}^{n} q_{i} x_{i j}^{v} \leq Q, \forall v \in V, \forall j \in C

(2)

\sum_{v = 1}^{m} \sum_{j = 0, j \neq i}^{n} x_{i j}^{v} = 1, \forall i \in C \ \{0\}

(3)

\sum_{j = 1}^{n} x_{0 j}^{v} = \sum_{i = 1}^{n} x_{i 0}^{v} = 1, \forall v \in V

(4)

\sum_{i \in C, i \neq j} x_{i j}^{v} - \sum_{h \in C, h \neq j} x_{j h}^{v} = 0, \forall v \in V, \forall j \in C \ \{0\}

(5)

Constraint (2) ensures that each vehicle’s total payload must not exceed Q. Constraint (3) specifies that each customer must be visited exactly once by a vehicle. Constraint (4) requires each vehicle to depart from the depot and return to the depot after servicing all assigned customers. Constraint (5) states that, for each vehicle and each customer, the number of vehicles entering and leaving the customer must be equal.

2.2. Reinforcement Learning Framework

RL represents a methodology tailored for addressing successive decision-making quandaries. Primarily, through systematic exploration of the environment and iterative trial-and-error endeavors, the agent endeavors to acquire proficiency in interacting with the environment to optimize long-term rewards. As is well known, RL has five key components, as portrayed in [4]: S: State space, A: Action space, p (s′, r | s, a): State transition function, r (s, a): Reward function, γ: Discount factor. In the context of CO problems, the “state” denotes the status of an ongoing solution, which can be either a partial solution within the construction framework or a feasible solution within the improvement framework. The “action” pertains to selecting an element from the candidate set to augment the partial solution in the construction framework or applying an operator to modify the current feasible solution in the improvement framework. When addressing CO problems within the construction framework, the state transition function is deterministic. The “reward” typically measures the difference in the objective function between two states, thus serving as an indicator of the improvement achieved by a specific action. Within the construction framework, this setup aligns with the finite-horizon task of the Markov Decision Process (MDP) framework [35], indicating that interactions with the environment will eventually come to a conclusion.

3. Method

3.1. Formulation of DRL

We set discount factor γ = 1. MDPs provide the mathematical framework essential for reinforcement learning problems. They enable agents to make decisions and adapt their learning in the face of uncertainty. The specific forms of the MDP elements are as follows:

State: Status s_t = (R_t, C_t) is a part of the solution, for instance, G (Q, q, R) created at time step t. Here, R_t (for t ≠ 0) is the group of customers receiving services, which includes all of the chosen customer locations up to step t, C_t is the set of candidate nodes at step t, Q is the maximum capacity of each vehicle, q is the demand of customers.
Action: Action a_t indicates that at step t, the candidate node π_t is chosen from the candidate node set C_t and added to the service customer set R_t.
Transition: With the action a_t, a modern fractional solution is obtained as the following state, i.e., s_t+₁ = (R_t+₁, C_t+₁). Within the updated state, R_t+₁ includes π_t in addition to the nodes chosen so far, whereas C_t+₁ consists of the candidate nodes from C_t with π_t expelled.
Reward: To minimize the total cost, we define the value of the objective function at step t as Obj_t = min E(cost), and the reward at step t as r_t = Obj_t−1 − Obj_t.
Policy: The strategy P_θ is parameterized using θ within the GAT-AM model. At each step t, a candidate node is automatically chosen as the service customer node until all service customer nodes are chosen, resulting in the final solution π = {π₁, π₂, …, π_n,} generated by the policy.

3.2. Model

In this section, we introduce our GAT-AM model, which consists of two components: an encoder and a decoder. The encoder utilizes graph attention mechanisms, leveraging graph neural networks to parameterize and learn node representations. The decoding method is based on the learned features. We devise a multi-head AM-based customer allocation strategy and adapt it to an RL framework. The entire model structure flowchart is depicted in Figure 1.

3.3. Encoder

In GAT [36], node features serve as the input to each layer of the network,

x = \{{\vec{x}}_{1}, {\vec{x}}_{2}, \dots, {\vec{x}}_{N}\}, {\vec{x}}_{i} \in ℝ^{F}

, N is the number of nodes, and F is the number of features per node. Feature extraction is conducted to obtain the features

x^{'''} = \{{\vec{x}}_{1}^{'''}, {\vec{x}}_{2}^{'''}, \dots, {\vec{x}}_{N}^{'''}\}, {\vec{x}}_{i}^{'''} \in ℝ^{F^{'}}

(Fʹ and F are not necessarily of the same size) of each node, where the original input is parameterized by learnable weight matrix

W \in ℝ^{F \times F^{'}}

; at each node, a self-attention mechanism is performed using a shared Attention Mechanism (AM)

a : ℝ^{F} \times ℝ^{F^{'}} \to ℝ

to compute initial attention coefficients e_ij. Here, e_ij represents the importance of node j with respect to node i,

j \in N_{i}

(

N_{i}

denotes the set of neighboring nodes of node i). Subsequently, we employ the softmax function to normalize them and obtain α_ij for ease of computation. The formula is as follows:

e_{i j} = a (W {\vec{x}}_{i}, W {\vec{x}}_{j})

(6)

α_{i j} = s o f t m a x (e_{i j}) = \frac{\exp (e_{i j})}{\sum_{k \in N_{i}} \exp (e_{i k})}

(7)

The AM typically consists of a single-layer feedforward neural network parameterized by a weight vector

\vec{α} \in ℝ^{2 F^{'}}

, employing LeakyReLU for non-linear activation. Here, T denotes the transpose operation, and ‖ represents the concatenation operation. Thus, the attention coefficients for a single head (as illustrated in Figure 2a) can be represented as:

α_{i j} = \frac{e x p (L e a k y R e L U ({\vec{α}}^{T} [W \vec{x_{i}} | | W \vec{x_{j}}]))}{\sum_{k \in N_{i}} e x p (L e a k y R e L U ({\vec{α}}^{T} [W \vec{x_{i}} | | W \vec{x_{k}}]))}

(8)

Once the activated attention coefficients are obtained, we use these coefficients in conjunction with the corresponding node features to calculate the updated features for each node, which serve as the final output of each layer:

{\vec{x}}_{i}^{'} = σ (\sum_{k \in N_{i}} α_{i j} W {\vec{x}}_{j})

(9)

We perform a Skip Connection [37] by adding the features obtained through the single-head AM to the original features, integrating the information:

{\vec{x}}_{i}^{''} = {\vec{x}}_{i} + {\vec{x}}_{i}^{'}

(10)

At this stage, we acquire features after the single-head AM and Skip Connection. Subsequently, we proceed with the computation of the multi-head AM to enhance stability and efficiency. The multi-head AM entails executing AM independently for each head, followed by concatenating the features obtained from each head. The computational formula is expressed as follows:

{\vec{x}}_{i}^{'''} = \begin{matrix} K \\ | | \\ k = 1 \end{matrix} σ (\sum_{j \in N_{i}} α_{i j}^{k} W^{k} {\vec{x}}_{j} + {\vec{x}}_{i})

(11)

where ‖ denotes the concatenation operation,

α_{i j}^{k}

represents the normalized attention coefficients obtained from the k-th head, and W^k is the weight matrix corresponding to the linear transformation of the k-th head. The aggregation process of a multi-head graph attention layer is illustrated in Figure 2b. Simultaneously, we compute the mean of all node features as the aggregated feature

\bar{{\vec{x}}^{'''}}

of the output graph:

\bar{{\vec{x}}^{'''}} = \frac{1}{n} \sum_{i = 1}^{n} {\vec{x}}_{i}^{'''}

(12)

Thus, the node features generated by the encoder, along with the aggregated graph features, will be fed into the decoder, as illustrated in Figure 3.

3.4. Decoder

The decoding process in this paper follows multi-head AM, as employed in Transformer [38]. Decoding occurs sequentially. At time step

t \in \{1, \dots, T\}

, the decoder generates the current time step’s selected node π_t′ by leveraging both the node embeddings from the encoder and the node π_t selected at the preceding time step tʹ ˂ t. Simultaneously, employing a designated contextual node (c) extends the graph’s information, thereby representing the contextual connections for decoding. The decoder employs an attention sub-layer on top of the encoder to refine the information of contextual nodes. Finally, using a self-attention mechanism, it computes the probability of selecting each node and iteratively determines the final path. Figure 4 illustrates the decoding process.

For the sake of capacity constraints, we monitor the remaining demand

{\hat{δ}}_{i, t}

for nodes

i \in \{1, \dots, n\}

, and the remaining vehicle capacity

{\hat{D}}_{t}

at time step t. At t = 1, we initialize

{\hat{δ}}_{i, t} = {\hat{δ}}_{i}

and

{\hat{D}}_{t}

=1. Subsequently, we update them as follows (where π_t denotes the index of the node selected at decoding step t):

{\hat{δ}}_{i, t + 1} = \{\begin{matrix} m a x (0, {\hat{δ}}_{i, t} - {\hat{D}}_{t}) & π_{t} = i \\ {\hat{δ}}_{i, t} & π_{t} \neq i \end{matrix}

(13)

{\hat{D}}_{t + 1} = \{\begin{matrix} m a x ({\hat{D}}_{t} - {\hat{δ}}_{π_{t}, t}, 0) & π_{t} \neq 0 \\ 1 & π_{t} = 0 \end{matrix}

(14)

Context embedding: At time step t, the context of the decoder comprises the node at the current or last position π_t−1 and the remaining capacity

{\hat{D}}_{t}

. When t = 1, corresponding to the route’s initiation at the depot, we omit the need to furnish information about the initial node since the route is expected to terminate at the depot:

x_{(c)}^{(L)} = \{\begin{matrix} [{\bar{x}}^{(L)}, x_{π_{t - 1}}^{(L)}, {\hat{D}}_{t}] & t > 1 \\ [{\bar{x}}^{(L)}, x_{0}^{(L)}, {\hat{D}}_{t}] & t = 1 \end{matrix}

(15)

where [·, ·, ·] denotes the horizontal concatenation operator. The embedding of the contextual node is denoted as

x_{(c)}^{(L)}

(with a dimension size of 3·d_x), where the superscript L aligns with the node embedding

x_{i}^{(L)}

(indicating the embedding of the L-th layer).

To compute the updated embedding of the contextual node

x_{(c)}^{(L + 1)}

using the multi-head AM, where the Keys are derived from the node embeddings

x_{i}^{(L)}

, and the Queries (for each head) are from the embedding of the contextual node

x_{(c)}^{(L)}

for convenience, the superscript (L) will be omitted hereafter:

q_{(c)} = W^{Q} x_{(c)}, k_{i} = W^{K} x_{i}

(16)

Compute the relevance between the contextual node and all other nodes based on the Keys and Queries:

u_{(c) j} = \{\begin{array}{l} \frac{q_{(c)}^{T} k_{j}}{\sqrt{d_{k}}} & , i f j \neq π_{t^{'}}, \forall t^{'} < t \\ - \infty & o t h e r w i s e \end{array}

(17)

Here,

d_{k} = \frac{d_{x}}{M}

, where d_k is the dimensionality of query/key, and M is the number of attention heads.

Calculate probabilities: In the last layer of the decoder, generate the probability of selecting each node. Utilize a single-head AM (M = 1) in this layer. Unlike the earlier updated contextual node embeddings, we solely employ Formula (18) to calculate node correlations, where we utilize tanh to clip the results within the range of [−C, C] (C = 10):

u_{(c) j} = \{\begin{matrix} C • \tanh (\frac{q_{(c)}^{T} k_{j}}{\sqrt{d_{k}}}) & , i f j \neq π_{t^{'}}, \forall t^{'} < t \\ - \infty & o t h e r w i s e \end{matrix}

(18)

The obtained correlation coefficients are referred to as unnormalized log-probabilities (logits). Finally, employ softmax to calculate the output probability Vector p:

p_{i} = p_{θ} (π_{t} = i | s, π_{1 : t - 1}) = \frac{e^{u_{(c) i}}}{\sum_{j} e^{u_{(c) j}}}

(19)

3.5. Algorithm

We define a solution (cost) π = (π₁, …, π_n) as a permutation of nodes, where

π_{t} \in \{1, \dots, n\}

and

π_{t} \neq π_{t^{'}}

,

\forall t \neq t^{'}

. Our encoder–decoder model defines a stochastic policy P(π|s) for selecting a solution π given a problem instance s. This policy is decomposed and parameterized by θ:

p_{θ} (π | s) = \prod_{t = 1}^{n} p_{θ} (π_{t} | s, π_{1 : t - 1})

(20)

Given an instance s, we define a probability distribution P_θ(π|s), from which we can sample a solution (cost)π|s. To train our model, we define the loss function

L (θ | s)

as the expected value of the cost C(π) under the distribution P_θ(π|s):

L (θ | s) = E_{p_{θ} (π | s)} [C (π)]

. We employ the REINFORCE gradient estimator with a baseline b(s) to optimize

L

using gradient descent algorithm:

\nabla L (θ | s) = E_{p_{θ} (π | s)} [(C (π) - b (s)) \nabla \log_{p_{θ} (π | s)}]

(21)

We propose using a rollout baseline analogous to the self-critical training method by Rennie et al. [39], with periodic updates to the baseline policy. This baseline, denoted as b(s), represents the cost of a solution derived from a deterministic greedy rollout of the policy defined by the best model to date. To stabilize the baseline as the model evolves during training, we periodically freeze the greedy rollout policy P_θ^Baseline for a fixed number of steps per epoch, akin to the freezing of the target Q-network in DQN [6]. Stronger algorithms define stronger baselines. Hence, at the end of each epoch, we compare the current training policy with the baseline policy (using greedy decoding), and only replace the parameters θ^Baseline of the baseline policy when there is a significant improvement based on a paired t-test (α = 5%) across 10,000 individual (evaluation) instances. In the event of an update to the baseline policy, we sample fresh evaluation instances to mitigate the risk of overfitting. By using greedy rollout as the baseline b(s), if the sampled solution π outperforms the greedy rollout, the function C(π) − b(s) becomes negative, leading to reinforcement of the action. Conversely, if the sampled solution performs worse, the opposite effect occurs. This mechanism trains the model to enhance its greedy performance. The algorithm procedure is detailed in Algorithm 1, and the training process is illustrated in Figure 5.

Algorithm 1. REINFORCE with Rollout Baseline

Input: number of epochs E, steps per epoch T, batch size B, significance α
Initialize θ, θ^Baseline ← θ
For epoch = 1, …, E do
For step = 1, …, T do

s_{i} \leftarrow R a n d o m I n s t a n c e () \forall i \in \{1, \dots, B\}

π_{i} \leftarrow S a m p l e R o l l o u t (s_{i}, P_{θ}) \forall i \in \{1, \dots, B\}

π_{i}^{B a s e l i n e} \leftarrow G r e e d y R o l l o u t (s_{i}, p_{θ^{B a s e l i n e}}) \forall i \in \{1, \dots, B\}

\nabla L \leftarrow \sum_{i = 1}^{B} (C (π_{i}) - C (π_{i}^{B a s e l i n e})) \nabla_{θ} \log_{p_{θ} (π_{i})}

θ \leftarrow A d a m (θ, \nabla L)

End For
If Test (P_θ, P_θ^Baseline) < α then
θ^Baseline ← θ
End If
End For

3.6. Stochastic Travel Costs

To integrate stochastic travel costs [40] into the model, this study introduces three distinct traffic conditions: light, normal, and heavy, represented as Λ = {0,1,2}. By utilizing a 3 × 3 transformation matrix, as the incidence of faults increases, it suggests that the current high flow may potentially surpass the current low flow in the future. Additionally, we employ matrices with constrained distributions, indicating that, over the long term, each traffic condition appears in the road network with a fixed probability. Specifically, these matrices are designed such that rows and columns represent light, normal, and heavy traffic states:

P_{heavy} = [\begin{matrix} 0.3 & 0.5 & 0.2 \\ 0.2 & 0.3 & 0.5 \\ 0.1 & 0.2 & 0.7 \end{matrix}]

(22)

P_{l i g h t} = [\begin{matrix} 0.7 & 0.2 & 0.1 \\ 0.5 & 0.3 & 0.2 \\ 0.2 & 0.5 & 0.3 \end{matrix}]

(23)

The limiting distributions of the matrices are [0.16, 0.27, 0.57] and [0.57, 0.27, 0.16], respectively. Under light traffic conditions, the vehicle travel cost between nodes is half of that under normal traffic conditions, whereas under heavy traffic conditions, it is double the cost observed in normal conditions. We incorporate stochastic information by averaging the traffic states according to the limiting distribution. For instance, the vehicle travel cost from node i to node j under heavy traffic conditions becomes:

c_{i j}^{'} = 0.16 \cdot 0.5 \cdot c_{i j} + 0.27 \cdot 1 \cdot c_{i j} + 0.57 \cdot 2 \cdot c_{i j}

(24)

Given the existence of limiting distributions for traffic conditions, through repeated simulations, the stochasticity under traffic conditions will ultimately converge towards an “average value”, surpassing the limiting distribution. To better align the stochastic variables with the model in this paper, we utilize the distance between two points to represent the travel cost. Therefore, the travel cost computed after obtaining the policy solution from the decoder becomes:

C_{(π)}^{'} = 0.16 \cdot 0.5 \cdot C (π) + 0.27 \cdot 1 \cdot C (π) + 0.57 \cdot 2 \cdot C (π)

(25)

Hence, we only need to substitute C′(π) in (25) into C(π) in (21) and proceed accordingly.

4. Experiments

In this section, we perform experiments on VRP and VRP-STC at various scales to assess the performance of the proposed GAT-AM model. Specifically, we train a total of six models, encompassing both randomized and non-randomized variants, across different scales: n = 20, 50, 100. Node coordinates are randomly sampled from a uniform distribution within the unit square [0, 1] × [0, 1]. We conduct a comparative analysis by evaluating the results obtained from exact methods, heuristic methods, and DRL-based methods.

4.1. Experimental Settings

During the training phase, we train on randomly generated training instances for 100 epochs. In each epoch, we process 2500 batches of 1024 instances. For the two-dimensional coordinates of customer nodes, we denote d = 512. Our model architecture features a single encoder with three layers of GAT, eight attention heads (M = 8), and one decoder. Additionally, we employ the Adam Optimizer [41] with a fixed learning rate of η = 10⁻³ to train the strategy network. In the first epoch, we utilize a rollout baseline with an exponential decay factor β = 0.8, a practical approach for ensuring stable training [27]. During the testing phase, we generate 1000 instances for each VRP scale using the same distribution as during training, with the random seed number fixed at seed = 1235. We utilize either greedy decoding or sampling decoding techniques. In greedy decoding, we meticulously select the optimal action at each step, guided by the model’s output. Conversely, in sampling decoding, we sample 1280 solutions randomly and subsequently report the most optimal among them. Augmenting the sample size has the potential to improve solution quality, albeit at the expense of heightened computational demands. Increasing the sample size may lead to improved solution quality, although this improvement comes with an associated increase in computational complexity. Similarly, for VRP-STC, we follow the same procedure. This maintains a clear distinction between the training and testing instances. This approach allows us to effectively evaluate the performance of GAT-AM and all baselines across different problem scales for both VRP and VRP-STC.

4.2. Baseline Methods and Evaluation Metrics

We compare the proposed GAT-AM model, without randomness, with three baselines, including the heuristic algorithm LKH3 [42], OR Tools [43], and the DRL-based method AM [27]. We specifically compare the GAT-AM model, with randomness added, with the AM model. We will utilize Objective Obj, Gap (%), and Time as performance metrics for evaluation. Considering that the objective of all methods is to minimize the objective function, Obj signifies the optimal value of the objective function, denoting the minimal attainable objective function value associated with the optimal solution. Additionally, we consider the Obj obtained from LKH3 as the benchmark (which is consistently optimal for small-scale instances), and we evaluate the accuracy and stability of each method by analyzing the average gap (Gap). The Gap measures the relative difference between the objective function value achieved by a given method and the optimal objective function value determined by LKH3. It is computed using the following formula:

G a p = \frac{\sum_{i = 1}^{T} (\frac{{method}_{o b j_{i}}}{L K H 3_{o b j_{i}}} - 1)}{T} \times 100 % .

(26)

Computational time is a crucial factor in our analysis. However, due to various influences, such as differences between Python3.9 and C++ GCC 11.2.0, as well as variations in hardware GPU/CPU models, comparing running times directly proves challenging. To address this, we measure “Time” as the average duration required to solve 1000 instances. All methods are trained and tested on a PC equipped with a single GPU (RTX 3090) and an Intel^® Core™ i7-8700K CPU operating at 3.70 GHz.

4.3. Comparison Analysis

In Table 1, we summarize the comparison results between the GAT-AM model and three comparative methods on the non-randomized problem. In Table 2, we compare the results of the GAT-AM model with the AM model on the randomized problem. The comparison between Table 1 and Table 2 reveals that the Obj and Time for problems with added randomness are significantly higher than those without randomness. This result could be interpreted as randomized problems having higher travel costs compared to non-randomized ones. In comparison to the VRP, the computational complexity of VRP-STC is higher, leading to an increase in solving time. Table 1 clearly delineates that among traditional heuristic approaches, LKH3 consistently yields the optimal solutions across three distinct problem scales, hence establishing it as the benchmark method. It is obvious that when examining Obj and Gap, the outcomes attained by DRL-based methods closely resemble those achieved by traditional algorithms. Furthermore, in terms of computational time, methods based on DRL demonstrate the fastest resolution speed. It is noteworthy that with the escalation of problem scale, the computational time of traditional heuristic methods undergoes exponential growth, while the solving time of DRL-based methods demonstrates nearly linear progression. Moreover, it is evident from the comparison results of DRL-based methods that GAT-AM surpasses AM in both Obj and Gap, while also exhibiting shorter computational time. Through the comparison presented in Table 2, it is obvious that our GAT-AM model outperforms the AM model. The GAT demonstrates greater efficacy in addressing problems with stochastic traversal costs, leveraging its specialized network architecture tailored for graph data to unearth deeper layers of information. Following the introduction of stochastic elements, this advantage becomes even more pronounced.

4.4. Model Convergence Performance

Figure 6a–c demonstrate that the proposed model exhibits superior convergence performance across VRP problems of different scales. Compared to the AM model, the GAT-AM model achieves near-optimal results within the initial epochs. As the node number increases, the convergence rate of the AM model declines due to the exponential growth in the complexity of the graph structure. This complexity hampers the traditional Transformer architecture’s ability to capture intricate node relationships during the early training stages. In contrast, the GAT leverages an attention mechanism to automatically learn and optimize node connections. This enables the network to assign varying weights to different node pairs based on the dataset characteristics and task requirements. By utilizing continuous numerical attention weights instead of binary node connections, GAT enhances its expressive capacity. The parallel computation of attention values further contributes to the high computational efficiency of the GAT. GAT provides personalized feature representations for each node by incorporating both the features of the node itself and its neighbors during attention calculation. Moreover, GAT’s focus on capturing the local structural information of nodes is crucial for determining the optimal solution to VRPs.

As illustrated in Figure 7, the introduction of random travel costs into the model induces varying degrees of oscillation in the cost convergence curves for both the GAT-AM and AM models. The GAT-AM model exhibits noticeably poorer convergence performance during the first 40 epochs compared to the scenario without random transit costs. Nonetheless, with an increasing number of epochs, the cost curve typically shows a tendency to stabilize. This phenomenon is attributed to the model’s extensive exploration due to the introduced randomness during the initial stages, leading to significant cost fluctuations. Nevertheless, with sufficient exploration, the model identifies more optimal solutions, resulting in stabilized costs. This observation aligns with the hypothesis made when configuring the random model; with an increase in the number of training samples, the model’s final cost is expected to approach the “average value” of the random matrix.

In comparison to the AM model, the GAT-AM model consistently demonstrates superior convergence performance, with a smoother training cost curve. This indicates that the GAT-AM model effectively addresses the challenges posed by randomness. The GAT’s ability to capture relational information between nodes, coupled with the Transformer model’s capacity to autonomously learn the dependencies and sequences of paths, enables the formulation of more accurate and adaptive vehicle routing solutions. Through the modelling and learning of random travel costs, the proposed approach yields more precise and robust vehicle routing planning outcomes.

4.5. Visualization

In this subsection, we present a data simulation example to provide a more intuitive demonstration of the solution outcomes. Utilizing GAT-AM, Figure 8 and Figure 9 showcase exemplary solutions for VRP and VRP-STC with different node numbers (n = 20, 50, and 100), achieved via models incorporating sampling decoding techniques. Figure 8 illustrates the solutions to the VRP for different node numbers. The lines of different colors represent distinct vehicles, while the black squares represent depots, and circular nodes denote customers. From Figure 8, it is evident that with smaller node sizes, fewer vehicles are employed. For instance, in VRP20, only three vehicles suffice to fulfill the demands of all customers. As the node size expands, the number of vehicles gradually increases. In VRP50, six vehicles are utilized. However, the number of vehicles does not proportionally increase with the node size. In VRP100, despite doubling the number of customers compared to VRP50, the number of vehicles only increases by two. This is because each vehicle also aims to serve as many customers as possible to achieve the objective of minimizing the total number of vehicles required. Figure 9 displays the solutions to VRP-STC for different scales. The lines of different colors represent distinct vehicles. Comparing Figure 9 with the results of VRP in Figure 8, it can be observed that in both node sizes 20 and 50, the solutions for VRP-STC entail one additional vehicle compared to the VRP solutions. As the node size expands to 100, although the number of vehicles remains the same, the final vehicle is loaded with only a small amount of cargo. In this case, while the vehicles are not fully loaded, they compensate for the increased cost of using multiple vehicles by traveling shorter distances.

5. Conclusions

In this research endeavor, we present the GAT-AM model, which combines AM and DRL techniques. This model exhibits robust performance in addressing the challenges posed by both the VRP and the VRP-STC. The architecture of the GAT-AM model comprises an encoder and a decoder. The encoder utilizes GAT to effectively learn feature representations of customer nodes across diverse subspaces. Meanwhile, the decoder employs multi-head AM to refine solution construction strategies, aiming to generate a broader range of solutions and thereby explore high-quality solutions more effectively. During model training, the REINFORCE with Rollout Baseline algorithm is employed, which effectively mitigates the variance of policy gradients by incorporating a baseline. This optimization technique accelerates the learning process and enhances training efficiency. Additionally, the decoder’s use of a dedicated context mechanism and masking policy restricts the choice of subsequent nodes, ensuring that the solutions generated remain feasible. This refinement significantly augments the likelihood of the model discovering high-quality solutions within polynomial time complexity.

The comparative experiments encompassed three baseline methods for VRP across various scales and conditions. For VRP-STC, we conducted a distinct comparison between GAT-AM and AM. The experimental results indicate that, compared to existing heuristic algorithms, the proposed model exhibits a significant improvement in both training and testing speed, thereby confirming its efficiency. Notably, when dealing with large-scale instances, the proposed model can furnish solutions within a reasonable time complexity, underscoring the immense potential of DRL in addressing CO problems.

Although this paper has achieved certain accomplishments, there are still several avenues worth further exploration in solving VRPs:

Algorithmic Refinement and Generalization: The model currently exhibits certain limitations. Enhancing the algorithm’s generalization capacity to accommodate a broader spectrum of environments and various problem scenarios represents a pivotal area for future investigation.
Real-time Dynamic Planning: With the growing demand for practical applications, implementing real-time dynamic planning within the model to accommodate the evolving logistics demands and traffic conditions is a pressing challenge awaiting resolution.
Multi-objective Optimization: The present model primarily focuses on minimizing the total travel cost. In the future, exploration could be extended to achieve a balance among multiple objectives, such as service level assurance and minimizing environmental impacts.

In conclusion, while this paper has made some progress, there remains vast research space in the realms of deep learning and combinatorial optimization problem solving. Future work will continue to advance the integration of theory and practice, driving the intelligent upgrading of the transportation industry, and providing robust technological support and theoretical guidance for scientific research and industrial development in related fields.

Author Contributions

Conceptualization, H.C.; methodology, H.C. and X.T.; software, H.C.; validation, H.C. and G.L.; formal analysis, H.C.; investigation, X.T.; resources, P.X.; data curation, H.C. and X.T.; writing—original draft preparation, H.C. and X.T.; writing—review and editing, H.C. and X.T.; visualization, G.L.; supervision, X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

We thank the anonymous reviewers and editor for their valuable comments on an earlier version of this paper that resulted in improved content and exposition.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

Khalil, E.; Dai, H.; Zhang, Y.; Dilkina, B.; Song, L. Learning combinatorial optimization algorithms over graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 6348–6358. [Google Scholar]
Bengio, Y.; Lodi, A.; Prouvost, A. Machine learning for combinatorial optimization: A methodological tour d’horizon. Eur. J. Oper. Res. 2021, 290, 405–421. [Google Scholar] [CrossRef]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A. Mastering the game of Go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Karimi-Mamaghan, M.; Mohammadi, M.; Meyer, P.; Karimi-Mamaghan, A.M.; Talbi, E.-G. Machine learning at the service of meta-heuristics for solving combinatorial optimization problems: A state-of-the-art. Eur. J. Oper. Res. 2022, 296, 393–422. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Konda, V.; Tsitsiklis, J. Actor-critic algorithms. Adv. Neural Inf. Process. Syst. 1999, 12, 1008–1014. [Google Scholar]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 22–24 June 2014; pp. 387–395. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Babaeizadeh, M.; Frosio, I.; Tyree, S.; Clemons, J.; Kautz, J. Reinforcement learning through asynchronous advantage actor-critic on a gpu. arXiv 2016, arXiv:1611.06256. [Google Scholar]
Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1334–1373. [Google Scholar]
Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; Dai, Q. Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 653–664. [Google Scholar] [CrossRef]
Zheng, G.; Zhang, F.; Zheng, Z.; Xiang, Y.; Yuan, N.J.; Xie, X.; Li, Z. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 167–176. [Google Scholar]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef]
Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Lu, H.; Zhang, X.; Yang, S. A learning-based iterative method for solving vehicle routing problems. In Proceedings of International Conference on Learning Representations. Available online: https://openreview.net/forum?id=BJe1334YDH (accessed on 13 August 2024).
Manchanda, S.; Mittal, A.; Dhawan, A.; Medya, S.; Ranu, S.; Singh, A. Learning heuristics over large graphs via deep reinforcement learning. arXiv 2019, arXiv:1903.03332. [Google Scholar]
Mazyavkina, N.; Sviridov, S.; Ivanov, S.; Burnaev, E. Reinforcement learning for combinatorial optimization: A survey. Comput. Oper. Res. 2021, 134, 105400. [Google Scholar] [CrossRef]
Cappart, Q.; Chételat, D.; Khalil, E.B.; Lodi, A.; Morris, C.; Veličković, P. Combinatorial optimization and reasoning with graph neural networks. J. Mach. Learn. Res. 2023, 24, 1–61. [Google Scholar]
Kool, W.; Van Hoof, H.; Welling, M. Attention, learn to solve routing problems! arXiv 2018, arXiv:1803.08475. [Google Scholar]
Nowak, A.; Villar, S.; Bandeira, A.S.; Bruna, J. A note on learning algorithms for quadratic assignment with graph neural networks. Stat 2017, 1050, 22. [Google Scholar]
Li, Z.; Chen, Q.; Koltun, V. Combinatorial optimization with graph convolutional networks and guided tree search. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper_files/paper/2018/file/8d3bba7425e7c98c50f52ca1b52d3735-Paper.pdf (accessed on 13 August 2024).
Drori, I.; Kharkar, A.; Sickinger, W.R.; Kates, B.; Ma, Q.; Ge, S.; Dolev, E.; Dietrich, B.; Williamson, D.P.; Udell, M. Learning to solve combinatorial optimization problems on real-world graphs in linear time. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 14–17 December 2020; pp. 19–24. [Google Scholar]
Lodi, A.; Mossina, L.; Rachelson, E. Learning to handle parameter perturbations in combinatorial optimization: An application to facility location. EURO J. Transp. Logist. 2020, 9, 100023. [Google Scholar] [CrossRef]
Xidias, E.; Zacharia, P.; Nearchou, A. Intelligent fleet management of autonomous vehicles for city logistics. Appl. Intell. 2022, 52, 18030–18048. [Google Scholar] [CrossRef]
Luo, H.; Dridi, M.; Grunder, O. A branch-price-and-cut algorithm for a time-dependent green vehicle routing problem with the consideration of traffic congestion. Comput. Ind. Eng. 2023, 177, 109093. [Google Scholar] [CrossRef]
Bai, R.; Chen, X.; Chen, Z.-L.; Cui, T.; Gong, S.; He, W.; Jiang, X.; Jin, H.; Jin, J.; Kendall, G. Analytics and machine learning in vehicle routing research. Int. J. Prod. Res. 2023, 61, 4–30. [Google Scholar] [CrossRef]
Puterman, M.L. Markov decision processes. Handb. Oper. Res. Manag. Sci. 1990, 2, 331–434. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
Liu, Z.; Li, X.; Khojandi, A. The flying sidekick traveling salesman problem with stochastic travel time: A reinforcement learning approach. Transp. Res. Part E Logist. Transp. Rev. 2022, 164, 102816. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Helsgaun, K. An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems. Rosk. Rosk. Univ. 2017, 12, 966–980. [Google Scholar]
Bello, I.; Pham, H.; Le, Q.V.; Norouzi, M.; Bengio, S. Neural combinatorial optimization with reinforcement learning. arXiv 2016, arXiv:1611.09940. [Google Scholar]

Figure 1. GAT-AM model diagram.

Figure 2. (a): The AM

a (W {\vec{x}}_{i}, W {\vec{x}}_{j})

employed by our model, employing a LeakyReLU activation. (b): An illustration of multi-head attention (with K = 3 heads) by node 1 on its neighborhood. Different arrow styles and colors denote independent attention computations.

Figure 2. (a): The AM

a (W {\vec{x}}_{i}, W {\vec{x}}_{j})

employed by our model, employing a LeakyReLU activation. (b): An illustration of multi-head attention (with K = 3 heads) by node 1 on its neighborhood. Different arrow styles and colors denote independent attention computations.

Figure 3. The refined GAT encoder undergoes a feature extraction process following transformation.

Figure 4. Utilizing multi-head attention mechanisms, the decoder for VRP and VRP-STC problems processes both graph and node embeddings as inputs. At each time step t, the context includes the graph embeddings, the embeddings of the first and last (previously output) nodes of the partial solution, and the embedding representing the remaining capacity at the depot. Nodes that have already been visited are masked to prevent re-access. As both the start and end points are depot nodes, this example illustrates how to construct a solution π = (0, 3, 2, 1, 4, 0).

Figure 5. GAT-AM model training process diagram.

Figure 6. Comparison of convergence performance for VRP at different node scales.

Figure 7. Comparison of convergence performance for VRP-STC at different node scales.

Figure 8. The experimental results for the Vehicle Routing Problem (VRP) with different node sizes.

Figure 9. The experimental results for the VRP-STC with different node sizes.

Table 1. GAT-AM vs. baselines.

	VRP20			VRP50			VRP100
Method	Obj	Gap	Time	Obj	Gap	Time	Obj	Gap	Time
LKH3	6.14	0.00%	(7 h)	10.38	0.00%	(7 h)	15.65	0.00%	(13 h)
OR Tools	6.42	4.84%	(-)	11.22	8.12%	(-)	17.14	9.34%	(-)
AM (greedy)	6.40	4.57%	(1 s)	10.98	5.78%	(3 s)	16.80	7.34%	(8 s)
AM (sampling)	6.25	2.12%	(6 m)	10.62	2.31%	(28 m)	16.23	3.72%	(2 h)
GAT-AM (greedy)	6.35	3.76%	(1 s)	10.88	4.82%	(2 s)	16.13	2.89%	(5 s)
GAT-AM (sampling)	6.18	0.98%	(5 m)	10.51	1.25%	(11 m)	15.89	1.53%	(23 m)

Table 2. GAT-AM vs. AM.

	VRP-STC20		VRP-STC50		VRP-STC100
Method	Obj	Time	Obj	Time	Obj	Time
AM (greedy)	9.54	(2 s)	16.36	(8 s)	25.08	(23 s)
AM (sampling)	9.31	(13 m)	15.98	(1 h)	24.68	(4.5 h)
GAT-AM (greedy)	9.34	(2 s)	16.01	(6 s)	24.67	(13 s)
GAT-AM (sampling)	9.18	(7 m)	15.68	(16 m)	24.36	(46 m)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, H.; Xu, P.; Tang, X.; Lin, G. Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning. Electronics 2024, 13, 3242. https://doi.org/10.3390/electronics13163242

AMA Style

Cai H, Xu P, Tang X, Lin G. Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning. Electronics. 2024; 13(16):3242. https://doi.org/10.3390/electronics13163242

Chicago/Turabian Style

Cai, Hao, Peng Xu, Xifeng Tang, and Gan Lin. 2024. "Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning" Electronics 13, no. 16: 3242. https://doi.org/10.3390/electronics13163242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Solving the Vehicle Routing Problem with Stochastic Travel Cost Using Deep Reinforcement Learning

Abstract

1. Introduction

2. Preliminary

2.1. VRP-STC

2.2. Reinforcement Learning Framework

3. Method

3.1. Formulation of DRL

3.2. Model

3.3. Encoder

3.4. Decoder

3.5. Algorithm

3.6. Stochastic Travel Costs

4. Experiments

4.1. Experimental Settings

4.2. Baseline Methods and Evaluation Metrics

4.3. Comparison Analysis

4.4. Model Convergence Performance

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI