Graph-Driven Deep Reinforcement Learning for Vehicle Routing Problems with Pickup and Delivery

Yan, Dapeng; Guan, Qingshu; Ou, Bei; Yan, Bowen; Cao, Hui

doi:10.3390/app15094776

Open AccessArticle

Graph-Driven Deep Reinforcement Learning for Vehicle Routing Problems with Pickup and Delivery

by

Dapeng Yan

^1,2,

Qingshu Guan

^1,2

,

Bei Ou

¹

,

Bowen Yan

¹ and

Hui Cao

^1,2,*

¹

School of Electrical Engineering, Xi’an Jiaotong University, Xi’an 710049, China

²

State Key Laboratory of Electrical Insulation and Power Equipment, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4776; https://doi.org/10.3390/app15094776

Submission received: 4 March 2025 / Revised: 22 April 2025 / Accepted: 23 April 2025 / Published: 25 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Recently, the vehicle routing problem with pickup and delivery (VRP-PD) has attracted increasing interest due to its widespread applications in real-life logistics and transportation. However, existing learning-based methods often fail to fully exploit hierarchical graph structures, leading to suboptimal performance. In this study, we propose a graph-driven deep reinforcement learning (GDRL) approach that employs an encoder–decoder framework to address this shortcoming. The encoder incorporates stacked graph convolution modules (GCMs) to aggregate neighborhood information via updated edge features, producing enriched node representations for subsequent decision-making. The single-head attention decoder then applies a computationally efficient compatibility layer to sequentially determine the next node to visit. Extensive experiments demonstrate that the proposed GDRL achieves superior performance over both heuristic and learning-based baselines, reducing route length by up to 5.81% across synthetic and real-world datasets. Furthermore, GDRL also exhibits strong generalization capability across diverse problem scales and node distributions, highlighting its potential for real-world deployment.

Keywords:

vehicle routing problem; deep reinforcement learning; graph convolution; attention mechanism

1. Introduction

The vehicle routing problem with pickup and delivery (VRP-PD) [1] is a fundamental subfield of combinatorial optimization problems (COPs) [2] with far-reaching implications in computer science [3] and operational research [4], as well as extensive real-world applications in logistics [5], robotics [6], transportation [7], and so forth. Unlike traditional VRP variants such as the traveling salesman problem (TSP) [8] and capacitated VRP (CVRP) [9], VRP-PD requires determining optimal routes that fulfill both pickup and delivery requests while adhering to several constraints: (1) each customer node is visited exactly once; (2) each vehicle begins and ends its route at a designated depot; (3) customer demands must be fully met without exceeding vehicle capacity; (4) each pickup operation must precede its corresponding delivery. These problem-specific constraints introduce unique challenges and complexities to optimization.

In recent decades, a wide array of exact and heuristic methods has been developed to address routing problems, including the VRP-PD. Exact algorithms, including branch-and-cut [10] and its variants, provide theoretical guarantees of optimality but face prohibitive computational complexity, making them impractical for large-scale instances. In contrast, heuristic approaches, such as local neighbor search [11], ant colony optimization [12], and simulated annealing [13], offer a more computationally efficient alternative by leveraging manually crafted rules to guide the search process. While these methods achieve a balance between solution quality and execution speed, their reliance on problem-specific knowledge and expert knowledge limits their adaptability, often leading to suboptimal performance in complex and dynamic real-world routing scenarios.

Deep neural networks and reinforcement learning have made remarkable strides in solving complex tasks, benefiting from the parallel computing capabilities of graphics processing units (GPUs) [14,15]. Recently, deep reinforcement learning (DRL) has gained traction as a promising approach for solving various VRPs [16,17]. By extracting patterns from large-scale data and autonomously learning decision rules during route construction, DRL-based methods have demonstrated strong performance across different routing problem variants. However, in our view, there are still several limitations to constructive DRL-based routing methods. On the one hand, prevalent methods predominantly rely on node-level features for representation learning, failing to fully exploit hierarchical graph structures, which limits routing performance. On the other hand, widely adopted decoding strategies incur substantial computational costs and lack adaptability for real-time decision-making, posing significant challenges in scaling to large problem instances. Overall, addressing these limitations is critical for further advancing DRL-based methods in combinatorial optimization.

To address these challenges, we formulate the VRP-PD as a Markov decision process (MDP) and introduce a graph-driven deep reinforcement learning (GDRL) framework for route optimization. The proposed approach follows an encoder–decoder architecture, where the encoder constructs a k-nearest neighbor (KNN) graph to represent spatial relationships among nodes. Stacked graph convolution modules (GCMs) are then employed to extract hierarchical graph features, generating informative representations for both nodes and edges. The relative connection weight between adjacent nodes is derived from edge features, which are computed based on node embeddings via GCMs. In the decoding phase, a computationally efficient compatibility layer with a single-head attention mechanism produces a probability distribution for node selection at each decision step, iteratively constructing a complete route. The entire network is optimized using an enhanced reinforcement learning algorithm, ensuring improved training stability and performance.

We conduct a comprehensive evaluation of the proposed GDRL framework on both randomly generated datasets and real-world benchmarks. Experimental results indicate that GDRL significantly outperforms a range of heuristic and learning-based methods, achieving up to a 5.81% reduction in route length. Moreover, the proposed GDRL demonstrates remarkable generalization capabilities, maintaining robust performance when applied to larger-scale problems and varying nodal distributions. These findings underscore the potential of our approach to effectively address VRP-PD challenges in real-world logistics and transportation scenarios.

2. Related Work

VRP-PD stands out as a unique branch of routing problems with broad applications in logistics and transportation. It is characterized by pickup (backhaul) nodes and delivery (linehaul) nodes, where freight is collected from customers and transported to a central depot. Depending on whether pickup operations must precede their corresponding deliveries, VRP-PD can be categorized into conventional and improved variants. In this section, we provide an overview of exact and heuristic approaches for both problem settings and examine the prevailing DRL-based methods developed for VRP-PD.

2.1. Exact and Heuristic Methods

Mainstream exact methods for VRP-PDs typically employ branch-and-cut, branch-and-bound, and their variants to systematically manage computational complexity. Baldacci et al. [18] presented a branch-and-cut approach based on a set-partitioning formulation to solve VRP-PD with time windows. Li et al. [19] addressed real-world pickup and delivery challenges by introducing a branch-and-price-and-cut algorithm with specialized ad hoc label-setting strategies, achieving high-quality solutions for instances with fewer than 100 products within an hour of computation. Hernández-Pérez et al. [20] formulated the VRP-PD as a relaxed mixed-inter programming problem and developed a branch-and-cut method based on an aggregated master model to optimize solutions for split-demand one-commodity VRP-PD. Although exact algorithms provide strong optimality guarantees, their high computational cost renders them impractical for large-scale routing problems, limiting their applicability in real-world scenarios.

In contrast, heuristic methods exploit hand-crafted rules to guide the search process, enabling them to produce high-quality solutions for large-scale VRP-PDs. Karaoglan et al. [21] designed a two-stage simulated annealing approach for the conventional VRP-PD, aimed at minimizing the total cost of vehicle routes. Olgun et al. [22] concentrated on the green VRP with simultaneous linehaul and backhaul nodes, designing a hyper-heuristic framework that integrated adaptive local search to optimize energy consumption in oil-fueled vehicle fleets, addressing both conventional and improved VRP-PDs. Helsgaun et al. [23] implemented an advanced neighborhood search (i.e., LKH solver) for various routing problems, including TSP, CVRP, and VRP-PD. Zhou et al. [4] optimized electric vehicle routing in pickup and delivery logistics using an adaptive hybrid neighborhood search approach. Despite their effectiveness, heuristic methods are highly dependent on expert knowledge and problem-specific tuning, limiting their adaptability to diverse routing challenges.

2.2. DRL-Based Methods

In recent times, DRL has achieved remarkable advancements in addressing a wide range of routing and scheduling problems. A notable milestone in this field is the attention model (AM) proposed by Kool et al. [24], which leverages a transformer-based encoder–decoder framework to learn decision-making strategies for VRPs. Building upon this foundation, Kwon et al. [25] introduced policy optimization with multiple optima (POMO), an approach designed to enhance solution diversity through multiple rollouts and data augmentation, achieving state-of-the-art performance across VRPs of varying scales and node distributions. Further refining DRL-based routing models, Li et al. [26] brought in a feature embedding refiner (FER) block within a novel encoder–refiner–decoder architecture. As a model-agnostic and plug-and-play module, FER dynamically enhances node embeddings and adjusts probability distributions over a broader search space, leading to notable improvements in conventional VRP-PD solutions. Additionally, Zhang et al. [27] developed a meta-learning-based framework for multi-objective optimization, demonstrating adaptability and efficiency in addressing improved VRP-PDs and other complex routing problems. Despite these advancements, DRL-based methods for VRP-PDs remain an open research area with substantial gaps. For instance, Ghaziri et al. [28] proposed an unsupervised competitive neural network to adaptively organize feature maps for linehaul and backhaul nodes. However, their approach did not fully leverage edge information among customer nodes, and solution quality for improved VRP-PDs remained suboptimal, underscoring the need for further exploration in this domain.

3. Problem Formulation

In this section, we first formulate the conventional and improved VRP-PD mathematically and then recast it as a MDP to enable the application of DRL.

3.1. Mathematical Formulation

A VRP-PD instance consists of a single depot and N customer nodes, comprising

N_{L}

delivery nodes and

N_{B}

pickup nodes. The objective is to minimize the total travel distance for a fleet of identical vehicles while satisfying the following constraints: (1) each customer node must be visited exactly once; (2) all customer demands must be fulfilled; (3) the vehicle’s load must not exceed its maximum capacity; (4) delivery nodes must be served before pickup nodes within each route. Formally, we define the VRP-PD as a directed graph

G = (V, E)

, where

V = \{0, 1, \dots, N\}

represent the set of all nodes and

E

denotes the set of edges connecting them. Based on this representation, the mathematical formulation of VRP-PD is detailed as follows:

\begin{array}{l} (1) & min & \sum_{k = 1}^{K} \sum_{i = 0}^{N} \sum_{j = 0}^{N} e_{i j}^{k} d_{i j}^{k}, \\ (2) & s . t . & \sum_{k = 1}^{K} \sum_{i = 0}^{N} e_{i j}^{k} = 1, \forall k \in K, j \in V^{'}, \\ (3) & \sum_{i = 0}^{N} e_{i j}^{k} = \sum_{i = 1}^{N} e_{j i}^{k}, \forall k \in K, j \in V^{'}, \\ (4) & \sum_{k = 1}^{K} \sum_{i \in B} \sum_{j \in L} e_{i j}^{k} = 0, \\ (5) & \sum_{k = 1}^{K} \sum_{j \in B} e_{0 j}^{k} = 0, \\ (6) & \sum_{i = 0}^{N} u_{j i}^{k} - \sum_{i = 0}^{N} u_{i j}^{k} = q_{j}, \forall j \neq 0, \\ (7) & 0 \leq u_{i j}^{k} \leq e_{i j}^{k} D, \forall i, j \in V, k \in K, \\ (8) & e_{i j}^{k} \in \{0, 1\}, u_{i j}^{k} \geq 0, \forall i, j \in V, k \in K, \end{array}

where the notations of parameters and decision variables are summarized in Table 1. The objective function, as shown in Equation (1), aims to minimize the total traversal cost across all vehicles. Constraints (2) and (3) guarantee that each node is visited merely once and that each route is assigned to a single vehicle. Constraints (4) and (5) enforce the service order, requiring that pickup nodes be visited only after their corresponding delivery nodes. In specific, constraint (4) prohibits a vehicle from visiting a pickup node before completing all deliveries within its route, while constraint (5) prevents a vehicle from traveling directly to a pickup node upon departing from the depot. Constraint (6) represents the load variation of the vehicle after serving different types of customer nodes. Constraint (7) enforces that the vehicle’s freight load does not exceed its maximum capacity. For the improved VRP-PD variant, Equation (4) is omitted, relaxing the strict service order constraint and allowing greater flexibility in route planning.

3.2. Formulation as a MDP

The route construction of VRP-PD can be inherently framed as a sequential decision-making problem, making it well-suited for formulation within DRL. To this end, we frame the VRP-PD as a tailored MDP

M = \{S, A, T, R\}

, where

S

represents the state space,

A

denotes the action space,

T

defines the state transition dynamics, and

R

corresponds to the cumulative reward. The components of this MDP are described as follows.

State: At each decision step t, the state

s_{t} = \{{v_{i}^{t}}_{i \in V}, {u_{k}^{t}}_{k \in K}\} \in S

encapsulates two key elements. The first component

v_{i}^{t} = (p_{i}, q_{i}^{t})

comprises the two-dimensional Euclidean coordinate and real-time demand of each customer node

i \in V

. The second component

u_{k}^{t}

represents the dynamic remaining capacity of vehicle

k \in K

at time t. Initially, the vehicle starts at its full capacity, such that

u_{k}^{0} = D

.

Action: The action

a_{t} \in A

corresponds to selecting the next customer node to visit at time step t. This decision is governed by the current state

s_{t}

and a policy network parameterized by

θ

, i.e.,

a_{t} \sim p_{θ} (a_{t} | s_{t})

. To ensure feasibility, nodes that have already been visited or violate problem constraints are masked from selection. The sequence of selected nodes ultimately forms the complete routing solution for the VRP-PD.

Transition: The transition to the next state

s_{t + 1}

follows the transition rule

T

, which updates the system based on the selected action

a_{t}

and the current state

s_{t}

. Once node i is served by vehicle k at time t, the real-time demand of node i is set to 0 (

q_{i}^{t + 1} = 0

), and the vehicle’s remaining capacity is updated as

u_{k}^{t + 1} = u_{k}^{t} - q_{i}^{t}

. It is worth noting that the demand

q_{i}^{t}

can take either positive or negative values, reflecting whether the node corresponds to a pickup or delivery operation.

Reward: The goal of VRP-PD is to minimize the total travel distance. Accordingly, the cumulative reward is defined as the negative of the total traversal length provided by

R = - L (A)

.

L (A) = {∥p_{T} - p_{a_{0}}∥}_{2} + \sum_{t = 0}^{T - 1} {∥p_{a_{t + 1}} - p_{a_{t}}∥}_{2},

(9)

where

T

is the maximum execution time required to construct a complete route, and

{∥ \cdot ∥}_{2}

represents the

L_{2}

norm. The immediate reward

r_{t} \in R

at time t is thus formulated as

r_{t} = - {∥p_{a_{t}} - p_{a_{t - 1}}∥}_{2}

, penalizing longer travel distances between consecutive decisions.

Therefore, the probability of constructing a complete route solution is governed by a chain rule as follows:

p (A | s) = \prod_{t = 0}^{T} p_{θ} (a_{t} | s_{t}) .

(10)

At each time step t, the policy network generates a probability distribution over candidate nodes, facilitating node selection via either a greedy strategy or a sampling-based rollout [24,25]). This iterative process continues until a complete solution is obtained. In the subsequent section, we introduce the proposed encoder–decoder-based policy network, designed to efficiently solve both the conventional and improved VRP-PD with high-quality solutions.

4. Methodology

In this section, we present the proposed DRL-based approach with an encoder–decoder framework for VRP-PD. We first bring in a graph convolution module as the encoder to embed node features. Then, we introduce a transformer-based decoder for node selection. Finally, we detail the training algorithm to optimize the policy network.

4.1. Encoder

Traditional DRL-based methods primarily rely on node features to generate high-dimensional embeddings but fail to fully exploit the hierarchical structure of graph-based representations. As a result, these approaches often lead to suboptimal routing solutions. To mitigate this limitation, the proposed encoder, as shown in Figure 1, incorporates a graph convolution module (GCM) that integrates both node and edge features, capturing interdependencies between pickup and delivery locations as well as local and global structural relationships within the problem space.

Moreover, computing pairwise connections among all nodes is computationally prohibitive, particularly for large-scale instances. To address this challenge, we construct a sparse k-nearest neighbor (KNN) graph that enables efficient edge-based message passing. Specifically, each node i exchanges edge features exclusively with its k nearest neighbor

N (i)

in Euclidean space. This design preserves critical structural information while ensuring efficient information propagation across different problem scales.

k = μ_{k} N,

(11)

where

μ_{k} \in (0, 1)

is a ratio which controls the scope of KNN.

The raw node features

v_{i}

and edge features

d_{i j}

are firstly embedded into

d_{g}

-dimensional tokens via a linear projection as follows:

g_{i}^{(0)} = v_{i} W_{g}^{(0)} + b_{g}^{(0)}, \forall i \in V,

(12)

z_{i j}^{(0)} = d_{i j} W_{z}^{(0)} + b_{z}^{(0)}, \forall i \in V, j \in N (i),

(13)

where

W_{g}^{(0)} \in R^{2 \times d_{g}}

and

W_{z}^{(0)}, b_{g}^{(0)}, b_{z}^{(0)} \in R^{1 \times d_{g}}

are learnable parameter matrices. The initial node tokens

g_{i}^{(0)}

and edge tokens

z_{i j}^{(0)}

are processed through the GCM with L layers, where hierarchical feature extraction is performed. For each layer

l \in \{1, \dots, L\}

, the GCM operates as a message-passing mechanism, propagating information between neighboring nodes based on edge features defined in the KNN graph.

g_{i}^{(l)} = g_{i}^{(l - 1)} + ReLU (BN (g_{i}^{(l - 1)} W_{g}^{(l - 1)} + \sum_{j \in N (i)} ψ_{i j}^{(l - 1)} ⊙ g_{j}^{(l - 1)} W_{g^{'}}^{(l - 1)})),

(14)

ψ_{i j}^{(l)} = \frac{Sigmoid (z_{i j}^{(l - 1)})}{Sigmoid (\sum_{j \in N (i)} z_{i j}^{(l - 1)})},

(15)

z_{i j}^{(l)} = z_{i j}^{(l - 1)} + ReLU (BN (z_{i j}^{(l - 1)} W_{z}^{(l - 1)} + g_{i}^{(l - 1)} W_{z^{'}}^{(l - 1)} + g_{j}^{(l - 1)} W_{z^{''}}^{(l - 1)})), j \in N (i),

(16)

where

W_{g}^{(l - 1)}, W_{g^{'}}^{(l - 1)}, W_{z}^{(l - 1)}, W_{z^{'}}^{(l - 1)}, W_{z^{''}}^{(l - 1)} \in R^{d_{g} \times d_{g}}

are trainable weight matrices, and

ψ_{i j}^{(l)}

denotes the relative connection coefficient between two neighboring nodes. Nonlinear activation and normalization are applied through the rectified linear unit (

ReLU (\cdot)

), batch normalization (

BN (\cdot)

), and the sigmoid function (

Sigmoid (\cdot)

) to enhance learning stability.

In each layer l, the node token

g_{i}^{(l)}

is updated by aggregating information from its previous token

g_{i}^{(l - 1)}

and edge tokens

{z_{i j}^{(l - 1)}}_{j \in N (i)}

associated with its k-nearest neighbors. Simultaneously, the edge token

z_{i j}^{(l)}

is refined by incorporating information from its prior state

z_{i j}^{(l - 1)}

along with node tokens

g_{i}^{(l - 1)}

and

g_{j}^{(l - 1)}

. After L layers, the final output of the encoder is obtained, where the global graph representation

g_{graph}

is computed as the mean of all node embeddings

{g_{i}^{(L)}}

, encapsulating the overall structural information of the problem instance.

g_{graph} = \sum_{i \in V} \frac{1}{1 + N} g_{i}^{(L)} .

(17)

4.2. Decoder

At each decoding step, the decoder generates a probability distribution over candidate nodes based on the set of output node tokens

{\{g_{i}^{(L)}\}}_{i \in V}

and the global graph token

g_{graph}

, both derived from the GCM-based encoder. Unlike traditional DRL methods that utilize transformer-based decoders with heavy computational complexity, we introduce an attention-driven, computationally lightweight decoder that selects nodes in a step-wise manner, improving efficiency without compromising performance.

During decoding, the model constructs a context token

c_{t}

by concatenating the global graph representation

g_{graph}

, the depot (starting node) token

g_{a_{0}}

, and the token of the last visited node

g_{a_{t - 1}}

at the current time t. This context token effectively integrates global structural information with localized node-specific features, ensuring a well-informed selection process.

c_{t} = Concatenate [g_{graph}, g_{a_{0}}, g_{a_{t - 1}}] .

(18)

To facilitate route construction, the decoder employs a single-head attention-driven compatibility layer, which computes the probability distribution over candidate nodes, guiding the next-step selection in a computationally efficient manner.

δ_{i}^{t} = C \cdot tanh (\frac{(c_{t} W_{c}^{Q}) {(g_{i} W_{c}^{K})}^{T}}{d_{g}}),

(19)

p_{θ} (a_{t} = i | s_{t}) = \frac{exp (δ_{i}^{t})}{\sum_{i} exp (δ_{i})},

(20)

where

W_{c}^{Q}, W_{c}^{K} \in R^{d_{g} \times d_{g}}

are trainable parameter matrices, and

C

is a coefficient to control the compatibility within the range of

[- C, C]

. The overall decoding process is depicted in Figure 2.

4.3. Training

The proposed approach encompasses two neural networks: a policy network

p_{θ}

and a baseline network

b (s)

. The policy network governs node selection by generating a probability distribution over possible actions based on the real-time state. The baseline network, which shares the same architecture as the policy network, uses a greedy rollout strategy to calculate the reward by selecting the node with the highest probability. Overall, the policy network

p_{θ}

, parameterized by

θ

, is optimized using gradient descent with the REINFORCE algorithm as follows:

L (θ | s) = E_{p_{θ} (a | s)} (R (A)),

(21)

\nabla L (θ | s) = E_{p_{θ} (a | s)} [R (A) - b (s) \nabla log (p_{θ} (A | s))] .

(22)

The baseline network serves to enhance training stability and accelerate convergence. When the performance of the policy network

p_{θ}

surpasses that of the baseline network

b (s)

by a significant margin, the parameters of

b (s)

are updated to match those of

p_{θ}

. Through iterative training, this strategy enables the policy network to progressively refine its decision-making capability, ultimately yielding high-quality solutions for the VRP-PD.

5. Experiments

In this section, we verify the effectiveness of the proposed approach on the conventional and improved VRP-PD. First, we detail the experimental settings and hyperparameters of the proposed approach. Then, we compare GDRL with a range of heuristic and DRL-based methods on problems with varying scales and distributions. Finally, we evaluate the generalization ability of the proposed approach.

5.1. Experimental Settings

5.1.1. Datasets

To evaluate the proposed approach, we follow the convention of [24,25,29] to conduct experiments on both synthetically generated datasets and real-world benchmarks. For synthetic datasets, we generate instances with 20, 50, 100, and 150 nodes, respectively, where customer node coordinates are uniformly sampled from the unit square

[0, 1] \times [0, 1]

, with the depot located at the origin. Delivery node demands are drawn from the integer set

\{1, \dots, 9\}

, while pickup node demands are assigned inversely to maintain a 1:1 ratio between pickup and delivery nodes. Regarding real-world benchmarks, we employ datasets from [30] to benchmark conventional VRP-PD solutions and utilize instances from [29] to evaluate the improved VRP-PD. These real-world datasets exhibit significantly different node distributions and demand structures compared to the synthetic instances.

5.1.2. Training and Testing

For both the conventional and improved VRP-PD, we train the model for 100 epochs, with each epoch comprising 10,000 batches of 128 instances. Training datasets consist of 10,000 independently generated instances per problem size, while evaluation is performed on a test set of 200 instances. Node tokens and edge tokens are embedded into

d_{g} = 128

-dimensional vectors through linear projection. The encoder comprises

L = 3

stacked GCM layers, and the tanh function values are clipped within

[- 10, 10]

in Equation (19). Model parameters are optimized using the Adam optimizer. All implementations are executed in Python 3.11.0 on an Intel Core i7-14700KF CPU and an NVIDIA GeForce RTX 4070Ti GPU (made by Foxconn in Houston, TX, America) .

5.2. Comparison Analysis

5.2.1. Baselines

We compare the proposed GDRL with a variety of heuristic and DRL-based methods as follows:

CPLEX [31]: the most powerful decision optimizer for mathematical programming problems. We employ CPLEX 20.1 to solve VRP-PD and set the maximum computing time as 3600 s (1 h).
LKH [23]: a well-known efficient solver for TSP, CVRP, and its variants. Similarly, we adopt LKH to handle test instances with a maximum time limit of 3600 s.
GA [32]: an evolutionary computation-driven genetic algorithm that utilizes mutation and crossover operators to yield solutions for VRPs.
SA [33]: the simulated annealing approach that models the VRP-PD as a mixed-integer programming issue and delivers satisfactory results for routing.
ACO [34]: an ant colony optimization method leveraging swarm intelligence to search excellent route solutions for the conventional and improved VRP-PD.
ALNS [35]: an adaptive local neighborhood search algorithm for various routing problems.
AM [24]: an attention model that leverages a transformer-based encoder–decoder framework to tackle TSPs, CVRPs, and VRP-PDs and yield high-quality solutions.
POMO [25]: a DRL-based method that learns diverse multiple optimal policies and achieves state-of-the-art routing performance on various VRPs.
FER [26]: a learning-based approach that leverages a novel encoder–refiner–decoder framework for feature embedding and route construction. It exhibits superior solutions for classical VRPs.

5.2.2. Comparison

The comparison results among heuristic and DRL-based methods on the conventional and improved VRP-PDs are presented in Table 2 and Table 3. We report the average tour length, optimality gap, and average computational time over test instances as evaluation metrics to verify the effectiveness of baseline methods. In particular, the optimality gap is calculated as the relative difference between the average tour length

L_{a v e r}

of the method and the best-known solution length

L_{b e s t}

obtained by all comparative methods.

\begin{matrix} G a p = \frac{L_{a v e r} - L_{b e s t}}{L_{b e s t}} \times 100 % . \end{matrix}

(23)

Regarding the conventional VRP-PD, CPLEX achieves the optimal tour length on small-scale problems with 20 nodes. However, as the problem scale increases gradually, the solution quality of CPLEX significantly drops with a limited computational time, especially when confronted with the problem with 150 nodes. In contrast, LKH, standing out as a powerful heuristic routing solver, exhibits the best performance among problems with four different scales, which provides the optimal baseline for gap calculation. In comparison with heuristic and DRL-based methods, GDRL attains the most competitive routing performance. Taking the conventional VRP-PD with 100 nodes as an example, GDRL achieves an average length of 17.94, reducing the length of 19.17, 17.22, 1.29, and 19.04 separately when compared with ACO, ALNS, POMO, and FER. For N = 150, the proposed approach obtains an optimality gap of 18.28%, outperforming the second-best model (i.e., POMO) by a reduction of 5.81%.

As for the improved VRP-PD, CPLEX provides the best tour length on problems with 20, 50, and 100 nodes, while the proposed approach achieves the best results on the problem with 150 nodes. ALNS exhibits desirable results on sizes N = 20, 50, 100, and 150 among heuristic methods, but they are inferior to the proposed GDRL with longer traveling lengths of 4.02, 8.56, 17.22, and 17.59. In comparison with the second-best method POMO, GDRL achieves an average tour length of 4.68, 7.56, 11.36, and 13.06 on problems with 20, 50, 100, and 150 nodes, decreasing the optimality gap by 3.14%, 5.46%, 4.70%, and 15.39%, respectively. Regarding the solving speed, the proposed approach decreases the computational time by an order of magnitude in contrast with DRL-based AM (sampling) and FER while also achieving the most excellent routing performance on four problem sizes. In order to more intuitively demonstrate the superiority of our solution results, we visualize the detailed traversing routes of POMO and the proposed GDRL on two exemplary instances with 100 and 150 nodes, as shown in Figure 3. It can be observed that GDRL attains a shorter route and avoids inefficient pickups and deliveries. These achievements convincingly demonstrate the superiority of innovative designs for GDRL.

5.3. Generalization Study

It is computationally intractable to learn models from scratch to tackle unseen VRP-PDs. Therefore, it is of great importance to evaluate the generalization ability of learning-based methods to tackle routing problems with larger scales and different distributions. To be more specific, we apply the model trained on VRP-PD with 100 nodes to tackle two types of problems: (1) larger-scale uniform-distribution problems with 200, 250, and 300 nodes; (2) same-scale problems with different node distributions; (3) real-world benchmark datasets with varying node numbers and distributions.

5.3.1. Cross-Size Generalization

Regarding size generation, we generate 200 instances following the same uniform distribution with 200, 250, and 300 nodes, respectively, and leverage the model trained on problems with 100 nodes to tackle them. The comparison results for the conventional and improved VRP-PDs are shown in Figure 4a and Figure 5a. For both variants, AM models with greedy rollout and sampling rollout perform poorly on large-scale problems since they engage a relatively clumsy decoder. By contrast, FER performs moderately well, but its solution quality is still inferior to POMO and the proposed GDRL. The average tour lengths of GDRL on the conventional VRP-PD with 200, 250, and 300 nodes are 25.11, 28.35, and 31.85, respectively, which are 7.37%, 7.30%, and 7.28% lower than those of POMO. Moreover, the proposed GDRL exhibits a length reduction of 5.03% in comparison with state-of-the-art POMO on the improved VRP-PD with 300 nodes.

5.3.2. Cross-Distribution Generalization

In this case, we use the model trained on problems with uniform nodal distributions with 100 nodes to handle 200 instances following the Gaussian, gamma, and beta distributions, respectively, each with 100 nodes. The cross-distribution generalization results are presented in Figure 4b and Figure 5b. Pertaining to the conventional VRP-PD with a Gaussian distribution, the proposed GDRL attains an average tour length of 19.51, which is 63.05, 28.95, 1.34, and 15.93 lower than that of AM (Greedy), AM (Sampling), POMO, and FER, respectively. As for the improved VRP-PD, the proposed GDRL achieves an average length of 10.78, 12.34, and 13.08 on instances with Gaussian, gamma, and beta distribution, outstripping the second-best method, POMO, by a reduction of 3.90%, 6.56%, and 7.65%, respectively.

5.3.3. Generalization on Real-World Benchmarks

We also employ real-world benchmark datasets following prevailing studies [30,36] to verify the generalization ability of the proposed approach. In particular, we apply the model trained on problems with 150 nodes to tackle these instances in the benchmark dataset. As for the conventional VRP-PD, the instances are derived from the literature [30] based on CVRPLib. As shown in Table 4, GDRL attains the best routing performance among a range of heuristic and DRL-based methods. GDRL achieves a length reduction of 39.09% and 8.33% when compared with the competitive heuristic ALNS and state-of-the-art DRL-based POMO. Regarding the improved VRP-PD with benchmark instances used in [36], the proposed approach shows excellent routing performance with high effectiveness and efficiency, as illustrated in Table 5. GDRL obtains an average tour length of 28,259.75, outperforming DRL-based AM (Greedy), AM (Sampling), and POMO by 139.90%, 115.09%, and 15.38%, respectively.

5.4. Effect of Sampling Size

The proposed approach leverages a sampling rollout to facilitate the exploration of the whole solution space, thereby leading to more satisfactory results. To evaluate the influence of sampling size, we apply the same sampling rollout to AM and POMO and report the comparison results on both the conventional and improved VRP-PDs with 100 nodes, as illustrated in Table 6 and Table 7. Specifically, we set the sampling sizes to 320, 640, 1280, and 2560 using the respective trained policy networks. As for the conventional VRP-PD, the proposed GDRL with a sampling width of 1280 attains an average tour length of 17.64, outperforming AM and POMO by 194.06% and 2.95%, respectively. Pertaining to the improved VRP-PD, GDRL outstrips POMO with gap reductions of 2.52%, 3.00%, 3.14%, and 3.17% when the sampling size is set to 320, 640, 1280, and 2560, respectively.

6. Conclusions

In this study, we concentrate on two variants of pickup and delivery problems, i.e., the conventional VRP-PD and the improved VRP-PD, which have broad applications in logistics, robotics, and so on. We propose a novel deep reinforcement learning approach, which engages an encoder–decoder framework, to tackle them. The encoder utilizes stacked GCMs to learn hierarchical graph features and generate informative tokens of nodes and edges. Afterwards, the decoder uses an attention-driven compatibility layer with lightweight computation to output a probability distribution for the next node selection. Extensive experiments on synthetic and real-world benchmark datasets demonstrate that the proposed GDRL outperforms a variety of heuristic and DRL-based algorithms on problems with different scales and node distributions.

Despite its demonstrated effectiveness, the proposed approach presents two notable limitations. First, the GDRL encoder relies on stacked graph convolutional layers to capture node and edge information. While effective, this architecture demands substantial computational resources, resulting in suboptimal runtimes for large-scale instances involving more than 500 customer nodes. Second, the current framework is specifically designed for VRP-PD scenarios and does not readily extend to more complex variants, such as those involving service time windows. Future efforts will focus on generalizing the approach to address more realistic and operationally relevant routing problems, including vehicle routing with time windows and electric vehicle routing with charging constraints.

Author Contributions

Conceptualization, D.Y. and Q.G.; methodology, Q.G.; software, B.O.; validation, D.Y., Q.G. and B.O.; formal analysis, B.Y.; investigation, B.O.; resources, H.C.; data curation, B.O.; writing—original draft preparation, Q.G.; writing—review and editing, D.Y.; visualization, B.Y.; supervision, H.C.; project administration, H.C.; funding acquisition, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62306232, the Natural Science Basic Research Program of Shaanxi Province under Grant No. 2023-JC-QN-0662, and the State Key Laboratory of Electrical Insulation and Power Equipment under Grant No. EIPE23416.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

We thank Xi’an Jiaotong University for helping us with the Article Processing Charge for publication of the article in Open Access.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jiang, J.; Dai, Y.; Yang, F.; Ma, Z. A multi-visit flexible-docking vehicle routing problem with drones for simultaneous pickup and delivery services. Eur. J. Oper. Res. 2024, 312, 125–137. [Google Scholar] [CrossRef]
Yin, X.; Qian, Y.; Vardar, A.; Günther, M.; Müller, F.; Laleni, N.; Zhao, Z.; Jiang, Z.; Shi, Z.; Shi, Y.; et al. Ferroelectric compute-in-memory annealer for combinatorial optimization problems. Nat. Commun. 2024, 15, 2419. [Google Scholar] [CrossRef] [PubMed]
Guan, Q.; Cao, H.; Zhong, X.; Yan, D.; Xue, S. Hisom: Hierarchical Self-Organizing Map for Solving Multiple Traveling Salesman Problems. Networks 2025. [Google Scholar] [CrossRef]
Zhou, S.; Zhang, D.; Yuan, W.; Wang, Z.; Zhou, L.; Bell, M.G. Pickup and delivery problem with electric vehicles and time windows considering queues. Transp. Res. Part C Emerg. Technol. 2024, 167, 104829. [Google Scholar] [CrossRef]
Guan, Q.; Cao, H.; Jia, L.; Yan, D.; Chen, B. Synergetic attention-driven transformer: A Deep reinforcement learning approach for vehicle routing problems. Expert Syst. Appl. 2025, 274, 126961. [Google Scholar] [CrossRef]
Li, B.; Ma, H. Double-deck multi-agent pickup and delivery: Multi-robot rearrangement in large-scale warehouses. IEEE Robot. Autom. Lett. 2023, 8, 3701–3708. [Google Scholar] [CrossRef]
Xiang, C.; Wu, Z.; Tu, J.; Huang, J. Centralized deep reinforcement learning method for dynamic multi-vehicle pickup and delivery problem with crowdshippers. IEEE Trans. Intell. Transp. Syst. 2024, 25, 9253–9267. [Google Scholar] [CrossRef]
Guan, Q.; Hong, X.; Ke, W.; Zhang, L.; Sun, G.; Gong, Y. Kohonen self-organizing map based route planning: A revisit. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: New York, NY, USA, 2021; pp. 7969–7976. [Google Scholar]
Hao, Y.; Chen, Z.; Sun, X.; Tong, L. Planning of truck platooning for road-network capacitated vehicle routing problem. Transp. Res. Part Logist. Transp. Rev. 2025, 194, 103898. [Google Scholar] [CrossRef]
Wei, J.; Poon, M.; Zhang, Z. A branch-and-cut algorithm for the time-dependent vehicle routing problem with time windows and combinatorial auctions. Comput. Oper. Res. 2024, 172, 106807. [Google Scholar] [CrossRef]
Su, Y.; Zhang, S.; Zhang, C. A lightweight genetic algorithm with variable neighborhood search for multi-depot vehicle routing problem with time windows. Appl. Soft Comput. 2024, 161, 111789. [Google Scholar] [CrossRef]
Ma, X.; Liu, C. Improved Ant Colony Algorithm for the Split Delivery Vehicle Routing Problem. Appl. Sci. 2024, 14, 5090. [Google Scholar] [CrossRef]
Rodríguez-Esparza, E.; Masegosa, A.D.; Oliva, D.; Onieva, E. A new hyper-heuristic based on adaptive simulated annealing and reinforcement learning for the capacitated electric vehicle routing problem. Expert Syst. Appl. 2024, 252, 124197. [Google Scholar] [CrossRef]
He, H.; Meng, X.; Wang, Y.; Khajepour, A.; An, X.; Wang, R.; Sun, F. Deep reinforcement learning based energy management strategies for electrified vehicles: Recent advances and perspectives. Renew. Sustain. Energy Rev. 2024, 192, 114248. [Google Scholar] [CrossRef]
Li, X.K.; Ma, J.X.; Li, X.Y.; Hu, J.J.; Ding, C.Y.; Han, F.K.; Guo, X.M.; Tan, X.; Jin, X.M. High-efficiency reinforcement learning with hybrid architecture photonic integrated circuit. Nat. Commun. 2024, 15, 1044. [Google Scholar] [CrossRef]
Wang, Y.; Hong, X.; Wang, Y.; Zhao, J.; Sun, G.; Qin, B. Token-based deep reinforcement learning for Heterogeneous VRP with Service Time Constraints. Knowl. Based Syst. 2024, 300, 112173. [Google Scholar] [CrossRef]
Bogyrbayeva, A.; Meraliyev, M.; Mustakhov, T.; Dauletbayev, B. Machine learning to solve vehicle routing problems: A survey. IEEE Trans. Intell. Transp. Syst. 2024, 25, 4754–4772. [Google Scholar] [CrossRef]
Baldacci, R.; Bartolini, E.; Mingozzi, A. An exact algorithm for the pickup and delivery problem with time windows. Oper. Res. 2011, 59, 414–426. [Google Scholar] [CrossRef]
Li, C.; Gong, L.; Luo, Z.; Lim, A. A branch-and-price-and-cut algorithm for a pickup and delivery problem in retailing. Omega 2019, 89, 71–91. [Google Scholar] [CrossRef]
Hernández-Pérez, H.; Salazar-González, J.J. A branch-and-cut algorithm for the split-demand one-commodity pickup-and-delivery travelling salesman problem. Eur. J. Oper. Res. 2022, 297, 467–483. [Google Scholar] [CrossRef]
Karaoglan, I.; Altiparmak, F.; Kara, I.; Dengiz, B. The location-routing problem with simultaneous pickup and delivery: Formulations and a heuristic approach. Omega 2012, 40, 465–477. [Google Scholar] [CrossRef]
Olgun, B.; Koç, Ç.; Altıparmak, F. A hyper heuristic for the green vehicle routing problem with simultaneous pickup and delivery. Comput. Ind. Eng. 2021, 153, 107010. [Google Scholar] [CrossRef]
Helsgaun, K. An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems. Roskilde Rosk. Univ. 2017, 12, 966–980. [Google Scholar]
Kool, W.; van Hoof, H.; Welling, M. Attention, Learn to Solve Routing Problems! In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 1–12. [Google Scholar]
Kwon, Y.D.; Choo, J.; Kim, B.; Yoon, I.; Gwon, Y.; Min, S. Pomo: Policy optimization with multiple optima for reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21188–21198. [Google Scholar]
Li, J.; Ma, Y.; Cao, Z.; Wu, Y.; Song, W.; Zhang, J.; Chee, Y.M. Learning feature embedding refiner for solving vehicle routing problems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 15279–15291. [Google Scholar] [CrossRef]
Zhang, Z.; Wu, Z.; Zhang, H.; Wang, J. Meta-learning-based deep reinforcement learning for multiobjective optimization problems. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 7978–7991. [Google Scholar] [CrossRef]
Ghaziri, H.; Osman, I.H. Self-organizing feature maps for the vehicle routing problem with backhauls. J. Sched. 2006, 9, 97–114. [Google Scholar] [CrossRef]
Wang, C.; Cao, Z.; Wu, Y.; Teng, L.; Wu, G. Deep reinforcement learning for solving vehicle routing problems with backhauls. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 4779–4793. [Google Scholar] [CrossRef]
Queiroga, E.; Frota, Y.; Sadykov, R.; Subramanian, A.; Uchoa, E.; Vidal, T. On the exact solution of vehicle routing problems with backhauls. Eur. J. Oper. Res. 2020, 287, 76–89. [Google Scholar] [CrossRef]
Nickel, S.; Steinhardt, C.; Schlenker, H.; Burkart, W.; Reuter-Oppermann, M. Ibm ilog cplex optimization studio. In Angewandte Optimierung mit IBM ILOG CPLEX Optimization Studio: Modellierung von Planungs-und Entscheidungsproblemen des Operations Research mit OPL; Springer: Berlin/Heidelberg, Germany, 2021; pp. 9–23. [Google Scholar]
Tasan, A.S.; Gen, M. A genetic algorithm based approach to vehicle routing problem with simultaneous pick-up and deliveries. Comput. Ind. Eng. 2012, 62, 755–761. [Google Scholar] [CrossRef]
Wang, C.; Mu, D.; Zhao, F.; Sutherland, J.W. A parallel simulated annealing method for the vehicle routing problem with simultaneous pickup–delivery and time windows. Comput. Ind. Eng. 2015, 83, 111–122. [Google Scholar] [CrossRef]
Sayyah, M.; Larki, H.; Yousefikhoshbakht, M. Solving the vehicle routing problem with simultaneous pickup and delivery by an effective ant colony optimization. J. Ind. Eng. Manag. Stud. 2016, 3, 15–38. [Google Scholar]
Chaharsooghi, S.; Momayezi, F.; Ghaffarinasab, N. An adaptive large neighborhood search heuristic for solving the reliable multiple allocation hub location problem under hub disruptions. Int. J. Ind. Eng. Comput. 2017, 8, 191–202. [Google Scholar] [CrossRef]
Achamrah, F.E. Modelling and Solving Complex Vehicle Routing Problems with Integrated Management of Shared Inventories in Supply Chains. Ph.D. Thesis, Université Paris-Saclay, Orsay, France, 2022. [Google Scholar]

Figure 1. The encoding procedure of the proposed GDRL.

Figure 2. The decoding process of the proposed GDRL.

Figure 3. Representative routing results of some exemplary instances for the improved VRP-PD. For the 100-node case, the proposed method achieves an average traversal length of 9.16 (a) compared to 11.68 by POMO (c). When scaled to 150 nodes, the proposed method maintains competitive performance with an average route length of 11.68 (b), while POMO reaches 11.81 (d). These results demonstrate the scalability and solution quality of the proposed approach across different problem sizes.

Figure 4. Generalization results on conventional VRP-PDs.

Figure 5. Generalization results on improved VRP-PDs.

Table 1. The definition of parameters and decision variables in the VRP-PD model.

Parameter	Definition
$G$	Whole graph $G = (V, E)$ .
$V$	Set of all nodes.
$E$	Set of edges.
$L$	Set of delivery (linehaul) nodes.
$B$	Set of pickup (backhaul) nodes.
$V^{'}$	Set of all nodes except the depot, i.e., $V^{'} = (L, B)$ .
$K$	Set of all vehicles.
$D$	Maximum capacity of the vehicle.
$q_{i}$	Demand of the pickup (delivery) node i.
$d_{i j}^{k}$	Traveling cost (length) for vehicle k to depart from node i to node j.
$u_{i j}^{k}$	Remaining capacity of vehicle k when traveling from node i to node j.
$e_{i j}^{k}$	Binary decision variable representing whether the vehicle k travels from node i to node j.

Table 2. Comparison results on conventional VRP-PDs.

Method	N = 20			N = 50			N = 100			N = 150
Method	Length	Gap	Time (s)	Length	Gap	Time (s)	Length	Gap	Time (s)	Length	Gap	Time (s)
CPLEX [31]	6.59	0.00%	8.04	14.50	38.36%	3600	26.19	66.92%	3600	38.07	89.12%	3600
LKH [23]	6.59	0.00%	0.07	10.48	0.00%	345.06	15.69	0.00%	3600	20.13	0.00%	3600
GA [32]	15.01	127.77%	34.82	107.40	924.81%	61.82	324.92	1970.87%	127.98	445.07	2110.98%	210.30
SA [33]	11.53	74.96%	31.33	21.70	107.06%	55.64	38.66	146.40%	115.18	45.22	124.64%	189.27
ACO [34]	11.19	69.80%	27.16	20.94	99.81%	48.22	37.11	136.52%	99.82	43.52	116.19%	164.03
ALNS [35]	10.78	63.58%	0.315	19.99	90.74%	0.43	35.16	124.09%	1.14	41.40	105.66%	2.13
AM (Greedy) [24]	22.97	248.56%	0.13	51.69	393.23%	0.27	60.97	288.59%	0.51	87.42	334.28%	0.65
AM (Sampling) [24]	8.41	27.62%	145.07	18.25	74.14%	913.17	55.12	251.31%	3600	78.58	290.36%	3600
POMO [25]	7.77	17.91%	0.16	12.48	19.08%	0.67	19.23	22.56%	3.48	24.98	24.09%	9.78
FER [26]	8.78	33.23%	31.67	18.37	75.28%	218.37	36.98	135.69%	1003.06	51.33	105.66%	3600
GDRL	6.76	2.58%	0.22	11.43	9.06%	0.82	17.94	14.34%	4.09	23.81	18.28%	11.67

Table 3. Comparison results on improved VRP-PDs.

Method	N = 20			N = 50			N = 100			N = 150
Method	Length	Gap	Time (s)	Length	Gap	Time (s)	Length	Gap	Time (s)	Length	Gap	Time (s)
CPLEX [31]	4.45	0.00%	6.51	7.15	0.00%	3600	10.41	0.00%	3600	17.74	89.12%	3600
LKH [23]	5.77	29.66%	0.07	9.72	35.94%	345.06	14.17	36.12%	3600	18.02	37.98%	3600
GA [32]	6.87	54.38%	19.27	76.58	971.05%	282.75	62.28	498.27%	3600	152.72	1069.37%	3600
SA [33]	7.86	76.63%	17.34	17.12	139.44%	254.46	33.66	223.34%	3600	43.13	230.25%	3600
ACO [34]	7.62	71.24%	15.03	16.44	129.93%	220.55	32.09	208.26%	3600	41.10	214.70%	3600
ALNS [35]	7.34	64.94%	0.22	15.60	118.18%	1.30	30.12	189.34%	1.90	38.55	195.18%	3.71
AM (Greedy) [24]	10.38	133.26%	0.02	23.46	228.11%	0.07	53.96	418.35%	0.09	86.76	564.32%	0.13
AM (Sampling) [24]	5.01	12.58%	31.11	8.86	23.92%	69.88	48.90	369.74%	103.07	73.86	465.54%	193.33
POMO [25]	4.82	8.31%	0.10	7.95	11.19%	0.17	11.85	13.83%	0.29	15.07	15.39%	0.45
FER [26]	6.60	48.31%	17.68	12.85	79.72%	29.85	24.11	131.60%	63.55	30.35	132.39%	111.24
GDRL	4.68	5.17%	0.23	7.56	5.73%	0.85	11.36	9.13%	4.27	13.06	0.00%	12.23

Table 4. Routing results on benchmark instances in conventional VRP-PDs.

Method	Conventional VRP-PD Benchmarks
Method	GA	SA	ACO	ALNS	AM (Greedy)	AM (Sampling)	POMO	GDRL
Length	63,066.28	44,081.06	41,092.51	37,356.83	60,495.22	54,237.10	29,094.36	26,857.70
Time (s)	98.87	88.98	77.12	13.24	7.38	885.49	15.52	17.95

Table 5. Routing results on benchmark instances of improved VRP-PDs.

Method	Improved VRP-PD Benchmarks
Method	GA	SA	ACO	ALNS	AM (Greedy)	AM (Sampling)	POMO	GDRL
Length	70,677.72	49,401.19	46,051.95	41,865.41	67,796.36	60,783.23	32,605.75	28,259.75
Time (s)	103.61	93.25	80.82	13.88	7.73	927.98	16.26	18.81

Table 6. Routing results of different sampling sizes on conventional VRP-PDs.

Sampling Size	320			640			1280			2560
Sampling Size	Length	Gap	Time (s)	Length	Gap	Time (s)	Length	Gap	Time (s)	Length	Gap	Time (s)
AM	53.69	200.28%	1072.76	52.98	198.82%	2000.35	51.89	194.16%	3600	49.85	182.76%	3600
POMO	18.46	3.24%	44.02	18.29	3.16%	185.43	18.16	2.95%	811.15	18.03	2.27%	3600
GDRL	17.88	0.00%	57.15	17.73	0.00%	211.79	17.64	0.00%	1050.38	17.63	0.00%	3600

Table 7. Routing results of different sampling sizes on improved VRP-PDs.

Sampling Size	320			640			1280			2560
Sampling Size	Length	Gap	Time (s)	Length	Gap	Time (s)	Length	Gap	Time (s)	Length	Gap	Time (s)
AM	49.56	442.83%	25.67	49.12	445.17%	51.28	48.72	446.80%	102.94	48.19	445.14%	244.38
POMO	9.36	2.52%	0.56	9.28	3.00%	0.91	9.19	3.14%	1.46	9.12	3.17%	2.53
GDRL	9.13	0.00%	0.72	9.01	0.00%	1.33	8.91	0.00%	2.05	8.84	0.00%	3.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, D.; Guan, Q.; Ou, B.; Yan, B.; Cao, H. Graph-Driven Deep Reinforcement Learning for Vehicle Routing Problems with Pickup and Delivery. Appl. Sci. 2025, 15, 4776. https://doi.org/10.3390/app15094776

AMA Style

Yan D, Guan Q, Ou B, Yan B, Cao H. Graph-Driven Deep Reinforcement Learning for Vehicle Routing Problems with Pickup and Delivery. Applied Sciences. 2025; 15(9):4776. https://doi.org/10.3390/app15094776

Chicago/Turabian Style

Yan, Dapeng, Qingshu Guan, Bei Ou, Bowen Yan, and Hui Cao. 2025. "Graph-Driven Deep Reinforcement Learning for Vehicle Routing Problems with Pickup and Delivery" Applied Sciences 15, no. 9: 4776. https://doi.org/10.3390/app15094776

APA Style

Yan, D., Guan, Q., Ou, B., Yan, B., & Cao, H. (2025). Graph-Driven Deep Reinforcement Learning for Vehicle Routing Problems with Pickup and Delivery. Applied Sciences, 15(9), 4776. https://doi.org/10.3390/app15094776

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Graph-Driven Deep Reinforcement Learning for Vehicle Routing Problems with Pickup and Delivery

Abstract

1. Introduction

2. Related Work

2.1. Exact and Heuristic Methods

2.2. DRL-Based Methods

3. Problem Formulation

3.1. Mathematical Formulation

3.2. Formulation as a MDP

4. Methodology

4.1. Encoder

4.2. Decoder

4.3. Training

5. Experiments

5.1. Experimental Settings

5.1.1. Datasets

5.1.2. Training and Testing

5.2. Comparison Analysis

5.2.1. Baselines

5.2.2. Comparison

5.3. Generalization Study

5.3.1. Cross-Size Generalization

5.3.2. Cross-Distribution Generalization

5.3.3. Generalization on Real-World Benchmarks

5.4. Effect of Sampling Size

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI