An Improved Ant Colony Algorithm with Deep Reinforcement Learning for the Robust Multiobjective AGV Routing Problem in Assembly Workshops

Chen, Yong; Chen, Mingyu; Yu, Feiyang; Lin, Han; Yi, Wenchao

doi:10.3390/app14167135

Open AccessArticle

An Improved Ant Colony Algorithm with Deep Reinforcement Learning for the Robust Multiobjective AGV Routing Problem in Assembly Workshops

by

Yong Chen

,

Mingyu Chen

,

Feiyang Yu

,

Han Lin

and

Wenchao Yi

^*

College of Mechanical Engineering, Zhejiang University of Technology, Hangzhou 310014, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7135; https://doi.org/10.3390/app14167135

Submission received: 5 July 2024 / Revised: 1 August 2024 / Accepted: 12 August 2024 / Published: 14 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Vehicle routing problems (VRPs) are challenging problems. Many variants of the VRP have been proposed. However, few studies on VRP have combined robustness and just-in-time (JIT) requirements with uncertainty. To solve the problem, this paper proposes the just-in-time-based robust multiobjective vehicle routing problem with time windows (JIT-RMOVRPTW) for the assembly workshop. Based on the conflict between uncertain time and JIT requirements, a JIT strategy was proposed. To measure the robustness of the solution, a metric was designed as the objective. Afterwards, a two-stage nondominated sorting ant colony algorithm with deep reinforcement learning (NSACOWDRL) was proposed. In stage I, ACO combines with NSGA-III to obtain the Pareto frontier. Based on the model, a pheromone update strategy and a transfer probability formula were designed. DDQN was introduced as a local search algorithm which trains networks through Pareto solutions to participate in probabilistic selection and nondominated sorting. In stage II, the Pareto frontier was quantified in feasibility by Monte Carlo simulation, and tested by diversity-robust selection based on uniformly distributed weights in the solution space to select robust Pareto solutions that take diversity into account. The effectiveness of NSACOWDRL was demonstrated through comparative experiments with other algorithms on instances. The impact of JIT strategy is analyzed and the effect of networks on the NSACOWDRL is further discussed.

Keywords:

robust; VRP; multiobjective problem; ACO; DRL

1. Introduction

The vehicle routing problem (VRP) was proposed by Dantzig and Ramser in 1959 [1]; this problem aims to find an optimal route based on constraints and objectives. The problem is widely used in traffic management, logistics transportation, and other fields. This problem has been extensively studied. Several VRP variants have been proposed by applying the VRPs in different environments, such as the capacity-constrained vehicle routing problem (CVRP) and vehicle routing problem with time window (VRPTW). Many researchers have performed systematic literature reviews on the VRP.

Liu et al. [2] extracted data and analyzed relevant literature within 2018–2022 to address the problems associated with VRPTW. Li et al. [3] analyzed the articles which solve VRPs by Learning-Based Optimization (LBO) algorithms from various aspects, evaluated the performance of representative LBO algorithms, summarized the types of problems to which LBO is applicable, and proposed directions for improvement. Asghari et al. [4] investigated the contributions related to the green VRP and proposed a classification scheme based on the variants considered in the scientific literature, so as to provide a comprehensive and structured survey of the state of knowledge, as well as to discuss the most important features of the problems and to indicate future directions. Gunawan et al. [5] provided and categorized a list of VRP datasets. Salehi Sarbijan et al. [6] systematically reviewed and analyzed recent research on the VRP from 2001 to 2022, focusing on new and emerging topics in VRPs. Based on the basic VRP, Zhang et al. [7] classified VRPs according to their characteristics and practical applications, gave the unified description and mathematical model of each type of problem, and then analyzed the solution methods of each type of VRP. Ni et al. [8] comprehensively analyzed and summarized the literature related to VRPs from 1959 to 2022 by using the knowledge graph, and classified the VRP models and their solutions.

However, the complexity of the VRP increases with the additional constraints of real life. The VRP is NP-hard, so finding the optimal solution is difficult. Therefore, the method proposed to solve the VRP aims to find the approximate optimal solution via an efficient computational approach.

VRPs for AGVs are solved by using exact algorithms and intelligent evolutionary algorithms. The exact algorithms include the branch-and-bound method, integer linear programming method, and dynamic programming method, which are suitable for solving small-scale VRPs with simple structures. Soysal [9] et al. proposed a simulation-based restricted dynamic programming approach to solve the VRP associated with traffic emissions. Reihaneh [10] proposed a branch-and-price algorithm that solves the vehicle routing with demand allocation problem. However, the exact algorithm is prone to falling into a local optimal solution or taking too long to compute when solving large-scale VRPs.

In contrast, intelligent evolutionary algorithms can find satisfactory approximate optimal solutions in a limited time when solving large-scale VRPs. Therefore, many experts have improved and designed intelligent evolutionary algorithms. Yong Wang [11] et al. proposed an improved multiobjective genetic algorithm with tabu search (IMOGA-TS), which combines local and global searches with the improvement of solutions at each iteration by TS to solve a collaborative multidepot EV routing problem with time windows and shared charging stations (CMEVRPTW-SCS). Pierre [12] proposed a multiobjective genetic algorithm with stochastic partially optimized cyclic shift crossover to solve the multiobjective VRPTW.

The multiobjective model is always transformed into a single-objective model by weighting or other conversion methods. In the VRP, the total objective is often composed of multiple objectives with mutual exclusivity, which means that improving the performance of one objective may lead to degradation of the performance of other objectives. These objectives may not be linearly related and cannot be transformed into a single-objective problem with weights. Therefore, there is a limitation in using a single-objective model to solve the VRP.

Multiobjective optimization problems (MOPs) are proposed for optimizing multiple objectives which are always mutually exclusive at the same time. In recent decades, researchers have proposed many methods for solving MOPs. Multiobjective evolutionary algorithms (MOEAs) are the mainstream methods for solving MOPs. Jinlong Wang [13] proposed an enhanced algorithm based on SPEA2 (ESPEA) to solve the pickup vehicle scheduling problem with mixed steel storage (PVSP-MSS) which optimizes the makespan of pickup vehicles and the makespan of steel logistics parks at the same time. Yadian Geng [14] proposed an improved hyperplane-assisted evolutionary algorithm (IhpaEA) to solve the distributed hybrid flow shop scheduling problem (DHFS), which takes maximum completion time and energy consumption as objectives.

Based on different algorithmic strategies, most MOEAs can be classified into the following categories: (1) based on their dominant relationship, such as the NSGA-II [15], SPEA [1], and PESA-II [16]; (2) based on decomposition, such as the MOEA/D [17], MOEA/D-M2M [18], and RVEA [19]; and (3) based on performance evaluation indicators, such as the IBEA [20], SMSEMOA [21], and DNMOEA/HI [22].

From the above algorithms, Pareto solutions can be obtained. However, for MOPs, the number of Pareto solutions increases as the number of objectives increases. For practical scenarios, not all solutions make sense, especially for uncertain time-based VRPs. Uncertain times require Pareto solutions that are most resistant to uncertain times compared to deterministic times, and uncertainty in VRPs is rarely considered in these algorithms. To solve this problem, robust optimization has been proposed, where the researcher searches for robust solutions with the strongest resistance to interference. Robust optimization [23] means that the solution and performance results remain relatively unchanged when exposed to uncertain conditions. The solution is usually evaluated using the most unfavorable uncertainty. Xia [24] et al. proposed a new method that sequentially approaches the robust Pareto front (SARPF) from nominal Pareto points to solve MOPs with uncertainties. Jin et al. [25] introduced and discussed existing methods for dealing with different uncertainties and studied the relationships between different categories of uncertainty. Scheffermann [26] et al. introduced and compared the NSGA-II with an improved predator–prey algorithm for the VRPTW with uncertain travel times. He [27] et al. used a robust multiobjective optimization evolutionary algorithm (RMOEA) to solve robust multiobjective optimization problems (RMOPs), which consists of two parts: multiobjective optimization using the improved NSGA-II and robust optimization to search for a robust optimal frontier. The robust optimization in this paper is based on Monte Carlo simulation to simulate the practical environment in which the results are approached to the practical situation through a large number of simulation experiments. The number of feasible times of Monte Carlo simulation under the disturbance of uncertainty coefficients is taken as the criterion for assessing the robustness of the solution.

In the VRP, the AGV operation time, total travel distance of the AGVs or number of AGVs are often used as objectives. However, the just-in-time (JIT) of the material distribution is often ignored. In addition, the literature discusses the robustness of VRPs, but few papers have combined robustness with JIT problems. JITs pursue just-in-time arrivals, and uncertain times can seriously affect the realization of JIT, so combining JITs with robust optimization can improve the possibility of realizing JITs.

To solve the above issues, in this paper, the multiobjective robust VRP model that simultaneously considers uncertainty and JIT strategy is proposed. In addition, an improved ACO algorithm with deep reinforcement learning for assembly workshops is proposed. This study aims to make the following contributions.

(1) A just-in-time-based multiobjective robust vehicle routing problem with time windows (JIT-RMOVRPTW) is constructed which simultaneously optimizes three objectives. The model takes into account the uncertain time by introducing uncertainty coefficients. By defining the robustness metric that combines uncertain time to measure the robustness of the solution, the robustness optimization is defined as a dedicated optimization objective that can filter robust solutions in a better way.

(2) The JIT-RMOVRPTW considers the two conflicting goals of JIT and robustness at the same time. In this model, the JIT strategy under uncertainty is proposed. The traditional time window is eliminated, and the JIT time, which is used as the deadline for the workstation, is proposed. The model eliminates the fixed departure time of AGVs, so that AGVs can flexibly adjust the depot time according to the JIT time to eliminate the waiting time at the workstation generated by early arrival of AGVs.

(3) A two-stage nondominated sorting ant colony algorithm with deep reinforcement learning (NSACOWDRL) is proposed to improve the robustness of the solutions by double screening. The NSACOWDRL divides the problem into two stages: a multiobjective optimization problem and a robust optimization problem. In the first stage, an initial search is performed with robustness metrics as one of the objectives. The small habitat conservation strategy based on reference points used in NSGA-III [28] is introduced in routing selection. The nondominant rank, the elite path strategy, and the max-min pheromone strategy are introduced in the pheromone update strategy. Furthermore, DDQN is introduced as a local search algorithm. The network is trained by the Pareto solutions which guide the learning direction of the network, after which the trained network is used to participate in nondominated sorting. Moreover, a probabilistic formula based on the network and objectives is designed. In the second stage, the feasibility of the solutions is quantified through Monte Carlo simulation, then the solutions are partitioned according to the uniformly distributed weights in the solution space. Each partitioned solution set is screened separately for robustness. The set of screened solutions is the robust Pareto frontier that takes into account the diversity of solutions.

(4) The performance of the proposed algorithm is validated by comparative experiments. The paper also further discusses the effect of network structure on the performance of the algorithm.

The remainder of this paper is organized as follows: In Section II, the background is introduced. Section III introduces the related VRP problem in the assembly workshop. In Section IV, the NSACOWDRL is described in detail. The experimental design and analysis are described in Section V. Finally, conclusions and suggestions for future work are given in Section VI.

2. Background

2.1. Vehicle Routing Problem

With further study of these elements, many VRP models, such as the green VRP (GVRP) and the dynamic VRP (DVRP), have been proposed and investigated. Moreover, depending on the objectives of the VRP, the problems can be categorized into single-objective VRPs and multiobjective VRPs.

Many models and algorithms have been proposed for single-objective VRP. Kaijun [29] et al. constructed a new VRPTW model of multiple distribution centers based on urban rail transportation, and a novel concentration-immune algorithm particle swarm optimization (C-IAPSO). Zhang [30] et al. proposed and established a reverse logistics VRP with backhauls (RL-VRPB) for waste packaging recycling route planning. An improved scatter search algorithm (ISS) was proposed, which introduces vehicle residual space recovery capability and the local search strategy. To solve the DVRP, Sabar [31] et al. proposed a population-based iterated local search (ILS) algorithm that integrates skewed variable neighborhood descent (SVND), a quality-and-diversity updating strategy and an adaptive multioperator perturbation procedure. Feng [32] et al. proposed a new algorithm by transferring knowledge from useful customer representations that are captured from past solved VRPs. Li [33] et al. utilized a novel neural network integrated with a heterogeneous attention mechanism to solve the pickup and delivery problem in the VRP problem. Jia [34] et al. proposed a novel bilevel ACO algorithm to solve the capacitated EV routing problem (CEVRP). The algorithm divides the problem into two subproblems: first, an order-first split-second max-min ant system algorithm is proposed to solve problems which ignore the electricity constraint. Second, a new removal heuristic combining a restricted enumeration method is designed to generate the charging schedule in the generated routes to satisfy the electricity constraint.

However, these VRP algorithms do not consider the multiobjective characteristics of real problems and lack multidimensional consideration of real problems. Therefore, multiobjective VRPs as well as many variants of multiobjective algorithms have been proposed. Nayera Elgharably [35] et al. proposed a new hybrid search genetic algorithm that combines the SPEA with a resultant local search heuristic to solve the stochastic GVRP, which simultaneously considers economic, environmental, and social aspects. Arash Motaghedi-Larijani [36] proposed four metaheuristic algorithms that involve combining different multiobjective algorithms to solve routing and scheduling vehicles on a cross-dock, minimizing the total shipping and tardiness costs and minimizing the number of outbound open doors as the objectives. Nan Yin [37] proposed M-NSGA-II, which combines the multifactorial evolutionary algorithm (MFEA) with NSGA-II to solve the VRP in low-carbon intelligent transportation. This approach considers carbon emissions, the number of vehicles, and distribution cost funding. Mazdarani [38] proposed a hybrid multiobjective genetic algorithm to solve the bi-objective overlapped links VRP (OLVRP) for valuable goods transportation, which takes the total cost and risk of routing as objectives. Wang et al. [39] established a five-objective multidepot VRP with time windows (MDVRPTW) and proposed a two-stage MOEA to solve the problem. In stage I, the algorithm focuses on finding extreme solutions and forms a coarse Pareto front. Stage II extends the found solutions for approximating the whole Pareto front. The two-stage strategy provides a new method for balancing convergence and diversity when solving MOPs.

Nevertheless, these multiobjective algorithms seldom consider the uncertainty of the real problem, which is an unavoidable problem in practical applications. Uncertainty disrupts problem-solving significantly. With the study of uncertainty problems, researchers have turned to incorporating robustness into VRPs that are insensitive to disturbances and can maintain the optimization performance at an acceptable level under disturbances. Robustness means maintaining feasibility as well as maintaining the best possible performance under uncertainty constraints. The goal of robust optimization is to make full use of the search space to find the best optimal solution while minimizing the effect of uncertainty.

Deb et al. [40] proposed a novel, robust, two-stage planning model for charging station placement that is determined by fuzzy inference considering distance, road traffic, and grid stability. A hybrid algorithm combining chicken swarm optimization and the teaching-learning-based optimization (CSO TLBO) algorithm was proposed to obtain the Pareto front, and fuzzy decision-making was used to compare the Pareto optimal solutions. Muñoz [41] et al. proposed a novel strategy to solve the VRPTW with uncertain demands using adaptive credibility thresholds, which try to find good solutions with relaxed capacity constraints. For this purpose, a new split decodification algorithm that combines a genetic algorithm with local search strategies was designed. Wang [42] et al. proposed a robust routing optimization model for multivehicle electric delivery that considers random changes in demand and transportation time at different customer points to minimize the distribution cost and carbon emissions; additionally, an improved genetic algorithm was proposed to solve this problem. Jiahui Duan [43] et al. designed a new form of disturbance on travel time that is determined by the maximum disturbance degree and constructed a robust multiobjective VRPTW, which adopts two conflicting objectives to capture the uncertainty characteristics. Moreover, a robust multiobjective particle swarm optimization approach that adopts a coding method involving priority assignment and decoding based on a greedy strategy is proposed. A new metric, Rob, was designed to measure the robustness of the solutions. To take full advantage of the problem characteristics, the algorithm adopts problem-based and route-based local searches.

In the current literature, robustness methods rarely take into account the JIT objective of the VRP, and there are fewer studies on the combination of robust VRPs with MOPs. In this paper, the solutions obtained by the algorithm are subjected to an uncertainty test. Robust solutions that take into account diversity are obtained based on the distribution of the solutions in the solution space.

2.2. Ant Colony Algorithm

The ant colony algorithm (ACO) [44] is a widely studied intelligent evolutionary algorithm. The ACO algorithm consists of two main steps: calculation of the state transfer probabilities and updating of the routing pheromone concentrations:

(1) The node transfer probabilities:

When the ant selects the next transfer node, it selects nodes by a roulette wheel after calculating the transfer probability of each node. To prevent ants from repeatedly selecting nodes, the selected node is added to the tabu list.

(2) Updating of routing pheromone concentrations:

The ACO algorithm mimics the behavior of ants that leave pheromones on each route on which they walk and introduces a pheromone volatilization mechanism to prevent the algorithm from falling into local optima. After all the ants have explored all the nodes, the pheromone concentration of each route is updated according to the pheromone update strategy.

The ACO algorithm is used in various fields, such as assignment problems, scheduling problems, and resource allocation. Fengyu Chen [45] et al. proposed the product ant colony optimization (PACO) algorithm which combines the ACO algorithm with production products to improve the productivity of textile manufacturers and balance the total operating time of all machines in each process. Alwin M. [46] et al. combined the binary ant colony algorithm (BACA) with other search optimization algorithms, such as hill climbing (HC) and simulated annealing (SA) to ensure the optimal Wireless Sensor Networks (WSNs) Coverage.

The ACO algorithm has good performance and effectiveness in solving VRPs. Huang [47] et al. employed an ACO to solve the feeder vehicle routing problem (FVRP). Li [48] et al. applied an improved ACO which used an innovative approach in updating the pheromone to solve the multidepot GVRP (MDGVRP). Current research on the improvement of the ACO for VRPs has focused on the state transfer strategy, pheromone update strategy, introduction of heuristic rules, individual learning strategy, and integration with other algorithms. Jiao [49] et al. proposed a polymorphic ACO with the adaptive state transfer strategy and the adaptive information updating strategy to improve the global search capability of the algorithm. Teng Ren [50] proposed a new improved ACO to solve the VRP with split pickup and delivery of multicategory goods. The algorithm adopts a TS operator that contains five neighborhood operators to improve the local search ability and introduces SA mechanisms to update the global pheromones. Luo [51] et al. proposed an improved ACO which adopted a pseudo-random state transition rule and avoided the blindness search at early planning through the unequal allocation initial pheromone. Li [52] proposed the multiobjective ant colony system algorithm for epidemic situations (MOACS4E) to solve a new multiobjective VRP model for epidemic situations that considers the traditional travel cost and the prevention cost of the VRP in epidemic situations. MOACS4E adopted two ant colonies to optimize the problem and proposed a new pheromone fusion-based solution generation method. In addition to this, algorithms such as the simulated annealing algorithm [53] and the artificial fish swarm algorithm [54] were used to combine with the ACO.

Deep reinforcement learning (DRL) is currently one of the important research directions in the field of machine learning. Common reinforcement learning algorithms are Deep Q-Networks (DQNs), Deep Deterministic Policy Gradient (DDPG), and Proximal Policy Optimization (PPO). DRL can realize end-to-end learning by directly interacting with the environment. With excellent feature extraction ability to realize the migration between different tasks, it can effectively overcome various perturbations in the system, as well as have good ability to solve high-dimensional and large-scale problems. However, it has the problems of insufficient ability to explore the environment, poor robustness, and susceptibility to deceptive gradients caused by deceptive rewards; in addition, the training process of DRL is quite complicated in the face of real problems. Evolutionary algorithms have the advantages of better global search capability, good robustness, and parallelism, but it is difficult to carry out migration learning in the face of perturbations. Overcoming the shortcomings of each by combining reinforcement learning and evolutionary algorithms to achieve further optimization of the algorithms is one of the current research directions. NICOLÁS FRÍAS [55] proposed the four innovative hybrid algorithms which combine Machine Learning (ML) clustering techniques with metaheuristic approaches inspired by an ACO to solve the energy minimizing VRP (EMVRP).

Neural networks are good at learning but not at searching, while evolutionary algorithms are good at searching but not at learning. Evolutionary algorithms combined with DRL can be used to make up for the lack of DRL. Thus, in this paper, an ACO is combined with a Double DQN (DDQN). A DDQN is introduced to overcome the problem that the ACO is prone to fall into local optimization to improve the optimization ability of the ACO; meanwhile, the good searching ability of the ACO on the VRP is utilized to guide the DDQN for learning and searching.

3. Problem Description and Mathematical Models

3.1. Problem Description

In this paper, the multiobjective multi-AGV routing planning mathematical model JIT-RMOVRPTW was constructed. The VRP in the assembly workshop is described as follows:

i, j

indicates the identification of workstations;

i, j \in V, V = \{0,1, \dots, n\}

represents the set of workstations; and when

i

or

j

is 0, it represents the raw material library. There are K AGVs with the same performance in the raw material library. Each of them has a maximum load capacity of Q and an average travel speed of

v

. The maximum travel distance of the AGV is unlimited. Materials are delivered by material carts, which are pulled by the tail hook of the AGV. Each AGV can transport up to

R

carts. The cart has a maximum capacity of

Q_{r}

. There are

N

workstations in the assembly workshop, and a total of

M

materials are needed. The correspondence between workstations and materials is known, the information about the materials is known, and

q_{i}

is the total number of material boxes required for the workstation

i

. The position of each workstation is known, and

d_{i j}

is the distance between stations. AGVs are dispatched to deliver

M

materials to

N

workstations. In AGV delivery, disturbed by random events such as traffic jams, the transportation time is uncertain, and the disturbance time is defined as

t_{i}^{~}

. In the JIT-RMOVRPTW, the traditional time window

{[a}_{i}, b_{i}]

is canceled, and materials need to be delivered on time by the deadline

{J I T}_{i}

. The materials of each workstation require

s_{i}

to be unloaded.

s_{i}

is the unloading time, determined by the number of boxes of materials to be delivered to the station.

The VRP problem in the assembly workshop is based on the following assumptions:

(1) There is only one raw material library in the entire workshop, and the raw material library is able to meet the material requirements of all the workstations.

(2) The material demand at each station is less than the load capacity of the AGV, and the unloading time of the AGV at each station is known.

(3) All the AGVs depart from the same raw material library and must return to the raw material library after unloading the transported material.

(4) AGV acceleration and deceleration, charging, etc., are not considered during vehicle operation.

(5) Materials are placed in standard material boxes, and the material requirements of each workstation cannot be split. The AGV transports materials through the material carts. The boxes of the materials required for the same workstation are loaded in the same material cart, which is not mixed with other workstation materials.

3.2. Mathematical Model

Based on the traditional VRPTW, the JIT-RMOVRPTW adds uncertain time constraints and designs a new metric alpha based on uncertain time as an objective to measure the robustness of the solution. In addition, the JIT-RMOVRPTW considers JIT delivery. JIT time is used as the deadline, and the AGV adjusts the departure time according to the JIT time of the workstation to achieve the objective of the JIT.

The JIT-RMOVRPTW can be formulated as follows:

The three objectives for the JIT-RMOVRPTW, each of which needs to be minimized, can be stated as follows:

f_{d}

(Total travel distance):

\min f_{d} = \sum_{k = 1}^{K} \sum_{i = 0}^{N} \sum_{j = 0}^{N} d_{i j} x_{i j k}

(1)

f_{a l p h a}

(Robustness metrics):

\min f_{a l p h a} = \frac{1}{\prod_{i = 1}^{N} {a l p h a}_{i}}

(2)

f_{J I T}

(JIT penalty time):

\min f_{J I T} = \sum_{k = 1}^{K} w_{k}

(3)

f

(Overall objectives):

\min f = (f_{d}, f_{a l p h a}, f_{J I T})

(4)

The constraints associated with the JIT-RMOVRPTW are added as follows:

Capacity constraints: The number of material boxes on each workstation must not exceed the maximum capacity of the material cart. The number of material boxes delivered by each AGV should not exceed the maximum load capacity of the AGV. $E_{i}$ indicates the kinds of materials required at station $i$ . $q_{i e}$ indicates the number of material boxes of material $e$ for workstation $i$ .

$q_{i} = \sum_{e = 1}^{E_{i}} q_{i e}, (i = 1,2, \dots, n)$

(5)

$q_{i} \leq Q_{r}$

(6)

$\sum_{i = 0}^{N} q_{i} y_{i k} \leq Q, (k = 1,2, \dots, m)$

(7)
Constraints on the number of material carts: The number of delivery workstations for each AGV must not exceed the number of carts for its own material. The number of material carts of each AGV must not exceed the maximum number of material carts of the AGV. $R_{k}$ indicates the number of material carts of the $k t h$ AGV.

$\sum_{i = 0}^{N} y_{i k} \leq R_{k} \leq R, (k = 1,2, \dots, m)$

(8)
Service time: The service time for each station is determined by the number of material boxes to be delivered to that station. $T$ indicates the time taken to unload a material box.

$s_{i} = \{\begin{array}{l} T q_{i}, (i = 1,2, \dots, n) \\ 0, (i = 0) \end{array}$

(9)
Distribution constraints: Each workstation can be delivered only once.

$\sum_{k = 1}^{K} y_{i k} = 1, (i = 1,2, \dots, n)$

(10)
AGV arrival number constraints: Only one AGV arrives at each workstation.

$\sum_{i = 0}^{n} x_{i j k} = y_{j k}, (j = 1,2, \dots, n; k = 1,2, \dots, m)$

(11)

$\sum_{j = 0}^{n} x_{i j k} {= y}_{i k}, (i = 1,2, \dots, n; k = 1,2, \dots, m)$

(12)
Material requirement constraints: Material requirements are met at each workstation.

$\sum_{k = 1}^{K} \sum_{i = 0}^{N} x_{i j k} = 1 (j = 1,2, \dots, n)$

(13)
Loop constraints: Each subloop starts at the raw material library and ends at the raw material library.

$\sum_{k = 1}^{K} \sum_{j = 1}^{N} x_{0 j k} = \sum_{k = 1}^{K} \sum_{i = 1}^{N} x_{i 0 k}$

(14)
Workstation demand time: The JIT problem contradicts the robust problem, so the lead time is used as the safety time for the vehicle. JIT-RMOVRPTW eliminates the traditional time windows, with ${J I T}_{i}$ serving as the deadline for the JIT problem (as shown in Figure 1). $p$ is the slack time. As shown in Equation (17), for the station $i$ , the larger the $q_{i}$ , the higher the JIT requirement.

$b_{i} = \min {b_{i e}}$

(15)

$a_{i} = \max {a_{i e}}, (a_{i e} \leq b_{i})$

(16)

${J I T}_{i} = b_{i} - (b_{i} - a_{i}) * p * (1 - \frac{q_{i}}{Q})$

(17)
The arrival time of the AGVs:

$t_{j} = t_{i} + \frac{d_{i j}}{v} x_{i j k} + s_{i}, (i = 0,1, 2, \dots, n; j = 0,1, 2, \dots, n)$

(18)
Uncertainty constraints: $t_{i}^{~}$ is the uncertainty time of arriving at station $i$ . $ζ$ is the overall random disturbance. $\bar{D}$ is the average distance between workstations. $\tilde{u_{i j}}$ is the characteristic coefficient between stations $i$ and $j$ . $\tilde{u_{0 i}}$ has unique characteristics, so it is discussed separately in Equation (19).

$\tilde{u_{i j}} = \{\begin{array}{l} d_{i j} / \bar{D}, (i = 1,2, \dots, n; j = 1,2, \dots, n) \\ 1, (i = 0; j = 1,2, \dots, n) \end{array}$

(19)

$t_{j}^{~} = t_{j} + \tilde{u_{i j}} \bar{ζ}, (\bar{ζ} = \frac{ζ}{n})$

(20)
Feasibility of workstations: To measure the robustness of the solution, the metric alpha is proposed in JIT-RMOVRPTW. Alpha is specifically defined according to different random distributions that different scenarios have. In this paper, $ζ$ is taken as a continuous probability distribution of a triangular distribution with a lower limit of $m i n$ , a peak location of $m i d$ , and an upper limit of $m a x$ , which can be expressed as [ $m i n, m a x, m i d$ ]. Equation (21) represents the range of random arrival times of the station $j$ . Equation (22) represents the degree of uncertainty from station $i$ to station $j$ .

$t j ~ = t i ~ + d i j v x i j k + s i + u i j ζ, i = 0, 1, 2, \dots, n; j = 0, 1, 2, \dots, n$

(21)

${i n t e r v a l}_{i j}^{~} = \max (\tilde{u_{i j}} \bar{ζ}) - \min (\tilde{u_{i j}} \bar{ζ})$

(22)

{a l p h a}_{i}

represents the feasibility of station

i

. The value of

{a l p h a}_{i}

is shown by Equation (23). The maximum value of alpha is set to 2 to prevent too large an alpha from affecting the realization of the JIT.

\begin{matrix} {a l p h a}_{i} = \\ \{\begin{array}{l} 0 & ; m i n (t_{i}^{~}) > b_{i} \\ \frac{{(|b_{i} - \max (t_{i}^{~})| - \min (\tilde{u_{i j}} \bar{ζ}))}^{2}}{{i n t e r v a l}_{i j}^{~} * (mid (\tilde{u_{i j}} \bar{ζ}) - \min (\tilde{u_{i j}} \bar{ζ}))} & ; m a x (t_{i}^{~}) > b_{i} a n d m i n (t_{i}^{~}) < b_{i} a n d |b_{i} - \max (t_{i}^{~})| \leq m i d (\tilde{u_{i j}} \bar{ζ}) \\ 1 - \frac{{(\max (\tilde{u_{i j}} \bar{ζ}) - |b_{i} - \max (t_{i}^{~})|)}^{2}}{{i n t e r v a l}_{i j}^{~} * (\max (\tilde{u_{i j}} \bar{ζ}) - mid (\tilde{u_{i j}} \bar{ζ}))} & ; m a x (t_{i}^{~}) > b_{i} a n d m i n (t_{i}^{~}) < b_{i} a n d |b_{i} - \max (t_{i}^{~})| > m i d (\tilde{u_{i j}} \bar{ζ}) \\ \frac{b_{i} - \max (t_{i}^{~})}{{i n t e r v a l}_{i j}^{~}} & ; m a x (t_{i}^{~}) < b_{i} a n d \frac{b_{i} - m a x (t_{i}^{~})}{{i n t e r v a l}_{i j}^{~}} \leq 2 \\ 2 & ; \frac{b_{i} - m a x (t_{i}^{~})}{{i n t e r v a l}_{i j}^{~}} > 2 \end{array} \end{matrix}

(23)

12.: Time window constraints:

$t_{i} \leq {J I T}_{i}, (i = 1,2, \dots, n)$

(24)

13.: Advanced time for materials:

$t_{i} \leq {J I T}_{i}, (i = 1,2, \dots, n) w_{i e} = {J I T}_{i e} - t_{i} (i = 1,2, \dots n; e = 1,2, \dots, E_{i})$

(25)

14.: Penalty time: $w_{k m i n}$ indicates the minimum advance time of the material required for station $i$ , which is delivered by the $k t h$ AGV. $w_{k}$ indicates the total penalty time of each AGV. Equation (28) indicates that the departure time from the raw library of the $k t h$ AGV is delayed for $w_{k m i n}$ to satisfy the smallest demand time of the material required for workstation $i$ delivered by the $k t h$ AGV in time. $N_{k}$ indicates the set of stations distributed by the $k t h$ AGV.

$w_{i} = \sum_{e = 1}^{E_{i}} w_{i e} (i = 1,2, \dots n) # (26)$

(26)

$w_{k m i n} = \min \{w_{i e}\}, (i {\in N}_{k}, e \in E_{i})$

(27)

$w_{k} = \sum_{i = 1}^{N_{k}} (w_{i} - w_{k m i n})$

(28)

15.: Decision-making variables:

$x_{i j k} = \{\begin{array}{l} 1, k t h A G V i s r e s p o n s i b l e f o r t h e d i s t r i b u t i o n f r o m w o r k s t a t i o n i t o j \\ 0, o t h e r \end{array}$

(29)

$y_{i k} = \{\begin{array}{l} 1, D e l i v e r y s e r v i c e f o r s t a t i o n i i s c o m p l e t e d b y t h e k t h v e h i c l e \\ 0, o t h e r \end{array}$

(30)

Thus, the JIT-RMOVRPTW can be summarized by three objectives defined by Equations (1)–(4), subject to constraint conditions defined by Equations (5)–(30).

4. Proposed Algorithm

In this section, the overall NSACOWDRL process is first described. Then, the six components of NSACOWDRL are introduced, which include (1) solution construction, (2) nondominated sorting based on reference points, (3) the pheromone updating strategy, (4) local search based on the DDQN, (5) the probabilistic transfer formula, and (6) the robust optimization method. The optimization process of the NSACOWDRL is given in Figure 2.

4.1. Framework of the NSACOWDRL

The framework of the NSACOWDRL is given in Figure 3. The NSACOWDRL consists of two stages: fully searching the solution space to generate Pareto solutions and robust selection of Pareto solutions.

(1) Stage 1: Solution Space Characteristic Exploration and Pareto Solution Generation

Step 1: Initialize the algorithm parameters, including pheromone matrix, distance matrix, neural networks, etc. The initial number of iterations is 0.

Step 2: Obtain the set of feasible routes according to the solution construction strategy (see Section 4.2 for details).

Step 3: The routings are stratified by nondominated sorting based on the reference points, and N routings are selected for this iteration.

Step 4: The pheromone matrix is updated according to the pheromone update strategy.

Step 5: When the iteration is a specific value, the network of the DDQN is trained through the nondominated solutions, and the N paths are obtained by the trained network to participate in the next iteration through the selection of the actions.

Step 6: The number of iterations plus 1.

Step 7: If the maximum number of iterations is reached, output the solution set, otherwise skip to step 2.

(2) Stage 2: Robust Multiobjective Optimization

Step 1: The Pareto solutions obtained in stage 1 are partitioned by weights that are uniformly distributed in the solution space.

Step 2: Each solution is assigned to the weight closest to this solution.

Step 3: Through Monte Carlo simulation, the Monte Carlo feasible times for each solution is obtained to quantify the feasibility of the solution.

Step 4: The set of solutions belonging to each weight is then selected according to feasibility, respectively, to obtain the set of Pareto solutions for each weight. Robust solutions that take diversity into account are obtained.

4.2. Solution Construction

NSACOWDRL constructs solutions by imitating the behavior of ant colonies foraging for food with the following steps.

Step 1: There are

m

ants in the raw material bank. Record the number of ants as 0.

Step 2: Each ant sets out independently.

Step 3: The ant calculates the workstations that satisfy the constraints based on the current information and obtains the set of feasible workstations.

Step 4: If the feasible set is empty and the ant does not visit all the workstations, the ant returns to the raw material library and sets out again. Otherwise, calculate the node transfer probability according to the transfer probability formula and select the next node from the feasible set via roulette. Then, update the feasible set.

Step 5: Repeat Steps 3–4 until the ant distributes all the empty workstations, and record the route of the ant. The number of ants plus 1.

Step 6: Repeat Steps 2–5 until the number of ants is

m

. The feasible routes are obtained.

The pseudo-code for solution construction is as follows (Algorithm 1):

Algorithm 1: Solution Construction

Input:

V = \{0,1, \dots, n\} : s e t o f w o r k s t a t i o n s

;

d_{i j} : t h e d i s t a n c e b e t w e e n s t a t i o n s;

Q : t h e m a x i m u m l o a d c a p a c i t y

;

m : t h e n u m b e r o f a n t s

;

K = 0

;
While

K

<

m

Set of undistributed workstations:

S =

{1,2,…,n}; depart from the raw materials warehouse:

i

←0.

r o u t e = \emptyset

;
While

S! = \emptyset

calculates the workstations that satisfy the constraints to obtains the set of feasible workstations, named

A

.
if

A! = \emptyset

calculate the node transfer probability according to the transfer probability formula and select the next node j

from A

via roulette.

S

=

S

-[

j

];

A

=

A

-[

j

];

i

←

j

; Update

A

. Append

i

to

r o u t e (K + 1)

else
Return to raw material library:

j =

0; Update

A

end
end

K = K + 1

;
end
Output:

r o u t e : t h e s e t o f f e a s i b l e r o u t e s

;

4.3. Nondominated Sorting Based on Reference Points

NSACOWDRL uses nondominated sorting based on reference points to obtain Pareto solutions with good diversity. Nondominated routing indicates that each of the routes is better than the others for some objectives. Searching for nondominated routing can expand the search range in the solution space, which can increase the probability of selecting better routing and improve the diversity of the algorithm.

In the NSACOWDRL, during the selection of half of the individuals in each generation, the paths obtained from the two iterations before and after are selected in a nondominated way to obtain more elite generation in such a way that it is elitist for high-quality individuals and at the same time, it will sufficiently bias in favor of the individuals with better nondominated ranks.

4.3.1. Nondominated Sorting

After one iteration of the NSACOWDRL, a set of routings can be obtained. To improve the diversity of routings, a set of feasible routings named

Q_{u r}

is obtained by mixing the routings of the current generation with the routings of the previous generation. Then,

Q_{u r}

is stratified by nondominated sorting via the following steps:

First,

n_{i}

represents the number of individuals who dominate individual

i

. All individuals in

Q_{u r}

with

n_{i}

= 0 are found by a two-by-two comparison. They are given the nondominance rank 1. These individuals are stored in the set

R_{1}

;

Then, find the individuals that are dominated by the individuals in

R_{1}

and store them in the set

S_{1}

. Subtract 1 from

n_{j}

for each individual in

S_{1}

. If (

n_{j}

− 1) equals 0, individual

j

is given nondominance rank 2 and is stored in the set

R_{2}

.

The above operation is repeated for the remaining individuals in

Q_{u r}

until all individuals have been assigned a nondominance rank representing the importance of the routing.

4.3.2. Diversity Conservation

In the

Q_{u r}

containing a large number of solutions, there will be many ranks, which leads to the existence of individuals with nondominated ranks that are too low and which are less important. These individuals not only reduce the running efficiency of the algorithm but also cause pheromone concentration changes in the poorer individuals to be recorded, thus increasing the search difficulty of the algorithm or even leading to a fall into a local optimum.

To eliminate this difficulty, a routing selection strategy based on reference points is adopted. The solutions in

Q_{u r}

are selected according to their impact on the diversity of the algorithm. The uniformly distributed reference points are generated through the hyperplane. The individual is selected based on the association between the solutions and reference points. This strategy uses well-distributed reference points to maintain the diversity of populations. The steps are described below:

(1) The individuals starting from nondominance rank 1 are added to set

D

sequentially until the size of

D

is larger than

N_{m a x}

after adding the individuals with nondominance rank L to

D

.

N_{m a x}

is the maximum number of nondominated solutions. The individuals with nondominated ranks from 1 to (L-1) are added to set

G

. Individuals with nondominant rank L are added to set

G^{*}

.

(2) Establish reference points on the hyperplane:

The hyperplane has the same tilt for all coordinate axes and an intercept of one on each axis. For the problem with M objectives, reference points are established on an (M-1)-dimensional hyperplane. For the three-objective problem, to generate reference points, each coordinate axis is uniformly segmented into H segments. The plane perpendicular to the coordinate axis is established through the segmented points of the coordinate axis The reference point is the intersection of planes from different coordinate axes with the hyperplane. The number of reference points is determined by Equation (31). Figure 4 shows the hyperplane of three objectives and reference points on the hyperplane when H is set to 5 (a detailed discussion of the reference points is given in Section 5.4).

P = C_{M + H - 1}^{H}

(31)

(3) Normalize the individuals:

First, calculate the ideal point

z^{m i n}

, which is the point consisting of the minimum value of all individuals in set

D

on each objective. All individuals in set

D

are subtracted from the ideal point.

Then, calculate the extreme points, which are the points with large values in one objective and small values in other objectives. M objectives have M extreme points. To find the extreme point on the

i t h

objective, the value of the

i t h

objective of all individuals is divided by a factor of 1, and the other objective values are divided by a factor of 10e−6. Then, excluding the

i t h

objective value, the largest value of each individual is found. Comparing these values, the point corresponding to the smallest of these values is the extreme value point on the

i t h

objective.

The (M-1) dimensional hyperplane is formed using M extreme points. Then, the intercept between the hyperplane and the coordinate axes is calculated. Figure 5 shows the two-dimensional hyperplane formed using three extreme points.

z_{m a x}^{i}

represents the extreme point of the

i t h

objective.

a_{i}

represents the intercept of the

i t h

coordinate axis.

Finally, the normalized objective value is calculated by Equation (32)

f_{i}^{n} (x) = \frac{f_{i} (x) - z_{i}^{m i n}}{a_{i} - z_{i}^{m i n}}, i = 1,2, \dots, M

(32)

where

f_{i} (x)

represents the value of the individual in the

i t h

objective.

z_{i}^{m i n}

represents the value of the ideal point in the

i t h

objective.

(4) Link individuals to reference points:

The reference point is connected to the origin of the coordinates to form the connecting line. The vertical distance from each individual in set

D

to each connecting line is calculated. The individual is then associated with the reference line closest to it. Each individual will correspond to a reference point.

P j

represents the number of individuals in set

G

associated with reference point j.

(5) Select individuals:

The individual selection operation is executed according to the number

P j

from smallest to largest. Starting from the reference point with the smallest

P j

and if more than one reference point has the smallest

P j

, choose a reference point randomly among these reference points. If

P j

= 0 and there are individuals associated with the selected reference point in the set

G^{*}

, the individual with the smallest distance from the reference point is selected and added to

G

. If there is no individual associated with it in

G^{*}

, this reference point is no longer considered in the rest of the operations. If

P j

≥ 1 and there are individuals associated with this reference point in

G^{*}

, an associated individual is randomly selected to join

G

. The above operation is repeated until the size of

G

equals

N_{m a x}

.

This method involves a small habitat protection operation. The method preserves the diversity of the algorithm by selecting solutions within the region where the solution’s distribution of the previous Pareto front is sparser. This selection operation uses hyperplane generation reference points and can be easily generalized to high-dimensional multiobjective problems to solve the difficulty of maintaining the diversity of solution sets in high-dimensional multiobjective problems.

4.4. Pheromone Update Strategy

The pheromone update strategy guides the algorithm update through a positive feedback mechanism; however, if the positive feedback mechanism is too strong, local convergence problems can easily occur, and if the mechanism is too weak, the algorithm will converge more slowly. To improve the convergence and diversity of the algorithms, this paper proposes a pheromone update strategy that combines the nondominated ranks with the elite routing strategy and the max-min pheromone update strategy. In this strategy, the pheromones of the routes are updated according to their nondominated rank, which comprehensively considers the influence of multiple objectives. Then, the elite paths with better nondominated ranks are updated twice to improve the convergence of the algorithm. Finally, to avoid the algorithm falling into a local optimum due to an excessively large difference between pheromone concentrations, the max-min pheromone strategy is adopted to limit the range of pheromone concentrations.

The update strategy is as follows:

(1) The pheromone update process is shown in Equations (33)–(35):

τ_{i j} (t + 1) = (1 - ρ) τ_{i j} (t) + Δ τ_{i j} (t, t + 1) + Δ {τ_{i j}}^{*} (t, t + 1)

(33)

Δ τ_{i j} (t, t + 1) = \sum_{k = 1}^{m} Δ τ_{i j}^{k} (t, t + 1)

(34)

Δ {τ_{i j}}^{*} (t, t + 1) = \sum_{k = 1}^{m} Δ τ_{i j}^{* k} (t, t + 1)

(35)

where

ρ

(

0 < ρ < 1

) represents the pheromone volatilization rate.

Δ τ_{i j} (t, t + 1)

represents the pheromone increment of path (i, j) as the sum of the pheromone increments left by all ants that traverse path (i, j) in the cycle.

Δ τ_{i j}^{k}

represents the amount of pheromone released by ant

k

to edge

(i, j)

.

Δ {τ_{i j}}^{*}

represents the pheromone increment of the elite path

(i, j)

.

Δ τ_{i j}^{* k}

represents the amount of pheromone released by elite ant

* k

to edge

(i, j)

.

(2) Common ant pheromone update strategy:

Δ τ_{i j}^{k}

is updated for all routings according to the nondominance rank.

Δ {τ_{i j}}^{k} (t, t + 1) = \{\begin{array}{l} C_{P} * C_{N} / n, n = 1, 2, 3, 4 \\ 0, o t h e r \end{array}

(36)

where

k

represents the

k t h

ant.

C_{P}

represents the pheromone constant,

C_{N}

represents the nondominant constant, and

n

represents the nondominated rank of the routing in this iteration.

(3) Elite routing pheromone update strategy:

Δ τ_{i j}^{* k}

updated pheromone concentrations for elite routing with a nondominance rank of 1, 2.

Δ {τ_{i j}}^{* k} (t, t + 1) = \{\begin{array}{l} (6 - n) * C_{P} * C_{N} / n, n = 1,2 \\ 0, o t h e r \end{array}

(37)

(4) Max-min pheromone limits: Since there are various nondominance ranks, a large difference in pheromone concentration tends to occur between paths after the pheromone is updated. A too large difference in pheromone concentration can easily lead to premature stalling of the search, increasing susceptibility of the algorithm to falling into local optima. To avoid this, explicit limits are imposed on the minimum and maximum pheromone concentrations in the pheromone update strategy, as shown in Equation (38). After each iteration, it must be ensured that the pheromone concentration adheres to the limits.

τ_{i j} (t + 1) = \{\begin{matrix} \log_{a} (\frac{τ_{i j} (t + 1)}{2 σ}), τ_{i j} \geq τ_{m i n} \\ τ_{m i n}, o t h e r \end{matrix}

(38)

τ_{m i n} = \frac{1}{2 * c u s u m * (1 - ρ)}

(39)

where

a

represents the base of the logarithm, which is used to control the range of pheromone concentrations;

σ

represents the limiting factor; and

c u s u m

represents the number of nodes.

4.5. Local Search Strategy

The DDQN [56] is an off-policy DRL algorithm which is improved from the DQN. The DDQN decouples the two steps of selection of the target Q value action and computation of the target Q value to address the problem that the DQN potentially leads to maximization bias, which results in overestimation.

The decoupling process is as follows:

The DDQN constructs two networks: the evaluation network and the target network. The structure of the evaluation network and the target network are identical, but the weighting parameters are different.

First, the DDQN selects the action corresponding to the maximum Q value by the evaluation network, as shown in Equation (40):

a_{m a x} = a r g {m a x}_{a^{'}} Q (s_{t + 1}, a^{'}; {θ^{'}}_{t})

(40)

Then, the target network calculates the target Q value as shown in Equation (41):

y_{t} = r_{t} + γ Q (s_{t + 1}, a_{m a x}; θ_{t})

(41)

Combining the states, the target Q value of is shown in Equation (42):

y_{t} = \{\begin{matrix} r_{t} & i f s^{'} i s t e r m i n a l \\ r_{t} + γ Q (s_{t + 1}, a r g {m a x}_{a^{'}} Q (s_{t + 1}, a^{'}; {θ^{'}}_{t}); θ_{t}) & o t h e r w i s e \end{matrix}

(42)

As shown in the Equation (43), the DDQN adopts the mean square error (MSE) of the estimated Q value and the target Q value as the loss function, and updates the evaluation network by gradient descent backpropagation.

L (θ_{t}) = {E [(y_{t} - Q (s_{t}, a; θ_{t}))}^{2}]

(43)

The DDQN learns past experience offline by building a replay buffer. After

N^{-}

rounds of the replay buffer sampling, the weight parameters of the evaluation network are copied to the target network so as to realize the model self-learning.

To solve the MOPs, the DDQN adopts the Pareto solution set to train the network and guide the learning direction (details are in Section 4.5.1 and Section 4.5.2). The pseudo-code is shown in the table below (Algorithm 2):

Algorithm 2: Double DQN in MOPs

Input:

D : e m p t y r e p l a y b u f f e r; θ : n e t w o r k p a r a m e t e r s; θ^{-} : c o p y o f θ

;

N_{r} : r e p l a y b u f f e r m a x i m u m s i z e; N_{b} : t r a i n i n g b a t c h s i z e

;

N^{-} : t a r g e t n e t w o r k r e p l a c e m e n t f r e q

;

G : t h e s e t o f P a r e t o s o l u t i o n s t h a t h a v e b e e n o b t a i n e d

;

G^{H}

: the Pareto frontier in

G

For

i

from 1 to size

(G^{H})

for

j

from 1 to size

(G_{i}^{H})

Take (s_{j}, a_{j})

and r_{j}

. Do a gradient descent step with loss {‖r_{j} (s_{j}, a_{j}) - Q (s_{j}, a_{j}; θ)‖}^{2}

to train θ

.
end
end
For

e p i s o d e e \in {1,2, \dots M}

do
Initialize frame sequence x←()
for

t \in {0,1, \dots}

do

Set state s \leftarrow x, sample action a

~ π_{B}

Sample next frame x^{'}

from environment ε

given (s, a)

and receive

r

, and append

x^{'}

to x

Compare the ε

to G

to obtain the overall learning rate

{A l p h a}_{t a r g e t}

, set

r^{'}

\leftarrow r

* A l p h a_{t a r g e t}

if

|x|

>

N_{f}

then delete oldest frame

x_{t_{m i n}}

from x end

Set s^{'}

\leftarrow x, and add (s, a, r^{'}, s^{'})

to D

,
Replacing the oldest tuple if

|D|

>

N_{r}

Sample a minibatch of N_{b}

tuples (s, a, r^{'}, s^{'})

~ Unif (D

)

Construct target values, one for each of the N_{b}

tuples

a^{m a x} (s^{'}; θ) = a r g {m a x}_{a^{'}} Q (s^{'}, a^{'}; θ)

;

y_{j} = \{\begin{matrix} r^{'} & i f s^{'} i s t e r m i n a l \\ r + γ Q (s^{'}, a^{m a x} (s^{'}; θ); θ^{-}), & o t h e r w i s e . \end{matrix}

;

Do a gradient descent step with loss {‖y_{i} - Q (s, a; θ)‖}^{2}

Set θ^{-}

\leftarrow θ

every N^{-}

steps
end
end
end

4.5.1. Environment Setup

The evolutionary algorithm guides the reinforcement learning for searching. The following is the process of establishing the DDQN environment and directing its learning direction:

(1) Neural Network Construction

The input to the problem in this paper is a three-dimensional vector which has low dimensionality, so this paper constructs a function-fitting neural network which is structured as a fully connected network with three hidden layers. The input layer has three nodes. Each hidden layer has 40 nodes.

The activation functions are chosen to be linear transfer function (purelin), saturated linear transfer function (satlin), and positive linear transfer function (poslin). Multilayer networks use different activation functions for each layer to allow the neural network to approximate arbitrary functions.

The network chooses a multilayer neural network training function: the variable learning rate backpropagation algorithm (traingdx). It uses momentum optimization to accelerate convergence, and improves stability and accuracy through an adaptive learning rate to make the gradient descent smoother.

The network structure is shown in Figure 6.

After the initial construction of the neural network is completed, the initial parameter training of the neural network is required. In order to improve the effectiveness of parameter training and the rate of neural net approximation, the Pareto frontier is chosen as the input for initial training in this paper. The Pareto frontier, named

H p a r e t o

, is selected in the solution set G obtained by nondominated ordering; meanwhile the set with the largest nondominated rank, named

L p a r e t o

, is selected. The mean square error between the estimated Q value and the corresponding reward of

H p a r e t o

is then used as the loss function to train the initial evaluation network, as shown in Equation (44).

L (θ_{t}) = {E_{(s, a, r, s^{'}) ~ U (H p a r e t o)} [(r (s, a) - Q (s, a; θ_{t}))}^{2}]

(44)

r (s, a)

represents the reward corresponding to Hpareto in the

(s, a)

environment.

Q (s, a; θ_{t})

represents the estimated Q value obtained by the evaluation network in the

(s, a)

environment.

(2) Reward Return Design

This paper studies the multiobjective VRP problem where the reward value is related to the pheromone between the routes, the heuristic function, and the advance time. The rewards are shown in Equation (45):

r_{t} = τ_{i j}^{α} (t) \times η_{i j}^{β} (t) \times (1 / w_{j} (t_{j}))^{δ}

(45)

This equation simultaneously takes full account of multiobjectives, so that when an agent completes an action choice that causes a change in the state, its corresponding reward also changes. This design allows for the effective mapping of the multiple objectives of the study to the rewards of deep reinforcement learning.

For the VRP problem, the optimal solution of the subpaths can easily cause the algorithm to fall into a local optimum, so this paper proposes an overall learning rate called

A l p h a_{t a r g e t}

. After the action selection is finished to obtain a complete path for all the workstation requirements, the IGD [57] metrics of this path with the two path sets of

H p a r e t o

and

L p a r e t o

are calculated to obtain

I G D_L p a r e t o

and

I G D_H p a r e t o

.Then

A l p h a_{t a r g e t}

is calculated according to Equation (46), and the

r_{t}

is updated according to Equation (47).

A l p h a_{t a r g e t}

quantifies the advantages and disadvantages of the obtained path as a ratio of distances from

H p a r e t o

and

L p a r e t o

, in order to correct the rewards of the various subpaths of this path, so that paths with smaller IGDs have a higher advantage, whereas paths with larger IGDs have a larger disadvantage.

A l p h a_{t a r g e t} = I G D_L p a r e t o / I G D_H p a r e t o

(46)

r_{t} = A l p h a_{t a r g e t} * r_{t}

(47)

The structure of the DDQN algorithm is shown in Figure 7.

4.5.2. Output Designs

The reinforcement learning results are used to improve the optimization performance of the evolutionary algorithm.

Since the training of DDQN neural networks needs to rely on the replay buffer, and the solution sets obtained between adjacent iterations of the algorithm have a certain degree of approximation, this leads to the difficulty of having significant differences in the replay buffers of adjacent iterations, thus making it difficult for neural networks to perform effective parameter optimization. In order to improve the effectiveness of neural network optimization, this paper sets the neural network to be trained at a specific number of iterations. For the evolutionary algorithm, the greater the number of iterations, the better the results and the more iterations are needed for further optimization of the iteration results, so the neural network training is set at 50, 100, and 200 generations.

The trained network obtains 40 complete paths by action selection, which are recorded in the set

G'

, to participate in nondominated sorting in the next iteration. The pseudo-code is as follows (Algorithm 3):

Algorithm 3: Route Generation by DDQN

Input:

Q (s, a; θ) : t h e t r a i n e d n e t w o r k

; epsilon:

p a r a m e t e r o f g r e e d y p o l i c y

; N:

n u m b e r o f r o u t e s

;

V :

set of workstations; c u n s u m :

number of workstations; s_{0} : t h e o u t s e t

While

N \leq 40

do

Initialize parameters : s

\leftarrow s_{0}

; a c t i o n

\leftarrow \emptyset

;
While

s i z e (a c t i o n) \leq c u n s u m

do

The feasible solutions selected from the unassigned workstations are recorded in V_{a}

if

V_{a}

is not empty
if

rand < e p s i l o n

a^{'} = a r g {m a x}_{a} Q (s, a; θ); (a \in V_{a}) .

Set s \leftarrow a^{'},

append a^{'}

to a c t i o n

else

Take a random integer i

from 1 to size (V_{a})

, select the ith action a^{i}

in V_{a}

, a^{'} \leftarrow a^{i}

;

Set s \leftarrow a^{'},

append a^{'}

to a c t i o n

end
else

Set s

\leftarrow s_{0}

end

add a c t i o n

to G'

end
end
Output:

G' : t h e s e t o f r o u t e s

;

In the probabilistic selection of nodes, the current workstation

i

and the next workstation

j

are used as inputs into the trained network to obtain the Q value between workstations to participate in the probabilistic selection as shown in Equation (48).

Q_{i j} = Q (s_{i}, j; θ_{t}), (j \in E)

(48)

s_{i}

represents the state of the current node

i

;

j

represents the next node;

E

represents the set of feasible nodes.

4.6. Node Transfer Probability Rule

NSACOWDRL uses a roulette wheel for node selection. The probability formula for participation in the roulette wheel is chosen by Equation (49).

J = \{\begin{array}{l} p_{i j} (t), & o t h e r \\ Q_{i j} (t), & i t e r \in U \end{array}

(49)

J

represents the probabilistic selection method.

p_{i j} (t)

represents the selection based on probability formula and

Q_{i j} (t)

represents the selection based on Q table.

U

represents the set of a specific number of iterations.

In the JIT-RMOVRPTW, the traditional time windows are eliminated, so the time window width is canceled. In the JIT-RMOVRPTW, the movement of ants is related to the pheromone concentration, path distance, waiting time, and the value of Q table. As shown in Equation (50), the probability formula of the JIT-RMOVRPTW is obtained based on the mathematical model of the problem.

p_{i j}^{k} (t) = \{\begin{array}{l} \frac{τ_{i j}^{α} (t) \times η_{i j}^{β} (t) \times (1 / w_{j} (t))^{δ} \times Q_{i j}^{γ} (t)}{\sum_{s \in (J - B_{k})} τ_{i s}^{α} (t) \times η_{i s}^{β} (t) \times (1 / w_{j} (t))^{δ} \times Q_{i s}^{(S E T)} (t)} \\ 0, o t h e r \end{array}

(50)

γ = i t e r - D D Q N_{i t e r}; {D D Q N_{i t e r} \in U | \frac{i t e r}{2} \leq D D Q N_{i t e r} < i t e r}

(51)

η_{i j} (t) = \frac{1}{d_{i j}}

(52)

where

J

represents the set of all neighboring nodes

j

of node

i

;

s

represents the node that ant

k

is allowed to select next;

τ_{i j}

represents the pheromone concentration of each pathway;

η_{i j}

is the heuristic function, as shown in Equation (52);

Q_{i j}

is as shown in Equation (48),

α

is the pheromone concentration factor,

β

is the heuristic function factor,

γ

is the Q value factor, and

δ

is the waiting time factor.

4.7. Robust Multiobjective Optimization

The traditional robust selection strategy selects only the solution set with the highest feasibility. However, due to the conflicting nature among multiple objectives, it is difficult for the other objectives of the most feasible solution to be close to the optimum. For practical problems, appropriate delays are acceptable. So this paper proposes a robust selection method based on uniformly distributed weights in the solution space combined with Monte Carlo simulation. Since the solutions in the neighborhood centered at a certain point in the solution space have some degree of similarity, this robust optimization method splits the solution space by uniform weights, then converts the feasibility of solutions into Monte Carlo feasibility times. Finally, the optimal set of solutions in the solution space corresponding to each weight is found separately. This strategy obtains solutions that take into account robustness while guaranteeing diversity in multiple objectives.

Through this strategy, the more feasible solutions can be retained, and at the same time, based on the uniformly distributed weights, the solutions with lower feasibility but better other objectives are also retained, so as to provide the decision maker with more decision-making options in the face of different practical needs.

First, the solution space is partitioned by uniformly distributed weights, as shown in the Figure 8, with different colors representing different weights. Each weight represents a section of the solution space.

Then, each solution is associated with the weight which is closest to it. Following this, the Monte Carlo simulation is used to simulate each solution 1000 times to obtain the Monte Carlo feasible times for each solution, which is represented by

M C

; the pseudo-code is shown in Algorithm 4. Each solution can be represented as

(f_{d}, f_{a l p h a}, f_{J I T}, M C)

.

Algorithm 4: Monte Carlo Simulation

Input:

N V

: the number of AGV; r o u t e = \{r o u t e (1), r o u t e (2), \dots r o u t e (N V)\}

: a set of routes

r o u t e (i) = {n o d e (1), n o d e (2), \dots n o d e (m)}

: a route; t = {t_{1,2}, t_{1,3}, \dots t_{n, n - 1}}

: travel time

s = {s_{1}

, s_{2}, \dots s_{n}

} : serve time; \tilde{u_{i j}} \bar{ζ}

: uncertain time; m a x s i m

: the times of Monte Carlo simulation

M C :

empty set of Monte Carlo feasible times
for

i

= 1 : l e n g t h (r o u t e)

for

j

= 1 : l e n g t h (r o u t e (i))

if

j

= 1

t_{r o u t e (i, j)}^{~} = t_{0, r o u t e (i, j)} + \tilde{u_{i j}} \bar{ζ}

;
else

t_{r o u t e (i, j)}^{~} = t_{r o u t e (i, j - 1)}^{~} + s_{r o u t e (i, j - 1)} + t_{r o u t e (i, j - 1), r o u t e (i, j)} + \tilde{u_{i j}} \bar{ζ}

;
end
end
for

t i m e s

= 1 : m a x s i m

if

r o u t e (i)

satisfy constraints

M C (i)

= M C (i) +

1;
end
end
end
Output:

M C

: Monte Carlo feasible times for each solution

For each weight, if there exists at least one solution associated with it, the solution with the highest

M C

is sought, the Monte Carlo times of this solution is represented as

{M C}_{e}

, and the solutions in the same weight for which

M C

is less than

{M C}_{e}

are deleted. The remaining solutions are the optimal set of solutions in the solution space represented by that weight. The final result is shown in Figure 9, which is the set of robust Pareto solutions that guarantees the diversity of the algorithm. The points in different colors represent the solutions corresponding to weights with the same color.

5. Experiments and Analyses

The comparison experiments to test the performance of NSACOWDRL is discussed in this section. In this section, we compare the performance of multiple multiobjective algorithms under different instances, NSACOWDRL under different JIT models, and NSACOWDRL under different network settings, respectively.

5.1. Experimental Setting

To test the performance of NSACOWDRL, four instances of Solomon’s VRPTW benchmark problems, named C202, C206, RC202, RC204, and one instance, named RL—according to the real case from a manufacturer in China—are adopted, in which each of the instances is tested with three different interference coefficients.

Based on the model and the experience of the manufacturer, the following modifications were made to the instances:

(1) The company divides the distribution time of one day into several cycle times. The different distribution tasks were performed at each cycle.

(2) All the workstations are available all day.

(3) The assembly workshop has standardized the delivery of materials. All the materials were delivered in standardized boxes, and all the materials required at the same station during a cycle were placed in the same carts.

(4) In this case, due to the constraints of the assembly workshop and the safety of the AGV operation, the maximum number of carts is nine.

Table 1 shows the dataset of RL. Figure 10 shows a simple illustrative scenario to visualize the layout and process of material delivery. The dotted lines indicate the routes on which the AGVs can run. The small rectangles joined together represent workstations. The black dots indicate the stations that need to be distributed.

The NSACOWDRL was executed for 250 generations. The number of ants was set to 40.

α

,

β

, γ, and δ in the node transition probability formula were set as 1, 2, 2, and 2, respectively. The parameter

ρ

in the pheromone update formula was set to 0.2. The parameter

H

was set to 10. In the DDQN, the Buffer size was set to 60,000, the epsilon was set to 0.9, the Gama was set to 0.9, and the Target update interval was set to 1200.The parameters selected were not necessarily optimal for each model. This paper did not strictly select the optimal parameters.

5.2. Performance Metrics

In MOPs, the diversity and convergence of the algorithm are the main performance metrics that need to be evaluated. A single metric has difficulty comprehensively measuring the performance of the algorithm. In this paper, two performance metrics, HV [1] and IGD, are adopted to evaluate the performance of the algorithm. These metrics are described below:

As shown in Equation (53), the HV metric is the hypervolume metric, which is a measure of the diversity of an algorithm by calculating the volume of the target space formed by the set of nondominated solutions and reference points. The larger the HV is, the better the diversity of the algorithm.

H V = L e b (⋃_{i = 1}^{|S|} v_{i})

(53)

L e b

represents the Lebesgue measure, which is used to measure the volume.

| S |

represents the number of nondominated solution sets, and

v_{i}

represents the hypervolume formed by the reference point and the ith solution in the solution set.

As shown in Equation (54), the IGD metric is an inverse generational distance metric: a comprehensive performance metric that evaluates the convergence and distribution performance of the algorithm by calculating the average minimum distance from each point on the true Pareto front (PF) to the set of individuals obtained by the algorithm. The smaller the IGD is, the better the convergence and distribution performance of the algorithm.

I G D (P, P^{*}) = \frac{\sum_{x \in P^{*}} {m i n}_{y \in P} d i s (x, y)}{|P^{*}|}

(54)

P

is the solution set obtained by the algorithm,

P^{*}

is a set of uniformly distributed reference points sampled from the Pareto front, and

d i s (x, y)

represents the Euclidean distance between the point

x

in the reference set

P^{*}

and the point

y

in the solution set

P

.

Because it is difficult to find the true Pareto front in real scenarios, the PF is approximated by considering all nondominated solutions obtained by all algorithms.

In this experiment, the normalized metrics were used to unify the measures between the different objectives.

5.3. Experimental Results

Based on the dataset, to verify the effectiveness of the NSACOWDRL, experiments were carried out on the JIT-RMOVRPTW models successively. In the experiments, to demonstrate the superiority of NSACOWDRL for multiobjective problems, NSACOWDRL compares NSACO, NSGA-III, NSGA-II, and MOEA/D. Twenty independent runs were used to test the performance of the proposed algorithm in each experiment. The experiments presented the results of the proposed algorithm in terms of HV and IGD.

5.3.1. Results of the JIT-RMOVRPTW

(1) Analysis of the Performance Metric

Table 2 and Table 3 present the mean and variance of the HV and IGD metrics for different algorithms on the JIT-RMOVRPTW dataset with three interference coefficients, respectively. The leftmost column represents the instance numbers. The second column represents three different interference coefficients, which are triangularly distributed. “Small”, ”Middle”, and “Large” represent interference coefficients from small to large. For the two rows that follow each disturbance coefficient, the top row represents the mean of the metrics obtained after the corresponding algorithm has been run 20 times independently with this disturbance coefficient, and the bottom row represents the variance of the metrics obtained.

As shown in Table 2 and Table 3, NSACOWDRL performs better with superior metrics in instances, and has more stable metrics. In comparison, in RL, rc204, and rc202, as the interference coefficient increases, the performance of NSGA-II starts to improve, and the performance of MOEA/D starts to decrease. However, in c202 and c206, as the interference coefficient increases, the performance of NSGA-II starts to decrease, and the performance of MOEA/D starts to improve. Meanwhile, NSGA-III performance is more stable. The performance of the algorithms is strongly influenced by the instances, where the performance of NSACO, NSGA-II, NSGA-III, and MOEA/D have advantages in different instances, whereas NSACOWDRL is always superior.

To determine the differences between the algorithms on different metrics, a two-by-two comparison p value test was carried out. As shown in Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15, the p value is represented by the heatmap. The X-axis is the interference coefficient, and the p values at different interference coefficients are shown on either side of the breakpoint in the X-axis. The upper triangle of each large square is the p value for HV and the lower triangle is the p value for IGD, where p values greater than 0.05 are marked in black.

The p-value plots show that the algorithms differ significantly between different instances and between different interference coefficients. NSACOWDRL is significantly better than the other algorithms, especially on the C202, C206, RL, whereas on RC202, and RC204, the performance of MOEA/D is close to that of NSACOWDRL in terms of the IGD metrics. Meanwhile, as the interference coefficient increases, the variation between algorithm performances begins to decrease on C202, C206, and RC204, whereas on RC202, RL, the variation fluctuates. Algorithm performance is affected by the combination of the instances and the interference coefficients; as the interference coefficients increase, the variation between algorithms is affected more in the RC-series instances than in the C-series.

As the result shows, the NSACOWDRL outperforms the other algorithms in terms of convergence and diversity when solving the JIT-RMOVRPTW. Furthermore, the NSACOWDRL has better generalization in solving problems for different scenarios.

5.3.2. Discussions

The results of the algorithms on the different instances show superiority of the NSACOWDRL on the proposed model. To evaluate the impact of multiple JIT strategies on the performance of the algorithm, the following three JIT strategies are compared:

Strategy 1: Consider transportation capacity and the slack time.

{J I T}_{i} = b_{i} - (b_{i} - a_{i}) * p * (1 - \frac{q_{i}}{Q})

Strategy 2: Consider the slack time.

{J I T}_{i} = b_{i} - (b_{i} - a_{i}) * p

Strategy 3: Schedule advance time for the time window.

{J I T}_{i} = - a d v a n c e d t i m e

The results are shown in Table 4. “mean gap” represents the gap between the best mean and the worst mean for different interference coefficients of the same strategy. The NSACOWDRL has better performance in both models, and the solutions obtained by the NSACOWDRL have more stable metrics. In comparison, in different models, as the interference coefficient increases, NSGA-III gradually approaches NSGA-II in terms of convergence, and finally its performance is weaker than that of NSGA-II. Meanwhile, the diversity of MOEA/D becomes progressively weaker than that of NSGA-III, and the convergence of MOEA/D begins to weaken to approach that of NSGA-II. Faced with different JIT strategies, NSGA-II, NSGA-III and MOEA/D have their own strengths and weaknesses in the performance, while NSACOWDRL always keeps the superiority.

Thus, NSGA-III, and MOEA/D are susceptible to interference coefficients and JIT strategies; however, the NSACOWDRL has better generalizability when faced with different models.

By analyzing the “mean gap”, as the interference coefficient increases, the HV and IGD metrics of NSACOWDRL of Strategy 1 have the smallest degradation, followed by Strategy 2, and the algorithm of Strategy 3 has the most degradation. Compared with other strategies, Strategy 1 can maintain the performance of the algorithm better, which verifies the superiority of Strategy 1.

5.4. Different Parameter Settings

In the previous experiments, this paper did not make any serious attempt to find the optimal parameter settings for NSACOWDRL. For the DDQN, the type of neural network affects the performance of the algorithm. In this section, based on the proposed model, the neural networks of the DDQN are set as the back propagation (BP), radial basis function (RBF), the general regression network (GRNN), and product-based network (PNN), respectively, so as to test the effects brought by different networks.

Table 5 shows the performance of the algorithm with different networks. Figure 16 shows the p value of the metrics. By comparison, it can be seen that PNN has the best convergence and diversity, PNN has the closest performance to GRNN, and RBF has the worst performance. At the same time, as the interference coefficient increases, the variability between different networks begins to decrease.

For illustration, Figure 17 shows the set of Pareto solutions in the approximate PF obtained by each algorithm with different interference coefficients. The interference coefficient increases from left to right and then from top to bottom in the subplots of Figure 16. As the interference coefficient increases, each algorithm has more Pareto solutions which are more widely distributed in the solution space, and the variation in the distribution of solutions between algorithms begins to decrease. The figure shows that the solutions of NSACOWDRL _ PNN are distributed at all levels between (0, 1) on the

1 / a l p h a

axis in all subplots, while the solutions of NSACOWDRL _ RBF are concentrated in two blocks. In terms of solution space distribution, the solution distribution obtained by NSACOWDRL _PNN is comparable to NSACOWDRL _ GRNN which is better than the other algorithms.

From these results, it can be shown that when adjusting the parameters of the algorithm, choosing the appropriate networks can effectively improve the performance of the algorithm. The appropriate neural network is selected by the type of problem, the Convolutional Neural Network (CNN) can be used to deal with discrete action space problems, while GRNN can be used to deal with nonlinear problems, which requires different network architectures and tuning for different problems.

6. Conclusions

This paper constructs the multiobjective AGV routing planning mathematical model of the JIT-RMOVRPTW for assembly workshops which considers the uncertainty in the transportation time. Based on the constraints, the metrics alpha—which measures the feasibility of the solution under uncertainty—is introduced into the model. To solve the problem, the two-stage NSACOWDRL is proposed. In stage 1, the NSACOWDRL adopts the nondominated routing selection method through small habitat protection based on reference points to obtain the set of nondominated routes, which ensures the diversity of the NSACOWDRL. The nondominated rank is introduced into the pheromone update strategy, which adopts the elite routing update method to protect better routings and uses the maximum-minimum pheromone strategy to prevent the algorithm from easily falling into local optimality due to an overly large gap between pheromone concentrations. Then, the DDQN is introduced as a local search algorithm which trains the DDQN networks by using nondominated solution sets. The trained network is used to generate routes to participate in the next nondominated sorting; in addition, the network participates in the probability formula. The state transfer probability formula based on objectives, the DDQN, and constraints is proposed in the NSACOWDRL. In stage 2, the feasibility of the nondominated solutions obtained in stage 1 was quantified by Monte Carlo simulation as Monte Carlo times, which were then robustly selected for diversity based on uniform weights in the solution space to obtain the Pareto solution set. The JIT-RMOVRPTW deals with the conflict between punctuality and robustness in uncertain environments in a better way. Meanwhile, based on multidimensional space, the multiobjective problem is also solved better. The NSACOWDRL complements evolutionary algorithms with deep reinforcement learning algorithms to enhance the optimization capability.

The experimental results among the different algorithms under different disturbance coefficients verify the superiority of the NSACOWDRL in terms of diversity and convergence in the robust VRP problem. A comparison of the models for different JIT strategies also reveals the superiority of the proposed model. The experiments demonstrate that NSACOWDRL has better generalization in different scenarios.

The uncertainty of material requirements due to the volatility of processing times, equipment failures, and other issues will be further considered. Future research can further enhance the complexity of the model and further optimize the material distribution scheme. Through the comparison experiments of different networks, DRL with different network structures has a large impact on the model performance, therefore the research on DRL models needs to be further deepened.

Author Contributions

Conceptualization, Y.C. and M.C.; methodology, Y.C., W.Y., F.Y. and M.C.; software, F.Y. and M.C.; validation, F.Y. and M.C.; formal analysis, H.L. and M.C.; investigation, F.Y. and H.L.; resources, Y.C.; data curation, M.C.; writing—original draft preparation, M.C.; writing—review and editing, W.Y.; visualization, F.Y., H.L. and M.C.; supervision, Y.C. and W.Y.; project administration, M.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Foundation of Zhejiang province, China under the Grant Nos. LGG22G010002 and Nos. 52005447, Zhejiang Provincial Natural Science Foundation of China under Grant Nos. LQ21E050014.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The instance dataset presented in the study is included in the article. The benchmark problem datasets presented in the study are openly available in [VRPTE] at [https://www.sintef.no/projectweb/top/vrptw/solomon-benchmark/].

Conflicts of Interest

The authors declare no conflict of interest.

References

Zitzler, E.; Thiele, L. Multiobjective Evolutionary Algorithms: A Comparative Case Study and the Strength Pareto Approach. IEEE Trans. Evol. Comput. 1999, 3, 257–271. [Google Scholar] [CrossRef]
Liu, X.; Chen, Y.-L.; Por, L.Y.; Ku, C.S. A Systematic Literature Review of Vehicle Routing Problems with Time Windows. Sustainability 2023, 15, 12004. [Google Scholar] [CrossRef]
Li, B.; Wu, G.; He, Y.; Fan, M.; Pedrycz, W. An Overview and Experimental Study of Learning-Based Optimization Algorithms for the Vehicle Routing Problem. IEEE/CAA J. Autom. Sin. 2022, 9, 1115–1138. [Google Scholar] [CrossRef]
Asghari, M.; Mirzapour Al-e-hashem, S.M.J. Green Vehicle Routing Problem: A State-of-the-Art Review. Int. J. Prod. Econ. 2021, 231, 107899. [Google Scholar] [CrossRef]
Gunawan, A.; Kendall, G.; McCollum, B.; Seow, H.-V.; Lee, L.S. Vehicle Routing: Review of Benchmark Datasets. J. Oper. Res. Soc. 2021, 72, 1794–1807. [Google Scholar] [CrossRef]
Salehi Sarbijan, M.; Behnamian, J. Emerging Research Fields in Vehicle Routing Problem: A Short Review. Arch. Computat. Methods Eng. 2023, 30, 2473–2491. [Google Scholar] [CrossRef]
Zhang, H.; Ge, H.; Yang, J.; Tong, Y. Review of Vehicle Routing Problems: Models, Classification and Solving Algorithms. Arch. Computat. Methods Eng. 2022, 29, 195–221. [Google Scholar] [CrossRef]
Ni, Q.; Tang, Y. A Bibliometric Visualized Analysis and Classification of Vehicle Routing Problem Research. Sustainability 2023, 15, 7394. [Google Scholar] [CrossRef]
Soysal, M.; Çimen, M. A Simulation Based Restricted Dynamic Programming Approach for the Green Time Dependent Vehicle Routing Problem. Comput. Oper. Res. 2017, 88, 297–305. [Google Scholar] [CrossRef]
Reihaneh, M.; Ghoniem, A. A Branch-and-Price Algorithm for a Vehicle Routing with Demand Allocation Problem. Eur. J. Oper. Res. 2019, 272, 523–538. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, J.; Sun, Y.; Fan, J.; Wang, Z.; Wang, H. Collaborative Multidepot Electric Vehicle Routing Problem with Time Windows and Shared Charging Stations. Expert Syst. Appl. 2023, 219, 119654. [Google Scholar] [CrossRef]
Pierre, D.M.; Zakaria, N. Stochastic Partially Optimized Cyclic Shift Crossover for Multi-Objective Genetic Algorithms for the Vehicle Routing Problem with Time-Windows. Appl. Soft Comput. 2017, 52, 863–876. [Google Scholar] [CrossRef]
Wang, J.; Xu, Z.; He, M.; Xue, L.; Xu, H. Optimization of Pickup Vehicle Scheduling for Steel Logistics Park with Mixed Storage. Appl. Sci. 2024, 14, 3628. [Google Scholar] [CrossRef]
Geng, Y.; Li, J. An Improved Hyperplane Assisted Multiobjective Optimization for Distributed Hybrid Flow Shop Scheduling Problem in Glass Manufacturing Systems. CMES 2022, 134, 241–266. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II. IEEE Trans. Evol. Computat. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Corne, D.; Jerram, N.; Knowles, J.; Oates, M. PESA-II: Region-Based Selection in Evolutionary Multiobjective Optimization. In Proceedings of the 6th International Conference on Parallel Problem Solving from Nature, PPSN VI, Paris, France, 18–20 September 2000. [Google Scholar]
Zhang, Q.; Li, H. MOEA/D: A Multiobjective Evolutionary Algorithm Based on Decomposition. IEEE Trans. Evol. Comput. 2007, 11, 712–731. [Google Scholar] [CrossRef]
Liu, H.-L.; Gu, F.; Zhang, Q. Decomposition of a Multiobjective Optimization Problem Into a Number of Simple Multiobjective Subproblems. IEEE Trans. Evol. Comput. 2014, 18, 450–455. [Google Scholar] [CrossRef]
Cheng, R.; Jin, Y.; Olhofer, M.; Sendhoff, B. A Reference Vector Guided Evolutionary Algorithm for Many-Objective Optimization. IEEE Trans. Evol. Comput. 2016, 20, 773–791. [Google Scholar] [CrossRef]
Zitzler, E.; Künzli, S. Indicator-Based Selection in Multiobjective Search. In Proceedings of the Parallel Problem Solving from Nature—PPSN VIII, Birmingham, UK, 18–22 September 2004; Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guervós, J.J., Bullinaria, J.A., Rowe, J.E., Tiňo, P., Kabán, A., Schwefel, H.-P., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 832–842. [Google Scholar]
Beume, N.; Naujoks, B.; Emmerich, M. SMS-EMOA: Multiobjective Selection Based on Dominated Hypervolume. Eur. J. Oper. Res. 2007, 181, 1653–1669. [Google Scholar] [CrossRef]
Li, K.; Kwong, S.; Cao, J.; Li, M.; Zheng, J.; Shen, R. Achieving Balance between Proximity and Diversity in Multi-Objective Evolutionary Algorithm. Inf. Sci. 2012, 182, 220–242. [Google Scholar] [CrossRef]
Beyer, H.-G.; Sendhoff, B. Robust Optimization—A Comprehensive Survey. Comput. Methods Appl. Mech. Eng. 2007, 196, 3190–3218. [Google Scholar] [CrossRef]
Xia, T.; Li, M. An Efficient Multi-Objective Robust Optimization Method by Sequentially Searching From Nominal Pareto Solutions. J. Comput. Inf. Sci. Eng. 2021, 21, 041010. [Google Scholar] [CrossRef]
Jin, Y.; Branke, J. Evolutionary Optimization in Uncertain Environments-a Survey. IEEE Trans. Evol. Comput. 2005, 9, 303–317. [Google Scholar] [CrossRef]
Scheffermann, R.; Bender, M.; Cardeneo, A. Robust Solutions for Vehicle Routing Problems via Evolutionary Multiobjective Optimization. In Proceedings of the 2009 IEEE Congress on Evolutionary Computation, Trondheim, Norway, 18–21 May 2009; pp. 1605–1612. [Google Scholar]
He, Z.; Yen, G.G.; Yi, Z. Robust Multiobjective Optimization via Evolutionary Algorithms. IEEE Trans. Evol. Comput. 2019, 23, 316–330. [Google Scholar] [CrossRef]
Deb, K.; Jain, H. An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints. IEEE Trans. Evol. Comput. 2014, 18, 577–601. [Google Scholar] [CrossRef]
Leng, K.; Li, S. Distribution Path Optimization for Intelligent Logistics Vehicles of Urban Rail Transportation Using VRP Optimization Model. IEEE Trans. Intell. Transp. Syst. 2022, 23, 1661–1669. [Google Scholar] [CrossRef]
Zhang, Q.; Wu, L.; Li, J. Application of Improved Scatter Search Algorithm to Reverse Logistics VRP Problem. In Proceedings of the 2023 IEEE 5th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 14–16 July 2023; pp. 283–288. [Google Scholar]
Sabar, N.R.; Goh, S.L.; Turky, A.; Kendall, G. Population-Based Iterated Local Search Approach for Dynamic Vehicle Routing Problems. IEEE Trans. Autom. Sci. Eng. 2022, 19, 2933–2943. [Google Scholar] [CrossRef]
Feng, L.; Huang, Y.; Tsang, I.W.; Gupta, A.; Tang, K.; Tan, K.C.; Ong, Y.-S. Towards Faster Vehicle Routing by Transferring Knowledge From Customer Representation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 952–965. [Google Scholar] [CrossRef]
Li, J.; Xin, L.; Cao, Z.; Lim, A.; Song, W.; Zhang, J. Heterogeneous Attentions for Solving Pickup and Delivery Problem via Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 2306–2315. [Google Scholar] [CrossRef]
Jia, Y.-H.; Mei, Y.; Zhang, M. A Bilevel Ant Colony Optimization Algorithm for Capacitated Electric Vehicle Routing Problem. IEEE Trans. Cybern. 2022, 52, 10855–10868. [Google Scholar] [CrossRef]
Elgharably, N.; Easa, S.; Nassef, A.; El Damatty, A. Stochastic Multi-Objective Vehicle Routing Model in Green Environment With Customer Satisfaction. IEEE Trans. Intell. Transp. Syst. 2023, 24, 1337–1355. [Google Scholar] [CrossRef]
Motaghedi-Larijani, A. Solving the Number of Cross-Dock Open Doors Optimization Problem by Combination of NSGA-II and Multi-Objective Simulated Annealing. Appl. Soft Comput. 2022, 128, 109448. [Google Scholar] [CrossRef]
Yin, N. Multiobjective Optimization for Vehicle Routing Optimization Problem in Low-Carbon Intelligent Transportation. IEEE Trans. Intell. Transp. Syst. 2022, 24, 13161–13170. [Google Scholar] [CrossRef]
Mazdarani, F.; Farid Ghannadpour, S.; Zandieh, F. Bi-Objective Overlapped Links Vehicle Routing Problem for Risk Minimizing Valuables Transportation. Comput. Oper. Res. 2023, 153, 106177. [Google Scholar] [CrossRef]
Wang, J.; Weng, T.; Zhang, Q. A Two-Stage Multiobjective Evolutionary Algorithm for Multiobjective Multidepot Vehicle Routing Problem With Time Windows. IEEE Trans. Cybern. 2019, 49, 2467–2478. [Google Scholar] [CrossRef] [PubMed]
Deb, S.; Tammi, K.; Gao, X.-Z.; Kalita, K.; Mahanta, P.; Cross, S. A Robust Two-Stage Planning Model for the Charging Station Placement Problem Considering Road Traffic Uncertainty. IEEE Trans. Intell. Transp. Syst. 2022, 23, 6571–6585. [Google Scholar] [CrossRef]
Muñoz, C.C.; Palacios-Alonso, J.J.; Vela, C.R.; Afsar, S. Solving a Vehicle Routing Problem with Uncertain Demands and Adaptive Credibility Thresholds. In Proceedings of the 2022 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
Wang, J. Research on Route Planning Model and Algorithm of Electric Distribution Vehicle Based on Robust Optimization under the Background of Carbon Trading. In Proceedings of the 2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE), Jinzhou, China, 18–20 August 2023; pp. 1623–1628. [Google Scholar]
Duan, J.; He, Z.; Yen, G.G. Robust Multiobjective Optimization for Vehicle Routing Problem with Time Windows. IEEE Trans. Cybern. 2022, 52, 8300–8314. [Google Scholar] [CrossRef] [PubMed]
Dorigo, M.; Maniezzo, V.; Colorni, A. Ant System: Optimization by a Colony of Cooperating Agents. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 1996, 26, 29–41. [Google Scholar] [CrossRef] [PubMed]
Chen, F.; Xie, W.; Ma, J.; Chen, J.; Wang, X. Textile Flexible Job-Shop Scheduling Based on a Modified Ant Colony Optimization Algorithm. Appl. Sci. 2024, 14, 4082. [Google Scholar] [CrossRef]
Kurian, A.M.; Onuorah, M.J.; Ammari, H.M. Optimizing Coverage in Wireless Sensor Networks: A Binary Ant Colony Algorithm with Hill Climbing. Appl. Sci. 2024, 14, 960. [Google Scholar] [CrossRef]
Huang, Y.-H.; Blazquez, C.A.; Huang, S.-H.; Paredes-Belmar, G.; Latorre-Nuñez, G. Solving the Feeder Vehicle Routing Problem Using Ant Colony Optimization. Comput. Ind. Eng. 2019, 127, 520–535. [Google Scholar] [CrossRef]
Li, Y.; Soleimani, H.; Zohal, M. An Improved Ant Colony Optimization Algorithm for the Multi-Depot Green Vehicle Routing Problem with Multiple Objectives. J. Clean. Prod. 2019, 227, 1161–1172. [Google Scholar] [CrossRef]
Jiao, Z.; Ma, K.; Rong, Y.; Wang, P.; Zhang, H.; Wang, S. A Path Planning Method Using Adaptive Polymorphic Ant Colony Algorithm for Smart Wheelchairs. J. Comput. Sci. 2018, 25, 50–57. [Google Scholar] [CrossRef]
Ren, T.; Luo, T.; Jia, B.; Yang, B.; Wang, L.; Xing, L. Improved Ant Colony Optimization for the Vehicle Routing Problem with Split Pickup and Split Delivery. Swarm Evol. Comput. 2023, 77, 101228. [Google Scholar] [CrossRef]
Luo, Q.; Wang, H.; Zheng, Y.; He, J. Research on Path Planning of Mobile Robot Based on Improved Ant Colony Algorithm. Neural Comput. Applic 2020, 32, 1555–1566. [Google Scholar] [CrossRef]
Li, J.-Y.; Deng, X.-Y.; Zhan, Z.-H.; Yu, L.; Tan, K.C.; Lai, K.-K.; Zhang, J. A Multipopulation Multiobjective Ant Colony System Considering Travel and Prevention Costs for Vehicle Routing in COVID-19-like Epidemics. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25062–25076. [Google Scholar] [CrossRef]
Liu, K.; Zhang, M. Path Planning Based on Simulated Annealing Ant Colony Algorithm. In Proceedings of the 2016 9th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 10–11 December 2016; Volume 2, pp. 461–466. [Google Scholar]
Wang, F.; Wang, J.; Chen, X. Evacuation Entropy Path Planning Model Based on Hybrid Ant Colony-Artificial Fish Swarm Algorithms. IOP Conf. Ser. Mater. Sci. Eng. 2019, 563, 052025. [Google Scholar] [CrossRef]
Frías, N.; Johnson, F.; Valle, C. Hybrid Algorithms for Energy Minimizing Vehicle Routing Problem: Integrating Clusterization and Ant Colony Optimization. IEEE Access 2023, 11, 125800–125821. [Google Scholar] [CrossRef]
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. arXiv 2015. [Google Scholar] [CrossRef]
Coello, C.A.C.; Cortés, N.C. Solving Multiobjective Optimization Problems Using an Artificial Immune System. Genet. Program. Evolvable Mach. 2005, 6, 163–190. [Google Scholar] [CrossRef]

Figure 1. JIT time setting.

Figure 2. The optimization process of the NSACOWDRL.

Figure 3. Framework of the NSACOWDRL.

Figure 4. Hyperplane and reference points.

Figure 5. Construction of the hyperplane.

Figure 6. Construction of the network.

Figure 7. The structure of the DDQN.

Figure 8. Uniformly distributed weights in the solution space.

Figure 9. Pareto solutions.

Figure 10. Material information dataset routing.

Figure 11. p value heatmap of c202.

Figure 12. p value heatmap of c206.

Figure 13. p value heatmap of rc202.

Figure 14. p value heatmap of rc204.

Figure 15. p value heatmap of the real case.

Figure 16. p value heatmap of networks.

Figure 17. The set of Pareto solutions in PF obtained by algorithms.

Table 1. Material information dataset of the real case.

Material Number	X-Coordinate	Y-Coordinate	Number of Delivery Boxes	Left Time Window	Right Time Window	Service Time
0	0	11.2	0	0	0	0
1	6.9	30.4	1	792	1800	15
2	8.1	30.4	1	120	720	15
3	9.3	30.4	2	792	1800	30
4	9.3	30.4	1	72	720	15
5	12.9	30.4	1	240	1080	15
6	12.9	30.4	1	132	1080	15
7	15.3	30.4	1	240	1080	15
8	15.3	30.4	1	216	1080	15
9	20.1	30.4	1	288	1368	15
10	22.5	30.4	1	828	1800	15
11	24.9	30.4	2	1200	1800	30
12	24.9	30.4	2	450	1800	30
13	28.5	30.4	1	450	1800	15
14	28.5	30.4	1	540	1800	15
15	33.3	30.4	1	180	1080	15
16	6.9	25.3	1	1620	1800	15
17	10.5	25.3	5	900	1800	75
18	11.7	25.3	1	1200	1800	15
19	11.7	25.3	2	828	1800	30
20	21.3	25.3	3	115	720	45
21	22.5	25.3	1	1620	1800	15
22	27.3	25.3	5	600	1800	75
23	27.3	25.3	1	600	1800	15
24	28.5	25.3	1	540	1800	15

Table 2. Mean and variance metric values of HV indicators.

Instance	$\bar{ζ}$	NSACOWDRL	NSACO	NSGA-III	NSGA-II	MOEA/D
c202	Small	0.778107	0.659747	0.482335	0.620586	0.582376
	Small	9.74E−05	0.000594	0.003465	0.002409	0.002838
	Middle	0.783101	0.685152	0.522198	0.622289	0.612993
	Middle	0.000137	0.000604	0.003457	0.004938	0.004171
	Large	0.708188	0.609913	0.404309	0.505273	0.508527
	Large	0.000179	0.001238	0.004473	0.003483	0.005629
c206	Small	0.656716	0.557998	0.487161	0.652318	0.595443
	Small	0.000298	0.000589	0.003799	0.007288	0.004559
	Middle	0.635402	0.514251	0.440545	0.561055	0.550337
	Middle	0.000254	0.000532	0.009169	0.003729	0.003936
	Large	0.5232	0.441354	0.345311	0.439341	0.478306
	Large	0.000319	0.000555	0.005952	0.002454	0.003588
rc202	Small	0.635634	0.600488	0.560277	0.576173	0.582909
	Small	0.000116	0.000439	0.001576	0.000463	0.001525
	Middle	0.591863	0.535674	0.495836	0.498852	0.567148
	Middle	0.000324	0.000513	0.000986	0.001867	0.002845
	Large	0.530264	0.474198	0.479061	0.556831	0.543224
	Large	0.000897	0.000483	0.005019	0.001418	0.002381
rc204	Small	0.630401	0.597802	0.540024	0.471606	0.611194
	Small	0.000183	0.000192	0.00278	0.003056	0.00235
	Middle	0.646317	0.617087	0.557666	0.571228	0.655838
	Middle	0.000236	0.00045	0.001948	0.002557	0.002609
	Large	0.607741	0.57318	0.55936	0.582608	0.559298
	Large	0.001551	0.001147	0.003931	0.001271	0.008846
RL	Small	0.551697	0.515265	0.489368	0.436566	0.512563
	Small	0.000503	0.0006	0.001642	0.000742	0.002456
	Middle	0.530147	0.507949	0.493525	0.383	0.417351
	Middle	0.000205	8.74E−05	0.00193	0.001007	0.00279
	Large	0.448719	0.427708	0.406559	0.311904	0.301467
	Large	0.000595	0.000319	0.003167	0.001017	0.001066

Table 3. Mean and variance metric values of IGD indicators.

Instance	$\bar{ζ}$	NSACOWDRL	NSACO	NSGA-III	NSGA-II	MOEA/D
c202	Small	0.067098	0.111056	0.221364	0.155425	0.149471
	Small	9.94E−05	5.76E−05	0.001808	0.001132	0.000783
	Middle	0.070933	0.095232	0.195238	0.156117	0.142658
	Middle	0.000124	7.45E−05	0.000646	0.000893	0.000335
	Large	0.063498	0.096462	0.22384	0.179565	0.166596
	Large	5.61E−05	5.11E−05	0.001706	0.001494	0.001324
c206	Small	0.07471	0.123859	0.214364	0.099614	0.157687
	Small	0.000159	0.000116	0.001652	0.000463	0.000542
	Middle	0.089647	0.121177	0.205124	0.119925	0.15276
	Middle	6.96E−05	0.000127	0.001728	0.000238	0.001202
	Large	0.111621	0.14161	0.207618	0.117877	0.173173
	Large	0.000449	0.000345	0.003593	0.000272	0.002181
rc202	Small	0.091929	0.095577	0.136415	0.12007	0.13582
	Small	7.24E−05	4.42E−05	0.000804	0.000165	0.000713
	Middle	0.092914	0.107702	0.170686	0.158157	0.141599
	Middle	4.08E−05	5.77E−05	0.000682	0.000297	0.000587
	Large	0.143131	0.157795	0.211815	0.163551	0.184171
	Large	0.000346	9.64E−05	0.000707	0.00036	0.000859
rc204	Small	0.099484	0.10002	0.155438	0.179327	0.124656
	Small	0.000159	9.27E−05	0.001078	0.001097	0.000573
	Middle	0.107897	0.119688	0.217314	0.189579	0.22512
	Middle	9.31E−05	6.03E−05	0.001167	0.001371	0.001906
	Large	0.109758	0.110154	0.162679	0.119891	0.172319
	Large	0.000275	0.00014	0.000765	0.000193	0.002228
RL	Small	0.081081	0.089546	0.114181	0.125006	0.085965
	Small	1.93E−05	2.56E−05	0.000217	0.000377	0.000176
	Middle	0.07656	0.081223	0.12775	0.132086	0.107114
	Middle	1.51E−05	2.22E−05	0.000873	0.000183	0.000203
	Large	0.075806	0.075908	0.166999	0.113566	0.103795
	Large	3.33E−05	1.65E−05	0.002057	7.47E−05	0.000376

Table 4. Mean and variance metric values of algorithms.

HV	Strategy	$\bar{ζ}$	NSACOWDRL	NSACO	NSGA-III	NSGA-II	MOEA/D
	Strategy 1	Small	0.551697	0.515265	0.489368	0.436566	0.512563
		Small	0.000503	0.0006	0.001642	0.000742	0.002456
		Middle	0.530147	0.507949	0.493525	0.383	0.417351
		Middle	0.000205	8.74E−05	0.00193	0.001007	0.00279
		Large	0.448719	0.427708	0.406559	0.311904	0.301467
		Large	0.000595	0.000319	0.003167	0.001017	0.001066
		mean gap	18.67%	16.99%	17.62%	28.56%	41.18%
	Strategy 2	Small	0.548972	0.508876	0.469451	0.423446	0.45982
		Small	0.00019059	0.00036	0.001223	0.001704	0.001119
		Middle	0.576724	0.558853	0.510205	0.437301	0.458765
		Middle	6.62792E−05	0.000253	0.002138	0.001383	0.001028
		Large	0.4549855	0.438979	0.430791	0.328832	0.333339
		Large	9.52814E−05	0.000192	0.002057	0.001357	0.000807
		mean gap	21.11%	21.45%	15.57%	24.80%	27.51%
	Strategy 3	Small	0.607842	0.577377	0.531816	0.500355	0.537066
		Small	0.000156	0.000275	0.001741	0.000921	0.002507
		Middle	0.548341	0.517545	0.481369	0.38915	0.397993
		Middle	0.000235	0.000321	0.005204	0.00121	0.001877
		Large	0.452389	0.43685	0.414419	0.325447	0.29446
		Large	0.000199	0.000382	0.001707	0.00078	0.001737
		mean gap	25.57%	24.34%	22.07%	34.96%	45.17%
IGD	Strategy	$\bar{ζ}$
	Strategy 1	Small	0.081081	0.089546	0.114181	0.125006	0.085965
		Small	1.93E−05	2.56E−05	0.000217	0.000377	0.000176
		Middle	0.07656	0.081223	0.12775	0.132086	0.107114
		Middle	1.51E−05	2.22E−05	0.000873	0.000183	0.000203
		Large	0.075806	0.075908	0.166999	0.113566	0.103795
		Large	3.33E−05	1.65E−05	0.002057	7.47E−05	0.000376
		mean gap	6.51%	15.23%	31.63%	14.02%	19.74%
	Strategy 2	Small	0.070188	0.078635	0.112759	0.120658	0.08723
		Small	1.14E−05	1.22E−05	0.000214	0.000263	4.24E−05
		Middle	0.067106	0.069043	0.109341	0.110483	0.093927
		Middle	2.46E−05	1.16E−05	0.000552	0.000121	0.000114
		Large	0.064821	0.065644	0.135298	0.106542	0.091753
		Large	1.53E−05	6.89E−06	0.001603	0.000113	0.000309
		mean gap	7.65%	16.52%	19.19%	11.70%	7.13%
	Strategy 3	Small	0.083499	0.085558	0.116708	0.126069	0.093247
		Small	2.57E−05	4.2E−05	0.00021	0.000104	0.000116
		Middle	0.082514	0.085099	0.13417	0.127182	0.110645
		Middle	5.33E−05	2.18E−05	0.000563	0.000119	0.000152
		Large	0.074674	0.079224	0.13372	0.115114	0.113762
		Large	1.47E−05	8.18E−06	0.000623	0.00011	0.000174
		mean gap	10.57%	7.40%	13.01%	9.49%	18.03%

Table 5. Mean and variance metric values of HV and IGD for different networks.

HV	$\bar{ζ}$	NSACOWDRL _BP	NSACOWDRL _RBF	NSACOWDRL _PNN	NSACOWDRL _GRNN
	Small	6.215E−01	6.208E−01	6.367E−01	6.427E−01
	Small	8.768E−04	9.602E−04	1.183E−03	5.848E−04
	Middle	6.037E−01	5.901E−01	6.068E−01	6.026E−01
	Middle	2.763E−04	2.104E−04	1.639E−04	1.809E−04
	Large	4.874E−01	4.733E−01	4.763E−01	4.901E−01
	Large	7.729E−04	2.365E−04	5.044E−04	4.079E−04
IGD	$\bar{ζ}$
	Small	6.881E−02	7.188E−02	6.537E−02	6.816E−02
	Small	3.990E−05	1.428E−05	2.453E−05	1.668E−05
	Middle	6.906E−02	6.994E−02	6.620E−02	6.826E−02
	Middle	2.326E−05	2.518E−05	8.431E−06	1.017E−05
	Large	7.867E−02	7.937E−02	7.560E−02	7.539E−02
	Large	1.390E−05	9.746E−06	1.416E−05	1.164E−05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Chen, M.; Yu, F.; Lin, H.; Yi, W. An Improved Ant Colony Algorithm with Deep Reinforcement Learning for the Robust Multiobjective AGV Routing Problem in Assembly Workshops. Appl. Sci. 2024, 14, 7135. https://doi.org/10.3390/app14167135

AMA Style

Chen Y, Chen M, Yu F, Lin H, Yi W. An Improved Ant Colony Algorithm with Deep Reinforcement Learning for the Robust Multiobjective AGV Routing Problem in Assembly Workshops. Applied Sciences. 2024; 14(16):7135. https://doi.org/10.3390/app14167135

Chicago/Turabian Style

Chen, Yong, Mingyu Chen, Feiyang Yu, Han Lin, and Wenchao Yi. 2024. "An Improved Ant Colony Algorithm with Deep Reinforcement Learning for the Robust Multiobjective AGV Routing Problem in Assembly Workshops" Applied Sciences 14, no. 16: 7135. https://doi.org/10.3390/app14167135

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Ant Colony Algorithm with Deep Reinforcement Learning for the Robust Multiobjective AGV Routing Problem in Assembly Workshops

Abstract

1. Introduction

2. Background

2.1. Vehicle Routing Problem

2.2. Ant Colony Algorithm

3. Problem Description and Mathematical Models

3.1. Problem Description

3.2. Mathematical Model

4. Proposed Algorithm

4.1. Framework of the NSACOWDRL

4.2. Solution Construction

4.3. Nondominated Sorting Based on Reference Points

4.3.1. Nondominated Sorting

4.3.2. Diversity Conservation

4.4. Pheromone Update Strategy

4.5. Local Search Strategy

4.5.1. Environment Setup

4.5.2. Output Designs

4.6. Node Transfer Probability Rule

4.7. Robust Multiobjective Optimization

5. Experiments and Analyses

5.1. Experimental Setting

5.2. Performance Metrics

5.3. Experimental Results

5.3.1. Results of the JIT-RMOVRPTW

5.3.2. Discussions

5.4. Different Parameter Settings

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI