1. Introduction
The vehicle routing problem (VRP) was proposed by Dantzig and Ramser in 1959 [
1]; this problem aims to find an optimal route based on constraints and objectives. The problem is widely used in traffic management, logistics transportation, and other fields. This problem has been extensively studied. Several VRP variants have been proposed by applying the VRPs in different environments, such as the capacity-constrained vehicle routing problem (CVRP) and vehicle routing problem with time window (VRPTW). Many researchers have performed systematic literature reviews on the VRP.
Liu et al. [
2] extracted data and analyzed relevant literature within 2018–2022 to address the problems associated with VRPTW. Li et al. [
3] analyzed the articles which solve VRPs by Learning-Based Optimization (LBO) algorithms from various aspects, evaluated the performance of representative LBO algorithms, summarized the types of problems to which LBO is applicable, and proposed directions for improvement. Asghari et al. [
4] investigated the contributions related to the green VRP and proposed a classification scheme based on the variants considered in the scientific literature, so as to provide a comprehensive and structured survey of the state of knowledge, as well as to discuss the most important features of the problems and to indicate future directions. Gunawan et al. [
5] provided and categorized a list of VRP datasets. Salehi Sarbijan et al. [
6] systematically reviewed and analyzed recent research on the VRP from 2001 to 2022, focusing on new and emerging topics in VRPs. Based on the basic VRP, Zhang et al. [
7] classified VRPs according to their characteristics and practical applications, gave the unified description and mathematical model of each type of problem, and then analyzed the solution methods of each type of VRP. Ni et al. [
8] comprehensively analyzed and summarized the literature related to VRPs from 1959 to 2022 by using the knowledge graph, and classified the VRP models and their solutions.
However, the complexity of the VRP increases with the additional constraints of real life. The VRP is NP-hard, so finding the optimal solution is difficult. Therefore, the method proposed to solve the VRP aims to find the approximate optimal solution via an efficient computational approach.
VRPs for AGVs are solved by using exact algorithms and intelligent evolutionary algorithms. The exact algorithms include the branch-and-bound method, integer linear programming method, and dynamic programming method, which are suitable for solving small-scale VRPs with simple structures. Soysal [
9] et al. proposed a simulation-based restricted dynamic programming approach to solve the VRP associated with traffic emissions. Reihaneh [
10] proposed a branch-and-price algorithm that solves the vehicle routing with demand allocation problem. However, the exact algorithm is prone to falling into a local optimal solution or taking too long to compute when solving large-scale VRPs.
In contrast, intelligent evolutionary algorithms can find satisfactory approximate optimal solutions in a limited time when solving large-scale VRPs. Therefore, many experts have improved and designed intelligent evolutionary algorithms. Yong Wang [
11] et al. proposed an improved multiobjective genetic algorithm with tabu search (IMOGA-TS), which combines local and global searches with the improvement of solutions at each iteration by TS to solve a collaborative multidepot EV routing problem with time windows and shared charging stations (CMEVRPTW-SCS). Pierre [
12] proposed a multiobjective genetic algorithm with stochastic partially optimized cyclic shift crossover to solve the multiobjective VRPTW.
The multiobjective model is always transformed into a single-objective model by weighting or other conversion methods. In the VRP, the total objective is often composed of multiple objectives with mutual exclusivity, which means that improving the performance of one objective may lead to degradation of the performance of other objectives. These objectives may not be linearly related and cannot be transformed into a single-objective problem with weights. Therefore, there is a limitation in using a single-objective model to solve the VRP.
Multiobjective optimization problems (MOPs) are proposed for optimizing multiple objectives which are always mutually exclusive at the same time. In recent decades, researchers have proposed many methods for solving MOPs. Multiobjective evolutionary algorithms (MOEAs) are the mainstream methods for solving MOPs. Jinlong Wang [
13] proposed an enhanced algorithm based on SPEA2 (ESPEA) to solve the pickup vehicle scheduling problem with mixed steel storage (PVSP-MSS) which optimizes the makespan of pickup vehicles and the makespan of steel logistics parks at the same time. Yadian Geng [
14] proposed an improved hyperplane-assisted evolutionary algorithm (IhpaEA) to solve the distributed hybrid flow shop scheduling problem (DHFS), which takes maximum completion time and energy consumption as objectives.
Based on different algorithmic strategies, most MOEAs can be classified into the following categories: (1) based on their dominant relationship, such as the NSGA-II [
15], SPEA [
1], and PESA-II [
16]; (2) based on decomposition, such as the MOEA/D [
17], MOEA/D-M2M [
18], and RVEA [
19]; and (3) based on performance evaluation indicators, such as the IBEA [
20], SMSEMOA [
21], and DNMOEA/HI [
22].
From the above algorithms, Pareto solutions can be obtained. However, for MOPs, the number of Pareto solutions increases as the number of objectives increases. For practical scenarios, not all solutions make sense, especially for uncertain time-based VRPs. Uncertain times require Pareto solutions that are most resistant to uncertain times compared to deterministic times, and uncertainty in VRPs is rarely considered in these algorithms. To solve this problem, robust optimization has been proposed, where the researcher searches for robust solutions with the strongest resistance to interference. Robust optimization [
23] means that the solution and performance results remain relatively unchanged when exposed to uncertain conditions. The solution is usually evaluated using the most unfavorable uncertainty. Xia [
24] et al. proposed a new method that sequentially approaches the robust Pareto front (SARPF) from nominal Pareto points to solve MOPs with uncertainties. Jin et al. [
25] introduced and discussed existing methods for dealing with different uncertainties and studied the relationships between different categories of uncertainty. Scheffermann [
26] et al. introduced and compared the NSGA-II with an improved predator–prey algorithm for the VRPTW with uncertain travel times. He [
27] et al. used a robust multiobjective optimization evolutionary algorithm (RMOEA) to solve robust multiobjective optimization problems (RMOPs), which consists of two parts: multiobjective optimization using the improved NSGA-II and robust optimization to search for a robust optimal frontier. The robust optimization in this paper is based on Monte Carlo simulation to simulate the practical environment in which the results are approached to the practical situation through a large number of simulation experiments. The number of feasible times of Monte Carlo simulation under the disturbance of uncertainty coefficients is taken as the criterion for assessing the robustness of the solution.
In the VRP, the AGV operation time, total travel distance of the AGVs or number of AGVs are often used as objectives. However, the just-in-time (JIT) of the material distribution is often ignored. In addition, the literature discusses the robustness of VRPs, but few papers have combined robustness with JIT problems. JITs pursue just-in-time arrivals, and uncertain times can seriously affect the realization of JIT, so combining JITs with robust optimization can improve the possibility of realizing JITs.
To solve the above issues, in this paper, the multiobjective robust VRP model that simultaneously considers uncertainty and JIT strategy is proposed. In addition, an improved ACO algorithm with deep reinforcement learning for assembly workshops is proposed. This study aims to make the following contributions.
(1) A just-in-time-based multiobjective robust vehicle routing problem with time windows (JIT-RMOVRPTW) is constructed which simultaneously optimizes three objectives. The model takes into account the uncertain time by introducing uncertainty coefficients. By defining the robustness metric that combines uncertain time to measure the robustness of the solution, the robustness optimization is defined as a dedicated optimization objective that can filter robust solutions in a better way.
(2) The JIT-RMOVRPTW considers the two conflicting goals of JIT and robustness at the same time. In this model, the JIT strategy under uncertainty is proposed. The traditional time window is eliminated, and the JIT time, which is used as the deadline for the workstation, is proposed. The model eliminates the fixed departure time of AGVs, so that AGVs can flexibly adjust the depot time according to the JIT time to eliminate the waiting time at the workstation generated by early arrival of AGVs.
(3) A two-stage nondominated sorting ant colony algorithm with deep reinforcement learning (NSACOWDRL) is proposed to improve the robustness of the solutions by double screening. The NSACOWDRL divides the problem into two stages: a multiobjective optimization problem and a robust optimization problem. In the first stage, an initial search is performed with robustness metrics as one of the objectives. The small habitat conservation strategy based on reference points used in NSGA-III [
28] is introduced in routing selection. The nondominant rank, the elite path strategy, and the max-min pheromone strategy are introduced in the pheromone update strategy. Furthermore, DDQN is introduced as a local search algorithm. The network is trained by the Pareto solutions which guide the learning direction of the network, after which the trained network is used to participate in nondominated sorting. Moreover, a probabilistic formula based on the network and objectives is designed. In the second stage, the feasibility of the solutions is quantified through Monte Carlo simulation, then the solutions are partitioned according to the uniformly distributed weights in the solution space. Each partitioned solution set is screened separately for robustness. The set of screened solutions is the robust Pareto frontier that takes into account the diversity of solutions.
(4) The performance of the proposed algorithm is validated by comparative experiments. The paper also further discusses the effect of network structure on the performance of the algorithm.
The remainder of this paper is organized as follows: In Section II, the background is introduced. Section III introduces the related VRP problem in the assembly workshop. In Section IV, the NSACOWDRL is described in detail. The experimental design and analysis are described in Section V. Finally, conclusions and suggestions for future work are given in Section VI.
3. Problem Description and Mathematical Models
3.1. Problem Description
In this paper, the multiobjective multi-AGV routing planning mathematical model JIT-RMOVRPTW was constructed. The VRP in the assembly workshop is described as follows:
indicates the identification of workstations; represents the set of workstations; and when or is 0, it represents the raw material library. There are K AGVs with the same performance in the raw material library. Each of them has a maximum load capacity of Q and an average travel speed of . The maximum travel distance of the AGV is unlimited. Materials are delivered by material carts, which are pulled by the tail hook of the AGV. Each AGV can transport up to carts. The cart has a maximum capacity of . There are workstations in the assembly workshop, and a total of materials are needed. The correspondence between workstations and materials is known, the information about the materials is known, and is the total number of material boxes required for the workstation . The position of each workstation is known, and is the distance between stations. AGVs are dispatched to deliver materials to workstations. In AGV delivery, disturbed by random events such as traffic jams, the transportation time is uncertain, and the disturbance time is defined as . In the JIT-RMOVRPTW, the traditional time window is canceled, and materials need to be delivered on time by the deadline . The materials of each workstation require to be unloaded. is the unloading time, determined by the number of boxes of materials to be delivered to the station.
The VRP problem in the assembly workshop is based on the following assumptions:
(1) There is only one raw material library in the entire workshop, and the raw material library is able to meet the material requirements of all the workstations.
(2) The material demand at each station is less than the load capacity of the AGV, and the unloading time of the AGV at each station is known.
(3) All the AGVs depart from the same raw material library and must return to the raw material library after unloading the transported material.
(4) AGV acceleration and deceleration, charging, etc., are not considered during vehicle operation.
(5) Materials are placed in standard material boxes, and the material requirements of each workstation cannot be split. The AGV transports materials through the material carts. The boxes of the materials required for the same workstation are loaded in the same material cart, which is not mixed with other workstation materials.
3.2. Mathematical Model
Based on the traditional VRPTW, the JIT-RMOVRPTW adds uncertain time constraints and designs a new metric alpha based on uncertain time as an objective to measure the robustness of the solution. In addition, the JIT-RMOVRPTW considers JIT delivery. JIT time is used as the deadline, and the AGV adjusts the departure time according to the JIT time of the workstation to achieve the objective of the JIT.
The JIT-RMOVRPTW can be formulated as follows:
The three objectives for the JIT-RMOVRPTW, each of which needs to be minimized, can be stated as follows:
(Total travel distance):
(Robustness metrics):
(JIT penalty time):
The constraints associated with the JIT-RMOVRPTW are added as follows:
Capacity constraints: The number of material boxes on each workstation must not exceed the maximum capacity of the material cart. The number of material boxes delivered by each AGV should not exceed the maximum load capacity of the AGV.
indicates the kinds of materials required at station
.
indicates the number of material boxes of material
for workstation
.
Constraints on the number of material carts: The number of delivery workstations for each AGV must not exceed the number of carts for its own material. The number of material carts of each AGV must not exceed the maximum number of material carts of the AGV.
indicates the number of material carts of the
AGV.
Service time: The service time for each station is determined by the number of material boxes to be delivered to that station.
indicates the time taken to unload a material box.
Distribution constraints: Each workstation can be delivered only once.
AGV arrival number constraints: Only one AGV arrives at each workstation.
Material requirement constraints: Material requirements are met at each workstation.
Loop constraints: Each subloop starts at the raw material library and ends at the raw material library.
Workstation demand time: The JIT problem contradicts the robust problem, so the lead time is used as the safety time for the vehicle. JIT-RMOVRPTW eliminates the traditional time windows, with
serving as the deadline for the JIT problem (as shown in
Figure 1).
is the slack time. As shown in Equation (17), for the station
, the larger the
, the higher the JIT requirement.
The arrival time of the AGVs:
Uncertainty constraints:
is the uncertainty time of arriving at station
.
is the overall random disturbance.
is the average distance between workstations.
is the characteristic coefficient between stations
and
.
has unique characteristics, so it is discussed separately in Equation (19).
Feasibility of workstations: To measure the robustness of the solution, the metric alpha is proposed in JIT-RMOVRPTW. Alpha is specifically defined according to different random distributions that different scenarios have. In this paper,
is taken as a continuous probability distribution of a triangular distribution with a lower limit of
, a peak location of
, and an upper limit of
, which can be expressed as [
]. Equation (21) represents the range of random arrival times of the station
. Equation (22) represents the degree of uncertainty from station
to station
.
represents the feasibility of station
. The value of
is shown by Equation (23). The maximum value of alpha is set to 2 to prevent too large an alpha from affecting the realization of the JIT.
- 12.
- 13.
Advanced time for materials:
- 14.
Penalty time:
indicates the minimum advance time of the material required for station
, which is delivered by the
AGV.
indicates the total penalty time of each AGV. Equation (28) indicates that the departure time from the raw library of the
AGV is delayed for
to satisfy the smallest demand time of the material required for workstation
delivered by the
AGV in time.
indicates the set of stations distributed by the
AGV.
- 15.
Decision-making variables:
Thus, the JIT-RMOVRPTW can be summarized by three objectives defined by Equations (1)–(4), subject to constraint conditions defined by Equations (5)–(30).
4. Proposed Algorithm
In this section, the overall NSACOWDRL process is first described. Then, the six components of NSACOWDRL are introduced, which include (1) solution construction, (2) nondominated sorting based on reference points, (3) the pheromone updating strategy, (4) local search based on the DDQN, (5) the probabilistic transfer formula, and (6) the robust optimization method. The optimization process of the NSACOWDRL is given in
Figure 2.
4.1. Framework of the NSACOWDRL
The framework of the NSACOWDRL is given in
Figure 3. The NSACOWDRL consists of two stages: fully searching the solution space to generate Pareto solutions and robust selection of Pareto solutions.
(1) Stage 1: Solution Space Characteristic Exploration and Pareto Solution Generation
Step 1: Initialize the algorithm parameters, including pheromone matrix, distance matrix, neural networks, etc. The initial number of iterations is 0.
Step 2: Obtain the set of feasible routes according to the solution construction strategy (see
Section 4.2 for details).
Step 3: The routings are stratified by nondominated sorting based on the reference points, and N routings are selected for this iteration.
Step 4: The pheromone matrix is updated according to the pheromone update strategy.
Step 5: When the iteration is a specific value, the network of the DDQN is trained through the nondominated solutions, and the N paths are obtained by the trained network to participate in the next iteration through the selection of the actions.
Step 6: The number of iterations plus 1.
Step 7: If the maximum number of iterations is reached, output the solution set, otherwise skip to step 2.
(2) Stage 2: Robust Multiobjective Optimization
Step 1: The Pareto solutions obtained in stage 1 are partitioned by weights that are uniformly distributed in the solution space.
Step 2: Each solution is assigned to the weight closest to this solution.
Step 3: Through Monte Carlo simulation, the Monte Carlo feasible times for each solution is obtained to quantify the feasibility of the solution.
Step 4: The set of solutions belonging to each weight is then selected according to feasibility, respectively, to obtain the set of Pareto solutions for each weight. Robust solutions that take diversity into account are obtained.
4.2. Solution Construction
NSACOWDRL constructs solutions by imitating the behavior of ant colonies foraging for food with the following steps.
Step 1: There are ants in the raw material bank. Record the number of ants as 0.
Step 2: Each ant sets out independently.
Step 3: The ant calculates the workstations that satisfy the constraints based on the current information and obtains the set of feasible workstations.
Step 4: If the feasible set is empty and the ant does not visit all the workstations, the ant returns to the raw material library and sets out again. Otherwise, calculate the node transfer probability according to the transfer probability formula and select the next node from the feasible set via roulette. Then, update the feasible set.
Step 5: Repeat Steps 3–4 until the ant distributes all the empty workstations, and record the route of the ant. The number of ants plus 1.
Step 6: Repeat Steps 2–5 until the number of ants is . The feasible routes are obtained.
The pseudo-code for solution construction is as follows (Algorithm 1):
Algorithm 1: Solution Construction |
Input: ; ; ;
; While < Set of undistributed workstations: {1,2,…,n}; depart from the raw materials warehouse: ←0. ; While calculates the workstations that satisfy the constraints to obtains the set of feasible workstations, named . if via roulette. = -[]; = -[]; ←; Update . Append to else Return to raw material library: 0; Update end end ; end Output: ; |
4.3. Nondominated Sorting Based on Reference Points
NSACOWDRL uses nondominated sorting based on reference points to obtain Pareto solutions with good diversity. Nondominated routing indicates that each of the routes is better than the others for some objectives. Searching for nondominated routing can expand the search range in the solution space, which can increase the probability of selecting better routing and improve the diversity of the algorithm.
In the NSACOWDRL, during the selection of half of the individuals in each generation, the paths obtained from the two iterations before and after are selected in a nondominated way to obtain more elite generation in such a way that it is elitist for high-quality individuals and at the same time, it will sufficiently bias in favor of the individuals with better nondominated ranks.
4.3.1. Nondominated Sorting
After one iteration of the NSACOWDRL, a set of routings can be obtained. To improve the diversity of routings, a set of feasible routings named is obtained by mixing the routings of the current generation with the routings of the previous generation. Then, is stratified by nondominated sorting via the following steps:
First, represents the number of individuals who dominate individual . All individuals in with = 0 are found by a two-by-two comparison. They are given the nondominance rank 1. These individuals are stored in the set ;
Then, find the individuals that are dominated by the individuals in and store them in the set . Subtract 1 from for each individual in . If ( − 1) equals 0, individual is given nondominance rank 2 and is stored in the set .
The above operation is repeated for the remaining individuals in until all individuals have been assigned a nondominance rank representing the importance of the routing.
4.3.2. Diversity Conservation
In the containing a large number of solutions, there will be many ranks, which leads to the existence of individuals with nondominated ranks that are too low and which are less important. These individuals not only reduce the running efficiency of the algorithm but also cause pheromone concentration changes in the poorer individuals to be recorded, thus increasing the search difficulty of the algorithm or even leading to a fall into a local optimum.
To eliminate this difficulty, a routing selection strategy based on reference points is adopted. The solutions in are selected according to their impact on the diversity of the algorithm. The uniformly distributed reference points are generated through the hyperplane. The individual is selected based on the association between the solutions and reference points. This strategy uses well-distributed reference points to maintain the diversity of populations. The steps are described below:
(1) The individuals starting from nondominance rank 1 are added to set sequentially until the size of is larger than after adding the individuals with nondominance rank L to . is the maximum number of nondominated solutions. The individuals with nondominated ranks from 1 to (L-1) are added to set . Individuals with nondominant rank L are added to set .
(2) Establish reference points on the hyperplane:
The hyperplane has the same tilt for all coordinate axes and an intercept of one on each axis. For the problem with M objectives, reference points are established on an (M-1)-dimensional hyperplane. For the three-objective problem, to generate reference points, each coordinate axis is uniformly segmented into H segments. The plane perpendicular to the coordinate axis is established through the segmented points of the coordinate axis The reference point is the intersection of planes from different coordinate axes with the hyperplane. The number of reference points is determined by Equation (31).
Figure 4 shows the hyperplane of three objectives and reference points on the hyperplane when H is set to 5 (a detailed discussion of the reference points is given in
Section 5.4).
(3) Normalize the individuals:
First, calculate the ideal point , which is the point consisting of the minimum value of all individuals in set on each objective. All individuals in set are subtracted from the ideal point.
Then, calculate the extreme points, which are the points with large values in one objective and small values in other objectives. M objectives have M extreme points. To find the extreme point on the objective, the value of the objective of all individuals is divided by a factor of 1, and the other objective values are divided by a factor of 10e−6. Then, excluding the objective value, the largest value of each individual is found. Comparing these values, the point corresponding to the smallest of these values is the extreme value point on the objective.
The (M-1) dimensional hyperplane is formed using M extreme points. Then, the intercept between the hyperplane and the coordinate axes is calculated.
Figure 5 shows the two-dimensional hyperplane formed using three extreme points.
represents the extreme point of the
objective.
represents the intercept of the
coordinate axis.
Finally, the normalized objective value is calculated by Equation (32)
where
represents the value of the individual in the
objective.
represents the value of the ideal point in the
objective.
(4) Link individuals to reference points:
The reference point is connected to the origin of the coordinates to form the connecting line. The vertical distance from each individual in set to each connecting line is calculated. The individual is then associated with the reference line closest to it. Each individual will correspond to a reference point. represents the number of individuals in set associated with reference point j.
(5) Select individuals:
The individual selection operation is executed according to the number from smallest to largest. Starting from the reference point with the smallest and if more than one reference point has the smallest , choose a reference point randomly among these reference points. If = 0 and there are individuals associated with the selected reference point in the set , the individual with the smallest distance from the reference point is selected and added to . If there is no individual associated with it in , this reference point is no longer considered in the rest of the operations. If ≥ 1 and there are individuals associated with this reference point in , an associated individual is randomly selected to join . The above operation is repeated until the size of equals .
This method involves a small habitat protection operation. The method preserves the diversity of the algorithm by selecting solutions within the region where the solution’s distribution of the previous Pareto front is sparser. This selection operation uses hyperplane generation reference points and can be easily generalized to high-dimensional multiobjective problems to solve the difficulty of maintaining the diversity of solution sets in high-dimensional multiobjective problems.
4.4. Pheromone Update Strategy
The pheromone update strategy guides the algorithm update through a positive feedback mechanism; however, if the positive feedback mechanism is too strong, local convergence problems can easily occur, and if the mechanism is too weak, the algorithm will converge more slowly. To improve the convergence and diversity of the algorithms, this paper proposes a pheromone update strategy that combines the nondominated ranks with the elite routing strategy and the max-min pheromone update strategy. In this strategy, the pheromones of the routes are updated according to their nondominated rank, which comprehensively considers the influence of multiple objectives. Then, the elite paths with better nondominated ranks are updated twice to improve the convergence of the algorithm. Finally, to avoid the algorithm falling into a local optimum due to an excessively large difference between pheromone concentrations, the max-min pheromone strategy is adopted to limit the range of pheromone concentrations.
The update strategy is as follows:
(1) The pheromone update process is shown in Equations (33)–(35):
where
(
) represents the pheromone volatilization rate.
represents the pheromone increment of path (
i,
j) as the sum of the pheromone increments left by all ants that traverse path (
i,
j) in the cycle.
represents the amount of pheromone released by ant
to edge
.
represents the pheromone increment of the elite path
.
represents the amount of pheromone released by elite ant
to edge
.
(2) Common ant pheromone update strategy:
is updated for all routings according to the nondominance rank.
where
represents the
ant.
represents the pheromone constant,
represents the nondominant constant, and
represents the nondominated rank of the routing in this iteration.
(3) Elite routing pheromone update strategy:
updated pheromone concentrations for elite routing with a nondominance rank of 1, 2.
(4) Max-min pheromone limits: Since there are various nondominance ranks, a large difference in pheromone concentration tends to occur between paths after the pheromone is updated. A too large difference in pheromone concentration can easily lead to premature stalling of the search, increasing susceptibility of the algorithm to falling into local optima. To avoid this, explicit limits are imposed on the minimum and maximum pheromone concentrations in the pheromone update strategy, as shown in Equation (38). After each iteration, it must be ensured that the pheromone concentration adheres to the limits.
where
represents the base of the logarithm, which is used to control the range of pheromone concentrations;
represents the limiting factor; and
represents the number of nodes.
4.5. Local Search Strategy
The DDQN [
56] is an off-policy DRL algorithm which is improved from the DQN. The DDQN decouples the two steps of selection of the target Q value action and computation of the target Q value to address the problem that the DQN potentially leads to maximization bias, which results in overestimation.
The decoupling process is as follows:
The DDQN constructs two networks: the evaluation network and the target network. The structure of the evaluation network and the target network are identical, but the weighting parameters are different.
First, the DDQN selects the action corresponding to the maximum Q value by the evaluation network, as shown in Equation (40):
Then, the target network calculates the target Q value as shown in Equation (41):
Combining the states, the target Q value of is shown in Equation (42):
As shown in the Equation (43), the DDQN adopts the mean square error (MSE) of the estimated Q value and the target Q value as the loss function, and updates the evaluation network by gradient descent backpropagation.
The DDQN learns past experience offline by building a replay buffer. After rounds of the replay buffer sampling, the weight parameters of the evaluation network are copied to the target network so as to realize the model self-learning.
To solve the MOPs, the DDQN adopts the Pareto solution set to train the network and guide the learning direction (details are in
Section 4.5.1 and
Section 4.5.2). The pseudo-code is shown in the table below (Algorithm 2):
Algorithm 2: Double DQN in MOPs |
Input: ; ;; ; : the Pareto frontier in For from 1 to size for from 1 to size . end end For do Initialize frame sequence x←() for do and receive , and append to x to obtain the overall learning rate , set if > then delete oldest frame from x end , Replacing the oldest tuple if > ) tuples ; ; steps end end end |
4.5.1. Environment Setup
The evolutionary algorithm guides the reinforcement learning for searching. The following is the process of establishing the DDQN environment and directing its learning direction:
(1) Neural Network Construction
The input to the problem in this paper is a three-dimensional vector which has low dimensionality, so this paper constructs a function-fitting neural network which is structured as a fully connected network with three hidden layers. The input layer has three nodes. Each hidden layer has 40 nodes.
The activation functions are chosen to be linear transfer function (purelin), saturated linear transfer function (satlin), and positive linear transfer function (poslin). Multilayer networks use different activation functions for each layer to allow the neural network to approximate arbitrary functions.
The network chooses a multilayer neural network training function: the variable learning rate backpropagation algorithm (traingdx). It uses momentum optimization to accelerate convergence, and improves stability and accuracy through an adaptive learning rate to make the gradient descent smoother.
The network structure is shown in
Figure 6.
After the initial construction of the neural network is completed, the initial parameter training of the neural network is required. In order to improve the effectiveness of parameter training and the rate of neural net approximation, the Pareto frontier is chosen as the input for initial training in this paper. The Pareto frontier, named
, is selected in the solution set G obtained by nondominated ordering; meanwhile the set with the largest nondominated rank, named
, is selected. The mean square error between the estimated Q value and the corresponding reward of
is then used as the loss function to train the initial evaluation network, as shown in Equation (44).
represents the reward corresponding to Hpareto in the environment. represents the estimated Q value obtained by the evaluation network in the environment.
(2) Reward Return Design
This paper studies the multiobjective VRP problem where the reward value is related to the pheromone between the routes, the heuristic function, and the advance time. The rewards are shown in Equation (45):
This equation simultaneously takes full account of multiobjectives, so that when an agent completes an action choice that causes a change in the state, its corresponding reward also changes. This design allows for the effective mapping of the multiple objectives of the study to the rewards of deep reinforcement learning.
For the VRP problem, the optimal solution of the subpaths can easily cause the algorithm to fall into a local optimum, so this paper proposes an overall learning rate called
. After the action selection is finished to obtain a complete path for all the workstation requirements, the IGD [
57] metrics of this path with the two path sets of
and
are calculated to obtain
and
.Then
is calculated according to Equation (46), and the
is updated according to Equation (47).
quantifies the advantages and disadvantages of the obtained path as a ratio of distances from
and
, in order to correct the rewards of the various subpaths of this path, so that paths with smaller IGDs have a higher advantage, whereas paths with larger IGDs have a larger disadvantage.
The structure of the DDQN algorithm is shown in
Figure 7.
4.5.2. Output Designs
The reinforcement learning results are used to improve the optimization performance of the evolutionary algorithm.
Since the training of DDQN neural networks needs to rely on the replay buffer, and the solution sets obtained between adjacent iterations of the algorithm have a certain degree of approximation, this leads to the difficulty of having significant differences in the replay buffers of adjacent iterations, thus making it difficult for neural networks to perform effective parameter optimization. In order to improve the effectiveness of neural network optimization, this paper sets the neural network to be trained at a specific number of iterations. For the evolutionary algorithm, the greater the number of iterations, the better the results and the more iterations are needed for further optimization of the iteration results, so the neural network training is set at 50, 100, and 200 generations.
The trained network obtains 40 complete paths by action selection, which are recorded in the set
, to participate in nondominated sorting in the next iteration. The pseudo-code is as follows (Algorithm 3):
Algorithm 3: Route Generation by DDQN |
Input: ; epsilon: ; N: ; While do ; While do if is not empty if else ; end else end end end Output: ; |
In the probabilistic selection of nodes, the current workstation
and the next workstation
are used as inputs into the trained network to obtain the Q value between workstations to participate in the probabilistic selection as shown in Equation (48).
represents the state of the current node
;
represents the next node;
represents the set of feasible nodes.
4.6. Node Transfer Probability Rule
NSACOWDRL uses a roulette wheel for node selection. The probability formula for participation in the roulette wheel is chosen by Equation (49).
represents the probabilistic selection method.
represents the selection based on probability formula and
represents the selection based on Q table.
represents the set of a specific number of iterations.
In the JIT-RMOVRPTW, the traditional time windows are eliminated, so the time window width is canceled. In the JIT-RMOVRPTW, the movement of ants is related to the pheromone concentration, path distance, waiting time, and the value of Q table. As shown in Equation (50), the probability formula of the JIT-RMOVRPTW is obtained based on the mathematical model of the problem.
where
represents the set of all neighboring nodes
of node
;
represents the node that ant
is allowed to select next;
represents the pheromone concentration of each pathway;
is the heuristic function, as shown in Equation (52);
is as shown in Equation (48),
is the pheromone concentration factor,
is the heuristic function factor,
is the Q value factor, and
is the waiting time factor.
4.7. Robust Multiobjective Optimization
The traditional robust selection strategy selects only the solution set with the highest feasibility. However, due to the conflicting nature among multiple objectives, it is difficult for the other objectives of the most feasible solution to be close to the optimum. For practical problems, appropriate delays are acceptable. So this paper proposes a robust selection method based on uniformly distributed weights in the solution space combined with Monte Carlo simulation. Since the solutions in the neighborhood centered at a certain point in the solution space have some degree of similarity, this robust optimization method splits the solution space by uniform weights, then converts the feasibility of solutions into Monte Carlo feasibility times. Finally, the optimal set of solutions in the solution space corresponding to each weight is found separately. This strategy obtains solutions that take into account robustness while guaranteeing diversity in multiple objectives.
Through this strategy, the more feasible solutions can be retained, and at the same time, based on the uniformly distributed weights, the solutions with lower feasibility but better other objectives are also retained, so as to provide the decision maker with more decision-making options in the face of different practical needs.
First, the solution space is partitioned by uniformly distributed weights, as shown in the
Figure 8, with different colors representing different weights. Each weight represents a section of the solution space.
Then, each solution is associated with the weight which is closest to it. Following this, the Monte Carlo simulation is used to simulate each solution 1000 times to obtain the Monte Carlo feasible times for each solution, which is represented by
; the pseudo-code is shown in Algorithm 4. Each solution can be represented as
.
Algorithm 4: Monte Carlo Simulation |
Input: : a set of routes : travel time : the times of Monte Carlo simulation empty set of Monte Carlo feasible times for for if = 1 ; else ; end end for if satisfy constraints 1; end end end Output: : Monte Carlo feasible times for each solution |
For each weight, if there exists at least one solution associated with it, the solution with the highest
is sought, the Monte Carlo times of this solution is represented as
, and the solutions in the same weight for which
is less than
are deleted. The remaining solutions are the optimal set of solutions in the solution space represented by that weight. The final result is shown in
Figure 9, which is the set of robust Pareto solutions that guarantees the diversity of the algorithm. The points in different colors represent the solutions corresponding to weights with the same color.
5. Experiments and Analyses
The comparison experiments to test the performance of NSACOWDRL is discussed in this section. In this section, we compare the performance of multiple multiobjective algorithms under different instances, NSACOWDRL under different JIT models, and NSACOWDRL under different network settings, respectively.
5.1. Experimental Setting
To test the performance of NSACOWDRL, four instances of Solomon’s VRPTW benchmark problems, named C202, C206, RC202, RC204, and one instance, named RL—according to the real case from a manufacturer in China—are adopted, in which each of the instances is tested with three different interference coefficients.
Based on the model and the experience of the manufacturer, the following modifications were made to the instances:
(1) The company divides the distribution time of one day into several cycle times. The different distribution tasks were performed at each cycle.
(2) All the workstations are available all day.
(3) The assembly workshop has standardized the delivery of materials. All the materials were delivered in standardized boxes, and all the materials required at the same station during a cycle were placed in the same carts.
(4) In this case, due to the constraints of the assembly workshop and the safety of the AGV operation, the maximum number of carts is nine.
Table 1 shows the dataset of RL.
Figure 10 shows a simple illustrative scenario to visualize the layout and process of material delivery. The dotted lines indicate the routes on which the AGVs can run. The small rectangles joined together represent workstations. The black dots indicate the stations that need to be distributed.
The NSACOWDRL was executed for 250 generations. The number of ants was set to 40. , , γ, and δ in the node transition probability formula were set as 1, 2, 2, and 2, respectively. The parameter in the pheromone update formula was set to 0.2. The parameter was set to 10. In the DDQN, the Buffer size was set to 60,000, the epsilon was set to 0.9, the Gama was set to 0.9, and the Target update interval was set to 1200.The parameters selected were not necessarily optimal for each model. This paper did not strictly select the optimal parameters.
5.2. Performance Metrics
In MOPs, the diversity and convergence of the algorithm are the main performance metrics that need to be evaluated. A single metric has difficulty comprehensively measuring the performance of the algorithm. In this paper, two performance metrics, HV [
1] and IGD, are adopted to evaluate the performance of the algorithm. These metrics are described below:
As shown in Equation (53), the HV metric is the hypervolume metric, which is a measure of the diversity of an algorithm by calculating the volume of the target space formed by the set of nondominated solutions and reference points. The larger the HV is, the better the diversity of the algorithm.
represents the Lebesgue measure, which is used to measure the volume.
represents the number of nondominated solution sets, and
represents the hypervolume formed by the reference point and the ith solution in the solution set.
As shown in Equation (54), the IGD metric is an inverse generational distance metric: a comprehensive performance metric that evaluates the convergence and distribution performance of the algorithm by calculating the average minimum distance from each point on the true Pareto front (PF) to the set of individuals obtained by the algorithm. The smaller the IGD is, the better the convergence and distribution performance of the algorithm.
is the solution set obtained by the algorithm,
is a set of uniformly distributed reference points sampled from the Pareto front, and
represents the Euclidean distance between the point
in the reference set
and the point
in the solution set
.
Because it is difficult to find the true Pareto front in real scenarios, the PF is approximated by considering all nondominated solutions obtained by all algorithms.
In this experiment, the normalized metrics were used to unify the measures between the different objectives.
5.3. Experimental Results
Based on the dataset, to verify the effectiveness of the NSACOWDRL, experiments were carried out on the JIT-RMOVRPTW models successively. In the experiments, to demonstrate the superiority of NSACOWDRL for multiobjective problems, NSACOWDRL compares NSACO, NSGA-III, NSGA-II, and MOEA/D. Twenty independent runs were used to test the performance of the proposed algorithm in each experiment. The experiments presented the results of the proposed algorithm in terms of HV and IGD.
5.3.1. Results of the JIT-RMOVRPTW
(1) Analysis of the Performance Metric
Table 2 and
Table 3 present the mean and variance of the HV and IGD metrics for different algorithms on the JIT-RMOVRPTW dataset with three interference coefficients, respectively. The leftmost column represents the instance numbers. The second column represents three different interference coefficients, which are triangularly distributed. “Small”, ”Middle”, and “Large” represent interference coefficients from small to large. For the two rows that follow each disturbance coefficient, the top row represents the mean of the metrics obtained after the corresponding algorithm has been run 20 times independently with this disturbance coefficient, and the bottom row represents the variance of the metrics obtained.
As shown in
Table 2 and
Table 3, NSACOWDRL performs better with superior metrics in instances, and has more stable metrics. In comparison, in RL, rc204, and rc202, as the interference coefficient increases, the performance of NSGA-II starts to improve, and the performance of MOEA/D starts to decrease. However, in c202 and c206, as the interference coefficient increases, the performance of NSGA-II starts to decrease, and the performance of MOEA/D starts to improve. Meanwhile, NSGA-III performance is more stable. The performance of the algorithms is strongly influenced by the instances, where the performance of NSACO, NSGA-II, NSGA-III, and MOEA/D have advantages in different instances, whereas NSACOWDRL is always superior.
To determine the differences between the algorithms on different metrics, a two-by-two comparison
p value test was carried out. As shown in
Figure 11,
Figure 12,
Figure 13,
Figure 14 and
Figure 15, the
p value is represented by the heatmap. The
X-axis is the interference coefficient, and the
p values at different interference coefficients are shown on either side of the breakpoint in the
X-axis. The upper triangle of each large square is the
p value for HV and the lower triangle is the
p value for IGD, where
p values greater than 0.05 are marked in black.
The p-value plots show that the algorithms differ significantly between different instances and between different interference coefficients. NSACOWDRL is significantly better than the other algorithms, especially on the C202, C206, RL, whereas on RC202, and RC204, the performance of MOEA/D is close to that of NSACOWDRL in terms of the IGD metrics. Meanwhile, as the interference coefficient increases, the variation between algorithm performances begins to decrease on C202, C206, and RC204, whereas on RC202, RL, the variation fluctuates. Algorithm performance is affected by the combination of the instances and the interference coefficients; as the interference coefficients increase, the variation between algorithms is affected more in the RC-series instances than in the C-series.
As the result shows, the NSACOWDRL outperforms the other algorithms in terms of convergence and diversity when solving the JIT-RMOVRPTW. Furthermore, the NSACOWDRL has better generalization in solving problems for different scenarios.
5.3.2. Discussions
The results of the algorithms on the different instances show superiority of the NSACOWDRL on the proposed model. To evaluate the impact of multiple JIT strategies on the performance of the algorithm, the following three JIT strategies are compared:
Strategy 1: Consider transportation capacity and the slack time.
Strategy 2: Consider the slack time.
Strategy 3: Schedule advance time for the time window.
The results are shown in
Table 4. “mean gap” represents the gap between the best mean and the worst mean for different interference coefficients of the same strategy. The NSACOWDRL has better performance in both models, and the solutions obtained by the NSACOWDRL have more stable metrics. In comparison, in different models, as the interference coefficient increases, NSGA-III gradually approaches NSGA-II in terms of convergence, and finally its performance is weaker than that of NSGA-II. Meanwhile, the diversity of MOEA/D becomes progressively weaker than that of NSGA-III, and the convergence of MOEA/D begins to weaken to approach that of NSGA-II. Faced with different JIT strategies, NSGA-II, NSGA-III and MOEA/D have their own strengths and weaknesses in the performance, while NSACOWDRL always keeps the superiority.
Thus, NSGA-III, and MOEA/D are susceptible to interference coefficients and JIT strategies; however, the NSACOWDRL has better generalizability when faced with different models.
By analyzing the “mean gap”, as the interference coefficient increases, the HV and IGD metrics of NSACOWDRL of Strategy 1 have the smallest degradation, followed by Strategy 2, and the algorithm of Strategy 3 has the most degradation. Compared with other strategies, Strategy 1 can maintain the performance of the algorithm better, which verifies the superiority of Strategy 1.
5.4. Different Parameter Settings
In the previous experiments, this paper did not make any serious attempt to find the optimal parameter settings for NSACOWDRL. For the DDQN, the type of neural network affects the performance of the algorithm. In this section, based on the proposed model, the neural networks of the DDQN are set as the back propagation (BP), radial basis function (RBF), the general regression network (GRNN), and product-based network (PNN), respectively, so as to test the effects brought by different networks.
Table 5 shows the performance of the algorithm with different networks.
Figure 16 shows the
p value of the metrics. By comparison, it can be seen that PNN has the best convergence and diversity, PNN has the closest performance to GRNN, and RBF has the worst performance. At the same time, as the interference coefficient increases, the variability between different networks begins to decrease.
For illustration,
Figure 17 shows the set of Pareto solutions in the approximate PF obtained by each algorithm with different interference coefficients. The interference coefficient increases from left to right and then from top to bottom in the subplots of
Figure 16. As the interference coefficient increases, each algorithm has more Pareto solutions which are more widely distributed in the solution space, and the variation in the distribution of solutions between algorithms begins to decrease. The figure shows that the solutions of NSACOWDRL _ PNN are distributed at all levels between (0, 1) on the
axis in all subplots, while the solutions of NSACOWDRL _ RBF are concentrated in two blocks. In terms of solution space distribution, the solution distribution obtained by NSACOWDRL _PNN is comparable to NSACOWDRL _ GRNN which is better than the other algorithms.
From these results, it can be shown that when adjusting the parameters of the algorithm, choosing the appropriate networks can effectively improve the performance of the algorithm. The appropriate neural network is selected by the type of problem, the Convolutional Neural Network (CNN) can be used to deal with discrete action space problems, while GRNN can be used to deal with nonlinear problems, which requires different network architectures and tuning for different problems.
6. Conclusions
This paper constructs the multiobjective AGV routing planning mathematical model of the JIT-RMOVRPTW for assembly workshops which considers the uncertainty in the transportation time. Based on the constraints, the metrics alpha—which measures the feasibility of the solution under uncertainty—is introduced into the model. To solve the problem, the two-stage NSACOWDRL is proposed. In stage 1, the NSACOWDRL adopts the nondominated routing selection method through small habitat protection based on reference points to obtain the set of nondominated routes, which ensures the diversity of the NSACOWDRL. The nondominated rank is introduced into the pheromone update strategy, which adopts the elite routing update method to protect better routings and uses the maximum-minimum pheromone strategy to prevent the algorithm from easily falling into local optimality due to an overly large gap between pheromone concentrations. Then, the DDQN is introduced as a local search algorithm which trains the DDQN networks by using nondominated solution sets. The trained network is used to generate routes to participate in the next nondominated sorting; in addition, the network participates in the probability formula. The state transfer probability formula based on objectives, the DDQN, and constraints is proposed in the NSACOWDRL. In stage 2, the feasibility of the nondominated solutions obtained in stage 1 was quantified by Monte Carlo simulation as Monte Carlo times, which were then robustly selected for diversity based on uniform weights in the solution space to obtain the Pareto solution set. The JIT-RMOVRPTW deals with the conflict between punctuality and robustness in uncertain environments in a better way. Meanwhile, based on multidimensional space, the multiobjective problem is also solved better. The NSACOWDRL complements evolutionary algorithms with deep reinforcement learning algorithms to enhance the optimization capability.
The experimental results among the different algorithms under different disturbance coefficients verify the superiority of the NSACOWDRL in terms of diversity and convergence in the robust VRP problem. A comparison of the models for different JIT strategies also reveals the superiority of the proposed model. The experiments demonstrate that NSACOWDRL has better generalization in different scenarios.
The uncertainty of material requirements due to the volatility of processing times, equipment failures, and other issues will be further considered. Future research can further enhance the complexity of the model and further optimize the material distribution scheme. Through the comparison experiments of different networks, DRL with different network structures has a large impact on the model performance, therefore the research on DRL models needs to be further deepened.